Use this file to discover all available pages before exploring further.
This guide walks through building a complete retrieval-augmented generation (RAG) pipeline using Perplexity’s Embeddings API and Agent API.It covers document chunking, embedding with both standard and contextualized models, building an in-memory vector index, querying for relevant context, and generating grounded answers.
A RAG pipeline retrieves relevant information from your own documents before generating an answer, grounding model responses in your data rather than relying solely on parametric knowledge.The steps are:
Chunk your source documents into manageable pieces with overlap.
Embed each chunk using a Perplexity embedding model.
Index the embeddings for similarity search.
Query by embedding the user question with the same model.
Retrieve the top-k most similar chunks.
Generate an answer by passing the retrieved context to the Agent API.
Split your documents into chunks small enough for the model’s context window while preserving semantic coherence. Overlapping chunks ensure that information at chunk boundaries is not lost.
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 100) -> list[str]: """Split text into overlapping chunks by character count.""" chunks = [] start = 0 while start < len(text): end = start + chunk_size chunk = text[start:end].strip() if chunk: chunks.append(chunk) start += chunk_size - overlap return chunksdocument = """Retrieval-augmented generation (RAG) is a technique that combinesinformation retrieval with text generation. Rather than relying solely on alanguage model's training data, RAG systems first search a knowledge base forrelevant documents, then use those documents as context when generating aresponse. This reduces hallucinations and allows the system to provide answersgrounded in specific, up-to-date sources."""chunks = chunk_text(document, chunk_size=300, overlap=50)for i, chunk in enumerate(chunks): print(f"Chunk {i} ({len(chunk)} chars): {chunk[:60]}...")
A chunk size of 300-500 characters with 50-100 characters of overlap works well for most use cases. For structured documents (markdown, HTML), consider splitting on headings or paragraph boundaries instead of raw character counts.
Standard embeddings treat each text independently. Use them when chunks are self-contained and don’t rely on surrounding context.
import base64import numpy as npfrom perplexity import Perplexityclient = Perplexity()def decode_embedding(b64_string: str) -> np.ndarray: """Decode a base64-encoded int8 embedding to a float32 numpy array.""" return np.frombuffer(base64.b64decode(b64_string), dtype=np.int8).astype(np.float32)chunks = [ "RAG combines retrieval with generation to ground responses in real data.", "Document chunking splits text into overlapping segments for embedding.", "Cosine similarity measures the angle between two embedding vectors.",]response = client.embeddings.create(input=chunks, model="pplx-embed-v1-4b")embeddings = [decode_embedding(emb.embedding) for emb in response.data]print(f"Embedded {len(embeddings)} chunks, each with {len(embeddings[0])} dimensions")
Contextualized embeddings understand that chunks belong to the same document. The model uses cross-chunk attention so that each chunk’s embedding incorporates information from its neighbors. The key API difference is the nested array structure: each inner array contains chunks from a single document.
from perplexity import Perplexityclient = Perplexity()# Two source documents, each split into chunksdoc1_chunks = [ "RAG combines retrieval with generation to produce grounded answers.", "The retrieval step searches a vector index for chunks similar to the query.", "The generation step uses retrieved context to produce a final response."]doc2_chunks = [ "Embedding models convert text into dense vector representations.", "Cosine similarity is the standard metric for comparing embeddings."]# Pass as nested arrays (one inner array per document)response = client.contextualized_embeddings.create( input=[doc1_chunks, doc2_chunks], model="pplx-embed-context-v1-4b")# Nested response: response.data[doc_idx].data[chunk_idx]for doc in response.data: for chunk in doc.data: print(f"Doc {doc.index}, Chunk {chunk.index}: {chunk.embedding[:20]}...")
Chunk ordering matters. Chunks within each document must be passed in their original sequential order. The contextualized model uses positional context to relate neighboring chunks, so shuffling them will degrade embedding quality.
When using contextualized embeddings, wrap each query as a single-element inner list (e.g., [[query]]) so the API treats it as a single-chunk document:
from perplexity import Perplexityimport base64, numpy as npclient = Perplexity()def decode_embedding(b64: str) -> np.ndarray: return np.frombuffer(base64.b64decode(b64), dtype=np.int8).astype(np.float32)def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))# Index with contextualized model (chunks share cross-chunk attention)doc_chunks = [ "RAG combines retrieval with generation to produce grounded answers.", "The retrieval step finds chunks similar to the user query.", "The generation step uses retrieved context to produce a final response.",]ctx_response = client.contextualized_embeddings.create( input=[doc_chunks], # nested array: one inner list per document model="pplx-embed-context-v1-4b")index = [ {"embedding": decode_embedding(chunk.embedding), "text": doc_chunks[chunk.index]} for chunk in ctx_response.data[0].data]# Query the indexquery = "How does retrieval work in RAG?"q_response = client.contextualized_embeddings.create( input=[[query]], model="pplx-embed-context-v1-4b")q_emb = decode_embedding(q_response.data[0].data[0].embedding)results = sorted(index, key=lambda x: cosine_similarity(q_emb, x["embedding"]), reverse=True)print(f"Top result: {results[0]['text']}")
This example uses numpy for cosine similarity with a simple in-memory index. For production systems with millions of vectors, use a dedicated vector database (Pinecone, Weaviate, Qdrant, etc.).
import base64import numpy as npfrom perplexity import Perplexityclient = Perplexity()def decode_embedding(b64_string: str) -> np.ndarray: return np.frombuffer(base64.b64decode(b64_string), dtype=np.int8).astype(np.float32)def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))# Documents to indexdocuments = { "RAG Overview": [ "Retrieval-augmented generation grounds LLM responses in external data.", "RAG reduces hallucinations by providing factual context to the model.", "A typical RAG pipeline has three stages: indexing, retrieval, and generation." ], "Embedding Models": [ "Embedding models map text to dense vector representations.", "Similar texts produce vectors that are close in the embedding space.", "Perplexity offers both standard and contextualized embedding models." ]}# Build index: list of (embedding, text, doc_title) tuplesindex = []for title, chunks in documents.items(): response = client.embeddings.create(input=chunks, model="pplx-embed-v1-4b") for emb_obj in response.data: index.append({ "embedding": decode_embedding(emb_obj.embedding), "text": chunks[emb_obj.index], "doc_title": title })print(f"Indexed {len(index)} chunks")
The full query pipeline embeds the user question, retrieves the top-k most similar chunks, and passes them as context to the Agent API for answer generation.
def rag_query(question: str, index: list[dict], top_k: int = 3, min_score: float = 0.3) -> str: """Embed question -> retrieve similar chunks -> generate answer.""" # Step 1: Embed the question query_response = client.embeddings.create(input=[question], model="pplx-embed-v1-4b") query_emb = decode_embedding(query_response.data[0].embedding) # Step 2: Retrieve top-k chunks above the minimum similarity threshold scored = sorted( [{"score": cosine_similarity(query_emb, item["embedding"]), **item} for item in index], key=lambda x: x["score"], reverse=True )[:top_k] scored = [item for item in scored if item["score"] >= min_score] if not scored: return "No relevant context found for this question." # Include source attribution alongside each chunk context = "\n\n".join( f"[Source: {item['doc_title']}]\n{item['text']}" for item in scored ) # Step 3: Generate answer via Agent API response = client.responses.create( model="openai/gpt-5.4", input=question, instructions=( "Answer based only on the provided context. " "Cite sources by name when referencing specific information. " "If the context does not contain enough information, say so.\n\n" f"Context:\n{context}" ) ) return response.output_textanswer = rag_query("What are the stages of a RAG pipeline?", index)print(answer)
Start with top_k=3 and min_score=0.3 for most use cases. Raise top_k to 5–7 for broad questions or short chunks. Raise min_score to 0.5–0.7 if retrieved chunks contain irrelevant information. Lower it toward 0.2 for diverse or ambiguous queries.
Perplexity embedding models support Matryoshka Representation Learning (MRL), which concentrates the most important information in the first N dimensions. You can request reduced dimensions directly via the API for faster search and smaller storage.
import base64import numpy as npfrom perplexity import Perplexityclient = Perplexity()texts = ["Matryoshka embeddings allow dimension reduction without re-embedding."]def decode_embedding(b64: str) -> np.ndarray: return np.frombuffer(base64.b64decode(b64), dtype=np.int8)# Full dimensions (2560 for 4b model)full = client.embeddings.create(input=texts, model="pplx-embed-v1-4b")# Reduced to 512 dimensions via the APIreduced = client.embeddings.create(input=texts, model="pplx-embed-v1-4b", dimensions=512)print(f"Full: {len(decode_embedding(full.data[0].embedding))} dimensions")print(f"Reduced: {len(decode_embedding(reduced.data[0].embedding))} dimensions")
Dimension reduction tradeoffs for the pplx-embed-v1-4b model:
Dimensions
Storage per Vector
Relative Quality
Use Case
2560 (full)
2.5 KB
Highest
Maximum accuracy, small datasets
1024
1 KB
Very high
Good balance for most applications
512
512 B
High
Large-scale retrieval, fast search
256
256 B
Moderate
Extremely large datasets, coarse filtering
128
128 B
Lower
First-pass candidate filtering
Use the dimensions parameter in the API call rather than manually truncating vectors. The API applies proper normalization for the requested dimension count. Start with full dimensions and reduce only when storage or latency becomes a bottleneck.
When embedding large document collections, process them in batches to stay within API rate limits. The standard API accepts up to 512 texts per request with a combined limit of 120,000 tokens.
import asyncioimport base64import numpy as npfrom perplexity import AsyncPerplexitydef decode_embedding(b64_string: str) -> np.ndarray: return np.frombuffer(base64.b64decode(b64_string), dtype=np.int8).astype(np.float32)async def batch_embed(texts: list[str], batch_size: int = 100) -> list[np.ndarray]: """Embed texts in batches with rate limiting.""" async with AsyncPerplexity() as client: all_embeddings = [] for i in range(0, len(texts), batch_size): batch = texts[i:i + batch_size] response = await client.embeddings.create( input=batch, model="pplx-embed-v1-4b" ) all_embeddings.extend(decode_embedding(e.embedding) for e in response.data) print(f"Embedded {min(i + batch_size, len(texts))}/{len(texts)}") if i + batch_size < len(texts): await asyncio.sleep(0.1) # Brief delay between batches return all_embeddings# Usagetexts = [f"Document chunk number {i} with content." for i in range(500)]embeddings = asyncio.run(batch_embed(texts, batch_size=100))print(f"Total: {len(embeddings)} embeddings")
For contextualized embeddings, batch at the document level using client.contextualized_embeddings.create(input=batch_of_doc_arrays) with the same pattern. The contextualized API accepts up to 512 documents with 16,000 total chunks per request.
Rate limits: Keep batch sizes well within the API limits (512 texts / 120,000 tokens for standard; 512 documents / 16,000 chunks for contextualized) and add small delays between requests to avoid throttling.
A self-contained pipeline that indexes two documents with contextualized embeddings and answers questions against the indexed content.
import base64import numpy as npfrom perplexity import Perplexityclient = Perplexity()# --- Helpers ---def chunk_text(text: str, chunk_size: int = 400, overlap: int = 80) -> list[str]: chunks, start = [], 0 while start < len(text): chunk = text[start:start + chunk_size].strip() if chunk: chunks.append(chunk) start += chunk_size - overlap return chunksdef decode_embedding(b64: str) -> np.ndarray: return np.frombuffer(base64.b64decode(b64), dtype=np.int8).astype(np.float32)def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))# --- Source documents ---DOCUMENTS = { "Quantum Computing": ( "Quantum computers use qubits that can exist in superposition, representing " "0 and 1 simultaneously. Unlike classical bits, qubits leverage quantum " "interference to perform calculations. Quantum entanglement allows qubits to " "be correlated, enabling parallel processing at scale. Current quantum computers " "from IBM, Google, and others have dozens to hundreds of physical qubits." ), "Machine Learning": ( "Machine learning enables computers to learn from data without explicit " "programming. Supervised learning uses labeled examples to train models for " "classification and regression. Neural networks with many layers (deep learning) " "excel at image recognition and language tasks. Training requires large datasets " "and significant compute, often using GPUs or TPUs." ),}# --- Step 1: Index with the model ---def build_index(documents: dict[str, str]) -> list[dict]: index = [] for title, text in documents.items(): chunks = chunk_text(text) response = client.contextualized_embeddings.create( input=[chunks], model="pplx-embed-context-v1-4b" ) for chunk_obj in response.data[0].data: index.append({ "embedding": decode_embedding(chunk_obj.embedding), "text": chunks[chunk_obj.index], "doc_title": title, }) print(f"Indexed {len(index)} chunks from {len(documents)} documents") return index# --- Step 2: Query the index, retrieve, generate ---def rag_query(question: str, index: list[dict], top_k: int = 3, min_score: float = 0.3) -> str: q_resp = client.contextualized_embeddings.create( input=[[question]], model="pplx-embed-context-v1-4b" ) q_emb = decode_embedding(q_resp.data[0].data[0].embedding) results = sorted( [{"score": cosine_similarity(q_emb, item["embedding"]), **item} for item in index], key=lambda x: x["score"], reverse=True )[:top_k] results = [r for r in results if r["score"] >= min_score] if not results: return "No relevant context found for this question." context = "\n\n".join(f"[{r['doc_title']}]\n{r['text']}" for r in results) response = client.responses.create( model="openai/gpt-5.4", input=question, instructions=( "Answer based only on the provided context. " "Cite the source name in brackets when referencing information. " "If the context is insufficient, say so.\n\n" f"Context:\n{context}" ) ) return response.output_text# --- Run ---if __name__ == "__main__": index = build_index(DOCUMENTS) questions = [ "What makes qubits different from classical bits?", "What hardware is used to train machine learning models?", ] for q in questions: print(f"\nQ: {q}") print(f"A: {rag_query(q, index)}")