This guide focuses on the end-to-end pipeline. For API reference details on individual embedding types, see Standard Embeddings and Contextualized Embeddings.
Pipeline Overview
A RAG pipeline retrieves relevant information from your own documents before generating an answer, grounding model responses in your data rather than relying solely on parametric knowledge.- Chunk your source documents into manageable pieces with overlap.
- Embed each chunk using a Perplexity embedding model.
- Index the embeddings for similarity search.
- Query by embedding the user question with the same model.
- Retrieve the top-k most similar chunks.
- Generate an answer by passing the retrieved context to the Agent API.
Prerequisites
Install the Perplexity SDK:Get your Perplexity API Key
Navigate to the API Keys tab in the API Portal and generate a new key.
Document Chunking
Split your documents into chunks small enough for the model’s context window while preserving semantic coherence. Overlapping chunks ensure that information at chunk boundaries is not lost.Embedding with the Standard Model
Standard embeddings treat each text independently. Use them when chunks are self-contained and don’t rely on surrounding context.Embedding with the Contextualized Model
Contextualized embeddings understand that chunks belong to the same document. The model uses cross-chunk attention so that each chunk’s embedding incorporates information from its neighbors. The key API difference is the nested array structure: each inner array contains chunks from a single document.Querying a Contextualized Index
When using contextualized embeddings, wrap each query as a single-element inner list (e.g.,[[query]]) so the API treats it as a single-chunk document:
Building a Vector Index
This example uses numpy for cosine similarity with a simple in-memory index. For production systems with millions of vectors, use a dedicated vector database (Pinecone, Weaviate, Qdrant, etc.).Query Pipeline
The full query pipeline embeds the user question, retrieves the top-k most similar chunks, and passes them as context to the Agent API for answer generation.Standard vs Contextualized Comparison
| Aspect | Standard (pplx-embed-v1-4b) | Contextualized (pplx-embed-context-v1-4b) |
|---|---|---|
| Input format | Flat list of texts | Nested arrays grouped by document |
| Context awareness | Each text embedded independently | Chunks share cross-chunk context within each document |
| Best for | FAQ entries, standalone texts, short documents | Document paragraphs, article sections |
| Chunk ordering | Order does not matter | Must be in original document order |
| Query embedding | client.embeddings.create(input=[query]) | client.contextualized_embeddings.create(input=[[query]]) |
| Price (4b model) | $0.03 / 1M tokens | $0.05 / 1M tokens |
When to Use Standard Embeddings
- Chunks are self-contained and do not rely on surrounding context.
- Your content consists of FAQ pairs, product descriptions, or short independent entries.
- You need the lowest cost per token.
When to Use Contextualized Embeddings
- Chunks come from longer documents where meaning depends on neighboring text.
- A chunk like “This approach improves performance by 20%” only makes sense with its surrounding context.
- You are embedding paragraphs from articles, reports, or technical documentation.
- You want higher retrieval accuracy at a modest cost increase.
Matryoshka Dimensions
Perplexity embedding models support Matryoshka Representation Learning (MRL), which concentrates the most important information in the first N dimensions. You can request reduced dimensions directly via the API for faster search and smaller storage.pplx-embed-v1-4b model:
| Dimensions | Storage per Vector | Relative Quality | Use Case |
|---|---|---|---|
| 2560 (full) | 2.5 KB | Highest | Maximum accuracy, small datasets |
| 1024 | 1 KB | Very high | Good balance for most applications |
| 512 | 512 B | High | Large-scale retrieval, fast search |
| 256 | 256 B | Moderate | Extremely large datasets, coarse filtering |
| 128 | 128 B | Lower | First-pass candidate filtering |
Batch Processing
When embedding large document collections, process them in batches to stay within API rate limits. The standard API accepts up to 512 texts per request with a combined limit of 120,000 tokens.For contextualized embeddings, batch at the document level using
client.contextualized_embeddings.create(input=batch_of_doc_arrays) with the same pattern. The contextualized API accepts up to 512 documents with 16,000 total chunks per request.Complete Example
A self-contained pipeline that indexes two documents with contextualized embeddings and answers questions against the indexed content.Next Steps
Standard Embeddings
API reference for standard embedding parameters and response format.
Contextualized Embeddings
API reference for contextualized embedding parameters and response format.
Best Practices
Encoding formats, similarity metrics, normalization, and error handling.
Agent API
Learn more about the Responses API used for answer generation.