Overview
This guide covers best practices for getting the most out of Perplexity’s Embeddings API, including dimension reduction, batch processing, RAG patterns, and error handling.
Matryoshka Dimension Reduction
Perplexity embeddings support Matryoshka representation learning, allowing you to reduce embedding dimensions while maintaining quality. This enables faster similarity search and reduced storage costs.
from perplexity import Perplexity
client = Perplexity()
# Full dimensions (2560 for 4b model)
full_response = client.embeddings.create(
input=["Your text here"],
model="pplx-embed-v1-4b"
)
print(f"Full: {full_response.data[0].embedding}") # 2560-dim base64 string
# Reduced dimensions - faster search, smaller storage
reduced_response = client.embeddings.create(
input=["Your text here"],
model="pplx-embed-v1-4b",
dimensions=512
)
print(f"Reduced: {reduced_response.data[0].embedding}") # 512-dim base64 string
Trade-off: Lower dimensions = faster search + less storage, but slightly lower quality. Start with full dimensions and reduce if needed.
Control precision and size of embedding outputs:
| Format | Description | Decoded Size | Similarity Metric | Use Case |
|---|
base64_int8 | Base64-encoded signed int8 (-128 to 127) | dimensions bytes | Cosine similarity | Default, good balance of quality and size |
base64_binary | Base64-encoded packed bits (1 bit per dimension, LSB first) | dimensions / 8 bytes | Hamming distance | Maximum compression for large-scale retrieval |
import base64
import numpy as np
# Decode base64_int8 (default)
response = client.embeddings.create(
input=["Your text"],
model="pplx-embed-v1-4b"
)
int8_embedding = np.frombuffer(
base64.b64decode(response.data[0].embedding), dtype=np.int8
)
# Binary embeddings for large-scale retrieval systems
response = client.embeddings.create(
input=["Your text"],
model="pplx-embed-v1-4b",
encoding_format="base64_binary"
)
binary_bytes = np.frombuffer(
base64.b64decode(response.data[0].embedding), dtype=np.uint8
)
# Unpack bits: each byte contains 8 dimensions (LSB first)
binary_embedding = np.unpackbits(binary_bytes, bitorder="little")
base64_int8 produces the same quality as bfloat16 with significantly reduced storage. Use base64_binary for extreme compression in large-scale systems.
Similarity Metrics
Perplexity embedding models produce unnormalized embeddings. Choosing the correct similarity metric is critical for accurate retrieval.
pplx-embed-v1 and pplx-embed-context-v1 natively produce unnormalized int8-quantized embeddings. You must compare them via cosine similarity. Using inner product or L2 distance directly will produce incorrect results because most embedding models are pre-normalized, but Perplexity embeddings are not.
int8 Embeddings (base64_int8)
Compare using cosine similarity. If your vector database does not support cosine similarity natively, convert the embeddings to float32 and L2-normalize them before storing:
import base64
import numpy as np
def decode_and_normalize(b64_string):
"""Decode and L2-normalize for vector DBs that only support inner product."""
embedding = np.frombuffer(base64.b64decode(b64_string), dtype=np.int8).astype(np.float32)
norm = np.linalg.norm(embedding)
if norm > 0:
embedding = embedding / norm
return embedding
# After normalization, cosine similarity == inner product
Binary Embeddings (base64_binary)
Compare using Hamming distance. Binary embeddings encode each dimension as a single bit, so the natural distance metric is the number of differing bits between two vectors.
import numpy as np
def hamming_distance(a: np.ndarray, b: np.ndarray) -> int:
"""Hamming distance between two binary vectors (as uint8 packed bits)."""
return np.unpackbits(np.bitwise_xor(a, b)).sum()
Most vector databases (Pinecone, Weaviate, Qdrant, Milvus) support cosine similarity as a distance metric. Verify your database’s configuration before indexing embeddings.
RAG Pattern
Combine embeddings with Perplexity’s Agentic Research API for retrieval-augmented generation:
import base64
import numpy as np
from perplexity import Perplexity
client = Perplexity()
def decode_embedding(b64_string):
return np.frombuffer(base64.b64decode(b64_string), dtype=np.int8).astype(np.float32)
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# 1. Your knowledge base (embed once, store in vector DB)
knowledge_base = [
"Perplexity API provides web-grounded AI responses",
"The Embeddings API supports Matryoshka dimension reduction",
"Contextualized embeddings share context across document chunks"
]
kb_response = client.embeddings.create(input=knowledge_base, model="pplx-embed-v1-4b")
kb_embeddings = [decode_embedding(emb.embedding) for emb in kb_response.data]
# 2. User query
user_query = "How do I reduce embedding dimensions?"
# 3. Find relevant context
query_response = client.embeddings.create(input=[user_query], model="pplx-embed-v1-4b")
query_embedding = decode_embedding(query_response.data[0].embedding)
scores = [(i, cosine_similarity(query_embedding, emb)) for i, emb in enumerate(kb_embeddings)]
top_docs = sorted(scores, key=lambda x: x[1], reverse=True)[:2]
context = "\n".join([knowledge_base[i] for i, _ in top_docs])
# 4. Generate answer with context
response = client.responses.create(
model="openai/gpt-5.2",
input=f"Answer using this context:\n\n{context}\n\nQuestion: {user_query}"
)
print(response.output[0].content[0].text)
Batch Processing
Process large datasets efficiently with async batching:
import asyncio
from perplexity import AsyncPerplexity
async def batch_embed(texts: list[str], batch_size: int = 100):
async with AsyncPerplexity() as client:
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = await client.embeddings.create(
input=batch,
model="pplx-embed-v1-4b"
)
results.extend(response.data)
print(f"Processed {min(i + batch_size, len(texts))}/{len(texts)}")
return results
# Usage
texts = ["Document " + str(i) for i in range(1000)]
embeddings = asyncio.run(batch_embed(texts))
print(f"Generated {len(embeddings)} embeddings")
Error Handling
import perplexity
from perplexity import Perplexity
client = Perplexity()
try:
response = client.embeddings.create(
input=["Your text"],
model="pplx-embed-v1-4b"
)
except perplexity.BadRequestError as e:
print(f"Invalid request: {e}")
except perplexity.RateLimitError:
print("Rate limited, please retry later")
except perplexity.APIStatusError as e:
print(f"API error: {e.status_code}")
Tips
Batch requests
Send up to 512 texts per request to maximize throughput and reduce API calls.
Match models
Always use the same embedding model for both queries and documents to ensure consistent similarity scores.
Use cosine similarity
Perplexity embeddings are unnormalized. Always use cosine similarity for base64_int8 and Hamming distance for base64_binary. If your vector DB only supports inner product, L2-normalize the embeddings before storing.
Cache embeddings
Store computed embeddings in a vector database. Never recompute embeddings for the same text.
Use Matryoshka wisely
Start with full dimensions for best quality. Reduce dimensions only if you need faster search or smaller storage.
Binary for scale
Use base64_binary encoding format for large-scale retrieval systems where storage and speed are critical.