This guide covers best practices for getting the most out of Perplexity’s Embeddings API, including dimension reduction, batch processing, RAG patterns, and error handling.
Perplexity embeddings support Matryoshka representation learning, allowing you to reduce embedding dimensions while maintaining quality. This enables faster similarity search and reduced storage costs.
Copy
Ask AI
from perplexity import Perplexityclient = Perplexity()# Full dimensions (2560 for 4b model)full_response = client.embeddings.create( input=["Your text here"], model="pplx-embed-v1-4b")print(f"Full: {full_response.data[0].embedding}") # 2560-dim base64 string# Reduced dimensions - faster search, smaller storagereduced_response = client.embeddings.create( input=["Your text here"], model="pplx-embed-v1-4b", dimensions=512)print(f"Reduced: {reduced_response.data[0].embedding}") # 512-dim base64 string
Trade-off: Lower dimensions = faster search + less storage, but slightly lower quality. Start with full dimensions and reduce if needed.
base64_int8 produces the same quality as bfloat16 with significantly reduced storage. Use base64_binary for extreme compression in large-scale systems.
Perplexity embedding models produce unnormalized embeddings. Choosing the correct similarity metric is critical for accurate retrieval.
pplx-embed-v1 and pplx-embed-context-v1 natively produce unnormalized int8-quantized embeddings. You must compare them via cosine similarity. Using inner product or L2 distance directly will produce incorrect results because most embedding models are pre-normalized, but Perplexity embeddings are not.
Compare using cosine similarity. If your vector database does not support cosine similarity natively, convert the embeddings to float32 and L2-normalize them before storing:
Copy
Ask AI
import base64import numpy as npdef decode_and_normalize(b64_string): """Decode and L2-normalize for vector DBs that only support inner product.""" embedding = np.frombuffer(base64.b64decode(b64_string), dtype=np.int8).astype(np.float32) norm = np.linalg.norm(embedding) if norm > 0: embedding = embedding / norm return embedding# After normalization, cosine similarity == inner product
Compare using Hamming distance. Binary embeddings encode each dimension as a single bit, so the natural distance metric is the number of differing bits between two vectors.
Copy
Ask AI
import numpy as npdef hamming_distance(a: np.ndarray, b: np.ndarray) -> int: """Hamming distance between two binary vectors (as uint8 packed bits).""" return np.unpackbits(np.bitwise_xor(a, b)).sum()
Most vector databases (Pinecone, Weaviate, Qdrant, Milvus) support cosine similarity as a distance metric. Verify your database’s configuration before indexing embeddings.
Combine embeddings with Perplexity’s Agentic Research API for retrieval-augmented generation:
Copy
Ask AI
import base64import numpy as npfrom perplexity import Perplexityclient = Perplexity()def decode_embedding(b64_string): return np.frombuffer(base64.b64decode(b64_string), dtype=np.int8).astype(np.float32)def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))# 1. Your knowledge base (embed once, store in vector DB)knowledge_base = [ "Perplexity API provides web-grounded AI responses", "The Embeddings API supports Matryoshka dimension reduction", "Contextualized embeddings share context across document chunks"]kb_response = client.embeddings.create(input=knowledge_base, model="pplx-embed-v1-4b")kb_embeddings = [decode_embedding(emb.embedding) for emb in kb_response.data]# 2. User queryuser_query = "How do I reduce embedding dimensions?"# 3. Find relevant contextquery_response = client.embeddings.create(input=[user_query], model="pplx-embed-v1-4b")query_embedding = decode_embedding(query_response.data[0].embedding)scores = [(i, cosine_similarity(query_embedding, emb)) for i, emb in enumerate(kb_embeddings)]top_docs = sorted(scores, key=lambda x: x[1], reverse=True)[:2]context = "\n".join([knowledge_base[i] for i, _ in top_docs])# 4. Generate answer with contextresponse = client.responses.create( model="openai/gpt-5.4", input=f"Answer using this context:\n\n{context}\n\nQuestion: {user_query}")print(response.output[0].content[0].text)
Send up to 512 texts per request to maximize throughput and reduce API calls.
2
Match models
Always use the same embedding model for both queries and documents to ensure consistent similarity scores.
3
Use cosine similarity
Perplexity embeddings are unnormalized. Always use cosine similarity for base64_int8 and Hamming distance for base64_binary. If your vector DB only supports inner product, L2-normalize the embeddings before storing.
4
Cache embeddings
Store computed embeddings in a vector database. Never recompute embeddings for the same text.
5
Use Matryoshka wisely
Start with full dimensions for best quality. Reduce dimensions only if you need faster search or smaller storage.
6
Binary for scale
Use base64_binary encoding format for large-scale retrieval systems where storage and speed are critical.