Embeddings turn text into vectors so you can search by meaning. Combined with chat, that’s RAG: retrieve relevant chunks, stuff them into the prompt, generate the answer.
This guide is opinionated. For the raw endpoint, see Embeddings API.
The pipeline
documents → chunk → embed → store with vectors
↓
user question → embed → similarity search → top-k chunks → chat completion → answer
Five components: a chunker, an embedding model, a vector store, a retriever, a chat model.
1. Pick an embedding model
| Model | Dim | Strong for | Cost (per 1M tokens) |
|---|
text-embedding-3-small | 1536 (reducible) | Default; cheap, fast | $ |
text-embedding-3-large | 3072 (reducible) | Highest English quality | $$ |
gemini-embedding-001 | 3072 | 100+ languages | $ |
qwen3-embedding-8b | 4096 | Code, multilingual | $ |
Higher dimensions ≠ always better — they cost more memory in your vector store. With OpenAI v3 models you can request a smaller dimensions (e.g. 768 or 512) and get most of the quality.
emb = client.embeddings.create(
model="text-embedding-3-small",
input=texts, # list of strings
dimensions=512, # optional, MRL-truncated
)
Pick once, stick with it. Vectors from different models live in different spaces and can’t be compared. Migrating means re-embedding your entire corpus.
2. Chunk the documents
Don’t embed whole documents — embed chunks of ~200–500 tokens with ~50–100 token overlap. Smaller chunks = sharper retrieval; larger = more context per hit.
def chunk(text: str, size: int = 400, overlap: int = 80) -> list[str]:
words = text.split()
out = []
i = 0
while i < len(words):
out.append(" ".join(words[i : i + size]))
i += size - overlap
return out
For mixed content (markdown, code, PDFs), chunk by structure first (headings, function boundaries), then by size.
3. Embed in batches
The endpoint accepts up to 2 048 strings per call. Batch hard:
def embed_all(texts: list[str], batch: int = 256) -> list[list[float]]:
out = []
for i in range(0, len(texts), batch):
resp = client.embeddings.create(
model="text-embedding-3-small",
input=texts[i : i + batch],
)
out.extend(d.embedding for d in resp.data)
return out
Embedding 100k chunks at 256/batch ≈ 400 requests, well under any plan’s RPM.
Use a vector store that supports metadata filtering — pgvector, Qdrant, Pinecone, Weaviate, Chroma. Always store:
{
"id": "doc42_chunk7",
"vector": [0.014, -0.221, ...],
"text": "...the original chunk...",
"doc_id": "doc42",
"title": "Q3 Financial Report",
"url": "https://...",
"tokens": 412,
"indexed_at": "2026-04-15T10:00:00Z"
}
The text itself goes into the prompt later — don’t lose it.
5. Query
Embed the user’s question with the same model, then pull top-k:
def search(question: str, k: int = 6) -> list[dict]:
q = client.embeddings.create(
model="text-embedding-3-small",
input=question,
).data[0].embedding
return vector_store.query(vector=q, top_k=k, include_metadata=True)
Tune k empirically — usually 4–8. Below 3, you miss relevant chunks; above 10, you push noise into the prompt.
6. Generate the answer
Format retrieved chunks into the prompt with clear separators and source citations:
def answer(question: str) -> str:
hits = search(question, k=6)
context = "\n\n---\n\n".join(
f"[{i+1}] {h['text']}\n(Source: {h['title']})"
for i, h in enumerate(hits)
)
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content":
"Answer using ONLY the provided context. "
"Cite sources as [1], [2]. "
"If the context doesn't contain the answer, say so."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
],
)
return resp.choices[0].message.content
Improvements that earn their cost
- Hybrid search — combine vector similarity with BM25 keyword scores. ~10–15% recall lift on technical content.
- Re-ranking — fetch top-30 by vector, re-rank to top-6 with a cross-encoder (coming soon: dedicated reranker endpoint, see changelog).
- Query rewriting — for chat, ask the LLM to rewrite the user’s follow-up into a standalone search query before embedding.
- Multi-query — generate 3–5 paraphrases of the question, search each, dedupe results. Catches lexical mismatches.
Common mistakes
- Mixing embedding models in the same store — vectors don’t compare. Re-embed everything when you switch.
- Embedding raw HTML/PDF bytes — extract clean text first.
- Chunks too large — model only “sees” the centre; edges are wasted.
- No metadata filter — searching all customers’ data for one customer’s question. Always filter by tenant first.
- Forgetting to track
indexed_at — when documents update, you need to know what’s stale.
Cost ballpark
For 100 000 chunks × 400 tokens = 40 M tokens:
- Index once with
text-embedding-3-small (≈ 0.02/1Mtokens)→∗∗ 0.80**
- Query at 1 query / sec (≈ 50 tokens each) → ~$0.005 / day
Embeddings are the cheapest part of any RAG system. Spend the budget on the chat model.