Skip to main content
Embeddings turn text into vectors so you can search by meaning. Combined with chat, that’s RAG: retrieve relevant chunks, stuff them into the prompt, generate the answer. This guide is opinionated. For the raw endpoint, see Embeddings API.

The pipeline

documents → chunk → embed → store with vectors

        user question → embed → similarity search → top-k chunks → chat completion → answer
Five components: a chunker, an embedding model, a vector store, a retriever, a chat model.

1. Pick an embedding model

ModelDimStrong forCost (per 1M tokens)
text-embedding-3-small1536 (reducible)Default; cheap, fast$
text-embedding-3-large3072 (reducible)Highest English quality$$
gemini-embedding-0013072100+ languages$
qwen3-embedding-8b4096Code, multilingual$
Higher dimensions ≠ always better — they cost more memory in your vector store. With OpenAI v3 models you can request a smaller dimensions (e.g. 768 or 512) and get most of the quality.
python
emb = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts,           # list of strings
    dimensions=512,        # optional, MRL-truncated
)
Pick once, stick with it. Vectors from different models live in different spaces and can’t be compared. Migrating means re-embedding your entire corpus.

2. Chunk the documents

Don’t embed whole documents — embed chunks of ~200–500 tokens with ~50–100 token overlap. Smaller chunks = sharper retrieval; larger = more context per hit.
python
def chunk(text: str, size: int = 400, overlap: int = 80) -> list[str]:
    words = text.split()
    out = []
    i = 0
    while i < len(words):
        out.append(" ".join(words[i : i + size]))
        i += size - overlap
    return out
For mixed content (markdown, code, PDFs), chunk by structure first (headings, function boundaries), then by size.

3. Embed in batches

The endpoint accepts up to 2 048 strings per call. Batch hard:
python
def embed_all(texts: list[str], batch: int = 256) -> list[list[float]]:
    out = []
    for i in range(0, len(texts), batch):
        resp = client.embeddings.create(
            model="text-embedding-3-small",
            input=texts[i : i + batch],
        )
        out.extend(d.embedding for d in resp.data)
    return out
Embedding 100k chunks at 256/batch ≈ 400 requests, well under any plan’s RPM.

4. Store with metadata

Use a vector store that supports metadata filtering — pgvector, Qdrant, Pinecone, Weaviate, Chroma. Always store:
{
  "id": "doc42_chunk7",
  "vector": [0.014, -0.221, ...],
  "text": "...the original chunk...",
  "doc_id": "doc42",
  "title": "Q3 Financial Report",
  "url": "https://...",
  "tokens": 412,
  "indexed_at": "2026-04-15T10:00:00Z"
}
The text itself goes into the prompt later — don’t lose it.

5. Query

Embed the user’s question with the same model, then pull top-k:
python
def search(question: str, k: int = 6) -> list[dict]:
    q = client.embeddings.create(
        model="text-embedding-3-small",
        input=question,
    ).data[0].embedding
    return vector_store.query(vector=q, top_k=k, include_metadata=True)
Tune k empirically — usually 4–8. Below 3, you miss relevant chunks; above 10, you push noise into the prompt.

6. Generate the answer

Format retrieved chunks into the prompt with clear separators and source citations:
python
def answer(question: str) -> str:
    hits = search(question, k=6)
    context = "\n\n---\n\n".join(
        f"[{i+1}] {h['text']}\n(Source: {h['title']})"
        for i, h in enumerate(hits)
    )
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content":
                "Answer using ONLY the provided context. "
                "Cite sources as [1], [2]. "
                "If the context doesn't contain the answer, say so."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )
    return resp.choices[0].message.content

Improvements that earn their cost

  • Hybrid search — combine vector similarity with BM25 keyword scores. ~10–15% recall lift on technical content.
  • Re-ranking — fetch top-30 by vector, re-rank to top-6 with a cross-encoder (coming soon: dedicated reranker endpoint, see changelog).
  • Query rewriting — for chat, ask the LLM to rewrite the user’s follow-up into a standalone search query before embedding.
  • Multi-query — generate 3–5 paraphrases of the question, search each, dedupe results. Catches lexical mismatches.

Common mistakes

  • Mixing embedding models in the same store — vectors don’t compare. Re-embed everything when you switch.
  • Embedding raw HTML/PDF bytes — extract clean text first.
  • Chunks too large — model only “sees” the centre; edges are wasted.
  • No metadata filter — searching all customers’ data for one customer’s question. Always filter by tenant first.
  • Forgetting to track indexed_at — when documents update, you need to know what’s stale.

Cost ballpark

For 100 000 chunks × 400 tokens = 40 M tokens:
  • Index once with text-embedding-3-small (≈ 0.02/1Mtokens) 0.02 / 1 M tokens) → **~0.80**
  • Query at 1 query / sec (≈ 50 tokens each) → ~$0.005 / day
Embeddings are the cheapest part of any RAG system. Spend the budget on the chat model.