Knowledge Base | Speechify API

A knowledge base is a bundle of documents (PDF, plain text, markdown, or HTML) that your voice agent can consult during a call. You upload once; the server extracts, chunks, embeds, and indexes the content. Every agent attached to the knowledge base gets a built-in search_knowledge tool that retrieves the most relevant excerpts in real time.

Why use it

The LLM only knows what’s in its prompt. If you need it to answer from product manuals, policy documents, an FAQ, or internal runbooks, inlining everything into the system prompt is expensive and doesn’t scale past a few pages. A knowledge base gives the agent a cheap, fast way to look up exactly the passage it needs, when it needs it.

Create a knowledge base

Python

TypeScript

cURL

1 from speechify import Speechify
2 
3 client = Speechify()
4 
5 kb = client.tts.knowledge_bases.create(
6     name="Product Handbook",
7     description="Manuals, FAQs, troubleshooting",
8 )
9 print(kb.id)

Upload a document

Multipart upload. Max 10 MB per file.

Python

TypeScript

cURL

1 with open("manual.pdf", "rb") as f:
2     doc = client.tts.knowledge_bases.upload_document(
3         id=kb.id,
4         file=f,
5     )
6 print(doc.status, doc.chunk_count)

The response includes a status field that transitions from embedding to ready once every chunk is indexed. Upload is synchronous — expect a few seconds per megabyte of input.

Status	Meaning
`embedding`	Chunks are being embedded and inserted.
`ready`	All chunks indexed; document is searchable.
`failed`	Extraction or embedding failed. See `error` for details.

Attach to an agent

Python

TypeScript

cURL

1 client.tts.agents.attach_knowledge_base(id=agent.id, kb_id=kb.id)

On the next conversation for that agent, search_knowledge is auto-registered as a function tool. The LLM decides when to call it based on the caller’s question; you don’t have to modify the agent prompt.

The tool is scoped to exactly the knowledge bases attached to the agent — it cannot query anything else, regardless of what the worker sends.

Search via the API

You can also run semantic search directly, outside a conversation. Useful for UIs that want to show grounded snippets, or for verifying what the agent would retrieve.

Python

TypeScript

cURL

1 result = client.tts.knowledge_bases.search(
2     kb_ids=[kb.id],
3     query="what is the return policy for refurbished hardware",
4     top_k=5,
5 )
6 for hit in result.hits:
7     print(hit.filename, hit.score, hit.content[:120])

Each hit includes the source filename, the chunk content, and a cosine-similarity score. Scores are relative — use them for ranking, not as an absolute confidence metric.

How it works

Extract — the server reads the upload and extracts text. PDFs use per-page parsing with graceful skip-on-error; HTML is stripped to plain text; markdown and plain text pass through.
Chunk — text is split into overlapping 1000-character windows with 200 characters of overlap. Chunk boundaries prefer paragraph breaks, then sentence ends, then spaces, so each chunk reads as a coherent passage.
Embed — chunks are embedded in batches with OpenAI text-embedding-3-large (1536 dimensions via Matryoshka truncation).
Index — embeddings land in Postgres pgvector with a cosine-distance IVFFlat index. Search is ANN (approximate nearest-neighbor), sub-50ms on the indexed path.
Query — at call time, the search_knowledge tool sends the user’s question to the server. The server embeds the query, runs the ANN search, and returns the top-k chunks with filenames and scores for the LLM to quote.

Tips

One knowledge base per topic. A “Product Manuals” KB and a “Billing Policies” KB will retrieve more relevantly than a single “Everything” KB, because ANN search ranks within the whole pool.
Curate your source documents. Out-of-date or contradictory documents will surface; the retriever has no way to know which version is correct.
Expect a few seconds of latency on first retrieval. The search_knowledge tool adds one embedding round-trip and one DB query to the turn. In our measurements this is typically 200-500ms — noticeable but not disruptive.
Monitor the transcript. Every search_knowledge call is logged as a role=tool message on the conversation, including the query the LLM used and the chunks returned. If the agent is answering incorrectly, that’s the first place to look.