Vector search & ingest, comxbot help

What is semantic search?

Instead of matching keywords, vector search matches meaning. 'How much is a boiler service?' will find your fees page even if that page doesn't use the words 'how much'.

We use OpenAI's text-embedding-3-small to convert each chunk of content (and each visitor question) into a 1,536-dimensional vector, then find the nearest neighbours by cosine similarity.

The ingest pipeline

When you add a knowledge source it goes through five steps: text extraction → chunking → embedding → storage → status update.

Extractors used: unpdf for PDFs, mammoth for Word, SheetJS for Excel, officeparser for PowerPoint, plain fetch for URLs, sitemap walk for sitemaps.

Chunks are typically 1,200 characters with smart breaks at paragraph and sentence boundaries. The status moves PENDING → PROCESSING → HEALTHY.

What runs the query side

For every visitor message we embed the message into a vector and run a cosine-similarity query through a pgvector ivfflat index. Fetch the top 8 chunks.

Each chunk gets a health-aware multiplier applied (recency + citation rate boosters) before we re-rank. The top-k result is passed to the LLM as context.

The model cites sources using [n] notation, which the widget renders as clickable chips with the source name and chunk preview.

Source health monitoring

Every source has a health score computed from four factors: freshness vs the freshness SLA window (default 7 days for URLs, longer for documents), parse-error count, 7-day retrieval hit rate, 7-day citation rate.

Sources scoring below 0.3 are flagged 'critical' and effectively excluded from retrieval. The Source Health Centre shows the breakdown per source.

Reingesting a source

Click 'Re-ingest' on any source to re-fetch its content and rebuild the chunks. Useful when you've updated the underlying document and want the assistant to pick up the changes immediately rather than waiting for the freshness SLA to expire.

Why pgvector and not a vector database?

Your knowledge embeddings live in the same Postgres database as the rest of your workspace data. No separate vendor, no cross-system data sync, no extra latency.

pgvector's ivfflat index handles millions of chunks per workspace. For very large deployments we'd recommend dedicated indexing, talk to us if you're approaching 10M+ chunks.