Knowledge Sources, comxbot help

Supported source types

URLs: Provide individual page URLs or a sitemap URL. Comxbot will crawl all linked pages, extract text content, and chunk it for retrieval. Ideal for help centres, blogs, and product documentation.

PDFs: Upload PDF documents directly. Comxbot extracts text (including OCR for scanned documents on Pro plans), splits into semantic chunks, and embeds.

Sitemaps: Provide a sitemap.xml URL and comxbot will discover and crawl all pages listed. Useful for indexing an entire website efficiently.

Google Drive: Connect a Google Drive folder. Comxbot will index all supported documents (Docs, Sheets, Slides, PDFs) within the folder and sub-folders. Requires OAuth authorisation.

OneDrive: Similar to Google Drive. Connect a OneDrive or SharePoint folder and comxbot indexes documents automatically.

Manual Q&A: Create custom question-and-answer pairs directly in the dashboard. Useful for FAQs, product specs, or information not available in existing documents.

Source health explained

Every knowledge source in comxbot receives a health score from 0-100, composed of three factors:

Freshness (0-100): Measures how recently the source content was updated or re-synced. Sources that haven't been refreshed in over 30 days begin to lose freshness points. Automatic re-sync schedules prevent freshness decay.

Citation Rate (0-100): Tracks how often this source is actually cited in assistant responses. A low citation rate may indicate the content isn't relevant to the questions being asked, or that chunks are poorly structured.

Retrieval Score (0-100): Measures how well the source's chunks match incoming queries during vector search. Low retrieval scores suggest the content may need to be re-chunked or the embedding model updated.

Stale Detection: When freshness drops below 40 or content at the source URL has changed significantly since last crawl, comxbot flags the source as stale and optionally deprioritises it in retrieval.

You can view health trends over time in the Source Health Centre. Set up alerts to be notified when a source drops below a threshold.

How ingestion works

When you add or refresh a source, comxbot runs an ingestion pipeline with these steps:

1. Fetch: Content is downloaded from the source (crawl URL, read PDF, pull from Drive API).

2. Extract: Raw text is extracted from HTML, PDF, or document formats. Metadata (title, headings, dates) is preserved.

3. Chunk: Text is split into semantic chunks using a sliding-window approach. Default chunk size is 512 tokens with 50-token overlap. You can adjust these in source settings.

4. Embed: Each chunk is converted to a vector embedding using your configured model (default: OpenAI text-embedding-3-small).

5. Index: Vectors are stored in the search index, tagged with source metadata for filtering and citation.

6. Health Score: Initial health metrics are computed and the source appears in your dashboard.

Ingestion jobs run asynchronously. You can track progress via the ingest jobs list or the API (GET /api/org/v1/ingest-jobs).

Troubleshooting failed sources

Source ingestion can fail for several reasons. Here are the most common issues and fixes:

403/401 Errors: The URL requires authentication. Ensure the page is publicly accessible, or use the authenticated crawl option in source settings.

Timeout: Very large pages or slow servers may cause the crawler to time out. Try reducing the crawl scope or increasing the timeout in settings.

Empty Content: Some pages render content via JavaScript only. Comxbot uses a static crawler by default. Switch to the 'JavaScript rendering' option for SPAs.

PDF Extraction Failed: Corrupted or password-protected PDFs cannot be processed. Ensure the PDF is valid and unprotected.

Google Drive Permission Denied: Re-authorise the Google connection and ensure the folder is shared with the connected account.

Rate Limited: If crawling many URLs quickly, the target server may rate-limit requests. Enable polite crawling (adds delays between requests).

Check the source detail page for specific error messages and retry options. If issues persist, contact support with the source ID.