Configuring the KnowledgeBase Builder
The KnowledgeBase Builder supports the following environment variables:
OPENAI_API_KEY: OpenAI API key (overrides api_key_file)VOYAGE_API_KEY: Voyage AI API key (overrides api_key_file)
Building Your First KnowledgeBase
The following steps walk you through the process of building a KnowledgeBase:
-
Install the prerequisites:
# For Ollama (optional) ollama pull nomic-embed-text -
Set up API keys (for OpenAI or Voyage):
# For OpenAI echo "sk-your-openai-key" > ~/.openai-api-key chmod 600 ~/.openai-api-key # For Voyage echo "pa-your-voyage-key" > ~/.voyage-api-key chmod 600 ~/.voyage-api-key -
Create the configuration file:
cp pgedge-nla-kb-builder.yaml.example pgedge-nla-kb-builder.yaml # Edit pgedge-nla-kb-builder.yaml to configure sources and embedding providers -
Build the KnowledgeBase:
./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml -
Configure the MCP server, and add the KnowledgeBase configuration:
knowledgebase: enabled: true database_path: "./pgedge-nla-kb.db" embedding_provider: "openai" embedding_model: "text-embedding-3-small" embedding_openai_api_key_file: "~/.openai-api-key"
Incremental Updates
The kb-builder supports incremental processing:
- The kb-builder pulls Git repositories to get the latest changes.
- The kb-builder reprocesses only modified files.
- Unchanged files reuse existing chunks and embeddings.
- Use
--skip-updatesto skip git pull during development.
For example:
# Initial build (full processing)
./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml
# Later update (only changed files)
./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml
Managing Embeddings
If a build fails or you enable a new provider, you can add a missing embedding with the following command:
./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml --add-missing-embeddings
This will only generate missing embeddings, skipping files that already have embeddings.
Clearing Embeddings
Use the following syntax to clear embeddings for a specific provider:
# Clear OpenAI embeddings
./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml --clear-embeddings openai
# Clear Voyage embeddings
./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml --clear-embeddings voyage
# Clear Ollama embeddings
./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml --clear-embeddings ollama
After clearing the embedding, you can rebuild with:
./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml --add-missing-embeddings
Embedding Content from Different Sources
The following examples demonstrate using different sources for your KnowledgeBase information.
PostgreSQL Documentation Only
database_path: "pgedge-nla-kb.db"
doc_source_path: "doc-source"
sources:
- git_url: "https://github.com/postgres/postgres.git"
branch: "REL_17_STABLE"
doc_path: "doc/src/sgml"
project_name: "PostgreSQL"
project_version: "17"
embeddings:
openai:
enabled: true
api_key_file: "~/.openai-api-key"
model: "text-embedding-3-small"
dimensions: 1536
voyage:
enabled: false
ollama:
enabled: false
Multiple PostgreSQL Versions with Voyage AI
database_path: "postgres-multi-version-kb.db"
doc_source_path: "postgres-docs"
sources:
- git_url: "https://github.com/postgres/postgres.git"
branch: "REL_17_STABLE"
doc_path: "doc/src/sgml"
project_name: "PostgreSQL"
project_version: "17"
- git_url: "https://github.com/postgres/postgres.git"
branch: "REL_16_STABLE"
doc_path: "doc/src/sgml"
project_name: "PostgreSQL"
project_version: "16"
- git_url: "https://github.com/postgres/postgres.git"
branch: "REL_15_STABLE"
doc_path: "doc/src/sgml"
project_name: "PostgreSQL"
project_version: "15"
embeddings:
openai:
enabled: false
voyage:
enabled: true
api_key_file: "~/.voyage-api-key"
model: "voyage-3"
ollama:
enabled: false
Local Development Sources with Ollama
database_path: "local-kb.db"
doc_source_path: "local-docs"
sources:
- local_path: "~/projects/myapp"
doc_path: "docs"
project_name: "MyApp"
project_version: "dev"
- local_path: "/opt/docs/internal"
doc_path: "."
project_name: "Internal Docs"
project_version: "latest"
embeddings:
openai:
enabled: false
voyage:
enabled: false
ollama:
enabled: true
endpoint: "http://localhost:11434"
model: "nomic-embed-text"
Then run:
# Make sure Ollama is running and model is pulled
ollama pull nomic-embed-text
# Build the knowledgebase
./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml
Multiple Embedding Providers (Recommended for Flexibility)
This configuration generates embeddings from all three providers, allowing the MCP server to use any available provider for search based on its own configuration.
database_path: "multi-provider-kb.db"
doc_source_path: "doc-source"
sources:
- git_url: "https://github.com/postgres/postgres.git"
branch: "REL_17_STABLE"
doc_path: "doc/src/sgml"
project_name: "PostgreSQL"
project_version: "17"
embeddings:
# Enable all three providers
# MCP server can use any one for search
openai:
enabled: true
api_key_file: "~/.openai-api-key"
model: "text-embedding-3-small"
dimensions: 1536
voyage:
enabled: true
api_key_file: "~/.voyage-api-key"
model: "voyage-3"
ollama:
enabled: true
endpoint: "http://localhost:11434"
model: "nomic-embed-text"
pgEdge Knowledgebase Builder Configuration - Example
The following example demonstrates a complete kb-builder configuration file with all available options.
## The kb-builder tool processes documentation from multiple sources (Git repos,
# local paths) and builds a searchable SQLite database with vector embeddings.
#
# Configuration Priority (highest to lowest):
# 1. Command line flags (--database, --config)
# 2. Configuration file values (this file)
# 3. Hard-coded defaults
#
# Copy this file to pgedge-nla-kb-builder.yaml and customize as needed.
# By default, kb-builder looks for config in the same directory as the binary.
# ============================================================================
# OUTPUT DATABASE CONFIGURATION
# ============================================================================
# Path to the output SQLite knowledgebase database
# Default: pgedge-nla-kb.db in same directory as config file
# Command line flag: --database or -d
database_path: "pgedge-nla-kb.db"
# ============================================================================
# DOCUMENTATION SOURCE DIRECTORY
# ============================================================================
# Directory for storing downloaded/processed documentation
# Git repositories will be cloned here
# Default: doc-source in same directory as config file
doc_source_path: "doc-source"
# ============================================================================
# DOCUMENTATION SOURCES
# ============================================================================
# List of documentation sources to process
# Each source can be either a Git repository or a local path
sources:
# -------------------------
# Git Repository Sources
# -------------------------
# Example: PostgreSQL 17 documentation from Git
- git_url: "https://github.com/postgres/postgres.git"
branch: "REL_17_STABLE" # Git branch to use
# tag: "REL_17_0" # Alternative: use tag instead
doc_path: "doc/src/sgml" # Path within repo containing docs
project_name: "PostgreSQL" # Project identifier (required)
project_version: "17" # Version identifier (optional)
# Example: PostgreSQL 16 documentation
- git_url: "https://github.com/postgres/postgres.git"
branch: "REL_16_STABLE"
doc_path: "doc/src/sgml"
project_name: "PostgreSQL"
project_version: "16"
# Example: pgEdge documentation
# - git_url: "https://github.com/pgEdge/docs.git"
# branch: "main"
# doc_path: "." # Docs at repo root
# project_name: "pgEdge"
# project_version: "latest"
# -------------------------
# Local Path Sources
# -------------------------
# Example: Local documentation directory
# - local_path: "~/projects/my-project"
# doc_path: "docs" # Optional subdirectory
# project_name: "My Project"
# project_version: "1.0"
# Example: Absolute path
# - local_path: "/opt/documentation/myapp"
# doc_path: "." # Process entire directory
# project_name: "MyApp"
# project_version: "2.5"
# ============================================================================
# EMBEDDING PROVIDER CONFIGURATION
# ============================================================================
# Configure one or more embedding providers
# The knowledgebase will store embeddings from all enabled providers
# The MCP server can use any available provider for search
#
# IMPORTANT: Enable at least one provider
embeddings:
# -------------------------
# OpenAI Embeddings
# -------------------------
openai:
# Enable OpenAI embeddings
# Default: false
enabled: true
# Path to file containing OpenAI API key
# Default: ~/.openai-api-key
# Environment variable: OPENAI_API_KEY (takes priority)
api_key_file: "~/.openai-api-key"
# OpenAI embedding model
# Options: text-embedding-3-small (1536 dim),
# text-embedding-3-large (3072 dim),
# text-embedding-ada-002 (1536 dim)
# Default: text-embedding-3-small
model: "text-embedding-3-small"
# Embedding dimensions (optional, model-specific)
# Only needed for models that support variable dimensions
# Default: 1536 (for text-embedding-3-small)
dimensions: 1536
# -------------------------
# Voyage AI Embeddings
# -------------------------
voyage:
# Enable Voyage AI embeddings
# Default: false
enabled: false
# Path to file containing Voyage API key
# Default: ~/.voyage-api-key
# Environment variable: VOYAGE_API_KEY (takes priority)
api_key_file: "~/.voyage-api-key"
# Voyage embedding model
# Options: voyage-3 (1024 dim), voyage-3-lite (512 dim)
# Default: voyage-3
model: "voyage-3"
# -------------------------
# Ollama Local Embeddings
# -------------------------
ollama:
# Enable Ollama embeddings (local, no API key needed)
# Default: false
enabled: false
# Ollama API endpoint
# Default: http://localhost:11434
endpoint: "http://localhost:11434"
# Ollama embedding model
# Options: nomic-embed-text (768 dim), mxbai-embed-large (1024 dim)
# Default: nomic-embed-text
# Note: Model must be pulled first: ollama pull nomic-embed-text
model: "nomic-embed-text"
# ============================================================================
# SUPPORTED DOCUMENT FORMATS
# ============================================================================
# The kb-builder automatically detects and converts:
# - Markdown (.md)
# - HTML (.html, .htm)
# - reStructuredText (.rst)
# - SGML (.sgml, .sgm)
# - DocBook XML (.xml)
#
# Documents are converted to Markdown, chunked intelligently, and embedded.
# ============================================================================
# COMMAND LINE USAGE
# ============================================================================
# Basic usage:
# ./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml
#
# Override database path:
# ./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml --database /path/to/output.db
#
# Skip git pull for existing repos (faster for development):
# ./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml --skip-updates
#
# Add missing embeddings to existing database:
# ./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml --add-missing-embeddings
#
# Clear embeddings for a specific provider:
# ./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml --clear-embeddings openai
# ./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml --clear-embeddings voyage
# ./pgedge-nla-kb-builder --config pgedge-nla-kb-builder.yaml --clear-embeddings ollama