Provides specialized documentation scraping and retrieval from GitHub, NPM, PyPI, and web pages, enabling accurate refer...
Created byApr 22, 2025
docs-mcp-server MCP Server
A MCP server for fetching and searching 3rd party package documentation.
Key Features
Versatile Scraping: Fetch documentation from diverse sources like websites, GitHub, npm, PyPI, or local files.
Intelligent Processing: Automatically split content semantically and generate embeddings using your choice of models (OpenAI, Google Gemini, Azure OpenAI, AWS Bedrock, Ollama, and more).
Optimized Storage: Leverage SQLite with sqlite-vec for efficient vector storage and FTS5 for robust full-text search.
Powerful Hybrid Search: Combine vector similarity and full-text search across different library versions for highly relevant results.
Asynchronous Job Handling: Manage scraping and indexing tasks efficiently with a background job queue and MCP/CLI tools.
Simple Deployment: Get up and running quickly using Docker or npx.
Overview
This project provides a Model Context Protocol (MCP) server designed to scrape, process, index, and search documentation for various software libraries and packages. It fetches content from specified URLs, splits it into meaningful chunks using semantic splitting techniques, generates vector embeddings using OpenAI, and stores the data in an SQLite database. The server utilizes sqlite-vec for efficient vector similarity search and FTS5 for full-text search capabilities, combining them for hybrid search results. It supports versioning, allowing documentation for different library versions (including unversioned content) to be stored and queried distinctly.
The server exposes MCP tools for:
Starting a scraping job (scrape_docs): Returns a jobId immediately.
Checking job status (get_job_status): Retrieves the current status and progress of a specific job.
Listing active/completed jobs (list_jobs): Shows recent and ongoing jobs.
Cancelling a job (cancel_job): Attempts to stop a running or queued job.
Searching documentation (search_docs).
Listing indexed libraries (list_libraries).
Finding appropriate versions (find_version).
Removing indexed documents (remove_docs).
Fetching single URLs (fetch_url): Fetches a URL and returns its content as Markdown.
Configuration
The following environment variables are supported to configure the embedding model behavior:
Embedding Model Configuration
DOCS_MCP_EMBEDDING_MODEL: Optional. Format: provider:model_name or just model_name (defaults to text-embedding-3-small). Supported providers and their required environment variables:
Vector Dimensions
The database schema uses a fixed dimension of 1536 for embedding vectors. Only models that produce vectors with dimension 1536 are supported, except for certain providers (like Gemini) that support dimension reduction.
For OpenAI-compatible APIs (like Ollama), use the openai provider with OPENAI_API_BASE pointing to your endpoint.
These variables can be set regardless of how you run the server (Docker, npx, or from source).
Running the MCP Server
There are two ways to run the docs-mcp-server:
Option 1: Using Docker (Recommended)
This is the recommended approach for most users. It's easy, straightforward, and doesn't require Node.js to be installed.
Ensure Docker is installed and running.
Configure your MCP settings:Claude/Cline/Roo Configuration Example:
Add the following configuration block to your MCP settings file (adjust path as needed):Remember to replace "sk-proj-..." with your actual OpenAI API key and restart the application.
That's it! The server will now be available to your AI assistant.
Docker Container Settings:
-i: Keep STDIN open, crucial for MCP communication over stdio.
--rm: Automatically remove the container when it exits.
-e OPENAI_API_KEY: Required. Set your OpenAI API key.
-v docs-mcp-data:/data: Required for persistence. Mounts a Docker named volume docs-mcp-data to store the database. You can replace with a specific host path if preferred (e.g., -v /path/on/host:/data).
Any of the configuration environment variables (see Configuration above) can be passed to the container using the -e flag. For example:
Option 2: Using npx
This approach is recommended when you need local file access (e.g., indexing documentation from your local file system). While this can also be achieved by mounting paths into a Docker container, using npx is simpler but requires a Node.js installation.
Ensure Node.js is installed.
Configure your MCP settings:Claude/Cline/Roo Configuration Example:
Add the following configuration block to your MCP settings file:Remember to replace "sk-proj-..." with your actual OpenAI API key and restart the application.
That's it! The server will now be available to your AI assistant.
Using the CLI
You can use the CLI to manage documentation directly, either via Docker or npx. Important: Use the same method (Docker or npx) for both the server and CLI to ensure access to the same indexed documentation.
Using Docker CLI
If you're running the server with Docker, use Docker for the CLI as well:
Make sure to use the same volume name (docs-mcp-data in this example) as you did for the server. Any of the configuration environment variables (see Configuration above) can be passed using -e flags, just like with the server.
Using npx CLI
If you're running the server with npx, use npx for the CLI as well:
The npx approach will use the default data directory on your system (typically in your home directory), ensuring consistency between server and CLI.
(See "CLI Command Reference" below for available commands and options.)
CLI Command Reference
The docs-cli provides commands for managing the documentation index. Access it either via Docker (docker run -v docs-mcp-data:/data ghcr.io/arabold/docs-mcp-server:latest docs-cli ...) or npx (npx -y --package=@arabold/docs-mcp-server docs-cli ...).
General Help:
Command Specific Help: (Replace docs-cli with the npx... command if not installed globally)
Fetching Single URLs (`fetch-url`)
Fetches a single URL and converts its content to Markdown. Unlike scrape, this command does not crawl links or store the content.
Options:
--no-follow-redirects: Disable following HTTP redirects (default: follow redirects).
--scrape-mode <mode>: HTML processing strategy: 'fetch' (fast, less JS), 'playwright' (slow, full JS), 'auto' (default).
Examples:
Scraping Documentation (`scrape`)
Scrapes and indexes documentation from a given URL for a specific library.
Options:
-v, --version <string>: The specific version to associate with the scraped documents.
-p, --max-pages <number>: Maximum pages to scrape (default: 1000).
-d, --max-depth <number>: Maximum navigation depth (default: 3).
-c, --max-concurrency <number>: Maximum concurrent requests (default: 3).
--scope <scope>: Defines the crawling boundary: 'subpages' (default), 'hostname', or 'domain'.
--no-follow-redirects: Disable following HTTP redirects (default: follow redirects).
--scrape-mode <mode>: HTML processing strategy: 'fetch' (fast, less JS), 'playwright' (slow, full JS), 'auto' (default).
--ignore-errors: Ignore errors during scraping (default: true).
Examples:
Searching Documentation (`search`)
Searches the indexed documentation for a library, optionally filtering by version.
Options:
-v, --version <string>: The target version or range to search within.
-l, --limit <number>: Maximum number of results (default: 5).
-e, --exact-match: Only match the exact version specified (disables fallback and range matching) (default: false).
Examples:
Finding Available Versions (`find-version`)
Checks the index for the best matching version for a library based on a target, and indicates if unversioned documents exist.
Options:
-v, --version <string>: The target version or range. If omitted, finds the latest available version.
Examples:
Listing Libraries (`list`)
Lists all libraries currently indexed in the store.
Removing Documentation (`remove`)
Removes indexed documents for a specific library and version.
Options:
-v, --version <string>: The specific version to remove. If omitted, removes unversioned documents for the library.
Examples:
Version Handling Summary
Scraping: Requires a specific, valid version (X.Y.Z, X.Y.Z-pre, X.Y, X) or no version (for unversioned docs). Ranges (X.x) are invalid for scraping.
Searching/Finding: Accepts specific versions, partials, or ranges (X.Y.Z, X.Y, X, X.x). Falls back to the latest older version if the target doesn't match. Omitting the version targets the latest available. Explicitly searching --version "" targets unversioned documents.
Unversioned Docs: Libraries can have documentation stored without a specific version (by omitting --version during scrape). These can be searched explicitly using --version "". The find-version command will also report if unversioned docs exist alongside any semver matches.
Development & Advanced Setup
This section covers running the server/CLI directly from the source code for development purposes. The primary usage method is now via the public Docker image as described in "Method 2".
Running from Source (Development)
This provides an isolated environment and exposes the server via HTTP endpoints.
Clone the repository:
Create `.env` file:
Copy the example and add your OpenAI key (see "Environment Setup" below).
Build the Docker image:
Run the Docker container:The server inside the container now runs directly using Node.js and communicates over stdio.
This method is useful for contributing to the project or running un-published versions.
Clone the repository:
Install dependencies:
Build the project:
This compiles TypeScript to JavaScript in the dist/ directory.
Setup Environment:
Create and configure your .env file as described in "Environment Setup" below. This is crucial for providing the OPENAI_API_KEY.
Run:
Environment Setup (for Source/Docker)
Note: This .env file setup is primarily needed when running the server from source or using the Docker method. When using the npx integration method, the OPENAI_API_KEY is set directly in the MCP configuration file.
Create a .env file based on .env.example:
Update your OpenAI API key in .env:
Debugging (from Source)
Since MCP servers communicate over stdio when run directly via Node.js, debugging can be challenging. We recommend using the MCP Inspector, which is available as a package script after building:
The Inspector will provide a URL to access debugging tools in your browser.
Commit Messages: All commits merged into the main branch must follow the Conventional Commits specification.
Manual Trigger: The "Release" GitHub Actions workflow can be triggered manually from the Actions tab when you're ready to create a new release.
`semantic-release` Actions: Determines version, updates CHANGELOG.md & package.json, commits, tags, publishes to npm, and creates a GitHub Release.
What you need to do:
Use Conventional Commits.
Merge changes to main.
Trigger a release manually when ready from the Actions tab in GitHub.
Automation handles: Changelog, version bumps, tags, npm publish, GitHub releases.
Architecture
For details on the project's architecture and design principles, please see ARCHITECTURE.md.
Notably, the vast majority of this project's code was generated by the AI assistant Cline, leveraging the capabilities of this very MCP server.
docs-mcp-server MCP Server
A MCP server for fetching and searching 3rd party package documentation.
Key Features
Versatile Scraping: Fetch documentation from diverse sources like websites, GitHub, npm, PyPI, or local files.
Intelligent Processing: Automatically split content semantically and generate embeddings using your choice of models (OpenAI, Google Gemini, Azure OpenAI, AWS Bedrock, Ollama, and more).
Optimized Storage: Leverage SQLite with sqlite-vec for efficient vector storage and FTS5 for robust full-text search.
Powerful Hybrid Search: Combine vector similarity and full-text search across different library versions for highly relevant results.
Asynchronous Job Handling: Manage scraping and indexing tasks efficiently with a background job queue and MCP/CLI tools.
Simple Deployment: Get up and running quickly using Docker or npx.
Overview
This project provides a Model Context Protocol (MCP) server designed to scrape, process, index, and search documentation for various software libraries and packages. It fetches content from specified URLs, splits it into meaningful chunks using semantic splitting techniques, generates vector embeddings using OpenAI, and stores the data in an SQLite database. The server utilizes sqlite-vec for efficient vector similarity search and FTS5 for full-text search capabilities, combining them for hybrid search results. It supports versioning, allowing documentation for different library versions (including unversioned content) to be stored and queried distinctly.
The server exposes MCP tools for:
Starting a scraping job (scrape_docs): Returns a jobId immediately.
Checking job status (get_job_status): Retrieves the current status and progress of a specific job.
Listing active/completed jobs (list_jobs): Shows recent and ongoing jobs.
Cancelling a job (cancel_job): Attempts to stop a running or queued job.
Searching documentation (search_docs).
Listing indexed libraries (list_libraries).
Finding appropriate versions (find_version).
Removing indexed documents (remove_docs).
Fetching single URLs (fetch_url): Fetches a URL and returns its content as Markdown.
Configuration
The following environment variables are supported to configure the embedding model behavior:
Embedding Model Configuration
DOCS_MCP_EMBEDDING_MODEL: Optional. Format: provider:model_name or just model_name (defaults to text-embedding-3-small). Supported providers and their required environment variables:
Vector Dimensions
The database schema uses a fixed dimension of 1536 for embedding vectors. Only models that produce vectors with dimension 1536 are supported, except for certain providers (like Gemini) that support dimension reduction.
For OpenAI-compatible APIs (like Ollama), use the openai provider with OPENAI_API_BASE pointing to your endpoint.
These variables can be set regardless of how you run the server (Docker, npx, or from source).
Running the MCP Server
There are two ways to run the docs-mcp-server:
Option 1: Using Docker (Recommended)
This is the recommended approach for most users. It's easy, straightforward, and doesn't require Node.js to be installed.
Ensure Docker is installed and running.
Configure your MCP settings:Claude/Cline/Roo Configuration Example:
Add the following configuration block to your MCP settings file (adjust path as needed):Remember to replace "sk-proj-..." with your actual OpenAI API key and restart the application.
That's it! The server will now be available to your AI assistant.
Docker Container Settings:
-i: Keep STDIN open, crucial for MCP communication over stdio.
--rm: Automatically remove the container when it exits.
-e OPENAI_API_KEY: Required. Set your OpenAI API key.
-v docs-mcp-data:/data: Required for persistence. Mounts a Docker named volume docs-mcp-data to store the database. You can replace with a specific host path if preferred (e.g., -v /path/on/host:/data).
Any of the configuration environment variables (see Configuration above) can be passed to the container using the -e flag. For example:
Option 2: Using npx
This approach is recommended when you need local file access (e.g., indexing documentation from your local file system). While this can also be achieved by mounting paths into a Docker container, using npx is simpler but requires a Node.js installation.
Ensure Node.js is installed.
Configure your MCP settings:Claude/Cline/Roo Configuration Example:
Add the following configuration block to your MCP settings file:Remember to replace "sk-proj-..." with your actual OpenAI API key and restart the application.
That's it! The server will now be available to your AI assistant.
Using the CLI
You can use the CLI to manage documentation directly, either via Docker or npx. Important: Use the same method (Docker or npx) for both the server and CLI to ensure access to the same indexed documentation.
Using Docker CLI
If you're running the server with Docker, use Docker for the CLI as well:
Make sure to use the same volume name (docs-mcp-data in this example) as you did for the server. Any of the configuration environment variables (see Configuration above) can be passed using -e flags, just like with the server.
Using npx CLI
If you're running the server with npx, use npx for the CLI as well:
The npx approach will use the default data directory on your system (typically in your home directory), ensuring consistency between server and CLI.
(See "CLI Command Reference" below for available commands and options.)
CLI Command Reference
The docs-cli provides commands for managing the documentation index. Access it either via Docker (docker run -v docs-mcp-data:/data ghcr.io/arabold/docs-mcp-server:latest docs-cli ...) or npx (npx -y --package=@arabold/docs-mcp-server docs-cli ...).
General Help:
Command Specific Help: (Replace docs-cli with the npx... command if not installed globally)
Fetching Single URLs (`fetch-url`)
Fetches a single URL and converts its content to Markdown. Unlike scrape, this command does not crawl links or store the content.
Options:
--no-follow-redirects: Disable following HTTP redirects (default: follow redirects).
--scrape-mode <mode>: HTML processing strategy: 'fetch' (fast, less JS), 'playwright' (slow, full JS), 'auto' (default).
Examples:
Scraping Documentation (`scrape`)
Scrapes and indexes documentation from a given URL for a specific library.
Options:
-v, --version <string>: The specific version to associate with the scraped documents.
-p, --max-pages <number>: Maximum pages to scrape (default: 1000).
-d, --max-depth <number>: Maximum navigation depth (default: 3).
-c, --max-concurrency <number>: Maximum concurrent requests (default: 3).
--scope <scope>: Defines the crawling boundary: 'subpages' (default), 'hostname', or 'domain'.
--no-follow-redirects: Disable following HTTP redirects (default: follow redirects).
--scrape-mode <mode>: HTML processing strategy: 'fetch' (fast, less JS), 'playwright' (slow, full JS), 'auto' (default).
--ignore-errors: Ignore errors during scraping (default: true).
Examples:
Searching Documentation (`search`)
Searches the indexed documentation for a library, optionally filtering by version.
Options:
-v, --version <string>: The target version or range to search within.
-l, --limit <number>: Maximum number of results (default: 5).
-e, --exact-match: Only match the exact version specified (disables fallback and range matching) (default: false).
Examples:
Finding Available Versions (`find-version`)
Checks the index for the best matching version for a library based on a target, and indicates if unversioned documents exist.
Options:
-v, --version <string>: The target version or range. If omitted, finds the latest available version.
Examples:
Listing Libraries (`list`)
Lists all libraries currently indexed in the store.
Removing Documentation (`remove`)
Removes indexed documents for a specific library and version.
Options:
-v, --version <string>: The specific version to remove. If omitted, removes unversioned documents for the library.
Examples:
Version Handling Summary
Scraping: Requires a specific, valid version (X.Y.Z, X.Y.Z-pre, X.Y, X) or no version (for unversioned docs). Ranges (X.x) are invalid for scraping.
Searching/Finding: Accepts specific versions, partials, or ranges (X.Y.Z, X.Y, X, X.x). Falls back to the latest older version if the target doesn't match. Omitting the version targets the latest available. Explicitly searching --version "" targets unversioned documents.
Unversioned Docs: Libraries can have documentation stored without a specific version (by omitting --version during scrape). These can be searched explicitly using --version "". The find-version command will also report if unversioned docs exist alongside any semver matches.
Development & Advanced Setup
This section covers running the server/CLI directly from the source code for development purposes. The primary usage method is now via the public Docker image as described in "Method 2".
Running from Source (Development)
This provides an isolated environment and exposes the server via HTTP endpoints.
Clone the repository:
Create `.env` file:
Copy the example and add your OpenAI key (see "Environment Setup" below).
Build the Docker image:
Run the Docker container:The server inside the container now runs directly using Node.js and communicates over stdio.
This method is useful for contributing to the project or running un-published versions.
Clone the repository:
Install dependencies:
Build the project:
This compiles TypeScript to JavaScript in the dist/ directory.
Setup Environment:
Create and configure your .env file as described in "Environment Setup" below. This is crucial for providing the OPENAI_API_KEY.
Run:
Environment Setup (for Source/Docker)
Note: This .env file setup is primarily needed when running the server from source or using the Docker method. When using the npx integration method, the OPENAI_API_KEY is set directly in the MCP configuration file.
Create a .env file based on .env.example:
Update your OpenAI API key in .env:
Debugging (from Source)
Since MCP servers communicate over stdio when run directly via Node.js, debugging can be challenging. We recommend using the MCP Inspector, which is available as a package script after building:
The Inspector will provide a URL to access debugging tools in your browser.