Speech MCP

A Goose MCP extension for voice interaction with modern audio visualization.

https://github.com/user-attachments/assets/f10f29d9-8444-43fb-a919-c80b9e0a12c8

Overview

Speech MCP provides a voice interface for Goose, allowing users to interact through speech rather than text. It includes:

Real-time audio processing for speech recognition

Local speech-to-text using faster-whisper (a faster implementation of OpenAI's Whisper model)

High-quality text-to-speech with multiple voice options

Modern PyQt-based UI with audio visualization

Simple command-line interface for voice interaction

Features

Modern UI: Sleek PyQt-based interface with audio visualization and dark theme

Voice Input: Capture and transcribe user speech using faster-whisper

Voice Output: Convert agent responses to speech with 54+ voice options

Multi-Speaker Narration: Generate audio files with multiple voices for stories and dialogues

Single-Voice Narration: Convert any text to speech with your preferred voice

Audio/Video Transcription: Transcribe speech from various media formats with optional timestamps and speaker detection

Voice Persistence: Remembers your preferred voice between sessions

Continuous Conversation: Automatically listen for user input after agent responses

Silence Detection: Automatically stops recording when the user stops speaking

Robust Error Handling: Graceful recovery from common failure modes with helpful voice suggestions

Installation

Important Note: After installation, the first time you use the speech interface, it may take several minutes to download the Kokoro voice models (approximately 523 KB per voice). During this initial setup period, the system will use a more robotic-sounding fallback voice. Once the Kokoro voices are downloaded, the high-quality voices will be used automatically.

IMPORTANT PREREQUISITES

Before installing Speech MCP, you MUST install PortAudio on your system. PortAudio is required for PyAudio to capture audio from your microphone.

PortAudio Installation Instructions

macOS:

Linux (Debian/Ubuntu):

Linux (Fedora/RHEL/CentOS):

Windows: For Windows, PortAudio is included in the PyAudio wheel file, so no separate installation is required when installing PyAudio with pip.

Note: If you skip this step, PyAudio installation will fail with "portaudio.h file not found" errors and the extension will not work.

Option 1: Quick Install (One-Click)

Click the link below if you have Goose installed:

Option 2: Using Goose CLI (recommended)

Start Goose with your extension enabled:

Option 3: Manual setup in Goose

Run goose configure

Select "Add Extension" from the menu

Choose "Command-line Extension"

Enter a name (e.g., "Speech Interface")

For the command, enter: speech-mcp

Follow the prompts to complete the setup

Option 4: Manual Installation

Install PortAudio (see Prerequisites section)

Clone this repository

Install dependencies:Or for a complete installation including Kokoro TTS:

Dependencies

Python 3.10+

PyQt5 (for modern UI)

PyAudio (for audio capture)

faster-whisper (for speech-to-text)

NumPy (for audio processing)

Pydub (for audio processing)

psutil (for process management)

Optional Dependencies

Kokoro TTS: For high-quality text-to-speech with multiple voices

Multi-Speaker Narration

The MCP supports generating audio files with multiple voices, perfect for creating stories, dialogues, and dramatic readings. You can use either JSON or Markdown format to define your conversations.

JSON Format Example:

Markdown Format Example:

Available Voices by Category:

American Female (af_*):

American Male (am_*):

British Female (bf_*):

British Male (bm_*):

Other English:

Other Languages:

Usage Example:

Each voice in the conversation can be different, allowing for distinct character voices in stories and dialogues. The pause_after parameter adds natural pauses between segments.

Single-Voice Narration

For simple text-to-speech conversion, you can use the narrate tool:

The narrate tool will use your configured voice preference or the default voice (af_heart) to generate the audio file. You can change the default voice through the UI or by setting the SPEECH_MCP_TTS_VOICE environment variable.

Audio Transcription

The MCP can transcribe speech from various audio and video formats using faster-whisper:

Supported Formats:

Audio: mp3, wav, m4a, flac, aac, ogg

Video: mp4, mov, avi, mkv, webm (audio is automatically extracted)

Output Files:

The transcription tool generates two files:

{input_name}.transcript.txt: Contains the transcription text

{input_name}.metadata.json: Contains metadata about the transcription

Features:

Automatic language detection

Optional word-level timestamps

Optional speaker detection

Efficient audio extraction from video files

Progress tracking for long files

Detailed metadata including:

Usage

To use this MCP with Goose, simply ask Goose to talk to you or start a voice conversation:

Start a conversation by saying something like:

Goose will automatically launch the speech interface and start listening for your voice input.

When Goose responds, it will speak the response aloud and then automatically listen for your next input.

The conversation continues naturally with alternating speaking and listening, just like talking to a person.

No need to call specific functions or use special commands - just ask Goose to talk and start speaking naturally.

UI Features

The new PyQt-based UI includes:

Modern Dark Theme: Sleek, professional appearance

Audio Visualization: Dynamic visualization of audio input

Voice Selection: Choose from 54+ voice options

Voice Persistence: Your voice preference is saved between sessions

Animated Effects: Smooth animations and visual feedback

Status Indicators: Clear indication of system state (ready, listening, processing)

Configuration

User preferences are stored in ~/.config/speech-mcp/config.json and include:

Selected TTS voice

TTS engine preference

Voice speed

Language code

UI theme settings

You can also set preferences via environment variables, such as:

SPEECH_MCP_TTS_VOICE - Set your preferred voice

SPEECH_MCP_TTS_ENGINE - Set your preferred TTS engine

Troubleshooting

If you encounter issues with the extension freezing or not responding:

Check the logs: Look at the log files in src/speech_mcp/ for detailed error messages.

Reset the state: If the extension seems stuck, try deleting src/speech_mcp/speech_state.json or setting all states to false.

Use the direct command: Instead of uv run speech-mcp, use the installed package with speech-mcp directly.

Check audio devices: Ensure your microphone is properly configured and accessible to Python.

Verify dependencies: Make sure all required dependencies are installed correctly.

Common PortAudio Issues

"PyAudio installation failed" or "portaudio.h file not found"

This typically means PortAudio is not installed or not found in your system:

macOS:

Linux: Make sure you have the development packages:

"Audio device not found" or "No Default Input Device Available"

Check if your microphone is properly connected

Verify your system recognizes the microphone in your sound settings

Try selecting a specific device index in the code if you have multiple audio devices

Changelog

For a detailed list of recent improvements and version history, please see the Changelog.

Technical Details

Speech-to-Text

The MCP uses faster-whisper for speech recognition:

Uses the "base" model for a good balance of accuracy and speed

Processes audio locally without sending data to external services

Automatically detects when the user has finished speaking

Provides improved performance over the original Whisper implementation

Text-to-Speech

The MCP supports multiple text-to-speech engines:

Default: pyttsx3

Uses system voices available on your computer

Works out of the box without additional setup

Limited voice quality and customization

Optional: Kokoro TTS

High-quality neural text-to-speech with multiple voices

Lightweight model (82M parameters) that runs efficiently on CPU

Multiple voice styles and languages

To install: python scripts/install_kokoro.py

Note about Voice Models: The voice models are .pt files (PyTorch models) that are loaded by Kokoro. Each voice model is approximately 523 KB in size and is automatically downloaded when needed.

Voice Persistence: The selected voice is automatically saved to a configuration file (~/.config/speech-mcp/config.json) and will be remembered between sessions. This allows users to set their preferred voice once and have it used consistently.

Available Kokoro Voices

Speech MCP supports 54+ high-quality voice models through Kokoro TTS. For a complete list of available voices and language options, please visit the Kokoro GitHub repository.

License

MIT License

Speech MCP

A Goose MCP extension for voice interaction with modern audio visualization.

https://github.com/user-attachments/assets/f10f29d9-8444-43fb-a919-c80b9e0a12c8

Overview

Speech MCP provides a voice interface for Goose, allowing users to interact through speech rather than text. It includes:

Real-time audio processing for speech recognition

Local speech-to-text using faster-whisper (a faster implementation of OpenAI's Whisper model)

High-quality text-to-speech with multiple voice options

Modern PyQt-based UI with audio visualization

Simple command-line interface for voice interaction

Features

Modern UI: Sleek PyQt-based interface with audio visualization and dark theme

Voice Input: Capture and transcribe user speech using faster-whisper

Voice Output: Convert agent responses to speech with 54+ voice options

Multi-Speaker Narration: Generate audio files with multiple voices for stories and dialogues

Single-Voice Narration: Convert any text to speech with your preferred voice

Audio/Video Transcription: Transcribe speech from various media formats with optional timestamps and speaker detection

Voice Persistence: Remembers your preferred voice between sessions

Continuous Conversation: Automatically listen for user input after agent responses

Silence Detection: Automatically stops recording when the user stops speaking

Robust Error Handling: Graceful recovery from common failure modes with helpful voice suggestions

Installation

Important Note: After installation, the first time you use the speech interface, it may take several minutes to download the Kokoro voice models (approximately 523 KB per voice). During this initial setup period, the system will use a more robotic-sounding fallback voice. Once the Kokoro voices are downloaded, the high-quality voices will be used automatically.

IMPORTANT PREREQUISITES

Before installing Speech MCP, you MUST install PortAudio on your system. PortAudio is required for PyAudio to capture audio from your microphone.

PortAudio Installation Instructions

macOS:

Linux (Debian/Ubuntu):

Linux (Fedora/RHEL/CentOS):

Windows: For Windows, PortAudio is included in the PyAudio wheel file, so no separate installation is required when installing PyAudio with pip.

Note: If you skip this step, PyAudio installation will fail with "portaudio.h file not found" errors and the extension will not work.

Option 1: Quick Install (One-Click)

Click the link below if you have Goose installed:

Option 2: Using Goose CLI (recommended)

Start Goose with your extension enabled:

Option 3: Manual setup in Goose

Run goose configure

Select "Add Extension" from the menu

Choose "Command-line Extension"

Enter a name (e.g., "Speech Interface")

For the command, enter: speech-mcp

Follow the prompts to complete the setup

Option 4: Manual Installation

Install PortAudio (see Prerequisites section)

Clone this repository

Install dependencies:Or for a complete installation including Kokoro TTS:

Dependencies

Python 3.10+

PyQt5 (for modern UI)

PyAudio (for audio capture)

faster-whisper (for speech-to-text)

NumPy (for audio processing)

Pydub (for audio processing)

psutil (for process management)

Optional Dependencies

Kokoro TTS: For high-quality text-to-speech with multiple voices

Multi-Speaker Narration

The MCP supports generating audio files with multiple voices, perfect for creating stories, dialogues, and dramatic readings. You can use either JSON or Markdown format to define your conversations.

JSON Format Example:

Markdown Format Example:

Available Voices by Category:

American Female (af_*):

American Male (am_*):

British Female (bf_*):

British Male (bm_*):

Other English:

Other Languages:

Usage Example:

Each voice in the conversation can be different, allowing for distinct character voices in stories and dialogues. The pause_after parameter adds natural pauses between segments.

Single-Voice Narration

For simple text-to-speech conversion, you can use the narrate tool:

Audio Transcription

The MCP can transcribe speech from various audio and video formats using faster-whisper:

Supported Formats:

Audio: mp3, wav, m4a, flac, aac, ogg

Video: mp4, mov, avi, mkv, webm (audio is automatically extracted)

Output Files:

The transcription tool generates two files:

{input_name}.transcript.txt: Contains the transcription text

{input_name}.metadata.json: Contains metadata about the transcription

Features:

Automatic language detection

Optional word-level timestamps

Optional speaker detection

Efficient audio extraction from video files

Progress tracking for long files

Detailed metadata including:

Usage

To use this MCP with Goose, simply ask Goose to talk to you or start a voice conversation:

Start a conversation by saying something like:

Goose will automatically launch the speech interface and start listening for your voice input.

When Goose responds, it will speak the response aloud and then automatically listen for your next input.

The conversation continues naturally with alternating speaking and listening, just like talking to a person.

No need to call specific functions or use special commands - just ask Goose to talk and start speaking naturally.

UI Features

The new PyQt-based UI includes:

Modern Dark Theme: Sleek, professional appearance

Audio Visualization: Dynamic visualization of audio input

Voice Selection: Choose from 54+ voice options

Voice Persistence: Your voice preference is saved between sessions

Animated Effects: Smooth animations and visual feedback

Status Indicators: Clear indication of system state (ready, listening, processing)

Configuration

User preferences are stored in ~/.config/speech-mcp/config.json and include:

Selected TTS voice

TTS engine preference

Voice speed

Language code

UI theme settings

You can also set preferences via environment variables, such as:

SPEECH_MCP_TTS_VOICE - Set your preferred voice

SPEECH_MCP_TTS_ENGINE - Set your preferred TTS engine

Troubleshooting

If you encounter issues with the extension freezing or not responding:

Check the logs: Look at the log files in src/speech_mcp/ for detailed error messages.

Reset the state: If the extension seems stuck, try deleting src/speech_mcp/speech_state.json or setting all states to false.

Use the direct command: Instead of uv run speech-mcp, use the installed package with speech-mcp directly.

Check audio devices: Ensure your microphone is properly configured and accessible to Python.

Verify dependencies: Make sure all required dependencies are installed correctly.

Common PortAudio Issues

"PyAudio installation failed" or "portaudio.h file not found"

This typically means PortAudio is not installed or not found in your system:

macOS:

Linux: Make sure you have the development packages:

"Audio device not found" or "No Default Input Device Available"

Check if your microphone is properly connected

Verify your system recognizes the microphone in your sound settings

Try selecting a specific device index in the code if you have multiple audio devices

Changelog

For a detailed list of recent improvements and version history, please see the Changelog.

Technical Details

Speech-to-Text

The MCP uses faster-whisper for speech recognition:

Uses the "base" model for a good balance of accuracy and speed

Processes audio locally without sending data to external services

Automatically detects when the user has finished speaking

Provides improved performance over the original Whisper implementation

Text-to-Speech

The MCP supports multiple text-to-speech engines:

Default: pyttsx3

Uses system voices available on your computer

Works out of the box without additional setup

Limited voice quality and customization

Optional: Kokoro TTS

High-quality neural text-to-speech with multiple voices

Lightweight model (82M parameters) that runs efficiently on CPU

Multiple voice styles and languages

To install: python scripts/install_kokoro.py

Note about Voice Models: The voice models are .pt files (PyTorch models) that are loaded by Kokoro. Each voice model is approximately 523 KB in size and is automatically downloaded when needed.

Available Kokoro Voices

Speech MCP supports 54+ high-quality voice models through Kokoro TTS. For a complete list of available voices and language options, please visit the Kokoro GitHub repository.

License

MIT License

Speech Interface (Faster Whisper)

Speech MCP

Overview

Features

Installation

IMPORTANT PREREQUISITES

PortAudio Installation Instructions

Option 1: Quick Install (One-Click)

Option 2: Using Goose CLI (recommended)

Option 3: Manual setup in Goose

Option 4: Manual Installation

Dependencies

Optional Dependencies

Multi-Speaker Narration

JSON Format Example:

Markdown Format Example:

Available Voices by Category:

Usage Example:

Single-Voice Narration

Audio Transcription

Supported Formats:

Output Files:

Features:

Usage

UI Features

Configuration

Troubleshooting

Common PortAudio Issues

"PyAudio installation failed" or "portaudio.h file not found"

"Audio device not found" or "No Default Input Device Available"

Changelog

Technical Details

Speech-to-Text

Text-to-Speech

Default: pyttsx3

Optional: Kokoro TTS

License

Speech MCP

Overview

Features

Installation

IMPORTANT PREREQUISITES

PortAudio Installation Instructions

Option 1: Quick Install (One-Click)

Option 2: Using Goose CLI (recommended)

Option 3: Manual setup in Goose

Option 4: Manual Installation

Dependencies

Optional Dependencies

Multi-Speaker Narration

JSON Format Example:

Markdown Format Example:

Available Voices by Category:

Usage Example:

Single-Voice Narration

Audio Transcription

Supported Formats:

Output Files:

Features:

Usage

UI Features

Configuration

Troubleshooting

Common PortAudio Issues

"PyAudio installation failed" or "portaudio.h file not found"

"Audio device not found" or "No Default Input Device Available"

Changelog

Technical Details

Speech-to-Text

Text-to-Speech

Default: pyttsx3

Optional: Kokoro TTS

License

Related servers

MagicSlides MCP Server

Time

CUA MCP Server

WhatsApp Bridge