OmniMCP

![CI](https://github.com/OpenAdaptAI/OmniMCP/actions/workflows/ci.yml/badge.svg) ![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg) ![Python Version](https://img.shields.io/badge/python-3.10%20|%203.11%20|%203.12-blue) ![Code style: ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)

OmniMCP provides rich UI context and interaction capabilities to AI models through Model Context Protocol (MCP) and microsoft/OmniParser. It focuses on enabling deep understanding of user interfaces through visual analysis, structured planning, and precise interaction execution.

Core Features

Visual Perception: Understands UI elements using OmniParser.

LLM Planning: Plans next actions based on goal, history, and visual state.

Agent Executor: Orchestrates the perceive-plan-act loop (omnimcp/agent_executor.py).

Action Execution: Controls mouse/keyboard via pynput (omnimcp/input.py).

CLI Interface: Simple entry point (cli.py) for running tasks.

Auto-Deployment: Optional OmniParser server deployment to AWS EC2 with auto-shutdown.

Debugging: Generates timestamped visual logs per step.

Overview

cli.py uses AgentExecutor to run a perceive-plan-act loop. It captures the screen (VisualState), plans using an LLM (core.plan_action_for_ui), and executes actions (InputController).

Demos

Real Action (Calculator): python cli.py opens Calculator and computes 5*9. OmniMCP Real Action Demo GIF

Synthetic UI (Login): python demo_synthetic.py uses generated images (no real I/O). (Note: Pending refactor to use AgentExecutor). OmniMCP Synthetic Demo GIF

Prerequisites

Python >=3.10, <3.13

uv installed (pip install uv)

Linux Runtime Requirement: Requires an active graphical session (X11/Wayland) for pynput. May need system libraries (libx11-dev, etc.) - see pynput docs.

(macOS display scaling dependencies are handled automatically during installation).

For AWS Deployment Features

Requires AWS credentials in .env (see .env.example). Warning: Creates AWS resources (EC2, Lambda, etc.) incurring costs. Use python -m omnimcp.omniparser.server stop to clean up.

Installation

Quick Start

Ensure environment is activated and .env is configured.

Debug outputs are saved in runs/<timestamp>/.

Note on MCP Server: An experimental MCP server (OmniMCP class in omnimcp/mcp_server.py) exists but is separate from the primary cli.py/AgentExecutor workflow.

Architecture

CLI (cli.py) - Entry point, setup, starts Executor.

Agent Executor (omnimcp/agent_executor.py) - Orchestrates loop, manages state/artifacts.

Visual State Manager (omnimcp/visual_state.py) - Perception (screenshot, calls parser).

OmniParser Client & Deploy (omnimcp/omniparser/) - Manages OmniParser server communication/deployment.

LLM Planner (omnimcp/core.py) - Generates action plan.

Input Controller (omnimcp/input.py) - Executes actions (mouse/keyboard).

(Optional) MCP Server (omnimcp/mcp_server.py) - Experimental MCP interface.

Development

Environment Setup & Checks

Debug Support

Running python cli.py saves timestamped runs in runs/, including:

step_N_state_raw.png

step_N_state_parsed.png (with element boxes)

step_N_action_highlight.png (with action highlight)

final_state.png

Detailed logs are in logs/run_YYYY-MM-DD_HH-mm-ss.log (LOG_LEVEL=DEBUG in .env recommended).

(Note: Details like timings, counts, IPs, instance IDs, and specific plans will vary)

Roadmap & Limitations

Key limitations & future work areas:

Performance: Reduce OmniParser latency (explore local models, caching, etc.) and optimize state management (avoid full re-parse).

Robustness: Improve LLM planning reliability (prompts, techniques like ReAct), add action verification/error recovery, enhance element targeting.

Target API/Architecture: Evolve towards a higher-level declarative API (e.g., @omni.publish style) and potentially integrate loop logic with the experimental MCP Server (OmniMCP class).

Consistency: Refactor demo_synthetic.py to use AgentExecutor.

Features: Expand action space (drag/drop, hover).

Testing: Add E2E tests, broaden cross-platform validation, define evaluation metrics.

Research: Explore fine-tuning, process graphs (RAG), framework integration.

Project Status

Core loop via cli.py/AgentExecutor is functional for basic tasks. Performance and robustness need significant improvement. MCP integration is experimental.

Contributing

Fork repository

Create feature branch

Implement changes & add tests

Ensure checks pass (uv run ruff format ., uv run ruff check . --fix, uv run pytest tests/)

Submit pull request

License

MIT License

Contact

Issues: GitHub Issues

Questions: Discussions

Security: security@openadapt.ai

OmniMCP

Core Features

Visual Perception: Understands UI elements using OmniParser.

LLM Planning: Plans next actions based on goal, history, and visual state.

Agent Executor: Orchestrates the perceive-plan-act loop (omnimcp/agent_executor.py).

Action Execution: Controls mouse/keyboard via pynput (omnimcp/input.py).

CLI Interface: Simple entry point (cli.py) for running tasks.

Auto-Deployment: Optional OmniParser server deployment to AWS EC2 with auto-shutdown.

Debugging: Generates timestamped visual logs per step.

Overview

cli.py uses AgentExecutor to run a perceive-plan-act loop. It captures the screen (VisualState), plans using an LLM (core.plan_action_for_ui), and executes actions (InputController).

Demos

Real Action (Calculator): python cli.py opens Calculator and computes 5*9. OmniMCP Real Action Demo GIF

Synthetic UI (Login): python demo_synthetic.py uses generated images (no real I/O). (Note: Pending refactor to use AgentExecutor). OmniMCP Synthetic Demo GIF

Prerequisites

Python >=3.10, <3.13

uv installed (pip install uv)

Linux Runtime Requirement: Requires an active graphical session (X11/Wayland) for pynput. May need system libraries (libx11-dev, etc.) - see pynput docs.

(macOS display scaling dependencies are handled automatically during installation).

For AWS Deployment Features

Requires AWS credentials in .env (see .env.example). Warning: Creates AWS resources (EC2, Lambda, etc.) incurring costs. Use python -m omnimcp.omniparser.server stop to clean up.

Installation

Quick Start

Ensure environment is activated and .env is configured.

Debug outputs are saved in runs/<timestamp>/.

Note on MCP Server: An experimental MCP server (OmniMCP class in omnimcp/mcp_server.py) exists but is separate from the primary cli.py/AgentExecutor workflow.

Architecture

CLI (cli.py) - Entry point, setup, starts Executor.

Agent Executor (omnimcp/agent_executor.py) - Orchestrates loop, manages state/artifacts.

Visual State Manager (omnimcp/visual_state.py) - Perception (screenshot, calls parser).

OmniParser Client & Deploy (omnimcp/omniparser/) - Manages OmniParser server communication/deployment.

LLM Planner (omnimcp/core.py) - Generates action plan.

Input Controller (omnimcp/input.py) - Executes actions (mouse/keyboard).

(Optional) MCP Server (omnimcp/mcp_server.py) - Experimental MCP interface.

Development

Environment Setup & Checks

Debug Support

Running python cli.py saves timestamped runs in runs/, including:

step_N_state_raw.png

step_N_state_parsed.png (with element boxes)

step_N_action_highlight.png (with action highlight)

final_state.png

Detailed logs are in logs/run_YYYY-MM-DD_HH-mm-ss.log (LOG_LEVEL=DEBUG in .env recommended).

(Note: Details like timings, counts, IPs, instance IDs, and specific plans will vary)

Roadmap & Limitations

Key limitations & future work areas:

Performance: Reduce OmniParser latency (explore local models, caching, etc.) and optimize state management (avoid full re-parse).

Robustness: Improve LLM planning reliability (prompts, techniques like ReAct), add action verification/error recovery, enhance element targeting.

Target API/Architecture: Evolve towards a higher-level declarative API (e.g., @omni.publish style) and potentially integrate loop logic with the experimental MCP Server (OmniMCP class).

Consistency: Refactor demo_synthetic.py to use AgentExecutor.

Features: Expand action space (drag/drop, hover).

Testing: Add E2E tests, broaden cross-platform validation, define evaluation metrics.

Research: Explore fine-tuning, process graphs (RAG), framework integration.

Project Status

Core loop via cli.py/AgentExecutor is functional for basic tasks. Performance and robustness need significant improvement. MCP integration is experimental.

Contributing

Fork repository

Create feature branch

Implement changes & add tests

Ensure checks pass (uv run ruff format ., uv run ruff check . --fix, uv run pytest tests/)

Submit pull request

License

MIT License

Contact

Issues: GitHub Issues

Questions: Discussions

Security: security@openadapt.ai

OmniParser

OmniMCP

Core Features

Overview

Demos

Prerequisites

For AWS Deployment Features

Installation

Quick Start

Architecture

Development

Environment Setup & Checks

Debug Support

Roadmap & Limitations

Project Status

Contributing

License

Contact

OmniMCP

Core Features

Overview

Demos

Prerequisites

For AWS Deployment Features

Installation

Quick Start

Architecture

Development

Environment Setup & Checks

Debug Support

Roadmap & Limitations

Project Status

Contributing

License

Contact

Related servers

MagicSlides MCP Server

Time

CUA MCP Server

WhatsApp Bridge