OmniMCP provides rich UI context and interaction capabilities to AI models through Model Context Protocol (MCP) and microsoft/OmniParser. It focuses on enabling deep understanding of user interfaces through visual analysis, structured planning, and precise interaction execution.
Core Features
Visual Perception: Understands UI elements using OmniParser.
LLM Planning: Plans next actions based on goal, history, and visual state.
Agent Executor: Orchestrates the perceive-plan-act loop (omnimcp/agent_executor.py).
Action Execution: Controls mouse/keyboard via pynput (omnimcp/input.py).
CLI Interface: Simple entry point (cli.py) for running tasks.
Auto-Deployment: Optional OmniParser server deployment to AWS EC2 with auto-shutdown.
Debugging: Generates timestamped visual logs per step.
Overview
cli.py uses AgentExecutor to run a perceive-plan-act loop. It captures the screen (VisualState), plans using an LLM (core.plan_action_for_ui), and executes actions (InputController).
Demos
Real Action (Calculator):python cli.py opens Calculator and computes 5*9.
OmniMCP Real Action Demo GIF
Synthetic UI (Login):python demo_synthetic.py uses generated images (no real I/O). (Note: Pending refactor to use AgentExecutor).
OmniMCP Synthetic Demo GIF
Prerequisites
Python >=3.10, <3.13
uv installed (pip install uv)
Linux Runtime Requirement: Requires an active graphical session (X11/Wayland) for pynput. May need system libraries (libx11-dev, etc.) - see pynput docs.
(macOS display scaling dependencies are handled automatically during installation).
For AWS Deployment Features
Requires AWS credentials in .env (see .env.example). Warning: Creates AWS resources (EC2, Lambda, etc.) incurring costs. Use python -m omnimcp.omniparser.server stop to clean up.
Installation
Quick Start
Ensure environment is activated and .env is configured.
Debug outputs are saved in runs/<timestamp>/.
Note on MCP Server: An experimental MCP server (OmniMCP class in omnimcp/mcp_server.py) exists but is separate from the primary cli.py/AgentExecutor workflow.
Architecture
CLI (cli.py) - Entry point, setup, starts Executor.
Detailed logs are in logs/run_YYYY-MM-DD_HH-mm-ss.log (LOG_LEVEL=DEBUG in .env recommended).
(Note: Details like timings, counts, IPs, instance IDs, and specific plans will vary)
Roadmap & Limitations
Key limitations & future work areas:
Performance: Reduce OmniParser latency (explore local models, caching, etc.) and optimize state management (avoid full re-parse).
Robustness: Improve LLM planning reliability (prompts, techniques like ReAct), add action verification/error recovery, enhance element targeting.
Target API/Architecture: Evolve towards a higher-level declarative API (e.g., @omni.publish style) and potentially integrate loop logic with the experimental MCP Server (OmniMCP class).
Consistency: Refactor demo_synthetic.py to use AgentExecutor.
Research: Explore fine-tuning, process graphs (RAG), framework integration.
Project Status
Core loop via cli.py/AgentExecutor is functional for basic tasks. Performance and robustness need significant improvement. MCP integration is experimental.
Contributing
Fork repository
Create feature branch
Implement changes & add tests
Ensure checks pass (uv run ruff format ., uv run ruff check . --fix, uv run pytest tests/)
OmniMCP provides rich UI context and interaction capabilities to AI models through Model Context Protocol (MCP) and microsoft/OmniParser. It focuses on enabling deep understanding of user interfaces through visual analysis, structured planning, and precise interaction execution.
Core Features
Visual Perception: Understands UI elements using OmniParser.
LLM Planning: Plans next actions based on goal, history, and visual state.
Agent Executor: Orchestrates the perceive-plan-act loop (omnimcp/agent_executor.py).
Action Execution: Controls mouse/keyboard via pynput (omnimcp/input.py).
CLI Interface: Simple entry point (cli.py) for running tasks.
Auto-Deployment: Optional OmniParser server deployment to AWS EC2 with auto-shutdown.
Debugging: Generates timestamped visual logs per step.
Overview
cli.py uses AgentExecutor to run a perceive-plan-act loop. It captures the screen (VisualState), plans using an LLM (core.plan_action_for_ui), and executes actions (InputController).
Demos
Real Action (Calculator):python cli.py opens Calculator and computes 5*9.
OmniMCP Real Action Demo GIF
Synthetic UI (Login):python demo_synthetic.py uses generated images (no real I/O). (Note: Pending refactor to use AgentExecutor).
OmniMCP Synthetic Demo GIF
Prerequisites
Python >=3.10, <3.13
uv installed (pip install uv)
Linux Runtime Requirement: Requires an active graphical session (X11/Wayland) for pynput. May need system libraries (libx11-dev, etc.) - see pynput docs.
(macOS display scaling dependencies are handled automatically during installation).
For AWS Deployment Features
Requires AWS credentials in .env (see .env.example). Warning: Creates AWS resources (EC2, Lambda, etc.) incurring costs. Use python -m omnimcp.omniparser.server stop to clean up.
Installation
Quick Start
Ensure environment is activated and .env is configured.
Debug outputs are saved in runs/<timestamp>/.
Note on MCP Server: An experimental MCP server (OmniMCP class in omnimcp/mcp_server.py) exists but is separate from the primary cli.py/AgentExecutor workflow.
Architecture
CLI (cli.py) - Entry point, setup, starts Executor.
Detailed logs are in logs/run_YYYY-MM-DD_HH-mm-ss.log (LOG_LEVEL=DEBUG in .env recommended).
(Note: Details like timings, counts, IPs, instance IDs, and specific plans will vary)
Roadmap & Limitations
Key limitations & future work areas:
Performance: Reduce OmniParser latency (explore local models, caching, etc.) and optimize state management (avoid full re-parse).
Robustness: Improve LLM planning reliability (prompts, techniques like ReAct), add action verification/error recovery, enhance element targeting.
Target API/Architecture: Evolve towards a higher-level declarative API (e.g., @omni.publish style) and potentially integrate loop logic with the experimental MCP Server (OmniMCP class).
Consistency: Refactor demo_synthetic.py to use AgentExecutor.
Research: Explore fine-tuning, process graphs (RAG), framework integration.
Project Status
Core loop via cli.py/AgentExecutor is functional for basic tasks. Performance and robustness need significant improvement. MCP integration is experimental.
Contributing
Fork repository
Create feature branch
Implement changes & add tests
Ensure checks pass (uv run ruff format ., uv run ruff check . --fix, uv run pytest tests/)