omniparser.com
omniparser.com logo

OmniParser

Provides a robust UI automation framework leveraging advanced computer vision techniques for precise element detection,...

Created byApr 22, 2025

OmniMCP

OmniMCP provides rich UI context and interaction capabilities to AI models through Model Context Protocol (MCP) and microsoft/OmniParser. It focuses on enabling deep understanding of user interfaces through visual analysis, structured planning, and precise interaction execution.

Core Features

  • Visual Perception: Understands UI elements using OmniParser.
  • LLM Planning: Plans next actions based on goal, history, and visual state.
  • Agent Executor: Orchestrates the perceive-plan-act loop (omnimcp/agent_executor.py).
  • Action Execution: Controls mouse/keyboard via pynput (omnimcp/input.py).
  • CLI Interface: Simple entry point (cli.py) for running tasks.
  • Auto-Deployment: Optional OmniParser server deployment to AWS EC2 with auto-shutdown.
  • Debugging: Generates timestamped visual logs per step.

Overview

cli.py uses AgentExecutor to run a perceive-plan-act loop. It captures the screen (VisualState), plans using an LLM (core.plan_action_for_ui), and executes actions (InputController).

Demos

  • Real Action (Calculator): python cli.py opens Calculator and computes 5*9. OmniMCP Real Action Demo GIF
  • Synthetic UI (Login): python demo_synthetic.py uses generated images (no real I/O). (Note: Pending refactor to use AgentExecutor). OmniMCP Synthetic Demo GIF

Prerequisites

  • Python >=3.10, <3.13
  • uv installed (pip install uv)
  • Linux Runtime Requirement: Requires an active graphical session (X11/Wayland) for pynput. May need system libraries (libx11-dev, etc.) - see pynput docs.
(macOS display scaling dependencies are handled automatically during installation).

For AWS Deployment Features

Requires AWS credentials in .env (see .env.example). Warning: Creates AWS resources (EC2, Lambda, etc.) incurring costs. Use python -m omnimcp.omniparser.server stop to clean up.

Installation

Quick Start

Ensure environment is activated and .env is configured.
Debug outputs are saved in runs/<timestamp>/.
Note on MCP Server: An experimental MCP server (OmniMCP class in omnimcp/mcp_server.py) exists but is separate from the primary cli.py/AgentExecutor workflow.

Architecture

  1. CLI (cli.py) - Entry point, setup, starts Executor.
  1. Agent Executor (omnimcp/agent_executor.py) - Orchestrates loop, manages state/artifacts.
  1. Visual State Manager (omnimcp/visual_state.py) - Perception (screenshot, calls parser).
  1. OmniParser Client & Deploy (omnimcp/omniparser/) - Manages OmniParser server communication/deployment.
  1. LLM Planner (omnimcp/core.py) - Generates action plan.
  1. Input Controller (omnimcp/input.py) - Executes actions (mouse/keyboard).
  1. (Optional) MCP Server (omnimcp/mcp_server.py) - Experimental MCP interface.

Development

Environment Setup & Checks

Debug Support

Running python cli.py saves timestamped runs in runs/, including:
  • step_N_state_raw.png
  • step_N_state_parsed.png (with element boxes)
  • step_N_action_highlight.png (with action highlight)
  • final_state.png
Detailed logs are in logs/run_YYYY-MM-DD_HH-mm-ss.log (LOG_LEVEL=DEBUG in .env recommended).
(Note: Details like timings, counts, IPs, instance IDs, and specific plans will vary)

Roadmap & Limitations

Key limitations & future work areas:
  • Performance: Reduce OmniParser latency (explore local models, caching, etc.) and optimize state management (avoid full re-parse).
  • Robustness: Improve LLM planning reliability (prompts, techniques like ReAct), add action verification/error recovery, enhance element targeting.
  • Target API/Architecture: Evolve towards a higher-level declarative API (e.g., @omni.publish style) and potentially integrate loop logic with the experimental MCP Server (OmniMCP class).
  • Consistency: Refactor demo_synthetic.py to use AgentExecutor.
  • Features: Expand action space (drag/drop, hover).
  • Testing: Add E2E tests, broaden cross-platform validation, define evaluation metrics.
  • Research: Explore fine-tuning, process graphs (RAG), framework integration.

Project Status

Core loop via cli.py/AgentExecutor is functional for basic tasks. Performance and robustness need significant improvement. MCP integration is experimental.

Contributing

  1. Fork repository
  1. Create feature branch
  1. Implement changes & add tests
  1. Ensure checks pass (uv run ruff format ., uv run ruff check . --fix, uv run pytest tests/)
  1. Submit pull request

License

MIT License

Contact

OmniMCP

OmniMCP provides rich UI context and interaction capabilities to AI models through Model Context Protocol (MCP) and microsoft/OmniParser. It focuses on enabling deep understanding of user interfaces through visual analysis, structured planning, and precise interaction execution.

Core Features

  • Visual Perception: Understands UI elements using OmniParser.
  • LLM Planning: Plans next actions based on goal, history, and visual state.
  • Agent Executor: Orchestrates the perceive-plan-act loop (omnimcp/agent_executor.py).
  • Action Execution: Controls mouse/keyboard via pynput (omnimcp/input.py).
  • CLI Interface: Simple entry point (cli.py) for running tasks.
  • Auto-Deployment: Optional OmniParser server deployment to AWS EC2 with auto-shutdown.
  • Debugging: Generates timestamped visual logs per step.

Overview

cli.py uses AgentExecutor to run a perceive-plan-act loop. It captures the screen (VisualState), plans using an LLM (core.plan_action_for_ui), and executes actions (InputController).

Demos

  • Real Action (Calculator): python cli.py opens Calculator and computes 5*9. OmniMCP Real Action Demo GIF
  • Synthetic UI (Login): python demo_synthetic.py uses generated images (no real I/O). (Note: Pending refactor to use AgentExecutor). OmniMCP Synthetic Demo GIF

Prerequisites

  • Python >=3.10, <3.13
  • uv installed (pip install uv)
  • Linux Runtime Requirement: Requires an active graphical session (X11/Wayland) for pynput. May need system libraries (libx11-dev, etc.) - see pynput docs.
(macOS display scaling dependencies are handled automatically during installation).

For AWS Deployment Features

Requires AWS credentials in .env (see .env.example). Warning: Creates AWS resources (EC2, Lambda, etc.) incurring costs. Use python -m omnimcp.omniparser.server stop to clean up.

Installation

Quick Start

Ensure environment is activated and .env is configured.
Debug outputs are saved in runs/<timestamp>/.
Note on MCP Server: An experimental MCP server (OmniMCP class in omnimcp/mcp_server.py) exists but is separate from the primary cli.py/AgentExecutor workflow.

Architecture

  1. CLI (cli.py) - Entry point, setup, starts Executor.
  1. Agent Executor (omnimcp/agent_executor.py) - Orchestrates loop, manages state/artifacts.
  1. Visual State Manager (omnimcp/visual_state.py) - Perception (screenshot, calls parser).
  1. OmniParser Client & Deploy (omnimcp/omniparser/) - Manages OmniParser server communication/deployment.
  1. LLM Planner (omnimcp/core.py) - Generates action plan.
  1. Input Controller (omnimcp/input.py) - Executes actions (mouse/keyboard).
  1. (Optional) MCP Server (omnimcp/mcp_server.py) - Experimental MCP interface.

Development

Environment Setup & Checks

Debug Support

Running python cli.py saves timestamped runs in runs/, including:
  • step_N_state_raw.png
  • step_N_state_parsed.png (with element boxes)
  • step_N_action_highlight.png (with action highlight)
  • final_state.png
Detailed logs are in logs/run_YYYY-MM-DD_HH-mm-ss.log (LOG_LEVEL=DEBUG in .env recommended).
(Note: Details like timings, counts, IPs, instance IDs, and specific plans will vary)

Roadmap & Limitations

Key limitations & future work areas:
  • Performance: Reduce OmniParser latency (explore local models, caching, etc.) and optimize state management (avoid full re-parse).
  • Robustness: Improve LLM planning reliability (prompts, techniques like ReAct), add action verification/error recovery, enhance element targeting.
  • Target API/Architecture: Evolve towards a higher-level declarative API (e.g., @omni.publish style) and potentially integrate loop logic with the experimental MCP Server (OmniMCP class).
  • Consistency: Refactor demo_synthetic.py to use AgentExecutor.
  • Features: Expand action space (drag/drop, hover).
  • Testing: Add E2E tests, broaden cross-platform validation, define evaluation metrics.
  • Research: Explore fine-tuning, process graphs (RAG), framework integration.

Project Status

Core loop via cli.py/AgentExecutor is functional for basic tasks. Performance and robustness need significant improvement. MCP integration is experimental.

Contributing

  1. Fork repository
  1. Create feature branch
  1. Implement changes & add tests
  1. Ensure checks pass (uv run ruff format ., uv run ruff check . --fix, uv run pytest tests/)
  1. Submit pull request

License

MIT License

Contact