Streamlines web scraper development for e-commerce sites by analyzing website structure, detecting anti-bot protections,...
Created byApr 23, 2025
AI-Cursor-Scraping-Assistant
A powerful tool that leverages Cursor AI and MCP (Model Context Protocol) to easily generate web scrapers for various types of websites. This project helps you quickly analyze websites and generate proper Scrapy or Camoufox scrapers with minimal effort.
Project Overview
This project contains two main components:
**Cursor Rules** - A set of rules that teach Cursor AI how to analyze websites and create different types of Scrapy spiders
**MCP Tools** - A collection of Model Context Protocol tools that enhance Cursor's capabilities for web scraping tasks
Prerequisites
[Cursor AI](https://cursor.sh/) installed
Python 3.10+ installed
Basic knowledge of web scraping concepts
Installation
Clone this repository to your local machine:
Install the required dependencies:
If you plan to use Camoufox, you'll need to fetch its browser binary:
Setup
Setting Up MCP Server
The MCP server provides tools that help Cursor AI analyze web pages and generate XPath selectors. To start the MCP server:
Navigate to the MCPfiles directory:
```bash
cd MCPfiles
```
Update the `CAMOUFOX_FILE_PATH` in `xpath_server.py` to point to your local `Camoufox_template.py` file.
Start the MCP server:
```bash
python xpath_server.py
```
In Cursor, connect to the MCP server by configuring it in the settings or using the MCP panel.
Cursor Rules
The cursor-rules directory contains rules that teach Cursor AI how to analyze websites and create different types of scrapers. These rules are automatically loaded when you open the project in Cursor.
Detailed Cursor Rules Explanation
The `cursor-rules` directory contains a set of MDC (Markdown Configuration) files that guide Cursor's behavior when creating web scrapers:
`prerequisites.mdc`
This rule handles initial setup tasks before creating any scrapers:
Gets the full path of the current project using `pwd`
Stores the path in context for later use by other rules
Confirms the execution of preliminary actions before proceeding
`website-analysis.mdc`
This comprehensive rule guides Cursor through website analysis:
Identifies the type of Scrapy spider to build (PLP, PDP, etc.)
Fetches and stores homepage HTML and cookies
Strips CSS using the MCP tool to simplify HTML analysis
Checks cookies for anti-bot protection (Akamai, Datadome, PerimeterX, etc.)
For PLP scrapers: fetches category pages, analyzes structure, looks for JSON data
For PDP scrapers: fetches product pages, analyzes structure, looks for JSON data
Detects schema.org markup and modern frameworks like Next.js
`scrapy-step-by-step-process.mdc`
This rule provides the execution flow for creating scrapers:
Outlines the sequence of steps to follow
References other rule files in the correct order
Ensures prerequisite actions are completed before scraper creation
Guides Cursor to analyze the website before generating code
`scrapy.mdc`
This extensive rule contains Scrapy best practices:
Defines recommended code organization and directory structure
Details file naming conventions and module organization
Provides component architecture guidelines
Offers strategies for code splitting and reuse
Includes performance optimization recommendations
Covers security practices, error handling, and logging
Provides specific syntax examples and code snippets
`scraper-models.mdc`
This rule defines the different types of scrapers that can be created:
**E-commerce PLP**: Details the data structure, field definitions, and implementation steps
**E-commerce PDP**: Details the data structure, field definitions, and implementation steps
Field mapping guidelines for all scraper types
Step-by-step instructions for creating each type of scraper
Default settings recommendations
Anti-bot countermeasures for different protection systems
Usage
Here's how to use the AI-Cursor-Scraping-Assistant:
Open the project in Cursor AI
Make sure the MCP server is running
Ask Cursor to create a scraper with a prompt like:
```
Write an e-commerce PLP scraper for the website gucci.com
```
Cursor will then:
Analyze the website structure
Check for anti-bot protection
Extract the relevant HTML elements
Generate a complete Scrapy spider based on the website type