mcp-crew-risk

平台与服务

by deeppath-ai

为 crawler 开发者与运营者提供自动化合规检测工具集,从法律、社会伦理和技术三方面评估目标网站的友好度与潜在风险。

什么是 mcp-crew-risk

为 crawler 开发者与运营者提供自动化合规检测工具集,从法律、社会伦理和技术三方面评估目标网站的友好度与潜在风险。

README

<div align="center">

mcp-crew-risk

</div>

A Crawler Risk Assessor based on the Model Context Protocol (MCP).
This server provides a simple API interface that allows users to perform a comprehensive crawler compliance risk assessment for a specified webpage.

<div>

Crawler Compliance Risk Assessment Framework Description

This framework aims to provide crawler developers and operators with a comprehensive automated compliance detection toolset to evaluate the crawler-friendliness and potential risks of target websites. It covers three major dimensions: legal, social ethics, and technical aspects. Through multi-level risk warnings and specific recommendations, it helps plan crawler strategies reasonably to avoid legal disputes and negative social impacts while improving technical stability and efficiency.


Framework Structure

1. Legal Risk

Detection Content

  • Whether there are explicit Terms of Service restricting crawler activities
  • Whether the website declares copyright information and whether content is copyright protected
  • Whether pages contain sensitive personal data (e.g., emails, phone numbers, ID numbers)

Risk Significance

Violating terms may lead to breach of contract, infringement, or criminal liability; scraping sensitive data may violate privacy regulations such as GDPR, CCPA, etc.

Detection Examples

  • Detect <meta> tags and key keywords in page content
  • Regex matching for emails, phone numbers

2. Social/Ethical Risk

Detection Content

  • Whether robots.txt disallows crawler access to specific paths
  • Anti-crawling technologies deployed by the site (e.g., Cloudflare JS Challenge)
  • Risks of collecting user privacy or sensitive information

Risk Significance

Excessive crawling may harm user experience and trust; collecting private data has ethical risks and social responsibility implications.

Detection Examples

  • Accessing and parsing robots.txt
  • Detecting anti-crawling mechanisms and JS challenges
  • Sensitive information extraction warnings

3. Technical Risk

Detection Content

  • Whether redirects, CAPTCHAs, JS rendering obstacles are encountered during access
  • Whether robots.txt can be successfully accessed to get crawler rules
  • Exposure of target API paths, possible permissions or rate limiting restrictions

Risk Significance

Technical risks may cause crawler failure, IP bans, or incomplete data, affecting business stability.

Detection Examples

  • HTTP status code and response header analysis
  • Anti-crawling technology detection
  • API path scanning

Rating System

  • allowed: No obvious restrictions or risks, generally safe to crawl
  • partial: Some restrictions (e.g., robots.txt disallows some paths, anti-crawling measures), requires cautious operation
  • blocked: Severe restrictions or high risk (e.g., heavy JS anti-crawling challenges, sensitive data protection), crawling is not recommended

Recommendations

Risk DimensionSummary Recommendations
Legal RiskCarefully read and comply with the target site's Terms of Service; avoid scraping sensitive or personal data; consult legal counsel if necessary.
Social/Ethical RiskControl crawl frequency; avoid impacting server performance and user experience; be transparent about data sources and usage.
Technical RiskUse appropriate crawler frameworks and strategies; support dynamic rendering and anti-crawling bypass; handle exceptions and monitor access health in real-time.

Implementation Process

  1. Pre-crawl Assessment
    Run compliance assessment on the target site to confirm risk levels and restrictions.

  2. Compliance Strategy Formulation
    Adjust crawler access frequency and content scope according to assessment results to avoid breaches or violations.

  3. Crawler Execution and Monitoring
    Continuously monitor technical exceptions and risk changes during crawling; regularly reassess.

  4. Data Processing and Protection
    Ensure crawled data complies with privacy protection requirements and perform necessary anonymization.


Technical Implementation Overview

  • Use Axios + node-fetch for HTTP requests, supporting timeout and redirect control.
  • Parse robots.txt and page meta tags to automatically identify crawler rules.
  • Use regex to detect privacy-sensitive information (emails, phones, ID numbers, etc.).
  • Detect anti-crawling tech (e.g., Cloudflare JS Challenge) and exposed API endpoints.
  • Provide legal, social, and technical risk warnings and comprehensive suggestions via risk evaluation functions.

Future Extensions

  • Integrate Puppeteer/Playwright for JavaScript-rendered page detection.
  • Automatically parse and notify on Terms of Service text updates.
  • Add dedicated detection modules for GDPR, CCPA, and other regional laws.
  • Combine machine learning models to improve privacy-sensitive data recognition accuracy.
  • Provide Web UI to display compliance reports and risk suggestions.

Summary

This compliance risk assessment framework provides a foundational and comprehensive risk evaluation for crawler development and operation. It helps teams comply with laws, regulations, and ethical principles while enhancing technical efficiency and data quality and reducing potential legal and social risks.

</div> <h1>✅ 1. Technical Checks</h1>
Check ItemDescriptionRecommendation
robots.txt existenceAccess https://example.com/robots.txtIf exists, parse and strictly follow the rules
Allowed crawling paths in robots.txtCheck rules for specified User-Agent (e.g., Disallow, Allow)Set a proper User-Agent for matching
Meta robots tagWhether <meta name="robots" content="noindex, nofollow"> exists on the pageIf present, avoid crawling/indexing page content
X-Robots-Tag response headerWhether HTTP response headers contain X-Robots-Tag (e.g., noindex)Follow the respective directives
Dynamic rendered contentWhether page depends on JS loading (React/Vue etc.)May require headless browser (e.g., Puppeteer)
IP rate limiting / WAFWhether access frequency limits, IP blocks, CAPTCHAs existImplement rate limiting, retry, proxy pools
Anti-crawling mechanism detectionCheck for token validation, Referer checks, JS obfuscationUse network analysis tools to investigate
API supportWhether page data is also provided via public APIPrefer API for higher efficiency if available
<h1>2. Legal and Ethical Checks</h1>
Check ItemDescriptionRecommendation
Existence of Terms of ServiceCheck if ToS prohibits automated crawlingIf explicitly prohibited, do not crawl
Website copyright declarationWhether content copyright is declared in footerAvoid crawling copyrighted data for commercial use
Public data/open data policySome sites offer open data or licensesComply with licenses or open-source agreements
Previous litigation due to crawlingSome sites (e.g., LinkedIn, Facebook) have strict anti-crawling stancesIf prior cases exist, higher risk — avoid crawling
<h1>3. Data Protection and Privacy</h1>
Check ItemDescriptionRecommendation
Presence of user-generated contentComments, avatars, phone, email, location, etc.Scraping such data may violate privacy laws
Privacy Policy existenceCheck data usage boundaries and restrictionsFollow data processing terms stated in policy
Involvement of EU or CA usersSubject to GDPR or CCPA regulationsDo not store or analyze personal data without consent
Scraped personally identifiable infoPhone numbers, IDs, emails, IP addressesFilter or anonymize unless necessary
Sensitive domain dataMedical, financial, minors, etc.Requires strict compliance, recommend avoiding or anonymizing
<h1>4. Practical Operational Suggestions (Compliance-Friendly Strategies)</h1>
Check ItemDescriptionRecommendation
Set reasonable User-AgentClearly indicate tool origin, e.g., MyCrawlerBot/1.0 (+email@example.com)Increase credibility and ease site identification
Set access frequency limitsAvoid too frequent requests (e.g., 1-2 requests/sec)Reduce server load, avoid being blocked
Add Referer and Accept headersSimulate normal browser behaviorPrevent anti-crawling blocking
Support failure retry mechanismHandle 503, 429, connection drop errorsImprove robustness
Logging and crawl timing controlSave crawl logs and schedule crawling during off-peak hoursCoordinate with site maintenance windows
Indicate data source in outputsWhen used for display or research, cite data sourceAvoid copyright disputes
Data storage anonymizationEspecially for personal dataAvoid privacy law violations

🧠 One-sentence summary:

No robots.txt ≠ permission to crawl freely; technical crawlability ≠ legal permission; respect data, websites, and users — that is the foundation of compliant crawling.

<h1 align="center"> 🚩Features </h1>

MCP-Based Website Crawler Compliance Risk Assessment – Main Features:


1. Target Website Access and Basic Status Check ✅ Completed

  • Access the target website homepage with timeout and up to 5 redirects supported
  • Return HTTP status code to determine site accessibility
  • Detect redirects and warn about potential risks

2. Anti-Crawling Mechanism Detection ✅ Completed

  • Detect if server uses Cloudflare or similar anti-crawling protection
  • Detect presence of JavaScript verification challenges (e.g., Cloudflare JS Challenge)
  • Parse page <meta name="robots"> tags and HTTP response header X-Robots-Tag
  • Automatically request and parse robots.txt, extract allowed and disallowed paths

3. Sensitive Content Detection and Legal Risk Warning ✅ Completed

  • Detect copyright notices and Terms of Service related information on pages
  • Regex match to identify possible personal private information (email, phone, ID)
  • Provide legal compliance warnings to prevent infringement and privacy leaks

4. Public API Endpoint Detection ✅ Completed

  • Access common API paths (e.g., /api/, /v1/, /rest/)
  • Determine whether APIs are open and whether authentication is required; warn about potential permission and rate limiting risks

5. Comprehensive Risk Evaluation and Classification ✅ Completed

  • Provide three-level crawl permissibility rating based on all detection results:
    • allowed: no obvious restrictions or risks
    • partial: some technical or compliance restrictions
    • blocked: obvious anti-crawling or high risk

6. Planned Features 🚧 Pending

<div align="center"> <img src="./doc/case1.jpg" width=800px/> <img src="./doc/case2.jpg" width=800px/> </div> <h1 align="center">⚙️Installation</h1>
bash
git clone https://github.com/Joooook/mcp-crew-risk.git
npm i

<div align="center">▶️Quick Start</div>

CLI

bash
npx -y mcp-crew-risk

MCP Server Configuration

json
{
    "mcpServers": {
        "mcp-crew-risk": {
            "command": "npx",
            "args": [
                "-y",
                "mcp-crew-risk"
            ]
        }
    }
}

<div align="center">💭Murmurs</div>

This project is for learning purposes only. Contributions and feature requests are welcome.

<div align="center"><h1>Contact</h1></div> <img width="380" height="200" src="./doc/dpai.jpg" alt="mcp-crew-risk MCP server" />

Business Collaboration Email: deeppathai@outlook.com

</div>

🧠 MCP Access Links

  • 🌐 ModelScope MCP Address
    For testing and integrating the mcp-crew-risk service on the ModelScope platform.

  • 🛠️ Smithery.ai MCP Address
    For visually configuring and invoking the mcp-crew-risk service via the Smithery platform.

常见问题

mcp-crew-risk 是什么?

为 crawler 开发者与运营者提供自动化合规检测工具集,从法律、社会伦理和技术三方面评估目标网站的友好度与潜在风险。

相关 Skills

MCP构建

by anthropics

Universal
热门

聚焦高质量 MCP Server 开发,覆盖协议研究、工具设计、错误处理与传输选型,适合用 FastMCP 或 MCP SDK 对接外部 API、封装服务能力。

想让 LLM 稳定调用外部 API,就用 MCP构建:从 Python 到 Node 都有成熟指引,帮你更快做出高质量 MCP 服务器。

平台与服务
未扫描114.1k

Slack动图

by anthropics

Universal
热门

面向Slack的动图制作Skill,内置emoji/消息GIF的尺寸、帧率和色彩约束、校验与优化流程,适合把创意或上传图片快速做成可直接发送的Slack动画。

帮你快速做出适配 Slack 的动图,内置约束规则和校验工具,少踩上传与播放坑,做表情包和演示都更省心。

平台与服务
未扫描114.1k

MCP服务构建器

by alirezarezvani

Universal
热门

从 OpenAPI 一键生成 Python/TypeScript MCP server 脚手架,并校验 tool schema、命名规范与版本兼容性,适合把现有 REST API 快速发布成可生产演进的 MCP 服务。

帮你快速搭建 MCP 服务与后端 API,脚手架完善、扩展顺手,尤其适合想高效验证服务能力的开发者。

平台与服务
未扫描10.2k

相关 MCP Server

Slack 消息

编辑精选

by Anthropic

热门

Slack 是让 AI 助手直接读写你的 Slack 频道和消息的 MCP 服务器。

这个服务器解决了团队协作中需要 AI 实时获取 Slack 信息的痛点,特别适合开发团队让 Claude 帮忙汇总频道讨论或发送通知。不过,它目前只是参考实现,文档有限,不建议在生产环境直接使用——更适合开发者学习 MCP 如何集成第三方服务。

平台与服务
83.4k

by netdata

热门

io.github.netdata/mcp-server 是让 AI 助手实时监控服务器指标和日志的 MCP 服务器。

这个工具解决了运维人员需要手动检查系统状态的痛点,最适合 DevOps 团队让 Claude 自动分析性能数据。不过,它依赖 NetData 的现有部署,如果你没用过这个监控平台,得先花时间配置。

平台与服务
78.4k

by d4vinci

热门

Scrapling MCP Server 是专为现代网页设计的智能爬虫工具,支持绕过 Cloudflare 等反爬机制。

这个工具解决了爬取动态网页和反爬网站时的头疼问题,特别适合需要批量采集电商价格或新闻数据的开发者。不过,它依赖外部浏览器引擎,资源消耗较大,不适合轻量级任务。

平台与服务
35.4k

评论