Scrapling MCP Server

平台与服务编辑精选

by d4vinci

Scrapling MCP Server 是专为现代网页设计的智能爬虫工具,支持绕过 Cloudflare 等反爬机制。

这个工具解决了爬取动态网页和反爬网站时的头疼问题,特别适合需要批量采集电商价格或新闻数据的开发者。不过,它依赖外部浏览器引擎,资源消耗较大,不适合轻量级任务。

34.5kGitHub

什么是 Scrapling MCP Server

Scrapling MCP Server 是专为现代网页设计的智能爬虫工具,支持绕过 Cloudflare 等反爬机制。

README

<!-- mcp-name: io.github.D4Vinci/Scrapling --> <h1 align="center"> <a href="https://scrapling.readthedocs.io"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_dark.svg?sanitize=true"> <img alt="Scrapling Poster" src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_light.svg?sanitize=true"> </picture> </a> <br> <small>Effortless Web Scraping for the Modern Web</small> </h1> <p align="center"> <a href="https://trendshift.io/repositories/14244" target="_blank"><img src="https://trendshift.io/api/badge/repositories/14244" alt="D4Vinci%2FScrapling | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a> <br/> <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_AR.md">العربيه</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_ES.md">Español</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_FR.md">Français</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_DE.md">Deutsch</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_CN.md">简体中文</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_JP.md">日本語</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_RU.md">Русский</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_KR.md">한국어</a> <br/> <a href="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml" alt="Tests"> <img alt="Tests" src="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg"></a> <a href="https://badge.fury.io/py/Scrapling" alt="PyPI version"> <img alt="PyPI version" src="https://badge.fury.io/py/Scrapling.svg"></a> <a href="https://clickpy.clickhouse.com/dashboard/scrapling" rel="nofollow"><img src="https://img.shields.io/pypi/dm/scrapling" alt="PyPI package downloads"></a> <a href="https://github.com/D4Vinci/Scrapling/tree/main/agent-skill" alt="AI Agent Skill directory"> <img alt="Static Badge" src="https://img.shields.io/badge/Skill-black?style=flat&label=Agent&link=https%3A%2F%2Fgithub.com%2FD4Vinci%2FScrapling%2Ftree%2Fmain%2Fagent-skill"></a> <a href="https://clawhub.ai/D4Vinci/scrapling-official" alt="OpenClaw Skill"> <img alt="OpenClaw Skill" src="https://img.shields.io/badge/Clawhub-darkred?style=flat&label=OpenClaw&link=https%3A%2F%2Fclawhub.ai%2FD4Vinci%2Fscrapling-official"></a> <br/> <a href="https://discord.gg/EMgGbDceNQ" alt="Discord" target="_blank"> <img alt="Discord" src="https://img.shields.io/discord/1360786381042880532?style=social&logo=discord&link=https%3A%2F%2Fdiscord.gg%2FEMgGbDceNQ"> </a> <a href="https://x.com/Scrapling_dev" alt="X (formerly Twitter)"> <img alt="X (formerly Twitter) Follow" src="https://img.shields.io/twitter/follow/Scrapling_dev?style=social&logo=x&link=https%3A%2F%2Fx.com%2FScrapling_dev"> </a> <br/> <a href="https://pypi.org/project/scrapling/" alt="Supported Python versions"> <img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/scrapling.svg"></a> </p> <p align="center"> <a href="https://scrapling.readthedocs.io/en/latest/parsing/selection.html"><strong>Selection methods</strong></a> &middot; <a href="https://scrapling.readthedocs.io/en/latest/fetching/choosing.html"><strong>Fetchers</strong></a> &middot; <a href="https://scrapling.readthedocs.io/en/latest/spiders/architecture.html"><strong>Spiders</strong></a> &middot; <a href="https://scrapling.readthedocs.io/en/latest/spiders/proxy-blocking.html"><strong>Proxy Rotation</strong></a> &middot; <a href="https://scrapling.readthedocs.io/en/latest/cli/overview.html"><strong>CLI</strong></a> &middot; <a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server.html"><strong>MCP</strong></a> </p>

Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl.

Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation - all in a few lines of Python. One library, zero compromises.

Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.

python
from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
StealthyFetcher.adaptive = True
p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)  # Fetch website under the radar!
products = p.css('.product', auto_save=True)                                        # Scrape data that survives website design changes!
products = p.css('.product', adaptive=True)                                         # Later, if the website structure changes, pass `adaptive=True` to find them!

Or scale up to full crawls

python
from scrapling.spiders import Spider, Response

class MySpider(Spider):
  name = "demo"
  start_urls = ["https://example.com/"]

  async def parse(self, response: Response):
      for item in response.css('.product'):
          yield {"title": item.css('h2::text').get()}

MySpider().start()
<p align="center"> <a href="https://dataimpulse.com/?utm_source=scrapling&utm_medium=banner&utm_campaign=scrapling" target="_blank" style="display:flex; justify-content:center; padding:4px 0;"> <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/DataImpulse.png" alt="At DataImpulse, we specialize in developing custom proxy services for your business. Make requests from anywhere, collect data, and enjoy fast connections with our premium proxies." style="max-height:60px;"> </a> </p>

Platinum Sponsors

<table> <tr> <td width="200"> <a href="https://hypersolutions.co/?utm_source=github&utm_medium=readme&utm_campaign=scrapling" target="_blank" title="Bot Protection Bypass API for Akamai, DataDome, Incapsula & Kasada"> <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/HyperSolutions.png"> </a> </td> <td> Scrapling handles Cloudflare Turnstile. For enterprise-grade protection, <a href="https://hypersolutions.co?utm_source=github&utm_medium=readme&utm_campaign=scrapling"> <b>Hyper Solutions</b> </a> provides API endpoints that generate valid antibot tokens for <b>Akamai</b>, <b>DataDome</b>, <b>Kasada</b>, and <b>Incapsula</b>. Simple API calls, no browser automation required. </td> </tr> <tr> <td width="200"> <a href="https://birdproxies.com/t/scrapling" target="_blank" title="At Bird Proxies, we eliminate your pains such as banned IPs, geo restriction, and high costs so you can focus on your work."> <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/BirdProxies.jpg"> </a> </td> <td>Hey, we built <a href="https://birdproxies.com/t/scrapling"> <b>BirdProxies</b> </a> because proxies shouldn't be complicated or overpriced. Fast residential and ISP proxies in 195+ locations, fair pricing, and real support. <br /> <b>Try our FlappyBird game on the landing page for free data!</b> </td> </tr> <tr> <td width="200"> <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"> <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png"> </a> </td> <td> <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling"> <b>Evomi</b> </a>: residential proxies from $0.49/GB. Scraping browser with fully spoofed Chromium, residential IPs, auto CAPTCHA solving, and anti-bot bypass. </br> <b>Scraper API for hassle-free results. MCP and N8N integrations are available.</b> </td> </tr> <tr> <td width="200"> <a href="https://tikhub.io/?utm_source=github.com/D4Vinci/Scrapling&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad" target="_blank" title="Unlock the Power of Social Media Data & AI"> <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/TikHub.jpg"> </a> </td> <td> <a href="https://tikhub.io/?utm_source=github.com/D4Vinci/Scrapling&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad" target="_blank">TikHub.io</a> provides 900+ stable APIs across 16+ platforms including TikTok, X, YouTube & Instagram, with 40M+ datasets. <br /> Also offers <a href="https://ai.tikhub.io/?ref=KarimShoair" target="_blank">DISCOUNTED AI models</a> - Claude, GPT, GEMINI & more up to 71% off. </td> </tr> <tr> <td width="200"> <a href="https://www.nsocks.com/?keyword=2p67aivg" target="_blank" title="Scalable Web Data Access for AI Applications"> <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/nsocks.png"> </a> </td> <td> <a href="https://www.nsocks.com/?keyword=2p67aivg" target="_blank">Nsocks</a> provides fast Residential and ISP proxies for developers and scrapers. Global IP coverage, high anonymity, smart rotation, and reliable performance for automation and data extraction. Use <a href="https://www.xcrawl.com/?keyword=2p67aivg" target="_blank">Xcrawl</a> to simplify large-scale web crawling. </td> </tr> <tr> <td width="200"> <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."> <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"> </a> </td> <td> Close your laptop. Your scrapers keep running. <br /> <a href="https://petrosky.io/d4vinci" target="_blank">PetroSky VPS</a> - cloud servers built for nonstop automation. Windows and Linux machines with full control. From €6.99/mo. </td> </tr> <tr> <td width="200"> <a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide?utm_source=github&utm_medium=repo&utm_campaign=scrapling" target="_blank" title="The #1 newsletter dedicated to Web Scraping"> <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/TWSC.png"> </a> </td> <td> Read a full review of <a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide?utm_source=github&utm_medium=repo&utm_campaign=scrapling" target="_blank">Scrapling on The Web Scraping Club</a> (Nov 2025), the #1 newsletter dedicated to Web Scraping. </td> </tr> <tr> <td width="200"> <a href="https://proxy-seller.com/?partner=CU9CAA5TBYFFT2" target="_blank" title="Proxy-Seller provides reliable proxy infrastructure for Web Scraping"> <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxySeller.png"> </a> </td> <td> <a href="https://proxy-seller.com/?partner=CU9CAA5TBYFFT2" target="_blank">Proxy-Seller</a> provides reliable proxy infrastructure for web scraping, offering IPv4, IPv6, ISP, Residential, and Mobile proxies with stable performance, broad geo coverage, and flexible plans for business-scale data collection. </td> </tr> <tr> <td width="200"> <a href="http://mangoproxy.com/?utm_source=D4Vinci&utm_medium=GitHub&utm_campaign=D4Vinci" target="_blank" title="Proxies You Can Rely On: Residential, Server, and Mobile"> <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/MangoProxy.png"> </a> </td> <td> <a href="http://mangoproxy.com/?utm_source=D4Vinci&utm_medium=GitHub&utm_campaign=D4Vinci" target="_blank">Stable proxies</a> for scraping, automation, and multi-accounting. Clean IPs, fast response, and reliable performance under load. Built for scalable workflows. </td> </tr> </table>

<i><sub>Do you want to show your ad here? Click here</sub></i>

Sponsors

<!-- sponsors -->

<a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a> <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a> <a href="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci" target="_blank" title="The web scraping service that actually beats anti-bot systems!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a> <a href="https://proxyempire.io/?ref=scrapling&utm_source=scrapling" target="_blank" title="Collect The Data Your Project Needs with the Best Residential Proxies"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a> <a href="https://www.webshare.io/?referral_code=48r2m2cd5uz1" target="_blank" title="The Most Reliable Proxy with Unparalleled Performance"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/webshare.png"></a> <a href="https://www.crawleo.dev/?utm_source=github&utm_medium=sponsor&utm_campaign=scrapling" target="_blank" title="Supercharge your AI with Real-Time Web Intelligence"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/crawleo.png"></a> <a href="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral" target="_blank" title="Browser Automation & AI Browser Agent Platform"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>

<!-- /sponsors -->

<i><sub>Do you want to show your ad here? Click here and choose the tier that suites you!</sub></i>


Key Features

Spiders - A Full Crawling Framework

  • 🕷️ Scrapy-like Spider API: Define spiders with start_urls, async parse callbacks, and Request/Response objects.
  • Concurrent Crawling: Configurable concurrency limits, per-domain throttling, and download delays.
  • 🔄 Multi-Session Support: Unified interface for HTTP requests, and stealthy headless browsers in a single spider - route requests to different sessions by ID.
  • 💾 Pause & Resume: Checkpoint-based crawl persistence. Press Ctrl+C for a graceful shutdown; restart to resume from where you left off.
  • 📡 Streaming Mode: Stream scraped items as they arrive via async for item in spider.stream() with real-time stats - ideal for UI, pipelines, and long-running crawls.
  • 🛡️ Blocked Request Detection: Automatic detection and retry of blocked requests with customizable logic.
  • 🤖 Robots.txt Compliance: Optional robots_txt_obey flag that respects Disallow, Crawl-delay, and Request-rate directives with per-domain caching.
  • 📦 Built-in Export: Export results through hooks and your own pipeline or the built-in JSON/JSONL with result.items.to_json() / result.items.to_jsonl() respectively.

Advanced Websites Fetching with Session Support

  • HTTP Requests: Fast and stealthy HTTP requests with the Fetcher class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP/3.
  • Dynamic Loading: Fetch dynamic websites with full browser automation through the DynamicFetcher class supporting Playwright's Chromium and Google's Chrome.
  • Anti-bot Bypass: Advanced stealth capabilities with StealthyFetcher and fingerprint spoofing. Can easily bypass all types of Cloudflare's Turnstile/Interstitial with automation.
  • Session Management: Persistent session support with FetcherSession, StealthySession, and DynamicSession classes for cookie and state management across requests.
  • Proxy Rotation: Built-in ProxyRotator with cyclic or custom rotation strategies across all session types, plus per-request proxy overrides.
  • Domain Blocking: Block requests to specific domains (and their subdomains) in browser-based fetchers.
  • Async Support: Complete async support across all fetchers and dedicated async session classes.

Adaptive Scraping & AI Integration

  • 🔄 Smart Element Tracking: Relocate elements after website changes using intelligent similarity algorithms.
  • 🎯 Smart Flexible Selection: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more.
  • 🔍 Find Similar Elements: Automatically locate elements similar to found elements.
  • 🤖 MCP Server to be used with AI: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features powerful, custom capabilities that leverage Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. (demo video)

High-Performance & battle-tested Architecture

  • 🚀 Lightning Fast: Optimized performance outperforming most Python scraping libraries.
  • 🔋 Memory Efficient: Optimized data structures and lazy loading for a minimal memory footprint.
  • Fast JSON Serialization: 10x faster than the standard library.
  • 🏗️ Battle tested: Not only does Scrapling have 92% test coverage and full type hints coverage, but it has been used daily by hundreds of Web Scrapers over the past year.

Developer/Web Scraper Friendly Experience

  • 🎯 Interactive Web Scraping Shell: Optional built-in IPython shell with Scrapling integration, shortcuts, and new tools to speed up Web Scraping scripts development, like converting curl requests to Scrapling requests and viewing requests results in your browser.
  • 🚀 Use it directly from the Terminal: Optionally, you can use Scrapling to scrape a URL without writing a single line of code!
  • 🛠️ Rich Navigation API: Advanced DOM traversal with parent, sibling, and child navigation methods.
  • 🧬 Enhanced Text Processing: Built-in regex, cleaning methods, and optimized string operations.
  • 📝 Auto Selector Generation: Generate robust CSS/XPath selectors for any element.
  • 🔌 Familiar API: Similar to Scrapy/BeautifulSoup with the same pseudo-elements used in Scrapy/Parsel.
  • 📘 Complete Type Coverage: Full type hints for excellent IDE support and code completion. The entire codebase is automatically scanned with PyRight and MyPy with each change.
  • 🔋 Ready Docker image: With each release, a Docker image containing all browsers is automatically built and pushed.

Getting Started

Let's give you a quick glimpse of what Scrapling can do without deep diving.

Basic Usage

HTTP requests with session support

python
from scrapling.fetchers import Fetcher, FetcherSession

with FetcherSession(impersonate='chrome') as session:  # Use latest version of Chrome's TLS fingerprint
    page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
    quotes = page.css('.quote .text::text').getall()

# Or use one-off requests
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()

Advanced stealth mode

python
from scrapling.fetchers import StealthyFetcher, StealthySession

with StealthySession(headless=True, solve_cloudflare=True) as session:  # Keep the browser open until you finish
    page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
    data = page.css('#padded_content a').getall()

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a').getall()

Full browser automation

python
from scrapling.fetchers import DynamicFetcher, DynamicSession

with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:  # Keep the browser open until you finish
    page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
    data = page.xpath('//span[@class="text"]/text()').getall()  # XPath selector if you prefer it

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text').getall()

Spiders

Build full crawlers with concurrent requests, multiple session types, and pause/resume:

python
from scrapling.spiders import Spider, Request, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    concurrent_requests = 10
    
    async def parse(self, response: Response):
        for quote in response.css('.quote'):
            yield {
                "text": quote.css('.text::text').get(),
                "author": quote.css('.author::text').get(),
            }
            
        next_page = response.css('.next a')
        if next_page:
            yield response.follow(next_page[0].attrib['href'])

result = QuotesSpider().start()
print(f"Scraped {len(result.items)} quotes")
result.items.to_json("quotes.json")

Use multiple session types in a single spider:

python
from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSessionSpider(Spider):
    name = "multi"
    start_urls = ["https://example.com/"]
    
    def configure_sessions(self, manager):
        manager.add("fast", FetcherSession(impersonate="chrome"))
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
    
    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            # Route protected pages through the stealth session
            if "protected" in link:
                yield Request(link, sid="stealth")
            else:
                yield Request(link, sid="fast", callback=self.parse)  # explicit callback

Pause and resume long crawls with checkpoints by running the spider like this:

python
QuotesSpider(crawldir="./crawl_data").start()

Press Ctrl+C to pause gracefully - progress is saved automatically. Later, when you start the spider again, pass the same crawldir, and it will resume from where it stopped.

Advanced Parsing & Navigation

python
from scrapling.fetchers import Fetcher

# Rich element selection and navigation
page = Fetcher.get('https://quotes.toscrape.com/')

# Get quotes with multiple selection methods
quotes = page.css('.quote')  # CSS selector
quotes = page.xpath('//div[@class="quote"]')  # XPath
quotes = page.find_all('div', {'class': 'quote'})  # BeautifulSoup-style
# Same as
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote')  # and so on...
# Find element by text content
quotes = page.find_by_text('quote', tag='div')

# Advanced navigation
quote_text = page.css('.quote')[0].css('.text::text').get()
quote_text = page.css('.quote').css('.text::text').getall()  # Chained selectors
first_quote = page.css('.quote')[0]
author = first_quote.next_sibling.css('.author::text')
parent_container = first_quote.parent

# Element relationships and similarity
similar_elements = first_quote.find_similar()
below_elements = first_quote.below_elements()

You can use the parser right away if you don't want to fetch websites like below:

python
from scrapling.parser import Selector

page = Selector("<html>...</html>")

And it works precisely the same way!

Async Session Management Examples

python
import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession

async with FetcherSession(http3=True) as session:  # `FetcherSession` is context-aware and can work in both sync/async patterns
    page1 = session.get('https://quotes.toscrape.com/')
    page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')

# Async session usage
async with AsyncStealthySession(max_pages=2) as session:
    tasks = []
    urls = ['https://example.com/page1', 'https://example.com/page2']
    
    for url in urls:
        task = session.fetch(url)
        tasks.append(task)
    
    print(session.get_pool_stats())  # Optional - The status of the browser tabs pool (busy/free/error)
    results = await asyncio.gather(*tasks)
    print(session.get_pool_stats())

CLI & Interactive Shell

Scrapling includes a powerful command-line interface:

asciicast

Launch the interactive Web Scraping shell

bash
scrapling shell

Extract pages to a file directly without programming (Extracts the content inside the body tag by default). If the output file ends with .txt, then the text content of the target will be extracted. If it ends in .md, it will be a Markdown representation of the HTML content; if it ends in .html, it will be the HTML content itself.

bash
scrapling extract get 'https://example.com' content.md
scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome'  # All elements matching the CSS selector '#fromSkipToProducts'
scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare

[!NOTE] There are many additional features, but we want to keep this page concise, including the MCP server and the interactive Web Scraping Shell. Check out the full documentation here

Performance Benchmarks

Scrapling isn't just powerful-it's also blazing fast. The following benchmarks compare Scrapling's parser with the latest versions of other popular libraries.

Text Extraction Speed Test (5000 nested elements)

#LibraryTime (ms)vs Scrapling
1Scrapling2.021.0x
2Parsel/Scrapy2.041.01
3Raw Lxml2.541.257
4PyQuery24.17~12x
5Selectolax82.63~41x
6MechanicalSoup1549.71~767.1x
7BS4 with Lxml1584.31~784.3x
8BS4 with html5lib3391.91~1679.1x

Element Similarity & Text Search Performance

Scrapling's adaptive element finding capabilities significantly outperform alternatives:

LibraryTime (ms)vs Scrapling
Scrapling2.391.0x
AutoScraper12.455.209x

All benchmarks represent averages of 100+ runs. See benchmarks.py for methodology.

Installation

Scrapling requires Python 3.10 or higher:

bash
pip install scrapling

This installation only includes the parser engine and its dependencies, without any fetchers or commandline dependencies.

Optional Dependencies

  1. If you are going to use any of the extra features below, the fetchers, or their classes, you will need to install fetchers' dependencies and their browser dependencies as follows:

    bash
    pip install "scrapling[fetchers]"
    
    scrapling install           # normal install
    scrapling install  --force  # force reinstall
    

    This downloads all browsers, along with their system dependencies and fingerprint manipulation dependencies.

    Or you can install them from the code instead of running a command like this:

    python
    from scrapling.cli import install
    
    install([], standalone_mode=False)          # normal install
    install(["--force"], standalone_mode=False) # force reinstall
    
  2. Extra features:

    • Install the MCP server feature:
      bash
      pip install "scrapling[ai]"
      
    • Install shell features (Web Scraping shell and the extract command):
      bash
      pip install "scrapling[shell]"
      
    • Install everything:
      bash
      pip install "scrapling[all]"
      

    Remember that you need to install the browser dependencies with scrapling install after any of these extras (if you didn't already)

Docker

You can also install a Docker image with all extras and browsers with the following command from DockerHub:

bash
docker pull pyd4vinci/scrapling

Or download it from the GitHub registry:

bash
docker pull ghcr.io/d4vinci/scrapling:latest

This image is automatically built and pushed using GitHub Actions and the repository's main branch.

Contributing

We welcome contributions! Please read our contributing guidelines before getting started.

Disclaimer

[!CAUTION] This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect the terms of service of websites and robots.txt files.

🎓 Citations

If you have used our library for research purposes please quote us with the following reference:

text
  @misc{scrapling,
    author = {Karim Shoair},
    title = {Scrapling},
    year = {2024},
    url = {https://github.com/D4Vinci/Scrapling},
    note = {An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!}
  }

License

This work is licensed under the BSD-3-Clause License.

Acknowledgments

This project includes code adapted from:

  • Parsel (BSD License)-Used for translator submodule

<div align="center"><small>Designed & crafted with ❤️ by Karim Shoair.</small></div><br>

常见问题

Scrapling MCP Server 是什么?

Web scraping with stealth HTTP, real browsers, and Cloudflare bypass. CSS selectors supported.

相关 Skills

MCP构建

by anthropics

Universal
热门

聚焦高质量 MCP Server 开发,覆盖协议研究、工具设计、错误处理与传输选型,适合用 FastMCP 或 MCP SDK 对接外部 API、封装服务能力。

想让 LLM 稳定调用外部 API,就用 MCP构建:从 Python 到 Node 都有成熟指引,帮你更快做出高质量 MCP 服务器。

平台与服务
未扫描109.6k

Slack动图

by anthropics

Universal
热门

面向Slack的动图制作Skill,内置emoji/消息GIF的尺寸、帧率和色彩约束、校验与优化流程,适合把创意或上传图片快速做成可直接发送的Slack动画。

帮你快速做出适配 Slack 的动图,内置约束规则和校验工具,少踩上传与播放坑,做表情包和演示都更省心。

平台与服务
未扫描109.6k

接口设计评审

by alirezarezvani

Universal
热门

审查 REST API 设计是否符合行业规范,自动检查命名、HTTP 方法、状态码与文档覆盖,识别破坏性变更并给出设计评分,适合评审接口方案和版本迭代前把关。

做API和架构方案时,它能帮你提前揪出接口设计问题并对齐最佳实践,评审视角系统,团队协作更省心。

平台与服务
未扫描9.0k

相关 MCP Server

Slack 消息

编辑精选

by Anthropic

热门

Slack 是让 AI 助手直接读写你的 Slack 频道和消息的 MCP 服务器。

这个服务器解决了团队协作中需要 AI 实时获取 Slack 信息的痛点,特别适合开发团队让 Claude 帮忙汇总频道讨论或发送通知。不过,它目前只是参考实现,文档有限,不建议在生产环境直接使用——更适合开发者学习 MCP 如何集成第三方服务。

平台与服务
82.9k

by netdata

热门

io.github.netdata/mcp-server 是让 AI 助手实时监控服务器指标和日志的 MCP 服务器。

这个工具解决了运维人员需要手动检查系统状态的痛点,最适合 DevOps 团队让 Claude 自动分析性能数据。不过,它依赖 NetData 的现有部署,如果你没用过这个监控平台,得先花时间配置。

平台与服务
78.3k
热门

Chrome DevTools MCP 是让 AI 助手直接控制 Chrome 浏览器进行自动化调试和性能分析的工具。

这个工具解决了 AI 助手无法直接操作浏览器进行实时调试的痛点,特别适合前端开发者让 Claude 自动抓取页面性能数据或模拟用户交互。但要注意它默认会收集使用统计数据,隐私敏感的项目需要手动禁用。

平台与服务
33.0k

评论