io.github.ofershap/scraper

搜索与获取

by ofershap

用于 Web scraping 的 MCP,可从任意 URL 提取干净的 markdown、links 和 metadata。

什么是 io.github.ofershap/scraper

用于 Web scraping 的 MCP,可从任意 URL 提取干净的 markdown、links 和 metadata。

README

mcp-server-scraper

npm version npm downloads CI TypeScript License: MIT

Extract clean, readable content from any URL. Returns markdown text, links, and metadata. No API keys, no config. A free alternative to Firecrawl for scraping docs, blogs, and articles.

bash
npx mcp-server-scraper

Works with Claude Desktop, Cursor, VS Code Copilot, and any MCP client. No accounts or API keys needed.

MCP server for web scraping, content extraction, and URL metadata

<sub>Demo built with <a href="https://github.com/ofershap/remotion-readme-kit">remotion-readme-kit</a></sub>

Why

When you're working with an AI assistant and need to reference a docs page, a blog post, or an API reference, you usually end up copy-pasting content manually. Tools like Firecrawl solve this but require a paid API key. This server does the same thing for free. It fetches a URL, runs it through Mozilla Readability (the same engine behind Firefox Reader View), and returns clean markdown. It works well for server-rendered content like documentation sites, blog posts, and articles. It won't handle JavaScript-heavy SPAs, but for the most common use case of "read this docs page and summarize it," it does the job.

Tools

ToolWhat it does
scrape_urlExtract clean text content from a URL (Readability-powered)
extract_linksGet all links with href and anchor text
extract_metadataGet title, description, OG tags, canonical, favicon
search_pageSearch for a query string within the page, return matching lines
scrape_multipleBatch scrape multiple URLs, get title + excerpt per URL

Quick Start

Cursor

Add to .cursor/mcp.json:

json
{
  "mcpServers": {
    "scraper": {
      "command": "npx",
      "args": ["-y", "mcp-server-scraper"]
    }
  }
}

Claude Desktop

Add to claude_desktop_config.json:

json
{
  "mcpServers": {
    "scraper": {
      "command": "npx",
      "args": ["-y", "mcp-server-scraper"]
    }
  }
}

VS Code

Add to your MCP settings (e.g. .vscode/mcp.json):

json
{
  "mcp": {
    "servers": {
      "scraper": {
        "command": "npx",
        "args": ["-y", "mcp-server-scraper"]
      }
    }
  }
}

Examples

  • "Scrape the API docs from https://docs.example.com and summarize them"
  • "Extract all links from this page"
  • "What's the OG image and description for this URL?"
  • "Search this page for mentions of 'authentication'"
  • "Scrape these 5 URLs and give me a summary of each"

How it works

Uses Mozilla Readability (the engine behind Firefox Reader View) plus linkedom for fast HTML parsing in Node. No headless browser needed. Works best with server-rendered pages: docs, blogs, articles, news sites.

Development

bash
npm install
npm run typecheck
npm run build
npm test

See also

More MCP servers and developer tools on my portfolio.

Author

Made by ofershap

LinkedIn GitHub


<sub>README built with README Builder</sub>

License

MIT © Ofer Shapira

常见问题

io.github.ofershap/scraper 是什么?

用于 Web scraping 的 MCP,可从任意 URL 提取干净的 markdown、links 和 metadata。

相关 Skills

Morning Brief

by amadeus9169

This skill pulls the latest headlines from a reliable international RSS feed and presents a concise list of news titles. It is ideal for a quick, up-to-date snapshot of global events at the start of your day.

搜索与获取
未扫描3.9k

SEO Audit Skill

by amdf01-debug

热门

搜索与获取
未扫描3.9k

agent-browser

by chulla-ceja

热门

Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction.

搜索与获取
未扫描3.9k

相关 MCP Server

by Anthropic

热门

Puppeteer 是让 Claude 自动操作浏览器进行网页抓取和测试的 MCP 服务器。

这个服务器解决了手动编写 Puppeteer 脚本的繁琐问题,适合需要自动化网页交互的开发者,比如抓取动态内容或做端到端测试。不过,作为参考实现,它可能缺少生产级的安全防护,建议在可控环境中使用。

搜索与获取
83.1k

网页抓取

编辑精选

by Anthropic

热门

Fetch 是 MCP 官方参考服务器,让 AI 能抓取网页并转为 Markdown 格式。

这个服务器解决了 AI 直接处理网页内容时格式混乱的问题,适合需要让 Claude 分析在线文档或新闻的开发者。不过作为参考实现,它缺乏生产级的安全配置,你得自己处理反爬虫和隐私风险。

搜索与获取
83.1k

Brave 搜索

编辑精选

by Anthropic

热门

Brave Search 是让 Claude 直接调用 Brave 搜索 API 获取实时网络信息的 MCP 服务器。

如果你想让 AI 助手帮你搜索最新资讯或技术文档,这个工具能绕过传统搜索的限制,直接返回结构化数据。特别适合需要实时信息的开发者,比如查 API 更新或竞品动态。不过它依赖 Brave 的 API 配额,高频使用可能受限。

搜索与获取
83.1k

评论