TikTok资料采集

TikTok Profile Scraper

by arulmozhiv

A browser-based TikTok profile discovery and scraping tool.

4.5k搜索与获取未扫描2026年3月30日

安装

claude skill add --url https://github.com/openclaw/skills

文档

A browser-based TikTok profile discovery and scraping tool.

Part of ScrapeClaw — a suite of production-ready, agentic social media scrapers for Instagram, YouTube, X/Twitter, TikTok, and Facebook built with Python & Playwright, no API keys required.

yaml
---
name: tiktok-scraper
description: Discover and scrape TikTok profiles from your browser.
emoji: 🎵
version: 1.0.0
author: influenza
tags:
  - tiktok
  - scraping
  - social-media
  - influencer-discovery
metadata:
  clawdbot:
    requires:
      bins:
        - python3
        - chromium

    config:
      stateDirs:
        - data/output
        - data/queue
        - thumbnails
      outputFormats:
        - json
        - csv
---

Overview

This skill provides a two-phase TikTok scraping system:

  1. Profile Discovery
  2. Browser Scraping

Features

  • 🔍 - Discover TikTok profiles by location and category
  • 🌐 - Full browser simulation for accurate scraping
  • 🛡️ - Browser fingerprinting, human behavior simulation, and stealth scripts
  • 📊 - Profile info, stats, video thumbnails, and engagement data
  • 💾 - JSON/CSV export with downloaded thumbnails
  • 🔄 - Resume interrupted scraping sessions
  • ⚡ - Auto-skip private accounts, low followers, empty profiles
  • 🌍 - Built-in residential proxy support with 4 providers

Getting Google API Credentials (Optional)

  1. Go to Google Cloud Console
  2. Create a new project or select existing
  3. Enable "Custom Search API"
  4. Create API credentials → API Key
  5. Go to Programmable Search Engine
  6. Create a search engine with tiktok.com as the site to search
  7. Copy the Search Engine ID

Usage

Agent Tool Interface

For OpenClaw agent integration, the skill provides JSON output:

bash
# Discover profiles (returns JSON)
discover --location "Miami" --category "dance" --output json

# Scrape single profile (returns JSON)
scrape --username charlidamelio --output json

Output Data

Profile Data Structure

json
{
  "username": "example_creator",
  "full_name": "Example Creator",
  "nickname": "Example",
  "bio": "Dance creator | NYC 💃",
  "bio_link": "https://example.com",
  "followers": 250000,
  "following": 800,
  "likes": 5000000,
  "videos_count": 120,
  "is_verified": false,
  "is_private": false,
  "influencer_tier": "macro",
  "category": "dance",
  "location": "New York",
  "profile_url": "https://www.tiktok.com/@example_creator",
  "profile_pic_local": "thumbnails/example_creator/profile_abc123.jpg",
  "content_thumbnails": [
    "thumbnails/example_creator/content_1_def456.jpg",
    "thumbnails/example_creator/content_2_ghi789.jpg"
  ],
  "video_views": [
    {"display": "1.2M", "count": 1200000},
    {"display": "500K", "count": 500000}
  ],
  "scrape_timestamp": "2026-03-02T14:30:00"
}

Influencer Tiers

TierFollower Range
nano< 1,000
micro1,000 - 10,000
mid10,000 - 100,000
macro100,000 - 1M
mega> 1,000,000

File Outputs

  • Queue files: data/queue/{location}_{category}_{timestamp}.json
  • Scraped data: data/output/{username}.json
  • Thumbnails: thumbnails/{username}/profile_*.jpg, thumbnails/{username}/content_*.jpg
  • Export files: data/export_{timestamp}.json, data/export_{timestamp}.csv

Configuration

Edit config/scraper_config.json:

json
{
  "proxy": {
    "enabled": false,
    "provider": "brightdata",
    "country": "",
    "sticky": true,
    "sticky_ttl_minutes": 10
  },
  "google_search": {
    "enabled": true,
    "api_key": "",
    "search_engine_id": "",
    "queries_per_location": 3
  },
  "scraper": {
    "headless": false,
    "min_followers": 1000,
    "download_thumbnails": true,
    "max_thumbnails": 6
  },
  "cities": ["New York", "Los Angeles", "Miami", "Chicago"],
  "categories": ["fashion", "beauty", "fitness", "food", "travel", "tech", "comedy", "dance", "music", "gaming"]
}

Filters Applied

The scraper automatically filters out:

  • ❌ Private accounts
  • ❌ Accounts with < 1,000 followers (configurable)
  • ❌ Accounts with no videos
  • ❌ Non-existent/removed accounts
  • ❌ Already scraped accounts (deduplication)

Troubleshooting

No Profiles Discovered

  • Check Google API key and quota
  • Verify Search Engine ID is configured for tiktok.com
  • Try different location/category combinations

Rate Limiting

  • Reduce scraping speed (increase delays in config)
  • Run during off-peak hours
  • Use a residential proxy (see below)

CAPTCHA / Bot Detection

  • TikTok has aggressive bot detection — residential proxies are strongly recommended
  • The built-in anti-detection handles fingerprinting and stealth automatically
  • If you see CAPTCHAs, try running in non-headless mode and solve them manually

🌐 Residential Proxy Support

Why Use a Residential Proxy?

Running a scraper at scale without a residential proxy will get your IP blocked fast. Here's why proxies are essential for long-running scrapes:

AdvantageDescription
Avoid IP BansResidential IPs look like real household users, not data-center bots. TikTok is far less likely to flag them.
Automatic IP RotationEach request (or session) gets a fresh IP, so rate-limits never stack up on one address.
Geo-TargetingRoute traffic through a specific country/city so scraped content matches the target audience's locale.
Sticky SessionsKeep the same IP for a configurable window (e.g. 10 min) — critical for maintaining a consistent browsing session.
Higher Success RateRotating residential IPs deliver 95%+ success rates compared to ~30% with data-center proxies on TikTok.
Long-Running ScrapesScrape thousands of profiles over hours or days without interruption.
Concurrent ScrapingRun multiple browser instances across different IPs simultaneously.

Recommended Proxy Providers

We have affiliate partnerships with top residential proxy providers. Using these links supports continued development of this skill:

ProviderBest ForSign Up
Bright DataWorld's largest network, 72M+ IPs, enterprise-grade👉 Get Bright Data
IProyalPay-as-you-go, 195+ countries, no traffic expiry👉 Get IProyal
Storm ProxiesFast & reliable, developer-friendly API, competitive pricing👉 Get Storm Proxies
NetNutISP-grade network, 52M+ IPs, direct connectivity👉 Get NetNut

Setup Steps

1. Get Your Proxy Credentials

Sign up with any provider above, then grab:

  • Username (from your provider dashboard)
  • Password (from your provider dashboard)
  • Host and Port are pre-configured per provider (or use custom)

2. Configure via Environment Variables

bash
export PROXY_ENABLED=true
export PROXY_PROVIDER=brightdata    # brightdata | iproyal | stormproxies | netnut | custom
export PROXY_USERNAME=your_user
export PROXY_PASSWORD=your_pass
export PROXY_COUNTRY=us             # optional: two-letter country code
export PROXY_STICKY=true            # optional: keep same IP per session

3. Provider-Specific Host/Port Defaults

These are auto-configured when you set the provider name:

ProviderHostPort
Bright Databrd.superproxy.io22225
IProyalproxy.iproyal.com12321
Storm Proxiesrotating.stormproxies.com9999
NetNutgw-resi.netnut.io5959

Override with PROXY_HOST / PROXY_PORT env vars if your plan uses a different gateway.

4. Custom Proxy Provider

For any other proxy service, set provider to custom and supply host/port manually:

json
{
  "proxy": {
    "enabled": true,
    "provider": "custom",
    "host": "your.proxy.host",
    "port": 8080,
    "username": "user",
    "password": "pass"
  }
}

Running the Scraper with Proxy

Once configured, the scraper picks up the proxy automatically — no extra flags needed:

bash
# Discover and scrape as usual — proxy is applied automatically
python main.py discover --location "Miami" --category "dance"
python main.py scrape --username charlidamelio

# The log will confirm proxy is active:
# INFO - Proxy enabled: <ProxyManager provider=brightdata enabled host=brd.superproxy.io:22225>

Using the Proxy Manager Programmatically

python
from proxy_manager import ProxyManager

# From config (auto-reads config/scraper_config.json)
pm = ProxyManager.from_config()

# From environment variables
pm = ProxyManager.from_env()

# Manual construction
pm = ProxyManager(
    provider="brightdata",
    username="your_user",
    password="your_pass",
    country="us",
    sticky=True
)

# For Playwright browser context
proxy = pm.get_playwright_proxy()
# → {"server": "http://brd.superproxy.io:22225", "username": "user-country-us-session-abc123", "password": "pass"}

# For requests / aiohttp
proxies = pm.get_requests_proxy()
# → {"http": "http://user:pass@host:port", "https": "http://user:pass@host:port"}

# Force new IP (rotates session ID)
pm.rotate_session()

# Debug info
print(pm.info())

Best Practices for Long-Running Scrapes

  1. Use sticky sessions — TikTok requires consistent IPs during a browsing session. Set "sticky": true.
  2. Target the right country — Set "country": "us" (or your target region) so TikTok serves content in the expected locale.
  3. Combine with existing anti-detection — This scraper already has fingerprinting, stealth scripts, and human behavior simulation. The proxy is the final layer.
  4. Rotate sessions between batches — Call pm.rotate_session() between large batches of profiles to get a fresh IP.
  5. Use delays — Even with proxies, respect delay_between_profiles in config to avoid aggressive patterns.
  6. Monitor your proxy dashboard — All providers have dashboards showing bandwidth usage and success rates.

相关 Skills

谷歌视频工具

by bwbernardweston18

热门

>

搜索与获取
未扫描4.5k
热门

股票投研9点分析框架,覆盖基本面/财务/竞品/估值/宏观/情绪等维度

搜索与获取
未扫描4.5k

SEO审计工具

by amdf01-debug

热门

搜索与获取
未扫描4.5k

相关 MCP 服务

by Anthropic

热门

Puppeteer 是让 Claude 自动操作浏览器进行网页抓取和测试的 MCP 服务器。

这个服务器解决了手动编写 Puppeteer 脚本的繁琐问题,适合需要自动化网页交互的开发者,比如抓取动态内容或做端到端测试。不过,作为参考实现,它可能缺少生产级的安全防护,建议在可控环境中使用。

搜索与获取
86.1k

Brave 搜索

编辑精选

by Anthropic

热门

Brave Search 是让 Claude 直接调用 Brave 搜索 API 获取实时网络信息的 MCP 服务器。

如果你想让 AI 助手帮你搜索最新资讯或技术文档,这个工具能绕过传统搜索的限制,直接返回结构化数据。特别适合需要实时信息的开发者,比如查 API 更新或竞品动态。不过它依赖 Brave 的 API 配额,高频使用可能受限。

搜索与获取
86.1k

网页抓取

编辑精选

by Anthropic

热门

Fetch 是 MCP 官方参考服务器,让 AI 能抓取网页并转为 Markdown 格式。

这个服务器解决了 AI 直接处理网页内容时格式混乱的问题,适合需要让 Claude 分析在线文档或新闻的开发者。不过作为参考实现,它缺乏生产级的安全配置,你得自己处理反爬虫和隐私风险。

搜索与获取
86.1k

评论