视频智取
YouTube Model Feeder
by celstnblacc
Food for your model — extract transcripts, key frames, OCR, slides, and LLM summaries from YouTube videos into structured AI-ready knowledge.
安装
claude skill add --url github.com/openclaw/skills/tree/main/skills/celstnblacc/youtube-model-feeder文档
YouTube Model Feeder
Food for your model.
Stop pausing videos every 30 seconds to screenshot, paste into Obsidian, and caption. Every 20-minute tutorial shouldn't take an hour to document.
YouTube Model Feeder extracts everything from a YouTube video — timestamped transcript, key frame snapshots, OCR of code and slides, presentation slide detection, and LLM-generated summaries — and packages it into structured knowledge your AI assistant can search, reference, and reason about.
Why This Exists
The problem isn't transcription — ten tools do that. The problem is structured context. When you feed a raw transcript to a model, it has no visual context. It doesn't know what was on screen when the speaker said "as you can see here." It can't read the code in the terminal, the diagram on the slide, or the config file being edited.
YouTube Model Feeder captures all of that. The output isn't just text — it's a knowledge bundle: transcript segments aligned to timestamps, screenshots of every key moment, OCR text from code snippets and slides, and an LLM summary that ties it all together.
Combined with obsidian-semantic-search (also on ClawHub), every video you watch becomes permanently searchable by meaning in your Obsidian vault.
What It Extracts
Full Pipeline
| Step | Tool | What it produces |
|---|---|---|
| Download | yt-dlp | Video + audio + metadata (title, duration, thumbnail) |
| Transcribe | Whisper (Ollama) or YouTube captions | Timestamped transcript segments |
| Frame Extraction | FFmpeg | Key frame snapshots every 5s (configurable) |
| Slide Detection | SSIM analysis (OpenCV) | Identifies presentation slides via structural similarity between frames |
| OCR | Tesseract | Reads code, terminal output, and text from captured frames |
| LLM Summary | Ollama / OpenAI / Anthropic | Structured markdown with sections, code blocks, and key takeaways |
Slide Detection (Deep)
Not just frame captures — intelligent slide boundary detection:
- Layout detection — classifies video as full-frame, picture-in-picture, or split panel
- SSIM transition scan — compares consecutive frames for structural changes (threshold: SSIM < 0.85)
- LLM disambiguation — borderline transitions (0.85–0.93 SSIM) sent to LLM for classification
- Slide grouping — merges transitions into slides with enforced minimum duration (3s)
- Final-state capture — saves the last frame of each slide as JPEG
- OCR extraction — runs Tesseract on each slide image
- Transcript alignment — maps transcript segments to slide time ranges
Output Formats
| Format | What you get |
|---|---|
| Markdown | Timestamped sections with headings, code blocks, image references |
| HTML | Styled single-page doc with embedded screenshots |
| Obsidian bundle | ZIP export: markdown + images, ready to drop into your vault |
Installation
Prerequisites
# macOS
brew install ffmpeg tesseract
# Linux
apt install ffmpeg tesseract-ocr
Docker Desktop must be running for the full backend.
Start the Stack
git clone https://github.com/celstnblacc/youtube-model-feeder.git
cd youtube-model-feeder
docker-compose up -d
This starts 5 services:
| Service | Port | Purpose |
|---|---|---|
| api | 8000 | FastAPI backend + Swagger docs at /docs |
| celery_worker | — | Background video processing |
| postgres | 5432 | Job tracking, transcripts, documents |
| redis | 6379 | Task queue (Celery broker) |
| web | 3000 | Next.js frontend (optional) |
Verify
Open http://localhost:8000/docs — you should see the Swagger API documentation.
Usage
Via AI Assistant
Extract a video:
"Extract everything from this YouTube video and save it to my vault: https://youtube.com/watch?v=..."
Transcript only:
"Get the timestamped transcript for this video"
Slides and code screenshots:
"Extract all the code screenshots and presentation slides from this tutorial"
Obsidian export:
"Convert this video into an Obsidian note with screenshots and timestamps"
Via API
# Submit a video for processing
curl -X POST http://localhost:8000/jobs \
-H "Content-Type: application/json" \
-d '{"url": "https://youtube.com/watch?v=dQw4w9WgXcQ"}'
# Check job status
curl http://localhost:8000/jobs/{job_id}
# Get the generated document
curl http://localhost:8000/videos/{video_id}
Via Web UI
Open http://localhost:3000, paste a YouTube URL, and watch the extraction happen in real time with progress tracking.
LLM Provider Selection
Per-user configuration — choose your summarization engine:
| Provider | Model (default) | Setup | Cost |
|---|---|---|---|
| Ollama (default) | Mistral 7B | Pre-installed locally | Free |
| OpenAI | GPT-4o-mini | Set OPENAI_API_KEY | Per-token |
| Anthropic | Claude Sonnet 4.6 | Set ANTHROPIC_API_KEY | Per-token |
Configure via the API: PATCH /settings/me with your preferred provider and API key (encrypted at rest with Fernet).
The Knowledge Pipeline
YouTube Model Feeder is designed to work with other ClawHub skills:
YouTube Video
→ youtube-model-feeder (extract transcript + snapshots + OCR + summary)
→ Obsidian vault (structured markdown + images)
→ obsidian-semantic-search (index with embeddings, search by meaning)
→ Your AI assistant (full context from every video you've ever watched)
Every video becomes permanent, searchable knowledge. Not buried in a playlist — indexed and queryable.
Architecture
youtube-model-feeder/
├── backend/ # FastAPI + Celery (Python 3.12)
│ ├── app/
│ │ ├── services/ # Core extraction pipeline (6372 LOC)
│ │ │ ├── youtube.py # yt-dlp wrapper
│ │ │ ├── transcript.py # Whisper + YouTube captions
│ │ │ ├── snapshot.py # FFmpeg frame extraction
│ │ │ ├── slide_detection.py # SSIM + OCR + alignment (643 LOC)
│ │ │ ├── llm.py # Summarization engine (1229 LOC)
│ │ │ └── llm_providers.py # Ollama/OpenAI/Anthropic abstraction
│ │ ├── routes/ # REST API endpoints
│ │ └── tasks.py # Celery background jobs
├── frontend/ # Next.js 14 + React 18 + Tailwind + Zustand
├── docker-compose.yml # 5 services
└── migrations/ # Alembic database migrations
Troubleshooting
| Problem | Fix |
|---|---|
| VirusTotal "suspicious" warning on install | False positive — skill describes video extraction patterns. Use clawhub install youtube-model-feeder --force |
| Docker services won't start | Ensure Docker Desktop is running. Check docker-compose logs api for errors |
| Transcription is slow | First run pulls the Whisper model (~1.5 GB). Subsequent runs are fast. Try YouTube captions first (faster, no model needed) |
| No slides detected | SSIM threshold may need tuning for your video type. Presentation-style videos work best |
| LLM summary is empty | Check LLM provider config. Default is Ollama — ensure Ollama is running with a model pulled |
| FFmpeg not found | brew install ffmpeg (macOS) or apt install ffmpeg (Linux) |
Links
- Source: https://github.com/celstnblacc/youtube-model-feeder
- Obsidian Semantic Search: https://clawhub.ai/skills/obsidian-semantic-search
- License: MIT-0 (this skill) / Apache 2.0 (source)
Built by celstnblacc — food for your model. 226 tests, 6 extraction stages, 3 LLM providers, Obsidian-ready output.
相关 Skills
Claude接口
by anthropics
面向接入 Claude API、Anthropic SDK 或 Agent SDK 的开发场景,自动识别项目语言并给出对应示例与默认配置,快速搭建 LLM 应用。
✎ 想把Claude能力接进应用或智能体,用claude-api上手快、兼容Anthropic与Agent SDK,集成路径清晰又省心
RAG架构师
by alirezarezvani
聚焦生产级RAG系统设计与优化,覆盖文档切块、检索链路、索引构建、召回评估等关键环节,适合搭建可扩展、高准确率的知识库问答与检索增强应用。
✎ 面向RAG落地,把知识库、向量检索和生成链路系统串联起来,做架构设计时更清晰,也更少踩坑。
多智能体架构
by alirezarezvani
聚焦多智能体系统架构设计,梳理 Supervisor、Swarm、分层和 Pipeline 等模式,覆盖角色定义、通信协作与性能评估,适合规划稳健可扩展的 AI agent 编排方案。
✎ 帮你系统解决多智能体应用的架构设计与协同编排难题,适合构建复杂 AI 工作流,成熟度高、社区认可也很亮眼。
相关 MCP 服务
知识图谱记忆
编辑精选by Anthropic
Memory 是一个基于本地知识图谱的持久化记忆系统,让 AI 记住长期上下文。
✎ 帮 AI 和智能体补上“记不住”的短板,用本地知识图谱沉淀长期上下文,连续对话更聪明,数据也更可控。
顺序思维
编辑精选by Anthropic
Sequential Thinking 是让 AI 通过动态思维链解决复杂问题的参考服务器。
✎ 这个服务器展示了如何让 Claude 像人类一样逐步推理,适合开发者学习 MCP 的思维链实现。但注意它只是个参考示例,别指望直接用在生产环境里。
PraisonAI
编辑精选by mervinpraison
PraisonAI 是一个支持自反思和多 LLM 的低代码 AI 智能体框架。
✎ 如果你需要快速搭建一个能 24/7 运行的 AI 智能体团队来处理复杂任务(比如自动研究或代码生成),PraisonAI 的低代码设计和多平台集成(如 Telegram)让它上手极快。但作为非官方项目,它的生态成熟度可能不如 LangChain 等主流框架,适合愿意尝鲜的开发者。