io.github.rsmdt/multimodal

内容与创意

by rsmdt

通过统一接口调用多家 provider 的媒体生成能力,支持 images、video、audio 与 transcription。

什么是 io.github.rsmdt/multimodal

通过统一接口调用多家 provider 的媒体生成能力,支持 images、video、audio 与 transcription。

README

multimodal-mcp

Multi-provider media generation MCP server. Generate images, videos, audio, and transcriptions from text prompts using OpenAI, xAI, Gemini, ElevenLabs, and BFL (FLUX) through a single unified interface.

Features

  • 🎨 Image Generation — Generate images via OpenAI (gpt-image-1), xAI (grok-imagine-image), Gemini (imagen-4), or BFL (FLUX Pro 1.1)
  • ✏️ Image Editing — Edit images via OpenAI, xAI, Gemini, or BFL (FLUX Kontext)
  • 🎬 Video Generation — Generate videos via OpenAI (sora-2), xAI (grok-imagine-video), or Gemini (veo-3.1)
  • 🔊 Audio Generation — Text-to-speech via OpenAI (tts-1), Gemini, or ElevenLabs (Flash v2.5). Sound effects via ElevenLabs
  • 🎙️ Audio Transcription — Speech-to-text via OpenAI (Whisper) or ElevenLabs (Scribe)
  • 🔄 Auto-Discovery — Automatically detects configured providers from environment variables
  • 🎯 Provider Selection — Auto-selects or explicitly choose a provider per request
  • 📁 File Output — Saves all generated media to disk with descriptive filenames

Quick Start

Set the API key for at least one provider. Most users only need one — add more to access additional providers.

bash
# Using OpenAI
claude mcp add multimodal-mcp -e OPENAI_API_KEY=sk-... -- npx -y @r16t/multimodal-mcp@latest

# Or using xAI
# claude mcp add multimodal-mcp -e XAI_API_KEY=xai-... -- npx -y @r16t/multimodal-mcp@latest

# Or using Gemini
# claude mcp add multimodal-mcp -e GEMINI_API_KEY=AIza... -- npx -y @r16t/multimodal-mcp@latest

# Or using ElevenLabs (audio + transcription)
# claude mcp add multimodal-mcp -e ELEVENLABS_API_KEY=xi-... -- npx -y @r16t/multimodal-mcp@latest

# Or using BFL/FLUX (images)
# claude mcp add multimodal-mcp -e BFL_API_KEY=... -- npx -y @r16t/multimodal-mcp@latest

Using a different editor? See setup instructions for Claude Desktop, Cursor, VS Code, Windsurf, and Cline.

Environment Variables

VariableRequiredDescription
OPENAI_API_KEYAt least one provider keyOpenAI API key — enables image, video, audio generation, and transcription via gpt-image-1, sora-2, tts-1, and whisper-1
XAI_API_KEYAt least one provider keyxAI API key — enables image and video generation via grok-imagine-image and grok-imagine-video
GEMINI_API_KEYAt least one provider keyGemini API key — enables image, video, and audio generation via imagen-4, veo-3.1, and gemini-2.5-flash-preview-tts
GOOGLE_API_KEYAlias for GEMINI_API_KEY; either name is accepted
ELEVENLABS_API_KEYAt least one provider keyElevenLabs API key — enables audio generation (TTS, sound effects) and transcription via Flash v2.5 and Scribe v1
BFL_API_KEYAt least one provider keyBFL API key — enables image generation and editing via FLUX Pro 1.1 and FLUX Kontext
MEDIA_OUTPUT_DIRNoDirectory for saved media files. Defaults to the current working directory

Available Tools

generate_image

Generate an image from a text prompt.

ParameterTypeRequiredDescription
promptstringYesText description of the image to generate
providerstringNoProvider to use: openai, xai, google, bfl. Auto-selects if omitted
aspectRatiostringNoAspect ratio: 1:1, 16:9, 9:16, 4:3, 3:4
qualitystringNoQuality level: low, standard, high
outputDirectorystringNoDirectory to save the generated file. Absolute or relative path. Defaults to MEDIA_OUTPUT_DIR or cwd
providerOptionsobjectNoProvider-specific parameters passed through directly

generate_video

Generate a video from a text prompt. Video generation is asynchronous and may take several minutes.

ParameterTypeRequiredDescription
promptstringYesText description of the video to generate
providerstringNoProvider to use: openai, xai, google. Auto-selects if omitted
durationnumberNoVideo duration in seconds (provider limits apply)
aspectRatiostringNoAspect ratio: 16:9, 9:16, 1:1
resolutionstringNoResolution: 480p, 720p, 1080p
outputDirectorystringNoDirectory to save the generated file. Absolute or relative path. Defaults to MEDIA_OUTPUT_DIR or cwd
providerOptionsobjectNoProvider-specific parameters passed through directly

generate_audio

Generate audio from text. Supports text-to-speech and sound effects. Audio generation is synchronous.

ParameterTypeRequiredDescription
textstringYesText to convert to speech, or a description of the sound effect to generate
providerstringNoProvider to use: openai, google, elevenlabs. Auto-selects if omitted
voicestringNoVoice name (provider-specific). OpenAI: alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer. Google: Kore, Charon, Fenrir, Aoede, Puck, etc. ElevenLabs: voice ID
speednumberNoSpeech speed multiplier (OpenAI only): 0.25 to 4.0
formatstringNoOutput format (OpenAI only): mp3, opus, aac, flac, wav, pcm
outputDirectorystringNoDirectory to save the generated file. Absolute or relative path. Defaults to MEDIA_OUTPUT_DIR or cwd
providerOptionsobjectNoProvider-specific parameters passed through directly. ElevenLabs: set mode: "sound-effect" for sound effects, model for TTS model selection

transcribe_audio

Transcribe audio to text (speech-to-text).

ParameterTypeRequiredDescription
audioPathstringYesAbsolute path to the audio file to transcribe
providerstringNoProvider to use: openai, elevenlabs. Auto-selects if omitted
languagestringNoLanguage code (e.g., en, fr, es) to hint the transcription language
providerOptionsobjectNoProvider-specific parameters passed through directly

list_providers

List all configured media generation providers and their capabilities. Takes no parameters.

Provider Capabilities

ProviderImageImage EditingVideoAudioTranscriptionKey Models
OpenAIgpt-image-1, sora-2, tts-1, whisper-1
xAIgrok-imagine-image, grok-imagine-video
Geminiimagen-4, veo-3.1, gemini-2.5-flash-preview-tts
ElevenLabseleven_flash_v2_5, scribe_v1
BFLflux-pro-1.1, flux-kontext-pro

Image Aspect Ratios

Provider1:116:99:164:33:4
OpenAI
xAI
Gemini
BFL

Video Aspect Ratios & Resolutions

Provider16:99:161:1480p720p1080p
OpenAI
xAI
Gemini

Audio Formats

Providermp3opusaacflacwavpcm
OpenAI
Gemini
ElevenLabs

Troubleshooting

No providers configured

code
[config] No provider API keys detected

Set at least one of OPENAI_API_KEY, XAI_API_KEY, GEMINI_API_KEY, ELEVENLABS_API_KEY, or BFL_API_KEY in the MCP server's env block.

Provider not available for requested media type

Each provider supports different media types (see Provider Capabilities). If you specify a provider that isn't configured (no API key) or doesn't support the requested media type, you'll receive an error. Omit the provider parameter to auto-select from configured providers.

Video generation timeout

Video generation polls for up to 10 minutes. If your video hasn't completed in that window, the request will fail with a timeout error. Try a shorter duration or a simpler prompt.

xAI image generation returned no data

This indicates the xAI API returned an empty response. Check that your XAI_API_KEY is valid and that your prompt does not violate xAI content policies.

Gemini image/video generation failed: 403

Verify your GEMINI_API_KEY has the Generative Language API enabled in Google Cloud Console.

Development

bash
npm run build      # Compile TypeScript to build/
npm test           # Run tests with Vitest
npm run lint       # Lint and auto-fix with ESLint
npm run typecheck  # Type-check without emitting
npm run dev        # Watch mode for TypeScript compilation

Editor Setup

Replace OPENAI_API_KEY with your provider of choice (XAI_API_KEY, GEMINI_API_KEY, ELEVENLABS_API_KEY, BFL_API_KEY). You can set multiple keys to enable multiple providers.

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

json
{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Cursor

Add to .cursor/mcp.json in your project root (or ~/.cursor/mcp.json globally):

json
{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

VS Code (GitHub Copilot)

Add to .vscode/mcp.json in your project root:

json
{
  "servers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Windsurf

Add to ~/.codeium/windsurf/mcp_config.json:

json
{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Cline

Add to ~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json:

json
{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

License

MIT

常见问题

io.github.rsmdt/multimodal 是什么?

通过统一接口调用多家 provider 的媒体生成能力,支持 images、video、audio 与 transcription。

相关 Skills

主题工厂

by anthropics

Universal
热门

给幻灯片、文档、报告和 HTML 落地页快速套用专业配色与字体主题,内置 10 套预设风格并支持现场生成新主题,适合统一品牌或内容视觉。

主题工厂能帮你把幻灯片、文档到落地页快速统一视觉风格,内置 10 套主题,还能按需即时生成新主题。

内容与创意
未扫描109.6k

品牌规范

by anthropics

Universal
热门

把文档、幻灯片等内容快速套用 Anthropic 官方品牌配色与字体规范,统一标题、正文和图形视觉风格,适合做品牌化排版、视觉润色和公司设计标准校准。

把文档、页面和素材快速套用 Anthropic 官方色彩与字体系,少翻设计手册也能稳稳保持统一品牌感。

内容与创意
未扫描109.6k

文档共著

by anthropics

Universal
热门

围绕文档、提案、技术规格、决策记录等写作任务,按上下文收集、结构迭代、读者测试三步协作共创,减少信息遗漏,写出更清晰、经得起他人阅读的内容。

写文档、方案或技术规格时容易思路散、信息漏,它用结构化共著流程帮你高效传递上下文、反复打磨内容,还能从读者视角做验证。

内容与创意
未扫描109.6k

相关 MCP Server

热门

免费的加密新闻聚合 MCP,汇集 Bitcoin、Ethereum、DeFi、Solana 与 altcoins 资讯源。

内容与创意
111

by ProfessionalWiki

让 Large Language Model 客户端无缝连接任意 MediaWiki 站点,可创建、更新、搜索页面,并通过 OAuth 2.0 安全管理内容。

内容与创意16 个工具
72

借助 86+ 个云端 media processing robots,处理视频、音频、图像和文档。

内容与创意
71

评论