io.github.rsmdt/multimodal

内容与创意

by rsmdt

通过统一接口调用多家 provider 的媒体生成能力,支持 images、video、audio 与 transcription。

什么是 io.github.rsmdt/multimodal

通过统一接口调用多家 provider 的媒体生成能力,支持 images、video、audio 与 transcription。

README

multimodal-mcp

Multi-provider media generation MCP server. Generate images, videos, audio, and transcriptions from text prompts using OpenAI, xAI, Gemini, ElevenLabs, and BFL (FLUX) through a single unified interface.

Features

  • 🎨 Image Generation — Generate images via OpenAI (gpt-image-1), xAI (grok-imagine-image), Gemini (imagen-4), or BFL (FLUX Pro 1.1)
  • ✏️ Image Editing — Edit images via OpenAI, xAI, Gemini, or BFL (FLUX Kontext)
  • 🎬 Video Generation — Generate videos via OpenAI (sora-2), xAI (grok-imagine-video), or Gemini (veo-3.1)
  • 🔊 Audio Generation — Text-to-speech via OpenAI (tts-1), Gemini, or ElevenLabs (Flash v2.5). Sound effects via ElevenLabs
  • 🎙️ Audio Transcription — Speech-to-text via OpenAI (Whisper) or ElevenLabs (Scribe)
  • 🔄 Auto-Discovery — Automatically detects configured providers from environment variables
  • 🎯 Provider Selection — Auto-selects or explicitly choose a provider per request
  • 📁 File Output — Saves all generated media to disk with descriptive filenames

Quick Start

Set the API key for at least one provider. Most users only need one — add more to access additional providers.

bash
# Using OpenAI
claude mcp add multimodal-mcp -e OPENAI_API_KEY=sk-... -- npx -y @r16t/multimodal-mcp@latest

# Or using xAI
# claude mcp add multimodal-mcp -e XAI_API_KEY=xai-... -- npx -y @r16t/multimodal-mcp@latest

# Or using Gemini
# claude mcp add multimodal-mcp -e GEMINI_API_KEY=AIza... -- npx -y @r16t/multimodal-mcp@latest

# Or using ElevenLabs (audio + transcription)
# claude mcp add multimodal-mcp -e ELEVENLABS_API_KEY=xi-... -- npx -y @r16t/multimodal-mcp@latest

# Or using BFL/FLUX (images)
# claude mcp add multimodal-mcp -e BFL_API_KEY=... -- npx -y @r16t/multimodal-mcp@latest

Using a different editor? See setup instructions for Claude Desktop, Cursor, VS Code, Windsurf, and Cline.

Environment Variables

VariableRequiredDescription
OPENAI_API_KEYAt least one provider keyOpenAI API key — enables image, video, audio generation, and transcription via gpt-image-1, sora-2, tts-1, and whisper-1
XAI_API_KEYAt least one provider keyxAI API key — enables image and video generation via grok-imagine-image and grok-imagine-video
GEMINI_API_KEYAt least one provider keyGemini API key — enables image, video, and audio generation via imagen-4, veo-3.1, and gemini-2.5-flash-preview-tts
GOOGLE_API_KEYAlias for GEMINI_API_KEY; either name is accepted
ELEVENLABS_API_KEYAt least one provider keyElevenLabs API key — enables audio generation (TTS, sound effects) and transcription via Flash v2.5 and Scribe v1
BFL_API_KEYAt least one provider keyBFL API key — enables image generation and editing via FLUX Pro 1.1 and FLUX Kontext
MEDIA_OUTPUT_DIRNoDirectory for saved media files. Defaults to the current working directory

Available Tools

generate_image

Generate an image from a text prompt.

ParameterTypeRequiredDescription
promptstringYesText description of the image to generate
providerstringNoProvider to use: openai, xai, google, bfl. Auto-selects if omitted
aspectRatiostringNoAspect ratio: 1:1, 16:9, 9:16, 4:3, 3:4
qualitystringNoQuality level: low, standard, high
outputDirectorystringNoDirectory to save the generated file. Absolute or relative path. Defaults to MEDIA_OUTPUT_DIR or cwd
providerOptionsobjectNoProvider-specific parameters passed through directly

generate_video

Generate a video from a text prompt. Video generation is asynchronous and may take several minutes.

ParameterTypeRequiredDescription
promptstringYesText description of the video to generate
providerstringNoProvider to use: openai, xai, google. Auto-selects if omitted
durationnumberNoVideo duration in seconds (provider limits apply)
aspectRatiostringNoAspect ratio: 16:9, 9:16, 1:1
resolutionstringNoResolution: 480p, 720p, 1080p
outputDirectorystringNoDirectory to save the generated file. Absolute or relative path. Defaults to MEDIA_OUTPUT_DIR or cwd
providerOptionsobjectNoProvider-specific parameters passed through directly

generate_audio

Generate audio from text. Supports text-to-speech and sound effects. Audio generation is synchronous.

ParameterTypeRequiredDescription
textstringYesText to convert to speech, or a description of the sound effect to generate
providerstringNoProvider to use: openai, google, elevenlabs. Auto-selects if omitted
voicestringNoVoice name (provider-specific). OpenAI: alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer. Google: Kore, Charon, Fenrir, Aoede, Puck, etc. ElevenLabs: voice ID
speednumberNoSpeech speed multiplier (OpenAI only): 0.25 to 4.0
formatstringNoOutput format (OpenAI only): mp3, opus, aac, flac, wav, pcm
outputDirectorystringNoDirectory to save the generated file. Absolute or relative path. Defaults to MEDIA_OUTPUT_DIR or cwd
providerOptionsobjectNoProvider-specific parameters passed through directly. ElevenLabs: set mode: "sound-effect" for sound effects, model for TTS model selection

transcribe_audio

Transcribe audio to text (speech-to-text).

ParameterTypeRequiredDescription
audioPathstringYesAbsolute path to the audio file to transcribe
providerstringNoProvider to use: openai, elevenlabs. Auto-selects if omitted
languagestringNoLanguage code (e.g., en, fr, es) to hint the transcription language
providerOptionsobjectNoProvider-specific parameters passed through directly

list_providers

List all configured media generation providers and their capabilities. Takes no parameters.

Provider Capabilities

ProviderImageImage EditingVideoAudioTranscriptionKey Models
OpenAIgpt-image-1, sora-2, tts-1, whisper-1
xAIgrok-imagine-image, grok-imagine-video
Geminiimagen-4, veo-3.1, gemini-2.5-flash-preview-tts
ElevenLabseleven_flash_v2_5, scribe_v1
BFLflux-pro-1.1, flux-kontext-pro

Image Aspect Ratios

Provider1:116:99:164:33:4
OpenAI
xAI
Gemini
BFL

Video Aspect Ratios & Resolutions

Provider16:99:161:1480p720p1080p
OpenAI
xAI
Gemini

Audio Formats

Providermp3opusaacflacwavpcm
OpenAI
Gemini
ElevenLabs

Troubleshooting

No providers configured

code
[config] No provider API keys detected

Set at least one of OPENAI_API_KEY, XAI_API_KEY, GEMINI_API_KEY, ELEVENLABS_API_KEY, or BFL_API_KEY in the MCP server's env block.

Provider not available for requested media type

Each provider supports different media types (see Provider Capabilities). If you specify a provider that isn't configured (no API key) or doesn't support the requested media type, you'll receive an error. Omit the provider parameter to auto-select from configured providers.

Video generation timeout

Video generation polls for up to 10 minutes. If your video hasn't completed in that window, the request will fail with a timeout error. Try a shorter duration or a simpler prompt.

xAI image generation returned no data

This indicates the xAI API returned an empty response. Check that your XAI_API_KEY is valid and that your prompt does not violate xAI content policies.

Gemini image/video generation failed: 403

Verify your GEMINI_API_KEY has the Generative Language API enabled in Google Cloud Console.

Development

bash
npm run build      # Compile TypeScript to build/
npm test           # Run tests with Vitest
npm run lint       # Lint and auto-fix with ESLint
npm run typecheck  # Type-check without emitting
npm run dev        # Watch mode for TypeScript compilation

Editor Setup

Replace OPENAI_API_KEY with your provider of choice (XAI_API_KEY, GEMINI_API_KEY, ELEVENLABS_API_KEY, BFL_API_KEY). You can set multiple keys to enable multiple providers.

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

json
{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Cursor

Add to .cursor/mcp.json in your project root (or ~/.cursor/mcp.json globally):

json
{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

VS Code (GitHub Copilot)

Add to .vscode/mcp.json in your project root:

json
{
  "servers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Windsurf

Add to ~/.codeium/windsurf/mcp_config.json:

json
{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Cline

Add to ~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json:

json
{
  "mcpServers": {
    "multimodal-mcp": {
      "command": "npx",
      "args": ["@r16t/multimodal-mcp@latest"],
      "env": {
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

License

MIT

常见问题

io.github.rsmdt/multimodal 是什么?

通过统一接口调用多家 provider 的媒体生成能力,支持 images、video、audio 与 transcription。

相关 Skills

文档共著

by anthropics

Universal
热门

围绕文档、提案、技术规格、决策记录等写作任务,按上下文收集、结构迭代、读者测试三步协作共创,减少信息遗漏,写出更清晰、经得起他人阅读的内容。

写文档、方案或技术规格时容易思路散、信息漏,它用结构化共著流程帮你高效传递上下文、反复打磨内容,还能从读者视角做验证。

内容与创意
未扫描151.3k

内部沟通

by anthropics

Universal
热门

按公司常用模板和语气快速起草内部沟通内容,覆盖 3P 更新、状态报告、领导汇报、项目进展、事故复盘、FAQ 与 newsletter,适合需要统一格式的团队沟通场景。

按公司偏好的模板快速产出状态汇报、领导更新和 FAQ,既省去反复改稿,也让内部沟通更统一、更专业。

内容与创意
未扫描151.3k

平面设计

by anthropics

Universal
热门

先生成视觉哲学,再落地成原创海报、艺术画面或其他静态设计,输出 .png/.pdf,强调构图、色彩与空间表达,适合需要高完成度视觉成品的场景。

做海报、插画或静态视觉稿时,用它能快速产出兼顾美感与版式的PNG/PDF成品,原创设计更省心,也更适合规避版权风险。

内容与创意
未扫描151.3k

相关 MCP Server

免费的加密新闻聚合 MCP,汇集 Bitcoin、Ethereum、DeFi、Solana 与 altcoins 资讯源。

内容与创意
237

用于Adobe Photoshop自动化的MCP server,让AI assistants直接控制Photoshop。

内容与创意
105

by ProfessionalWiki

让 Large Language Model 客户端无缝连接任意 MediaWiki 站点,可创建、更新、搜索页面,并通过 OAuth 2.0 安全管理内容。

内容与创意16 个工具
96

评论