DataGen
效率与工作流by kuoyusheng
从校验、执行到监控统一编排 DataGen deployments,生成可直接使用的 curl、输入输出 schema 与 Mermaid 流程图,并轻松构建、测试、部署 Python 自动化。
什么是 DataGen?
从校验、执行到监控统一编排 DataGen deployments,生成可直接使用的 curl、输入输出 schema 与 Mermaid 流程图,并轻松构建、测试、部署 Python 自动化。
核心功能 (20 个工具)
getDeploymentDetails🔍 Get comprehensive deployment details for easy copying to Clay or other tools. Retrieves complete information about a specific deployment including code, input examples, and ready-to-use curl commands for external integrations. **Perfect for:** - Getting curl commands for external API calls - Understanding deployment input/output schemas - Integrating deployments into external systems **Parameters:** - deployment_uuid: UUID of the deployment - brief: (optional, default: false) Set to true to get only essential information for LLM understanding (name, description, input/output schemas, and example values) **Returns:** - Complete deployment metadata and code (when brief=false) - Ready-to-copy curl commands for sync and async execution - Input/output schemas with examples - API endpoint information for external use - When brief=true: Only name, description, input_vars, input_schema, output_schema, and default_input_vars **📊 Create Accessible Mermaid Flowchart from Code:** After receiving the response, analyze the `final_code` field and create clear mermaid diagrams that help non-technical users understand and interact with the code: **🚨 CRITICAL SYNTAX REQUIREMENT:** Always use double-quoted brackets for all nodes: `A["Node Content"]` NOT `A[Node Content]` **Process Flow Structure:** - Use **top-to-bottom flowchart** format (`flowchart TD`) - Show **main workflow** with clear start and end points - Include **decision points** (diamonds) for conditional logic - Use **descriptive labels** in plain English, avoiding technical jargon - Group related functions into **logical sections** with subgraphs when helpful **Function Details:** For each function/process box, include: - **Function name** in readable format (e.g., "Get Repository Data" instead of "mcp_GitHub_get_repo()") - **Key arguments/inputs** that users might want to modify - **Purpose** in simple terms (what it does, not how) - Format: `A["**Function Name**<br/>Purpose: [what it accomplishes]<br/>Input: [key parameters]"]` **Data Classification** - Use color coding and styling: - 📝 **User inputs** (blue/cyan boxes): Variables users can modify - ⚙️ **Processing steps** (green boxes): Data transformation and logic - 🔗 **External calls** (orange boxes): MCP tools, APIs, external services - 📊 **Outputs** (purple boxes): Final results and return values - 🔄 **Decision points** (yellow diamonds): Conditional logic and branching - ⚠️ **Hardcoded values** (red/pink boxes): Fixed data not user-configurable (URLs, API keys, constants) **Code Analysis Requirements:** - **Parse the `final_code`** to understand real processing workflow - **Extract function calls** especially MCP tool calls (starting with `mcp_`) - **Identify control flow** including if/else conditions, loops, try/catch blocks - **Map data transformations** showing how inputs become outputs - **Detect service interactions** between different tools and APIs - **Identify hardcoded values** that are embedded in code vs user-configurable - **Make technical concepts accessible** to non-technical users
validateDeploymentConnectionValidate deployment connection before executing a workflow. Recommend to use this tool before running a deployment. Use this to confirm that required MCP connections, secrets, and environment variables are configured for the authenticated user. **Returns (JSON text):** - `deployment_uuid`, `is_valid`, `status`, and `readiness_flag` - Compact `missing` object for environment variables, secrets, and MCP connections (with auth/URL hints) - `note` with a manage URL whenever secrets are missing - `next_steps` array describing remediation actions
submitDeploymentRunExecute a deployed DataGen workflow with custom inputs. Use this to run pre-built workflows (like data processing pipelines, web scrapers, or automation scripts) that have been deployed as API endpoints. This starts an asynchronous execution - you'll get a run ID that you can monitor with 'checkRunStatus'. **Typical workflow:** 1. Use `validateDeploymentConnection` tool to validate the deployment and get the missing MCP or Secrets. 2. Use this tool to start a deployment 3. Get a run_uuid in response 4. Use `checkRunStatus` to monitor progress 5. Retrieve results when complete **Use cases:** Run data pipelines, execute scrapers, trigger automations, process files. **Error handling:** If found any missing MCP or Secrets, try to run validate deployment connection tool to validate the deployment and get the missing MCP or Secrets.
checkRunStatusMonitor the progress and results of a running DataGen workflow with automatic polling. After starting a workflow with 'submitDeploymentRun', use this to check if it's still running, completed successfully, or failed. Provide a run UUID directly, or supply a deployment UUID to automatically locate the most recent run for that deployment. **Status types:** - 'pending': Waiting in queue, check again in a few seconds - 'running': Still executing, check again in a few seconds - 'completed': Finished successfully, output data available - 'failed': Execution failed, error details provided **Controls:** - 'timeout_seconds' (default 180, max 600) to cap how long to poll - 'poll_interval_seconds' (default 5, min 2, max 30) between checks **Lookup options:** - Provide 'run_uuid' for direct status polling - Provide 'deployment_uuid' to look up the latest run automatically
executeCodeExecute Python code with full access to MCP tools and data processing libraries. This is your Python sandbox for building workflows, processing data, and integrating multiple tools. You can call MCP tools directly as Python functions (e.g., `mcp_Supabase_list_projects()`) and use libraries like pandas, requests, and more. **Key features:** - Call any tool as a Python function, including MCP tools - Rich execution logs and error handling - All the MCP tools(tool start with mcp_) output have been parsed by json.loads() already if its json parsable,If json.loads() fails, returns the original string - b/c MCP output schema may not be defined, always first try test the mcp tool output data structure. especially for user defined data like google sheet, airtable, supabase sql results. **Make sure to include the right MCP server name in mcp_server_names and tool name in the required_tools array.** - do not guess the tool name or server name. if its not in the context, use SearchTool tool to get the correct tool name and server name. before using this code execution tool - If you get back with 401 authorizied error. its very likely that you did not include the right server name in the mcp_server_names array. - If the required server is not installed, prompt user to install the server. with "addRemoteMcpServer" tool. **Do not use any local() or global() in the code.** you can assume the input variables are already defined in the global scope. and you can use them directly with data type defined in the input_schema. **Do not use any async in the code. it will cause the code to not work.** ** When work with API** - use httpx instead of requests. - if API key is needed, use getUserSecrets tool to get the API key. if not in their, prompt user to add the API key in Datagen. - when use API key just use the key like other variables. DO NOT USE os.getenv() to get the API key. - use polling if the API is async and you want to wait for the result. **Coding Style:** - use python 3.12 syntax. - Keep it simple and readable. do not use exntensive comments and logger. only necessary print out for debugging. - Focus on the main logic and do not add unnecessary code. **Error handling:** - If you get back with 401 authorizied error. its very likely that you did not include the right server name in the mcp_server_names array or missing required_tools. - If you have trouble to parse the tool response, try to use the native tool call to observe the response structure. - <example> if you get parsing erro with mcp_Supabase_list_projects(), try to use the user's native list_projects tool in LLM client(like Cursor or Claude) to get the response structure. </example> **See the 'how to use executeCode' resource for detailed examples and best practices.**
asyncExecuteCodeExecute Python code asynchronously for long-running operations. This tool starts Python code execution in the background and returns immediately with an execution UUID. Use 'checkCodeExecStatus' to monitor progress and retrieve results. Perfect for long-running scripts, large data processing, or operations that might take several minutes. call this tool when you are dealing with long running operations. **Key advantages over executeCode:** - Non-blocking execution - No timeout limitations **Workflow:** 1. Call this tool to start execution 2. Get execution_uuid in response 3. Use 'checkCodeExecStatus' to monitor progress 4. Retrieve results when status is 'completed' **Do not use any local() or global() in the code.** **Do not use any async in the code. it will cause the code to not work.** **When work with API directly, use httpx instead of requests.**
checkCodeExecStatusMonitor the progress and retrieve results of asynchronous code execution with automatic polling. After starting code execution with 'asyncExecuteCode', use this tool to check the current status and get results when complete. This tool provides real-time updates on execution progress. **Status types:** - 'pending': Execution queued but not yet started - 'running': Currently executing your code - 'completed': Finished successfully, results available - 'failed': Execution failed, error details provided - 'cancelled': Execution was cancelled **Controls:** - 'timeout_seconds' (default 300, max 900) to cap how long to poll - 'poll_interval_seconds' (default 5, min 2, max 30) between checks **Tip:** For long operations, poll every 10-30 seconds until completion.
getToolDetailsGet comprehensive documentation for any specific tool. you use it to find the tool details and the server name to use in code execution tool.
searchTools🔎 Find tools by functionality, keywords, or provider. Smart search across all available tools when you know what you want to accomplish but aren't sure which specific tool to use. Search by functionality, keywords, or filter by tool type and provider. **Search examples:** - "database" → Find all database-related tools - "web scraping" → Discover scraping and data extraction tools - "Supabase" → All Supabase integration tools - "file upload" → Tools for handling file operations **Filters help you** narrow down to exactly what you need - specific providers, etc.
deployCodeDeploys working Python code as a DataGen standalone deployment. This tool orchestrates the complete workflow: takes your Python code, tests it, and creates a standalone deployment as an API endpoint with default values. Perfect for converting working code into a production-ready deployment without flows. Uses OpenAPI/JSON Schema for rich input and output validation with descriptions, type constraints, default values, and comprehensive documentation. **Schema Example:** input_schema: { 'type': 'object', 'properties': { 'name': {'type': 'string', 'description': 'User name'}, 'count': {'type': 'integer', 'minimum': 1, 'default': 10}, 'data': {'type': 'array', 'items': {'type': 'string'}} }, 'required': ['name'] } **Do not use any local() or global() in the code.** you can assume the input variables are already defined in the global scope. and you can use them directly with data type defined in the input_schema. **Do Not Return anthing for Output** Deploy code use the globa variable to reference the input and output variables. so do not return in main script. otherwise it would trigger ReturnException. To return output, just reference the global variable. for example: if I need to return the output variable "result" in the main script, I can do this: result = "Hello, World!" and in the output_variables, I can do this: output_variables: ['result'] just simply reference the global variable in the output_variables. **No Async in the code** Do not use any async in the code. it will cause the code to not work. **Steps to take before deploying code** <step0> Try to briefly explain the code or plan to the user. </step0> <step1> Come up with right input_schema and output_schema to define the input and output variables </step1> <step2> Confirm with user if the input and output are correct. modify if needed. </step2> <step3> Run submitDeploymentRun tool to test the code is working on Datagen after the deployment is created. </step3>
addRemoteMcpServerAdd a remote MCP server with OAuth or direct URL to DataGen. <Find Remote MCP Server URL> Before adding a remote MCP server, and if you have web research tool, you should first search and find the officail remote MCP server with their URL. If No official remote MCP available, recommend user to look for MCP hosted services like smithery.ai(https://smithery.ai), Klavis AI(https://klavis.ai), etc. to add the remote MCP server. </Find Remote MCP Server URL> <Add Remote MCP Server> Directly add remote MCP servers by providing server name, and URL </Add Remote MCP Server> Supports OAuth flows. Returns available tools upon successful connection. Perfect for: - Connecting to external MCP services to DataGen Input Requirements: - server_name: Display name for the server (must follow naming rules) - server_url: Remote server endpoint (HTTP/SSE) Naming Rules: - Use only alphanumeric characters (no spaces, underscores, or dashes) - Start with an uppercase letter - Use CamelCase for multiple words - Examples: 'GitHub', 'Slack', 'GoogleDrive' Returns: - Server info + complete list of available tools with descriptions or auth url if OAuth is required. - if success is false, it means the server is not found or the URL is not valid. - if Auth url is returned, plese use the proper formating like [Auth url](https://your-auth-url.com) to format the auth url. and use checkRemoteMcpOauthStatus tool to check the status of the OAuth flow right after this tool call.
checkRemoteMcpOauthStatusCheck the status of an OAuth flow for remote MCP server connection with polling. After receiving an OAuth redirect URL from addRemoteMcpServer, use this tool to check if the user has completed authentication. This tool will poll the status until completion or timeout. **Use this when:** - addRemoteMcpServer returned requires_auth: true - User has completed OAuth authentication in browser - You want to confirm the server connection is established **Returns:** Final connection status on success, or error details on failure/timeout **Next steps after success:** - When status is "completed", the MCP server is now connected and ready - Use 'searchTools' to discover what tools are available from the newly connected server - Example: searchTools({query: "server_name", tool_type: "mcp"})
ReAuthRemoteMcpServer🔄 Reauthenticate an existing remote MCP server connection. When an existing remote MCP server's OAuth tokens have expired or become invalid, use this tool to initiate a fresh authentication flow. This will start a new OAuth flow while preserving the server configuration. **Use this when:** - Server tools stop working due to expired tokens - You receive authentication errors from MCP tools - OAuth tokens need to be refreshed for a connected server - Server connection has been lost and needs re-authentication **Process:** 1. Call this tool with the server name (must follow naming rules) 2. If OAuth is required, you'll get an auth_url 3. Complete authentication in the browser 4. Use checkRemoteMcpOauthStatus to verify completion **Naming Rules:** - Use only alphanumeric characters (no spaces, underscores, or dashes) - Start with an uppercase letter - Use CamelCase for multiple words - Examples: 'GitHub', 'Slack', 'GoogleDrive', 'OpenAI' **Returns:** Either immediate success or OAuth flow details for browser authentication
updateRemoteMcpServerUpdate an existing remote MCP server with new configuration and refresh its tools list. Use this tool to update the configuration of an existing remote MCP server connection. This allows you to change the server URL, update authentication credentials, or refresh environment variables. **Perfect for:** - Updating server URL when endpoints change - Refreshing API keys or authentication tokens - Updating environment variables or configuration - Migrating to new API versions or endpoints - Getting the latest available tools after config changes **Requirements:** - Server with the given name must already exist - New server URL must be accessible - New authentication credentials must be valid **Input Requirements:** - server_name: Name of the existing server (must match exactly and follow naming rules) - server_url: New remote server endpoint URL - env_args: Updated environment variables/configuration **Naming Rules:** - Use only alphanumeric characters (no spaces, underscores, or dashes) - Start with an uppercase letter - Use CamelCase for multiple words - Examples: 'GitHub', 'Slack', 'GoogleDrive', 'OpenAI' **Returns:** Updated server info with refreshed tools list
getUserSecretsGet User Secret Keys Retrieve all available secret keys for the authenticated user. These keys can be referenced in Python code execution for MCP tool integrations, but the actual values are never exposed for security. **Perfect for:** - Discovering what secret keys are available for workflow integrations - Understanding which MCP providers are configured - Planning workflows that require authentication with external services **Returns:** - List of available secret keys with their names and providers - Metadata including total count and available providers - Usage instructions for referencing secrets in executeCode **Security Note:** Only secret key names and metadata are returned, never the actual secret values.
scheduleDeployment🕐 Schedule a deployment to run at specific times or intervals. Set up automated execution of deployments using flexible scheduling options including: - One-time execution at a specific date/time - Recurring schedules using cron expressions - Simple interval-based schedules (daily, weekly, monthly) **Perfect for:** - Automated data processing workflows - Regular report generation - Periodic API data syncing - Scheduled backup operations - Time-based business process automation **Schedule Types:** - 'once': Execute once at a specific datetime - 'cron': Use cron expression for complex schedules - 'interval': Simple recurring intervals (daily, weekly, monthly) **Examples:** - Daily at 9 AM: schedule_type='interval', interval='daily', time='09:00' - Every Monday at 2 PM: schedule_type='cron', cron_expression='0 14 * * 1' - Once on Dec 25, 2024 at 10:30 AM: schedule_type='once', datetime='2024-12-25T10:30:00Z'
listSchedulesList all scheduled deployments for the current user. View and manage all your scheduled deployment executions with filtering and pagination options. **Perfect for:** - Getting an overview of all scheduled tasks - Finding specific schedules by deployment or status - Managing and monitoring scheduled executions - Planning workflow timing and coordination **Returns:** - List of all schedules with details - Schedule status and next execution times - Deployment information and input variables - Pagination support for large lists
memory_write💾 Write a personalized memory for the authenticated user. Capture durable preferences, ongoing work, or contextual notes so future workflows can tailor their behaviour automatically. **Great for:** - Remembering preferred tone or formatting - Storing project milestones or TODOs - Persisting CRM or onboarding notes - Tracking tool configuration choices
memory_search🔍 Search memories previously saved for the current user. Run semantic search across stored context to quickly retrieve preferences, project history, or tagged notes.
deleteSchedule🗑️ Delete a scheduled deployment permanently. Remove a scheduled deployment from the system. This action cannot be undone, but it will not affect any deployments that have already been executed. **Perfect for:** - Removing schedules that are no longer needed - Cleaning up test or temporary schedules - Managing schedule cleanup and maintenance **Warning:** This action is permanent and cannot be undone.
常见问题
DataGen 是什么?
从校验、执行到监控统一编排 DataGen deployments,生成可直接使用的 curl、输入输出 schema 与 Mermaid 流程图,并轻松构建、测试、部署 Python 自动化。
DataGen 提供哪些工具?
提供 20 个工具,包括 getDeploymentDetails、validateDeploymentConnection、submitDeploymentRun 等。
相关 Skills
表格处理
by anthropics
围绕 .xlsx、.xlsm、.csv、.tsv 做读写、修复、清洗、格式整理、公式计算与格式转换,适合修改现有表格、生成新报表或把杂乱数据整理成交付级电子表格。
✎ 做 Excel/CSV 相关任务很省心,能直接读写、修复、清洗和格式转换,尤其擅长把乱七八糟的表格整理成交付级文件。
PDF处理
by anthropics
遇到 PDF 读写、文本表格提取、合并拆分、旋转加水印、表单填写或加解密时直接用它,也能提取图片、生成新 PDF,并把扫描件通过 OCR 变成可搜索文档。
✎ PDF杂活别再来回切工具了,文本表格提取、合并拆分到OCR识别一次搞定,连扫描件也能变可搜索。
Word文档
by anthropics
覆盖Word/.docx文档的创建、读取、编辑与重排,适合生成报告、备忘录、信函和模板,也能处理目录、页眉页脚、页码、图片替换、查找替换、修订批注及内容提取整理。
✎ 搞定 .docx 的创建、改写与精排版,目录、批量替换、批注修订和图片更新都能自动化,做正式文档尤其省心。
相关 MCP Server
文件系统
编辑精选by Anthropic
Filesystem 是 MCP 官方参考服务器,让 LLM 安全读写本地文件系统。
✎ 这个服务器解决了让 Claude 直接操作本地文件的痛点,比如自动整理文档或生成代码文件。适合需要自动化文件处理的开发者,但注意它只是参考实现,生产环境需自行加固安全。
by wonderwhy-er
Desktop Commander 是让 AI 直接执行终端命令、管理文件和进程的 MCP 服务器。
✎ 这工具解决了 AI 无法直接操作本地环境的痛点,适合需要自动化脚本调试或文件批量处理的开发者。它能让你用自然语言指挥终端,但权限控制需谨慎,毕竟让 AI 执行 rm -rf 可不是闹着玩的。
EdgarTools
编辑精选by dgunning
EdgarTools 是无需 API 密钥即可解析 SEC EDGAR 财报的开源 Python 库。
✎ 这个工具解决了金融数据获取的痛点——直接让 AI 读取结构化财报,比如让 Claude 分析苹果的 10-K 文件。适合量化分析师或金融开发者快速构建数据管道。但注意,它依赖 SEC 网站稳定性,高峰期可能延迟。