io.github.txn2/mcp-datahub
平台与服务by txn2
面向 DataHub data catalog 的 MCP server,可发现 datasets、探索 lineage,并访问 metadata。
什么是 io.github.txn2/mcp-datahub?
面向 DataHub data catalog 的 MCP server,可发现 datasets、探索 lineage,并访问 metadata。
README
An MCP server and composable Go library that connects AI assistants to DataHub metadata catalogs. Search datasets, explore schemas, trace lineage, and access glossary terms and domains.
mcp-datahub.txn2.com | Installation | Library Docs
MCP Data Platform Ecosystem
mcp-datahub is part of a broader suite of open-source MCP servers designed to work together as a composable data platform. Each component can run standalone or be combined to give AI assistants unified access to storage, query engines, and metadata catalogs.
Two Ways to Use
1. Standalone MCP Server
Install and connect to Claude Desktop, Cursor, or any MCP client:
Claude Desktop (Easiest) - Download the .mcpb bundle from releases and double-click to install:
- macOS Apple Silicon:
mcp-datahub_X.X.X_darwin_arm64.mcpb - macOS Intel:
mcp-datahub_X.X.X_darwin_amd64.mcpb - Windows:
mcp-datahub_X.X.X_windows_amd64.mcpb
Other Installation Methods:
# Homebrew (macOS)
brew install txn2/tap/mcp-datahub
# Go install
go install github.com/txn2/mcp-datahub/cmd/mcp-datahub@latest
Manual Claude Desktop Configuration (if not using MCPB):
{
"mcpServers": {
"datahub": {
"command": "/opt/homebrew/bin/mcp-datahub",
"env": {
"DATAHUB_URL": "https://datahub.example.com",
"DATAHUB_TOKEN": "your_token"
}
}
}
}
Multi-Server Configuration
Connect to multiple DataHub instances simultaneously:
# Primary server
export DATAHUB_URL=https://prod.datahub.example.com/api/graphql
export DATAHUB_TOKEN=prod-token
export DATAHUB_CONNECTION_NAME=prod
# Additional servers (JSON)
export DATAHUB_ADDITIONAL_SERVERS='{"staging":{"url":"https://staging.datahub.example.com/api/graphql","token":"staging-token"}}'
Use datahub_list_connections to discover available connections, then pass the connection parameter to any tool.
2. Composable Go Library
Import into your own MCP server for custom authentication, tenant isolation, and audit logging:
import (
"github.com/txn2/mcp-datahub/pkg/client"
"github.com/txn2/mcp-datahub/pkg/tools"
)
// Create client and register tools with your MCP server
datahubClient, _ := client.NewFromEnv()
defer datahubClient.Close()
toolkit := tools.NewToolkit(datahubClient, tools.Config{})
toolkit.RegisterAll(yourMCPServer)
Customizing Tool Descriptions
Override tool descriptions to match your deployment:
toolkit := tools.NewToolkit(datahubClient, tools.Config{},
tools.WithDescriptions(map[tools.ToolName]string{
tools.ToolSearch: "Search our internal data catalog for datasets and dashboards",
}),
)
Customizing Tool Annotations
Override MCP tool annotations (behavior hints for AI clients):
toolkit := tools.NewToolkit(datahubClient, tools.Config{},
tools.WithAnnotations(map[tools.ToolName]*mcp.ToolAnnotations{
tools.ToolSearch: {ReadOnlyHint: true, OpenWorldHint: boolPtr(true)},
}),
)
All 12 tools ship with default annotations: read tools are marked ReadOnlyHint: true; datahub_create is non-destructive and non-idempotent; datahub_update is non-destructive and idempotent; datahub_delete is destructive and idempotent.
Extensions (Logging, Metrics, Error Hints)
Enable optional middleware via the extensions package:
import "github.com/txn2/mcp-datahub/pkg/extensions"
// Load from environment variables (MCP_DATAHUB_EXT_*)
cfg := extensions.FromEnv()
opts := extensions.BuildToolkitOptions(cfg)
toolkit := tools.NewToolkit(datahubClient, toolsCfg, opts...)
// Or load from a YAML/JSON config file
serverCfg, _ := extensions.LoadConfig("config.yaml")
See the library documentation for middleware, selective tool registration, and enterprise patterns.
Combining with mcp-trino
Build a unified data platform MCP server by combining DataHub metadata with Trino query execution:
import (
datahubClient "github.com/txn2/mcp-datahub/pkg/client"
datahubTools "github.com/txn2/mcp-datahub/pkg/tools"
trinoClient "github.com/txn2/mcp-trino/pkg/client"
trinoTools "github.com/txn2/mcp-trino/pkg/tools"
)
// Add DataHub tools (search, lineage, schema, glossary)
dh, _ := datahubClient.NewFromEnv()
datahubTools.NewToolkit(dh, datahubTools.Config{}).RegisterAll(server)
// Add Trino tools (query execution, catalog browsing)
tr, _ := trinoClient.NewFromEnv()
trinoTools.NewToolkit(tr, trinoTools.Config{}).RegisterAll(server)
// AI assistants can now:
// - Search DataHub for tables -> Get schema -> Query via Trino
// - Explore lineage -> Understand data flow -> Run validation queries
See txn2/mcp-trino for the companion library.
Bidirectional Integration with QueryProvider
The library supports bidirectional context injection. While mcp-trino can pull semantic context from DataHub, mcp-datahub can receive query execution context back from a query engine:
import (
datahubTools "github.com/txn2/mcp-datahub/pkg/tools"
"github.com/txn2/mcp-datahub/pkg/integration"
)
// QueryProvider enables query engines to inject context into DataHub tools
type myQueryProvider struct {
trinoClient *trino.Client
}
func (p *myQueryProvider) Name() string { return "trino" }
func (p *myQueryProvider) ResolveTable(ctx context.Context, urn string) (*integration.TableIdentifier, error) {
// Map DataHub URN to Trino table (catalog.schema.table)
return &integration.TableIdentifier{
Catalog: "hive", Schema: "production", Table: "users",
}, nil
}
func (p *myQueryProvider) GetTableAvailability(ctx context.Context, urn string) (*integration.TableAvailability, error) {
// Check if table is queryable
return &integration.TableAvailability{Available: true}, nil
}
func (p *myQueryProvider) GetQueryExamples(ctx context.Context, urn string) ([]integration.QueryExample, error) {
// Return sample queries for this entity
return []integration.QueryExample{
{Name: "sample", SQL: "SELECT * FROM hive.production.users LIMIT 10"},
}, nil
}
// Wire it up
toolkit := datahubTools.NewToolkit(datahubClient, config,
datahubTools.WithQueryProvider(&myQueryProvider{trinoClient: trino}),
)
When a QueryProvider is configured, tool responses are enriched:
- Search results: Include
query_contextwith table availability - Entity details: Include
query_table,query_examples,query_availability - Schema: Include
query_tablefor immediate SQL usage - Lineage: Include
execution_contextmapping URNs to tables
Integration Middleware
Enterprise features like access control and audit logging are enabled through middleware adapters:
import (
datahubTools "github.com/txn2/mcp-datahub/pkg/tools"
"github.com/txn2/mcp-datahub/pkg/integration"
)
// Access control - filter entities by user permissions
type myAccessFilter struct{}
func (f *myAccessFilter) CanAccess(ctx context.Context, urn string) (bool, error) { /* ... */ }
func (f *myAccessFilter) FilterURNs(ctx context.Context, urns []string) ([]string, error) { /* ... */ }
// Audit logging - track all tool invocations
type myAuditLogger struct{}
func (l *myAuditLogger) LogToolCall(ctx context.Context, tool string, params map[string]any, userID string) error { /* ... */ }
// Wire up with multiple integration options
toolkit := datahubTools.NewToolkit(datahubClient, config,
datahubTools.WithAccessFilter(&myAccessFilter{}),
datahubTools.WithAuditLogger(&myAuditLogger{}, func(ctx context.Context) string {
return ctx.Value("user_id").(string)
}),
datahubTools.WithURNResolver(&myURNResolver{}), // Map external IDs to URNs
datahubTools.WithMetadataEnricher(&myEnricher{}), // Add custom metadata
)
See the library documentation for complete integration patterns.
Available Tools
Read Tools (always available)
| Tool | Description |
|---|---|
datahub_search | Search for datasets, dashboards, pipelines by query and entity type |
datahub_get_entity | Get entity metadata by URN (description, owners, tags, domain) |
datahub_get_schema | Get dataset schema with field types and descriptions |
datahub_get_lineage | Get upstream/downstream lineage (supports level=column for column-level) |
datahub_get_queries | Get SQL queries associated with a dataset |
datahub_browse | Browse catalog: list tags, domains, or data products |
datahub_get_glossary_term | Get glossary term definition and properties |
datahub_get_data_product | Get data product details (owners, domain, properties) |
datahub_list_connections | List configured DataHub server connections (multi-server mode) |
Write Tools (require DATAHUB_WRITE_ENABLED=true)
3 CRUD tools using the what discriminator pattern — 35 operations total:
| Tool | Operations | Description |
|---|---|---|
datahub_create | 10 | Create tags, domains, glossary terms, data products, documents, applications, queries, incidents, structured properties, data contracts |
datahub_update | 17 | Update descriptions, tags, glossary terms, links, owners, domains, structured properties, incidents, queries, documents, data contracts |
datahub_delete | 8 | Delete queries, tags, domains, glossary entities, data products, applications, documents, structured properties |
Write tools are disabled by default for safety.
DataHub Version Compatibility
Minimum: DataHub 1.3.x. Full feature set: DataHub 1.4.x.
| DataHub Version | Features |
|---|---|
| 1.3.x+ (minimum) | All read tools, all write operations except documents (tags, domains, glossary, data products, queries, owners, links, descriptions, incidents, applications, structured properties incl. delete, data contracts) |
| 1.4.x+ (full) | + Documents (create/update/delete) |
The client gracefully handles version differences — read queries return empty results (not errors) when a feature is unavailable on older versions.
See the tools reference for detailed documentation.
Configuration
| Variable | Description | Default |
|---|---|---|
DATAHUB_URL | DataHub GraphQL API URL | (required) |
DATAHUB_TOKEN | API token | (required) |
DATAHUB_TIMEOUT | Request timeout (seconds) | 30 |
DATAHUB_DEFAULT_LIMIT | Default search limit | 10 |
DATAHUB_MAX_LIMIT | Maximum limit | 100 |
DATAHUB_CONNECTION_NAME | Display name for primary connection | datahub |
DATAHUB_ADDITIONAL_SERVERS | JSON map of additional servers | (optional) |
DATAHUB_WRITE_ENABLED | Enable write operations (true or 1) | false |
DATAHUB_DEBUG | Enable debug logging (1 or true) | false |
Extensions
| Variable | Description | Default |
|---|---|---|
MCP_DATAHUB_EXT_LOGGING | Enable structured logging of tool calls | false |
MCP_DATAHUB_EXT_METRICS | Enable metrics collection | false |
MCP_DATAHUB_EXT_METADATA | Enable metadata enrichment on results | false |
MCP_DATAHUB_EXT_ERRORS | Enable error hint enrichment | true |
Config File
As an alternative to environment variables, configure via YAML or JSON:
datahub:
url: https://datahub.example.com
token: "${DATAHUB_TOKEN}"
timeout: "30s"
write_enabled: true
toolkit:
default_limit: 20
descriptions:
datahub_search: "Custom search description for your deployment"
extensions:
logging: true
errors: true
Load with extensions.LoadConfig("config.yaml"). Environment variables override file values for sensitive fields. Token values support $VAR / ${VAR} expansion.
See configuration reference for all options.
Development
make build # Build binary
make test # Run tests with race detection
make lint # Run golangci-lint
make security # Run gosec and govulncheck
make coverage # Generate coverage report
make verify # Run tidy, lint, and test
make help # Show all targets
Related Projects
- txn2/mcp-trino (docs) - Composable MCP toolkit for Trino query execution
- DataHub - The open-source metadata platform
Contributing
See CONTRIBUTING.md for guidelines.
License
Open source by Craig Johnston, sponsored by Deasil Works, Inc.
常见问题
io.github.txn2/mcp-datahub 是什么?
面向 DataHub data catalog 的 MCP server,可发现 datasets、探索 lineage,并访问 metadata。
相关 Skills
MCP构建
by anthropics
聚焦高质量 MCP Server 开发,覆盖协议研究、工具设计、错误处理与传输选型,适合用 FastMCP 或 MCP SDK 对接外部 API、封装服务能力。
✎ 想让 LLM 稳定调用外部 API,就用 MCP构建:从 Python 到 Node 都有成熟指引,帮你更快做出高质量 MCP 服务器。
Slack动图
by anthropics
面向Slack的动图制作Skill,内置emoji/消息GIF的尺寸、帧率和色彩约束、校验与优化流程,适合把创意或上传图片快速做成可直接发送的Slack动画。
✎ 帮你快速做出适配 Slack 的动图,内置约束规则和校验工具,少踩上传与播放坑,做表情包和演示都更省心。
MCP服务构建器
by alirezarezvani
从 OpenAPI 一键生成 Python/TypeScript MCP server 脚手架,并校验 tool schema、命名规范与版本兼容性,适合把现有 REST API 快速发布成可生产演进的 MCP 服务。
✎ 帮你快速搭建 MCP 服务与后端 API,脚手架完善、扩展顺手,尤其适合想高效验证服务能力的开发者。
相关 MCP Server
Slack 消息
编辑精选by Anthropic
Slack 是让 AI 助手直接读写你的 Slack 频道和消息的 MCP 服务器。
✎ 这个服务器解决了团队协作中需要 AI 实时获取 Slack 信息的痛点,特别适合开发团队让 Claude 帮忙汇总频道讨论或发送通知。不过,它目前只是参考实现,文档有限,不建议在生产环境直接使用——更适合开发者学习 MCP 如何集成第三方服务。
by netdata
io.github.netdata/mcp-server 是让 AI 助手实时监控服务器指标和日志的 MCP 服务器。
✎ 这个工具解决了运维人员需要手动检查系统状态的痛点,最适合 DevOps 团队让 Claude 自动分析性能数据。不过,它依赖 NetData 的现有部署,如果你没用过这个监控平台,得先花时间配置。
by d4vinci
Scrapling MCP Server 是专为现代网页设计的智能爬虫工具,支持绕过 Cloudflare 等反爬机制。
✎ 这个工具解决了爬取动态网页和反爬网站时的头疼问题,特别适合需要批量采集电商价格或新闻数据的开发者。不过,它依赖外部浏览器引擎,资源消耗较大,不适合轻量级任务。