Spark SQL

数据与存储

by aidancorrell

Query Spark SQL clusters via Thrift/HiveServer2. Works with Spark, EMR, Hive, Impala.

什么是 Spark SQL

Query Spark SQL clusters via Thrift/HiveServer2. Works with Spark, EMR, Hive, Impala.

README

Spark SQL MCP Server

<!-- mcp-name: io.github.aidancorrell/spark-sql-mcp-server -->

An MCP server that enables AI assistants to query Spark SQL clusters via the Thrift/HiveServer2 protocol.

Works with any HiveServer2-compatible system: Apache Spark, AWS EMR, Hive, Impala, Presto.

Features

  • Query Spark SQL — Execute read-only SQL queries against your Spark cluster
  • Schema Discovery — List databases, tables, and describe table structures
  • Multiple Auth Methods — NONE, LDAP, NOSASL, CUSTOM, and Kerberos authentication
  • EMR Compatible — Works with AWS EMR clusters out of the box
  • Read-Only Enforcement — Only SELECT, SHOW, DESCRIBE, EXPLAIN, and WITH statements are allowed
  • Safety Defaults — Automatic LIMIT clause on unbounded queries, sanitized error messages

Installation

bash
pip install spark-sql-mcp-server

Or run directly with uvx:

bash
uvx spark-sql-mcp-server

Quick Start

1. Set Environment Variables

bash
export SPARK_HOST="your-emr-master-node.amazonaws.com"
export SPARK_PORT="10000"        # default
export SPARK_DATABASE="default"  # default
export SPARK_AUTH="NONE"         # NONE | LDAP | KERBEROS | CUSTOM | NOSASL

2. Add to Claude Code

Global (all projects) — add to ~/.claude.json under your project's mcpServers:

json
{
  "mcpServers": {
    "spark-sql": {
      "command": "uvx",
      "args": ["spark-sql-mcp-server"],
      "env": {
        "SPARK_HOST": "your-emr-master-node.amazonaws.com",
        "SPARK_PORT": "10000",
        "SPARK_AUTH": "NONE"
      }
    }
  }
}

Project-level — add to .claude/mcp.json in your repo:

json
{
  "mcpServers": {
    "spark-sql": {
      "command": "uvx",
      "args": ["spark-sql-mcp-server"],
      "env": {
        "SPARK_HOST": "your-emr-master-node.amazonaws.com",
        "SPARK_PORT": "10000",
        "SPARK_AUTH": "NONE"
      }
    }
  }
}

3. Add to Claude Desktop

Add to your claude_desktop_config.json:

json
{
  "mcpServers": {
    "spark-sql": {
      "command": "uvx",
      "args": ["spark-sql-mcp-server"],
      "env": {
        "SPARK_HOST": "your-emr-master-node.amazonaws.com",
        "SPARK_PORT": "10000"
      }
    }
  }
}

4. Query

Ask Claude things like:

  • "What databases are available in our Spark cluster?"
  • "Show me the schema of the sales.transactions table"
  • "Query the top 10 customers by revenue from the analytics database"

Available Tools

ToolDescription
list_databasesList all available databases
list_tablesList tables in a database
describe_tableGet table schema (columns, types)
execute_queryRun read-only SQL queries with formatted results

Authentication

No Auth (default)

bash
export SPARK_AUTH="NONE"

LDAP

bash
export SPARK_AUTH="LDAP"
export SPARK_USERNAME="your-username"
export SPARK_PASSWORD="your-password"

Kerberos

bash
export SPARK_AUTH="KERBEROS"
export SPARK_KERBEROS_SERVICE_NAME="hive"  # default
# Ensure you have a valid Kerberos ticket (kinit)

AWS EMR Setup

  1. Security Group — Allow inbound traffic on port 10000 from your IP
  2. SSH Tunnel (recommended):
    bash
    ssh -i your-key.pem -L 10000:localhost:10000 hadoop@your-emr-master
    
  3. Set SPARK_HOST=localhost

Development

bash
git clone https://github.com/aidancorrell/spark-sql-mcp-server.git
cd spark-sql-mcp-server
pip install -e ".[dev]"
pytest
ruff check .

Local Testing with Docker

A Docker Compose setup provides a local Spark Thrift Server with sample data for integration testing.

bash
# Start the Spark Thrift Server
cd docker && docker compose up -d

# Wait for it to be ready (takes ~30s on first start)
docker logs -f spark-thrift-server  # look for "Sample data loaded."

# Run integration tests
pytest -m integration -v

# Tear down
cd docker && docker compose down -v

The local server comes with sample tables: default.employees, default.orders, and test_db.metrics.

Unit tests run by default with pytest (integration tests are skipped unless -m integration is specified).

Using the local server with Claude Code

With the Docker Spark server running, add it to your MCP config to test the server interactively.

Global — add to ~/.claude.json under your project's mcpServers:

json
{
  "spark-sql": {
    "command": "uvx",
    "args": ["spark-sql-mcp-server"],
    "env": {
      "SPARK_HOST": "localhost",
      "SPARK_PORT": "10000",
      "SPARK_AUTH": "NONE"
    }
  }
}

Project-level — add to .claude/mcp.json:

json
{
  "mcpServers": {
    "spark-sql": {
      "command": "uvx",
      "args": ["spark-sql-mcp-server"],
      "env": {
        "SPARK_HOST": "localhost",
        "SPARK_PORT": "10000",
        "SPARK_AUTH": "NONE"
      }
    }
  }
}

Then start a new Claude Code session and ask it to query the sample data.

Security

Read-Only Enforcement

The execute_query tool only allows read-only SQL statements. Queries must start with one of: SELECT, SHOW, DESCRIBE, DESC, EXPLAIN, or WITH. All other statement types (DROP, INSERT, DELETE, CREATE, ALTER, SET, ADD JAR, etc.) are rejected before reaching the Spark cluster.

Error Sanitization

Database errors are sanitized before being returned to the MCP client. Internal details such as server hostnames, file paths, and stack traces are not exposed. Connection failures report only the target host/port and error type.

Credential Handling

  • Passwords are never included in log output or error messages
  • The SparkConfig object masks passwords in its string representation
  • SPARK_PASSWORD is marked as a secret in the MCP registry schema

Known Limitations

  • No TLS/SSL support — Thrift connections are unencrypted. For production use with LDAP auth, use an SSH tunnel to protect credentials in transit.
  • No query timeout — Long-running queries are not automatically cancelled. Rely on Spark cluster-level timeout configuration.
  • No per-user access control — All queries execute with the privileges of the configured Spark user. Use HiveServer2 authorization (Ranger, Sentry) to restrict access at the database level.
  • Auth mode defaults to NONE — Appropriate for local development but not for production. Set SPARK_AUTH to LDAP or KERBEROS for authenticated environments.

License

MIT

常见问题

Spark SQL 是什么?

Query Spark SQL clusters via Thrift/HiveServer2. Works with Spark, EMR, Hive, Impala.

相关 Skills

技术栈评估

by alirezarezvani

Universal
热门

对比框架、数据库和云服务,结合 5 年 TCO、安全风险、生态活力与迁移复杂度做量化评估,适合技术选型、栈升级和替换路线决策。

帮你系统比较技术栈优劣,不只看功能,还把TCO、安全性和生态健康度一起量化,选型和迁移决策更稳。

数据与存储
未扫描18.5k

资深数据科学家

by alirezarezvani

Universal
热门

覆盖实验设计、特征工程、预测建模、因果推断与模型评估,适合用 Python/R/SQL 做 A/B 测试、时序分析和生产级 ML 落地,支撑数据驱动决策。

从 A/B 测试、因果分析到预测建模一条龙搞定,既有硬核统计方法也懂业务沟通,特别适合把数据结论真正落地。

数据与存储
未扫描18.5k

资深架构师

by alirezarezvani

Universal
热门

适合系统设计评审、ADR记录和扩展性规划,分析依赖与耦合,权衡单体或微服务、数据库与技术栈选型,并输出Mermaid、PlantUML、ASCII架构图。

搞系统设计、技术选型和扩展规划时,用它能更快理清架构决策与依赖关系,还能直接产出 Mermaid/PlantUML 图,方案讨论效率很高。

数据与存储
未扫描18.5k

相关 MCP Server

SQLite 数据库

编辑精选

by Anthropic

热门

SQLite 是让 AI 直接查询本地数据库进行数据分析的 MCP 服务器。

这个服务器解决了 AI 无法直接访问 SQLite 数据库的问题,适合需要快速分析本地数据集的开发者。不过,作为参考实现,它可能缺乏生产级的安全特性,建议在受控环境中使用。

数据与存储
87.4k

by Anthropic

热门

PostgreSQL 是让 Claude 直接查询和管理你的数据库的 MCP 服务器。

这个服务器解决了开发者需要手动编写 SQL 查询的痛点,特别适合数据分析师或后端开发者快速探索数据库结构。不过,由于是参考实现,生产环境使用前务必评估安全风险,别指望它能处理复杂事务。

数据与存储
87.4k

by Firecrawl

热门

Firecrawl 是让 AI 直接抓取网页并提取结构化数据的 MCP 服务器。

它解决了手动写爬虫的麻烦,让 Claude 能直接访问动态网页内容。最适合需要实时数据的研究者或开发者,比如监控竞品价格或抓取新闻。但要注意,它依赖第三方 API,可能涉及隐私和成本问题。

数据与存储
6.6k

评论