Discovery Engine

编码与调试

by leap-laboratories

不是帮你写 pandas 或 SQL 的分析器,而是自动发现数据中的复杂模式、非线性阈值与关键子群,并进行验证。

什么是 Discovery Engine

不是帮你写 pandas 或 SQL 的分析器,而是自动发现数据中的复杂模式、非线性阈值与关键子群,并进行验证。

核心功能 (10 个工具)

discovery_list_plans

List available Discovery Engine plans with pricing. No authentication required. Returns all available subscription tiers with credit allowances and pricing. Use this to help users choose a plan.

discovery_estimate

Estimate cost, time, and credit requirements before running an analysis. Returns credit cost, estimated duration (low/median/high), whether you have sufficient credits, and whether a free public alternative exists. Always call this before discovery_analyze for private runs. Args: file_size_mb: Size of the dataset in megabytes. num_columns: Number of columns in the dataset. num_rows: Number of rows (optional, improves time estimate). depth_iterations: Search depth (1=fast, higher=deeper). Default 1. visibility: "public" (free, results published) or "private" (costs credits). api_key: Discovery Engine API key (disco_...). Optional if DISCOVERY_API_KEY env var is set.

discovery_analyze

Run Discovery Engine on tabular data to find novel, statistically validated patterns. This is NOT another data analyst — it's a discovery pipeline that systematically searches for feature interactions, subgroup effects, and conditional relationships nobody thought to look for, then validates each on hold-out data with FDR-corrected p-values and checks novelty against academic literature. This is a long-running operation (3-15 minutes). Returns a run_id immediately. Use discovery_status to poll and discovery_get_results to fetch completed results. Use this when you need to go beyond answering questions about data and start finding things nobody thought to ask. Do NOT use this for summary statistics, visualization, or SQL queries. Public runs are free but results are published. Private runs cost credits. Call discovery_estimate first to check cost. Args: file_path: Path to the dataset file (CSV, TSV, Excel, JSON, Parquet, ARFF, Feather). target_column: The column to analyze — what drives it, beyond what's obvious. depth_iterations: Search depth (1=fast, higher=deeper). Default 1. visibility: "public" (free) or "private" (costs credits). Default "public". title: Optional title for the analysis. description: Optional description of the dataset. api_key: Discovery Engine API key (disco_...). Optional if DISCOVERY_API_KEY env var is set.

discovery_status

Check the status of a Discovery Engine run. Returns the current status (pending, processing, completed, failed) and progress information. Poll this after calling discovery_analyze — runs typically take 3-15 minutes. This is a lightweight status check. Use discovery_get_results to fetch the full results once the run is completed. Args: run_id: The run ID returned by discovery_analyze. api_key: Discovery Engine API key (disco_...). Optional if DISCOVERY_API_KEY env var is set.

discovery_get_results

Fetch the full results of a completed Discovery Engine run. Returns discovered patterns (with conditions, p-values, novelty scores, citations), feature importance scores, a summary with key insights, column statistics, a shareable report URL, and suggestions for what to explore next. Only call this after discovery_status returns "completed". Args: run_id: The run ID returned by discovery_analyze. api_key: Discovery Engine API key (disco_...). Optional if DISCOVERY_API_KEY env var is set.

discovery_account

Check your Discovery Engine account status. Returns current plan, available credits (subscription + purchased), and payment method status. Use this to verify you have sufficient credits before running a private analysis. Args: api_key: Discovery Engine API key (disco_...). Optional if DISCOVERY_API_KEY env var is set.

discovery_signup

Create a Discovery Engine account and get an API key. Zero-touch signup: provide an email address, get back a ready-to-use disco_ API key. The free tier (10 credits/month, unlimited public runs) is active immediately. No authentication required. Returns 409 if the email is already registered. Args: email: Email address for the new account. name: Display name (optional — defaults to email local part).

discovery_add_payment_method

Attach a Stripe payment method to your Discovery Engine account. The payment method must be tokenized via Stripe's API first — card details never touch Discovery Engine's servers. Required before purchasing credits or subscribing to a paid plan. To tokenize a card, call Stripe's API directly: POST https://api.stripe.com/v1/payment_methods with the stripe_publishable_key from your account info. Args: payment_method_id: Stripe payment method ID (pm_...) from Stripe's API. api_key: Discovery Engine API key (disco_...). Optional if DISCOVERY_API_KEY env var is set.

discovery_purchase_credits

Purchase Discovery Engine credit packs using a stored payment method. Credits cost $1.00 each, sold in packs of 20 ($20/pack). Credits are used for private analyses (public analyses are free). Requires a payment method on file — use discovery_add_payment_method first. Args: packs: Number of 20-credit packs to purchase. Default 1. api_key: Discovery Engine API key (disco_...). Optional if DISCOVERY_API_KEY env var is set.

discovery_subscribe

Subscribe to or change your Discovery Engine plan. Available plans: - "free_tier": Explorer — free, 10 credits/month - "tier_1": Researcher — $49/month, 50 credits/month - "tier_2": Team — $199/month, 200 credits/month Paid plans require a payment method on file. Credits roll over on paid plans. Args: plan: Plan tier ID ("free_tier", "tier_1", or "tier_2"). api_key: Discovery Engine API key (disco_...). Optional if DISCOVERY_API_KEY env var is set.

README

Disco

Find novel, statistically validated patterns in tabular data — feature interactions, subgroup effects, and conditional relationships that correlation analysis and LLMs miss.

PyPI License: MIT

Made by Leap Laboratories.


What it actually does

Most data analysis starts with a question. Disco starts with the data.

Without biases or assumptions, it searches for combinations of feature conditions that significantly shift your target column — things like "patients aged 45–65 with low HDL and high CRP have 3× the readmission rate" — without you needing to hypothesise that interaction first.

Each pattern is:

  • Validated on a hold-out set — increases the chance of generalisation
  • FDR-corrected — p-values included, adjusted for multiple testing
  • Checked against academic literature — to help you understand what you've found, and identify if it is novel.

The output is structured: conditions, effect sizes, p-values, citations, and a novelty classification for every pattern found.

Use it when: "which variables are most important with respect to X", "are there patterns we're missing?", "I don't know where to start with this data", "I need to understand how A and B affect C".

Not for: summary statistics, visualisation, filtering, SQL queries — use pandas for those


Quickstart

bash
pip install discovery-engine-api

Get an API key:

bash
# Step 1: request verification code (no password, no card)
curl -X POST https://disco.leap-labs.com/api/signup \
  -H "Content-Type: application/json" \
  -d '{"email": "you@example.com"}'

# Step 2: submit code from email → get key
curl -X POST https://disco.leap-labs.com/api/signup/verify \
  -H "Content-Type: application/json" \
  -d '{"email": "you@example.com", "code": "123456"}'
# → {"key": "disco_...", "credits": 10, "tier": "free_tier"}

Or create a key at disco.leap-labs.com/docs.

Run your first analysis:

python
from discovery import Engine

engine = Engine(api_key="disco_...")
result = await engine.discover(
    file="data.csv",
    target_column="outcome",
)

for pattern in result.patterns:
    if pattern.p_value < 0.05 and pattern.novelty_type == "novel":
        print(f"{pattern.description} (p={pattern.p_value:.4f})")

print(f"Explore: {result.report_url}")

Runs take 3–15 minutes. discover() polls automatically and logs progress — queue position, estimated wait, current pipeline step, and ETA. For background runs, see Running asynchronously.

Full Python SDK reference · Example notebook


What you get back

Each Pattern in result.patterns looks like this (real output from a crop yield dataset):

python
Pattern(
    description="When humidity is between 72–89% AND wind speed is below 12 km/h, "
                "crop yield increases by 34% above the dataset average",
    conditions=[
        {"type": "continuous", "feature": "humidity_pct",
         "min_value": 72.0, "max_value": 89.0},
        {"type": "continuous", "feature": "wind_speed_kmh",
         "min_value": 0.0, "max_value": 12.0},
    ],
    p_value=0.003,              # FDR-corrected
    novelty_type="novel",
    novelty_explanation="Published studies examine humidity and wind speed as independent "
                        "predictors, but this interaction effect — where low wind amplifies "
                        "the benefit of high humidity within a specific range — has not been "
                        "reported in the literature.",
    citations=[
        {"title": "Effects of relative humidity on cereal crop productivity",
         "authors": ["Zhang, L.", "Wang, H."], "year": "2021",
         "journal": "Journal of Agricultural Science"},
    ],
    target_change_direction="max",
    abs_target_change=0.34,     # 34% increase
    support_count=847,          # rows matching this pattern
    support_percentage=16.9,
)

Key things to notice:

  • Patterns are combinations of conditions — humidity AND wind speed together, not just "more humidity is better"
  • Specific thresholds — 72–89%, not a vague correlation
  • Novel vs confirmatory — every pattern is classified; confirmatory ones validate known science, novel ones are what you came for
  • Citations — shows what IS known, so you can see what's genuinely new
  • report_url links to an interactive web report with all patterns visualised

The result.summary gives an LLM-generated narrative overview:

python
result.summary.overview
# "Disco identified 14 statistically significant patterns. 5 are novel.
#  The strongest driver is a previously unreported interaction between humidity
#  and wind speed at specific thresholds."

result.summary.key_insights
# ["Humidity × low wind speed at 72–89% humidity produces a 34% yield increase — novel.",
#  "Soil nitrogen above 45 mg/kg shows diminishing returns when phosphorus is below 12 mg/kg.",
#  ...]

How it works

Disco is a pipeline, not prompt engineering over data. It:

  1. Trains machine learning models on a subset of your data
  2. Uses interpretability techniques to extract learned patterns
  3. Validates every pattern on the held-out data with FDR correction (Benjamini-Hochberg)
  4. Checks surviving patterns against academic literature via semantic search

You cannot replicate this by writing pandas code or asking an LLM to look at a CSV. It finds structure that hypothesis-driven analysis misses because it doesn't start with hypotheses.


Preparing your data

Before running, exclude columns that would produce meaningless findings. Disco finds statistically real patterns — but if the input includes columns that are definitionally related to the target, the patterns will be tautological.

Exclude:

  1. Identifiers — row IDs, UUIDs, patient IDs, sample codes
  2. Data leakage — the target renamed or reformatted (e.g., diagnosis_text when the target is diagnosis_code)
  3. Tautological columns — alternative encodings of the same construct as the target. If target is serious, then serious_outcome, not_serious, death are all part of the same classification. If target is profit, then revenue and cost together compose it. If target is a survey index, the sub-items are tautological.

Full guidance with examples: SKILL.md


Parameters

python
await engine.discover(
    file="data.csv",           # path, Path, or pd.DataFrame
    target_column="outcome",   # column to predict/explain
    analysis_depth=2,          # 2=default, higher=deeper (max: num_columns − 2)
    visibility="public",       # "public" (free) or "private" (costs credits)
    column_descriptions={      # improves pattern explanations and literature context
        "bmi": "Body mass index",
        "hdl": "HDL cholesterol in mg/dL",
    },
    excluded_columns=["id", "timestamp"],  # see "Preparing your data" above
    title="My dataset",
    description="...", # improves pattern explanations and literature context
)

Public runs are free but results are published. Set visibility="private" for private data — this costs credits.


Running asynchronously

Runs take 3–15 minutes. For agent workflows or scripts that do other work in parallel:

python
# Submit without waiting
run = await engine.run_async(file="data.csv", target_column="outcome", wait=False)
print(f"Submitted {run.run_id}, continuing...")

# ... do other things ...

result = await engine.wait_for_completion(run.run_id, timeout=1800)

For synchronous scripts and Jupyter notebooks:

python
result = engine.run(file="data.csv", target_column="outcome", wait=True)
# or: pip install discovery-engine-api[jupyter] for notebook compatibility

MCP server

Disco is available as an MCP server — no local install required.

json
{
  "mcpServers": {
    "discovery-engine": {
      "url": "https://disco.leap-labs.com/mcp",
      "env": { "DISCOVERY_API_KEY": "disco_..." }
    }
  }
}

Tools: discovery_estimate, discovery_upload, discovery_analyze, discovery_status, discovery_get_results, plus account management tools.

Full agent skill file · MCP docs


Pricing

Cost
Public runsFree — results and data are published
Private runs1 credit per MB
Free tier10 credits/month, no card required
Researcher$49/month — 50 credits
Team$199/month — 200 credits
Purchase more credits at $1 per credit

Estimate before running:

python
estimate = await engine.estimate(file_size_mb=10.5, num_columns=25, analysis_depth=2, visibility="private")
# estimate["cost"]["credits"] → 21
# estimate["cost"]["free_alternative"] → True
# estimate["account"]["sufficient"] → True/False

Expected data format

Disco expects a flat table — columns for features, rows for samples.

code
| patient_id | age | bmi  | smoker | outcome |
|------------|-----|------|--------|---------|
| 001        | 52  | 28.3 | yes    | 1       |
| 002        | 34  | 22.1 | no     | 0       |
| ...        | ... | ...  | ...    | ...     |
  • One row per observation — a patient, a sample, a transaction, a measurement, etc.
  • One column per feature — numeric, categorical, datetime, or free text are all fine
  • One target column — the outcome you want to understand. Must have at least 2 distinct values.
  • Missing values are OK — Disco handles them automatically. Don't drop rows or impute beforehand.
  • No pivoting needed — if your data is already in a flat table, it's ready to go

Supported formats: CSV, TSV, Excel (.xlsx), JSON, Parquet, ARFF, Feather. Max 5 GB.

Not supported: images, raw text documents, nested/hierarchical JSON, multi-sheet Excel (use the first sheet or export to CSV)


Links


常见问题

Discovery Engine 是什么?

不是帮你写 pandas 或 SQL 的分析器,而是自动发现数据中的复杂模式、非线性阈值与关键子群,并进行验证。

Discovery Engine 提供哪些工具?

提供 10 个工具,包括 discovery_list_plans、discovery_estimate、discovery_analyze

相关 Skills

前端设计

by anthropics

Universal
热门

面向组件、页面、海报和 Web 应用开发,按鲜明视觉方向生成可直接落地的前端代码与高质感 UI,适合做 landing page、Dashboard 或美化现有界面,避开千篇一律的 AI 审美。

想把页面做得既能上线又有设计感,就用前端设计:组件到整站都能产出,难得的是能避开千篇一律的 AI 味。

编码与调试
未扫描109.6k

网页构建器

by anthropics

Universal
热门

面向复杂 claude.ai HTML artifact 开发,快速初始化 React + Tailwind CSS + shadcn/ui 项目并打包为单文件 HTML,适合需要状态管理、路由或多组件交互的页面。

在 claude.ai 里做复杂网页 Artifact 很省心,多组件、状态和路由都能顺手搭起来,React、Tailwind 与 shadcn/ui 组合效率高、成品也更精致。

编码与调试
未扫描109.6k

网页应用测试

by anthropics

Universal
热门

用 Playwright 为本地 Web 应用编写自动化测试,支持启动开发服务器、校验前端交互、排查 UI 异常、抓取截图与浏览器日志,适合调试动态页面和回归验证。

借助 Playwright 一站式验证本地 Web 应用前端功能,调 UI 时还能同步查看日志和截图,定位问题更快。

编码与调试
未扫描109.6k

相关 MCP Server

GitHub

编辑精选

by GitHub

热门

GitHub 是 MCP 官方参考服务器,让 Claude 直接读写你的代码仓库和 Issues。

这个参考服务器解决了开发者想让 AI 安全访问 GitHub 数据的问题,适合需要自动化代码审查或 Issue 管理的团队。但注意它只是参考实现,生产环境得自己加固安全。

编码与调试
82.9k

by Context7

热门

Context7 是实时拉取最新文档和代码示例的智能助手,让你告别过时资料。

它能解决开发者查找文档时信息滞后的问题,特别适合快速上手新库或跟进更新。不过,依赖外部源可能导致偶尔的数据延迟,建议结合官方文档使用。

编码与调试
51.5k

by tldraw

热门

tldraw 是让 AI 助手直接在无限画布上绘图和协作的 MCP 服务器。

这解决了 AI 只能输出文本、无法视觉化协作的痛点——想象让 Claude 帮你画流程图或白板讨论。最适合需要快速原型设计或头脑风暴的开发者。不过,目前它只是个基础连接器,你得自己搭建画布应用才能发挥全部潜力。

编码与调试
46.2k

评论