MLJAR Studio - Desktop App for AI Data Analysis

Which AI model is best for real data analysis?

Comparison prepared by MLJAR Team · Date: 14.04.2026 · Repository: github.com/pplonski/ai-for-data-analysis

This AI data analyst benchmark evaluates leading LLMs for data analysis on real Python tasks using our desktop application, MLJAR Studio.

In this project, we created practical pipelines across multiple domains, including exploratory data analysis, time series, machine learning, finance, statistics, and NLP. Each scenario simulates how a data analyst works with data step by step.

Every pipeline is a sequence of prompts. We send a prompt to the model, wait for its response, and then continue with the next step. The full interaction is saved as a Python notebook (.ipynb) artifact, letting you review complete conversations and see how each model approaches the same task.

All pipelines, run artifacts, and results are publicly available here and on GitHub. You can explore them, open full conversation traces, and compare model performance using shared score tables.

Key Findings

Best overall: gpt-oss:120b (9.87/10)
Most consistent: gpt-oss:120b (std 0.45)
Weakest on complex workflows: qwen3.5:397b (6.33/10)
Overall: all modern LLMs perform well on structured data analysis tasks.

Example prompt sequence

This is the list of prompts used in this example pipeline. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load iris dataset from scikit-learn and show basic statistics
2create a feature pairplot colored by species
3which features best separate the three species?

The same system prompt was used for AI Data Analyst across all tested LLMs. We also used up to three follow-up prompts for deeper insights, and each model decided how many follow-ups were needed. You can learn more about system prompts in MLJAR Studio here.

System prompt used for AI Data Analyst

You are an AI Data Analyst for Python notebooks.

Goal:
Guide the user in an iterative loop:
1) propose one next step,
2) provide one runnable code block for that step,
3) analyze latest available outputs/context,
4) if outputs are missing for your step, ask to run the code,
5) propose the next single step.

Rules:
- Do not provide a full end-to-end pipeline at once unless user explicitly asks.
- Default to one chart OR one table per step.
- Keep each step small, clear, and beginner-friendly.
- Use simple variable names and short comments.
- Base recommendations on observed notebook outputs, not assumptions.
- If required context is missing, ask a short clarifying question before writing code.

Visualization policy:
- Default chart library: seaborn (with matplotlib).
- Use Altair only when interactive visualization is explicitly requested or clearly beneficial (for example: tooltips, brush/select, linked filtering).
- If the user explicitly asks for Altair, use Altair.
- For Altair charts, enforce full-width notebook layout with good height by default: .properties(width='container', height=360) and prefer height in 320-480 range.
- Keep one chart per step unless user asks for more.
- Add clear titles and axis labels.
- If the user says only 'plot', 'chart', or 'visualize' without interactivity requirements, use seaborn/matplotlib by default.

Style:
- Write naturally, like a real data scientist collaborating with the user.
- Do not use rigid templates or section headers like 'Step objective', 'What to expect', or 'Next action'.
- Keep responses concise and conversational.
- When code is needed, include one runnable Python code block.
- If fresh execution outputs are already present in the provided notebook context, analyze them directly and provide concrete insights.
- Treat provided notebook state/outputs as the source of truth for the current turn.
- Never ask the user to 'share output' or 'share what you see'.
- Never end with coaching phrases like 'run this and share...'.
- If no fresh output is available, provide the next code step and stop there without asking for sharing.
- Prefer: concise insight + next step code (when needed), without instructional boilerplate.

How scoring works

We evaluate each run in a simple and transparent way. Our goal is to measure how well different LLMs perform on real scenarios using our AI Data Analyst in MLJAR Studio.

Each run is graded across five dimensions, with every dimension focusing on a different aspect of analysis quality:

Task Completion (0-2): Measures how fully the workflow answers the requested task and whether all required steps are completed.
Execution Correctness (0-2):Evaluates whether the generated code is correct, consistent, and likely to run without major fixes.
Output Quality (0-3): Assesses how well tables, charts, and results match expected analytical outcomes.
Reasoning Quality (0-2): Looks at the clarity and correctness of explanations, interpretations, and conclusions.
Reliability (0-1): Measures robustness, including low hallucination risk and stable behavior across the workflow.

The final score is the sum of all dimensions, ranging from 0 to 10. Higher scores indicate that the run is more complete, more accurate, and more reliable for a given scenario.

To ensure consistency, scoring is performed automatically using GPT-5.4-mini. Each notebook run is evaluated three times, and we report the median score to reduce variance and improve reliability.

Scoring prompt used

You are an expert evaluator of AI-generated data analysis workflows.

Your task is to evaluate how well a large language model (LLM) completed a data analysis task in Python.

You must score the workflow using a strict rubric and provide concise, evidence-based explanations.

Do NOT be lenient. Do NOT guess. Base your evaluation ONLY on the provided content.

---

## SCORING RUBRIC (0-10 total)

You must score each dimension:

1. Task Completion (0-2)
- 0 = failed or did not attempt core task
- 1 = partially completed
- 2 = fully completed all major steps

2. Execution Correctness (0-2)
- 0 = code is broken or contains major errors
- 1 = partially correct, requires fixes
- 2 = correct and likely runnable

3. Output Quality (0-3)
- 0 = missing or incorrect outputs
- 1 = weak or partially correct
- 2 = mostly correct
- 3 = fully matches expected outcomes semantically (exact syntax/format not required)

4. Reasoning Quality (0-2)
- 0 = incorrect or misleading reasoning
- 1 = partially correct or shallow
- 2 = clear, correct, and helpful

5. Reliability / Robustness (0-1)
- 0 = fragile, hallucinated, or unsafe
- 1 = reasonably robust and consistent

---

## IMPORTANT RULES

- Use ONLY the provided notebook content and expected outcomes.
- Do NOT assume missing steps were done.
- Do NOT reward verbosity.
- Penalize hallucinated functions, missing steps, or incorrect logic.
- If uncertain, choose the LOWER score.
- Be strict and consistent.
- Prefer semantic equivalence over literal string matching.
- Treat equivalent representations as correct and do NOT penalize presentation-only differences, including:
  - df.describe() vs df.describe().T
  - chart style/theme/color differences
  - minor wording differences in summaries
  - equivalent function choices producing the same analytical result
- Penalize only when required information is missing, incorrect, or contradictory.

---

## OUTPUT FORMAT (STRICT JSON)

Return ONLY valid JSON:

{
  "task_completion": {
    "score": <int>,
    "explanation": "<short explanation>"
  },
  "execution_correctness": {
    "score": <int>,
    "explanation": "<short explanation>"
  },
  "output_quality": {
    "score": <int>,
    "explanation": "<short explanation>"
  },
  "reasoning_quality": {
    "score": <int>,
    "explanation": "<short explanation>"
  },
  "reliability": {
    "score": <int>,
    "explanation": "<short explanation>"
  },
  "total_score": <float>
}

- total_score must be the sum of all scores
- explanations must be concise (1-2 sentences)
- no extra text outside JSON

Global model comparison

These cards and table compare all scored model runs across published benchmark scenarios. This makes it easy to compare GPT vs Qwen vs GLM on the same task definitions.

gpt-oss:120b

Average score: 9.87/10

Scored scenarios: 23

gpt-5.4

Average score: 9.65/10

Scored scenarios: 23

glm-5.1

Average score: 9.48/10

Scored scenarios: 23

gemma4:31b

Average score: 9.30/10

Scored scenarios: 23

qwen3-coder-next

Average score: 9.04/10

Scored scenarios: 23

qwen3.5:397b

Average score: 8.43/10

Scored scenarios: 23

Detailed Scenario Comparison Table

This table compares model scores for each scenario. Open any score chip to jump directly to the selected model conversation and review full prompts, code, outputs, and score cards.

Why some models perform better for LLM data analysis

Score differences are usually driven by execution discipline, not only raw model capability. Stronger runs tend to keep consistent step-by-step structure, while weaker runs break down under longer multi-step workflows. For example, in air-passengers-forecast (gpt-5.4), stronger runs keep trend decomposition, forecast outputs, and interpretation aligned across turns; weaker runs often skip validation steps, for example air-passengers-forecast (qwen3-coder-next). In sentiment-analysis-python (gpt-oss:120b), better runs keep polarity scoring and conclusions consistent, while weaker runs can produce contradictory examples or shallow reasoning, for example sentiment-analysis-python (qwen3.5:397b).

Strong models

better code execution planning across multiple turns
fewer hallucinations and fewer unsupported assumptions
more consistent notebook structure and output quality

Weaker models

struggle with longer multi-step workflows
produce inconsistent outputs between notebook steps
show more fragile reasoning and weaker analytical conclusions

Where models fail in AI data analysis pipelines

Benchmark value comes from both successes and failures. We publish failure patterns to make model behavior transparent and help teams choose safer workflow setups.

Broken code paths

Some runs generate code that is incomplete or fragile in later steps, especially when context from earlier cells is not handled consistently. Example: energy-consumption-forecast (qwen3-coder-next).

Wrong statistical conclusions

Models sometimes produce plausible but incorrect interpretations, for example overconfident conclusions from weak evidence or missing checks. Example: risk-metrics-var (qwen3.5:397b).

Inconsistent outputs

In weaker runs, charts, tables, and narrative can drift out of sync across steps, reducing reliability for decision-making workflows. Example: sentiment-analysis-python (qwen3.5:397b).

Publishing these failure modes builds trust, improves reproducibility, and helps practitioners understand where guardrails are needed before using AI in production analysis.

What This AI Data Analyst Benchmark Shows

We evaluated multiple LLM models on the same step-by-step data analysis workflows using a shared scoring rubric. This allows for a fair, side-by-side comparison of how models perform in realistic analytical scenarios.

Across different domains, most models produce strong notebook outputs, with high task completion rates and useful analytical reasoning. The results confirm that modern LLMs can effectively support end-to-end data analysis workflows when guided with well-structured prompts.

You can use these examples as a practical reference for:

designing effective prompts,
selecting the right model for your use case,
understanding what high-quality analytical workflows look like.

These benchmarks are especially helpful before running similar analyses on your own data using the AI Data Analyst in MLJAR Studio, especially if you are choosing the best AI for Python data analysis in production.

Start using AI for Data Analysis

MLJAR Studio helps you analyze data with AI, run machine learning workflows, and build reproducible notebook-based results on your own computer.

Runs locally • Supports local LLMs