Theme

Most AI code demos dazzle—then crash the moment real data hits. In this hackathon, you’ll stress-test Small Language Models where it matters: generating executable Python with Polars under Dockerized, GPU-backed constraints, balancing correctness, speed, and efficiency at scale. Build prompts, tune runtimes, break assumptions, and surface what “works” in production. Are you ready to engineer the benchmark that exposes tomorrow’s truly reliable models?

Detailed Schedule

Judging Criteria

The evaluation is designed specifically for this hackathon and reflects what matters most for the task: building the best text-to-Polars query generation model.

Submissions are evaluated primarily on their ability to generate correct Polars queries from natural language prompts. The most important factor is therefore the number of correct answers produced on the evaluation set.

Leaderboard scoring (Global submissions)

The public leaderboard ranks teams by their best completed Global submission. Scores come from the official benchmark endpoint (POST /submit_final), and higher is better.

Score = N / (T * VRAM^0.1 * RAM^0.01)

Where:

N = number of correct answers (exact matches)
T = total generation time across the evaluation set
VRAM = GPU memory usage
RAM = system memory usage

This reflects the order of importance:

Correctness (maximize N) — most important
Generation speed (minimize T)
VRAM efficiency (minimize VRAM)
RAM efficiency (minimize RAM; least important)

All teams are evaluated using the same set of benchmark questions and consistent evaluation conditions so results are directly comparable.

Tie-breakers (if scores are very close)

If two projects receive very similar scores, judges may use submission quality as a tie-breaker (reproducibility, clarity, documentation), with correctness remaining the primary objective.

Test runs vs Final submissions (important distinction)

The portal supports two submission modes:

Test submissions (`test`)

Use these to iterate quickly and debug. Test runs provide development feedback to help you improve correctness and performance, but may not match final leaderboard rankings.

Global submissions (`global`)

These are the submissions that count for the leaderboard and produce a Final Score using the formula above.

Best practices to score well

1) Maximize correctness (N)

Start by getting correct outputs reliably on test runs before optimizing performance.
Add regression tests and handle edge cases (nulls, empty frames, dtypes, parsing).
Prefer deterministic logic and explicit error handling over “best-effort” fallbacks.

2) Reduce total time (T)

Avoid Python loops over rows; prefer Polars expressions / vectorized operations.
Reduce unnecessary joins/sorts; drop unused columns early.
Cache repeated sub-computations where safe; avoid recomputing expensive intermediates.

3) Reduce VRAM and RAM

Keep models small and efficient where possible (quantization / smaller checkpoints if it doesn’t hurt correctness).
Avoid loading multiple model copies.
Free large intermediates and minimize peak allocations during generation and execution.

4) Make your submission reproducible

Provide a clear README with exact install and run steps.
Pin dependencies (lockfiles / fixed versions).
Ensure it runs non-interactively (no prompts), with sensible defaults and clear logs.

Submission checklist

Repository URL is correct and accessible to the runner
Your project runs end-to-end without manual steps
A Test submission completes successfully
A Global submission completes and shows a non-null Final Score
README includes setup/run instructions and key assumptions

Prizes

Handbook View Handbook

Welcome to the Polars Bench hackathon. Your goal is to build a text → Polars system: given a natural-language prompt, your model should generate correct Polars queries / code that produces the right answer reliably and efficiently.

Key links (start here)

Evaluation platform: https://polarsbench.net
Leaderboard: https://polarsbench.net/leaderboard

You will submit your project on AIT as well (see “Final submission (AIT)” below).

Template repo (recommended starting point)

Use the provided GitHub template to get the correct structure and runner contract from day one:

Template: https://github.com/nisseya/submission_example

Recommended approach:

Click Use this template to create your own repository (clean history), or fork it.
Keep the entrypoint and required files consistent with the template, then swap in your model + logic.

Dataset access (important)

You do not download the evaluation dataset.

For official runs, the dataset is provided inside the evaluation runner (mounted/available in the runtime). This is done specifically to:

prevent dataset leakage, and
keep evaluation conditions consistent across teams.

Your code should read data from the runner-provided location(s) as defined by the template/runner contract.

What you’re building (hackathon theme)

Build a model + runtime that can:

read a natural language analytics request,
generate Polars operations (expressions / DataFrame code),
run them correctly on the provided data,
return valid outputs consistently (no crashes, no malformed responses).

How evaluation works on polarsbench.net

The platform supports two submission modes:

1) Test submissions (`test`)

Use these during development to iterate quickly and debug. Test runs provide development feedback (logs, outputs, and other signals) to help you improve correctness and performance, but they may not match final leaderboard rankings.

2) Global submissions (`global`)

These count for the public leaderboard. The leaderboard ranks teams by their best completed Global submission (status done) and uses the Final Score returned by the official benchmark endpoint (POST /submit_final).

Final Score = N / (T * VRAM^0.1 * RAM^0.01)

Where:

N = number of correct answers (exact matches)
T = total generation time across the evaluation set
VRAM = GPU memory usage
RAM = system memory usage

Higher is better. Only your best global score per team is shown on the leaderboard.

How to use the evaluation platform (recommended workflow)

Create / join a team on polarsbench.net.
Start from the template repo and implement your model.
Submit a test run early:

confirm the runner can build your repo
confirm outputs are valid

Iterate until you’re stable and correct.
Submit a global run when ready for an official ranking.
Keep iterating: your leaderboard entry updates when you achieve a better global Final Score.

Repo requirements (what the runner expects)

Your submission is a public GitHub repository URL that is:

Reproducible: pinned dependencies (lockfiles recommended)
Self-contained: includes all code/config required to run
Non-interactive: no prompts; fully automated execution
Runner-friendly:
- clear entrypoint
- predictable install/build steps
- sensible logging (enough to debug failures)

Best practices:

Pin versions (Python/Node deps, model revision, etc.).
Avoid downloading large artifacts at runtime unless required (and cache when possible).
Make your runtime deterministic where possible (seed, fixed decoding, stable formatting).

Best practices to score well (and demo well)

1) Correctness first (maximize N)

Ensure outputs follow the expected format every time.
Handle edge cases: nulls, empty results, dtype pitfalls, parsing issues.
Add quick regression tests for prompts you commonly fail.

2) Then optimize latency (minimize T)

Keep generation short and structured.
Reduce overhead: model warm-up, repeated loads, unnecessary preprocessing.
Cache safely where it doesn’t change correctness.

3) Control memory (VRAM / RAM)

Prefer smaller / quantized models if they preserve accuracy.
Avoid loading duplicate model copies.
Watch peak allocations during generation and execution.

4) Make it easy to demo

Provide a one-command run path.
Add a short “Demo” section in your README with setup + example prompts + expected outputs.
Keep logs readable: show key timings + failure reasons.

Final submission (AIT)

In addition to polarsbench.net runs, you must submit on the AIT platform.

Include in your AIT submission:

your team name
your public GitHub repo URL
your polarsbench.net team page URL
(recommended) your best global run / leaderboard proof (score + timestamp)
short notes on:
- model choice + why
- key optimizations
- reproducibility instructions (how to run)

Troubleshooting checklist

If a run fails or you don’t appear on the leaderboard:

Did you submit as global (not test)?
Did the run reach status = done?
Does your run have a non-null Final Score?
Is your repo public and buildable from scratch?
Are you printing / returning outputs in the expected format?

If you’re stuck, do a test run first, fix build/runtime/output issues, then go global.

Hackathon: Benchmarking Small Language Models in the Real World

Theme

Detailed Schedule

☕ Morning Kickoff

🛠️ Deep Work + Demos

🏁 Final Stretch

🏆 Judging & Awards

🌐 Wrap-Up

Judging Criteria

Judging Criteria

Leaderboard scoring (Global submissions)

Tie-breakers (if scores are very close)

Test runs vs Final submissions (important distinction)

Test submissions (test)

Global submissions (global)

Best practices to score well

1) Maximize correctness (N)

2) Reduce total time (T)

3) Reduce VRAM and RAM

4) Make your submission reproducible

Submission checklist

Prizes

Handbook View Handbook

Key links (start here)

Template repo (recommended starting point)

Dataset access (important)

What you’re building (hackathon theme)

How evaluation works on polarsbench.net

1) Test submissions (test)

2) Global submissions (global)

How to use the evaluation platform (recommended workflow)

Repo requirements (what the runner expects)

Best practices to score well (and demo well)

1) Correctness first (maximize N)

2) Then optimize latency (minimize T)

3) Control memory (VRAM / RAM)

4) Make it easy to demo

Final submission (AIT)

Troubleshooting checklist

Quick Actions

Key Deadlines

Key links (start here)

Template repo (recommended starting point)

Dataset access (important)

What you’re building (hackathon theme)

How evaluation works on polarsbench.net

1) Test submissions (test)

2) Global submissions (global)

How to use the evaluation platform (recommended workflow)

Repo requirements (what the runner expects)

Best practices to score well (and demo well)

1) Correctness first (maximize N)

2) Then optimize latency (minimize T)

3) Control memory (VRAM / RAM)

4) Make it easy to demo

Final submission (AIT)

Troubleshooting checklist

AI Generated Options

Test submissions (`test`)

Global submissions (`global`)

1) Test submissions (`test`)

2) Global submissions (`global`)

1) Test submissions (`test`)

2) Global submissions (`global`)