Skip to content

feat(gsm8k): add dataset and 8-shot base-model task#1

Open
jack-scitix-ai wants to merge 1 commit into
scitix:mainfrom
jack-scitix-ai:feat/gsm8k
Open

feat(gsm8k): add dataset and 8-shot base-model task#1
jack-scitix-ai wants to merge 1 commit into
scitix:mainfrom
jack-scitix-ai:feat/gsm8k

Conversation

@jack-scitix-ai

@jack-scitix-ai jack-scitix-ai commented Jun 11, 2026

Copy link
Copy Markdown

Type

  • feature — new benchmark, task, or capability

Summary

Test Plan

Automated

  • Lint/format clean (ruff check && ruff format --check)
  • Type check clean (ty check or mypy --strict)

Manual

  • Dataset loading tested (pdm run sieval dataset download gsm8k --data-dir /tmp/sieval-gsm8k-download-test) succeeds
  • Ran gsm8k_8shot_base_gen on the full GSM8K test split with Qwen2.5-72B Base
  • Run completed 1319/1319 samples with 0 failures
  • Actual score: 86.35 exact match; reference score: 88.30; diff: -1.95
Model Expected Actual Diff
Qwen2.5-72B Base 88.30 86.35 -1.95(-2.21%)

Checklist

Required (all PRs)

  • PR title follows conventional format (type(scope): description)
  • No internal paths, credentials, or personal info in committed files
  • AI-generated code has AI-Generated Code - <model> (<provider>) in module docstring
  • No new upper-layer dependencies added to core/
  • Deleted code verified — no remaining call sites depend on it

If: New or Modified Benchmark

  • Reference paper/repo linked in Summary
  • Score comparison table included (model, expected, actual, diff)
  • Dataset loading tested (sieval dataset download <name> succeeds)
  • Task registered in package-level __init__.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant