TSSS STEM Benchmark

The TSSS_STEM Benchmark

This benchmark grades a model's ability to solve math problems and 'reason'. LLMs cannot technically 'reason', so this benchmark uses a high-quality uncontaminated dataset of problems that are less about pure crunching numbers and more about reasoning and knowing HOW to do the problem.

---

Also, this dataset was compiled by one undergrad university student. This dataset is not representative of all STEM use cases or subjects, nor is it trying to be. It will contain a limited number of models I feel represent a wide breadth of sizes and providers.

31 questions total.
8 questions linear algebra, 11 questions calculus, 12 questions physics. All questions are undergraduate level. Problems were collected when LLMs were flagged to consistently get those problems incorrect.

Parameters:
Temperature = 0
All other samplers disabled.
No system prefill or instruct formatting.
Zero-shot prompting only.

For local models:
Using kobold.cpp.
Using self-made GGUF quants.

All models were allotted 3 rerolls/question.

Example Question:
This question was removed from the dataset to be here.
Subject: Linear Algebra

$\text{Which of the following statements are always true for vectors in } \mathbb{R}^3 \text{?}$

$\text{(i) If } u \cdot (v \times w) = 4 \text{ then } w \cdot (u \times v) = -4$
$\text{(ii) } (2u + v) \times (u - 5v) = -11(u \times v)$
$\text{(iii) If } u \text{ is orthogonal to } v \text{ and } w \text{ then } u \text{ is also orthogonal to } ||w||v + ||v||w$

${\color{gray} \textit{Expected solution: (ii), (iii)}}$

Click on the Score header to sort.

Name	Score	Provider
Claude 3.5 Haiku	70.96	Anthropic
Claude 3.5 Sonnet	77.42	Anthropic
Gemini Pro 1.5	70.96	Google
Qwen 2.5 72B	70.96	Alibaba
o1-mini	83.87	OpenAI
Ministral 8B	38.71	Mistral
GPT-4o-mini (2024-07-18)	67.74	OpenAI
Nova Pro	72.19	Amazon
Deepseek V3	77.42	Deepseek
Deepseek V2.5	70.96	Deepseek

Return to home.