Dark mode is not supported for this page yet.

The TSSS_STEM Benchmark

This benchmark grades a model's ability to solve math problems and 'reason'. LLMs cannot technically 'reason', so this benchmark uses a high-quality uncontaminated dataset of problems that are less about pure crunching numbers and more about reasoning and knowing HOW to do the problem.

---

Also, this dataset was compiled by one undergrad university student. This dataset is not representative of all STEM use cases or subjects, nor is it trying to be. It will contain a limited number of models I feel represent a wide breadth of sizes and providers.

31 questions total.
8 questions linear algebra, 11 questions calculus, 12 questions physics. All questions are undergraduate level. Problems were collected when LLMs were flagged to consistently get those problems incorrect.

Parameters:
Temperature = 0
All other samplers disabled.
No system prefill or instruct formatting.
Zero-shot prompting only.

For local models:
Using kobold.cpp.
Using self-made GGUF quants.

All models were allotted 3 rerolls/question.

Example Question:
This question was removed from the dataset to be here.
Subject: Linear Algebra


$\text{Which of the following statements are always true for vectors in } \mathbb{R}^3 \text{?}$

$\text{(i) If } u \cdot (v \times w) = 4 \text{ then } w \cdot (u \times v) = -4$
$\text{(ii) } (2u + v) \times (u - 5v) = -11(u \times v)$
$\text{(iii) If } u \text{ is orthogonal to } v \text{ and } w \text{ then } u \text{ is also orthogonal to } ||w||v + ||v||w$


${\color{gray} \textit{Expected solution: (ii), (iii)}}$

Click on the Score header to sort.
Name Score Provider
Claude 3.5 Haiku 70.96 Anthropic
Claude 3.5 Sonnet 77.42 Anthropic
Gemini Pro 1.5 70.96 Google
Qwen 2.5 72B 70.96 Alibaba
o1-mini 83.87 OpenAI
Ministral 8B 38.71 Mistral
GPT-4o-mini (2024-07-18) 67.74 OpenAI
Nova Pro 72.19 Amazon
Deepseek V3 77.42 Deepseek
Deepseek V2.5 70.96 Deepseek
Return to home.