Public Problems and Model Solutions

Download our current public problems, their expert solutions, and model results in PDF format:

Download the chain-of-thought report for DeepSeek-R1:

The model performance on public problems is as follows:

The meaning of colors in the plot:

  • Green: Get the correct result for at least 4 out of 5 attempts
  • Blue: Get 2-3 correct results out of 5 attempts
  • Yellow: Get only 1 correct result out of 5 attempts
  • Red: Fail to get correct result for all 5 attempts

The calculation method of TP Bench GPA:

Firstly, to each problem, we assign a numerical weight \(w\) from 1 to 5 based on its difficulty level. Next, we translate the color coding used above to assign a letter and a corresponding numerical grade \(g\) as follows:

(a) Green: Letter grade A which is assigned a numerical value of \(g=4\).
(b) Blue: Letter grade B and is assigned a value of \(g=3\).
(c) Yellow: Letter grade C and \(g=2\).
(d) Red: Letter grade F which is set to \(g=0\).

Finally, we apply the following formula to determine GPA (out of 4) \[ {\rm GPA}=\frac{\sum_{i=1}^{N}w_{i}\times g_{i}}{\sum_{i=1}^{N}w_{i}} \] where \(N\) represents the total number of attempts across all problems.

Looking ahead, we plan to improve and introduce additional evaluation metrics, one of which could incorporate confidence and precision to better capture the variability of model responses across multiple attempts on the same problem. This metric would impose penalties on models that exhibit high inconsistency in their responses, ensuring a more nuanced assessment of a model’s reliability and robustness. More intriguingly, as we enable models to learn and improve incrementally from their previous attempts, rather than treating each attempt as independent, we will need to develop more sophisticated metrics to evaluate their performance accurately.