{"id":2,"date":"2025-02-05T19:41:57","date_gmt":"2025-02-05T19:41:57","guid":{"rendered":"https:\/\/tpbench.org\/?page_id=2"},"modified":"2026-05-02T06:15:46","modified_gmt":"2026-05-02T06:15:46","slug":"sample-page","status":"publish","type":"page","link":"https:\/\/tpbench.org\/?page_id=2","title":{"rendered":"Public Problems and Model Solutions"},"content":{"rendered":"\n<h3 class=\"wp-block-heading has-text-align-center\">Public Problems and Model Solutions<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Download our current public problems, their expert solutions, and model results in PDF format:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/tpbench.org\/wp-content\/uploads\/2025\/02\/One-Pole-Problem.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Level 5 &#8211; One-Pole Problem<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/tpbench.org\/wp-content\/uploads\/2025\/02\/Bias-of-a-Sampled-Halo-Field.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Level 5 &#8211; Bias of a Sampled Halo Field<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/tpbench.org\/wp-content\/uploads\/2025\/02\/SHO-Vacuum-Entanglement.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Level 4 &#8211; SHO Vacuum Entanglement<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/tpbench.org\/wp-content\/uploads\/2025\/02\/SUSY-Symmetry.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Level 4 &#8211; SUSY-Symmetry<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/tpbench.org\/wp-content\/uploads\/2025\/02\/Slow-Roll-Inflation.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Level 3 &#8211; Slow-Roll Inflation<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/tpbench.org\/wp-content\/uploads\/2025\/02\/Scalar-Particle-Scattering.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Level 3 &#8211; Scalar Particle Scattering<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/tpbench.org\/wp-content\/uploads\/2025\/02\/Dark-Matter-Capture-as-a-Function-of-Time.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Level 2 &#8211; Dark Matter Capture as a Function of Time<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/tpbench.org\/wp-content\/uploads\/2025\/02\/A-3-State-QM-Problem.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Level 2 &#8211; A 3-state QM Problem<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/tpbench.org\/wp-content\/uploads\/2025\/02\/Blackbody-in-d-Dimensions.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Level 1 &#8211; Blackbody in d Dimensions<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/tpbench.org\/wp-content\/uploads\/2025\/02\/Boosted-Parabolic-Trajectory.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Level 1 &#8211; Boosted Parabolic Trajectory<\/a><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Download the chain-of-thought report for DeepSeek-R1:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/tpbench.org\/wp-content\/uploads\/2025\/02\/DeepSeek-R1-CoT.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">CoT report for DeepSeek-R1<\/a><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The model performance on public problems is as follows:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"787\" height=\"1024\" src=\"https:\/\/tpbench.org\/wp-content\/uploads\/2026\/05\/website_figure-787x1024.png\" alt=\"\" class=\"wp-image-499\" srcset=\"https:\/\/tpbench.org\/wp-content\/uploads\/2026\/05\/website_figure-787x1024.png 787w, https:\/\/tpbench.org\/wp-content\/uploads\/2026\/05\/website_figure-231x300.png 231w, https:\/\/tpbench.org\/wp-content\/uploads\/2026\/05\/website_figure-768x999.png 768w, https:\/\/tpbench.org\/wp-content\/uploads\/2026\/05\/website_figure-1181x1536.png 1181w, https:\/\/tpbench.org\/wp-content\/uploads\/2026\/05\/website_figure-1574x2048.png 1574w, https:\/\/tpbench.org\/wp-content\/uploads\/2026\/05\/website_figure-scaled.png 1968w\" sizes=\"auto, (max-width: 787px) 100vw, 787px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The meaning of colors in the plot:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Green: Get the correct result for at least 4 out of 5 attempts<\/li>\n\n\n\n<li>Blue: Get 2-3 correct results out of 5 attempts<\/li>\n\n\n\n<li>Yellow: Get only 1 correct result out of 5 attempts<\/li>\n\n\n\n<li>Red: Fail to get correct result for all 5 attempts<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The calculation method of TP Bench GPA:<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Firstly, to each problem, we assign a numerical weight \\(w\\) from 1 to 5 based on its difficulty level. Next, we translate the color coding used above to assign a letter and a corresponding numerical grade \\(g\\) as follows:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">(a) Green: Letter grade A which is assigned a numerical value of \\(g=4\\).<br>(b) Blue: Letter grade B and is assigned a value of \\(g=3\\).<br>(c) Yellow: Letter grade C and \\(g=2\\).<br>(d) Red: Letter grade F which is set to \\(g=0\\).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, we apply the following formula to determine GPA (out of 4) \\[ {\\rm GPA}=\\frac{\\sum_{i=1}^{N}w_{i}\\times g_{i}}{\\sum_{i=1}^{N}w_{i}} \\] where \\(N\\) represents the total number of attempts across all problems.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>L<\/strong>ooking ahead, we plan to improve and introduce additional evaluation metrics, one of which could incorporate confidence and precision to better capture the variability of model responses across multiple attempts on the same problem. This metric would impose penalties on models that exhibit high inconsistency in their responses, ensuring a more nuanced assessment of a model&#8217;s reliability and robustness. More intriguingly, as we enable models to learn and improve incrementally from their previous attempts, rather than treating each attempt as independent, we will need to develop more sophisticated metrics to evaluate their performance accurately.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Public Problems and Model Solutions Download our current public problems, their expert solutions, and model results in PDF format: Download the chain-of-thought report for DeepSeek-R1: The model performance on public problems is as follows: The meaning of colors in the plot: The calculation method of TP Bench GPA: Firstly, to each problem, we assign a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":1,"comment_status":"closed","ping_status":"open","template":"","meta":{"footnotes":""},"class_list":["post-2","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/tpbench.org\/index.php?rest_route=\/wp\/v2\/pages\/2","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tpbench.org\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/tpbench.org\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/tpbench.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tpbench.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2"}],"version-history":[{"count":47,"href":"https:\/\/tpbench.org\/index.php?rest_route=\/wp\/v2\/pages\/2\/revisions"}],"predecessor-version":[{"id":503,"href":"https:\/\/tpbench.org\/index.php?rest_route=\/wp\/v2\/pages\/2\/revisions\/503"}],"wp:attachment":[{"href":"https:\/\/tpbench.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}