Advances in Cognitive Computing and Machine Learning

1 / 61

Embed Share

Explore the latest developments in cognitive computing, reinforcement learning, and Chain-of-Thought models. From AlphaGo to Large Language Monkeys, delve into supervised CoT and scaling laws in board games. Discover the evolution of AI reasoning and test-time computation techniques for enhanced problem-solving.

cotwe Follow

Uploaded on May 27, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

ChatGPT o1/o3/o4DeepSeekr1Gemini 2 Flash Thinking Claude 3.7 Sonnet (Extended Thinking) 1+1 2 1+1 10 1+1=

<think> </think> Planning Planning Verification Verification Explore Explore Let me check the answer Let s first try to Let s try a different approach (Reasoning) ( Inference ) Test-Time Compute

Training Time Testing Time Test- Time Compute AlphaGo https://www.nature.com/articles/nature16961

Test-Time Scaling Scaling Scaling Laws with Board Games https://arxiv.org/abs/2104.03113

(Chain-of-Thought, CoT) (Imitation Learning) (Reinforcement Learning, RL)

(Chain-of-Thought, CoT) (Imitation Learning) (Reinforcement Learning, RL)

Chain-of-Though (CoT) Few-shot CoT https://arxiv.org/abs/2201.11903 Short CoT Zero-shot CoT https://arxiv.org/abs/2205.11916 Long CoT https://arxiv.org/abs/2503.09567

gpt-4o Supervised CoT https://arxiv.org/abs/2410.14198

Long CoT

(Chain-of-Thought, CoT) (Imitation Learning) (Reinforcement Learning, RL)

Explore output 1 input output 2 output 3

Large Language Monkeys https://arxiv.org/abs/2407.21787

Explore output 1 Majority Vote (Self-consistency) https://arxiv.org/abs/2203.11171 input output 2 Confidence (used in CoT decoding) https://arxiv.org/abs/2402.10200 output 3 <answer></answer>

Explore https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute

Verification score output Verifier 0.1 Verifier output 1 input 0.9 Verifier output 2 Best Best- -of of- -N N 0.2 Verifier output 3 https://arxiv.org/abs/2110.14168

Verification Training Data: input ground truth Verifier output 1 output 1 1.0 input Verifier output 2 output 2 0.0 Verifier 1.0 output 3 output 3

Parallel vs. Sequential Parallel Parallel Sequential Sequential output 1 input output 1 input output 2 output 2 output 3 output 3

Parallel vs. Sequential Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters Parallel + Sequential Parallel + Sequential https://arxiv.org/abs/2408.03314 output 1-2 output 1-1 input output 2-1 output 2-2 output 3-1 output 3-2

123 x 456 =? planning Verification (for a step)

step 1 input step 1 step 1 Process Verifier score step 1 Let's Verify Step by Step https://arxiv.org/abs/2305.20050

step 1 </step> input step 1 </step> step 1 </step> Process Verifier score step 1 Let's Verify Step by Step https://arxiv.org/abs/2305.20050

ans step 2 step 3 step 1 step 2 step 3 ans input 2/3 step 3 step 2 step 4 ans Training Data: input ground truth ans step 3 step 1 step 2 step 3 ans input 1/3 ans step 3

Process Verifier step 1 2/3 input 2/3 Math-Shepherd: Verify and Reinforce LLMs Step-by- step without Human Annotations https://arxiv.org/abs/2312.08935 Process Verifier 1/3 step 1 step 2 input 1/3

step 2 </step> Beam Search Beam Search https://arxiv.org/abs/2305.00633 https://arxiv.org/abs/2401.17686 step 1 </step> step 2 </step> step 2 </step> input step 1 </step> </step> step 2 N step 2 step 1 </step> </step> step 2 </step> Process Verifier score step 1 step 2

https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-computehttps://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute

e.g. Monte Carlo Tree Search (MCTS) Heuristic Search Algorithm Source of image: Wikipedia https://arxiv.org/abs/2405.00451 Monte Carlo Tree Search Boosts Reasoning via Iterative Preference ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search Learning https://arxiv.org/abs/2406.03816 Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers https://arxiv.org/abs/2408.06195

(Chain-of-Thought, CoT) LLM without Reasoning Post-Training LLM with Reasoning Learn to Reasoning (Fine-tuned Model) (Foundation Model) (Imitation Learning) (Reinforcement Learning, RL)

(Chain-of-Thought, CoT) (Imitation Learning) (Reinforcement Learning, RL)

Input reasoning process ground truth Training data: ??? Input

Training Data: input ground truth Training data Training data reasoning process ans Verifier Verifier ans reasoning process input CoT Verifier reasoning process ans

https://arxiv.org/abs/2501.04519 rStar-Math step 2 step 3 ans step 1 step 2 step 3 input step 1 step 3 step 2 step 3 ans step 2 step 1 input step 1 step 2 step 3 Training data: ans Reasoning Processing

https://arxiv.org/abs/2501.04519 rStar-Math step 2 step 3 ans step 1 step 2 step 3 input step 1 step 3 step 2 step 3 ans step 2 step 1 step 1 step 2 step 3 step 1 input input step 1 step 3 step 2 step 1

( 9) !

step 2 step 3 ans step 1 step 2 step 3 input step 1 step 3 step 2 step 3 ans step 2 step 1 ! input step 1 step 2 step 3 Training data: ans Reasoning Processing

Stream of search (SoS) https://arxiv.org/abs/2404.03683 step 2 step 3 ans step 1 step 2 step 3 input step 1 step 3 step 2 step 3 ans step 2 step 1 [ Verifier ] [ Verifier ] input step 1 step 2 step 2 step 1 ans step 1 step 2 step 3 [ Verifier ]

https://arxiv.org/abs/2410.18982

Knowledge Distillation reasoning process answer Input Reasoning Model Input ????? ????? Sky-T1: https://novasky-ai.github.io/posts/sky-t1/ s1:https://arxiv.org/abs/2501.19393

Knowledge Distillation https://arxiv.org/abs/2501.12948 Foundation Model

(Chain-of-Thought, CoT) (Imitation Learning) (Reinforcement Learning, RL) DeepSeek-R1

https://arxiv.org/abs/2501.12948 Training Data: input ground truth Reinforcement Learning (RL) Reasoning Process answer input Reasoning Process answer RL DeepSeek-v3-base DeepSeek-R1-Zero Accuracy as reward (Foundation Model)

Majority vote Source of image: https://arxiv.org/abs/2501.12948

Aha Moment Aha Moment Source of image: https://arxiv.org/abs/2501.12948

https://arxiv.org/abs/2501.12948 Training Data: input ground truth Reasoning Process answer Poor readability & Language Mixing input Reasoning Process answer RL DeepSeek-v3-base DeepSeek-R1-Zero Accuracy as reward (Foundation Model)

RL DeepSeek-v3-base DeepSeek-R1-Zero Accuracy as reward Input reasoning process ground truth using few-shot prompting with a long CoT as an example directly prompting models to generate detailed answers with reflection and verification Generated data + human annotation (Thousands of examples) Imitation Learning DeepSeek-v3-base Model A? RL Model A? Model B? Accuracy / Language coherence as reward

Reasoning Process ans input Model B? DeepSeek-v3 Reasoning Process ans As verifier Including tasks without standard answers 600k examples Reasoning Process ans filtered out chain-of-thought with mixed languages, long parapraphs, and code blocks Imitation Learning DeepSeek-v3-base Model C? RL Model C? DeepSeek-R1 Safety / Helpfulness Based on the Deepseek-R1 paper, both the process verifier and MTCS were tried but ultimately not used.