Advances in Cognitive Computing and Machine Learning

slide1 n.w
1 / 61
Embed
Share

Explore the latest developments in cognitive computing, reinforcement learning, and Chain-of-Thought models. From AlphaGo to Large Language Monkeys, delve into supervised CoT and scaling laws in board games. Discover the evolution of AI reasoning and test-time computation techniques for enhanced problem-solving.

  • AI
  • Cognitive Computing
  • Machine Learning
  • Reinforcement Learning
  • Cognitive Models

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. ChatGPT o1/o3/o4DeepSeekr1Gemini 2 Flash Thinking Claude 3.7 Sonnet (Extended Thinking) 1+1 2 1+1 10 1+1=

  2. <think> </think> Planning Planning Verification Verification Explore Explore Let me check the answer Let s first try to Let s try a different approach (Reasoning) ( Inference ) Test-Time Compute

  3. Training Time Testing Time Test- Time Compute AlphaGo https://www.nature.com/articles/nature16961

  4. Test-Time Scaling Scaling Scaling Laws with Board Games https://arxiv.org/abs/2104.03113

  5. (Chain-of-Thought, CoT) (Imitation Learning) (Reinforcement Learning, RL)

  6. (Chain-of-Thought, CoT) (Imitation Learning) (Reinforcement Learning, RL)

  7. Chain-of-Though (CoT) Few-shot CoT https://arxiv.org/abs/2201.11903 Short CoT Zero-shot CoT https://arxiv.org/abs/2205.11916 Long CoT https://arxiv.org/abs/2503.09567

  8. gpt-4o Supervised CoT https://arxiv.org/abs/2410.14198

  9. Long CoT

  10. (Chain-of-Thought, CoT) (Imitation Learning) (Reinforcement Learning, RL)

  11. Explore output 1 input output 2 output 3

  12. Large Language Monkeys https://arxiv.org/abs/2407.21787

  13. Explore output 1 Majority Vote (Self-consistency) https://arxiv.org/abs/2203.11171 input output 2 Confidence (used in CoT decoding) https://arxiv.org/abs/2402.10200 output 3 <answer></answer>

  14. Explore https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute

  15. Verification score output Verifier 0.1 Verifier output 1 input 0.9 Verifier output 2 Best Best- -of of- -N N 0.2 Verifier output 3 https://arxiv.org/abs/2110.14168

  16. Verification Training Data: input ground truth Verifier output 1 output 1 1.0 input Verifier output 2 output 2 0.0 Verifier 1.0 output 3 output 3

  17. Parallel vs. Sequential Parallel Parallel Sequential Sequential output 1 input output 1 input output 2 output 2 output 3 output 3

  18. Parallel vs. Sequential Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters Parallel + Sequential Parallel + Sequential https://arxiv.org/abs/2408.03314 output 1-2 output 1-1 input output 2-1 output 2-2 output 3-1 output 3-2

  19. 123 x 456 =? planning Verification (for a step)

  20. step 1 input step 1 step 1 Process Verifier score step 1 Let's Verify Step by Step https://arxiv.org/abs/2305.20050

  21. step 1 </step> input step 1 </step> step 1 </step> Process Verifier score step 1 Let's Verify Step by Step https://arxiv.org/abs/2305.20050

  22. ans step 2 step 3 step 1 step 2 step 3 ans input 2/3 step 3 step 2 step 4 ans Training Data: input ground truth ans step 3 step 1 step 2 step 3 ans input 1/3 ans step 3

  23. Process Verifier step 1 2/3 input 2/3 Math-Shepherd: Verify and Reinforce LLMs Step-by- step without Human Annotations https://arxiv.org/abs/2312.08935 Process Verifier 1/3 step 1 step 2 input 1/3

  24. step 2 </step> Beam Search Beam Search https://arxiv.org/abs/2305.00633 https://arxiv.org/abs/2401.17686 step 1 </step> step 2 </step> step 2 </step> input step 1 </step> </step> step 2 N step 2 step 1 </step> </step> step 2 </step> Process Verifier score step 1 step 2

  25. https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-computehttps://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute

  26. e.g. Monte Carlo Tree Search (MCTS) Heuristic Search Algorithm Source of image: Wikipedia https://arxiv.org/abs/2405.00451 Monte Carlo Tree Search Boosts Reasoning via Iterative Preference ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search Learning https://arxiv.org/abs/2406.03816 Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers https://arxiv.org/abs/2408.06195

  27. (Chain-of-Thought, CoT) LLM without Reasoning Post-Training LLM with Reasoning Learn to Reasoning (Fine-tuned Model) (Foundation Model) (Imitation Learning) (Reinforcement Learning, RL)

  28. (Chain-of-Thought, CoT) (Imitation Learning) (Reinforcement Learning, RL)

  29. Input reasoning process ground truth Training data: ??? Input

  30. Training Data: input ground truth Training data Training data reasoning process ans Verifier Verifier ans reasoning process input CoT Verifier reasoning process ans

  31. https://arxiv.org/abs/2501.04519 rStar-Math step 2 step 3 ans step 1 step 2 step 3 input step 1 step 3 step 2 step 3 ans step 2 step 1 input step 1 step 2 step 3 Training data: ans Reasoning Processing

  32. https://arxiv.org/abs/2501.04519 rStar-Math step 2 step 3 ans step 1 step 2 step 3 input step 1 step 3 step 2 step 3 ans step 2 step 1 step 1 step 2 step 3 step 1 input input step 1 step 3 step 2 step 1

  33. ( 9) !

  34. step 2 step 3 ans step 1 step 2 step 3 input step 1 step 3 step 2 step 3 ans step 2 step 1 ! input step 1 step 2 step 3 Training data: ans Reasoning Processing

  35. Stream of search (SoS) https://arxiv.org/abs/2404.03683 step 2 step 3 ans step 1 step 2 step 3 input step 1 step 3 step 2 step 3 ans step 2 step 1 [ Verifier ] [ Verifier ] input step 1 step 2 step 2 step 1 ans step 1 step 2 step 3 [ Verifier ]

  36. https://arxiv.org/abs/2410.18982

  37. Knowledge Distillation reasoning process answer Input Reasoning Model Input ????? ????? Sky-T1: https://novasky-ai.github.io/posts/sky-t1/ s1:https://arxiv.org/abs/2501.19393

  38. Knowledge Distillation https://arxiv.org/abs/2501.12948 Foundation Model

  39. (Chain-of-Thought, CoT) (Imitation Learning) (Reinforcement Learning, RL) DeepSeek-R1

  40. https://arxiv.org/abs/2501.12948 Training Data: input ground truth Reinforcement Learning (RL) Reasoning Process answer input Reasoning Process answer RL DeepSeek-v3-base DeepSeek-R1-Zero Accuracy as reward (Foundation Model)

  41. Majority vote Source of image: https://arxiv.org/abs/2501.12948

  42. Aha Moment Aha Moment Source of image: https://arxiv.org/abs/2501.12948

  43. https://arxiv.org/abs/2501.12948 Training Data: input ground truth Reasoning Process answer Poor readability & Language Mixing input Reasoning Process answer RL DeepSeek-v3-base DeepSeek-R1-Zero Accuracy as reward (Foundation Model)

  44. RL DeepSeek-v3-base DeepSeek-R1-Zero Accuracy as reward Input reasoning process ground truth using few-shot prompting with a long CoT as an example directly prompting models to generate detailed answers with reflection and verification Generated data + human annotation (Thousands of examples) Imitation Learning DeepSeek-v3-base Model A? RL Model A? Model B? Accuracy / Language coherence as reward

  45. Reasoning Process ans input Model B? DeepSeek-v3 Reasoning Process ans As verifier Including tasks without standard answers 600k examples Reasoning Process ans filtered out chain-of-thought with mixed languages, long parapraphs, and code blocks Imitation Learning DeepSeek-v3-base Model C? RL Model C? DeepSeek-R1 Safety / Helpfulness Based on the Deepseek-R1 paper, both the process verifier and MTCS were tried but ultimately not used.

  46. Foundation Model Qwen-32B-Base Foundation Model RL Qwen-32B-Base Imitation Learning ( DeepSeek-R1 ) Qwen-32B-Base RL

Related


More Related Content