Menu
About me Kontakt

How DeepSeek R1 Works - A Very Simple Explanation of Algorithms (video, 9m)

DeepSeek R1, a new large-language model, has taken the tech world by storm, representing a significant breakthrough in the AI research community. Last Sunday, during the 12-hour ban on Tech Talk, a research team from China unveiled this model, showcasing that it performs similarly to OpenAI's O1 model in areas like math, coding, and scientific reasoning. In this video, Alex discusses the three key takeaways from their paper, including the use of Chain of Thought, allowing the model to self-evaluate its performance, the implementation of pure reinforcement learning guiding the model, and model distillation making DeepSeek more accessible to a wider audience.

The Chain of Thought technique is a straightforward yet effective prompt engineering method that encourages the model to think aloud while providing step-by-step reasoning. This approach helps in identifying mistakes if the model wanders off track, enabling easy adjustments to the prompts to prevent repeat errors. For instance, when given a math problem, the model demonstrates its reasoning and outlines the process behind reaching an answer, leading to more accurate results compared to providing answers without the underlying thought process. Overall, introducing reflective reasoning increases the precision of the answers derived from the model.

DeepSeek's unique approach to using reinforcement learning deviates from typical AI training methods. Rather than supplying a question and expecting an answer, the model learns on its own, akin to how a child learns to walk through trial and error. The model's ability to explore and optimize its behavior enables it to achieve better results while maximizing rewards. As DeepSeek develops, it mimics the learning process of a child discovering which methods lead to faster and more effective solutions.

As observed in the findings of their research, DeepSeek R1 significantly improves in accuracy over training time. It surpasses OpenAI's static O1 model, suggesting potential performance reaching 90-100% accuracy with extended training. The Chain of Thought approach empowers the model to self-assess and adjust its responses, enhancing the overall learning process. This adaptability illustrates the effectiveness of the integration of reinforcement learning methodologies in AI development.

The third crucial technique in DeepSeek’s architecture is model distillation. Although the DeepSeek R1 model comprises 671 billion parameters requiring substantial computational resources, the training of smaller models based on this larger LLM allows for similar performance at a fraction of the memory cost. The research findings indicated that these distilled models outperform larger models like GPT-4.0 and Cloud 3.5 Sonnet in mathematical, coding, and scientific reasoning tasks, making the LLM ecosystem more accessible. At the time of writing this article, the video boasts an impressive view count of 1,274,579 and a like count of 35,797, reflecting significant interest in the new DeepSeek R1 model.

Toggle timeline summary

  • 00:00 Introduction to the new large-language model, DeepSeek R1, released by an AI research team from China.
  • 00:10 Comparison of DeepSeek R1's performance with OpenAI's O1 model on reasoning tasks.
  • 00:25 Overview of three main takeaways from the DeepSeek R1 paper.
  • 00:33 Explanation of the Chain of Thought technique for self-evaluation.
  • 00:54 Demonstration of how Chain of Thought prompts models to think step-by-step.
  • 01:09 Example showing the model's reasoning process in solving math problems.
  • 01:40 Introduction to the use of reinforcement learning in training the model.
  • 01:56 Analogy of how a baby learns to walk to illustrate model learning via exploration.
  • 02:19 Description of how different solving strategies can yield varying rewards.
  • 02:45 Graph demonstrating DeepSeek R1's improved accuracy over time compared to OpenAI’s O1.
  • 02:56 Differentiation of training methods between static and dynamic models.
  • 03:29 Reinforcement learning's role in adjusting the model's behavior for optimal performance.
  • 03:59 Use of group relative policy optimization to assess model performance.
  • 04:17 Explanation of policy changes and stability concerns during training.
  • 06:30 Discussion on model distillation to make large models more accessible.
  • 06:56 Process of using a larger model to teach a smaller model through examples.
  • 08:06 Findings showing distilled models outperforming larger models in specific tasks.
  • 08:21 Summary of key concepts of DeepSeek R1 and encouragement to explore further.

Transcription

This new large-language model has taken the tech world by absolute storm and represents a big breakthrough in the AI research community. Last Sunday, while Tech Talk was banned for 12 hours, an AI research team from China released a new large-language model called DeepSeek R1. As you can see on the screen, DeepSeek R1's benchmark shows that it performs at a similar level to OpenAI's O1 model on reasoning problems like math, coding, and scientific reasoning. And in this video, I'll talk about the three main takeaways from their paper, including how they use Chain of Thought in order to have the model self-evaluate its performance, how it uses pure reinforcement learning to have the model guide itself, and how they use model distillation to make DeepSeek and other LLMs more accessible to everyone. Chain of Thought is a very simple but effective prompt engineering technique, where we pretty much ask the model to think out loud, where we add to our prompts that we want the model to explain its reasoning step-by-step. That way, if the model makes any mistakes, we can easily pinpoint where in its reasoning it was off so that we can re-prompt the model to not make the mistake again. Here's an example from the paper, where if you give the model a question like this math problem, you can see that in its response, it actually reasons through it and gives you the steps to how it got to the solution. It showed its work. You can see in red, it says, wait, wait, there's an aha moment, as well as let's evaluate, let's re-evaluate this step-by-step. And in doing so, the model is going to have a more accurate response than if you were to just give the response by itself without Chain of Thought reasoning. The way DeepSeek uses reinforcement learning is a little different how most AI models are trained. We don't give it the question and answer, we kind of let it learn on its own. This is exactly the same way in how a baby learns how to walk for the first time. If you notice, if you've ever seen a baby, it's actually pretty funny. They stumble around the environment and they maybe hold on to things as they try to decide how to walk. And in doing so, they're learning how to move and position their joints so that they don't fall. And in the same way, reinforcement learning allows us to train a model by optimizing its policy, aka how the model behaves, and it does so to maximize the reward. And as it explores its environment over time, it learns which policies maximize the reward. And then it just probably picks the policy over here or the policy over here. For example, if you're solving an equation like this, there's two or three different ways to solve it, but one of them is much shorter than the other way to solve it and thus has a much higher reward than the other. And reinforcement learning is exactly how most robots learn how to walk and how Tesla's self-driving car learns how to drive through a city. And if we go to the paper and look at this graph, we can see how DeepSeq R1 improves how accurately it can answer questions if we train it over time. Using reinforcement learning, instead of telling the model what a correct answer is to a question, since that kind of data is pretty expensive to obtain, we instead let it figure out on its own while measuring how accurate the model is. You can see while OpenAI's O1 model is static, DeepSeq R1 eventually outperforms OpenAI's O1 model. And if we let it train for even longer, it looks like it's even it's going to perform even more and get closer to 90 or even 100% accuracy if we kept training it. And you can see how the model uses chain of thought reasoning in order to improve its responses over time and self-reflect. In reinforcement learning, we can't exactly tell the model how to change its policy. So that's why we use chain of thought reasoning to force the model to self-reflect and evaluate to change its behavior to get closer to a maximum reward. That way we can kind of give the model the right incentives using prompts, and the model can re-evaluate how it answers questions, and it can do so with an increasing accuracy. And this equation is the key behind how DeepSeq uses reinforcement learning in order to optimize its policy. It uses group relative policy optimization in order to essentially use this equation to score how well it answered a question without having the correct answer. So this looks very very complicated, and I'll just briefly explain the most important parts of it. What we do is we take pretty much the expectation of the old answers from the old policy the model has. And remember the policy pi, this is the key thing that we're trying to optimize with DeepSeq, where we want to change the policy so that DeepSeq can then output better and more correct answers. So what we do is we take a weighted average of how the model responded with its old policy, and how it used its old policy to answer questions, versus how the model's new policy answers questions as well. And we also multiply it by some standardization value, AI. AI is basically saying compared to the average reward, how well does this new policy increase the reward. And what we also want to do is we don't want to have the model's policy change too much, because that can cause a lot of instability with model training. If you look at most reinforcement learning charts and graphs, or even the example of a baby, the baby's going to fall down unpredictably so many times. And what we want to do is we want to make sure our model is as stable as possible, and we avoid a roller coaster of policy changes. That's where this clipping comes in. Clipping essentially restricts how much our policy can change by 1 minus epsilon and 1 plus epsilon, and we also standardize that. So the weighted average is taking basically how small the change can we change our policy in order to maximize the reward. We also subtract it from this regularization term called KL divergence. This pretty much also is another way for us to stabilize our model training by making sure it doesn't change too much. And in short, all this is trying to say is that we don't want our policy for our model to change too much, but we want to do so in a way that we can compare our old answers with the new answers, and then we change our policy so that we can maximize ultimately the policy changes. We can maximize the reward from the policy changes that are minimized. It's like a min-max kind of situation here, and that's what it's doing here with the weighted average. And so the third important technique that the DeepSeq researchers use with their R1 model is model distillation. And the idea here is that the actual DeepSeq model is 671 billion parameters, and to run this you pretty much need a couple thousand dollar GPU at least, as well as a pretty expensive computer to actually run the full model. So to make it more accessible, what they do is they take the larger LLM and then they use it to teach a smaller LLM how it reasons and how it answers questions so that the smaller LLM can actually perform on the same level as the bigger LLM, but at a magnitude of a smaller parameter size, like 7 billion parameters. And in the paper DeepSeq, the DeepSeq researchers distilled from their DeepSeq model into LLAMA3 as well as QEN. And the idea here is that the teacher uses again chain of thought reasoning in order to generate examples, or generate a lot of examples of it answering questions. And then those examples it just gives directly to the student as part of the prompt. And the student is supposed to answer the questions in the similar accuracy as the larger model. And this makes the whole LLM ecosystem much more accessible for people who don't have as much resources. And the key insight is that in this paper they found that the student model during reinforcement learning training actually outperforms the teacher model just by a little bit. But it's doing so again at a small fraction of the memory and storage required to use it. And in the experiments from the paper, the researchers actually found that these smaller distilled models from DeepSeq, as I said, outperform larger models like GPT-4.0 and Cloud 3.5 Sonnet in these math coding and scientific reasoning tasks, as you can see in the table below right here. And from those three things, those are kind of the key concepts behind how DeepSeq works. And hopefully you enjoyed this video. And if you want to, you can go read the paper in the description below, as well as play around with DeepSeq on Ollama yourself.