Budowanie OpenAI o1 - rozmowa z twórcami (film, 22m)

All right. I'm Bob McGrew. I lead the research team here at OpenAI. We've just released a preview of our new series of models, O1 and O1 Mini, which we are very excited about. And we've got the whole team here to tell you about them. What exactly is O1? So we're starting a series of new models with the new name O1. This is to highlight the fact that you might feel different when you use O1 as compared to previous models such as GPT-4-O. As others will explain later, O1 is a reasoning model, so it will think more before answering your question. We are releasing two models, O1 Preview, which is to preview what's coming for O1, and O1 Mini, which is a smaller and faster model that is trained with a similar framework as O1. So we hope you like our new naming scheme, O1. So what is reasoning, anyway? So one way of thinking of reasoning is that there are times when we ask questions and we need answers immediately because they're simple questions. For example, if you ask what's the capital of Italy, you know the answer is Rome, and you don't really have to think about it much. But if you wonder about a complex puzzle or you want to write a really good business plan, you want to write a novel, you probably want to think about it for a while. And the more you think about it, the better the outcome. So reasoning is the ability of turning thinking time into better outcomes, whatever the task you're doing. So how long have you guys been working on this? Early on at OpenAI, we were very inspired by the AlphaGo results and the potential of deep reinforcement learning. And so we were researching that heavily and we saw great scaling on data and robotics. And we were thinking about how can we do reinforcement learning on a general domain to get to a very capable artificial intelligence. And then we saw the amazing results of scaling and supervised learning in the GPT paradigm. And so ever since, we've been thinking about how do we combine these two different paradigms into one. And it's hard to point to one exact instance where this whole effort got started, but we've had early explorations with Jakob and Shimon, we've had early explorations with Lukasz, Ilya. And of course, I think one moment in time here is consolidating things with Jerry and having him build out this large scale effort here. So I mean, it's been going on for a long time, but I think what's really cool about research is there's that aha moment. There's that particular point in time where something surprising happens and things really click together. Are there any times for you all when you had that aha moment? Yeah, I mean, we trained GPT-2, GPT-3, GPT-4. There was a first moment when the moment was hot off the press. We started talking to the model. People were like, wow, this model is really great, and started doing something like that. And I think that there was a certain moment in our training process where we trained, like, put more computes in our L than before, and trained, first of all, generating coherent chains of thought. And we saw, wow, this looks like something meaningfully different than before. And I think, for me, this is the moment. Wow. Related to that, when we think about training a model for reasoning, one thing that immediately jumps to mind is you could have humans write out their thought process and train on that. One aha moment for me was when we saw that if you train the model using RL to generate and hone its own chain of thoughts, it can do even better than having humans write chains of thought for it. And that was an aha moment that you could really scale this and explore models reasoning that way. For a lot of the time that I've been here, we've been trying to make the models better at solving math problems, as an example. And we've put a lot of work into this. And we've come up with a lot of different methods. But one thing that I kept, like, every time I would read these outputs from the models, I'd always be so frustrated. The model just would never seem to question what was wrong, or when it was making mistakes, or things like that. But one of these early O1 models, when we trained it, and we actually started talking to it, and we started asking it these questions, and it was scoring higher on these math tests we were giving it, we could look at how it was reasoning. And you could just see that it started to question itself and have really interesting reflection. And that was a moment for me where I was like, wow, we've uncovered something different. This is going to be something new. And it was just one of these coming together moments that was really powerful. So when you read the thoughts, does it feel like you're watching a human, or does it feel like you're watching a robot? It's like a spiritual experience. It's a spiritual experience, but then you can empathize with the model. You're like, oh, that's a mistake that a lot of people would make. Or you can see it sort of questioning common conventions. And yeah, it's spiritual, but oddly human in its behavior. It was also pretty cool at some point when we have seen in cases where there was a limited amount of thinking allowed for the model, that just before the timeout, the model was like, oh, and I'm like, I have to finish it now. And it's like, oh, here's the answer. I spent a lot of time doing competition math when I was young. And that was really my whole reason for getting into AI was to try and automate this process. And so it's been very like a huge full circle moment for me to see the model actually be able to follow through like very close to the same steps I would use when solving these problems. And you know, it's not exactly the same chain of thought, I would say, but very, very relatable. It's also really cool to, you know, it's believable that these models, they are getting on the cusp of really advancing engineering and science. And if they seem to be like solving the problems hard, you know, maybe you can call ourselves experts hard for us, then maybe they will be even hard for some other experts and could advance science. So we've talked a lot about some of the great moments and the times and everything just clicked. What are some of the hurdles? What are some of the places where it was actually really hard to make things work? Training large models is fundamentally a very, very hard thing to do. There are like thousands of things that can go wrong and there are at least like hundreds that did go wrong in every training run. So I also run here, like, you know, put a lot of blood, sweat and tears in training those things and figuring out how to keep them continue learning and improving on their path. And it's actually the path of success is very narrow and the ways of failure are plentiful. It's like imagine like having the center for launching a rocket to the, let's say, some planet moon or so. And if you are off by one angle, you won't arrive at the destination. And that's our job. So the model we said is very good, oftentimes better than humans, like has equivalent of several PhDs. And that is sometimes a challenge because we have to often go and verify that the model isn't going off the rails, doing something self-sensible. And it started taking some serious time as we scaled the model. We were saturating out all the industry grade evals and we don't know what to look for next. So that is also a challenge. Yeah, I do think all of these things we ran into, it's also been one point of fulfillment. It's like, you know, every time you have a puzzle, it's like another hurdle for this team to overcome. And I'm really glad with all the little hurdles that we've overcome. So what are some of the ways you tested the models? Did you have any favorite questions that you saw the model get better at? How many R's are in strawberries? For whatever reason, the judge GPT wasn't able to solve this question reliably. But oh one, okay, you know, we did like a year and a half work. And now we can count the number of R's in strawberry. Reliably. I have this habit, which I think other people here do too, of whenever you go on Twitter and you see some posts that's like, large language models can't do this. You copy and paste it in and then you say confirm that. I can't do this. To give people a sense of what they can use the model for. I'd love to hear some of the ways that you use O1. So one way I've been using O1 is for obviously coding. And a lot of my job is about coding. So more and more, I focus on the problem definition and use what's called a TDD, test driven development. So instead of writing the code that implements the functionality, I focus on writing, say, the unit test that specify what is correct behavior of this piece of code to pass. And so because I can focus on more of that and then pass it on to O1 to really implements the thing, I can focus on what's important, what's a high level problem to solve, and so on. So this has been really an important way of shifting my focus. And another area is debugging. So now when I get some error messages, I just pass it to O1 and then it just prints out something. Sometimes it solves right away. Even if it doesn't, it at least gives some better questions to ask and provides some ways to think about this problem better. So it has been a really important change of working for me, and I hope this helps others, too. I like using O1 more and more for learning. The more I ask it's like various complex technical subjects, I find it hallucinate less and explain better those concepts on previous models. For me, I like to use O1 as like a brainstorming partner. So that can range from anything from how to solve some very specific ML problem, machine learning problem, to how to write a blog post or a tweet. For example, I recently wrote a blog post about language model evaluations, and I was asking O1 about ideas for the structure of the blog post, pros and cons of certain benchmarks, and even the style of the writing. And I think because it's able to think before it gives the final answer, it's able to connect ideas better, it can revise and critique candidate ideas and things like that. Yeah, I think if you need like a, you know, you have some short text and you wanted more creative, something really different, that's a great use to it. Give me five different ideas. Also, if you have just sort of like some unstructured thoughts, it's a really brilliant thought partner. So you have like some ideas, it's like, well, how should I connect these things? What am I missing? And through its final answers and through sort of reading, it's like thought process, it can really lead to like much better results for you. Yeah, I use it to try out a bunch of our internal secret ideas and actually try to improve it. Yeah, for standalone projects, it's great. Like I had to add the GitHub plugin. I know nothing about adding GitHub plugins. And I just said like, hey, I want GitHub plugin that displays this and this information about the PR and like, yeah, just produce the code. I just like, you know, I would just ask it like, OK, so where do I need to paste this code? I don't even know. It's just like, yeah, you paste it here. Let's go. I think for a lot of people, it's hard to really feel the AGI until you see the models do something better than humans can at a domain that you really care about. And I think, you know, for Go players and chess players that would have come a few years earlier And for a lot of us that like really value math and coding, I think we're starting to feel that now. I want moms to be proud of us. So are there any parts of this project, anything that really needed to be done, but, you know, people might not realize how important it is? So I think building large scale, reliable infrastructure to run our biggest flagship model training grounds, as well as doing research experiments is something that is not as exciting as doing research itself, but has to be done. And it has a tremendous impact on the success on the entire project. I think there is something special in OpenAI about how we structure our research in that we value algorithmic advancements in the same way as building reliable, large scale systems and building data sets that are needed either way for training those models. I'm really proud of OpenAI in that way. I think that has been a consistent pattern throughout many of our big projects. Every time we scale a new thing up another order of magnitude, we see another host of problems, both algorithmic and infrastructure. And we've definitely built a capacity to advance in both with a lot of focus. I feel the final model is just like literally a beautiful piece of art, right? In order to make it work, we have to make sure that every step has work, right? We find some unexplained problem and solve it, right? I think that's really how OpenAI operates. And I'm very proud to work here. And I also, let's say, there is like a really not only brilliant people are here, but also kind-hearted. It's just fun to me to work over here. And I'm grateful to my colleagues to, you know, code with me, barcode with me, hang out with me, eat lunch with me, like speak with the model with me. So what's it like to work on the Strawberry team? You can have your brilliant ideas, but most of the time you spend on like running them and not running and failing. And it's very good to have people very close by in your office that you can ask for help with whatever failed last time. Because, I mean, most of the time you spend your time debugging things that didn't work. And having people who can help you is... Speaking of this help, we had many times when we were trying to debug this for like a week and then passing by Wenda and then like ask it and then like he just sold it right away. I started calling it Wenda blessing and then blessing people. And that has been really, really effective. And I stopped like thinking about it's too stupid to ask and just ask right away. One of the things I really appreciate about working at OpenAI is that from every big project like this, we really learn. I think from Dota we learn the importance of engineering, from GPT-4 we learn the importance of research and we keep iterating like this. And the effect of that is that now the Strawberry team is again the best big research project team yet because it's built on all of the things we've learned from the previous projects. And it's really like, you can like really see it working here. Like people really like started developing, develop very good intuition. Like when do you hack something? Where do you like need to develop stronger fundamentals? Like when do you stay overnight? Where do you actually like take a week off and like come with a fresh mind to this particular problem? Like I think, I think like it's really amazing to observe this progress we make as a company. Yeah, one thing I like is just how organic this project has felt. The ideas have come literally from everywhere on this team and people feel empowered to just like, hey, here's an idea I really believe in and it's the thing that I'm going to push. And also people are just willing to get their hands dirty. I feel like there've been a lot of deadlines, some self-imposed, but we've all really come together and you know, we're willing to put in that work to make it happen. This project really demonstrated the power of momentum where we get initial good results and more and more people get excited about particular field and particular research. They try to contribute their new ideas, those new ideas work even better. And then the thing started snowballing and getting more and more momentum on itself and people just believing that this is the right thing to do and we should continue pushing this research. Related to that, I think we have a very, lots of very smart people, but also like very opinionated people. But people are always willing to like update their opinions once you see results to the contrary. And I think that's like make things really fun. It's kind of cool to be in that place that's a combination of like a brilliant scientist and engineers and folks who can like build out like incredible systems. It's very humbling. So one thing I remember a few months ago, I remember the model was very smart, but it was also kind of boring. So what was it like to give the model a personality? Yeah, so that's interesting. I remember I asked model about the meaning of life and it gave me an answer, 42, which is not that bad of an answer. And you know, it's kind of similar when I asked the model, you know, what is love? It told me, oh, it is like a strange human feeling. And once we actually gave model personality, made it actually work with chat, then the answers start being quite interesting that they, you know, I asked about the love, it told me, you know, there's like a romantic love, familial love, self-love, non-conditional love, conditional love, and it became more useful and also more fun. The funnest moment is that I asked the exact same question and he tries to define love with algebra. I shouldn't have asked a math nerd a question. So what's the story of O1 Mini? How did that come to be? So the motivation is that we want to bring the O1 series to a broader audience with much lower cost. So we created O1 Mini, which was designed to be like a minimal demonstration of the whole O1 pipeline or the framework. We make it a stand reasoning specialist, which may not necessarily know the birth date of our favorite celebrity, but really truly understands like how to do reasoning effectively, right? And truly has a lot of intelligence. The model is actually really smart, right? It's like much smarter than our previous best model, 4.0, and also almost on par, right, with our best model, O1, but only comes with a fraction of cost and latency. It does have the limitation that it may not know a lot of the knowledge in the outside world, right, which is not about science or technology, but we try to make it roughly on par with our previous best minimal, like 4.0 Mini, and we are working to improve it further. So I'm super excited for our external users to just try it out for this like lightning experience of reasoning and thinking. So what motivates you to do your research? I just find it fascinating that in this world you have these things that can do intelligence and reasoning, and they're much smaller than you'd think, and they can do this in different ways. It's just super fascinating. Good things in life take time, and our models just tend to answer too quickly, and eventually we want to have models that can do, for example, research for months or years, and I feel like this is the first step in the direction of models that can think very long for about one problem, and now we're at the level of minutes, and I think it's just the first step on a long path that hopefully takes us to models that can think for months or years as time goes by. It feels very meaningful that I, together with a small number of people, can have some substantial positive impact on the world, and also it's fun. Day-to-day it's just fun. I like, you know, speaking to the computer. I like starting a job on the cluster. I very much enjoy collaboration. It's just beautiful. I really like our models to be useful, and I think technology has a chance and a promise to improve human life, and I like our models to do work for us, to help us with our day-to-day problems, and giving them ability to reason allows them to do for us things that they just couldn't before, and that will allow us to spend our time more productively. Yeah, I'm very excited about this. I mean, I think these sort of paradigms unlock things that the models couldn't do before, so it's not just like answering some sets of queries a little bit better, but it's actually getting to a point where through planning, through error correction, it's able to just unlock new capabilities, and the ability to produce new knowledge in the world for science, for discovery, I think is one of the most exciting pieces for this, and I think in some short amount of time, it's going to become like a larger and larger contribution or contributor to its own development, and I think that's a really exciting regime. I think some of the people on this team, we were math or coding Olympiad participants in the past, and there's this huge personal motivation to create a system that can be us, that will do best, and I think the second thing really kind of echoes the point that JT and Leo made, you know, I do think reasoning is a much more powerful primitive than people give it credit for. When you think about kind of accomplishing tasks reliably, really that fundamental primitive has to be reasoning. You're going to hit bottlenecks, and you're going to have to navigate your way around them, so I'm really excited for that. I think AI researchers' job is to find a way to put more compute in, and hardware people have been doing so good of a job that the cost has been going down exponentially for a very long time, and we don't have much time to find another way to put in more compute, and it's kind of like a weight on my shoulder is just getting larger and larger, and this new paradigm really finds a way to unload that for probably a long time. Is there anything else you've observed as we've been going through this project for the whole time we've been doing it? Anything else that's worth calling out? I think an interesting meta-observation that we've had is every model that we train is like a little bit different. It has its own quirks, and it's almost like artisanal, because when you look at a model that can do so many different tasks, each model you train won't be exactly the same performance at each task, so it might be better at some tasks and worse at others, and so there's this uniqueness or like personality to every model that is almost like a little bit beautiful. Thank you, and congrats on releasing this.

Menu

Budowanie OpenAI o1 - rozmowa z twórcami (film, 22m)

Toggle timeline summary

Transcription