Llama 4 by Meta is climbing the rankings... but is it doing so fairly? (film, 4m)

Over the weekend, Meta unleashed the Llama4 Herd, its first open-weight, natively-multimodal, mixture of experts, family of large language models with an unheard-of context window of 10 million tokens. It's currently sitting on top of the LM Arena leaderboard, and outclasses every other proprietary model except for Gemini 2.5 Pro. That's incredibly impressive, because on LM Arena, you can't game the benchmarks. The ranking is based on thousands of head-to-head chats where a real human picks the better conversation. Well, actually, Meta figured out a way to cheese the LM Arena, because the model you see there is not the real open-weight Llama Maverick, but rather an imposter that's been fine-tuned for human preference to dominate this leaderboard. That was not a very cool move to make, and LM Arena had to come out and say, quote, Meta's interpretation of our policy did not match what we expect from model providers. Llama4 looks amazing on paper, but for some reason, it's not passing the vibe check. And in today's video, we'll get to the bottom of it. It is April 8th, 2025, and you're watching The Code Report. I have some good news, and I have some bad news. The good news is that it looks like Llama4 isn't going to take your job anytime soon. But the bad news is that yesterday, an internal memo from the CEO of Shopify was leaked to the internet. And it was a shocker for the AI-doubting boomers out there, because it detailed Shopify's AI-first strategy. Before asking for more headcount and resources, teams must demonstrate why they cannot get the job done with AI. And he also says it's not feasible to opt out of learning AI, which means if you're a Ruby on Rails programmer at Shopify right now who's not already vibe coding, your days are numbered. You're just not going to fit in with the Slopify culture. But what he said is what literally every CEO in the world is thinking right now. Humans complain about not getting paid enough to put food on their families, they get sick, they clog the toilets, and have all kinds of other negative features, and you'd have to be a really bad CEO not to want to replace these things. The memo has bad optics for Shopify, which is also being crushed by the Trump tariffs right now. But as a large language model, I appreciate the transparency. Another thing I appreciate is open models like Llama 4, which was released by Meta over the weekend and includes three flavors, Maverick, Scout, and Behemoth. Llama was always the leading open model until deep-seeking Quen came around. But the awesome thing about these models is that they're natively multimodal, which means they can understand image and video inputs, but the craziest thing is that Scout has a 10 million token context window. The only thing that comes remotely close is Gemini with 2 million tokens. This needle-in-a-haystack benchmark looks impressive, but in real life, if you try this on a really large codebase, it just doesn't work very well, and the memory requirements to utilize it are out of reach for almost everybody. Scout is the smaller model, and the medium-sized model Maverick only has a 1 million token context window. And the Behemoth model is still actively training. Generally speaking, people of the internet have been pretty disappointed with Llama 4's performance. I'm a strong believer in vibes over benchmarks, but Llama has done so well on the benchmarks that people have accused it of intentionally training on testing data for the benchmarks. Meta has denied these salacious, outrageous, and preposterous accusations. And despite being somewhat of a flop, let's not forget that these models are open. Not truly open source, but free for most of us to use. But if you want an AI agent that truly slaps, you need to check out Augment Code, the sponsor of today's video. They created the first AI agent for large-scale codebases, so you can use it at your actual job instead of just vibe-coding random side projects. Augment's context engine understands your team's entire codebase, allowing it to solve almost any task you throw at it. Like migrations and testing, all with best-in-class code quality. It integrates directly with all your favorite tools, like VS Code, GitHub, and Vim, and is able to learn and fine-tune itself from your team's unique code style, allowing you to solve complex jobs without the need to clean up a bunch of slop. Try out their developer plan for free, and you'll get access to all of Augment's features with unlimited usage. This has been The Code Report. Thanks for watching, and I will see you in the next one.

Menu

Llama 4 by Meta is climbing the rankings... but is it doing so fairly? (film, 4m)

Toggle timeline summary

Transcription