Menu
About me Kontakt

Llama 4 by Meta is climbing the rankings... but is it doing so fairly? (film, 4m)

This weekend, Meta unveiled the Llama4 Herd, its first open-weight, natively-multimodal model family that boasts an unprecedented context window of 10 million tokens. Llama4 is currently sitting atop the LM Arena leaderboard, outperforming nearly every proprietary model except for Gemini 2.5 Pro. This is quite an achievement, especially considering that the LM Arena rankings are established based on thousands of head-to-head chats, where real humans determine the better conversation. However, there are allegations that Meta has manipulated the process, as the model shown is not the authentic open-weight Llama Maverick, but instead a fine-tuned version designed for human preferences, raising eyebrows in the community. LM Arena responded, stating that Meta's interpretation of their policy did not align with what they expect from model providers.

In today's video from The Code Report, the author discusses a concerning leak from Shopify's CEO, which details the company's AI-first strategy. Teams are required to justify their requests for more personnel and resources if they cannot complete tasks using AI. This presents a serious warning for Ruby on Rails developers at Shopify who have yet to embrace AI, as they may find themselves out of place in the evolving company culture. The CEO's sentiments reflect a broader trend among leaders in many organizations, as the drawbacks of human employees—such as health issues and other inconveniences—have led to a drive for automation.

Llama4 comprises three variations: Maverick, Scout, and Behemoth. This model was once the leader in the open model space until deep-learning competitor Quen emerged. Despite the impressive specifications, including the 10 million token context offered by the Scout variant, many users have expressed disappointment with its performance. The author believes there is a disconnect between benchmark results and the actual use of these models, leading to accusations that they may have intentionally trained on testing data to achieve these results. While Meta denies these serious allegations, this controversy does not bode well for Llama4's reputation among its peer models.

Regardless of the controversies, Llama4 remains accessible for users, which is one of its distinguishing traits. Although the author points out usability issues, these models are celebrated for democratizing access to AI capabilities. Additionally, the video sponsors Augment Code, which has introduced an AI agent tailored for managing large-scale codebases, allowing users to ensure high-quality code for professional use. The author encourages focusing on practical tasks that can be managed efficiently with the help of this innovative tool, avoiding the unnecessary mess of traditional coding practices.

Finally, as of the time this article was written, The Code Report has amassed 792,866 views and 25,199 likes, reflecting significant interest in these technological advancements. The topics covered in the video have sparked considerable controversy, underscoring the increasing importance of AI in the programming world.

Toggle timeline summary

  • 00:00 Meta releases Llama4 Herd, a new family of large language models.
  • 00:08 Llama4 boasts an impressive context window of 10 million tokens.
  • 00:14 It ranks high on the LM Arena leaderboard, outperforming most proprietary models.
  • 00:22 However, it appears that the model's ranking may not be entirely genuine.
  • 00:39 LM Arena calls out Meta for not adhering to their model provider policy.
  • 00:51 The video will explore Llama4's performance and issues.
  • 00:57 An internal memo from Shopify's CEO leaks, revealing an AI-first strategy.
  • 01:14 Shopify teams must justify needing human resources over AI.
  • 01:29 The memo indicates a shift in hiring practices due to AI advancements.
  • 01:49 Shopify faces challenges from tariffs while navigating AI integration.
  • 01:55 Llama4 comprises three distinct models: Maverick, Scout, and Behemoth.
  • 02:15 Scout model features a 10 million token context but is criticized in practical use.
  • 02:34 Despite initial benchmarks, many express disappointment in Llama4's real-world application.
  • 02:44 Accusations arise regarding Llama4's training on benchmark data.
  • 02:58 Meta ensures that these models, although not fully open source, remain available for use.
  • 03:06 Augment Code is promoted as a solution for large-scale AI applications in coding.
  • 03:23 Augment's context engine adapts to team styles for better coding efficiency.
  • 03:39 The report concludes with a thank you and invitation to the next video.

Transcription

Over the weekend, Meta unleashed the Llama4 Herd, its first open-weight, natively-multimodal, mixture of experts, family of large language models with an unheard-of context window of 10 million tokens. It's currently sitting on top of the LM Arena leaderboard, and outclasses every other proprietary model except for Gemini 2.5 Pro. That's incredibly impressive, because on LM Arena, you can't game the benchmarks. The ranking is based on thousands of head-to-head chats where a real human picks the better conversation. Well, actually, Meta figured out a way to cheese the LM Arena, because the model you see there is not the real open-weight Llama Maverick, but rather an imposter that's been fine-tuned for human preference to dominate this leaderboard. That was not a very cool move to make, and LM Arena had to come out and say, quote, Meta's interpretation of our policy did not match what we expect from model providers. Llama4 looks amazing on paper, but for some reason, it's not passing the vibe check. And in today's video, we'll get to the bottom of it. It is April 8th, 2025, and you're watching The Code Report. I have some good news, and I have some bad news. The good news is that it looks like Llama4 isn't going to take your job anytime soon. But the bad news is that yesterday, an internal memo from the CEO of Shopify was leaked to the internet. And it was a shocker for the AI-doubting boomers out there, because it detailed Shopify's AI-first strategy. Before asking for more headcount and resources, teams must demonstrate why they cannot get the job done with AI. And he also says it's not feasible to opt out of learning AI, which means if you're a Ruby on Rails programmer at Shopify right now who's not already vibe coding, your days are numbered. You're just not going to fit in with the Slopify culture. But what he said is what literally every CEO in the world is thinking right now. Humans complain about not getting paid enough to put food on their families, they get sick, they clog the toilets, and have all kinds of other negative features, and you'd have to be a really bad CEO not to want to replace these things. The memo has bad optics for Shopify, which is also being crushed by the Trump tariffs right now. But as a large language model, I appreciate the transparency. Another thing I appreciate is open models like Llama 4, which was released by Meta over the weekend and includes three flavors, Maverick, Scout, and Behemoth. Llama was always the leading open model until deep-seeking Quen came around. But the awesome thing about these models is that they're natively multimodal, which means they can understand image and video inputs, but the craziest thing is that Scout has a 10 million token context window. The only thing that comes remotely close is Gemini with 2 million tokens. This needle-in-a-haystack benchmark looks impressive, but in real life, if you try this on a really large codebase, it just doesn't work very well, and the memory requirements to utilize it are out of reach for almost everybody. Scout is the smaller model, and the medium-sized model Maverick only has a 1 million token context window. And the Behemoth model is still actively training. Generally speaking, people of the internet have been pretty disappointed with Llama 4's performance. I'm a strong believer in vibes over benchmarks, but Llama has done so well on the benchmarks that people have accused it of intentionally training on testing data for the benchmarks. Meta has denied these salacious, outrageous, and preposterous accusations. And despite being somewhat of a flop, let's not forget that these models are open. Not truly open source, but free for most of us to use. But if you want an AI agent that truly slaps, you need to check out Augment Code, the sponsor of today's video. They created the first AI agent for large-scale codebases, so you can use it at your actual job instead of just vibe-coding random side projects. Augment's context engine understands your team's entire codebase, allowing it to solve almost any task you throw at it. Like migrations and testing, all with best-in-class code quality. It integrates directly with all your favorite tools, like VS Code, GitHub, and Vim, and is able to learn and fine-tune itself from your team's unique code style, allowing you to solve complex jobs without the need to clean up a bunch of slop. Try out their developer plan for free, and you'll get access to all of Augment's features with unlimited usage. This has been The Code Report. Thanks for watching, and I will see you in the next one.