Google Succeeds With LLMs While Meta and OpenAI Stumble

Posted by:

|

On:

|

The early history of large languages models (LLMs) was dominated by OpenAI and, to a lesser extent, Meta. OpenAI’s early GPT models established the frontier of LLM performance, while Meta carved out a healthy niche with open-weight models that delivered strong performance. Open-weight models have publicly accessible code that anyone can use, modify, and deploy freely.

That left some tech giants, including Google, behind the curve. The breakthrough research paper on the transformer architecture that underpins large language models came from Google in 2017, yet the company is often remembered more for its botched launch of Bard in 2023 than for its innovative AI research.

But strong new LLMs from Google, and misfires from Meta and OpenAI, are shifting the vibe.

Llama 4 herd gets off on the wrong hoof

News of Llama 4‘s release unexpectedly came out of Meta on Saturday, 5 April.

If the decision to release a major model on a weekend strikes you as odd, you’re not alone. The timing caught everyone off guard and partially buried the announcement in the following week’s news cycle.

Meta’s new open-weight LLM does have its strengths. Llama 4 is multimodal, which means it can handle images, audio, and other modalities. It comes in three flavors, Llama 4 Behemoth, Maverick, and Scout, which have different sizes and strengths. Llama 4 Scout also boasts a huge context window of up to 10 million tokens. Tokens are the small units of text that LLMs process and gneerate, and the context window is the number of tokens a model can process at once. A larger context window helps the model “remember” and work with larger amounts of text in a single session. Most models have a context window of one million tokens or less.

But reception took a turn for a worse when critics noticed Meta’s sly approach to ranking on LMArena, a site that ranks LLMs based on user votes. The specific Llama 4 model that Meta used for the rankings wasn’t the same model available as part of its general release. In a statement, LMArena said Meta provided “a customized model to optimize for human preference.”

Meta also caught flak for its boast about Llama 4 Scout’s 10-million-token context window. While this figure appears to be technically accurate, a benchmark of long-context performance found that Llama 4 lagged behind competitive models.

Meta also didn’t release a Llama 4 “reasoning” or “thinking” model and held back smaller variants, though Meta says a reasoning model will become available.

“They deviated from the norm of a more systematic release, where they have all their ducks in a row,” says Ben Lorica, founder of the AI consulting company Gradient Flow. “This seems like they wanted to reassure people they have a new model, even if they don’t have all the components, like a reasoning model and smaller versions.”

GPT-4.5 is forced to retreat

OpenAI has experienced its share of difficulties in recent months, too.

GPT-4.5, released as a research preview on 27 February 27, was touted as the company’s “largest and best model for chat yet.” And OpenAI found that it did, in fact, generally outperform the prior model GPT-4o in benchmarks.

However, the model’s costs drew criticism. OpenAI priced API access to the model at US $150 per million output tokens. That was a staggering 15-fold increase over GPT-4o, which is priced at just $10 per million tokens. The API is the method provided by OpenAI to developers looking to use OpenAI models in their apps and services.

“GPT-4.5 was probably the largest traditional LLM released during the first quarter of 2025. I estimated it to be a mixture-of-experts model with 5.4 trillion parameters,” says Alan D. Thompson, an AI consultant and analyst at Life Architect. “That kind of raw scale is difficult to justify with our current hardware limitations, and even more challenging to serve to a large user base now.”

On 14 April, OpenAI announced it would wind down GPT-4.5 access via the API after less than three months. GPT 4.5 will still be available, but only to ChatGPT users through the ChatGPT interface.

OpenAI made the announcement alongside the reveal of GPT-4.1, a more economical model priced at $8 per million tokens. OpenAI’s benchmarks show that GPT-4.1 isn’t quite as capable as GPT 4.5 overall, though it does perform better in some coding benchmarks.

OpenAI also released new reasoning models last week: o3 and o4-mini. The o3 model scores particularly well on benchmarks. Cost is once again a concern, however, as access to o3 via the API is priced at $40 per one million output tokens.

As competitors struggle, Google ascends

The middling reception of Llama 4 and ChatGPT-4.5 left an opening for competitors—and they’ve pushed their advantage.

Meta’s rocky launch of Llama 4 is unlikely to move developers away from DeepSeek-V3, Google’s Gemma, and Alibaba’s Qwen2.5. These LLMs, which arrived in late 2024, are now the preferred open-weight models on LMArena and HuggingFace leaderboards. They’re competitive with or superior to Llama 4 on popular benchmarks, inexpensive to access via an API, and in some cases available to download and use on consumer-grade computer hardware.

But it’s Google’s new leading-edge LLM, Gemini 2.5 Pro, that really turned heads.

Released on 25 March, Google Gemini 2.5 Pro is a “thinking model,” similar to GPT-o1 and DeepSeek-R1, which uses self-prompting to reason through tasks. Gemini 2.5 Pro is multimodal, has a context window of one million tokens, and supports deep research.

Gemini 2.5 quickly racked up benchmark wins including the top spot in SimpleBench (though it lost that to OpenAI’s o3 on 16 April), and on Artificial Analysis’s combined AI Intelligence Index. Gemini 2.5 Pro currently sits at the top of LMArena, as well. As of 14 April, Google models have nabbed 5 of the top 10 slots on LMArena (this includes Gemini 2.5 Pro, three variants of Gemini 2.0, and Gemma 3-27B).

Strong performance would be enough to attract attention, but Google is also a price leader. Google Gemini 2.5 is currently free to use through Google’s Gemini app and through Google’s AI Studio website. Google’s API pricing is also competitive; Gemini 2.5 Pro is priced at $10 per one million output tokens and Gemini 2.0 Flash is priced at just 40 cents per one million tokens.

“Honestly, when it comes to high volume, I probably end up using DeepSeek-R1 or Google Gemini for reasoning. I’ll use OpenAI, but I feel I must be more conscious in terms of the price,” says Lorica.

Of course, this isn’t to say Meta and OpenAI are sunk. OpenAI in particular has room to maneuver thanks to the popularity of ChatGPT, which reportedly now has one billion users. Still, Gemini’s strong rankings and benchmark performance shows the winds of change are blowing in the world of LLMs—and they currently favor Google.

Posted by

in

Leave a Reply

Your email address will not be published. Required fields are marked *