OpenAI's highly anticipated return to the "open" aspect of its name happened yesterday with the launch of two new large language models (LLMs): gpt-oss-120B and gpt-oss-20B.
Although these models meet technical benchmarks comparable to OpenAI's other advanced proprietary AI offerings, the initial response from the broader AI developer and user community has been quite mixed. If this release were a movie being rated on Rotten Tomatoes, my observations suggest we'd see a near 50% divide.
Here's some background: OpenAI has introduced these two new text-only language models (no image generation or analysis), both under the permissive open-source Apache 2.0 license—the first time since 2019 (prior to ChatGPT) that the company has done so with a state-of-the-art language model.
The entire ChatGPT era over the past 2.7 years has relied on proprietary or closed-source models, ones that OpenAI managed and for which users had to pay to access (or use a free version with limitations), offering limited customization and no option to run them offline or on private computing hardware.
AI Scaling Hits Its Limits
Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:
- Turning energy into a strategic advantage
- Architecting efficient inference for real throughput gains
- Unlocking competitive ROI with sustainable AI systems
Secure your spot to stay ahead: https://bit.ly/4mwGngO
But this changed with the release of the gpt-oss models yesterday, one larger and more powerful for use on a single Nvidia H100 GPU, suitable for a small or medium-sized enterprise's data center, and a smaller model that operates on a single consumer laptop or desktop PC, like those in home offices.
As these models are new, it has taken several hours for the AI power user community to independently test them against their own benchmarks and tasks.
And we're now receiving a spectrum of feedback ranging from optimistic enthusiasm regarding the potential of these powerful, free, and efficient new models to an undercurrent of dissatisfaction and dismay concerning perceived significant issues and limitations, especially when compared to a wave of similarly Apache 2.0-licensed powerful open-source, multimodal LLMs from Chinese startups (which can also be adapted, customized, and run locally on U.S. hardware by U.S. companies, or companies globally).
High benchmarks, but still behind Chinese open-source leaders
Intelligence benchmarks place the gpt-oss models ahead of most American open-source options. According to independent third-party AI benchmarking firm Artificial Analysis, gpt-oss-120B is "the most intelligent American open weights model," though it still lags behind Chinese giants like DeepSeek R1 and Qwen3 235B.

"In retrospect, that's all they did. Mogged on benchmarks," wrote self-proclaimed DeepSeek "stan" @teortaxesTex. "No good derivative models will be trained… No new use cases created… Barren claim to bragging rights."
This skepticism is echoed by pseudonymous open-source AI researcher Teknium (@Teknium1), co-founder of rival open-source AI model provider Nous Research, who described the release as "a legitimate nothing burger," on X, predicting a Chinese model will soon surpass it. "Overall very disappointed and I legitimately approached this with an open mind," they wrote.
Bench-maxxing on math and coding at the expense of writing?
Other criticisms targeted the gpt-oss models' apparent limited usefulness.
AI influencer "Lisan al Gaib (@scaling01)" highlighted that the models excel at math and coding but "completely lack taste and common sense." He added, "So it's just a math model?"
In creative writing tests, some users noticed the model inserting equations into poetic outputs. "This is what happens when you benchmarkmax," Teknium commented, sharing a screenshot where the model added an integral formula mid-poem.
And @kalomaze, a researcher at decentralized AI model training company Prime Intellect, commented that “gpt-oss-120b knows less about the world than what a good 32b does. probably wanted to avoid copyright issues so they likely pretrained on majority synth. pretty devastating stuff.”
Former Googler and independent AI developer Kyle Corbitt concurred that the gpt-oss pair of models seemed to be trained predominantly on synthetic data — that is, data generated by an AI model specifically for training another one — making it “extremely spiky.”
It’s “excellent at the tasks it’s trained on, but poor at everything else,” Corbitt noted, i.e., excellent on coding and math problems, but poor at more linguistic tasks such as creative writing or report generation.
Essentially, the allegation is that OpenAI intentionally trained the model with more synthetic data than real-world facts and figures to avoid using copyrighted data scraped from websites and other repositories it doesn’t own or have a license to use, which is something it and many other leading AI companies have faced accusations of in the past, leading to ongoing lawsuits.
Others speculated OpenAI might have trained the model primarily on synthetic data to avoid safety and security issues, resulting in a decrease in quality compared to if it had been trained on more real-world (and presumably copyrighted) data.
Concerning third-party benchmark results
Furthermore, evaluating the models on third-party benchmarking tests has revealed concerning metrics in the eyes of some users.
SpeechMap — which assesses the performance of LLMs in complying with user prompts to generate disallowed, biased, or politically sensitive outputs — indicated compliance scores for gpt-oss 120B hovering under 40%, near the bottom of peer open models, suggesting resistance to follow user requests and defaulting to guardrails, potentially at the expense of delivering accurate information.
In Aider’s Polyglot evaluation, gpt-oss-120B scored merely 41.8% in multilingual reasoning—far below competitors like Kimi-K2 (59.1%) and DeepSeek-R1 (56.9%).
Some users also reported their tests showed the model is oddly resistant to generating criticism of China or Russia, contrasting with its treatment of the US and EU, raising questions about bias and training data filtering.
Other experts have applauded the release and what it signals for U.S. open source AI
Not all commentary is negative, though. Software engineer and dedicated AI observer Simon Willison called the release “really impressive” on X, elaborating in a blog post on the models’ efficiency and their ability to achieve parity with OpenAI’s proprietary o3-mini and o4-mini models.
He praised their strong performance on reasoning and STEM-heavy benchmarks, and commended the new “Harmony” prompt template format — which provides developers more structured terms for guiding model responses — and support for third-party tool use as significant contributions.
In a lengthy X post, Clem Delangue, CEO and co-founder of AI code-sharing and open-source community Hugging Face, encouraged users to avoid rushing to judgment, noting that inference for these models is complex, and early issues could stem from infrastructure instability and insufficient optimization among hosting providers.
“The power of open-source is that there’s no cheating,” Delangue wrote. “We’ll uncover all the strengths and limitations… progressively.”
More cautiously, Wharton School of Business at the University of Pennsylvania professor Ethan Mollick, wrote on X that “The US now likely has the leading open weights models (or close to it)”, but questioned if this is a one-time effort by OpenAI. “The lead will evaporate quickly as others catch up,” he noted, adding that it’s uncertain what incentives OpenAI has to keep the models updated.
Nathan Lambert, a prominent AI researcher at the rival open-source lab Allen Institute for AI (Ai2) and commentator, highlighted the symbolic importance of the release on his blog Interconnects, calling it “a phenomenal step for the open ecosystem, especially for the West and its allies, that the most renowned brand in the AI space has returned to openly releasing models.”
However, he cautioned on X that gpt-oss is “unlikely to significantly slow down [Chinese e-commerce giant Alibaba’s AI team] Qwen,” citing its usability, performance, and diversity.
He argued the release marks an important shift in the U.S. toward open models, but OpenAI still has a “long path back” to catch up in practical terms.
A split verdict
The verdict, for the time being, is divided.
OpenAI’s gpt-oss models are a landmark in terms of licensing and accessibility.
However, while the benchmarks appear solid, the real-world "vibes" — as many users describe it — are proving less compelling.
Whether developers can build robust applications and derivatives on top of gpt-oss will determine whether the release is remembered as a breakthrough or a mere blip.
