Researchers from the University of Pennsylvania and the Allen Institute for Artificial Intelligence have introduced an innovative tool that enables open-source AI systems to compete with or exceed the visual comprehension capabilities of proprietary models like GPT-4V and Gemini 1.5 Flash. This tool has the potential to transform the competitive dynamics between open and closed AI development.
Named CoSyn (Code-Guided Synthesis), this tool addresses a significant challenge in AI development: the shortage of high-quality training data needed for machines to understand intricate visual information such as scientific charts, medical diagrams, and financial documents. Instead of collecting millions of images from the internet—a method fraught with copyright and ethical issues—CoSyn utilizes the coding capabilities of existing language models to produce synthetic training data.
“We lack data to train the model. We need documents, charts with detailed annotations to train a vision language model for question answering over these images,” explained Yue Yang, a recent Penn Engineering Ph.D. graduate and co-first author of the research, in an exclusive interview with VentureBeat. “These images are actually more challenging to annotate compared to natural photos, like a picture of a dog, a cat, or a house.”
This breakthrough occurs as businesses increasingly demand AI systems that can comprehend and reason about complex visual information—skills essential for everything from automated document processing to AI agents that navigate digital interfaces independently. The research took place during Yang’s internship with the PRIOR team at the Allen Institute for AI, supported by the Office of the Director of National Intelligence, Intelligence Advanced Research Projects Activity, and the Defense Advanced Research Projects Agency.
How synthetic data generation solves AI’s biggest training challenge
Training AI to understand text-rich images has long been a challenge in the field. Unlike natural photographs, scientific figures, charts, and documents require extensive annotation, which is both time-consuming and costly. Traditional methods have relied on gathering images and their alt-text descriptions from the internet, but this approach often results in superficial and legally questionable training data.
CoSyn adopts a fundamentally different approach by recognizing that most text-rich images are initially created through code—Python scripts generate charts, LaTeX renders mathematical equations, HTML creates web interfaces. The research team’s insight was to reverse this process: using language models’ proven coding skills to generate the underlying code, which is then executed to create realistic synthetic images.
“One intuition is that images like charts and documents are rendered from programs or code, like using Python to generate charts. We use LaTeX or Word to write our documents,” Yang said. “So why not go the reverse way, generating the code since text-only language models have proven very good at writing code.”
Chris Callison-Burch, a computer science professor at Penn who co-advised the research, explained the approach more simply: “This is like taking a student who excels at writing and asking them to teach someone how to draw, just by describing what the drawing should look like. We’re essentially transferring the strengths of open-source AI from text to vision.”
CoSyn-trained models outperform GPT-4V and Gemini on key benchmarks
The results are remarkable. Using their synthetic dataset of 400,000 images and 2.7 million instruction pairs, models trained with CoSyn reached state-of-the-art performance among open-source systems, surpassing proprietary models on seven benchmark tests for text-rich image understanding.
Their 7-billion parameter model averaged a score of 80.9% across the benchmark suite, outperforming the previous best open-source model (Llama 3.2 11B) by 3.9 percentage points. Notably, even their “zero-shot” model—trained without any examples from the evaluation datasets—outperformed most open and closed models, demonstrating the transferability of capabilities learned from synthetic data.

In a particularly impressive demonstration, the researchers introduced a new benchmark called NutritionQA, comprising 100 questions about nutrition label photographs. With only 7,000 synthetically generated nutrition labels for training, their model outperformed others trained on millions of real images. “Despite being trained on millions of images, we find that open-source VLMs are not data-efficient and perform poorly on this novel task compared to GPT-4V,” the researchers noted in their paper.
Yang highlighted the importance: “Large corporations have vast resources for data collection and experimentation, but open source models can provide access to people, including the model weights, the data we trained on, or even the code and training scripts, allowing developers to build upon it.”
Real companies are already using vision AI for quality control and automation
The technology is already being applied in real-world scenarios across industries. Callison-Burch mentioned an example from one of his teaching assistants whose company uses vision-language models for cable installation quality assurance: “They have the workers on site take photographs of the installation process, and they use that to automatically verify that each step has been properly followed.”
This specialized visual understanding could revolutionize numerous enterprise workflows, from automated document processing in financial services to quality control in manufacturing. The ability to train models on specific visual tasks using synthetic data means companies can develop AI systems tailored to their particular needs without the extensive data collection efforts traditionally required.
For enterprise decision-makers, the research suggests a shift in AI data strategies. “I think synthetic data is a promising way to reduce human annotation efforts. It costs less, can automatically generate large-scale data, and avoids some copyright issues,” Yang noted.
The persona-driven approach that makes AI training data more diverse
One of CoSyn’s key innovations is its method of ensuring data diversity. To prevent repetitive outputs common in AI-generated content, the system employs what researchers term a “persona-driven mechanism.” Each time CoSyn generates a synthetic example, it pairs the request with a randomly sampled persona—a short description like “a sci-fi novelist constantly bouncing off ideas for new alien worlds” or “a chemistry teacher preparing lab materials.”
“Every time we generate one piece of synthetic data, we pair it with a randomly sampled persona,” Yang explained. “This diversifies the content and styles of the examples we generate, because, for example, if I provide the persona of a PhD student, it will generate something more scientific or related to academia.”
This approach allows the system to generate content across nine different categories: charts, documents, math problems, tables, diagrams, vector graphics, music sheets, electrical circuits, and chemical structures. The researchers used 11 different rendering tools, from Python’s Matplotlib for charts to LaTeX for mathematical expressions, supported by 20 specialized generation pipelines.
Why this breakthrough could level the playing field between open source and Big Tech
The implications for the broader AI industry are profound. Major technology companies like OpenAI and Google have invested billions in developing their proprietary vision-language capabilities, with training methods and data sources remaining trade secrets. CoSyn offers a way for open-source alternatives to compete without requiring similar resource investments.
“Open source models are still behind closed source models, but with the collective efforts and resources from the open source community, we can catch up,” Yang said.
The commitment to openness goes beyond just releasing the model. The complete CoSyn codebase, the 400,000-image dataset, and all training scripts are publicly accessible, allowing researchers and companies worldwide to build upon the work. “From the academic perspective, much research relies on openness; we need access to data, code, everything to discover new findings and support our claims in papers,” Yang emphasized.
This transparency addresses growing concerns about the opaque nature of proprietary AI systems. “If you only rely on APIs from companies like OpenAI, it may not be reliable for validating scientific discoveries, because there may be unknowns in the backend,” Yang noted.
Teaching AI agents to click, scroll and navigate like humans
Beyond static image comprehension, CoSyn is pioneering capabilities crucial for the next generation of AI agents—systems that can autonomously navigate digital interfaces and perform complex tasks. The researchers developed synthetic “pointing data” that instructs models precisely where to click on screenshots, a fundamental requirement for web-based automation.
With 65,000 synthetic screenshots featuring click annotations, their model achieved state-of-the-art performance on ScreenSpot, a benchmark for click prediction, outperforming systems trained on 1.3 million real screenshots. “We only use around 100k synthetic screenshots, yet we can outperform previous models on millions of screenshots,” Yang said.
This capability is vital as the industry moves toward AI agents capable of performing knowledge work autonomously. “There are two prevailing models for implementing agents,” Callison-Burch explained. One approach uses specialized APIs, while the other relies on agents that “literally just use web browsing capabilities as you and I do.”
The vision-based approach, enabled by technologies like CoSyn, could prove more versatile: “You’re not just calling up a software function, which is relatively straightforward, but you actually have to take screenshots of the current state of the web browser, reason about where to click, and navigate your mouse to that location to click.”
How synthetic data sidesteps the growing copyright crisis in AI training
The synthetic data approach also offers a potential solution to mounting legal challenges surrounding AI training data. With ongoing litigation over whether training on copyrighted materials constitutes fair use, synthetic data generation provides an alternative that avoids many intellectual property concerns.
Callison-Burch, who testified before Congress on AI and copyright in 2023, views synthetic data as complementary to, rather than replacing, real-world training data: “I don’t think that synthetic data eliminates the need for having a wide array of diverse training data, as that remains a core element in training AI systems, but it does allow you to extend their capabilities in remarkable ways.”
The approach illustrates how existing knowledge can be transferred to new applications without directly using copyrighted materials. “The underlying principle here is that a large language model can write code, something it learned from its original data. We’re now applying that to a completely different application: the creation of new training data unlike any of the data it was originally trained on.”
The current limits of synthetic data and what comes next
Despite its promise, synthetic data generation faces significant limitations. “One limitation is it may inherit biases from the model that generates the synthetic data,” Yang acknowledged. The system can also struggle with diversity: “If you prompt a large network to generate data across different runs, it may produce similar data.”
The current research focuses on text-rich images rather than natural photographs, limiting its immediate applicability to some domains. “What about real photos or other natural images? It is challenging to generate synthetic data for such images, or even medical images like chest X-rays,” Yang noted, although she indicated ongoing efforts to extend the approach to medical imaging.
Looking ahead, Yang anticipates synthetic data generation becoming standard practice: “In the future, in two or three years, synthetic data will be a crucial component for teaching models different capabilities.” However, she emphasized that optimal results will likely require combining synthetic and real-world data: “Real-world data reflects real-world distributions, while synthetic data can be large-scale and more controllable.”
Early adopters from Meta to Amazon are already experimenting with the technology
Early adoption signals suggest the technology is already impacting industry practices. “I heard that companies like Meta and some teams at Amazon are trying to use our data to train their models,” Yang revealed during the interview.
For startups and smaller companies, the cost advantages could be particularly significant. “For some startups, it is cheaper to host their own open model on their server rather than relying on APIs, which offer less control,” Yang noted.
The research team’s decision to make everything open source reflects a broader philosophy about AI development. As Yang prepares to join the Allen Institute full-time after completing her Ph.D., the commitment to open science remains central to their mission. “Currently, vision language models are quite brittle. They just need the right data to gain the right capabilities,” she said. “If you find the right data, you can enhance the model's capability, benefiting society.”
The vision for AI that acts, not just describes
As the research transitions from academic labs to real-world applications, the implications extend far beyond improved benchmark scores. Yang and her colleagues are already envisioning applications that could transform how people with disabilities interact with technology, from AI that understands sign language for the hearing impaired to systems that describe complex medical images for those with visual impairments.
“I have an idea for the model to understand sign language or assist those with hearing difficulties,” Yang said, describing potential future applications. “If you find the right data, you can enhance the model's capability, benefiting society.”
Callison-Burch sees even broader possibilities, particularly in robotics and scientific discovery: “Synthetic data opens up many possible applications that lack naturally occurring data. One project Yang has worked on at the Allen Institute involves creating simulated training data for robots.”
The work is more than just a technical achievement—it’s a demonstration that open-source AI development can rival the well-funded efforts of major technology companies through innovative approaches to fundamental challenges. As Yang reflected on her choice to join the Allen Institute instead of accepting higher-paying offers from companies like Meta: “I think it’s still an early stage for multimodal models, and there are not many open resources or knowledge to share with the community.”
The message is clear: in the race to build AI that can truly see and understand the world, the advantage may not always go to those with the deepest pockets, but to those with the most creative solutions.
