Big players in AI are developing synthetic data to fill in the gaps for traffic modeling, trend analysis, autonomous vehicles and more. Today, as Christopher Court-Dobson discovers, this tech is at the center of a growing debate: can we really trust artificially generated facts?
It’s 2025 and the data used to construct pictures of our transportation world and predict the future is in high demand. Collected from sensors, cameras, mobile phones and scraped off social media, each wave of traffic technology innovation is more dependent on a steady, broad stream of data than the last. But what happens when we don’t have enough?
“Traditional rule-based simulation models, which are largely Western-centric in design, require large amounts of data for validation, and sensing technology is expensive to deploy,” says Allessandro Tricamo, a Dubai-based partner in the Transportation Practice at global consultancy Oliver Wyman.
This puts traffic managers in a bind. Without comprehensive coverage, simulations are prey to biases and inaccuracy.
But municipal budgets can only be stretched so far. Furthermore, the computer vision that allows a large proportion of data harvesting is still prone to inaccuracies, especially under outlier conditions. Synthetic data is proposed as an answer to many of these challenges.
What is synthetic data?
“Synthetic data is created through generative AI rather than real-life captured data. Now you have companies like NVIDIA developing Omniverse and Replicator, which essentially construct a digital twin environment,” says CVEDIA’s head of partnerships Natalia Simanovsky.
Two major applications have emerged in traffic technology for synthetic data. One of them is in generating artificial location and vector data, which is used to count anything from average speed to vehicles-per-hour. The other major use is based in image and video, and uses traffic cameras as a data source to build the digital twin. This generates synthetic data by simulating that environment under a variety of different conditions.
“Some researchers have found that one of the most effective methods to overcome the limitations of real-world data is to create synthetic datasets. Its effectiveness lies in the ability to carefully control and manipulate various environmental factors, scenarios, and data distributions that are difficult or even impossible to achieve in real-world data collection,” says Zhihang Song, Department of Automation, Tsinghua University.
With real-life data as input, the AI neural network outputs a prediction, a data point or image that resembles the input. This synthetic data is then tested against real-world data to see how accurate the simulation is and, if valid, it can be used to ‘fill in the gaps’.
“It is an incredibly cost effective, affordable technology that can be used over and over again that cuts the cost for R&D for large companies,” says Simanovsky.
Capture equipment is expensive, and the process time consuming. Further labor-time is spent in labeling the data. All of this adds up to significant costs, which can become a limiting factor. Synthetic data expands the usability of existing data in a low-cost manner, and is ‘self-labeling’ rather than requiring human input. However, the initial setup can be skill and time intensive.
“When the model is being developed in the R&D center, you’ve absolutely got tons of humans in the loop,” says Simanovsky.
After the initial effort, the benefits accrue exponentially due to the sheer volume of data that can be generated.
Open access
A large variety of synthetic datasets are publicly available. The earliest, Frida by Tarel et al in 2010, used Sivic Software to create foggy versions of real-life images. This was then paired with a depth map, allowing for more accurate estimates of road conditions in fog.
“In the early years, synthetic datasets were primarily focused on a particular autonomous driving perception task, such as optical flow estimation or semantic segmentation. However, as the generation techniques advanced, the synthetic datasets became more diverse and applicable to a broader range of tasks,” says Song.
Virtual KITTI, based on the KITTI real data sequence, is an example of a multi-task synthetic dataset. The 17,000 highresolution frames contain multi-task annotations of 2D object detection, depth estimation, optical flow estimation, pixel-level semantic and instance segmentation and multi-object tracking.
Each image can be manipulated through different camera angles, lighting conditions, and clear, cloudy, foggy and heavy-rain weather conditions.
Video games such as GTA-V were often used as early experiments in synthetic datasets, even being modified so as to produce lidar simulation. Game engines such as Unity3D have also been used in Paralleleye and Virtual KITTI2. Eventually specialist simulation platforms, including one called Carla, emerged.

”Synthetic data is created through generative AI rather than real-life captured data”
Natalia Simanovsky, head of partnerships, CVEDIA
The most synthetic datasets and accompanying domain suites are defined by the multiple types of data they combine, reflecting the enhanced sensing capabilities of modern vehicles.
Shift, AIOdrive and V2X-Sim for example include multi-view RGB cameras, 128 channel lidar, GPS/GNSS, optical flow sensors, depth cameras and IMU sensors, in combination with roadside unit sensors, SPAD-lidar and mmWaveRadar.
What’s the controversy?
Ineffective use of synthetic data can exacerbate existing biases within the system. Conventional wisdom states that the solution to biased data is more data, however, this is not necessarily correct. If real-life data has a bias, then it can simply become enlarged at greater volume.
“AI is still a black box. Nobody, not the professors, not the experts, not Apple, not Amazon, nobody can tell you why the neural network detected the object with a camera tilt of 30° and not at 40°,” says Simanovsky.
In a process known as model drift, bias is further magnified by the use of synthetic data because it resembles existing real-life data, including its flaws. Once that bias has set in it is difficult to identify from the tools themselves, as they reflect the bias (see Varieties of Bias, page 14). Human insight, human intelligence, human investigative drive must be engaged to root out the issue, and only then can steps be taken to correct it, which might mean training AI again from scratch.
Is this the future?
Synthetic data is evaluated through reasonableness tests, where its predictions are compared to new real-life data to see how accurate they are. What works and what doesn’t gives clues for data design processes. However, this requires skilled and knowledgeable data scientists to continually improve upon it.
“In general, the data in synthetic datasets and real datasets are distributed in two domains with large differences. Although the domain gap can be narrowed, the model trained using the synthetic datasets still needs to be fine-tuned on real datasets to learn the features of both domains simultaneously and apply them to real scenarios,” says Song
“Its effectiveness lies in its ability to carefully control and manipulate various environmental factors, scenarios, and data distributions that are difficult or even impossible to achieve in real-world data collection”
Zhihang Song, Department of Automation, Tsinghua University
The relative scarcity of safety critical data remains an issue. Scenario Engineering tools such as SafeBench, running on the Carla platform, have been developed to aid this process by simulating a wide variety of challenging driving scenarios.

Bias, and the overall ‘black box’ nature of synthetic data and AI in general continues to slow progress despite the clear benefits it can provide. As a new technology there is much to prove in terms of efficacy, safety and efficiency.
But it is the sheer amount, and specificity, of data needed to make simulation and computer vision run accurately that is prompting the shift to synthetic data. It is unfeasible to costand time-effectively capture every possible variety of physical and environmental condition, and synthetic data, when correctly deployed is producing results. The challenges are many, and the requirements are high for human skills, expertise and capacity for lateral thinking.
“Better data quality and timely availability is a must, and synthetic data’s potential to deliver timely, tailored and targeted interventions is immense,” concludes Tricamo.
VARIETIES OF BIAS
The traffic management professional who desires to truly master synthetic data, reaping its benefits while avoiding its pitfalls, must become an expert in bias, which falls into one of several categories.
Sampling bias happens when the manner in which data is collected gives a skewed picture. For instance, the data is collected from GPS points, but this excludes those who don’t carry mobile phones. They might be older, or making shorter journeys, creating differences from the real-world situation.
Temporal bias usually occurs when data is harvested at irregular intervals. For instance, if the data was collected during the pandemic it will give in an inaccurate picture of road-users journey habits.
Measurement bias is a systematic error, for instance if a particular camera was improperly calibrated and overestimated the speed of all vehicles it detected until that error was corrected.
Geographic bias can occur when behaviors from one area are generalized to a wider area, as when data is only harvested from the center of a city, but used to simulate journeys to and from the periphery.
Demographic bias and gender bias. A classic example is that men tend to drive more on average. Hence, data is skewed towards male driving patterns, and there are known differences between the routes than men and women choose, with women generally prioritizing safety.
SYNTHETIC DATA FOR AUTONOMOUS VEHICLES
Researchers at Waymo found that more data in general did not necessary increase safety performance of autonomous vehicles. Part of the problem was a comparative lack of safety critical data. According to the California Department of Motor Vehicles, such conditions occur only once ever 30,000 miles of driving.
More training data for challenging or dangerous driving conditions led to significantly enhanced performance, reducing collisions by 15% and increasing route adherence by 10%. Waymo’s solution was to run simulations which develop a difficulty model, i.e. intelligently assesses the challenge of each sample.
Difficult sections were then oversampled, while usual driving conditions were downsampled, so that they were represented equally in the dataset. This approach was able to get improved results while only using 10% of the original training data, again underlining that not all data is equally useful.

ENHANCING ENFORCEMENT
Working closely with NVIDIA, Smartcow has developed AI-enabled traffic management software called Roadmaster. Using multiple cameras it uses AI to identify driver mobile phone use, lack of seatbelt, speeding and other rule violations. It can also identify vehicle type, color and make and integrates ALPR for enforcement.
Roadmaster makes use of synthetic data to train its AI recognition software, via its a License Plate Synthetic Generator (LP-SDG). This simulates a wide variety of possible environmental conditions to challenge its PlateReader engine, helping it to learn faster.
This article first appeared in the March 2025 edition of TTi magazine