
Andrew Miller
Waymos and Cybercabs see the world through very different sensors. Which technology wins out will determine the future of self-driving vehicles.
Picture a fall afternoon in Austin, Texas. The city is experiencing a sudden rainstorm, common there in October. Along a wet and darkened city street drive two robotaxis. Each has passengers. Neither has a driver.
Both cars drive themselves, but they perceive the world very differently.
One robotaxi is a Waymo. From its roof, a mounted lidar rig spins continuously, sending out laser pulses that bounce back from the road, the storefronts, and other vehicles, while radar signals emanate from its bumpers and side panels. The Waymo uses these sensors to generate a detailed 3D model of its surroundings, detecting pedestrians and cars that human drivers might struggle to see.
In the next lane is a Tesla Cybercab, operating in unsupervised full self-driving mode. It has no lidar and no radar, just eight cameras housed in pockets of glass. The car processes these video feeds through a neural network, identifying objects, estimating their dimensions, and planning its path accordingly.
This scenario is only partially imaginary. Waymo already operates, in limited fashion, in Austin, San Francisco, Los Angeles, Atlanta, and Phoenix, with announced plans to operate in many more cities. Tesla Motors launched an Austin pilot of its robotaxi business in June 2025, albeit using Model Y vehicles with safety monitors rather than the still-in-development Cybercab. The outcome of their competition will tell us much about the future of urban transportation.
The engineers who built the earliest automated driving systems would find the Waymo unsurprising. For nearly two decades after the first automated vehicles emerged, a consensus prevailed: To operate safely, an AV required redundant sensing modalities. Cameras, lidar, and radar each had weaknesses, but they could compensate for each other. That consensus is why those engineers would find the Cybercab so remarkable. In 2016, Tesla broke with orthodoxy by embracing the idea that autonomy could ultimately be solved with vision and compute and without lidar — a philosophical stance it later embodied in its full vision-only system. What humans can do with their eyeballs and a brain, the firm reasoned, a car must also be able to do with sufficient cameras and compute. If a human can drive without lidar, so, too, can an AV… or so Tesla asserts.
Jared Nangle
This philosophical disagreement will shortly play out before our eyes in the form of a massive contest between AVs that rely on multiple sensing modalities — lidar, radar, cameras — and AVs that rely on cameras and compute alone.
The stakes of this contest are enormous. The global taxi and ride-hailing market was valued at approximately $243 billion in 2023 and is projected to reach $640 billion by 2032. In the United States alone, people take over 3.6 billion ride-hailing trips annually. Converting even a fraction of this market to AVs represents a multibillion-dollar opportunity. Serving just the American market, at maturity, will require millions of vehicles.
Given the scale involved, the cost of each vehicle matters. The figures are commercially sensitive, but it is certainly true that cameras are cheaper than lidar. If Tesla’s bet pays off, building a Cybercab will cost a fraction of what it will take to build a Waymo. Which vision wins out has profound implications for how quickly each company will be able to put vehicles into service, as well as for how quickly robotaxi service can scale to bring its benefits to ordinary consumers across the United States and beyond.
To understand how this cleavage between sensor-fusion and vision-only approaches emerged, we must begin with the earliest breakthroughs in driving automation.
Early computer driving (1994–2003)
Fantasies of self-driving vehicles are ancient, appearing in Aristotle’s Politics and The Arabian Nights. But the clearest antecedent to today’s robotaxis first emerged in 1994, when German engineer Ernst Dickmanns installed a rudimentary automated driving system into two Mercedes sedans.
Dickmanns’ sedans were able to drive on European highways at speeds up to 130 kilometers per hour while maintaining their lane position and even executing passing maneuvers in traffic. Dickmanns had been testing prototypes on closed streets since the 1980s, and by 1995 his team was ready to demonstrate their system on a 1,600-kilometer open-street journey, driving autonomously 95% of the time.
The vehicles sensed the world using two sets of forward-facing video cameras: one pair with wide-angle lenses for short-range peripheral vision and another pair with telephoto lenses for long-range detail. Cameras in 1995 were reasonably fit for Dickmanns’ purpose. The chief bottleneck his system faced was in computer capacity. His work-around involved what he called, grandly, “4-D dynamic vision”: algorithms that efficiently processed visual data by focusing limited computational resources on specific regions of interest, much like human visual attention.
Despite the vehicles’ impressive achievements, Dickmanns was candid about the limitations of 4-D dynamic vision. It could be confused by lane markings — the cameras could “see” only in black and white, and so were blind to information conveyed by color, like yellow lines painted over white ones in construction zones. It also struggled when lighting conditions changed.
Most importantly, 4-D dynamic vision failed when road conditions changed suddenly, such as when another car cut sharply into the lane ahead. Relying only on cameras to model the world around it, the system had to measure distance via motion parallax, looking for differences in the size or position of objects in two frames taken at different times.
This was a reasonable approach for a vehicle in its own lane that the automated driving system might slowly overtake. But it was dangerously unsafe for cars that suddenly entered the lane ahead. Without stereo vision or other range-finding sensors, the car needed several video frames to model the world accurately, which posed great risks when the car and its neighbours were moving at autobahn speeds.
Dickmanns’ work suggested that the physics of visual perception imposed fundamental constraints that the algorithms of the day couldn't overcome. Other modalities were required.
DARPA and sensor fusion (2004–2016)
Amid the wars in Afghanistan and Iraq, Pentagon leaders increasingly looked to automation as a way to keep American soldiers out of harm’s way. Congress had already directed the military, in the 2001 defense budget, to pursue unmanned ground vehicles for logistics and combat roles by 2015. DARPA interpreted this mandate to require a push for autonomous resupply technologies, a goal that gained more immediacy as improvised explosive devices began inflicting significant casualties on US convoys in Iraq. DARPA's goal was to reduce the risk resupply operations posed to human soldiers. To that end, it organized its first Grand Challenge competition in 2004, offering a $1 million prize for an AV that could navigate a 142-mile desert course.
There were many sophisticated entrants from a variety of companies and universities. But the prize was large for a reason: The problem was daunting. No vehicle finished the course. The most successful entrant, Carnegie Mellon University's “Sandstorm” — a modified Humvee — traveled only 7.4 miles before its undercarriage stuck on a rock, leaving its wheels with insufficient traction to get it moving again. The other vehicles failed even earlier, getting stuck on embankments, being confused by fences, or in one case, flipping over due to aggressive steering.
The next year’s Grand Challenge had dramatically different results: Five vehicles finished the 2005 course. The winner, Stanford University's “Stanley,” a modified Volkswagen Touareg, crossed the finish line in six hours and 54 minutes, travelling 132 miles without human intervention.
What made the difference? In a word: sensor fusion. Stanley carried five laser scanners mounted on its roof rack, aimed forward at staggered tilt angles to produce a 3D view of the terrain ahead. All this was supplemented with a color camera focused for road pattern detection and two radar antennas mounted on the front to scan for large objects.
This collection of sensing modalities was not Stanley’s innovation. Sandstorm had also been equipped with cameras, lidar, and radar, as well as GPS. What Stanley had was the ability to collate the inputs of these sensors and fuse them into a consistent model of the vehicle’s surroundings. That fusion mitigated the weaknesses of individual modes. When dust kicked up by the lead vehicle obscured the camera and lidar, radar could still register metallic obstacles, while radar's lower resolution was supplemented by rich lidar point clouds and camera vision.
The 2007 DARPA Urban Challenge shifted the domain from the desert to a more challenging one: a mock city environment. Participants were expected to navigate intersections and parking lots while obeying traffic laws and avoiding collisions with other vehicles. These demands encouraged participants to take sensor fusion to new heights.
Carnegie Mellon University, which came in second in 2005, made a comeback with its winning vehicle, “Boss.” A modified Chevy Tahoe, Boss was notable for the full range of sensors it carried: 11 lidar sensors, five for long range and six for short; cameras; and four radar units. This rich set of sensor data, fused together, allowed Boss to handle otherwise-impossible scenarios, like detecting a car partially occluded by another at an intersection.
None of this was cheap. Boss’ sensor suite cost more than $250,000, exclusive of the computer-processing hardware that filled its trunk. So while Boss and vehicles like it were capable of automated driving, they were nowhere near ready to be rolled out to consumers.
Still, the DARPA competitors’ success demonstrated the potential of sensor fusion, which became the default approach in the nascent automated driving system sector. Google launched its self-driving car project in 2009 under Sebastian Thrun, who oversaw Stanley’s victory in the Grand Challenge for Stanford. From the start, this project — which was spun out into an independent subsidiary, Waymo, in 2016 — used a multisensor approach: lidar, radar, cameras, and detailed maps of the operational area. As limited deployment of AVs on public roads began in the mid-2010s, Waymo and its then-competitors, such as Cruise, Argo AI, Uber, and Aurora, were committed to sensor fusion.
Decades of work had yielded a consensus: Multiple sensor technologies, with outputs that could be fused by computers, transcended the limitations of any one sensor. It was expensive and complex, but it worked. All that was required was more deployment and time, to inch down the cost curve, year after year.
That consensus was about to be challenged.
The vision-only insurgency (2016–2019)
If you want to understand the Tesla perspective on driving automation, watch the firm's “Autonomy Day” video.
In an auditorium at Tesla's Palo Alto headquarters on April 22, 2019, Elon Musk and his technical leadership team flatly rejected the sensor-fusion consensus. Within minutes of taking the stage, Musk fired the first salvo: “What we're gonna explain to you today is that lidar is a fool's errand and anyone relying on lidar is doomed. Doomed! Expensive sensors that are unnecessary. It's like having a whole bunch of expensive appendixes ... appendices, that's bad. 'Well, now we'll put [in] a whole bunch of them'? That's ridiculous.”
After Musk's provocative opening, Andrej Karpathy, then the company's senior director of AI, took the stage to exhaustively dismantle the sensor-fusion consensus. “You all came here, you drove here, you used your 'neural net' and vision,” Karpathy said. “You were not shooting lasers out of your eyes and you still ended up here.” By this, Karparthy meant that human drivers can navigate their cars through the streets using only passive optical sensors — their eyes — coupled with powerful neural processing.
“Vision really understands the full details,” Karpathy argued. “The entire infrastructure that we have built up for roads is all designed for human visual consumption. … So all the signs, all the traffic lights, everything is designed for vision. That's where all that information is.” In this view, lidar and other nonvisual inputs weren't merely unnecessary but counterproductive. They were “a shortcut. … It gives a false sense of progress and is ultimately a crutch.”
Musk similarly dismissed high-definition mapping. “HD maps are a mistake. … You either need HD maps, in which case if anything changes about the environment, the car will break down, or you don't need HD maps, in which case, why are you wasting your time?” For Musk, depending on pre-mapped environments meant that the “system becomes extremely brittle. Any change to the system makes it [so that] it can't adapt.” A true automated driving system should be able to boot up anywhere and drive appropriately based purely on what it sees.
Tesla’s approach to driving automation was consistent with Musk’s design philosophy at all of his firms. Javier Verdura, reflecting on his time as Tesla’s director of product design, reminisced that
if we’re in a meeting and we ask, “Why are the two headlights on the cars shaped like this?” and someone replies, “Because that’s how they were designed when I was at Audi,” that’s the worst thing you can say. This means we’re telling how things are done at other companies that have been doing it for years without innovation. For Elon, everything we do must be started from scratch, stripping everything down to the basics and starting to rebuild it with new notions, without worrying about how things are normally done.
At Tesla, the goal was to do away with features other manufacturers took for granted. Musk has said that “the best part is no part. The best process is no process. It weighs nothing. Costs nothing. Can't go wrong.” Tesla's introduction of touchscreens as primary vehicle-control interfaces exemplifies this philosophy. By replacing the buttons and dials that stud a traditional dashboard with a touchscreen, Tesla streamlined user interactions and reduced the number of physical components. This minimalist design not only makes an aesthetic statement but also simplifies the car’s manufacturing and maintenance processes. In the process, of course, the car arguably becomes less safe to operate; but every design decision involves trade-offs.
The same logic that eliminated dashboard buttons militates against lidar in favor of a camera-only approach. If there is no lidar in the vehicle, then the lidar does not have to be sourced, does not have to be installed, does not have to be paid for, and does not need to be replaced; indeed, it cannot fail. While Waymo had to invest immense sums and effort in obtaining and installing and maintaining expensive lidar sets, Tesla was free of those burdens.
In its own way, Tesla’s choice to pursue minimalist design in sensor modalities was as audacious as when Apple did away with physical keyboards for the iPhone, or when SpaceX announced its plan to stop using single-use rockets. This break from orthodoxy was classic Musk: Like SpaceX’s unprecedented success with reusable boosters, it positioned Tesla as a company with an insight into what was possible, one that everyone else had fundamentally misunderstood.
In this case, the insight depended on recent progress in computer vision. In 2012, AlexNet, a neural network developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, won the ImageNet challenge, marking the beginning of the deep learning era in vision. Tasks like detecting cars and pedestrians in camera images to a high level of accuracy were now feasible. Deep learning went from strength to strength between 2012 and 2016, when Tesla began equipping all vehicles with cameras and compute hardware designed for eventual self-driving. They believed that with sufficient data and computing power, the fundamental limitations of earlier camera-only systems could be overcome.
“Neural networks are very good at recognizing patterns,” Karpathy explained at Autonomy Day. “If you have a very large, varied dataset of all the weird things that can happen on the roads, and you have a sufficiently sophisticated neural network that can digest all that complexity, you can make it work.” This was Tesla's advantage: hundreds of thousands of consumer vehicles already on the road, collecting real-world driving data with every mile traveled.
Each Tesla vehicle was a data-gathering platform, continuously feeding information back to Tesla's training systems. The company had built what Karpathy called a “data engine” — an iterative process that identified situations in which its autonomous system performed poorly, sourced similar examples from the fleet, trained improved neural networks on that data, and redeployed them to the vehicles. Though Waymo was also collecting data, scale matters for neural networks. In 2019, Waymo had obtained approximately 10 million miles of driving-automation data, while Tesla had over one billion miles collected via vehicles equipped with Autopilot. That two-orders-of-magnitude difference meant, in Tesla's view, that their neural network would outperform any competitor.
This data advantage complemented their hardware-cost advantage. In 2019, while a Waymo vehicle might have carried more than $100,000 worth of sensors and computing equipment, Tesla's vision-only approach added perhaps $2,000 to a vehicle's cost. In the firm’s view, these advantages would reinforce each other: Cheaper vehicles would mean more deployment, which would capture more data, which would improve their neural networks, which would make their product more competitive, enabling even more deployment. It was a virtuous cycle for scaling quickly.
“By the middle of next year,” Musk predicted during the 2019 event, “we'll have over a million Tesla cars on the road with full self-driving hardware, feature complete.”
Blind spots (2019–Present Day)
Musk’s prediction did not come true in mid-2020. As of late 2025, it remains unfulfilled. Throughout the early 2020s, Musk continually asserted that Tesla's vehicles would be capable of "full self-driving" by year's end. These announcements triggered market excitement without ever coming true. Tesla did launch a robotaxi pilot in Austin in June 2025, but using Model Y vehicles with safety monitors in the passenger seat and operating in a geofenced area of approximately 245 square miles. (Musk stated in October 2025 that safety monitors would be removed by year's end, which would fall far short of the widespread, unrestricted deployment he had suggested … and it remains to be seen if this promise will be kept.)
Those who have been paying only casual attention to the field may find this surprising. Don’t all Tesla vehicles come with something called “Autopilot”? Don’t many of them feature “Full Self-Driving”? If that feature is not actually full self-driving, then what is it?
Tesla's Autopilot is an Advanced Driver-Assist System that offers adaptive cruise control and lane-keeping. Full Self-Driving expands on this system, adding features like automatic lane changing and traffic-signal recognition. Despite the name, FSD requires constant supervision from a human driver who has to be ready to assume control of the vehicle at any moment. That requirement was obscured by the feature’s misleading name, in some cases with tragic results. Ultimately, Tesla decided truth was the better part of branding, and in early 2024 quietly renamed the feature Supervised Full Self-Driving (an oxymoron).
When will unsupervised FSD be ready? There are many hurdles Tesla must clear before that feature will be widely available. Technical limitations in Tesla's existing vehicle hardware are one; Musk has acknowledged that legacy Tesla vehicles equipped with Hardware 3 may not be capable of unsupervised FSD. Consequently, Tesla has committed to providing free upgrades to Hardware 4 for customers who have purchased the FSD package, ensuring their vehicles can support the technical demands of unsupervised driving.
But another delay, harder to overcome, comes from the limitations of the vision-only approach.
In May 2016 and again in March 2019, Tesla vehicles in Autopilot mode were involved in nearly identical fatal accidents in Florida, where they collided with the side of white tractor-trailers crossing highways. In both cases, the National Transportation Safety Board found that the vision systems failed to detect the broad side of a white truck against a bright sky. These incidents, occurring three years apart — the latter just weeks before the triumphant Autonomy Day presentation — demonstrated that even with substantial improvements, specific visual scenarios like light-colored objects against bright backgrounds continued to challenge Tesla's pattern-recognition systems.
As recently as October 2024, the National Highway Traffic Safety Administration opened a new investigation into Tesla's FSD following reports of four crashes in low-visibility conditions, including one that killed a pedestrian in Rimrock, Arizona, in November 2023. According to the NHTSA documentation, these crashes occurred when Tesla vehicles encountered sun glare, fog, or airborne dust: precisely the kinds of challenging visual conditions with which cameras struggle. This investigation, covering approximately 2.4 million Tesla vehicles from the 2016 through 2024 model years, represents a significant shift in regulatory approach. Rather than focusing solely on driver attentiveness, NHTSA is now examining whether the FSD system itself can “detect and respond appropriately to reduced roadway visibility conditions.” This broader scope is evidence of increasing regulatory scrutiny of the vision-only approach.
The pattern is clear. Despite years of neural network improvements and billions of miles of training data, vision-only systems continue to face fundamental limitations that software alone seems insufficient to overcome. These limitations include glare, darkness, and depth perception.
Glare is most obvious. Extreme contrasts of brightness — such as driving directly toward the setting sun or encountering headlights at night — can temporarily “blind” cameras. In these scenarios, human drivers typically slow down and proceed with caution, and camera-only systems should do the same. Conversely, an automated system equipped with lidar can continue to operate at speed.
Too much light is a problem for cameras, but so is too little. Lidar, radar, and sonar are “active” sensors: Each emits signals (lasers, radio waves, and sound waves, respectively) and measures the return reflections to determine object presence, distance, and velocity. Cameras, by contrast, are “passive” sensors, relying solely on ambient light in their environment. In its absence, they are inert.
As a consequence, Tesla vehicles struggle in a variety of conditions. Most obviously, at night there is often little light available. The car’s headlights help, of course, but are most useful for detecting reflective objects like road signs, well-painted and maintained lane lines, and the taillights of other cars. Nonreflective objects, like dark-clad pedestrians or road debris, are harder for cameras to notice.
This may seem to be a strange gap in Tesla’s capability. Since humans can drive at night using vision alone, shouldn’t camera-only vehicles be able to do the same? But human eyes have capabilities that cameras lack. Human eyes employ two distinct photoreceptor systems: rod cells for low-light monochrome vision and cone cells for color in daylight. When darkness falls, our eyes and brains effectively switch sensor modes. While cameras have advantages — perfect consistency, no fatigue, and the ability to deploy multiple synchronized units for 360-degree coverage — they still can’t match the low-light performance of a biological system that has been refined by millions of years of avoiding nocturnal predators.
But while human night vision under normal conditions is impressive, lidar is better. And while lidar, like cameras, is challenged by rain, snow, and fog — raindrops or snow can obscure a camera’s vision or cover its lens while also blocking a lidar’s pulse — radar is serenely unaffected. In one documented case, a Waymo One robotaxi was able to safely navigate dense fog in San Francisco. To be sure, it did not complete its journey, but it found a place to pull over safely and suspended the trip, in a situation where a vision-reliant system could not have proceeded at all.
Camera-only systems can also struggle with depth perception. Cameras don't directly measure distance; the best they can do is provide depth estimates. One method is to compare images from slightly different angles. In nature, this arrangement is called stereoscopic vision, and it’s why predator animals usually have two front-facing eyes: Predators need to make quick estimations of distance. Another method is to compare images across time (i.e., motion parallax), like Dickmanns’ rudimentary automated driving system from the 1990s.
Tesla vehicles use both methods. They have multiple cameras strategically positioned around the vehicle. The current Hardware 4 suite includes eight external cameras: three forward-facing with different focal lengths for near-, medium-, and long-range detection; two side pillars; one backup; and two fender cameras. Each camera is tuned for specific use cases, from high-speed highway driving to precise parking maneuvers, with overlapping fields of view to enable depth estimation through stereoscopic vision. Thanks to these cameras and computer-based depth-estimation tools, Tesla estimates that in practice its vehicles can see vehicles about 250 meters ahead on highways, which is roughly comparable to radar range.
This approach works well, but only in ideal conditions. Problems arise at longer distances, or when visibility is poor, and there are other less obvious failure modes. Imagine cresting a hill to discover a stopped vehicle on the other side. No sensor modality can detect such a thing, but lidar will detect it as soon as the hill ceases to interpose and the car can respond immediately. A camera-only system needs more time to make stereoscopic or motion-parallax estimates, and as a result may not brake soon enough.
Beyond physical limitations like glare and darkness, vision-only systems face a more fundamental AI challenge: training data bias.
Neural networks can recognize only the patterns they've been trained on, and unusual scenarios may be underrepresented in their training data. This challenge is magnified by Tesla's end-to-end machine learning architecture, where camera inputs feed directly into neural networks that output driving commands. Unlike Waymo's modular architecture, which separates perception from planning so engineers can diagnose whether errors stem from misunderstanding the world or making bad decisions, Tesla's end-to-end system processes camera images inside a black box. Images go in one end, driving commands come out the other. If the system brakes, there is no way to be certain why it did so. Was it because it recognized the red light, or detected a pedestrian, or noticed another vehicle, or responded to some unrelated visual pattern, like a missing manhole cover? There is no way to tell. What that means, ironically, is that Tesla's massive data collection is both its greatest strength and a significant constraint. The vast quantities of data make processing by humans impractical. While competitors can address specific edge cases by updating discrete components or rules, Tesla must retrain its entire system.
This long-tail problem is particularly challenging for safety-critical systems. When a Tesla encounters an unusual object configuration — like a truck with an unusual shape, a fallen tree, or construction equipment — the neural network may misclassify it or fail to detect it entirely. While expanding the training dataset helps, it's impossible to capture every edge case through fleet learning alone. Each new scenario requires collection, annotation, and retraining: a time-consuming process that creates inevitable gaps in the system's perception. This is in contrast to lidar and radar, which detect physical objects regardless of their visual appearance or how frequently they've been encountered before.
These limitations of vision-only systems are not theoretical; they manifest in the firm’s performance data. According to California DMV reports from 2023 to 2024, Waymo reported a remarkably low rate of 0.0004 disengagements (that is, returning control to a human driver) per mile. Tesla, which does not yet offer full automated driving system but merely a driver-assist system — its Supervised FSD — is not required to submit formal disengagement reports, but some third-party analyses estimate disengagement rates between 0.05 and 0.10 per mile. If these reports are accurate, Tesla cars disengage at least an order of magnitude more often than dedicated sensor-fusion systems.
In November 2025, weeks after Waymo's co-CEO challenged the industry to “be transparent about what's happening with their fleets," Tesla released its most detailed safety report to date. The release is good news, in that it finally offers a complete picture of Tesla’s contributions to road safety, but it’s also bad news, in that it underscores the gap between where Tesla is and where it needs to be — that is, where Waymo already is.
For Tesla’s Supervised FSD — where a human driver remains ready to intervene — Tesla reports approximately 2.9 million miles between major collisions in the most recent 12-month period, compared with roughly 505,000 miles for average US drivers. This represents about an 82% reduction in serious crash frequency, or roughly five to six times safer than human driving. For minor collisions, Tesla reports approximately 986,000 miles between incidents, compared with 178,000 miles for average drivers.
That’s good, but it’s failing to clear the bar that Waymo reaches. Waymo reports even more dramatic safety improvements for its vehicles, which don’t have a human standby driver, as Teslas do. Waymo's robotaxis are involved in 91% fewer crashes involving serious injury — roughly 11 times safer than humans in this respect — and 79% fewer crashes involving airbag deployment.
The most revealing comparison, however, comes from Tesla's robotaxi pilot in Austin. When Tesla attempted unsupervised operation — removing the human safety backup — early performance was catastrophic. In the first month of operation, covering approximately 7,000 miles, the vehicles were involved in three crashes. That rate of roughly one crash every 2,300 miles is orders of magnitude worse than Waymo's performance, and significantly worse than even average human drivers. While Tesla has not provided updated figures for the Austin pilot beyond this initial period, the contrast is stark. These differing safety records suggest that, despite advances in neural networks and computer vision, sensor-fusion systems continue to outperform vision-only approaches in real-world conditions.
Beyond the dichotomy
Perhaps in recognition of this reality, Tesla has quietly shifted its stance.
Tesla ostentatiously removed radar from its vehicles, throughout 2021 and 2022, as part of its commitment to “Tesla Vision.” In late 2023, without fanfare, the company reintroduced radar, incorporating a high-resolution radar unit (codenamed “Phoenix”) into its Hardware 4 suite. The reintegration played to the firm’s strengths: Whereas earlier use of radar was a separate stream to the automated driver assistance system with hard overrides, radar input was now incorporated directly into the ADAS’ neural network. Even so, for a company that had so loudly insisted on the sufficiency of cameras alone, this limited use of camera-and-radar sensor fusion represented a significant change. Similarly, Tesla vehicles quietly incorporated on onboard mapping to understand their position in space.
Meanwhile, Waymo and other sensor-fusion companies have increasingly embraced neural networks. Waymo now employs transformer-based foundation models — the same technology powering advanced language models — across its entire self-driving pipeline: perception, prediction, and motion planning. The system is trained end-to-end, with gradients flowing backward through components during training, presumably in the same fashion that Tesla does. However, Waymo has chosen to maintain distinct perception and planning networks: If the car makes a mistake, engineers can determine whether it misunderstood the world or made a poor decision. This modular architecture allows independent testing and validation of components.
One consequence of this is that Waymo needs fewer sensors, even as the economics driving these decisions have shifted dramatically. Early automotive lidars like Velodyne's HDL-64E cost upwards of $75,000 in 2007, making them impractical for mass-market vehicles. However, technological advances and economies of scale have caused prices to plummet. By 2020, Velodyne’s automotive-grade lidars were in the $500 range at production volumes: a remarkable 90% cost reduction in just over a decade. Waymo used Velodyne lidars early in the firm’s life but has been building their own lidar in-house for years at what the firm said in 2024 was “a significantly reduced cost.” Computing hardware costs have followed a similar trajectory. Today, industry projections suggest that by 2030, comprehensive sensor suites including multiple lidars might add only $2,000 to $3,000 to vehicle cost, approaching the price premium of Tesla's camera array and computing hardware.
Waymo and Tesla are not alone in the self-driving car space, and their competitors are also converging on sophisticated AI, sensor fusion, and multiple sensor modes. Mobileye, which supplies driver-assist systems to dozens of automakers, relies on cameras and basic radar for basic capability while adding more sophisticated sensing as autonomy levels increase. Their robotaxi platform incorporates lidar for redundancy and robustness: The camera subsystem alone can drive safely, and the lidar/radar subsystem alone can drive safely, running in parallel. Like Tesla, Mobileye built its reputation on vision-based ADAS, but for higher levels of autonomy, the firm recognizes the value of sensor fusion.
Another instructive example is Wayve, a UK-based startup whose approach blurs the line between vision-only and sensor-fusion. Like Tesla, Wayve emphasizes end-to-end deep learning: Its neural networks take raw video input and directly output driving commands. But unlike Tesla, Wayve does not insist on a vision-only approach. Its vehicles incorporate inertial measurement units, GPS, and occasionally radar to augment their understanding of the environment. Their approach underscores how much the earlier dichotomy is breaking down.
The fundamental question of sensor-fusion versus cameras-only is beginning to lose its sharpness. As it recedes, the question is no longer what sensing approach should we use, but what standard of safety is necessary for successful driving automation.
The argument of Tesla’s 2019 Autonomy Day, which Musk still hypes on X, is that if humans drive with vision alone, so can cars.
It’s pithy. It’s memorable. And in several ways, it’s misleading.
It’s misleading because humans don't actually drive with vision alone. We have other senses to engage. We use hearing to detect sirens, screeching tires, and warnings from pedestrians. We have proprioception that helps us feel g-forces, vibrations, and loss of traction. It’s true that we have vision, which we supplement with our brains, so it is fair to say that both humans and computers possess vast contextual knowledge about driving environments. But we can also — through reading facial expressions and gestures — rapidly discern other drivers’ intentions in ways that no computer can.
And despite these advantages, humans are terrible drivers.
Globally, human drivers cause approximately 1.19 million deaths annually. Human error contributes to over 90% of crashes. In the United States alone, roughly 40,000 people die in traffic accidents each year. Humans can’t shoot lasers out of our eyes, but if we did, we'd be much safer drivers. Our cars can. Why shouldn't we aspire to the level of safety that sensor fusion offers? Progress in this field, understood properly, should constitute living up to driving automation's capability, not living down to human weakness.
So as Waymo robotaxis and Tesla's Model Y-based robotaxis now ply the streets of Austin, the two vehicles indeed embody different philosophies about how AVs should perceive the world. The Tesla robotaxi sports its array of cameras, while the Waymo will spin its lidar alongside a suite of complementary sensors. But the competition will not be as sharp as it would have been in 2019.
Tesla challenged convention, but since then it has quietly reintroduced radar; it seems possible that it will bring in other modalities beside it. In 2020, Waymo pioneered comprehensive sensor fusion, but since then it has streamlined its hardware and has enhanced its AI capabilities. It seems certain that it will continue to do so. Looking ahead, the paths forward for each firm seem likely to converge.
If that’s correct, it means that observers — including the regulators who will admit this technology into the streets of other cities — have a different question to ask. Rather than cameras versus lidar, the real contest is between robotaxis that are as safe as human drivers and those that are better.
Which standard are we prepared to accept? What vehicles can meet the one we choose? How soon can those vehicles arrive? These questions aren’t technical but political, which means that, as citizens, it is up to us to decide.
The driving-automation future we get will depend on our answer.
Andrew Miller holds a Ph.D. in history from Johns Hopkins and is the co-author of the book The End of Driving (2025). Previously at Alphabet's Sidewalk Labs, he now writes the Changing Lanes newsletter on mobility innovation.
Published
Have something to say? Email us at letters@asteriskmag.com.