Comparison of Waymo Rider-Only crash rates by crash type to human benchmarks at 56.7 million miles

45 min read Original article ↗

Abstract

Objective

SAE Level 4 Automated Driving Systems (ADSs) are deployed on public roads, including Waymo’s Rider-Only (RO) ride-hailing service (without a driver behind the steering wheel). The objective of this study was to perform a retrospective safety assessment of Waymo’s RO crash rate compared to human benchmarks, including disaggregated by crash type.

Methods

Eleven crash type groups were identified from commonly relied upon crash typologies that are derived from human crash databases. Human benchmarks were developed from state vehicle miles traveled (VMT) and police-reported crash data. Benchmarks were aligned to the same vehicle types, road types, and locations as where the Waymo Driver operated. Waymo crashes were extracted from the NHTSA Standing General Order (SGO). RO mileage was provided by the company via a public website. Any-injury-reported, Airbag Deployment, and Suspected Serious Injury + crash outcomes were examined because they represented previously established, safety-relevant benchmarks where statistical testing could be performed at the current mileage.

Results

Data were examined over 56.7 million RO miles through the end of January 2025; resulting in a statistically significant lower crashed vehicle rate for all crashes compared to the benchmarks in Any-Injury-Reported and Airbag Deployment, and Suspected Serious Injury + crashes. Of the crash types, V2V Intersection crash events represented the largest total crash reduction, with a 96% reduction in Any-injury-reported (87–99% confidence interval) and a 91% reduction in Airbag Deployment (76–98% confidence interval) events. Cyclist, Motorcycle, Pedestrian, Secondary Crash, and Single Vehicle crashes were also statistically reduced for the Any-Injury-Reported outcome. There was no statistically significant disbenefit found in any of the 11 crash type groups.

Conclusions

This study represents the first retrospective safety assessment of an RO ADS that made statistical conclusions about more serious crash outcomes (Airbag Deployment and Suspected Serious Injury+) and analyzed crash rates on a crash type basis. The crash type breakdown applied in the current analysis provides unique insight into the direction and magnitude of safety impact being achieved by a currently deployed ADS system. This work should be considered by stakeholders, regulators, and other ADS companies aiming to objectively evaluate the safety impact of ADS technology.

Introduction

Retrospective safety assessments, also known as safety impact or safety benefits studies, compare the in-field crash performance of vehicle safety technology to some benchmark. The approach, which often utilizes reported crash and exposure data, has been widely used to assess safety systems such as seatbelts (McCarthy Citation1989; Elliott et al. Citation2006), airbags (Thompson et al. Citation2002; McCartt and Kyrychenko Citation2007; Viano Citation2024), anti-lock brakes (Kahane Citation1994), electronic stability control (Ferguson Citation2007), forward crash prevention (Fildes et al. Citation2015; Isaksson-Hellman and Lindman Citation2016; Cicchino Citation2017), lane departure prevention (Sternlund et al. Citation2017; Cicchino Citation2018), and other systems. The analytical approaches used in these retrospective safety assessments have included comparing crash rates (or insurance claims rates or amounts) before and after the system was deployed, as well as induced exposure (which allows for before and after comparisons without having an explicit exposure measure).

Many of the systems evaluated in the past are most effective at preventing or mitigating a certain type of crash. For example, forward crash prevention systems like Automatic Emergency Braking (AEB) and Forward Collision Warning (FCW) are most effective in front-to-rear (also known as rear-end) crashes. Therefore, many studies will present system effectiveness in reducing a target crash mode. The use of a conflict typology, also known as crash type groups, to subdivide the crash population to isolate common contributing factors and potential countermeasures has been a staple of traffic safety analysis (Najm and Smith Citation2007; Kusano et al. Citation2023). Conflict typologies are particularly useful for analyzing safety performance with respect to the scenarios with the highest safety burden and among unique causal mechanisms (Kusano et al. Citation2023).

Automated Driving Systems (ADSs), which are defined as SAE level 3 through 5 automation systems (SAE International (SAE) Citation2021), are just now becoming deployed on public roads in numbers that enable retrospective safety assessments. Past studies have quantified differences in ADS and human benchmark crash rates. Most of the literature on this subject has used early testing data from SAE level 4 ADS reported to the California Department of Motor Vehicle (DMV) and the National Highway Traffic Safety Administration (NHTSA) Standing General Order (SGO). This early testing data almost exclusively features an ADS with a human behind the steering wheel supervising the ADS with the capability to take over control of the vehicle if needed. See Scanlon et al. (Citation2024a) and Goodall (Citation2021, Citation2023) for thorough reviews of historical ADS safety impact studies to date. The presence of a human behind the wheel that can decide when to engage and disengage the ADS makes it difficult to separate the performance of the ADS from the human test drivers. Furthermore, the presence of a human supervising a level 4 system suggests a system under development that is not yet suitable for operation without a human supervisor. Schwall et al. (Citation2020) found through a counterfactual simulation method that a human present was able to prevent 62% of crashes over 6.1 million miles of testing operations. This result highlighted the effect of a safety operator during early testing and the lack of comparability from testing operations to future on-road performance, unless counterfactual simulation is introduced after human disengagement (Schwall et al. Citation2020). The most relevant performance of current level 4 ADS which are intended to be used as fleet-owned ride-hailing vehicles is when the ADS vehicle is operating without a human behind the wheel, or in rider-only (RO) configuration.

Because of the interest in ADS retrospective safety impact studies by many stakeholders, there have been recent efforts in developing best practices in performing and evaluating such studies. The RAVE Checklist, written by a group with safety impact research experience from automotive industry, insurance, and academia and described by Scanlon et al. (Citation2024b), is a list of requirements and recommendations to address the many challenges when performing retrospective safety impact studies. This study conforms to the required checklist items of the RAVE checklist (Scanlon et al. Citation2024b, see Supplementary Appendix for RAVE checklist analysis for this study).

There are several recent studies that compared the aggregate, or overall, crash rate of a level 4 ADS in RO configuration to a human benchmark. Kusano et al. (Citation2024) compared Waymo’s crash performance in several different crash outcomes to human benchmarks from Phoenix, San Francisco, and Los Angeles over 7 million RO miles. The study found a statistically significant reduction in police-reported (55% reduction) and any-injury-reported (80% reduction) crashes. No comparisons to higher severity outcomes, such as airbag deployments or serious injuries were done due to a lack of statistically significant conclusions due to limited RO miles relative to the benchmark rates. Chen and Shladover (Citation2024) compared Waymo’s RO crash rate to human benchmarks in San Francisco for just under 1 million RO miles at the any property damage or injury outcome level. No statistical testing was performed in that study, but the Waymo crash rate (reported as part of the NHTSA SGO) was found to be similar in magnitude to self-reported human transportation network company (TNC) crashes. It’s unclear what definition of a crash is used for the self-reported TNC crash data, and whether that TNC crash definition is well matched to the ADS crashes reported as part of the NHTSA SGO. That is, there is an unknown amount of underreporting in the TNC crash data, while the ADS data from the SGO includes any amount of property damage with little to no underreporting. TNC drivers may have incentives to not report low severity collisions, as reported collisions may lead to deactivation from the platform. In two subsequent studies, Di Lillo et al. (Citation2024a, Citation2024b) compared Waymo’s 3rd party liability property damage and bodily injury claims rate to a human benchmark weighted by garaged zip code proportional to the miles driven by the Waymo RO fleet over the first 3.9 million and 25.3 million miles. A 3rd party liability claim is when a party involved in a crash asks for payment from a party’s insurance. These studies use a 3rd party liability claim that is paid as a proxy for whether the insured vehicle contributes to the crash, which is a complementary view of crash involvement to the overall crash rates analyses studied by Kusano et al. (Citation2024), which compares crash rates regardless of contribution to the crash. Using the 25.3 million mile analysis (superseding the 3.9 million mile analysis), Di Lillo et al. (Citation2024b) found a statistically significant reduction in both property damage (88% reduction) and bodily injury (92% reduction) claims. Di Lillo et al. (Citation2024b) also compared Waymo RO 3rd party liability property damage and bodily injury claims rates to a benchmark of human insured latest-generation vehicles (model years 2018–2021), which have a lower claims rate than all human driven vehicles. The study found Waymo RO had an 86% reduction in property damage claims and 90% reduction in bodily injury claims compared to the latest-generation human driven vehicle benchmark.

As ADS deployments have continued to operate and expand to collect additional miles, there is now an opportunity to do such a safety impact analysis of more rare safety outcomes (such as serious injuries) and to disaggregate analysis by crash type as has been done in the past for other vehicle safety systems. Both of these types of analyses require sufficient mileage for statistical comparison and have thus been limited in the past. As the benchmark becomes more rare (i.e., a lower crash rate), more miles and/or a larger relative difference in performance between the ADS and benchmark is needed to draw statistically significant conclusions. For example, Scanlon et al. (Citation2024a) performed an example statistical power analysis that computed the number of miles needed for statistical significance for hypothetical ADS with different performances relative to the benchmarks. An ADS with a crash rate of 10% the national suspected serious injury + benchmark (i.e., a 90% reduction of the benchmark of 0.11 Incidents per Million Miles, IPMM) would require 56.3 million miles. Waymo’s RO miles are now within this range where statistical conclusions could be drawn about such a serious injury. The primary focus of road safety, aligned with the Vision Zero movement, is to eliminate serious and fatal injuries (Lie and Tingvall Citation2024). Thus, retrospective evidence of the performance of level 4 ADS would serve as a continuous confidence growth in the safety assurance process (Favarò et al. Citation2023) that used design-based and prospective studies during system development (e.g., Scanlon et al. Citation2021). Similarly, disaggregating the benchmark comparison by crash type reduces the benchmark crash rate under test, increasing the mileage requirement. Because current level 4 ADS are deployed where the system is responsible for the entire dynamic driving task without the ability for humans to take over at any time, the ADS may have a safety impact on all types of crashes. Until now, retrospective safety assessment studies of ADS in RO configuration have only compared aggregate crash rates, including all types of crashes, as that has been the only level of analysis possible. Analyses of performance by crash type may provide insight into how level 4 ADS are achieving reduced overall crash rates and what distributional shifts in crash type are occurring.

This study conducted a retrospective safety impact analysis of Waymo’s level 4 ADS in RO configuration on surface streets in San Francisco, Phoenix, Los Angeles, and Austin over 56.7 million RO miles with two primary research questions: (1) what is Waymo’s aggregate (all crash type) crash performance relative to aligned human benchmarks in the outcomes of Any-Injury-Reported, Airbag Deployment, and Suspected Serious Injury+ outcomes? and (2) what is Waymo’s crash rate for disaggregated crash types relative to human benchmarks in the Any-Injury-Reported and Airbag Deployment outcomes? This study has several contributions compared to past ADS safety impact studies. First, this study compares higher severity outcomes (Airbag Deployment and Suspected Serious Injury+) to human benchmarks than previous studies. Second, past studies have not compared RO ADS performance to these human benchmarks disaggregated by crash type. Third, this study applied a spatial dynamic benchmark correction described by Chen et al. (Citation2024) that adjusts the human benchmarks proportional to the spatial distribution of driving of the RO fleet within the geographies they operate in.

Methods

ADS and benchmark data alignment

The methodology was designed according to the RAVE checklist (Scanlon et al. Citation2024b) and was implemented in four main steps that are shown in . First, raw data was extracted from a variety of available data sources. Next the mileage and crash data from the benchmark and ADS crash data sources were aligned to maximize comparability. Specifically, the mileage and crash benchmark data was restricted to passenger vehicles and surface streets, and dynamic spatial adjustments were performed to make the benchmarking driving distribution representative of the ADS driving. Crash rates were generated by both crash types and various outcome levels. Statistical testing was then used to evaluate the meaningfulness of any observed differences in crash rates.

Figure 1. Data process methodology for ADS and benchmark comparison.

Figure 1. Data process methodology for ADS and benchmark comparison.

This study compared crash rates for Waymo’s RO service derived from the NHTSA SGO and self-reported mileage by Waymo to human benchmarks derived from crash and Vehicle Miles Traveled (VMT) databases maintained by the states of California, Arizona, and Texas. A full listing of these databases relied upon can be found in . Details of the data sources are in the Supplementary Appendix.

Table 1. Data sources relied upon in the current study.

Data alignment promotes comparability between datasets. This is sometimes referred to colloquially as making an “apples-to-apples” comparison (Scanlon et al. Citation2024b). This study performed alignment along four main dimensions known to influence crash risk, including (1) vehicle type, (2) road type, and (3) spatial driving distribution. Vehicle type and road type were accounted for through subselection. A dynamic benchmarking routine previously developed by Chen et al. (Citation2024) was used to adjust for crash risk by where within their deployed geographic regions the ADS operated.

The Waymo RO operations in this study are identical to those described in Kusano et al. (Citation2024). Waymo’s current RO operations have exclusively taken place using recent model run Chrysler Pacifica (not actively in operation) and Jaguar I-Pace (active) platforms, which are both classified as passenger vehicles according to the 49 CFR § 565.15 classification. For this study, surface streets refers to all roadways that are not “Interstates” or “Other Arterials – Other Freeways and Expressways” as defined using FHWA’s highway function classification coding (FHWA Citation2023b). The existing RO operations mostly occurred on surface streets, so this study only examines miles and crashes that occurred on surface streets. A final dimension considered was “in-transport” status, which refers to all vehicles that are not in designated parking, parked off the roadway, parked on private property, or working vehicles. Mileage is only accumulated while “in-transport”, and this variable is readily available in all police-reported databases, which makes it a straightforward dimension to align the data on. See the Supplementary Appendix for the procedure for determining in-transport status using SGO data, which is identical to the method used in Kusano et al. (Citation2024).

The benchmark mileage and crashes include data from multiple vehicle and road types, and quantifying in-transport surface street, passenger vehicle rates involves a combination of subselection, data joining, and re-weighting. This process was previously described in Scanlon et al. (Citation2024a) with the current study extending the methodology to Texas data and developing individual benchmarks for 11 crash type groups. The crash data was directly subset for these three data requirements. The mileage data was subset for surface streets. The amount of mileage in each region was not broken down by vehicle type, so the total mileage (including all vehicle types) reported by the states was adjusted based on the proportion of passenger vehicle miles reported in the FHWA VM-4 tables. By definition, all mileage data is accumulated while “in-transit”. Each mileage and crash dataset has unique features for identifying surface streets, passenger vehicles, and in-transport status. The variables relied upon in this study are shown in the Supplementary Appendix.

Spatial dynamic benchmark

The benchmark data relied upon were examined as four distinct geographic areas: Travis, Hays, and Williamson Counties in Texas (Austin); Los Angeles County, California; Maricopa County, Arizona (Phoenix); and San Francisco County, California. The dynamic benchmarking routine described presented by Chen et al. (Citation2024) and briefly summarized in the following paragraphs then effectively further subset and weighted the benchmark data to only include the area within the selected counties where the Waymo RO service drove, proportional to the miles driven.

Relying directly on the entirety of crash and mileage data from the counties making up the Waymo RO service area would effectively create what will be referred to as “unadjusted” benchmark crash rates, whereby the crash rates are representative of where the current driving population currently aggregates VMT. Waymo’s ODD has gradually changed over time and does not necessarily include the entire counties. Additionally, Waymo operates as a ride-hailing fleet with unique driving patterns that are responsive to user demand. Because of this, Chen et al. (Citation2024) examined the spatial distribution of where VMT was being accumulated and noted distinct differences in the driving distributions. The driving mix differences had direct implications on the crash risk.

Chen et al. (Citation2024) created a “dynamic” benchmark routine reweighting the human benchmark data to reflect the driving distribution of the ADS systems. This effectively models the crash rate of the benchmark given that the benchmark population drove with the same spatial distribution as the Waymo driver. For a spatial adjustment, the routine discretizes the miles driven by the ADS (Waymo) and benchmark (HPMS human driven mileage) into level-13 S2 cells thus providing a spatial distribution of driving miles throughout some bounded geographic area. The proportion of driving miles driven by both the Waymo and human within a given cell is used to reweight benchmark data. Likewise, if an area of the benchmark data geographic region is not driven in by the Waymo Driver, it is re-weighted in a manner that essentially excludes it from the derived benchmark (zero-weighted).

Crash type analysis

One of the contributions of this paper is to compare crash rates in individual crash types. There are many frameworks to consider for differentiating between crash types (Najm and Smith Citation2007; Kusano et al. Citation2023). There are two competing priorities that need to be considered. First, which crash types are informative for evaluating safety performance? Second, is there enough driving exposure to identify statistical differences? There is an inherent tradeoff in selecting crash type groupings, where a crash type grouping with too many categories could draw more specific conclusions about ADS performance but would suffer from a lack of statistical power. A crash type grouping with too few categories would have higher statistical power, but not aid in the understanding of ADS performance in specific crash types.

To balance the tradeoff of crash type groupings between analysis ability and statistical power, the approach in this study was to consider two dimensions, crash partner type and geometric configuration, in deriving crash type groupings. Crash types were only assigned for the first two involved parties (or one vehicle if it was a single vehicle crash). Vehicles involved in secondary contact events were indicated accordingly. shows the crash type groupings used in this study. Altogether, these crash types encompass 88% of the total police-reported crashes with at least minor injuries and 86% of the total fatal collisions nationally in the US (Kusano et al. Citation2023). Generally speaking, a crash partner type and geometrical lens is informative about (a) avoidability and (b) potential severity, and tend to share similar sets of causal mechanisms (Kusano et al. Citation2023).

Figure 2. Crash type groupings for ADS and human benchmark Crashed vehicle rate comparisons. The abbreviation “V2V” stands for vehicle-to-vehicle. The abbreviation F2R stands for Front-to-rear. An 11th group, “all others” is not pictured.

Figure 2. Crash type groupings for ADS and human benchmark Crashed vehicle rate comparisons. The abbreviation “V2V” stands for vehicle-to-vehicle. The abbreviation F2R stands for Front-to-rear. An 11th group, “all others” is not pictured.

Cyclist, motorcycle, and pedestrian (often referred to as vulnerable road user or VRU) crashes were each examined as individual groups because passenger vehicle collisions with VRUs in the outcome groups examined in this study are rare, which would not support splitting these crashes into more groups based on maneuvers at the current VMT. Next, various vehicle-to-vehicle (V2V) crash groupings were examined: backing, front-to-rear (F2R), opposite direction (Opp. Dir.), intersection, and lateral. Single vehicle crashes (involving the passenger vehicle striking an object or the ground) were also examined as their own group. Secondary crashes, where a vehicle is involved in a crash with another vehicle that had previously crashed, were separated as their own group. All other crashes were classified into an other crash category, which includes missing or unknown crash types in the human benchmark data. These groups were chosen based on the highest level aggregations of crashes used by NHTSA in their standardized crash databases (e.g., the Crash Reporting Sampling System) and other typologies (Najm and Smith Citation2007; Kusano et al. Citation2023). In total, there were 11 crash type groups examined in this study.

Outcome levels

A number of outcome levels were considered for the current analysis based on previously established levels and what was readily possible from the underlying data sources. Scanlon et al. (Citation2024a) outlined multiple severity levels potentially useful in a benchmarking analysis, which range from any amount of property damage to fatal injuries. Consistent with the analysis and recommendations provided by Scanlon et al. (Citation2024a), outcome levels were pre-selected to minimize potential bias in reporting between the benchmark and ADS population. Using police-reported data as a benchmark, there are underreporting considerations and geographic-specific reporting thresholds. It is difficult to draw conclusions about an Any Property Damage or Injury outcome level, which included all in-transport and impacted ADS crashes, because the human data has uncertainty in the lower reporting threshold and underreporting (Kusano et al. Citation2024). Additionally, because some reportable crashes are not reported to or by police, and the degree to which this underreporting occurred is unknown, a police-reported threshold was not considered. As discussed in Scanlon et al. (Citation2024a), the Police-Reported human benchmarks may suffer from systematic underreporting, especially in California, where the state police report crash database does not require police jurisdictions to report property damage only crashes. Scanlon et al. (Citation2024a) also noted challenges with a tow-away outcome level and recommended against its usage. ADS-equipped vehicles can be towed for a variety of reasons during only minor damage collisions, which makes comparability to human-driven vehicles challenging. For completeness, the comparison of ADS and benchmark crash rates for the Police-Reported and Any Property Damage or Injury outcome are provided with the Online Supplemental Materials, but for the aforementioned reasons, these estimates are considered less credible and are not a focus of the current study.

As the traditional focus of traffic safety research has been on preventing serious and fatal injuries, this study examined Any-Injury-Reported, Airbag Deployment, and Suspected Serious Injury+ outcomes. These outcome levels are the most injury-relevant outcomes that are readily available in both human and ADS crash data and where there is sufficient ADS mileage to draw statistically relevant conclusions.

The NHTSA SGO and benchmark crash data was subset by observed outcomes in order to align the comparisons. SGO crashes were limited to those where the ADS vehicle was in-transport (i.e., not parked in a parking space) and impacted (i.e., the Waymo ADS vehicle striked or was struck by another road user or object). The Suspected Serious Injury+, Airbag Deployment, and Any-Injury-Reported outcomes were the primary focus of this study. A Suspected Serious Injury+ is a crash where someone involved sustains a “Killed” or “Incapacitating” police-reported injury. For example, “A” injuries in California police reports on the KABCO scale are “severe laceration resulting in exposure of underlying tissues/muscles/organs or resulting in significant loss of blood”, “broken or distorted extremity (arm or leg)”, “crush injuries, “suspected skull, chest or abdominal injuries other than bruises or lacerations,” “Significant burns (second and third degree burns over 10% or more of the body)”, “unconsciousness when taken from the collision scene”, and/or “paralysis” (CHP, Citation2017). Police report data was obtained through public record requests for the three SGO crashes with “Serious” or “Fatality” SGO maximum severity. A police report obtained for case 30270-8968 indicated the sole injury reported in the crash was reported as a “complain of pain” (or “C” injury) on the police report, but was reported as a “Serious” injury in the SGO because that occupant was transported in an ambulance to seek medical treatment. The other two “Serious” or “Fatality” SGO severity crashes indicated an “Incapacitating” or “Killed” maximum severity on the police reports. As the human benchmarks use the police-reported injury severity, only the two (2) SGO-reported crashes with confirmed “K” or “A” maximum severity from police reports were included in the Suspected Serious Injury+ crashes. Both Suspected Serious Injury+ crashes involving a Waymo vehicle during the study period were Secondary Crashes, meaning the Waymo was not involved in the first event in the crash sequence. See the Supplementary Appendix for a complete description of the Suspected Serious Injury+ crashes.

An Airbag Deployment crash is where one or more vehicles involved deploys any airbag due to the crash. An Any-Injury-Reported crash is where any level of injury is reported due to the crash. Note that outcomes were classified at the crash level including all parties involved in the crash. The injury outcome levels (Suspected Serious Injury+ and Any-Injury-Reported) were selected if any party in the crash was injured, whether riding in the Waymo vehicle or otherwise. Similarly, the Airbag Deployment outcome was selected if any vehicle involved in the collision sequence had an airbag deploy, not just the Waymo vehicle. The classification routines and variables used for each benchmark dataset can be found in the Supplementary Appendix. The Any-Injury-Reported benchmark utilized the same underreporting correction as described in Scanlon et al. (Citation2024a). No underreporting correction was applied to the Airbag Deployment and Suspected Serious Injury+ benchmarks, as no data is available to estimate the amount of underreporting in these outcome levels. There is reason to believe that the underreporting in human crashes in these outcomes are non-zero.

Statistical testing

A statistical comparison between the ADS and benchmark crash rates was done using Clopper-Pearson limits to estimate 95% confidence intervals for the ratio of two Poisson mean occurrence rates, as described by Nelson (1970, Appendix I), which is the same method adopted by Kusano et al. (Citation2024).

Results

ADS and benchmark population comparisons

The ADS and human benchmark data were extracted from the same locations (counties), vehicle types (passenger vehicles), and road types (surface streets) in an attempt to account for these factors that can affect crash rates. The dynamic benchmark adjustment for spatial driving mix further aligns the benchmark and ADS driving population. The research question of this study addresses a comparison between the Waymo RO service and the overall driving population in these areas. Therefore, human driver demographics are not of particular importance because the entire driving population is represented in the data. shows the number of Waymo RO miles by location and calendar year. Most of the benchmark data is from calendar year 2022; which is the last year with complete data available at the time of writing.

Table 2. Waymo Rider-Only millions of miles by calendar year and location.

Aggregate crash rate comparison

lists the number of events with the Any-Injury-Reported, Airbag Deployment, and Suspected Serious Injury+ outcomes by location for the Waymo RO service during the study period. During the study period, there were 31.159 million miles in Phoenix, 18.260 million miles in San Francisco, 6.448 million miles in Los Angeles, and 0.834 million miles in Austin driven by the Waymo RO service for a total mileage of 56.700 million miles. shows the comparison of the aggregate Waymo RO and benchmark crash rates for the Any-Injury-Reported, Airbag Deployment, and Suspected Serious Injury+ outcomes. These comparisons were not statistically significant for Los Angeles due to limited mileage. The point estimates in Los Angeles, however, are of similar magnitudes than those in Phoenix and San Francisco. The Waymo RO service had a statistically significant reduction in Any-Injury-Reported and Airbag Deployment outcomes in Phoenix, San Francisco, and all locations combined. The Waymo RO service had a statistically significant reduction in Suspected Serious Injury+ crashes when considering all locations combined, with a 85% reduction (39% to 99% reduction 95% confidence interval), in Phoenix, with a 100% reduction (3% to 100% reduction 95% confidence interval), and in San Francisco with a 76% reduction (3% to 98% reduction 95% confidence interval).

Table 3. Event counts by outcome and location (through January 2025, 56.7 M RO miles).

Table 4. Comparison of Waymo RO and human benchmark crashed vehicle rates for Any-Injury-Reported, Airbag Deployment, and Suspected Serious Injury+ crashes (through January 2025, 56.7 M RO miles).

Crash rate comparison by crash type

The number of observed Waymo events by outcome level is listed in the Supplementary Appendix. compares the Waymo RO and benchmark crash rates by crash type for the Any-Injury-Reported outcome in all locations combined. Results for individual locations by crash type are included in the Supplementary Appendix. The results show a statistically significant reduction in Any-Injury-Reported crashes in Cyclist, Motorcyclist, Pedestrian, Secondary Crash, Single Vehicle, V2V Intersection, and V2V Lateral crash types in all locations combined. When evaluating individual locations, the same crash types had a statistically significant reduction in Any-Injury-Reported crashes in San Francisco except for Secondary Crashes. In addition, there was a statistically significant reduction in V2V F2R crashes in San Francisco. Only V2V Intersections crashes had a statistically significant reduction for Any-Injury-Reported crashes in Phoenix. The reduction in V2V Intersections was also statistically significant in Los Angeles.

Table 5. Comparison of ADS and human benchmark (with dynamic benchmark adjustment) crashed vehicle rates in all locations combined by crash type in Any-Injury-Reported crashes (through January 2025, 56.7 M RO miles).

compares the Waymo RO and benchmark crash rates by crash type for the Airbag Deployment outcome in all locations combined. Results for individual locations by crash type are included in the Supplementary Appendix. In all locations, the Single Vehicle and V2V Intersection crash type had a statistically significant reduction in Airbag Deployment crashes. The reduction in Airbag Deployment V2V Intersection crash type was also statistically significant when considering Phoenix, San Francisco, and Los Angeles alone. Additionally, there was a statistically significant reduction in Airbag Deployment Single Vehicle crashes in San Francisco and Secondary Crashes in Phoenix. Other crash type comparisons were not statistically significant. See the Supplementary Appendix for a description of additional trends in select crash types.

Table 6. Comparison of ADS and human benchmark (with dynamic benchmark adjustment) Crashed vehicle rates in all locations combined by crash type in Airbag Deployment crashes (through January 2025, 56.7 M RO miles).

and compare the number of crashes a driver with the average human benchmark rate driving the same distance as the Waymo RO service and the experienced number of Waymo crashes along with percent reductions.

Figure 3. Comparison of observed Waymo and average benchmark Any-Injury-Reported crashes in all locations over 56.7 million miles. Comparisons labeled with an asterisk (*) had a statistically significant difference in crashed vehicle rates (See ).

Figure 3. Comparison of observed Waymo and average benchmark Any-Injury-Reported crashes in all locations over 56.7 million miles. Comparisons labeled with an asterisk (*) had a statistically significant difference in crashed vehicle rates (See Table 5).

Figure 4. Comparison of observed Waymo and average benchmark Airbag Deployment crashes in all locations over 56.7 million miles. Comparisons labeled with an asterisk (*) had a statistically significant difference in crashed vehicle rates (See ).

Figure 4. Comparison of observed Waymo and average benchmark Airbag Deployment crashes in all locations over 56.7 million miles. Comparisons labeled with an asterisk (*) had a statistically significant difference in crashed vehicle rates (See Table 6).

Discussion

Interpretation of results

The results of this study show for the first time using retrospective crash data that the Waymo RO SAE level 4 ADS has a statistically significant reduction in a Suspected Serious Injury+ outcome, in addition to crashes resulting in injury of any severity which are dominated by frequent minor injuries. The result of a 85% reduction in Suspected Serious Injury+ crashes (39% to 99% reduction 95% confidence interval) is an indication of an effect, but is subject to a low number of observations (2 ADS crash) with large confidence intervals. The 2 Suspected Serious Injury+ crashes involving Waymo vehicles are summarized in the Supplementary Appendix, all of which were of the Secondary Crash crash type. In one of the two crashes, the Waymo vehicle was stationary in traffic when the crash occurred and the other the Waymo vehicle was in a Secondary Crash with a high speed red light runner which was redirected and struck pedestrians on the sidewalk after the crash. As discussed below, future research could develop objective measures of crash contribution that can be applied to both benchmark and ADS crashes. The magnitude of the effect is similar to past simulations studies of the Waymo Driver’s performance in reconstructed fatal human crashes (Scanlon et al. Citation2021). The previous simulation study was performed in Chandler, AZ, which is a more suburban location compared to the current operating areas that have the most driving in densely populated areas. This Suspected Serious Injury+ result is also in line with the multiple complimentary design-based methods used to set requirements and evaluate the Waymo Driver prior to deployment (Webb et al. Citation2020). These methods include simulation-based methods, including the Collision Avoidance Testing (CAT) method (Kusano et al. Citation2023) that compares the Waymo Driver’s performance to a Non-Impaired Eyes On conflict (NIEON) model (Engström et al. Citation2024). Future studies could continue to study the retrospective performance of the Suspected Serious Injury+ and other high severity outcomes as more ADS mileage is collected that enables further analysis.

When analyzing the crash performance by crash type, the Waymo RO service had statistically significant reductions in Cyclist, Motorcycle, Pedestrian, Secondary Crash, Single Vehicle, V2V Intersection, and V2V Lateral crashes for the Any-Injury-Reported outcome and V2V Intersection crashes for the Airbag Deployment outcome. Crashes involving VRUs (including Motorcyclists and Pedestrians), Single Vehicle, and V2V Intersection account for a large proportion of the benchmark crash rate for the Any-Injury-Reported and Airbag Deployment outcome levels. By significantly reducing crashes in these frequent crash modes, the Waymo RO service was able to achieve overall reductions in aggregate crash rates in these two outcomes. Comparisons in all other crash types for these two outcomes were not statistically significant, although generally lower for the SAE level 4 ADS. These non-significant results suggest the need to accumulate more ADS miles in order to determine whether Waymo RO and benchmark crash rates differ. Given the low number of observed Suspected Serious Injury+ ADS crashes, more miles are needed to draw statistical conclusions about Suspected Serious Injury+ performance in individual crash types. Taken together, the methods and data examined in this study represent the most comprehensive attempt to account for multiple confounding factors between ADS and benchmark data and features the largest dataset analyzed to-date. Therefore, the results of this study represent the most compelling evidence of a meaningful safety benefit of the Waymo RO SAE level 4 ADS operating in a ride-hailing setting in San Francisco, Los Angeles, and Phoenix.

One important reason for performing analysis by crash type is to isolate crashes with similar contributing factors and thus draw some conclusions about the safety system’s performance in those types of crashes. The severity of collisions can differ dramatically between different crash types. Using US crash data, Kusano et al. (Citation2023) found that front-to-rear crashes made up 36% of police-reported crashes resulting in property damage or minor injury but only 8% of fatal crashes, while crashes between passenger vehicles and pedestrians or motorcyclists made up 36% and 13% of fatal crashes, respectively, but less than 1% each of minor injury crashes each. Intersection crashes are common in both minor injury (25%) and fatal crashes (27%). Therefore, it is particularly promising that the Waymo RO service had reductions in crash types associated with serious injuries (V2V Intersection, Motorcycle, and Pedestrian crash types). Although there are not yet sufficient miles to analyze the Suspected Serious Injury+ outcome by crash type, it is promising to see an apparent reduction in the number of Suspected Serious Injury+ outcome at the aggregate level.

Although there were no statistically significant results suggesting that the Waymo RO service had an elevated crash rate relative to the benchmark in any of the 11 crash modes examined, a supplemental analysis that split the F2R crash type into F2R Striking and F2R Struck rates found the Waymo vehicle had a lower F2R Striking rate for Airbag Deployment and Any-Injury-Reported at a statistically significant level and a statistically significant increase in F2R Struck crashes at the Any-Injury-Reported outcome level in Phoenix (see Supplementary Appendix). As more data becomes available, more statistically significant conclusions will be drawn, and thus it stands to reason how one should interpret an ADS with no change or increase in certain types of crash types but decreases in others relative to some benchmark. This study did not account for crash contribution. Different crash types may have different levels of contribution from parties depending on the maneuver of the party. Another possibility is that different crash types have different proportions of crashes where one or more parties have little to no opportunity to contribute to the outcome of the crash. For example, in Scanlon et al. (Citation2021) which simulated the Waymo Driver placed in reconstructed fatal crashes involving human drivers, in 8% of responder role simulations there was little to no opportunity for the Waymo Driver to avoid the collision (i.e., the vehicle was stopped at a traffic light when struck from behind at high rates of speed). This type of collision where the Waymo ADS vehicle was stationary in traffic with little to no ability to prevent the collisions accounted for 1 out of the 2 Suspected Serious Injury+ crashes during the study period (30270-9724).

Regardless of crash contribution, past road safety systems, such as automated red light enforcement cameras and front crash prevention systems, have shown the ability to reduce a certain type of collision while slightly increasing the rate of a different type of collision. Because road safety, and the harm that result from traffic crashes, is often seen as a public health issue, systems are judged on their aggregate, overall contribution to safety, and some risk redistribution is tolerated if the aggregate safety benefit is deemed acceptable. For example, red light enforcement cameras (infrastructure that can automatically detect and ticket drivers who run red traffic lights) have been found to reduce fatal red light running and all fatal crashes at intersections after installation, but increase rear-end collisions to a lesser degree in some situations (McGee and Eccles Citation2003; Hu and Cicchino Citation2017). The main safety intervention (to disincentivize red light running) was effective in reducing intersection crashes, but may have also caused some drivers to suddenly come to a stop to avoid the potential of running a red light thus increasing the risk of being struck in the rear. Because red light running crashes, which feature vehicles often traveling at high speeds and side impacts which are more injurious than front-to-rear impacts, were reduced to a large degree, the potential increase of rear-end crashes was accepted because of the overall benefit of this safety intervention. Similarly, a large retrospective study using insurance claims data for front crash prevention systems, including FCW and AEB, found a combination of FCW and AEB reduced crashes by 50% and injury crashes by 56%, but increased rear-end struck crashes by 20% (Cicchino Citation2017). Overall, these front crash prevention systems were estimated to have a potential to prevent 1 million crashes per year and 400,000 injuries (Cicchino Citation2017), which is one of the reasons 20 automakers voluntarily committed to making AEB standard equipment on all new vehicles in the US by 2022. These are just two examples of numerous other road safety innovations that have reduced overall harm but had some measurable residual risks. Therefore, ADS should be considered in a similar way as a replacement for human driving where the evidence suggests there are large aggregate safety benefits. Potential increases over a benchmark in sub-groups of crashes should be considered in the context of the overall safety benefit of the system as well as the role of crash contribution, which is not considered in this study.

Similarly, the current study found the Waymo RO service had large reductions in V2V Intersection, Single Vehicle, and VRU crashes. In all locations and compared to a driver with the average benchmark rate driving the same 56.7 million miles as the Waymo Driver, the Waymo Driver experienced 86 fewer V2V Intersection, 45 fewer VRU, 14 fewer Single Vehicle, 12 fewer V2V Lateral, 10 fewer V2V F2R, and 8 fewer Secondary Crash Any-Injury-Reported Crashes. Compared to an average human driver, the Waymo RO service experienced 50 fewer V2V Intersection, 11 fewer Single Vehicle, and 5 fewer F2R Airbag Deployment Crashes. When breaking down F2R crashes by F2R Striking and F2R Struck, the Waymo RO service had 17 fewer F2R Striking and 8 more F2R Struck Any-Injury-Reported and crashes and 6 fewer F2R Striking and 1 more F2R Stuck Airbag Deployment crashes compared to an average human. Like past safety systems, the magnitude reductions of the Waymo RO service in most crash groups, including those that most often result in the most serious injuries, were far larger than the increases in F2R Struck crashes.

See the Supplementary Appendix for additional results comparisons to prior studies.

Difficulties in examining ADS crash contribution

The research questions addressed in the current study are related to the Waymo ADS overall crash rate compared to human benchmarks, regardless of contribution. This overall crash rate view compliments past studies that use 3rd party liability claims rate as a surrogate for ADS crash contribution and have found the Waymo RO service has a large reduction in 3rd party property damage and personal injury liability claims (Di Lillo et al. Citation2024a, Citation2024b). Even 3rd party liability claims have limitations in that frequency and/or payment amounts associated with insurance claims may not always be due to responsibility in the collision. Additionally, companies that operate fleets of vehicles like the current ride-hailing ADS deployments may have different insurance risk profiles as private insurance that is used by many human drivers. Although insurance claims data likely is a good proxy to investigate ADS contribution in crashes, there is a research need to develop more objective assessments of crash contribution. One possible way to develop such objective crash contribution assessments is through the use of reference behavior models, sometimes called driver models. The reference models used for crash contribution assessments should reflect proper Drivership, including behavior that matches normative expectations of good driving (Fraade-Blanar et al. Citation2025).

Limitations

This study had several limitations. Although the study takes steps to align the benchmark and ADS data using dimensions available in both human and ADS data sources (such as geographic location, road type, vehicle type, and spatial driving density), there is an endless list of possible factors to potentially account for and of adjustment methodologies to refine. One dimension discussed in the dynamic benchmark adjustment done by Chen et al. (Citation2024), but not implemented in current study, is an adjustment for time of day. Chen et al. (Citation2024) found that the Waymo ADS fleet, through the first 21.9 million RO miles, that the ADS fleet in San Francisco drove slightly more in the evening and overnight (6:30PM–3AM, 41% vs 24%) and early morning (3AM–6AM, 6% vs 3%) and less during the daytime (9AM–3:30PM, 16% vs 40%). In total, however, the time-of-day adjustment resulted in a benchmark that was higher than the baseline benchmark by 1.05 times [1.03, 1.06] for Any-Injury-Reported outcome and 1.16 times [1.13, 1.18] for the Airbag Deployment outcome in San Francisco (Chen et al. Citation2024). This suggests that the benchmark in San Francisco used in this study is conservative, in that it likely underestimates elevated crash risk with driving more at higher risk times of day. This time-of-day adjustment was only possible in San Francisco based on extrapolating a single traffic study that had human VMT by time-of-day. No such data could be found in other cities, and this single study is likely less robust than the spatial VMT data used by Chen et al. (Citation2024) to perform the spatial dynamic adjustment used in the current study. This study used an underreporting adjustment for the benchmark Any-Injury-Reported outcome, which accounts for the 33% of injury crashes that were estimated to not be reported to police by Blincoe et al. (Citation2023). The Airbag Deployment and Suspected Serious Injury+ benchmarks, however, did not have an underreporting adjustment applied even though there is likely non-zero underreporting in human crash data. This lack of underreporting adjustment for these outcomes likely makes the comparisons in these levels conservative, because there is assumed to be little underreporting in the ADS crash data due to automated collision reporting using vehicle sensors and operational procedures (Kusano et al. Citation2024).

Aligning the benchmark and Waymo crash and mileage data also comes with its implementation challenges. On the Waymo data side, the authors have detailed information about the mileage and crashes that have been made available. On the human driving side, there is uncertainty and potential bias introduced due to the nature of the mileage and crash data. A variety of geographic-specific variable and value pairing are needed to do this study’s classification routines for vehicle type, road type, and crash type (see the Supplementary Appendix). The underlying raw data being relied upon (e.g., police reports) have limited specificity and are also subject to input error. The mileage estimates are also derived from a variety of geographic-specific traffic sampling methodologies, and, in the absence of some ground truth data to validate, it is not clear how much uncertainty or bias should be attributed to these estimates. In the crash type groups used in this paper, there was an “All Others” category to capture both unknown crash types and crash types that have a configuration that is not described by the other 10 crash type groups. Only the human data had “unknown” crash type, as sufficient details to determine crash type were present for all SGO crashes examined. Examples of crashes included in the “All Other” category include “vehicle-to-vehicle dooring crashes” (where the open door of a vehicle is struck by another road user) and other unique collision circumstances that did not fit into the existing 10 categories.

Scanlon et al. (Citation2024a) examined the benchmark crash rates over time and found that after a disruption during the COVID-19 pandemic in 2020; the 2021 and 2022 crash rates were relatively stable. As most of the Waymo RO driving was performed in 2023 and beyond (98% of the miles), future work could examine the effect of changes in the human benchmark with time.

Similar to the analysis in Kusano et al. (Citation2024), the Waymo ADS vehicle was not always occupied while driving in RO configuration (e.g., traveling between dropping off and picking up passengers). The research question of the current study investigates the effect of the Waymo RO service on the current status quo of human driving. As Waymo is operating a ride-hailing service at relatively small scales relative to the overall human driving population, it is not unreasonable to assume that much of the VMT driven by the Waymo RO service would have been serviced by human ride-hailing services. In human ride-hailing, there is a human driver driving the vehicle between dropping off and picking up passengers with the potential to be injured if a crash occurs. The Waymo service thus reduces the risk of injury by removing additional people from potentially hazardous crashes. Additionally, the analysis includes any injury to any participant involved in a crash, including other human-operated vehicles involved in the crash. Even if the Waymo vehicle was completely unoccupied at all times, there could still be the potential for human injuries in the event of crashes, especially given that the Waymo mileage is being accumulated in dense urban areas where vulnerable road user crashes are more common. The Airbag Deployment outcome was intentionally included as an independent, complementary outcome that is not sensitive to the occupancy status of the Waymo vehicle. The Waymo vehicles comply with all relevant Federal Motor Vehicle Safety Standards (FMVSSs), and thus many of the airbag systems, including the driver front and side airbags, will fire regardless of occupancy. Future research could address the potential impacts of VMT shifts caused by automated driving and shifts in vehicle occupancy and/or vehicle seating position that may affect observed outcomes, but it should be noted that this removes the inherent benefits achieved from a lack of an occupant to be injured in between ride-hailing pickups. Another alternative to examining outcomes is to use injury risk functions that are functions of crash inputs (like delta-V) and are insensitive to occupancy or reported outcomes.

Some human crash databases define injuries using the Abbreviated Injury Scale (AIS) (AAAM Citation2016). The AIS is an anatomical injury scoring scheme that is coded by certified professionals based on medical records. The ADS crashes reported in the NHTSA SGO and used in this study do not have AIS codes available. Having available AIS, or other more detailed injury data than police-reported maximum severity, would make the ADS SGO more comparable to some human crash data sources that also have AIS injury data. This AIS coding, however, would require additional resources dedicated for crash investigation and a framework to address privacy concerns. Typically, human crash data with AIS codes in the US are overseen by NHTSA and information is collected by independent crash investigation firms, and not self-reported by manufacturers like the NHTSA SGO.

If multiple comparisons are made across many different dimensions, the probability of detecting false positive significant results increases. This study performs a number of comparisons across different outcome levels (3), crash modes (11), and locations (4). The approach to examine multiple outcome levels, which each have unique reporting challenges, was an attempt to draw broad conclusions about the performance of ADS relative to the current human driving population. The crash type comparison in the study was performed to gain insight into which crash modes contribute to the observed aggregate safety benefits. However, due to multiple comparisons, care should be taken when considering the statistical significance of individual comparisons as no multiple comparison corrections have been made.

Conclusions

An increase in ADS mileage on public roads in recent years enables additional retrospective safety assessments to be performed. This study compared crash rates of the Waymo RO ride-hailing service to aligned benchmarks in Phoenix, San Francisco, Los Angeles, and Austin for the outcomes of Any-Injury-Reported, Airbag Deployment, and Suspected Serious Injury+. Compared to past ADS safety impact studies, the 56.7 million RO miles and the ADS relative performance to the benchmark during the study period through January 2025 allowed for the first time a statistically relevant comparison to a Suspected Serious Injury+ outcome at the aggregate (all crash) level and comparisons disaggregated into 11 crash types for the Any-Injury-Reported and Airbag Deployment outcomes. At the aggregate crash level, the study found statistically significant reductions in Any-Injury-Reported and Airbag Deployment outcomes when considered in all locations combined (79% CI: [71%, 85%] and 81% CI: [69%, 90%] reduction, respectively) and in San Francisco and Phoenix individually. The ADS reductions in these two outcome groups was of a similar magnitude than an earlier study of the first 7.1 million Waymo RO miles (Kusano et al. Citation2024). The current study found the Waymo RO service had a statistically significant reduction in Suspected Serious Injury+ outcome crashes compared to the benchmark when considering all locations combined (85% CI: [39%, 99%] reduction), in Phoenix (100% CI: [3%, 100%]), and in San Francsico (76% CI: [3%, 98%]). Comparisons in Los Angeles were not statistically significant. Both (2) Suspected Serious Injury+ crashes involving a Waymo vehicle during the study period were Secondary Crashes, meaning the Waymo was not involved in the first event in the crash sequence. Compared to the benchmark crash rate representing the current driving fleet, the Waymo Driver experienced 181 fewer Any-Injury-Reported, 78 fewer Airbag Deployment, and 11 fewer Suspected Serious Injury+ crashes during the study period.

The aggregate crash reductions by the Waymo ADS in the Any-Injury-Reported outcome group were driven by statistically significant reductions in Cyclist (82%), Motorcycle (82%), Pedestrian (92%), Secondary Crash (66%), Single Vehicle (93%), V2V Intersection (96%), and V2V Lateral (74%) crash types when considering all locations. The reduction in Single Vehicle (100% reduction) and V2V Intersection (91% reduction) Airbag Deployment crashes was also statistically significant in all locations combined. All other crash rate comparisons disaggregated by crash type found the Waymo RO crash rate was not statistically different from the benchmark rate in all locations combined. The Airbag Deployment V2V Intersection comparison was statistically significant and lower in Phoenix, San Francisco, and Los Angeles when examined separately for Any-Injury-Reported and Airbag Deployment crashes.

The results of this study suggest increasing confidence that a level 4 ADS reduces Any-Injury-Reported and Airbag Deployment outcome crashes, primarily by reducing V2V Intersection and Single Vehicle crashes for both outcomes and VRU (i.e., cyclist, motorcyclist, and pedestrian), Secondary, and Lateral crashes for the Any-Injury-Reported outcome. The crash groups with significant reductions represent some of the most frequent crash modes in the environment the Waymo ADS currently operates. The results of this study strongly suggest a safety benefit of the Waymo RO service, an SAE level 4 ADS, operating as a ride-hailing vehicle in the Any-Injury-Reported and Airbag Deployment outcome levels, and suggest a benefit for Suspected Serious Injury+ crashes. The methodology of this study crafted the benchmarks through subselection and adjustments that attempted to account for many known factors that affect crash risk. Because conservative assumptions were made where appropriate and sensitivity analysis was performed, the remaining uncertainty in the benchmarks would likely not change the conclusions of the study. Future research should continue refining the alignment between benchmark and ADS crash and mileage data sources and continue to monitor serious and fatal severity injury outcomes, like the Suspected Serious Injury+ outcome, which have traditionally been the primary focus of road safety efforts.

Disclosure statement

All authors are employed by Waymo LLC.

Data availability statement

Public data sources used in this analysis are cited and available from state agencies and NHTSA. Waymo mileage data is self-reported and available for download at https://www.waymo.com/safety/impact. Data files with full study results and listing of NHTSA SGO cases used in this analysis are provided as supplemental online downloads.