Analyzing Commutes with Location History Data

Studying a year's commutes using Google Location Services data.

Introduction¶

I'm a physicist at the University of Chicago, and often work on experiments at Fermilab. UChicago is in Hyde Park on the south side of Chicago, while Fermilab is a bit west near Batavia, IL. I live in the middle, and thus spend a lot of time on Chicagoland's glorious highway system getting to these places.

As a scientist I love data, and I want to optimize my commute so that I can get on with the science-ing. I realized that Google Location Services has been my trusty companion on all my travels, all that data just waiting to be analyzed.

In this post I'll do some analysis on my own commutes, and encourage you to download your location history and do the same. You can grab this and much more from https://takeout.google.com. It's your data!

For this project, we'll use Python and a few excellent packages: Jupyter (this notebook), Pandas, numpy and scipy, matplotlib for plotting, plus we'll take statsmodels's regression modeling with R-style formulas for a spin.

Preparing the Data¶

First, let's understand the Location History data, which is available for download as a giant JSON file from https://takeout.google.com/.

The schema goes something like:

"locations": [
    {
        "timestampMs": "< UTC timestamp in milliseconds >",
        "latitudeE7: < Latitude * 10^7 >,
        "longitudeE7: < Longitude * 10^7 >,
        "accuracy": < Units? >,
        "altitude": < If accurate, probably in meters >,
        "verticalAccuracy": < Also meters? >,
        "velocity": < Units? >,
        "heading": < Degrees >,
        "activity": [
            {
                "timestampMs": "< Slightly different than above [ms] >",
                "activity": [
                    {
                        "type": "< An activity type, see below >",
                        "confidence": < Number 0-100, doesn't add to 100 >,
                    },
                    ...
                ],
                "extra": [
                    {
                        "type": "VALUE",
                        "name": "vehicle_personal_confidence",
                        "intVal": 100
                    }
                ]
            },
            ...
        }               
    },
    ...
]

Fields velocity, heading, and activity are optional. When the extra under activity appears (sporadically), it is always as shown, in my case. There can be a whole bunch of activities with a score from 0-100, which are not mutually exclusive.

Activities include: 'IN_VEHICLE', 'UNKNOWN', 'ON_FOOT', 'TILTING', 'STILL', 'EXITING_VEHICLE', 'ON_BICYCLE', 'WALKING', 'RUNNING', 'IN_ROAD_VEHICLE', 'IN_RAIL_VEHICLE'. I assume is this based on activity recognition being done by my phone, so it may be device-dependent.

First, import all the things!

And actually load the JSON data we've grabbed from Google:

For this project, I'm only interested in records where I'm in a vehicle. I'll trust the activity reconigition to get that mostly right, and reduce the activity list down to just the most likely activity string:

Now, we load the data into a pandas.DataFrame, and do a little tidying up: convert the timestamp to a datetime with the correct timezone (Chicago), convert latitude and longitude to floating point, and drop a few columns.

Now, there is some work to do to turn a series of timestamped records into discrete commutes with durations. Incorrect activity recognition and other noise will complicate things, and we'll need to make some assumptions. I define a commute as:

In a vehicle
On a weekday
Between 30 minutes and 3 hours long
Occurring between 5 and 10 AM (morning) or 2 and 10 PM (evening)

As a trick for extracting the start and end times, I assume that commutes are isolated in time, with at least $N$ non-vehicular minutes on either side. Programmatically, this means finding a set of VEHICLE-tagged samples meeting the above cuts, where the $\Delta t$ to the previous/next VEHICLE-tagged sample is at least $N$ minutes. This should yield the start and end timestamps, and if there are more than two such samples for a given time window, we take the first two. Later we'll apply more filters to the commutes to further clean things up.

Next, we look at the start and end points, to tag the commutes with the destination and ensure they match up with work/home as they should for morning/evening.

Finally, we apply some quality checks to the commutes: we start at home in the morning and end at home in the evening, and cut out outliers that are implausibly long or short, perhaps if I didn't go straight home.

Finally, we do a little bit more cleanup based on looking at the location samples.

Scanning through scatter plots of all sampled points between the starting and stopping timestmap, there are some clear examples of times I didn't go straight home. Most of these can be cut by detecting when I go too far.

I also noticed that the sampling becomes far more frequent (about double) on September 22, 2017. Wonder what's with that! Some of the earlier data is very sparsely sampled and unlikely to yield an accurate travel time, so we further remove any commutes containing fewer than 30 samples.

Here's a snapshot of the cleaned data:

Now we can make some plots, starting with histograms of commute times for Hyde Park and Fermilab in the morning and evening.

And some basic statistics:

Duration and Departure Time¶

Something I've noticed is that it sure feels like commutes get worse through the week. Is it so? Just growing impatience? Or leaving at different times as the week goes on and catching more traffic? Plots will tell.

Here, we look at the distribution of departure times and trip durations by day of the week using a set of box plots, and also take a look at the correlations through the joint distribution. There, we'll bring in statsmodels to fit a linear model.

In [16]:

# Plot the trip duration and departure time by weekday, plus the
# joint distribution
duration_v_depart = []
for name, dest in [('Hyde Park', hp), ('Fermilab', fnal)]:
    fig, axes = plt.subplots(2,3,figsize=(12,12), dpi=100)

    for title, dft, ax in [
            ('Morning, %s' % name, commutes[dest&morn], 0),
            ('Evening, %s' % name, commutes[dest&eve], 1)]:
        dft['depart'] = dft.timestamp.dt.hour + dft.timestamp.dt.minute / 60
        dft['duration_mins'] = dft.duration.dt.total_seconds() / 60
        dft['dow'] = pd.Categorical.from_codes(dft.timestamp.dt.dayofweek,
                                               ('MON', 'TUE', 'WED', 'THU', 'FRI'))
        dft.boxplot(column='duration_mins', by='dow', ax=axes[ax,0])
        dft.boxplot(column='depart', by='dow', ax=axes[ax,1])
        dft.plot.scatter( 'depart','duration_mins', ax=axes[ax,2])

        # Fit a linear model to duration vs. departure time
        dft.sort_values(by='depart', inplace=True)
        res = smf.ols('duration_mins ~ depart', data=dft).fit()
        duration_v_depart.append((title, res.params, res.bse))
        prstd, iv_l, iv_u = wls_prediction_std(res)

        x = dft['depart']
        y = dft['duration_mins']
        axes[ax,2].plot(x, y, 'o', label='Data')
        axes[ax,2].plot(x, res.fittedvalues, 'r--.', label='Model', color='orange')
        axes[ax,2].plot(x, iv_u, 'r--', color='orange')
        axes[ax,2].plot(x, iv_l, 'r--', color='orange')
        axes[ax,2].legend(loc='best')
        
        axes[ax,0].set_title(title)
        axes[ax,0].set_ylabel('Trip duration (minutes)')
        axes[ax,0].set_xlabel('Day of the Week')
        axes[ax,0].set_ylim(20, 100)
        
        axes[ax,1].set_title(title)
        axes[ax,1].set_ylabel('Departure time (hour of day)')
        axes[ax,1].set_xlabel('Day of the Week')
        lim = (6.5, 9) if ax == 0 else (15, 21)
        axes[ax,1].set_ylim(*lim)
        
        axes[ax,2].set_title(title)
        axes[ax,2].set_xlabel('Departure time (hour of day)')
        axes[ax,2].set_ylabel('Trip duration (minutes)')
        axes[ax,2].set_ylim(20, 100)
        axes[ax,2].set_xlim(*lim)

    plt.suptitle('')
    plt.tight_layout()
    plt.show()

The right panels show a linear model fit to the joint distribution, along with the standard errors. The fits are not excellent, but capture the general trend for this scenario where the correct underlying model is diffult to know without more data. The model parameters quantify the change in trip duration as a function of delay in departure:

That is, leaving an hour later on morning commutes to Hyde Park costs 5 extra minutes, while waiting an extra hour to leave Fermilab in the evening knocks 14 minutes off the trip. For morning commutes to Fermilab, it doesn't much matter when I leave... good to know!

Weather¶

Next, let's see if there's any correlation between commute time and weather conditions like snow or rain. We can use the DarkSky weather API (see https://darksky.net/dev/docs). You'll need an API key, which provides 1000 free API calls per day.

Add a column with a summary of the weather data:

Unforunately the statistics are a bit too low to learn much about the impact of weather on travel time, but a a preliminary look suggests the impact is not too significant. I guess we're already moving pretty slowly, after all.

For an illustration, let's look a higher-statistics sample, morning commutes to Hyde Park:

A look at the p-value for a two-sample Kolmogorov-Smirnov test shows that the distributions are highly compatible:

Conclusions¶

I hope this has been an interesting illustration of how to use your Location Services data to draw your own insights. Here I've looked at my commutes to better understand how I can minimize my car time, and have a more quantitative model for the impact of departure time. But this same approach could be used for all sorts of studies. It's great that Google makes it relatively easy to access this data -- have some fun with it!

And happy commuting 🚘!