The motivation for this analysis came from a colleague at Zendesk, who theorized that the large number of 0$ tips recorded when the payment was with a credit card was due to employees in the service sector not wanting to put the tip on their credit card.
The dataset used is the Google BigQuery public dataset nyc-tlc:yellow:trips. In order to try to isolate when employees in the service sector would be most likely to be taking cabs, I used the geographic area corresponding to downtown Manhattan (Greenwich Village and East Village), which has many restaurants and bars. This geographic area is bounded by a box from latitude,longitude values from upper left (Northwest) corner (40.741610, -74.010472) to lower right (Southeast) corner (40.719170, -73.971076).
One population, known as the early morning dataset, has data from 12AM - 3AM, which represents the theoretical population with a higher percentage of workers in the service economy. The control population is between 9AM - 12PM, as is known as the late morning dataset. The data was collected for the years 2011 - 2014.
Below is the Standard SQL query used to get the "early morning" time period of 12AM - 3 AM.
select DATE(pickup_datetime) as dt, SUM(CASE WHEN tip_amount = 0 THEN 1 ELSE 0 END) as card_no_tip, SUM(CASE WHEN tip_amount >0 THEN 1 ELSE 0 END) as card_with_tip from `nyc-tlc.yellow.trips` WHERE 1=1 AND DATE(pickup_datetime) BETWEEN '2011-01-01' and '2015-01-01' AND EXTRACT(HOUR FROM (DATETIME(pickup_datetime, "America/New_York"))) BETWEEN 0 and 3 --AND EXTRACT(HOUR FROM (DATETIME(pickup_datetime, "America/New_York"))) BETWEEN 9 and 12 -- for the late morning period. AND pickup_longitude BETWEEN -74.010472 AND -73.971076 AND pickup_latitude BETWEEN 40.719170 AND 40.741610 AND rate_code in ('1','6') # standard and group rates AND fare_amount > 0.0 AND payment_type='CRD' # only credit card fares show non-cash tips. GROUP BY dt;
Here is an example of what the data looks like in the resulting csv files.
The general gist of the analysis done for each dataset is as follows.
1. import data into a dataframe in a Jupyter Notebook. 2. calculate the ratio of rides in a given day with no tip to rides with a non-zero tip. 3. check this new variable, tip_ratio, for normality and remove any suspected outliers. 4. Run a two sample T-test, with unequal variances, to test the NULL hypothesis, H0.
The NULL hypothesis is that there is no statistically significant difference in the means between the two populations. The alternative hypothesis, HA, is that there is a difference between the means.
T-statistic = 5.254. p-value = 1.67e-07.
Based on the results of the T-test, where the listed p-value is less than the the alpha value (0.01) we would reject the null hypothesis that the means between the two populations are same. In fact, the mean of the early morning population, as compared to the late morning population is higher, and has a higher 75% value. This indicates that there may be some validity to the contention that the early morning cab taking public gives a greater percentage of their tips in cash than the late morning control group.
Here are the statistics of the distribution of the early morning tip ratio.
count | 1447.000000
mean | 2.819412
std | 1.500241
min | 0.000000
25% | 1.757237
50% | 2.185445
75% | 3.744168
max | 7.020598
Here are the statistics of the distribution of the late morning tip ratio.
count | 1447.000000
mean | 2.602116
std | 0.473757
min | 1.228501
25% | 2.271167
50% | 2.555296
75% | 2.933287
max | 3.814757
Complete details on the analysis, including the data files, are located in this Github repo.