COVID-19 Data Science Challenge Fall 2020: Protect Purdue

Predicting social crowding in Tippecanoe county.

Section 1


In this project, the aim is to make a prediction of foot traffic for week 44 (since Indiana recorded first COVID-19 case) in 1804 Point of Interests (POIs) in Tippecanoe County in Indiana, United States of America (a regression task). The data that is being used has been collected from the week 1 to week 43 and is available as a BIgQuery project file. The schema of the tables that will be used for this project can be found here.

To start with, the six tables would be imported from the BigQuery project and stored as a pandas DataFrame. Then, Exploratory Data Analysis (EDA) would be carried out on the data to better understand their relationships for feature selection for the modeling task. For this challenge, some of the various models I used were:

  • Arima
  • Random Forest
  • LightGBM
  • Catboost

The result presented in this report was obtained using catboost. This report visualized the effectiveness of the policies taken during the fall in five poi locations that reported the highest cases of the virus.

Section 2

Terms and Abbreviations


Data:

  • main_data: The primary data used for analysis
  • prediction: File of predicted cases for week 44
  • df_merge_col: Merged data for spatial representation

Terms:

  • poi_id: Unique and consistent ID that is tied to a particular POI in the dataset
  • CBG: The census_block_group for the poi_id

The Temporal Development for the Twenty most Crowded Places.


The timeseries of the twenty poi_id with highest predicted values for week 44 was plotted to see the development over the 33 weeks as shown below. Purdue University, Tipcannoe Plaza and the University's Main Campus are notable for their sharp reduction in the number of cases. They can help inform policy for curbing the spread in places like Beck plaza that has been experiencing a steady increase in reported cases.

In [102]:
fig, ax = plt.subplots(figsize=(20,9))
ax.set_title('Temporal Development for the 20 most Crowded Places')
for j in top_20:
    plot_gra(new_data(j))
plt.savefig('temporal_20.png')

The Temporal Development for the Twenty Places with Greatest "Increase"


By subtracting the predicted results for week 44 from the reported values for week 40, the twenty places with the highest increase were seperated. The timeseries chart illustrating the progress of the cases for these places was then plotted as shown below. Meijer and Menard's are two notably worrying places for the pandemic.

In [101]:
fig, ax = plt.subplots(figsize=(20,9))
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set_title('Temporal Development for the 20 most Increased Places from week 40 to 44')
for j in top_20_increase[1:]:
    plot_gra(new_data(j))
plot_gra(new_data(top_20_increase[0]))
plt.savefig('temporal_increased.png')

Temporal Mapping


The predictions for week 44 and the poi with greatest increase over week 40 to 44 are visualized on an interactive map showing places around Tipcanoe county and the predicted values. The HTML of the maps would be sperately attached to this markdown report.

Ten Most Crowded CBGs


The task featured a hundred CBGs. The predicted values were aggregated for each of the CBG and the ten CBGs with the highest cases were separated to be used for timeseries analysis shown below. Mike Raisor Lincoln, Bob Automotive group and Aster Place are notably fo concern given the predicted spikes.

In [289]:
fig, ax = plt.subplots(figsize=(20,9))
ax.set_title('Temporal Development for the 10 most Crowded CBGs')
for j in top_cbg:
    mod_plot_gra(mod_new_data(j))
plt.savefig('temporal_cbg_11.png')

Ten Most Increased CBGs


Similar to the time series analyses above, the highest increase for the CBGs was aggregated and then plotted. Gas America Service CBG are stood out, therefor efforts should be directed towards controlling spreadrate there.

In [284]:
fig, ax = plt.subplots(figsize=(20,9))
ax.set_title('Temporal Development for the 10 most Increased CBGs')
for j in top_cbg_1:
    mod_plot_gra_1(mod_new_data_1(j))
plt.savefig('temporal_cbg_10.png')

Concluding Remarks

Based on the results of the analysis, it is apparent that some locations are responding well to precventive measures while some others are recording an upsurge in the number of cases. A notable instance is Purdue University Main Campus in the first plot, the weekly cases recorded at the beginning of fall 2020 peaked at about 8000 which reduced to about 3000 cases per week and can be attributed to the adherence to the Protect Purdue pledge. The curve for Purdue University also correspondinly reduced. These are classic illustrations that simple measure such as masking up, washing hands, social distancing can indeed help flatten the curve. The second charts illutrated that supermarkets are some of the places that recorded the highest increase in the number of cases; this needs to be monitored. CBGs around Mike Raisor and Bob Automative group are some of the most crowded areas where policy change efforts should be focused.

In conclusion, visual cues can be obtained from the charts and maps to guage the performance of the previously used policies andplan further. The Purdue case however affirmaively shows that the simple measures are indeed effective in curbing the spread of the virus.