COVID-19 Data Science Challenge Protect Purdue: Predicting social crowding in Tippecanoe county.

COVID-19 has impacted the social and economic activity of faculty and students at Purdue. The Protect Purdue initiative has implemented a range of measures that keeps classrooms safe. However, the risk outside of the classroom, in bars, restaurants, churches, gyms, grocery stores, etc. remains. Thus, the "social concentration" of people in places in our region creates concerns. An increase in foot traffic* at public places around Purdue, in Tippecanoe County, suggest the normalization of daily routines.

The task here is to build a statistical model using the historical movement, COVID-19 incident and policy intervention data collected from week 11 to week 43 to predict the foot traffic in week 44 for 1804 places of interest (POIs) in Tippecanoe County. Since the time series data given here is short (31 datapoints), using neural networks and attention mechanisms might overfit the data. Hence, I opted for simpler and faster approaches like ARIMA models. Of all the submissions I made, the one with the best performance is of an autoregressive model. That was better than using optimized Random Forest Regressors. This showed that low-level models are capable of handling short time series data better than complex black box models.

The report here shows various visualizations prepared from the given datasets. The objective here is to communicate my analyses and aid the policymakers in their decisions to curb the infection spread.


   * foot traffic value here denotes the number of people visit a particular POI in a week.

Heatmap animation showing foot traffic in each POI

Use slider to move between weeks. Press Play button to see full animation

Data shown in heatmap is 'raw_visit_counts' data from the challenge dataset, which is the foot traffic at a particular POI. During initial part of lockdowns (week 11 to week 20), we can see that the foot traffic is high in shopping centers and Tippecanoe Mall. Particularly, the two Walmart shopping centers in the west and southwest of the county experience large visit counts. Then it reduces after week 25. After week 33, the foot traffic is starting to increase in Purdue University and prevails until the end of week 43.


Map showing percentage change in foot traffic in each POI

Click on blue location icon to see the % change

Data shown in this map is the percent change in foot traffic between the last three weeks of the data provided (week 40 to week 43). This is a clustered marker map, meaning the markers that are in proximity are clustered into blobs for visualization purposes. Clicking on the blobs with number will zoom into the area and show the lower level blobs or markers for POI. The percentage change in foot traffic will help us in knowing how the visitors count changed in the last few weeks. If a POI experiences huge positive change in foot traffic, more focus should be put on those POIs.


Map showing absolute predictions for week 44 in each POI

Click on blue location icon to see the predicted foot traffic for week 44

This clustered marker map shows up the absolute values of the predictions for week 44, as opposed to the color gradient shown in the heatmap animation. The progression of foot traffic for the busiest POIs can be seen in the plots shown below.


Time series plots of visit counts for top 50 busiest POIs

These plots show the visit counts for top 50 busiest POIs for which the week 44 forecast showed high foot traffic. X axis shows week number and Y axis shows visit counts data. Historic data is shown as blue line. Week 44 forecast is shown as orange dot.

Most of these busiest POIs are either educational institutions or shopping centers. So, controlling foot traffic in those regions will help in curbing the infection spread.



General stats about foot traffic

Apart from the visit counts, various other metrics were collected and published. These metrics can provide an idea about the overall change in foot traffic in each POI. For example, the plots below show the average visit concentration and average median dwell time for each week.

It is evident from these plots that visit concentration and median dwell time has reduced as the lockdown progressed. This shows that there is a decent amount of compliance to the lockdown among the public.



Time series modelling

Initial thoughts

Time series data for a POI is of short length - 33 data points. And the goal is to forecast the next immediate datapoint. Initial exploratory visualizations showed that the time series for foot traffic is stationary (having constant statistical parameters). Models like ARIMA can capture the trend well enough to make good predictions without overfitting the data.

Model usage

This hackathon was organized with the purpose to build a model that can help in policy-making decisions of the public health officials. Using simpler models like Autoregression can help non-technical people to understand the working of the model and help in instilling confidence on using data science to address public issues. Therefore, model interpretability is given equal importance as accuracy.

Winning model

The time series model that scored best in the hackathon is a simple Auto Regression model of order 1, which uses foot traffic data as the endogenous variable. A python library called 'statsmodels' was used to build the model. Variations of the model were tried but the model with the least complexity was able to outperform all other complex models. This shows that not every problem needs a deep neural network solution; even simple and naive models can answer important questions if used properly.

Spectral Analysis - an unconventional approach

Surprisingly, extrapolation of time series data using Fourier Transform has outperformed Random Forest Regressors, one of the powerful regression models in machine learning, in this task. A python function was written from scratch to deconstruct and reconstruct a time series using Fourier Transform. The forecasted points are added during reconstruction of the fourier series. The number of harmonics (terms) used in the fourier series played a vital role in generating accurate forecasts.

Important takeaways from modelling

  1. Model explainability is important in data science, particularly when using it in sensitive and high-risk fields like public healthcare. Using simpler algorithms like Autoregression can increase transparency and interpretability.
  2. Neural networks are not always the solution. Simpler models can outperform advanced models at times.


Concluding remarks

From these maps and plots, we can confidently say that shopping centers and educational institutions are the busiest areas in Tippicanoe country during the pandemic. It is also worthy to note that these places did not get crowded simultaneously. During the first half of the data, shopping centers received high foot traffic whereas the visit counts in Purdue University started to increase only in the second half of the data. After week 30, food outlets like McDonald's and Chick-fil-A also received huge number of visitors.

To conclude, locations like Walmart, Purdue University, Sam's Club, McDonald's can be classified as high-risk areas and public health officials should focus more on these locations in order reduce the infection spread.