COVID-19 Data Science Challenge Protect Purdue:¶

Section 1¶

COVID-19 has affected the lives of every person on the planet in some way leaving a wake of confusion as to how to adapt to this new lifestyle. The pandemic has left many entities including local governments and merchants wondering what the economic and societal implications of it will be. It is therefore my task with cooperation of IronHacks to help these entities by building a model to accurately predict for weekly foot traffic in 1804 places of interest within Tippecanoe county.

This report showcases the model I've developed, results with visualizations, as well as a discussion of how stakeholders should proceed.
<!-- Iron Hacks has provided me with several datasets to help in this task. These datasets include information about past foot traffic data in these places of interest and what types of places these are, census block group data, mobility graph data, covid 19 cases information, and existence and type of executive orders in place. I began by looking at each of these data frames for their specific variables that I thought could be useful.

After understanding each of the tables and the information that they provided, I began modeling to predict what the week 44 foot count could be of all of the POIs. I started with simple linear regression applied to each POI to find out if a simple linear line through the foot traffic would be a good model. I found that although simple, the model did well, however I quickly moved to more complicated models such as Time Series Modeling through ARIMA and Exponential Trend Smoothing, Neural Network models, and finally random forest modeling which I found to perform best. I tested MSE and MAPE levels on week 42 and 43 data to decide which models to submit for prediction of week 44 foot traffic. -->

Section 2¶

Terms and Abbreviations

POI: Place of Interest

raw_visit_counts: The number of unique visits to a specific POI in a given week

Section 3¶

Visualizations¶

Section 3.1¶

Map 1: Presenting the absolute values for the week 44 predictions

import os
import json
import random
import requests
import pandas
from ipywidgets import *
from ipyleaflet import *
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd
import matplotlib.pyplot as plt


all_results = pd.read_csv("report_data.csv")


# CONSTANTS
MAP_LAT=40.3900
MAP_LON=-86.8220
MAP_CENTER = (MAP_LAT, MAP_LON)

# FUNCTIONS
def print_basemaps():
    basemap_list = basemaps.keys()
    print("\n","Basemaps")
    print("=========")
    print("-", "\n- ".join(basemap_list))

def print_basemap_themes(name):
    basemap_theme_list = basemaps[name].keys()
    print("\n", name, "Themes")
    print("=========")
    print("\n".join(basemap_theme_list))

def random_color(feature):
    return {
        'color': 'black',
        'fillColor': random.choice([
            'red',
            'yellow',
            'green',
            'orange'
        ]),
    }

def fetch_json(url):
    r = requests.get(tipp_cbg_url)
    text = r.content.decode("utf-8")
    data = json.loads(text)
    return data

def unique(list1):
    list_set = set(list1)
    unique_list = (list(list_set))
    return unique_list
lst = list(all_results.loc[all_results["week_number"] == 44]["raw_visit_counts"].values - all_results.loc[all_results["week_number"] == 40]["raw_visit_counts"].values)
m = Map(center=MAP_CENTER, zoom=10)

markers = []
all_res_44 = all_results.loc[all_results["week_number"] == 44]
for point in all_res_44.iterrows():
    if point[1]["week_number"] != 44:
        continue
    marker = Marker(
        location=[point[1]["latitude"], point[1]["longitude"]],
        draggable=False,
    )
    marker_message = HTML()
    marker_message.value = "<strong>" + str(point[1]["poi_id"]) + "</strong>" + "<br>Visit Count: " + str(point[1]["raw_visit_counts"])
    marker.popup = marker_message
    markers.append(marker)

m.add_layer(MarkerCluster(markers=markers))

display(m)

Above is a map of the absolute values for the week 44 predictions

Map 2: Presenting the slope of the increase of the foot traffic from week 40 to 44

m = Map(center=MAP_CENTER, zoom=10)
markers = []
all_res_44 = all_results.loc[all_results["week_number"] == 44]
i = 0
for point in all_res_44.iterrows():
    if point[1]["week_number"] != 44:
        continue
    marker = Marker(
        location=[point[1]["latitude"], point[1]["longitude"]],
        draggable=False,
    )
    marker_message = HTML()
    marker_message.value = "<strong>" + str(point[1]["poi_id"]) + "</strong>" + "<br>Visit Count: " + str(lst[i])
    marker.popup = marker_message
    markers.append(marker)
    i += 1
m.add_layer(MarkerCluster(markers=markers))
display(m)

Above is a map of the slope of the increase of foot traffic from week 40 to week 44

Chart 1: The temporal development for the 20 most crowded places using a time series chart

top_20_poi = all_res_44.sort_values("raw_visit_counts", ascending=False)["poi_id"].values[:20]
for i in range(20):
    temp_df = all_results.loc[all_results["poi_id"] == top_20_poi[i]][["week_number", "raw_visit_counts"]]
    plt.plot(temp_df["week_number"].values, temp_df["raw_visit_counts"].values)
plt.xlabel("Week Number")
plt.ylabel("Visit Count")
plt.title("Temporal Development for the 20 Most Crowded Places")
plt.legend(top_20_poi, loc='upper center', bbox_to_anchor=(1.45, 1), shadow=True, ncol=1)

<matplotlib.legend.Legend at 0x7fb293c88710>

Chart 2: Present the temporal development of the 20 places with the greatest “increase” of foot traffic from week 40 to 44

top_pois = []
pois = all_results.loc[all_results["week_number"] == 44]["poi_id"].values
for i in range(20):
    pos=lst.index(max(lst))
    top_pois.append(pos)
    lst[pos] = 0
pois_chart2 = [pois[i] for i in top_pois]
for poi in top_pois:
    curr_poi = pois[poi]
    temp_df = all_results.loc[all_results["poi_id"] == curr_poi][["week_number", "raw_visit_counts"]]
    plt.plot(temp_df["week_number"].values[27:], temp_df["raw_visit_counts"].values[27:])
plt.title("Temporal Development of the 20 POI with the Greatest “increase” of Foot Traffic from Week 40 to 44")
plt.xlabel("Week Number")
plt.ylabel("Visit Count")
plt.legend(pois_chart2, loc='upper center', bbox_to_anchor=(1.45, 1), shadow=True, ncol=1)

<matplotlib.legend.Legend at 0x7fb29547a910>

Section 3.2¶

Map 1: Presenting the absolute values of the predicted foot traffic (raw_visit_counts) aggregated for the 100 CBGs. Which CBGs bear the greatest risk?

map1_32_res = all_res_44.groupby(['poi_cbg']).sum().sort_values(["raw_visit_counts"], ascending=False).head(100)
m3 = Map(center=MAP_CENTER, zoom=10)
markers = []
i = 0
map1_32_res = map1_32_res.reset_index()
for point in map1_32_res.iterrows():
    marker = Marker(
        location=[point[1]["latitude"]/(point[1]["week_number"]/44), point[1]["longitude"]/(point[1]["week_number"]/44)],
        draggable=False,
    )
    marker_message = HTML()
    marker_message.value = "<strong>" + str(point[1]["poi_cbg"]) + "</strong>" + "<br>Visit Count: " + str(point[1]["raw_visit_counts"])
    marker.popup = marker_message
    markers.append(marker)
    i += 1
m3.add_layer(MarkerCluster(markers=markers))
display(m3)

The CBGs that bear the greatest risk are 181570017002, 181570018002, 181570108001, and 181570104001

Map 2: Presenting the slope of the increase of the foot traffic for the 100 CBGs from week 40 to 44. Use color coding to show CBGs with a positive slope, negative slope, and “stable” foot traffic.

m4 = Map(center=MAP_CENTER, zoom=10)
cbgs = list(set(all_results["poi_cbg"]))
lst_cbgs_40_44_diff = []
for cbg in cbgs:
    temp_df = all_results.loc[all_results["poi_cbg"] == cbg]
    temp_df2 = temp_df.groupby(['week_number']).sum().sort_values(["week_number"]).tail(5)
    temp_df2["week_number"] = temp_df2.index
    diff = temp_df2.loc[temp_df2["week_number"] == 44]["raw_visit_counts"].values - temp_df2.loc[temp_df2["week_number"] == 40]["raw_visit_counts"].values
    lst_cbgs_40_44_diff.append(diff[0])
markers = []
i = 0
map1_32_res = map1_32_res.reset_index()
for point in cbgs:
    lat = all_results.loc[all_results["poi_cbg"] == point].head(1)["latitude"].values[0]
    long = all_results.loc[all_results["poi_cbg"] == point].head(1)["longitude"].values[0]
    marker = Marker(
        location=[lat, long],
        draggable=False,
    )
    marker_message = HTML()
    marker_message.value = "<strong>" + str(cbgs[i]) + "</strong>" + "<br>Visit Count: " + str(lst_cbgs_40_44_diff[i])
    marker.popup = marker_message
    markers.append(marker)
    i += 1
m4.add_layer(MarkerCluster(markers=markers))
display(m4)

Chart 1: Present the temporal development for the 10 most crowded CBGs using a time series chart

all_results.groupby(['poi_cbg']).sum().sort_values(["raw_visit_counts"], ascending=False).head(10)
most_crowded_cbgs = all_results.groupby(['poi_cbg']).sum().sort_values(["raw_visit_counts"], ascending=False).head(10).index
for cbg in most_crowded_cbgs:
    temp_df = all_results.loc[all_results["poi_cbg"] == cbg]
    temp_df2 = temp_df.groupby(['week_number']).sum().sort_values(["week_number"])
    x = list(temp_df2.index)
    y = temp_df2[["raw_visit_counts"]].values
    plt.plot(x, y)
    plt.title("Temporal Development of the 10 most crowded CBGs")
    plt.xlabel("Week Number")
    plt.ylabel("Visit Count")
    plt.legend(most_crowded_cbgs, loc='upper center', bbox_to_anchor=(1.20, 1), shadow=True, ncol=1)

Chart 2: Present the temporal development of the 10 CBGs with the greatest “increase” of foot traffic from week 40 to 44. Ideally can put them all in one chart as that eases interpretation. Feel free to also try a bar chart to highlight which areas seem to show the greatest increase in social crowding.

inds = []
for i in range(10):
    ind_ = lst_cbgs_40_44_diff.index(max(lst_cbgs_40_44_diff))
    inds.append(ind_)
    lst_cbgs_40_44_diff[ind_] = 0
cbgs_10 = []
for i in inds:
    cbgs_10.append(cbgs[i])
cbgs_10
for cbg in cbgs_10:
    temp_df = all_results.loc[all_results["poi_cbg"] == cbg]
    temp_df2 = temp_df.groupby(['week_number']).sum().sort_values(["week_number"]).tail(5)
    x = list(temp_df2.index)
    y = temp_df2[["raw_visit_counts"]].values
    plt.plot(x, y)
    plt.title("Temporal Development of the 10 CBGs with the Greatest “increase” of Foot Traffic from Week 40 to 44")
    plt.xlabel("Week Number")
    plt.ylabel("Visit Count")
    plt.legend(cbgs_10, loc='upper center', bbox_to_anchor=(1.20, 1), shadow=True, ncol=1)

Analysis of Models¶

During this challenge I tested 13 different models to find the model which was most accurate for predicting the next weeks raw visit count for every POI. Some of these models included regression in the form of linear, ridge, and lasso regression. I found these models to be the least accurate, however, are very simple in that a regression line, or a line of best fit, is drawn through a plot of raw_visit_count over weeks to try to predict for the next week. Although simple, these models should be avoided. Other regression models included KNN regression which is a bit more advanced in that sevaral other variables (such as covid rate, average distance from home, etc) can be applied to try to improve estimates. Time series models such as ARIMA, Exponential Smoothing, and VAR work well with this prediction task because they are naturally adept with time series forecasting. These models excel in identifying trends in movement of data, identify how many weeks in the past are worth using for next weeks prediction, and can even find effects of seasonality which may be useful when predicting raw_visits_count over sevaral years. One of the last models I tried which proved to be very effective was decision tree modeling. Decision trees can take sevaral different features (like the knn regression model) to make a more accurate estimate for next weeks raw_visits_count. They are somewhat difficult to explain, however, when more features are added to aid in the prediction of raw_visits_count, it can perform very well.

The final modeling strategy, and the most accurate, I used was random forest models for each of the 1804 POIs by analyzing the raw_visit_counts variable. The random forest model is similar to decision tree modeling, in fact, a random forest works by analyzing the predictions of sevaral decision trees to come up with an answer. I was only able to use the raw_visit_counts variable for training the model however adding other variables to assist the model (such as median dwell time, covid rate in that area, number/type of executive orders, etc) could immensely improve the model accuracy.

Concluding Remarks¶

Policymakers should be aware that many variables are available to add to this model to improve accuracy. If you would like to use other variables, these variables should be at least as accurate to predict for the next week as it is to predict raw_visit_counts. The variables I would recommend are covid rate in areas of each POI, type and number of executive orders, as well as any other variables that assist in predicting consumer behavior.