COVID-19 Data Science Challenge Protect Purdue: Predicting social crowding in Tippecanoe county

Background

This notebook is to predict the behavior of social crowding in Tippecanoe county. The data provided contains week 11 to week 43 of each POI (Place of Interests) visit counts. Using the provided data to predict week 44's visit count for each POI. In order to get the prediction result, here are the steps:

  • Retrieve the data from the database and visualize the data which is interesting. The data visulization is done with:
    1. How many data in each week
    2. City distribution
    3. Post code distribution
    4. Top category distribution
    5. How many POIs in each week
  • Start with the simple statistic model to do the prediction. Explore the univariate models prediction results at first. Extract poi_id, week_number, and raw_visit_count data and put into one csv file for further use. Train the univariate models and calculate the MSE score base on week 43 data. The models selected are as follows:
    1. Autoregression (AR)
    2. Moving Average (MA)
    3. Autoregressive Moving-Average (ARMA)
    4. Autoregressive Integrated Moving Average (ARIMA)
    5. Seasonal Autoregressive Integrated Moving-Average (SARIMA)
    6. Simple Exponential Smoothing (SES)
    7. Holt Winter’s Exponential Smoothing (HWES)
    8. Holt’s exponential smoothing (HOLT)
    9. Use auto_arima to automatic select univariate models for each POI
  • Then, try the time series model LSTM to predict.
    1. Univariate LSTM
    2. Multivariate LSTM
  • Linear models and ensemble model are also used to predict
    1. Average model
    2. Linear regression
    3. Support vector machine
    4. KNN
    5. Random forest
  • Next, a CNN model is used to predict, the CNN model is consists of:
    • An input layer
    • An convolution layer (ReLU)
    • A flattern layer
    • A dense layer (ReLU)
    • An output layer
  • All the model predictions results are evaluated with MAE (Mean absolute error) value for week 43's data

The final model selected is a univariate CNN model which has a MAE of 9.2.

Log, imports, and config file

In [1]:
%%capture
%logstop
%logstart -t -r -q ipython_command_log.py global

#- IRONHACKS RESEARCH TRACKING CODE
#----------------------------------
# The following code is used to help our research team understand how you 
# our notebook environment. We do not collect any personal information with
# the following code, it is used to measure when and how often you work on
# your submission files.

import os
from datetime import datetime
import IPython.core.history as history

ha = history.HistoryAccessor()
ha_tail = ha.get_tail(1)
ha_cmd = next(ha_tail)
session_id = str(ha_cmd[0])
command_id = str(ha_cmd[1])
timestamp = datetime.utcnow().isoformat()
history_line = ','.join([session_id, command_id, timestamp]) + '\n'
logfile = open(os.environ['HOME']+'/ipython_session_log.csv', 'a')
logfile.write(history_line)
logfile.close()
In [243]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from google.cloud import bigquery
from google.oauth2 import service_account
from google.cloud.bigquery import magics
In [3]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='key.json'
bigquery_client = bigquery.Client(project='ironhacks-covid19-data')

Database tables:

  • ironhacks-covid19-data.ironhacks_covid19_competition.weekly_patterns
  • ironhacks-covid19-data.ironhacks_covid19_competition.mobility_graph
  • ironhacks-covid19-data.ironhacks_covid19_competition.cbg_social_distancing
  • ironhacks-covid19-data.ironhacks_covid19_competition.covid19_cases
  • ironhacks-covid19-data.ironhacks_covid19_competition.executive_orders
  • ironhacks-covid19-data.ironhacks_covid19_competition.prediction_list_poi

Data visualization

SQL table: weekly_patterns

Query and plot the number of data entries for each week.

In [4]:
QUERY = """
SELECT week_number, Count(*) as count
FROM ironhacks-covid19-data.ironhacks_covid19_competition.weekly_patterns
GROUP BY week_number
ORDER BY week_number
"""
query_job = bigquery_client.query(QUERY)
week_number_count = query_job.to_dataframe()
week_number_count.head(10)
Out[4]:
week_number count
0 11 1612
1 12 1578
2 13 1564
3 14 1571
4 15 1548
5 16 1571
6 17 1585
7 20 1616
8 21 1612
9 22 1623
In [5]:
## fill the missing weeks' data with 0
week_number_count_array = np.insert(week_number_count.to_numpy(), 7, np.array([[18, 0],[19,0]]), axis=0)
# draw a bar plot
week_number = week_number_count_array[:,0]
week_count = week_number_count_array[:,1]
plt.figure(figsize=(12,8))
plt.barh(week_number, week_count)
plt.yticks(week_number)
plt.ylabel("Week number")
plt.xlabel("Number of Data Entries")
for i, v in enumerate(week_count):
    plt.text(v + 15, i + 10.7, str(v))
plt.show()

The query result shows that week 18, and week 19 data are missing. Week 40, 41, 42, 43 have 1804 entries, which are the most. Therefore, use the week 40 data to furture explore the data.

In [6]:
QUERY = """
SELECT Count(DISTINCT(poi_id)) as N_POI
FROM ironhacks-covid19-data.ironhacks_covid19_competition.weekly_patterns
WHERE week_number = 40
"""
query_job = bigquery_client.query(QUERY)
week_40_count = query_job.to_dataframe()
week_40_count
Out[6]:
N_POI
0 1804

Using DISTINCT to count the number of unique POIs recorded in week 40 and the result indicates that all 1804 data entries are from different POIs, respectively.

Choose data from week 40 to explore the POI's information

City

In [7]:
QUERY = """
SELECT city, Count(1) as N_city
FROM ironhacks-covid19-data.ironhacks_covid19_competition.weekly_patterns
WHERE week_number = 40
GROUP BY city
ORDER BY N_city DESC
"""
query_job = bigquery_client.query(QUERY)
week_40_city_count = query_job.to_dataframe()
week_40_city_count
Out[7]:
city N_city
0 Lafayette 1254
1 West Lafayette 474
2 Battle Ground 20
3 Dayton 14
4 Shadeland 8
5 Romney 7
6 West Point 7
7 Stockwell 5
8 Americus 5
9 Clarks Hill 3
10 Buck Creek 2
11 Westpoint 2
12 W Lafayette 1
13 La Fayette 1
14 Otterbein 1
In [8]:
city_count = {}
for i in range(len(week_40_city_count)):
    if(week_40_city_count["city"][i] == "W Lafayette"):
        city_count["West Lafayette"] += week_40_city_count["N_city"][i]
    elif(week_40_city_count["city"][i] == "La Fayette"):
        city_count["Lafayette"] += week_40_city_count["N_city"][i]
    else:
        city_count[week_40_city_count["city"][i]] = week_40_city_count["N_city"][i]
In [9]:
city_label = list(city_count.keys())[::-1]
city_poi_count = list(city_count.values())[::-1]
plt.figure(figsize=(12,8))
plt.barh(city_label, city_poi_count)
plt.yticks(city_label)
plt.ylabel("City")
plt.xlabel("Number of POIs")
for i, v in enumerate(city_poi_count):
    plt.text(v + 10, i, str(v))
plt.show()

By count the POI numbers inside each city, the bar plot result shows that Lafayette has the most number of POIs, which is 1255. West lafayette ranks second place with 475 POIs. Other cities fall far behind those two cities. This can be served as a Categorical Feature.

Postal_code

In [10]:
QUERY = """
SELECT postal_code, Count(1) as N_postal
FROM ironhacks-covid19-data.ironhacks_covid19_competition.weekly_patterns
WHERE week_number = 40
GROUP BY postal_code
ORDER BY N_postal DESC
"""
query_job = bigquery_client.query(QUERY)
week_40_post_code_count = query_job.to_dataframe()
week_40_post_code_count
Out[10]:
postal_code N_postal
0 47905 651
1 47906 446
2 47904 296
3 47909 204
4 47901 126
5 47907 37
6 47920 13
7 47992 9
8 47941 8
9 47981 7
10 47930 3
11 47983 3
12 47924 1
In [11]:
post_code_count = {}
for i in range(len(week_40_post_code_count)):
        post_code_count[week_40_post_code_count["postal_code"][i]] = week_40_post_code_count["N_postal"][i]
In [12]:
post_code_label = list(map(str, list(post_code_count.keys())[::-1]))
post_code_value = list(post_code_count.values())[::-1]
plt.figure(figsize=(12,8))
plt.barh(post_code_label, post_code_value)
plt.ylabel("Postal Code")
plt.xlabel("Number of POIs")
for i, v in enumerate(post_code_value):
    plt.text(v + 5, i - 0.15, str(v))
plt.show()

From the bar plot, postcode 47905 has the most POIs. Also, 47905 is located in Lafayette, which also aligns with the previous city result. This can be served as a Categorical Feature. Moreover, there are more postcodes that cities, which both represent the location information of the POIs. Thus, choose postcodes as a feature will show a more detailed location information than cities.

Top_category

In [13]:
QUERY = """
SELECT top_category, Count(1) as N_category
FROM ironhacks-covid19-data.ironhacks_covid19_competition.weekly_patterns
WHERE week_number = 40
GROUP BY top_category
ORDER BY N_category DESC
"""
query_job = bigquery_client.query(QUERY)
week_40_category_count = query_job.to_dataframe()
week_40_category_count.head(10)
Out[13]:
top_category N_category
0 Restaurants and Other Eating Places 351
1 Museums, Historical Sites, and Similar Institu... 107
2 Religious Organizations 83
3 Automotive Repair and Maintenance 63
4 Other Amusement and Recreation Industries 61
5 Offices of Physicians 56
6 Health and Personal Care Stores 49
7 Elementary and Secondary Schools 49
8 Offices of Dentists 48
9 Personal Care Services 48
In [14]:
category_count = {}
for i in range(len(week_40_category_count)):
    if(week_40_category_count["top_category"][i] is None):
        category_count["None"] = week_40_category_count["N_category"][i]
    else:
        category_count[week_40_category_count["top_category"][i]] = week_40_category_count["N_category"][i]
In [15]:
len(category_count)
Out[15]:
105
In [16]:
category_label = list(category_count.keys())[:40][::-1]
category_value = list(category_count.values())[:40][::-1]
plt.figure(figsize=(12,8))
plt.barh(category_label, category_value)
plt.ylabel("Top Category")
plt.xlabel("Number of POIs")
for i, v in enumerate(category_value):
    plt.text(v + 5, i - 0.5, str(v))
plt.show()

The summary of the top 40 categories from the total of 105 different categories in the listed POIs shows that "Restaurants and Other Eating Places" is the largest group of POIs. Total of 351 POIs is below this category. This can be served as a Categorical Feature. Notice that there are 9 POIs with a "None" category in the top 40 categories.

Number of weeks in the database for each POI

In [102]:
QUERY = """
SELECT week_count, COUNT(week_count) as number_of_POI
FROM (
    SELECT poi_id, COUNT(week_number) as week_count
    FROM ironhacks-covid19-data.ironhacks_covid19_competition.weekly_patterns
    GROUP BY poi_id
    ORDER BY week_count DESC
)
GROUP BY week_count
ORDER BY number_of_POI
"""
query_job = bigquery_client.query(QUERY)
poi_weeks = query_job.to_dataframe()
In [103]:
week_count_label = list(map(str, list(poi_weeks["week_count"])))
number_of_POI = list(poi_weeks["number_of_POI"])
plt.figure(figsize=(12,9))
plt.barh(week_count_label, number_of_POI)
plt.yticks(week_count_label)
plt.ylabel("The number of weeks of data included")
plt.xlabel("Number of POIs")
for i, v in enumerate(number_of_POI):
    plt.text(v + 15, i - 0.2, str(v))
plt.show()

Considering there are two weeks of data missing, the dataset has 1360 POIs with all weeks POI visitor information. Analysis of the sample of those POIs' data may reveal some other information about the numerical features since it can be a continuous feature.

Select 10 POIs with all weeks data

In [104]:
# select 10 of 1360 POIs with all weeks' data and further explore the numerical variables
QUERY = """
SELECT poi_id
FROM (
    SELECT poi_id, COUNT(week_number) as week_count
    FROM ironhacks-covid19-data.ironhacks_covid19_competition.weekly_patterns
    GROUP BY poi_id
)
WHERE week_count = 31
LIMIT 10
"""
query_job = bigquery_client.query(QUERY)
ten_poi_with_all_weeks = query_job.to_dataframe()
ten_poi_with_all_weeks
Out[104]:
poi_id
0 6a2d40f8-c477-49b6-a5d5-69c31f001afc
1 49636347-5a9d-4bf3-9d9a-c21a4ba7d8f1
2 d98b37cd-ed64-456c-b580-c2c299c9f6d6
3 16f4e696-2065-4854-8ba6-a8cab2b67bf7
4 156f82b5-cba9-4a2a-b8ac-4b5cb206d27d
5 7137f730-7aaa-4d32-9e05-fa02ba002055
6 439d8d34-4278-431a-bc0c-b648204d9129
7 eb699a93-f2b4-4656-a1d3-e7aee5db262b
8 7e1debba-8343-4e4b-846f-efff3fcbe516
9 c4f6ee6d-c1fd-49c9-9c3c-f359f19e6bfe
In [105]:
# query the 10 selected POI data and store them as a list of dataframe with all the data
data_missing = []
for poi_id in ten_poi_with_all_weeks["poi_id"]:
    QUERY="""
    SELECT poi_id, week_number, visits_concentration, distance_from_home, median_dwell, raw_visit_counts
    FROM ironhacks-covid19-data.ironhacks_covid19_competition.weekly_patterns
    WHERE poi_id="%s"
    """%(poi_id)
    query_job = bigquery_client.query(QUERY)
    data_missing.append(query_job.to_dataframe())
In [106]:
# fill the missing weeks with 0
data_fill = data_missing.copy()
for i in range(len(data_fill)):
    rows = pd.DataFrame([[data_fill[i]["poi_id"][0], 18, 0, 0, 0, 0], [data_fill[i]["poi_id"][0], 19, 0, 0, 0, 0]], columns=list(data_fill[i].columns))
    data_fill[i] = data_fill[i].append(rows, ignore_index=True)
    data_fill[i] = data_fill[i].sort_values(by=["week_number"])
    data_fill[i] = data_fill[i].reset_index(drop=True)
In [107]:
data_fill[0]
Out[107]:
poi_id week_number visits_concentration distance_from_home median_dwell raw_visit_counts
0 6a2d40f8-c477-49b6-a5d5-69c31f001afc 11 510.40 5834.0 109.0 23
1 6a2d40f8-c477-49b6-a5d5-69c31f001afc 12 844.44 12091.0 80.0 15
2 6a2d40f8-c477-49b6-a5d5-69c31f001afc 13 661.16 25415.0 31.0 22
3 6a2d40f8-c477-49b6-a5d5-69c31f001afc 14 666.67 4045.0 59.0 15
4 6a2d40f8-c477-49b6-a5d5-69c31f001afc 15 416.33 18115.0 61.0 35
5 6a2d40f8-c477-49b6-a5d5-69c31f001afc 16 528.00 24581.0 40.0 25
6 6a2d40f8-c477-49b6-a5d5-69c31f001afc 17 438.07 19384.0 61.0 43
7 6a2d40f8-c477-49b6-a5d5-69c31f001afc 18 0.00 0.0 0.0 0
8 6a2d40f8-c477-49b6-a5d5-69c31f001afc 19 0.00 0.0 0.0 0
9 6a2d40f8-c477-49b6-a5d5-69c31f001afc 20 269.59 14549.0 62.0 63
10 6a2d40f8-c477-49b6-a5d5-69c31f001afc 21 329.86 23359.0 39.5 48
11 6a2d40f8-c477-49b6-a5d5-69c31f001afc 22 351.54 28819.0 56.0 43
12 6a2d40f8-c477-49b6-a5d5-69c31f001afc 23 371.09 13749.0 35.5 32
13 6a2d40f8-c477-49b6-a5d5-69c31f001afc 24 487.51 17484.0 34.0 29
14 6a2d40f8-c477-49b6-a5d5-69c31f001afc 25 468.75 24259.0 55.0 32
15 6a2d40f8-c477-49b6-a5d5-69c31f001afc 26 659.72 32624.0 89.0 24
16 6a2d40f8-c477-49b6-a5d5-69c31f001afc 27 466.67 13047.0 52.0 30
17 6a2d40f8-c477-49b6-a5d5-69c31f001afc 28 1122.45 12676.0 48.5 14
18 6a2d40f8-c477-49b6-a5d5-69c31f001afc 29 429.69 33269.0 55.5 32
19 6a2d40f8-c477-49b6-a5d5-69c31f001afc 30 486.11 22483.0 98.0 24
20 6a2d40f8-c477-49b6-a5d5-69c31f001afc 31 449.22 7172.0 70.5 32
21 6a2d40f8-c477-49b6-a5d5-69c31f001afc 32 448.98 9004.0 80.0 35
22 6a2d40f8-c477-49b6-a5d5-69c31f001afc 33 509.89 22378.0 52.0 31
23 6a2d40f8-c477-49b6-a5d5-69c31f001afc 34 405.83 10231.0 30.0 31
24 6a2d40f8-c477-49b6-a5d5-69c31f001afc 35 637.12 6691.0 80.0 19
25 6a2d40f8-c477-49b6-a5d5-69c31f001afc 36 416.17 6691.0 20.0 29
26 6a2d40f8-c477-49b6-a5d5-69c31f001afc 37 572.32 17666.0 58.0 31
27 6a2d40f8-c477-49b6-a5d5-69c31f001afc 38 495.87 19441.0 37.0 22
28 6a2d40f8-c477-49b6-a5d5-69c31f001afc 39 466.67 21678.0 69.0 30
29 6a2d40f8-c477-49b6-a5d5-69c31f001afc 40 452.67 6519.0 51.0 27
30 6a2d40f8-c477-49b6-a5d5-69c31f001afc 41 486.11 7223.0 35.5 24
31 6a2d40f8-c477-49b6-a5d5-69c31f001afc 42 566.41 15415.0 47.5 32
32 6a2d40f8-c477-49b6-a5d5-69c31f001afc 43 459.18 3910.0 29.5 28
In [81]:
def plot_visit_count(data):
    plt.figure(figsize=(15,8))
    for i in range(len(data)):
        plt.plot(list(map(str, data[i]["week_number"])), data[i]["raw_visit_counts"], "o-", label=data[i]["poi_id"][0])
    plt.legend(bbox_to_anchor=(1.01, 1), loc='upper left', borderaxespad=0.)
    plt.grid()
    plt.show()
plot_visit_count(data_missing)
In [82]:
def plot_numerical_data(input_data):
    for data in input_data:
        x_time = list(map(str, data["week_number"]))
        y1 = data["visits_concentration"]
        y2 = data["distance_from_home"]
        y3 = data["median_dwell"]
        y = data["raw_visit_counts"]

        plt.figure(figsize=(8,4))
        plt.title(data["poi_id"][0])
        plt.plot(x_time, y1 / 10, "*-", label="visits_concentration / 10")
        plt.plot(x_time, y2 / 300, "^-", label="distance_from_home / 300")
        plt.plot(x_time, y3, "+-", label="median_dwell")
        plt.plot(x_time, y, "ro-", label='raw_visit_counts')
        plt.xticks(x_time)
        plt.grid()
        plt.legend()
        plt.show()
In [ ]:
plot_numerical_data(data_missing)

Select POIs with no more than 5 weeks of data

In [99]:
QUERY = """
SELECT poi_id
FROM (
    SELECT poi_id, COUNT(week_number) as week_count
    FROM ironhacks-covid19-data.ironhacks_covid19_competition.weekly_patterns
    GROUP BY poi_id
)
WHERE week_count <= 5
"""
query_job = bigquery_client.query(QUERY)
week_40_poi_less_5_weeks = query_job.to_dataframe()
In [85]:
# select all POIs data with no more than 5 weeks of data
data_less_5 = []
for poi_id in week_40_poi_less_5_weeks["poi_id"]:
    QUERY="""
    SELECT *
    FROM ironhacks-covid19-data.ironhacks_covid19_competition.weekly_patterns
    WHERE poi_id="%s"
    """%(poi_id)
    query_job = bigquery_client.query(QUERY)
    data_less_5.append(query_job.to_dataframe())
In [86]:
week_available = []
for i in range(len(data_less_5)):
    week_available.append(list(data_less_5[i]["week_number"]))
In [87]:
plt.figure(figsize=(4,3))
plt.eventplot(week_available, linewidths=5)
plt.grid()
plt.show()
In [ ]:
plot_numerical_data(data_less_5)

The histgram result shows what weeks are included in the POIs with less than 5 weeks of data. It indicates that most of them are the last four weeks.

Univariate models

The first trail is to test models on univariate variable, which is the raw_visit_counts. Steps are:

  • Prepare the training data for each POI
  • Train each model seperately
  • Predict the upcoming week for each POI
  • (Possible) As for POIs with no more than 5 weeks data, take an average of existing data as the prediction result.
  • Calculate the MAE
In [89]:
def mse_loss(y_true, y_pred):
    """
    MSE loss function
    Inputs param
    -------------------------
    y_true: list
        The ground truth values
    y_pred: list
    The predicted values
    -------------------------
    return: number
        mean square error of the two input lists
    """
    if(len(y_true) != len(y_pred)):
        print("True label len:" + str(len(y_true)) + ", Predict label len: " + str(len(y_pred)))
        raise Exception("Input lists have different length")
    mse = np.mean(np.array(y_true) - np.array(y_pred))**2
    return mse
In [90]:
def mae_loss(y_true, y_pred):
    """
    MAE loss function
    Inputs param
    -------------------------
    y_true: list
        The ground truth values
    y_pred: list
    The predicted values
    -------------------------
    return: number
        mean absolute error of the two input lists
    """
    if(len(y_true) != len(y_pred)):
        print("True label len:" + str(len(y_true)) + ", Predict label len: " + str(len(y_pred)))
        raise Exception("Input lists have different length")
    mae = np.mean(np.absolute(np.array(y_true) - np.array(y_pred)))
    return mae

For the POIs with no more than 5 weeks of data

Extract all weeks _raw_visitcounts data from the database

In [91]:
week_visit_less_5 = {} # dict stores all visit count grouped by week, example: week_visit_less_5[40] retrieves week 40's raw_visit_count of all selected POIs
poi_ids_less_5 = [] # list stores all selected POIs' IDs
for data in data_less_5:
    poi_ids_less_5.append(data["poi_id"][0])
    if not week_visit_less_5:
        for i in range(len(data["raw_visit_counts"])):
            week_visit_less_5[data["week_number"][i]] = []
    for i in range(len(data["raw_visit_counts"])):
        week_visit_less_5[data["week_number"][i]].append(data["raw_visit_counts"][i])
print(poi_ids_less_5)
print(week_visit_less_5)
['d70a16ea-4982-4391-b7bd-6789f7018718', 'e71e79cd-71ae-4fd3-abf8-34f31a384702', '5576691b-4038-4b20-9736-4fc4c6e3a65f', '780472e8-9738-4832-a3d5-9736ed0a8f93', 'b5bbc683-775d-4fb5-b120-1c304302e136', '2d4ebc33-75c8-4822-a48d-57401f36ddfc', 'db4b221a-b849-4ffe-ac01-b26d46a15e57', 'e3a4fcba-f285-4113-81f5-dab8df7a4c41', '03e9a7b8-8502-4587-aef8-874041b57fd9', '77baea34-296d-4f0f-9b0b-1e191233101a', '48337bca-d4fd-46bf-b70a-9ada732d116a', '9ea86083-43d8-43b0-b014-3136616c32b7', '235937be-0295-4874-a3a7-7956e559a834', '67dfe851-d702-494b-9ee6-e68c836dcf84', 'a4ad5af3-ece6-4298-ac4a-e0a0af6329fb', '5919c7d1-c861-4d92-860e-3ef51f52f054', '72e0f3da-e654-43b9-8a4b-e21dba4101ef', '35644203-45ba-4f4a-9ee0-942a392927e8', 'a0d73d31-39d0-4753-8d12-edf8b44d31b2']
{39: [526], 40: [342, 46, 9, 261, 396, 63, 1, 7, 2, 9, 159, 33, 78, 5, 18, 683, 118, 8, 8], 41: [304, 42, 9, 229, 390, 54, 2, 4, 5, 12, 157, 35, 80, 6, 18, 669, 83, 2, 8], 42: [312, 40, 6, 237, 377, 75, 3, 2, 3, 12, 104, 41, 74, 7, 22, 653, 193, 10, 8], 43: [330, 39, 17, 267, 363, 66, 5, 6, 3, 10, 84, 39, 48, 7, 22, 652, 51, 4, 7]}

Using average model

Calculate the average _raw_visitcounts of the existing data as the prediction

In [92]:
average_predict_result_less_5 = []
for data in data_less_5:
    average_predict_result_less_5.append(np.rint(data["raw_visit_counts"].sum() / len(data["raw_visit_counts"])))

sum_mae = 0
for i in range(4):
    mae = mae_loss(week_visit_less_5[i + 40], np.array(average_predict_result_less_5))
    sum_mae += mae
    print("MAE Loss for week " + str(i + 40) + " is: " + str(mae))
print("Average MAE is: " + str(sum_mae / 4))
MAE Loss for week 40 is: 7.2105263157894735
MAE Loss for week 41 is: 9.68421052631579
MAE Loss for week 42 is: 11.473684210526315
MAE Loss for week 43 is: 12.157894736842104
Average MAE is: 10.131578947368421

Linear regression model

Fit a linear regression model and predict the result.

In [93]:
from sklearn.linear_model import LinearRegression

linear_predict_result_less_5 = []
for data in data_less_5:
    X = np.array(data["week_number"]).reshape(-1, 1)
    y = np.array(data["raw_visit_counts"]).reshape(-1, 1)
    model = LinearRegression()
    model.fit(X, y)
    pred = model.predict(np.array([40, 41, 42, 43, 44]).reshape(-1, 1))
    linear_predict_result_less_5.append(np.rint(pred.reshape(1, -1)[0]))

sum_mae = 0
for i in range(np.array(linear_predict_result_less_5).shape[1] - 1):
    mae = mae_loss(week_visit_less_5[i + 40], np.array(linear_predict_result_less_5)[:, i])
    sum_mae += mae
    print("MAE Loss for week " + str(i + 40) + " is: " + str(mae))
print("Average MAE is: " + str(sum_mae / 4))
MAE Loss for week 40 is: 6.052631578947368
MAE Loss for week 41 is: 8.210526315789474
MAE Loss for week 42 is: 8.31578947368421
MAE Loss for week 43 is: 7.473684210526316
Average MAE is: 7.5131578947368425

Support vector regression model (kernel='linear')

Fit an SVR model and predict the result.

In [94]:
from sklearn.svm import SVR

SVR_predict_result_less_5 = []
for data in data_less_5:
    X = np.array(data["week_number"]).reshape(-1, 1)
    y = np.array(data["raw_visit_counts"]).reshape(-1, 1)
    model = SVR(kernel='linear')
    model.fit(X, y)
    pred = model.predict(np.array([40, 41, 42, 43, 44]).reshape(-1, 1))
    SVR_predict_result_less_5.append(list(np.rint(pred)))

sum_mae = 0
for i in range(np.array(SVR_predict_result_less_5).shape[1] - 1):
    mae = mae_loss(week_visit_less_5[i + 40], np.array(SVR_predict_result_less_5)[:, i])
    sum_mae += mae
    print("MAE Loss for week " + str(i + 40) + " is: " + str(mae))
print("Average MAE is: " + str(sum_mae / 4))
MAE Loss for week 40 is: 4.157894736842105
MAE Loss for week 41 is: 6.7894736842105265
MAE Loss for week 42 is: 10.157894736842104
MAE Loss for week 43 is: 8.263157894736842
Average MAE is: 7.342105263157895

KNN regression model (N = 2)

Fit a KNN regression model and predict the result. n_neighbors = 2

In [177]:
from sklearn.neighbors import KNeighborsRegressor

knn_predict_result_less_5 = []
for data in data_less_5:
    X = np.array(data["week_number"]).reshape(-1, 1)
    y = np.array(data["raw_visit_counts"]).reshape(-1, 1)
    model = KNeighborsRegressor(n_neighbors = 2)
    model.fit(X, y)
    pred = model.predict(np.array([40, 41, 42, 43, 44]).reshape(-1, 1))
    knn_predict_result_less_5.append(list((np.rint(pred).reshape(1, -1)[0])))

sum_mae = 0
for i in range(np.array(knn_predict_result_less_5).shape[1] - 1):
    mae = mae_loss(week_visit_less_5[i + 40], np.array(knn_predict_result_less_5)[:, i])
    sum_mae += mae
    print("MAE Loss for week " + str(i + 40) + " is: " + str(mae))
print("Average MAE is: " + str(sum_mae / 4))
MAE Loss for week 40 is: 8.157894736842104
MAE Loss for week 41 is: 4.157894736842105
MAE Loss for week 42 is: 7.052631578947368
MAE Loss for week 43 is: 7.578947368421052
Average MAE is: 6.736842105263157

RandomForest regression model

Fit a RandomForest regression model and predict the result.

In [96]:
from sklearn.ensemble import RandomForestClassifier

rf_predict_result_less_5 = []
for data in data_less_5:
    X = np.array(data["week_number"]).reshape(-1, 1)
    y = np.array(data["raw_visit_counts"]).reshape(-1, 1)
    model = RandomForestClassifier(max_features=None)
    model.fit(X, y)
    pred = model.predict(np.array([40, 41, 42, 43, 44]).reshape(-1, 1))
    rf_predict_result_less_5.append(list((np.rint(pred).reshape(1, -1)[0])))

sum_mae = 0
for i in range(np.array(rf_predict_result_less_5).shape[1] - 1):
    mae = mae_loss(week_visit_less_5[i + 40], np.array(rf_predict_result_less_5)[:, i])
    sum_mae += mae
    print("MAE Loss for week " + str(i + 40) + " is: " + str(mae))
print("Average MAE is: " + str(sum_mae / 4))
MAE Loss for week 40 is: 0.0
MAE Loss for week 41 is: 0.0
MAE Loss for week 42 is: 0.0
MAE Loss for week 43 is: 0.0
Average MAE is: 0.0

As the results shown above, KNN is selected to predict the POIs with no more than 5 weeks of data.

In [98]:
poi_less_5_week_44_pred = {"poi_id": poi_ids_less_5, "raw_visit_counts": np.array(knn_predict_result_less_5)[:, -1]}
poi_less_5_week_44_pred_df = pd.DataFrame.from_dict(poi_less_5_week_44_pred)
poi_less_5_week_44_pred_df
Out[98]:
poi_id raw_visit_counts
0 d70a16ea-4982-4391-b7bd-6789f7018718 321.0
1 e71e79cd-71ae-4fd3-abf8-34f31a384702 40.0
2 5576691b-4038-4b20-9736-4fc4c6e3a65f 12.0
3 780472e8-9738-4832-a3d5-9736ed0a8f93 252.0
4 b5bbc683-775d-4fb5-b120-1c304302e136 370.0
5 2d4ebc33-75c8-4822-a48d-57401f36ddfc 70.0
6 db4b221a-b849-4ffe-ac01-b26d46a15e57 4.0
7 e3a4fcba-f285-4113-81f5-dab8df7a4c41 4.0
8 03e9a7b8-8502-4587-aef8-874041b57fd9 3.0
9 77baea34-296d-4f0f-9b0b-1e191233101a 11.0
10 48337bca-d4fd-46bf-b70a-9ada732d116a 94.0
11 9ea86083-43d8-43b0-b014-3136616c32b7 40.0
12 235937be-0295-4874-a3a7-7956e559a834 61.0
13 67dfe851-d702-494b-9ee6-e68c836dcf84 7.0
14 a4ad5af3-ece6-4298-ac4a-e0a0af6329fb 22.0
15 5919c7d1-c861-4d92-860e-3ef51f52f054 652.0
16 72e0f3da-e654-43b9-8a4b-e21dba4101ef 122.0
17 35644203-45ba-4f4a-9ee0-942a392927e8 7.0
18 a0d73d31-39d0-4753-8d12-edf8b44d31b2 8.0

Just testing the KNN model for the entire dataset

In [101]:
# select all POIs
QUERY = """
SELECT poi_id
FROM ironhacks-covid19-data.ironhacks_covid19_competition.prediction_list_poi
"""
query_job = bigquery_client.query(QUERY)
all_poi_id = query_job.to_dataframe()
In [109]:
data_all = []
for poi_id in all_poi_id["poi_id"]:
    QUERY="""
    SELECT poi_id, week_number, raw_visit_counts
    FROM ironhacks-covid19-data.ironhacks_covid19_competition.weekly_patterns
    WHERE poi_id="%s"
    """%(poi_id)
    query_job = bigquery_client.query(QUERY)
    data_all.append(query_job.to_dataframe())
In [130]:
week_all = {} # dict stores all visit count grouped by week
poi_ids_all = all_poi_id["poi_id"].tolist() # list stores all selected POIs' IDs
for data in data_all:
    for i in range(len(data["raw_visit_counts"])):
        week_all[data["week_number"][i]] = []

for data in data_all:
    for i in range(len(data["raw_visit_counts"])):
        week_all[data["week_number"][i]].append(data["raw_visit_counts"][i])
In [256]:
knn_predict_all = []
for data in data_all:
    X = np.array(data["week_number"]).reshape(-1, 1)
    y = np.array(data["raw_visit_counts"]).reshape(-1, 1)
    model = KNeighborsRegressor(n_neighbors = 2)
    model.fit(X, y)
    pred = model.predict(np.array([40, 41, 42, 43, 44]).reshape(-1, 1))
    knn_predict_all.append(list((np.rint(pred).reshape(1, -1)[0])))
In [257]:
sum_mae = 0
for i in range(4):
    mae = mae_loss(week_all[i + 40], np.array(knn_predict_all)[:, i])
    sum_mae += mae
    print("MAE Loss for week " + str(i + 40) + " is: " + str(mae))
print("Average MAE is: " + str(sum_mae / 4))
MAE Loss for week 40 is: 9.459534368070953
MAE Loss for week 41 is: 6.42239467849224
MAE Loss for week 42 is: 6.976164079822617
MAE Loss for week 43 is: 7.635254988913526
Average MAE is: 7.623337028824834
In [172]:
knn_all_poi_univariate_predict = {"poi_id": poi_ids_all, "raw_visit_counts": np.array(knn_predict_all)[:, -1]}
knn_all_poi_univariate_predict_df = pd.DataFrame.from_dict(knn_all_poi_univariate_predict)
knn_all_poi_univariate_predict_df
Out[172]:
poi_id raw_visit_counts
0 ad5fbefe-9e6a-47da-8011-8dadd9ad4cfe 4.0
1 0cd903d2-00a1-44ac-b262-fbc7475eb2bd 4.0
2 780472e8-9738-4832-a3d5-9736ed0a8f93 252.0
3 e9316e9d-8db6-4134-8a50-1c1da16fb9bc 44.0
4 5545ebf0-4f14-4f52-a499-8a5bdaa1df10 22.0
... ... ...
1799 8fb606bf-5918-4c16-af3c-1ab39ae46251 5.0
1800 801457d7-be9f-41c6-a87c-038ffd7f8b81 20.0
1801 a09953ae-81ed-4ed8-9dd5-be160e91297f 195.0
1802 45fe4a7f-30ec-4be2-ae71-34b833df2424 16.0
1803 097307c9-bc10-4b58-b27e-350acd49ff14 46.0

1804 rows × 2 columns

Statistic Model predict function

Define the model function with the univariate models. Specificly, the selected models are

  • Autoregression (AR)
  • Moving Average (MA)
  • Autoregressive Moving-Average (ARMA)
  • Autoregressive Integrated Moving Average (ARIMA)
  • Seasonal Autoregressive Integrated Moving-Average (SARIMA)
  • Simple Exponential Smoothing (SES)
  • Holt Winter’s Exponential Smoothing (HWES)
  • Holt’s exponential smoothing (HOLT)
In [13]:
from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.holtwinters import Holt

def univariate_model_predict(model_type, predict_week=44):
    """
    univariate models prediction function.
    param
    -------------------
    model_type : string
        Select the univariate model that to be trained.
    predict_week : number
        Only 43 and 44. The week to be predicted. Default week is 44.
    -------------------
    return
    predict : list
        Return predicted result of all poi visit counts in a list.
    model : object
        Trained model.
    """
    predict = []
    for data in poi_visit_array:
        if(predict_week == 43):
            train_data = data[:-1]
        elif(predict_week == 44):
            train_data = data

        if(model_type == "AR"):
            model = AutoReg(train_data, lags=1)
        elif(model_type == "MA"):
            model = ARIMA(train_data, order=(0, 0, 1))
        elif(model_type == "ARMA"):
            model = ARIMA(train_data, order=(2, 0, 1))
        elif(model_type == "ARIMA"):
            model = ARIMA(train_data, order=(1, 1, 1))
        elif(model_type == "SARIMA"):
            model = SARIMAX(train_data, order=(1, 1, 1), seasonal_order=(0, 0, 0, 0))
        elif(model_type == "SES"):
            model = SimpleExpSmoothing(train_data)
        elif(model_type == "HWES"):
            model = ExponentialSmoothing(train_data)
        elif(model_type == "HOLT"):
            model = Holt(train_data)

        if(model_type == "SARIMA"):
            model_fit = model.fit(disp=False)
        else:
            model_fit = model.fit()

        if(model_type == "ARIMA"):
            pred = model_fit.predict(len(train_data), len(train_data), typ="levels")
        else:
            pred = model_fit.predict(len(train_data), len(train_data))
        predict.append(np.rint(pred[0]))
    return predict, model_fit

In [13]:
# AR model predict
ar_pred, ar_model = univariate_model_predict("AR", 43)
ar_mae = mae_loss(week_43_true, ar_pred)
print("AR predict MAE for week 43 is: " + str(ar_mae))
AR predict MAE for week 43 is: 14.932926829268293
In [60]:
# MA model predict
ma_pred, ma_model = univariate_model_predict("MA", 43)
ma_mae = mae_loss(week_43_true, ma_pred)
print("MA predict MSE for week 43 is: " + str(ma_mae))
MA predict MSE for week 43 is: 21.431818181818183
In [61]:
# ARMA model predict
arma_pred, arma_model = univariate_model_predict("ARMA", 43)
arma_mae = mae_loss(week_43_true, arma_pred)
print("ARMA predict MSE for week 43 is: " + str(arma_mae))
ARMA predict MSE for week 43 is: 15.05820399113082
In [62]:
# ARIMA model predict
arima_pred, arima_model = univariate_model_predict("ARIMA", 43)
arima_mae = mae_loss(week_43_true, arima_pred)
print("ARIMA predict MSE for week 43 is: " + str(arima_mae))
ARIMA predict MSE for week 43 is: 15.241130820399112
In [63]:
# SARIMA model predict
sarima_pred, sarima_model = univariate_model_predict("SARIMA", 43)
sarima_mae = mae_loss(week_43_true, sarima_pred)
print("SARIMA predict MSE for week 43 is: " + str(sarima_mae))
SARIMA predict MSE for week 43 is: 15.241130820399112
In [64]:
# SES model predict
ses_pred, ses_model = univariate_model_predict("SES", 43)
ses_mae = mae_loss(week_43_true, ses_pred)
print("SES predict MSE for week 43 is: " + str(ses_mae))
SES predict MSE for week 43 is: 14.394124168514413
In [65]:
# HWES model predict
hwes_pred, hwes_model = univariate_model_predict("HWES", 43)
hwes_mae = mae_loss(week_43_true, hwes_pred)
print("HWES predict MSE for week 43 is: " + str(hwes_mae))
HWES predict MSE for week 43 is: 14.394124168514413
In [66]:
# HOLT model predict
holt_pred, holt_model = univariate_model_predict("HOLT", 43)
holt_mae = mae_loss(week_43_true, holt_pred)
print("HOLT predict MSE for week 43 is: " + str(holt_mae))
HOLT predict MSE for week 43 is: 16.795454545454547
In [18]:
# Select the best univariate model to predict week 44
best_pred, best_model = univariate_model_predict("SES")
week_44 = {'poi_id':poi_pred_id, 'raw_visit_counts': best_pred}
week_44_df = pd.DataFrame(week_44)
week_44_df
Out[18]:
poi_id raw_visit_counts
0 00243ce2-6ecc-4c63-8d58-1138476289e4 6.0
1 00251e06-3f79-49f1-9e5a-7735e9b7836f 21.0
2 004e8d59-b91b-488a-b051-ed2769d8e8e3 193.0
3 00529b9c-54fe-4690-b319-48cf4c9eb226 30.0
4 005ec245-ca7b-4814-b706-8f0737c045b2 5.0
... ... ...
1799 ff017941-e5c5-4c3e-ad8e-51bee7a0d3f3 211.0
1800 ff047527-176f-4d9b-9d91-0d0ab8805a44 14.0
1801 ffa5b55e-5f61-4cd7-8f66-741f492c34c6 150.0
1802 ffdda836-f200-470e-9e9d-c8917a164043 9.0
1803 fffcc243-6692-466d-a427-4696a2f7ab61 11.0

1804 rows × 2 columns

In [ ]:
week_44_df.to_csv("prediction_list_poi.csv", index=False)
In [13]:
%%capture
!pip install pmdarima

Use auto_arima to select univariate models for each POI

In [14]:
from pmdarima import auto_arima
def auto_univariate_predict(predict_week=44):
    prediction = []
    for data in poi_visit_array:
        if(predict_week == 43):
            train_data = data[:-1]
        elif(predict_week == 44):
            train_data = data
        result = auto_arima(train_data, start_p=1, start_q=1, max_p=10, max_d=7,max_q=10, max_P=4, max_D=4,max_Q=4)
        pred = np.around(result.fit_predict(train_data, n_periods=1))
        prediction.append(pred)
    return prediction

auto_model_week_43_pred = auto_univariate_predict(predict_week=43)
print(mae_loss(week_43_true, auto_model_week_43_pred))
116.41370986376664

LSTM model

In [15]:
%%capture
!pip install keras
!pip install tensorflow
In [14]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

def lstm_univariate_model(n_feature):
    model = Sequential()
#     model.add(Dense(10, activation="relu"))
    model.add(LSTM(4, input_shape=(1, n_feature)))
    model.add(Dense(1))
    model.compile(loss="mae", optimizer="adam", metrics=['accuracy'])
    return model
In [ ]:
# split into train and test sets
dataset = poi_visit_array[0][:-1]
train_size = int(len(dataset) * 0.7)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size], dataset[train_size:len(dataset)]
print(train, test)

# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back)]
        dataX.append(a)
        dataY.append(dataset[i + look_back])
    return np.array(dataX), np.array(dataY)

look_back=2
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)

trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))

lstm_uni_model = lstm_univariate_model(look_back)
lstm_uni_model.summary()
lstm_uni_data = lstm_uni_model.fit(trainX, trainY, validation_data=(testX, testY), epochs=1000, batch_size=1, verbose=2)
In [16]:
np.rint(lstm_uni_model.predict(testX)[-1])
Out[16]:
array([4.], dtype=float32)
In [17]:
plt.figure()
plt.plot(lstm_uni_data.history['loss'])
plt.plot(lstm_uni_data.history['val_loss'])
plt.title('LSTM Accuracy vs Epoch')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['train', 'test'], loc='upper left')
Out[17]:
<matplotlib.legend.Legend at 0x7fbae6849750>
In [18]:
POI_CONCEN = {k:[] for k in POI_ID}
POI_MED_DWELL = {k:[] for k in POI_ID}

# init with all 0
for k in POI_CONCEN:
    for i in range(11,44):
        POI_CONCEN[k].append({i:0})
        POI_MED_DWELL[k].append({i:0})

for i in range(0,len(weekly_patterns_data)):
    for week_concen, week_med_dwell in zip(POI_CONCEN[weekly_patterns_data["poi_id"][i]], POI_MED_DWELL[weekly_patterns_data["poi_id"][i]]):
        if(weekly_patterns_data["week_number"][i] in week_concen):
                        week_concen[weekly_patterns_data["week_number"][i]] = weekly_patterns_data["visits_concentration"][i]
                        week_med_dwell[weekly_patterns_data["week_number"][i]] = weekly_patterns_data["median_dwell"][i]
poi_concen_df = pd.DataFrame.from_dict(POI_CONCEN)
poi_dwell_df = pd.DataFrame.from_dict(POI_MED_DWELL)
In [ ]:
for i in range(0, 33):
    for j in range(0, 1804):
        d1 = poi_concen_df.iloc[i,j]
        d2 = poi_dwell_df.iloc[i,j]
        poi_concen_df.iloc[i,j] = next(iter(d1.values()))
        poi_dwell_df.iloc[i,j] = next(iter(d2.values()))
In [ ]:
print(poi_visit_df.shape)
print(poi_concen_df.shape)
print(poi_dwell_df.shape)
In [ ]:
poi_concen_array = poi_concen_df.T.to_numpy()
poi_dwell_array = poi_dwell_df.T.to_numpy()

NUM_PLOTS = 1804
for l1, l2 in zip(poi_concen_array, poi_dwell_array):
    temp_sum1, temp_sum2 = 0, 0
    for i in range(2,7):
        temp_sum1 += l1[i]
        temp_sum2 += l2[i]

        if i == 6:
            l1[i+1] = temp_sum1 / 5
            l2[i+1] = temp_sum2 / 5
            temp_sum1 += l1[i+1] - l1[2]
            temp_sum2 += l2[i+1] - l2[2]
            l1[i+2] = temp_sum1 / 5
            l2[i+2] = temp_sum2 / 5

plt.figure(figsize=(15,5))
for i in range(0, NUM_PLOTS):
    plt.plot(poi_concen_array[i])

plt.figure(figsize=(15,5))
for i in range(0, NUM_PLOTS):
    plt.plot(poi_dwell_array[i])
In [6]:
### TODO change it!!!!
from statsmodels.tsa.vector_ar.var_model import VAR
from random import random
data = []
for i in range(100):
    v1 = i + random()
    v2 = v1 + random()
    v3 = v2 + random()
    row = [v1, v2, v3]
    data.append(row)
model = VAR(data)
model_fit = model.fit()
# make prediction
yhat = model_fit.forecast(model_fit.y, steps=10)
print(yhat)
[[100.33382282 100.80847979 101.17674311]
 [101.33181272 101.81319217 102.2230662 ]
 [102.32875227 102.81134276 103.21327589]
 [103.32584419 103.80897534 104.2099197 ]
 [104.32298401 104.80654484 105.20550124]
 [105.32019597 105.80417301 106.20128148]
 [106.31747797 106.80186901 107.19710928]
 [107.31483049 107.7996353  108.19301007]
 [108.31225351 108.79747207 109.18898079]
 [109.30974704 109.79537937 110.18502197]]
In [ ]:
# normal lstm
print(poi_visit_array)
print(poi_concen_array)
print(poi_dwell_array)
In [ ]:
# look back steps
# convert to [rows, columns] structure


aaa = 0
in_seq1 = poi_dwell_array[aaa][:-1].reshape((len(poi_dwell_array[aaa][:-1]), 1))
in_seq2 = poi_concen_array[aaa][:-1].reshape((len(poi_concen_array[aaa][:-1]), 1))
out_seq = poi_visit_array[aaa][:-1].reshape((len(poi_visit_array[aaa][:-1]), 1))
in_seq1 = nor_seq(in_seq1)
in_seq2 = nor_seq(in_seq2)

# horizontally stack columns
dataset = np.hstack((in_seq1, in_seq2, out_seq))
# choose a number of time steps
n_steps = 2
# convert into input/output
X, y = split_sequences(dataset, n_steps)
print(X.shape, y.shape)
# summarize the data
for i in range(len(X)):
    print(X[i], y[i])
# print(np.array(in_seq1).astype(None))
In [333]:
n_features = X.shape[2]
x_input = []
for n in range(n_steps):
    x_input.append([in_seq1[n-n_steps], in_seq2[n-n_steps]])
x_input = np.array(x_input).reshape((1, n_steps, n_features)).astype(None)
print(x_input)
[[[0.0501139  0.183673  ]
  [0.64464692 0.5       ]]]
In [344]:
n_features = X.shape[2]
model = Sequential()
model.add(LSTM(5, activation='relu', input_shape=(n_steps, n_features)))
# model.add(Dense(8, activation='relu'))
# model.add(Dense(32, activation='relu'))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mae')
model_data = model.fit(X, y, epochs=500, batch_size=8, verbose=0, shuffle=False)
yhat = model.predict(x_input)
print(np.rint(yhat).astype(int)[0][0])
4
In [345]:
plt.plot(model_data.history['loss'])
Out[345]:
[<matplotlib.lines.Line2D at 0x7fd2a31f4690>]
In [351]:
def multi_lstm(n_steps, n_features):
    """
    Multivariate LSTM model
    Input: param
    ------------------------
    n_steps: number
        Number of LSTM look back steps
    n_features: number
        Number of features used as X
    ------------------------
    return: object
        An LSTM model with input shape (n_steps, n_features)
    """
    model = Sequential()
    model.add(LSTM(5, activation='relu', input_shape=(n_steps, n_features)))
    model.add(Dense(1))
    model.compile(optimizer='adam', loss='mse')
    return model

def nor_seq(seq):
    """
    Normalization function, normalize a list to 0~1 scale
    Input param
    ------------------------
    seq: list
        Input list to be normalized
    ------------------------
    return: list
        Normalized list
    """
    return [i/max(seq) for i in seq]

def split_sequences(sequences, n_steps):
    """
    split a multivariate sequence into samples
    # https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
    Input param
    ------------------------
    sequences: array
        Input array to be splited by looking back into features X and labels y
    n_steps: number
        Number of look back steps to generate the model formula
    ------------------------
    return
    ------------------------
    X: numpy array
        X feature array
    y: numpy array
        y labels
    """
    X, y = list(), list()
    for i in range(len(sequences)):
        # find the end of this pattern
        end_ix = i + n_steps
        # check if we are beyond the dataset
        if end_ix >= len(sequences):
            break
        seq_x, seq_y = sequences[i:end_ix, :-1], sequences[end_ix, -1]
        X.append(seq_x)
        y.append(seq_y)
    return np.array(X).astype(None), np.array(y).astype(None)
In [352]:
def train_multi_lstm(n_steps = 2, epochs=500):
    """
    Use data to train individual LSTM models for each POI and predict.
    Input param
    -------------------------
    n_steps: number
        Number of LSTM look back steps, default 2
    epochs: number
        Training epochs, default 500
    -------------------------
    return: list, list
        return the LSTM predict results of all POIs visit count for week 43 and 44
    """
    predict_43 = []
    predict_44 = []

    for i in range(len(poi_visit_array)):
        in_seq1 = poi_dwell_array[i].reshape((len(poi_dwell_array[i]), 1))
        in_seq2 = poi_concen_array[i].reshape((len(poi_concen_array[i]), 1))
        out_seq = poi_visit_array[i].reshape((len(poi_visit_array[i]), 1))

        # normalize input features
        in_seq1 = nor_seq(in_seq1)
        in_seq2 = nor_seq(in_seq2)

        # horizontally stack columns
        dataset = np.hstack((in_seq1, in_seq2, out_seq))

        # convert into input/output
        X, y = split_sequences(dataset, n_steps)

        # create the LSTM model and train it
        model = multi_lstm(n_steps, n_features = X.shape[2])
        model.fit(X, y, epochs=epochs, batch_size=16, verbose=0, shuffle=False)

        # generate prediction features
        x_input_43 = []
        x_input_44 = []

        for n in range(n_steps):
            x_input_43.append([in_seq1[n-n_steps-1], in_seq2[n-n_steps-1]])
            x_input_44.append([in_seq1[n-n_steps], in_seq2[n-n_steps]])

        x_input_43 = np.array(x_input_43).reshape((1, n_steps, n_features)).astype(None)
        x_input_44 = np.array(x_input_44).reshape((1, n_steps, n_features)).astype(None)

        y_pred_43 = np.rint(model.predict(x_input_43)).astype(int)[0][0]
        y_pred_44 = np.rint(model.predict(x_input_44)).astype(int)[0][0]

        predict_43.append(y_pred_43)
        predict_44.append(y_pred_44)

        print(str(round(100*i/1804, 2))+"% "+str(y_pred_43)+", "+str(y_pred_44))
    return predict_43, predict_44
In [ ]:
import os, logging, time, datetime
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
logging.getLogger("tensorflow").setLevel(logging.CRITICAL)
logging.getLogger("tensorflow_hub").setLevel(logging.CRITICAL)

start_time = time.time()
pred_43, pred_44 = train_multi_lstm()
print(datetime.timedelta(seconds=(time.time() - start_time)))
In [380]:
lstm_mae = mae_loss(week_43_true, pred_43)
print(lstm_mae)
week_44_lstm = {'poi_id':poi_pred_id, 'raw_visit_counts': pred_44}
week_44_lstm_df = pd.DataFrame(week_44_lstm)
week_44_lstm_df.to_csv("submission_prediction_output.csv", index=False)
week_44_lstm_df
55.73115299334812
Out[380]:
poi_id raw_visit_counts
0 00243ce2-6ecc-4c63-8d58-1138476289e4 6
1 00251e06-3f79-49f1-9e5a-7735e9b7836f 14
2 004e8d59-b91b-488a-b051-ed2769d8e8e3 1
3 00529b9c-54fe-4690-b319-48cf4c9eb226 36
4 005ec245-ca7b-4814-b706-8f0737c045b2 3
... ... ...
1799 ff017941-e5c5-4c3e-ad8e-51bee7a0d3f3 28
1800 ff047527-176f-4d9b-9d91-0d0ab8805a44 15
1801 ffa5b55e-5f61-4cd7-8f66-741f492c34c6 68
1802 ffdda836-f200-470e-9e9d-c8917a164043 10
1803 fffcc243-6692-466d-a427-4696a2f7ab61 13

1804 rows × 2 columns

CNN model

In [ ]:
def cnn_model(n_steps=1, n_features=1):
    model = Sequential()
    model.add(Conv1D(filters=64, kernel_size=2, activation='relu', input_shape=(n_steps, n_features)))
    model.add(Flatten())
    model.add(Dense(50, activation='relu'))
    model.add(Dense(1))
    model.compile(optimizer='adam', loss='mae')
    return model
In [ ]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

cnn_predit_all = []

n_steps = 3
n_features = 1
i = 0
for data in data_all:
    X, y = create_dataset(data["raw_visit_counts"], look_back = n_steps)
    X = X.reshape((X.shape[0], X.shape[1], 1))
    model = cnn_model(n_steps = n_steps, n_features = n_features)
    model.fit(X, y, epochs=1000, verbose = 0)
    input_x = np.array(data["raw_visit_counts"][-3:]).reshape(1, n_steps, 1)
    pred = np.rint(model.predict(input_x, verbose = 0))[0][0]
    print(str(i + 1) + "/1804: " + str(pred))
    i = i + 1
    cnn_predit_all.append(pred)
In [261]:
print(mae_loss(week_all[43], cnn_predit_all))
9.512749445676276
In [262]:
for i in range(len(cnn_predit_all)):
    if(cnn_predit_all[i] < 0):
        cnn_predit_all[i] = 0
print(mae_loss(week_all[43], cnn_predit_all))
9.240576496674057
In [269]:
cnn_all_poi_univariate_predict = {"poi_id": poi_ids_all, "raw_visit_counts": np.array(cnn_predit_all)}
cnn_all_poi_univariate_predict_df = pd.DataFrame.from_dict(cnn_all_poi_univariate_predict)
cnn_all_poi_univariate_predict_df
Out[269]:
poi_id raw_visit_counts
0 ad5fbefe-9e6a-47da-8011-8dadd9ad4cfe 5.0
1 0cd903d2-00a1-44ac-b262-fbc7475eb2bd 2.0
2 780472e8-9738-4832-a3d5-9736ed0a8f93 266.0
3 e9316e9d-8db6-4134-8a50-1c1da16fb9bc 50.0
4 5545ebf0-4f14-4f52-a499-8a5bdaa1df10 23.0
... ... ...
1799 8fb606bf-5918-4c16-af3c-1ab39ae46251 8.0
1800 801457d7-be9f-41c6-a87c-038ffd7f8b81 13.0
1801 a09953ae-81ed-4ed8-9dd5-be160e91297f 189.0
1802 45fe4a7f-30ec-4be2-ae71-34b833df2424 16.0
1803 097307c9-bc10-4b58-b27e-350acd49ff14 57.0

1804 rows × 2 columns