1. Introduction
Coverage of time series analysis with RNNs tends to overlook the issue of shaping tabular data into the correct input shape. Articles and books that do cover this, all too often obscure data shaping with preprocessing, data cleaning, etc. While these steps are a necessary part of analyzing real data, they can cause confusion when attempting to understand how to correctly shape input data. In this article, we will use a tabular data set and explicitly show how to format it for use in RNNs. We run the RNN and plot results to verify that our data has been shaped properly.
2. Data
The data can be downloaded from here. Below is a snippet.
x0 | x1 | x2 | x3 |
---|---|---|---|
0.8855309006 | 0.8751045805 | 0.898579328 | 0.8897903446 |
0.8634506763 | 0.8523858494 | 0.8770729802 | 0.8676927599 |
0.4464741327 | 0.4233513496 | 0.4328841807 | 0.4112926447 |
0.2384104806 | 0.2092709983 | 0.2682779031 | 0.2421611309 |
0.3114450687 | 0.2844175706 | 0.3394142844 | 0.3152531419 |
Going from top to bottom is going from past to current time. We will refer to a single time period as a day for convenience. There are 4 features and we will use all 4 as input and a future value of x3 as the target to be predicted. We will use 25 days of data which we call days_backward and attempt to forecast values for x3 5 days forward. Here is the code to read in the tabular data:
import pandas as pd import os from pathlib import Path def foo(): # create then save data df_data = pd.read_csv(base_dir + 'df_data.csv') list_of_features = df_data.columns # ['x0','x1','x2','x3'] num_features_str = str(len(list_of_features)) backward_days = 25 forward_days = 5 data_directory = base_dir + 'features_' + num_features_str + '/' if not Path(data_directory).is_dir(): os.mkdir(data_directory) create_save_data(backward_days, forward_days, list_of_features, df_data, data_directory)
Here is where we shape the data.
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split def create_save_data(days_backward, days_forward, list_features, df_data, data_dir): # format data num_points = df_data.shape[0] x = df_data[list_features].values # shape = (num_points, len(list_features)) z = df_data['x3'].values # shape = (num_points,) # get chunks of data list_x_backward = [] list_forward = [] for i in range(num_points - days_backward - days_forward + 1): backward = x[i:days_backward + i] list_x_backward.append(backward) forward = z[i + days_backward + days_forward - 1] list_forward.append(forward) features_array = np.array(list_x_backward) # shape=(num_points-days_backward-days_forward+1, days_backward, len(list_features)) target_array = np.array(list_forward) # shape=(num_points-days_backward-days_forward+1,) # split into calibration and production num_split = int(0.8*len(target_array)) features_calibration = features_array[:num_split] target_calibration = target_array[:num_split] features_production = features_array[num_split:] np.save(data_dir + 'features_production.npy', features_production) target_production = target_array[num_split:] np.save(data_dir + 'target_production.npy', target_production) # split calib into train/test, then save f_train,f_test,t_train,t_test = train_test_split(features_calibration, target_calibration, test_size=0.2, shuffle=True) np.save(data_dir + 'features_train.npy', f_train) np.save(data_dir + 'features_test.npy', f_test) np.save(data_dir + 'target_train.npy', t_train) np.save(data_dir + 'target_test.npy', t_test)
Now the input features data arrays have shape: (number of samples, days backward, number of features). For the production data this is: (95, 25, 4).
Target data arrays have shape: (number of samples, ). For the production data this is: (95,)
While this is not the most efficient way to shape the data, our goal was to make it easy to read and easy to use. Note that we only have to specify days backward, days forward, and number of features.
3. The Recurrent Neural Network
Our RNN parameters are:
- epochs = 300
- epochs_stop = 10
- batch_size = 128
- verbose = 0
- loss = ‘mean_squared_error’
- layer_1_units_rnn = 32
- layer_1_units_mlp = 16
The RNN code:
from keras.models import Sequential from keras.layers import Dense, SimpleRNN, GRU, LSTM from keras.callbacks import ModelCheckpoint, EarlyStopping from keras.models import load_model from keras.layers import Dropout def gru_early_stop(xtrain_rnn, ytrain, xtest_rnn, ytest, xproduction_rnn, yproduction, results_dir, epochs_num, stop_epochs, batch_num, verbosity, loss_type, layer1_units_rnn, layer1_units_mlp): results_file = 'gru_results' model_file = results_dir + 'gru.h5' # dimensions num_timesteps = xtrain_rnn.shape[1] # days backward num_features = xtrain_rnn.shape[2] shape_of_input = (num_timesteps, num_features) callbacks_list = [ EarlyStopping(monitor='val_loss', patience=stop_epochs), ModelCheckpoint(filepath=model_file, monitor='val_loss', save_best_only=True)] # SimpleRNN model model = Sequential() model.add(GRU(units=layer1_units_rnn, input_shape=shape_of_input, activation="relu", recurrent_dropout=0.3)) model.add(Dense(layer1_units_mlp, activation="relu")) model.add(Dropout(0.3)) model.add(Dense(1)) model.compile(loss=loss_type, optimizer='Adam') h = model.fit(xtrain_rnn, ytrain, epochs=epochs_num, batch_size=batch_num, validation_data=(xtest_rnn, ytest), verbose=verbosity, callbacks=callbacks_list) # get the best model model_best = load_model(model_file) yproduction_predicted = model_best.predict(xproduction_rnn)
The key variable is the tuple that specifies input_shape for the initial RNN layer. This must have shape of (days backward, number of features) which matches that of each individual sample of the input feature array.
Here is a summary of the RNN:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
gru_1 (GRU) (None, 32) 3552
_________________________________________________________________
dense_1 (Dense) (None, 16) 528
_________________________________________________________________
dropout_1 (Dropout) (None, 16) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 17
=================================================================
Total params: 4,097
Trainable params: 4,097
Non-trainable params: 0
4. Results
- Mean Squared Errors:
- Train = 0.0955
- Test = 0.1210
- Production = 0.0626
The code runs properly and the results are reasonable, thus verifying that our data shaping was correct. We made no attempt to optimize parameters.