Shaping Input Data for Analyzing Multivariate Time Series with Recurrent Neural Networks

1. Introduction

Coverage of time series analysis with RNNs tends to overlook the issue of shaping tabular data into the correct input shape. Articles and books that do cover this, all too often obscure data shaping with preprocessing, data cleaning, etc. While these steps are a necessary part of analyzing real data, they can cause confusion when attempting to understand how to correctly shape input data. In this article, we will use a tabular data set and explicitly show how to format it for use in RNNs. We run the RNN and plot results to verify that our data has been shaped properly.

2. Data

The data can be downloaded from here. Below is a snippet.

x0	x1	x2	x3
0.8855309006	0.8751045805	0.898579328	0.8897903446
0.8634506763	0.8523858494	0.8770729802	0.8676927599
0.4464741327	0.4233513496	0.4328841807	0.4112926447
0.2384104806	0.2092709983	0.2682779031	0.2421611309
0.3114450687	0.2844175706	0.3394142844	0.3152531419

Going from top to bottom is going from past to current time. We will refer to a single time period as a day for convenience. There are 4 features and we will use all 4 as input and a future value of x3 as the target to be predicted. We will use 25 days of data which we call days_backward and attempt to forecast values for x3 5 days forward. Here is the code to read in the tabular data:

import pandas as pd
import os
from pathlib import Path

def foo():
    # create then save data
    df_data = pd.read_csv(base_dir + 'df_data.csv')
    list_of_features = df_data.columns  # ['x0','x1','x2','x3']
    num_features_str = str(len(list_of_features))
    backward_days = 25
    forward_days = 5
    
    data_directory = base_dir + 'features_' + num_features_str + '/'
    if not Path(data_directory).is_dir():
        os.mkdir(data_directory)
        
   create_save_data(backward_days, forward_days, list_of_features, 
                    df_data, data_directory)

Here is where we shape the data.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

def create_save_data(days_backward, days_forward, list_features, 
                     df_data, data_dir):    

#   format data
    num_points = df_data.shape[0]
    x = df_data[list_features].values  # shape = (num_points, len(list_features))
    z = df_data['x3'].values  # shape = (num_points,) 
    
    # get chunks of data
    list_x_backward = []
    list_forward = []
    for i in range(num_points - days_backward - days_forward + 1):
        backward = x[i:days_backward + i]
        list_x_backward.append(backward)
        
        forward = z[i + days_backward + days_forward - 1]
        list_forward.append(forward)
        
    features_array = np.array(list_x_backward)
    # shape=(num_points-days_backward-days_forward+1, days_backward, len(list_features))
    
    target_array = np.array(list_forward)
    # shape=(num_points-days_backward-days_forward+1,)
    
    # split into calibration and production
    num_split = int(0.8*len(target_array))
    
    features_calibration = features_array[:num_split]
    target_calibration = target_array[:num_split]
    
    features_production = features_array[num_split:]
    np.save(data_dir + 'features_production.npy', features_production)
    target_production = target_array[num_split:]
    np.save(data_dir + 'target_production.npy', target_production)
    
    
    # split calib into train/test, then save
    f_train,f_test,t_train,t_test = train_test_split(features_calibration, 
    target_calibration, test_size=0.2, shuffle=True)
    np.save(data_dir + 'features_train.npy', f_train)
    np.save(data_dir + 'features_test.npy', f_test)
    np.save(data_dir + 'target_train.npy', t_train)
    np.save(data_dir + 'target_test.npy', t_test)

Now the input features data arrays have shape: (number of samples, days backward, number of features). For the production data this is: (95, 25, 4).
Target data arrays have shape: (number of samples, ). For the production data this is: (95,)

While this is not the most efficient way to shape the data, our goal was to make it easy to read and easy to use. Note that we only have to specify days backward, days forward, and number of features.

3. The Recurrent Neural Network

Our RNN parameters are:

epochs = 300
epochs_stop = 10
batch_size = 128
verbose = 0
loss = ‘mean_squared_error’
layer_1_units_rnn = 32
layer_1_units_mlp = 16

The RNN code:

from keras.models import Sequential
from keras.layers import Dense, SimpleRNN, GRU, LSTM
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.models import load_model 
from keras.layers import Dropout
    
def gru_early_stop(xtrain_rnn, ytrain, xtest_rnn, ytest, xproduction_rnn, yproduction,
                   results_dir, epochs_num, stop_epochs, batch_num, verbosity,
                   loss_type, layer1_units_rnn, layer1_units_mlp):        
    results_file = 'gru_results'    
    model_file = results_dir + 'gru.h5'
    
    # dimensions
    num_timesteps = xtrain_rnn.shape[1]  # days backward
    num_features = xtrain_rnn.shape[2]
    shape_of_input = (num_timesteps, num_features)

    
    callbacks_list = [
            EarlyStopping(monitor='val_loss', patience=stop_epochs),
            ModelCheckpoint(filepath=model_file, monitor='val_loss', save_best_only=True)]
    
    # SimpleRNN model
    model = Sequential()
    model.add(GRU(units=layer1_units_rnn, 
                  input_shape=shape_of_input, 
                  activation="relu", recurrent_dropout=0.3))        
    model.add(Dense(layer1_units_mlp, activation="relu")) 
    model.add(Dropout(0.3))
    model.add(Dense(1))
    model.compile(loss=loss_type, optimizer='Adam')
    
    h = model.fit(xtrain_rnn, ytrain, epochs=epochs_num, batch_size=batch_num,
                  validation_data=(xtest_rnn, ytest), verbose=verbosity,
                  callbacks=callbacks_list)
    
    # get the best model
    model_best = load_model(model_file)
    
    yproduction_predicted = model_best.predict(xproduction_rnn)

The key variable is the tuple that specifies input_shape for the initial RNN layer. This must have shape of (days backward, number of features) which matches that of each individual sample of the input feature array.

Here is a summary of the RNN:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
gru_1 (GRU) (None, 32) 3552
_________________________________________________________________
dense_1 (Dense) (None, 16) 528
_________________________________________________________________
dropout_1 (Dropout) (None, 16) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 17
=================================================================
Total params: 4,097
Trainable params: 4,097
Non-trainable params: 0

4. Results

Mean Squared Errors:
Train = 0.0955
Test = 0.1210
Production = 0.0626

The code runs properly and the results are reasonable, thus verifying that our data shaping was correct. We made no attempt to optimize parameters.

Shaping Input Data for Analyzing Multivariate Time Series with Recurrent Neural Networks

1. Introduction

2. Data

3. The Recurrent Neural Network

4. Results

Affiliate Disclaimer