Time Series Learning Notes

This is a notebook learning Time Series on Coursera course Sequences, Time Series, and Prediction hold by Laurence Moroney. The Couse link is https://www.coursera.org/learn/tensorflow-sequences-time-series-and-prediction/home/welcome

Basic sequence and prediction

The time series problem consists of three dimensions: trends, seasonalities, and noise. We are trying to predict the future value based on the trends and seasonalities. The noise is unpredictable. There are some basic methods that are straight forward, without using machine learning techniques.

The practice below is using univariate synthetic data. Univariate data can be found in reality as predicting the temperature/customer growth/price etc. based on historical data where the “independent variable” is time and the target is what we are interested in.

  1. Naive prediction

Naive prediction is very intuitive, it takes the current value to predict the next value. This makes sense and always happen in our daily life. For example, although it not always the case, we usually anticipate the gasoline price tomorrow to be the same as today. This is the simplest way we predict the future using historical data. In python, we can simply take one value backward to get the predictions.

naive_forcast = series[validation_split_point - 1:-1]
# this is the prediction on the validation data starting from a pre-defined validation split index.

The Naive prediction is the baseline of all models. The score (usually MSE or MAE) of the more complex models should be smaller than the baseline.

2. Moving average

Moving average takes the average value of a time slot to predict the value at the next time spot. For example, we calculate the 30 days average price in the stock market as a reference line for us to understand the future stock price.

# define a function to calculate the average of the historic values # in a window and save it as a list with index corresponding with   # the time. 
def moving_average_forecast(series, window_size):
forecast = []
for i in range(window_size, len(series)):
# calculate from the index that have prior data.
forecast.append(series[i-window_size:i].mean())
return np.array(forecast)
# call the moving average function, forecast from the validation # time, look backward for 30 days.
moving_avg = moving_average_forecast(series, 30)[split_time-30:]

The moving average seems a good way to estimate the future. However, the score of the moving average methods can be even lower than the naive prediction. How does this happen? Because the moving average method does not take the trends and seasonality into account. Taking average smoothes out some important trends we are interested in. Below is the example in the course.

Moving Average Prediction

3. Advanced moving average

Remove the trends and seasonality will help us get a better result of the moving average methods. Since the synthetic data has a seasonality of one year, we can use the value in a time spot minus the value at the same time last year to calculate the year-on-year change. We can apply the moving average method on the difference to get a better result and then add the trends/seasonality back by sum the values last year with the prediction of the difference.

# calculate the difference, use the data from the second year to the # end minus the first year to the second last year. (does not       # consider leap year here). The first year does not have a          # referenced year, so get rid of the first year in the time. 
diff_series = (series[365:] - series[:-365])
diff_time = time[365:]
# using moving average method on the diff_seires and diff_time, the # diff series is starting from the second year, when get the # prediction of the validation set, we will need to minus 365 as well.
diff_moving_avg = moving_average_forecast(diff_series, 30)
[split_time - 365 - 30]

We can then add the trend and seasonality back by adding the prediction of the difference and the value on the same day but last year.

diff_moving_avg_plus_past = series[split_time-365:-365] + 
diff_moving_avg

We can further average the data from last year to get rid of some noise.

diff_moving_avg_smooth_plus_past =  
moving_average_forecast(series[split_time-370:-360],10) +
diff_moving_avg

The result score MSE can be higher than before smooth the noise from last year, but the MAR can perform better. This is because the MSE enlarge the influence of the higher errors. However, the smoothed prediction is more general.

DNN and Time Series

Step forward with the basic statistic analysis of the time series, we can bring in Machine Learning to try to even increase the prediction accuracy. We will use a simple Deep Neural Network to do the job.

  1. Prepare the windowed data

In supervised machine learning models, usually, we have a set of independent variables (x) and one dependent variable (y). So what are the x and y in time series problems? The ‘x’ is actually a set of historical values and the ‘y’ is the next value we need to predict. To build the time series machine learning model, we need to create the windowed data which takes old data as the features and the next unknown value as the target.

To create the windowed data based on the synthetic data created in the last section (having a seasonality of 365 days and an upward trend. ) in python, follow the code below:

import tensorflow as tfdef windowed_dateset(series, window_size, batch_size, shuffle_buffer):
dataset = tf.data.Dataset.from_tensor_slices(series)
dataset = dataset.window(window_size + 1, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_size + 1))
dataset = shuffle(shuffle_buffer).map(lambda window: (window[:-1], window[-1]))
dataset = dataset.batch(batch_size).prefetch(1)
return dataset

Where:
tf.data.Dataset.from_tensor_slices: pass the series to a TensorFlow dataset
window: slice the data into appropriate windows, each window shift 1 unit to the next value. drop the windows when the remained values are not enough to fulfill a window.
flat_map: create a tensor for each windowed dataset
shuffle: shuffle the sequence of the tensors created in the flat_map, this is for avoiding the sequence bias issue which means the order of the values can impact the training results. For example, people turn to select the first or last thing they see in a list when there is no preference for the items in the list.
map: split the x set and target y, the y is the last value in the windowed data and the x is all other values. Save the x and y as a prefetch dataset that can be directly used in the TensorFlow DNN model.

2. Single-layer “DNN”

We can then create a one layer deep neural network using the dataset created in the windowed data function. The input is the set of x values with the length as the window size, and the y value is the last point value in the windowed data. Below is the code to build up the model.

# create a variable to save the layer 
l0 = tf.keras.layers.Dense(1, input_shape=[window_size])
# add the layer to the model
model = tf.keras.models.Sequential([l0])
# use mean square error as the loss function, stochastic grediant # descent to find the parameters, use 1e-6 learning rate and the # exponentially weighted average is 0.9 (averaging the last ten
# iterations of gradients, rule of thumb)
model.compile(loss="mse", optimizer=tf.keras.optimizers.SGD(lr=1e-6, momentum=0.9))

Next, we can fit the model with the dataset output from the windowed dataset. And use the model to make the predictions.

model.fit(dataset, epochs=100, verbose=0)# predict
# use the np.newaxis to transform the dataset format
forecast = []
for t in range(len(series) - window_size):
forecast.append(model.predict(series[t:t+window_size][np.newaxis]))
forecast = forecast[split_time-window_size:]# get the number as a numpy array
results = np.array(forecast[:, 0, 0])

3. Multiple layers DNN

We can further increase the complexity of the DNN model by adding more layers. Below is an example of a three-layer DNN.

model = tf.keras.models.Sequential([
tf.keras.layers.Dense(100, input_shape=[window_size], activation="relu"),
tf.keras.layers.Dense(10, activation="relu"),
tf.leras.layers.Dense(1)
])
model.compile(loss="mse", optimizer=tf.keras.optimizers.SGD(lr=1e-6, momentum=0.9))model.fit(dataser, epochs=100)

RNN and Time Series

  1. Simple RNN

When talking about Time Series where we take the old data point as the input feature to predict the next point, a recurrent neural network naturally comes to the hand. RNN takes the output of the previous step and memorizes it when doing the next step. However, RNN has the short memory issue means the longer the info stays in the memory (that is the further a data point to the requested one), the less the influence is due to the vanishing gradient.

Sequence to Sequence RNN has both input and output as a sequence. Sequence to Vector RNN has the input as a sequence and the output as a single vector.

RNN has three dimensions: batch size, number of timestamps, and series dimensionality. Lambda layer is used to format the input data to add the batch size dimension with value None which means the model can take sequences of any length.

model = keras.models.Sequential([    keras.layers.Lambda(lambda x: tf.expand_dims(x, axis=-1), input_shape=[None]),    keras.layers.SimpleRNN(20, return_sequences=True),    keras.layers.SimpleRNN(20),    kears.layers.Dense(1),    keras.layers.Lambda(lambda x: x*100.0) # for better performing the tanh activation function. 
])

2. Optimizing the learning rate.

We need to create a scheduler to iterate many learning rates in a range with an interval and define what is the optimized learning rate. kearas package has functions to do this.

lr_schedule = tf.keras.callbacks.LearningRateScheduler(lambda epoch:     
1e-8 * 10**(epoch / 20))
optimizer = tf.keras.optimizers.SGD(lr=1e-8, momentum=0.9)
model.compile(loss=tf.keras.losses.Huber(),
optimizer=optimizer, metrics=['mae'])
# Huber loss function is less sensetive to outliers
# plot the experimented learning rate with the mae and choose the
# learning rate that gives the lowest mae before it gets unstable.
# retrain the model with the optimized learning rate afterward.

The prediction using simple RNN is shown below where it has a wired plateau. This will need to be improved by LSTM. It is basically solving the short memory problem of RNN.

3. LSTM

LSTM is a whole new topic that I may add a new story in detail when I get to the point. However, we can easily use some existing package to help us build the LSTM model and compare the result with the simple RNN model.

tf.keras.backend.clear_session() 
# this help us clear variables before our experiment.
dataset = windowed_dataset(x_train, window_size, batch_size,
shuffle_buffer_size)
model = tf.keras.models.Sequential([
tf.keras.layers.Lambda(lambda x: tf.expand_dims(x, axis=-1),
input_shape=[None]),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32,
return_sequences=True)),
tf.keras.layer.Bidirectional(tf.keras.layers.LSTM(32)),
tf.keras.layers.Dense(1),
tf.keras.layers.Lambda(lambda x: x*100.0)
])
model.compile(loss='mse', optimizer=tf.keras.optimizers.SGD(lr=1e-5,
momentum=0.9)
model.fit(dataset, epochs=500, verbose=0)

LSTM significantly improve the result of RNN but still have space to improve.

Will continue …

CNN

Explanations:

Momentum

Batch Size

Back Propagation