Preparing windowed dataset for time series forecasting



Windowed dataset is required in many forecasting methods, especially for machine learning-based approaches. As an example, given a time series, e.g., [1,3,5,7,9], and a window size of 2, the corresponding windowed dataset could be as follows, which can be used for training a machine learning model where each element in X represents windowed data while each element in y represents a corresponding label/target/next value to be predicted.

X = [[1,3], [3,5], [5,7]]
y = [[5], [7], [9]]

Importing packages


import tensorflow as tf
import numpy as np

print(tf.__version__)
print(np.__version__)

2.4.1
1.19.2  


For simplicity, here we create a sequence of numbers as our example time series data.


ts = np.arange(1, 100, 2)

array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99])


Implementation

The window_dataset() below provides a functionality to produce the windowed dataset for a given time series (ts) and the window_size. We delve into some details of the code in the following.


def window_dataset(ts, window_size=2):
    """ Process the time series into windowed dataset 
    
    :parameter ts: time series data
    
    Return data, targets where data and targets are both list
        each element in data is history in a window, 
        each element in targets is the next value
    """
    data = list()
    targets = list()
    
    dataset = tf.data.Dataset.from_tensor_slices(ts)
    dataset = dataset.window(
        window_size+1, 
        shift=1, 
        drop_remainder=True
    )
    dataset = dataset.flat_map(lambda w: w.batch(window_size+1))
    dataset = dataset.map(lambda w: (w[:-1], w[-1:]))
    
    for (x, y) in dataset.as_numpy_iterator():
        data.append(x)
        targets.append(y)
        
    return data, targets


data, targets = window_dataset(ts)

for x,y in zip(data, targets):
    print(x,y)

[1 3] [5]
[3 5] [7]
[5 7] [9]
[7 9] [11]
[ 9 11] [13]
[11 13] [15]
[13 15] [17]
[15 17] [19]
[17 19] [21]
[19 21] [23]
[21 23] [25]
[23 25] [27]
[25 27] [29]
[27 29] [31]
[29 31] [33]
[31 33] [35]
[33 35] [37]
[35 37] [39]
[37 39] [41]
[39 41] [43]
[41 43] [45]
[43 45] [47]
[45 47] [49]
[47 49] [51]
[49 51] [53]
[51 53] [55]
[53 55] [57]
[55 57] [59]
[57 59] [61]
[59 61] [63]
[61 63] [65]
[63 65] [67]
[65 67] [69]
[67 69] [71]
[69 71] [73]
[71 73] [75]
[73 75] [77]
[75 77] [79]
[77 79] [81]
[79 81] [83]
[81 83] [85]
[83 85] [87]
[85 87] [89]
[87 89] [91]
[89 91] [93]
[91 93] [95]
[93 95] [97]
[95 97] [99]

Given the windowed dataset, we can fit any forecasting model. Here, we simply use a linear regression to fit the windowed dataset for illustration.

from sklearn.linear_model import LinearRegression

model = LinearRegression(fit_intercept=True)
model.fit(data, targets)
model.predict([[97,99]])

array([[101.]])


Details

Now, we move on to some details of the window_dataset() method. First step is creating a Tensorflow dataset. The from_tensor_slices(ts) creates a Dataset whose elements are slices of the given tensors.


dataset = tf.data.Dataset.from_tensor_slices(ts)
for d in dataset:
    print(d)
    break

tf.Tensor(1, shape=(), dtype=int64)


The Tensorflow dataset provides a method - window() - which a dataset of "windows". According to the documentation, each "window" is a dataset that contains a subset of elements of the input dataset. These are finite datasets of size size (or possibly fewer if there are not enough input elements to fill the window and drop_remainder evaluates to False). As the subset is still a dataset, we use list() and as_numpy_iterator() which returns an iterator which converts all elements of the dataset to numpy.


window_size = 2
dataset = dataset.window(
    window_size+1, 
    shift=1, 
    drop_remainder=True
)
for d in dataset:
    # Each d will be sub-dataset of the dataset
    print(list(d.as_numpy_iterator()))
    break

[1, 3, 5]


Here, we grab all data in each sub-dataset and flattens the result.


# Maps .batch across each sub-dataset
dataset = dataset.flat_map(lambda w: w.batch(window_size+1))
for d in dataset:
    print(d)
    break

tf.Tensor([1 3 5], shape=(3,), dtype=int64)


Finally, we split the data part and target one for each element in the dataset, and each pair will be a training example for training a forecasting model.


dataset = dataset.map(lambda w: (w[:-1], w[-1:]))
for (x, y) in dataset.as_numpy_iterator():
    print(x, y)
    break

[1 3] [5]
That's it for preparing windowed dataset for time series forecasting. Although we used Tensorflow to implement the preprocessing step, you can also try to use other ways to implement the same functionality as long as you can derive the same output.

No comments:

Post a Comment