How to forecast time series in Python with ARIMA?

What is ARIMA?


ARIMA, which stands for AutoRegressive Integrated Moving Average, is a widely adopted popular statistical method for time series forecasting due to its simplicity and effectiveness for the task. It is a generalization of the simpler AutoRegressive Moving Average and adds the notion of integration. ARIMA is a method for non-stationary time series prediction (if a time-series has a trend then it is considered as non-stationary, for more information - What is stationary time series?).

As the full name suggests, ARIMA consists of three major components, which are reflected in 3 parameters ($p, d, q$) of the model:
  1. Auto-Regression (AR), which leverages the relationship between an observation and $p$ number of lagged observations
  2. Integrated (I), which uses the differencing of raw obserations (e.g., subtracting an observation from the previous observation) in order to make the time series stationary. The parameter $d$ denotes the number of times that the raw observations are differenced.
  3. Moving Average (MA), which exploits the dependency between an observation and a residual error from a moving average model applied to lagged observations. The parameter $q$ determines the window size of the moving average.


How to use it in Python for time series forecasting?

 

We use a VM workload dataset (GWA-T-12 BitBrains) from http://gwa.ewi.tudelft.nl/datasets/ for forecasting CPU usage (%) of a VM. You can download the dataset to "data" folder.

The CPU workload in a sample VM is as follows:


import pandas as pd
df = pd.read_csv("./data/fastStorage/2013-8/1.csv", sep=";\t", header=0)
series = df["CPU usage [%]"]
series.plot()




Now, we start to use ARIMA model for rolling forecasting of VM CPU usage while using 70% of the dataset for training first.

from pandas import read_csv
from matplotlib import pyplot
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error

# rolling forcast
def run_rolling_forecasting(series, training_percentage):
    X = series.values
    train_size = int(len(X) * training_percentage)
    train, test = X[:train_size], X[train_size:]
    history = [x for x in train]
    predictions = list()
    for t in range(len(test)):
        model = ARIMA(history, order=(5,1,0))
        model_fit = model.fit(disp=0)
        output = model_fit.forecast()
        yhat = output[0]
        predictions.append(yhat)
        obs = test[t]
        history.append(obs)
        print("%d/%d predicted=%f, expected=%f" % (t, len(test), yhat, obs))
    mae = mean_absolute_error(test, predictions)
    mse = mean_squared_error(test, predictions)

    return test, predictions, mse, mae

test, predictions, mse, mae = \
    run_rolling_forecasting(series, training_percentage=.7)
print("Test MAE: %.3f; MSE: %.3f" % (mae, mse))
# plot
pyplot.plot(test)
pyplot.plot(predictions, color="red")
pyplot.show()



At the end, ARIMA produces Test MAE: 1.855 and MSE: 74.258. As one might expect, we can further tune hyper-parameters to improve the results.