What is ARIMA?
ARIMA, which stands for AutoRegressive Integrated Moving Average, is a widely adopted popular statistical method for time series forecasting due to its simplicity and effectiveness for the task. It is a generalization of the simpler AutoRegressive Moving Average and adds the notion of integration. ARIMA is a method for non-stationary time series prediction (if a time-series has a trend then it is considered as non-stationary, for more information - What is stationary time series?).
As the full name suggests, ARIMA consists of three major components, which are reflected in 3 parameters ($p, d, q$) of the model:
- Auto-Regression (AR), which leverages the relationship between an observation and $p$ number of lagged observations
- Integrated (I), which uses the differencing of raw obserations (e.g., subtracting an observation from the previous observation) in order to make the time series stationary. The parameter $d$ denotes the number of times that the raw observations are differenced.
- Moving Average (MA), which exploits the dependency between an observation and a residual error from a moving average model applied to lagged observations. The parameter $q$ determines the window size of the moving average.
How to use it in Python for time series forecasting?
We use a VM workload dataset (GWA-T-12 BitBrains) from http://gwa.ewi.tudelft.nl/datasets/ for forecasting CPU usage (%) of a VM. You can download the dataset to "data" folder.
The CPU workload in a sample VM is as follows:
import pandas as pd
df = pd.read_csv("./data/fastStorage/2013-8/1.csv", sep=";\t", header=0)
series = df["CPU usage [%]"]
series.plot()
Now, we start to use ARIMA model for rolling forecasting of VM CPU usage while using 70% of the dataset for training first.
from pandas import read_csv
from matplotlib import pyplot
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
# rolling forcast
def run_rolling_forecasting(series, training_percentage):
X = series.values
train_size = int(len(X) * training_percentage)
train, test = X[:train_size], X[train_size:]
history = [x for x in train]
predictions = list()
for t in range(len(test)):
model = ARIMA(history, order=(5,1,0))
model_fit = model.fit(disp=0)
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test[t]
history.append(obs)
print("%d/%d predicted=%f, expected=%f" % (t, len(test), yhat, obs))
mae = mean_absolute_error(test, predictions)
mse = mean_squared_error(test, predictions)
return test, predictions, mse, mae
test, predictions, mse, mae = \
run_rolling_forecasting(series, training_percentage=.7)
print("Test MAE: %.3f; MSE: %.3f" % (mae, mse))
# plot
pyplot.plot(test)
pyplot.plot(predictions, color="red")
pyplot.show()
At the end, ARIMA produces Test MAE: 1.855 and MSE: 74.258. As one might expect, we can further tune hyper-parameters to improve the results.