# Original link: tecdat.cn/?p=12272

# Original source: Tuoduan Data Tribe Official Account

*Using the ARIMA model, you can use the past values of the series to predict a time series. In this article, we built an optimal ARIMA model from scratch and extended it to Seasonal ARIMA (SARIMA) and SARIMAX models.*

## 1. Introduction to time series forecasting

A time series is a sequence of recording metrics at regular intervals.

According to the frequency, the time series can be yearly (for example: annual budget), every quarter (for example: expenditure), weekly (for example: sales quantity), every day (for example, weather), hourly (for example: stock price), minute (for example, : Incoming call in the incoming call notification), even a few seconds (for example: network traffic).

### Why forecast?

Because forecasting time series (such as demand and sales) usually have huge commercial value.

In most manufacturing companies, it drives basic business planning, purchasing, and production activities. Any error in the forecast will spread to the entire supply chain or any business environment related to it. Therefore, it is important to make accurate predictions to save costs, which is critical to success.

Not only in manufacturing, the technology and concepts behind time series forecasting can also be applied to any business.

Now, forecast time series can be roughly divided into two types.

If only the previous value of the time series is used to predict its future value, it is called **univariate time series forecasting** .

If you use predictor variables other than the series (also called exogenous variables) for forecasting, it is called **multivariate time series forecasting** .

This article focuses on a special type of forecasting method called **ARIMA** modeling.

ARIMA is a forecasting algorithm based on the idea that the information in the past values of the time series can be used alone to predict future values.

## 2. Introduction to ARIMA model

So what exactly is the ARIMA model?

ARIMA is a type of model that can "interpret" a given time series based on its own past value (ie its own lag and lagging forecast error), so it can use equations to predict future value.

Any "non-seasonal" time series that has a pattern and is not random white noise can be modeled using the ARIMA model.

The ARIMA model is characterized by 3 terms: p, d, q

p is the AR term

q is the MA item

d is the order of difference required to make the time series stationary

If the time series has a seasonal pattern, you need to add seasonal conditions, and the time series will become SARIMA (short for "seasonal ARIMA"). Once ARIMA is completed.

So, what does "the order of AR terms" really mean? Let's take a look at "d" first.

## 3. What do p, d and q in the ARIMA model mean?

The first step in building an ARIMA model is to make the time series stationary .

why?

Because the term "autoregressive" in ARIMA means that it is a linear regression model , using its own lag as a predictor. As you know, linear regression models are most effective when the predictors are uncorrelated and independent of each other.

So how to make a sequence stationary?

The most common method is to differentiate. That is, the previous value is subtracted from the current value.

Therefore, the value of d is the minimum difference order required to make the sequence stationary. If the time series is already stationary, then d = 0.

Next, what are "p" and "q"?

"P" is the order of the "autoregressive" (AR) term. It refers to the lag order of Y to be used as a predictor variable. And "q" is the order of the "moving average" (MA) term. It refers to the number of lagging prediction errors that should be entered into the ARIMA model.

## 4. What are AR and MA models

So what are AR and MA models? What are the actual mathematical formulas of AR and MA models?

**The AR model** is a **model** in which Yt depends only on its own hysteresis. In other words, Yt is a function of "Yt lag".

Similarly, a pure **moving average (only MA) model** is a **model** in which Yt depends only on the lagging prediction error.

The error term is the error of the autoregressive model of each lag. The errors Et and E(t-1) are the errors from the following equations:

Those are AR and MA models respectively.

So what is the equation of the ARIMA model?

The ARIMA model is a model in which the time series is differentiated at least once to make it stationary, and then AR and MA terms are combined. Therefore, the equation becomes:

Therefore, the goal is to identify the values of p, d, and q.

## 5. How to find the difference order in the ARIMA model (d)

The purpose of difference is to make the time series stationary.

But you need to be careful not to differentiate the sequence too much. Because the super-difference sequence may still be stationary, which in turn will affect the model parameters.

So how to determine the correct difference order?

The correct difference order is to obtain the smallest difference of an approximately stationary sequence that fluctuates around a defined average value, and the ACF curve reaches zero fairly quickly.

If the autocorrelation is positive after many orders (10 or more), then the sequence needs to be further differentiated.

In this case, you cannot really determine the difference between the two difference orders, and then choose the order that gives the smallest standard deviation in the difference sequence.

Let us look at an example.

1. I will use Augmented Dickey Fuller test() to check whether the sequence is stable.

**why?**

Because, the difference is only needed when the series is non-stationary. Otherwise, no difference is needed, that is, d=0.

The null hypothesis of the ADF test is that the time series is non-stationary. Therefore, if the p-value of the test is less than the significance level (0.05), the null hypothesis is rejected and it is concluded that the time series is indeed stationary.

Therefore, in our case, if the P value is> 0.05, we will continue to look for the order of difference.

from statsmodels.tsa.stattools import adfuller from numpy import log result = adfuller(df.value.dropna()) print('ADF Statistic: %f'% result[0]) print('p-value: %f'% result[1]) Copy code

ADF Statistic: -2.464240 p-value: 0.124419 Copy code

Since the P value is greater than the significance level, let's differentiate the series and see what the autocorrelation graph looks like.

import numpy as np, pandas as pd from statsmodels.graphics.tsaplots import plot_acf, plot_pacf import matplotlib.pyplot as plt plt.rcParams.update({'figure.figsize':(9,7),'figure.dpi':120}) # Import Data df = pd.read_csv('wwwusage.csv', names=['value'], header=0) # Raw data fig, axes = plt.subplots(3, 2, sharex=True) axes[0, 0].plot(df.value); axes[0, 0].set_title('Original Series') plot_acf(df.value, ax=axes[0, 1]) # First difference axes[1, 0].plot(df.value.diff()); axes[1, 0].set_title('1st Order Differencing') plot_acf(df.value.diff().dropna(), ax=axes[1, 1]) # 2.order difference axes[2, 0].plot(df.value.diff().diff()); axes[2, 0].set_title('2nd Order Differencing') plot_acf(df.value.diff().diff().dropna(), ax=axes[2, 1]) plt.show() Copy code

### difference

For the above sequence, the time series is stationary and has two different orders. However, when viewing the autocorrelation graph of the second difference, the lag quickly enters the negative region, which indicates that the series may have been differentiated.

Therefore, even if the series is not completely stationary (weak stationarity), I will temporarily set the order of the difference to 1.

## Adf inspection ndiffs(y, test='adf') # 2 # KPSS inspection ndiffs(y, test='kpss') # 0 # PP inspection: ndiffs(y, test='pp') # 2 Copy code

2 0 2Copy code

## 6. How to find the order of AR term (p)

The next step is to determine whether the model requires AR. You can find the required AR order by checking the partial autocorrelation (PACF) graph.

### But what is PACF?

After excluding the influence of partial lag, partial autocorrelation can be imagined as the correlation between the sequence and its lag. Therefore, PACF transmission conveys a pure correlation between lag and sequence. In this way, you will know if the hysteresis is needed in AR.

# How to find the order of AR term?

Any autocorrelation in a stationary series can be corrected by adding enough AR terms. Therefore, we initially set the order of the AR term equal to the lag order that exceeds the significance interval in the PACF chart.

# Partial autocorrelation coefficient diagram of first-order difference plt.show() Copy code

AR order

It can be observed that the PACF lagging by 1 order is very important because it is much higher than the significance line. The second order lag is also important, slightly exceeding the significance interval (blue area).

## 7. How to find the order of MA term (q)

Just as we view the order of the AR term on the PACF chart, you can also view the order of the MA term on the ACF chart. MA is technically the error of lagging prediction.

ACF indicates how many MA terms are needed to eliminate the autocorrelation in the stationary series.

Let us look at the autocorrelation graph of the difference sequence.

fig, axes = plt.subplots(1, 2, sharex=True) axes[0].plot(df.value.diff()); axes[0].set_title('1st Differencing') axes[1].set(ylim=(0,1.2)) plot_acf(df.value.diff().dropna(), ax=axes[1]) plt.show() Copy code

### MA order

Several lags are well above the limit. Therefore, let us temporarily fix q to 2.

## 8. How to deal with the time series difference value is too low or too high

How to deal with it?

If your sequence difference value is too low, usually add one or more other AR items. Similarly, if the difference value is too high, try adding other MA items.

## 9. How to build an ARIMA model

Now that the values of p, d and q have been determined, all the conditions for fitting the ARIMA model have been met.

ARIMA Model Results ================================================== ============================= Dep. Variable: D.value No. Observations: 99 Model: ARIMA(1, 1, 2) Log Likelihood -253.790 Method: css-mle SD of innovations 3.119 Date: Wed, 06 Feb 2019 AIC 517.579 Time: 23:32:56 BIC 530.555 Sample: 1 HQIC 522.829 ================================================== ================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------- ------------------------------- const 1.1202 1.290 0.868 0.387 -1.409 3.649 ar.L1.D.value 0.6351 0.257 2.469 0.015 0.131 1.139 ma.L1.D.value 0.5287 0.355 1.489 0.140 -0.167 1.224 ma.L2.D.value -0.0010 0.321 -0.003 0.998 -0.631 0.629 Roots ================================================== =========================== Real Imaginary Modulus Frequency -------------------------------------------------- --------------------------- AR.1 1.5746 +0.0000j 1.5746 0.0000 MA.1 -1.8850 +0.0000j 1.8850 0.5000 MA.2 545.3515 +0.0000j 545.3515 0.0000 -------------------------------------------------- --------------------------- Copy code

The model summary reveals a lot of information. The middle table is the coefficient table, where the value under "coef" is the weight of the corresponding item.

Please note that the coefficient of the MA2 term here is close to zero. Ideally, the value of each X should be less than 0.05.

So let's rebuild the model without MA2.

ARIMA Model Results ================================================== ============================= Dep. Variable: D.value No. Observations: 99 Model: ARIMA(1, 1, 1) Log Likelihood -253.790 Method: css-mle SD of innovations 3.119 Date: Sat, 09 Feb 2019 AIC 515.579 Time: 12:16:06 BIC 525.960 Sample: 1 HQIC 519.779 ================================================== ================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------- ------------------------------- const 1.1205 1.286 0.871 0.386 -1.400 3.641 ar.L1.D.value 0.6344 0.087 7.317 0.000 0.464 0.804 ma.L1.D.value 0.5297 0.089 5.932 0.000 0.355 0.705 Roots ================================================== =========================== Real Imaginary Modulus Frequency -------------------------------------------------- --------------------------- AR.1 1.5764 +0.0000j 1.5764 0.0000 MA.1 -1.8879 +0.0000j 1.8879 0.5000 -------------------------------------------------- --------------------------- Copy code

The AIC model has been reduced, which is good. The P values of AR1 and MA1 have increased and are very significant (<< 0.05).

Let's plot the residuals.

# Residual density

The residuals seem to be good, the mean is close to zero and the variance is uniform. Let's use plot actual values and fitted values.

**Actual vs fit**

Set up

In other words, the model is trained to the previous value for the next prediction.

Therefore, we seem to have a good ARIMA model. But is that the best?

This cannot be said at the moment, because we have not really forecasted future data, but compared the forecast with actual data.

Therefore, cross-validation is now required.

## 10. How to manually find the best ARIMA model using cross validation

In "cross-validation", future data can be predicted. Then, you compare the predicted value with the actual value.

To cross-validate, you need to create training and test data sets by dividing the time series into two consecutive parts at a ratio of approximately 75:25 or a reasonable ratio based on the time frequency of the series.

# Why not randomly sample the training data?

This is because the sequence of the time series should be intact in order to be used for forecasting.

Now you can build an ARIMA model on the training data set, predict and plot it.

# Drawing plt.figure(figsize=(12,5), dpi=100) plt.plot(train, label='training') plt.plot(test, label='actual') plt.plot(fc_series, label='forecast') plt.fill_between(lower_series.index, lower_series, upper_series, color='k', alpha=.15) plt.title('Forecast vs Actuals') plt.legend(loc='upper left', fontsize=8) plt.show() Copy code

### Forecast and Actual

From the chart, the ARIMA(1,1,1) model seems to give a forecast in the right direction. The actual observations are within the 95% confidence interval.

But the forecast of each forecast is always lower than the actual. This means that by adding a small constant to our prediction, the accuracy will definitely increase. Therefore, there is definitely room for improvement.

So, what I want to do is increase the order of difference to 2, that is, set it,

When performing this operation, I will pay attention to the P values of AR and MA items in the model summary. They should be as close to zero as possible, and ideally should be less than 0.05.

ARIMA Model Results ================================================== ============================= Dep. Variable: D2.value No. Observations: 83 Model: ARIMA(3, 2, 1) Log Likelihood -214.248 Method: css-mle SD of innovations 3.153 Date: Sat, 09 Feb 2019 AIC 440.497 Time: 12:49:01 BIC 455.010 Sample: 2 HQIC 446.327 ================================================= ================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------- -------------------------------- const 0.0483 0.084 0.577 0.565 -0.116 0.212 ar.L1.D2.value 1.1386 0.109 10.399 0.000 0.924 1.353 ar.L2.D2.value -0.5923 0.155 -3.827 0.000 -0.896 -0.289 ar.L3.D2.value 0.3079 0.111 2.778 0.007 0.091 0.525 ma.L1.D2.value -1.0000 0.035 -28.799 0.000 -1.068 -0.932 Roots ================================================== =========================== Real Imaginary Modulus Frequency -------------------------------------------------- --------------------------- AR.1 1.1557 -0.0000j 1.1557 -0.0000 AR.2 0.3839 -1.6318j 1.6763 -0.2132 AR.3 0.3839 +1.6318j 1.6763 0.2132 MA.1 1.0000 +0.0000j 1.0000 0.0000 -------------------------------------------------- --------------------------- Copy code

### Revised forecast and actual value

AIC has been reduced from 515 to 440. The P value of the X term is less than <0.05, which is good.

So overall it is much better.

Ideally, you should return to multiple points in time, such as 1, 2, 3, and 4 quarters, and see how well the forecasts are at each point in the year.

## 11. Accuracy indicators for time series forecasting

The common accuracy indicators used to judge forecasts are:

- Mean absolute percentage error (MAPE)
- Mean Error (ME)
- Mean Absolute Error (MAE)
- Mean percentage error (MPE)
- Root Mean Square Error (RMSE)
- Lag 1 autocorrelation error (ACF1)
- Correlation between actual and forecast (corr)
- Minimum maximum error (minmax)

Generally, if you want to compare the predictions of two different sequences, you can use MAPE, Correlation and Min-Max Error.

### Why not use other indicators?

Because only the above three are percentage errors, the error varies from 0 to 1. Therefore, regardless of the size of the sequence, you can judge the quality of the prediction.

The other error measure is quantity. This means that a sequence with an average value of 1000 has an RMSE of 100, and a sequence with an average value of 10 has an RMSE of 5. Therefore, they cannot really be used to compare the forecasts of two time series with different proportions.

forecast_accuracy(fc, test.values) #> {'mape': 0.02250131357314834, #>'me': 3.230783108990054, #>'mae': 4.548322194530069, #>'mpe': 0.016421001932706705, #>'rmse': 6.373238534601827, #>'acf1': 0.5105506325288692, #>'corr': 0.9674576513924394, #>'minmax': 0.02163154777672227} Copy code

A MAPE of about 2.2% indicates that the accuracy of the model in predicting the next 15 observations is about 97.8%.

However, in the case of industrial applications, you will be provided with a lot of time series for forecasting, and forecasting will be repeated regularly.

Therefore, we need a way to automate the best model selection process.

## 12. How to make automatic Arima prediction in Python

Use a stepwise method to search multiple combinations of p, d, q parameters, and select the best model with the smallest AIC.

print(model.summary()) #> Fit ARIMA: order=(1, 2, 1); AIC=525.586, BIC=535.926, Fit time=0.060 seconds #> Fit ARIMA: order=(0, 2, 0); AIC=533.474, BIC=538.644, Fit time=0.005 seconds #> Fit ARIMA: order=(1, 2, 0); AIC=532.437, BIC=540.192, Fit time=0.035 seconds #> Fit ARIMA: order=(0, 2, 1); AIC=525.893, BIC=533.648, Fit time=0.040 seconds #> Fit ARIMA: order=(2, 2, 1); AIC=515.248, BIC=528.173, Fit time=0.105 seconds #> Fit ARIMA: order=(2, 2, 0); AIC=513.459, BIC=523.798, Fit time=0.063 seconds #> Fit ARIMA: order=(3, 2, 1); AIC=512.552, BIC=528.062, Fit time=0.272 seconds #> Fit ARIMA: order=(3, 2, 0); AIC=515.284, BIC=528.209, Fit time=0.042 seconds #> Fit ARIMA: order=(3, 2, 2); AIC=514.514, BIC=532.609, Fit time=0.234 seconds #> Total fit time: 0.865 seconds #> ARIMA Model Results #> =============================================== =============================== #> Dep. Variable: D2.y No. Observations: 98 #> Model: ARIMA(3, 2, 1) Log Likelihood -250.276 #> Method: css-mle SD of innovations 3.069 #> Date: Sat, 09 Feb 2019 AIC 512.552 #> Time: 12:57:22 BIC 528.062 #> Sample: 2 HQIC 518.825 #> #> =============================================== =============================== #> coef std err z P>|z| [0.025 0.975] #> ------------------------------------------------ ------------------------------ #> const 0.0234 0.058 0.404 0.687 -0.090 0.137 #> ar.L1.D2.y 1.1586 0.097 11.965 0.000 0.969 1.348 #> ar.L2.D2.y -0.6640 0.136 -4.890 0.000 -0.930 -0.398 #> ar.L3.D2.y 0.3453 0.096 3.588 0.001 0.157 0.534 #> ma.L1.D2.y -1.0000 0.028 -36.302 0.000 -1.054 -0.946 #> Roots #> =============================================== ============================= #> Real Imaginary Modulus Frequency #> ------------------------------------------------ ----------------------------- #> AR.1 1.1703 -0.0000j 1.1703 -0.0000 #> AR.2 0.3763 -1.5274j 1.5731 -0.2116 #> AR.3 0.3763 +1.5274j 1.5731 0.2116 #> MA.1 1.0000 +0.0000j 1.0000 0.0000 #> ------------------------------------------------ ----------------------------- Copy code

## 13. How to interpret the residual plot in the ARIMA model

Let's look at the residual plot.

# Residual plot

So how to explain?

**Top left: The** residual error seems to fluctuate around zero mean and has a uniform variance.

**Upper right: The** density plot suggests a normal distribution with a mean value of zero.

**Lower left:** All dots should be exactly the same as the red line. Any obvious deviation means that the distribution is skewed.

**Lower right: The** Correlogram (aka ACF) graph shows that the residual error is not autocorrelated. Any autocorrelation will imply that there is a pattern in the residuals that is not explained in the model. Therefore, you will need to find more X (predictor variables) for the model.

Overall, the model is suitable. Let us predict.

## 14. How to automatically build a SARIMA model in python

The problem with the ordinary ARIMA model is that it does not support seasonality.

If your time series defines seasonality, then use the seasonal difference SARIMA.

Seasonal differences are similar to regular differences, but you can subtract the value from the previous season instead of subtracting continuous terms.

Therefore, the model will be expressed as SARIMA (p, d, q) x (P, D, Q), where P, D, and Q are respectively SAR, the order of seasonal difference and the SMA term, and

If your model has a well-defined seasonal pattern, then enforce D = 1 for a given frequency "x".

Here are some practical suggestions for building a SARIMA model:

Generally, the model parameter is set to D not to exceed 1. And the total score'd + D'does not exceed 2. If the model has a seasonal component, try to keep only the SAR or SMA terms.

We build a SARIMA model on the drug sales data set.

### Seasonal difference

After applying the usual difference (lag 1), the seasonal peak is complete. In view of this, corrections should be made after seasonal differences.

Let's build a model using SARIMA. To do this, you need to set

Fit ARIMA: order=(1, 0, 1) seasonal_order=(0, 1, 1, 12); AIC=534.818, BIC=551.105, Fit time=1.742 seconds Fit ARIMA: order=(0, 0, 0) seasonal_order=(0, 1, 0, 12); AIC=624.061, BIC=630.576, Fit time=0.028 seconds Fit ARIMA: order=(1, 0, 0) seasonal_order=(1, 1, 0, 12); AIC=596.004, BIC=609.034, Fit time=0.683 seconds Fit ARIMA: order=(0, 0, 1) seasonal_order=(0, 1, 1, 12); AIC=611.475, BIC=624.505, Fit time=0.709 seconds Fit ARIMA: order=(1, 0, 1) seasonal_order=(1, 1, 1, 12); AIC=557.501, BIC=577.046, Fit time=3.687 seconds (...TRUNCATED...) Fit ARIMA: order=(3, 0, 0) seasonal_order=(1, 1, 1, 12); AIC=554.570, BIC=577.372, Fit time=2.431 seconds Fit ARIMA: order=(3, 0, 0) seasonal_order=(0, 1, 0, 12); AIC=554.094, BIC=570.381, Fit time=0.220 seconds Fit ARIMA: order=(3, 0, 0) seasonal_order=(0, 1, 2, 12); AIC=529.502, BIC=552.305, Fit time=2.120 seconds Fit ARIMA: order=(3, 0, 0) seasonal_order=(1, 1, 2, 12); AIC=nan, BIC=nan, Fit time=nan seconds Total fit time: 31.613 seconds Copy code

The model estimates AIC, and the P value of the coefficient seems to be important. Let us look at the diagnostic plot of the residuals.

Best model

Let us predict the next 24 months.

SARIMA-final forecast

## 15. How to build a SARIMAX model with exogenous variables

The SARIMA model we built is very good.

But for the sake of completeness, let's try to add external predictors (also called "exogenous variables") to the model. This model is called the SARIMAX model.

The only requirement for using exogenous variables is that you also need to know the value of the variable during the forecast period.

To demonstrate, I will use the seasonal index in the classical seasonal decomposition for the data of the most recent 36 months .

Why a seasonal index? Is SARIMA already simulating seasonality?

You're right.

Also, I want to see how the model will show if we impose the most recent seasonal pattern on training and forecasting.

Secondly, this is a good variable for demonstration purposes. Therefore, you can use it as a template and insert any variables into the code. The seasonal index is a good exogenous variable because it repeats every frequency period, in this case 12 months.

Therefore, you will always know what value the seasonal index will hold for future forecasts.

Let's calculate the seasonal index so that it can be used as the (external) predictor of the SARIMAX model.

The exogenous variable (seasonal index) is ready. Let us build the SARIMAX model.

Fit ARIMA: order=(1, 0, 1) seasonal_order=(0, 1, 1, 12); AIC=536.818, BIC=556.362, Fit time=2.083 seconds Fit ARIMA: order=(0, 0, 0) seasonal_order=(0, 1, 0, 12); AIC=626.061, BIC=635.834, Fit time=0.033 seconds Fit ARIMA: order=(1, 0, 0) seasonal_order=(1, 1, 0, 12); AIC=598.004, BIC=614.292, Fit time=0.682 seconds Fit ARIMA: order=(0, 0, 1) seasonal_order=(0, 1, 1, 12); AIC=613.475, BIC=629.762, Fit time=0.510 seconds Fit ARIMA: order=(1, 0, 1) seasonal_order=(1, 1, 1, 12); AIC=559.530, BIC=582.332, Fit time=3.129 seconds (...Truncated...) Fit ARIMA: order=(3, 0, 0) seasonal_order=(0, 1, 0, 12); AIC=556.094, BIC=575.639, Fit time=0.260 seconds Fit ARIMA: order=(3, 0, 0) seasonal_order=(0, 1, 2, 12); AIC=531.502, BIC=557.562, Fit time=2.375 seconds Fit ARIMA: order=(3, 0, 0) seasonal_order=(1, 1, 2, 12); AIC=nan, BIC=nan, Fit time=nan seconds Total fit time: 30.781 seconds Copy code

Therefore, we have a model with exogenous terms. But the coefficient is for

We have effectively imposed the latest seasonal effects of the last 3 years on the model.

Let us predict the next 24 months. For this, you need the seasonal index value for the next 24 months.

SARIMAX forecast

Most popular insights

1. Use lstm and pytorch for time series forecasting in python

2. Use the long and short-term memory model lstm for time series prediction analysis in python

3. Use R language for time series (arima, exponential smoothing) analysis

4. Multivariate copula-garch-model time series prediction in r language

5. R language copulas and financial time series case

6. Use R language random fluctuation model sv to deal with random fluctuations in time series

7. R language time series tar threshold autoregressive model

8. R language k-shape time series clustering method to cluster stock price time series