Original link: tecdat.cn/?p=21773
Original source: Tuoduan Data Tribe Official Account
Establish EWMA and ARIMA model to predict time series in Python and R language
- Learn the steps to create a time series forecast
- Pay attention to Dickey-Fuller test and ARIMA (autoregressive moving average) model
- Learn these concepts theoretically and their implementation in python
Time series (referred to as TS from now on) is considered one of the lesser-known skills in the field of data science.
Use python to create time series forecast
We use the following steps:
- What is the time series
- Load and process time series
- How to test the stationarity of a time series?
- How to make the time series stationary?
- Forecast time series
1. What is a time series?
As the name suggests, TS is a collection of data points collected at regular intervals.
Analyze these data to determine long-term trends to predict the future or perform other forms of analysis.
But what makes TS different from conventional regression problems?
- It depends on time. Therefore, the basic assumption of the linear regression model, that the observations are independent, does not hold in this case.
- With an increasing or decreasing trend, most TS have some form of seasonal trend, that is, changes that are specific to a specific time frame. For example, if you see the sales of a woolen jacket over time, you will definitely find that the winter sales are higher.
Due to its inherent characteristics, its analysis involves various steps. These issues will be discussed in detail below. Let's start by loading a TS object in Python. We use air passenger data.
2 Loading and processing time series
Pandas has a special library to handle TS objects, especially the datatime64[ns] class, which stores time information and we can perform some operations quickly. Let's start by starting the required libraries:
%matplotlib inline from matplotlib.pylab import rcParams data = pd.read_csv('AirPassengers.csv', parse_dates=['Month'], index_col='Month') Copy code
Let us understand these arguments one by one:
- parse dates: Specify the columns that contain date and time information. As mentioned above, the column name is'Month'.
- index_col: One of the key ideas behind using Pandas for TS data is that the index must be a variable that describes date and time information. So this parameter tells panda to use the'Month' column as the index.
- date parse: Specify a function to convert the input string into a datetime variable. By default, the data in the format "YYYY-MM-DD HH:MM:SS" is read. If the data is not in this format, you must manually define the format. Something similar to the dataparse function defined here can be used for this purpose.
Now we can see that the data is indexed by the time object and #Passengers is the column. We can use the following command to cross-check the data type of the index:
Note that dtype='datetime[ns]' confirms that it is a datetime object. As a personal preference, I will convert the columns to Series objects to prevent the column names from being quoted every time I use TS.
ts = data [ '# Passengers' ] ts.head (10) copying the code
Before going further, I will discuss some indexing techniques for TS data. Let's start by selecting a specific value in the Series object. This can be achieved in two ways:
#1. Specify the index as a string constant: ts['1949-01-01'] #2. Import the datetime library and use the'datetime' function: from datetime import datetime Copy code
Both will return the value "112", which can also be confirmed from the previous output. Suppose we want all the data before May 1949. This can be achieved in two ways:
#1. Specify the entire range: ts['1949-01-01':'1949-05-01'] #2. If one of the indexes is at the end, please use ":": ts[:'1949-05-01'] Copy code
Both will produce the following output:
There are two points to note here:
- Unlike the digital index, the end index is included here. For example, if we index the list to [:5], then it will return the value at the index-[0,1,2,3,4]. But here the index "1949-05-01" is included in the output.
- The index must be sorted for the range to work.
Consider another example that requires all the values in 1949. This can be achieved in the following ways:
ts['1949'] Copy code
The month part is omitted. Similarly, if you select all dates in a month, you can omit the date part.
Now, let us continue to analyze TS.
3. How to test the stationarity of a time series?
If the statistical properties of TS (such as mean and variance) remain unchanged over time, it is said to be stationary. But why is it important? Most TS models assume that TS is stationary. Intuitively, we can assume that if a TS has a specific behavior for a period of time, then it is likely to have the same behavior in the future. Compared with non-stationary series, the related theory of stationary series is more mature and easy to implement.
Stationarity is defined using very strict criteria. However, for practical purposes, we can assume that the sequence is stationary if it has constant statistical properties over time, namely:
- Constant average
- Constant variance
- Time-independent autocovariance.
I will skip these details because they are very clearly defined in this article. Let us begin to test the method of stationarity. The first and most important thing is to simply plot the data and perform visual analysis. You can use the following commands to plot the data:
plt.plot(ts) Copy code
Obviously, with some seasonal changes, the overall data shows an upward trend. However, it may not always be possible to make such visual intuitions (we will see this later). So, more formally, we can check stationarity using the following method:
- Plot rolling statistics: We can plot the moving average or the moving variance to see if it changes over time. Moving average/variance, I mean, at any moment't', we will take the average/variance of the last year, that is, the average/variance of the past 12 months. But this is more like a visual technology.
- Dickey-Fuller test: This is one of the statistical tests to test stationarity. The null hypothesis here is that TS is non-stationary. The test results include test statistics and some critical values of different confidence levels. If the "test statistic" is less than the "critical value", we can reject the null hypothesis and say that the series is stationary.
At this point, these concepts may not sound very intuitive.
Going back to the stationarity check, we will use the rolling statistical graph and the Dickey-Fuller test results are many, so I defined a function that takes TS as input and generates them for us. Note that I plotted the standard deviation instead of the variance to keep the units similar to the mean.
def test_stat(timeseries): orig = plt.plot(timeseries, color='blue',label='original series') mean = plt.plot(rolmean, color='red', label='rolling average') std = plt.plot(rolstd, color='black', label ='Rolling standard deviation') plt.title('Rolling average and standard deviation', fontproperties=myfont) #Execute Dickey-Fumer test: dftest = adfuller(timeseries, maxlag=13, regression='ctt', autolag='BIC') Copy code
Let's run it for the input sequence:
test_stat(ts) Copy code
Although the change in standard deviation is small, the average value increases significantly over time, which is not a stationary series. Moreover, the test statistic is much larger than the critical value. Note that you should compare signed values, not absolute values.
In the next step, we will discuss the methods that can be used to smoothly transform this TS.
4How to make the time series stationary?
Although many TS models adopt stationarity assumptions, almost none of the actual time series are stationary. Statisticians have found a way to make the sequence stationary, and we will discuss it now. In fact, it is almost impossible to make a sequence completely stationary, but we have to get as close as possible to it.
Let us understand what makes TS unstable. There are two main reasons for the non-stationarity of TS:
- Trend-the average value over time. For example, in this example, we see that on average, the number of passengers has increased over time.
- Seasonality-changes within a specific time frame. People may have a tendency to show up in certain months due to salary increases or holidays.
The basic principle is to model or estimate the trend and seasonality in the sequence, and remove these trends and seasonality from the sequence to obtain a stationary sequence. Then statistical forecasting techniques can be implemented on this sequence. The final step is to convert the predicted value to the original scale by applying trend and seasonal constraints.
Note: I will discuss some methods. In this case, some may work well, and some may not. But our idea is to master all the methods, not just focus on the problem at hand.
Let's start with the trend part.
Estimate and eliminate trends
One of the first techniques to reduce the trend can be conversion. For example, in this case, we can clearly see that there is a significant growth trend. Therefore, we can apply transformations to penalize higher values instead of smaller ones. These can take logarithms, square roots, cube roots, etc. The logarithmic conversion here is simple and simple:
np.log(ts) Copy code
In this simple case, it is easy to see the future trend of the data. But in the presence of noise, it is not very intuitive. Therefore, we can use some techniques to estimate or simulate this trend, and then delete it from the series. There are many ways to do this, the most common of which are:
- Aggregation-take the average over a period of time, such as monthly/weekly averages
- Smooth-take rolling average
- Polynomial fitting-fitting regression model
I will discuss smoothing here, you should try other techniques, and possibly solve other problems. Smoothing refers to taking rolling estimates, that is, considering several past instances. There are many methods, but I will discuss two of them here.
Moving (rolling) average
In this method, we take the average of the continuous values of "k" according to the frequency of the time series. Here we can take the average of the past year, which is the last 12 values. pandas has specific functions for determining rolling statistics.
pd.rolling_mean(ts_log,12) Copy code
The red line represents the rolling average. We subtract this from the original sequence. Note that because we take the average of the last 12 values, the rolling average does not define the first 11 values. This can be observed:
ts_log - moving_avg copy the code
Note that the first 11 are Nan. Let's delete these NaN values and check the graph to verify stationarity.
moving_avg_diff.dropna (inplace = True) copying the code
This looks like a better series. The rolling value seems to change slightly, but there is no specific trend. Moreover, the test statistic is less than the critical value of 5%, so we can say that this is a stationary series with 95% confidence.
Weighted Moving Average (EWMA)
However, one disadvantage of this method is that the time period must be strictly defined. In this case, we can take the annual average, but it is difficult to get a number in complex situations such as predicting stock prices. So we take a "weighted moving average", where the most recent value is given a higher weight. There are many ways to assign weights. One popular method is exponentially weighted moving average, which uses a decay factor to assign weights to all previous values. Please find details here. This can be implemented as:
ewma (ts_log, half = 12) copying the code
Please note that the parameter "half-life" here is used to define the amount of exponential decay. This is just an assumption and depends largely on the business area. Other parameters, such as span and center of mass, can also be used to define decay, which will be discussed in the link shared above. Now, let's remove this item from the series and check for stationarity:
ts_log-expwighted_avg Copy code
The change in the mean and standard deviation of this TS is even smaller. In addition, the test statistic is less than the critical value of 1%, which is better than the previous case. Note that in this case, there will be no missing values, because all values from the beginning are given weights. So it can work even if there is no previous value.
Eliminate trends and seasonality
The simple trend reduction techniques discussed before are not suitable for all situations, especially those with high seasonality. Let us discuss two ways to eliminate trends and seasonality:
- Difference-take the difference with a specific time difference
- Decomposition Model trends and seasonality and remove them from the model.
One of the most common methods for dealing with trends and seasonality is difference. In this technique, we take the difference between the observed value at a certain moment and the observed value at the previous moment. This is very effective in improving smoothness. The first-order difference can be performed as follows:
ts_log-ts_log.shift() Copy code
This seems to greatly reduce this trend. Let's use the drawing to verify:
We can see that the mean and standard deviation change very little over time. At the same time, the Dickey-Fuller test statistic is less than the critical value of 10%, so TS is stable with a confidence of 90%. We can also take the second or third order difference, which may get better results in some applications.
In this method, the trend and seasonality are modeled separately, and the rest of the sequence is returned. I will skip the statistics to get the result:
decompose(ts_log) decomposition.trend decomposition.seasonal decomposition.resid Copy code
Here we can see the trend, the seasonality is separated from the data, and we can model the residuals. Let's check the stationarity of the residuals:
ts_log_decompose = residual Copy code
The Dickey-Fuller test statistic is significantly lower than the critical value of 1%. So this t is very close to plateau. In addition, you should note that in this example, it is not very intuitive to convert the residual to the original value of the future data.
5 Forecast time series
We have seen different technologies, all of which are quite good at smoothing TS. Because this is a very popular technique, let us build a model on TS after difference. Furthermore, in this case, it is relatively easy to add noise and seasonality back to the prediction residuals. After implementing trend and seasonal estimation techniques, there may be two situations:
- A strictly stationary series with no dependencies between values. This is a simple example, we can model the residual as white noise. But this is very rare.
- A series of significant correlations between values. In this case, we need to use some statistical models, such as ARIMA, to predict the data.
Let me briefly introduce ARIMA. I will not go into the technical details, but if you want to apply them more effectively, you should understand these concepts in detail. ARIMA stands for Autoregressive Integrated Moving Average. The ARIMA forecast of a stationary time series is nothing more than a linear (similar to linear regression) equation. The predictor depends on the parameters (p, d, q) of the ARIMA model:
- Number of AR (autoregressive) terms (p): The AR term is the lag of the dependent variable. For example, if p is 5, the predictor of x(t) is x(t-1)...x(t-5).
- Number of moving average items (q): The moving average item is the lagging prediction error in the prediction equation. For example, if q is 5, the predictor of x(t) will be e(t-1)...e(t-5), where e(i) is the moving average between the first instant and the actual value The difference between.
- The number of differences (d): These are the number of non-seasonal differences, that is, in this case, we take the first-order difference. So we can pass this variable, let d=0 or pass the original variable, let d=1. Both will produce the same result.
An important issue here is how to determine the values of "p" and "q". Let's discuss it first.
- Autocorrelation function (ACF): It is a measure of the correlation between TS and its own lag. For example, at a lag of 5, ACF will compare the sequence at time "t1"..."t2" with the time series at "t1-5"..."t2-5" (t1-5 and t2 are the end points).
- Partial autocorrelation function (PACF): It measures the correlation between TS, but after eliminating the changes that have been explained. For example, at lag 5, it will check the correlation, but eliminate the effects that have been explained by lags 1 to 4.
The ACF and PACF diagrams of TS after the difference can be drawn as:
#DrawingACF: plt.subplot(121) plt.plot(lag_acf) plt.axhline(y=0,linestyle='--',color='gray') #DrawingPACF: plt.subplot(122) plt.plot(lag_pacf) plt.axhline(y=0,linestyle='--',color='gray') plt.tight_layout() Copy code
In this figure, the two dashed lines on either side of 0 are the confidence intervals. These can be used to determine the "p" and "q" values as follows:
- The p-PACF chart first crosses the lag value of the upper limit of the confidence interval. If you look closely, in this case, p=2.
- q-The lag value when the ACF chart crosses the upper limit of the confidence interval for the first time. If you look closely, in this case, q=2.
Now, let us build 3 different ARIMA models, considering both individual effects and combined effects. I will also output RSS. Please note that the RSS here is the residual value, not the actual sequence.
We need to load the ARIMA model first:
You can use the order parameter of ARIMA to specify the values of p, d, and q. This parameter uses a tuple (p, d, q). Let us simulate 3 cases:
model.fit(disp=-1) plt.plot(ts_log_diff) Copy code
results_MA = model.fit(disp=-1) plt.plot(ts_log_diff) Copy code
plt.plot(ts_log_diff) plt.plot(results_ARIMA.fittedvalues, color='red') Copy code
Here, we can see that the AR and MA models have almost the same RSS, but the combination will be better. Now, we are left with the last step, which is to restore these values to their original proportions.
Back to the original ratio
Since the combined model gives the best results, let's scale it back to the original value and see how it performs as predicted. The first step is to store the prediction result as a separate sequence and observe it.
Please note that these started on '1949-02-01', not the first month. why? This is because we have taken a lag of 1, and there is nothing to subtract before the first element. The method of converting differences to a logarithmic scale is to add these differences continuously to the base. A simple method is to first determine the cumulative sum at the index and then add it to the base. The cumulative sum is as follows:
You can use the previous output to quickly do some calculations to check whether these are correct. Next we need to add them to the bottom. To do this, let's create a sequence with all the values as the base, and add their differences. This can be done like this:
The first element here is the base itself, and then the added value is accumulated from there. The last step is to take the exponent and compare it with the original series.
plt.plot(predictions_ARIMA) Copy code
These are all content in Python. Let's learn how to implement time series forecasting in R.
R time series forecasting
Step 1: Read the data and calculate the basic summary
#Install the package and call the library install.packages("tseries") #Read Airpaseengers data tsdata<-AirPassengers #Identify data category class(tsdata) #Observe time series data tsdata #Data Summary dfSummary(tsdata) Copy code
class(tsdata) "ts" > #Observe time series data > tsdata Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1949 112 118 132 129 121 135 148 148 136 119 104 118 1950 115 126 141 135 125 149 170 170 158 133 114 140 1951 145 150 178 163 172 178 199 199 184 162 146 166 1952 171 180 193 181 183 218 230 242 209 191 172 194 1953 196 196 236 235 229 243 264 272 237 211 180 201 1954 204 188 235 227 234 264 302 293 259 229 203 229 1955 242 233 267 269 270 315 364 347 312 274 237 278 1956 284 277 317 313 318 374 413 405 355 306 271 306 1957 315 301 356 348 355 422 465 467 404 347 305 336 1958 340 318 362 348 363 435 491 505 404 359 310 337 1959 360 342 406 396 420 472 548 559 463 407 362 405 1960 417 391 419 461 472 535 622 606 508 461 390 432 > # tsdata Dimensions: 144 x 1 Duplicates: 26 -------------------------------------------------- -------------------------------------------------- No Variable Stats/Values Freqs (% of Valid) Graph Valid Missing ---- ---------- -------------------------- ---------- ------------- --------------------- -------- - 1 tsdata Mean (sd): 280.3 (120) 118 distinct values.:. 144 0 [ts] min <med <max: Start: 1949-01::..: (100%) (0%) 104 <265.5 <622 End: 1960-12::::: IQR (CV): 180.5 (0.4)::::::: ::::::::.. -------------------------------------------------- -------------------------------------------------- Copy code
Step 2: Check the period of the time series data and plot the original data
#Check the data and plot the original data plot(tsdata, ylab="Passengers (1000 people)", type="o") Copy code
cycle(tsdata) Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1949 1 2 3 4 5 6 7 8 9 10 11 12 1950 1 2 3 4 5 6 7 8 9 10 11 12 1951 1 2 3 4 5 6 7 8 9 10 11 12 1952 1 2 3 4 5 6 7 8 9 10 11 12 1953 1 2 3 4 5 6 7 8 9 10 11 12 1954 1 2 3 4 5 6 7 8 9 10 11 12 1955 1 2 3 4 5 6 7 8 9 10 11 12 1956 1 2 3 4 5 6 7 8 9 10 11 12 1957 1 2 3 4 5 6 7 8 9 10 11 12 1958 1 2 3 4 5 6 7 8 9 10 11 12 1959 1 2 3 4 5 6 7 8 9 10 11 12 1960 1 2 3 4 5 6 7 8 9 10 11 12 Copy code
Step 3: break down the time series data
#Decompose the data into trend, seasonal and random error components plot(tsdata_decom) Copy code
Step 4: Check the stationarity of the data
#Test data stability #Unit Root Test adf(tsdata) #Autocorrelation test plot(acf(tsdata,plot=FALSE))+ labs(title="air passenger data related graph") plot(acf(tsdata_decom$random,plot=FALSE)) Copy code
Augmented Dickey-Fuller Test data: tsdata Dickey-Fuller = -7.3186, Lag order = 5, p-value = 0.01 alternative hypothesis: stationary Copy code
the p-value is 0.01 which ip value is 0.01, which is less than 0.05. Therefore, we reject the null hypothesis, so the time series is stationary. s <0.05, therefore, we reject the null hypothesis and hence time series is stationary.
The maximum lag is 1 month or 12 months, indicating a positive correlation with the 12-month cycle.
Automatically draw 7:138 random time series observations, excluding NA values
Step 5: Fit the model
#Fit model #Linear model plot(tsdata) + smooth(method="lm") #ARIMA model arimats Copy code
Series: tsdata ARIMA(2,1,1)(0,1,0) Coefficients: ar1 ar2 ma1 0.5960 0.2143 -0.9819 se 0.0888 0.0880 0.0292 sigma^2 estimated as 132.3: log likelihood=-504.92 AIC=1017.85 AICc=1018.17 BIC=1029.35 Copy code
Step 6: Forecast
#Arima model prediction fore(arimats, level = c(95)) Copy code
At the end we have a prediction of the original data.
Most popular insights