Methods of Forecasting Based on Artificial Neural Networks

– This article presents an overview of artificial neural network (ANN) applications in forecasting and possible forecasting accuracy improvements. Artificial neural networks are computational models and universal approximators, which can be applied to the time series forecasting with a high accuracy. A great rise in research activities was observed in using artificial neural networks for forecasting. This paper examines multi-layer perceptrons (MLPs) – back-propagation neural network (BPNN), Elman recurrent neural network (ERNN), grey relational artificial neural network (GRANN) and hybrid systems – models that fuse artificial neural network with wavelets and auto-regressive integrated moving average (ARIMA).


I. INTRODUCTION
Artificial neural networks (ANNs) are a form of artificial intelligence, which attempts to mimic the function of real neurons found in the human brain [2].ANNs are one of the most accurate and widely used forecasting models that are used in forecasting social, economic, business, engineering, foreign exchange, stock problems and others.Structure of artificial neural networks makes them valuable for a forecasting task with good accuracy.
As opposed to the traditional model-based empirical and statistical methods such as regression and Box-Jenkins approaches, which need prior knowledge about the nature of the relationships between the data, artificial neural networks are self-adaptive methods that learn from data, and only few a priori assumptions about data are needed [1].
Neural networks learn from examples and can find functional relationships among the data even if relationships are unknown or the physical meaning is the baffling [2].Therefore, ANNs are well suited for problems, whose solutions require knowledge that is difficult to specify but for which there are enough data or observations.Artificial neural networks can generalize [8].After learning from the input data (a sample or pattern), ANNs can often correctly process the early unseen sample even if the sample data are noisy.Neural networks are less sensitive to error term assumptions and they can tolerate noise, chaotic components better than most other methods.Artificial neural networks are also universal function approximators.It was proved that a neural network can approximate any continuous function with any accuracy [1].

II. TYPICAL STRUCTURE OF ANN
ANN structure includes input data and artificial neurons that are known as "processing elements", "nodes" or "units" [10].The multilayer perceptron includes an input layer, an output layer and one or more intermediate layers called hidden layers.The size and nature of the data set affect the number of hidden layers and neurons within each layer.ANNs with one or two hidden layers perform better than neural networks with the large number of hidden layers.

A. The Propagation of Information in MLPs
The propagation of information in MLPs starts when the input data are taken into the input layer.The inputs are weighted and passed to each node in the next layer.Each processing element in a specific layer is fully or partially connected to many other processing elements using weighted connections [2].
For a time series forecasting problem, a training pattern consists of history data with a fixed number of observations.If we have N observations y 1 , y 2 , …, y N in the input data set, then using an ANN with n input nodes, we have N-n training patterns than can be used for short-term forecasting -one value ahead.The first training pattern will contain y 1 , y 2 , …, y n as inputs and y n+1 as the output.The second training pattern will contain y 2 , y 3 , …, y n+1 as inputs and y n+2 as the output.The last training pattern will contain y N-n , y N-n+1 , …, y N-1 inputs and y N as the output.Then pattern y N-n+1, yN-n+2, …, yN will be used to get forecasting value y N+1.The ANN performs the following unknown function mapping where y t is the observation at time t [1].
The scalar weights along with the network architecture store the knowledge of a trained network and determine the strength of the connections between interconnected neurons [7].If a weight value is zero, then there is no connection between two neurons and if a weight value is negative then a relationship between two neurons is prohibitive.An individual processing element receives weighted inputs from previous layers, which are summed in each node using a combination function, and a bias neuron or threshold is added or subtracted.A bias neuron is connected to every hidden or output unit and a bias neuron value is one.
Neural networks are similar to linear and non-linear least squares regression, and the bias neuron serves a similar purpose as the intercept in regression models.The bias unit is used to scale the input to a useful range [0, 1] or [−1, 1] to improve the convergence properties of the neural network.
The result of this combined summation is passed through a transfer function to produce the nodal output of the processing element (Fig. 1), which is weighted and passed to the processing element in the next layer [2].The combination function and transfer function together constitute the activation function.In the majority of cases, input layer neurons do not have an activation function, as their role is to transfer the inputs to the hidden layer.The most widely used activation function for the output layer is the linear function as a non-linear activation function may introduce distortion to the predicated output.The sigmoid (logistic), exponential (hyperbolic) tangent, quadratic or linear functions are often used as the hidden layer transfer function.The relationship between the output -predicted value (y t ) -and the inputspast observations of the time series (y t-1 , …, y t-p ) -is given by [5].
where w j are output layer weights, w i,j are input layer weights, f is a transfer function, q is the number of hidden nodes, p is the number of input nodes, ɛ t is a random error at time t.The network corrects its weights and uses a learning rule until it can find a set of weights that will produce the smallest possible error between an observed value and a predicted value at time t.That process is known as "learning" or "training".For this reason, the network training is actually an unconstrained optimization (nonlinear minimization) problem.
The neural network (2) can approximate any continuous function when the number of hidden nodes q is sufficiently large [5].In practice, if a network structure has a small number of hidden nodes, then it works well in "out-of-sample" forecasting on data that were not used in training.There is an overfitting effect that can be found in the neural network modeling process.An overfitted model has a good accuracy on training data, but poor accuracy on "out of the sample" data [9].
To improve the accuracy of the neural network, each data point in the input neurons needs to be normalized -rescaled within the range of [−1, 1] or [0, 1] and standardized to scale data and transformed to make the time series stationary.Transformation can be implemented as taking logarithmic returns of the time series, differencing the time series, etc.

B. Classes of Neural Networks
ANN learning process can be supervised and unsupervised.In supervised learning (e. g., multi-layer feed-forward neural network), the network is presented with historical data, where a training pattern contains independent variables and the corresponding (desired) outputs that are dependent variables in training data.The network then processes the inputs, the nodal output of the network is compared with the observed value of the time series, and an error is calculated.This error is used to correct the connection weights between the model inputs and outputs to reduce the error between the observed values of the time series and outputs predicted by the ANN.The input data used for learning is called the "training set".Supervised learning is suitable for forecasting and classification tasks.In unsupervised learning (e. g., Kohonen network), there is no dependent variable specified in input (training) data.The network corrects the connection weights according to the input values.The idea of training in unsupervised networks is to cluster the input data into classes of similar features or clusters, where similar input data should generate the same output.This can be referred to as self-organization, and it is suitable for clusterization tasks.
Based on connections between processing elements, ANNs structure can be regarded as feed-forward (e. g., backpropagation) and feedback (e. g., recurrent) networks.Feedforward network (FNN) propagates information in the forward direction only, while feedback networks propagate information in both the forward and backward directions [2].
Back-propagation neural networks (BPNNs) are a class of feed-forward neural networks with supervised learning rules.The back-propagation network is the most popular and robust multi-layer network that is used in the majority of all forecasting applications.In the learning process, backpropagation neural networks use the gradient-decent search method to correct the connection weights and reduce an error.The main problem of a standard back-propagation algorithm is its slow convergence, which is a typical problem for simple gradient descent methods [1].
Other neural networks that are also used in time series forecasting include recurrent networks, probabilistic networks and fuzzy neural networks.Although feed-forward neural networks are used in many forecasting applications, another type of neural networks -Elman recurrent neural network (ERNN) -is also used in forecasting applications with good accuracy.According to the general principle of the recurrent networks, there is a feedback connection from the outputs of some neurons in the hidden layer to neurons in the context layer that stores the delayed hidden layer outputs.The most important advantage of ERNN is a robust feature extraction ability, when the context layer stores useful information about data points in past.Since ERNN contains the context layer, it is possible to improve forecasting accuracy by using ERNN instead of FNN [3].

III. HYBRID MODEL: ANN AND WAVELETS
The time series of the real world is often complex in nature and any single forecasting model cannot learn on different patterns equally well.Many studies in time series forecasting have found that forecasting improves in combined models and the integrated forecasting techniques outperform the individual forecasts.In hybrid models, the aim is to reduce the risk that the chosen model will be inappropriate, and combination will help obtain results that are more accurate [5].Hybrid models can be homogeneous, such as using differently configured neural networks, or heterogeneous, such as with both linear and nonlinear models.Hybrid forecasting has been implemented using a nonlinear model, using ANN with genetic algorithm (GA) or fuzzy logic (FL), or combining linear model and the nonlinear model, using auto-regressive integrated moving average (ARIMA) model with ANN, since in reality time series data typically contain linear and nonlinear patterns [6].
In study [4], the hybrid wavelet and ANN (WANN) model were obtained by combining two methods, discrete wavelet transform and ANN model.ANN model used in this study was the multi-layer feed-forward network.

A. Wavelets
The wavelet transform is a mathematical tool that is used as a time-frequency representation of an analyzed signal.There are some important differences between wavelets and Fourier analysis that is also used as a time-frequency representation of signals.The Fourier coefficients contain only globally averaged information and the Fourier transform does not give local information.Small frequency changes in the Fourier transform will produce changes everywhere in time domain.Wavelets are local in both time and frequency domain.Wavelet transformations provide useful decomposition of the original time series and can get useful information on every decomposition level.Wavelet transforms can be very effective with nonstationary time series data [12].Many classes of functions can be represented by wavelength in way that is more compact.For a discrete time series x(t) the discrete wavelet transform (DWT) is given by where W mn is wavelet coefficient for the discrete wavelet, m is integer, which controls the scale, t is time, N is the number of time series data observations, Ψ is a transforming function (mother wavelet), n is integer, which controls the time.Therefore, a time series of length N is broken into N components with zero redundancy.
The inverse DWT is given by where, T is the time subseries mean value, M is the number of decomposition level [4].

B. WANN Model
In study [4], the Daubechies wavelet, one of the most widely used wavelet families, was chosen as the wavelet function to decompose the original time series into subtime series components, which were passed to ANN to improve the model accuracy.
To choose the number of decomposition levels, the following formula was used where L is the optimal (maximum) number of decomposition levels.In simple format ( 4) is given by where A M (M) is approximation subseries or residual term at levels M and D m (t) (m = 1, 2, …, M) is detailed subseries which can detect small features of interpretational value in the data.The optimal number of decomposition levels for original time series data in this study was three.Original time series was decomposed into three level components (D1, D2 and D3) and approximation component (A3).Each of decomposition level component series has a determined role in the original time series and has different effects on the original time series.In this study, the effectiveness of wavelet components was determined using the coefficient of determination (R 2 ) between each decomposition level component subtime series and original data.The wavelet components D2 and D3 showed significantly higher R 2 compared to the D1 and according to the R 2 analyses, the effective wavelet components were selected as the dominant wavelet components.The combination DW was calculated by Fig. 2 shows the structure of the WANN model.The hybrid model showed a great improvement in time series modeling and produced better forecasts than ANN model alone, as well as GARCH and ARIMA models alone.The study concluded that the forecasting abilities of WANN model were improved when the wavelet transformation technique was adopted for data pre-processing.The decomposed periodic components obtained from DWT technique were found to be most effective in giving accurate forecast when they were used as inputs in ANN model.The accurate forecasting results showed that WANN model provides a good and potentially very useful new method for time series forecasting [4].

IV. HYBRID MODEL: ANN AND ARIMA
ANN can model both linear and nonlinear structures in time series; however, they cannot capture both structures equally well.More hybrid forecasting models have been proposed in the last years, using auto-regressive integrated moving average and artificial neural networks.These hybrid models showed good forecasting accuracy [3].

A. Auto-regressive Integrated Moving Average
One of the most widely used linear time series models is the autoregressive integrated moving average (ARIMA) model.In the ARIMA (p, d, q) model, the future value of a time series is assumed to be a linear function of several past observations and random errors.In the ARIMA (p, d, q) model order p is non-negative integer that refers to the order of the autoregressive function, d is non-negative integer -order of differencing, order q is non-negative integer, which refers to the order of moving average.An auto-regressive integrated moving average process has three different parts -an autoregressive (AR) function that describes how each time series value is a function of the previous p observations, moving average (MA) function describes how each time series value is a function of previous q errors, and an integrated (I) part that describes how to make the data series stationary by differencing d times.The ARIMA model cannot capture nonlinear patterns [11].
Before the ARIMA model can be used for forecasting, check for stationarity is carried out.A stationary time series does not depend on time and, therefore, is characterized by statistical characteristics such as the mean, variance and autocorrelation structure.When in the observed time series there are trend and heteroscedasticity, then a time series is not stationary.Differencing and power transformation are then applied to the data to remove the trend and to stabilize the variance before an ARIMA model can be used.
If a time series is generated from an ARIMA process, it should have some theoretical autocorrelation properties.By comparing the empirical autocorrelation patterns with the theoretical patterns, it is often possible to identify one or several potential ARIMA models for the given time series.The autocorrelation function (ACF) and the partial autocorrelation function (PACF) of the sample data is often used as the basic tools to identify the order (p, d, q values) of the ARIMA model [5].

B. ANN and ARIMA
In study [5] where f is a nonlinear function determined by the neural network, e t is the residual at time t.Past observation z t is given by where B is the backward shift (lag) operator, y t is the original time series value at time t, µ is an ARIMA process generated time series mean value.The residuals are given by Next, a neural network was used to model the nonlinear and linear relationships existing in residuals and original data where z t is a predicted value of the time series by a neural network, f is a transfer function, w j are output layer weights, w i,j are input layer weights, q is the number of hidden nodes, p is the number of input nodes, and ɛ t is a random error at time t.In such hybrids, while the neural network model deals with nonlinearity, the auto-regressive integrated moving average model deals with the non-stationary linear component.
Similar approach was used in study [3].Here seasonal ARIMA (SARIMA) models were used to analyze the linear part of the time series.Linear part of time series, t L ˆ, was obtained from the SARIMA model.ANN model used in this study was an Elman recurrent neural network.The ERNN model was developed to fit the residuals obtained from the SARIMA model by With p inputs the ERNN model can be written where f is a nonlinear function determined by the ERNN.The estimation of e t in (13) will give the forecasting of nonlinear component of time series, t N ˆ.Therefore, forecasting values of the time series were obtained by adding the estimates of linear and nonlinear components of the time series It was observed that the following model gave a better result than other methods, such as Zhang's hybrid model of feed-forward neural network and ARIMA and Kajitani selfexciting threshold auto-regressive (SETAR) model.

V. HYBRID MODEL: GRANN AND ARIMA
In study [6] GRANN_ARIMA hybrid model (Fig. 3) was used.The grey relational analysis (GRA) was integrated with ANN to remove the redundancy inputs.GRANN_ARIMA model integrates a nonlinear grey relational artificial neural network (GRANN) and a linear ARIMA model, combining the multivariate time series data and the grey relational analysis to select the appropriate inputs.

A. Grey Analysis
Grey relational analysis is an analysis method that can be used to evaluate the degree of correlation for different data sequences.The degree of correlation between a data sequence (x) and the reference sequence (y) is expressed by a scalar value in interval [0 1].If the degree of correlation is near 1, it indicates the high correlation between x and y.There are three main steps in the grey relational analysis.The first step is data pre-processing.Data pre-processing is normally required due to the fact that the range and unit of one data sequence may differ from others.Therefore, data must be normalized, scaled and polarized initially into a comparable sequence before proceeding to other steps.In this study the following equation for data pre-processing was used: where i = 1, . ..m; k = 1, . . .n, m is the number of experimental data items, n is the number of parameters, x i 0 (k) is the original sequences, x i * (k) is the sequences after data preprocessing, min x i 0 (k) and max x i 0 (k) are the smallest and the largest value of x i 0 (k), respectively.The range of data is corrected to be in range [0 1].
The second step is to calculate the grey relational coefficient by using where ξ i (k) is a grey relational coefficient at any data point (k), ς is known as an identification coefficient within interval [0 1], and normally ς = 0.5 is used.Δ 0,i are deviation sequences of the reference sequence and comparability sequence where x 0 * (k) is the reference sequence and x i * (k) is the comparative sequence.Δmin in ( 16) is given by and Δmax in ( 16) is given by The grey relational grade is the average value of the grey relational coefficients and is defined as where n is the number of the reference sequence, x 0 *(k).The grey relational grade γ i represents the level of correlation between the reference sequence and the comparability sequence.

B. GRANN_ARIMA model
The residuals now represent the linear part of the data, and ARIMA can be used to model the residual.Residual modeling by ARIMA can be represented similarly as in (13), but here f is a linear function modeled by the ARIMA model.Therefore, the hybridized forecast model can be written as in ( 14), but here linear part of model t L ˆ is obtained from (13).Unlike model in (8), a nonlinear model is implemented first rather than followed by the linear model.
To validate the performance of the proposed model, Kuala Lumpur Stock Exchange (KLSE) daily close price was used as a time series data.Grey relational analysis was used as a feature selection tool, and four factors were found as the most influential factors relating KLSE close price: syarian index, trading/service index, composite index and industrial index.A three-layer feed-forward neural network with a single output unit, nine hidden units and four input units with the learning rate 0.5 and momentum 0.9 was used to model nonlinear part of forecast.ARIMA (0,1,3) model was used to model the residuals of close price -linear part of forecast.The network structure and learning parameters were determined by trial and error.The forecasting accuracy was compared with several models, and these include individual models (ARIMA, multiple regression, grey relational artificial neural network), several hybrid models (MARMA, MR ANN, ARIMA ANN), and the artificial neural network trained using the Levenberg Marquardt algorithm.The experiments showed that the GRANN_ARIMA model outperformed other models with MAPE error of 0.16 % and 99.84 % forecasting accuracy.The empirical results obtained showed that the GRANN_ARIMA model could be a good alternative for time series forecasting due to its promising forecasting accuracy.

VI. CONCLUSION
In this paper, an overview of artificial neural network applications in forecasting and possible forecasting accuracy improvements was presented.Artificial neural networks are computational models and universal approximators that can be applied to the time series forecasting with a high accuracy.Back-propagation neural networks are a class of feed-forward neural networks and are most popular and robust multi-layer networks used in the majority of all forecasting applications.Although feed-forward neural networks are used in many forecasting applications of ANNs, other type of neural networks -Elman recurrent neural network (ERNN) -is also used in forecasting applications with good accuracy.In the case of comparison with other types of multi-layer network, the most important advantage of ERNN is robust feature extraction ability.Since ERNN contains the context layer, it is possible to improve forecasting accuracy by using ERNN instead of FNN.The time series of the real world is often complex in nature and any single forecasting model cannot learn on different patterns equally well.Forecasting improves in combined models and the integrated forecasting techniques outperform the individual forecasts.Hybrid models can be homogeneous, such as using differently configured neural networks, or heterogeneous, such as with both linear and nonlinear models, since in reality time series data typically contain both linear and nonlinear patterns.Hybrid models WANN, ARIMA ANN and GRANN_ARIMA showed good forecasting accuracy and outperformed other forecasting models.
a time series was considered to be a nonlinear function of several past observations and random errors In the GRANN_ARIMA model, ARIMA is used as a linear model, L t and GRANN is used as a nonlinear model, N t Here y t is a value of the original time series.As t N ˆ is the forecast value of the GRANN model at time t, then the residuals e t are obtained by