Artificial neural networks have emerged as an important quantitative modeling tool for business forecasting. This chapter provides an overview of forecasting with neural networks. We provide a brief description of neural networks, their advantages over traditional forecasting models, and their applications for business forecasting. In addition, we address several important modeling issues for forecasting applications.
G. Peter Zhang, Georgia State University, USA
The recent upsurge in research activities into artificial neural networks (ANNs) has proven that neural networks have powerful pattern classification and prediction capabilities. ANNs have been successfully used for a variety of tasks in many fields of business, industry, and science (Widrow et al., 1994). They have fast become a standard class of quantitative modeling tools for researchers and practitioners. Interest in neural networks is evident from the growth in the number of papers published in journals of diverse scientific disciplines. A search of several major databases can easily result in hundreds or even thousands of "neural networks" articles published in one year.
One of the major application areas of ANNs is forecasting. There is an increasing interest in forecasting using ANNs in recent years. Forecasting has a long history and the importance of this old subject is reflected by the diversity of its applications in different disciplines ranging from business to engineering. The ability to accurately predict the future is fundamental to many decision processes in planning, scheduling, purchasing, strategy formulation, policy making, and supply chain operations. As such, forecasting is an area where a lot of efforts have been invested in the past. Yet, it is still an important and active field of human activity at the present time and will continue to be in the future. A survey of research needs for forecasting has been provided by Armstrong (1988).
Forecasting has been dominated by linear methods for many decades. Linear methods are easy to develop and implement and they are also relatively simple to understand and interpret. However, linear models have serious limitation in that they are not able to capture any nonlinear relationships in the data. The approximation of linear models to complicated nonlinear relationships is not always satisfactory. In the early 1980s, Makridakis (1982) organized a large-scale forecasting competition (often called M-competition) where a majority of commonly used linear methods were tested with more than 1,000 real time series. The mixed results show that no single linear model is globally the best, which may be interpreted as the failure of linear modeling in accounting for a varying degree of nonlinearity that is common in real world problems.
ANNs provide a promising alternative tool for forecasters. The inherently nonlinear structure of neural networks is particularly useful for capturing the complex underlying relationship in many real world problems. Neural networks are perhaps more versatile methods for forecasting applications in that not only can they find nonlinear structures in a problem, they can also model linear processes. For example, the capability of neural networks in modeling linear time series has been studied and confirmed by a number of researchers (Hwang, 2001; Medeiros et al., 2001; Zhang, 2001).
In addition to the nonlinear modeling capability, ANNs also have several other features that make them valuable for forecasting tasks. First, ANNs are data-driven nonparametric methods that do not require many restrictive assumptions on the underlying process from which data are generated. As such, they are less susceptible to the model misspecification problem than parametric methods. This "learning from data or experience" feature of ANNs is highly desirable in various forecasting situations where data are usually easy to collect, but the underlying data-generating mechanism is not known or pre-specifiable. Second, neural networks have been mathematically shown to have the universal functional approximating capability in that they can accurately approximate many types of complex functional relationships. This is an important and powerful characteristic, as any forecasting model aims to accurately capture the functional relationship between the variable to be predicted and other relevant factors or variables. The combination of the above-mentioned characteristics makes ANNs a very general and flexible modeling tool for forecasting.
Research efforts on neural networks as forecasting models are considerable and applications of ANNs for forecasting have been reported in a large number of studies. Although some theoretical and empirical issues remain unsolved, the field of neural network forecasting has surely made significant progress during the last decade. It will not be surprising to see even greater advancement and success in the next decade.
Artificial neural networks (ANNs) are computing models for information processing and pattern identification. They grow out of research interest in modeling biological neural systems, especially human brains. An ANN is a network of many simple computing units called neurons or cells, which are highly interconnected and organized in layers. Each neuron performs the simple task of information processing by converting received inputs into processed outputs. Through the linking arcs among these neurons, knowledge can be generated and stored regarding the strength of the relationship between different nodes. Although the ANN models used in all applications are much simpler than actual neural systems, they are able to perform a variety of tasks and achieve remarkable results.
Over the last several decades, many types of ANN models have been developed, each aimed at solving different problems. But by far the most widely and successfully used for forecasting has been the feedforward type neural network. Figure 1 shows the architecture of a three-layer feedforward neural network that consists of neurons (circles) organized in three layers: input layer, hidden layer, and output layer. The neurons in the input nodes correspond to the independent or predictor variables (x) that are believed to be useful for forecasting the dependent variable (y) which corresponds to the output neuron. Neurons in the hidden layer are connected to both input and output neurons and are key to learning the pattern in the data and mapping the relationship from input variables to the output variable. With nonlinear transfer functions, hidden neurons can process complex information received from input neurons and then send processed information to the output layer for further processing to generate forecasts. In feedforward ANNs, the information flow is one directional from the input layer to the hidden layer then to the output layer without any feedback from the output layer.
A Typical Feedforward Neural Network
In developing a feedforward neural network model for forecasting tasks, specifying its architecture in terms of the number of input, hidden, and output neurons is an important task. Most ANN applications use only one output neuron for both one-step-ahead and multi-step-ahead forecasting. However, as argued by Zhang et al. (1998), it may be beneficial to employ multiple output neurons for direct multi-step-ahead forecasting. The input neurons or variables are very important in any modeling endeavor and especially important for ANN modeling because the success of an ANN depends to a large extent on the patterns represented by the input variables. What and how many variables to use should be considered carefully. For a causal forecasting problem, we need to specify a set of appropriate predictor variables and use them as the input variables. On the other hand, for a time series forecasting problem, we need to identify a number of past lagged observations as the inputs. In either situation, knowledge of the forecasting problem as well as some experimentation based on neural networks may be necessary to determine the best number of input neurons. Finally, the number of hidden nodes is usually unknown before building an ANN model and must be chosen during the model-building process. This parameter is useful for approximating the nonlinear relationship between input and output variables.
Before a neural network can be used for forecasting, it must be trained. Neural network training refers to the estimation of connection weights. Although the estimation process is very similar to that in linear regression where we minimize the sum of squared errors (SSE), the ANN training process is more difficult and complicated due to the nature of nonlinear optimization involved. There are many training algorithms developed in the literature and the most influential one is the backpropagation algorithm by Werbos (1974) and Rumelhart et al. (1986). The basic idea of backpropagation training is to use a gradient-descent approach to adjust and determine weights such that an overall error function such as SSE can be minimized.
In addition to the most popular feedforward ANNs, many other types of neural networks can also be used for forecasting purposes. In particular, recurrent neural networks (Connor et al., 1994; Kuan et al., 1995; Kermanshahi, 1998; Vermaak & Botha, 1998; Parlos et al., 2000; Mandic & Chambers, 2001; Husken & Stagge, 2003) that explicitly account for the dynamic nonlinear pattern can be a good alternative to feedforward type ANNs for certain time series forecasting problems. In a recurrent ANN, there are cycles or feedback connections among neurons. Outputs from a recurrent network can be directly fed back to inputs, generating dynamic feedbacks on errors of past patterns. In this sense, recurrent ANNs can model richer dynamics than feedforward ANNs in the same way that linear autoregressive and moving average (ARMA) models have certain advantages over autoregressive (AR) models. However, much less attention has been paid to the research and applications of recurrent ANNs and the superiority of recurrent ANNs over feedforward ANNs has no tbeen established. The practical difficulty of using recurrent neural networks may lie in the facts that (1) recurrent networks can assume many different architectures and it may be difficult to specify appropriate model structures to experiment with and (2) it is more difficult to train recurrent ANNs due to the unstable nature of training algorithms.
For an in-depth coverage of many aspects of ANNs, readers are referred to a number of excellent books including Smith (1993), Bishop (1995), and Ripley (1996). For ANNs for forecasting research and applications, readers may consult Azoff (1994), Weigend and Gershenfeld (1994), Gately (1996), Zhang et al. (1998), and Remus and O'Connor (2001).
Developing an ANN model for a particular forecasting application is not a trivial task. Although many good software packages exist to ease users' effort in building an ANN model, it is still critical for forecasters to understand many important issues surrounding the model building process. It is important to point out that building a successful neural network is a combination of art and science and software alone is not sufficient to solve all problems in the process. It is a pitfall to blindly throw data into a software package and then hope it will automatically give a satisfactory solution.
An important point in effectively using ANN forecasting is the understanding of the issue of learning and generalization inherent in all ANN forecasting applications. This issue of learning and generalization can be understood with the concepts of model bias and variance (Geman et al., 1992). Bias and variance are important statistical properties associated with any empirical model. Model bias measures the systematic error of a forecasting model in learning the underlying relations among variables or time series observations. Model variance, on the other hand, relates to the stability of models built on different data samples from the same process and therefore offers insights on generalizability of the prediction model.
A pre-specified or parametric model, which is less dependent on the data, may misrepresent the true functional relationship and, hence, cause a largebias. On the other hand, a flexible, data-driven model may be too dependent on the specific data set and, hence, have a large variance. Bias and variance are two conflicting terms that impact a model's usefulness. Although it is desirable to have both low bias and low variance, we may not be able to reduce both terms at the same time for a given data set because these goals are conflicting. A model that is less dependent on the data tends to have low variance but high bias if the pre-specified model is incorrect. On the other hand, a model that fits the data well tends to have low bias but high variance when applied to different data sets. Hence, a good predictive model should have an "appropriate" balance between model bias and model variance.
As a model-free approach to data analysis, neural networks tend to fit the training data well and thus have low bias. But the price to pay is the potential overfitting effect that causes high variance. Therefore, attention should be paid to address issues of overfitting and the balance of bias and variance in neural network model building.
The major decisions a neural network forecaster must make include data preparation, input variable selection, choice of network type and architecture, transfer function, and training algorithm, as well as model validation, evaluation and selection. Some of these can be solved during the model building process while others must be considered before actual modeling starts.
Neural networks are data-driven techniques. Therefore, data preparation is a critical step in building a successful neural network model. Without a good, adequate, and representative data set, it is impossible to develop a useful, predictive ANN model. Thus, the reliability of ANN models depends to a large extent on the quality of data.
There are several practical issues around the data requirement for an ANN model. The first is the size of the sample used to build a neural network. While there is no specific rule that can be followed for all situations, the advantage of having large samples should be clear because not only do neural networks have typically a large number of parameters to estimate, but also it is often necessary to split data into several portions to avoid overfitting, select model, and perform model evaluation and comparison. A larger sample provides a better chance for neural networks to adequately approximate the underlying data structure. Although large samples do not always give superior performance over small samples, forecasters should strive to get as large of a sample as they can. In time series forecasting problems, Box and Jenkins (1976) have suggested that at least 50 or, even better, 100 observations are necessary to build linear ARIMA models. Therefore, for nonlinear modeling, larger sample size should be more desirable. In fact, using the longest time series available for developing forecasting models is a time-tested principle in forecasting (Armstrong, 2001). Of course, if data in the sample are not homogeneous or the underlying data generating process in a time series changes over time, then a larger sample may even hurt performance of static neural networks as well as other traditional methods.
The second issue is data splitting. Typically for neural network applications, all available data are divided into an in-sample and an out-of-sample. The in-sample data are used for model fitting and selection, while the out-of-sample is used to evaluate the predictive ability of the model. The in-sample data sometimes are further split into a training sample and a validation sample. Because of the bias and variance issue, it is critical to test an ANN model with an independent out-of-sample which is not used in the neural network training and model selection phase. This division of data means that the true size of sample used in model building is smaller than the initial sample size. Although there is no consensus on how to split the data, the general practice is to allocate more data for model building and selection. That is, most studies in the literature use convenient ratio of splitting for in- and out-of- samples such as 70%:30%, 80%:20%, or 90%:10%. It is important to note that in data splitting, the issue is not about what proportion of data should be allocated in each sample. But, rather, it is about sufficient data points in each sample to ensure adequate learning, validation, and testing. Granger (1993) suggests that for nonlinear modeling at least 20% of the data should be held back for an out-of-sample evaluation. Hoptroff (1993) recommends that atleast 10 data points should be in the test sample while Ashley (2003) suggests that a much larger out-of-sample size is necessary in order to achieve statistically significant improvement for forecasting problems.
Data preprocessing is another issue that is often recommended to highlight important relationships or to create more uniform data to facilitate ANN learning, meet algorithm requirements, and avoid computation problems. Azoff (1994) summarizes four methods typically used for input data normalization. They are: along channel normalization, across channel normalization, mixed channel normalization, and external normalization. However, the necessity and effect of data normalization on network learning and forecasting are still not universally agreed upon. For example, in modeling and forecasting seasonal time series, some researchers (Gorr, 1994) believe that data preprocessing is not necessary because the ANN is a universal approximator and is able to capture all of the underlying patterns well. Recent empirical studies (Nelson et al., 1999), however, find that pre-deseasonalization of the data is critical in improving forecasting performance. Zhang and Qi (2002) further demonstrate that for time series containing both trend and seasonal variations, preprocessing the data by both detrending and deseasonalization should be the most appropriate way to build neural networks for best forecasting performance.
Neural network design and architecture selection are important yet difficult tasks. Not only are there many ways to build an ANN model and a large number of choices to be made during the model building and selection process, but also numerous parameters and issues have to be estimated and experimented with before a satisfactory model may emerge. Adding to the difficulty is the lack of standards in the process. Numerous rules of thumb are available, but not all of them can be applied blindly to a new situation. In building an appropriate model for the forecasting task at hand, some experiments are usually necessary. Therefore, a good experiment design is needed. For discussions of many aspects of modeling issues, readers may consult Kaastra et al. (1996), Zhang et al. (1998), Coakley and Brown (1999), and Remus and O'Connor (2001).
As stated earlier, many types of ANN have been used for forecasting. However, the multilayer feedforward architecture is by far the best developed and most widely applied one for forecasting applications. Therefore, our discussion will be focused on this type of neural network, although it may be applied to other types of ANN.
A feedforward ANN is characterized by its architecture and determined by the number of layers, the number of nodes in each layer, the transfer or activation function used in each layer, as well as how the nodes in each layer are connected to nodes in adjacent layers. Although partial connections between nodes in adjacent layers and direct connections from input layer to output layer are possible, the most commonly used ANN is the so-called "fully connected" network in that each node at one layer is fully connected only to all of the nodes in the adjacent layers.
The size of the output layer is usually determined by the nature of the problem. For example, in most forecasting problems, one output node is naturally used for one-step-ahead forecasting, although one output node can also be employed for multi-step-ahead forecasting, in which case iterative forecasting mode must be used. That is, forecasts for more than two steps ahead in the time horizon must be based on earlier forecasts.
This may not be effective for multi-step forecasting as pointed out by Zhang et al. (1998), which is in line with Chatfield (2001) who discusses the potential benefits of using different forecasting models for different lead times. Therefore, for multi-step forecasting, one may either use multiple output nodes or develop multiple neural networks each for one particular step forecasting.
The number of input nodes is perhaps the most important parameter for designing an effective neural network forecaster. For causal forecasting problems, it corresponds to the number of independent or predictor variables that forecasters believe are important in predicting the dependent variable. For univariate time series forecasting problems, it is the number of past lagged observations. Determining an appropriate set of input variables is vital for neural networks to capture the essential underlying relationship that can be used for successful forecasting. How many and what variables to use in the input layer will directly affect the performance of neural network in both in-sample fitting and out-of-sample forecasting, resulting in the under-learning or overfitting phenomenon. Empirical results (Lennon et al., 2001; Zhang et al., 2001; Zhang, 2001) also suggest that the input layer is more important than the hidden layer in time series forecasting problems. Therefore, considerable attention should be given to determine the input variables, especially for time series forecasting.
Although there is substantial flexibility in choosing the number of hidden layers and the number of hidden nodes in each layer, most forecasting applications use only one hidden layer and a small number of hidden nodes. In practice, the number of hidden nodes is often determined by experimenting with a number of choices and then selected by the cross-validation approach or performance on the validation set. Although the number of hidden nodes is an important factor, a number of studies have found that forecasting performance of neural networks is not very sensitive to this parameter (Bakirtzis et al., 1996; Khotanzad et al., 1997; Zhang et al., 2001).
For forecasting applications, the most popular transfer function for hidden nodes is either logistic or hyperbolic and it is the linear or identity function for output nodes, although many other choices can be used. If the data, especially the output data, have been normalized into the range of [0, 1], then logistic function can be used for the output layer. In general, different choices of transfer function should not impact much on the performance of a neural network model.
Once a particular ANN architecture is of interest to the forecaster, it must be trained so that the parameters of the network can be estimated from the data. To be effective in performing this task, a good training algorithm is needed. Training a neural network can be treated as a nonlinear mathematical optimization problem and different solution approaches or algorithms can have quite different effects on the training result. As a result, training with different algorithms and repeating with multiple random initial weights can be helpful in getting a better solution to the neural network training problem. In addition to the popular basic backpropagation training algorithm, users should be aware of many other (sometimes more effective) algorithms. These include so-called second-order approaches, such as conjugate gradient descent, quasi-Newton, and Levenberg-Marquardt (Bishop, 1995).
ANN model selection is typically done with the basic cross-validation process. That is, the in-sample data is split into a training set and a validation set. The ANN parameters are estimated with the training sample, while the performance of the model is evaluated with the validation sample. The best model selected is the one that has the best performance on the validation sample. Of course, in choosing competing models, we must also apply the principle of parsimony. That is, a simpler model that has about the same performance as a more complex model should be preferred.
Model selection can also be done with all of the in-sample data. This can be achieved with several in-sample selection criteria that modify the total error function to include a penalty term that penalizes for the complexity of the model. In-sample model selection approaches are typically based on some information-based criteria such as Akaike's information criterion (AIC) and Bayesian (BIC) or Schwarz information criterion (SIC). However, it is important to note the limitation of these criteria as empirically demonstrated by Swanson and White (1995) and Qi and Zhang (2001). Other in-sample approaches are based on pruning methods such as node and weight pruning (Reed, 1993), as well as constructive methods such as the upstart and cascade correlation approaches (Fahlman & Lebiere, 1990; Frean, 1990).
After the modeling process, the finally selected model must be evaluated using data not used in the model-building stage. In addition, as ANNs are often used as a nonlinear alternative to traditional statistical models, the performance of ANNs needs to be compared to that of statistical methods. As Adya and Collopy (1998) point out, "if such a comparison is not conducted, it is difficult to argue that the study has taught us much about the value of ANNs." They further propose three evaluation criteria to objectively evaluate the performance of an ANN: (1) comparing it to well-accepted (traditional) models; (2) using true out-of-samples; and (3) ensuring enough sample size in the out-of-sample (40 for classification problems and 75 for time series problems). It is important to note that the test sample served as out-of-sample should not in any way be used in the model-building process. If the cross-validation is used for model selection and experimentation, the performance on the validation sample should not be treated as the true performance of the model.
Although some of the above issues are unique to neural networks, some are general issues to any forecasting method. Therefore, good forecasting practice and principles should be followed. It is beneficial to consult Armstrong (2001), which provides a good source of information on useful principles for forecasting model building, evaluation, and uses.
Artificial neural networks have emerged as an important tool for business forecasting. ANNs have many desired features that are quite suitable for practical forecasting applications. This chapter provides a general overview of the neural networks for forecasting applications. Successful forecasting application areas of ANNs, as well as critical modeling issues are reviewed. It should be emphasized that each forecasting situation requires a careful study of the problem characteristics, prudent design of modeling strategy, and full consideration of modeling issues. Many rules of thumb in ANNs may not be useful for a new application, although good forecasting principles and established guidelines should be followed.
ANNs have achieved remarkable successes in the field of business forecasting. It is, however, important to note that they may not be a panacea for every forecasting task under all circumstances. Forecasting competitions suggest that no single method, including neural networks, is universally the best for all types of problems in every situation. Thus, it may be beneficial to combine several different models in improving forecasting performance. Indeed, efforts to find better ways to use ANNs for forecasting should never cease. The subsequent chapters of this book will provide a number of new forecasting applications and address some practical issues in improving ANN forecasting performance.