Training Neural Networks
By Lou Mendelsohn
The application of neural networks to financial forecasting has quickly become a hot topic in today’s globalized trading environment. With extensive technical, intermarket and fundamental data available for analysis, neural networks are well suited to pattern recognition and quantifying relationships between interrelated markets. However, neural networks are not easy to develop. Here, S&C contributor Lou Mendelsohn examines the best ways to train and test neural networks for maximum performance.
Successfully developing neural networks to implement synergistic market analysis for financial forecasting in today’s global markets requires both knowledge of the financial markets and expertise in the design and application of artificial intelligence technologies. Now let us examine the process of training and testing neural networks for synergistic analysis in which technical, fundamental and intermarket data are used to find hidden patterns and relationships within the data.
To accomplish this, a set of data facts must be selected for presentation to the network. In addition, various training parameters must be optimized during the training process, and a protocol that automates the training and testing process to assure proper training of the network must be devised. If done properly, superior performance and more accurate forecasts can be achieved over rule-based technical analysis methods that rely on single-market linear modeling of market dynamics.
A fact is a single input vector and its associated output vector. A fact is typically represented as a row of related numbers where the first n numbers correspond to the n network inputs and the last m numbers correspond to the m network outputs. A fact set is a group of related facts used to train and test a neural network. Decisions about data inclusion must be made because a fact set should be representative of the problem space. For instance, should Standard & Poor’s 500 data from the October 1987 period be included in a fact set?
While internal market data on a target market is readily available, it is sometimes difficult to find appropriate fundamental data, which is often subject to revision and not always reported in a data-compatible format. Similarly, desirable intermarket data may be unavailable, depending on when each related market started trading. For example, while the yen began trading in 1972, the Nikkei 225 index only began trading as a futures contract in September 1990. To use both markets’ data in a neural network application for currency predictions forces compromises in data selection, due to the rigors of sound neural network design.
Since the back-propagation network is perhaps the most familiar to traders, it will be used to illustrate problems that can occur during training and testing and to highlight common pitfalls. Once the fact set has been selected, it is separated into training and testing subsets. Back-propagation networks typically operate in two modes. In the learning or training mode, the network modifies its internal representation by changing the values of its weights in an attempt to improve the mapping of inputs to outputs. In the recall or testing mode, the network is fed new inputs and utilizes the representation it had previously learned to generate associated outputs without changing its weights.
Because the neural network operates in these two modes, it is useful to divide the fact set into at least two subsets: a training set and an out-of-sample testing set. The facts in the training set are used during the network’s learning mode, while the facts in the testing set are used during the network’s walk-forward recall mode. The comparative performance of various nets on the test set helps identify which net should be used in the final application.
Numerous criteria can be used to determine the composition of the training and testing sets. At the very least, they should be mutually exclusive; that is a specific fact should not reside in both sets. In addition, if two facts have exactly the same input and output values, one of these facts should be removed from the fact set before it is split into two subsets. Care must be taken when dividing the original fact set into training and testing subsets.
For instance, in an 80/20 split, some commercial development tools remove every fifth fact for inclusion in the test set rather than randomly assigning facts to the training and testing subsets. If the facts are in chronological order prior to this split, all Friday data might end up in the test set, while only Monday through Thursday data would be included in the training set. Since this is not a reasonable way to split financial market data, the facts should be randomized prior to splitting them into subsets. In any case, the facts in the training and testing sets should be randomized once the data is split.
Even when randomly removing facts from a set, there is a chance that all facts with a certain characteristic might be removed. One way to prevent this is to identify the most important characteristics thought to be associated with the data and determine the fact set’s underlying distribution related to these characteristics. Then the initial fact set can be split, with similar distributions present in both the training and testing sets. This can be accomplished through statistical analysis or by clustering algorithms. A thorough analysis of the fact set will also help identify outliers that might adversely affect the training process. Many times, outliers that are not well represented in the facts are removed from the training and testing sets.
Before selecting a data handling method, it is best to first experiment with a variety of data handling methods. We have developed a training-testing procedure in which the fact set is split into three mutually exclusive sets rather than just two. While there is still a training and testing set, a second testing set is also created, one that contains examples of those facts that are determined to be most important in judging network performance, so this test set can be used to compare various networks.
TRAINING AND TESTING
After fact selection is complete, the training process can be initiated. First, the initial weights must be set. These mechanisms allow the network to adjust its internal representation when modeling a problem. If all weights in the network are initially set to the same value and the solution to the problem requires unequal weights, the network will never learn, because error changes are proportional to the weight values. As a result, small random weights are used to initialize the network.
The network learns by changing its weights, based on error information back-propagated from the output layer. Each time the weights change, the network is taking a step on a multidimensional surface, which represents the overall error space. During training the network is traversing the surface to find the lowest point or minimum error. The weight changes are proportional to a constant called the learning rate. The largest possible learning rate that does not cause oscillation should be selected.
As an example of oscillation, imagine that a network’s position is halfway down a valley on a two-dimensional error surface (Figure 1). If the learning rate is too large, the network’s next step might be to the other side of the valley, as opposed to moving closer toward the bottom. Then the following step might return to the original side. In this example, the network would not be making any progress toward the bottom where the solution lies. Conversely, if the learning rate is too small, training could take too long to get to the bottom of the valley. Since each problem space has a unique error surface, different learning rates are used to strike the best balance between training time and overall error reduction.
Another standard parameter associated with training is momentum. The learning rules used in some commercial development tools include a momentum term that acts as a filter to reduce oscillatory behavior. Thus, higher learning rates can be used to obtain solutions similar to those found with lower learning rates without increasing training time. Certain tools allow additional parameters such as temperature, gain and noise to be modified during the training process.
The sheer number of decisions to be made when developing a neural network necessitates training testing automation. This is especially true for setting training parameters, selecting preprocessing and choosing the number of hidden layers and neurons. To expedite parameter space searches, we use in-house development tools. Genetic algorithms are an example of such a tool. These algorithms are effective for many parameter optimization tasks. They use simple mechanisms analogous to those used in genetics to breed populations of superior solutions to optimization problems.
Certain forms of simulated annealing have also been found to be useful for automating learning rate adjustments during training. This method of training simulates annealing by including a temperature term that directly affects the learning rate. In simulated annealing, temperature refers to the energy of a neural network. The temperature begins at a high level, allowing the network both to learn quickly and to move quickly over the error surface. The temperature then drops as training proceeds. When the network cools, learning becomes less rapid and the network settles upon a near-optimum solution. Figure 2 depicts a two-dimensional example of simulated annealing, in which the step size is reduced to avoid oscillation while finding a minimum point on the error surface.
In neural networks, one of the major pitfalls is overtraining, analogous to curve fitting for rule-based trading systems. Overtraining occurs when a network has learned not only the basic mapping associated with input and output data, but also the subtle nuances and even the errors specific to the training set. If too much training occurs, the network only memorizes the training set and loses its ability to generalize to new data. The result is a network that performs well on the training set but performs poorly on out-of-sample test data and later during actual trading.
The simplest way to avoid overtraining is to devise an automated training-testing routine in which the network training is halted periodically at predetermined intervals, and the network is then run in recall mode on the test set to evaluate the network’s performance on various error criteria. Then the training is continued from the point at which it was halted. This process continues iteratively without human intervention, with interim results that meet the error criteria saved for later use. When the performance on the test set begins to degrade, it can be assumed that the network has begun to overtrain. The best saved network configurations up to this point are then used for further evaluation. Through automation, testing is incorporated as an integral facet of the training process, rather than a procedure that is performed afterward.
The choice of metrics to be used for testing should also be considered. There are many ways to evaluate a network’s performance on test data. Assume, for example, that a network has been designed to predict the high for the next day. One measure might be the difference between the actual high and the network’s output. This value would be determined for each fact in the test set and then summed and divided by the number of facts in the test set. This is a standard error measure called average error. Examples of error measures based on the distance from the target value include average absolute error, sum-of-squares error and root mean squared (RMS) error.
Other measures can be devised that calculate how often the network predicts a move in the right direction or how well network predictions match the shape of the actual price movement over the same period. In addition, if a neural network is developed to generate trading signals rather than make price predictions, criteria such as maximum drawdown, net profit and percentage profitable trades can be used as testing error measures. Since many off-the-shelf neural network development tools are limited with respect to the types of errors that can be back-propagated through the network during training, algorithms that implement custom error functions directly into the training process are required. By tailoring these functions to the specific application and outputs, real world neural network performance can be improved.
Certain problems can easily be avoided during the training stage of neural network development. Here are some general suggestions when training and testing neural networks:
Facts that best represent those elements that the neural network is to model should be selected.
Various training and testing fact sets should be constructed during development.
A training/testing methodology should be clearly defined to conduct a rigorous comparison of various networks as the architectures, selection of raw data inputs, preprocessing and training parameters are refined.
Initial weights should be randomized.
Learning rates and momentum should be adjusted through experimentation.
Testing should be performed alternately during training to avoid overtraining.
Error metrics that best measure those characteristics most important in the final application should be incorporated into the testing methodology.
Automation should be employed to increase network performance while reducing training time.
The last point cannot be overstated. The development of a successful neural network requires considerable time and effort. Even with extensive in-house research and development tools and access to a multitude of commercial tools, successful neural net development to implement synergistic market analysis for financial forecasting is a time-consuming and labor-intensive task that requires expertise in several domains.
Next, I will address network implementation, in which neural networks are incorporated into information and trading systems, as well as the results of some experiments that utilize various concepts that have been discussed.
Lou Mendelsohn, 813 973-0496, designs and tests neural trading systems for the financial industry. He is president of’ Market Technologies Corporation, of Wesley Chapel, FL., an AI research, software development and consulting firm.
Hecht-Nielsen. R. [ 1990]. Neurocoputing. Addison-Wesley Publishing Company, Inc.
Mendelsohn. Lou . “Preprocessing data for neural networks.” STOCKS & COMMODITIES. October.
_____. “Neural network development for financial forecasting.” STOCKS & COMMODITIES. September.
_____. “The basics of developing a neural trading system.” Technical Analysis of STOCKS & COMMODITIES. Volume 9: June.
Rumelhart, D.E., and J.L. McClelland [ 1986]. Parallel Distributed Processing, Volumes 1 & 2. The Massachusetts Institute of Technology.
Reprinted from Technical Analysis of
Stocks & Commodities magazine. (C) 1993 Technical Analysis, Inc.,
4757 California Avenue S.W., Seattle, WA 98116-4499, (800) 832-4642.