IES20020061A2

IES20020061A2 - Feature selection for neural networks

Info

Publication number: IES20020061A2
Application number: IE20020061A
Authority: IE
Inventors: John Carney
Original assignee: Predictions Dynamics Ltd
Priority date: 2001-01-31
Filing date: 2002-01-31
Publication date: 2002-08-07
Also published as: EP1405263A2; IE20020062A1; WO2002061678A3; AU2002230050A1; WO2002061678A2

Abstract

Features for training and run-time use of a prediction model are selected. A set of core features certain to be relevant are initially identified (2), as are a set of candidate features which are possibly relevant. In a first phase, a performance score is determined (5) for training vectors, each comprising the core features and one candidate feature. The candidate feature for the vector providing the best score is chosen (7). This feature is added to the set of core features, and a new phase is performed. New phases are commenced until there is no score improvement.

Description

INTRODUCTION Field of the Invention The invention relates to prediction models, having neural networks or statistical model engines, or other means to generate prediction outputs based on input data.

Prior Art Discussion Many prediction models are generated by providing a prototype model and training the model by inputting at least one training set of training vectors. Each training vector comprises a value for each of a number of features, (or “factors”) together with a target (correct) value for the prediction. For example, a training vector for a weather prediction model may have a value for humidity, rainfall and temperature at each of ten meteorological stations on a particular day. Thus, there are 30 features, some of which have little or no contribution to predicting the weather in a certain area. Thus, to train (build) a good prediction model it is important to choose the best set of features for the training vectors.

Heretofore, feature selection has often been performed manually, based on the skill and experience of the person involved in developing the model. However, this approach is time-consuming and is error-prone.

Automation of feature selection has also been described in the art. For example, US6038533 (Lucent) describes a method of selecting a subset of data. Sets of feature vectors are mapped into matrices according to natural or preselected divisions. A processor processes the matrices to determine a near-optimum sub-matrix. This approach appears to suffer from being very processor-intensive.

OPEN TO PUBLIC INSPECTION UNDER SECTION 28 AND RULE 23 JNL No. QP οή °§1-^ορ3ΐ -2US Patent No. US517962 (Hitachi) describes a speech recognition apparatus which has a number of neural networks. Each network extracts specific features from an input signal. A fuzzy logic circuit determines the feature having the greatest certainty. This approach appears to assist with real-time operation of this type of model. However, it does not appear that it would be of more general benefit, or that it would be of assistance during training.

The invention is therefore directed towards providing an improved method and system for feature selection in prediction models.

SUMMARY OF THE INVENTION According to the invention, there is provided a method for selecting features for input data to a prediction model, the method being implemented by a computerised system and comprising the steps of:receiving at an interface a set of core features that are certainly relevant for operation of the prediction model; receiving at the interface a set of candidate features which may be relevant for operation of the prediction model; iteratively operating the model with training vectors, wherein each iteration is performed with a vector having the core features and at least one of the candidate features; and comparing prediction outputs of the iterations and selecting the candidate features used to achieve the best prediction results in combination with the core parameters. -3IE 0 2 Ο Ο δ 1 Ιη one embodiment, each iteration is performed with a training vector comprising the core features and one candidate feature.

In another embodiment, all candidate features are used in iterations of one phase, and a plurality of phases are performed in which:the best candidate feature of the previous phase is added to the set of core parameters; a current phase is implemented with the fresh set of core features and each candidate feature in iterations; performance is evaluated and compared with the performance of the previous phase; if the current performance is better, implementing a new phase; and. if the current performance is not better, selecting all core features of the previous phase.

In one embodiment, the model comprises an ensemble of neural networks connected at their outputs.

In another embodiment, the ensemble does not overfit the training data.

In a further embodiment, the model comprises a plurality of ensembles and performance for each feature vector is determined by summing the outputs of the ensembles and calculating an average staged generalisation error.

IE 0 2 8 0 8 1 -4In one embodiment, the invention comprises the further steps of subsequently :(a) generating different initial conditions for the model; (b) selecting features with the different initial conditions, (c) repeating steps (a) and (b) a plurality of times; and (d) ranking the features on the basis of their selection.

In one embodiment, feature selection data is written to a table, and features are ranked according to the number of times they have been selected.

The invention also provides a model development system comprising means for selecting features in a method as defined above.

DETAILED DESCRIPTION OF THE INVENTION Brief Description of the Drawings The invention will be more clearly understood from the following description of some embodiments thereof, given by way of example only with reference to the accompanying drawings in which:Fig. 1 is a flow diagram illustrating an automated feature selection method of the invention; and Fig. 2 is a flow diagram illustrating a development of the method illustrated in Fig. 1. -5Description of the Embodiments Referring to Fig. 1 a method 1 is illustrated for selection of the most relevant set of features for generation of a neural network prediction model. Of course, this information is subsequently of enormous benefit to the user of the model after it has been developed.

In an initialisation step 2 a model development system receives from a user a set SF of core features which are certain to be relevant. It also receives a set R of candidate features which may be relevant, but the user is not certain.

In step 3 the system selects a candidate feature Fj, and in step 4 a training set is assembled comprising the core features and the selected candidate feature. In step 5, the system inputs this training set to a prototype prediction model and generates a score based on the estimated prediction performance of this model. The accuracy of this score is important. For example, a score generated with an unstable prototype model will mislead the search process. To ensure the accuracy and stability of this score function an ensemble of neural networks is used as the prototype prediction model. The training method corrects for both bias and variance. In the method a prediction model is generated by training an ensemble of multiple neural networks, and estimating the performance error of the ensemble. In a subsequent stage a subsequent ensemble is trained using an adapted training set so that the preceding bias component of performance error is modelled and compensated for in the new ensemble. In each successive stage the error is compared with that of all of the preceding ensembles combined. No further stages take place when there is no improvement in error. Within each stage, the optimum number of iterative weight updates is determined, so that the variance component of performance error is minimised. This method is described in our concurrent International patent application having the same filing date as this application and entitled “Neural Network Training”. -6Step 6 compares the score with the best score, and if better it updates the best score and adds the current candidate feature F, to the set of core features SF. As indicated by the decision step 8, for each candidate feature F, steps 3 to 7 are repeated. When all candidate features have been exhausted, the system in step 9 determines if there has been an improvement with this set of core features over the last set of core features. If so, in step 10 the steps 3 to 9 are repeated for a fresh core set SF and candidate set R, in which SF now includes a feature previously in R. If there is no change for the fresh set, the process is terminated with the current core set being returned, as indicated by the steps 9 to 11.

Thus, the invention provides a feature selection process which is a development of forward sequential search (FSS) that starts with a core set of features that the model builder is confident will be relevant. Additional features are added to this set from a set of possibly-relevant features until the addition of features does not produce an improvement in the predictiveness of the feature set.

A sample scenario is as follows: The user determines 5 input features that are certainly relevant, and 20 other features that are possibly relevant. 300 training patterns are available, for example 600 days of data with values for each possible input features and a value for the output feature (prediction target).

Each of the 20 features are considered in turn, being added to the 5 core features. Models are developed with each of these 20 sets of 6 features and the accuracy of these models is measured using the data.

The set of 6 features that produces the best model is identified and taken as a starting point to activate a next phase identify a set of 7 features.

The process continues until the best set of n+l features is not an improvement on the set of n features.

This set of n features is returned IE 0 2 0 ο 6 J -7The following is pseudo code for the development system illustrated in Fig. 1 in more detail.

Function FEATURE-SUBSET-SELECTION(Core-feature-list, Possible-features, Historicaldata, E,S,B) Returns SF Inputs; Core-feature-list Possible-features Historic-data Functions: ENSEMBLE-ERR-FN feature mask // selected features // features that must be included // candidate features that might be included // data samples with values for core-feature-list // and possible-features and values for // dependent variable.

// Maximum number of epochs (i.e. number of iterative weight updates) // Maximum number of stages // Maximum number of networks in an ensemble // function to determine the error for a SF core feature-list R ¢- Possible-features Best-score ENSEMBLE-ERR-FN(Historic-data, SF,E,S,B) Repeat Flag FALSE ForeachF,e7? // For each of the possible remaining features SFt <-SFv {F,} / / SF,, a new feature subset to test -8020061 Scoret 4- ENSEMBLE-ERR-FN(Historic-data, SFi,K,E,S,B) // Score that feature subset If Scorei < Best-score then Best-score ¢- Score, SF'4-SF, F <-F, // Remember new score // Remember new feature set // Remember new feature Flag 4- TRUE End-If End-For // If an improved feature subset was found by adding one of the remaining features: If (FLAG = True) SF SF' // keep this new feature subset R 4~R \ {F, } // remove// from remaining features to consider While (FLAG = True) // SF now contains the core-feature-Iist and // some possible features that were found to improve prediction.

Return SF Scoring a Feature Subset In order for the feature subset selection process to be robust it is important that the score the ENSEMBLE-ERR-FN assigns to a feature subset should be reliable. This is ensured by building an ensemble to provide a stable estimate of the error for that feature subset.

IE0 2 00 6 1 -9Since the ensemble has been designed not to overfit to the training data the score for the ensemble can be determined from the training data without the need for separate test data. This reduces the amount of data required for the feature selection process as it eliminates the need for a separate validation data set.

The following is pseudo code for determining the error for a feature subset. The code is explained by comments following “//”, and the term “Neural DVB” means the neural network training method 5 as described above, which comprises building stages of ensembles having training sets, and after each stage adapting the training set so that bias is identified and compensated for.

Function ENSEMBLE-ERR-FN(Data,SF, E, S, B) Returns Err Inputs: Data SF,E,S,B Functions: MASKED-DATA from SF only NeuralDVB models Err ¢- 0 T ¢- MASKED-DATA(Data, SF) N<- ΙΠ IF* NeuralD VB (E, S, Β, T) stages C- | W* | trained model(s) // Error score for feature subset S // data samples // as above // Produces a data-set with input features // Function which builds the optimised // determine the number of stages in the // break out the sets of weights for each stage IE 0 2 fl 0 61 -10For each; = 1 to stages / / For each of the stages in the overall model Ml· <- Prop Stage (N, j, T, Wy7 // determine the set of predictions for that stage. “Propstage” is a function which propagates training vectors through an ensemble just trained and stores the ensemble responses for each training vector. // this can be broken out into individual predictions End-For // Sum model outputs across stages for each data point FORn=l TON S„ <- Mj n 4-1/=1 » End-For / / Calculate average staged ensemble generalization error Return Err It will be appreciated that the feature subset selection process uses a forward sequential selection approach that starts with a subset of possible features. Forward sequential selection is an approach in which the model is used to evaluate the different feature subset alternatives. It is well known that the instability of neural networks makes them poor candidates for this task because small changes in the training process can produce radically different NN models. Thus the accuracy an individual NN shows with a particular feature subset is not a reliable indication of the quality of that feature subset since the accuracy of the NN is high variance.

IE 0 2 0 0 6 ! -11The process evaluates the quality of a feature mask using an ensemble of neural networks combined at their outputs - ensembles have low variance and are thus a more reliable estimate. Another advantage is that the process is a reliable means of training the ensemble without overfitting using small amounts of training data.

Referring now to Fig. 2 a feature selection method 20 comprises multiple executions of the method 1 to generate a matrix such as that of Table 1 below. The method 1 is executed to provide one row of data for the table, the training data is shuffled in a step 21 and is repartitioned in a step 22 and the method 1 is repeated. The data is stored in step 23, and as indicated by the decisions step 24 steps 21 to 23 are repeated up to a pre-set desired number of times (15 in this embodiment). In step 25 the data of the table is used to rank the features.

The method 1 is effective at minimising instability. In some situations such as where the amount of training data is limited, the feature selection process may produce different feature subsets given different starting conditions (for example different partitions of the training data). The method 20 addresses this problem and has the added advantage that it returns a ranking of the features - something that is of great interest to users.

The multiple runs of the method 1 must have different initial conditions that will result in different feature subsets. This is implicit for most neural network training methods as weights in neural networks are initialised to small random values. If more diversity is required, then this can be achieved by generating different partitions of the training data by shuffling the training data between runs as described above.

The result of each run is stored as a row in the table with a 1 in the column when a feature (FI, F2 ... F20) has been selected and a 0 otherwise. After the 15 runs the numbers in the columns are summed and the sum for each column can give us a rank -12for the features. In the example in the table, feature F10 was selected in each run and is the top ranking feature with a score of 15, the second ranking feature is F3 with a score of 12 and so forth.

Using the method 20, the user can be returned the feature ranking allowing the user to decide which set to select. Alternatively the system can automatically select features selected in more than half the runs (F3, F4, F6, F10, Fll, F12). As a further alternative, the features can be added to the feature subset in order of rank and evaluated. This process terminates when the addition of a feature produces no improvement. In this example the order of evaluation would be; F10, F10+F3, F10+F3+F11,, F10+F3+F11+F12, etc.

Run F1 F 2 F3 F4 F5 F6 F7 F8 F9 F10F11F12 F13F14F1SF16Ic17 F18 F19 F20 Run1 oi 1 1 1 o: 1 0 1 o 4 a 4 0 4 0 o oi 0i OI 0 Run2 a 0 1 1 0 1 0 4 0 4 0 0 4 oi 4 Oi 4 oi oi 0 Run3 0 0 1 0 0; 0 0 0 4 4 1 4 1 0 0 1 0 0 oi 0 Run4 4 1 1 1i 1 1 0 0 0 1 4 1 4 4 0i Oi oi 0 Oi 0 Run5 | 0, 0 0 4 t 0 0 1 0 4 4 0 0 0 0 0 a a oi 0 Run6 i 0 0 1 0' 4 1 1 0 0 4 4 4 0 4 0 0 0 oi Oi 0 Run7 i o; 0 0 4 o 0 1 4 0 4 0 0 0 0’ 4 oi a 0i o 0 Run8 4 4 1 1 o 1 1 0 o: 4 4 1 4 Oi 0 4 oi oi oi 0 Run 9 i 0; 0 1 0; 4 1 0 oi 0 4 4 4 4 Ο; 1! 0 oi oi Oi 0 Run10 Qi 0 1 4 4 4 0 0 0, 4 4 4 0 4 0 oi 0 Oi Oi 0 Run11 01 1 1 0 0i 1 ί 0 0: 4 1 0 0 0 o 4 0 0 0 1 Run 12 Oi 0 1 4 0: 0 0 0 0 1 4 4 o’ 0i 0 o’ oi 0 oi 1 Run13 o; 1 4 0 0i 0 0 σ 0 4 0 4 4 4 o Oi 0i 0i oi 0 Run14i 0 0 0 0; 0 0 0 t 4 4 11 o 0 0 oi 0! 0 oi oi 0 Run15 0 0 1 0. 0 0 1 0ί 0 4 1 4 4 0 a oi oi 0 Oi 0 Tot 2 5 12 8 5 8 5 5 2i 15 11 10 7 5 3 3 1 0 Oi 2 The invention is not limited to the embodiments described but may be varied in 15 construction and detail. For example, the ensemble error Fn described in the embodiment is not essential and alternative ensemble-based error functions may be used. Also, the method may be applied to prediction models having technologies other than neural networks.

Claims

1. A method for selecting features for input data to a prediction model, the method being implemented by a computerised system and comprising the steps of:receiving at an interface a set of core features that are certainly relevant for operation of the prediction model; receiving at the interface a set of candidate features which may be relevant for operation of the prediction model; iteratively operating the model with training vectors, wherein each iteration is performed with a vector having the core features and at least one of the candidate features; and comparing prediction outputs of the iterations and selecting the candidate features used to achieve the best prediction results in combination with the core parameters.

2. A method as claimed in claim 1, wherein each iteration is performed with a training vector comprising the core features and one candidate feature, and wherein all candidate features are used in iterations of one phase, and a plurality of phases are performed in which:the best candidate feature of the previous phase is added to the set of core parameters; a current phase is implemented with the fresh set of core features and each candidate feature in iterations; -14performance is evaluated and compared with the performance of the previous phase; 5 if the current performance is better, implementing a new phase; and if the current performance is not better, selecting all core features of the previous phase. 10

3. A method as claimed in any preceding claim, wherein the model comprises an ensemble of neural networks connected at their outputs, and wherein the ensemble does not overfit the training data, and wherein the model comprises a plurality of ensembles and performance for each feature vector is determined by summing the outputs of the ensembles and calculating an 15 average staged generalisation error.

4. A method as claimed in any preceding claim, comprising the further steps of subsequently:20 (a) generating different initial conditions for the model; (b) selecting features with the different initial conditions, (c) repeating steps (a) and (b) a plurality of times; and (d) ranking the features on the basis of their selection, and wherein feature selection data is written to a table, and features are ranked according to the number of times they have been selected. IE Ο 2 Ο Ο 6 1 -15

5. A development system comprising means for implementing a method as claimed in any preceding claim.