Disclosure of Invention
One or more embodiments of the present disclosure describe a method and an apparatus for selecting features for a machine learning model, which may improve the effectiveness of feature screening by randomly perturbing variables, evaluating the importance of the features, and selecting the features based on the importance of the features.
According to a first aspect, there is provided a method of selecting features for a constructed machine learning model, comprising: acquiring m sample data pairs, wherein each sample data pair comprises n to-be-selected features extracted aiming at a corresponding sample and a sample label pre-labeled aiming at the sample, and the n to-be-selected features at least comprise first features; training the machine learning model by using m sample data pairs to obtain a first importance of the first feature through the trained machine learning model; randomly exchanging sample labels for the m sample data pairs, training the machine learning model by using the m sample data pairs subjected to random sample label exchange, and obtaining a second importance of the first feature through the trained machine learning model; determining whether to select the first feature as a feature of the machine learning model based at least on a comparison of the first importance and the second importance.
In one embodiment, said training said machine learning model with m sample data pairs to derive a first importance of said first feature by said trained machine learning model comprises training said machine learning model to generate a first feature by using said first feature in said first feature: k is carried out on the m sample data pairs1Randomly sorting the times, and training the machine learning model by using m randomly sorted sample data pairs after each random sorting so as to obtain k of the first feature through the machine learning model after each training1A first rating score; based on the k1A first rating score determines the first importance.
In a further embodiment, said based on said k1Determining the first importance by the first rating score comprises: the k is added1The average of the first evaluation scores is determined as the first importance.
In another embodiment, the randomly swapping sample labels for the m sample data pairs and training the machine learning model using the m sample data pairs after randomly swapping sample labels to obtain the second importance of the first feature through the trained machine learning model includes: randomly exchanging sample labels k for the m sample data pairs2Secondly, after the labels are exchanged at each time, the machine learning model is trained by respectively utilizing m sample data pairs after the labels of the samples are exchanged randomly, so that k of the first characteristic is obtained through the machine learning model after each training2A second evaluation score; based on the k2A second evaluation score determines the second degree of importance.
In a further embodiment, said based on said k2Determining the second importance by the second evaluation score comprises: the k is added2The average of the first evaluation scores is determined as the second degree of importance.
In one embodiment, the determining whether to select the first feature as a feature of the machine learning model based at least on the comparison of the first importance and the second importance comprises: excluding the first feature from features of the machine learning model if a difference of the second importance minus the first importance exceeds a first threshold.
In another embodiment, the method comprisesDetermining whether to select the first feature as a feature of the machine learning model based at least on a comparison of the first importance and the second importance comprises: the difference of the second importance minus the first importance exceeds the k2Twice a second variance of the second evaluation scores, excluding the first feature from features of the machine learning model.
In one embodiment, the determining whether to select the first feature as a feature of the machine learning model based at least on the comparison of the first importance and the second importance comprises: determining to select the first feature as a feature of the machine learning model if a difference of the second importance minus the first importance is less than a second threshold.
In another embodiment, the determining whether to select the first feature as the feature of the machine learning model based at least on the comparison of the first importance and the second importance comprises: the difference of the second importance minus the first importance is less than the k2Determining to select the first feature as a feature of the machine learning model in the case of a second variance of the second evaluation score.
In one embodiment, the determining whether to select the first feature as a feature of the machine learning model based on at least the first importance and the second importance comprises: determining a composite importance of the first feature based on the first importance and the second importance; determining whether to select the first feature as a feature of the machine learning model according to the composite importance.
In one embodiment, the integrated importance of the first feature is determined based on an integrated indicator of the first feature obtained by integrating the first importance and the second importance, and the integrated indicator of the first feature is a sum of a difference between the second importance and the first importance and a ratio of the second importance to the first importance.
In a further embodiment, the integrated importance of the first feature is a ratio of the integrated index of the first feature to a maximum value of n integrated indexes corresponding to the n candidate features.
In another further embodiment, the comprehensive importance of the first feature is a ratio of a comprehensive index of the first feature to a maximum value in each comprehensive index of similar candidate features, where the similar candidate features are features of the n candidate features that satisfy the same predetermined condition as the first feature.
According to a second aspect, there is provided an apparatus for selecting features for a constructed machine learning model, comprising:
the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is configured to acquire m sample data pairs, each sample data pair comprises n to-be-selected features extracted aiming at a corresponding sample and a sample label pre-labeled aiming at the sample, and the n to-be-selected features at least comprise first features;
a first determining unit configured to train the machine learning model by using m sample data pairs to obtain a first importance of the first feature through the trained machine learning model;
a second determining unit, configured to randomly swap sample labels for the m sample data pairs, and train the machine learning model using the m sample data pairs after randomly swapping the sample labels, so as to obtain a second importance of the first feature through the trained machine learning model;
a selection unit configured to determine whether to select the first feature as a feature of the machine learning model based on at least a comparison of the first importance and the second importance.
According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
According to a fourth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and wherein the processor, when executing the executable code, implements the method of the first aspect.
According to the method and the device for selecting the features for the constructed machine learning model, provided by the embodiment of the specification, m sample data pairs are obtained firstly, and then random disturbance is performed on the m sample data pairs so as to analyze the importance of the features. Specifically, on one hand, a machine learning model is trained by using m sample data pairs, so that a first importance of a first feature is obtained through the trained machine learning model; on the other hand, sample labels of the m sample data pairs are randomly exchanged, the machine learning model is trained by using the m sample data pairs after the sample labels are randomly exchanged, and the second importance of the first feature is obtained through the trained machine learning model. Furthermore, the first importance and the second importance of each feature are compared, and the features are selected for the constructed machine learning model according to the comparison result, so that the effectiveness of feature selection of the machine learning model can be improved.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of an application scenario of an embodiment of the present specification. In the application scenario, the method mainly comprises a data module, a feature selection module and a model training module. The method for selecting features for the constructed machine learning model provided by the specification is mainly applicable to the feature selection module in fig. 1. The data module, the feature selection module, and the model training module may be disposed on the same computing platform (e.g., a server or a server cluster), or may be disposed on different computing platforms, which is not limited herein. The data module may include, for example, various storage media for storing the training data set.
The basic idea of the embodiments of the present specification is based on random perturbations. It will be appreciated that if a feature is important, its importance may be reduced if the sample label is scrambled. For example, in two categories of distinguishing the elderly from children, one feature is "age", and it is assumed that the value of the feature of sample a is 85 years old, the sample label is "elderly", the value of the feature of sample B is 5 years old, and the sample label is "children" … …, and it is obvious that the feature "age" is very important. When the labels are shuffled, assume that the sample label of sample a becomes "child", the sample label of sample B remains "child", and so on. The "age" feature has not been effective in distinguishing between the two categories of elderly people and children. That is, the importance of the feature "age" decreases, while the importance of features that are not originally important, such as lifestyle habits, behavior habits, hair color, etc., may increase.
Therefore, the change of the feature importance degree can be analyzed by randomly perturbing the sample label, so as to determine whether the feature is an important feature. Here, the importance of a feature can be understood as the degree of importance of the feature. E.g., weights in the linear model, the degree to which features (corresponding leaf nodes) in the tree structure model distinguish between different analogs, etc. For the tree structure model, in the process of training the model, the model itself scores each feature through the model training process so as to represent the feature importance of each feature in the currently constructed machine learning model. Machine learning models such as LightGBM (lightweight Gradient hoist), Xgboost (eXtreme Gradient hoist model), Randomforest (random forest), etc. can output the importance of each feature in the process of training the model.
Therefore, on one hand, the first importance of each feature can be determined for the machine learning model constructed by training by using the sample data in the training data set, and on the other hand, the second importance of each feature can be obtained by randomly disturbing the sample label and using the sample data after disturbing the sample label to train the machine learning model constructed by retraining again. Then, aiming at each feature, the second importance of random disturbance on the sample label can be compared with the first importance obtained by training the machine learning model by using the normal sample, and the more important feature can be selected for the machine learning model.
Specifically, as shown in fig. 1, in the above application scenario, a training data set may be stored in the data module, where the data set includes at least m training samples. The feature selection module may obtain sample data pairs of m training samples from the data module. Each sample data pair may include n candidate features extracted for a respective sample, and a pre-labeled sample for that sample. For example, sample data pairs such as sample 1 and sample 2 … …, sample m, and corresponding sample data pair [ n features, label 1], [ n features, label 2] … … [ n features, label m ]. As shown in table 1, a schematic diagram of a sample data pair corresponding to each sample is given when n candidate features include features 1 to n.
TABLE 1 sample data Pair schematic for each sample
Sample 1
|
X11、X21……Xn1 |
Y1 |
Sample 2
|
X12、X22……Xn2 |
Y2 |
……
|
……
|
……
|
Sample m
|
X1m、X2m……Xnm |
Ym |
In Table 1, n features are X respectively1、X2……XnThe label is denoted by Y. Features and labels of the respective samples are indicated by reference numerals corresponding to the samples, e.g. features of sample 1 are suffixed with 1, denoted X, in the subscript11、X21……Xn1The label is represented as Y1. It should be noted that the labels are for convenience of description only and are not to be distinguished from one another. For example, Y1And YmAll can be 'apple', Y2And Y1May be a "pear", etc. When the machine learning model constructed by the m sample data pairs is trained, the first importance of each feature can be obtained.
On the other hand, if the sample labels in the sample data pairs are randomly disturbed, that is, the sample labels are randomly exchanged, m sample data pairs shown in tables 2 and 3 below can be obtained. Table 2 and table 3 show the m sample data pairs after two random perturbations of the sample label.
TABLE 2 one-time stochastic perturbation schematic for sample labels
Sample 1
|
X11、X21……Xn1 |
Y2 |
Sample 2
|
X12、X22……Xn2 |
Ym |
……
|
……
|
……
|
Sample m
|
X1m、X2m……Xnm |
Y1 |
Table 3 another random perturbation signature on a sample label
As can be seen from tables 2 and 3, the random disturbance to the sample label changes the sample label sequence number originally corresponding to the sample. At this time, the meaning represented by the sample label corresponding to the sample may be changedAnd may not be changed. For example, sample 1 in Table 2 originally labeled Y1The apple can become a label Y after random disturbance2"Pear"; in Table 3, sample 1 originally labeled Y1The apple can become a label Y after random disturbancemStill "pears", etc. For the result of one sample label random exchange, there are still m sample data pairs. Similarly, the sample data can be used for training the constructed machine learning model to obtain the second importance degree of each feature.
It will be appreciated that a feature is likely to be a valid feature if the significance is greatly reduced after the sample label is randomly perturbed. Therefore, further, whether to select the feature as the machine learning model described above may be determined according to a comparison of the first importance and the second importance of the respective features. The selected features may then be used by a model training module for machine learning model training. The process of feature screening is described in detail below.
FIG. 2 illustrates a flow diagram of a method of selecting features for a built machine learning model, according to one embodiment. The execution subject of the method can be any system, device, apparatus, platform or server with computing and processing capabilities.
As shown in fig. 2, the method comprises the steps of: step 21, obtaining m sample data pairs, wherein each sample data pair comprises n to-be-selected features extracted aiming at a corresponding sample and a sample label pre-labeled aiming at the sample; step 22, training the machine learning model by using the m sample data pairs to obtain a first importance of each feature through the trained machine learning model; step 23, randomly exchanging sample labels for the m sample data pairs, training a machine learning model by using the m sample data pairs after the sample labels are randomly exchanged, and obtaining a second importance of each feature through the trained machine learning model; and 24, determining whether to select the corresponding feature as the feature of the machine learning model at least based on the comparison of the first importance and the second importance of each feature.
First, in step 21, m sample data pairs are acquired. Here, each sample data pair may include n features to be selected extracted for a corresponding sample, and a sample label pre-labeled for the sample.
Wherein, m sample data pairs can be obtained from the training sample set. The training data set may comprise at least not less than m (e.g. 2 m) training samples for training the machine learning model. m may be a positive integer with a certain magnitude to meet the requirement of training the machine learning model, such as 1000. In some cases, the training data set may also include test samples to test whether the machine learning model trained by the training samples satisfies the use condition. For convenience of description, the training data set includes (or can extract) at least m samples.
Taking the constructed machine learning model as a credit wind control model as an example, the samples in the training data set may include credit records of a plurality of users, such as loan and payment records of a plurality of users, and the like. At this time, one user can be treated as one sample. Each sample may also correspond to a sample label, e.g., user a corresponds to a "believing user". The sample label may be labeled manually or in other reliable manners, and is not limited herein. M sample data pairs corresponding to m samples may be obtained from the training dataset. For each sample in the m samples, n features to be selected can be extracted respectively, and a label marked in advance is obtained to form a sample data pair, as shown in table 1. The sample data pairs in Table 1 may also be expressed as easily understood [ features, tags ]]In the form of (1). Wherein in Table 1, the characteristics include X1To XnThe label is denoted by Y.
Next, in step 22, the machine learning model may be trained using the m sample data pairs to obtain the first importance of each feature through the trained machine learning model. For convenience of description, the present specification may assume that the n candidate features include at least the first feature. Wherein the first feature may be any one of the n candidate features. As mentioned above, in the process of using m sample data pairs, the importance of the features can be determined by the machine learning model, and will not be described herein again.
It can be understood that, in the model training process, the input sequence of the samples is different, and the model parameters of the trained model are also different. In order to make the result of the first importance more accurate, according to an embodiment, the arrangement order of the m samples may be randomly perturbed, and the first importance of each feature may be obtained based on the feature importance obtained under each perturbation. Here, for convenience of distinction and description, the feature importance obtained in each disturbance case may be referred to as each first evaluation score. In particular, k may be performed on the above m sample data pairs1And (4) secondary random disturbance, and respectively detecting each first evaluation score of each characteristic under each disturbance. It should be noted that the random perturbation on m sample data pairs can also be understood as random ordering on m sample data pairs. As shown in table 4 and table 5, the arrangement order of the m sample data pairs after two random orderings is shown.
Table 4 shows a random ordering of each sample data pair
Sample 2
|
X12、X22……Xn2 |
Y2 |
Sample m
|
X1m、X2m……Xnm |
Ym |
……
|
……
|
……
|
Sample 1
|
X11、X21……Xn1 |
Y1 |
Table 5 illustrates another random ordering of sample data pairs
Sample m
|
X1m、X2m……Xnm |
Ym |
Sample 1
|
X11、X21……Xn1 |
Y1 |
……
|
……
|
……
|
Sample 2
|
X12、X22……Xn2 |
Y2 |
As can be seen from tables 4 and 5, random disturbance to the sample data pair only changes the whole arrangement order of the sample data pair, and does not change the sample data pair itself.
It is understood that the input sequence of the samples is different, the trained models may have a certain difference, and the obtained first evaluation scores of the same feature are also different. With a first characteristic (e.g. X)1) For example, by k for m sample data pairs1Randomly sorting, and after each random sorting, respectively using m sample data after random sorting to train machine learning model to obtain k1A first rating score.
Assuming that the random ordering for the first time is shown in table 1, when the constructed machine learning model is trained for the first time, sample data pairs corresponding to m samples are input into the machine learning model according to the sequence of sample 1 and sample 2 … …, and the respective importance of n features, namely, a first evaluation score, is determined in the process of training the model.
Assuming that the second random ordering is shown in table 2, when the constructed machine learning model is trained for the second time, sample data pairs corresponding to m samples are input into the machine learning model according to the sequence of the sample 2 and the sample m … …, and in the process of training the model, a first evaluation score of each of the n features is determined again.
And so on until k is carried out1The sub-random ordering, for each feature (such as the first feature) in the n features, can obtain k respectively1A first rating score. As shown in table 6.
TABLE 6k1First evaluation scores of n candidate features under condition of sub-random ordering
It can be seen that in table 6, n candidate features respectively yield k1A first rating score.
Then, k can be based on each feature1The first evaluation scores determine a first importance of the respective features. Taking the first feature as an example, in some embodiments, the first weight of the first featureThe importance may be k1Average of the first rating scores. Still taking the first feature as an example, in other embodiments, the first importance of the first feature may be k1Median of individual first rating scores. The first importance of each feature may also be based on k1The first evaluation score is determined by other reasonable means, and is not limited herein. Thus, training the machine learning model according to each random ordering result to obtain X1To XnA first rating score for each feature in the set of features. K for sample data pair1Sub-randomly ordered to obtain X1To XnK of each feature in1A first evaluation score to obtain X1To XnThe first importance of each of the features.
It can be appreciated that the more important features, regardless of the order of sample input when training the model, the first evaluation score remains relatively stable. However, the first evaluation score converges with less fluctuation due to the difference in the order of sample input. Therefore, by randomly sequencing the sample data pairs for multiple times, the interference of the sample data to the input sequence can be avoided, and the more accurate first importance of each characteristic is obtained.
On the other hand, in step 23, the sample labels are randomly exchanged for the m sample data pairs, and the machine learning model is trained by using the m sample data pairs after the sample labels are randomly exchanged, so as to obtain the second importance of each feature through the trained machine learning model. Wherein, the m sample data pairs after exchanging the sample labels are shown in table 2 or table 3, and the process of determining the second importance of each feature is similar to the process of determining the first importance of each feature in step 22, and is not described herein again.
It can be understood that the variation of feature importance is also large for random perturbation of the primary sample label. For example, if the sample labels of the positive and negative samples are just restored under a random perturbation, for example, although the sample label of sample 1 becomes the sample label of sample 2, the content of the label is not changed, and the label is still "apple", the importance of each feature may not be changed from the first importance. For another example, if just half of the labels of the positive samples are exchanged to the original negative samples under a random disturbance, half of the labels of the negative samples become the labels of the positive samples, for example, the labels of samples 1 to m/2 are "apples", the labels of samples 1+ m/2 to m are "pears", the labels of samples 1 to m/4 are "pears", the labels of samples 1+ m/4 to m/2 are still "apples", the labels of samples 1+ m/2 to 3m/4 are "apples", and the labels of samples 1+3m/4 to m are still "pears", the importance of a certain important feature (e.g., black dots on the skin) may be greatly reduced.
Thus, according to one possible design, the sample labels of m sample data pairs may be perturbed multiple times, e.g., by k1And performing secondary random disturbance, wherein the training of the machine learning model can be performed again every time the random disturbance on the sample label is performed. X can be obtained every time the process of training the machine learning model1To XnA second rating score for each feature in the set of features. Assume random ordering k2Then for X1To XnGet k separately for each feature in2A second evaluation score. Similarly to the first evaluation score, the second evaluation score is also used to descriptively distinguish a second degree of importance, which is substantially indicative of the importance of a feature in a certain model training. Then, k can be corresponding based on each feature to be selected2The second evaluation score determines a second degree of importance thereof. Wherein, taking the first feature as an example, the second importance may be k corresponding to the first feature2The average value of the second evaluation scores may be k2The median of the second evaluation scores is not limited herein.
Therefore, by randomly exchanging the sample labels in the sample data pairs for multiple times, the interference caused by random special conditions can be avoided, and the relatively accurate second importance of each characteristic is obtained.
Then, it is determined whether to select each feature as a feature of the machine learning model based on at least the comparison of the first and second importance of each feature, via step 24. It can be understood that, for each feature, the first importance is determined by training the machine learning model constructed by normal samples, the second importance is determined under the condition that the sample label randomly changes, and the importance of each feature to the constructed machine learning model can be evaluated by comprehensively analyzing the first importance and the second importance of each feature, so that effective features can be selected from each feature to train the machine learning model.
The following describes a process of comprehensively analyzing the first importance and the second importance by taking the first feature as an example. According to the above analysis, in the case of random swapping of tags, the importance of more important features may be greatly reduced, while the importance of less important features may be increased or remain the same as the importance of important features is reduced. Thus, in one embodiment, the first feature may be excluded from the features of the machine learning model if the difference of the second importance minus the first importance exceeds a first threshold. Wherein the first threshold value may be predetermined. For example, the first threshold may be 0, and if the second importance is greater than the first importance, the first feature is excluded from the features of the machine learning model. That is, the first feature is not selected as a feature of the machine learning model that is built. On the other hand, according to the description of step 23, in some embodiments the second degree of importance is by k2A second evaluation score, and therefore, in a corresponding alternative embodiment, the first threshold value may also be based on k2A second rating score is determined, e.g. k2Two times the second variance corresponds to the second evaluation score.
Further, if the difference of the second importance of the first feature minus the first importance is smaller than the first threshold, the second importance and the first importance of the first feature may be further compared, and the first feature may also be selected as the feature of the constructed machine learning model.
According to one possible design, the difference between the second importance and the first importance may be less than a second thresholdIn this case, it is determined to select the first feature as a feature of the machine learning model constructed. The second threshold may be predetermined, e.g. one-half of the inverse of the second degree of importance, etc. On the other hand, according to the description of step 23, in some embodiments the second degree of importance is by k2A second evaluation score, and therefore, in a corresponding alternative embodiment, the second threshold value may also be based on k2A second rating score is determined, e.g. k2A second variance corresponding to the second evaluation score, and so on. In the case that the difference obtained by subtracting the first importance from the second importance of the first feature is greater than the second threshold, the first feature may be directly excluded from the features of the machine learning model that is constructed, or further determination may be performed, which is not limited herein.
It is to be noted that, in the above process of determining whether to select the first feature as the feature of the constructed machine learning model by the first threshold and the second threshold, the determination by the first threshold and the second threshold may be combined. At this time, the first threshold is greater than the second threshold. For the case where the difference between the second importance and the first importance of the first feature is between the first threshold and the second threshold, whether to select the feature to be machine-learned as constructed may be determined according to other rules. For example, the differences between the second importance and the first importance corresponding to the n features to be selected are sorted, and a predetermined number of features with smaller differences are selected. It is understood that, since the importance of the important feature decreases greatly after the tag is randomly scrambled, the difference between the second importance and the first importance is a negative value, and the smaller the difference, the larger the absolute value thereof, the more the importance of the corresponding feature decreases.
According to some optional implementations, the first importance is by k1A first evaluation score and a second importance by k2If the second evaluation score is determined, the first importance and the second importance can be compared and analyzed in other ways. For example, features for which the first importance is greater than a third threshold and the second importance is less than a fourth threshold are selected as features of the machine learning model being built, and so on。
For convenience of description, individual features selected as features of the constructed machine learning model may also be referred to as valid features.
According to another possible design, the comprehensive importance of each feature can be determined based on the first importance and the second importance corresponding to each feature, and the effective feature of the machine learning model can be determined according to the comprehensive importance.
In one embodiment, the first importance and the second importance may be integrated to obtain an integrated indicator of the first feature, and then the integrated importance of the first feature may be determined based on the integrated indicator. In one embodiment, the composite indicator may be a ratio of the second importance to the first importance. In another embodiment, the composite indicator may be a sum of a difference between the second importance and the first importance and a ratio of the second importance to the first importance, that is: (second importance-first importance) + second importance/first importance. Further, the composite index itself may be used as the composite importance of the first feature, or the composite index may be further calculated and mapped to a predetermined range, for example, [0, 1], as the composite importance of the first feature.
In one implementation, the comprehensive importance of the first feature may be a ratio of the comprehensive index of the first feature to a maximum value of the n comprehensive indexes corresponding to the n candidate features. In this manner, the relative overall importance of each feature among the n features can be determined.
In another implementation, the comprehensive importance of the first feature may be a ratio of the comprehensive index of the first feature to a maximum value in each comprehensive index of the similar candidate features. The same type of candidate features are the features which meet the same preset conditions as the first features in the n candidate features. For example, if the first feature satisfies a predetermined condition that the difference between the second mean value and the first mean value is smaller than the first threshold, each candidate feature, which satisfies that the difference between the second mean value and the first mean value is smaller than the first threshold, of the n candidate features is a similar candidate feature. In this specification, the same kind of candidate features as the first features include the first features. In this way, the relative comprehensive importance of each feature can be determined in different categories, and the maximum comprehensive importance in each category is 1.
Referring to fig. 3, in order to make the above description more clear, the implementation of the step is described below by a specific example. In this particular example, each of the n candidate features has k passed through in step 221First importance of the first evaluation score, k in step 232A second importance determined by the second evaluation score. First, a first importance and a second importance for each feature, and k1First variance, k, of first evaluation score2A second variance of the second evaluation score. And then roughly classifying the n candidate features according to a preset condition that the mean value and the variance meet. Specifically, the method comprises the following steps:
first, the condition is satisfied: the difference between the second importance and the first importance is greater than or equal to 2 times the second variance;
and in the second category, the condition is satisfied: the difference between the second importance and the first importance is greater than or equal to the second variance and less than 2 times the second variance;
and in the third category, the condition is satisfied: the difference between the second importance and the first importance is less than the second variance.
From the above grouping, it can be seen that: after the sample label is randomly disturbed, the importance degree of the first type of features is reduced to a small extent or is not reduced, the first type of features are the features with lower importance for the constructed machine learning model and are invalid features (reject types); after the sample label is randomly disturbed, the third type of characteristics have larger descending amplitude of the importance degree, are more important characteristics for the constructed machine learning model, and are effective characteristics (keep type); the second class of features is features that are intermediate between the first class of features and the third class of features, which are intermediate features (latent class). In general, a valid feature has a large contribution to the sample discrimination economy, however, the amount of information may be small and the model will not predict well if that feature of the prediction object is not obvious or missing. Thus, in this example, the valid features and the intermediate features may be selected together as features of the constructed machine learning model.
Further, in practical applications, it may be necessary to provide a more specific arrangement of feature importance for the user to refer to, or to have more space for selecting features, for example, when there are more features, a part of the features may be removed from the intermediate features of the second class. In this case, as shown in fig. 3, a comprehensive index of each feature in each category may be calculated, and the feature comprehensive importance may be determined from the comprehensive index. A specific method for calculating the comprehensive importance degree comprises the following steps: the composite indicator for each feature is divided by the largest composite indicator in the category. Thus, the features in each class correspond to a score between 0 and 1, which indicates the relative importance of the feature in the class. If the calculated amount of the model is reduced and the features to be selected are continuously removed, the features with lower comprehensive importance can be removed from the intermediate features.
Reviewing the above process, in the process of selecting the features for the constructed machine learning model, random disturbance is utilized, on one hand, the first importance degree of each feature is determined through the machine learning model constructed by normal training of the sample, on the other hand, the sample labels are randomly exchanged, the second importance degree of each feature is determined, so that feature selection is carried out through the contrast of the second importance degree and the first importance degree, and the effectiveness of feature screening can be improved in the process of constructing the machine learning model. Further, the determination of the first importance may be performed multiple times by randomly swapping the order of the sample data pairs, so that the first importance may be more accurate. In addition, for the determination of the second importance, the mode of disturbing the sample label for multiple times is adopted for determination, so that the interference of random disturbance in special conditions is reduced, and the accuracy of the second importance is further improved.
According to an embodiment of another aspect, a device for selecting features for the constructed machine learning model is also provided. FIG. 4 shows a schematic block diagram of an apparatus for feature selection for a constructed machine learning model according to one embodiment. As shown in fig. 4, the apparatus 400 for selecting features for a constructed machine learning model includes: an obtaining unit 41, configured to obtain m sample data pairs, where each sample data pair includes n to-be-selected features extracted for a corresponding sample and a sample label pre-labeled for the sample, where the n to-be-selected features at least include a first feature; a first determining unit 42 configured to train the machine learning model by using the m sample data pairs to obtain a first importance of the first feature through the trained machine learning model; a second determining unit 43, configured to randomly swap sample labels for the m sample data pairs, and train the machine learning model using the m sample data pairs after randomly swapping the sample labels, so as to obtain a second importance of the first feature through the trained machine learning model; a selection unit 44 configured to determine whether to select the first feature as a feature of the machine learning model based on at least a comparison of the first importance and the second importance.
In an embodiment of an aspect, the first determining unit 42 may be further configured to:
k for m sample data pairs1And performing secondary random sequencing, and after each secondary random sequencing, respectively utilizing m sample data subjected to random sequencing to train the constructed machine learning model so as to respectively obtain k of the first characteristic through the machine learning model subjected to each training1A first rating score;
based on k1The first evaluation score determines a first importance.
Further, the first determination unit 42 may determine k1An average of the first evaluation scores is determined as a first importance of the first feature.
In an embodiment of another aspect, the second determining unit 43 may be further configured to:
randomly exchanging sample labels k for m sample data pairs2Secondly, after the labels are exchanged at each time, the machine learning model constructed by training is respectively utilized by m sample data after the labels of the samples are exchanged randomly, so that k of the first characteristic is respectively obtained through the machine learning model after each training2A second evaluation score;
based on k2The second evaluation score determines a second degree of importance of the first feature.
Further, the second determinationUnit 43 may couple k2The average of the first evaluation scores is determined as the second degree of importance of the first feature.
According to one possible design, the selection unit 44 may further be configured to:
in the event that the difference of the second importance minus the first importance exceeds a first threshold, the first feature is excluded from the features of the machine learning model being built.
Wherein, in some embodiments, when the second importance passes k2When determining the second evaluation scores, the selecting unit 43 may further: the difference between the second importance minus the first importance exceeds k2Twice the second variance of the second evaluation score, the first feature is excluded from the features of the machine learning model being constructed. That is, the first threshold value is k2Twice the second variance of each second merit score.
According to another possible design, the selection unit 44 may further be configured to:
in the case that a difference of the second importance minus the first importance is smaller than a second threshold, it is determined to select the first feature as a feature of the machine learning model constructed.
Wherein, in some embodiments, when the second importance passes k2When determining the second evaluation score, the selecting unit 44 may be further configured to determine that the difference between the second importance minus the first importance is less than k2In the case of a second variance of the second evaluation score, determining to select the first feature as a feature of the constructed machine learning model. That is, the second threshold may be k2A second variance of the second evaluation score.
According to an embodiment, the selection unit 44 may be further configured to:
determining a composite importance of the first feature based on the first importance and the second importance;
and determining whether to select the first feature as the feature of the constructed machine learning model according to the comprehensive importance.
In one embodiment, the integrated importance of the first feature is determined based on an integrated indicator of the first feature obtained by integrating the first importance and the second importance, and the integrated indicator of the first feature is a sum of a difference between the second importance and the first importance and a ratio of the second importance to the first importance.
In a further embodiment, the comprehensive importance of the first feature is a ratio of the comprehensive index of the first feature to a maximum value of the n comprehensive indexes corresponding to the n candidate features.
In another further embodiment, the comprehensive importance of the first feature is a ratio of the comprehensive index of the first feature to a maximum value in each comprehensive index of similar candidate features, where the similar candidate features are the features of the n candidate features that satisfy the same predetermined condition as the first feature.
It should be noted that the apparatus 400 shown in fig. 4 is an apparatus embodiment corresponding to the method embodiment shown in fig. 2, and the corresponding description in the method embodiment shown in fig. 2 is also applicable to the apparatus 400, and is not repeated herein.
Through the device, in the process of selecting the characteristics for the constructed machine learning model, random disturbance is utilized, on one hand, the constructed machine learning model is normally trained through a sample, the first importance degree of each characteristic is determined, on the other hand, the sample labels are randomly exchanged, and the second importance degree of each characteristic is determined, so that the characteristics are selected through the contrast of the second importance degree and the first importance degree, and the effectiveness of characteristic screening can be improved in the process of constructing the machine learning model.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method described in connection with fig. 2.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.