CN111860630A

CN111860630A - Model establishing method and system based on feature importance

Info

Publication number: CN111860630A
Application number: CN202010661710.4A
Authority: CN
Inventors: 林建明
Original assignee: Shenzhen Wuyu Technology Co ltd
Current assignee: Shenzhen Wuyu Technology Co ltd
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-10-30
Anticipated expiration: 2040-07-10
Also published as: CN111860630B

Abstract

The invention discloses a model building method and a system based on feature importance, wherein the model building method comprises the following steps: step S1, initializing feature data; step S2, sampling the feature data to form a plurality of groups of feature data combinations, wherein each group of feature data combination is used as a sub-model; s3, setting model parameters of each group of characteristic models; each group of feature combinations is provided with the same model parameter range; s4, training a model, and calculating the importance of each sub-model feature; step S5, calculating the importance of the comprehensive weighting characteristics by using all the sub models; s6, sorting the feature importance to obtain a new importance sorting; and step S7, modeling again according to the characteristic sequence. The model establishing method and system based on the feature importance can reduce the relative sorting fluctuation of the feature importance obtained by calculation and improve the reliability.

Description

Model establishing method and system based on feature importance

Technical Field

The invention belongs to the technical field of big data processing, relates to a model building method, and particularly relates to a model building method and system based on feature importance.

Background

In the field of big data wind control modeling, the commonly used data sources are various, and due to the fact that the data obtaining rate is different and the access time is sequential, the quality of usable data of the wind control modeling is low finally. In the face of the data situation, the traditional algorithm has limited performance, a large amount of data preprocessing work is needed, and the risk control requirement of the business cannot be met, so that the selection of a model with higher performance is a primary target, XGBOST is one of the commonly used algorithms, and the content of the patent is based on XGBOST algorithm expansion.

The XGB OST algorithm adopts an integrated learning scheme to construct trees, N trees are finally learned, the characteristic importance of the N trees is the proportion of times appearing on tree nodes, parameters such as maximum tree depth, node splitting conditions and the like are limited due to the fact that other factors such as overfitting and stability of a test set of a model need to be considered in the tree construction process, and finally the characteristics used by the whole forest are limited, so that most of characteristic importance is zero in a large number of molded characteristic variables, the condition that most of characteristic importance is zero is not beneficial to subsequent characteristic screening work, and the characteristic importance is zero and does not represent the characteristics.

In view of the above, there is an urgent need to design a new model building method to overcome at least some of the above-mentioned disadvantages of the existing model building methods.

Disclosure of Invention

The invention provides a model establishing method and system based on feature importance, which can reduce the relative sorting fluctuation of the feature importance obtained by calculation and improve the reliability.

In order to solve the technical problem, according to one aspect of the present invention, the following technical solutions are adopted:

a model building method based on feature importance comprises the following steps:

step S1, initializing feature data;

step S2, sampling the feature data to form a plurality of groups of feature data combinations, wherein each group of feature data combination is used as a sub-model;

s3, setting model parameters of each group of characteristic models; each group of feature combinations is provided with the same model parameter range;

s4, training a model, and calculating the importance of each sub-model feature;

step S5, calculating the importance of the comprehensive weighting characteristics by using all the sub models;

s6, sorting the feature importance to obtain a new importance sorting;

and step S7, modeling again according to the characteristic sequence.

As an embodiment of the present invention, the step S5 includes:

step S51, traversing each sub-model;

step S52, modifying the feature importance calculation mode of a single sub-model: test set ks of the submodel_i+ks_iImportance of this feature fimp_i；

Step S53, aggregating the feature importance of all sub-models, and solving the mean value of the new feature importance of each feature to obtain the comprehensive weighted feature importance;

where m represents the number of models, ks, that are trained in total_iKs value, fimp representing the ith model_iRepresenting the feature importance of all the modelled features of the ith model.

In one embodiment of the present invention, in step S2, the feature data is randomly sampled without being put back.

In one embodiment of the present invention, in step S2, the replacement equal percentage sampling is performed on the feature data.

In one embodiment of the present invention, in step S2, the random non-recurrent combined features are used, and the non-recurrent sampling equal ratio features are sampled for a plurality of times, so as to ensure that each feature can participate in the model training with the same weight, i.e., the number of models participating in the training for each feature is the same.

As an embodiment of the present invention, in the step S2, a random combination method is introduced, and a final model test set ks and sub-model feature importance weighting and averaging method is combined, so that the obtained feature importance is more representative, and a smoother quantitative evaluation index is provided for the feature, which provides an excellent basis for subsequent modeling feature screening.

In an embodiment of the invention, in step S4, the XGBOOST model is used as the training model.

In an embodiment of the present invention, in step S4, each group of models trains multiple groups of parameters, resulting in multiple groups of basic XGBOOST models.

In an embodiment of the present invention, in step S4, the obtained feature importance is combined with the feature IV value to form an evaluation index again, and the XGBOOST tree evaluation method and the linear IV evaluation method are compatible. Setting a weight combination scheme with different feature importance for iv and the feature, and deriving additional feature evaluation indexes again; given the iv weight a and the feature importance weight b, the derived new feature evaluation index is f _ index _ new ═ iv a + f _ imp _ new ×.b.

As an embodiment of the present invention, the step S6 further includes: and modeling in an increasing, decreasing or step wise mode according to the obtained characteristic sequence rows.

According to another aspect of the invention, the following technical scheme is adopted: a feature importance-based modeling system, the modeling system comprising:

the characteristic data initialization module is used for initializing the characteristic data;

the characteristic data sampling module is used for sampling the characteristic data to form a plurality of groups of characteristic data combinations, and each group of characteristic data combination is used as a sub-model;

The model parameter setting module is used for setting model parameters of each group of characteristic models; each group of feature combinations is provided with the same model parameter range;

the model training module is used for training the model and calculating the importance of the characteristics of each sub-model;

the importance calculating module is used for calculating the importance of the comprehensive weighting characteristics by using all the sub models;

the importance ranking module is used for ranking the feature importance to obtain a new importance ranking; and

and the training model modeling module is used for setting the model parameter range again by taking the characteristic subset with the top rank according to the characteristic sequence and retraining the model modeling.

As an embodiment of the present invention, the importance calculating module includes:

the submodel traversing unit is used for traversing each submodel;

the feature importance calculating unit is used for modifying the feature importance calculating mode of a single sub-model as follows: test set ks of the submodel_i+ks_iImportance of this feature fimp_i(ii) a And

the feature importance aggregation unit is used for aggregating the feature importance of all the submodels and solving the mean value of the new feature importance of each feature to obtain the comprehensive weighted feature importance;

where m represents the number of models, ks, that are trained in total _iKs value, fimp representing the ith model_iRepresenting the feature importance of all the modelled features of the ith model.

The invention integrates the weighted feature importance core calculation thought: (1) random no-put-back combination features: sampling is carried out for multiple times without the returned sampling equal ratio features, and each feature can participate in the model training with the same weight (namely the number of models participating in the training of each feature is consistent). (2) And (3) training multiple groups of parameters of each group of models to obtain multiple groups of basic XGBOOST models, and calculating the importance of the comprehensive characteristics of the multiple models.

The invention has the overall improved effect that: (1) the random non-playback combination sampling mode disorganizes and combines the features, the more the combination is, the more easily the good feature combination and the poor feature combination appear, the larger the difference of the corresponding model effect is, namely, the orderliness of the model effect is related to the use feature thereof. (2) The native XGBOST feature importance may fluctuate due to different model parameters, and is not consistent every time, but the feature importance calculated by the scheme of the invention has small fluctuation in relative sequencing and high reliability. (3) After the importance of the features of the multiple groups of models is calculated through ks weighting, the importance of each feature is not zero, namely the problem that the importance of a large number of features of a single XGBOOST model is zero is solved. (4) Compared with the primary feature importance of the XGB OST, the calculated comprehensive feature importance and the numerical value are smoother.

The invention has the beneficial effects that: the model establishing method based on the feature importance can reduce the relative sorting fluctuation of the feature importance obtained by calculation and improve the reliability.

Drawings

Fig. 1 is a flowchart of a feature importance-based model building method according to an embodiment of the present invention.

Fig. 2 is a schematic composition diagram of a model building system based on feature importance according to an embodiment of the present invention.

FIG. 3 is a schematic diagram illustrating an importance calculation module of the model building system according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

For a further understanding of the invention, reference will now be made to the preferred embodiments of the invention by way of example, and it is to be understood that the description is intended to further illustrate features and advantages of the invention, and not to limit the scope of the claims.

The description in this section is for several exemplary embodiments only, and the present invention is not limited only to the scope of the embodiments described. It is within the scope of the present disclosure and protection that the same or similar prior art means and some features of the embodiments may be interchanged.

The invention discloses a model building method based on feature importance, and FIG. 1 is a flow chart of the model building method based on feature importance in an embodiment of the invention; referring to fig. 1, the model building method includes:

Step S1 initializes the feature data.

Step S2, the feature data are sampled to form a plurality of sets of feature data combinations, and each set of feature data combination is used as a sub-model.

In one embodiment of the invention, the feature data is randomly sampled without playback. And sampling for multiple times by adopting random non-playback combination characteristics and non-playback sampling equal ratio characteristics to ensure that each characteristic can participate in model training and the participating weights are the same, namely the number of models participating in training of each characteristic is consistent. By introducing a random combination method and combining a final model test set ks and a sub-model feature importance weighting and averaging mode, the obtained feature importance is more representative, and meanwhile, a smoother quantitative evaluation index is provided for the features, so that an excellent basis is provided for subsequent modeling feature screening.

In another embodiment of the present invention, the replacement equal percentage sampling is performed on the feature data.

Step S3, setting model parameters of each group of feature models; each set of feature combinations sets the same model parameter ranges.

Step S4, the model is trained, and the importance of each sub-model feature is calculated.

In an embodiment of the invention, the XGBOST model is adopted as the training model. In one embodiment, each model set trains multiple sets of parameters, resulting in multiple sets of basic XGBOOST models.

In an embodiment of the invention, the feature importance is obtained based on the XGBOOST scheme, the core of the evaluation mostly depends on the tree node segmentation calculation mode, the obtained feature importance is combined with the feature IV value, an evaluation index can be formed again, and the evaluation mode of the XGBOOST tree and the evaluation mode of the linear IV are considered. Setting a weight combination scheme with different feature importance for iv and the feature, and deriving additional feature evaluation indexes again; given the iv weight a and the feature importance weight b, the derived new feature evaluation index is f _ index _ new ═ iv a + f _ imp _ new ×.b. For example, given an iv weight of 50% and a feature importance weight of 50%, the derived new feature evaluation index is f _ index _ new ═ iv 0.5+ f _ imp _ new ×. 0.5.

Step S5, the importance of the integrated weighted feature is calculated using all sub models.

In an embodiment of the present invention, the step S5 includes:

step S51, traversing each sub-model;

step S52, modifying the feature importance of the individual submodelsThe calculation method is as follows: test set ks of the submodel _i+ks_iImportance of this feature fimp_i；

In one embodiment of the invention, ks is used in the calculation mode of the weighted importance of the comprehensive characteristics; in another embodiment of the present invention, other models such as AUC and LIFT may be used.

Step S6, the feature importance is ranked to obtain a new importance ranking.

In an embodiment of the present invention, the step S6 further includes: modeling in an increasing, decreasing or step wise mode according to the obtained characteristic sequencing rows; the stability and the upper performance limit of the model are improved, and the use characteristics are reduced, so that the deployment difficulty of the online model is greatly reduced.

Step S7, modeling the features again according to the feature sequence;

step S8 ends.

The invention also discloses a model building system based on the feature importance, and FIG. 2 is a schematic composition diagram of the model building system based on the feature importance in an embodiment of the invention; referring to fig. 2, in an embodiment of the present invention, the model building system includes: the system comprises a characteristic data initialization module 1, a characteristic data sampling module 2, a model parameter setting module 3, a model training module 4, an importance calculating module 5, an importance sequencing module 6 and a training model modeling module 7.

The characteristic data initialization module 1 is used for initializing characteristic data; the characteristic data sampling module 2 is used for sampling the characteristic data to form a plurality of groups of characteristic data combinations, and each group of characteristic data combination is used as a sub-model; the model parameter setting module 3 is used for setting model parameters of each group of feature models; each group of feature combinations is provided with the same model parameter range; the model training module 4 is used for training the model and calculating the importance of the characteristics of each sub-model; the importance calculating module 5 is used for calculating the importance of the comprehensive weighting characteristic by using all the sub models; the importance ranking module 6 is used for ranking the feature importance to obtain a new importance ranking; the training model modeling module 7 is used for setting the model parameter range again by taking the characteristic subset with the top rank according to the characteristic sequence and retraining the model modeling.

FIG. 3 is a schematic diagram illustrating an importance calculation module of the model building system according to an embodiment of the present invention; referring to fig. 3, in an embodiment of the present invention, the importance calculating module 5 includes: a submodel traversing unit 51, a feature importance calculating unit 52, and a feature importance aggregating unit 53.

The sub-model traversal unit 51 is used to traverse the sub-models. The feature importance calculating means 52 is used to modify the feature importance calculation of the individual submodels in the following manner: test set ks of the submodel _i+ks_iImportance of this feature fimp_i. The feature importance aggregating unit 53 is configured to aggregate feature importance of all sub-models, and calculate an average of new feature importance of each feature, i.e., obtain an integrated weighted feature importance;

In summary, the feature importance-based model building method and system provided by the invention can reduce the relative sorting fluctuation of the calculated feature importance and improve the reliability.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The description and applications of the invention herein are illustrative and are not intended to limit the scope of the invention to the embodiments described above. Effects or advantages referred to in the embodiments may not be reflected in the embodiments due to interference of various factors, and the description of the effects or advantages is not intended to limit the embodiments. Variations and modifications of the embodiments disclosed herein are possible, and alternative and equivalent various components of the embodiments will be apparent to those skilled in the art. It will be clear to those skilled in the art that the present invention may be embodied in other forms, structures, arrangements, proportions, and with other components, materials, and parts, without departing from the spirit or essential characteristics thereof. Other variations and modifications of the embodiments disclosed herein may be made without departing from the scope and spirit of the invention.

Claims

1. A model building method based on feature importance is characterized by comprising the following steps:

step S1, initializing feature data;

s4, training a model, and calculating the importance of each sub-model feature;

s6, sorting the feature importance to obtain a new importance sorting;

and step S7, setting the parameter range of the model again by taking the characteristic subset with the top rank according to the characteristic sequence, and retraining the model for modeling.

2. The feature importance-based model building method according to claim 1, wherein:

the step S5 includes:

step S51, traversing each sub-model;

3. The feature importance-based model building method according to claim 1, wherein:

in step S2, performing random non-playback sampling on the feature data; alternatively, the feature data is sampled with equal percentage put back.

4. The feature importance-based model building method according to claim 1, wherein:

in step S2, the random non-playback combination features are used, and sampling is performed for multiple times through the non-playback sampling equal ratio features, so that each feature can participate in the model training, and the participating weights are the same, that is, the number of models that each feature participates in the training is the same.

5. The feature importance-based model building method according to claim 1, wherein:

in the step S2, by introducing a random combination method and combining a final model test set ks and a sub-model feature importance weighting and averaging manner, the obtained feature importance is more representative, and a smoother quantitative evaluation index is provided for the feature, which provides an excellent basis for subsequent modeling feature screening.

6. The feature importance-based model building method according to claim 1, wherein:

in the step S4, the training model adopts an XGBOOST model; and training multiple groups of parameters by each group of models to obtain multiple groups of basic XGBOST models.

7. The feature importance-based model building method according to claim 1, wherein:

in the step S4, the obtained feature importance and the feature IV value are combined together to form an evaluation index again, and the evaluation mode of the XGBOOST tree and the evaluation mode of the linear IV are considered;

setting weight combination schemes with different IV and feature importance, and deriving additional feature evaluation indexes again; given the IV weight a and the feature importance weight b, the derived new feature evaluation index is f _ index _ new ═ IV × a + f _ imp _ new × b.

8. The feature importance-based model building method according to claim 1, wherein:

the step S6 further includes: and modeling in an increasing, decreasing or step wise mode according to the obtained characteristic sequence rows.

9. A model building system based on feature importance, the model building system comprising:

10. The feature importance based model building system according to claim 9, wherein:

the importance calculation module comprises:

the submodel traversing unit is used for traversing each submodel;

the feature importance calculating unit is used for modifying the feature importance calculating mode of a single sub-model as follows: test set ks of the submodel_i+ks_iImportance of this feature fimp _i(ii) a And