CN111860630B

CN111860630B - Model building method and system based on feature importance

Info

Publication number: CN111860630B
Application number: CN202010661710.4A
Authority: CN
Inventors: 林建明
Original assignee: Shenzhen Wuyu Technology Co ltd
Current assignee: Shenzhen Wuyu Technology Co ltd
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2023-10-13
Anticipated expiration: 2040-07-10
Also published as: CN111860630A

Abstract

The invention discloses a model building method and a system based on feature importance, wherein the model building method comprises the following steps: step S1, initializing characteristic data; step S2, sampling the characteristic data to form a plurality of groups of characteristic data combinations, wherein each group of characteristic data combination is used as a sub-model; s3, setting model parameters of each group of feature models; each group of feature combinations is provided with the same model parameter range; s4, training a model, and calculating the importance of each sub-model feature; s5, calculating importance of the comprehensive weighting characteristics by using all sub-models; s6, sorting the feature importance to obtain a new importance sorting; and S7, modeling again according to the characteristic sequence. The model establishing method and system based on the feature importance can reduce the relative sorting fluctuation of the feature importance obtained by calculation and improve the credibility.

Description

Model building method and system based on feature importance

Technical Field

The invention belongs to the technical field of big data processing, relates to a model building method, and particularly relates to a model building method and system based on feature importance.

Background

In the field of big data wind control modeling, commonly used data sources are numerous, and due to the fact that the data searching yields are different and the access time is sequential, the quality of the data which can be used for wind control modeling is low. In the face of the data situation, the traditional algorithm has limited performance, and needs to do a large amount of data preprocessing work, and cannot meet the control demands of business on risks, so that a model with higher performance is selected as a primary target, XGBOOST is one of the common algorithms, and the patent is based on XGBOOST algorithm development.

The XGBOOST algorithm adopts an integrated learning scheme to construct the tree, finally learns N trees, and is characterized in that the number of times of the N trees appearing on the nodes of the tree is proportional, parameters such as maximum tree depth, splitting conditions of the nodes and the like are required to be limited because other factors such as fitting, stability and the like of a test set of the model are required to be considered in the tree construction process, and finally the characteristics used by the whole tree forest are limited, so that most of characteristic variables are zero in the model entering characteristic variables, the condition that most of characteristic importance is zero is unfavorable for subsequent characteristic screening work, and the characteristic importance is zero and does not represent useless characteristics.

In view of this, there is an urgent need to design a new model building method to overcome at least some of the above-mentioned drawbacks of the existing model building methods.

Disclosure of Invention

The invention provides a model building method and a system based on feature importance, which can reduce the relative sorting fluctuation of the feature importance obtained by calculation and improve the credibility.

In order to solve the technical problems, according to one aspect of the present invention, the following technical scheme is adopted:

a model building method based on feature importance, the model building method comprising:

step S1, initializing characteristic data;

step S2, sampling the characteristic data to form a plurality of groups of characteristic data combinations, wherein each group of characteristic data combination is used as a sub-model;

s3, setting model parameters of each group of feature models; each group of feature combinations is provided with the same model parameter range;

s4, training a model, and calculating the importance of each sub-model feature;

s5, calculating importance of the comprehensive weighting characteristics by using all sub-models;

s6, sorting the feature importance to obtain a new importance sorting;

and S7, modeling again according to the characteristic sequence.

As an embodiment of the present invention, the step S5 includes:

step S51, traversing each sub-model;

step S52, modifying the feature importance calculation mode of the single submodel to be: test set ks of the submodel _i +ks _i * Importance fimp of the feature _i ；

Step S53, feature importance of all sub-models is aggregated, and the average value of new feature importance of each feature is calculated, so that comprehensive weighted feature importance is obtained;wherein m represents the number of models trained overall, ks _i Ks value, fimp, representing the ith model _i Representing the feature importance of all the in-mold features of the ith model.

In one embodiment of the present invention, in the step S2, the feature data is randomly sampled without substitution.

In the step S2, the feature data is sampled with a percentage of the same value as the feature data.

In step S2, the random non-replacement combined feature is adopted, and multiple sampling is performed through the feature with the non-replacement sampling equal ratio, so that each feature can participate in model training, and the participating weights are the same, i.e. the number of models of each feature participating in training is the same.

In step S2, a random combination method is introduced, and a final model test set ks and a sub-model feature importance weighting and averaging method are combined, so that the obtained feature importance is more representative, and a smoother quantitative evaluation index is provided for the feature, which provides an excellent basis for subsequent modeling feature screening.

In step S4, the XGBOOST model is used as the training model.

In step S4, each set of models trains multiple sets of parameters to obtain multiple sets of basic XGBOOST models.

In the step S4, the obtained feature importance and the feature IV value are combined to form one evaluation index again, and the evaluation method of XGBOOST tree and the evaluation method of linear IV are compatible. Setting a weight combination scheme with iv and different feature importance, and deriving additional feature evaluation indexes again; given iv weight a, feature importance weight b, the new feature evaluation index after derivation is f_index_new=iv_a+f_imp_new_new.

As an embodiment of the present invention, the step S6 further includes: the progressive, decremental or step wise modeling is performed according to the resulting feature ordering lines.

According to another aspect of the invention, the following technical scheme is adopted: a feature importance based model building system, the model building system comprising:

the feature data initializing module is used for initializing feature data;

the characteristic data sampling module is used for sampling characteristic data to form a plurality of groups of characteristic data combinations, and each group of characteristic data combination is used as a sub-model;

the model parameter setting module is used for setting model parameters of each group of characteristic models; each group of feature combinations is provided with the same model parameter range;

the model training module is used for training a model and calculating the importance of each sub-model characteristic;

an importance calculation module for calculating the importance of the comprehensive weighted features using all sub-models;

the importance ranking module is used for ranking the feature importance to obtain a new importance ranking; and

and the training model modeling module is used for taking the feature subset which is ranked at the front according to the feature sequence to set the model parameter range again and retrain the model modeling again.

As one embodiment of the present invention, the importance calculating module includes:

the sub-model traversing unit is used for traversing each sub-model;

the feature importance calculating unit is used for modifying the feature importance calculating mode of the single submodel to be as follows: test set ks of the submodel _i +ks _i * Importance fimp of the feature _i The method comprises the steps of carrying out a first treatment on the surface of the And

the feature importance aggregation unit is used for aggregating the feature importance of all the submodels and solving the mean value of the new feature importance of each feature to obtain the comprehensive weighted feature importance;wherein m represents the number of models trained overall, ks _i Ks value, fimp, representing the ith model _i Representing the feature importance of all the in-mold features of the ith model.

The invention synthesizes the core calculation thought of the weighted feature importance: (1) random non-put back combination feature: sampling is carried out for a plurality of times without replaced sampling equal ratio characteristics, so that each characteristic can participate in model training, and the participating weights are the same (namely, the quantity of models of each characteristic participating in training is consistent). (2) Each group of models trains a plurality of groups of parameters to obtain a plurality of groups of basic XGBOOST models, and the comprehensive feature importance of the plurality of models is calculated.

The invention has the overall improvement effect that: (1) The random sampling mode without the put-back combination is used for combining the characteristics in disorder, the more the combination is, the more good characteristic combination and poor characteristic combination are easy to appear, and the larger the corresponding model effect is, namely the ordering of the model effect is related to the use characteristics. (2) The original XGBOOST feature importance may have fluctuation due to different model parameters and is not consistent every time, but the feature importance calculated by the scheme of the invention has small fluctuation of relative ordering and higher reliability. (3) After the feature importance of the multiple groups of models is calculated through the ks weighting, the importance of each feature is not zero, namely the problem that a large number of features of a single XGBOOST model are zero is solved. (4) Compared with the original characteristic importance of XGBOOST, the calculated comprehensive characteristic importance is smoother in value.

The invention has the beneficial effects that: the model establishment method based on the feature importance can reduce the relative sequencing fluctuation of the feature importance obtained by calculation and improve the credibility.

Drawings

FIG. 1 is a flow chart of a method for modeling based on feature importance in an embodiment of the invention.

FIG. 2 is a schematic diagram of a feature importance based modeling system according to an embodiment of the present invention.

FIG. 3 is a schematic diagram showing the components of a model building system importance calculating module according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

For a further understanding of the present invention, preferred embodiments of the invention are described below in conjunction with the examples, but it should be understood that these descriptions are merely intended to illustrate further features and advantages of the invention, and are not limiting of the claims of the invention.

The description of this section is intended to be illustrative of only a few exemplary embodiments and the invention is not to be limited in scope by the description of the embodiments. It is also within the scope of the description and claims of the invention to interchange some of the technical features of the embodiments with other technical features of the same or similar prior art.

The invention discloses a model building method based on feature importance, and FIG. 1 is a flow chart of the model building method based on feature importance in an embodiment of the invention; referring to fig. 1, the method for establishing the model includes:

initializing characteristic data (step S1).

And (S2) sampling the characteristic data to form a plurality of groups of characteristic data combinations, wherein each group of characteristic data combination is used as a submodel.

In one embodiment of the invention, the feature data is randomly non-subsampled. The random non-replacement combined features are adopted, and sampling is carried out for multiple times through the features of non-replacement sampling equal ratio, so that each feature can participate in model training, the participating weights are the same, and the number of models of each feature participating in training is consistent. By introducing a random combination method and combining a final model test set ks and sub-model feature importance weighting average method, the obtained feature importance is more representative, and a smoother quantitative evaluation index is provided for the feature, so that an excellent basis is provided for subsequent modeling feature screening.

In another embodiment of the invention, the feature data is sampled with a put back equal percentage.

Setting model parameters of each group of feature models; the same model parameter range is set for each set of feature combinations.

Training the model and calculating the importance of each sub-model feature (step S4).

In one embodiment of the invention, the training model employs the XGBOOST model. In one embodiment, each set of models trains multiple sets of parameters, resulting in multiple sets of basic XGBOOST models.

In an embodiment of the present invention, the feature importance is obtained based on XGBOOST scheme, the core of the evaluation is mostly dependent on the segmentation calculation mode of the tree node, the obtained feature importance and the feature IV value are combined together, an evaluation index can be formed again, and the evaluation mode of XGBOOST tree and the evaluation mode of linear IV are considered. Setting a weight combination scheme with iv and different feature importance, and deriving additional feature evaluation indexes again; given iv weight a, feature importance weight b, the new feature evaluation index after derivation is f_index_new=iv_a+f_imp_new_new. For example, given that iv weights are 50%, feature importance weights are 50%, the new feature evaluation index after derivation is f_index_new=iv×0.5+f_imp_new×0.5.

Step S5 the importance of the comprehensive weighting characteristics is calculated using all sub-models.

In an embodiment of the present invention, the step S5 includes:

step S51, traversing each sub-model;

In one embodiment of the present invention, ks is used as the integrated feature weighted importance calculation; in another embodiment of the present invention, evaluation indexes of other models such as AUC, LIFT, etc. may be used.

And (6) sorting the feature importance to obtain a new importance sort.

In an embodiment of the present invention, the step S6 further includes: modeling in an incremental, decremental or step wise manner according to the obtained feature ordering row; the stability and the performance upper limit of the model are improved, and meanwhile, the used characteristics are reduced, so that the deployment difficulty of the online model is greatly reduced.

Step S7, performing feature re-modeling according to the feature sequence;

the step S8 ends.

The invention also discloses a model building system based on the feature importance, and FIG. 2 is a schematic diagram of the model building system based on the feature importance in an embodiment of the invention; referring to fig. 2, in an embodiment of the present invention, the modeling system includes: the system comprises a feature data initialization module 1, a feature data sampling module 2, a model parameter setting module 3, a model training module 4, an importance calculating module 5, an importance sorting module 6 and a training model modeling module 7.

The characteristic data initializing module 1 is used for initializing characteristic data; the feature data sampling module 2 is used for sampling feature data to form a plurality of groups of feature data combinations, and each group of feature data combinations is used as a sub-model; the model parameter setting module 3 is used for setting the model parameters of each group of feature models; each group of feature combinations is provided with the same model parameter range; the model training module 4 is used for training a model and calculating the importance of each sub-model feature; the importance calculating module 5 is used for calculating the importance of the comprehensive weighting characteristics by using all sub-models; the importance ranking module 6 is used for ranking the feature importance to obtain a new importance ranking; the training model modeling module 7 is configured to take the feature subset with the top rank according to the feature order to set the model parameter range again, and retrain the model modeling.

FIG. 3 is a schematic diagram showing the components of a model building system importance calculating module according to an embodiment of the present invention; referring to fig. 3, in an embodiment of the present invention, the importance calculating module 5 includes: a sub-model traversing unit 51, a feature importance calculating unit 52, and a feature importance aggregating unit 53.

The sub-model traversing unit 51 is used for traversing each sub-model. The feature importance calculating unit 52 is configured to modify the feature importance calculating manner of the single submodel as follows: test set ks of the submodel _i +ks _i * Importance fimp of the feature _i . The feature importance aggregation unit 53 is configured to aggregate feature importance of all sub-models, and calculate a mean value of new feature importance of each feature, thereby obtaining a comprehensive weighted feature importance;wherein m represents the number of models trained overall, ks _i Ks value, fimp, representing the ith model _i Representing the feature importance of all the in-mold features of the ith model.

In summary, the model building method and system based on the feature importance provided by the invention can reduce the relative sorting fluctuation of the feature importance obtained by calculation and improve the reliability.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The description and applications of the present invention herein are illustrative and are not intended to limit the scope of the invention to the embodiments described above. Effects or advantages referred to in the embodiments may not be embodied in the embodiments due to interference of various factors, and description of the effects or advantages is not intended to limit the embodiments. Variations and modifications of the embodiments disclosed herein are possible, and alternatives and equivalents of the various components of the embodiments are known to those of ordinary skill in the art. It will be clear to those skilled in the art that the present invention may be embodied in other forms, structures, arrangements, proportions, and with other assemblies, materials, and components, without departing from the spirit or essential characteristics thereof. Other variations and modifications of the embodiments disclosed herein may be made without departing from the scope and spirit of the invention.

Claims

1. The model building method based on the feature importance is characterized by being used in the field of big data wind control modeling, and comprises the following steps of:

step S1, initializing characteristic data;

s4, training a model, and calculating the importance of each sub-model feature;

s6, sorting the feature importance to obtain a new importance sorting;

s7, according to the feature sequence, the feature subset which is ranked at the front is taken to set the parameter range of the model again, and model modeling is retrained;

in the step S2, random non-replacement combined features are adopted, and multiple sampling is carried out through non-replacement sampling equal ratio features, so that each feature can participate in model training, the participating weights are the same, namely the number of models of each feature participating in training is consistent;

in the step S4, the training model adopts an XGBOOST model; training multiple groups of parameters by each group of models to obtain multiple groups of basic XGBOOST models;

the step S5 includes:

step S51, traversing each sub-model;

step S52, modifying the feature weight of the single submodelThe significance calculation mode is as follows: test set ks of the submodel _i +ks _i * Importance fimp of the feature _i ；

2. The feature importance-based model building method according to claim 1, characterized in that:

in the step S2, random unreplaced sampling is carried out on the characteristic data; alternatively, the feature data is sampled with a put back equal percentage.

3. The feature importance-based model building method according to claim 1, characterized in that:

in the step S2, a method of random combination is introduced, and a final model test set ks and sub-model feature importance weighting averaging method is combined, so that the obtained feature importance is more representative, and a smoother quantitative evaluation index is provided for the feature, so that an excellent basis is provided for subsequent modeling feature screening.

4. The feature importance-based model building method according to claim 1, characterized in that:

in the step S4, the obtained feature importance and the feature IV value are combined together to form an evaluation index again, and the evaluation mode of the XGBOOST tree and the evaluation mode of the linear IV are considered;

setting weight combination schemes with different IV and feature importance, and deriving additional feature evaluation indexes again; given IV weight a, feature importance weight b, and the new feature evaluation index after derivation is f_index new W

iv*a+f_imp_new*b。

5. The feature importance-based model building method according to claim 1, characterized in that:

the step S6 further includes: the progressive, decremental or step wise modeling is performed according to the resulting feature ordering lines.

6. A model building system based on feature importance, wherein the model building system is used in the big data wind control modeling field, the model building system comprises:

the feature data initializing module is used for initializing feature data;

the characteristic data sampling module is used for sampling characteristic data to form a plurality of groups of characteristic data combinations, and each group of characteristic data combination is used as a sub-model; the feature data sampling module adopts random non-replacement combined features, and performs multiple sampling through the features of non-replacement sampling equal ratio, so that each feature can participate in model training, and the weights of the participation are the same, namely the number of models of each feature participating in training is consistent;

the model training module is used for training a model and calculating the importance of each sub-model characteristic; the training model adopts an XGBOOST model; training multiple groups of parameters by each group of models to obtain multiple groups of basic XGBOOST models;

the training model modeling module is used for taking the feature subset which is ranked at the front according to the feature sequence to set the model parameter range again, and retraining model modeling;

the importance calculating module includes:

the sub-model traversing unit is used for traversing each sub-model;