CN112036476A

CN112036476A - Data feature selection method and device based on two-classification service and computer equipment

Info

Publication number: CN112036476A
Application number: CN202010888882.5A
Authority: CN
Inventors: 顾凌云; 谢旻旗; 段湾; 张涛; 潘峻; 汪仁杰
Original assignee: Shanghai IceKredit Inc
Current assignee: Shanghai IceKredit Inc
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-12-04

Abstract

The data feature selection method, device and computer equipment based on the classified service provided by the embodiment of the invention train a first tree model by adopting the extracted initial feature combination and acquire an initial gain contribution value and screen out m first data features from n initial data features, determine a correlation coefficient and a feature iv value corresponding to the first data features, screen out j second data features from the m first data features based on the correlation coefficient and the feature iv value and train a second tree model to acquire a current gain contribution value, and screen out k third data features from the j second data features according to the current gain contribution value and a drawn iv diagram. In this way, the gain contribution, the inter-feature correlation and the feature iv value of the data features can be taken into account, so that a small number of data features with irreplaceability can be selected quickly and efficiently, the time consumption for modeling and training the tree model can be reduced, and the processing load of a computer can be reduced.

Description

Data feature selection method and device based on two-classification service and computer equipment

Technical Field

The invention relates to the technical field of data analysis, in particular to a data feature selection method and device based on two classification services and computer equipment.

Background

In a binary service scenario, the tree model is widely applied. The tree model is adopted to process the two-classification service, so that the efficiency and the accuracy of service processing can be improved. With the increase of the business scale and the business data volume, in order to ensure the accuracy and the reliability of the binary classification identification, the tree model needs to be modeled and trained. When modeling and training a tree model, data characteristics of business data are required to be input, however, the following two problems may exist in a common data characteristic extraction method of business data:

(1) the time consumption for modeling and training the tree model based on the extracted data features is long;

(2) the correlation between the data features is not considered, so that a plurality of effect replaceable features are included in the feature combination for modeling and training the tree model, the number of the data features is large, and the processing load of a computer is increased.

Disclosure of Invention

In order to solve the above problems, the present invention provides a data feature selection method, device and computer device based on a binary service.

Firstly, a data feature selection method based on a binary service is provided, the data feature selection method is applied to computer equipment, and the method at least comprises the following steps:

acquiring target service data with a binary label, and performing feature extraction on the target service data to obtain an initial feature combination corresponding to the target service data; the initial feature combination comprises n initial data features corresponding to the target service data, wherein n is a positive integer;

training a first tree model by adopting the initial characteristic combination, and acquiring an initial gain contribution value of each initial data characteristic in the first tree model; screening m first data characteristics from the n initial data characteristics according to the initial gain contribution values and forming a first characteristic combination; wherein m is a positive integer less than n;

determining a correlation coefficient between first data features in the first feature combination and a feature iv value of each first data feature, and screening j second data features from the m first data features based on the correlation coefficient and the feature iv value to form a second feature combination; wherein j is a positive integer less than m;

training a second tree model by adopting the second feature combination, and acquiring a current gain contribution value of each second data feature in the second feature combination in the second tree model; drawing an iv graph of each second data feature, screening k third data features from the j second data features according to the current gain contribution value and the iv graph, forming a third feature combination, and determining the third feature combination as a final feature combination of the target service data; wherein k is a positive integer less than j.

Further, screening out m first data features from the n initial data features according to the initial gain contribution values comprises:

and determining the initial data characteristic corresponding to the initial gain contribution value larger than the set threshold value as the first data characteristic.

sequencing the n initial data features according to the sequence of the initial gain contribution values from large to small to obtain a sequencing sequence of the initial data features;

and selecting the first m initial data features from the sequencing sequence as first data features.

Further, determining a correlation coefficient between first data features in the first combination of features comprises:

judging whether the first data feature in the first feature combination is subject to normal distribution;

determining a Pearson correlation coefficient between first data features in the first feature combination if the first data features in the first feature combination follow the normal distribution;

determining a spearman correlation coefficient between first data features in the first feature combination if the first data features in the first feature combination do not follow the normal distribution.

Further, screening j second data features from the m first data features based on the correlation coefficient and the feature iv value includes:

initializing a sample feature combination identical to the first feature combination; wherein the sample feature combination comprises m first data features;

combining the first data features in the sample feature combination and the first data features in the first feature combination pairwise to obtain a plurality of feature pairs;

calculating a target correlation coefficient between the two first data features in each feature pair;

judging whether the target correlation coefficient is larger than a set correlation coefficient, if so, determining whether two first data features in the feature pair exist in the sample feature combination at the same time, and if so, deleting the first data feature with a smaller feature iv value in the feature pair and keeping the first data feature with a larger feature iv value in the feature pair;

determining the j retained first data features in the sample feature combination as the second data features.

Further, screening k third data features from the j second data features according to the current gain contribution value and the iv map, including:

analyzing each iv graph according to the sequence of the current gain contribution value from large to small to obtain an analysis result;

and if the analysis result represents that inconsistent distribution is presented between the positive proportion and the coding ascending and descending trend of the feature coding group corresponding to the iv diagram, removing the second data features corresponding to the iv diagram, and determining the k reserved second data features as k third data features.

Further, the characteristic iv value of each first data characteristic is determined by:

extracting a feature code of each first data feature, and splitting the feature code into a plurality of code groups;

calculating a first ratio of a current positive example to a global positive example and a second ratio of a current negative example to a global negative example in each coding group;

a feature iv value for each first data feature is determined from the first proportion and the second proportion.

Secondly, a data feature selection device based on the binary service is provided, the data feature selection device is applied to computer equipment, and the device at least comprises the following functional modules:

the characteristic extraction module is used for acquiring target service data with a two-classification label and extracting the characteristics of the target service data to obtain an initial characteristic combination corresponding to the target service data; the initial feature combination comprises n initial data features corresponding to the target service data, wherein n is a positive integer;

the first selection module is used for training a first tree model by adopting the initial characteristic combination and acquiring an initial gain contribution value of each initial data characteristic in the first tree model; screening m first data characteristics from the n initial data characteristics according to the initial gain contribution values and forming a first characteristic combination; wherein m is a positive integer less than n;

a second selection module, configured to determine a correlation coefficient between first data features in the first feature combination and a feature iv value of each first data feature, screen j second data features from the m first data features based on the correlation coefficient and the feature iv value, and form a second feature combination; wherein j is a positive integer less than m;

a third selecting module, configured to train a second tree model by using the second feature combination, and obtain a current gain contribution value of each second data feature in the second feature combination in the second tree model; drawing an iv graph of each second data feature, screening k third data features from the j second data features according to the current gain contribution value and the iv graph, forming a third feature combination, and determining the third feature combination as a final feature combination of the target service data; wherein k is a positive integer less than j.

There is then provided a computer device comprising a processor and a memory, the processor being in communication with the memory, the processor being arranged to retrieve a computer program from the memory and to implement the data feature selection method described above by running the computer program.

Finally, a computer-readable storage medium is provided, in which a computer program is stored, which computer program, when executed, implements the above-described data characteristic selection method.

The data feature selection method, device and computer equipment based on the classified services provided by the embodiments of the present invention first perform feature extraction on the obtained target service data to obtain an initial feature combination, then train the first tree model by using the initial feature combination and obtain an initial gain contribution value of each initial data feature in the first tree model, screen out m first data features from n initial data features according to the initial gain contribution values and form a first feature combination, then determine a correlation coefficient between the first data features in the first feature combination and a feature iv value of each first data feature, screen out j second data features from the m first data features based on the correlation coefficient and the feature iv value and form a second feature combination, and finally train the second tree model by using the second feature combination, and acquiring the current gain contribution value of each second data characteristic in the second tree model, screening k third data characteristics from the j second data characteristics according to the current gain contribution value and the drawn iv graph, and forming a final characteristic combination of the target service data. In this way, the gain contribution, the inter-feature correlation and the feature iv value of the data features can be taken into account, so that a small number of data features with irreplaceability can be selected quickly and efficiently, the time consumption for modeling and training the tree model can be reduced, and the processing load of a computer can be reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a flowchart of a data feature selection method based on a binary service according to an embodiment of the present invention.

Fig. 2 is a block diagram of a data feature selection apparatus based on a binary service according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

In order to better understand the technical solutions of the present invention, the following detailed descriptions of the technical solutions of the present invention are provided with the accompanying drawings and the specific embodiments, and it should be understood that the specific features in the embodiments and the examples of the present invention are the detailed descriptions of the technical solutions of the present invention, and are not limitations of the technical solutions of the present invention, and the technical features in the embodiments and the examples of the present invention may be combined with each other without conflict.

The inventor finds that a common data feature extraction method is a recursive feature adding method or a recursive feature eliminating method through investigation and research. In the recursive feature addition method, the following five steps are generally included.

(1) The computer obtains data with two classification labels from a text file or a database, wherein the two classification labels are generally also called positive example and negative example, for example, in the loan data, a 1 label indicates that the audit is not passed and the loan is not paid, namely, the positive example, and a 0 label indicates that the audit is passed and the loan is paid, namely, the negative example. And obtain all features from the data.

(2) The computer initializes a combination of features based on all of the above features. For example, default features that must be included in a feature combination are specified manually according to business requirements, but the initialized combination may also be empty, i.e., not include any features.

(3) And adding new features into the feature combinations circularly according to a certain sequence by the computer equipment, and modeling by using the new feature combinations. The sequence can be a feature importance sequence obtained by modeling all features in advance by a computer, or a category sequence to which the features assigned by people belong, such as addresses, calendars, incomes and the like.

(4) And in the circulation process, the modeling effect of the feature combination is measured according to indexes, such as the accuracy, the recall rate, the auc, the ks and the like of the model established by the new feature combination on data prediction, which meet the requirements of the evaluation service. If the effect can be improved to a certain extent, the newly added features of the round are reserved, and otherwise, the features are abandoned.

(5) And ending the cycle and obtaining the optimal feature combination until the number of the features in the expanded feature combination reaches a certain value, or the modeling effect of the expanded feature combination reaches the expectation, or all the features are tried, wherein the number and the expectation effect are manually set according to business requirements.

Further, the recursive feature elimination method is similar to the recursive feature addition method, except that compared with the recursive feature combination expansion method, the recursive feature elimination method performs feature combination reduction through a loop, which generally initializes a combination including all features, and then sequentially and continuously removes some features until the number of features in the feature combination is appropriate or the modeling effect is not lower than expected, the loop is stopped to obtain the final feature combination.

However, the recursive feature elimination method and the recursive feature addition method both need to perform a loop, and the general business faces a large amount of data and many features, which would otherwise take a long time for single modeling and consume more time for loop modeling, and thus the rate is low. In addition, the recursive feature addition or elimination method does not generally consider the correlation between features, which results in that a plurality of effect-replaceable features are included in the feature combination, the contribution of the features to the model is evenly divided, so that a single feature cannot exert better effect, and the number of features in the feature combination is large, thereby increasing the cost of the modeling and the operation load of the computer.

In order to solve the above problem, embodiments of the present invention provide a data feature selection method, an apparatus, and a computer device based on a binary service, which can take into account the gain contribution, inter-feature correlation, and feature iv value of a data feature, so as to quickly and efficiently select a small number of data features with irreplaceability, which can reduce the time consumption for modeling and training a tree model and reduce the processing load of a computer.

Referring to fig. 1, a flowchart of a data feature selection method based on a binary service according to an embodiment of the present invention is shown, where the method is applied to a computer device, and may specifically include the contents described in the following steps S110 to S140.

Step S110, target service data with a binary label is obtained, and feature extraction is carried out on the target service data to obtain an initial feature combination corresponding to the target service data.

In this embodiment, the initial feature combination includes n initial data features corresponding to the target service data, where n is a positive integer. The classification labels are also commonly referred to as positive examples and negative examples, for example, the loan data is marked with a 1 label to indicate that the audit is not passed and no deposit is made, i.e., positive example, and the loan data is marked with a 0 label to indicate that the audit is passed and deposit is made, i.e., negative example.

Step S120, training a first tree model by adopting the initial characteristic combination, and acquiring an initial gain contribution value of each initial data characteristic in the first tree model; and screening out m first data characteristics from the n initial data characteristics according to the initial gain contribution values, and forming a first characteristic combination.

In this embodiment, m is a positive integer less than n. The initial gain contribution value can be obtained by dividing the sum of information gains of the data features serving as the split nodes in the whole tree group by the occurrence frequency of the data features, and is one of importance indexes of the data features, wherein the higher the gain contribution is, the greater the importance of the data features is.

Step S130, determining a correlation coefficient between first data features in the first feature combination and a feature iv value of each first data feature, and screening j second data features from the m first data features based on the correlation coefficient and the feature iv value to form a second feature combination.

In this embodiment, j is a positive integer smaller than m, the correlation coefficient refers to the degree of closeness of the correlation between the two data characteristic variation trends and directions, and the statistical index for measuring this degree is generally a correlation coefficient, including a pearson correlation coefficient and a spearman correlation coefficient. If the low values of the two data characteristics correspond in sequence and the high values correspond in sequence, namely the high positive correlation, the two data characteristics can be mutually replaced to a certain extent when entering the model, and if entering the model at the same time, the contribution to the modeling effect can be equally divided.

The feature IV value is used for coding and prediction capability evaluation of the data feature. And (3) dividing the characteristic values into t groups after coding, calculating the proportion of the current positive example yi to the global positive example ys and the proportion of the current group negative example ni to the global negative example ns in each group, and then calculating the ratio based on a preset formula to obtain the characteristic values.

In addition, the number of each group and the positive proportion are used as the horizontal axis, the corresponding iv graph can be drawn by using the number of the groups and the positive proportion as the double vertical axes, and the iv graph can intuitively represent the positive proportion trend corresponding to the change of the data characteristics along with the change of the sampling values.

Step S140, training a second tree model by using the second feature combination, and acquiring a current gain contribution value of each second data feature in the second feature combination in the second tree model; drawing an iv graph of each second data feature, screening k third data features from the j second data features according to the current gain contribution value and the iv graph, forming a third feature combination, and determining the third feature combination as a final feature combination of the target service data; wherein k is a positive integer less than j.

It can be understood that, through the descriptions in the above steps S110 to S140, firstly, feature extraction is performed on the obtained target service data to obtain an initial feature combination, secondly, the initial feature combination is used to train the first tree model and obtain an initial gain contribution value of each initial data feature in the first tree model, m first data features are screened out from n initial data features according to the initial gain contribution values and are combined into a first feature combination, then, a correlation coefficient between the first data features in the first feature combination and a feature iv value of each first data feature are determined, j second data features are screened out from the m first data features and are combined into a second feature combination based on the correlation coefficient and the feature iv value, finally, the second tree model is trained by using the second feature combination, and a current gain contribution value of each second data feature in the second tree model is obtained, and screening k third data characteristics from the j second data characteristics according to the current gain contribution value and the drawn iv diagram, and forming a final characteristic combination of the target service data.

In this way, the gain contribution, the inter-feature correlation and the feature iv value of the data features can be taken into account, so that a small number of data features with irreplaceability can be selected quickly and efficiently, the time consumption for modeling and training the tree model can be reduced, and the processing load of a computer can be reduced.

In an alternative embodiment, the step S120 of filtering out m first data features from the n initial data features according to the initial gain contribution values may specifically be implemented in the following two ways, and of course, when implemented in detail, the method is not limited to the following two ways.

First, an initial data feature corresponding to an initial gain contribution value greater than a set threshold is determined as the first data feature.

And secondly, sequencing the n initial data features according to the sequence of the initial gain contribution values from large to small to obtain a sequencing sequence of the initial data features, and selecting the first m initial data features from the sequencing sequence as first data features.

In this embodiment, the threshold and m may be specified according to the distribution of the gain contributions, for example, it may be considered that the contribution of the data features with the gain contribution smaller than 0.005 to the tree model is not large, and the number of the data features with the gain contribution larger than 0.005 is about 100, and the remaining feature combinations neither lose too many important features nor reach the degree of primarily screening the unimportant features, so it is appropriate to specify a as 0.005 or m as 100. The specific implementation is not limited.

It will be appreciated that from the above, the gain contribution of the data features can be taken into account to ensure that the gain contribution of the first data feature resulting from the screening is satisfactory.

In a possible implementation manner, the determining of the correlation coefficient between the first data features in the first feature combination described in step S130 may specifically include the following descriptions of step S1311 to step S1313.

Step S1311, judging whether the first data features in the first feature combination are subject to normal distribution; if so, the process goes to step S1312, and if not, the process goes to step S1313.

Step S1312 determines a pearson correlation coefficient between the first data features in the first combination of features.

Step S1313, if the first data features in the first feature combination do not comply with the normal distribution, determining a spearman correlation coefficient between the first data features in the first feature combination.

Through the above steps S1311 to S1313, the correlation coefficient between the first data features can be accurately determined.

On the basis of the foregoing steps S1311 to S1313, the screening out j second data features from the m first data features based on the correlation coefficient and the feature iv value, which is described in step S130, may specifically include what is described in the following steps S1321 to S1325.

Step S1321, initializing a sample feature combination identical to the first feature combination; wherein the sample feature combination comprises m first data features.

Step S1322 is to combine the first data feature in the sample feature combination and the first data feature in the first feature combination two by two to obtain a plurality of feature pairs.

Step S1323, a target correlation coefficient between the two first data features in each feature pair is calculated.

Step S1324, determining whether the target correlation coefficient is greater than a set correlation coefficient, if so, determining whether two first data features in the feature pair exist in the sample feature combination at the same time, and if so, deleting the first data feature with the smaller value of the feature iv in the feature pair, and retaining the first data feature with the larger value of the feature iv in the feature pair.

Step S1325, determining j first data features retained in the sample feature combination as the second data features.

In this embodiment, the first feature combination may be defined as F1, the sample feature combination may be defined as F2, the set correlation coefficient may be defined as b, the list of feature iv values may be defined as I, and the correlation matrix of the first data feature may be defined as C.

Further, the above steps S1321 to S1325 can be obtained by the following algorithm.

It is understood that through the descriptions of step S1321 to step S1325, the correlation between the data features can be considered, and when the correlation between the two data features is higher, only the feature with better iv value is retained. The method not only weakens the correlation among the final in-mold data characteristics, but also ensures that the reserved data characteristics have better distinguishing effect and are more beneficial to the modeling effect.

In a possible implementation manner, the step S140 of screening k third data features from the j second data features according to the current gain contribution value and the iv map may specifically include the following sub-steps S141 to S142.

And step S141, analyzing each iv graph according to the sequence of the current gain contribution value from large to small to obtain an analysis result.

And S142, if the analysis result represents that inconsistent distribution is presented between the positive proportion and the coding ascending and descending trend of the feature coding group corresponding to the iv diagram, removing the second data features corresponding to the iv diagram, and determining k reserved second data features as k third data features.

It can be understood that through the steps S141 to S142, the iv diagram can be combined for screening, so that the number of the data features finally entering the model is reduced, the complexity of the model is further simplified, and the overall business cost is reduced.

Alternatively, the characteristic iv value of the first data characteristic described in step S130 may be determined by: extracting a feature code of each first data feature, and splitting the feature code into a plurality of code groups; calculating a first ratio of a current positive example to a global positive example and a second ratio of a current negative example to a global negative example in each coding group; a feature iv value for each first data feature is determined from the first proportion and the second proportion. In this way, the accuracy of the characteristic iv value can be ensured.

Based on the same inventive concept as above, please refer to fig. 2 in combination, a data feature selection apparatus 200 based on a binary service is provided, the data feature selection apparatus 200 is applied to a computer device, and the apparatus at least includes the following functional modules:

the feature extraction module 210 is configured to obtain target service data with a binary label, and perform feature extraction on the target service data to obtain an initial feature combination corresponding to the target service data; the initial feature combination comprises n initial data features corresponding to the target service data, wherein n is a positive integer;

a first selecting module 220, configured to train a first tree model by using the initial feature combination, and obtain an initial gain contribution value of each initial data feature in the first tree model; screening m first data characteristics from the n initial data characteristics according to the initial gain contribution values and forming a first characteristic combination; wherein m is a positive integer less than n;

a second selection module 230, configured to determine a correlation coefficient between the first data features in the first feature combination and a feature iv value of each first data feature, screen j second data features from the m first data features based on the correlation coefficient and the feature iv value, and form a second feature combination; wherein j is a positive integer less than m;

a third selecting module 240, configured to train a second tree model by using the second feature combination, and obtain a current gain contribution value of each second data feature in the second feature combination in the second tree model; drawing an iv graph of each second data feature, screening k third data features from the j second data features according to the current gain contribution value and the iv graph, forming a third feature combination, and determining the third feature combination as a final feature combination of the target service data; wherein k is a positive integer less than j.

Further, referring to fig. 3, a computer device 300 is provided, which includes a processor 310 and a memory 320, the processor 310 is in communication with the memory 320, and the processor 310 is configured to retrieve a computer program from the memory 320 and execute the computer program to implement the data feature selection method described above.

Further, a computer-readable storage medium is provided, in which a computer program is stored, which computer program, when executed, implements the above-mentioned data feature selection method.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A data feature selection method based on binary service is applied to computer equipment, and the method at least comprises the following steps:

2. The method of claim 1, wherein filtering out m first data features from the n initial data features according to the initial gain contribution values comprises:

3. The method of claim 1, wherein filtering out m first data features from the n initial data features according to the initial gain contribution values comprises:

4. The method of claim 1, wherein determining a correlation coefficient between first data features in the first combination of features comprises:

5. The method of claim 4, wherein the screening j second data features from the m first data features based on the correlation coefficient and the feature iv value comprises:

6. The method of claim 1, wherein the filtering k third data features from the j second data features according to the current gain contribution value and the iv map comprises:

7. The method according to any one of claims 1 to 6, wherein the characteristic iv value of each first data characteristic is determined by:

8. A data feature selection device based on binary service is characterized in that the data feature selection device is applied to computer equipment, and the device at least comprises the following functional modules:

9. A computer device comprising a processor and a memory, the processor being in communication with the memory, the processor being configured to retrieve a computer program from the memory and to execute the computer program to perform the data feature selection method of any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored, which computer program, when executed, implements the data feature selection method of any one of claims 1 to 7.