CN112149702A

CN112149702A - Feature processing method and device

Info

Publication number: CN112149702A
Application number: CN201910576748.9A
Authority: CN
Inventors: 王倩; 徐晓飞; 杨海华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2020-12-29

Abstract

The embodiment of the invention provides a feature processing method and a feature processing device, wherein the method comprises the following steps: the method comprises the steps of obtaining a data set to be processed, wherein the data set to be processed comprises at least two types of features and at least two data corresponding to each type of feature, calculating to obtain a correlation coefficient between each two types of features in the data set to be processed, wherein the correlation coefficient represents the correlation degree of the two types of features, obtaining importance sequences of the at least two types of features by adopting an importance analysis model, and filtering the at least two types of features according to the correlation coefficient and the importance sequences to obtain features for model training. Therefore, the feature screening is carried out by combining the correlation coefficient and the importance, the accuracy of the feature screening is improved, and the modeling effect is improved.

Description

Feature processing method and device

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a feature processing method and device.

Background

With the advent of the big data age, more and more data is applied as training samples to build business models, each data having multiple types of features, and business models are usually trained based on the multiple types of features. However, the trained model is not accurate enough due to the possible problem of co-linearity between multiple types of features, which means that model estimation is distorted or difficult to estimate accurately due to the existence of precise correlation or high correlation between the interpretation variables in the linear regression model.

At present, in order to avoid the existence of collinearity, correlation coefficients between every two data features are generally calculated, and feature screening is performed according to the correlation coefficients. However, in this method, the accuracy of feature screening is not high, and the required features may be screened out, so that the trained model is not accurate enough.

Disclosure of Invention

The embodiment of the invention provides a feature processing method and device, which are used for improving the accuracy of data screening and improving the modeling effect.

In a first aspect, an embodiment of the present invention provides a feature processing method, including:

acquiring a data set to be processed, wherein the data set to be processed comprises at least two types of characteristics and at least two data corresponding to each type of characteristic;

calculating to obtain a correlation coefficient between every two types of features in the data set to be processed, wherein the correlation coefficient represents the correlation degree of the two types of features;

acquiring importance sequences of the at least two types of features by adopting an importance analysis model;

and filtering the at least two types of features according to the correlation coefficient and the importance ranking to obtain features for model training.

In one possible design, the filtering the at least two types of features according to the ranking of the correlation coefficients and the importance to obtain features for model training includes:

and aiming at each correlation coefficient, if the absolute value of the correlation coefficient is greater than a preset threshold, filtering one feature with poor importance out of the two types of features corresponding to the correlation coefficient according to the importance ranking to obtain the feature for model training.

In one possible design, the calculating a correlation coefficient between every two types of features in the to-be-processed data set includes:

and calculating a Pearson correlation coefficient between every two types of features in the data set to be processed, wherein the correlation coefficient comprises the Pearson correlation coefficient.

In one possible design, the pearson correlation coefficient is calculated by the following formula:

wherein S is_ijIs the Pearson correlation coefficient between the ith characteristic and the jth characteristic, m is the number of data in the data set to be processed, m is an integer greater than or equal to 2, x_itIs the value, x, corresponding to the ith feature of the t-th data_jtIs the value corresponding to the jth characteristic of the tth data,

is the average value of the values corresponding to the ith characteristic,

is the average value of the numerical values corresponding to the jth characteristic, and the value range of t is 1 to m.

In a second aspect, an embodiment of the present invention provides a processing apparatus, including:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a data set to be processed, and the data set to be processed comprises at least two types of characteristics and at least two data corresponding to each type of characteristics;

the calculation module is used for calculating a correlation coefficient between every two types of features in the data set to be processed, and the correlation coefficient represents the correlation degree of the two types of features;

the obtaining module is further configured to obtain importance ranks of the at least two types of features by using an importance analysis model;

and the filtering module is used for filtering the at least two types of features according to the correlation coefficient and the importance ranking to obtain features for model training.

In one possible design, the filter module is specifically configured to:

In one possible design, the calculation module is specifically configured to:

is the average value of the values corresponding to the ith characteristic,

In a third aspect, an embodiment of the present invention provides an electronic device, including:

a memory and a processor;

the memory for storing program code;

the processor is configured to call the program code to execute the processing method of the feature of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a readable storage medium, on which a computer program is stored; which when executed performs the processing method of the features of the first aspect.

Drawings

FIG. 1 is a flow chart of a method for processing features provided in accordance with an embodiment of the present invention;

FIG. 2 is a flow chart of a method of processing features provided in another embodiment of the invention;

FIG. 3 is a schematic diagram of a processing device according to one embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The characteristic processing method provided by the embodiment of the invention can be applied to a data modeling scene, such as: in the fields of finance, internet, retail and the like, when a business model is trained, in order to avoid the problem that different types of features of an input model generate collinearity, the features are generally screened.

Aiming at the problem that the accuracy of feature screening based on a correlation coefficient is not high in the prior art, the embodiment of the invention provides a feature processing method, which comprises the following steps: the method comprises the steps of obtaining a data set to be processed, wherein the data set to be processed comprises at least two types of features and at least two data corresponding to each type of feature, calculating to obtain a correlation coefficient between each two types of features in the data set to be processed, wherein the correlation coefficient represents the correlation degree of the two types of features, obtaining importance sequences of the at least two types of features by adopting an importance analysis model, and filtering the at least two types of features according to the correlation coefficient and the importance sequences to obtain features for model training. The feature screening is carried out by combining the correlation coefficient and the importance, so that the accuracy of the feature screening is improved, and the modeling effect is improved.

The technical solution of the above embodiments of the present invention will be described in detail by using several specific embodiments. Fig. 1 is a flowchart of a feature processing method according to an embodiment of the present invention, where an execution body according to the embodiment of the present invention may be an electronic device, such as a notebook computer, a mobile phone, a tablet computer, and the like, which is not limited in this embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:

s101, acquiring a data set to be processed.

Generally, in the modeling process, a problem of colinearity may exist among a plurality of types of features obtained in advance for training a business model, and the colinearity refers to that model estimation is distorted or difficult to estimate accurately due to an accurate correlation relationship or a high correlation relationship among interpretation variables in a linear regression model. Therefore, the features of the multiple types need to be screened, so that the modeling effect is improved, and the accuracy of the trained model is improved.

In this embodiment, the data set to be processed includes at least two types of features and at least two data corresponding to each type of feature, where the at least two types of features may include age, height, gender, and the like, and may be determined specifically according to modeling requirements, which is not limited in this embodiment.

Illustratively, the data set to be processed includes n types of features, where n is a positive integer greater than or equal to 2, each type of feature corresponds to at least two pieces of data, each piece of data is denoted as an n-dimensional vector, each dimension represents a type of feature, where the dimensions of the pieces of data are the same, and the same dimension of the pieces of data represents the same feature. The features of the same dimension representation in each data may be predefined, e.g. the first dimension all representing age, the second dimension all representing height, the third dimension all representing gender.

In the actual application process, it is assumed that a retail business model needs to be established to predict the purchasing power of customers, and in the modeling process, features need to be screened to improve the modeling effect. Firstly, a data set to be processed is obtained, the data set to be processed comprises three types of features, each type of feature corresponds to two data and is marked as data 1 and data 2, the data 1 and the data 2 are three-dimensional vectors, the first dimension in each data is defined in advance as age, the second dimension is defined as gender, and the third dimension is defined as height, namely, the data 1 and the data 2 comprise three types of features, namely, the age, the gender and the height. Data 1 may be represented as vector 1(30, 01, 1.6) and data 2 may be represented as vector 2(40, 10, 1.7), where data 1 represents age 30, woman, 1.6 meters high and data 2 represents age 40, man, 1.7 meters high.

It should be noted that, gender may be converted into a number according to a preset rule, and of course, in the actual modeling process, each feature may also be converted into a corresponding number according to an actual situation and the preset rule of each feature, which is not limited in this embodiment.

S102, calculating a correlation coefficient between every two types of features in the data set to be processed, wherein the correlation coefficient represents the degree of correlation of the two types of features.

The correlation coefficient represents the degree of correlation of the two types of features, the correlation coefficient is between-1 and 1, the absolute value of the correlation coefficient is between 0 and 1, the larger the absolute value of the correlation coefficient is, the larger the correlation degree is, the smaller the absolute value of the correlation coefficient is, the lower the correlation degree is, and when the absolute value is 0, the irrelevance is represented.

In a modeling process, if the characteristics of an input model have a collinearity problem, model estimation is distorted or estimation is difficult to be accurate, and a modeling effect is influenced. Therefore, in the present embodiment, the correlation coefficient between every two types of features in the data set to be processed is calculated, that is, the correlation coefficient between two types of features in at least two types of features is calculated for each data in the data set to be processed.

Illustratively, correlation coefficients between age and height, age and sex, sex and height in S101 are calculated, respectively.

In one possible design, the correlation coefficient is a Pearson correlation coefficient, which is also called Pearson product-moment correlation coefficient (PPMCC or PCCs) and is used to measure the correlation between the sum of two variables, the value of which is between-1 and 1, and the absolute value of which is between 0 and 1.

S103, acquiring importance sequences of the at least two types of features by adopting an importance analysis model.

In this embodiment, in order to avoid that the feature filtering is not accurate enough due to single processing, after the correlation coefficient between every two types of features in the data set to be processed is obtained through calculation, an importance analysis model may be further used to rank the importance of each feature, and in the importance ranking, the features may be ranked from high to low according to the importance level, or from low to high, which is not limited in this embodiment.

The importance ranking may be ranking of importance scores of the respective features, that is, the greater the importance score is, the higher the importance is, and the smaller the importance score is, the worse the importance is.

In this embodiment, the importance analysis model may be an xgboost model or a gbdt model, and the importance analysis model may be obtained by pre-training, so that data in the data set to be processed may be input into the importance analysis model to obtain importance scores of each feature, and thus, the importance ranking of each feature is obtained according to the importance score of each feature.

S104, filtering the at least two types of features according to the correlation coefficient and the importance ranking to obtain features for model training.

In this embodiment, compare with prior art, carry out feature filtering through combining correlation coefficient and importance, obtain the characteristic that is used for the model training, improved the accuracy of model, promoted the effect of modelling.

S104 specifically comprises the following steps:

If the correlation coefficient is greater than the preset threshold, it indicates that there is a large correlation between the two types of features corresponding to the correlation coefficient, and thus when the two features are used to train a model, the problem of co-linearity is caused, which causes inaccuracy of the model, and therefore, one of the features needs to be filtered out. Therefore, in this embodiment, one of the two types of features with poor importance is filtered out according to the importance ranking, and the remaining features are features for model training.

The preset threshold may be determined according to actual conditions, which is not limited in this embodiment.

The embodiment provides a feature processing method, which comprises the following steps: the method comprises the steps of obtaining a data set to be processed, wherein the data set to be processed comprises at least two types of features and at least two data corresponding to each type of feature, calculating to obtain a correlation coefficient between each two types of features in the data set to be processed, wherein the correlation coefficient represents the correlation degree of the two types of features, obtaining importance sequences of the at least two types of features by adopting an importance analysis model, and filtering the at least two types of features according to the correlation coefficient and the importance sequences to obtain features for model training. Therefore, the feature screening is carried out by combining the correlation coefficient and the importance, the accuracy of the feature screening is improved, and the modeling effect is improved.

On the basis of the foregoing embodiment, fig. 2 is a flowchart of a feature processing method according to another embodiment of the present invention, and as shown in fig. 2, the feature processing method includes the following steps:

s201, a data set to be processed is obtained, wherein the data set to be processed comprises at least two types of features and at least two data corresponding to each type of feature.

The implementation of step S201 is similar to that of step S101, and is not described herein again.

S202, calculating to obtain a Pearson correlation coefficient between every two types of features in the data set to be processed.

The correlation coefficient comprises a Pearson correlation coefficient, and the Pearson correlation coefficient is calculated by the following formula:

wherein S is_ijIs the Pearson correlation coefficient between the ith feature and the jth feature, m is the number of data in the data set to be processed, m is greater thanOr an integer equal to 2, x_itIs the value, x, corresponding to the ith feature of the t-th data_jtIs the value corresponding to the jth characteristic of the tth data,

is the average value of the values corresponding to the ith characteristic,

For example, x_i1Denotes the ith feature, x, in the 1 st data_j1And j represents the jth feature in the 1 st data, and if each data is an n-dimensional vector, namely each data has n features, the value range of i is 1 to n, and the value range of j is 1 to n.

In this embodiment, a pearson correlation coefficient between every two types of features in the to-be-processed data set is calculated, that is, a pearson correlation coefficient between two types of features in at least two types of features is calculated for each data in the to-be-processed data set.

Taking data 1 and data 2 in S101 as an example, the data set to be processed includes two data, where data 1 is represented as vector 1(30, 01, 1.6) and data 2 is represented as vector 2(40, 10, 1.7), then m is 2 and n is 3, and taking calculation of the pearson correlation coefficient between the 1 st feature and the 3 rd feature as an example, where age is the first feature in the data, height is the third feature in the data, i is 1, j is 3, and x is x_i1Is a number of 30, and is,

is 35, x_j1The content of the organic acid is 1.6,

is 1.65, x_i2Is 40, x_j21.7, substituting each parameter into the above formula can obtain Pearson correlation coefficient between age and height in 2 data, and if there are multiple data, according to the same calculation methodAnd (4) calculating the Pearson correlation coefficient of the age and the height.

S203, acquiring importance sequences of the at least two types of features by adopting an importance analysis model.

The implementation of step S203 is similar to that of step S103, and is not described herein again.

S204, aiming at each correlation coefficient, if the absolute value of the correlation coefficient is larger than a preset threshold value, filtering one feature with poor importance out of the two types of features corresponding to the correlation coefficient according to the importance ranking to obtain features for model training, wherein the correlation coefficient comprises a Pearson correlation coefficient.

In this embodiment, for each correlation coefficient, if the correlation coefficient is greater than the preset threshold, which indicates that there is a large correlation between the two types of features corresponding to the correlation coefficient, then one feature with poor importance among the two types of features is filtered out according to the importance ranking, and the features used for model training can be obtained.

Illustratively, after the correlation coefficient of the age and the height is obtained in S203, if the correlation coefficient is greater than the preset threshold, one of the differences in importance between the age and the height is filtered according to the importance ranking, and the pearson correlation coefficients between the age and the gender, and between the gender and the height are calculated in the same manner, and feature filtering is performed in the same manner, so as to obtain the features for training the model.

The embodiment provides a feature processing method, which comprises the following steps: the method comprises the steps of obtaining a data set to be processed, wherein the data set to be processed comprises at least two types of features and at least two pieces of data corresponding to each type of feature, calculating to obtain a Pearson correlation coefficient between each two types of features in the data set to be processed, wherein the correlation coefficient comprises the Pearson correlation coefficient, obtaining importance ranking of the at least two types of features by adopting an importance analysis model, and filtering one feature with poor importance in the two types of features corresponding to the correlation coefficient according to the importance ranking if the absolute value of the correlation coefficient is larger than a preset threshold value aiming at each correlation coefficient to obtain features used for model training, wherein the correlation coefficient comprises the Pearson correlation coefficient. Therefore, the feature screening is carried out by combining the correlation coefficient and the importance, the accuracy of the feature screening is improved, and the modeling effect is improved.

Fig. 3 is a schematic structural diagram of a processing apparatus according to an embodiment of the present invention, as shown in fig. 3. The processing device 10 of the present embodiment includes:

an obtaining module 11, configured to obtain a data set to be processed, where the data set to be processed includes at least two types of features and at least two pieces of data corresponding to each type of feature;

a calculating module 12, configured to calculate a correlation coefficient between every two types of features in the to-be-processed data set, where the correlation coefficient represents a degree of association between two types of features;

the obtaining module 11 is further configured to obtain importance ranks of the at least two types of features by using an importance analysis model;

and the filtering module 13 is configured to filter the at least two types of features according to the correlation coefficient and the importance ranking to obtain features for model training.

In one possible design, the filter module 13 is specifically configured to:

In one possible design, the calculation module 12 is specifically configured to:

is the average value of the values corresponding to the ith characteristic,

is the average value of the values corresponding to the jth feature, and t is an integer between 1 and m.

In the feature processing apparatus of this embodiment, an obtaining module is configured to obtain a to-be-processed data set, where the to-be-processed data set includes at least two types of features and at least two pieces of data corresponding to each type of feature, a calculating module is configured to calculate a correlation coefficient between every two types of features in the to-be-processed data set, where the correlation coefficient represents a degree of association between the two types of features, and the obtaining module is further configured to obtain an importance ranking of the at least two types of features using an importance analysis model, and a filtering module is configured to filter the at least two types of features according to the correlation coefficient and the importance ranking to obtain features used for model training. The feature screening is carried out by combining the correlation coefficient and the importance, so that the accuracy of the feature screening is improved, and the modeling effect is improved.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 4, the electronic device 20 according to this embodiment may include: a memory 21 and a processor 22. The memory 21 and the processor 22 may be connected by a bus 23, for example.

The memory 21 for storing program codes;

the processor 22 is configured to implement the processing method of the features provided in any implementation manner of the foregoing method embodiments by executing the computer program.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a processing method of the features provided by any of the implementations of the method embodiments described above.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an ExtEnded ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for processing features, comprising:

2. The method of claim 1, wherein the filtering the at least two types of features according to the ranking of the correlation coefficients and the importance to obtain features for model training comprises:

3. The method of claim 1, wherein the calculating a correlation coefficient between each two types of features in the set of data to be processed comprises:

4. The method of claim 3, wherein the Pearson correlation coefficient is calculated by the following formula:

is the average value of the values corresponding to the ith characteristic,

5. A feature processing apparatus, comprising:

6. The device according to claim 5, wherein the filtering module is specifically configured to:

7. The apparatus of claim 5, wherein the computing module is specifically configured to:

8. The apparatus of claim 7, wherein the Pearson correlation coefficient is calculated by the following equation:

is the average value of the values corresponding to the ith characteristic,

9. An electronic device, comprising: a memory and a processor;

the memory for storing program code;

the processor is configured to call the program code to perform a processing method according to any one of the features of claims 1 to 4.

10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program; the computer program, when executed, implements a processing method as recited in any of claims 1-4.