CN112149702A - Feature processing method and device - Google Patents

Feature processing method and device Download PDF

Info

Publication number
CN112149702A
CN112149702A CN201910576748.9A CN201910576748A CN112149702A CN 112149702 A CN112149702 A CN 112149702A CN 201910576748 A CN201910576748 A CN 201910576748A CN 112149702 A CN112149702 A CN 112149702A
Authority
CN
China
Prior art keywords
features
correlation coefficient
types
importance
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910576748.9A
Other languages
Chinese (zh)
Inventor
王倩
徐晓飞
杨海华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910576748.9A priority Critical patent/CN112149702A/en
Publication of CN112149702A publication Critical patent/CN112149702A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention provides a feature processing method and a feature processing device, wherein the method comprises the following steps: the method comprises the steps of obtaining a data set to be processed, wherein the data set to be processed comprises at least two types of features and at least two data corresponding to each type of feature, calculating to obtain a correlation coefficient between each two types of features in the data set to be processed, wherein the correlation coefficient represents the correlation degree of the two types of features, obtaining importance sequences of the at least two types of features by adopting an importance analysis model, and filtering the at least two types of features according to the correlation coefficient and the importance sequences to obtain features for model training. Therefore, the feature screening is carried out by combining the correlation coefficient and the importance, the accuracy of the feature screening is improved, and the modeling effect is improved.

Description

Feature processing method and device
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a feature processing method and device.
Background
With the advent of the big data age, more and more data is applied as training samples to build business models, each data having multiple types of features, and business models are usually trained based on the multiple types of features. However, the trained model is not accurate enough due to the possible problem of co-linearity between multiple types of features, which means that model estimation is distorted or difficult to estimate accurately due to the existence of precise correlation or high correlation between the interpretation variables in the linear regression model.
At present, in order to avoid the existence of collinearity, correlation coefficients between every two data features are generally calculated, and feature screening is performed according to the correlation coefficients. However, in this method, the accuracy of feature screening is not high, and the required features may be screened out, so that the trained model is not accurate enough.
Disclosure of Invention
The embodiment of the invention provides a feature processing method and device, which are used for improving the accuracy of data screening and improving the modeling effect.
In a first aspect, an embodiment of the present invention provides a feature processing method, including:
acquiring a data set to be processed, wherein the data set to be processed comprises at least two types of characteristics and at least two data corresponding to each type of characteristic;
calculating to obtain a correlation coefficient between every two types of features in the data set to be processed, wherein the correlation coefficient represents the correlation degree of the two types of features;
acquiring importance sequences of the at least two types of features by adopting an importance analysis model;
and filtering the at least two types of features according to the correlation coefficient and the importance ranking to obtain features for model training.
In one possible design, the filtering the at least two types of features according to the ranking of the correlation coefficients and the importance to obtain features for model training includes:
and aiming at each correlation coefficient, if the absolute value of the correlation coefficient is greater than a preset threshold, filtering one feature with poor importance out of the two types of features corresponding to the correlation coefficient according to the importance ranking to obtain the feature for model training.
In one possible design, the calculating a correlation coefficient between every two types of features in the to-be-processed data set includes:
and calculating a Pearson correlation coefficient between every two types of features in the data set to be processed, wherein the correlation coefficient comprises the Pearson correlation coefficient.
In one possible design, the pearson correlation coefficient is calculated by the following formula:
Figure BDA0002112243530000021
wherein S isijIs the Pearson correlation coefficient between the ith characteristic and the jth characteristic, m is the number of data in the data set to be processed, m is an integer greater than or equal to 2, xitIs the value, x, corresponding to the ith feature of the t-th datajtIs the value corresponding to the jth characteristic of the tth data,
Figure BDA0002112243530000022
is the average value of the values corresponding to the ith characteristic,
Figure BDA0002112243530000023
is the average value of the numerical values corresponding to the jth characteristic, and the value range of t is 1 to m.
In a second aspect, an embodiment of the present invention provides a processing apparatus, including:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a data set to be processed, and the data set to be processed comprises at least two types of characteristics and at least two data corresponding to each type of characteristics;
the calculation module is used for calculating a correlation coefficient between every two types of features in the data set to be processed, and the correlation coefficient represents the correlation degree of the two types of features;
the obtaining module is further configured to obtain importance ranks of the at least two types of features by using an importance analysis model;
and the filtering module is used for filtering the at least two types of features according to the correlation coefficient and the importance ranking to obtain features for model training.
In one possible design, the filter module is specifically configured to:
and aiming at each correlation coefficient, if the absolute value of the correlation coefficient is greater than a preset threshold, filtering one feature with poor importance out of the two types of features corresponding to the correlation coefficient according to the importance ranking to obtain the feature for model training.
In one possible design, the calculation module is specifically configured to:
and calculating a Pearson correlation coefficient between every two types of features in the data set to be processed, wherein the correlation coefficient comprises the Pearson correlation coefficient.
In one possible design, the pearson correlation coefficient is calculated by the following formula:
Figure BDA0002112243530000031
wherein S isijIs the Pearson correlation coefficient between the ith characteristic and the jth characteristic, m is the number of data in the data set to be processed, m is an integer greater than or equal to 2, xitIs the value, x, corresponding to the ith feature of the t-th datajtIs the value corresponding to the jth characteristic of the tth data,
Figure BDA0002112243530000032
is the average value of the values corresponding to the ith characteristic,
Figure BDA0002112243530000033
is the average value of the numerical values corresponding to the jth characteristic, and the value range of t is 1 to m.
In a third aspect, an embodiment of the present invention provides an electronic device, including:
a memory and a processor;
the memory for storing program code;
the processor is configured to call the program code to execute the processing method of the feature of the first aspect.
In a fourth aspect, an embodiment of the present invention provides a readable storage medium, on which a computer program is stored; which when executed performs the processing method of the features of the first aspect.
The embodiment of the invention provides a feature processing method and a feature processing device, wherein the method comprises the following steps: the method comprises the steps of obtaining a data set to be processed, wherein the data set to be processed comprises at least two types of features and at least two data corresponding to each type of feature, calculating to obtain a correlation coefficient between each two types of features in the data set to be processed, wherein the correlation coefficient represents the correlation degree of the two types of features, obtaining importance sequences of the at least two types of features by adopting an importance analysis model, and filtering the at least two types of features according to the correlation coefficient and the importance sequences to obtain features for model training. Therefore, the feature screening is carried out by combining the correlation coefficient and the importance, the accuracy of the feature screening is improved, and the modeling effect is improved.
Drawings
FIG. 1 is a flow chart of a method for processing features provided in accordance with an embodiment of the present invention;
FIG. 2 is a flow chart of a method of processing features provided in another embodiment of the invention;
FIG. 3 is a schematic diagram of a processing device according to one embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The characteristic processing method provided by the embodiment of the invention can be applied to a data modeling scene, such as: in the fields of finance, internet, retail and the like, when a business model is trained, in order to avoid the problem that different types of features of an input model generate collinearity, the features are generally screened.
Aiming at the problem that the accuracy of feature screening based on a correlation coefficient is not high in the prior art, the embodiment of the invention provides a feature processing method, which comprises the following steps: the method comprises the steps of obtaining a data set to be processed, wherein the data set to be processed comprises at least two types of features and at least two data corresponding to each type of feature, calculating to obtain a correlation coefficient between each two types of features in the data set to be processed, wherein the correlation coefficient represents the correlation degree of the two types of features, obtaining importance sequences of the at least two types of features by adopting an importance analysis model, and filtering the at least two types of features according to the correlation coefficient and the importance sequences to obtain features for model training. The feature screening is carried out by combining the correlation coefficient and the importance, so that the accuracy of the feature screening is improved, and the modeling effect is improved.
The technical solution of the above embodiments of the present invention will be described in detail by using several specific embodiments. Fig. 1 is a flowchart of a feature processing method according to an embodiment of the present invention, where an execution body according to the embodiment of the present invention may be an electronic device, such as a notebook computer, a mobile phone, a tablet computer, and the like, which is not limited in this embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:
s101, acquiring a data set to be processed.
Generally, in the modeling process, a problem of colinearity may exist among a plurality of types of features obtained in advance for training a business model, and the colinearity refers to that model estimation is distorted or difficult to estimate accurately due to an accurate correlation relationship or a high correlation relationship among interpretation variables in a linear regression model. Therefore, the features of the multiple types need to be screened, so that the modeling effect is improved, and the accuracy of the trained model is improved.
In this embodiment, the data set to be processed includes at least two types of features and at least two data corresponding to each type of feature, where the at least two types of features may include age, height, gender, and the like, and may be determined specifically according to modeling requirements, which is not limited in this embodiment.
Illustratively, the data set to be processed includes n types of features, where n is a positive integer greater than or equal to 2, each type of feature corresponds to at least two pieces of data, each piece of data is denoted as an n-dimensional vector, each dimension represents a type of feature, where the dimensions of the pieces of data are the same, and the same dimension of the pieces of data represents the same feature. The features of the same dimension representation in each data may be predefined, e.g. the first dimension all representing age, the second dimension all representing height, the third dimension all representing gender.
In the actual application process, it is assumed that a retail business model needs to be established to predict the purchasing power of customers, and in the modeling process, features need to be screened to improve the modeling effect. Firstly, a data set to be processed is obtained, the data set to be processed comprises three types of features, each type of feature corresponds to two data and is marked as data 1 and data 2, the data 1 and the data 2 are three-dimensional vectors, the first dimension in each data is defined in advance as age, the second dimension is defined as gender, and the third dimension is defined as height, namely, the data 1 and the data 2 comprise three types of features, namely, the age, the gender and the height. Data 1 may be represented as vector 1(30, 01, 1.6) and data 2 may be represented as vector 2(40, 10, 1.7), where data 1 represents age 30, woman, 1.6 meters high and data 2 represents age 40, man, 1.7 meters high.
It should be noted that, gender may be converted into a number according to a preset rule, and of course, in the actual modeling process, each feature may also be converted into a corresponding number according to an actual situation and the preset rule of each feature, which is not limited in this embodiment.
S102, calculating a correlation coefficient between every two types of features in the data set to be processed, wherein the correlation coefficient represents the degree of correlation of the two types of features.
The correlation coefficient represents the degree of correlation of the two types of features, the correlation coefficient is between-1 and 1, the absolute value of the correlation coefficient is between 0 and 1, the larger the absolute value of the correlation coefficient is, the larger the correlation degree is, the smaller the absolute value of the correlation coefficient is, the lower the correlation degree is, and when the absolute value is 0, the irrelevance is represented.
In a modeling process, if the characteristics of an input model have a collinearity problem, model estimation is distorted or estimation is difficult to be accurate, and a modeling effect is influenced. Therefore, in the present embodiment, the correlation coefficient between every two types of features in the data set to be processed is calculated, that is, the correlation coefficient between two types of features in at least two types of features is calculated for each data in the data set to be processed.
Illustratively, correlation coefficients between age and height, age and sex, sex and height in S101 are calculated, respectively.
In one possible design, the correlation coefficient is a Pearson correlation coefficient, which is also called Pearson product-moment correlation coefficient (PPMCC or PCCs) and is used to measure the correlation between the sum of two variables, the value of which is between-1 and 1, and the absolute value of which is between 0 and 1.
S103, acquiring importance sequences of the at least two types of features by adopting an importance analysis model.
In this embodiment, in order to avoid that the feature filtering is not accurate enough due to single processing, after the correlation coefficient between every two types of features in the data set to be processed is obtained through calculation, an importance analysis model may be further used to rank the importance of each feature, and in the importance ranking, the features may be ranked from high to low according to the importance level, or from low to high, which is not limited in this embodiment.
The importance ranking may be ranking of importance scores of the respective features, that is, the greater the importance score is, the higher the importance is, and the smaller the importance score is, the worse the importance is.
In this embodiment, the importance analysis model may be an xgboost model or a gbdt model, and the importance analysis model may be obtained by pre-training, so that data in the data set to be processed may be input into the importance analysis model to obtain importance scores of each feature, and thus, the importance ranking of each feature is obtained according to the importance score of each feature.
S104, filtering the at least two types of features according to the correlation coefficient and the importance ranking to obtain features for model training.
In this embodiment, compare with prior art, carry out feature filtering through combining correlation coefficient and importance, obtain the characteristic that is used for the model training, improved the accuracy of model, promoted the effect of modelling.
S104 specifically comprises the following steps:
and aiming at each correlation coefficient, if the absolute value of the correlation coefficient is greater than a preset threshold, filtering one feature with poor importance out of the two types of features corresponding to the correlation coefficient according to the importance ranking to obtain the feature for model training.
If the correlation coefficient is greater than the preset threshold, it indicates that there is a large correlation between the two types of features corresponding to the correlation coefficient, and thus when the two features are used to train a model, the problem of co-linearity is caused, which causes inaccuracy of the model, and therefore, one of the features needs to be filtered out. Therefore, in this embodiment, one of the two types of features with poor importance is filtered out according to the importance ranking, and the remaining features are features for model training.
The preset threshold may be determined according to actual conditions, which is not limited in this embodiment.
The embodiment provides a feature processing method, which comprises the following steps: the method comprises the steps of obtaining a data set to be processed, wherein the data set to be processed comprises at least two types of features and at least two data corresponding to each type of feature, calculating to obtain a correlation coefficient between each two types of features in the data set to be processed, wherein the correlation coefficient represents the correlation degree of the two types of features, obtaining importance sequences of the at least two types of features by adopting an importance analysis model, and filtering the at least two types of features according to the correlation coefficient and the importance sequences to obtain features for model training. Therefore, the feature screening is carried out by combining the correlation coefficient and the importance, the accuracy of the feature screening is improved, and the modeling effect is improved.
On the basis of the foregoing embodiment, fig. 2 is a flowchart of a feature processing method according to another embodiment of the present invention, and as shown in fig. 2, the feature processing method includes the following steps:
s201, a data set to be processed is obtained, wherein the data set to be processed comprises at least two types of features and at least two data corresponding to each type of feature.
The implementation of step S201 is similar to that of step S101, and is not described herein again.
S202, calculating to obtain a Pearson correlation coefficient between every two types of features in the data set to be processed.
The correlation coefficient comprises a Pearson correlation coefficient, and the Pearson correlation coefficient is calculated by the following formula:
Figure BDA0002112243530000071
wherein S isijIs the Pearson correlation coefficient between the ith feature and the jth feature, m is the number of data in the data set to be processed, m is greater thanOr an integer equal to 2, xitIs the value, x, corresponding to the ith feature of the t-th datajtIs the value corresponding to the jth characteristic of the tth data,
Figure BDA0002112243530000081
is the average value of the values corresponding to the ith characteristic,
Figure BDA0002112243530000082
is the average value of the numerical values corresponding to the jth characteristic, and the value range of t is 1 to m.
For example, xi1Denotes the ith feature, x, in the 1 st dataj1And j represents the jth feature in the 1 st data, and if each data is an n-dimensional vector, namely each data has n features, the value range of i is 1 to n, and the value range of j is 1 to n.
In this embodiment, a pearson correlation coefficient between every two types of features in the to-be-processed data set is calculated, that is, a pearson correlation coefficient between two types of features in at least two types of features is calculated for each data in the to-be-processed data set.
Taking data 1 and data 2 in S101 as an example, the data set to be processed includes two data, where data 1 is represented as vector 1(30, 01, 1.6) and data 2 is represented as vector 2(40, 10, 1.7), then m is 2 and n is 3, and taking calculation of the pearson correlation coefficient between the 1 st feature and the 3 rd feature as an example, where age is the first feature in the data, height is the third feature in the data, i is 1, j is 3, and x is xi1Is a number of 30, and is,
Figure BDA0002112243530000083
is 35, xj1The content of the organic acid is 1.6,
Figure BDA0002112243530000084
is 1.65, xi2Is 40, xj21.7, substituting each parameter into the above formula can obtain Pearson correlation coefficient between age and height in 2 data, and if there are multiple data, according to the same calculation methodAnd (4) calculating the Pearson correlation coefficient of the age and the height.
S203, acquiring importance sequences of the at least two types of features by adopting an importance analysis model.
The implementation of step S203 is similar to that of step S103, and is not described herein again.
S204, aiming at each correlation coefficient, if the absolute value of the correlation coefficient is larger than a preset threshold value, filtering one feature with poor importance out of the two types of features corresponding to the correlation coefficient according to the importance ranking to obtain features for model training, wherein the correlation coefficient comprises a Pearson correlation coefficient.
In this embodiment, for each correlation coefficient, if the correlation coefficient is greater than the preset threshold, which indicates that there is a large correlation between the two types of features corresponding to the correlation coefficient, then one feature with poor importance among the two types of features is filtered out according to the importance ranking, and the features used for model training can be obtained.
Illustratively, after the correlation coefficient of the age and the height is obtained in S203, if the correlation coefficient is greater than the preset threshold, one of the differences in importance between the age and the height is filtered according to the importance ranking, and the pearson correlation coefficients between the age and the gender, and between the gender and the height are calculated in the same manner, and feature filtering is performed in the same manner, so as to obtain the features for training the model.
The embodiment provides a feature processing method, which comprises the following steps: the method comprises the steps of obtaining a data set to be processed, wherein the data set to be processed comprises at least two types of features and at least two pieces of data corresponding to each type of feature, calculating to obtain a Pearson correlation coefficient between each two types of features in the data set to be processed, wherein the correlation coefficient comprises the Pearson correlation coefficient, obtaining importance ranking of the at least two types of features by adopting an importance analysis model, and filtering one feature with poor importance in the two types of features corresponding to the correlation coefficient according to the importance ranking if the absolute value of the correlation coefficient is larger than a preset threshold value aiming at each correlation coefficient to obtain features used for model training, wherein the correlation coefficient comprises the Pearson correlation coefficient. Therefore, the feature screening is carried out by combining the correlation coefficient and the importance, the accuracy of the feature screening is improved, and the modeling effect is improved.
Fig. 3 is a schematic structural diagram of a processing apparatus according to an embodiment of the present invention, as shown in fig. 3. The processing device 10 of the present embodiment includes:
an obtaining module 11, configured to obtain a data set to be processed, where the data set to be processed includes at least two types of features and at least two pieces of data corresponding to each type of feature;
a calculating module 12, configured to calculate a correlation coefficient between every two types of features in the to-be-processed data set, where the correlation coefficient represents a degree of association between two types of features;
the obtaining module 11 is further configured to obtain importance ranks of the at least two types of features by using an importance analysis model;
and the filtering module 13 is configured to filter the at least two types of features according to the correlation coefficient and the importance ranking to obtain features for model training.
In one possible design, the filter module 13 is specifically configured to:
and aiming at each correlation coefficient, if the absolute value of the correlation coefficient is greater than a preset threshold, filtering one feature with poor importance out of the two types of features corresponding to the correlation coefficient according to the importance ranking to obtain the feature for model training.
In one possible design, the calculation module 12 is specifically configured to:
and calculating a Pearson correlation coefficient between every two types of features in the data set to be processed, wherein the correlation coefficient comprises the Pearson correlation coefficient.
In one possible design, the pearson correlation coefficient is calculated by the following formula:
Figure BDA0002112243530000101
wherein S isijIs the Pearson correlation coefficient between the ith characteristic and the jth characteristic, m is the number of data in the data set to be processed, m is an integer greater than or equal to 2, xitIs the value, x, corresponding to the ith feature of the t-th datajtIs the value corresponding to the jth characteristic of the tth data,
Figure BDA0002112243530000102
is the average value of the values corresponding to the ith characteristic,
Figure BDA0002112243530000103
is the average value of the values corresponding to the jth feature, and t is an integer between 1 and m.
In the feature processing apparatus of this embodiment, an obtaining module is configured to obtain a to-be-processed data set, where the to-be-processed data set includes at least two types of features and at least two pieces of data corresponding to each type of feature, a calculating module is configured to calculate a correlation coefficient between every two types of features in the to-be-processed data set, where the correlation coefficient represents a degree of association between the two types of features, and the obtaining module is further configured to obtain an importance ranking of the at least two types of features using an importance analysis model, and a filtering module is configured to filter the at least two types of features according to the correlation coefficient and the importance ranking to obtain features used for model training. The feature screening is carried out by combining the correlation coefficient and the importance, so that the accuracy of the feature screening is improved, and the modeling effect is improved.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 4, the electronic device 20 according to this embodiment may include: a memory 21 and a processor 22. The memory 21 and the processor 22 may be connected by a bus 23, for example.
The memory 21 for storing program codes;
the processor 22 is configured to implement the processing method of the features provided in any implementation manner of the foregoing method embodiments by executing the computer program.
The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a processing method of the features provided by any of the implementations of the method embodiments described above.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an ExtEnded ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for processing features, comprising:
acquiring a data set to be processed, wherein the data set to be processed comprises at least two types of characteristics and at least two data corresponding to each type of characteristic;
calculating to obtain a correlation coefficient between every two types of features in the data set to be processed, wherein the correlation coefficient represents the correlation degree of the two types of features;
acquiring importance sequences of the at least two types of features by adopting an importance analysis model;
and filtering the at least two types of features according to the correlation coefficient and the importance ranking to obtain features for model training.
2. The method of claim 1, wherein the filtering the at least two types of features according to the ranking of the correlation coefficients and the importance to obtain features for model training comprises:
and aiming at each correlation coefficient, if the absolute value of the correlation coefficient is greater than a preset threshold, filtering one feature with poor importance out of the two types of features corresponding to the correlation coefficient according to the importance ranking to obtain the feature for model training.
3. The method of claim 1, wherein the calculating a correlation coefficient between each two types of features in the set of data to be processed comprises:
and calculating a Pearson correlation coefficient between every two types of features in the data set to be processed, wherein the correlation coefficient comprises the Pearson correlation coefficient.
4. The method of claim 3, wherein the Pearson correlation coefficient is calculated by the following formula:
Figure FDA0002112243520000011
wherein S isijIs the Pearson correlation coefficient between the ith characteristic and the jth characteristic, m is the number of data in the data set to be processed, m is an integer greater than or equal to 2, xitIs the value, x, corresponding to the ith feature of the t-th datajtIs the value corresponding to the jth characteristic of the tth data,
Figure FDA0002112243520000012
is the average value of the values corresponding to the ith characteristic,
Figure FDA0002112243520000013
is the average value of the numerical values corresponding to the jth characteristic, and the value range of t is 1 to m.
5. A feature processing apparatus, comprising:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a data set to be processed, and the data set to be processed comprises at least two types of characteristics and at least two data corresponding to each type of characteristics;
the calculation module is used for calculating a correlation coefficient between every two types of features in the data set to be processed, and the correlation coefficient represents the correlation degree of the two types of features;
the obtaining module is further configured to obtain importance ranks of the at least two types of features by using an importance analysis model;
and the filtering module is used for filtering the at least two types of features according to the correlation coefficient and the importance ranking to obtain features for model training.
6. The device according to claim 5, wherein the filtering module is specifically configured to:
and aiming at each correlation coefficient, if the absolute value of the correlation coefficient is greater than a preset threshold, filtering one feature with poor importance out of the two types of features corresponding to the correlation coefficient according to the importance ranking to obtain the feature for model training.
7. The apparatus of claim 5, wherein the computing module is specifically configured to:
and calculating a Pearson correlation coefficient between every two types of features in the data set to be processed, wherein the correlation coefficient comprises the Pearson correlation coefficient.
8. The apparatus of claim 7, wherein the Pearson correlation coefficient is calculated by the following equation:
Figure FDA0002112243520000021
wherein S isijIs the Pearson correlation coefficient between the ith characteristic and the jth characteristic, m is the number of data in the data set to be processed, m is an integer greater than or equal to 2, xitIs the value, x, corresponding to the ith feature of the t-th datajtIs the value corresponding to the jth characteristic of the tth data,
Figure FDA0002112243520000022
is the average value of the values corresponding to the ith characteristic,
Figure FDA0002112243520000023
is the average value of the numerical values corresponding to the jth characteristic, and the value range of t is 1 to m.
9. An electronic device, comprising: a memory and a processor;
the memory for storing program code;
the processor is configured to call the program code to perform a processing method according to any one of the features of claims 1 to 4.
10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program; the computer program, when executed, implements a processing method as recited in any of claims 1-4.
CN201910576748.9A 2019-06-28 2019-06-28 Feature processing method and device Pending CN112149702A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910576748.9A CN112149702A (en) 2019-06-28 2019-06-28 Feature processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910576748.9A CN112149702A (en) 2019-06-28 2019-06-28 Feature processing method and device

Publications (1)

Publication Number Publication Date
CN112149702A true CN112149702A (en) 2020-12-29

Family

ID=73869473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910576748.9A Pending CN112149702A (en) 2019-06-28 2019-06-28 Feature processing method and device

Country Status (1)

Country Link
CN (1) CN112149702A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113407680A (en) * 2021-06-30 2021-09-17 竹间智能科技(上海)有限公司 Heterogeneous integrated model screening method and electronic equipment
CN114662698A (en) * 2022-02-11 2022-06-24 南京英锐祺科技有限公司 Industrial internet multi-modal machine learning data processing method
CN116720058A (en) * 2023-04-28 2023-09-08 贵研铂业股份有限公司 Method for realizing key feature combination screening of machine learning candidate features

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480696A (en) * 2017-07-12 2017-12-15 深圳信息职业技术学院 A kind of disaggregated model construction method, device and terminal device
CN107608938A (en) * 2017-08-08 2018-01-19 安徽师范大学 The factor screening method towards two-value classification of tree algorithm is returned based on enhancing
CN108256907A (en) * 2018-01-09 2018-07-06 北京腾云天下科技有限公司 A kind of construction method and computing device of customer grouping model
CN109784377A (en) * 2018-12-26 2019-05-21 平安科技(深圳)有限公司 Multiple recognition model building method, device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480696A (en) * 2017-07-12 2017-12-15 深圳信息职业技术学院 A kind of disaggregated model construction method, device and terminal device
CN107608938A (en) * 2017-08-08 2018-01-19 安徽师范大学 The factor screening method towards two-value classification of tree algorithm is returned based on enhancing
CN108256907A (en) * 2018-01-09 2018-07-06 北京腾云天下科技有限公司 A kind of construction method and computing device of customer grouping model
CN109784377A (en) * 2018-12-26 2019-05-21 平安科技(深圳)有限公司 Multiple recognition model building method, device, computer equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113407680A (en) * 2021-06-30 2021-09-17 竹间智能科技(上海)有限公司 Heterogeneous integrated model screening method and electronic equipment
CN113407680B (en) * 2021-06-30 2023-06-02 竹间智能科技(上海)有限公司 Heterogeneous integrated model screening method and electronic equipment
CN114662698A (en) * 2022-02-11 2022-06-24 南京英锐祺科技有限公司 Industrial internet multi-modal machine learning data processing method
CN116720058A (en) * 2023-04-28 2023-09-08 贵研铂业股份有限公司 Method for realizing key feature combination screening of machine learning candidate features

Similar Documents

Publication Publication Date Title
CN108073902B (en) Video summarizing method and device based on deep learning and terminal equipment
CN112149702A (en) Feature processing method and device
WO2019061976A1 (en) Fund product recommendation method and apparatus, terminal device, and storage medium
CN110347971B (en) Particle filtering method and device based on TSK fuzzy model and storage medium
CN110348412B (en) Key point positioning method and device, electronic equipment and storage medium
CN109191133B (en) Payment channel selection method and terminal equipment
CN111242319A (en) Model prediction result interpretation method and device
CN110956131B (en) Single-target tracking method, device and system
CN114168318A (en) Training method of storage release model, storage release method and equipment
CN113435499A (en) Label classification method and device, electronic equipment and storage medium
CN110929285B (en) Method and device for processing private data
CN112836513A (en) Linking method, device and equipment of named entities and readable storage medium
CN112185574A (en) Method, device, equipment and storage medium for remote medical entity link
CN109543557B (en) Video frame processing method, device, equipment and storage medium
CN111275071A (en) Prediction model training method, prediction device and electronic equipment
CN110264306B (en) Big data-based product recommendation method, device, server and medium
CN109670976B (en) Feature factor determination method and device
CN113361381A (en) Human body key point detection model training method, detection method and device
CN113781180A (en) Article recommendation method and device, electronic equipment and storage medium
US20210182696A1 (en) Prediction of objective variable using models based on relevance of each model
CN112785444A (en) Intelligent investment data processing method, device and system based on mass financial data
CN112232417A (en) Classification method and device, storage medium and terminal
CN111738648A (en) Product selection method, device, equipment and storage medium
CN117436550B (en) Recommendation model training method and device
CN109034207B (en) Data classification method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination