CN111046972A - Feature selection method and device - Google Patents

Feature selection method and device Download PDF

Info

Publication number
CN111046972A
CN111046972A CN201911354862.3A CN201911354862A CN111046972A CN 111046972 A CN111046972 A CN 111046972A CN 201911354862 A CN201911354862 A CN 201911354862A CN 111046972 A CN111046972 A CN 111046972A
Authority
CN
China
Prior art keywords
feature
characteristic
value
correlation coefficient
discrete
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911354862.3A
Other languages
Chinese (zh)
Inventor
李虹锋
曹清鑫
樊丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN201911354862.3A priority Critical patent/CN111046972A/en
Publication of CN111046972A publication Critical patent/CN111046972A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a feature selection method and a device, wherein the method comprises the following steps: determining discrete characteristic values of all the characteristics in the overall sample; calculating information gain obtained by dividing the overall sample by each feature according to the discrete feature value of each feature; calculating a correlation coefficient among the features according to the discrete feature values of the features; calculating a correlation coefficient comprehensive weighing value of each characteristic and other characteristics according to the correlation coefficients of each characteristic and all other characteristics in the overall sample; and determining a feature comprehensive value of each feature according to the information gain obtained by dividing the overall sample according to each feature and the correlation coefficient comprehensive measurement value of each feature and other features, and selecting each feature in the overall sample according to the feature comprehensive value. The invention provides a method for calculating the score of the measured characteristics, and the method can comprehensively evaluate each characteristic.

Description

Feature selection method and device
Technical Field
The invention relates to a display method, in particular to a feature selection method and device.
Background
Feature selection is an important "data preprocessing" process. In engineering practice based on machine learning, the problem of dimension disaster is frequently encountered, which is caused by excessive features participating in training, and if important features can be selected from an original feature set, the problem of dimension disaster can be greatly reduced if a model is trained on only a part of feature subsets in a subsequent learning process; and secondly, removing features irrelevant to modeling targets is beneficial to reducing the difficulty of the learning task. For both reasons, we usually perform feature selection on the modeling dataset and then train the algorithm model with the reduced dataset.
Feature selection typically involves two key links: how to select a candidate feature subset, and how to evaluate the quality of each feature in the feature subset. The first link is a 'subset search' problem, the second link is a 'subset evaluation' problem, namely the 'importance' of each feature in the subset to a modeling target is balanced through some quantitative calculation methods, and then feature selection is carried out. The existing feature selection method generally focuses on the importance of each modeling feature to a modeling target, ignores the relationship between modeling features, and may cause that strong correlation exists between the selected features, thereby causing the problem of feature redundancy and influencing the final modeling effect.
Disclosure of Invention
The present invention provides a feature selection method and apparatus to solve at least one technical problem in the background art.
In order to achieve the above object, according to one aspect of the present invention, there is provided a feature selection method including:
determining discrete characteristic values of all the characteristics in the overall sample;
calculating information gain obtained by dividing the overall sample by each feature according to the discrete feature value of each feature;
calculating a correlation coefficient among the features according to the discrete feature values of the features;
calculating a correlation coefficient comprehensive weighing value of each characteristic and other characteristics according to the correlation coefficients of each characteristic and all other characteristics in the overall sample;
and determining a feature comprehensive value of each feature according to the information gain obtained by dividing the overall sample according to each feature and the correlation coefficient comprehensive measurement value of each feature and other features, and selecting each feature in the overall sample according to the feature comprehensive value.
Optionally, the determining the discrete characteristic value of each characteristic in the population sample specifically includes:
and discretizing the continuous characteristic in the overall sample to obtain a discrete characteristic value of the continuous characteristic.
Optionally, the calculating, according to the discrete feature value of each feature, an information gain obtained by dividing the total sample by each feature includes:
calculating the information entropy of a sample subset generated by dividing the overall sample according to each discrete characteristic value of each characteristic;
and calculating the information gain obtained by dividing the overall sample by each feature according to the information entropy of the sample subset generated by dividing the overall sample by each discrete feature value of each feature and the information entropy of the overall sample.
Optionally, the correlation coefficient comprehensive metric value of each feature and other features is calculated according to the correlation coefficients of each feature and all other features in the total sample, and a calculation formula of the correlation coefficient comprehensive metric value of each feature and other features is as follows:
Figure BDA0002335618880000021
wherein Z is(i)For the correlation coefficient of the characteristic i with other characteristics, a combined measure, rijThere are n features in the population for the correlation coefficient of feature i and feature j.
In order to achieve the above object, according to another aspect of the present invention, there is provided a feature selection apparatus including:
the discrete characteristic value determining unit is used for determining discrete characteristic values of all the characteristics in the overall sample;
the information gain calculation unit is used for calculating the information gain obtained by dividing the overall sample by each feature according to the discrete feature value of each feature;
the correlation coefficient calculation unit is used for calculating the correlation coefficient among the characteristics according to the discrete characteristic values of the characteristics;
the correlation coefficient comprehensive weighing value calculating unit is used for calculating the correlation coefficient comprehensive weighing value of each characteristic and other characteristics according to the correlation coefficients of each characteristic and all other characteristics in the overall sample;
and the characteristic comprehensive value calculating unit is used for determining the characteristic comprehensive value of each characteristic according to the information gain obtained by dividing the overall sample by each characteristic and the correlation coefficient comprehensive weighing value of each characteristic and other characteristics so as to select each characteristic in the overall sample according to the characteristic comprehensive value.
Optionally, the discrete feature value determining unit includes:
and the discretization processing module is used for discretizing the continuous characteristic in the overall sample to obtain a discrete characteristic value of the continuous characteristic.
Optionally, the information gain calculating unit includes:
the information entropy calculation module is used for calculating the information entropy of a sample subset generated by dividing the overall sample by each discrete characteristic value of each characteristic;
and the information gain calculation module is used for calculating the information gain obtained by dividing the overall sample by each characteristic according to the information entropy of the sample subset generated by dividing the overall sample by each discrete characteristic value of each characteristic and the information entropy of the overall sample.
Optionally, the feature integrated value calculating unit calculates the correlation coefficient integrated metric value of each feature and other features according to the following formula:
Figure BDA0002335618880000031
wherein Z is(i)For the correlation coefficient of the characteristic i with other characteristics, a combined measure, rijThere are n features in the population for the correlation coefficient of feature i and feature j.
In order to achieve the above object, according to another aspect of the present invention, there is also provided a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the above feature selection method when executing the computer program.
In order to achieve the above object, according to another aspect of the present invention, there is also provided a computer-readable storage medium storing a computer program which, when executed in a computer processor, implements the steps in the above-described feature selection method.
The invention has the beneficial effects that: the invention provides a calculation method for measuring feature scores based on two statistical feature indexes of information gain and correlation coefficient theories and considering the importance of features to a modeling target and the correlation among the features.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts. In the drawings:
FIG. 1 is a flow chart of a feature selection method of an embodiment of the invention;
FIG. 2 is a flow chart of calculating information gain according to an embodiment of the present invention;
FIG. 3 is a block diagram of a feature selection apparatus according to an embodiment of the present invention;
FIG. 4 is a block diagram of an information gain calculating unit according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a computer apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
It should be noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present invention and the above-described drawings, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Some of the terms appearing in the description and claims of the invention are explained below.
Overall sample: the population sample is the total of the subjects at the time of statistical analysis.
Characteristics (independent variables): the quantitative variation index is called a feature, and its expression is a specific characteristic value or variable value.
Feature and feature matrix: the quantitative variation index is called a feature, and its expression is a specific characteristic value or variable value. The set of eigenvalues of each individual over the features is called a feature matrix.
Continuous type characteristic: the feature that can take value at will in a certain interval is called continuous feature, and its numerical value is continuous, and two adjacent numerical values can be infinitely divided, i.e. infinite numerical values can be taken.
Discrete type characteristics: the value space of a feature is finite or can be listed as infinite, or the probability 1 is distributed on each value with a certain probability and is called as a discrete feature.
Modeling objective (dependent variable): the fingers will change with the change of the features (arguments) and are the objects and objects to be studied for modeling.
k-means binning method: the clusters clustered by k-means are shown by a histogram, the horizontal axis of which represents the classification of each group, and the height of the vertical axis rectangle of which represents the frequency of the corresponding group.
Information entropy: the definition of entropy in thermodynamics is used to describe the degree of disorder of a substance, and is used to measure the uncertainty, the more disordered a substance is, the greater the uncertainty, and the higher the entropy value. Entropy is an expected value of the amount of information, and represents expected values of various kinds of information contained in a substance, and is expressed by the following formula:
Figure BDA0002335618880000051
information gain: in probability theory and information theory, the information gain is asymmetric to measure the difference between two probability distributions P and Q, and the information gain describes the difference when coding with Q and then coding with P. Typically P represents the distribution of samples or observations; q represents a theory, model, description, or approximation to P.
Correlation statistics: the statistics are a vector with each component corresponding to an initial feature, and the importance of the subset of features is determined by the sum of the components of the relevant statistics corresponding to each feature in the subset.
Correlation coefficient: is a quantity for researching the degree of linear correlation between variables and can only reflect the linear correlation between the variables.
Correlation matrix: also called a correlation coefficient matrix, which is formed by the correlation coefficients between the columns of the matrix.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 is a flowchart of a feature selection method according to an embodiment of the present invention, and as shown in fig. 1, the feature selection method according to the embodiment includes steps S101 to S105.
Step S101, determining discrete characteristic values of all characteristics in the overall sample.
In the embodiment of the invention, the step directly counts the characteristic value of the discrete characteristic in the overall sample. Discretization is firstly needed for the continuous characteristic, and then discrete characteristic values of the continuous characteristic are obtained. In an alternative embodiment of the present invention, the discretization of the continuous feature may include the following specific steps: and discretizing the original characteristic value by using a k-means binning method, and counting segment values, so that the segment values can be used as discrete characteristic values.
And step S102, calculating information gain obtained by dividing the overall sample by each feature according to the discrete feature value of each feature.
The method comprises the following steps: calculating the information entropy of a sample subset generated by dividing the overall sample according to each discrete characteristic value of each characteristic; and calculating the information gain obtained by dividing the overall sample by each feature according to the information entropy of the sample subset generated by dividing the overall sample by each discrete feature value of each feature and the information entropy of the overall sample
Step S103, calculating the correlation coefficient among the characteristics according to the discrete characteristic value of each characteristic.
In the embodiment of the invention, the correlation coefficient between the discrete features can be calculated by various methods in the prior art, such as a pearson coefficient and the like.
In an alternative embodiment of the present invention, calculating the correlation coefficient between the features may specifically include the following steps.
Assume that two continuity features A, B are divided into m, n segment value intervals by step S101, respectively, as follows:
{[a0,a1],[a1,a2],…,[am-1,am]}
{[b0,b1],[b1,b2],…,[bn-1,bn]}
Figure BDA0002335618880000061
TABLE 1
In Table 1, N11 represents [ a ] under A characteristics0,a1]Segmentation value interval and [ B ] under B feature0,b1]The number of samples in the segmentation value interval is N11.
Setting:
o=mn
Figure BDA0002335618880000062
Figure BDA0002335618880000071
Figure BDA0002335618880000072
Figure BDA0002335618880000073
Figure BDA0002335618880000074
the correlation coefficient r of the features a and BABComprises the following steps:
Figure BDA0002335618880000075
the step calculates a correlation coefficient reflecting whether the characteristic A and the characteristic B have a linear correlation relationship, and the value of the correlation coefficient is in an interval of [ -1,1]If r isAB>0, which indicates that the characteristic A and the characteristic B have a positive linear correlation relationship, namely A and B present a change in the same direction; if rAB0, the characteristic A and the characteristic B do not have a linear correlation relationship, namely the A and the B are not related in variation; if rAB<And 0, which indicates that the characteristic A and the characteristic B have an inverse linear correlation relationship, i.e. A and B show inverse changes. In the embodiment of the invention, the correlation coefficient r of the characteristics A and B calculated by the formulaABAlso known as "pseudo correlation coefficients".
And step S104, calculating a correlation coefficient comprehensive weighing value of each feature and other features according to the correlation coefficients of each feature and all other features in the overall sample.
In the embodiment of the present invention, the step calculates the linear correlation coefficient r of the feature a and the feature B according to the step S103ABAssume that the population sample contains the feature { P }1,P2,…,PnSelecting two characteristics each time, and obtaining P in the total sample by calculation according to the step S103i、PjIs related toijRepeating the steps for multiple times to finally obtain a correlation coefficient matrix, which is shown in the following table 2:
Figure BDA0002335618880000076
TABLE 2
And calculating the correlation coefficient comprehensive weighing value of the characteristic and other characteristics. The value range of the correlation coefficient is [ -1,1]The larger the absolute value of the correlation coefficient is, the larger the linear correlation relationship between the two characteristics is represented, and the smaller the comprehensive weighing value of the correlation coefficient is. In an alternative embodiment of the present invention, this step may specifically calculate the correlation coefficient comprehensive measure Z of the feature i and other features by the following formula(i)
Figure BDA0002335618880000081
Wherein Z is(i)For the correlation coefficient of the characteristic i with other characteristics, a combined measure, rijThere are n features in the population for the correlation coefficient of feature i and feature j.
And step S105, determining a characteristic comprehensive value of each characteristic according to the information gain obtained by dividing the overall sample according to each characteristic and the correlation coefficient comprehensive measurement value of each characteristic and other characteristics, and selecting each characteristic in the overall sample according to the characteristic comprehensive value.
In the embodiment of the present invention, in this step, the information gain obtained by dividing the total sample according to each feature calculated in the above step S102 and the feature integrated value of each feature are calculated according to the correlation coefficient integrated metric values of each feature and other features calculated in the above steps S103 and S104. The feature composite value is a composite score for each feature. In an alternative embodiment of the present invention, the calculation method of the feature comprehensive value score (i) of the feature i is as follows:
Score(i)=Gain(D,i)+Z(i)
wherein Gain (D, i) is an information Gain obtained by dividing the overall sample D by the characteristic i, and the value range is [0,1]The larger the value is, the better the improvement effect on the sample classification is represented. Z(i)The correlation coefficient of the feature i and other features is integrated with a constant value in the range of (0, n-1)]And n represents that n characteristics exist in total, the larger the value of the n characteristics is, the smaller the linear correlation between the characteristic and other characteristics is, and the better the modeling effect is influenced. The comprehensive value of the features obtained by adding the two values is used for measuring the classification effect of the features on the samples and the linear correlation among the features, and theoretical basis is provided for further screening the most-significant feature subsets.
In an alternative embodiment of the present invention, this step further ranks the feature comprehensive values score (i) of all features according to the scores, which is shown in table 3 below:
Score(i) Rank
Score(1) 1
Score(2) 2
…… ……
Score(n) n
TABLE 3
In an alternative embodiment of the present invention, the present invention may perform feature selection and feature screening according to table 3. Specifically, a threshold τ may be specified first, and features smaller than τ may be removed according to the feature comprehensive value score (i); or the number k of the selected features is specified, and features ranked behind k are removed (i), and the like.
From the above description, it can be seen that the invention provides a calculation method for measuring feature scores based on two statistical feature indexes of information gain and correlation coefficient theory, and simultaneously considering the importance of features to a modeling target and the correlation between features.
Fig. 2 is a flowchart of calculating an information gain according to an embodiment of the present invention, and as shown in fig. 2, in the embodiment of the present invention, the information gain obtained by dividing the total sample according to each feature calculated according to the discrete feature value of each feature in step S102 specifically includes step S201 and step S202.
Step S201, calculating information entropy of a sample subset generated by dividing the total sample according to each discrete feature value of each feature.
Step S202, calculating the information gain obtained by dividing the overall sample by each feature according to the information entropy of the sample subset generated by dividing the overall sample by each discrete feature value of each feature and the information entropy of the overall sample.
In an alternative embodiment of the present invention, the information entropy of the overall sample can be calculated by: assuming that the modeling target in the current set sample D is classified into K types, and the proportion of the K type sample is PkThen the set of dependent variables for D can be represented as [ P ]1,P2,P3,…,Pk]Where k is 1,2,3, …. The information entropy ent (d) of the overall sample can be calculated according to the following calculation:
Figure BDA0002335618880000091
for each feature i (argument), its information gain is calculated:
assume that there are v possible values { I } for the discrete feature I1,I2,I3,…,IvF, if I is used to divide the original sample set D, v sample subsets are generated, wherein the v sample subset includes all values I on the feature I in DvSample of (2), denoted as Dv. D can be calculated according to the formula for calculating the information entropy of the overall samplevConsidering the number of samples contained in different subsets, weights are given
Figure BDA0002335618880000092
I.e., the greater the number of samples, the greater the effect of the subset, and the information gain obtained by dividing the set of samples D by the feature I can then be calculated.
Further, the information gain obtained by dividing the global sample D by the feature I and the information entropy ent (D) of the global sample D may be calculated according to the information gain obtained by dividing the global sample D by the feature I, and the specific calculation formula may be:
Figure BDA0002335618880000101
where Gain (D, i) is an information Gain obtained by dividing the ensemble sample D by the feature i. In general, a larger information gain means a larger discrimination of the feature I with respect to the original sample set.
As can be seen from the above embodiments, the feature selection method of the present invention achieves at least the following advantageous effects:
1. the invention provides a method for quantitatively calculating modeling characteristic screening based on information gain and correlation coefficient theories in information theory and statistics, and the method is more scientific and efficient than manual characteristic selection;
2. the invention designs a method for measuring the importance of the features based on variable statistics, provides a method for calculating the comprehensive Score value Score of each feature according to a 'pseudo information gain meter' and a 'pseudo correlation coefficient', emphasizes the features which have larger influence on the sample classification effect, and reduces the blindness of modeling feature selection.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
Based on the same inventive concept, embodiments of the present invention further provide a feature selection apparatus, which can be used to implement the feature selection method described in the above embodiments, as described in the following embodiments. Because the principle of the feature selection apparatus for solving the problem is similar to that of the feature selection method, the embodiment of the feature selection apparatus can be referred to the embodiment of the feature selection method, and repeated details are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 3 is a block diagram of a feature selection apparatus according to an embodiment of the present invention, and as shown in fig. 3, the feature selection apparatus according to the embodiment of the present invention includes: the device comprises a discrete characteristic value determining unit 1, an information gain calculating unit 2, a correlation coefficient calculating unit 3, a correlation coefficient comprehensive measure value calculating unit 4 and a characteristic comprehensive value calculating unit 5.
And the discrete characteristic value determining unit is used for determining the discrete characteristic value of each characteristic in the overall sample.
In an alternative embodiment of the present invention, the discrete feature value determination unit 1 includes: and the discretization processing module is used for discretizing the continuous characteristic in the overall sample to obtain a discrete characteristic value of the continuous characteristic.
And the information gain calculation unit is used for calculating the information gain obtained by dividing the overall sample according to the discrete characteristic value of each characteristic.
And the correlation coefficient calculating unit is used for calculating the correlation coefficient among the characteristics according to the discrete characteristic values of the characteristics.
And the correlation coefficient comprehensive weighing value calculating unit is used for calculating the correlation coefficient comprehensive weighing value of each characteristic and other characteristics according to the correlation coefficient of each characteristic and all other characteristics in the overall sample.
In an optional embodiment of the present invention, the feature integrated value calculating unit specifically calculates the correlation coefficient integrated metric value of each feature and other features according to the following formula:
Figure BDA0002335618880000111
wherein Z is(i)For the correlation coefficient of the characteristic i with other characteristics, a combined measure, rijThere are n features in the population for the correlation coefficient of feature i and feature j.
And the characteristic comprehensive value calculating unit is used for determining the characteristic comprehensive value of each characteristic according to the information gain obtained by dividing the overall sample by each characteristic and the correlation coefficient comprehensive weighing value of each characteristic and other characteristics so as to select each characteristic in the overall sample according to the characteristic comprehensive value.
Fig. 4 is a block diagram of the information gain calculating unit 2 according to the embodiment of the present invention, and as shown in fig. 4, the information gain calculating unit 2 according to the embodiment of the present invention includes: an information entropy calculation module 201 and an information gain calculation module 202.
And the information entropy calculation module 201 is used for calculating the information entropy of the sample subset generated by dividing the overall sample by each discrete characteristic value of each characteristic.
And the information gain calculation module 202 is configured to calculate an information gain obtained by dividing the overall sample by each feature according to the information entropy of the sample subset generated by dividing the overall sample by each discrete feature value of each feature and the information entropy of the overall sample.
To achieve the above object, according to another aspect of the present application, there is also provided a computer apparatus. As shown in fig. 5, the computer device comprises a memory, a processor, a communication interface and a communication bus, wherein a computer program that can be run on the processor is stored in the memory, and the steps of the method of the above embodiment are realized when the processor executes the computer program.
The processor may be a Central Processing Unit (CPU). The Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or a combination thereof.
The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and units, such as the corresponding program units in the above-described method embodiments of the present invention. The processor executes various functional applications of the processor and the processing of the work data by executing the non-transitory software programs, instructions and modules stored in the memory, that is, the method in the above method embodiment is realized.
The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor, and the like. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and such remote memory may be coupled to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more units are stored in the memory and when executed by the processor perform the method of the above embodiments.
The specific details of the computer device may be understood by referring to the corresponding related descriptions and effects in the above embodiments, and are not described herein again.
In order to achieve the above object, according to another aspect of the present application, there is also provided a computer-readable storage medium storing a computer program which, when executed in a computer processor, implements the steps in the above-described feature selection method. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of feature selection, comprising:
determining discrete characteristic values of all the characteristics in the overall sample;
calculating information gain obtained by dividing the overall sample by each feature according to the discrete feature value of each feature;
calculating a correlation coefficient among the features according to the discrete feature values of the features;
calculating a correlation coefficient comprehensive weighing value of each characteristic and other characteristics according to the correlation coefficients of each characteristic and all other characteristics in the overall sample;
and determining a feature comprehensive value of each feature according to the information gain obtained by dividing the overall sample according to each feature and the correlation coefficient comprehensive measurement value of each feature and other features, and selecting each feature in the overall sample according to the feature comprehensive value.
2. The method for selecting features according to claim 1, wherein the determining discrete feature values of each feature in the population sample specifically comprises:
and discretizing the continuous characteristic in the overall sample to obtain a discrete characteristic value of the continuous characteristic.
3. The feature selection method according to claim 1, wherein the calculating an information gain obtained by dividing the population sample according to each feature based on the discrete feature value of each feature comprises:
calculating the information entropy of a sample subset generated by dividing the overall sample according to each discrete characteristic value of each characteristic;
and calculating the information gain obtained by dividing the overall sample by each feature according to the information entropy of the sample subset generated by dividing the overall sample by each discrete feature value of each feature and the information entropy of the overall sample.
4. The method according to claim 1, wherein the correlation coefficient comprehensive measure value of each feature and other features is calculated according to the correlation coefficient of each feature and all other features in the population sample, and the calculation formula of the correlation coefficient comprehensive measure value of each feature and other features is as follows:
Figure FDA0002335618870000011
wherein Z is(i)For the correlation coefficient of the characteristic i with other characteristics, a combined measure, rijThere are n features in the population for the correlation coefficient of feature i and feature j.
5. A feature selection apparatus, comprising:
the discrete characteristic value determining unit is used for determining discrete characteristic values of all the characteristics in the overall sample;
the information gain calculation unit is used for calculating the information gain obtained by dividing the overall sample by each feature according to the discrete feature value of each feature;
the correlation coefficient calculation unit is used for calculating the correlation coefficient among the characteristics according to the discrete characteristic values of the characteristics;
the correlation coefficient comprehensive weighing value calculating unit is used for calculating the correlation coefficient comprehensive weighing value of each characteristic and other characteristics according to the correlation coefficients of each characteristic and all other characteristics in the overall sample;
and the characteristic comprehensive value calculating unit is used for determining the characteristic comprehensive value of each characteristic according to the information gain obtained by dividing the overall sample by each characteristic and the correlation coefficient comprehensive weighing value of each characteristic and other characteristics so as to select each characteristic in the overall sample according to the characteristic comprehensive value.
6. The feature selection device according to claim 5, wherein the discrete eigenvalue determination unit comprises:
and the discretization processing module is used for discretizing the continuous characteristic in the overall sample to obtain a discrete characteristic value of the continuous characteristic.
7. The feature selection device according to claim 5, wherein the information gain calculation unit includes:
the information entropy calculation module is used for calculating the information entropy of a sample subset generated by dividing the overall sample by each discrete characteristic value of each characteristic;
and the information gain calculation module is used for calculating the information gain obtained by dividing the overall sample by each characteristic according to the information entropy of the sample subset generated by dividing the overall sample by each discrete characteristic value of each characteristic and the information entropy of the overall sample.
8. The feature selection device according to claim 5, wherein the feature integrated value calculation unit calculates the correlation coefficient integrated metric value of each feature and the other features according to the following formula:
Figure FDA0002335618870000021
wherein Z is(i)For the correlation coefficient of the characteristic i with other characteristics, a combined measure, rijThere are n features in the population for the correlation coefficient of feature i and feature j.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 4 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when executed in a computer processor, implements the method of any one of claims 1 to 4.
CN201911354862.3A 2019-12-25 2019-12-25 Feature selection method and device Pending CN111046972A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911354862.3A CN111046972A (en) 2019-12-25 2019-12-25 Feature selection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911354862.3A CN111046972A (en) 2019-12-25 2019-12-25 Feature selection method and device

Publications (1)

Publication Number Publication Date
CN111046972A true CN111046972A (en) 2020-04-21

Family

ID=70240278

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911354862.3A Pending CN111046972A (en) 2019-12-25 2019-12-25 Feature selection method and device

Country Status (1)

Country Link
CN (1) CN111046972A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898027A (en) * 2020-08-06 2020-11-06 北京字节跳动网络技术有限公司 Method, device, electronic equipment and computer readable medium for determining feature dimension

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898027A (en) * 2020-08-06 2020-11-06 北京字节跳动网络技术有限公司 Method, device, electronic equipment and computer readable medium for determining feature dimension

Similar Documents

Publication Publication Date Title
CN109858740B (en) Enterprise risk assessment method and device, computer equipment and storage medium
CN112258093A (en) Risk level data processing method and device, storage medium and electronic equipment
TW201437933A (en) Ranking product search results
CN104813308A (en) Data metric resolution ranking system and method
CN112396211B (en) Data prediction method, device, equipment and computer storage medium
CN108052387B (en) Resource allocation prediction method and system in mobile cloud computing
CN111080360B (en) Behavior prediction method, model training method, device, server and storage medium
CN108681751B (en) Method for determining event influence factors and terminal equipment
CN106452934B (en) Method and device for analyzing network performance index change trend
Kyriakidou et al. Driving factors during the different stages of broadband diffusion: A non-parametric approach
CN115391561A (en) Method and device for processing graph network data set, electronic equipment, program and medium
CN111046972A (en) Feature selection method and device
CN111858267B (en) Early warning method, early warning device, electronic equipment and storage medium
CN111325255B (en) Specific crowd delineating method and device, electronic equipment and storage medium
KR102088855B1 (en) An apparatus for predicting user preferences based on collaborative filtering, a method using it and a service providing method thereof
Almomani et al. Selecting a good stochastic system for the large number of alternatives
CN111783883A (en) Abnormal data detection method and device
CN110264306B (en) Big data-based product recommendation method, device, server and medium
CN111209105A (en) Capacity expansion processing method, capacity expansion processing device, capacity expansion processing equipment and readable storage medium
CN113822371A (en) Training packet model, and method and device for grouping time sequence data
CN111027599A (en) Clustering visualization method and device based on random sampling
Otranto et al. Clustering space-time series: FSTAR as a flexible STAR approach
Yin et al. Forecasting of stock price trend based on CART and similar stock
US20200258002A1 (en) Machine-learning techniques for evaluating suitability of candidate datasets for target applications
CN111160929A (en) Method and device for determining client type

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20220907

Address after: 25 Financial Street, Xicheng District, Beijing 100033

Applicant after: CHINA CONSTRUCTION BANK Corp.

Address before: 25 Financial Street, Xicheng District, Beijing 100033

Applicant before: CHINA CONSTRUCTION BANK Corp.

Applicant before: Jianxin Financial Science and Technology Co.,Ltd.

TA01 Transfer of patent application right