CN114118246A - Method and device for selecting fully-relevant features based on Shapril value and hypothesis test - Google Patents

Method and device for selecting fully-relevant features based on Shapril value and hypothesis test Download PDF

Info

Publication number
CN114118246A
CN114118246A CN202111384278.XA CN202111384278A CN114118246A CN 114118246 A CN114118246 A CN 114118246A CN 202111384278 A CN202111384278 A CN 202111384278A CN 114118246 A CN114118246 A CN 114118246A
Authority
CN
China
Prior art keywords
features
feature
importance
global
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111384278.XA
Other languages
Chinese (zh)
Inventor
陈丹
殷丁泽
汤云波
李小俚
熊明福
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202111384278.XA priority Critical patent/CN114118246A/en
Publication of CN114118246A publication Critical patent/CN114118246A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a method and a device for selecting fully-relevant characteristics based on a sand-pril value and hypothesis testing. The method is suitable for feature sets with supervision tasks. A feature selection model for solving the full correlation problem is designed, the model firstly utilizes a Shapril attribution algorithm to calculate the local importance of features, secondly utilizes random features to construct an adaptive threshold, and then utilizes the importance and the threshold to evaluate the correlation of the features. In the aspect of selecting strategies, the invention designs double hypothesis testing, rapidly eliminates irrelevant features by utilizing local hypothesis testing, and reduces the risk of mistakenly deleting the relevant features by utilizing global hypothesis testing. And finally, all the characteristics related to the problem domain are obtained, so that the interpretability of the characteristic set is improved, and the reliability of prediction is enhanced.

Description

Method and device for selecting fully-relevant features based on Shapril value and hypothesis test
Technical Field
The invention relates to the technical field of feature selection, in particular to a method and a device for selecting fully-relevant features based on a Shapril value and hypothesis testing.
Background
Feature selection is one of the important issues in feature engineering, and the task of feature selection is to select a subset of features from the original set of features that are relevant to the problem domain. The goal of feature selection is to improve the interpretability and predictive performance of a feature set. It is crucial to solve this problem in feature data centric scenarios. At present, the research of traditional feature selection mainly solves the problem of minimum optimization, namely, selecting the minimum feature subset with the optimal classification performance. The method can be divided into a filtering type method and an encapsulating type method according to the characteristic subset evaluation standard, wherein the filtering type method specifically refers to sorting all the characteristics according to a specific statistical value, and selecting the characteristic subset according to the sorting. The packaging method is characterized in that candidate feature subsets are evaluated through a learning algorithm, the candidate feature subsets are changed through multiple iterations, and then the optimal feature subsets are selected according to evaluation criteria such as classification accuracy and feature number. The method for selecting the features with the aim of solving the minimum optimal problem has the advantages that the obtained feature subsets are good in classification effect, the number of the features is small, and a subsequently established model is simpler. The method has the disadvantages that a black box prediction model is obtained by using the minimum optimal feature set, and the interpretability of the feature set is difficult to ensure. In order to better understand the potential knowledge of the problem domain, the feature selection method preferably solves the full correlation problem, that is, determines all the features related to the problem domain, and the solving of the full correlation problem also has the following difficulties, such as under strong fitting capability of the model, the false correlation widely exists, the correlation index is difficult to define and evaluate, and for example, it is difficult to select all the related features, especially the weak correlation features.
Disclosure of Invention
The invention mainly aims to provide a method and a device for selecting fully-correlated characteristics based on a Shapril value and hypothesis testing, and aims to solve the problems that the characteristic correlation cannot be effectively evaluated and all the correlated characteristics cannot be adaptively identified.
In a first aspect, the present invention provides a method for selecting a fully relevant feature based on a salpril value and a hypothesis test, the method for selecting a fully relevant feature based on a salpril value and a hypothesis test comprising:
step 1: evaluating the relevance;
the input to step 1 is a data set consisting of N samples, noted
Figure BDA0003357241390000021
Wherein the feature vector of the nth sample is x(n)=(x1,...,xM) A total of M candidate features, and features are recorded as
Figure BDA0003357241390000022
Quantifying the importance of the M candidate features by using the Shapril value to obtain a local importance value
Figure BDA0003357241390000023
And global importance GIm
Global importance Using randomized features
Figure BDA0003357241390000024
Obtaining an importance threshold T by the adaptive coefficient c;
evaluating the local relevance indexes R of the M candidate characteristicsmAnd global relevance index GRm
Step 2: selecting a strategy;
the input of the step 2 is a local relevance index R of M candidate characteristicsmAnd global relevance index GRm
Local relevance index R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related features
Figure BDA0003357241390000025
Set of uncorrelated features
Figure BDA0003357241390000026
Pending feature set
Figure BDA0003357241390000027
Detecting an unrelated feature set
Figure BDA0003357241390000028
Whether it is empty;
if not
Figure BDA0003357241390000029
If not, deleting the irrelevant feature set and executing the step 1;
if not
Figure BDA00033572413900000210
If the adaptive coefficient c is null, detecting whether the adaptive coefficient c is 1;
if the adaptive coefficient c is not 1, increasing the adaptive coefficient c by 0.1, and executing the step 1;
if the adaptive coefficient c is 1, detecting the undetermined characteristic set
Figure BDA00033572413900000211
Whether it is empty;
if a pending feature set
Figure BDA00033572413900000212
If not, executing step 1;
if a pending feature set
Figure BDA00033572413900000213
And if the result is empty, stopping executing the step 1.
Optionally, the importance of the M candidate features is quantified by using the salpril value to obtain a local importance value
Figure BDA00033572413900000214
And global importance GImComprises the following steps:
determining an input data set x ═ { x ═ x(n)N1.. N }, where the feature vector of the nth sample is x(n)=(x1,...,xM) The label is y(n)Feature set of
Figure BDA00033572413900000215
The algorithm for attributing the sapril is expressed as
Figure BDA00033572413900000216
Applying classification/regression model f (-) to sample x using the Shapril attribution algorithm(n)Is attributed to the contribution of the mth candidate feature
Figure BDA0003357241390000031
Wherein
Figure BDA0003357241390000032
Is the mean value of the model output;
if it is a classification task, then
Figure BDA0003357241390000033
Figure BDA0003357241390000034
Represents the contribution of the mth feature of the nth sample to the l class, when the local importance is
Figure BDA0003357241390000035
Wherein l ═ y(n)
If the task is a regression task, the contribution is directly expressed as a contribution value
Figure BDA0003357241390000036
The local importance here is
Figure BDA0003357241390000037
Global importance is the average of the local importance over all samples, i.e. global importance
Figure BDA0003357241390000038
Figure BDA0003357241390000039
Optionally, the method for selecting fully-relevant features based on the salpril value and hypothesis testing comprises:
self-adaptationShould be thresholded as
Figure BDA00033572413900000310
Wherein
Figure BDA00033572413900000311
C is an adaptive coefficient for the global importance of the random feature.
Optionally, the method for selecting fully-relevant features based on the salpril value and hypothesis testing comprises:
the local correlation is
Figure BDA00033572413900000312
RmRefers to the number of features of which the local importance is higher than the adaptive threshold, RmThe greater the degree of correlation;
the global correlation is
Figure BDA00033572413900000313
Where MI is the maximum number of iterations, GRmRefers to the number of times the global importance of the mth feature is above the threshold in the iterative process.
Optionally, the local correlation index R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related features
Figure BDA00033572413900000314
Set of uncorrelated features
Figure BDA00033572413900000315
Pending feature set
Figure BDA00033572413900000316
Comprises the following steps:
definitions original hypothesis (H0): the correlation of the features follows a binomial distribution with a probability of 0.5, and the probability distribution function is F (·);
and (3) performing hypothesis test on the local correlation, and defining the characteristic set falling into the left and right rejection regions as follows:
Figure BDA00033572413900000317
Figure BDA00033572413900000318
wherein a is the level of significance and wherein,
Figure BDA00033572413900000319
representing that the feature falls within the left reject domain, the feature falling within the left reject domain is locally an irrelevant feature;
Figure BDA00033572413900000320
representing features falling within the right reject domain, which are locally relevant features;
and performing hypothesis test on the global correlation to obtain two characteristic sets:
Figure BDA0003357241390000041
Figure BDA0003357241390000042
wherein
Figure BDA0003357241390000043
Representing features falling within the left reject domain, which are globally uncorrelated features;
Figure BDA0003357241390000044
representing features falling into a right reject domain, the features falling into the right reject domain being globally relevant features;
the division of the feature set is obtained according to two hypothesis tests:
Figure BDA0003357241390000045
Figure BDA0003357241390000046
Figure BDA0003357241390000047
wherein
Figure BDA0003357241390000048
For the set of relevant features to be included in the set of relevant features,
Figure BDA0003357241390000049
for a set of unrelated features,
Figure BDA00033572413900000410
is a pending feature.
In a second aspect, the present invention also provides a device for selecting a fully relevant feature based on a salpril value and hypothesis testing: the device for selecting the fully-relevant characteristics based on the salpril values and the hypothesis test comprises:
an evaluation module 10, configured to perform step 1: evaluating the relevance;
the input to step 1 is a data set consisting of N samples, noted
Figure BDA00033572413900000411
Wherein the feature vector of the nth sample is x(n)=(x1,...,xM) A total of M candidate features, and features are recorded as
Figure BDA00033572413900000412
Quantifying the importance of the M candidate features by using the Shapril value to obtain a local importance value
Figure BDA00033572413900000413
And global importance GIm
Global importance Using randomized features
Figure BDA00033572413900000414
Obtaining an importance threshold T by the adaptive coefficient c;
evaluating the local relevance indexes R of the M candidate characteristicsmAnd global relevance index GRm
A selecting module 20, configured to perform step 2: selecting a strategy;
the input of the step 2 is a local relevance index R of M candidate characteristicsmAnd global relevance index GRm
Local relevance index R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related features
Figure BDA0003357241390000051
Set of uncorrelated features
Figure BDA0003357241390000052
Pending feature set
Figure BDA0003357241390000053
Detecting an unrelated feature set
Figure BDA0003357241390000054
Whether it is empty;
if not
Figure BDA0003357241390000055
If not, deleting the irrelevant feature set and executing the step 1;
if not
Figure BDA0003357241390000056
If the adaptive coefficient c is null, detecting whether the adaptive coefficient c is 1;
if the adaptive coefficient c is not 1, increasing the adaptive coefficient c by 0.1, and executing the step 1;
if the adaptive coefficient c is 1, detecting the undetermined characteristicCollection
Figure BDA0003357241390000057
Whether it is empty;
if a pending feature set
Figure BDA0003357241390000058
If not, executing step 1;
if a pending feature set
Figure BDA0003357241390000059
And if the result is empty, stopping executing the step 1.
Optionally, the evaluation module 10 is further configured to:
determining an input data set
Figure BDA00033572413900000510
Wherein the feature vector of the nth sample is x(n)=(x1,...,xM) The label is y(n)Feature set of
Figure BDA00033572413900000511
The algorithm for attributing the sapril is expressed as
Figure BDA00033572413900000512
Applying classification/regression model f (-) to sample x using the Shapril attribution algorithm(n)Is attributed to the contribution of the mth candidate feature
Figure BDA00033572413900000513
Wherein
Figure BDA00033572413900000514
Is the mean value of the model output;
if it is a classification task, then
Figure BDA00033572413900000515
Figure BDA00033572413900000516
Represents the contribution of the mth feature of the nth sample to the l class, when the local importance is
Figure BDA00033572413900000517
Wherein l ═ y(n)
If the task is a regression task, the contribution is directly expressed as a contribution value
Figure BDA00033572413900000518
The local importance here is
Figure BDA00033572413900000519
Global importance is the average of the local importance over all samples, i.e. global importance
Figure BDA00033572413900000520
Figure BDA00033572413900000521
Optionally, the apparatus for selecting a fully correlated feature based on a salpril value and hypothesis testing is characterized in that the adaptive threshold is expressed as
Figure BDA00033572413900000522
Wherein
Figure BDA00033572413900000523
C is an adaptive coefficient for the global importance of the random feature.
Optionally, the apparatus for selecting fully correlated features based on the salpril values and hypothesis testing is characterized in that the local correlation is
Figure BDA0003357241390000061
RmRefers to the number of features of which the local importance is higher than the adaptive threshold, RmThe greater the degree of correlation;
the global correlation is
Figure BDA0003357241390000062
Where MI is the maximum number of iterations, GRmRefers to the number of times the global importance of the mth feature is above the threshold in the iterative process.
Optionally, the selecting module 20 is further configured to:
definitions original hypothesis (H0): the correlation of the features follows a binomial distribution with a probability of 0.5, and the probability distribution function is F (·);
and (3) performing hypothesis test on the local correlation, and defining the characteristic set falling into the left and right rejection regions as follows:
Figure BDA0003357241390000063
Figure BDA0003357241390000064
wherein a is the level of significance and wherein,
Figure BDA0003357241390000065
representing that the feature falls within the left reject domain, the feature falling within the left reject domain is locally an irrelevant feature;
Figure BDA0003357241390000066
representing features falling within the right reject domain, which are locally relevant features;
and performing hypothesis test on the global correlation to obtain two characteristic sets:
Figure BDA0003357241390000067
Figure BDA0003357241390000068
wherein
Figure BDA0003357241390000069
Representing features falling within the left reject domain, which are globally uncorrelated features;
Figure BDA00033572413900000610
representing features falling into a right reject domain, the features falling into the right reject domain being globally relevant features;
the division of the feature set is obtained according to two hypothesis tests:
Figure BDA00033572413900000611
Figure BDA00033572413900000612
Figure BDA00033572413900000613
wherein
Figure BDA00033572413900000614
For the set of relevant features to be included in the set of relevant features,
Figure BDA00033572413900000615
for a set of unrelated features,
Figure BDA00033572413900000616
is a pending feature.
The invention discloses a method and a device for selecting fully-relevant characteristics based on a sand-pril value and hypothesis testing. The method is suitable for feature sets with supervision tasks. A feature selection model for solving the full correlation problem is designed, the model firstly utilizes a Shapril attribution algorithm to calculate the local importance of features, secondly utilizes random features to construct an adaptive threshold, and then utilizes the importance and the threshold to evaluate the correlation of the features. In the aspect of selecting strategies, the invention designs double hypothesis testing, rapidly eliminates irrelevant features by utilizing local hypothesis testing, and reduces the risk of mistakenly deleting the relevant features by utilizing global hypothesis testing. And finally, all the characteristics related to the problem domain are obtained, so that the interpretability of the characteristic set is improved, and the reliability of prediction is enhanced.
Drawings
FIG. 1 is a schematic flow chart of a method for selecting fully relevant features based on a salpril value and hypothesis testing as contemplated in an embodiment of the present invention;
FIG. 2 is a functional schematic diagram of a first embodiment of a holohedral feature selection device based on a salpril value and hypothesis testing as contemplated in an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In a first aspect, embodiments of the present invention provide a method for selecting fully relevant features based on a salpril value and hypothesis testing.
Referring to fig. 1, fig. 1 is a schematic flow chart of a method for selecting fully relevant features based on a salpril value and hypothesis testing according to an embodiment of the present invention.
As shown in fig. 1, the method for selecting fully relevant features based on the salpril values and hypothesis testing comprises:
step 1: evaluating the relevance;
the input to step 1 is a data set consisting of N samples, noted
Figure BDA0003357241390000071
Wherein the feature vector of the nth sample is x(n)=(x1,...,xM) A total of M candidate features, and features are recorded as
Figure BDA0003357241390000072
Quantifying the importance of the M candidate features by using the Shapril value to obtain a local importance value
Figure BDA0003357241390000073
And global importance GIm
Global importance Using randomized features
Figure BDA0003357241390000081
Obtaining an importance threshold T by the adaptive coefficient c;
evaluating the local relevance indexes R of the M candidate characteristicsmAnd global relevance index GRm
In this embodiment, the method provided in this embodiment is suitable for a supervised task, and sample data needs to be provided
Figure BDA0003357241390000082
And label y(n)Wherein x is(n)=(x1,...,xM) There are a total of M features, i.e. the feature set is
Figure BDA0003357241390000083
Randomizing the feature, sampling the feature set, randomizing to obtain a random feature set
Figure BDA0003357241390000084
Figure BDA0003357241390000085
Calculating importance by using a Shapril attribution algorithm to obtain a characteristic set
Figure BDA0003357241390000086
The local importance of each feature in
Figure BDA0003357241390000087
And global importance GIm=E(Im) Wherein N is the number of samples. From a set of random features
Figure BDA0003357241390000088
Get global importance
Figure BDA0003357241390000089
Step 2: selecting a strategy;
the input of the step 2 is a local relevance index R of M candidate characteristicsmAnd global relevance index GRm
Local relevance index R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related features
Figure BDA00033572413900000810
Set of uncorrelated features
Figure BDA00033572413900000811
Pending feature set
Figure BDA00033572413900000812
Detecting an unrelated feature set
Figure BDA00033572413900000813
Whether it is empty;
if not
Figure BDA00033572413900000814
If not, deleting the irrelevant feature set and executing the step 1;
if not
Figure BDA00033572413900000815
If the adaptive coefficient c is null, detecting whether the adaptive coefficient c is 1;
if the adaptive coefficient c is not 1, increasing the adaptive coefficient c by 0.1, and executing the step 1;
if the adaptive coefficient c is 1, detecting the undetermined characteristic set
Figure BDA00033572413900000816
Whether it is empty;
if a pending feature set
Figure BDA00033572413900000817
If not, executing step 1;
if a pending feature set
Figure BDA00033572413900000818
And if the result is empty, stopping executing the step 1.
Further, in an embodiment, the importance of the M candidate features is quantified by using the salpril value to obtain a local importance value
Figure BDA00033572413900000819
And global importance GImComprises the following steps:
determining an input data set
Figure BDA00033572413900000820
Wherein the feature vector of the nth sample is x(n)=(x1,...,xM) The label is y(n)Feature set of
Figure BDA0003357241390000091
The algorithm for attributing the sapril is expressed as
Figure BDA0003357241390000092
Applying classification/regression model f (-) to sample x using the Shapril attribution algorithm(n)Is attributed to the contribution of the mth candidate feature
Figure BDA0003357241390000093
Wherein
Figure BDA0003357241390000094
Is the mean value of the model output;
if it is a classification task, then
Figure BDA0003357241390000095
Figure BDA0003357241390000096
Represents the contribution of the mth feature of the nth sample to the l class, when the local importance is
Figure BDA0003357241390000097
Wherein l ═ y(n)
If the task is a regression task, the contribution is directly expressed as a contribution value
Figure BDA0003357241390000098
The local importance here is
Figure BDA0003357241390000099
Global importance is the average of the local importance over all samples, i.e. global importance
Figure BDA00033572413900000910
Figure BDA00033572413900000911
Further, in an embodiment, the method for selecting fully-correlated features based on the salpril value and hypothesis testing comprises:
the adaptive threshold is represented as
Figure BDA00033572413900000912
Wherein
Figure BDA00033572413900000913
C is an adaptive coefficient for the global importance of the random feature.
In this embodiment, the adaptive coefficient c and the random feature set are used as basis
Figure BDA00033572413900000914
Global importance of GI*Calculating an adaptive threshold
Figure BDA00033572413900000915
Namely GI*Is multiplied by a coefficient c, where c has an initial value of 0.1 and a maximum value of 1.
Further, in an embodiment, the method for selecting fully-correlated features based on the salpril value and hypothesis testing comprises:
the local correlation is
Figure BDA00033572413900000916
RmRefers to the number of features of which the local importance is higher than the adaptive threshold, RmThe greater the degree of correlation;
the global correlation is
Figure BDA00033572413900000917
Where MI is the maximum number of iterations, GRmRefers to the number of times the global importance of the mth feature is above the threshold in the iterative process.
In this embodiment, the correlation needs to be evaluated; local relevance, local importance of features ImThe number of times above T, compared to a threshold T, is the local correlation of the feature, i.e.
Figure BDA00033572413900000918
Figure BDA0003357241390000101
Global relevance, given that MI iterations have been performed, the global importance GI of the MI iterationsmThe number of times above the threshold is denoted as global correlation, i.e.
Figure BDA0003357241390000102
Figure BDA0003357241390000103
Further, in an embodiment, the local relevance indicator R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related features
Figure BDA0003357241390000104
Set of uncorrelated features
Figure BDA0003357241390000105
Pending feature set
Figure BDA0003357241390000106
Comprises the following steps:
definitions original hypothesis (H0): the correlation of the features follows a binomial distribution with a probability of 0.5, and the probability distribution function is F (·);
and (3) performing hypothesis test on the local correlation, and defining the characteristic set falling into the left and right rejection regions as follows:
Figure BDA0003357241390000107
Figure BDA0003357241390000108
wherein a is the level of significance and wherein,
Figure BDA0003357241390000109
representing that the feature falls within the left reject domain, the feature falling within the left reject domain is locally an irrelevant feature;
Figure BDA00033572413900001010
representing features falling within the right reject domain, which are locally relevant features;
and performing hypothesis test on the global correlation to obtain two characteristic sets:
Figure BDA00033572413900001011
Figure BDA00033572413900001012
wherein
Figure BDA00033572413900001013
Representing features falling within the left reject domain, which are globally uncorrelated features;
Figure BDA00033572413900001014
representing features falling into a right reject domain, the features falling into the right reject domain being globally relevant features;
the division of the feature set is obtained according to two hypothesis tests:
Figure BDA00033572413900001015
Figure BDA00033572413900001016
Figure BDA00033572413900001017
wherein
Figure BDA00033572413900001018
For the set of relevant features,
Figure BDA00033572413900001019
for a set of unrelated features,
Figure BDA00033572413900001020
is a set of pending features.
In this embodiment, after one feature selection, the relevant feature set is used
Figure BDA0003357241390000111
Replacing an original feature set
Figure BDA0003357241390000112
In preparation for subsequent supervisory tasks.
The embodiment of the invention discloses a method and a device for selecting fully-correlated characteristics based on a salpril value and hypothesis testing. The method is suitable for feature sets with supervision tasks. A feature selection model for solving the full correlation problem is designed, the model firstly utilizes a Shapril attribution algorithm to calculate the local importance of features, secondly utilizes random features to construct an adaptive threshold, and then utilizes the importance and the threshold to evaluate the correlation of the features. In terms of selection strategy, the embodiment of the invention designs double hypothesis testing, uses local hypothesis testing to quickly eliminate irrelevant features, and uses global hypothesis testing to reduce the risk of mistakenly deleting relevant features. And finally, all the characteristics related to the problem domain are obtained, so that the interpretability of the characteristic set is improved, and the reliability of prediction is enhanced.
In a second aspect, the embodiments of the present invention further provide a device for selecting fully relevant features based on the salpril values and hypothesis testing.
Referring to fig. 2, fig. 2 is a functional schematic diagram of a first embodiment of a holocorrelation feature selection device based on a salpril value and hypothesis testing, as involved in an embodiment of the present invention.
In this embodiment, the device for selecting fully-correlated features based on the salpril value and hypothesis testing includes:
an evaluation module for performing step 1: evaluating the relevance;
the input to step 1 is a data set consisting of N samples, noted
Figure BDA0003357241390000113
Wherein the feature vector of the nth sample is x(n)=(x1,...,xM) A total of M candidate features, and features are recorded as
Figure BDA0003357241390000114
Quantifying the importance of the M candidate features by using the Shapril value to obtain a local importance value
Figure BDA0003357241390000115
And global importance GIm
Global importance Using randomized features
Figure BDA0003357241390000116
Obtaining an importance threshold T by the adaptive coefficient c;
evaluating the local relevance indexes R of the M candidate characteristicsmAnd global relevance index GRm
A selection module for executing step 2: selecting a strategy;
the input of the step 2 is a local relevance index R of M candidate characteristicsmAnd global relevance index GRm
Local relevance index R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related features
Figure BDA0003357241390000121
Set of uncorrelated features
Figure BDA0003357241390000122
Pending feature set
Figure BDA0003357241390000123
Detecting an unrelated feature set
Figure BDA0003357241390000124
Whether it is empty;
if not
Figure BDA0003357241390000125
If not, deleting the irrelevant feature set and executing the step 1;
if not
Figure BDA0003357241390000126
If the adaptive coefficient c is null, detecting whether the adaptive coefficient c is 1;
if the adaptive coefficient c is not 1, increasing the adaptive coefficient c by 0.1, and executing the step 1;
if the adaptive coefficient c is 1, detecting the undetermined characteristic set
Figure BDA0003357241390000127
Whether it is empty;
if a pending feature set
Figure BDA0003357241390000128
If not, executing step 1;
if a pending feature set
Figure BDA0003357241390000129
And if the result is empty, stopping executing the step 1.
Further, in an embodiment, the evaluation module is further configured to:
determining an input data set
Figure BDA00033572413900001210
Wherein the feature vector of the nth sample is x(n)=(x1,...,xM) The label is y(n)Feature set of
Figure BDA00033572413900001211
The algorithm for attributing the sapril is expressed as
Figure BDA00033572413900001212
Applying classification/regression model f (-) to sample x using the Shapril attribution algorithm(n)Is attributed to the contribution of the mth candidate feature
Figure BDA00033572413900001213
Wherein
Figure BDA00033572413900001214
Is the mean value of the model output;
if it is a classification task, then
Figure BDA00033572413900001215
Figure BDA00033572413900001216
Represents the contribution of the mth feature of the nth sample to the l class, when the local importance is
Figure BDA00033572413900001217
Wherein l ═ y(n)
If the task is a regression task, the contribution is directly expressed as a contribution value
Figure BDA00033572413900001218
The local importance here is
Figure BDA00033572413900001219
Global importance is the average of the local importance over all samples, i.e. global importance
Figure BDA00033572413900001220
Figure BDA00033572413900001221
Further, in one embodiment, the apparatus for selecting a fully correlated feature based on a sand-pril value and a hypothesis test is characterized in that the adaptive threshold is expressed as
Figure BDA00033572413900001222
Wherein
Figure BDA00033572413900001223
C is an adaptive coefficient for the global importance of the random feature.
Further, in an embodiment, the apparatus for selecting fully correlated features based on the salpril values and hypothesis testing is characterized in that the local correlation is
Figure BDA0003357241390000131
RmRefers to the number of features of which the local importance is higher than the adaptive threshold, RmThe greater the degree of correlation;
the global correlation is
Figure BDA0003357241390000132
Where MI is the maximum number of iterations, GRmRefers to the number of times the global importance of the mth feature is above the threshold in the iterative process.
Further, in an embodiment, the selecting module is further configured to:
definitions original hypothesis (H0): the correlation of the features follows a binomial distribution with a probability of 0.5, and the probability distribution function is F (·);
and (3) performing hypothesis test on the local correlation, and defining the characteristic set falling into the left and right rejection regions as follows:
Figure BDA0003357241390000133
Figure BDA0003357241390000134
wherein a is the level of significance and wherein,
Figure BDA0003357241390000135
representing that the feature falls within the left reject domain, the feature falling within the left reject domain is locally an irrelevant feature;
Figure BDA0003357241390000136
representing features falling within the right reject domain, which are locally relevant features;
and performing hypothesis test on the global correlation to obtain two characteristic sets:
Figure BDA0003357241390000137
Figure BDA0003357241390000138
wherein
Figure BDA0003357241390000139
Representing features falling within the left reject domain, which are globally uncorrelated features;
Figure BDA00033572413900001310
representing features falling into a right reject domain, the features falling into the right reject domain being globally relevant features;
the division of the feature set is obtained according to two hypothesis tests:
Figure BDA00033572413900001311
Figure BDA00033572413900001312
Figure BDA00033572413900001313
wherein
Figure BDA00033572413900001314
For the set of relevant features to be included in the set of relevant features,
Figure BDA00033572413900001315
for a set of unrelated features,
Figure BDA00033572413900001316
is a pending feature.
The function implementation of each module in the xx device corresponds to each step in the embodiment of the fully-relevant feature selection method based on the sand-pril value and hypothesis testing, and the function and implementation process are not described in detail herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for causing a terminal device to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method for selecting a fully relevant feature based on a salpril value and hypothesis testing, wherein the method for selecting the fully relevant feature based on the salpril value and hypothesis testing comprises:
step 1: evaluating the relevance;
the input to step 1 is a data set consisting of N samples, noted
Figure FDA0003357241380000011
Wherein the feature vector of the nth sample is x(n)=(x1,...,xM) A total of M candidate features, and features are recorded as
Figure FDA0003357241380000012
Quantifying the importance of the M candidate features by using the Shapril value to obtain a local importance value
Figure FDA0003357241380000013
And global importance GIm
Global importance Using randomized features
Figure FDA0003357241380000014
Obtaining an importance threshold T by the adaptive coefficient c;
evaluating the local relevance indexes R of the M candidate characteristicsmAnd global relevance index GRm
Step 2: selecting a strategy;
the input of the step 2 is a local relevance index R of M candidate characteristicsmAnd global relevance index GRm
Local relevance index R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related features
Figure FDA0003357241380000015
Set of uncorrelated features
Figure FDA0003357241380000016
Pending feature set
Figure FDA0003357241380000017
Detecting an unrelated feature set
Figure FDA0003357241380000018
Whether it is empty;
if not
Figure FDA0003357241380000019
If not, deleting the irrelevant feature set and executing the step 1;
if not
Figure FDA00033572413800000110
If the adaptive coefficient c is null, detecting whether the adaptive coefficient c is 1;
if the adaptive coefficient c is not 1, increasing the adaptive coefficient c by 0.1, and executing the step 1;
if the adaptive coefficient c is 1, detecting the undetermined characteristic set
Figure FDA00033572413800000111
Whether it is empty;
if a pending feature set
Figure FDA00033572413800000112
If not, executing step 1;
if a pending feature set
Figure FDA00033572413800000113
And if the result is empty, stopping executing the step 1.
2. The method for selecting a fully relevant feature based on salpril values and hypothesis testing as claimed in claim 1, wherein the importance of the M candidate features is quantified by the salpril values to obtain local importance values
Figure FDA0003357241380000021
And global importance GImComprises the following steps:
determining an input data set
Figure FDA0003357241380000022
WhereinThe feature vector of the nth sample is x(n)=(x1,...,xM) The label is y(n)Feature set of
Figure FDA0003357241380000023
The algorithm for attributing the sapril is expressed as
Figure FDA0003357241380000024
Applying classification/regression model f (-) to sample x using the Shapril attribution algorithm(n)Is attributed to the contribution of the mth candidate feature
Figure FDA0003357241380000025
Wherein
Figure FDA0003357241380000026
Is the mean value of the model output;
if it is a classification task, then
Figure FDA0003357241380000027
Figure FDA0003357241380000028
Represents the contribution of the mth feature of the nth sample to the l class, when the local importance is
Figure FDA0003357241380000029
Wherein l ═ y(n)
If the task is a regression task, the contribution is directly expressed as a contribution value
Figure FDA00033572413800000210
The local importance here is
Figure FDA00033572413800000211
The global importance is the average of the local importance over all samples, i.e. globalImportance of
Figure FDA00033572413800000212
Figure FDA00033572413800000213
3. The method for selecting a fully relevant feature based on the salpril value and hypothesis test as claimed in claim 2, wherein the method for selecting a fully relevant feature based on the salpril value and hypothesis test comprises:
the adaptive threshold is represented as
Figure FDA00033572413800000214
Wherein
Figure FDA00033572413800000215
C is an adaptive coefficient for the global importance of the random feature.
4. The method for selecting a fully relevant feature based on the salpril value and hypothesis test as claimed in claim 3, wherein the method for selecting a fully relevant feature based on the salpril value and hypothesis test comprises:
the local correlation is
Figure FDA00033572413800000216
RmRefers to the number of features of which the local importance is higher than the adaptive threshold, RmThe greater the degree of correlation;
the global correlation is
Figure FDA00033572413800000217
Where MI is the maximum number of iterations, GRmRefers to the number of times the global importance of the mth feature is above the threshold in the iterative process.
5. The method of claim 4, based on the values and hypotheses of salaprilThe method for selecting the fully correlated feature based on the test is characterized in that the local correlation index R based on the M candidate featuresmAnd global relevance index GRmDeriving a set of related features
Figure FDA0003357241380000031
Set of uncorrelated features
Figure FDA0003357241380000032
Pending feature set
Figure FDA0003357241380000033
Comprises the following steps:
definitions original hypothesis (H0): the correlation of the features follows a binomial distribution with a probability of 0.5, and the probability distribution function is F (·);
and (3) performing hypothesis test on the local correlation, and defining the characteristic set falling into the left and right rejection regions as follows:
Figure FDA0003357241380000034
Figure FDA0003357241380000035
wherein a is the level of significance and wherein,
Figure FDA0003357241380000036
representing that the feature falls within the left reject domain, the feature falling within the left reject domain is locally an irrelevant feature;
Figure FDA0003357241380000037
representing features falling within the right reject domain, which are locally relevant features;
and performing hypothesis test on the global correlation to obtain two characteristic sets:
Figure FDA0003357241380000038
Figure FDA0003357241380000039
wherein
Figure FDA00033572413800000310
Representing features falling within the left reject domain, which are globally uncorrelated features;
Figure FDA00033572413800000311
representing features falling into a right reject domain, the features falling into the right reject domain being globally relevant features;
the division of the feature set is obtained according to two hypothesis tests:
Figure FDA00033572413800000312
Figure FDA00033572413800000313
Figure FDA00033572413800000314
wherein
Figure FDA00033572413800000315
For the set of relevant features to be included in the set of relevant features,
Figure FDA00033572413800000316
for a set of unrelated features,
Figure FDA00033572413800000317
is a pending feature.
6. A device for selecting a fully correlated feature based on a salpril value and hypothesis testing, the device for selecting a fully correlated feature based on a salpril value and hypothesis testing comprising:
an evaluation module for performing step 1: evaluating the relevance;
the input to step 1 is a data set consisting of N samples, noted
Figure FDA0003357241380000041
Wherein the feature vector of the nth sample is x(n)=(x1,...,xM) A total of M candidate features, and features are recorded as
Figure FDA0003357241380000042
Quantifying the importance of the M candidate features by using the Shapril value to obtain a local importance value
Figure FDA0003357241380000043
And global importance GIm
Global importance Using randomized features
Figure FDA0003357241380000044
Obtaining an importance threshold T by the adaptive coefficient c;
evaluating the local relevance indexes R of the M candidate characteristicsmAnd global relevance index GRm
A selection module for executing step 2: selecting a strategy;
the input of the step 2 is a local relevance index R of M candidate characteristicsmAnd global relevance index GRm
Local relevance index R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related features
Figure FDA0003357241380000045
Set of uncorrelated features
Figure FDA0003357241380000046
Pending feature set
Figure FDA0003357241380000047
Detecting an unrelated feature set
Figure FDA0003357241380000048
Whether it is empty;
if not
Figure FDA0003357241380000049
If not, deleting the irrelevant feature set and executing the step 1;
if not
Figure FDA00033572413800000410
If the adaptive coefficient c is null, detecting whether the adaptive coefficient c is 1;
if the adaptive coefficient c is not 1, increasing the adaptive coefficient c by 0.1, and executing the step 1;
if the adaptive coefficient c is 1, detecting the undetermined characteristic set
Figure FDA00033572413800000411
Whether it is empty;
if a pending feature set
Figure FDA00033572413800000412
If not, executing step 1;
if a pending feature set
Figure FDA00033572413800000413
And if the result is empty, stopping executing the step 1.
7. The apparatus for selecting a fully relevant feature based on salpril values and hypothesis testing as claimed in claim 6, wherein the evaluation module is further configured to:
determining an input data set
Figure FDA00033572413800000414
Wherein the feature vector of the nth sample is x(n)=(x1,...,xM) The label is y(n)Feature set of
Figure FDA00033572413800000415
The algorithm for attributing the sapril is expressed as
Figure FDA0003357241380000051
Applying classification/regression model f (-) to sample x using the Shapril attribution algorithm(n)Is attributed to the contribution of the mth candidate feature
Figure FDA0003357241380000052
Wherein
Figure FDA0003357241380000053
Is the mean value of the model output;
if it is a classification task, then
Figure FDA0003357241380000054
Figure FDA0003357241380000055
Represents the contribution of the mth feature of the nth sample to the l class, when the local importance is
Figure FDA0003357241380000056
Wherein l ═ y(n)
If the task is a regression task, the contribution is directly expressed as a contribution value
Figure FDA0003357241380000057
The local importance here is
Figure FDA0003357241380000058
Global importance is the average of the local importance over all samples, i.e. global importance
Figure FDA0003357241380000059
Figure FDA00033572413800000510
8. The apparatus for full correlation feature selection based on salpril values and hypothesis testing as claimed in claim 7 wherein the adaptive threshold is expressed as
Figure FDA00033572413800000511
Wherein
Figure FDA00033572413800000512
C is an adaptive coefficient for the global importance of the random feature.
9. The apparatus for selecting a fully correlated feature based on salpril values and hypothesis testing as claimed in claim 8, wherein the local correlation is
Figure FDA00033572413800000513
RmRefers to the number of features of which the local importance is higher than the adaptive threshold, RmThe greater the degree of correlation;
the global correlation is
Figure FDA00033572413800000514
Where MI is the maximum number of iterations, GRmRefers to the number of times the global importance of the mth feature is above the threshold in the iterative process.
10. The apparatus for selecting a fully relevant feature based on salpril values and hypothesis testing as claimed in claim 9, wherein the selection module is further configured to:
definitions original hypothesis (H0): the correlation of the features follows a binomial distribution with a probability of 0.5, and the probability distribution function is F (·);
and (3) performing hypothesis test on the local correlation, and defining the characteristic set falling into the left and right rejection regions as follows:
Figure FDA0003357241380000061
Figure FDA0003357241380000062
wherein a is the level of significance and wherein,
Figure FDA0003357241380000063
representing that the feature falls within the left reject domain, the feature falling within the left reject domain is locally an irrelevant feature;
Figure FDA0003357241380000064
representing features falling within the right reject domain, which are locally relevant features;
and performing hypothesis test on the global correlation to obtain two characteristic sets:
Figure FDA0003357241380000065
Figure FDA0003357241380000066
wherein
Figure FDA0003357241380000067
Representing features falling within the left reject domain, which are globally uncorrelated features;
Figure FDA0003357241380000068
representing features falling into a right reject domain, the features falling into the right reject domain being globally relevant features;
the division of the feature set is obtained according to two hypothesis tests:
Figure FDA0003357241380000069
Figure FDA00033572413800000610
Figure FDA00033572413800000611
wherein
Figure FDA00033572413800000612
For the set of relevant features to be included in the set of relevant features,
Figure FDA00033572413800000613
for a set of unrelated features,
Figure FDA00033572413800000614
is a pending feature.
CN202111384278.XA 2021-11-16 2021-11-16 Method and device for selecting fully-relevant features based on Shapril value and hypothesis test Pending CN114118246A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111384278.XA CN114118246A (en) 2021-11-16 2021-11-16 Method and device for selecting fully-relevant features based on Shapril value and hypothesis test

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111384278.XA CN114118246A (en) 2021-11-16 2021-11-16 Method and device for selecting fully-relevant features based on Shapril value and hypothesis test

Publications (1)

Publication Number Publication Date
CN114118246A true CN114118246A (en) 2022-03-01

Family

ID=80439074

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111384278.XA Pending CN114118246A (en) 2021-11-16 2021-11-16 Method and device for selecting fully-relevant features based on Shapril value and hypothesis test

Country Status (1)

Country Link
CN (1) CN114118246A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115953248A (en) * 2023-03-01 2023-04-11 支付宝(杭州)信息技术有限公司 Wind control method, device, equipment and medium based on Shapril additive interpretation

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110009014A (en) * 2019-03-24 2019-07-12 北京工业大学 A kind of feature selection approach merging related coefficient and mutual information

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110009014A (en) * 2019-03-24 2019-07-12 北京工业大学 A kind of feature selection approach merging related coefficient and mutual information

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115953248A (en) * 2023-03-01 2023-04-11 支付宝(杭州)信息技术有限公司 Wind control method, device, equipment and medium based on Shapril additive interpretation
CN115953248B (en) * 2023-03-01 2023-05-16 支付宝(杭州)信息技术有限公司 Wind control method, device, equipment and medium based on saprolitic additivity interpretation

Similar Documents

Publication Publication Date Title
CN109302410B (en) Method and system for detecting abnormal behavior of internal user and computer storage medium
Nguyen et al. Model selection for degradation modeling and prognosis with health monitoring data
US9398034B2 (en) Matrix factorization for automated malware detection
CN110991474A (en) Machine learning modeling platform
CN111242358A (en) Enterprise information loss prediction method with double-layer structure
CN111753290A (en) Software type detection method and related equipment
CN116167010B (en) Rapid identification method for abnormal events of power system with intelligent transfer learning capability
CN113723070B (en) Text similarity model training method, text similarity detection method and device
CN109766259B (en) Classifier testing method and system based on composite metamorphic relation
Garćia et al. Noisy data set identification
CN111400713B (en) Malicious software population classification method based on operation code adjacency graph characteristics
CN114118246A (en) Method and device for selecting fully-relevant features based on Shapril value and hypothesis test
Wang et al. Mushroom toxicity recognition based on multigrained cascade forest
CN110111311B (en) Image quality evaluation method and device
CN113468538A (en) Vulnerability attack database construction method based on similarity measurement
CN111738530B (en) River water quality prediction method, device and computer readable storage medium
CN110808947B (en) Automatic vulnerability quantitative evaluation method and system
CN115641201B (en) Data anomaly detection method, system, terminal equipment and storage medium
CN114285587A (en) Domain name identification method and device and domain name classification model acquisition method and device
CN107067034B (en) Method and system for rapidly identifying infrared spectrum data classification
CN116579980A (en) Printed circuit board defect detection method, medium and equipment based on small sample learning
CN117523218A (en) Label generation, training of image classification model and image classification method and device
Alfaz et al. A deep convolutional neural network based approach to classify and detect crack in concrete surface using xception
CN113190851B (en) Active learning method of malicious document detection model, electronic equipment and storage medium
CN115730656A (en) Out-of-distribution sample detection method using mixed unmarked data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220301

RJ01 Rejection of invention patent application after publication