CN114118246A - Method and device for selecting fully-relevant features based on Shapril value and hypothesis test - Google Patents
Method and device for selecting fully-relevant features based on Shapril value and hypothesis test Download PDFInfo
- Publication number
- CN114118246A CN114118246A CN202111384278.XA CN202111384278A CN114118246A CN 114118246 A CN114118246 A CN 114118246A CN 202111384278 A CN202111384278 A CN 202111384278A CN 114118246 A CN114118246 A CN 114118246A
- Authority
- CN
- China
- Prior art keywords
- features
- feature
- importance
- global
- local
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012360 testing method Methods 0.000 title claims abstract description 69
- 238000000034 method Methods 0.000 title claims abstract description 47
- 230000003044 adaptive effect Effects 0.000 claims abstract description 58
- 230000002596 correlated effect Effects 0.000 claims description 9
- 238000011156 evaluation Methods 0.000 claims description 8
- 238000009826 distribution Methods 0.000 claims description 6
- 238000005315 distribution function Methods 0.000 claims description 6
- 238000012804 iterative process Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000010187 selection method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a method and a device for selecting fully-relevant characteristics based on a sand-pril value and hypothesis testing. The method is suitable for feature sets with supervision tasks. A feature selection model for solving the full correlation problem is designed, the model firstly utilizes a Shapril attribution algorithm to calculate the local importance of features, secondly utilizes random features to construct an adaptive threshold, and then utilizes the importance and the threshold to evaluate the correlation of the features. In the aspect of selecting strategies, the invention designs double hypothesis testing, rapidly eliminates irrelevant features by utilizing local hypothesis testing, and reduces the risk of mistakenly deleting the relevant features by utilizing global hypothesis testing. And finally, all the characteristics related to the problem domain are obtained, so that the interpretability of the characteristic set is improved, and the reliability of prediction is enhanced.
Description
Technical Field
The invention relates to the technical field of feature selection, in particular to a method and a device for selecting fully-relevant features based on a Shapril value and hypothesis testing.
Background
Feature selection is one of the important issues in feature engineering, and the task of feature selection is to select a subset of features from the original set of features that are relevant to the problem domain. The goal of feature selection is to improve the interpretability and predictive performance of a feature set. It is crucial to solve this problem in feature data centric scenarios. At present, the research of traditional feature selection mainly solves the problem of minimum optimization, namely, selecting the minimum feature subset with the optimal classification performance. The method can be divided into a filtering type method and an encapsulating type method according to the characteristic subset evaluation standard, wherein the filtering type method specifically refers to sorting all the characteristics according to a specific statistical value, and selecting the characteristic subset according to the sorting. The packaging method is characterized in that candidate feature subsets are evaluated through a learning algorithm, the candidate feature subsets are changed through multiple iterations, and then the optimal feature subsets are selected according to evaluation criteria such as classification accuracy and feature number. The method for selecting the features with the aim of solving the minimum optimal problem has the advantages that the obtained feature subsets are good in classification effect, the number of the features is small, and a subsequently established model is simpler. The method has the disadvantages that a black box prediction model is obtained by using the minimum optimal feature set, and the interpretability of the feature set is difficult to ensure. In order to better understand the potential knowledge of the problem domain, the feature selection method preferably solves the full correlation problem, that is, determines all the features related to the problem domain, and the solving of the full correlation problem also has the following difficulties, such as under strong fitting capability of the model, the false correlation widely exists, the correlation index is difficult to define and evaluate, and for example, it is difficult to select all the related features, especially the weak correlation features.
Disclosure of Invention
The invention mainly aims to provide a method and a device for selecting fully-correlated characteristics based on a Shapril value and hypothesis testing, and aims to solve the problems that the characteristic correlation cannot be effectively evaluated and all the correlated characteristics cannot be adaptively identified.
In a first aspect, the present invention provides a method for selecting a fully relevant feature based on a salpril value and a hypothesis test, the method for selecting a fully relevant feature based on a salpril value and a hypothesis test comprising:
step 1: evaluating the relevance;
the input to step 1 is a data set consisting of N samples, notedWherein the feature vector of the nth sample is x(n)=(x1,...,xM) A total of M candidate features, and features are recorded as
Quantifying the importance of the M candidate features by using the Shapril value to obtain a local importance valueAnd global importance GIm;
Global importance Using randomized featuresObtaining an importance threshold T by the adaptive coefficient c;
evaluating the local relevance indexes R of the M candidate characteristicsmAnd global relevance index GRm;
Step 2: selecting a strategy;
the input of the step 2 is a local relevance index R of M candidate characteristicsmAnd global relevance index GRm;
Local relevance index R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related featuresSet of uncorrelated featuresPending feature set
if the adaptive coefficient c is not 1, increasing the adaptive coefficient c by 0.1, and executing the step 1;
if the adaptive coefficient c is 1, detecting the undetermined characteristic setWhether it is empty;
Optionally, the importance of the M candidate features is quantified by using the salpril value to obtain a local importance valueAnd global importance GImComprises the following steps:
determining an input data set x ═ { x ═ x(n)N1.. N }, where the feature vector of the nth sample is x(n)=(x1,...,xM) The label is y(n)Feature set of
The algorithm for attributing the sapril is expressed asApplying classification/regression model f (-) to sample x using the Shapril attribution algorithm(n)Is attributed to the contribution of the mth candidate featureWhereinIs the mean value of the model output;
if it is a classification task, then Represents the contribution of the mth feature of the nth sample to the l class, when the local importance isWherein l ═ y(n);
If the task is a regression task, the contribution is directly expressed as a contribution valueThe local importance here is
Optionally, the method for selecting fully-relevant features based on the salpril value and hypothesis testing comprises:
self-adaptationShould be thresholded asWhereinC is an adaptive coefficient for the global importance of the random feature.
Optionally, the method for selecting fully-relevant features based on the salpril value and hypothesis testing comprises:
the local correlation isRmRefers to the number of features of which the local importance is higher than the adaptive threshold, RmThe greater the degree of correlation;
the global correlation isWhere MI is the maximum number of iterations, GRmRefers to the number of times the global importance of the mth feature is above the threshold in the iterative process.
Optionally, the local correlation index R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related featuresSet of uncorrelated featuresPending feature setComprises the following steps:
definitions original hypothesis (H0): the correlation of the features follows a binomial distribution with a probability of 0.5, and the probability distribution function is F (·);
and (3) performing hypothesis test on the local correlation, and defining the characteristic set falling into the left and right rejection regions as follows:
wherein a is the level of significance and wherein,representing that the feature falls within the left reject domain, the feature falling within the left reject domain is locally an irrelevant feature;representing features falling within the right reject domain, which are locally relevant features;
and performing hypothesis test on the global correlation to obtain two characteristic sets:
whereinRepresenting features falling within the left reject domain, which are globally uncorrelated features;representing features falling into a right reject domain, the features falling into the right reject domain being globally relevant features;
the division of the feature set is obtained according to two hypothesis tests:
whereinFor the set of relevant features to be included in the set of relevant features,for a set of unrelated features,is a pending feature.
In a second aspect, the present invention also provides a device for selecting a fully relevant feature based on a salpril value and hypothesis testing: the device for selecting the fully-relevant characteristics based on the salpril values and the hypothesis test comprises:
an evaluation module 10, configured to perform step 1: evaluating the relevance;
the input to step 1 is a data set consisting of N samples, notedWherein the feature vector of the nth sample is x(n)=(x1,...,xM) A total of M candidate features, and features are recorded as
Quantifying the importance of the M candidate features by using the Shapril value to obtain a local importance valueAnd global importance GIm;
Global importance Using randomized featuresObtaining an importance threshold T by the adaptive coefficient c;
evaluating the local relevance indexes R of the M candidate characteristicsmAnd global relevance index GRm;
A selecting module 20, configured to perform step 2: selecting a strategy;
the input of the step 2 is a local relevance index R of M candidate characteristicsmAnd global relevance index GRm;
Local relevance index R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related featuresSet of uncorrelated featuresPending feature set
if the adaptive coefficient c is not 1, increasing the adaptive coefficient c by 0.1, and executing the step 1;
if the adaptive coefficient c is 1, detecting the undetermined characteristicCollectionWhether it is empty;
Optionally, the evaluation module 10 is further configured to:
determining an input data setWherein the feature vector of the nth sample is x(n)=(x1,...,xM) The label is y(n)Feature set of
The algorithm for attributing the sapril is expressed asApplying classification/regression model f (-) to sample x using the Shapril attribution algorithm(n)Is attributed to the contribution of the mth candidate featureWhereinIs the mean value of the model output;
if it is a classification task, then Represents the contribution of the mth feature of the nth sample to the l class, when the local importance isWherein l ═ y(n);
If the task is a regression task, the contribution is directly expressed as a contribution valueThe local importance here is
Optionally, the apparatus for selecting a fully correlated feature based on a salpril value and hypothesis testing is characterized in that the adaptive threshold is expressed asWhereinC is an adaptive coefficient for the global importance of the random feature.
Optionally, the apparatus for selecting fully correlated features based on the salpril values and hypothesis testing is characterized in that the local correlation isRmRefers to the number of features of which the local importance is higher than the adaptive threshold, RmThe greater the degree of correlation;
the global correlation isWhere MI is the maximum number of iterations, GRmRefers to the number of times the global importance of the mth feature is above the threshold in the iterative process.
Optionally, the selecting module 20 is further configured to:
definitions original hypothesis (H0): the correlation of the features follows a binomial distribution with a probability of 0.5, and the probability distribution function is F (·);
and (3) performing hypothesis test on the local correlation, and defining the characteristic set falling into the left and right rejection regions as follows:
wherein a is the level of significance and wherein,representing that the feature falls within the left reject domain, the feature falling within the left reject domain is locally an irrelevant feature;representing features falling within the right reject domain, which are locally relevant features;
and performing hypothesis test on the global correlation to obtain two characteristic sets:
whereinRepresenting features falling within the left reject domain, which are globally uncorrelated features;representing features falling into a right reject domain, the features falling into the right reject domain being globally relevant features;
the division of the feature set is obtained according to two hypothesis tests:
whereinFor the set of relevant features to be included in the set of relevant features,for a set of unrelated features,is a pending feature.
The invention discloses a method and a device for selecting fully-relevant characteristics based on a sand-pril value and hypothesis testing. The method is suitable for feature sets with supervision tasks. A feature selection model for solving the full correlation problem is designed, the model firstly utilizes a Shapril attribution algorithm to calculate the local importance of features, secondly utilizes random features to construct an adaptive threshold, and then utilizes the importance and the threshold to evaluate the correlation of the features. In the aspect of selecting strategies, the invention designs double hypothesis testing, rapidly eliminates irrelevant features by utilizing local hypothesis testing, and reduces the risk of mistakenly deleting the relevant features by utilizing global hypothesis testing. And finally, all the characteristics related to the problem domain are obtained, so that the interpretability of the characteristic set is improved, and the reliability of prediction is enhanced.
Drawings
FIG. 1 is a schematic flow chart of a method for selecting fully relevant features based on a salpril value and hypothesis testing as contemplated in an embodiment of the present invention;
FIG. 2 is a functional schematic diagram of a first embodiment of a holohedral feature selection device based on a salpril value and hypothesis testing as contemplated in an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In a first aspect, embodiments of the present invention provide a method for selecting fully relevant features based on a salpril value and hypothesis testing.
Referring to fig. 1, fig. 1 is a schematic flow chart of a method for selecting fully relevant features based on a salpril value and hypothesis testing according to an embodiment of the present invention.
As shown in fig. 1, the method for selecting fully relevant features based on the salpril values and hypothesis testing comprises:
step 1: evaluating the relevance;
the input to step 1 is a data set consisting of N samples, notedWherein the feature vector of the nth sample is x(n)=(x1,...,xM) A total of M candidate features, and features are recorded as
Quantifying the importance of the M candidate features by using the Shapril value to obtain a local importance valueAnd global importance GIm;
Global importance Using randomized featuresObtaining an importance threshold T by the adaptive coefficient c;
evaluating the local relevance indexes R of the M candidate characteristicsmAnd global relevance index GRm;
In this embodiment, the method provided in this embodiment is suitable for a supervised task, and sample data needs to be providedAnd label y(n)Wherein x is(n)=(x1,...,xM) There are a total of M features, i.e. the feature set is
Calculating importance by using a Shapril attribution algorithm to obtain a characteristic setThe local importance of each feature inAnd global importance GIm=E(Im) Wherein N is the number of samples. From a set of random featuresGet global importance
Step 2: selecting a strategy;
the input of the step 2 is a local relevance index R of M candidate characteristicsmAnd global relevance index GRm;
Local relevance index R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related featuresSet of uncorrelated featuresPending feature set
if the adaptive coefficient c is not 1, increasing the adaptive coefficient c by 0.1, and executing the step 1;
if the adaptive coefficient c is 1, detecting the undetermined characteristic setWhether it is empty;
Further, in an embodiment, the importance of the M candidate features is quantified by using the salpril value to obtain a local importance valueAnd global importance GImComprises the following steps:
determining an input data setWherein the feature vector of the nth sample is x(n)=(x1,...,xM) The label is y(n)Feature set of
The algorithm for attributing the sapril is expressed asApplying classification/regression model f (-) to sample x using the Shapril attribution algorithm(n)Is attributed to the contribution of the mth candidate featureWhereinIs the mean value of the model output;
if it is a classification task, then Represents the contribution of the mth feature of the nth sample to the l class, when the local importance isWherein l ═ y(n);
If the task is a regression task, the contribution is directly expressed as a contribution valueThe local importance here is
Further, in an embodiment, the method for selecting fully-correlated features based on the salpril value and hypothesis testing comprises:
the adaptive threshold is represented asWhereinC is an adaptive coefficient for the global importance of the random feature.
In this embodiment, the adaptive coefficient c and the random feature set are used as basisGlobal importance of GI*Calculating an adaptive thresholdNamely GI*Is multiplied by a coefficient c, where c has an initial value of 0.1 and a maximum value of 1.
Further, in an embodiment, the method for selecting fully-correlated features based on the salpril value and hypothesis testing comprises:
the local correlation isRmRefers to the number of features of which the local importance is higher than the adaptive threshold, RmThe greater the degree of correlation;
the global correlation isWhere MI is the maximum number of iterations, GRmRefers to the number of times the global importance of the mth feature is above the threshold in the iterative process.
In this embodiment, the correlation needs to be evaluated; local relevance, local importance of features ImThe number of times above T, compared to a threshold T, is the local correlation of the feature, i.e. Global relevance, given that MI iterations have been performed, the global importance GI of the MI iterationsmThe number of times above the threshold is denoted as global correlation, i.e.
Further, in an embodiment, the local relevance indicator R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related featuresSet of uncorrelated featuresPending feature setComprises the following steps:
definitions original hypothesis (H0): the correlation of the features follows a binomial distribution with a probability of 0.5, and the probability distribution function is F (·);
and (3) performing hypothesis test on the local correlation, and defining the characteristic set falling into the left and right rejection regions as follows:
wherein a is the level of significance and wherein,representing that the feature falls within the left reject domain, the feature falling within the left reject domain is locally an irrelevant feature;representing features falling within the right reject domain, which are locally relevant features;
and performing hypothesis test on the global correlation to obtain two characteristic sets:
whereinRepresenting features falling within the left reject domain, which are globally uncorrelated features;representing features falling into a right reject domain, the features falling into the right reject domain being globally relevant features;
the division of the feature set is obtained according to two hypothesis tests:
whereinFor the set of relevant features,for a set of unrelated features,is a set of pending features.
In this embodiment, after one feature selection, the relevant feature set is usedReplacing an original feature setIn preparation for subsequent supervisory tasks.
The embodiment of the invention discloses a method and a device for selecting fully-correlated characteristics based on a salpril value and hypothesis testing. The method is suitable for feature sets with supervision tasks. A feature selection model for solving the full correlation problem is designed, the model firstly utilizes a Shapril attribution algorithm to calculate the local importance of features, secondly utilizes random features to construct an adaptive threshold, and then utilizes the importance and the threshold to evaluate the correlation of the features. In terms of selection strategy, the embodiment of the invention designs double hypothesis testing, uses local hypothesis testing to quickly eliminate irrelevant features, and uses global hypothesis testing to reduce the risk of mistakenly deleting relevant features. And finally, all the characteristics related to the problem domain are obtained, so that the interpretability of the characteristic set is improved, and the reliability of prediction is enhanced.
In a second aspect, the embodiments of the present invention further provide a device for selecting fully relevant features based on the salpril values and hypothesis testing.
Referring to fig. 2, fig. 2 is a functional schematic diagram of a first embodiment of a holocorrelation feature selection device based on a salpril value and hypothesis testing, as involved in an embodiment of the present invention.
In this embodiment, the device for selecting fully-correlated features based on the salpril value and hypothesis testing includes:
an evaluation module for performing step 1: evaluating the relevance;
the input to step 1 is a data set consisting of N samples, notedWherein the feature vector of the nth sample is x(n)=(x1,...,xM) A total of M candidate features, and features are recorded as
Quantifying the importance of the M candidate features by using the Shapril value to obtain a local importance valueAnd global importance GIm;
Global importance Using randomized featuresObtaining an importance threshold T by the adaptive coefficient c;
evaluating the local relevance indexes R of the M candidate characteristicsmAnd global relevance index GRm;
A selection module for executing step 2: selecting a strategy;
the input of the step 2 is a local relevance index R of M candidate characteristicsmAnd global relevance index GRm;
Local relevance index R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related featuresSet of uncorrelated featuresPending feature set
if the adaptive coefficient c is not 1, increasing the adaptive coefficient c by 0.1, and executing the step 1;
if the adaptive coefficient c is 1, detecting the undetermined characteristic setWhether it is empty;
Further, in an embodiment, the evaluation module is further configured to:
determining an input data setWherein the feature vector of the nth sample is x(n)=(x1,...,xM) The label is y(n)Feature set of
The algorithm for attributing the sapril is expressed asApplying classification/regression model f (-) to sample x using the Shapril attribution algorithm(n)Is attributed to the contribution of the mth candidate featureWhereinIs the mean value of the model output;
if it is a classification task, then Represents the contribution of the mth feature of the nth sample to the l class, when the local importance isWherein l ═ y(n);
If the task is a regression task, the contribution is directly expressed as a contribution valueThe local importance here is
Further, in one embodiment, the apparatus for selecting a fully correlated feature based on a sand-pril value and a hypothesis test is characterized in that the adaptive threshold is expressed asWhereinC is an adaptive coefficient for the global importance of the random feature.
Further, in an embodiment, the apparatus for selecting fully correlated features based on the salpril values and hypothesis testing is characterized in that the local correlation isRmRefers to the number of features of which the local importance is higher than the adaptive threshold, RmThe greater the degree of correlation;
the global correlation isWhere MI is the maximum number of iterations, GRmRefers to the number of times the global importance of the mth feature is above the threshold in the iterative process.
Further, in an embodiment, the selecting module is further configured to:
definitions original hypothesis (H0): the correlation of the features follows a binomial distribution with a probability of 0.5, and the probability distribution function is F (·);
and (3) performing hypothesis test on the local correlation, and defining the characteristic set falling into the left and right rejection regions as follows:
wherein a is the level of significance and wherein,representing that the feature falls within the left reject domain, the feature falling within the left reject domain is locally an irrelevant feature;representing features falling within the right reject domain, which are locally relevant features;
and performing hypothesis test on the global correlation to obtain two characteristic sets:
whereinRepresenting features falling within the left reject domain, which are globally uncorrelated features;representing features falling into a right reject domain, the features falling into the right reject domain being globally relevant features;
the division of the feature set is obtained according to two hypothesis tests:
whereinFor the set of relevant features to be included in the set of relevant features,for a set of unrelated features,is a pending feature.
The function implementation of each module in the xx device corresponds to each step in the embodiment of the fully-relevant feature selection method based on the sand-pril value and hypothesis testing, and the function and implementation process are not described in detail herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for causing a terminal device to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (10)
1. A method for selecting a fully relevant feature based on a salpril value and hypothesis testing, wherein the method for selecting the fully relevant feature based on the salpril value and hypothesis testing comprises:
step 1: evaluating the relevance;
the input to step 1 is a data set consisting of N samples, notedWherein the feature vector of the nth sample is x(n)=(x1,...,xM) A total of M candidate features, and features are recorded as
Quantifying the importance of the M candidate features by using the Shapril value to obtain a local importance valueAnd global importance GIm;
Global importance Using randomized featuresObtaining an importance threshold T by the adaptive coefficient c;
evaluating the local relevance indexes R of the M candidate characteristicsmAnd global relevance index GRm;
Step 2: selecting a strategy;
the input of the step 2 is a local relevance index R of M candidate characteristicsmAnd global relevance index GRm;
Local relevance index R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related featuresSet of uncorrelated featuresPending feature set
if the adaptive coefficient c is not 1, increasing the adaptive coefficient c by 0.1, and executing the step 1;
if the adaptive coefficient c is 1, detecting the undetermined characteristic setWhether it is empty;
2. The method for selecting a fully relevant feature based on salpril values and hypothesis testing as claimed in claim 1, wherein the importance of the M candidate features is quantified by the salpril values to obtain local importance valuesAnd global importance GImComprises the following steps:
determining an input data setWhereinThe feature vector of the nth sample is x(n)=(x1,...,xM) The label is y(n)Feature set of
The algorithm for attributing the sapril is expressed asApplying classification/regression model f (-) to sample x using the Shapril attribution algorithm(n)Is attributed to the contribution of the mth candidate featureWhereinIs the mean value of the model output;
if it is a classification task, then Represents the contribution of the mth feature of the nth sample to the l class, when the local importance isWherein l ═ y(n);
If the task is a regression task, the contribution is directly expressed as a contribution valueThe local importance here is
3. The method for selecting a fully relevant feature based on the salpril value and hypothesis test as claimed in claim 2, wherein the method for selecting a fully relevant feature based on the salpril value and hypothesis test comprises:
4. The method for selecting a fully relevant feature based on the salpril value and hypothesis test as claimed in claim 3, wherein the method for selecting a fully relevant feature based on the salpril value and hypothesis test comprises:
the local correlation isRmRefers to the number of features of which the local importance is higher than the adaptive threshold, RmThe greater the degree of correlation;
5. The method of claim 4, based on the values and hypotheses of salaprilThe method for selecting the fully correlated feature based on the test is characterized in that the local correlation index R based on the M candidate featuresmAnd global relevance index GRmDeriving a set of related featuresSet of uncorrelated featuresPending feature setComprises the following steps:
definitions original hypothesis (H0): the correlation of the features follows a binomial distribution with a probability of 0.5, and the probability distribution function is F (·);
and (3) performing hypothesis test on the local correlation, and defining the characteristic set falling into the left and right rejection regions as follows:
wherein a is the level of significance and wherein,representing that the feature falls within the left reject domain, the feature falling within the left reject domain is locally an irrelevant feature;representing features falling within the right reject domain, which are locally relevant features;
and performing hypothesis test on the global correlation to obtain two characteristic sets:
whereinRepresenting features falling within the left reject domain, which are globally uncorrelated features;representing features falling into a right reject domain, the features falling into the right reject domain being globally relevant features;
the division of the feature set is obtained according to two hypothesis tests:
6. A device for selecting a fully correlated feature based on a salpril value and hypothesis testing, the device for selecting a fully correlated feature based on a salpril value and hypothesis testing comprising:
an evaluation module for performing step 1: evaluating the relevance;
the input to step 1 is a data set consisting of N samples, notedWherein the feature vector of the nth sample is x(n)=(x1,...,xM) A total of M candidate features, and features are recorded as
Quantifying the importance of the M candidate features by using the Shapril value to obtain a local importance valueAnd global importance GIm;
Global importance Using randomized featuresObtaining an importance threshold T by the adaptive coefficient c;
evaluating the local relevance indexes R of the M candidate characteristicsmAnd global relevance index GRm;
A selection module for executing step 2: selecting a strategy;
the input of the step 2 is a local relevance index R of M candidate characteristicsmAnd global relevance index GRm;
Local relevance index R based on M candidate featuresmAnd global relevance index GRmDeriving a set of related featuresSet of uncorrelated featuresPending feature set
if the adaptive coefficient c is not 1, increasing the adaptive coefficient c by 0.1, and executing the step 1;
if the adaptive coefficient c is 1, detecting the undetermined characteristic setWhether it is empty;
7. The apparatus for selecting a fully relevant feature based on salpril values and hypothesis testing as claimed in claim 6, wherein the evaluation module is further configured to:
determining an input data setWherein the feature vector of the nth sample is x(n)=(x1,...,xM) The label is y(n)Feature set of
The algorithm for attributing the sapril is expressed asApplying classification/regression model f (-) to sample x using the Shapril attribution algorithm(n)Is attributed to the contribution of the mth candidate featureWhereinIs the mean value of the model output;
if it is a classification task, then Represents the contribution of the mth feature of the nth sample to the l class, when the local importance isWherein l ═ y(n);
If the task is a regression task, the contribution is directly expressed as a contribution valueThe local importance here is
9. The apparatus for selecting a fully correlated feature based on salpril values and hypothesis testing as claimed in claim 8, wherein the local correlation isRmRefers to the number of features of which the local importance is higher than the adaptive threshold, RmThe greater the degree of correlation;
10. The apparatus for selecting a fully relevant feature based on salpril values and hypothesis testing as claimed in claim 9, wherein the selection module is further configured to:
definitions original hypothesis (H0): the correlation of the features follows a binomial distribution with a probability of 0.5, and the probability distribution function is F (·);
and (3) performing hypothesis test on the local correlation, and defining the characteristic set falling into the left and right rejection regions as follows:
wherein a is the level of significance and wherein,representing that the feature falls within the left reject domain, the feature falling within the left reject domain is locally an irrelevant feature;representing features falling within the right reject domain, which are locally relevant features;
and performing hypothesis test on the global correlation to obtain two characteristic sets:
whereinRepresenting features falling within the left reject domain, which are globally uncorrelated features;representing features falling into a right reject domain, the features falling into the right reject domain being globally relevant features;
the division of the feature set is obtained according to two hypothesis tests:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111384278.XA CN114118246A (en) | 2021-11-16 | 2021-11-16 | Method and device for selecting fully-relevant features based on Shapril value and hypothesis test |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111384278.XA CN114118246A (en) | 2021-11-16 | 2021-11-16 | Method and device for selecting fully-relevant features based on Shapril value and hypothesis test |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114118246A true CN114118246A (en) | 2022-03-01 |
Family
ID=80439074
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111384278.XA Pending CN114118246A (en) | 2021-11-16 | 2021-11-16 | Method and device for selecting fully-relevant features based on Shapril value and hypothesis test |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114118246A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115953248A (en) * | 2023-03-01 | 2023-04-11 | 支付宝(杭州)信息技术有限公司 | Wind control method, device, equipment and medium based on Shapril additive interpretation |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110009014A (en) * | 2019-03-24 | 2019-07-12 | 北京工业大学 | A kind of feature selection approach merging related coefficient and mutual information |
-
2021
- 2021-11-16 CN CN202111384278.XA patent/CN114118246A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110009014A (en) * | 2019-03-24 | 2019-07-12 | 北京工业大学 | A kind of feature selection approach merging related coefficient and mutual information |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115953248A (en) * | 2023-03-01 | 2023-04-11 | 支付宝(杭州)信息技术有限公司 | Wind control method, device, equipment and medium based on Shapril additive interpretation |
CN115953248B (en) * | 2023-03-01 | 2023-05-16 | 支付宝(杭州)信息技术有限公司 | Wind control method, device, equipment and medium based on saprolitic additivity interpretation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109302410B (en) | Method and system for detecting abnormal behavior of internal user and computer storage medium | |
Nguyen et al. | Model selection for degradation modeling and prognosis with health monitoring data | |
US9398034B2 (en) | Matrix factorization for automated malware detection | |
CN110991474A (en) | Machine learning modeling platform | |
CN111242358A (en) | Enterprise information loss prediction method with double-layer structure | |
CN111753290A (en) | Software type detection method and related equipment | |
CN116167010B (en) | Rapid identification method for abnormal events of power system with intelligent transfer learning capability | |
CN113723070B (en) | Text similarity model training method, text similarity detection method and device | |
CN109766259B (en) | Classifier testing method and system based on composite metamorphic relation | |
Garćia et al. | Noisy data set identification | |
CN111400713B (en) | Malicious software population classification method based on operation code adjacency graph characteristics | |
CN114118246A (en) | Method and device for selecting fully-relevant features based on Shapril value and hypothesis test | |
Wang et al. | Mushroom toxicity recognition based on multigrained cascade forest | |
CN110111311B (en) | Image quality evaluation method and device | |
CN113468538A (en) | Vulnerability attack database construction method based on similarity measurement | |
CN111738530B (en) | River water quality prediction method, device and computer readable storage medium | |
CN110808947B (en) | Automatic vulnerability quantitative evaluation method and system | |
CN115641201B (en) | Data anomaly detection method, system, terminal equipment and storage medium | |
CN114285587A (en) | Domain name identification method and device and domain name classification model acquisition method and device | |
CN107067034B (en) | Method and system for rapidly identifying infrared spectrum data classification | |
CN116579980A (en) | Printed circuit board defect detection method, medium and equipment based on small sample learning | |
CN117523218A (en) | Label generation, training of image classification model and image classification method and device | |
Alfaz et al. | A deep convolutional neural network based approach to classify and detect crack in concrete surface using xception | |
CN113190851B (en) | Active learning method of malicious document detection model, electronic equipment and storage medium | |
CN115730656A (en) | Out-of-distribution sample detection method using mixed unmarked data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20220301 |
|
RJ01 | Rejection of invention patent application after publication |