CN114118246A

CN114118246A - Method and device for selecting fully-relevant features based on Shapril value and hypothesis test

Info

Publication number: CN114118246A
Application number: CN202111384278.XA
Authority: CN
Inventors: 陈丹; 殷丁泽; 汤云波; 李小俚; 熊明福
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2022-03-01

Abstract

The invention discloses a method and a device for selecting fully-relevant characteristics based on a sand-pril value and hypothesis testing. The method is suitable for feature sets with supervision tasks. A feature selection model for solving the full correlation problem is designed, the model firstly utilizes a Shapril attribution algorithm to calculate the local importance of features, secondly utilizes random features to construct an adaptive threshold, and then utilizes the importance and the threshold to evaluate the correlation of the features. In the aspect of selecting strategies, the invention designs double hypothesis testing, rapidly eliminates irrelevant features by utilizing local hypothesis testing, and reduces the risk of mistakenly deleting the relevant features by utilizing global hypothesis testing. And finally, all the characteristics related to the problem domain are obtained, so that the interpretability of the characteristic set is improved, and the reliability of prediction is enhanced.

Description

Method and device for selecting fully-relevant features based on Shapril value and hypothesis test

Technical Field

The invention relates to the technical field of feature selection, in particular to a method and a device for selecting fully-relevant features based on a Shapril value and hypothesis testing.

Background

Feature selection is one of the important issues in feature engineering, and the task of feature selection is to select a subset of features from the original set of features that are relevant to the problem domain. The goal of feature selection is to improve the interpretability and predictive performance of a feature set. It is crucial to solve this problem in feature data centric scenarios. At present, the research of traditional feature selection mainly solves the problem of minimum optimization, namely, selecting the minimum feature subset with the optimal classification performance. The method can be divided into a filtering type method and an encapsulating type method according to the characteristic subset evaluation standard, wherein the filtering type method specifically refers to sorting all the characteristics according to a specific statistical value, and selecting the characteristic subset according to the sorting. The packaging method is characterized in that candidate feature subsets are evaluated through a learning algorithm, the candidate feature subsets are changed through multiple iterations, and then the optimal feature subsets are selected according to evaluation criteria such as classification accuracy and feature number. The method for selecting the features with the aim of solving the minimum optimal problem has the advantages that the obtained feature subsets are good in classification effect, the number of the features is small, and a subsequently established model is simpler. The method has the disadvantages that a black box prediction model is obtained by using the minimum optimal feature set, and the interpretability of the feature set is difficult to ensure. In order to better understand the potential knowledge of the problem domain, the feature selection method preferably solves the full correlation problem, that is, determines all the features related to the problem domain, and the solving of the full correlation problem also has the following difficulties, such as under strong fitting capability of the model, the false correlation widely exists, the correlation index is difficult to define and evaluate, and for example, it is difficult to select all the related features, especially the weak correlation features.

Disclosure of Invention

The invention mainly aims to provide a method and a device for selecting fully-correlated characteristics based on a Shapril value and hypothesis testing, and aims to solve the problems that the characteristic correlation cannot be effectively evaluated and all the correlated characteristics cannot be adaptively identified.

In a first aspect, the present invention provides a method for selecting a fully relevant feature based on a salpril value and a hypothesis test, the method for selecting a fully relevant feature based on a salpril value and a hypothesis test comprising:

step 1: evaluating the relevance;

the input to step 1 is a data set consisting of N samples, noted

Wherein the feature vector of the nth sample is x⁽ⁿ⁾＝(x₁，...，x_M) A total of M candidate features, and features are recorded as

Quantifying the importance of the M candidate features by using the Shapril value to obtain a local importance value

And global importance GI_m；

Global importance Using randomized features

Obtaining an importance threshold T by the adaptive coefficient c;

evaluating the local relevance indexes R of the M candidate characteristics_mAnd global relevance index GR_m；

Step 2: selecting a strategy;

the input of the step 2 is a local relevance index R of M candidate characteristics_mAnd global relevance index GR_m；

Local relevance index R based on M candidate features_mAnd global relevance index GR_mDeriving a set of related features

Set of uncorrelated features

Pending feature set

Detecting an unrelated feature set

Whether it is empty;

if not

If not, deleting the irrelevant feature set and executing the step 1;

if not

If the adaptive coefficient c is null, detecting whether the adaptive coefficient c is 1;

if the adaptive coefficient c is not 1, increasing the adaptive coefficient c by 0.1, and executing the step 1;

if the adaptive coefficient c is 1, detecting the undetermined characteristic set

Whether it is empty;

if a pending feature set

If not, executing step 1;

if a pending feature set

And if the result is empty, stopping executing the step 1.

Optionally, the importance of the M candidate features is quantified by using the salpril value to obtain a local importance value

And global importance GI_mComprises the following steps:

determining an input data set x ═ { x ═ x⁽ⁿ⁾N1.. N }, where the feature vector of the nth sample is x⁽ⁿ⁾＝(x₁，...，x_M) The label is y⁽ⁿ⁾Feature set of

The algorithm for attributing the sapril is expressed as

Applying classification/regression model f (-) to sample x using the Shapril attribution algorithm⁽ⁿ⁾Is attributed to the contribution of the mth candidate feature

Wherein

Is the mean value of the model output;

if it is a classification task, then

Represents the contribution of the mth feature of the nth sample to the l class, when the local importance is

Wherein l ═ y⁽ⁿ⁾；

If the task is a regression task, the contribution is directly expressed as a contribution value

The local importance here is

Global importance is the average of the local importance over all samples, i.e. global importance

Optionally, the method for selecting fully-relevant features based on the salpril value and hypothesis testing comprises:

self-adaptationShould be thresholded as

Wherein

C is an adaptive coefficient for the global importance of the random feature.

the local correlation is

R_mRefers to the number of features of which the local importance is higher than the adaptive threshold, R_mThe greater the degree of correlation;

the global correlation is

Where MI is the maximum number of iterations, GR_mRefers to the number of times the global importance of the mth feature is above the threshold in the iterative process.

Optionally, the local correlation index R based on M candidate features_mAnd global relevance index GR_mDeriving a set of related features

Set of uncorrelated features

Pending feature set

Comprises the following steps:

definitions original hypothesis (H0): the correlation of the features follows a binomial distribution with a probability of 0.5, and the probability distribution function is F (·);

and (3) performing hypothesis test on the local correlation, and defining the characteristic set falling into the left and right rejection regions as follows:

wherein a is the level of significance and wherein,

representing that the feature falls within the left reject domain, the feature falling within the left reject domain is locally an irrelevant feature;

representing features falling within the right reject domain, which are locally relevant features;

and performing hypothesis test on the global correlation to obtain two characteristic sets:

wherein

Representing features falling within the left reject domain, which are globally uncorrelated features;

representing features falling into a right reject domain, the features falling into the right reject domain being globally relevant features;

the division of the feature set is obtained according to two hypothesis tests:

wherein

For the set of relevant features to be included in the set of relevant features,

for a set of unrelated features,

is a pending feature.

In a second aspect, the present invention also provides a device for selecting a fully relevant feature based on a salpril value and hypothesis testing: the device for selecting the fully-relevant characteristics based on the salpril values and the hypothesis test comprises:

an evaluation module 10, configured to perform step 1: evaluating the relevance;

the input to step 1 is a data set consisting of N samples, noted

And global importance GI_m；

Global importance Using randomized features

Obtaining an importance threshold T by the adaptive coefficient c;

A selecting module 20, configured to perform step 2: selecting a strategy;

Set of uncorrelated features

Pending feature set

Detecting an unrelated feature set

Whether it is empty;

if not

If not, deleting the irrelevant feature set and executing the step 1;

if not

if the adaptive coefficient c is 1, detecting the undetermined characteristicCollection

Whether it is empty;

if a pending feature set

If not, executing step 1;

if a pending feature set

And if the result is empty, stopping executing the step 1.

Optionally, the evaluation module 10 is further configured to:

determining an input data set

Wherein the feature vector of the nth sample is x⁽ⁿ⁾＝(x₁，...，x_M) The label is y⁽ⁿ⁾Feature set of

The algorithm for attributing the sapril is expressed as

Wherein

Is the mean value of the model output;

if it is a classification task, then

Wherein l ═ y⁽ⁿ⁾；

The local importance here is

Optionally, the apparatus for selecting a fully correlated feature based on a salpril value and hypothesis testing is characterized in that the adaptive threshold is expressed as

Wherein

C is an adaptive coefficient for the global importance of the random feature.

Optionally, the apparatus for selecting fully correlated features based on the salpril values and hypothesis testing is characterized in that the local correlation is

the global correlation is

Optionally, the selecting module 20 is further configured to:

wherein a is the level of significance and wherein,

wherein

the division of the feature set is obtained according to two hypothesis tests:

wherein

for a set of unrelated features,

is a pending feature.

Drawings

FIG. 1 is a schematic flow chart of a method for selecting fully relevant features based on a salpril value and hypothesis testing as contemplated in an embodiment of the present invention;

FIG. 2 is a functional schematic diagram of a first embodiment of a holohedral feature selection device based on a salpril value and hypothesis testing as contemplated in an embodiment of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In a first aspect, embodiments of the present invention provide a method for selecting fully relevant features based on a salpril value and hypothesis testing.

Referring to fig. 1, fig. 1 is a schematic flow chart of a method for selecting fully relevant features based on a salpril value and hypothesis testing according to an embodiment of the present invention.

As shown in fig. 1, the method for selecting fully relevant features based on the salpril values and hypothesis testing comprises:

step 1: evaluating the relevance;

the input to step 1 is a data set consisting of N samples, noted

And global importance GI_m；

Global importance Using randomized features

Obtaining an importance threshold T by the adaptive coefficient c;

In this embodiment, the method provided in this embodiment is suitable for a supervised task, and sample data needs to be provided

And label y⁽ⁿ⁾Wherein x is⁽ⁿ⁾＝(x₁，...，x_M) There are a total of M features, i.e. the feature set is

Randomizing the feature, sampling the feature set, randomizing to obtain a random feature set

Calculating importance by using a Shapril attribution algorithm to obtain a characteristic set

The local importance of each feature in

And global importance GI_m＝E(I_m) Wherein N is the number of samples. From a set of random features

Get global importance

Step 2: selecting a strategy;

Set of uncorrelated features

Pending feature set

Detecting an unrelated feature set

Whether it is empty;

if not

If not, deleting the irrelevant feature set and executing the step 1;

if not

Whether it is empty;

if a pending feature set

If not, executing step 1;

if a pending feature set

And if the result is empty, stopping executing the step 1.

Further, in an embodiment, the importance of the M candidate features is quantified by using the salpril value to obtain a local importance value

And global importance GI_mComprises the following steps:

determining an input data set

The algorithm for attributing the sapril is expressed as

Wherein

Is the mean value of the model output;

if it is a classification task, then

Wherein l ═ y⁽ⁿ⁾；

The local importance here is

Further, in an embodiment, the method for selecting fully-correlated features based on the salpril value and hypothesis testing comprises:

the adaptive threshold is represented as

Wherein

C is an adaptive coefficient for the global importance of the random feature.

In this embodiment, the adaptive coefficient c and the random feature set are used as basis

Global importance of GI^*Calculating an adaptive threshold

Namely GI^*Is multiplied by a coefficient c, where c has an initial value of 0.1 and a maximum value of 1.

the local correlation is

the global correlation is

In this embodiment, the correlation needs to be evaluated; local relevance, local importance of features I_mThe number of times above T, compared to a threshold T, is the local correlation of the feature, i.e.

Global relevance, given that MI iterations have been performed, the global importance GI of the MI iterations_mThe number of times above the threshold is denoted as global correlation, i.e.

Further, in an embodiment, the local relevance indicator R based on M candidate features_mAnd global relevance index GR_mDeriving a set of related features

Set of uncorrelated features

Pending feature set

Comprises the following steps:

wherein a is the level of significance and wherein,

wherein

the division of the feature set is obtained according to two hypothesis tests:

wherein

For the set of relevant features,

for a set of unrelated features,

is a set of pending features.

In this embodiment, after one feature selection, the relevant feature set is used

Replacing an original feature set

In preparation for subsequent supervisory tasks.

The embodiment of the invention discloses a method and a device for selecting fully-correlated characteristics based on a salpril value and hypothesis testing. The method is suitable for feature sets with supervision tasks. A feature selection model for solving the full correlation problem is designed, the model firstly utilizes a Shapril attribution algorithm to calculate the local importance of features, secondly utilizes random features to construct an adaptive threshold, and then utilizes the importance and the threshold to evaluate the correlation of the features. In terms of selection strategy, the embodiment of the invention designs double hypothesis testing, uses local hypothesis testing to quickly eliminate irrelevant features, and uses global hypothesis testing to reduce the risk of mistakenly deleting relevant features. And finally, all the characteristics related to the problem domain are obtained, so that the interpretability of the characteristic set is improved, and the reliability of prediction is enhanced.

In a second aspect, the embodiments of the present invention further provide a device for selecting fully relevant features based on the salpril values and hypothesis testing.

Referring to fig. 2, fig. 2 is a functional schematic diagram of a first embodiment of a holocorrelation feature selection device based on a salpril value and hypothesis testing, as involved in an embodiment of the present invention.

In this embodiment, the device for selecting fully-correlated features based on the salpril value and hypothesis testing includes:

an evaluation module for performing step 1: evaluating the relevance;

the input to step 1 is a data set consisting of N samples, noted

And global importance GI_m；

Global importance Using randomized features

Obtaining an importance threshold T by the adaptive coefficient c;

A selection module for executing step 2: selecting a strategy;

Set of uncorrelated features

Pending feature set

Detecting an unrelated feature set

Whether it is empty;

if not

If not, deleting the irrelevant feature set and executing the step 1;

if not

Whether it is empty;

if a pending feature set

If not, executing step 1;

if a pending feature set

And if the result is empty, stopping executing the step 1.

Further, in an embodiment, the evaluation module is further configured to:

determining an input data set

The algorithm for attributing the sapril is expressed as

Wherein

Is the mean value of the model output;

if it is a classification task, then

Wherein l ═ y⁽ⁿ⁾；

The local importance here is

Further, in one embodiment, the apparatus for selecting a fully correlated feature based on a sand-pril value and a hypothesis test is characterized in that the adaptive threshold is expressed as

Wherein

C is an adaptive coefficient for the global importance of the random feature.

Further, in an embodiment, the apparatus for selecting fully correlated features based on the salpril values and hypothesis testing is characterized in that the local correlation is

the global correlation is

Further, in an embodiment, the selecting module is further configured to:

wherein a is the level of significance and wherein,

wherein

the division of the feature set is obtained according to two hypothesis tests:

wherein

for a set of unrelated features,

is a pending feature.

The function implementation of each module in the xx device corresponds to each step in the embodiment of the fully-relevant feature selection method based on the sand-pril value and hypothesis testing, and the function and implementation process are not described in detail herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for causing a terminal device to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for selecting a fully relevant feature based on a salpril value and hypothesis testing, wherein the method for selecting the fully relevant feature based on the salpril value and hypothesis testing comprises:

step 1: evaluating the relevance;

the input to step 1 is a data set consisting of N samples, noted

Wherein the feature vector of the nth sample is x⁽ⁿ⁾＝(x₁,...,x_M) A total of M candidate features, and features are recorded as

And global importance GI_m；

Global importance Using randomized features

Obtaining an importance threshold T by the adaptive coefficient c;

Step 2: selecting a strategy;

Set of uncorrelated features

Pending feature set

Detecting an unrelated feature set

Whether it is empty;

if not

If not, deleting the irrelevant feature set and executing the step 1;

if not

Whether it is empty;

if a pending feature set

If not, executing step 1;

if a pending feature set

And if the result is empty, stopping executing the step 1.

2. The method for selecting a fully relevant feature based on salpril values and hypothesis testing as claimed in claim 1, wherein the importance of the M candidate features is quantified by the salpril values to obtain local importance values

And global importance GI_mComprises the following steps:

determining an input data set

WhereinThe feature vector of the nth sample is x⁽ⁿ⁾＝(x₁,...,x_M) The label is y⁽ⁿ⁾Feature set of

The algorithm for attributing the sapril is expressed as

Wherein

Is the mean value of the model output;

if it is a classification task, then

Wherein l ═ y⁽ⁿ⁾；

The local importance here is

The global importance is the average of the local importance over all samples, i.e. globalImportance of

3. The method for selecting a fully relevant feature based on the salpril value and hypothesis test as claimed in claim 2, wherein the method for selecting a fully relevant feature based on the salpril value and hypothesis test comprises:

the adaptive threshold is represented as

Wherein

C is an adaptive coefficient for the global importance of the random feature.

4. The method for selecting a fully relevant feature based on the salpril value and hypothesis test as claimed in claim 3, wherein the method for selecting a fully relevant feature based on the salpril value and hypothesis test comprises:

the local correlation is

the global correlation is

5. The method of claim 4, based on the values and hypotheses of salaprilThe method for selecting the fully correlated feature based on the test is characterized in that the local correlation index R based on the M candidate features_mAnd global relevance index GR_mDeriving a set of related features

Set of uncorrelated features

Pending feature set

Comprises the following steps:

wherein a is the level of significance and wherein,

wherein

the division of the feature set is obtained according to two hypothesis tests:

wherein

for a set of unrelated features,

is a pending feature.

6. A device for selecting a fully correlated feature based on a salpril value and hypothesis testing, the device for selecting a fully correlated feature based on a salpril value and hypothesis testing comprising:

an evaluation module for performing step 1: evaluating the relevance;

the input to step 1 is a data set consisting of N samples, noted

And global importance GI_m；

Global importance Using randomized features

Obtaining an importance threshold T by the adaptive coefficient c;

A selection module for executing step 2: selecting a strategy;

Set of uncorrelated features

Pending feature set

Detecting an unrelated feature set

Whether it is empty;

if not

If not, deleting the irrelevant feature set and executing the step 1;

if not

Whether it is empty;

if a pending feature set

If not, executing step 1;

if a pending feature set

And if the result is empty, stopping executing the step 1.

7. The apparatus for selecting a fully relevant feature based on salpril values and hypothesis testing as claimed in claim 6, wherein the evaluation module is further configured to:

determining an input data set

Wherein the feature vector of the nth sample is x⁽ⁿ⁾＝(x₁,...,x_M) The label is y⁽ⁿ⁾Feature set of

The algorithm for attributing the sapril is expressed as

Wherein

Is the mean value of the model output;

if it is a classification task, then

Wherein l ═ y⁽ⁿ⁾；

The local importance here is

8. The apparatus for full correlation feature selection based on salpril values and hypothesis testing as claimed in claim 7 wherein the adaptive threshold is expressed as

Wherein

C is an adaptive coefficient for the global importance of the random feature.

9. The apparatus for selecting a fully correlated feature based on salpril values and hypothesis testing as claimed in claim 8, wherein the local correlation is

the global correlation is

10. The apparatus for selecting a fully relevant feature based on salpril values and hypothesis testing as claimed in claim 9, wherein the selection module is further configured to:

wherein a is the level of significance and wherein,

wherein

the division of the feature set is obtained according to two hypothesis tests:

wherein

for a set of unrelated features,

is a pending feature.