CN112802555B

CN112802555B - Complementary differential expression gene selection method based on mvAUC

Info

Publication number: CN112802555B
Application number: CN202110147526.2A
Authority: CN
Inventors: 卫金茂; 苏月; 杜科宇; 刘健
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2021-02-03
Filing date: 2021-02-03
Publication date: 2022-04-19
Anticipated expiration: 2041-02-03
Also published as: CN112802555A

Abstract

The invention provides a feature selection method based on multivariate AUC, which selects the most complementary gene subset from the differential expression data of cancer to realize the maximization of global classification performance. The method includes the steps that firstly, a new angle of AUC calculation is provided based on a possible misclassification set of features; then, for a feature set, determining a common possible misclassification set of the feature set and calculating a new AUC after each feature is combined; the difference between the new AUC and the original AUC for a feature demonstrates the complementary effect of other features in the feature set on the classification ability of that feature after combination. Finally, the mvAUC is calculated based on the new AUC after the feature combination, and the candidate feature which maximizes the current mvAUC is added into the selected feature subset in an incremental mode. The method of the invention has the advantage that the global class discrimination capability of the selected feature subset can be directly evaluated without the need to compute redundant information between the candidate features and each selected feature in pairs.

Description

Complementary differential expression gene selection method based on mvAUC

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a method for selecting complementary differential expression genes based on mvAUC.

Background

In the biomedical field, with the rapid development and continuous maturity of a new generation sequencing technology (NGS), the sequencing cost is greatly reduced, data such as cancer gene expression and the like are rapidly accumulated, and the analysis and application based on NGS big data grow rapidly. Gene expression datasets typically contain thousands or even hundreds of thousands of genes, and a relatively small number of hundreds to thousands of samples. Of these thousands of genes, only a small fraction of genes are involved in the development of cancer, and the presence of a large number of unrelated redundant genes can severely impact the analysis of the data and lead to bias. It is therefore becoming increasingly important to identify the genes that contribute most to cancer classification. The identification process is called gene selection, and the key point of the identification process is to establish an evaluation standard to select the most discriminative gene subset, so as to achieve the purposes of reducing the spatial dimension, improving the classification precision and finding potential target genes.

In the field of machine learning and data mining, gene selection is called feature selection, that is, screening of genes can be realized by using a feature selection technology in machine learning. There are many feature selection methods in machine learning, and many feature selection methods select a feature subset with the strongest distinguishing capability for a class by measuring the relevant information of the features and the class. And evaluating the relevance of each candidate feature to the class by using a feature selection method such as FAST and Relief, and adding the feature with high relevance to the selected feature subset. However, this type of method does not consider redundancy among features, which may result in high correlation of selected features, and thus the performance of joint classification of a plurality of features with strong discrimination capability after combination is not necessarily better than that of the combination of features with weak discrimination capability. In response to this problem, much research has begun to focus on reducing inter-feature redundancy. Methods such as ARCO, mRMR and CIFE evaluate feature redundancy by measuring correlation between features, and select features with high class correlation and low inter-correlation to be added into a selected feature subset. However, the features that provide a large amount of information for classification as a whole to maximize global class discrimination are not necessarily irrelevant, and are more likely to be complementary features. And neither measure class-related information nor redundant information, these methods take into account the information retained by the selected feature subset for identifying the target class when adding new features. For two features with the same relevance to a class, their impact on the selected subset of features may be completely different. In addition, for practical calculation feasibility, the existing methods calculate the correlation between the class-related information and the features in a pair-wise manner. This may overestimate the recognition capability of the feature pairs for classes and the redundancy between features, ignoring the mutual cooperation of the selected feature subsets as a whole and the effect on the global classification performance. The invention provides a method for selecting complementary differential expression genes based on mvAUC by considering the problems.

Disclosure of Invention

The invention provides a complementary differential expression gene selection method based on mvAUC, which selects the most complementary gene subset from differential expression data of cancer to realize the maximization of global classification performance. This method has the advantage that the global discrimination capability of the selected feature subset can be evaluated directly, without the need to compute redundant information in pairs between the candidate features and each selected feature.

In order to achieve the purpose, the invention provides the following scheme:

a method for selecting complementary differentially expressed genes based on mvAUC comprises the following steps:

calculating the ordered possible misclassification set OPMS of each gene characteristic;

for a feature set, determining a possible misclassification set PMS common to the feature set and calculating a new AUC of each feature based on the possible misclassification set;

the mvAUC is calculated based on the new AUC after the gene feature combination, and the candidate feature which maximizes the current mvAUC is added into the selected feature subset in an incremental manner.

Preferably, the AUC is defined as the area under the ROC curve, and the formula is:

where θ is a given classification threshold, F (θ) represents a negative instance that is incorrectly classified as a positive class, and P (θ) represents a positive instance that is correctly classified as a positive class;

the AUC value represents sample information which can be correctly classified by the feature, and the larger the AUC is, the more the feature is related to the target class, and the stronger the classification capability of the feature is;

AAC is the area above the ROC curve, and the formula is as follows:

the smaller the AAC value is, the more powerful class discrimination capability is shown in the feature, AAC represents sample information of possible misclassification of the feature, and a set formed by possible misclassification samples of the feature is defined as a possible misclassification set PMS.

Preferably, the ROC curve is a two-dimensional graph with a false positive rate F (θ) as an x-axis and a true positive rate P (θ) as a y-axis, and is used to represent the classification capability of a feature.

Preferably, the step of calculating the new AUC of the possible misclassification set is:

calculating an ordered set of possible misclassifications: order to

X represents a data set comprising n instances, each instance X_iAre all made of

M of (2) feature representations, x_ijIs referred to in the feature f_jExample above x_iA value of (d); n is₀And n₁Respectively representing the number of positive and negative class instances in the dataset, and n₀+n₁N, for a feature f, all samples in the data set X are sorted in ascending order according to their values on the feature f, obtaining an ordered set of sample sequences

The first sample at the left end of the sequence is appointed to belong to a negative class and traverse from the left side and the right side of the sequence respectively, and the first positive class sample at the left end and the first negative class sample at the right end are marked as a positive class sample and a negative class sample respectively

And

then

To

Defining a sample sequence in the interval as an ordered possible misclassification set OPMS, wherein a set formed by all samples in the sequence is a possible misclassification set PMS;

regarding a value of a certain positive example in the OPMS as a threshold θ, all negative samples and positive samples on the right side of the example are false positive samples and true positive samples respectively, and regarding each positive sample in the OPMS from right to left as a threshold in turn, the calculation formula of the AAC is as follows:

wherein

Indicating that the calculation is performed on an ordered set of possible misclassifications OPMS, k referring to the k-th positive instance, n, starting from the right end in OPMS₀And n₁Respectively representing the number of positive and negative class instances in the dataset, n' being the number of positive class instances in OPMS, FP_lIs the number of false positive type samples from the l-1 th to the l-th positive type sample from the right end;

the final OPMS-based AUC expression is:

AUC＝1-AAC

if OPMS is empty, all positive class instances will be ranked higher than negative class instances, where AAC is 0 and AUC is 1, and all instances can be correctly classified into two classes.

Preferably, the specific steps of determining the common possible misclassification set and calculating the new AUC after each feature combination are as follows:

calculating a common possible misclassification set of the combined features; calculating a new OPMS of the combined single features; the mvAUC of the combined features was calculated.

Preferably, for a feature set F, a set formed by samples in which all features in the F cannot be classified correctly is a common possible misclassification set after feature combination, and is expressed as:

where F is a feature set, M_FIs a common possible misclassification set of a feature set F, F_jIs a feature in the feature set F,

finger feature f_jThe original possibly misclassified set PMS.

Preferably, each feature F in the feature set F_jThe new OPMS combined with other features is composed of M_FAll examples in (1) constitute_FAccording to said characteristic f_jThe values of (c) are arranged in ascending order to obtain the characteristic f_jNew OPMS.

Preferably, each feature is calculated on its respective new OPMS to obtain a new combined AUC value, and the average of the new AUCs of all features in the feature combination is the mvAUC of the feature set F.

The invention has the beneficial effects that:

(1) the original AUC can only evaluate the classification capability of a single feature, while the mvAUC can be used to evaluate the joint classification performance of multiple features. Compared with various AUC-based feature selection methods, such as FAST, ARCO, AVC, mvAUC can not only measure the correlation between the target class and the feature combination, but also evaluate the complementarity between a selected set of features and a candidate feature. The mvAUC considers a plurality of features as a whole and directly and effectively evaluates the global complementation and classification performance of the feature set.

(2) The mvAUC avoids the overestimation of class correlation and feature redundancy brought by the paired calculation problem, thereby enabling the evaluation to be more accurate. In addition, the mvAUC can simultaneously measure new classification information brought by the candidate features and class related information retained by the originally selected feature subset, and the method is easier to implement and calculate due to the application of set operation in the mvAUC.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of the ROC curve of the present invention, wherein (a) is a schematic diagram of AUC integration and (b) is a schematic diagram of AAC integration;

FIG. 3 is a schematic diagram showing a comparison of ROC curves before and after gene combination associated with prostate cancer in examples of the present invention;

FIG. 4 is a schematic diagram showing a comparison of ROC curves before and after gene combination with the largest mvAUC among genes related to prostate cancer disclosed in COMSIC in the examples of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The invention provides a method for selecting complementary differential expression genes based on mvAUC (as shown in figure 1), which specifically comprises the following steps:

the invention firstly defines the concept of the AAC area above the ROC curve and the possible misclassification set from the angle of the ROC curve integration, and specifically comprises the following steps:

the ROC curve is a two-dimensional graph with a false positive rate (false positive rate) as an x-axis and a true positive rate (true positive rate) as a y-axis, and can be used to represent the classification capability of a feature, as shown in fig. 2 (a). The closer the ROC curve is to the upper left corner, the stronger the discrimination of the features to the classes. Given a classification threshold θ, the true positive rate P (θ) indicates how many positive instances are correctly classified as positive classes, and the false positive rate F (θ) indicates how many negative instances are incorrectly classified as positive classes. AUC (area Under the ROC curve) is defined as the area Under the ROC curve, and is given by the formula:

the AUC value represents sample information that the feature can be classified correctly, and a larger AUC means that the feature is more relevant to the target class, and the classification capability of the feature is stronger.

As shown in FIG. 2(b), AAC (area above the ROC Curve) is defined as the area above the ROC curve, and the formula is as follows:

the smaller the AAC value, the more discriminating the feature is. AAC represents sample information of possible misclassifications of features, where a set of possible misclassification samples of features is defined as a Possible Misclassification Set (PMS).

A new AUC calculation method based on the possible misclassification set is provided on the basis of the related definition, and the specific steps are as follows:

step one, calculating an Ordered Possible Misclassification Set (OPMS).

Order to

Representing a data set comprising n instances, each instance x_iAre all made of

M of (2) feature representations, x_ijIs referred to in the feature f_jExample above x_iThe value of (c). n is₀And n₁Respectively representing the number of positive and negative class instances in the dataset, and n₀+n₁N, for a feature f, all samples in the data set X are sorted according to their values on the feature f in an ascending order to obtain an ordered sample sequence set

And (3) assuming that the first sample at the left end of the sequence belongs to a negative class, respectively traversing from the left side and the right side of the sequence, and respectively recording the found first positive class sample at the left end and the first negative class sample at the right end as a positive class sample and a negative class sample

And

then

To

The sample sequence in the interval is defined as an ordered set of possible misclassifications, and the set of all samples in the sequence is the set of possible misclassifications.

And step two, calculating the AUC based on the OPMS.

Taking the value of a positive instance in the OPMS as the threshold θ, all negative samples and positive samples on the right side of the instance are false positive and true positive, respectively. According to the integral definition of AAC, each positive sample in OPMS is sequentially regarded as a threshold from right to left, and then the calculation formula of AAC is as follows:

wherein the content of the first and second substances,

indicating that the calculation was performed on an ordered set of possible misclassifications OPMS. n' is the number of positive class instances in OPMS, FP_lIs the number of false positive type samples from the l-1 th to the l-th positive type sample from the right end. This AAC calculation is actually a numerical calculation of equation (2), i.e., an integral of the false positive rate. According to the definitions of AUC and AVC, the final OPMS-based AUC can be calculated as:

AUC＝1-AAC (4)

if OPMS is empty, all positive class instances will be ranked higher than negative class instances. When AAC is 0 and AUC is 1, it means that all instances can be correctly classified into two classes.

And finally, based on the public possible misclassification set of the feature combination and a new AUC calculation method, providing the mvAUC of the combined features, and measuring the global complementarity and class discrimination capability of the combined features on the whole. The specific calculation steps are as follows:

step one, calculating a common possible misclassification set of the combination characteristics.

For a feature set F, a set composed of samples in which all features in F cannot be classified correctly is defined as a common possible misclassification set after feature combination:

wherein f is_jRepresenting the features in the feature set F.

And step two, calculating the new OPMS of the combined single feature.

Each feature F in F_jThe new OPMS combined with other features is composed of M_FAll examples in (a). Will M_FAccording to their characteristic f_jThe values of the above are arranged in ascending order to obtain the characteristic f_jNew OPMS.

And step three, calculating the mvAUC of the combined features.

According to equation (3), new AUC values can be calculated for each feature after combination on its respective new OPMS. The mean of the new AUC of all features in the feature set is the mvAUC of the feature set F. Since the common possible misclassification set is contained in the PMS of each feature's original, the new AUC for each feature must be greater than or equal to its original AUC. The difference between the two represents the contribution of other characteristics in the F to the characteristic in enhancing the classification capability, and the larger the difference is, the more complementary the difference is. The larger the mvAUC of the feature set, the stronger the complementarity and global class resolution performance of the features in the feature set.

Based on the AUC calculation of OPMS, as shown in table 1, algorithm 1 takes the dataset and feature set as inputs, calculates and outputs the set of possible misclassifications and AUC for each feature in the feature set. The algorithm first looks at all samples according to their characteristic f_jThe values above are ordered in ascending order to determine an ordered sequence of possible misclassifications opms (j). PMS (j) is the set of samples in opms (j) according to the definition of PMS. Obtaining each characteristic f based on the calculation formula of AAC_jInitial AUC value of (a).

TABLE 1

The calculation of the gene selection method based on the mvAUC is performed, as shown in table 2, by using the data set, the feature set, and the possible misclassification set PMS and AUC of each feature generated by the algorithm 1 as input, the algorithm 2 calculates the mvAUC after combining a plurality of gene features, and selects the subset of genes with the strongest global complementarity. The larger the mvAUC of a feature combination in a feature subset, the stronger the global complementarity of the feature subset. The core of the algorithm is to incrementally select a candidate feature to be added to the selected feature subset so that the mvAUC of the currently selected feature subset is maximized.

In the initialization phase of algorithm 2, the feature corresponding to the maximum value in AUC is selected

Is added to F^*As the initial selected feature subset, while removing the feature from the candidate feature subset F.

Adopting an iteration method, each iteration adds a candidate feature into the selected feature subset until a new candidate feature is added but the mvAUC is not changed, or the size of the selected feature subset is changedThe algorithm stops when a specified number is reached. Wherein a common probable misclassification subset of all features in the selected feature subset is computed by intersection operations

And then aiming at each candidate feature, calculating the mvAUC after the candidate feature is combined with the selected feature subset, and dividing the mvAUC into three steps: firstly, to the candidate feature f_jComputing a common probable misclassification subset of the selected feature subset and the same

To F^*∪{f_jEach of the characteristics f_kAccording to

In the middle sample at f_kThe values are sorted to obtain the combined characteristic f_kAnd calculating AUC; last set F^*∪{f_jThe mean value of the new AUC after all the characteristics are combined is the mvAUC of the combination. And then selecting the candidate feature corresponding to the maximum mvAUC

Adding to the selected feature subset F^*And if the stop condition is not met, repeating the steps.

TABLE 2

FIG. 3 shows the results of the experiments performed in the present invention on the prostate cancer data set in the TCGA database. This figure shows the change in ROC curve and AUC values for five genes before and after the combination of seven genes selected by the method of the present invention. It can be seen that the AUC of only two of the original single genes was above 0.9, but the combined mvAUC value reached 0.9989. This indicates that the combined gene set may have a high class discrimination ability even if a weak discrimination ability gene having a low class correlation exists. The selected gene set has stronger complementary action and disease identification capability when identifying the class. FIG. 4 shows that the combination of ROC curves and AUC values of five genes before and after the combination of seven genes with the maximum mvAUC values among the multiple genes related to prostate tumor is published and reported by COMSIC, and it can be seen that the gene combination performance obtained by the method is obviously superior to that of the group of genes.

The original AUC can only evaluate the classification capability of a single feature, while the mvAUC can be used to evaluate the joint classification performance of multiple features. Compared with various AUC-based feature selection methods, such as FAST, ARCO, AVC, mvAUC can not only measure the correlation between the target class and the feature combination, but also evaluate the complementarity between a selected set of features and a candidate feature. The mvAUC considers a plurality of features as a whole and directly and effectively evaluates the global complementation and classification performance of the feature set.

Compared with classical and newly proposed feature selection methods such as mRMR, ReliefF, MRI, and PAM, which is a direct gene selection-oriented method, the mvAUC avoids over-estimation of class correlation and feature redundancy brought by pairwise computation problems, thereby making the estimation more accurate. Furthermore, the mvAUC can simultaneously measure the new classification information brought by the candidate features and the class-related information retained by the originally selected feature subset. And the method is easier to implement and calculate due to the application of set operation in the mvAUC.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims

1. A method for selecting complementary differentially expressed genes based on mvAUC is characterized by comprising the following steps:

calculating the mvAUC based on the new AUC after the gene feature combination, and incrementally selecting candidate features which maximize the current mvAUC to be added into the selected feature subset;

the step of calculating the new AUC of each feature based on the possible misclassification set is as follows:

calculating an ordered set of possible misclassifications: order to

And

then

To

wherein

the final OPMS-based AUC expression is:

AUC＝1-AAC

if OPMS is empty, all positive class instances will be ranked higher than negative class instances, where AAC is 0 and AUC is 1, and all instances can be correctly classified into two classes;

the specific steps of determining the common possible misclassification set PMS and calculating the new AUC after each feature combination are as follows:

calculating a common possible misclassification set of the combined features; calculating a new OPMS of the combined single features; calculating mvAUC of the combined features;

for a feature set F, a set formed by samples in which all features in the F cannot be classified correctly is a common possible misclassification set after feature combination, and is expressed as:

finger feature f_jAn original possibly misclassified set PMS;

each feature F in the feature set F_jThe new OPMS combined with other features is composed of M_FAll examples in (1) constitute_FAccording to said characteristic f_jThe values of (c) are arranged in ascending order to obtain the characteristic f_jThe new OPMS of (1);

and calculating new AUC values of all the characteristics after combination on new OPMSs of the characteristics, wherein the average value of the new AUC values of all the characteristics in the characteristic combination is the mvAUC of the characteristic set F.

2. The method for selecting complementary differentially expressed genes according to claim 1, wherein the AUC is defined as the area under the ROC curve, and the formula is:

AUC＝∫₀ ¹P(θ)dF(θ)

AAC is the area above the ROC curve, and the formula is as follows:

AAC＝∫₀ ¹F(θ)dP(θ)

3. The method for selecting complementary differentially expressed genes according to claim 2, wherein the ROC curve is a two-dimensional graph having a false positive rate F (θ) as an x-axis and a true positive rate P (θ) as a y-axis, and represents the classification ability of a feature.