CN112802555B - Complementary differential expression gene selection method based on mvAUC - Google Patents

Complementary differential expression gene selection method based on mvAUC Download PDF

Info

Publication number
CN112802555B
CN112802555B CN202110147526.2A CN202110147526A CN112802555B CN 112802555 B CN112802555 B CN 112802555B CN 202110147526 A CN202110147526 A CN 202110147526A CN 112802555 B CN112802555 B CN 112802555B
Authority
CN
China
Prior art keywords
feature
auc
positive
class
opms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202110147526.2A
Other languages
Chinese (zh)
Other versions
CN112802555A (en
Inventor
卫金茂
苏月
杜科宇
刘健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN202110147526.2A priority Critical patent/CN112802555B/en
Publication of CN112802555A publication Critical patent/CN112802555A/en
Application granted granted Critical
Publication of CN112802555B publication Critical patent/CN112802555B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a feature selection method based on multivariate AUC, which selects the most complementary gene subset from the differential expression data of cancer to realize the maximization of global classification performance. The method includes the steps that firstly, a new angle of AUC calculation is provided based on a possible misclassification set of features; then, for a feature set, determining a common possible misclassification set of the feature set and calculating a new AUC after each feature is combined; the difference between the new AUC and the original AUC for a feature demonstrates the complementary effect of other features in the feature set on the classification ability of that feature after combination. Finally, the mvAUC is calculated based on the new AUC after the feature combination, and the candidate feature which maximizes the current mvAUC is added into the selected feature subset in an incremental mode. The method of the invention has the advantage that the global class discrimination capability of the selected feature subset can be directly evaluated without the need to compute redundant information between the candidate features and each selected feature in pairs.

Description

Complementary differential expression gene selection method based on mvAUC
Technical Field
The invention belongs to the technical field of data mining, and particularly relates to a method for selecting complementary differential expression genes based on mvAUC.
Background
In the biomedical field, with the rapid development and continuous maturity of a new generation sequencing technology (NGS), the sequencing cost is greatly reduced, data such as cancer gene expression and the like are rapidly accumulated, and the analysis and application based on NGS big data grow rapidly. Gene expression datasets typically contain thousands or even hundreds of thousands of genes, and a relatively small number of hundreds to thousands of samples. Of these thousands of genes, only a small fraction of genes are involved in the development of cancer, and the presence of a large number of unrelated redundant genes can severely impact the analysis of the data and lead to bias. It is therefore becoming increasingly important to identify the genes that contribute most to cancer classification. The identification process is called gene selection, and the key point of the identification process is to establish an evaluation standard to select the most discriminative gene subset, so as to achieve the purposes of reducing the spatial dimension, improving the classification precision and finding potential target genes.
In the field of machine learning and data mining, gene selection is called feature selection, that is, screening of genes can be realized by using a feature selection technology in machine learning. There are many feature selection methods in machine learning, and many feature selection methods select a feature subset with the strongest distinguishing capability for a class by measuring the relevant information of the features and the class. And evaluating the relevance of each candidate feature to the class by using a feature selection method such as FAST and Relief, and adding the feature with high relevance to the selected feature subset. However, this type of method does not consider redundancy among features, which may result in high correlation of selected features, and thus the performance of joint classification of a plurality of features with strong discrimination capability after combination is not necessarily better than that of the combination of features with weak discrimination capability. In response to this problem, much research has begun to focus on reducing inter-feature redundancy. Methods such as ARCO, mRMR and CIFE evaluate feature redundancy by measuring correlation between features, and select features with high class correlation and low inter-correlation to be added into a selected feature subset. However, the features that provide a large amount of information for classification as a whole to maximize global class discrimination are not necessarily irrelevant, and are more likely to be complementary features. And neither measure class-related information nor redundant information, these methods take into account the information retained by the selected feature subset for identifying the target class when adding new features. For two features with the same relevance to a class, their impact on the selected subset of features may be completely different. In addition, for practical calculation feasibility, the existing methods calculate the correlation between the class-related information and the features in a pair-wise manner. This may overestimate the recognition capability of the feature pairs for classes and the redundancy between features, ignoring the mutual cooperation of the selected feature subsets as a whole and the effect on the global classification performance. The invention provides a method for selecting complementary differential expression genes based on mvAUC by considering the problems.
Disclosure of Invention
The invention provides a complementary differential expression gene selection method based on mvAUC, which selects the most complementary gene subset from differential expression data of cancer to realize the maximization of global classification performance. This method has the advantage that the global discrimination capability of the selected feature subset can be evaluated directly, without the need to compute redundant information in pairs between the candidate features and each selected feature.
In order to achieve the purpose, the invention provides the following scheme:
a method for selecting complementary differentially expressed genes based on mvAUC comprises the following steps:
calculating the ordered possible misclassification set OPMS of each gene characteristic;
for a feature set, determining a possible misclassification set PMS common to the feature set and calculating a new AUC of each feature based on the possible misclassification set;
the mvAUC is calculated based on the new AUC after the gene feature combination, and the candidate feature which maximizes the current mvAUC is added into the selected feature subset in an incremental manner.
Preferably, the AUC is defined as the area under the ROC curve, and the formula is:
Figure BDA0002930800330000031
where θ is a given classification threshold, F (θ) represents a negative instance that is incorrectly classified as a positive class, and P (θ) represents a positive instance that is correctly classified as a positive class;
the AUC value represents sample information which can be correctly classified by the feature, and the larger the AUC is, the more the feature is related to the target class, and the stronger the classification capability of the feature is;
AAC is the area above the ROC curve, and the formula is as follows:
Figure BDA0002930800330000032
where θ is a given classification threshold, F (θ) represents a negative instance that is incorrectly classified as a positive class, and P (θ) represents a positive instance that is correctly classified as a positive class;
the smaller the AAC value is, the more powerful class discrimination capability is shown in the feature, AAC represents sample information of possible misclassification of the feature, and a set formed by possible misclassification samples of the feature is defined as a possible misclassification set PMS.
Preferably, the ROC curve is a two-dimensional graph with a false positive rate F (θ) as an x-axis and a true positive rate P (θ) as a y-axis, and is used to represent the classification capability of a feature.
Preferably, the step of calculating the new AUC of the possible misclassification set is:
calculating an ordered set of possible misclassifications: order to
Figure BDA0002930800330000033
X represents a data set comprising n instances, each instance XiAre all made of
Figure BDA0002930800330000034
M of (2) feature representations, xijIs referred to in the feature fjExample above xiA value of (d); n is0And n1Respectively representing the number of positive and negative class instances in the dataset, and n0+n1N, for a feature f, all samples in the data set X are sorted in ascending order according to their values on the feature f, obtaining an ordered set of sample sequences
Figure BDA0002930800330000041
The first sample at the left end of the sequence is appointed to belong to a negative class and traverse from the left side and the right side of the sequence respectively, and the first positive class sample at the left end and the first negative class sample at the right end are marked as a positive class sample and a negative class sample respectively
Figure BDA0002930800330000042
And
Figure BDA0002930800330000043
then
Figure BDA0002930800330000044
To
Figure BDA0002930800330000045
Defining a sample sequence in the interval as an ordered possible misclassification set OPMS, wherein a set formed by all samples in the sequence is a possible misclassification set PMS;
regarding a value of a certain positive example in the OPMS as a threshold θ, all negative samples and positive samples on the right side of the example are false positive samples and true positive samples respectively, and regarding each positive sample in the OPMS from right to left as a threshold in turn, the calculation formula of the AAC is as follows:
Figure BDA0002930800330000046
wherein
Figure BDA0002930800330000047
Indicating that the calculation is performed on an ordered set of possible misclassifications OPMS, k referring to the k-th positive instance, n, starting from the right end in OPMS0And n1Respectively representing the number of positive and negative class instances in the dataset, n' being the number of positive class instances in OPMS, FPlIs the number of false positive type samples from the l-1 th to the l-th positive type sample from the right end;
the final OPMS-based AUC expression is:
AUC=1-AAC
if OPMS is empty, all positive class instances will be ranked higher than negative class instances, where AAC is 0 and AUC is 1, and all instances can be correctly classified into two classes.
Preferably, the specific steps of determining the common possible misclassification set and calculating the new AUC after each feature combination are as follows:
calculating a common possible misclassification set of the combined features; calculating a new OPMS of the combined single features; the mvAUC of the combined features was calculated.
Preferably, for a feature set F, a set formed by samples in which all features in the F cannot be classified correctly is a common possible misclassification set after feature combination, and is expressed as:
Figure BDA0002930800330000051
where F is a feature set, MFIs a common possible misclassification set of a feature set F, FjIs a feature in the feature set F,
Figure BDA0002930800330000052
finger feature fjThe original possibly misclassified set PMS.
Preferably, each feature F in the feature set FjThe new OPMS combined with other features is composed of MFAll examples in (1) constituteFAccording to said characteristic fjThe values of (c) are arranged in ascending order to obtain the characteristic fjNew OPMS.
Preferably, each feature is calculated on its respective new OPMS to obtain a new combined AUC value, and the average of the new AUCs of all features in the feature combination is the mvAUC of the feature set F.
The invention has the beneficial effects that:
(1) the original AUC can only evaluate the classification capability of a single feature, while the mvAUC can be used to evaluate the joint classification performance of multiple features. Compared with various AUC-based feature selection methods, such as FAST, ARCO, AVC, mvAUC can not only measure the correlation between the target class and the feature combination, but also evaluate the complementarity between a selected set of features and a candidate feature. The mvAUC considers a plurality of features as a whole and directly and effectively evaluates the global complementation and classification performance of the feature set.
(2) The mvAUC avoids the overestimation of class correlation and feature redundancy brought by the paired calculation problem, thereby enabling the evaluation to be more accurate. In addition, the mvAUC can simultaneously measure new classification information brought by the candidate features and class related information retained by the originally selected feature subset, and the method is easier to implement and calculate due to the application of set operation in the mvAUC.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of the ROC curve of the present invention, wherein (a) is a schematic diagram of AUC integration and (b) is a schematic diagram of AAC integration;
FIG. 3 is a schematic diagram showing a comparison of ROC curves before and after gene combination associated with prostate cancer in examples of the present invention;
FIG. 4 is a schematic diagram showing a comparison of ROC curves before and after gene combination with the largest mvAUC among genes related to prostate cancer disclosed in COMSIC in the examples of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
The invention provides a method for selecting complementary differential expression genes based on mvAUC (as shown in figure 1), which specifically comprises the following steps:
the invention firstly defines the concept of the AAC area above the ROC curve and the possible misclassification set from the angle of the ROC curve integration, and specifically comprises the following steps:
the ROC curve is a two-dimensional graph with a false positive rate (false positive rate) as an x-axis and a true positive rate (true positive rate) as a y-axis, and can be used to represent the classification capability of a feature, as shown in fig. 2 (a). The closer the ROC curve is to the upper left corner, the stronger the discrimination of the features to the classes. Given a classification threshold θ, the true positive rate P (θ) indicates how many positive instances are correctly classified as positive classes, and the false positive rate F (θ) indicates how many negative instances are incorrectly classified as positive classes. AUC (area Under the ROC curve) is defined as the area Under the ROC curve, and is given by the formula:
Figure BDA0002930800330000071
the AUC value represents sample information that the feature can be classified correctly, and a larger AUC means that the feature is more relevant to the target class, and the classification capability of the feature is stronger.
As shown in FIG. 2(b), AAC (area above the ROC Curve) is defined as the area above the ROC curve, and the formula is as follows:
Figure BDA0002930800330000072
the smaller the AAC value, the more discriminating the feature is. AAC represents sample information of possible misclassifications of features, where a set of possible misclassification samples of features is defined as a Possible Misclassification Set (PMS).
A new AUC calculation method based on the possible misclassification set is provided on the basis of the related definition, and the specific steps are as follows:
step one, calculating an Ordered Possible Misclassification Set (OPMS).
Order to
Figure BDA0002930800330000081
Representing a data set comprising n instances, each instance xiAre all made of
Figure BDA0002930800330000082
M of (2) feature representations, xijIs referred to in the feature fjExample above xiThe value of (c). n is0And n1Respectively representing the number of positive and negative class instances in the dataset, and n0+n1N, for a feature f, all samples in the data set X are sorted according to their values on the feature f in an ascending order to obtain an ordered sample sequence set
Figure BDA0002930800330000083
And (3) assuming that the first sample at the left end of the sequence belongs to a negative class, respectively traversing from the left side and the right side of the sequence, and respectively recording the found first positive class sample at the left end and the first negative class sample at the right end as a positive class sample and a negative class sample
Figure BDA0002930800330000084
And
Figure BDA0002930800330000085
then
Figure BDA0002930800330000086
To
Figure BDA0002930800330000087
The sample sequence in the interval is defined as an ordered set of possible misclassifications, and the set of all samples in the sequence is the set of possible misclassifications.
And step two, calculating the AUC based on the OPMS.
Taking the value of a positive instance in the OPMS as the threshold θ, all negative samples and positive samples on the right side of the instance are false positive and true positive, respectively. According to the integral definition of AAC, each positive sample in OPMS is sequentially regarded as a threshold from right to left, and then the calculation formula of AAC is as follows:
Figure BDA0002930800330000088
wherein the content of the first and second substances,
Figure BDA0002930800330000089
indicating that the calculation was performed on an ordered set of possible misclassifications OPMS. n' is the number of positive class instances in OPMS, FPlIs the number of false positive type samples from the l-1 th to the l-th positive type sample from the right end. This AAC calculation is actually a numerical calculation of equation (2), i.e., an integral of the false positive rate. According to the definitions of AUC and AVC, the final OPMS-based AUC can be calculated as:
AUC=1-AAC (4)
if OPMS is empty, all positive class instances will be ranked higher than negative class instances. When AAC is 0 and AUC is 1, it means that all instances can be correctly classified into two classes.
And finally, based on the public possible misclassification set of the feature combination and a new AUC calculation method, providing the mvAUC of the combined features, and measuring the global complementarity and class discrimination capability of the combined features on the whole. The specific calculation steps are as follows:
step one, calculating a common possible misclassification set of the combination characteristics.
For a feature set F, a set composed of samples in which all features in F cannot be classified correctly is defined as a common possible misclassification set after feature combination:
Figure BDA0002930800330000091
wherein f isjRepresenting the features in the feature set F.
And step two, calculating the new OPMS of the combined single feature.
Each feature F in FjThe new OPMS combined with other features is composed of MFAll examples in (a). Will MFAccording to their characteristic fjThe values of the above are arranged in ascending order to obtain the characteristic fjNew OPMS.
And step three, calculating the mvAUC of the combined features.
According to equation (3), new AUC values can be calculated for each feature after combination on its respective new OPMS. The mean of the new AUC of all features in the feature set is the mvAUC of the feature set F. Since the common possible misclassification set is contained in the PMS of each feature's original, the new AUC for each feature must be greater than or equal to its original AUC. The difference between the two represents the contribution of other characteristics in the F to the characteristic in enhancing the classification capability, and the larger the difference is, the more complementary the difference is. The larger the mvAUC of the feature set, the stronger the complementarity and global class resolution performance of the features in the feature set.
Based on the AUC calculation of OPMS, as shown in table 1, algorithm 1 takes the dataset and feature set as inputs, calculates and outputs the set of possible misclassifications and AUC for each feature in the feature set. The algorithm first looks at all samples according to their characteristic fjThe values above are ordered in ascending order to determine an ordered sequence of possible misclassifications opms (j). PMS (j) is the set of samples in opms (j) according to the definition of PMS. Obtaining each characteristic f based on the calculation formula of AACjInitial AUC value of (a).
TABLE 1
Figure BDA0002930800330000101
The calculation of the gene selection method based on the mvAUC is performed, as shown in table 2, by using the data set, the feature set, and the possible misclassification set PMS and AUC of each feature generated by the algorithm 1 as input, the algorithm 2 calculates the mvAUC after combining a plurality of gene features, and selects the subset of genes with the strongest global complementarity. The larger the mvAUC of a feature combination in a feature subset, the stronger the global complementarity of the feature subset. The core of the algorithm is to incrementally select a candidate feature to be added to the selected feature subset so that the mvAUC of the currently selected feature subset is maximized.
In the initialization phase of algorithm 2, the feature corresponding to the maximum value in AUC is selected
Figure BDA0002930800330000111
Is added to F*As the initial selected feature subset, while removing the feature from the candidate feature subset F.
Adopting an iteration method, each iteration adds a candidate feature into the selected feature subset until a new candidate feature is added but the mvAUC is not changed, or the size of the selected feature subset is changedThe algorithm stops when a specified number is reached. Wherein a common probable misclassification subset of all features in the selected feature subset is computed by intersection operations
Figure BDA0002930800330000112
And then aiming at each candidate feature, calculating the mvAUC after the candidate feature is combined with the selected feature subset, and dividing the mvAUC into three steps: firstly, to the candidate feature fjComputing a common probable misclassification subset of the selected feature subset and the same
Figure BDA0002930800330000113
To F*∪{fjEach of the characteristics fkAccording to
Figure BDA0002930800330000114
In the middle sample at fkThe values are sorted to obtain the combined characteristic fkAnd calculating AUC; last set F*∪{fjThe mean value of the new AUC after all the characteristics are combined is the mvAUC of the combination. And then selecting the candidate feature corresponding to the maximum mvAUC
Figure BDA0002930800330000116
Adding to the selected feature subset F*And if the stop condition is not met, repeating the steps.
TABLE 2
Figure BDA0002930800330000115
Figure BDA0002930800330000121
FIG. 3 shows the results of the experiments performed in the present invention on the prostate cancer data set in the TCGA database. This figure shows the change in ROC curve and AUC values for five genes before and after the combination of seven genes selected by the method of the present invention. It can be seen that the AUC of only two of the original single genes was above 0.9, but the combined mvAUC value reached 0.9989. This indicates that the combined gene set may have a high class discrimination ability even if a weak discrimination ability gene having a low class correlation exists. The selected gene set has stronger complementary action and disease identification capability when identifying the class. FIG. 4 shows that the combination of ROC curves and AUC values of five genes before and after the combination of seven genes with the maximum mvAUC values among the multiple genes related to prostate tumor is published and reported by COMSIC, and it can be seen that the gene combination performance obtained by the method is obviously superior to that of the group of genes.
The original AUC can only evaluate the classification capability of a single feature, while the mvAUC can be used to evaluate the joint classification performance of multiple features. Compared with various AUC-based feature selection methods, such as FAST, ARCO, AVC, mvAUC can not only measure the correlation between the target class and the feature combination, but also evaluate the complementarity between a selected set of features and a candidate feature. The mvAUC considers a plurality of features as a whole and directly and effectively evaluates the global complementation and classification performance of the feature set.
Compared with classical and newly proposed feature selection methods such as mRMR, ReliefF, MRI, and PAM, which is a direct gene selection-oriented method, the mvAUC avoids over-estimation of class correlation and feature redundancy brought by pairwise computation problems, thereby making the estimation more accurate. Furthermore, the mvAUC can simultaneously measure the new classification information brought by the candidate features and the class-related information retained by the originally selected feature subset. And the method is easier to implement and calculate due to the application of set operation in the mvAUC.
The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims (3)

1. A method for selecting complementary differentially expressed genes based on mvAUC is characterized by comprising the following steps:
calculating the ordered possible misclassification set OPMS of each gene characteristic;
for a feature set, determining a possible misclassification set PMS common to the feature set and calculating a new AUC of each feature based on the possible misclassification set;
calculating the mvAUC based on the new AUC after the gene feature combination, and incrementally selecting candidate features which maximize the current mvAUC to be added into the selected feature subset;
the step of calculating the new AUC of each feature based on the possible misclassification set is as follows:
calculating an ordered set of possible misclassifications: order to
Figure FDA0003537896050000011
X represents a data set comprising n instances, each instance XiAre all made of
Figure FDA0003537896050000012
M of (2) feature representations, xijIs referred to in the feature fjExample above xiA value of (d); n is0And n1Respectively representing the number of positive and negative class instances in the dataset, and n0+n1N, for a feature f, all samples in the data set X are sorted in ascending order according to their values on the feature f, obtaining an ordered set of sample sequences
Figure FDA0003537896050000013
The first sample at the left end of the sequence is appointed to belong to a negative class and traverse from the left side and the right side of the sequence respectively, and the first positive class sample at the left end and the first negative class sample at the right end are marked as a positive class sample and a negative class sample respectively
Figure FDA0003537896050000014
And
Figure FDA0003537896050000015
then
Figure FDA0003537896050000016
To
Figure FDA0003537896050000017
Defining a sample sequence in the interval as an ordered possible misclassification set OPMS, wherein a set formed by all samples in the sequence is a possible misclassification set PMS;
regarding a value of a certain positive example in the OPMS as a threshold θ, all negative samples and positive samples on the right side of the example are false positive samples and true positive samples respectively, and regarding each positive sample in the OPMS from right to left as a threshold in turn, the calculation formula of the AAC is as follows:
Figure FDA0003537896050000021
wherein
Figure FDA0003537896050000022
Indicating that the calculation is performed on an ordered set of possible misclassifications OPMS, k referring to the k-th positive instance, n, starting from the right end in OPMS0And n1Respectively representing the number of positive and negative class instances in the dataset, n' being the number of positive class instances in OPMS, FPlIs the number of false positive type samples from the l-1 th to the l-th positive type sample from the right end;
the final OPMS-based AUC expression is:
AUC=1-AAC
if OPMS is empty, all positive class instances will be ranked higher than negative class instances, where AAC is 0 and AUC is 1, and all instances can be correctly classified into two classes;
the specific steps of determining the common possible misclassification set PMS and calculating the new AUC after each feature combination are as follows:
calculating a common possible misclassification set of the combined features; calculating a new OPMS of the combined single features; calculating mvAUC of the combined features;
for a feature set F, a set formed by samples in which all features in the F cannot be classified correctly is a common possible misclassification set after feature combination, and is expressed as:
Figure FDA0003537896050000023
where F is a feature set, MFIs a common possible misclassification set of a feature set F, FjIs a feature in the feature set F,
Figure FDA0003537896050000024
finger feature fjAn original possibly misclassified set PMS;
each feature F in the feature set FjThe new OPMS combined with other features is composed of MFAll examples in (1) constituteFAccording to said characteristic fjThe values of (c) are arranged in ascending order to obtain the characteristic fjThe new OPMS of (1);
and calculating new AUC values of all the characteristics after combination on new OPMSs of the characteristics, wherein the average value of the new AUC values of all the characteristics in the characteristic combination is the mvAUC of the characteristic set F.
2. The method for selecting complementary differentially expressed genes according to claim 1, wherein the AUC is defined as the area under the ROC curve, and the formula is:
AUC=∫0 1P(θ)dF(θ)
where θ is a given classification threshold, F (θ) represents a negative instance that is incorrectly classified as a positive class, and P (θ) represents a positive instance that is correctly classified as a positive class;
the AUC value represents sample information which can be correctly classified by the feature, and the larger the AUC is, the more the feature is related to the target class, and the stronger the classification capability of the feature is;
AAC is the area above the ROC curve, and the formula is as follows:
AAC=∫0 1F(θ)dP(θ)
where θ is a given classification threshold, F (θ) represents a negative instance that is incorrectly classified as a positive class, and P (θ) represents a positive instance that is correctly classified as a positive class;
the smaller the AAC value is, the more powerful class discrimination capability is shown in the feature, AAC represents sample information of possible misclassification of the feature, and a set formed by possible misclassification samples of the feature is defined as a possible misclassification set PMS.
3. The method for selecting complementary differentially expressed genes according to claim 2, wherein the ROC curve is a two-dimensional graph having a false positive rate F (θ) as an x-axis and a true positive rate P (θ) as a y-axis, and represents the classification ability of a feature.
CN202110147526.2A 2021-02-03 2021-02-03 Complementary differential expression gene selection method based on mvAUC Expired - Fee Related CN112802555B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110147526.2A CN112802555B (en) 2021-02-03 2021-02-03 Complementary differential expression gene selection method based on mvAUC

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110147526.2A CN112802555B (en) 2021-02-03 2021-02-03 Complementary differential expression gene selection method based on mvAUC

Publications (2)

Publication Number Publication Date
CN112802555A CN112802555A (en) 2021-05-14
CN112802555B true CN112802555B (en) 2022-04-19

Family

ID=75813880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110147526.2A Expired - Fee Related CN112802555B (en) 2021-02-03 2021-02-03 Complementary differential expression gene selection method based on mvAUC

Country Status (1)

Country Link
CN (1) CN112802555B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279520A (en) * 2015-09-25 2016-01-27 天津师范大学 Optimal character subclass selecting method based on classification ability structure vector complementation
CN105740653A (en) * 2016-01-27 2016-07-06 北京工业大学 Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis
CN105745659A (en) * 2013-09-16 2016-07-06 佰欧迪塞克斯公司 Classifier generation method using combination of mini-classifiers with regularization and uses thereof
CN105938523A (en) * 2016-03-31 2016-09-14 陕西师范大学 Feature selection method and application based on feature identification degree and independence
CN110379521A (en) * 2019-06-24 2019-10-25 南京理工大学 Medical data collection feature selection approach based on information theory

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156542B (en) * 2015-03-27 2018-09-14 深圳华大基因科技有限公司 The method that the immunity difference of the individual two class states of analysis, auxiliary determine individual state
CN110444248B (en) * 2019-07-22 2021-09-24 山东大学 Cancer biomolecule marker screening method and system based on network topology parameters

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105745659A (en) * 2013-09-16 2016-07-06 佰欧迪塞克斯公司 Classifier generation method using combination of mini-classifiers with regularization and uses thereof
CN105279520A (en) * 2015-09-25 2016-01-27 天津师范大学 Optimal character subclass selecting method based on classification ability structure vector complementation
CN105740653A (en) * 2016-01-27 2016-07-06 北京工业大学 Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis
CN105938523A (en) * 2016-03-31 2016-09-14 陕西师范大学 Feature selection method and application based on feature identification degree and independence
CN110379521A (en) * 2019-06-24 2019-10-25 南京理工大学 Medical data collection feature selection approach based on information theory

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AVC: Selecting discriminative features on basis of AUC by maximizing variable complementarity;Lei Sun 等;《BMC Bioinformatics》;20171231;全文 *
基于AUC和特征间互补性的特征选择方法;孙蕾;《中国优秀博硕士学位论文全文数据库(硕士)基础科学辑》;20180415;全文 *
基于保留分类信息的多任务特征学习算法;王珺 等;《计算机研究与发展》;20171231;第54卷(第3期);全文 *

Also Published As

Publication number Publication date
CN112802555A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
Lustgarten et al. Measuring stability of feature selection in biomedical datasets
Bhanot et al. A robust meta‐classification strategy for cancer detection from MS data
JP2018181290A (en) Filter type feature selection algorithm based on improved information measurement and ga
CN109886284B (en) Fraud detection method and system based on hierarchical clustering
Pont-Tuset et al. Supervised assessment of segmentation hierarchies
Dai et al. Feature selection via max-independent ratio and min-redundant ratio based on adaptive weighted kernel density estimation
Kotanchek et al. Symbolic regression via genetic programming as a discovery engine: Insights on outliers and prototypes
Cui et al. MMCO-Clus–an evolutionary co-clustering algorithm for gene selection
López-Oriona et al. Hard and soft clustering of categorical time series based on two novel distances with an application to biological sequences
CN112802555B (en) Complementary differential expression gene selection method based on mvAUC
Suresh et al. Data clustering using multi-objective differential evolution algorithms
Yun et al. Experimental comparison of feature subset selection methods
Yun et al. An experimental study on feature subset selection methods
TWI399661B (en) A system for analyzing and screening disease related genes using microarray database
Cai et al. Fuzzy criteria in multi-objective feature selection for unsupervised learning
Liang et al. A sequential three-way classification model based on risk preference and decision correction
Vukicevic et al. Internal evaluation measures as proxies for external indices in clustering gene expression data
Breimann et al. AAclust: k-optimized clustering for selecting redundancy-reduced sets of amino acid scales
AlSaif Large scale data mining for banking credit risk prediction
CN111316366A (en) Method for simultaneous multivariate feature selection, feature generation and sample clustering
Alexandre et al. Integrating Statistical Significance and Discriminative Power in Pattern Discovery
Laborda et al. Feature Selection in Credit Scoring Model. Mathematics 2021, 9, 746
Chlis et al. Extracting reliable gene expression signatures through stable bootstrap validation
Wu et al. Evaluation of error-sensitive attributes
Hatami et al. Diverse accurate feature selection for microarray cancer diagnosis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220419