CN115662504A

CN115662504A - Multi-angle fusion-based biological omics data analysis method

Info

Publication number: CN115662504A
Application number: CN202211361898.6A
Authority: CN
Inventors: 王堃宇; 林晓惠
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2023-01-31

Abstract

A multi-angle fusion-based biological omics data analysis method is used for systematically analyzing the association of genomics, metabonomics and other omics data and diseases from multiple angles, constructing a plurality of characteristic subspaces rich in biological information and ensuring the information richness. In order to solve the influence of small sample size and high dimensionality of the biological omics data on the effectiveness of the analysis method, considering the diversity of the relationship among the characteristics of each component in a living body, from the perspective of multi-angle fusion, three characteristic subspaces which are representative and rich in biological information are constructed from different angles by using a characteristic selection method from three different angles, and a fusion classification model is established on the basis of the characteristic subspace to carry out data analysis. The results of the public data sets based on a plurality of different omics show that the data analysis method with multi-angle fusion is effective in analysis results and more superior in classification performance, provides practical and effective data analysis means for the research of various biological omics data such as genomics, metabonomics and proteomics, and has high application value.

Description

Multi-angle fusion-based biological omics data analysis method

Technical Field

The invention belongs to the technical field of biological omics data analysis, and relates to a biological omics data analysis method based on multi-angle fusion.

Background

With the rapid development of science and technology and the continuous progress of omics technology, a great amount of biological omics data are continuously emerged. Common omics data include: genomic data, transcriptomic data, proteomic data, metabolomic data, and the like. Of these omics data, genomics refers to the collective and quantitative study of all genes in an organism and the comparison of differences between different genes, and is currently the most mature field of biology. Genomics focuses on the study of entire genomes rather than on a few or a single gene of interest in the traditional genetics field. Genomics provides reliable guarantees for deciphering genetic information, studying complex diseases and specific genetic variations. The gene becomes a protein which is a life embodiment through processes of transcription, translation and the like, and is closely related to various biochemical reaction processes in cells. Thus, proteomics has received a great deal of attention from researchers behind genomics. Proteomics is the discipline for studying protein expression levels, post-translational modifications, and protein interactions. Proteins in the human body undergo dynamic change processes, have natural complexity, and the analysis of information contained in proteomics plays a crucial role in understanding the processes of life activities. However, it is not sufficient to decrypt the human life code only through genomics, proteomics, for example, the same genotype may show different characteristics, which are caused by both genetic and environmental factors. In the case of diseases, the occurrence of a disease may be related to a mutation in a gene, or may be related to an error in the transcription, translation or other process of a gene. Therefore, the role of other biotomics in the human body remains largely unappreciated. Transcriptomics researches the whole genome transcription condition and the transcription regulation rule; phenomics makes an overall study of the modified characteristics of genomic DNA or DNA binding proteins; metabolomics quantitatively analyzes all metabolites (e.g., amino acids, fatty acids, carbohydrates, etc.) in an organism and correlates the metabolites with corresponding diseases. Therefore, the existence of the biological group data is of great significance for people to understand the phenomena in the life activities, analyze and research organisms, search the characteristics rich in biological information and explore the specific research directions such as the occurrence and development of diseases.

However, most of the biological data have a serious problem, that is, the characteristics of high data dimension, high noise and small sample number, so that researchers have many limitations in the process of analyzing and mining the biological data, and how to realize effective analysis and mining of the biological data with the characteristics has great biological significance in the directions of disease research, medical treatment methods and the like in the biological field.

According to the invention, from the multi-angle fusion direction, from the angle research determined by the characteristic subspace of the omics data, three different characteristic selection methods ERGS and mRMR and a characteristic selection method based on a Spearman's difference correlation network are used, characteristics rich in biological information in the omics data are screened from three different angles, the characteristic subspaces reflecting different physiological and pathological states of an organism are determined, and then a fusion classifier is established on the three determined characteristic subspaces, so that the biological omics data are effectively analyzed and mined. According to the method, an effective biological omics data analysis model is constructed by using three different feature selection methods from multiple angles and establishing a fusion classifier, a feature subspace with certain discrimination ability is screened out from the original data set, and good classification performance of the analysis of the biological data is obtained.

Disclosure of Invention

The invention aims to mine a characteristic subspace which is rich in biological information in biological data by using three different characteristic selection methods from multiple angles based on the characteristics of high dimensionality, small sample size, more noise, complex and various relationships among characteristics and the like of the biological data, thereby effectively analyzing the biological data. The model is suitable for the analysis and research of biological omics data, can mine important information in the omics data from different angles, and can be used in the fields of omics data analysis, precise medical treatment and the like. The core technology of the method is based on the determination of different feature subspaces of multi-angle fusion.

In order to achieve the above object, the technical scheme adopted by the invention is as follows:

a multi-angle fusion-based method for analyzing biological omics data comprises the following steps:

step one, data preprocessing

The data set is preprocessed and mainly divided into two parts, wherein the first part is used for processing missing value parts in the data set, and the processing method comprises the following steps: deleting the characteristics that the number of the missing values of each type of samples exceeds eighty percent of the total number of the type of samples, and filling the missing values of the remaining characteristics into the average value of the same type of samples on the characteristics; let F = { F ₁ ,f ₂ ,…,f _m The feature set is used as the input data, and m represents the number of features; y = { Y _j J =1,2 is a set of class labels; s = { S = ₁ ,s ₂ ,…,s _n Is the sample set and n represents the number of samples.

The second part is to standardize the data by using a Z-Score method, and the calculation formula of the Z-Score standardization is shown as (1);

wherein f is ^scaled _ik In a sample s _k Upper characteristic f _i Value after Z-Score data normalization, f _ik Is a sample s _k Upper characteristic f _i Of original value u _i Is characterized by _i Mean value over all samples, σ _i Is characterized by _i Standard deviations found on all samples; thereby obtaining a normalized feature set F = { F = { F } ^scaled ₁ ,f ^scaled ₂ ,…,f ^scaled _m }。

Step two, determining a feature subspace from multiple angles

Determining three different feature subspaces from multiple angles by using an ERGS feature selection method, an mRMR feature selection method and a feature selection method based on a Spearman difference correlation network;

the first feature subspace (Subset-ERGS) is determined by using an ERGS feature selection method, and the specific formula of the ERGS is as follows:

w _i ＝1-AC _i /max{AC _u :u＝1,2} (3)

(2) In the formula:

R _ij is characterized by ^scaled _i In category y _j (j =1, 2);

r ⁺ _ij and r ^- _ij Is the effective range R _ij Upper and lower bounds of;

u _ij is y _j Class-specific features f ^scaled _i The mean value of (a);

σ _ij is a feature f in yj class ^scaled _i Standard deviation of (d);

p _j is y _j A prior probability of a class;

the coefficient 1.732 is determined by the chebyshev inequality, ensuring that the valid range contains at least 2/3 of the samples;

(3) In the formula:

w _i is characterized by ^scaled _i The weight value of (2);

(4) In the formula:

OA _i as features f between different classes ^scaled _i The overlapping area of the effective ranges of (a);

AC _i to calculate w _i The median of (a), represents the overlapping area ratio of the effective range;

finally selecting weight value w by ERGS method _i High in character.

The second feature subspace (Subset-mRMR) was determined using the mRMR feature selection method, which is specifically formulated as follows:

(5) In the formula:

w _i is the characteristic f calculated by the method of mRMR ^scaled _i The final score of (2);

I(f _i ^scaled ；y _j ) Is expressed as a characteristic f ^scaled _i And class label y _j A mutual information value of;

I(f _i ^scaled (ii) a x) is a characteristic f ^scaled _i A mutual information value with a selected feature X, X representing a selected feature set;

and the mRMR characteristic selection method is used for carrying out characteristic selection according to the finally calculated score w and selecting the characteristics with high scores.

A third feature subspace (Subset-Spearman) is determined by using a feature selection method based on a Spearman difference correlation network, spearman correlation coefficients among features are calculated, the difference correlation network is established for feature screening, and therefore the determination of the feature subspace is completed, and the method relates to the specific mode that:

first, according to the set of class labels Y = { Y = { Y = _j (ii) a j =1,2} the whole sample is divided into two categories, and the correlation between features is calculated on the two categories, respectively, and the correlation is calculated using Spearman correlation formula:

(6) In the formula:

q _ij representing the resulting calculated features f ^scaled _i And f ^scaled _j Spearman correlation score between, u _i And u _j Is a characteristic f ^scaled _i And f ^scaled _j Average over all samples, f ^scaled _ik And f ^scaled _jk Is characterized by ^scaled _i And f ^scaled _j In the sample s _k The value of (c) is as follows. Q finally calculated _ij The larger the value is, the two characteristics f are indicated ^scaled _i And f ^scaled _j The higher the Spearman correlation between;

the difference in Spearman correlation found across the two classes was then calculated separately:

wherein o is _ij Is f ^scaled _i And f ^scaled _j Final Spearman correlation difference score, q, for two features over two categories ⁺ _ij And q is ^- _ij Are respectively f ^scaled _i And f ^scaled _j Spearman relevance scores for two features on two different categories; the Spearman correlation difference o is obtained by calculation _ij Then, constructing a final difference correlation network; set of features F = { F) in a diversity correlation network ^scaled ₁ ,f ^scaled ₂ ,…,f ^scaled _m Each feature in (1) is defined as a node of the network, and (7) is o obtained by the following equation _ij As the weights corresponding to the edges between two nodes, the network weight values and the network node weight values of the constructed differential network are respectively: netEdge _ Weight and netNode _ Weight:

netEdge_Weight _ij ＝o _ij (8)

wherein, netNode _ Weight _i Is f ^scaled _i The final Weight of the network node corresponding to the characteristics is obtained by the integral summation of the weights of all sides connected with the node, a final difference correlation network netS is constructed, evaluation is carried out according to the final Weight netNode _ Weight of each node of the netS in the network, network nodes with high Weight scores are screened and selected, and a characteristic subspace Subset-Spearman rich in biological information is constructed;

the three feature selection methods used from three different angles are used for establishing a Subset-ERGS, a Subset-mRMR and a Subset-Spearman feature subspace from three angles of establishing a difference correlation network by respectively considering a single feature score, calculating the correlation between features and class marks and the redundancy between feature pairs and considering the synergistic action among all feature variables;

step three, establishing a fusion classifier on different feature subspaces

On the obtained three feature subspaces, establishing a fusion classifier by using a Support Vector Machine (SVM) method and a Deep Neural Network (DNN) to classify the data;

the feature subspaces obtained by the feature selection method of three different angles are respectively Subset-ERGS, subset-mRMR and Subset-Spearman; the three feature subspaces are rich in different biological information from different angles, the information richness selected by the feature subspaces is ensured, the Subset-ERGS is the feature subspace obtained by ranking through a single feature score, the Subset-mRMR is the feature subspace established by comprehensively considering the redundancy between feature pairs and the correlation between features and class marks, and the Subset-Spearman is the feature subspace obtained by constructing an overall difference correlation network by considering the Spearman correlation coefficient synergistic effect among all features by utilizing a Spearman correlation network construction method based on Spearman;

applying a Support Vector Machine (SVM) and a Deep Neural Network (DNN) classification method to three feature subspaces of Subset-ERGS, subset-mRMR and Subset-Spearman to respectively establish classifiers, integrating classification results by using a Majority Voting method (Majority Voting) to establish an integral fusion classifier, and carrying out complete data analysis to obtain a final classification result.

The invention combines the characteristics of gene regulation, metabolic reaction, protein synthesis and the like in organisms, systematically analyzes the association of genomics, metabonomics and other omics data and diseases from multiple angles, constructs a plurality of characteristic subspaces rich in biological information, and ensures the information richness of the selected characteristic subspaces. In the invention, in order to solve the influence of small sample size and high dimensionality of the biological omics data on the effectiveness of the analysis method, considering the diversity of the relationship among the characteristics of all components in a living body, starting from the angle of multi-angle fusion, three characteristic subspaces which are representative and rich in biological information are constructed by using the characteristic selection methods of three different angles from different angles, and a fusion classification model is established on the basis of the characteristic subspace to carry out data analysis. The results of the public data sets based on a plurality of different omics show that compared with other commonly used data analysis methods, the data analysis method based on multi-angle fusion provided by the invention has the advantages of effective analysis results and more superior classification performance. Through analysis of theory and experiment, the invention can provide practical and effective data analysis means for research of various biological omics data such as genomics, metabonomics, proteomics and the like, and has strong application value.

Drawings

Fig. 1 is an overall architecture diagram of an overall integrated data analysis model established by the invention.

Fig. 2 is a diagram of a network structure of a DNN classifier used in the present invention.

FIG. 3 is a PCA diagram drawn after the human gastric cancer miRNA data set training set part screens a feature subspace by using an ERGS feature selection method.

Fig. 4 is a PCA chart drawn by screening feature subspace by using mRMR feature selection method in the human gastric cancer miRNA data set training set section.

FIG. 5 is a PCA graph drawn after a feature subspace is screened by a Spearman-based difference correlation network feature selection method in a human gastric cancer miRNA data set training set part.

Detailed Description

The following further describes the specific embodiments of the present invention in conjunction with the technical solutions. The miRNA dataset of human gastric cancer is taken as an example to briefly explain the execution process.

The public omics dataset used in this example is a miRNA dataset of human gastric cancer, and after the samples are effectively analyzed and processed by the relevant bioanalytical technology, the total number of the datasets includes 44 samples, wherein 22 diseased samples, 22 non-diseased samples and 556 characteristic numbers, which completely meet the basic characteristics of small sample size and high dimensionality of the public omics dataset to which the present invention is directed.

Step one, preprocessing a human gastric cancer miRNA data set. Before specific data analysis, missing value filling and Z-Score standardization data preprocessing steps are carried out on the data set, and finally standardized data capable of being further analyzed are obtained.

Step two, determining a feature subspace from multiple angles

In the embodiment, the number of the common samples is 44, the number of the features is 556, in order to remove invalid features with small disease guidance and screen effective features, three feature selection methods ERGS and mRMR from different angles and a difference correlation network feature selection method based on Spearman are used for feature selection on the data set, the number of the features of feature subspaces constructed by the three feature selection methods is uniformly set to be 100, and feature subspaces Subset-ERGS, subset-mRMR and Subset-Spearman which contain 100 features and are screened by the three feature selection methods are respectively used. Fig. 3-5 show PCA charts constructed on two types of samples after determining corresponding feature subspaces by using three different feature selection methods in the training set part of the miRNA data set for human gastric cancer, respectively, from the three charts, it can be seen that the two types of samples in the three charts have a relatively clear separation trend, which indicates that the three feature subspaces determined from three angles have relatively strong distinguishing and discriminating capabilities.

Step three, establishing a fusion classifier on the three feature subspaces

Three feature subspaces Subset-ERGS, subset-mRMR and Subset-Spearman constructed on the data set respectively use SVM and DNN to establish classifiers, and a majority voting mode is used for integrating final classification results. The classifier SVM uses a linear kernel function, the DNN uses a grid search method to carry out parameter optimization on a plurality of parameters such as a network structure, a learning rate, an activation function, the number of training rounds and the training size of each round, and a fifty-time quintupling cross validation method is used to validate the classification performance index Accuracy (AUC), the Specificity (SPE) and the Sensitivity (SEN). Fig. 1 is a complete structural diagram of an integrated classification model established by the present invention, and fig. 2 is a diagram of a network architecture used by a DNN classifier used in the present invention after optimization.

The following table is a comparison of classification performance of the inventive methods (EMS-SVM and EMS-DNN) with other data analysis methods commonly used in the analysis of biological omics data, including fifty-five-fold cross-validation of SVM-RFE, RF, and XGBOOST on ten public datasets, with bold font for the best performance obtained by the data analysis method on each dataset. From the results, the classification performance of the method is far higher than that of other technologies no matter the indexes are AUC, SPE or SEN indexes, and the effectiveness of the method is proved.

TABLE 1 EMS-DNN and EMS-SVM for comparison of accuracy with other effective methods

TABLE 2 sensitivity comparison of EMS-DNN and EMS-SVM with other effective methods

TABLE 3 specificity comparison of EMS-DNN and EMS-SVM with other effective methods

Claims

1. A multi-angle fusion-based method for analyzing biological omics data comprises the following steps:

step one, data preprocessing

The data set is preprocessed and mainly divided into two parts, wherein the first part is used for processing missing value parts in the data set, and the processing method comprises the following steps: the number of missing values on each type of sample is deletedFeatures exceeding eighty percent of the total number of the samples of the same type, and the missing values of the remaining features are filled as the average value of the samples of the same type on the features; let F = { F ₁ ,f ₂ ,…,f _m The feature set is used as the input data, and m represents the number of features; y = { Y _j J =1,2 is a set of class labels; s = { S = ₁ ,s ₂ ,…,s _n Is a sample set, n represents the number of samples;

wherein, f ^scaled _ik In a sample s _k Upper characteristic f _i Value after Z-Score data normalization, f _ik As a sample s _k Upper characteristic f _i Original value of u _i Is characterized by f _i Mean value over all samples, σ _i Is characterized by _i Standard deviations found on all samples; thereby obtaining a normalized feature set F = { F = ^scaled ₁ ,f ^scaled ₂ ,…,f ^scaled _m }；

Step two, determining the feature subspace from multiple angles

the first feature subspace (Subset-ERGS) is determined by using an ERGS feature selection method, and a specific formula of the ERGS is as follows:

w _i ＝1-AC _i /max{AC _u :u＝1,2} (3)

(2) In the formula:

R _ij is characterized by ^scaled _i In category y _j (j =1, 2);

r ⁺ _ij and r ^- _ij Is the effective range R _ij Upper and lower bounds of;

u _ij is y _j Class-in feature f ^scaled _i The mean value of (a);

σ _ij is a feature f in yj class ^scaled _i Standard deviation of (d);

p _j is y _j A prior probability of a class;

(3) In the formula:

w _i is characterized by ^scaled _i The weight value of (1);

(4) In the formula:

finally selecting weight value w by ERGS method _i A high profile;

(5) In the formula:

w _i is the calculation of the characteristic f by the mRMR method ^scaled _i (ii) a final score of;

I(f _i ^scaled (ii) a x) is a feature f ^scaled _i A mutual information value with a selected feature X, X representing a selected feature set;

the mRMR feature selection method is used for selecting features according to the finally calculated score w and selecting features with high scores;

(6) In the formula:

q _ij representing the resulting calculated features f ^scaled _i And f ^scaled _j Spearman correlation score between u _i And u _j Is a characteristic f ^scaled _i And f ^scaled _j Average over all samples, f ^scaled _ik And f ^scaled _jk Is characterized by f ^scaled _i And f ^scaled _j In a sample s _k The value of the above, and the finally calculated q _ij The larger the value is, the two characteristics f are indicated ^scaled _i And f ^scaled _j The higher the Spearman correlation between;

wherein o is _ij Is f ^scaled _i And f ^scaled _j Final Spearman correlation difference score, q, for two features over two categories ⁺ _ij And q is ^- _ij Are respectively f ^scaled _i And f ^scaled _j Spearman relevance scores for two features on two different categories; the Spearman correlation difference o is obtained by calculation _ij Then, constructing a final difference correlation network; set of features F = { F in a diversity correlation network ^scaled ₁ ,f ^scaled ₂ ,…,f ^scaled _m Each feature in (7) is defined as a node of the network, and o is obtained in the formula _ij As the weights corresponding to the edges between two nodes, the network weight values and the network node weight values of the constructed differential network are respectively: netEdge _ Weight and netNode _ Weight:

netEdge_Weight _ij ＝o _ij (8)

step three, establishing a fusion classifier on different feature subspaces