CN113192556A - Genotype and phenotype association analysis method in multigroup chemical data based on small sample - Google Patents

Genotype and phenotype association analysis method in multigroup chemical data based on small sample Download PDF

Info

Publication number
CN113192556A
CN113192556A CN202110288323.5A CN202110288323A CN113192556A CN 113192556 A CN113192556 A CN 113192556A CN 202110288323 A CN202110288323 A CN 202110288323A CN 113192556 A CN113192556 A CN 113192556A
Authority
CN
China
Prior art keywords
gene
cluster
minimum
snp
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110288323.5A
Other languages
Chinese (zh)
Other versions
CN113192556B (en
Inventor
郭新鹏
宋亚飞
刘帅忱
刘树慧
王艺菲
尚学群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Air Force Engineering University of PLA
Original Assignee
Northwestern Polytechnical University
Air Force Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University, Air Force Engineering University of PLA filed Critical Northwestern Polytechnical University
Priority to CN202110288323.5A priority Critical patent/CN113192556B/en
Publication of CN113192556A publication Critical patent/CN113192556A/en
Application granted granted Critical
Publication of CN113192556B publication Critical patent/CN113192556B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Discloses a genotype and phenotype association analysis method in multigroup data based on a small sample, which comprises the following steps: generating a weighted undirected gene association graph by using a protein network and a gene expression value, and clustering the undirected graph by using a SPICi clustering method to generate a gene cluster; screening the gene cluster by using a group Lasso method; obtaining SNP clusters corresponding to the screened gene clusters through eQTL data; constructing each SNP cluster, the corresponding gene cluster and the phenotype into a three-layer network class block, performing regression operation on the association relation of the SNP and the gene in each class block by adopting a sparse partial least square method, and performing operation on the association relation of the gene and the phenotype by adopting logistic regression; and averaging the prediction results obtained by the various blocks to obtain a final prediction result. The method can solve the problem that the characteristic value is huge and can not be effectively regressed under the condition of small samples in a three-layer network; the prediction accuracy is improved; the biological significance is more definite; tissue specificity is considered.

Description

Genotype and phenotype association analysis method in multigroup chemical data based on small sample
Technical Field
The invention relates to the field of bioinformatics, in particular to a genotype and phenotype association research method in multigroup chemical data based on a small sample.
Background
An important goal of current genetics is to establish a complete functional link between genotype and phenotype, the so-called genotype phenotype mapping. The genetic variation process can be clearer by researching the correlation between the genotype and the phenotype. Common genome-wide association studies (GWAS) between genotype and phenotype are an effective way to reveal associations between an individual's genetic background and a particular disease or trait. The principle is to find out the difference sites on all genomes and carry out correlation analysis on the difference sites and phenotypes. Over the past decade, a number of genome-wide association studies have identified a number of genetic variations associated with human complex diseases or other traits. These findings can identify novel variant trait associations, provide insight into ethnic variations of complex traits, and enrich a variety of clinical applications. However, most of the variations found to date can account for only a small fraction of causal genetic factors. According to the GWAS principle, although thousands of single nucleotide polymorphisms have been discovered for complex diseases and characteristics, a single omics level can provide only limited biological mechanisms, and the functional meaning and mechanism of the relevant sites are largely unknown.
Due to limitations at the monoomic level, there is a need to more accurately predict the biological association between genotype and phenotype by fusing other omic data. And the data is used as a support, so that the interaction between the multiomics can be researched. This provides researchers with a new opportunity to detect true genotypic phenotype associations, while revealing their association mechanism. The biological association relationship between the genotype and the phenotype can be more accurately predicted by fusing other omics data, so that the genetic variation process is clearer. Such as analyzing the effect of Single Nucleotide Polymorphism (SNP) data on phenotype in conjunction with gene expression data. The current multiomics data fusion method mainly has the following two ideas, namely performing multi-level fusion analysis and performing multi-dimensional fusion analysis.
When a plurality of groups of biological networks are used for mining the correlation between genotypes and phenotypes, the phenotype character difference is generally considered to be mainly generated due to the gradual influence of each omic. For example, the site difference of SNP causes the gene expression to change, thereby influencing the change of protein expression and finally causing the generation of diseases. The layer-by-layer fusion analysis method is generally called as 'multi-stage fusion analysis', and the main processing flow is to establish the inter-omic association relationship between every two layers of omics by methods such as linear regression, partial least squares, typical correlation analysis, correlation coefficients and the like, and finally carry out disease prediction and the like through the hierarchical relationship between different omics. The most common multilevel fusion method at present is the "three-layer method" (S.Lee, S.Kong, E.P.Xing, A network-drive protocol for gene-wide interaction mapping, Bioinformatics, 32(2016) i164-i 173). The idea is as follows: firstly, linear regression is used for establishing the association relationship between SNP and genes, and then logistic regression is used for establishing the association relationship between genes and phenotypes (phenotypes have only two possibilities of 0 and 1, and represent whether a certain disease exists or not). The disease is predicted by analyzing the influence of SNP on the gene expression level, and the method is more accurate than the method for directly predicting the disease by using SNP. This demonstrates that true biological relationships are more reflected with a three-layer network. However, when the three-layer network is used for regression model establishment, the incidence relation among the interiors of omics data is not considered, so that the accuracy of the model is low.
"multidimensional fusion analysis" can be subdivided into three specific implementations: the method comprises an integration method based on feature association, an integration method based on intermediate conversion and an integration method based on single-component chemical model fusion. The feature association-based integration method comprises the steps of processing the original omic data before analyzing and modeling, fusing the characteristics of the processed omic data through a machine learning algorithm to form a more comprehensive input matrix, and establishing a prediction model through the obtained input matrix. The intermediate transformation-based integration method first transforms each of the sets of data into an intermediate form, and then fuses the intermediate forms to generate the predictive model. The integration method based on the single-group chemical model fusion is to independently use each omics data to establish a plurality of prediction models, and then fuse the prediction models to generate a final prediction model. The multidimensional fusion analysis has a common defect that the status of each omics data is equivalent, so that the phenotype can be analyzed from different angles by using each omics data, and the biological association relationship of each omics cannot be analyzed. Although multidimensional fusion analysis methods can improve the accuracy of phenotypic predictions, the biological significance of the fusion of multiple sets of mathematical data is not particularly clear.
Moreover, the correlation between the genotype and the phenotype is researched by utilizing a multigroup analysis method, the omics data are required to be the same sample set, the sample size requirement is high when the models are established by the two types of multigroup analysis methods due to the huge characteristic quantity of SNP, and the clinical data are difficult to obtain due to the protection of the individual privacy of patients and the self requirement of each organization on the data. Therefore, the disclosed clinical data can not meet the requirement of a multigroup chemical data fusion method on data in terms of sample size and omics number.
Disclosure of Invention
The invention provides a genotype and phenotype association analysis method in multigroup chemical data based on a small sample, which specifically comprises the following steps:
firstly, generating a weighted undirected gene association graph by using a protein network and a gene expression value, and clustering the undirected graph by using a SPICi clustering method to generate a gene cluster;
generating a gene network map with weights by using the protein network data and the gene expression data; clustering the generated gene network diagram by adopting an SPICi clustering method; the SPICi method has three super parameters, namely a minimum clustering value minimum cluster size, a minimum support threshold value minimum support threshold and a minimum clustering density minimum cluster density; the three parameters jointly influence the number of clusters and the number of elements of each cluster; further analysis is carried out on the settings of the three super parameters;
the minimum cluster size (minimum cluster size) is used for determining the leaving of the cluster by comparing the number of genes contained in each cluster, namely, if the number of elements in the cluster is more than the minimum cluster size, the cluster is kept, otherwise, the cluster is discarded; if the minimum clustering number is set to be too small, the purpose of capturing the association relation between genes cannot be achieved, but if the minimum clustering number is too large, the clustering cluster is deleted by mistake; according to the test on different data, the minimum clustering value is finally set to [4, 6]]Such an interval range; in the gene network graph G ═ (V, E), V represents a set of all vertices in the gene network graph G, and E represents a set of all edges in the gene network graph G; for any vertex u and the set of vertices connected to u
Figure RE-GSB0000194193640000031
Defining a support:
Figure RE-GSB0000194193640000032
support (u, S) refers to the sum of the weights of all edges connected to vertex u, wu,vA weight representing an edge between vertex u and vertex v; representing the weight of the edge by the Pearson correlation coefficient of the two vertex vectors, taking the absolute value of the Pearson correlation coefficient solved, thereby wu,v∈(0,1]The minimum support threshold is set to [0.4, 0.7 ]]The interval range of (a); the definition of clustering density (S) is that the total sum of the edge weights is divided by the total number of possible edge numbers to reflect the compactness of the subgraph; the formula is as follows:
Figure RE-GSB0000194193640000041
setting the minimum clustering density parameter value range as [0.1, 0.6 ];
secondly, screening the gene cluster by using a group Lasso method;
because the gene base number is large, the number of the gene clusters obtained in the first step is relatively large, and the grouping minimum angle regression algorithm is utilized to carry out regression operation on the gene clusters and the phenotypes; if the gene cluster is an L group, the selection of each feature in the Lasso regression is popularized to the selection of each group of features in the grouped minimum angle regression algorithm, and the objective function is as follows:
Figure RE-GSB0000194193640000042
wherein lambda is a regularization parameter, controlling the overall punishment, X, Y are independent variable and dependent variable matrixes respectively, beta is a coefficient vector, and beta islFor each set of the coefficient vectors,
Figure RE-GSB0000194193640000043
is the weighting of each group, adjusted as needed; if beta islIf the number is 0, the corresponding gene cluster is deleted, whereas if the number is beta, the gene cluster is deletedlIf not equal to 0, reserving the corresponding gene cluster, and achieving the purpose of gene cluster screening by the operation;
thirdly, obtaining SNP clusters corresponding to the screened gene clusters through eQTL data;
screening out gene clusters with the coefficient not being zero through the second step, wherein the gene clusters are considered to most possibly influence the phenotype; the main reason of diseases caused by the gene clusters is that the influence of different sites on the genes in the clusters is generated, so that the association relationship between the SNP and the genes needs to be established, and the relationship between the genotype and the phenotype pathway can be completely reflected; quantitative trait gene expression sites eQTL in GTEx data reflect the association relationship between SNP and genes in each tissue, and SNP information associated with the genes in each cluster is searched in the eQTL data, so that SNP clusters corresponding to the gene clusters are obtained;
fourthly, constructing each SNP cluster, the corresponding gene cluster and the phenotype into a three-layer network class block, performing regression operation on the association relation between the SNP and the gene in each class block by adopting a sparse partial least square method, and performing operation on the association relation between the gene and the phenotype by adopting logistic regression;
combining the corresponding SNP cluster, gene cluster and phenotype into a three-layer network, and calling the three-layer network as a class block, wherein each class block constructs the three-layer network; when processing the association relationship between each block SNP and the gene, simultaneously considering the association relationship between layers; the method for solving the problem of the association relationship between the SNP and the gene is to use a sparse partial least square method SPLS; solving the correlation between the gene and the phenotype is completed by adopting a logistic regression method;
fifthly, averaging the prediction results of all the blocks to obtain a final prediction result;
and constructing a plurality of class blocks for prediction analysis through the fourth step, wherein strong dependence does not exist among the class blocks, parallel operation can be simultaneously performed, and the prediction results of the class blocks are averaged to obtain the final prediction result.
In a particular embodiment of the invention, during the first step of the experiment, the parametric tests were carried out with an increment of 0.1.
The method has the advantages that:
a) the problem that under the condition of small samples in a three-layer network, the characteristic value is huge and can not be effectively regressed can be solved;
b) the internal incidence relation among the omics is considered, and the prediction accuracy is improved;
c) the correlation relationship of biological pathways among different omics layers is analyzed, so that the biological significance is more definite;
d) tissue specificity is considered. Each tissue of the multicellular organism individuals has the characteristic of being distinguished from other tissues, and the difference is largely due to the specific morphological structure and physiological function which are endowed by specific expression genes of different tissues;
e) integrates the advantages of the existing multiomics fusion method. The multi-level fusion analysis method can better reflect the advantages of the association relation in the biological sense, and integrates the learning idea in the multi-dimensional fusion analysis method.
Drawings
FIG. 1 shows a genotype-phenotype association analysis model based on gene cluster grouping;
FIG. 2 illustrates various intra-block and inter-layer network association analysis processes;
FIG. 3 shows a flow of a genotype-phenotype association analysis study based on multiple sets of mathematical data from small samples;
fig. 4 shows ROC plots and AUC values for the GSE33356 data for the correlation of four genotypes to phenotypes, wherein fig. 4(a), fig. 4(b), fig. 4(c), fig. 4(d) show AUC: 0.811, AUC: 0.619, AUC: 0.441, AUC: the case of 0.586;
figure 5 shows the ROC plots and AUC values of the GSE114269 data for the four genotype-phenotype correlations, with figure 5(a), figure 5(b), figure 5(c), and figure 5(d) showing AUC: 0.779, AUC: 0.707, AUC: 0.671, AUC: 0.701 case.
Detailed Description
The invention is described in detail below with reference to the drawings, and is specifically divided into the following five steps.
The first step is to generate a weighted undirected gene correlation diagram by using a protein network and a gene expression value, and cluster the undirected diagram by using a SPICi clustering method to generate a gene cluster.
Protein network data were derived from the PICKLE database (Protein InteraCtion KnowLedgebasE, http:// www.pickle.gr /). The incidence relation between genes can be mapped by utilizing the protein network data to form a gene incidence weightless map. And calculating the interpage Pearson correlation coefficient by using the gene expression data, wherein the interpage Pearson correlation coefficient can be used as the weight of the edge, and a gene network diagram with the weight is generated. The above procedures are well known to those skilled in the art and will not be described again. In the gene network map, the more closely related genes are likely to affect the expression by interaction, so the present invention clusters the gene association map to obtain a plurality of gene association clusters, as shown in the middle layer of fig. 1. The invention generally compares a plurality of clustering methods such as MCODE, RNSC, Cfinder, NetworkBLAST, DPCiu and MCUPGMA, and finally adopts a SPICi (speed and Performance In clustering) clustering method (M.S. Pen Jiang, SPICi: a fast clustering algorithm for large biological networks, Bioinformatics, 26(2010)1105-1111) to cluster the generated gene network map. The SPICi method has three super parameters, namely a minimum cluster value (minimum cluster size), a minimum support threshold (minimum support threshold) and a minimum cluster density (minimum cluster density). The three parameters together affect the number of clusters and the number of elements of each cluster. In order to facilitate subsequent analysis, the final clustering effect of the invention needs to ensure that the clustering number and the clustering elements are in a proper range. For example, if there are too many elements in the cluster, then the grouping minimum angle regression algorithm (group lasso) (Huang J, Breheny P, Ma S.A selective review of group selection in high-dimensional models. State Sci, 2012, 27: 481) is performed subsequently, and the larger the element contained in each group, the larger the target error if the penalty parameters are the same. If too few elements are present in the cluster, the influence of genetic association on the disease cannot be effectively analyzed. On the basis of the general principle, the setting of the three super parameters is further analyzed due to the respective characteristics of the three super parameters.
The minimum cluster size (minimum cluster size) is used to determine the leaving of the cluster by comparing the number of genes contained in each cluster, i.e. if the number of elements in the cluster is greater than the minimum cluster size, the cluster is kept, otherwise the cluster is discarded. If the minimum clustering number is set to be too small, the purpose of capturing the association relationship between genes cannot be achieved, but if the minimum clustering number is too large, the clustering cluster is deleted by mistake. According to the test on different data, the minimum clustering value is finally set to [4, 6]]Such a range of intervals. In the gene network graph G ═ (V, E), V represents a set of all vertices in the gene network graph G, and E represents a set of all edges in the gene network graph G. For any vertex u and the set of vertices connected to u
Figure RE-GSB0000194193640000071
Defining a support:
Figure RE-GSB0000194193640000072
support (u, S) refers to the sum of the weights of all edges connected to vertex u, wu,vRepresenting the weight of the edge between vertex u and vertex v. In the invention, the weighting of the edge is represented by the Pearson correlation coefficient of two vertex vectors, but the Pearson correlation coefficient has positive and negative scores, so that the possibility of mutual cancellation exists in the process of finding the support (u, S), and therefore, the method has the advantages of simple structure, low cost and high reliabilityThe invention takes the absolute value of the solved Pearson correlation coefficient, thereby wu,v∈(0,1]Generally, a pearson correlation coefficient of more than 0.2 is considered to be a correlation of weak correlation or more. Too many weakly associated genes may cause additional noise, and therefore the present invention requires that at least two or more edges be associated with a weakly associated gene, and sets the lower minimum support threshold to 0.4. Various data tests show that when the minimum support threshold is larger than 0.7, the total number of genes is greatly reduced, and the influence of the genetic association relation on diseases is not sufficiently reflected. Therefore, the invention sets the minimum support threshold as [0.4, 0.7 ]]The interval range of (2). The clustering density (S) is defined as the sum of the edge weights divided by the total number of possible edge numbers to reflect the compactness of the subgraph. The formula is as follows:
Figure RE-GSB0000194193640000081
as can be seen from the formula, too small a density (S) parameter will result in more clustering elements and less total clustering. When the density (S) of a cluster does not reach the minimum cluster density, the program divides the cluster into two or more small clusters. Therefore, the minimum clustering density directly influences the total number of clusters and is also the parameter which has the largest influence on the clustering effect in the three hyperparameters. Through experimental comparison, the value range of the minimum clustering density parameter is set to [0.1, 0.6], and in the experimental process, the parameter test is carried out with the increment degree of 0.1.
And secondly, screening the gene cluster by using a group Lasso method.
Because the gene base number is large, the number of the gene clusters obtained in the first step is relatively large, and the regression operation can be carried out on the gene clusters and the phenotypes by utilizing a grouping minimum angle regression algorithm (group lasso). The grouped minimum Angle regression algorithm is a generalization of the Lasso regression (Tibshirai R. regression Shunkage and Selection of the Lasso [ J ]. Journal of the Royal Statistical Society, 1996, 58 (1): 267-288). If the gene cluster is an L group, the selection of each feature in the Lasso regression is popularized to the selection of each group of features in the grouped minimum angle regression algorithm, and the objective function is as follows:
Figure RE-GSB0000194193640000082
wherein lambda is a regularization parameter, controlling the overall punishment, x, Y are independent variable and dependent variable matrixes respectively, beta is a coefficient vector, and beta islFor each set of the coefficient vectors,
Figure RE-GSB0000194193640000083
is the weighting of each group and can be adjusted as desired. If beta islIf the number is 0, the corresponding gene cluster is deleted, whereas if the number is beta, the gene cluster is deletedlAnd if not equal to 0, reserving the corresponding gene cluster, and achieving the purpose of gene cluster screening.
And thirdly, obtaining the SNP cluster corresponding to the screened gene cluster through eQTL data.
In a second step, clusters of genes whose coefficients are not zero can be selected, which are considered to have the highest probability of influencing the phenotype. The invention researches the genotype and phenotype association analysis, the gene in each gene cluster can establish the gene and phenotype association relationship through multiple regression, and the multiple regression method is well known to the technical personnel in the field and is not described again. The main reason for diseases caused by these gene clusters is that the difference sites on the genes in the clusters affect the generation, so the association relationship between SNP and gene needs to be established, thus completely reflecting the relationship between genotype and phenotype pathway. Quantitative Trait gene Expression site eQTL (Expression Quantitative Trait locus Loci) in data of GTEx (Lonsdale J., Thomas J., Salvator M., Phillips R.the Genotype-Tissue Expression (GTEx) project. Nat Genet 2013, 45, 580-585) reflects the association relationship between SNP and genes in each Tissue, SNP information associated with genes in each cluster can be searched in eQTL data, and thus the SNP cluster corresponding to the gene cluster can be obtained.
And fourthly, constructing each SNP cluster, the corresponding gene cluster and the phenotype into a three-layer network class block, performing regression operation on the association relation between the SNP and the gene in each class block by adopting a sparse partial least square method, and performing operation on the association relation between the gene and the phenotype by adopting logistic regression.
Corresponding SNP clusters, gene clusters and phenotypes are combined into a three-layer network, which is called as a block (block) in the invention, and each block can construct a three-layer network. The number of SNP and genes in the three-layer structure after the classification block is sharply reduced, the amount of samples required during effective regression is also reduced, and help is provided for the incapability of regression of large features under the condition of small samples. When the three-layer structure of each class block is analyzed, partial path relation can be predicted only by using an interlayer regression method, but the intra-omic association relation is not considered, so that the method is not in line with biological reality. However, if only the intra-omic association relationship is considered, the pathway association between other omics is not considered, the whole biological system cannot be well reflected, and only local conditions can be seen, as shown in fig. 2, wherein the upper nodes represent SNPs, the middle nodes represent genes, the bottom nodes represent phenotypes, the solid lines represent intra-omic association relationships, and the dotted lines represent inter-omic association relationships. When the invention processes the association relationship between the SNP and the gene of each block, the original association relationship only considering the single element between layers (the left half part of figure 2) is improved into the association relationship between layers and in-layer (the right half part of figure 2). Namely, the original many-to-one incidence relation is changed into a many-to-many incidence relation. Due to the advantages of the Sparse Partial Least Squares (SPLS) (A.Csala, F.Voorbrak, A.H.Zwenderman, M.H.Hof, Sparse redundancy analysis of high-dimensional genetic and genetic data, Bioinformatics, 33(2017) (3228) 3234) integrating methods such as principal component analysis, typical correlation analysis, linear regression analysis and the like, the problems that the number of samples is far less than the number of features, the regression cannot be effectively performed, multiple collinearity exists among the features and the like can be effectively solved. Therefore, the method for solving the problem of the association relationship between the SNP and the gene is to replace the original multivariate regression method with a Sparse Partial Least Squares (SPLS) method. Sparse Partial Least Squares (SPLS) is the addition of a penalty function during the solution of the partial least squares method (PLS) (Jong S D.SIMPLS: alternative early to partial least squares regression [ J ]. Chemometrics and Intelligent Laboratory Systems, 1993, 18 (3): 251-263). The partial least squares method (PLS) and the sparse partial least squares method are known to those skilled in the art and will not be described in detail.
The correlation between the genes and the phenotypes in each cluster block is accomplished by logistic regression, which is well known to those skilled in the art and will not be described again.
And fifthly, averaging the prediction results of all the blocks to obtain a final prediction result.
Through the fourth step, a plurality of class blocks can be constructed for prediction analysis, as shown in fig. 1. The method is characterized in that strong dependence does not exist among various blocks, parallel operation can be simultaneously carried out, similarly, a learner is trained by randomly extracting partial samples from a set in ensemble learning, a plurality of learners can be trained by extracting for many times, and then the learners are combined. The invention selects the gene cluster which most possibly affects the phenotype for analysis, and has better effect than random extraction. And performing interlayer correlation analysis on each class block through the fourth step to obtain prediction results of each class block, averaging the obtained results to obtain a final prediction result, wherein the whole process is shown in fig. 3.
Specific examples
1. Data source and preprocessing
To verify the effectiveness of the method, the invention was verified with two sets of data derived from the GEO database (Gene Expression Omnibus database, https:// www.ncbi.nlm.nih.gov/GEO /). GSE33356 was studied lung adenocarcinoma. It includes lung cancer patients and their adjacent normal tissues, which are collected from the patients. Lung tumor and normal specimens from 84 non-smoking female adenocarcinoma patients were analyzed using Affymetrix SNP 6.0 and Affymetrix u133plus2.0 chips. GSE114269 compares the data for bone marrow type breast cancer (MBC) and non-bone marrow type basal-like breast cancer (non-MBC BLC) with 48 samples. The main reason for selecting these two sets of data for experiments is to demonstrate that the method of the present invention can be widely applied to such genotyping and phenotyping problems based on multiple attributes of small samples.
Protein network data were derived from the PICKLE database (Protein InteraCtion KnowLedgebasE, http:// www.pickle.gr /). The database is a metadata database of human protein interaction, and integrates protein interaction databases of various public sources through gene ontology information.
eQTL data was derived from GTEx Analysis V7(dbGaP Access phs000424.V7.p 2). For accuracy of data prediction, the eQTL data corresponding to the data is selected for tissue specificity. For example, the first set of data is selected for pulmonary eQTL data and the second set of data is selected for mammary tissue eQTL data.
The data preprocessing is mainly completed by unifying the SNP and gene naming in various data, removing the data with the missing value exceeding 10% from the SNP data, if the missing value is less than 10%, filling the data with the highest frequency of occurrence or the average value, and only taking the SNP data with the sub-allele frequency (MAF) more than 0.1.
2. Analysis of predicted results
At present, no typical method is used for correlating and analyzing genotypes and phenotypes with large attributes of small samples, so the method disclosed by the invention is used for carrying out comparative experiments to verify the feasibility of the method. In comparative experiments, the method herein is named "grouping + partial least squares".
1) A sparse partial least square method is not used for a three-layer network structure formed after gene clustering, but rather multivariate regression is used, and the method can verify the influence of the correlation relationship (omic internal relationship) among genes on the result. This method is named "grouping + multiple regression".
2) The genes are not subjected to clustering screening, eQTL data is adopted to screen SNP data of related genes, a large three-layer network is established for all data, and lasso regression is directly adopted for the three-layer structure. The method does not perform clustering grouping processing on the SNP and the genes, can verify that the association relationship among the genes is not considered, and only considers the influence of the association relationship among the omics on the result under the condition of not clustering the genes. This method is named "multiple regression method".
3) And carrying out fusion analysis on SNP and gene data by adopting a multigroup chemical fusion method. The method does not consider the association between SNPs and genes. The method can verify the influence of a multidimensional fusion method on the result without considering the association relation between omics. This approach fails to analyze the inter-omics pathway relationships. This method is named "multigroup chemical fusion".
Fig. 4 and fig. 5 show the GSE33356 and GSE114269 data sets comparing the method of the present invention with the above three methods, and sequentially showing four methods of "grouping + partial least squares", "grouping + multiple regression", "multiple regression method", and "multiple group chemical fusion" from left to right. Because of the dichotomy problem, the present invention can be illustrated by comparing "Receiver Operating characteristics" (ROC) curves, which are well known to those skilled in the art and will not be described again. Table 1 shows the AUC values corresponding to the four methods, and AUC (area Under cut) is defined as the area enclosed by the coordinate axes Under the ROC curve.
TABLE 1 alignment of AUC values of four methods of two data sets
Figure RE-GSB0000194193640000121
As can be seen from the results of fig. 4 and table 1, the present invention performed better in both sets of data than the other methods. The result of the grouping + partial least square method is obviously improved compared with the result of the grouping + multiple regression method, which shows that the association between the omics is considered in the analysis process, and the method is more in line with the reality of a biological system. The grouping + partial least square method and the grouping + multiple regression method for clustering the gene network have advantages over other methods, and the idea of clustering, screening and then performing ensemble learning on the gene data is suitable for processing the small sample data. The multiple regression method is the worst in performance among methods, because the method is not suitable for processing small samples, but provides guidance for the analysis of pathways among omics, and documents prove that the algorithm is suitable for being applied when the sample size reaches 500 or more in the analysis of the related relation of SNP, genes and phenotypes, and the algorithm cannot realize effective regression under the condition of small samples. The multi-group chemical fusion is a multi-dimensional fusion analysis method, the result of the method is not much different from that of a grouping + multiple regression method, and sometimes even better than the grouping + multiple regression method in the test process of the invention, but the method can not analyze the correlation between the omics, and the grouping + partial least squares are obviously better than the method in the prediction result under the condition of a small sample.

Claims (2)

1. The genotype-phenotype association analysis method based on the multigroup chemical data of the small sample is characterized by comprising the following steps:
firstly, generating a weighted undirected gene association graph by using a protein network and a gene expression value, and clustering the undirected graph by using a SPICi clustering method to generate a gene cluster;
generating a gene network map with weights by using the protein network data and the gene expression data; clustering the generated gene network diagram by adopting an SPICi clustering method; the SPICi method has three super parameters, namely a minimum clustering value minimum cluster size, a minimum support threshold value minimum support threshold and a minimum clustering density minimum cluster density; the three parameters jointly influence the number of clusters and the number of elements of each cluster; further analysis is carried out on the settings of the three super parameters;
the minimum cluster size (minimum cluster size) is used for determining the leaving of the cluster by comparing the number of genes contained in each cluster, namely, if the number of elements in the cluster is more than the minimum cluster size, the cluster is kept, otherwise, the cluster is discarded; if the minimum clustering number is set to be too small, the purpose of capturing the association relation between genes cannot be achieved, but if the minimum clustering number is too large, the clustering cluster is deleted by mistake; according to the test on different data, the minimum clustering value is finally set to [4, 6]]Such an interval range; in the gene network graph G ═ (V, E), V represents a set of all vertices in the gene network graph G, and E represents a set of all edges in the gene network graph G; for any vertex u and the set of vertices connected to u
Figure FSA0000236239680000011
Defining a support:
Figure FSA0000236239680000012
support (u, S) refers to the sum of the weights of all edges connected to vertex u, wu,vA weight representing an edge between vertex u and vertex v; representing the weight of the edge by the Pearson correlation coefficient of the two vertex vectors, taking the absolute value of the Pearson correlation coefficient solved, thereby wu,v∈(0,1]The minimum support threshold is set to [0.4, 0.7 ]]The interval range of (a); the definition of clustering density (S) is that the total sum of the edge weights is divided by the total number of possible edge numbers to reflect the compactness of the subgraph; the formula is as follows:
Figure DEST_PATH_BSA0000236239690000073
setting the minimum clustering density parameter value range as [0.1, 0.6 ];
secondly, screening the gene cluster by using a group Lasso method;
because the gene base number is large, the number of the gene clusters obtained in the first step is relatively large, and the grouping minimum angle regression algorithm is utilized to carry out regression operation on the gene clusters and the phenotypes; if the gene cluster is an L group, the selection of each feature in the Lasso regression is popularized to the selection of each group of features in the grouped minimum angle regression algorithm, and the objective function is as follows:
Figure FSA0000236239680000021
wherein lambda is a regularization parameter, controlling the overall punishment, X, Y are independent variable and dependent variable matrixes respectively, beta is a coefficient vector, and beta islFor each set of the coefficient vectors,
Figure FSA0000236239680000022
is the weighting of each group, adjusted as needed; if beta islIf the number is 0, the corresponding gene cluster is deleted, whereas if the number is beta, the gene cluster is deletedlIf not equal to 0, reserving the corresponding gene cluster, and achieving the purpose of gene cluster screening by the operation;
thirdly, obtaining SNP clusters corresponding to the screened gene clusters through eQTL data;
screening out gene clusters with the coefficient not being zero through the second step, wherein the gene clusters are considered to most possibly influence the phenotype; the main reason of diseases caused by the gene clusters is that the influence of different sites on the genes in the clusters is generated, so that the association relationship between the SNP and the genes needs to be established, and the relationship between the genotype and the phenotype pathway can be completely reflected; quantitative trait gene expression sites eQTL in GTEx data reflect the association relationship between SNP and genes in each tissue, and SNP information associated with the genes in each cluster is searched in the eQTL data, so that SNP clusters corresponding to the gene clusters are obtained;
fourthly, constructing each SNP cluster, the corresponding gene cluster and the phenotype into a three-layer network class block, performing regression operation on the association relation between the SNP and the gene in each class block by adopting a sparse partial least square method, and performing operation on the association relation between the gene and the phenotype by adopting logistic regression;
combining the corresponding SNP cluster, gene cluster and phenotype into a three-layer network, and calling the three-layer network as a class block, wherein each class block constructs the three-layer network; when processing the association relationship between each block SNP and the gene, simultaneously considering the association relationship between layers; the method for solving the problem of the association relationship between the SNP and the gene is to use a sparse partial least square method SPLS; solving the correlation between the gene and the phenotype is completed by adopting a logistic regression method;
fifthly, averaging the prediction results of all the blocks to obtain a final prediction result;
and constructing a plurality of class blocks for prediction analysis through the fourth step, wherein strong dependence does not exist among the class blocks, parallel operation can be simultaneously performed, and the prediction results of the class blocks are averaged to obtain the final prediction result.
2. The method for analyzing genotype-phenotype association in multiple sets of scientific data based on small samples according to claim 1, wherein during the first step of the experiment, the parameter test is performed in increments of 0.1.
CN202110288323.5A 2021-03-17 2021-03-17 Genotype and phenotype association analysis method in multigroup chemical data based on small sample Expired - Fee Related CN113192556B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110288323.5A CN113192556B (en) 2021-03-17 2021-03-17 Genotype and phenotype association analysis method in multigroup chemical data based on small sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110288323.5A CN113192556B (en) 2021-03-17 2021-03-17 Genotype and phenotype association analysis method in multigroup chemical data based on small sample

Publications (2)

Publication Number Publication Date
CN113192556A true CN113192556A (en) 2021-07-30
CN113192556B CN113192556B (en) 2022-04-26

Family

ID=76973366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110288323.5A Expired - Fee Related CN113192556B (en) 2021-03-17 2021-03-17 Genotype and phenotype association analysis method in multigroup chemical data based on small sample

Country Status (1)

Country Link
CN (1) CN113192556B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643760A (en) * 2021-08-27 2021-11-12 西北工业大学 Multivariate Gaussian distribution based missing eQTL statistic inference method
WO2024065070A1 (en) * 2022-09-26 2024-04-04 之江实验室 Graph clustering-based genetic coding breeding prediction method and apparatus

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040126782A1 (en) * 2002-06-28 2004-07-01 Holden David P. System and method for SNP genotype clustering
US20090298063A1 (en) * 2005-10-25 2009-12-03 Interleuken Genetics, Inc. IL-1 Gene Cluster and Associated Inflammatory Polymorphisms and Haplotypes
CN109643578A (en) * 2016-06-01 2019-04-16 生命科技股份有限公司 For designing the method and system of the assortment of genes
CN109994200A (en) * 2019-03-08 2019-07-09 华南理工大学 A kind of multiple groups cancer data confluence analysis method based on similarity fusion
US10734096B1 (en) * 2019-11-29 2020-08-04 Kpn Innovations, Llc Methods and systems for optimizing supplement decisions

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040126782A1 (en) * 2002-06-28 2004-07-01 Holden David P. System and method for SNP genotype clustering
US20090298063A1 (en) * 2005-10-25 2009-12-03 Interleuken Genetics, Inc. IL-1 Gene Cluster and Associated Inflammatory Polymorphisms and Haplotypes
CN109643578A (en) * 2016-06-01 2019-04-16 生命科技股份有限公司 For designing the method and system of the assortment of genes
CN109994200A (en) * 2019-03-08 2019-07-09 华南理工大学 A kind of multiple groups cancer data confluence analysis method based on similarity fusion
US10734096B1 (en) * 2019-11-29 2020-08-04 Kpn Innovations, Llc Methods and systems for optimizing supplement decisions

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XINPENG GUO.ET.: "IPMM: Cancer Subtype Clustering Model Based on Multiomics Data and Pathway and Motif Information", 《ADVANCED DATA MINING AND APPLICATIONS》 *
张睿: "基于元启发式算法的基因表达数据谱的聚类分析", 《中国优秀博硕士学位论文全文数据库(硕士) 基础科学辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643760A (en) * 2021-08-27 2021-11-12 西北工业大学 Multivariate Gaussian distribution based missing eQTL statistic inference method
CN113643760B (en) * 2021-08-27 2024-01-09 西北工业大学 Missing eQTL statistic inference method based on multi-variable Gaussian distribution
WO2024065070A1 (en) * 2022-09-26 2024-04-04 之江实验室 Graph clustering-based genetic coding breeding prediction method and apparatus

Also Published As

Publication number Publication date
CN113192556B (en) 2022-04-26

Similar Documents

Publication Publication Date Title
Diaz-Papkovich et al. A review of UMAP in population genetics
Kim et al. A multivariate regression approach to association analysis of a quantitative trait network
Novikova et al. Polyploidy breaks speciation barriers in Australian burrowing frogs Neobatrachus
CN109994200A (en) A kind of multiple groups cancer data confluence analysis method based on similarity fusion
CN113192556B (en) Genotype and phenotype association analysis method in multigroup chemical data based on small sample
CN113555062B (en) Data analysis system and analysis method for genome base variation detection
Milone et al. * omeSOM: a software for clustering and visualization of transcriptional and metabolite data mined from interspecific crosses of crop plants
CN108335756B (en) Nasopharyngeal carcinoma database and comprehensive diagnosis and treatment decision method based on database
CN108206056B (en) Nasopharyngeal darcinoma artificial intelligence assists diagnosis and treatment decision-making terminal
CN111312334A (en) Method for analyzing receptor-ligand system influencing intercellular communication
Ressom et al. Adaptive double self-organizing maps for clustering gene expression profiles
Langlieb et al. The cell type composition of the adult mouse brain revealed by single cell and spatial genomics
Guo et al. Linking genotype to phenotype in multi-omics data of small sample
Zhang et al. MaLAdapt reveals novel targets of adaptive introgression from Neanderthals and Denisovans in worldwide human populations
CN108320797B (en) Nasopharyngeal carcinoma database and comprehensive diagnosis and treatment decision method based on database
CN115985503B (en) Cancer prediction system based on ensemble learning
Wang et al. BFDCA: A comprehensive tool of using Bayes factor for differential co-expression analysis
Malovini et al. Phenotype forecasting with SNPs data through gene-based Bayesian networks
Diaz-Papkovich et al. Topological stratification of continuous genetic variation in large biobanks
CN111785325B (en) Method for identifying heterogeneous cancer driver genes of mutually exclusive constraint graph Laplace
Cudic et al. Prediction of sorghum bicolor genotype from in-situ images using autoencoder-identified SNPs
Bandyopadhyay et al. SSLPred: predicting synthetic sickness lethality
CN117524503B (en) Height prediction method and system based on biological genetic data
Deng et al. Dynamic gene regulatory network reconstruction and analysis based on clinical transcriptomic data of colorectal cancer
Tanvir et al. MOGAT: An Improved Multi-Omics Integration Framework Using Graph Attention Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220426

CF01 Termination of patent right due to non-payment of annual fee