CN113192556B

CN113192556B - Genotype and phenotype association analysis method in multigroup chemical data based on small sample

Info

Publication number: CN113192556B
Application number: CN202110288323.5A
Authority: CN
Inventors: 郭新鹏; 宋亚飞; 刘帅忱; 刘树慧; 王艺菲; 尚学群
Original assignee: Northwestern Polytechnical University; Air Force Engineering University of PLA
Current assignee: Northwestern Polytechnical University; Air Force Engineering University of PLA
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2022-04-26
Anticipated expiration: 2041-03-17
Also published as: CN113192556A

Abstract

Discloses a genotype and phenotype association analysis method in multigroup data based on a small sample, which comprises the following steps: generating a weighted undirected gene association graph by using a protein network and a gene expression value, and clustering the undirected graph by using a SPICi clustering method to generate a gene cluster; screening the gene cluster by using a group Lasso method; obtaining SNP clusters corresponding to the screened gene clusters through eQTL data; constructing each SNP cluster, the corresponding gene cluster and the phenotype into a three-layer network class block, performing regression operation on the association relation of the SNP and the gene in each class block by adopting a sparse partial least square method, and performing operation on the association relation of the gene and the phenotype by adopting logistic regression; and averaging the prediction results obtained by the various blocks to obtain a final prediction result. The method can solve the problem that the characteristic value is huge and can not be effectively regressed under the condition of small samples in a three-layer network; the prediction accuracy is improved; the biological significance is more definite; tissue specificity is considered.

Description

Genotype and phenotype association analysis method in multigroup chemical data based on small sample

Technical Field

The invention relates to the field of bioinformatics, in particular to a genotype and phenotype association research method in multigroup chemical data based on a small sample.

Background

An important goal of current genetics is to establish a complete functional link between genotype and phenotype, the so-called genotype phenotype mapping. The genetic variation process can be clearer by researching the correlation between the genotype and the phenotype. Common genome-wide association studies (GWAS) between genotype and phenotype are an effective way to reveal associations between an individual's genetic background and a particular disease or trait. The principle is to find out the difference sites on all genomes and carry out correlation analysis on the difference sites and phenotypes. Over the past decade, a number of genome-wide association studies have identified a number of genetic variations associated with human complex diseases or other traits. These findings can identify novel variant trait associations, provide insight into ethnic variations of complex traits, and enrich a variety of clinical applications. However, most of the variations found to date can account for only a small fraction of causal genetic factors. According to the GWAS principle, although thousands of single nucleotide polymorphisms have been discovered for complex diseases and characteristics, a single omics level can provide only limited biological mechanisms, and the functional meaning and mechanism of the relevant sites are largely unknown.

Due to limitations at the monoomic level, there is a need to more accurately predict the biological association between genotype and phenotype by fusing other omic data. And the data is used as a support, so that the interaction between the multiomics can be researched. This provides researchers with a new opportunity to detect true genotypic phenotype associations, while revealing their association mechanism. The biological association relationship between the genotype and the phenotype can be more accurately predicted by fusing other omics data, so that the genetic variation process is clearer. Such as analyzing the effect of Single Nucleotide Polymorphism (SNP) data on phenotype in conjunction with gene expression data. The current multiomics data fusion method mainly has the following two ideas, namely performing multi-level fusion analysis and performing multi-dimensional fusion analysis.

When a plurality of groups of biological networks are used for mining the correlation between genotypes and phenotypes, the phenotype character difference is generally considered to be mainly generated due to the gradual influence of each omic. For example, the site difference of SNP causes the gene expression to change, thereby influencing the change of protein expression and finally causing the generation of diseases. The layer-by-layer fusion analysis method is generally called as 'multi-stage fusion analysis', and the main processing flow is to establish the inter-omic association relationship between every two layers of omics by methods such as linear regression, partial least squares, typical correlation analysis, correlation coefficients and the like, and finally carry out disease prediction and the like through the hierarchical relationship between different omics. The most common multilevel fusion method at present is the "three-layer method" (S.Lee, S.Kong, E.P.Xing, A network-drive protocol for gene-wide interaction mapping, Bioinformatics, 32(2016) i164-i 173). The idea is as follows: firstly, linear regression is used for establishing the association relationship between SNP and genes, and then logistic regression is used for establishing the association relationship between genes and phenotypes (phenotypes have only two possibilities of 0 and 1, and represent whether a certain disease exists or not). The disease is predicted by analyzing the influence of SNP on the gene expression level, and the method is more accurate than the method for directly predicting the disease by using SNP. This demonstrates that true biological relationships are more reflected with a three-layer network. However, when the three-layer network is used for regression model establishment, the incidence relation among the interiors of omics data is not considered, so that the accuracy of the model is low.

"multidimensional fusion analysis" can be subdivided into three specific implementations: the method comprises an integration method based on feature association, an integration method based on intermediate conversion and an integration method based on single-component chemical model fusion. The feature association-based integration method comprises the steps of processing the original omic data before analyzing and modeling, fusing the characteristics of the processed omic data through a machine learning algorithm to form a more comprehensive input matrix, and establishing a prediction model through the obtained input matrix. The intermediate transformation-based integration method first transforms each of the sets of data into an intermediate form, and then fuses the intermediate forms to generate the predictive model. The integration method based on the single-group chemical model fusion is to independently use each omics data to establish a plurality of prediction models, and then fuse the prediction models to generate a final prediction model. The multidimensional fusion analysis has a common defect that the status of each omics data is equivalent, so that the phenotype can be analyzed from different angles by using each omics data, and the biological association relationship of each omics cannot be analyzed. Although multidimensional fusion analysis methods can improve the accuracy of phenotypic predictions, the biological significance of the fusion of multiple sets of mathematical data is not particularly clear.

Moreover, the correlation between the genotype and the phenotype is researched by utilizing a multigroup analysis method, the omics data are required to be the same sample set, the sample size requirement is high when the models are established by the two types of multigroup analysis methods due to the huge characteristic quantity of SNP, and the clinical data are difficult to obtain due to the protection of the individual privacy of patients and the self requirement of each organization on the data. Therefore, the disclosed clinical data can not meet the requirement of a multigroup chemical data fusion method on data in terms of sample size and omics number.

Disclosure of Invention

The invention provides a genotype and phenotype association analysis method in multigroup chemical data based on a small sample, which specifically comprises the following steps:

firstly, generating a weighted undirected gene association graph by using a protein network and a gene expression value, and clustering the undirected graph by using a SPICi clustering method to generate a gene cluster;

generating a gene network map with weights by using the protein network data and the gene expression data; clustering the generated gene network diagram by adopting an SPICi clustering method; the SPICi method has three super parameters, namely a minimum clustering value minimum cluster size, a minimum support threshold value minimum support threshold and a minimum clustering density minimum cluster density; the three parameters jointly influence the number of clusters and the number of elements of each cluster; further analysis is carried out on the settings of the three super parameters;

the minimum cluster size (minimum cluster size) is used for determining the leaving of the cluster by comparing the number of genes contained in each cluster, namely, if the number of elements in the cluster is more than the minimum cluster size, the cluster is kept, otherwise, the cluster is discarded; if the minimum clustering number is set to be too small, the purpose of capturing the association relation between genes cannot be achieved, but if the minimum clustering number is too large, the clustering cluster is deleted by mistake; according to the test on different data, the minimum clustering value is finally set to [4, 6]]Such an interval range; in thatIn (V, E), V represents a set of all vertices in the gene network map G, and E represents a set of all edges in the gene network map G; for any vertex u and the set of vertices connected to u

Defining a support:

support (u, S) refers to the sum of the weights of all edges connected to vertex u, w_u，vA weight representing an edge between vertex u and vertex v; representing the weight of the edge by the Pearson correlation coefficient of the two vertex vectors, taking the absolute value of the Pearson correlation coefficient solved, thereby w_u，v∈(0，1]The minimum support threshold is set to [0.4, 0.7 ]]The interval range of (a); the definition of clustering density (S) is that the total sum of the edge weights is divided by the total number of possible edge numbers to reflect the compactness of the subgraph; the formula is as follows:

setting the minimum clustering density parameter value range as [0.1, 0.6 ];

secondly, screening the gene cluster by using a group Lasso method;

because the gene base number is large, the number of the gene clusters obtained in the first step is relatively large, and the grouping minimum angle regression algorithm is utilized to carry out regression operation on the gene clusters and the phenotypes; if the gene cluster is an L group, the selection of each feature in the Lasso regression is popularized to the selection of each group of features in the grouped minimum angle regression algorithm, and the objective function is as follows:

wherein lambda is a regularization parameter, the intensity of the overall punishment is controlled, and X and Y are independent variables respectivelyAnd a dependent variable matrix, beta being a coefficient vector, and beta_lFor each set of the coefficient vectors,

is the weighting of each group, adjusted as needed; if beta is_lIf the number is 0, the corresponding gene cluster is deleted, whereas if the number is beta, the gene cluster is deleted_lIf not equal to 0, reserving the corresponding gene cluster, and achieving the purpose of gene cluster screening by the operation;

thirdly, obtaining SNP clusters corresponding to the screened gene clusters through eQTL data;

screening out gene clusters with the coefficient not being zero through the second step, wherein the gene clusters are considered to most possibly influence the phenotype; the main reason of diseases caused by the gene clusters is that the influence of different sites on the genes in the clusters is generated, so that the association relationship between the SNP and the genes needs to be established, and the relationship between the genotype and the phenotype pathway can be completely reflected; quantitative trait gene expression sites eQTL in GTEx data reflect the association relationship between SNP and genes in each tissue, and SNP information associated with the genes in each cluster is searched in the eQTL data, so that SNP clusters corresponding to the gene clusters are obtained;

fourthly, constructing each SNP cluster, the corresponding gene cluster and the phenotype into a three-layer network class block, performing regression operation on the association relation between the SNP and the gene in each class block by adopting a sparse partial least square method, and performing operation on the association relation between the gene and the phenotype by adopting logistic regression;

combining the corresponding SNP cluster, gene cluster and phenotype into a three-layer network, and calling the three-layer network as a class block, wherein each class block constructs the three-layer network; when processing the association relationship between each block SNP and the gene, simultaneously considering the association relationship between layers; the method for solving the problem of the association relationship between the SNP and the gene is to use a sparse partial least square method SPLS; solving the correlation between the gene and the phenotype is completed by adopting a logistic regression method;

fifthly, averaging the prediction results of all the blocks to obtain a final prediction result;

and constructing a plurality of class blocks for prediction analysis through the fourth step, wherein strong dependence does not exist among the class blocks, parallel operation can be simultaneously performed, and the prediction results of the class blocks are averaged to obtain the final prediction result.

In a particular embodiment of the invention, during the first step of the experiment, the parametric tests were carried out with an increment of 0.1.

The method has the advantages that:

a) the problem that under the condition of small samples in a three-layer network, the characteristic value is huge and can not be effectively regressed can be solved;

b) the internal incidence relation among the omics is considered, and the prediction accuracy is improved;

c) the correlation relationship of biological pathways among different omics layers is analyzed, so that the biological significance is more definite;

d) tissue specificity is considered. Each tissue of the multicellular organism individuals has the characteristic of being distinguished from other tissues, and the difference is largely due to the specific morphological structure and physiological function which are endowed by specific expression genes of different tissues;

e) integrates the advantages of the existing multiomics fusion method. The multi-level fusion analysis method can better reflect the advantages of the association relation in the biological sense, and integrates the learning idea in the multi-dimensional fusion analysis method.

Drawings

FIG. 1 shows a genotype-phenotype association analysis model based on gene cluster grouping;

FIG. 2 illustrates various intra-block and inter-layer network association analysis processes;

FIG. 3 shows a flow of a genotype-phenotype association analysis study based on multiple sets of mathematical data from small samples;

fig. 4 shows ROC plots and AUC values for the GSE33356 data for the correlation of four genotypes to phenotypes, wherein fig. 4(a), fig. 4(b), fig. 4(c), fig. 4(d) show AUC: 0.811, AUC: 0.619, AUC: 0.441, AUC: the case of 0.586;

figure 5 shows the ROC plots and AUC values of the GSE114269 data for the four genotype-phenotype correlations, with figure 5(a), figure 5(b), figure 5(c), and figure 5(d) showing AUC: 0.779, AUC: 0.707, AUC: 0.671, AUC: 0.701 case.

Detailed Description

The invention is described in detail below with reference to the drawings, and is specifically divided into the following five steps.

The first step is to generate a weighted undirected gene correlation diagram by using a protein network and a gene expression value, and cluster the undirected diagram by using a SPICi clustering method to generate a gene cluster.

Protein network data were derived from the PICKLE database (Protein InteraCtion KnowLedgebasE, http:// www.pickle.gr /). The incidence relation between genes can be mapped by utilizing the protein network data to form a gene incidence weightless map. And calculating the interpage Pearson correlation coefficient by using the gene expression data, wherein the interpage Pearson correlation coefficient can be used as the weight of the edge, and a gene network diagram with the weight is generated. The above procedures are well known to those skilled in the art and will not be described again. In the gene network map, the more closely related genes are likely to affect the expression by interaction, so the present invention clusters the gene association map to obtain a plurality of gene association clusters, as shown in the middle layer of fig. 1. The invention generally compares a plurality of clustering methods such as MCODE, RNSC, Cfinder, NetworkBLAST, DPCiu and MCUPGMA, and finally adopts a SPICi (speed and Performance In clustering) clustering method (M.S. Pen Jiang, SPICi: a fast clustering algorithm for large biological networks, Bioinformatics, 26(2010)1105-1111) to cluster the generated gene network map. The SPICi method has three super parameters, namely a minimum cluster value (minimum cluster size), a minimum support threshold (minimum support threshold) and a minimum cluster density (minimum cluster density). The three parameters together affect the number of clusters and the number of elements of each cluster. In order to facilitate subsequent analysis, the final clustering effect of the invention needs to ensure that the clustering number and the clustering elements are in a proper range. For example, if there are too many elements in the cluster, then the grouping minimum angle regression algorithm (group lasso) (Huang J, Breheny P, Ma S.A selective review of group selection in high-dimensional models. State Sci, 2012, 27: 481) is performed subsequently, and the larger the element contained in each group, the larger the target error if the penalty parameters are the same. If too few elements are present in the cluster, the influence of genetic association on the disease cannot be effectively analyzed. On the basis of the general principle, the setting of the three super parameters is further analyzed due to the respective characteristics of the three super parameters.

The minimum cluster size (minimum cluster size) is used to determine the leaving of the cluster by comparing the number of genes contained in each cluster, i.e. if the number of elements in the cluster is greater than the minimum cluster size, the cluster is kept, otherwise the cluster is discarded. If the minimum clustering number is set to be too small, the purpose of capturing the association relationship between genes cannot be achieved, but if the minimum clustering number is too large, the clustering cluster is deleted by mistake. According to the test on different data, the minimum clustering value is finally set to [4, 6]]Such a range of intervals. In the gene network graph G ═ (V, E), V represents a set of all vertices in the gene network graph G, and E represents a set of all edges in the gene network graph G. For any vertex u and the set of vertices connected to u

Defining a support:

support (u, S) refers to the sum of the weights of all edges connected to vertex u, w_u，vRepresenting the weight of the edge between vertex u and vertex v. Because the invention uses the Pearson correlation coefficient of two vertex vectors to represent the weight of the edge, but the Pearson correlation coefficient has positive and negative scores, so that the possibility of mutual cancellation exists in the process of solving the support (u, S), the invention takes the absolute value of the solved Pearson correlation coefficient, thereby w is w_u，v∈(0，1]Generally, a pearson correlation coefficient of more than 0.2 is considered to be a correlation of weak correlation or more. Too many weakly associated genes may cause additional noise, and therefore the present invention requires that at least two or more edges be associated with a weakly associated gene, and sets the lower minimum support threshold to 0.4. Various data tests show that when the minimum support threshold is larger than 0.7, the total number of genes is greatly reduced and is not enough to reflect the influence of the genetic association relation on diseasesAnd (6) sounding. Therefore, the invention sets the minimum support threshold as [0.4, 0.7 ]]The interval range of (2). The clustering density (S) is defined as the sum of the edge weights divided by the total number of possible edge numbers to reflect the compactness of the subgraph. The formula is as follows:

as can be seen from the formula, too small a density (S) parameter will result in more clustering elements and less total clustering. When the density (S) of a cluster does not reach the minimum cluster density, the program divides the cluster into two or more small clusters. Therefore, the minimum clustering density directly influences the total number of clusters and is also the parameter which has the largest influence on the clustering effect in the three hyperparameters. Through experimental comparison, the value range of the minimum clustering density parameter is set to [0.1, 0.6], and in the experimental process, the parameter test is carried out with the increment degree of 0.1.

And secondly, screening the gene cluster by using a group Lasso method.

Because the gene base number is large, the number of the gene clusters obtained in the first step is relatively large, and the regression operation can be carried out on the gene clusters and the phenotypes by utilizing a grouping minimum angle regression algorithm (group lasso). The grouped minimum Angle regression algorithm is a generalization of the Lasso regression (Tibshirai R. regression Shunkage and Selection of the Lasso [ J ]. Journal of the Royal Statistical Society, 1996, 58 (1): 267-288). If the gene cluster is an L group, the selection of each feature in the Lasso regression is popularized to the selection of each group of features in the grouped minimum angle regression algorithm, and the objective function is as follows:

wherein lambda is a regularization parameter, controlling the overall punishment, x, Y are independent variable and dependent variable matrixes respectively, beta is a coefficient vector, and beta is_lFor each set of the coefficient vectors,

is the weighting of each group and can be adjusted as desired. If beta is_lIf the number is 0, the corresponding gene cluster is deleted, whereas if the number is beta, the gene cluster is deleted_lAnd if not equal to 0, reserving the corresponding gene cluster, and achieving the purpose of gene cluster screening.

And thirdly, obtaining the SNP cluster corresponding to the screened gene cluster through eQTL data.

In a second step, clusters of genes whose coefficients are not zero can be selected, which are considered to have the highest probability of influencing the phenotype. The invention researches the genotype and phenotype association analysis, the gene in each gene cluster can establish the gene and phenotype association relationship through multiple regression, and the multiple regression method is well known to the technical personnel in the field and is not described again. The main reason for diseases caused by these gene clusters is that the difference sites on the genes in the clusters affect the generation, so the association relationship between SNP and gene needs to be established, thus completely reflecting the relationship between genotype and phenotype pathway. Quantitative Trait gene Expression site eQTL (Expression Quantitative Trait locus Loci) in data of GTEx (Lonsdale J., Thomas J., Salvator M., Phillips R.the Genotype-Tissue Expression (GTEx) project. Nat Genet 2013, 45, 580-585) reflects the association relationship between SNP and genes in each Tissue, SNP information associated with genes in each cluster can be searched in eQTL data, and thus the SNP cluster corresponding to the gene cluster can be obtained.

And fourthly, constructing each SNP cluster, the corresponding gene cluster and the phenotype into a three-layer network class block, performing regression operation on the association relation between the SNP and the gene in each class block by adopting a sparse partial least square method, and performing operation on the association relation between the gene and the phenotype by adopting logistic regression.

Corresponding SNP clusters, gene clusters and phenotypes are combined into a three-layer network, which is called as a block (block) in the invention, and each block can construct a three-layer network. The number of SNP and genes in the three-layer structure after the classification block is sharply reduced, the amount of samples required during effective regression is also reduced, and help is provided for the incapability of regression of large features under the condition of small samples. When the three-layer structure of each class block is analyzed, partial path relation can be predicted only by using an interlayer regression method, but the intra-omic association relation is not considered, so that the method is not in line with biological reality. However, if only the intra-omic association relationship is considered, the pathway association between other omics is not considered, the whole biological system cannot be well reflected, and only local conditions can be seen, as shown in fig. 2, wherein the upper nodes represent SNPs, the middle nodes represent genes, the bottom nodes represent phenotypes, the solid lines represent intra-omic association relationships, and the dotted lines represent inter-omic association relationships. When the invention processes the association relationship between the SNP and the gene of each block, the original association relationship only considering the single element between layers (the left half part of figure 2) is improved into the association relationship between layers and in-layer (the right half part of figure 2). Namely, the original many-to-one incidence relation is changed into a many-to-many incidence relation. Due to the advantages of the Sparse Partial Least Squares (SPLS) (A.Csala, F.Voorbrak, A.H.Zwenderman, M.H.Hof, Sparse redundancy analysis of high-dimensional genetic and genetic data, Bioinformatics, 33(2017) (3228) 3234) integrating methods such as principal component analysis, typical correlation analysis, linear regression analysis and the like, the problems that the number of samples is far less than the number of features, the regression cannot be effectively performed, multiple collinearity exists among the features and the like can be effectively solved. Therefore, the method for solving the problem of the association relationship between the SNP and the gene is to replace the original multivariate regression method with a Sparse Partial Least Squares (SPLS) method. Sparse Partial Least Squares (SPLS) is the addition of a penalty function during the solution of the partial least squares method (PLS) (Jong S D.SIMPLS: alternative early to partial least squares regression [ J ]. Chemometrics and Intelligent Laboratory Systems, 1993, 18 (3): 251-263). The partial least squares method (PLS) and the sparse partial least squares method are known to those skilled in the art and will not be described in detail.

The correlation between the genes and the phenotypes in each cluster block is accomplished by logistic regression, which is well known to those skilled in the art and will not be described again.

And fifthly, averaging the prediction results of all the blocks to obtain a final prediction result.

Through the fourth step, a plurality of class blocks can be constructed for prediction analysis, as shown in fig. 1. The method is characterized in that strong dependence does not exist among various blocks, parallel operation can be simultaneously carried out, similarly, a learner is trained by randomly extracting partial samples from a set in ensemble learning, a plurality of learners can be trained by extracting for many times, and then the learners are combined. The invention selects the gene cluster which most possibly affects the phenotype for analysis, and has better effect than random extraction. And performing interlayer correlation analysis on each class block through the fourth step to obtain prediction results of each class block, averaging the obtained results to obtain a final prediction result, wherein the whole process is shown in fig. 3.

Specific examples

1. Data source and preprocessing

To verify the effectiveness of the method, the invention was verified with two sets of data derived from the GEO database (Gene Expression Omnibus database, https:// www.ncbi.nlm.nih.gov/GEO /). GSE33356 was studied lung adenocarcinoma. It includes lung cancer patients and their adjacent normal tissues, which are collected from the patients. Lung tumor and normal specimens from 84 non-smoking female adenocarcinoma patients were analyzed using Affymetrix SNP 6.0 and Affymetrix u133plus2.0 chips. GSE114269 compares the data for bone marrow type breast cancer (MBC) and non-bone marrow type basal-like breast cancer (non-MBC BLC) with 48 samples. The main reason for selecting these two sets of data for experiments is to demonstrate that the method of the present invention can be widely applied to such genotyping and phenotyping problems based on multiple attributes of small samples.

Protein network data were derived from the PICKLE database (Protein InteraCtion KnowLedgebasE, http:// www.pickle.gr /). The database is a metadata database of human protein interaction, and integrates protein interaction databases of various public sources through gene ontology information.

eQTL data was derived from GTEx Analysis V7(dbGaP Access phs000424.V7.p 2). For accuracy of data prediction, the eQTL data corresponding to the data is selected for tissue specificity. For example, the first set of data is selected for pulmonary eQTL data and the second set of data is selected for mammary tissue eQTL data.

The data preprocessing is mainly completed by unifying the SNP and gene naming in various data, removing the data with the missing value exceeding 10% from the SNP data, if the missing value is less than 10%, filling the data with the highest frequency of occurrence or the average value, and only taking the SNP data with the sub-allele frequency (MAF) more than 0.1.

2. Analysis of predicted results

At present, no typical method is used for correlating and analyzing genotypes and phenotypes with large attributes of small samples, so the method disclosed by the invention is used for carrying out comparative experiments to verify the feasibility of the method. In comparative experiments, the method herein is named "grouping + partial least squares".

1) A sparse partial least square method is not used for a three-layer network structure formed after gene clustering, but rather multivariate regression is used, and the method can verify the influence of the correlation relationship (omic internal relationship) among genes on the result. This method is named "grouping + multiple regression".

2) The genes are not subjected to clustering screening, eQTL data is adopted to screen SNP data of related genes, a large three-layer network is established for all data, and lasso regression is directly adopted for the three-layer structure. The method does not perform clustering grouping processing on the SNP and the genes, can verify that the association relationship among the genes is not considered, and only considers the influence of the association relationship among the omics on the result under the condition of not clustering the genes. This method is named "multiple regression method".

3) And carrying out fusion analysis on SNP and gene data by adopting a multigroup chemical fusion method. The method does not consider the association between SNPs and genes. The method can verify the influence of a multidimensional fusion method on the result without considering the association relation between omics. This approach fails to analyze the inter-omics pathway relationships. This method is named "multigroup chemical fusion".

Fig. 4 and fig. 5 show the GSE33356 and GSE114269 data sets comparing the method of the present invention with the above three methods, and sequentially showing four methods of "grouping + partial least squares", "grouping + multiple regression", "multiple regression method", and "multiple group chemical fusion" from left to right. Because of the dichotomy problem, the present invention can be illustrated by comparing "Receiver Operating characteristics" (ROC) curves, which are well known to those skilled in the art and will not be described again. Table 1 shows the AUC values corresponding to the four methods, and AUC (area Under cut) is defined as the area enclosed by the coordinate axes Under the ROC curve.

TABLE 1 alignment of AUC values of four methods of two data sets

As can be seen from the results of fig. 4 and table 1, the present invention performed better in both sets of data than the other methods. The result of the grouping + partial least square method is obviously improved compared with the result of the grouping + multiple regression method, which shows that the association between the omics is considered in the analysis process, and the method is more in line with the reality of a biological system. The grouping + partial least square method and the grouping + multiple regression method for clustering the gene network have advantages over other methods, and the idea of clustering, screening and then performing ensemble learning on the gene data is suitable for processing the small sample data. The multiple regression method is the worst in performance among methods, because the method is not suitable for processing small samples, but provides guidance for the analysis of pathways among omics, and documents prove that the algorithm is suitable for being applied when the sample size reaches 500 or more in the analysis of the related relation of SNP, genes and phenotypes, and the algorithm cannot realize effective regression under the condition of small samples. The multi-group chemical fusion is a multi-dimensional fusion analysis method, the result of the method is not much different from that of a grouping + multiple regression method, and sometimes even better than the grouping + multiple regression method in the test process of the invention, but the method can not analyze the correlation between the omics, and the grouping + partial least squares are obviously better than the method in the prediction result under the condition of a small sample.

Claims

1. The genotype-phenotype association analysis method based on the multigroup chemical data of the small sample is characterized by comprising the following steps:

firstly, generating a weighted undirected gene association diagram by using a protein network and a gene expression value, and clustering the weighted undirected gene association diagram by using a SPICi clustering method to generate a gene cluster;

generating a gene network map with weights by using the protein network data and the gene expression data; clustering the generated gene network diagram by adopting an SPICi clustering method; the SPICi method has three super parameters, namely a minimum clustering value minimum cluster size, a minimum support threshold minimum threshold, and a minimum clustering density minimum cluster density; the three parameters jointly influence the number of clusters and the number of elements of each cluster; further analysis is carried out on the settings of the three super parameters;

the minimum cluster size (minimum cluster size) is used for determining the leaving of the cluster by comparing the number of genes contained in each cluster, namely, if the number of elements in the cluster is more than the minimum cluster size, the cluster is kept, otherwise, the cluster is discarded; if the minimum clustering value is set to be too small, the purpose of capturing the association relation between genes cannot be achieved, but if the minimum clustering value is too large, the clustering cluster is deleted by mistake; according to the test on different data, the minimum clustering value is finally set to [4, 6]]Such an interval range; in the gene network graph G ═ (V, E), V represents a set of all vertices in the gene network graph G, and E represents a set of all edges in the gene network graph G; for any vertex u and the set of vertices connected to u

Defining a support:

support (u, S) refers to the sum of the weights of all edges connected to vertex u, w_u，vA weight representing an edge between vertex u and vertex v; representing the weight of the edge by the Pearson correlation coefficient of the two vertex vectors, taking the absolute value of the Pearson correlation coefficient solved, thereby w_u，v∈(0，1]Defining the minimum support threshold as[0.4，0.7]The interval range of (a); the definition of clustering density (S) is that the total sum of the edge weights is divided by the total number of possible edge numbers to reflect the compactness of the subgraph; the formula is as follows:

setting the minimum clustering density parameter value range as [0.1, 0.6 ];

secondly, screening the gene cluster by using a group Lasso method;

2. The method for analyzing genotype-phenotype association in multi-cohort data based on small samples according to claim 1, wherein during the first step of the experiment, the minimum cluster density parameter is subjected to a parametric test with an increment of 0.1.