CN110797080A

CN110797080A - Predicting synthetic lethal genes based on cross-species migratory learning

Info

Publication number: CN110797080A
Application number: CN201910991037.8A
Authority: CN
Inventors: 卢新国; 屈强; 朱正浩; 王新宇; 陈浩文
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-10-18
Filing date: 2019-10-18
Publication date: 2020-02-14

Abstract

The invention belongs to the field of bioinformatics, and particularly relates to a method for predicting synthetic lethal gene based on cross-species transfer learning. The method of the present invention migrates the synthetic lethal gene learned by Saccharomyces cerevisiae into human to predict the synthetic lethal gene of human. The method consists of two basic steps. First, manifold feature learning is performed, learning new feature representations of the two species. Then, the relative importance of the edge distribution and the conditional distribution is quantitatively evaluated by adopting a dynamic distribution alignment method, and the difference of the edge distribution and the conditional distribution between the two species is adaptively minimized. Finally, a domain invariant synthetic lethal gene classifier is learned by summarizing these two steps. The invention can be used to predict synthetic lethal genes in humans.

Description

Predicting synthetic lethal genes based on cross-species migratory learning

Technical Field

The invention belongs to the field of bioinformatics, and particularly relates to a method for predicting synthetic lethal gene based on cross-species transfer learning.

Background

The current screening methods for Synthetic Lethal (SL) gene pairs can be categorized into three groups.

The first is a model organism-based approach. Their genomes are small and easily mutated and matched; thus, gene silencing techniques are more easily performed in model organisms. However, as with the homology inference methods for all model organisms, most of the genes in the SL gene pair of the model organisms do not have homologous genes in the human genome. Although homologous genes can be found in the human genome, their function is greatly changed and cannot be directly converted into the SL gene.

The second screening method is a gene silencing method for mammals, and two gene silencing methods have been developed. One is speculation based on a priori knowledge. A potential SL gene pair comprises two genes, a mutated cancer gene and a SL partner gene. Thus, the SL partner genes should be directly knocked out and detected individually. The other is unbiased screening of whole genomes based on high throughput experimental techniques. Finally, siRNA and CRISPR screening proved to be the most reliable method for detecting SL gene pair s 15. However, human cell systems face greater challenges in genome-wide siRNA or CRISPR screening than model genetic systems. Moreover, these methods are much more expensive, labor and time intensive, and many of the basic genes discovered are either restricted to these cell line models or are often overexpressed in cancer.

The third is a calculation method based on big data and data mining. Such data-driven methods in turn include methods of biological network topology, data mining methods, and methods of statistical screening. Computational methods are an attractive alternative to whole genome sirna or CRISPR-based human cell line screening methods, which can help identify and prioritize potential SL genes for further experimental validation. These methods include inferring a human homologous SL gene from a yeast SL gene; evaluating the importance of the gene pair by using the robustness characteristic of the tumor PPI network; performing mutual exclusion calculation by using a statistical model of gene mutation/transcription expression data; SL (DAISY) data-driven detection combined with cell copy number change, siRNA screening, cell survival and gene co-expression information is used for driving SL genes by data, and a good effect is achieved; and combining three characteristics of mutation coverage, driving mutation probability and network information centrality into a manifold ordering model based on the learned training and prediction pipeline to generate an ordered list of potential SL pairs.

In conclusion, the existing method for predicting the synthetic lethal gene of human has high cost and needs to consume a great deal of labor and time.

Disclosure of Invention

Aiming at the problems that the effectiveness of the existing supervised learning method is limited and the data volume of synthetic lethal genes of human is small, the invention provides a method for predicting synthetic lethal genes based on cross-species transfer learning. The synthetic lethal gene of human beings is predicted by obtaining abundant and experimentally verified synthetic lethal effects from model organisms such as yeast, mice and the like. The method comprises the following steps:

1. data collection phase

Our data collected from the BioGrid protein interaction database generated PPI networks, with each node representing a protein and each edge representing an interaction between proteins. And then, classifying the genes of the source species and the target species obtained from the PPI network by using a training classifier, wherein the genes with synthetic lethality are positive data sets, and the genes without synthetic lethality are negative data sets. The synthetic lethality known between two genes is represented by the binary matrix Ys, Yt, with 1 representing synthetic lethality and 0 representing no synthetic lethality.

2. Data preprocessing stage

PPI network topological similarity measurement is carried out on the source species and the target species to obtain a topological similarity matrix Ns belonging to Rn multiplied by k, Nt belonging to Rm multiplied by k, wherein k is a network parameter of the gene pair. And performing GO semantic similarity measurement on the source species and the target species to obtain a semantic similarity matrix Gs belonging to Rn multiplied by d, and Gt belonging to Rm multiplied by d, wherein d is the number of methods for calculating GO similarity. Then, a feature matrix Xs, Xt of the source species and the target species is obtained based on the linear combination of the PPI network topology similarity matrix and the GO method semantic similarity matrix, and the feature matrix Xs, Xt are as follows:

X_s＝[N_sG_s]

X_t＝[N_tG_t]

the cross-species migration learning approach consists of two basic steps. First, manifold feature learning is performed, learning new feature representations of the two species. Secondly, the relative importance of the edge distribution and the conditional distribution is quantitatively evaluated by adopting a dynamic distribution alignment method, and the difference of the edge distribution and the conditional distribution between the two species is adaptively minimized. Finally, the domain-invariant synthetic lethality classifier f can be learned by summarizing these two steps. Formally, the manifold feature learning function is denoted g (·), and the objective function is expressed as follows:

where the first term represents the loss of data samples, is the squared norm of f, Df (·,) represents the dynamic distribution alignment, Rf (·,) is Laplace regularization, η, and λ and ρ are the corresponding regularization parameters.

3. Manifold feature learning phase

The goal of manifold feature learning is to determine a new feature space that allows the source and target species to exhibit common features. The new feature representation of the common feature is domain invariant, thus enabling migration of the classifier from the source species to the target species. We embed the source and target data sets into the Grassmann manifold method g (d), which can be viewed as all d-dimensional subspaces { Φ (T): t is 0. ltoreq.1. For the D-dimensional feature vectors of two original pairs xi and xj of genes, we calculate Φ (T) tx, which is the projection of one feature vector x in this subspace, for consecutive T from 0 to 1, and concatenate all projections into the infinite-dimensional feature vectors zi and zj. The inner product of the eigenvectors zi and zj yields a positive semi-definite ground flow kernel function as:

thus, converting the source eigenspace into Grassmann manifold eigenspace of z ═ G (x) √ gx, G can be efficiently calculated by singular value decomposition, and the objective function can be expressed as:

the structure of Ds is then minimized:

where is the Frobenius norm. K ∈ r (Nm) x (Nm) is the kernel matrix, Kij ═ K (zi, ZJ), a ∈ r (Nm) x (Nm) is a diagonal matrix, Aii ═ 1 if i ∈ Ds, otherwise Aii ═ 0. y1, y2, y (nm) is a matrix of tags for saccharomyces cerevisiae and the target species. tr (-) is the trace operation.

4. Dynamic distributed alignment phase

The main objective of dynamic distribution alignment is distribution adaptation to minimize distribution differences between domains. The importance of the edge distribution (P) and the conditional distribution (Q) between the two species was quantitatively evaluated using a dynamic distribution alignment method. To this end, an adaptation factor μ is introduced, defining the dynamic distribution alignment function as:

(1) measurement of distribution divergence

The maximum mean deviation MMD between the edge distribution P and the conditional distribution Q is defined as follows:

the dynamic distribution alignment function can therefore be expressed as:

wherein the first term represents the edge distribution deviation between species and the second term represents the conditional distribution deviation. By further utilizing phenomenological and nuclear techniques, the dynamic distribution alignment function in the above equation can be converted into:

(2) adaptive factor mu

A-distance is used as a basic measurement method to obtain the adaptation factor. A-distance is defined as the error that establishes a linear classifier to distinguish two domains. ε (H) represents the error of the linear classifier H in discriminating the two regions Ds and Dt. A-distance is defined as follows:

d_A(D_s，D_t)＝2(1-2ε(h))

μ can then be estimated as:

where dM represents the edge distribution of the c-th class A-distance, and dC represents the conditional distribution of A-distance.

5. Laplacian regularization is introduced to further exploit the similar geometric properties of neighboring points in manifold method G, the pair-wise affinity matrix is as follows:

where sim (·, ·) is a similarity function (e.g., cosine distance) that measures the distance between two points. Np (Zi) represents the nearest neighbor set of points Zi. P is a free parameter that must be set in the method. By introducing the laplace matrix L ═ D-W of the diagonal matrix, the final laplace regularization term of the equation is obtained.

The final objective function is expressed as:

setting derivatives

Get a solution

β^*＝((A+λW+ρL)K+ηI)^-1AY^T

Drawings

FIG. 1: similarity measure of gene pairs

FIG. 2: manifold feature matrix transformation

FIG. 3: dynamic distributed alignment

FIG. 4: two different target domains

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to experiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

1. Data collection

We obtained species-specific PPI networks from the BioGrid database, which included 740000 protein interactions in saccharomyces cerevisiae, 74000 protein interactions in schizosaccharomyces, and 470000 protein interactions in humans. The BioGrid database also provides experimentally verified synthetic lethality between genes, including 14000 synthetic lethal genes in saccharomyces cerevisiae, 900 synthetic lethal genes in schizosaccharomyces, and 800 synthetic lethal genes in humans. GO has three sub-entities, namely Biological Processes (BP), Molecular Functions (MF) and Cellular Components (CC). BP is 29660 th item, MF is 11120 th item, and CC is 4115 th item. Among various GO semantic similarity-based calculation methods, we used the protein semantic similarity tool proposed by Mazandu et al. The synthetic lethality prediction algorithm is as follows:

2. synthetic lethal gene prediction for fission yeast

We applied the TLSL model to saccharomyces cerevisiae and saccharomyces cerevisiae, with saccharomyces cerevisiae (s.cerevisiae) as the source species and schizosaccharomyces as the target species. In s.cerevisiae we constructed PPI networks including synthetic lethality from 9000 experiments. PPI networks consist of 904 typesSynthetic lethality, 50 dose lethality, 200 negative inheritance, 200 comprehensive growth defects, and 200 positive genetic interactions. In s.cerevisiae, we considered 8500 synthetic lethal gene pairs as positive datasets and generated 18000 random pairs as negative datasets in a connected component map. In fission yeast, 906 SLs were positive datasets and 8237 NSLs were negative datasets. Secondly, a PPI similarity matrix based on a topological structure and a GO semantic similarity matrix based on a GO are calculated respectively. Finally, removing the gene pair with the function similarity deletion, and obtaining a characteristic matrix Xs epsilon R of the saccharomyces cerevisiae and the fission yeast through linear combination^25039×35,Xt∈R^8463×35. And (3) obtaining a synthetic lethal prediction result Yt of the fission yeast by using the feature matrix as the input of the migration learning model. To evaluate the performance of the proposed method, we used a series of performance evaluation programs to evaluate our models to predict SLS, including Accuracy (ACC), sensitivity (Se), specificity (Sp), precision (Pr), F1-measure (F1), G-mean (GM), Matthews Correlation Coefficient (MCC). TLSL recognizes SL missing in Schizosaccharomyces, and we hope to find 177 SL pairs, but only 65. Table 1 shows that the sensitivity of the method is 95.9-80.5%, the specificity is 91.6-89.7%, and the accuracy is 88.6-85.1%.

TABLE 1 comparison of the Performance of the synthetic lethality prediction model of Schizosaccharomyces

3. Prediction of synthetic lethal gene in human

We used Saccharomyces cerevisiae as a source species for labeling and human as an unlabeled target species. We used the source data set in the prediction of synthetic lethal genes from Schizosaccharomyces. Human PPI networks were constructed using the BiorGrid database, comprising 6645 genes and 17083 physical interactions. 803 SLs were randomly selected as positive data sets and 6000 NSLs as negative data sets. Secondly, a PPI similarity matrix based on a topological structure and a GO semantic similarity matrix based on a GO are calculated respectively. Finally, removingObtaining a human characteristic matrix Xt epsilon R through linear combination of gene pairs with functional similarity deletion^8463×35. And obtaining a synthetic lethality prediction result Yt of the human by using the characteristic matrix as the input of the transfer learning model. To evaluate the predicted performance of the TLSL method, the results were compared with the SINaTRA method. The results (Table 1) show that TLSL performs best for all the markers classified in the human SL gene pair, with a clear improvement for each marker.

TABLE 2 comparison of the Performance of the human synthetic lethality prediction model

4. Experiment and analysis of results

Experimental results show that the transfer learning model is superior to the most advanced classifier in the cross-species learning task of transferring the synthetic lethal gene of saccharomyces cerevisiae to the synthetic lethal gene of human. The empirical success of the migration learning model can be attributed to the following advantages. First, manifold feature learning of the migration learning model enables learning of a new feature representation of a common feature that is invariant to both species. Therefore, a shallow model that only focuses on the covariance of the observed variables, such as random forests and support vector machines, will have difficulty capturing this common feature between the two domains. Second, dynamic distribution alignment in the migration learning model takes into account the edge distribution and conditional distribution among species and adaptively exploits the importance of each distribution. Traditional classifiers, such as random forests, typically cannot capture inter-domain distribution differences, limiting their performance on cross-species tasks, resulting in poor performance.

Claims

1. Predicting a synthetic lethal gene based on cross-species migratory learning, characterized by the implementation steps of:

(1) collecting data, and generating PPI network from interaction data of Saccharomyces cerevisiae, Schizosaccharomyces cerevisiae and human protein collected from BioGrid protein interaction database;

(2) data preprocessing, namely calculating PPI network topology similarity and GO-based semantic similarity measurement for a source species and a target species, and obtaining a feature matrix through linear combination;

(3) manifold feature learning, namely embedding a source data set and a target data set into a Grassmann manifold method, and then converting a source feature space into a Grassmann manifold feature space;

(4) dynamic distribution alignment, namely quantitatively evaluating the importance of edge distribution and conditional distribution between two species by adopting a dynamic distribution alignment method;

(5) and (3) Laplace regularization is introduced to further obtain a pair-wise affinity matrix by utilizing similar geometric properties of adjacent points in the manifold method G.

2. The synthetic lethal gene based on cross-species migratory learning prediction according to claim 1, characterized by a data collection phase:

PPI networks of specific species, including saccharomyces cerevisiae, schizosaccharomyces, and human, were obtained from the BioGird protein interaction database.

3. The prediction of synthetic lethal genes based on cross-species migratory learning according to claim 1, characterized by a data preprocessing stage:

(1) measuring PPI network topological similarity of a source species and a target species;

(2) measuring GO-based semantic similarity of a source species and a target species;

(3) performing linear combination on the PPI network topology similarity matrix and the GO-based semantic similarity matrix:

X_s＝[N_sG_s]

X_t＝[N_tG_t]。

4. the synthetic lethal gene based on cross-species migratory learning prediction according to claim 1, characterized by a manifold feature learning phase:

(1) we embed the source and target data sets into the Grassmann manifold method;

(2) adopting a GFK (geodesic Flow kernel) algorithm, and obtaining a positive half-definite ground wire Flow kernel function by the inner product of the feature vectors:

(3) the source feature space is converted into a manifold feature space.

5. The synthetic lethal gene based on learning of migration across species according to claim 1, characterized by a dynamically distributed alignment phase:

(1) the maximum mean deviation MMD between the edge distribution P and the conditional distribution Q is calculated:

(2) measuring dynamic distribution alignment divergence:

(3) calculating an adaptive factor μ:

。

6. the synthetic lethal gene based on learning of migration across species according to claim 1, characterized by a laplace regularization stage:

(1) calculating a Laplace regularization term:

(2) setting derivatives

Get a solution：

β^*＝((A+λM+ρL)K+ηI)^-1AY^T。