CN113838525A

CN113838525A - Method and system for predicting pathogenic gene pair

Info

Publication number: CN113838525A
Application number: CN202111150222.8A
Authority: CN
Inventors: 袁杨杨; 李淼新
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2021-12-24
Anticipated expiration: 2041-09-29
Also published as: CN113838525B

Abstract

The invention discloses a method and a system for predicting a pathogenic gene pair, wherein the method comprises the following steps: constructing a data set based on a double-gene disease database, and filtering and screening data to obtain a reference data set; introducing characteristics and establishing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model; predicting the double-gene interaction effect potential between the whole-genome encoding gene pairs based on a prediction model, and storing in a triple form to obtain a prediction result; and compressing the prediction result based on a paired data compression method. The system comprises: the device comprises a data set construction module, a training module, a prediction module and a compression module. By using the invention, the discovery of potential interacting pathogenic gene pairs can be helped. The method and the system for predicting the pathogenic gene pair can be widely applied to the field of gene pair prediction.

Description

Method and system for predicting pathogenic gene pair

Technical Field

The invention relates to the field of gene pair prediction, in particular to a method and a system for predicting a pathogenic gene pair.

Background

Human genetic diseases can be largely divided into three categories, including mendelian disease (monogenic disease), oligogenic disease and polygenic disease, where monogenic is the simplest genetic pattern of disease, theoretically meaning that one or a few pathogenic sites/genes are sufficient to cause disease phenotypes such as common cystic fibrosis and thalassemia, but the known susceptibility genes of monogenic diseases do not fully explain the corresponding disease phenotype. In addition, due to the multiplicity of environmental factors and the complexity of the human genome, the phenotype of many diseases is intricate, making diagnosis of the disease more difficult. With the rapid development of whole genome sequencing technology, in the current big data era, the susceptible genes of complex diseases are continuously mined, and the positioning of the susceptible genes of diseases faces huge opportunities and challenges.

The traditional methods for positioning the gene pairs of the double-gene interaction effect comprise hybridization tests, family-based association analysis, whole genome association analysis or multiomic association analysis and the like, but the positioning of the double-gene interaction effect based on the methods needs certain preconditions, and the methods have large limitations and are difficult to be applied to screening in the whole genome range.

Disclosure of Invention

In order to solve the above technical problems, the present invention provides a method and a system for predicting a pathogenic gene pair, wherein the algorithm is a supervised learning method and can help to reveal a potential interacting pathogenic gene pair.

The first technical scheme adopted by the invention is as follows: a method for predicting a pathogenic gene pair comprises the following steps:

s1, constructing a data set based on the double-gene disease database, and filtering and screening the data to obtain a reference data set;

s2, introducing characteristics and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model;

s3, predicting the double-gene interaction effect potential values between the whole-genome encoding gene pairs based on the prediction model, and storing the potential values in a triple form to obtain a prediction result;

and S4, compressing the prediction result based on the paired data compression method.

Further, the step of constructing a data set based on the double-gene disease database, and performing data filtering and screening to obtain a reference data set specifically includes:

s11, taking a double-gene pathogenic gene pair in the double-gene disease database as a positive training sample;

s12, combining the pathogenic genes of the monogenic disease pairwise to obtain a gene pair serving as a first negative training sample;

s13, taking a gene pair obtained by pairwise combination of the genes with function loss as a second negative training sample;

s14, combining the main pathogenic genes of the monogenic diseases and the genes with the function loss pairwise to obtain a gene pair serving as a third negative training sample;

s15, randomly selecting two protein coding genes on the whole genome to obtain a gene pair, and taking the gene pair as a fourth negative training sample;

s16, randomly combining every two genes of the double-gene disease database to obtain a gene pair serving as a fifth negative training sample;

and S17, filtering and screening the positive training sample, the first negative training sample, the second negative training sample, the third negative training sample, the fourth negative training sample and the fifth negative training sample based on the normal sample and the feature missing rate to obtain a reference data set.

Further, the step of introducing the features and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on the reference data set pair to obtain the prediction model specifically comprises the following steps:

s21, randomly taking out equal samples from each negative sample in the reference data set through equal proportion sampling to form a negative sample subset with the same number as the number of the positive training samples, and combining all the positive samples and the negative samples obtained through undersampling to form a sub-training set;

s22, sampling with replacement is carried out on each negative sample set, and the step S21 is circulated until the preset times are reached to obtain a plurality of sub training sets;

s23, for each sub-training set, introducing features and training on the basis of a random forest method to obtain sub-models;

s24, calculating out-of-bag error rate of the submodels, and selecting the submodels with the out-of-bag error rate larger than a preset value to obtain the selected submodels;

s25, weighting all the selected sub-models by taking the out-of-bag error rate as weight to obtain a whole-genome double-gene interaction effect potential prediction model.

Further, the step of introducing features and training to obtain the sub-models based on a random forest method for each sub-training set specifically comprises:

s231, regarding each sub-training set, representing that no similarity exists between two genes by taking the characteristic value as 0, and representing that the characteristic value is absent by taking the deficiency value as-1;

s232, calculating the number of missing feature values in each sub-training set and taking the missing feature values as new features;

s233, through 10X cross validation, taking the weighted harmonic mean of the precision ratio and the recall ratio as an evaluation standard, and carrying out parameter adjustment on parameters in the random forest by using a grid search method to obtain a sub-model;

the parameters include the number of trees, the maximum feature number, the maximum depth, the minimum number of samples required for node partitioning, and the minimum number of samples for leaf nodes.

Further, the features include mutation level information, gene level information, protein interaction level information, protein structure information, expression level information, and phenotype level information.

Further, the step of compressing the prediction result based on the paired data compression method specifically includes:

constructing a gene name dictionary file according to the sequence of gene names;

and (4) converting the potential values of the gene pairs into integers according to a preset rule and storing the integers as a data file to obtain a compressed prediction result.

Further, the preset rule includes:

if 2 decimal places are reserved, L is used_dEach potential value is represented by 1byte, the value is converted to an integer v' of 100v, andis preserved as v'&0xFF；

If 4-bit decimal is reserved, L_d2byte represents each potential value, which is converted to an integer v '10000 v and saved as [ v'&0xFF，(v'＞＞8)&0xFF]；

If 6 decimal places are reserved, L is used_d3byte represents each potential value, which is converted to an integer v '1000000 v and saved as [ v'&0xFF，(v'＞＞8)&0xFF,(v'＞＞16)&0xFF]。

Further, the method also comprises a step of quickly accessing the compressed prediction result, which specifically comprises the following steps:

search for Gene pairs (G)_i，G_j)；

Reading the gene names recorded in the gene name dictionary file, and constructing indexes according to the sequence;

obtaining Gene G_iAnd gene G_jIndexing I and j in the dictionary, and obtaining the initial address I (I, j) of the gene pair in the data file;

move file pointer to I (I, j), and read L_dA byte is obtained, and a byte array is obtained;

and reversely reducing the obtained byte array according to the direction during encoding.

The second technical scheme adopted by the invention is as follows: a system for predicting pairs of disease-causing genes, comprising:

the data set construction module is used for constructing a data set based on the double-gene disease database, and filtering and screening data to obtain a reference data set;

the training module is used for introducing features and establishing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model;

the prediction module predicts the double-gene interaction effect potential values between the full-genome encoding gene pairs based on the prediction model and stores the double-gene interaction effect potential values in a triple form to obtain a prediction result;

and the compression module is used for compressing the prediction result based on a paired data compression method.

The method and the system have the beneficial effects that: the invention firstly utilizes a machine learning method to carry out the mining of the double-gene pathogenic gene pairs based on the biological similarity or correlation between genes, can also carry out the evaluation of the double-gene interaction effect on different gene pairs under the condition of not knowing candidate genes, and can obtain more reliable prediction results through filtering and screening high-quality reference data sets.

Drawings

FIG. 1 is a flow chart showing the steps of a method for predicting a pair of pathogenic genes according to the present invention;

FIG. 2 is a block diagram showing the construction of a system for predicting a pair of pathogenic genes according to the present invention;

FIG. 3 is a diagram illustrating compression and fast access to predicted results according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

Referring to fig. 1, the present invention provides a method for predicting a pathogenic gene pair, the method comprising the steps of:

Further, as a preferred embodiment of the method, the step of constructing a data set based on the double-gene disease database, and performing data filtering and screening to obtain a reference data set specifically includes:

s12, using a gene pair obtained by pairwise combination of pathogenic genes of the monogenic disease as a first negative training sample (MD);

specifically, we believe that there is no double-gene interaction effect between the major causative genes of common monogenic diseases.

S13, taking a gene pair obtained by pairwise combination of the genes with function loss as a second negative training sample (LOF);

specifically, we considered that the LOF (loss of function) gene had substantially no double-gene interaction effect;

s14, combining the main pathogenic gene of the monogenic disease and the gene with the function loss pairwise to obtain a gene pair as a third negative training sample (MDLOF);

s15, randomly selecting a gene pair obtained by combining two protein coding genes on a whole genome as a fourth negative training sample (Random);

specifically, we consider that the double-gene interaction effect is an uncommon effect, so the probability of the double-gene interaction effect existing between two randomly selected genes in the whole genome range is extremely low;

s16, randomly combining every two genes of the double-gene disease database to obtain a gene pair serving as a fifth negative training sample (DIDA _ NDI);

Specifically, we used data from the thousand human genomes for filtering. Samples of the thousand human genomes are normal samples, i.e. samples without disease phenotype. We believe that if non-synonymous mutations with allele frequencies of 1% or less are present in both genes of a gene pair and at least 2 individuals in a sample of thousands of human genomes carry such a pair of mutant genes, such a pair is a true negative pair, and only such pair is retained for placement in the final negative training set; more features are adopted for training the model, but the deletion rates of different features are different, and more deletion values lead to the reduction of the efficiency of the model, so that the gene pairs with higher deletion rates of the features are deleted, so that the reliability of a training set is improved, and the performance of the model is enhanced.

Further as a preferred embodiment of the method, the step of introducing the features and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain the prediction model specifically comprises:

specifically, sampling and constructing a test set are further included.

The formula is as follows:

specifically, i denotes the selected submodel, oob_iIs the ith subOut-of-bag error rate, RF, of the model_iRepresenting the predicted result of the ith sub-model, RF_FinalNamely the weighted potential value of the double-gene interaction effect.

Further, as a preferred embodiment of the method, the step of introducing features and training based on a random forest method to obtain the sub-models for each sub-training set specifically includes:

in particular, since we use biological similarity or correlation between two genes to assess the effect of two-gene interaction, if the value of a feature is 0, it means that there is no such similarity or correlation between the two genes; for the case where the eigenvalue is absent, to distinguish from 0, we define the absent value as-1;

specifically, in order to evaluate whether the number of missing values affects the judgment of the final model, the number of missing values (-1) in each sample is calculated, and the number is used as a new feature to be added into the training of the model;

Finally, the weighted model is used to predict the potential value of the double-gene interaction effect between the coding gene pairs of the whole genome. All coding genes on the whole genome are downloaded from the HUGO Gene Naming Committee (HGNC) official network, then, the coding gene pairs of the whole genome are obtained through pairwise matching, and relevant characteristics of all the gene pairs are obtained from the described corresponding database. Like step S23, we replace with-1 for the missing features, and for each sample we calculate the number of missing values to add to the model training as a characteristic feature. The triplet form may be as shown with reference to C in fig. 3.

Further as a preferred embodiment of the method, the features include:

mutation level information, the score of the tolerance of the gene to the mutation;

gene level information, probability that a gene is a recessive pathogenic gene, the state of necessity for basic functional development of a gene pair, probability of an essential gene, whether a gene pair participates in the same pathway, the degree of semantic similarity of a gene pair in Gene Ontology (GO), the magnitude of gene-gene interaction effect, the number and similarity of interacting genes in common between gene pairs, single-dose insufficiency, destructive strength of genes, biological distance between gene pairs, and the tolerance of a gene pair to functional deficiency.

Protein interaction level information, protein-protein interaction effect, mainly using database with BioGRID and STRING;

protein structural information, primarily using information about gene domains provided in the UniProtKB database;

expression level information, expression profile of gene pairs in different tissues, protein abundance, and degree of gene co-expression;

phenotypic level information, semantic similarity of gene pairs in Disease Ontology (DO), i.e. the degree of similarity of disease phenotypes associated with two genes.

Further as a preferred embodiment of the method, reliable features are screened for model training, and feature screening mainly comprises:

because the information coverage degrees of different characteristics are different, the characteristics with higher missing rate are deleted firstly, and the model efficiency is prevented from being reduced due to missing information;

the 14 input features with the best evaluation performance are obtained by using a recursive feature elimination method (RFE), and the accuracy and the reliability of the model are further ensured by using the high-quality features.

Further, as a preferred embodiment of the method, the step of compressing the prediction result by using the pairwise data compression method specifically includes:

constructing a gene name dictionary file with the extension name of x, d according to the gene name order (ASCII code order), wherein the file is a single-row file with tab as a separator;

Specifically, the potential value v of the gene pair belongs to [0,1], and is stored as a data file b according to a preset rule.

Further as a preferred embodiment of the method, the preset rule includes:

if 2 decimal places are reserved, L is used_dEach potential value is denoted by 1byte, and the value is converted into an integer v 'of 100v and stored as v'&0xFF；

If 4-bit decimal is reserved, L_d2byte represents each potential value, which is converted to an integer v '10000 v and saved as [ v'&0xFF,(v'＞＞8)&0xFF]；

If 6 decimal places are reserved, L is used_d3byte represents each potential value, which is converted to an integer v '1000000 v and saved as [ v'&0xFF，(v'＞＞8)&0xFF，(v'＞＞16)&0xFF]。

Further, as a preferred embodiment of the method, the method further includes a step of quickly accessing the compressed prediction result, which specifically includes:

search for Gene pairs (G)_i，G_j)；

Reading the gene names recorded in the d gene name dictionary file, and constructing indexes according to the sequence;

specifically, the calculation formula of the start address is as follows:

wherein i and j are the gene G_iAnd gene G_jIndex in a dictionary, L_dIs the length in bytes used to hold the potential values.

The advantages of the invention relative to the existing method are mainly reflected in the following three points:

1. the traditional method for positioning the double-gene interaction effect is premised on that a pathogenic gene and a candidate interaction gene need to be known, but the invention firstly utilizes a machine learning method to carry out the excavation of double-gene pathogenic gene pairs based on the biological similarity or correlation between genes, and under the condition that the candidate gene is unknown, the double-gene interaction effect can be evaluated on different gene pairs;

2. the training set is a high-quality reference data set obtained after filtering and screening, the coverage is relatively wide, and a reliable feature set helps us to construct a relatively robust prediction model, wherein protein-protein interaction, semantic similarity degree in Gene Ontology (GO) and semantic similarity of genes in Disease Ontology (DO) are relatively important 3 features;

3. the model construction method based on undersampling and integration enables us to obtain a more reliable prediction result.

As shown in fig. 2, a system for predicting a pair of pathogenic genes includes:

Further as a preferred embodiment of the present system, the present system further comprises:

and the access module is used for quickly accessing the compressed prediction result.

The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for predicting a pathogenic gene pair, comprising the steps of:

2. The method for predicting pathogenic gene pairs as claimed in claim 1, wherein the step of constructing a data set based on the double-gene disease database, and performing data filtering and screening to obtain a reference data set specifically comprises:

3. The method for predicting pathogenic gene pairs as claimed in claim 2, wherein the step of introducing the features and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on the reference data set pair to obtain a prediction model specifically comprises:

4. The method for predicting pathogenic gene pairs as claimed in claim 3, wherein the step of introducing features and training based on a random forest method to obtain the sub-models for each sub-training set specifically comprises:

5. The method of claim 4, wherein the characteristics include mutation level information, gene level information, protein interaction level information, protein structure information, expression level information, and phenotype level information.

6. The method for predicting pathogenic gene pairs as claimed in claim 5, wherein the step of compressing the prediction result based on the paired data compression method specifically comprises:

7. The method for predicting pathogenic gene pair according to claim 6, wherein the predetermined rule comprises:

If 6 decimal places are reserved, L is used_d3byte represents each potential value, which is converted to an integer v '1000000 v and saved as [ v'&0xFF,(v'＞＞8)&0xFF,(v'＞＞16)&0xFF]。

8. The method for predicting pathogenic gene pairs according to claim 7, further comprising a step of quickly accessing the compressed predicted results, which specifically comprises:

search for Gene pairs (G)_i,G_j)；

9. A system for predicting a pair of disease-causing genes, comprising: