CN113838525B

CN113838525B - Prediction method and system for pathogenic gene pair

Info

Publication number: CN113838525B
Application number: CN202111150222.8A
Authority: CN
Inventors: 袁杨杨; 李淼新
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2023-09-29
Anticipated expiration: 2041-09-29
Also published as: CN113838525A

Abstract

The application discloses a method and a system for predicting pathogenic gene pairs, wherein the method comprises the following steps: constructing a data set based on the double-gene disease database, and performing data filtering and screening to obtain a reference data set; introducing features and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model; predicting the double-gene interaction effect potential values between the whole genome coding gene pairs based on a prediction model, and storing the double-gene interaction effect potential values in a triplet form to obtain a prediction result; the prediction result is compressed based on a paired data compression method. The system comprises: the system comprises a data set construction module, a training module, a prediction module and a compression module. By using the present application, it is possible to help reveal potential pairs of interacting pathogenic genes. The method and the system for predicting the pathogenic gene pair can be widely applied to the field of gene pair prediction.

Description

Prediction method and system for pathogenic gene pair

Technical Field

The application relates to the field of gene pair prediction, in particular to a method and a system for predicting pathogenic gene pairs.

Background

Human genetic diseases can be largely divided into three classes, including Mendelian disease (monogenic disease), oligogenic disease and polygenic disease, where monogenic is the simplest genetic model of the disease, theoretically meaning that one or a few causative sites/genes are sufficient to cause the development of disease phenotypes, such as common cystic fibrosis and thalassemia, but known susceptibility genes for monogenic disease do not fully explain the corresponding disease phenotype. In addition, the phenotypic complexity of many diseases, due to the variability of environmental factors and the complexity of the human genome, makes diagnosis of the disease more difficult. With the rapid development of whole genome sequencing technology, under the current big data age, susceptibility genes of complex diseases are being continuously mined, and locating disease susceptibility genes faces great opportunities and challenges.

The traditional methods for locating the gene pairs of the double-gene interaction effect comprise hybridization tests, correlation analysis based on families, whole genome correlation analysis or multi-group combination analysis, and the like, but locating the double-gene interaction effect based on the methods requires certain preconditions, and meanwhile, the methods have larger limitation and are difficult to be applied to screening in a whole genome range.

Disclosure of Invention

In order to solve the above technical problems, the present application aims to provide a method and a system for predicting a pathogenic gene pair, wherein the algorithm is a supervised learning method, which can help to reveal a pathogenic gene pair with potential interaction.

The first technical scheme adopted by the application is as follows: a method of predicting a pathogenic gene pair comprising the steps of:

s1, constructing a data set based on a double-gene disease database, and performing data filtering and screening to obtain a reference data set;

s2, introducing features and constructing a whole genome dual-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model;

s3, predicting the double-gene interaction effect potential values between the whole genome coding gene pairs based on a prediction model, and storing the double-gene interaction effect potential values in a triplet form to obtain a prediction result;

s4, compressing the prediction result based on the paired data compression method.

Further, the step of constructing a data set based on the double-gene disease database and performing data filtering and screening to obtain a reference data set specifically comprises the following steps:

s11, taking a double-gene pathogenic gene pair in a double-gene disease database as a positive training sample;

s12, taking gene pairs obtained by combining pathogenic genes of monogenic diseases in pairs as a first negative training sample;

s13, combining gene pairs obtained by combining the genes with the functions which are deleted as a second negative training sample;

s14, combining the main pathogenic genes of the monogenic diseases and the genes with the functions deleted in pairs to obtain a gene pair which is used as a third negative training sample;

s15, randomly selecting a gene pair obtained by combining two protein coding genes on the whole genome as a fourth negative training sample;

s16, taking gene pairs obtained by randomly combining genes of the double-gene disease database as a fifth negative training sample;

and S17, filtering and screening the positive training sample, the first negative training sample, the second negative training sample, the third negative training sample, the fourth negative training sample and the fifth negative training sample based on the normal sample and the feature deletion rate to obtain a reference data set.

Further, the step of introducing features and constructing a whole genome dual-gene interaction effect potential prediction model based on a random forest model based on a reference dataset pair to obtain a prediction model specifically comprises the following steps:

s21, taking out an equal amount of samples from all negative samples in a reference data set randomly through equal-proportion sampling to form a negative sample subset with the same number as positive training samples, and combining all positive samples with the negative samples obtained through undersampling to form a sub-training set;

s22, sampling with replacement is carried out in each negative sample set, and the step S21 is circulated until the preset times are reached, so that a plurality of sub-training sets are obtained;

s23, introducing features for each sub training set and training based on a random forest method to obtain sub models;

s24, calculating the error rate outside the bags of the submodels and selecting the submodels with the error rate outside the bags larger than a preset value to obtain the selected submodels;

s25, weighting all selected submodels by taking the error rate outside the bag as a weight to obtain a whole genome dual-gene interaction effect potential prediction model.

Further, the step of introducing features and training to obtain a sub-model based on a random forest method for each sub-training set specifically includes:

s231, for each sub training set, the characteristic value is 0 to indicate that the two genes have no similarity, and the missing value is-1 to indicate that the characteristic value is missing;

s232, calculating the number of missing feature values in each sub-training set and taking the number as a new feature;

s233, performing cross verification by 10X, taking weighted harmonic mean of the precision rate and the recall rate as an evaluation standard, and performing parameter tuning on parameters in a random forest by using a grid search method to obtain a submodel;

the parameters include the number of trees, the maximum number of features, the maximum depth, the minimum number of samples required for node partitioning, and the minimum number of samples for leaf nodes.

Further, the characteristics include mutation level information, gene level information, protein interaction level information, protein structure information, expression level information, and phenotype level information.

Further, the step of compressing the prediction result based on the paired data compression method specifically includes:

constructing a gene name dictionary file according to the sequence of gene names;

and converting the potential value of the gene pair into an integer according to a preset rule, and storing the integer into a data file to obtain a compressed prediction result.

Further, the preset rule includes:

if the 2-bit decimal is reserved, take L _d =1 byte represents each potential value, which is converted to an integer v '=100deg.V, and stored as v'&0xFF；

If the 4-bit decimal is reserved, the decimal is expressed as L _d =2byte represents each potential value, which is converted to an integer v ' =10000 v, and stored as [ v ' '&0xFF，(v'＞＞8)&0xFF]；

If the 6-bit decimal is reserved, the decimal is expressed as L _d =3 byte represents each potential value, which is converted to an integer v ' =1000000v, and saved as [ v ' '&0xFF，(v'＞＞8)&0xFF,(v'＞＞16)&0xFF]。

Further, the method also comprises a step of quickly accessing the compressed prediction result, which specifically comprises the following steps:

searching for Gene pairs (G) _i ，G _j )；

Reading the gene names recorded in the gene name dictionary file, and constructing indexes according to the sequence;

acquisition of Gene G _i And gene G _j Indexes I and j in the dictionary, and acquiring the initial address I (I, j) of the gene pair in the data file;

move the file pointer to I (I, j), and read L _d Byte numberObtaining a byte array;

and reversely reducing the obtained byte array according to the coding direction.

The second technical scheme adopted by the application is as follows: a predictive system for a pathogenic gene pair comprising:

the data set construction module is used for constructing a data set based on the double-gene disease database and filtering and screening the data to obtain a reference data set;

the training module is used for introducing characteristics and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on the reference data set pair to obtain a prediction model;

the prediction module predicts the double-gene interaction effect potential value between the whole genome coding gene pairs based on the prediction model and stores the double-gene interaction effect potential value in a triplet form to obtain a prediction result;

and the compression module is used for compressing the prediction result based on the paired data compression method.

The method and the system have the beneficial effects that: the application firstly utilizes a machine learning method to excavate the double-gene pathogenic gene pairs based on biological similarity or relativity among genes, can evaluate double-gene interaction effects on different gene pairs under the condition of not knowing candidate genes, and can obtain more reliable prediction results by filtering a high-quality reference data set after screening.

Drawings

FIG. 1 is a flow chart showing the steps of a method for predicting a pathogenic gene pair according to the present application;

FIG. 2 is a block diagram of a predictive system for a pathogenic gene pair of the present application;

FIG. 3 is a diagram illustrating compression and quick access of prediction results in accordance with an embodiment of the present application.

Detailed Description

The application will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

Referring to fig. 1, the present application provides a method for predicting a pathogenic gene pair, the method comprising the steps of:

Further as a preferred embodiment of the method, the step of constructing a dataset based on the database of double-gene disease and performing data filtering and screening to obtain a reference dataset specifically includes:

s12, taking gene pairs obtained by combining pathogenic genes of monogenic diseases in pairs as a first negative training sample (MD);

in particular, we believe that there is no dual gene interaction effect between the major causative genes of common monogenic diseases.

S13, combining gene pairs obtained by combining the genes with the functions which are deleted as a second negative training sample (LOF);

in particular, we believe that the gene of LOF (loss of function) is substantially free of dual gene interaction effects;

s14, combining the main pathogenic genes of the monogenic diseases and the genes with the functions deleted in pairs to obtain a gene pair which is used as a third negative training sample (MDLOF);

s15, randomly selecting a gene pair obtained by combining two protein coding genes on a whole genome as a fourth negative training sample (Random);

specifically, we consider that the double gene interaction effect is an unusual effect, so that the probability of the double gene interaction effect existing between two genes randomly selected in the whole genome range is extremely low;

s16, taking a gene pair obtained by randomly combining genes of the double-gene disease database as a fifth negative training sample (DIDA_NDI);

Specifically, we used data from the thousand people genome for filtering. The sample of the thousand genome is a normal sample, i.e., a sample without a disease phenotype. It is believed that if there are non-synonymous mutations with allele frequencies of 1% or less on both genes of a gene pair, and at least 2 individuals in a sample of a thousand individuals' genome carry such mutant gene pairs, such gene pairs are true negative gene pairs, only such gene pairs will be retained for placement in the final negative training set; more characteristics are adopted for training the model, but the deletion rates of different characteristics are different, and more deletion values can lead to the decline of the efficiency of the model, so that the gene pairs with higher deletion rates of the characteristics are deleted, thereby improving the reliability of the training set and enhancing the performance of the model.

Further as a preferred embodiment of the method, the step of introducing features and constructing a whole genome dual gene interaction effect potential prediction model based on a random forest model based on a reference dataset pair to obtain a prediction model specifically comprises the following steps:

specifically, sampling is also included to construct a test set.

The formula is as follows:

specifically, i represents the selected submodel, oob _i Out-of-bag error rate, RF, for the ith sub-model _i Representing the prediction result of the ith sub-model, RF _Final The weighted dual gene interaction effect potential value is obtained.

Further as a preferred embodiment of the method, the step of introducing features and training to obtain a sub-model based on a random forest method for each sub-training set specifically includes:

specifically, since we utilized biological similarity or correlation between two genes to evaluate the dual gene interaction effect, if the value of the feature is 0, it means that there is no such similarity or correlation between the two genes; whereas for the case where the feature value is missing, we define the missing value as-1 for distinguishing from 0;

specifically, to evaluate whether the number of missing values would affect the final model decision, we calculated the number of missing values (-1) in each sample, and added this number as a new feature to the model training;

Finally, a weighted model is used to predict the dual gene interaction effector potential between the whole genome-encoded gene pairs. All the encoding genes on the whole genome are downloaded from the HUGO Gene Naming Committee (HGNC) functional network, and then we obtain encoding gene pairs of the whole genome through pairwise matching, and relevant characteristics of all the gene pairs are obtained from the corresponding database of the description. In the same way as in step S23, we replace with-1 for the missing feature, and for each sample we calculate the number of missing values to add as a personality feature to the model training. The triplet form may be as shown with reference to C in fig. 3.

Further as a preferred embodiment of the method, the features include:

mutation level information, scoring of the tolerance degree of the gene to the mutation;

gene level information, probability of a gene being a recessive pathogenic gene, status of necessity of a gene for basic functional development, probability of a necessity of a gene, whether a gene pair is involved in the same pathway, degree of semantic similarity of a gene pair in a Gene Ontology (GO), magnitude of gene-gene interaction effect, number and similarity of commonly interacted genes between a gene pair, single dose shortage, destructive strength of a gene, biological distance between a gene pair, tolerance of a gene to functional deficiency.

Protein interaction level information, protein-protein interaction effect, mainly using database with BioGRID and sting;

protein structure information, mainly using information about gene domains provided in UniProtKB database;

expression level information, expression conditions of the gene pairs in different tissues, protein abundance and degree of gene co-expression;

phenotype level information, semantic similarity of genes to those in Disease Ontology (DO), i.e., the degree of similarity of disease phenotypes associated with two genes.

Further as the preferred embodiment of the method, the reliable feature screening is used for training the model, and the feature screening mainly comprises the following steps:

because the information coverage of different features is different, the features with higher missing rate are deleted firstly, and the problem that the model efficiency is reduced due to missing information is avoided;

the 14 input features with the best evaluation performance are obtained by using a recursive feature deletion (RFE), and the accuracy and reliability of the model are further ensured by the high-quality features.

Further as a preferred embodiment of the method, the step of compressing the prediction result based on the paired data compression method specifically includes:

specifically, a file of a gene name dictionary with an extension of x.d. is constructed in the order of gene names (ASCII code order), which is a single file with tab as separator;

Specifically, the potential value v e [0,1] of the gene pair is saved as a data file of x.b according to a preset rule.

Further as a preferred embodiment of the method, the preset rule includes:

If the 4-bit decimal is reserved, the decimal is expressed as L _d =2 byte represents each potential value, which is converted to an integer v' =10000v and stored as [ v ]'&0xFF,(v'＞＞8)&0xFF]；

If the 6-bit decimal is reserved, the decimal is expressed as L _d =3 byte represents each potential value, which is converted to an integer v ' =1000000v, and saved as [ v ' '&0xFF，(v'＞＞8)&0xFF，(v'＞＞16)&0xFF]。

Further as a preferred embodiment of the method, the method further includes a step of quickly accessing the compressed prediction result, which specifically includes:

searching for Gene pairs (G) _i ，G _j )；

specifically, the calculation formula of the start address is as follows:

wherein i and j are genes G _i And gene G _j Index in dictionary, L _d Is the byte length used to hold the potential value.

Move the file pointer to I (I, j), and read L _d Obtaining byte arrays by bytes;

The advantages of the application over the prior art are mainly represented by the following three points:

1. while the traditional method for locating the double-gene interaction effect requires knowing one pathogenic gene and candidate interaction genes, the application firstly utilizes a machine learning method to excavate double-gene pathogenic gene pairs based on biological similarity or relativity among genes, and can evaluate double-gene interaction effects on different gene pairs under the condition that candidate genes are not known;

2. the training set is a high-quality reference data set which is filtered and screened, the coverage is relatively wide, and the reliable feature set helps us to construct a relatively robust prediction model, wherein the protein-protein interaction, the semantic similarity degree in the Gene Ontology (GO) and the semantic similarity of genes in the Disease Ontology (DO) are relatively important 3 features;

3. the model construction method based on undersampling and integration enables us to obtain more reliable prediction results.

As shown in fig. 2, a prediction system of a pathogenic gene pair comprises:

Further as a preferred embodiment of the present system, further comprising:

and the access module is used for quickly accessing the compressed prediction result.

The content in the method embodiment is applicable to the system embodiment, the functions specifically realized by the system embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.

While the preferred embodiment of the present application has been described in detail, the application is not limited to the embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims

1. A method for predicting a pathogenic gene pair, comprising the steps of:

s4, compressing the prediction result based on a paired data compression method;

the step of constructing a data set based on the double-gene disease database and filtering and screening the data to obtain a reference data set specifically comprises the following steps:

s17, filtering and screening a positive training sample, a first negative training sample, a second negative training sample, a third negative training sample, a fourth negative training sample and a fifth negative training sample based on the normal sample and the feature deletion rate to obtain a reference data set;

the step of introducing features and constructing a whole genome dual-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model comprises the following steps:

2. The method according to claim 1, wherein for each sub-training set, the step of introducing features and training based on a random forest method to obtain sub-models comprises:

3. The method of claim 2, wherein the characteristics include mutation level information, gene level information, protein interaction level information, protein structure information, expression level information, and phenotype level information.

4. A method for predicting a pathogenic gene pair according to claim 3, wherein the step of compressing the predicted result based on a paired data compression method comprises:

5. The method of claim 4, wherein the predetermined rule comprises:

If the 4-bit decimal is reserved, the decimal is expressed as L _d =2byte represents each potential value, which is converted to an integer v ' =10000 v, and stored as [ v ' '&0xFF,(v′>>8)&0xFF]；

If the 6-bit decimal is reserved, the decimal is expressed as L _d =3 byte represents each potential value, which is converted to an integer v ' =1000000v, and saved as [ v ' '&OxFF,(v′>>8)&OxFF,(v′>>16)&OxFF]。

6. The method according to claim 5, further comprising the step of rapidly accessing the compressed prediction result, which specifically comprises:

searching for Gene pairs (G) _i ，G _j )；

7. A predictive system for a pathogenic gene pair comprising:

the compression module is used for compressing the prediction result based on a paired data compression method;

the double-gene disease database-based data set construction and data filtering and screening are carried out to obtain a reference data set, and the method specifically comprises the following steps: s11, taking a double-gene pathogenic gene pair in a double-gene disease database as a positive training sample; s12, taking gene pairs obtained by combining pathogenic genes of monogenic diseases in pairs as a first negative training sample; s13, combining gene pairs obtained by combining the genes with the functions which are deleted as a second negative training sample; s14, combining the main pathogenic genes of the monogenic diseases and the genes with the functions deleted in pairs to obtain a gene pair which is used as a third negative training sample; s15, randomly selecting a gene pair obtained by combining two protein coding genes on the whole genome as a fourth negative training sample; s16, taking gene pairs obtained by randomly combining genes of the double-gene disease database as a fifth negative training sample; s17, filtering and screening a positive training sample, a first negative training sample, a second negative training sample, a third negative training sample, a fourth negative training sample and a fifth negative training sample based on the normal sample and the feature deletion rate to obtain a reference data set;

the feature is introduced, a whole genome double-gene interaction effect potential prediction model based on a random forest model is constructed based on a reference data set pair, and a prediction model is obtained, wherein the method specifically comprises the following steps: s21, taking out an equal amount of samples from all negative samples in a reference data set randomly through equal-proportion sampling to form a negative sample subset with the same number as positive training samples, and combining all positive samples with the negative samples obtained through undersampling to form a sub-training set; s22, sampling with replacement is carried out in each negative sample set, and the step S21 is circulated until the preset times are reached, so that a plurality of sub-training sets are obtained; s23, introducing features for each sub training set and training based on a random forest method to obtain sub models; s24, calculating the error rate outside the bags of the submodels and selecting the submodels with the error rate outside the bags larger than a preset value to obtain the selected submodels; s25, weighting all selected submodels by taking the error rate outside the bag as a weight to obtain a whole genome dual-gene interaction effect potential prediction model.