CN113838525A - Method and system for predicting pathogenic gene pair - Google Patents

Method and system for predicting pathogenic gene pair Download PDF

Info

Publication number
CN113838525A
CN113838525A CN202111150222.8A CN202111150222A CN113838525A CN 113838525 A CN113838525 A CN 113838525A CN 202111150222 A CN202111150222 A CN 202111150222A CN 113838525 A CN113838525 A CN 113838525A
Authority
CN
China
Prior art keywords
gene
double
pair
data set
predicting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111150222.8A
Other languages
Chinese (zh)
Other versions
CN113838525B (en
Inventor
袁杨杨
李淼新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202111150222.8A priority Critical patent/CN113838525B/en
Publication of CN113838525A publication Critical patent/CN113838525A/en
Application granted granted Critical
Publication of CN113838525B publication Critical patent/CN113838525B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Business, Economics & Management (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Strategic Management (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Tourism & Hospitality (AREA)
  • Genetics & Genomics (AREA)
  • Biophysics (AREA)
  • Development Economics (AREA)
  • Analytical Chemistry (AREA)
  • Game Theory and Decision Science (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Molecular Biology (AREA)
  • General Business, Economics & Management (AREA)
  • Chemical & Material Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Biotechnology (AREA)
  • Databases & Information Systems (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method and a system for predicting a pathogenic gene pair, wherein the method comprises the following steps: constructing a data set based on a double-gene disease database, and filtering and screening data to obtain a reference data set; introducing characteristics and establishing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model; predicting the double-gene interaction effect potential between the whole-genome encoding gene pairs based on a prediction model, and storing in a triple form to obtain a prediction result; and compressing the prediction result based on a paired data compression method. The system comprises: the device comprises a data set construction module, a training module, a prediction module and a compression module. By using the invention, the discovery of potential interacting pathogenic gene pairs can be helped. The method and the system for predicting the pathogenic gene pair can be widely applied to the field of gene pair prediction.

Description

Method and system for predicting pathogenic gene pair
Technical Field
The invention relates to the field of gene pair prediction, in particular to a method and a system for predicting a pathogenic gene pair.
Background
Human genetic diseases can be largely divided into three categories, including mendelian disease (monogenic disease), oligogenic disease and polygenic disease, where monogenic is the simplest genetic pattern of disease, theoretically meaning that one or a few pathogenic sites/genes are sufficient to cause disease phenotypes such as common cystic fibrosis and thalassemia, but the known susceptibility genes of monogenic diseases do not fully explain the corresponding disease phenotype. In addition, due to the multiplicity of environmental factors and the complexity of the human genome, the phenotype of many diseases is intricate, making diagnosis of the disease more difficult. With the rapid development of whole genome sequencing technology, in the current big data era, the susceptible genes of complex diseases are continuously mined, and the positioning of the susceptible genes of diseases faces huge opportunities and challenges.
The traditional methods for positioning the gene pairs of the double-gene interaction effect comprise hybridization tests, family-based association analysis, whole genome association analysis or multiomic association analysis and the like, but the positioning of the double-gene interaction effect based on the methods needs certain preconditions, and the methods have large limitations and are difficult to be applied to screening in the whole genome range.
Disclosure of Invention
In order to solve the above technical problems, the present invention provides a method and a system for predicting a pathogenic gene pair, wherein the algorithm is a supervised learning method and can help to reveal a potential interacting pathogenic gene pair.
The first technical scheme adopted by the invention is as follows: a method for predicting a pathogenic gene pair comprises the following steps:
s1, constructing a data set based on the double-gene disease database, and filtering and screening the data to obtain a reference data set;
s2, introducing characteristics and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model;
s3, predicting the double-gene interaction effect potential values between the whole-genome encoding gene pairs based on the prediction model, and storing the potential values in a triple form to obtain a prediction result;
and S4, compressing the prediction result based on the paired data compression method.
Further, the step of constructing a data set based on the double-gene disease database, and performing data filtering and screening to obtain a reference data set specifically includes:
s11, taking a double-gene pathogenic gene pair in the double-gene disease database as a positive training sample;
s12, combining the pathogenic genes of the monogenic disease pairwise to obtain a gene pair serving as a first negative training sample;
s13, taking a gene pair obtained by pairwise combination of the genes with function loss as a second negative training sample;
s14, combining the main pathogenic genes of the monogenic diseases and the genes with the function loss pairwise to obtain a gene pair serving as a third negative training sample;
s15, randomly selecting two protein coding genes on the whole genome to obtain a gene pair, and taking the gene pair as a fourth negative training sample;
s16, randomly combining every two genes of the double-gene disease database to obtain a gene pair serving as a fifth negative training sample;
and S17, filtering and screening the positive training sample, the first negative training sample, the second negative training sample, the third negative training sample, the fourth negative training sample and the fifth negative training sample based on the normal sample and the feature missing rate to obtain a reference data set.
Further, the step of introducing the features and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on the reference data set pair to obtain the prediction model specifically comprises the following steps:
s21, randomly taking out equal samples from each negative sample in the reference data set through equal proportion sampling to form a negative sample subset with the same number as the number of the positive training samples, and combining all the positive samples and the negative samples obtained through undersampling to form a sub-training set;
s22, sampling with replacement is carried out on each negative sample set, and the step S21 is circulated until the preset times are reached to obtain a plurality of sub training sets;
s23, for each sub-training set, introducing features and training on the basis of a random forest method to obtain sub-models;
s24, calculating out-of-bag error rate of the submodels, and selecting the submodels with the out-of-bag error rate larger than a preset value to obtain the selected submodels;
s25, weighting all the selected sub-models by taking the out-of-bag error rate as weight to obtain a whole-genome double-gene interaction effect potential prediction model.
Further, the step of introducing features and training to obtain the sub-models based on a random forest method for each sub-training set specifically comprises:
s231, regarding each sub-training set, representing that no similarity exists between two genes by taking the characteristic value as 0, and representing that the characteristic value is absent by taking the deficiency value as-1;
s232, calculating the number of missing feature values in each sub-training set and taking the missing feature values as new features;
s233, through 10X cross validation, taking the weighted harmonic mean of the precision ratio and the recall ratio as an evaluation standard, and carrying out parameter adjustment on parameters in the random forest by using a grid search method to obtain a sub-model;
the parameters include the number of trees, the maximum feature number, the maximum depth, the minimum number of samples required for node partitioning, and the minimum number of samples for leaf nodes.
Further, the features include mutation level information, gene level information, protein interaction level information, protein structure information, expression level information, and phenotype level information.
Further, the step of compressing the prediction result based on the paired data compression method specifically includes:
constructing a gene name dictionary file according to the sequence of gene names;
and (4) converting the potential values of the gene pairs into integers according to a preset rule and storing the integers as a data file to obtain a compressed prediction result.
Further, the preset rule includes:
if 2 decimal places are reserved, L is useddEach potential value is represented by 1byte, the value is converted to an integer v' of 100v, andis preserved as v'&0xFF;
If 4-bit decimal is reserved, Ld2byte represents each potential value, which is converted to an integer v '10000 v and saved as [ v'&0xFF,(v'>>8)&0xFF];
If 6 decimal places are reserved, L is usedd3byte represents each potential value, which is converted to an integer v '1000000 v and saved as [ v'&0xFF,(v'>>8)&0xFF,(v'>>16)&0xFF]。
Further, the method also comprises a step of quickly accessing the compressed prediction result, which specifically comprises the following steps:
search for Gene pairs (G)i,Gj);
Reading the gene names recorded in the gene name dictionary file, and constructing indexes according to the sequence;
obtaining Gene GiAnd gene GjIndexing I and j in the dictionary, and obtaining the initial address I (I, j) of the gene pair in the data file;
move file pointer to I (I, j), and read LdA byte is obtained, and a byte array is obtained;
and reversely reducing the obtained byte array according to the direction during encoding.
The second technical scheme adopted by the invention is as follows: a system for predicting pairs of disease-causing genes, comprising:
the data set construction module is used for constructing a data set based on the double-gene disease database, and filtering and screening data to obtain a reference data set;
the training module is used for introducing features and establishing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model;
the prediction module predicts the double-gene interaction effect potential values between the full-genome encoding gene pairs based on the prediction model and stores the double-gene interaction effect potential values in a triple form to obtain a prediction result;
and the compression module is used for compressing the prediction result based on a paired data compression method.
The method and the system have the beneficial effects that: the invention firstly utilizes a machine learning method to carry out the mining of the double-gene pathogenic gene pairs based on the biological similarity or correlation between genes, can also carry out the evaluation of the double-gene interaction effect on different gene pairs under the condition of not knowing candidate genes, and can obtain more reliable prediction results through filtering and screening high-quality reference data sets.
Drawings
FIG. 1 is a flow chart showing the steps of a method for predicting a pair of pathogenic genes according to the present invention;
FIG. 2 is a block diagram showing the construction of a system for predicting a pair of pathogenic genes according to the present invention;
FIG. 3 is a diagram illustrating compression and fast access to predicted results according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
Referring to fig. 1, the present invention provides a method for predicting a pathogenic gene pair, the method comprising the steps of:
s1, constructing a data set based on the double-gene disease database, and filtering and screening the data to obtain a reference data set;
s2, introducing characteristics and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model;
s3, predicting the double-gene interaction effect potential values between the whole-genome encoding gene pairs based on the prediction model, and storing the potential values in a triple form to obtain a prediction result;
and S4, compressing the prediction result based on the paired data compression method.
Further, as a preferred embodiment of the method, the step of constructing a data set based on the double-gene disease database, and performing data filtering and screening to obtain a reference data set specifically includes:
s11, taking a double-gene pathogenic gene pair in the double-gene disease database as a positive training sample;
s12, using a gene pair obtained by pairwise combination of pathogenic genes of the monogenic disease as a first negative training sample (MD);
specifically, we believe that there is no double-gene interaction effect between the major causative genes of common monogenic diseases.
S13, taking a gene pair obtained by pairwise combination of the genes with function loss as a second negative training sample (LOF);
specifically, we considered that the LOF (loss of function) gene had substantially no double-gene interaction effect;
s14, combining the main pathogenic gene of the monogenic disease and the gene with the function loss pairwise to obtain a gene pair as a third negative training sample (MDLOF);
s15, randomly selecting a gene pair obtained by combining two protein coding genes on a whole genome as a fourth negative training sample (Random);
specifically, we consider that the double-gene interaction effect is an uncommon effect, so the probability of the double-gene interaction effect existing between two randomly selected genes in the whole genome range is extremely low;
s16, randomly combining every two genes of the double-gene disease database to obtain a gene pair serving as a fifth negative training sample (DIDA _ NDI);
and S17, filtering and screening the positive training sample, the first negative training sample, the second negative training sample, the third negative training sample, the fourth negative training sample and the fifth negative training sample based on the normal sample and the feature missing rate to obtain a reference data set.
Specifically, we used data from the thousand human genomes for filtering. Samples of the thousand human genomes are normal samples, i.e. samples without disease phenotype. We believe that if non-synonymous mutations with allele frequencies of 1% or less are present in both genes of a gene pair and at least 2 individuals in a sample of thousands of human genomes carry such a pair of mutant genes, such a pair is a true negative pair, and only such pair is retained for placement in the final negative training set; more features are adopted for training the model, but the deletion rates of different features are different, and more deletion values lead to the reduction of the efficiency of the model, so that the gene pairs with higher deletion rates of the features are deleted, so that the reliability of a training set is improved, and the performance of the model is enhanced.
Further as a preferred embodiment of the method, the step of introducing the features and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain the prediction model specifically comprises:
s21, randomly taking out equal samples from each negative sample in the reference data set through equal proportion sampling to form a negative sample subset with the same number as the number of the positive training samples, and combining all the positive samples and the negative samples obtained through undersampling to form a sub-training set;
specifically, sampling and constructing a test set are further included.
S22, sampling with replacement is carried out on each negative sample set, and the step S21 is circulated until the preset times are reached to obtain a plurality of sub training sets;
s23, for each sub-training set, introducing features and training on the basis of a random forest method to obtain sub-models;
s24, calculating out-of-bag error rate of the submodels, and selecting the submodels with the out-of-bag error rate larger than a preset value to obtain the selected submodels;
s25, weighting all the selected sub-models by taking the out-of-bag error rate as weight to obtain a whole-genome double-gene interaction effect potential prediction model.
The formula is as follows:
Figure BDA0003286968640000051
specifically, i denotes the selected submodel, oobiIs the ith subOut-of-bag error rate, RF, of the modeliRepresenting the predicted result of the ith sub-model, RFFinalNamely the weighted potential value of the double-gene interaction effect.
Further, as a preferred embodiment of the method, the step of introducing features and training based on a random forest method to obtain the sub-models for each sub-training set specifically includes:
s231, regarding each sub-training set, representing that no similarity exists between two genes by taking the characteristic value as 0, and representing that the characteristic value is absent by taking the deficiency value as-1;
in particular, since we use biological similarity or correlation between two genes to assess the effect of two-gene interaction, if the value of a feature is 0, it means that there is no such similarity or correlation between the two genes; for the case where the eigenvalue is absent, to distinguish from 0, we define the absent value as-1;
s232, calculating the number of missing feature values in each sub-training set and taking the missing feature values as new features;
specifically, in order to evaluate whether the number of missing values affects the judgment of the final model, the number of missing values (-1) in each sample is calculated, and the number is used as a new feature to be added into the training of the model;
s233, through 10X cross validation, taking the weighted harmonic mean of the precision ratio and the recall ratio as an evaluation standard, and carrying out parameter adjustment on parameters in the random forest by using a grid search method to obtain a sub-model;
the parameters include the number of trees, the maximum feature number, the maximum depth, the minimum number of samples required for node partitioning, and the minimum number of samples for leaf nodes.
Finally, the weighted model is used to predict the potential value of the double-gene interaction effect between the coding gene pairs of the whole genome. All coding genes on the whole genome are downloaded from the HUGO Gene Naming Committee (HGNC) official network, then, the coding gene pairs of the whole genome are obtained through pairwise matching, and relevant characteristics of all the gene pairs are obtained from the described corresponding database. Like step S23, we replace with-1 for the missing features, and for each sample we calculate the number of missing values to add to the model training as a characteristic feature. The triplet form may be as shown with reference to C in fig. 3.
Further as a preferred embodiment of the method, the features include:
mutation level information, the score of the tolerance of the gene to the mutation;
gene level information, probability that a gene is a recessive pathogenic gene, the state of necessity for basic functional development of a gene pair, probability of an essential gene, whether a gene pair participates in the same pathway, the degree of semantic similarity of a gene pair in Gene Ontology (GO), the magnitude of gene-gene interaction effect, the number and similarity of interacting genes in common between gene pairs, single-dose insufficiency, destructive strength of genes, biological distance between gene pairs, and the tolerance of a gene pair to functional deficiency.
Protein interaction level information, protein-protein interaction effect, mainly using database with BioGRID and STRING;
protein structural information, primarily using information about gene domains provided in the UniProtKB database;
expression level information, expression profile of gene pairs in different tissues, protein abundance, and degree of gene co-expression;
phenotypic level information, semantic similarity of gene pairs in Disease Ontology (DO), i.e. the degree of similarity of disease phenotypes associated with two genes.
Further as a preferred embodiment of the method, reliable features are screened for model training, and feature screening mainly comprises:
because the information coverage degrees of different characteristics are different, the characteristics with higher missing rate are deleted firstly, and the model efficiency is prevented from being reduced due to missing information;
the 14 input features with the best evaluation performance are obtained by using a recursive feature elimination method (RFE), and the accuracy and the reliability of the model are further ensured by using the high-quality features.
Further, as a preferred embodiment of the method, the step of compressing the prediction result by using the pairwise data compression method specifically includes:
constructing a gene name dictionary file according to the sequence of gene names;
constructing a gene name dictionary file with the extension name of x, d according to the gene name order (ASCII code order), wherein the file is a single-row file with tab as a separator;
and (4) converting the potential values of the gene pairs into integers according to a preset rule and storing the integers as a data file to obtain a compressed prediction result.
Specifically, the potential value v of the gene pair belongs to [0,1], and is stored as a data file b according to a preset rule.
Further as a preferred embodiment of the method, the preset rule includes:
if 2 decimal places are reserved, L is useddEach potential value is denoted by 1byte, and the value is converted into an integer v 'of 100v and stored as v'&0xFF;
If 4-bit decimal is reserved, Ld2byte represents each potential value, which is converted to an integer v '10000 v and saved as [ v'&0xFF,(v'>>8)&0xFF];
If 6 decimal places are reserved, L is usedd3byte represents each potential value, which is converted to an integer v '1000000 v and saved as [ v'&0xFF,(v'>>8)&0xFF,(v'>>16)&0xFF]。
Further, as a preferred embodiment of the method, the method further includes a step of quickly accessing the compressed prediction result, which specifically includes:
search for Gene pairs (G)i,Gj);
Reading the gene names recorded in the d gene name dictionary file, and constructing indexes according to the sequence;
obtaining Gene GiAnd gene GjIndexing I and j in the dictionary, and obtaining the initial address I (I, j) of the gene pair in the data file;
specifically, the calculation formula of the start address is as follows:
Figure BDA0003286968640000071
wherein i and j are the gene GiAnd gene GjIndex in a dictionary, LdIs the length in bytes used to hold the potential values.
Move file pointer to I (I, j), and read LdA byte is obtained, and a byte array is obtained;
and reversely reducing the obtained byte array according to the direction during encoding.
The advantages of the invention relative to the existing method are mainly reflected in the following three points:
1. the traditional method for positioning the double-gene interaction effect is premised on that a pathogenic gene and a candidate interaction gene need to be known, but the invention firstly utilizes a machine learning method to carry out the excavation of double-gene pathogenic gene pairs based on the biological similarity or correlation between genes, and under the condition that the candidate gene is unknown, the double-gene interaction effect can be evaluated on different gene pairs;
2. the training set is a high-quality reference data set obtained after filtering and screening, the coverage is relatively wide, and a reliable feature set helps us to construct a relatively robust prediction model, wherein protein-protein interaction, semantic similarity degree in Gene Ontology (GO) and semantic similarity of genes in Disease Ontology (DO) are relatively important 3 features;
3. the model construction method based on undersampling and integration enables us to obtain a more reliable prediction result.
As shown in fig. 2, a system for predicting a pair of pathogenic genes includes:
the data set construction module is used for constructing a data set based on the double-gene disease database, and filtering and screening data to obtain a reference data set;
the training module is used for introducing features and establishing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model;
the prediction module predicts the double-gene interaction effect potential values between the full-genome encoding gene pairs based on the prediction model and stores the double-gene interaction effect potential values in a triple form to obtain a prediction result;
and the compression module is used for compressing the prediction result based on a paired data compression method.
Further as a preferred embodiment of the present system, the present system further comprises:
and the access module is used for quickly accessing the compressed prediction result.
The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (9)

1. A method for predicting a pathogenic gene pair, comprising the steps of:
s1, constructing a data set based on the double-gene disease database, and filtering and screening the data to obtain a reference data set;
s2, introducing characteristics and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model;
s3, predicting the double-gene interaction effect potential values between the whole-genome encoding gene pairs based on the prediction model, and storing the potential values in a triple form to obtain a prediction result;
and S4, compressing the prediction result based on the paired data compression method.
2. The method for predicting pathogenic gene pairs as claimed in claim 1, wherein the step of constructing a data set based on the double-gene disease database, and performing data filtering and screening to obtain a reference data set specifically comprises:
s11, taking a double-gene pathogenic gene pair in the double-gene disease database as a positive training sample;
s12, combining the pathogenic genes of the monogenic disease pairwise to obtain a gene pair serving as a first negative training sample;
s13, taking a gene pair obtained by pairwise combination of the genes with function loss as a second negative training sample;
s14, combining the main pathogenic genes of the monogenic diseases and the genes with the function loss pairwise to obtain a gene pair serving as a third negative training sample;
s15, randomly selecting two protein coding genes on the whole genome to obtain a gene pair, and taking the gene pair as a fourth negative training sample;
s16, randomly combining every two genes of the double-gene disease database to obtain a gene pair serving as a fifth negative training sample;
and S17, filtering and screening the positive training sample, the first negative training sample, the second negative training sample, the third negative training sample, the fourth negative training sample and the fifth negative training sample based on the normal sample and the feature missing rate to obtain a reference data set.
3. The method for predicting pathogenic gene pairs as claimed in claim 2, wherein the step of introducing the features and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on the reference data set pair to obtain a prediction model specifically comprises:
s21, randomly taking out equal samples from each negative sample in the reference data set through equal proportion sampling to form a negative sample subset with the same number as the number of the positive training samples, and combining all the positive samples and the negative samples obtained through undersampling to form a sub-training set;
s22, sampling with replacement is carried out on each negative sample set, and the step S21 is circulated until the preset times are reached to obtain a plurality of sub training sets;
s23, for each sub-training set, introducing features and training on the basis of a random forest method to obtain sub-models;
s24, calculating out-of-bag error rate of the submodels, and selecting the submodels with the out-of-bag error rate larger than a preset value to obtain the selected submodels;
s25, weighting all the selected sub-models by taking the out-of-bag error rate as weight to obtain a whole-genome double-gene interaction effect potential prediction model.
4. The method for predicting pathogenic gene pairs as claimed in claim 3, wherein the step of introducing features and training based on a random forest method to obtain the sub-models for each sub-training set specifically comprises:
s231, regarding each sub-training set, representing that no similarity exists between two genes by taking the characteristic value as 0, and representing that the characteristic value is absent by taking the deficiency value as-1;
s232, calculating the number of missing feature values in each sub-training set and taking the missing feature values as new features;
s233, through 10X cross validation, taking the weighted harmonic mean of the precision ratio and the recall ratio as an evaluation standard, and carrying out parameter adjustment on parameters in the random forest by using a grid search method to obtain a sub-model;
the parameters include the number of trees, the maximum feature number, the maximum depth, the minimum number of samples required for node partitioning, and the minimum number of samples for leaf nodes.
5. The method of claim 4, wherein the characteristics include mutation level information, gene level information, protein interaction level information, protein structure information, expression level information, and phenotype level information.
6. The method for predicting pathogenic gene pairs as claimed in claim 5, wherein the step of compressing the prediction result based on the paired data compression method specifically comprises:
constructing a gene name dictionary file according to the sequence of gene names;
and (4) converting the potential values of the gene pairs into integers according to a preset rule and storing the integers as a data file to obtain a compressed prediction result.
7. The method for predicting pathogenic gene pair according to claim 6, wherein the predetermined rule comprises:
if 2 decimal places are reserved, L is useddEach potential value is denoted by 1byte, and the value is converted into an integer v 'of 100v and stored as v'&0xFF;
If 4-bit decimal is reserved, Ld2byte represents each potential value, which is converted to an integer v '10000 v and saved as [ v'&0xFF,(v'>>8)&0xFF];
If 6 decimal places are reserved, L is usedd3byte represents each potential value, which is converted to an integer v '1000000 v and saved as [ v'&0xFF,(v'>>8)&0xFF,(v'>>16)&0xFF]。
8. The method for predicting pathogenic gene pairs according to claim 7, further comprising a step of quickly accessing the compressed predicted results, which specifically comprises:
search for Gene pairs (G)i,Gj);
Reading the gene names recorded in the gene name dictionary file, and constructing indexes according to the sequence;
obtaining Gene GiAnd gene GjIndexing I and j in the dictionary, and obtaining the initial address I (I, j) of the gene pair in the data file;
move file pointer to I (I, j), and read LdA byte is obtained, and a byte array is obtained;
and reversely reducing the obtained byte array according to the direction during encoding.
9. A system for predicting a pair of disease-causing genes, comprising:
the data set construction module is used for constructing a data set based on the double-gene disease database, and filtering and screening data to obtain a reference data set;
the training module is used for introducing features and establishing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model;
the prediction module predicts the double-gene interaction effect potential values between the full-genome encoding gene pairs based on the prediction model and stores the double-gene interaction effect potential values in a triple form to obtain a prediction result;
and the compression module is used for compressing the prediction result based on a paired data compression method.
CN202111150222.8A 2021-09-29 2021-09-29 Prediction method and system for pathogenic gene pair Active CN113838525B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111150222.8A CN113838525B (en) 2021-09-29 2021-09-29 Prediction method and system for pathogenic gene pair

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111150222.8A CN113838525B (en) 2021-09-29 2021-09-29 Prediction method and system for pathogenic gene pair

Publications (2)

Publication Number Publication Date
CN113838525A true CN113838525A (en) 2021-12-24
CN113838525B CN113838525B (en) 2023-09-29

Family

ID=78967648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111150222.8A Active CN113838525B (en) 2021-09-29 2021-09-29 Prediction method and system for pathogenic gene pair

Country Status (1)

Country Link
CN (1) CN113838525B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341366A (en) * 2017-07-19 2017-11-10 西安交通大学 A kind of method that complex disease susceptibility loci is predicted using machine learning
CN109727642A (en) * 2019-01-22 2019-05-07 袁隆平农业高科技股份有限公司 Full-length genome prediction technique and device based on Random Forest model
CN110136773A (en) * 2019-04-02 2019-08-16 上海交通大学 A kind of phytoprotein interaction network construction method based on deep learning
CN113066586A (en) * 2021-04-01 2021-07-02 北京果壳生物科技有限公司 Method for constructing disease classification model based on multi-gene risk scoring

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341366A (en) * 2017-07-19 2017-11-10 西安交通大学 A kind of method that complex disease susceptibility loci is predicted using machine learning
CN109727642A (en) * 2019-01-22 2019-05-07 袁隆平农业高科技股份有限公司 Full-length genome prediction technique and device based on Random Forest model
CN110136773A (en) * 2019-04-02 2019-08-16 上海交通大学 A kind of phytoprotein interaction network construction method based on deep learning
CN113066586A (en) * 2021-04-01 2021-07-02 北京果壳生物科技有限公司 Method for constructing disease classification model based on multi-gene risk scoring

Also Published As

Publication number Publication date
CN113838525B (en) 2023-09-29

Similar Documents

Publication Publication Date Title
Li et al. SPRINT: ultrafast protein-protein interaction prediction of the entire human interactome
JP5479431B2 (en) Biomarker extraction apparatus and method
CN113393911B (en) Ligand compound rapid pre-screening method based on deep learning
CN108287808A (en) A kind of general dynamic tracing sequential sampling method of structure-oriented fail-safe analysis
Yuan et al. Evoq: Mixed precision quantization of dnns via sensitivity guided evolutionary search
Huo et al. Optimizing genetic algorithm for motif discovery
CN113555062A (en) Data analysis system and analysis method for genome base variation detection
CN115631789A (en) Pangenome-based group joint variation detection method
CN110491443B (en) lncRNA protein correlation prediction method based on projection neighborhood non-negative matrix decomposition
CN110262957B (en) Reuse method of test cases among similar programs and implementation system thereof
CN113764034B (en) Method, device, equipment and medium for predicting potential BGC in genome sequence
Pei et al. A “seed-refine” algorithm for detecting protein complexes from protein interaction data
CN113838525B (en) Prediction method and system for pathogenic gene pair
Chen et al. Domain-based predictive models for protein-protein interaction prediction
CN113782092B (en) Method and device for generating lifetime prediction model and storage medium
CN115577259A (en) Fault pole selection method and device for high-voltage direct-current transmission system and computer equipment
CN115691666A (en) Sigma-based mutation pathogenicity prediction analysis method, system and equipment
CN115295079A (en) Long-chain non-coding RNA subcellular localization prediction method based on metagram learning
CN113345593A (en) Method for predicting disease association relation in biological association network
CN113053461A (en) Target-based gene cluster directional mining method
CN113241123A (en) Method and system for fusing multiple feature recognition enhancers and intensities thereof
Lee et al. Protein secondary structure prediction using BLAST and exhaustive RT-RICO, the search for optimal segment length and threshold
CN118114125B (en) MiRNA based on incremental learning and isomer family information identification method thereof
CN116756542A (en) Feature selection method, device and medium of unbalanced data for intrusion detection
CN117133365A (en) High-throughput genome sequencing quality score data parallel compression method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant