CN113838525B - Prediction method and system for pathogenic gene pair - Google Patents

Prediction method and system for pathogenic gene pair Download PDF

Info

Publication number
CN113838525B
CN113838525B CN202111150222.8A CN202111150222A CN113838525B CN 113838525 B CN113838525 B CN 113838525B CN 202111150222 A CN202111150222 A CN 202111150222A CN 113838525 B CN113838525 B CN 113838525B
Authority
CN
China
Prior art keywords
gene
double
training sample
negative
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111150222.8A
Other languages
Chinese (zh)
Other versions
CN113838525A (en
Inventor
袁杨杨
李淼新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202111150222.8A priority Critical patent/CN113838525B/en
Publication of CN113838525A publication Critical patent/CN113838525A/en
Application granted granted Critical
Publication of CN113838525B publication Critical patent/CN113838525B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Business, Economics & Management (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Strategic Management (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Tourism & Hospitality (AREA)
  • Genetics & Genomics (AREA)
  • Biophysics (AREA)
  • Development Economics (AREA)
  • Analytical Chemistry (AREA)
  • Game Theory and Decision Science (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Molecular Biology (AREA)
  • General Business, Economics & Management (AREA)
  • Chemical & Material Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Biotechnology (AREA)
  • Databases & Information Systems (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a method and a system for predicting pathogenic gene pairs, wherein the method comprises the following steps: constructing a data set based on the double-gene disease database, and performing data filtering and screening to obtain a reference data set; introducing features and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model; predicting the double-gene interaction effect potential values between the whole genome coding gene pairs based on a prediction model, and storing the double-gene interaction effect potential values in a triplet form to obtain a prediction result; the prediction result is compressed based on a paired data compression method. The system comprises: the system comprises a data set construction module, a training module, a prediction module and a compression module. By using the present application, it is possible to help reveal potential pairs of interacting pathogenic genes. The method and the system for predicting the pathogenic gene pair can be widely applied to the field of gene pair prediction.

Description

Prediction method and system for pathogenic gene pair
Technical Field
The application relates to the field of gene pair prediction, in particular to a method and a system for predicting pathogenic gene pairs.
Background
Human genetic diseases can be largely divided into three classes, including Mendelian disease (monogenic disease), oligogenic disease and polygenic disease, where monogenic is the simplest genetic model of the disease, theoretically meaning that one or a few causative sites/genes are sufficient to cause the development of disease phenotypes, such as common cystic fibrosis and thalassemia, but known susceptibility genes for monogenic disease do not fully explain the corresponding disease phenotype. In addition, the phenotypic complexity of many diseases, due to the variability of environmental factors and the complexity of the human genome, makes diagnosis of the disease more difficult. With the rapid development of whole genome sequencing technology, under the current big data age, susceptibility genes of complex diseases are being continuously mined, and locating disease susceptibility genes faces great opportunities and challenges.
The traditional methods for locating the gene pairs of the double-gene interaction effect comprise hybridization tests, correlation analysis based on families, whole genome correlation analysis or multi-group combination analysis, and the like, but locating the double-gene interaction effect based on the methods requires certain preconditions, and meanwhile, the methods have larger limitation and are difficult to be applied to screening in a whole genome range.
Disclosure of Invention
In order to solve the above technical problems, the present application aims to provide a method and a system for predicting a pathogenic gene pair, wherein the algorithm is a supervised learning method, which can help to reveal a pathogenic gene pair with potential interaction.
The first technical scheme adopted by the application is as follows: a method of predicting a pathogenic gene pair comprising the steps of:
s1, constructing a data set based on a double-gene disease database, and performing data filtering and screening to obtain a reference data set;
s2, introducing features and constructing a whole genome dual-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model;
s3, predicting the double-gene interaction effect potential values between the whole genome coding gene pairs based on a prediction model, and storing the double-gene interaction effect potential values in a triplet form to obtain a prediction result;
s4, compressing the prediction result based on the paired data compression method.
Further, the step of constructing a data set based on the double-gene disease database and performing data filtering and screening to obtain a reference data set specifically comprises the following steps:
s11, taking a double-gene pathogenic gene pair in a double-gene disease database as a positive training sample;
s12, taking gene pairs obtained by combining pathogenic genes of monogenic diseases in pairs as a first negative training sample;
s13, combining gene pairs obtained by combining the genes with the functions which are deleted as a second negative training sample;
s14, combining the main pathogenic genes of the monogenic diseases and the genes with the functions deleted in pairs to obtain a gene pair which is used as a third negative training sample;
s15, randomly selecting a gene pair obtained by combining two protein coding genes on the whole genome as a fourth negative training sample;
s16, taking gene pairs obtained by randomly combining genes of the double-gene disease database as a fifth negative training sample;
and S17, filtering and screening the positive training sample, the first negative training sample, the second negative training sample, the third negative training sample, the fourth negative training sample and the fifth negative training sample based on the normal sample and the feature deletion rate to obtain a reference data set.
Further, the step of introducing features and constructing a whole genome dual-gene interaction effect potential prediction model based on a random forest model based on a reference dataset pair to obtain a prediction model specifically comprises the following steps:
s21, taking out an equal amount of samples from all negative samples in a reference data set randomly through equal-proportion sampling to form a negative sample subset with the same number as positive training samples, and combining all positive samples with the negative samples obtained through undersampling to form a sub-training set;
s22, sampling with replacement is carried out in each negative sample set, and the step S21 is circulated until the preset times are reached, so that a plurality of sub-training sets are obtained;
s23, introducing features for each sub training set and training based on a random forest method to obtain sub models;
s24, calculating the error rate outside the bags of the submodels and selecting the submodels with the error rate outside the bags larger than a preset value to obtain the selected submodels;
s25, weighting all selected submodels by taking the error rate outside the bag as a weight to obtain a whole genome dual-gene interaction effect potential prediction model.
Further, the step of introducing features and training to obtain a sub-model based on a random forest method for each sub-training set specifically includes:
s231, for each sub training set, the characteristic value is 0 to indicate that the two genes have no similarity, and the missing value is-1 to indicate that the characteristic value is missing;
s232, calculating the number of missing feature values in each sub-training set and taking the number as a new feature;
s233, performing cross verification by 10X, taking weighted harmonic mean of the precision rate and the recall rate as an evaluation standard, and performing parameter tuning on parameters in a random forest by using a grid search method to obtain a submodel;
the parameters include the number of trees, the maximum number of features, the maximum depth, the minimum number of samples required for node partitioning, and the minimum number of samples for leaf nodes.
Further, the characteristics include mutation level information, gene level information, protein interaction level information, protein structure information, expression level information, and phenotype level information.
Further, the step of compressing the prediction result based on the paired data compression method specifically includes:
constructing a gene name dictionary file according to the sequence of gene names;
and converting the potential value of the gene pair into an integer according to a preset rule, and storing the integer into a data file to obtain a compressed prediction result.
Further, the preset rule includes:
if the 2-bit decimal is reserved, take L d =1 byte represents each potential value, which is converted to an integer v '=100deg.V, and stored as v'&0xFF;
If the 4-bit decimal is reserved, the decimal is expressed as L d =2byte represents each potential value, which is converted to an integer v ' =10000 v, and stored as [ v ' '&0xFF,(v'>>8)&0xFF];
If the 6-bit decimal is reserved, the decimal is expressed as L d =3 byte represents each potential value, which is converted to an integer v ' =1000000v, and saved as [ v ' '&0xFF,(v'>>8)&0xFF,(v'>>16)&0xFF]。
Further, the method also comprises a step of quickly accessing the compressed prediction result, which specifically comprises the following steps:
searching for Gene pairs (G) i ,G j );
Reading the gene names recorded in the gene name dictionary file, and constructing indexes according to the sequence;
acquisition of Gene G i And gene G j Indexes I and j in the dictionary, and acquiring the initial address I (I, j) of the gene pair in the data file;
move the file pointer to I (I, j), and read L d Byte numberObtaining a byte array;
and reversely reducing the obtained byte array according to the coding direction.
The second technical scheme adopted by the application is as follows: a predictive system for a pathogenic gene pair comprising:
the data set construction module is used for constructing a data set based on the double-gene disease database and filtering and screening the data to obtain a reference data set;
the training module is used for introducing characteristics and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on the reference data set pair to obtain a prediction model;
the prediction module predicts the double-gene interaction effect potential value between the whole genome coding gene pairs based on the prediction model and stores the double-gene interaction effect potential value in a triplet form to obtain a prediction result;
and the compression module is used for compressing the prediction result based on the paired data compression method.
The method and the system have the beneficial effects that: the application firstly utilizes a machine learning method to excavate the double-gene pathogenic gene pairs based on biological similarity or relativity among genes, can evaluate double-gene interaction effects on different gene pairs under the condition of not knowing candidate genes, and can obtain more reliable prediction results by filtering a high-quality reference data set after screening.
Drawings
FIG. 1 is a flow chart showing the steps of a method for predicting a pathogenic gene pair according to the present application;
FIG. 2 is a block diagram of a predictive system for a pathogenic gene pair of the present application;
FIG. 3 is a diagram illustrating compression and quick access of prediction results in accordance with an embodiment of the present application.
Detailed Description
The application will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
Referring to fig. 1, the present application provides a method for predicting a pathogenic gene pair, the method comprising the steps of:
s1, constructing a data set based on a double-gene disease database, and performing data filtering and screening to obtain a reference data set;
s2, introducing features and constructing a whole genome dual-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model;
s3, predicting the double-gene interaction effect potential values between the whole genome coding gene pairs based on a prediction model, and storing the double-gene interaction effect potential values in a triplet form to obtain a prediction result;
s4, compressing the prediction result based on the paired data compression method.
Further as a preferred embodiment of the method, the step of constructing a dataset based on the database of double-gene disease and performing data filtering and screening to obtain a reference dataset specifically includes:
s11, taking a double-gene pathogenic gene pair in a double-gene disease database as a positive training sample;
s12, taking gene pairs obtained by combining pathogenic genes of monogenic diseases in pairs as a first negative training sample (MD);
in particular, we believe that there is no dual gene interaction effect between the major causative genes of common monogenic diseases.
S13, combining gene pairs obtained by combining the genes with the functions which are deleted as a second negative training sample (LOF);
in particular, we believe that the gene of LOF (loss of function) is substantially free of dual gene interaction effects;
s14, combining the main pathogenic genes of the monogenic diseases and the genes with the functions deleted in pairs to obtain a gene pair which is used as a third negative training sample (MDLOF);
s15, randomly selecting a gene pair obtained by combining two protein coding genes on a whole genome as a fourth negative training sample (Random);
specifically, we consider that the double gene interaction effect is an unusual effect, so that the probability of the double gene interaction effect existing between two genes randomly selected in the whole genome range is extremely low;
s16, taking a gene pair obtained by randomly combining genes of the double-gene disease database as a fifth negative training sample (DIDA_NDI);
and S17, filtering and screening the positive training sample, the first negative training sample, the second negative training sample, the third negative training sample, the fourth negative training sample and the fifth negative training sample based on the normal sample and the feature deletion rate to obtain a reference data set.
Specifically, we used data from the thousand people genome for filtering. The sample of the thousand genome is a normal sample, i.e., a sample without a disease phenotype. It is believed that if there are non-synonymous mutations with allele frequencies of 1% or less on both genes of a gene pair, and at least 2 individuals in a sample of a thousand individuals' genome carry such mutant gene pairs, such gene pairs are true negative gene pairs, only such gene pairs will be retained for placement in the final negative training set; more characteristics are adopted for training the model, but the deletion rates of different characteristics are different, and more deletion values can lead to the decline of the efficiency of the model, so that the gene pairs with higher deletion rates of the characteristics are deleted, thereby improving the reliability of the training set and enhancing the performance of the model.
Further as a preferred embodiment of the method, the step of introducing features and constructing a whole genome dual gene interaction effect potential prediction model based on a random forest model based on a reference dataset pair to obtain a prediction model specifically comprises the following steps:
s21, taking out an equal amount of samples from all negative samples in a reference data set randomly through equal-proportion sampling to form a negative sample subset with the same number as positive training samples, and combining all positive samples with the negative samples obtained through undersampling to form a sub-training set;
specifically, sampling is also included to construct a test set.
S22, sampling with replacement is carried out in each negative sample set, and the step S21 is circulated until the preset times are reached, so that a plurality of sub-training sets are obtained;
s23, introducing features for each sub training set and training based on a random forest method to obtain sub models;
s24, calculating the error rate outside the bags of the submodels and selecting the submodels with the error rate outside the bags larger than a preset value to obtain the selected submodels;
s25, weighting all selected submodels by taking the error rate outside the bag as a weight to obtain a whole genome dual-gene interaction effect potential prediction model.
The formula is as follows:
specifically, i represents the selected submodel, oob i Out-of-bag error rate, RF, for the ith sub-model i Representing the prediction result of the ith sub-model, RF Final The weighted dual gene interaction effect potential value is obtained.
Further as a preferred embodiment of the method, the step of introducing features and training to obtain a sub-model based on a random forest method for each sub-training set specifically includes:
s231, for each sub training set, the characteristic value is 0 to indicate that the two genes have no similarity, and the missing value is-1 to indicate that the characteristic value is missing;
specifically, since we utilized biological similarity or correlation between two genes to evaluate the dual gene interaction effect, if the value of the feature is 0, it means that there is no such similarity or correlation between the two genes; whereas for the case where the feature value is missing, we define the missing value as-1 for distinguishing from 0;
s232, calculating the number of missing feature values in each sub-training set and taking the number as a new feature;
specifically, to evaluate whether the number of missing values would affect the final model decision, we calculated the number of missing values (-1) in each sample, and added this number as a new feature to the model training;
s233, performing cross verification by 10X, taking weighted harmonic mean of the precision rate and the recall rate as an evaluation standard, and performing parameter tuning on parameters in a random forest by using a grid search method to obtain a submodel;
the parameters include the number of trees, the maximum number of features, the maximum depth, the minimum number of samples required for node partitioning, and the minimum number of samples for leaf nodes.
Finally, a weighted model is used to predict the dual gene interaction effector potential between the whole genome-encoded gene pairs. All the encoding genes on the whole genome are downloaded from the HUGO Gene Naming Committee (HGNC) functional network, and then we obtain encoding gene pairs of the whole genome through pairwise matching, and relevant characteristics of all the gene pairs are obtained from the corresponding database of the description. In the same way as in step S23, we replace with-1 for the missing feature, and for each sample we calculate the number of missing values to add as a personality feature to the model training. The triplet form may be as shown with reference to C in fig. 3.
Further as a preferred embodiment of the method, the features include:
mutation level information, scoring of the tolerance degree of the gene to the mutation;
gene level information, probability of a gene being a recessive pathogenic gene, status of necessity of a gene for basic functional development, probability of a necessity of a gene, whether a gene pair is involved in the same pathway, degree of semantic similarity of a gene pair in a Gene Ontology (GO), magnitude of gene-gene interaction effect, number and similarity of commonly interacted genes between a gene pair, single dose shortage, destructive strength of a gene, biological distance between a gene pair, tolerance of a gene to functional deficiency.
Protein interaction level information, protein-protein interaction effect, mainly using database with BioGRID and sting;
protein structure information, mainly using information about gene domains provided in UniProtKB database;
expression level information, expression conditions of the gene pairs in different tissues, protein abundance and degree of gene co-expression;
phenotype level information, semantic similarity of genes to those in Disease Ontology (DO), i.e., the degree of similarity of disease phenotypes associated with two genes.
Further as the preferred embodiment of the method, the reliable feature screening is used for training the model, and the feature screening mainly comprises the following steps:
because the information coverage of different features is different, the features with higher missing rate are deleted firstly, and the problem that the model efficiency is reduced due to missing information is avoided;
the 14 input features with the best evaluation performance are obtained by using a recursive feature deletion (RFE), and the accuracy and reliability of the model are further ensured by the high-quality features.
Further as a preferred embodiment of the method, the step of compressing the prediction result based on the paired data compression method specifically includes:
constructing a gene name dictionary file according to the sequence of gene names;
specifically, a file of a gene name dictionary with an extension of x.d. is constructed in the order of gene names (ASCII code order), which is a single file with tab as separator;
and converting the potential value of the gene pair into an integer according to a preset rule, and storing the integer into a data file to obtain a compressed prediction result.
Specifically, the potential value v e [0,1] of the gene pair is saved as a data file of x.b according to a preset rule.
Further as a preferred embodiment of the method, the preset rule includes:
if the 2-bit decimal is reserved, take L d =1 byte represents each potential value, which is converted to an integer v '=100deg.V, and stored as v'&0xFF;
If the 4-bit decimal is reserved, the decimal is expressed as L d =2 byte represents each potential value, which is converted to an integer v' =10000v and stored as [ v ]'&0xFF,(v'>>8)&0xFF];
If the 6-bit decimal is reserved, the decimal is expressed as L d =3 byte represents each potential value, which is converted to an integer v ' =1000000v, and saved as [ v ' '&0xFF,(v'>>8)&0xFF,(v'>>16)&0xFF]。
Further as a preferred embodiment of the method, the method further includes a step of quickly accessing the compressed prediction result, which specifically includes:
searching for Gene pairs (G) i ,G j );
Reading the gene names recorded in the gene name dictionary file, and constructing indexes according to the sequence;
acquisition of Gene G i And gene G j Indexes I and j in the dictionary, and acquiring the initial address I (I, j) of the gene pair in the data file;
specifically, the calculation formula of the start address is as follows:
wherein i and j are genes G i And gene G j Index in dictionary, L d Is the byte length used to hold the potential value.
Move the file pointer to I (I, j), and read L d Obtaining byte arrays by bytes;
and reversely reducing the obtained byte array according to the coding direction.
The advantages of the application over the prior art are mainly represented by the following three points:
1. while the traditional method for locating the double-gene interaction effect requires knowing one pathogenic gene and candidate interaction genes, the application firstly utilizes a machine learning method to excavate double-gene pathogenic gene pairs based on biological similarity or relativity among genes, and can evaluate double-gene interaction effects on different gene pairs under the condition that candidate genes are not known;
2. the training set is a high-quality reference data set which is filtered and screened, the coverage is relatively wide, and the reliable feature set helps us to construct a relatively robust prediction model, wherein the protein-protein interaction, the semantic similarity degree in the Gene Ontology (GO) and the semantic similarity of genes in the Disease Ontology (DO) are relatively important 3 features;
3. the model construction method based on undersampling and integration enables us to obtain more reliable prediction results.
As shown in fig. 2, a prediction system of a pathogenic gene pair comprises:
the data set construction module is used for constructing a data set based on the double-gene disease database and filtering and screening the data to obtain a reference data set;
the training module is used for introducing characteristics and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on the reference data set pair to obtain a prediction model;
the prediction module predicts the double-gene interaction effect potential value between the whole genome coding gene pairs based on the prediction model and stores the double-gene interaction effect potential value in a triplet form to obtain a prediction result;
and the compression module is used for compressing the prediction result based on the paired data compression method.
Further as a preferred embodiment of the present system, further comprising:
and the access module is used for quickly accessing the compressed prediction result.
The content in the method embodiment is applicable to the system embodiment, the functions specifically realized by the system embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.
While the preferred embodiment of the present application has been described in detail, the application is not limited to the embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims (7)

1. A method for predicting a pathogenic gene pair, comprising the steps of:
s1, constructing a data set based on a double-gene disease database, and performing data filtering and screening to obtain a reference data set;
s2, introducing features and constructing a whole genome dual-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model;
s3, predicting the double-gene interaction effect potential values between the whole genome coding gene pairs based on a prediction model, and storing the double-gene interaction effect potential values in a triplet form to obtain a prediction result;
s4, compressing the prediction result based on a paired data compression method;
the step of constructing a data set based on the double-gene disease database and filtering and screening the data to obtain a reference data set specifically comprises the following steps:
s11, taking a double-gene pathogenic gene pair in a double-gene disease database as a positive training sample;
s12, taking gene pairs obtained by combining pathogenic genes of monogenic diseases in pairs as a first negative training sample;
s13, combining gene pairs obtained by combining the genes with the functions which are deleted as a second negative training sample;
s14, combining the main pathogenic genes of the monogenic diseases and the genes with the functions deleted in pairs to obtain a gene pair which is used as a third negative training sample;
s15, randomly selecting a gene pair obtained by combining two protein coding genes on the whole genome as a fourth negative training sample;
s16, taking gene pairs obtained by randomly combining genes of the double-gene disease database as a fifth negative training sample;
s17, filtering and screening a positive training sample, a first negative training sample, a second negative training sample, a third negative training sample, a fourth negative training sample and a fifth negative training sample based on the normal sample and the feature deletion rate to obtain a reference data set;
the step of introducing features and constructing a whole genome dual-gene interaction effect potential prediction model based on a random forest model based on a reference data set pair to obtain a prediction model comprises the following steps:
s21, taking out an equal amount of samples from all negative samples in a reference data set randomly through equal-proportion sampling to form a negative sample subset with the same number as positive training samples, and combining all positive samples with the negative samples obtained through undersampling to form a sub-training set;
s22, sampling with replacement is carried out in each negative sample set, and the step S21 is circulated until the preset times are reached, so that a plurality of sub-training sets are obtained;
s23, introducing features for each sub training set and training based on a random forest method to obtain sub models;
s24, calculating the error rate outside the bags of the submodels and selecting the submodels with the error rate outside the bags larger than a preset value to obtain the selected submodels;
s25, weighting all selected submodels by taking the error rate outside the bag as a weight to obtain a whole genome dual-gene interaction effect potential prediction model.
2. The method according to claim 1, wherein for each sub-training set, the step of introducing features and training based on a random forest method to obtain sub-models comprises:
s231, for each sub training set, the characteristic value is 0 to indicate that the two genes have no similarity, and the missing value is-1 to indicate that the characteristic value is missing;
s232, calculating the number of missing feature values in each sub-training set and taking the number as a new feature;
s233, performing cross verification by 10X, taking weighted harmonic mean of the precision rate and the recall rate as an evaluation standard, and performing parameter tuning on parameters in a random forest by using a grid search method to obtain a submodel;
the parameters include the number of trees, the maximum number of features, the maximum depth, the minimum number of samples required for node partitioning, and the minimum number of samples for leaf nodes.
3. The method of claim 2, wherein the characteristics include mutation level information, gene level information, protein interaction level information, protein structure information, expression level information, and phenotype level information.
4. A method for predicting a pathogenic gene pair according to claim 3, wherein the step of compressing the predicted result based on a paired data compression method comprises:
constructing a gene name dictionary file according to the sequence of gene names;
and converting the potential value of the gene pair into an integer according to a preset rule, and storing the integer into a data file to obtain a compressed prediction result.
5. The method of claim 4, wherein the predetermined rule comprises:
if the 2-bit decimal is reserved, take L d =1 byte represents each potential value, which is converted to an integer v '=100deg.V, and stored as v'&0xFF;
If the 4-bit decimal is reserved, the decimal is expressed as L d =2byte represents each potential value, which is converted to an integer v ' =10000 v, and stored as [ v ' '&0xFF,(v′>>8)&0xFF];
If the 6-bit decimal is reserved, the decimal is expressed as L d =3 byte represents each potential value, which is converted to an integer v ' =1000000v, and saved as [ v ' '&OxFF,(v′>>8)&OxFF,(v′>>16)&OxFF]。
6. The method according to claim 5, further comprising the step of rapidly accessing the compressed prediction result, which specifically comprises:
searching for Gene pairs (G) i ,G j );
Reading the gene names recorded in the gene name dictionary file, and constructing indexes according to the sequence;
acquisition of Gene G i And gene G j Indexes I and j in the dictionary, and acquiring the initial address I (I, j) of the gene pair in the data file;
move the file pointer to I (I, j), and read L d Obtaining byte arrays by bytes;
and reversely reducing the obtained byte array according to the coding direction.
7. A predictive system for a pathogenic gene pair comprising:
the data set construction module is used for constructing a data set based on the double-gene disease database and filtering and screening the data to obtain a reference data set;
the training module is used for introducing characteristics and constructing a whole genome double-gene interaction effect potential prediction model based on a random forest model based on the reference data set pair to obtain a prediction model;
the prediction module predicts the double-gene interaction effect potential value between the whole genome coding gene pairs based on the prediction model and stores the double-gene interaction effect potential value in a triplet form to obtain a prediction result;
the compression module is used for compressing the prediction result based on a paired data compression method;
the double-gene disease database-based data set construction and data filtering and screening are carried out to obtain a reference data set, and the method specifically comprises the following steps: s11, taking a double-gene pathogenic gene pair in a double-gene disease database as a positive training sample; s12, taking gene pairs obtained by combining pathogenic genes of monogenic diseases in pairs as a first negative training sample; s13, combining gene pairs obtained by combining the genes with the functions which are deleted as a second negative training sample; s14, combining the main pathogenic genes of the monogenic diseases and the genes with the functions deleted in pairs to obtain a gene pair which is used as a third negative training sample; s15, randomly selecting a gene pair obtained by combining two protein coding genes on the whole genome as a fourth negative training sample; s16, taking gene pairs obtained by randomly combining genes of the double-gene disease database as a fifth negative training sample; s17, filtering and screening a positive training sample, a first negative training sample, a second negative training sample, a third negative training sample, a fourth negative training sample and a fifth negative training sample based on the normal sample and the feature deletion rate to obtain a reference data set;
the feature is introduced, a whole genome double-gene interaction effect potential prediction model based on a random forest model is constructed based on a reference data set pair, and a prediction model is obtained, wherein the method specifically comprises the following steps: s21, taking out an equal amount of samples from all negative samples in a reference data set randomly through equal-proportion sampling to form a negative sample subset with the same number as positive training samples, and combining all positive samples with the negative samples obtained through undersampling to form a sub-training set; s22, sampling with replacement is carried out in each negative sample set, and the step S21 is circulated until the preset times are reached, so that a plurality of sub-training sets are obtained; s23, introducing features for each sub training set and training based on a random forest method to obtain sub models; s24, calculating the error rate outside the bags of the submodels and selecting the submodels with the error rate outside the bags larger than a preset value to obtain the selected submodels; s25, weighting all selected submodels by taking the error rate outside the bag as a weight to obtain a whole genome dual-gene interaction effect potential prediction model.
CN202111150222.8A 2021-09-29 2021-09-29 Prediction method and system for pathogenic gene pair Active CN113838525B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111150222.8A CN113838525B (en) 2021-09-29 2021-09-29 Prediction method and system for pathogenic gene pair

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111150222.8A CN113838525B (en) 2021-09-29 2021-09-29 Prediction method and system for pathogenic gene pair

Publications (2)

Publication Number Publication Date
CN113838525A CN113838525A (en) 2021-12-24
CN113838525B true CN113838525B (en) 2023-09-29

Family

ID=78967648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111150222.8A Active CN113838525B (en) 2021-09-29 2021-09-29 Prediction method and system for pathogenic gene pair

Country Status (1)

Country Link
CN (1) CN113838525B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341366A (en) * 2017-07-19 2017-11-10 西安交通大学 A kind of method that complex disease susceptibility loci is predicted using machine learning
CN109727642A (en) * 2019-01-22 2019-05-07 袁隆平农业高科技股份有限公司 Full-length genome prediction technique and device based on Random Forest model
CN110136773A (en) * 2019-04-02 2019-08-16 上海交通大学 A kind of phytoprotein interaction network construction method based on deep learning
CN113066586A (en) * 2021-04-01 2021-07-02 北京果壳生物科技有限公司 Method for constructing disease classification model based on multi-gene risk scoring

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341366A (en) * 2017-07-19 2017-11-10 西安交通大学 A kind of method that complex disease susceptibility loci is predicted using machine learning
CN109727642A (en) * 2019-01-22 2019-05-07 袁隆平农业高科技股份有限公司 Full-length genome prediction technique and device based on Random Forest model
CN110136773A (en) * 2019-04-02 2019-08-16 上海交通大学 A kind of phytoprotein interaction network construction method based on deep learning
CN113066586A (en) * 2021-04-01 2021-07-02 北京果壳生物科技有限公司 Method for constructing disease classification model based on multi-gene risk scoring

Also Published As

Publication number Publication date
CN113838525A (en) 2021-12-24

Similar Documents

Publication Publication Date Title
JP5479431B2 (en) Biomarker extraction apparatus and method
CN110659207A (en) Heterogeneous cross-project software defect prediction method based on nuclear spectrum mapping migration integration
CN113393911B (en) Ligand compound rapid pre-screening method based on deep learning
Yuan et al. Evoq: Mixed precision quantization of dnns via sensitivity guided evolutionary search
CN112131399A (en) Old medicine new use analysis method and system based on knowledge graph
CN115798730A (en) Method, apparatus and medium for circular RNA-disease association prediction based on weighted graph attention and heterogeneous graph neural networks
CN109801681B (en) SNP (Single nucleotide polymorphism) selection method based on improved fuzzy clustering algorithm
Liu et al. MSNet-4mC: learning effective multi-scale representations for identifying DNA N4-methylcytosine sites
CN112270950B (en) Network enhancement and graph regularization-based fusion network drug target relation prediction method
CN113838525B (en) Prediction method and system for pathogenic gene pair
Pei et al. A “seed-refine” algorithm for detecting protein complexes from protein interaction data
CN115938490A (en) Metabolite identification method, system and equipment based on graph representation learning algorithm
CN115577259A (en) Fault pole selection method and device for high-voltage direct-current transmission system and computer equipment
CN115691666A (en) Sigma-based mutation pathogenicity prediction analysis method, system and equipment
CN115856641A (en) Method and device for predicting remaining charging time of battery and electronic equipment
CN113782092B (en) Method and device for generating lifetime prediction model and storage medium
CN111755074B (en) Method for predicting DNA replication origin in saccharomyces cerevisiae
Lee et al. Protein secondary structure prediction using BLAST and exhaustive RT-RICO, the search for optimal segment length and threshold
CN108182347B (en) Large-scale cross-platform gene expression data classification method
CN113241123A (en) Method and system for fusing multiple feature recognition enhancers and intensities thereof
CN112687329A (en) Cancer prediction system based on non-cancer tissue mutation information and construction method thereof
WO2020107836A1 (en) Word2vec-based incomplete user persona completion method and related device
CN113177604B (en) High-dimensional data feature selection method based on improved L1 regularization and clustering
JP4543687B2 (en) Data analyzer
WO2023148684A1 (en) Local steps in latent space and descriptors-based molecules filtering for conditional molecular generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant