CN110890131B - Method for predicting cancer risk based on genetic gene mutation - Google Patents

Method for predicting cancer risk based on genetic gene mutation Download PDF

Info

Publication number
CN110890131B
CN110890131B CN201911063276.3A CN201911063276A CN110890131B CN 110890131 B CN110890131 B CN 110890131B CN 201911063276 A CN201911063276 A CN 201911063276A CN 110890131 B CN110890131 B CN 110890131B
Authority
CN
China
Prior art keywords
representing
family
model
cancer
ith
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911063276.3A
Other languages
Chinese (zh)
Other versions
CN110890131A (en
Inventor
李嘉路
华芮
杨安力
刘洪�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huajia Biological Intelligence Technology Co ltd
Original Assignee
Shenzhen Huajia Biological Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huajia Biological Intelligence Technology Co ltd filed Critical Shenzhen Huajia Biological Intelligence Technology Co ltd
Priority to CN201911063276.3A priority Critical patent/CN110890131B/en
Publication of CN110890131A publication Critical patent/CN110890131A/en
Application granted granted Critical
Publication of CN110890131B publication Critical patent/CN110890131B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Strategic Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Game Theory and Decision Science (AREA)
  • Tourism & Hospitality (AREA)
  • Genetics & Genomics (AREA)
  • Development Economics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Educational Administration (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a method for predicting cancer risk based on genetic gene mutation, which belongs to the field of medical statistics and can construct an individualized primary cancer risk prediction quantitative model by collecting medical history information and saliva samples of family members of primary cancer patients. On the premise of using a molecular biology detection experimental technology, the invention provides a strategy for accurately measuring the hereditary mutation of the related genes from oral epithelial cells of consultants and then constructing a statistical prediction model by combining family medical history data. The prediction model provided by the invention can integrate multiple cancer history information, fully utilize family structure information to correct and confirm deviation, and analyze residual correlation among family inner members. Simulation experiments show that the method can accurately find out model parameters; the result of the real patient data fitting shows that the method can reach the highest known accuracy at present on the cross-validation data set, and the primary cancer risk can be predicted more stably in the independent data set. The invention can be used for constructing a cancer risk assessment system in genetic counseling and also can be used for screening key gene loci.

Description

Method for predicting cancer risk based on genetic gene mutation
Technical Field
The invention belongs to the field of medical statistics, and particularly relates to a method for constructing a primary cancer risk prediction model based on hereditary mutation data.
Background
Genetic mutation, also known as reproductive mutation (germline mutation), means that all cells of an individual, including somatic and germ cells, have the mutation, which is already present at the fertilized egg stage of the individual. Cancers that result from reproductive mutations in cancer-related genes are referred to as hereditary cancers. About 10% of cancer patients belong to hereditary cancers. Because hereditary cancers are less random than non-hereditary cancers, research into the pathogenesis and preventive measures of known hereditary cancers is also relatively mature. There have been a number of data showing that early cancer monitoring for high risk populations of hereditary cancer can significantly improve their prognosis. If a high risk family with hereditary primary cancer could be locked out at an earlier stage, those family members that have not yet had cancer would therefore benefit. However, only the reliability and the non-invasive performance of the screening method are guaranteed, and the risk assessment method can be applied to a large range.
The risk prediction model is a core tool for risk assessment. Early predictive models were based on susceptibility gene mutation information, combined with epidemiological study data, to give more qualitative risk assessment. In recent years, researchers have found incomplete penetration (incomplete penetrance) in more and more cases of hereditary cancer syndrome, i.e., susceptibility gene carriers are not necessarily at absolute high risk of developing cancer, and thus it is important to construct a quantitative model that can analyze the risk of developing cancer. Wang et al (1) The Bayesian rule is used to calculate the multiple primary melanoma related penetrance, and their calculation needs to be based on the former penetrance and relative risk estimation values, such as: the estimated value of the epiratio of the mutant carrier, the ratio of the number of mutant carriers to wild type patients in the patients with multiple primary melanoma, and the ratio of the patients with multiple primary melanoma to the single primary patient in the patients with mutation are non-parametric methods, and cannot take into account the influence of age and other related pathogenic factors on cancer risk estimation. Their methods have therefore great limitations and cannot be practically applied in clinic. Iversen et al (2) Statistical methods to correct for validation bias were developed but were not practically applied to risk prediction models.
Some more complex statistical models are increasingly being used in risk prediction model construction. For example, shinet et al (3) Half-parameter repeat event survival analysis model was used to fit Li Fomei Nile syndrome (Li-Fraumeni Syndrome, LFS) patient data and usingThe bayesian method estimates model parameters. Although this approach predicts the risk of the first and second primary cancers, it has the following significant drawbacks: a) Because a large amount of mutation data is missing in the training data, the method integrates a complicated missing data presumption model, which leads to the fact that the model is too complex and is difficult to converge during parameter estimation; b) The model can only integrate mutation information of a single gene, and is difficult to study more pathogenic related genetic factors; c) The method is used for over-emphasizing the prediction of multiple primary cancers (such as secondary primary cancers), and in practical application, the risk prediction of the primary cancers is often more important; d) The method only comprises a method for constructing a statistical model, and a method system which can be used in a falling mode is not formed. Thus, in general, the method has not been converted into a clinically broad-spectrum genetic cancer risk assessment tool, nor is it applicable to the construction of risk prediction methods other than LFS. Shinet et al (4) The LFS patient data were fitted using a semi-parametric competitive risk survival analysis model to predict the risk of three different types of cancer, the biggest technical drawback of this approach is: although specifically predicting the risk of the first cancer, the information related to the second cancer in the medical history data is not utilized, so that the prediction accuracy is not obviously superior to that of the traditional method; in addition, the method also integrates a deletion data speculation model, and only predicts the overall mutation probability of a single gene, which cannot be used if the genetic cancer is associated with multiple genes or loci.
Based on the previous research, the invention creatively provides a cancer risk prediction model construction method based on a molecular biological experiment and a statistical algorithm. Simulation experiments show that the method can accurately find out parameters of the simulation model; the verification of the real data shows that the accuracy of the prediction of the first primary cancer in 5 years can reach auc=0.84 by using only the information of single genetic mutation, and the accuracy is the highest value in the currently known similar research problems. The invention further provides a construction method of the genetic cancer risk prediction model which is noninvasive and has practicability and universality.
Disclosure of Invention
The invention mainly solves the technical problem of constructing a genetic cancer risk prediction method, and adopts the technical scheme that the genetic gene mutation is accurately measured from oral epithelial cells based on a molecular biological experiment, and a statistical student memory analysis model is constructed according to the genetic gene mutation. The overall scheme flow is shown in fig. 1, and the specific construction process is as follows.
1. Non-invasively measuring genetic gene key mutation site sequence information: the invention aims at sequencing saliva samples, and is especially important for obtaining genetic mutation information of key sites conveniently and atraumatically because the main audience is healthy and serves as a tool for early screening. The specific flow is as follows.
(1) The stability of double-stranded DNA (dsDNA) in saliva samples was verified at ambient temperature storage:
verification by a set of experiments: whether there is a significant change in dsDNA concentration in human saliva samples between different time points at room temperature storage. Multiple oral epithelial cell samples were collected in the experiment and tested for three different saliva preservation solutions, and the dsDNA concentrations were compared in parallel for changes in both time points on the third and seventh days after sample collection. As shown in fig. 1 and fig. 2, in three different saliva sample preservation methods, there is no significant difference in dsDNA specific concentration, and the difference and variance of concentration data generated by one preservation solution are small, so that the method can be used for practical application.
(2) Key site sequences were detected using saliva samples saved to day seven: taking the example of the site detected in the specific embodiment, PCR primers (FIG. 4) were designed to amplify the mutant site sequence, and Sanger's method was used to sequence and determine whether the tester carried a genetic mutation at the mutant site based on the allele frequency of the site.
2. Training to obtain a risk prediction model of the primary broad spectrum primary cancer based on genetic mutation data and family history data: because hereditary cancers have family aggregation, the risk prediction model construction method provided by the invention is based on family medical history data, blood relationship information among family members and hereditary gene mutation data. Fitting models using family data has two main reasons:
a) Even if all genetic factors are considered, unknown pathogenic factors still exist in family members, such as factors of unknown life, eating habits and the like, the unknown factors can be studied at the same time based on a family data training model, and the prediction efficacy of the model can be improved theoretically due to the fact that more relevant pathogenic factors are integrated;
b) The next most effective statistical method of confirmation bias correction is in units of households:
the strategy for constructing the risk prediction model based on the family data provided by the invention can improve the stability of the model in predicting risks in different family groups. The specific flow is as follows.
(1) Cancer risk quantification prediction model:
for from familiesIs a patient of (2)Respectively byAndrepresenting its time to deletion and the total observed primary cancer. Assume that there is a set of dataWhereinRepresenting the time of onset of diseaseAnddefining independent variable groups representing genotype and sex of patient, respectively, or other related pathogenic factorsWhereinIs a time-dependent index function representing the timeWhether the patient has developed a primary cancer for the first time. Is provided withAnd random effects of familiesIn our proposed model, the following multiplicative model is used to model the conditional intensity function:
wherein, the liquid crystal display device comprises a liquid crystal display device,
it is assumed that the history of cancer disease in the relevant patient is presentAndindependent of each other, then for the firstFirst in individual householdsFor individual patients, their conditional likelihood can be defined as:
in the model provided by the invention, as the second cancer disease time information of the patient is used at the same time, the sample size of parameter estimation can be increased, and the statistical efficacy (statistical power) is further improved;
is provided withThe cancer risk value can be calculated from the following formula:
(2) Correcting the confirmation deviation:
correction of validation bias using the validation correction joint likelihood method (ACJ) in Iversen et al (4), thThe ACJ likelihood values for individual households can be calculated by the following formula:
wherein the method comprises the steps ofIs a parameter to be estimated and is a parameter to be estimated,the probability that this family is selected with a confirmed bias can be calculated as:
3. model parameters were estimated using a bayesian method of the monte carlo markov chain (Markov chain Monte Carlo, MCMC): to be used forRepresenting any parameter to be estimated, for a specific parameter, the first methodThe specific flow of the iteration (iteration) is as follows:
a) Prior probability distribution setting:flat prior (flat priority);
b) Sampling distribution (sampling): based onGenerating
c) Iterative update (iterative updating): first calculate a sampling distribution correctionThe method comprises the steps of carrying out a first treatment on the surface of the Then, byRepresenting cancer phenotype or time to live data, andrepresentative ofThe acceptance rate (acceptance rate) is calculated by the following formula:
then, with probabilityTaking outWith probabilityTaking outFinally, extractIf (3)Then set upOtherwise
Due to some parameters in the model (e.g) Taking only positive values, the present invention uses a log-normal (log-normal) distribution asFor sampling distribution, assume thatAnd (2) andto extract a new sample, we first generateThereby can obtain. To correct for the asymmetry of the sampling probability distribution, weights are calculated:
for the random effect (random effect)Super parameter (hyperparameter)Is provided withOr (b)WhereinThe shape and rate parameters of the Gamma distribution, respectively.The posterior probability distribution of (c) can be written as:
wherein the method comprises the steps ofIs the total number of families in the data.
Drawings
FIG. 1 is a schematic diagram of a cancer risk prediction quantification method construction and application flow.
FIG. 2 is a gel electrophoresis result of a stability test of dsDNA extracted from saliva samples after a seventh day of storage at room temperature, wherein A, B, C represents three different saliva preservation methods, each of which was repeated three times; m: marker2000 bp plus.
FIG. 3 is a comparison of the concentration of dsDNA extracted at day three and day seven when saliva samples were stored at room temperature.
FIG. 4 is sequencing primer information for the mutation site of TP53 gene in the embodiment wherein the gray region is an exon, underlined is a PCR primer, and capital letters are mutation sites detected.
FIG. 5 is a simulation run of the risk prediction model of the present invention, with the horizontal line representing the actual parameter values and Corrected and Uncorrected representing the estimates before and after confirming the bias correction, respectively.
FIG. 6 is a predictive evaluation of the risk prediction model of the present invention in retrospective primary cancer patient data, with the dashed line representing the results of cross-validation and the solid line representing the median of cross-validation.
Fig. 7 is a non-fragile model parameter estimation posterior distribution of an embodiment of the present invention, with the horizontal line representing the median estimate of the posterior distribution.
Fig. 8 is a graph of a fragile model parameter estimation posterior distribution of an embodiment of the present invention, with the horizontal line representing the median estimate of the posterior distribution.
FIG. 9 is a predictive evaluation of the risk prediction model of the present invention in independent, primary cancer patient data not used to train the model, wherein the dashed line represents the predicted outcome using the traditional Kaplan-Meier method.
Detailed Description
The present invention is described in further detail below in connection with the detection of genetic mutations of the TP53 gene, and in connection with the specific embodiments of simulation experiments and real data.
1. The saliva samples were evaluated for room temperature storage stability: the specific steps and results are as follows:
(1) Collecting saliva samples of three persons, respectively collecting about 4ml of saliva by each person, subpackaging the saliva into 4 2.0ml centrifuge tubes, respectively filling 0.8ml of saliva into each tube, adding equal amounts of preservation solutions in the following three saliva collecting tubes into the centrifuge tubes, uniformly mixing, and standing at room temperature;
(2) DNA in samples was extracted at various time points (third and seventh days after saliva collection) using column saliva, urine genomic DNA extraction reagents (bioengineering limited) and dsDNA concentrations in samples were measured using Qubit. The results of gel electrophoresis and dsDNA concentration testing are shown in fig. 2 and 3, with the concentration of dsDNA extracted on day three being not significantly different from that on day seven in three different saliva preservation solutions.
2. Sequencing of the key mutation sites: as shown in FIG. 4, specific primers were designed for PCR, thereby amplifying key mutation sites of TP53 gene as shown in the following Table, and sequencing was performed using Sanger's method.
3. The simulation test comprises simulation data generation and model fitting, wherein the simulation generation comprises the following steps of:
(1) To simulate generation of family data for patients with multiple primary cancers, first use is made ofGene mutation data simulating generation of primary disease (proband) based on which can be usedThe distribution generates first and second time intervals (gap times), where the rate parameter can be fitted by the following equation:
wherein if it isThen the model is shown to fit the second time interval data;
setting a base lineBy setting upOr (b)The simulation generated random effects with or without family level, respectively. By setting super-parameters of random effectOr (b)Simulating and generating strong or weak residual correlations among related individuals (residual dependencies);
the two intervals generated above can be passed through and formed byThe generated erasure time determines the presence or absence of an erasure event. To introduce validation bias, family data is retained as simulated data only if proband has at least one primary cancer;
(2) Simulation generates family data: if proband is free of mutations, all of its family members are considered to be free of mutations. If the proband has mutation, a parent of the parent randomly selects one parent as a mutation carrier, and the probability of 50% of siblings or offspring of the proband is the mutation carrier, if the siblings of the proband are mutation carriers, the mutation of the offspring of the proband is randomly generated with the probability of 50%, and all other non-blood parent family members are set to be mutation-free;
(3) The first and second time intervals of the proband relatives and the cancer occurrence or non-occurrence information are simulated in the same manner as described above. A total of 100 family data were generated from the simulation, each family containing 30 members;
the risk prediction model provided by the invention is fitted with the data to generate 5000 posterior distribution samples, the first 1000 are used as burn-in, the process is repeated for 50 times, the result is shown in fig. 5, the real model parameters can be correctly retrieved by the method provided by the invention, and the reliability is obviously improved compared with the model without correction due to the addition of the confirmation deviation correction model.
4. True data fitting: the real data used in this embodiment is derived from primary cancer patients tracked by a hospital and their family history information, and includes two large data sets (hereinafter referred to as a cross-training verification set and an independent verification set) used for cross-training verification and independent verification, and the specific statistical information of each data set is shown in the following table:
a10-fold cross-validation method (10-fold cross validation) was used to fit a model to the cross-training validation set, which uses TP53 gene mutation, gender, and interactions between the two as the model's input arguments. To verify the efficacy of model prediction, the time of onset and deletion of the primary cancer is used as a starting point, the model is rolled back for five years, the trained model is used for calculating the risk value of the disease in the five years, and ROC (receiver operating characteristics) is used for comparing the model prediction accuracy under each threshold. As shown in fig. 6, each cross-validation yields an area under the curve auc=0.84 for the median value of the ROC curve.
Training an optimal final model by using cross training verification set data, generating 50,000 posterior samples in MCMC parameter estimation, taking the first 5000 samples as burn-in, and respectively displaying a non-random effect model and a random effect model in fig. 7 and 8, wherein the posterior distribution of parameters generated by MCMC is well converged; in addition, the super-parameters of the random effectThe 95% confidence interval is (1.774,6.620), which shows that although the random effect value is smaller, other unknown residual correlations exist between family inner members, and researchers should extract more relevant pathogenic factor information to train a more optimal predictive model.
The risk of developing cancer in each individual in the independent validation set for five years was predicted using the above-described trained model, the prediction accuracy was evaluated using the ROC method, and compared with the conventional Kaplan-Meier method, as shown in fig. 9, the area under the curve auc=0.754 of the risk prediction model proposed by the present invention, and the Kaplan-Meier method was 0.698, thereby proving that the scheme proposed by the present invention is significantly superior to the existing method.
Reference is made to:
(1)Wang, W., Niendorf, K. B., Patel, D., Blackford, A., Marroni, F., Sober, A. J., ... & Tsao, H. (2010). Estimating CDKN2A carrier probability and personalizing cancer risk assessments in hereditary melanoma using MelaPRO. Cancer research, 70(2), 552-559.
(2)Iversen Jr, E. S., & Chen, S. (2005). Population-calibrated gene characterization: Estimating age at onset distributions associated with cancer genes. Journal of the American Statistical Association, 100(470), 399-409.
(3)Shin, S. J., Li, J., Ning, J., Bojadzieva, J., Strong, L. C., & Wang, W. (2018). Bayesian estimation of a semiparametric recurrent event model with applications to the penetrance estimation of multiple primary cancers in Li-Fraumeni syndrome. Biostatistics (Oxford, England).
(4)Shin, S. J., Yuan, Y., Strong, L. C., Bojadzieva, J., & Wang, W. (2018). Bayesian Semiparametric Estimation of Cancer-Specific Age-at-Onset Penetrance With Application to Li-Fraumeni Syndrome. Journal of the American Statistical Association, 1-12.

Claims (6)

1. a method for predicting cancer risk based on genetic mutations, comprising:
detecting genetic mutation information of key gene loci from saliva samples by using a molecular biological experimental technique;
correcting the validation bias using the family data and analyzing residual correlations inside the family;
fitting multiple primary cancer follow-up data using a repeat event survival analysis model;
estimating parameters with real value domain in the model by using a Bayesian method of a Monte Carlo Markov chain;
validating the efficacy of the model using a simulation experiment method, cross-validating patient data, independently validating patient data, and criteria for primary cancer development prediction within 5 years;
the statistical method of integrating the validation bias correction in units of households is as follows:
wherein Pr represents a likelihood function;a likelihood value representing a confirmed bias correction for the ith household; h is a i Represents the ithA history of cancer for all family members of the family; g i Representing genotype-related covariates in the ith household; obs represents observations; />Representing genotype-independent covariates in the ith household; θ is the parameter to be estimated; zeta type toy i Representing the random effect of the ith household; a is that i A certain index variable representing the ith household; />Is the probability that the household is selected by the confirmation bias;
wherein the residual correlation includes lifestyle factors and eating habit factors.
2. The method as recited in claim 1, further comprising: samples of human saliva were collected, stored at room temperature for seven days to extract nucleic acid, PCR amplified using site-specific primers and sequenced, and whether the tester carried the genetic mutation at the site was determined based on the allele frequency of the mutation site.
3. The method of claim 1, wherein fitting multiple primary cancer follow-up data using a repeat event survival analysis model comprises: the statistical efficacy of the first cancer risk prediction is improved by integrating the second cancer incidence information, the conditional likelihood for the jth patient in the ith family being defined as:
wherein L is ij (θ, ζ) represents a likelihood function of the jth patient in the ith family; θ is the parameter to be estimated; ζ represents a random effect; k (K) ij Representing the total observed number of primary cancers for the jth patient in the ith household; λ represents a conditional intensity function; t is t ij,k Representing the time to onset of k primary cancers in the jth patient in the ith household; x is x ij A covariate vector representing the jth patient in the ith household; zeta type toy i Is the random effect of the ith family, representing the residual correlation inside this family; t is t ij,k-1 Represents the time to onset of k-1 primary cancers in the jth patient in the ith household; u represents the average value of the period of onset;representing the time of onset of the kth primary cancer in the jth patient in the ith household; v ij Indicating the time of deletion of the jth patient in the ith household; setting xi 1 ,…,ξ I ~Gamma(φ,φ),ξ 1 Representing the random effect of family 1, ζ I Representing the random effect of the ith household, gamma representing Gamma distribution, phi representing the super-parameters of the random effect; the cancer risk value is calculated from the following formula:
wherein Pr represents a likelihood function; w (W) 1 Sequence gap time representing the 1 st primary cancer; w represents a sequence gap time; x (t) represents a covariate vector of time of onset; x (u) represents a covariate vector of the time average of the disease.
4. The method according to claim 1, wherein estimating parameters with real values in the model using the bayesian method of the monte carlo markov chain comprises: bayesian method using Monte Carlo Markov chain to estimate model parameters, and gamma is set by using lognormal distribution as sampling distribution for parameters whose value range is positive (t-1) The product of the process is denoted by E (0, ++ infinity) to lnN (mu), sigma), gamma represents the Bayesian method estimation model parameters; t represents the iteration number; lnN (μ, σ) represents a normal distribution; a new sample gamma is extracted * =exp(lnγ (t-1) ) The product of the process is denoted by E (0, ++ infinity a) of the above-mentioned components, sampling probability distribution asymmetry is defined byThe following weight corrections:
5. the method as recited in claim 1, further comprising: verifying the reliability of the model fitting by using simulation experiment data; the accuracy of the model was verified using actual primary cancer patient data, with the primary cancer predicted to develop within 5 years.
6. The method according to claim 1, characterized in that: the method includes data analysis applied to a relevant basal or clinical trial.
CN201911063276.3A 2019-11-04 2019-11-04 Method for predicting cancer risk based on genetic gene mutation Active CN110890131B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911063276.3A CN110890131B (en) 2019-11-04 2019-11-04 Method for predicting cancer risk based on genetic gene mutation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911063276.3A CN110890131B (en) 2019-11-04 2019-11-04 Method for predicting cancer risk based on genetic gene mutation

Publications (2)

Publication Number Publication Date
CN110890131A CN110890131A (en) 2020-03-17
CN110890131B true CN110890131B (en) 2023-08-25

Family

ID=69746764

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911063276.3A Active CN110890131B (en) 2019-11-04 2019-11-04 Method for predicting cancer risk based on genetic gene mutation

Country Status (1)

Country Link
CN (1) CN110890131B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115831300B (en) * 2022-09-29 2023-12-29 广州金域医学检验中心有限公司 Detection method, device, equipment and medium based on patient information
CN115577951B (en) * 2022-10-19 2023-09-19 北京爱科农科技有限公司 Summer corn lodging early warning algorithm based on corn growth mechanism model

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510863A (en) * 2009-03-17 2009-08-19 江苏大学 Method for recognizing MPSK modulation signal
CN104059966A (en) * 2014-05-20 2014-09-24 吴松 STAG2 gene mutant sequence and detection method thereof as well as use of STAG2 gene mutation in detecting bladder cancer
RU2535157C1 (en) * 2013-08-13 2014-12-10 Федеральное государственное бюджетное учреждение "Научно-исследовательский институт онкологии имени Н.Н. Петрова" Министерства здравоохранения Российской Федерации Method for identifying recessive factors of genetic predisposition to breast cancer
CN107201401A (en) * 2017-05-23 2017-09-26 深圳市第二人民医院 A kind of Multiple-Factor Model and its method for building up for pathogenesis of breast carcinoma risk profile
CN108922628A (en) * 2018-04-23 2018-11-30 华北电力大学 A kind of Prognosis in Breast Cancer survival rate prediction technique based on dynamic Cox model
CN109136368A (en) * 2017-06-28 2019-01-04 海门中科基因生物科技有限公司 Heredity thyroid cancer pathogenic mutation site primer kit
CN110085324A (en) * 2019-04-25 2019-08-02 深圳市华嘉生物智能科技有限公司 A kind of method of multiple existence end results Conjoint Analysis
CN110310701A (en) * 2019-06-05 2019-10-08 复旦大学 Based on EucDiff value prediction mutation to the method and relevant device of RNA secondary structure influence degree

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030170638A1 (en) * 2002-03-07 2003-09-11 White Raymond L. Methods to determine genetic risk through analysis of very large families

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510863A (en) * 2009-03-17 2009-08-19 江苏大学 Method for recognizing MPSK modulation signal
RU2535157C1 (en) * 2013-08-13 2014-12-10 Федеральное государственное бюджетное учреждение "Научно-исследовательский институт онкологии имени Н.Н. Петрова" Министерства здравоохранения Российской Федерации Method for identifying recessive factors of genetic predisposition to breast cancer
CN104059966A (en) * 2014-05-20 2014-09-24 吴松 STAG2 gene mutant sequence and detection method thereof as well as use of STAG2 gene mutation in detecting bladder cancer
CN107201401A (en) * 2017-05-23 2017-09-26 深圳市第二人民医院 A kind of Multiple-Factor Model and its method for building up for pathogenesis of breast carcinoma risk profile
CN109136368A (en) * 2017-06-28 2019-01-04 海门中科基因生物科技有限公司 Heredity thyroid cancer pathogenic mutation site primer kit
CN108922628A (en) * 2018-04-23 2018-11-30 华北电力大学 A kind of Prognosis in Breast Cancer survival rate prediction technique based on dynamic Cox model
CN110085324A (en) * 2019-04-25 2019-08-02 深圳市华嘉生物智能科技有限公司 A kind of method of multiple existence end results Conjoint Analysis
CN110310701A (en) * 2019-06-05 2019-10-08 复旦大学 Based on EucDiff value prediction mutation to the method and relevant device of RNA secondary structure influence degree

Also Published As

Publication number Publication date
CN110890131A (en) 2020-03-17

Similar Documents

Publication Publication Date Title
Gilks et al. Modelling complexity: applications of Gibbs sampling in medicine
US20180247010A1 (en) Integrated method and system for identifying functional patient-specific somatic aberations using multi-omic cancer profiles
US20050032066A1 (en) Method for assessing risk of diseases with multiple contributing factors
US20200402614A1 (en) A computer-implemented method of analysing genetic data about an organism
Scott-Boyer et al. An integrated hierarchical Bayesian model for multivariate eQTL mapping
CN110890131B (en) Method for predicting cancer risk based on genetic gene mutation
WO2022170909A1 (en) Drug sensitivity prediction method, electronic device and computer-readable storage medium
EP4022626B1 (en) Computer-implemented method and apparatus for analysing genetic data
CN113096728B (en) Method, device, storage medium and equipment for detecting tiny residual focus
JP2022549737A (en) Polygenic risk score for in vitro fertilization
CA3158101A1 (en) Systems and methods for evaluating longitudinal biological feature data
WO2018137496A1 (en) Method and device for determining proportion of free nucleotide from predetermined source in biological sample
US20240038330A1 (en) Computer-implemented method and apparatus for analysing genetic data
Yoruk et al. A comprehensive statistical model for cell signaling
WO2022156610A1 (en) Prediction tool for determining sensitivity of liver cancer to drug and long-term prognosis of liver cancer on basis of genetic testing, and application thereof
JP2023543719A (en) Detecting cross-contamination in sequencing data
US20200105374A1 (en) Mixture model for targeted sequencing
US20240105280A1 (en) Computer-implemented method and apparatus for analysing genetic data
Temple et al. Modeling recent positive selection in Americans of European ancestry
Huang et al. Extending models via gradient boosting: an application to Mendelian models
Stone Threshold Parameter Optimization in Weighted Quantile Sum Regression
Zhou et al. A Bayesian model averaging approach for observational gene expression studies
Iniesta et al. Assessment of genetic association using haplotypes inferred with uncertainty via markov chain monte carlo
Ogundijo Bayesian Inference for Genomic Data Analysis
Wang Efficient Statistical Models For Detecting And Analyzing Human Genetic Variations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant