CN117373678B - Disease risk prediction model construction method and analysis method based on mutation signature - Google Patents

Disease risk prediction model construction method and analysis method based on mutation signature Download PDF

Info

Publication number
CN117373678B
CN117373678B CN202311678281.1A CN202311678281A CN117373678B CN 117373678 B CN117373678 B CN 117373678B CN 202311678281 A CN202311678281 A CN 202311678281A CN 117373678 B CN117373678 B CN 117373678B
Authority
CN
China
Prior art keywords
mutation
disease risk
prediction model
risk prediction
signature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311678281.1A
Other languages
Chinese (zh)
Other versions
CN117373678A (en
Inventor
濮梦辰
郑炜圣
李晓荣
樊可悦
田凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wangshi Intelligent Technology Co ltd
Original Assignee
Beijing Wangshi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wangshi Intelligent Technology Co ltd filed Critical Beijing Wangshi Intelligent Technology Co ltd
Priority to CN202311678281.1A priority Critical patent/CN117373678B/en
Publication of CN117373678A publication Critical patent/CN117373678A/en
Application granted granted Critical
Publication of CN117373678B publication Critical patent/CN117373678B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • General Physics & Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Analytical Chemistry (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to the technical field of biological genes, and provides a disease risk prediction model construction method and an analysis method based on mutation signatures. The disease risk prediction model construction method based on mutation signature comprises the following steps: acquiring first training data, wherein the first training data comprises mutation signature data of a plurality of patients, and the mutation signature data comprises at least one mutation signature and mutation signature activity values corresponding to the mutation signatures; training the first disease risk prediction model according to the first training data to obtain a second disease risk prediction model; inputting the first training data into a second disease risk prediction model to obtain mutation types corresponding to each patient; and according to each mutation type, adjusting the second disease risk prediction model to obtain a third disease risk prediction model. According to the invention, the generalization performance of the model is improved, so that the accuracy of predicting the disease risk by the third disease risk prediction model is higher.

Description

Disease risk prediction model construction method and analysis method based on mutation signature
Technical Field
The invention relates to the technical field of biological genes, in particular to a disease risk prediction model construction method and an analysis method based on mutation signatures.
Background
For most cancers, cancer cell metastasis is a major cause of cancer disease progression and death. Studies have shown that nearly 90% of cancer deaths are caused by cancer cell metastasis. Cancer genomes are susceptible to numerous mutations and rearrangements, exhibiting genomic instability and heterogeneity. These variations regulate the expression and function of genes associated with cell growth, differentiation, survival and migration. The main cause of cancer-related morbidity and mortality is metastatic spread, i.e., the spread of cancer cells from a primary site to other parts of the body through blood or lymphatic vessels. This process typically requires a source of cellular stress and environmental impact, resulting in dramatic changes in the cancer cell genome. These changes may confer an adaptive advantage to cancer cells, such as enhanced invasiveness and resistance to treatment. Thus, determining whether a patient's tumor is carcinoma in situ or metastatic is critical to the establishment of an effective strategy for preventing and treating cancer.
Individuals experience somatic mutations during their lifetime due to various genetic and environmental factors. These somatic mutations can occur in any part of the genome and can have different effects on the function and regulation of the gene. While most somatic mutations are neutral and increase in a passive manner, some genomic changes alter the regulation and function of DNA sequences, resulting in cells exhibiting abnormal phenotypes. Accumulation of somatic mutations in key regulatory genes may lead to the development of diseases such as cancer and the like. Thus, in the prior art, by studying somatic mutations, the tumorigenic development stage is decrypted and analyzed whether the tumor is carcinoma in situ or metastatic.
Mutation signatures refer to manifestations of abnormal somatic mutations in DNA, including single-base substitution (base substitutions, SBS), double-base substitution (double base substitution, DBS), small fragment Insertion and Deletion (ID), and the like. In recent years, mutation signatures provide a basis for disease risk prediction, and whether cancer cells belong to carcinoma in situ or metastatic in an individual can be predicted by mutation signatures. In the prior art, a prediction model is established to distinguish in-situ or transfer, but the existing prediction model has poor performance on different data set sources, and the accuracy of a prediction result is low.
Disclosure of Invention
In order to improve the accuracy of disease risk prediction, the invention provides a disease risk prediction model construction method and an analysis method based on mutation signatures.
In a first aspect, the present invention provides a method for constructing a disease risk prediction model based on mutation signature, the method comprising:
acquiring first training data, wherein the first training data comprises mutation signature data of a plurality of patients, and the mutation signature data comprises at least one mutation signature and mutation signature activity values corresponding to the mutation signatures;
Training the first disease risk prediction model according to the first training data to obtain a second disease risk prediction model;
inputting the first training data into a second disease risk prediction model to obtain mutation types corresponding to each patient;
and according to each mutation type, adjusting the second disease risk prediction model to obtain a third disease risk prediction model.
According to the method, first training of an initial model is carried out by using the first training data to obtain a second disease risk prediction model, then the mutation type obtained by the first training data and the second disease risk prediction model is used for further adjusting the second disease risk prediction model to obtain a final disease risk prediction model, namely a third disease risk prediction model, the disease risk prediction model is further adjusted according to the prediction result, the generalization performance of the model is improved, and the accuracy of predicting the disease risk of the third disease risk prediction model is higher.
In an alternative embodiment, the second disease risk prediction model is adjusted according to each mutation type to obtain a third disease risk prediction model, including:
calculating the correlation between each mutation signature and each mutation type in the first training data;
Screening all mutation signatures in the first training data according to the correlations to obtain second training data, wherein the second training data comprises the mutation signatures after screening and mutation signature activity values corresponding to the mutation signatures after screening in the first training data;
and training the second disease risk prediction model according to the second training data to obtain a third disease risk prediction model.
Through the embodiment, the mutation signature is screened by utilizing the correlation between the mutation signature and the mutation type of the predicted result, the mutation signature with higher correlation with the mutation type of the predicted result is used as the mutation signature after screening, and the second disease risk prediction model is further trained according to the mutation signature after screening and the mutation signature activity value corresponding to the mutation signature, so that the prediction accuracy of the obtained third disease risk prediction model on the disease risk is higher.
In an alternative embodiment, the correlation is represented by mutual information, and each mutation signature in the first training data is screened according to each correlation to obtain second training data, which includes:
deleting mutation signatures of a first preset proportion from the mutation signatures according to mutual information between the mutation signatures and mutation types in the first training data to obtain screened mutation signatures;
And taking the mutation signature after screening and the mutation signature activity value corresponding to the mutation signature after screening in the first training data as the second training data.
Through the embodiment, the correlation between each mutation signature and each mutation type in the first training data is represented by the mutual information, and the mutation signatures with higher correlation with the predicted result are obtained by deleting the mutation signatures with preset proportion according to the mutual information.
In an alternative embodiment, the second disease risk prediction model is a fully connected neural network model, the fully connected neural network model includes a plurality of neurons, the training is performed on the second disease risk prediction model according to the second training data to obtain a third disease risk prediction model, and the method includes:
acquiring weight values among all neurons in a second disease risk prediction model;
deleting the weight of the second preset proportion according to each weight value to obtain a screened weight;
according to the weights, the second disease risk prediction model is adjusted to obtain a fourth disease risk prediction model;
and training the fourth disease risk prediction model according to the second training data to obtain a third disease risk prediction model.
Through the embodiment, the weights of the neurons in the second disease risk prediction model are screened through the weight values among the neurons in the second disease risk prediction model, and the model is reduced by removing unimportant weights, so that the complexity of the model is reduced, the generalization capability of the model is improved, the accuracy of the adjusted disease risk prediction model on the disease risk prediction is higher, and whether a cancer patient is in situ cancer or metastatic cancer can be accurately predicted.
In a second aspect, the present invention also provides a disease risk analysis method based on mutation signature, the method comprising:
acquiring a plurality of first mutation signatures of a patient and mutation signature activity values corresponding to the first mutation signatures;
the mutation signature activity values are input into a disease risk prediction model, the mutation type of a patient is predicted, and the disease risk prediction model is obtained through the method for constructing the disease risk prediction model based on the mutation signature in the first aspect or any embodiment of the first aspect.
By the method, the disease risk of the patient is predicted by the disease risk prediction model, so that whether the patient is in-situ cancer or metastatic cancer can be accurately analyzed, and a basis is provided for the next treatment of the patient.
In an alternative embodiment, the method further comprises:
obtaining a plurality of genes for a patient;
and calculating the cumulative contribution abundance value of each gene and each first mutation signature, wherein the cumulative contribution abundance value characterizes the contribution degree of the genes to the first mutation signatures.
By combining the genes with the mutation signatures in the embodiment, the contribution degree of the genes to the mutation signatures is represented by the accumulated contribution abundance values of the genes and the mutation signatures, and the pathogenic mechanism of the patient is related to the genes, so that the relevance of the genes, mutation tags and predicted mutation types can be more intuitively analyzed.
In an alternative embodiment, calculating the cumulative contribution abundance value for each gene and each first mutation signature comprises:
calculating a correlation between the mutation type and each first mutation signature;
screening each first mutation signature according to each correlation to obtain at least one second mutation signature;
the cumulative contribution abundance values between each gene and each second mutation signature are calculated.
Through the embodiment, the correlation between the mutation type and the first mutation signature is utilized to screen the first mutation signature, so that the mutation signature with higher correlation degree with the predicted result is obtained through screening, and the cumulative contribution abundance value of each gene and the mutation signature with higher correlation degree is calculated.
In an alternative embodiment, calculating the cumulative contribution abundance value between the gene and the second mutation signature comprises:
determining a first number of mutations in the gene in somatic cells of the patient;
determining a second number of occurrences of a second mutation signature in somatic cells of the patient;
calculating a first influencing factor of the gene according to the first number;
calculating a second influence factor of a second mutation signature according to the second times;
and calculating a cumulative contribution abundance value between the gene and the second mutation signature according to the first influence factor and the second influence factor.
According to the embodiment, the first influence factor of the gene on the risk of high wind is calculated according to the mutation times of the gene, the second influence factor of the mutation signature on the risk of diseases is calculated according to the mutation times of the mutation signature, and the cumulative contribution abundance value between the gene and the mutation signature is calculated by using the first influence factor and the second influence factor so as to represent the contribution degree of the gene on the mutation signature.
In an alternative embodiment, the method further comprises:
determining genes corresponding to the second mutation signatures according to the accumulated contribution abundance values;
and carrying out enrichment analysis on genes corresponding to the second mutation signatures to obtain biological functions corresponding to the genes.
Through the embodiment, the genes corresponding to the second mutation signatures are screened through accumulating the contribution abundance values, and then the genes obtained through screening are subjected to enrichment analysis to obtain the biological functions corresponding to the genes, so that the mutation signatures, the genes and the biological functions are related, and a basis is provided for researching pathological mechanisms.
In an alternative embodiment, determining the genes corresponding to each second mutation signature based on the cumulative contribution abundance values, comprising:
and determining the gene with the largest cumulative contribution abundance value as the gene corresponding to each second mutation signature.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for constructing a disease risk prediction model based on mutation signatures, according to an exemplary embodiment;
FIG. 2 is a flowchart of a method for disease risk analysis based on mutation signatures, according to an exemplary embodiment;
FIG. 3 is a graph showing the results of enrichment analysis of genes predicted to be in situ cancer and metastatic cancer groups;
FIG. 4 shows the results of two different mutation signature dominated gene enrichment analyses in a group of patients predicted to metastasize cancer;
fig. 5 is a schematic structural diagram of a disease risk prediction model construction device based on mutation signature according to an exemplary embodiment;
fig. 6 is a schematic structural diagram of a disease risk analysis device based on mutation signature according to an exemplary embodiment;
Fig. 7 is a schematic diagram of a hardware structure of a computer device according to an exemplary embodiment.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
In order to improve the accuracy of disease risk prediction, the invention provides a disease risk prediction model construction method and an analysis method based on mutation signatures.
Currently, in a large-scale pan-cancer series of studies, it has been found that there is a correlation between partial mutation signatures and somatic mutations. For example, whole genome sequencing (whole genome sequencing, WGS) analysis of mutation signatures of metastatic solid tumors in pan-carcinoma reveals specific mutation patterns in metastatic tumors. Further analysis of matched carcinoma in situ and metastatic carcinoma groupings showed transformation profiles of mutation signatures in colorectal and papillary thyroid carcinomas from primary to metastatic tumors. In addition, the related art also distinguishes between advanced cancers and early tumors based on SBS mutation signatures. This suggests potential utility as a diagnostic tool in cancer management using mutation signature based analysis methods, which can provide valuable information about cancer progression.
Fig. 1 is a flowchart of a disease risk prediction model construction method based on mutation signatures according to an exemplary embodiment. As shown in fig. 1, the disease risk prediction model construction method based on the mutation signature includes the following steps S101 to S104.
Step S101: and acquiring first training data, wherein the first training data comprises mutation signature data of a plurality of patients, and the mutation signature data comprises at least one mutation signature and mutation signature activity values corresponding to the mutation signatures.
In an alternative embodiment, the mutation signature includes, but is not limited to, single Base Substitution (SBS), double Base Substitution (DBS), small fragment Insertion and Deletion (ID), and the like, without limitation.
In an alternative embodiment, the mutation signature activity value characterizes the ratio of the mutation signature corresponding to the mutation signature activity value to the somatic mutation of the patient.
In an alternative embodiment, the mutation signature activity value may be obtained from somatic mutation data of the patient.
Step S102: and training the first disease risk prediction model according to the first training data to obtain a second disease risk prediction model.
In an alternative embodiment, the first disease risk prediction model may be a fully connected neural network model. The full-connection neural network model comprises a plurality of layers of full-connection neural networks, each layer of full-connection neural network comprises a batch normalized layer and an activation function, and finally, the normalized function softmax function) outputs the predicted mutation type. In the embodiment of the invention, an open source artificial neural network library (keras framework) and an end-to-end open source machine learning platform (tensorf low) are adopted as the back ends to realize a first disease risk prediction model.
In an alternative embodiment, the parameters in the first disease risk prediction model may be optimized by a bayesian optimization method. Parameters include, but are not limited to, the number of nodes per layer, the learning rate, the weight decay, the learning rate, and the activation function. Bayesian optimization methods determine configuration parameters in a model by using probabilistic models.
Step S103: and inputting the first training data into a second disease risk prediction model to obtain the mutation type corresponding to each patient.
In an alternative embodiment, the mutation type includes carcinoma in situ and metastatic. In which carcinoma in situ means primary tumor production and metastatic carcinoma means metastatic tumor production.
Step S104: and according to each mutation type, adjusting the second disease risk prediction model to obtain a third disease risk prediction model.
In an alternative embodiment, the training data of the second disease risk model may be adjusted according to the correlation between the mutation type and the training data, and the second disease risk model may be further trained to obtain the third disease risk prediction model.
In an alternative embodiment, the second disease risk prediction model may be adjusted according to the correlation between the mutation type and the parameters in the second disease risk prediction model, to obtain the third disease risk prediction model.
According to the method, first, the first training data is used for training an initial model to obtain a second disease risk prediction model, then the mutation type obtained by the first training data and the second disease risk prediction model is used for further adjusting the second disease risk prediction model to obtain a final disease risk prediction model, namely a third disease risk prediction model, and the disease risk prediction model is further adjusted according to the prediction result.
In an example, in the above step S101, the first training data is acquired by:
first, high throughput sequencing data of a patient is acquired. Illustratively, the high throughput Sequencing data may be (white genome re-Sequencing, WGS) or Whole Exome Sequencing (WES).
Then, based on the high throughput sequencing data, somatic mutations of the patient are obtained. Illustratively, the somatic mutation of a patient can be obtained by comparing high throughput sequencing data to the genome by a bioinformatics analysis tool (Burrows-Wheeler Aligner, BWA) or the like, using the bioinformatics suite GATK or the like.
And finally, according to somatic mutation of the patient, obtaining a plurality of mutation signatures in the first training data and mutation signature activity values corresponding to the mutation signatures by using a non-negative matrix factorization algorithm.
In an example, in the step S104, the second disease risk prediction model is adjusted to obtain the third disease risk prediction model by:
step a1: and calculating the correlation between each mutation signature and each mutation type in the first training data.
Step a2: and screening all mutation signatures in the first training data according to the correlations to obtain second training data, wherein the second training data comprises the mutation signatures after screening and mutation signature activity values corresponding to the mutation signatures after screening in the first training data.
Step a3: and training the second disease risk prediction model according to the second training data to obtain a third disease risk prediction model.
In the embodiment of the invention, the correlation between the mutation signature and the mutation type of the predicted result is utilized to screen the mutation signature, the mutation signature with higher correlation with the mutation type of the predicted result is used as the mutation signature after screening, and the second disease risk prediction model is further trained according to the mutation signature after screening and the mutation signature activity value corresponding to the mutation signature, so that the prediction accuracy of the obtained third disease risk prediction model on the disease risk is higher.
In an alternative embodiment, the correlation is represented by mutual information, and in the step a2, the second training data is obtained by screening each mutation signature in the first training data as follows:
firstly, deleting mutation signatures of a first preset proportion from the mutation signatures according to mutual information between the mutation signatures and mutation types in the first training data, and obtaining filtered mutation signatures. The first preset proportion can be set according to actual needs, mutation signatures with smaller mutual information in the first preset proportion are deleted, mutation signatures with large correlation with mutation types are reserved, and mutation signatures which are not correlated with the mutation types are reduced.
And then taking the mutation signature after screening and the mutation signature activity value corresponding to the mutation signature after screening in the first training data as the second training data.
In one example, the second disease risk prediction model is a fully connected neural network model (Deep Neural Networks, DNN) comprising a plurality of neurons, and in step a3, the second disease risk prediction model is trained by:
first, a weight value between each neuron in the second disease risk prediction model is obtained.
And secondly, deleting the weight of the second preset proportion according to each weight value to obtain the weight after screening. The second preset ratio may be set according to actual needs, and is not particularly limited herein. In the embodiment of the invention, the weight of the second preset proportion with smaller weight value can be deleted so as to reduce the size and complexity of the model.
And thirdly, according to the weights, adjusting the second disease risk prediction model to obtain a fourth disease risk prediction model.
And finally, training the fourth disease risk prediction model according to the second training data to obtain a third disease risk prediction model.
In the embodiment of the invention, the weights of the neurons in the second disease risk prediction model are screened through the weight values among the neurons in the second disease risk prediction model, and the complexity of the model is reduced by removing unimportant weights, so that the generalization capability of the model is improved, and the accuracy of the adjusted disease risk prediction model for the disease risk prediction of different patients is higher.
Fig. 2 is a flowchart of a disease risk analysis method based on mutation signatures according to an exemplary embodiment. The method comprises the following steps S201 to S202:
Step S201: a plurality of first mutation signatures of the patient and mutation signature activity values corresponding to the first mutation signatures are obtained.
Step S202: the mutation signature activity values are input into a disease risk prediction model, the mutation type of a patient is predicted, and the disease risk prediction model is obtained by the disease risk prediction model construction method based on the mutation signature in the embodiment.
By the method, the disease risk of the patient is predicted by the disease risk prediction model, whether the patient is in-situ cancer or metastatic cancer can be accurately analyzed, and a basis is provided for the next treatment of the patient.
In an example, the disease risk analysis method based on mutation signature provided by the embodiment of the invention further includes the following contents:
first, a plurality of genes of a patient are acquired.
Then, a cumulative contribution abundance value of each gene and each first mutation signature is calculated, the cumulative contribution abundance value characterizing the contribution of the gene to the first mutation signature.
In the embodiment of the invention, the gene is combined with the mutation signature, the contribution degree of the gene to the mutation signature is represented by the accumulated contribution abundance value of the gene and the mutation signature, and the pathogenic mechanism of a patient is related to the gene, so that the relevance of the gene, the mutation label and the predicted mutation type can be more intuitively analyzed.
Before the step of calculating the cumulative contribution abundance value of each gene and each first mutation signature, the first mutation signatures are screened according to mutation types of patients, and the first mutation signatures with high correlation with the mutation types are selected to calculate the cumulative contribution abundance value. In the embodiment of the invention, the calculation of the cumulative contribution abundance value of each gene and each first mutation signature is realized through the following steps:
step b1: a correlation between the mutation type and each of the first mutation signatures is calculated.
In an alternative embodiment, the correlation between the mutation type and the first mutation signature may be obtained by a data analysis SHAP tool. By using the tool to construct an additive interpretation model, all mutation signatures are regarded as contributors, and for the patient, the tool can analyze and obtain the correlation of the first mutation signature on the mutation type, namely the influence degree of the first mutation signature on the mutation type, namely the first mutation signature with larger correlation indicates that the influence of the first mutation signature on the mutation type is larger.
Step b2: screening each first mutation signature according to each correlation to obtain at least one second mutation signature; illustratively, the first abrupt signature having a correlation less than a preset threshold is deleted, and an abrupt signature greater than the preset threshold is retained.
Step b3: the cumulative contribution abundance values between each gene and each second mutation signature are calculated.
In the embodiment of the invention, the correlation between the mutation type and the first mutation signature is utilized to screen the first mutation signature, so that the mutation signature with higher correlation degree with the predicted result is obtained through screening, and then the cumulative contribution abundance value of each gene and the mutation signature with higher correlation degree is calculated.
In one example, in step b3 above, the cumulative contribution abundance value between the gene and the second mutation signature is calculated by:
first, determining a first number of mutations in a gene in somatic cells of a patient;
second, a second number of occurrences of a second mutation signature in somatic cells of the patient is determined.
Again, based on the first number, a first influencing factor of the gene is calculated. In an embodiment of the present invention, the ratio of the first number to the sum of the number of mutations that occur in all genes in the somatic cells of the patient is used as the first influencing factor for that gene.
Then, a second influencing factor for the second mutation signature is calculated based on the second number of times. In an embodiment of the present invention, the ratio of the first number of times to the sum of the number of occurrences of all mutation signatures in the patient's somatic cells is used as the second influencing factor for the second mutation signature.
Finally, a cumulative contribution abundance value between the gene and the second mutation signature is calculated based on the first and second influencing factors. In an embodiment of the invention, the product of the first influencing factor and the second influencing factor is taken as the cumulative contribution abundance value between the gene and the second mutation signature.
By the embodiment, the first influence factor of the gene on the risk of high wind is calculated according to the mutation times of the gene, the second influence factor of the mutation signature on the risk of diseases is calculated according to the mutation times of the mutation signature, and the cumulative contribution abundance value between the gene and the mutation signature is calculated by using the first influence factor and the second influence factor so as to represent the contribution degree of the gene on the mutation signature.
In an example, the disease risk analysis method based on mutation signature provided by the embodiment of the invention further includes the following contents:
first, genes corresponding to the second mutation signatures are determined based on the cumulative contribution abundance values. The greater the cumulative contribution abundance value indicates the greater the degree of contribution of the gene to the mutation signature, and therefore, in the embodiment of the present invention, the gene whose cumulative contribution abundance value is the greatest is determined as the gene corresponding to each second mutation signature.
And then, carrying out enrichment analysis on genes corresponding to the second mutation signatures to obtain biological functions corresponding to the genes.
In the embodiment of the invention, the genes corresponding to the second mutation signature are screened through accumulating contribution abundance values, and then enrichment analysis is carried out on the screened genes to obtain the biological functions corresponding to the genes, so that the mutation signatures, the genes and the biological functions are related, and a basis is provided for researching pathological mechanisms.
In one example, an enrichment analysis can be performed on genes of multiple patients predicted to be of the same mutation type, in comparison to differences between biological functions of different patients.
In one example, multiple patients may be subjected to enrichment analysis of genes corresponding to the same mutation signature, as opposed to differences between biological functions of different mutation types.
In the embodiment of the invention, 986 breast cancer patient samples are firstly collected for training, verifying and testing of a disease risk prediction model, and 603 breast cancer patient samples are also collected for external testing of the disease risk prediction model. The training data, the verification data and the test data comprise a plurality of mutation signatures and mutation activity values corresponding to the mutation signatures. Firstly, taking a preset number of mutation signatures in training data as input of a first disease risk prediction model (initial model) as a model, and constructing a DNN model as a second disease risk prediction model by using the training data and a Bayesian optimization method. Then, pruning the input mutation signature and the middle layer neuron weight of the second disease risk prediction model to obtain a pruned sparse model, namely a third disease risk prediction model. And comparing the effect of the third disease risk prediction model with that of the first disease risk prediction model (initial model) by using the verification data and the test data, wherein the obtained model effect comparison result is shown in table 1. And comparing the effect of the third disease risk prediction model with that of the first disease risk prediction model (initial model) by using the test data, wherein the obtained model effect comparison result is shown in table 2. It can be seen from tables 1 and 2 that the initial model effect using the test data is more reduced than the initial model effect using the verification data and the test data, and the model effect using the test data is not significantly reduced than the model effect using the verification data and the test data when the third disease risk prediction model obtained by the embodiment of the present invention is used. Therefore, the generalization performance of the third disease risk prediction model after intermediate layer weight screening and mutation signature screening is superior to that of the initial model.
Table 1 model effect comparison using validation data and test data
TABLE 2 Effect comparison of models with external test data
And performing interpretable analysis on the third disease risk prediction model by using SHAP analysis software, and obtaining a mutation signature with the largest correlation with the mutation type in the test data. The in-situ cancer or the metastatic cancer groups the test data, and the test data is analyzed by using the gene accumulation contribution abundance value to obtain genes corresponding to the mutation signature with the largest mutation type correlation in the two groups of data and genes corresponding to the mutation signature with the largest mutation type correlation in each test data.
FIG. 3 is a graph showing the results of an enrichment analysis of genes corresponding to mutation signatures of two groups of patients predicted to be carcinoma in situ and metastatic. The left side of FIG. 3 shows the results of sample enrichment (molecular functions) for carcinoma in situ, involving molecular functions including gamma-catenin protein binding, 1-phosphatidyllinolen 3-kinase activity, histone deacetylase modulator activity, nitric oxide synthase modulator activity, and MutLalpha complex binding; on the right is the result of sample enrichment of metastatic cancer (molecular function), involving molecular functions including platelet-derived growth factor binding, HMG box domain binding, opioid receptor binding, intron transcriptional regulatory sequence specific DNA binding, nitric oxide synthase modulator activity. The value on the abscissa characterizes the degree of enrichment, a larger value indicating a higher degree of enrichment.
FIG. 3 shows the results of gene enrichment analysis of a group of patients predicted to metastasize cancer and a group of patients predicted to carcinoma in situ on test data. The gene enrichment results (p < 0.05) of the two groups of patients showed significant differences, the enrichment results of the first ten were almost completely different, and there were significant differences in the biological mechanism pathways affecting the disease. Taking the most obvious enriched pathways as an example: more of the patient genes affecting carcinoma in situ are enriched in gamma-catenin binding (PDGF) functions, while more of the patient genes affecting metastatic carcinoma are enriched in platelet-derived growth factor binding (PDGF) functions. It has been reported that overexpression or mutation of PDGF receptor signaling may drive excessive proliferation of tumor cells, PDGF binds to PDGF receptors of peripheral vascular cells and fibroblasts and myofibroblasts as a matrix of solid tumors, and induces these cells to promote tumorigenesis. PDGF-D is a member of the PDGF growth factor family, highly expressed in human breast cancer, and regulates cell proliferation, migration and survival by binding to the PDGF receptor. Can promote tumor growth and lymph node metastasis. Inhibition of PDGF receptor signaling has been shown to be useful in treating tumor patients. Imatinib is a multi-target tyrosine kinase small molecule inhibitor targeting PDGF/PDGFR-beta signaling pathway. The main indications of the medicine are leukemia, gastrointestinal stromal tumor and melanoma. However, studies have reported that Imatinib impedes proliferation and invasiveness of human epithelial breast cancer cells with different invasion potentials. It can inhibit the expression of receptor tyrosine kinase, especially PDGFR-beta pathway in breast cancer cell line, and inhibit proliferation of breast cancer cell line by apoptosis induction related to G2/M stage cell cycle stop. In addition, it has negligible effect on breast non-tumor cell lines. This makes it possible to produce a therapeutic effect while having a reduced possibility of adverse reactions.
FIG. 4 shows the results of two different mutation signature dominated genomic gene enrichment analyses in a group of patients predicted to have metastatic cancer. To the right is the cosmic mutation signature SBS8 and to the left is the cosmic mutation signature ID1.
As shown in fig. 4, the enrichment of the ID1 mutation signature-associated gene molecule function is mainly focused on apoptosis-related functions, and dysregulation of apoptosis is a marker of cancer, and many cancer cells have developed mechanisms that evade apoptosis, enabling them to survive and proliferate uncontrolled. The two functions of highest enrichment significance are focused on BH3 domain binding. The BH domain is a conserved sequence motif in Bcl-2 family proteins. The BCL-2 family includes pro-apoptotic and anti-apoptotic members, and the balance between these two classes of proteins determines whether cells undergo apoptosis. The BH domains are of four types, BH1, BH2, BH3, and BH4, respectively. The BH3 domain is a special type of Bcl-2 family protein that plays a key role in regulating apoptosis. BH1, BH2, BH3 and BH4 are all conserved sequence motifs in Bcl-2 family proteins that play a key role in regulating apoptosis. The main difference between these domains is their amino acid sequence and structure. BH1 and BH2 domains are found primarily in anti-apoptotic members of Bcl-2 family proteins, which promote protein dimerization by forming hydrophobic helices, thereby inhibiting apoptosis. While the BH3 domain is a specific class of Bcl-2 family proteins, it is primarily found in pro-apoptotic members of Bcl-2 family proteins, and it plays a key role in regulating apoptosis. The BH3 domain can bind to BH1, BH2, and BH3 domains of other Bcl-2 family proteins, thereby inducing apoptosis. The BH4 domain is found primarily in anti-apoptotic members of Bcl-2 family proteins, which are capable of binding to BH3 domains, thereby inhibiting apoptosis. The BH3 domain proteins targeting the BCL-2 family have become a promising cancer treatment strategy because they restore the normal balance between pro-apoptotic and anti-apoptotic signals and induce apoptosis in cancer cells.
Death domain binding is also a mechanism associated with apoptosis, and Death Domain (DD) is a protein interaction module consisting of six alpha helices. DD is a subclass of the motif of the desath fold protein, and is related in sequence and structure to the Desath Effector Domain (DED) and caspase recruitment domain (CARD), which function in similar pathways and exhibit similar interaction characteristics. DD may combine with each other to form oligomers. Mammals possess many different classes of DD-containing proteins. Due to the large scale of the death domain family protein superfamily, some death domain proteins may play a role in cancer and many other infections through several families of DD proteins and specific genetic alterations that have downstream functions to induce apoptosis. Many of these changes occur in genes encoding mediators of apoptosis or necrosis, potentially leading to enhanced resistance to cell death, an important feature of cancer.
Protein phosphatase 2A binding (PP 2A) is a serine/threonine phosphatase involved in regulating transcription, translation, metabolism, apoptosis, etc. PP2A is overexpressed in many cancers, and its reactivation can induce apoptosis in cancer cells. PP2A can regulate apoptosis through a variety of mechanisms. For example, PP2A can also dephosphorylate anti-apoptotic proteins (such as Bcl-2), which makes them less effective in preventing apoptosis. In addition to directly regulating apoptosis, PP2A can also indirectly regulate apoptosis by affecting cell cycle regulation, signal transduction, and other cellular processes involved in apoptosis. PP2A mutations can lead to apoptosis defects, increasing the risk of cancer. For example, mutations in the C subunit of PP2A can lead to the development of cancer by making the cells more resistant to apoptosis.
Ion channel inhibitor activity: ion channels play a role in a variety of cellular processes, including cell signaling, cell proliferation and apoptosis, and a variety of ion channel inhibitors, including calcium, potassium and chloride channels, can induce apoptosis by a variety of different mechanisms, e.g., T-type calcium channel inhibitors can induce apoptosis in medulloblastoma cells and are associated with altered metabolic activity. In addition, ion channel inhibitors can inhibit cell proliferation through a variety of mechanisms. One mechanism is by blocking ion channels, resulting in a decrease in intracellular calcium ion levels. The decrease in intracellular calcium ion levels results in inhibition of cell proliferation signals, and thus inhibition of cell proliferation.
SBS8 mutation signature-related gene enrichment is primarily focused on cell proliferation and DNA damage related functions.
Glucuronic acid-N-acetamido-proteoglycan 4-alpha-N-acetamido-glucosyltransferase activity (glucuronosyl-N-acetylglucosaminyl proteoglycan-alpha-N-acetylglucosaminyl transferase) is a glycosyltransferase that catalyzes a chemical reaction that produces uridine diphosphate UDP. UDP may be involved in the synthesis of glycoproteins and glycolipids, which may ultimately produce UTP via a degradation pathway for protein glycosylation and glycogen synthesis. Pyrimidine nucleotides to which UTP belongs are essential for cell growth and proliferation, and can reduce cancer cell growth by regulating the level of UTP.
The Camp response element binding protein (CREB) is a transcription factor that plays an important role in cancer and is associated with overall survival and therapeutic response in tumor patients. Camp regulates transcription of various target genes by Protein Kinase A (PKA) and its downstream effectors such as Camp response element binding protein (CREB). In addition, PKA can phosphorylate many kinases such as Raf, GSK3 and FAK. Aberrant cAMP-PKA signaling is involved in various types of human tumors. The Camp signal may have tumor-inhibiting and tumor-promoting effects, depending on tumor type and background. The Camp-PKA signal may regulate growth, migration, invasion and metabolism of cancer cells.
Both DNA damage binding (Damages DNA binding, DDB) and protein N-terminal binding (PNB) are associated with DNA damage. DNA damage binding is an important step in DNA damage repair. DDB proteins can recognize and bind damaged DNA and promote DNA repair in a variety of ways. Protein N-terminal binding (PNB) plays an important role in various biological processes such as cell signaling, apoptosis and gene expression. The N-terminus of a protein can bind to DNA, an important mechanism for regulating gene expression. The N-terminus of a protein may bind to a specific sequence of DNA, thereby preventing or initiating transcription of the gene. The N-terminus of the protein may also bind to specific domains of DNA, thereby regulating damage repair of DNA.
From fig. 3 and 4, it can be seen that the pathogenesis of the cancer is greatly altered inside the somatic cell after it has been transferred. Even in patients with breast cancer metastasis, the mechanism by which metastasis is initiated may be quite different. By analyzing the whole genome sequencing (WGS, whole Genome Sequencing) of the patient, the disease risk prediction model construction method provided by the embodiment of the invention can accurately predict the current disease stage of the patient, identify factors possibly causing the increase of the transfer risk of the patient, such as larger biological pathways/molecular functions and genes, can help doctors evaluate drugs or treatment methods aiming at the pathways, help doctors better know the response of the patient to different treatment schemes, and provide more personalized and more effective treatment schemes for the patient.
In addition, the same mutation signature data can be grouped into a group, and the gene set corresponding to the same mutation signature data can be subjected to enrichment analysis.
Based on the same inventive concept, the embodiment of the invention further provides a disease risk prediction model construction device based on mutation signature, as shown in fig. 5, the device comprises:
the first obtaining module 501 is configured to obtain first training data, where the first training data includes mutation signature data of a plurality of patients, and the mutation signature data includes at least one mutation signature and mutation signature activity values corresponding to the mutation signatures; the details are described in step S101 in the above embodiments, and are not described herein.
The training module 502 is configured to train the first disease risk prediction model according to the first training data to obtain a second disease risk prediction model; the details refer to the description of step S102 in the above embodiment, and are not repeated here.
A first prediction module 503, configured to input first training data into a second disease risk prediction model, and obtain mutation types corresponding to each patient; the details are described in step S103 in the above embodiments, and are not described herein.
And an adjustment module 504, configured to adjust the second disease risk prediction model according to each mutation type, so as to obtain a third disease risk prediction model. The details are referred to the description of step S104 in the above embodiment, and will not be repeated here.
In an example, the adjustment module 504 includes:
the first computing sub-module is used for computing the correlation between each mutation signature and each mutation type in the first training data; the details are described in the above embodiments, and are not repeated here.
The first screening submodule is used for screening each mutation signature in the first training data according to each correlation to obtain second training data, wherein the second training data comprises the mutation signature after screening and mutation signature activity values corresponding to the mutation signature after screening in the first training data; the details are described in the above embodiments, and are not repeated here.
And the training sub-module is used for training the second disease risk prediction model according to the second training data to obtain a third disease risk prediction model. The details are described in the above embodiments, and are not repeated here.
In an example, the correlation is characterized by mutual information, and the first screening submodule includes:
the first screening unit is used for deleting mutation signatures with a first preset proportion from the mutation signatures according to mutual information between the mutation signatures and the mutation types in the first training data to obtain screened mutation signatures; the details are described in the above embodiments, and are not repeated here.
The first determining unit is used for taking the mutation signature after screening and the mutation signature activity value corresponding to the mutation signature after screening in the first training data as the second training data. The details are described in the above embodiments, and are not repeated here.
In one example, the second disease risk prediction model is a fully connected neural network model including a plurality of neurons therein, and the training submodule includes:
the acquisition unit is used for acquiring weight values among all neurons in the second disease risk prediction model; the details are described in the above embodiments, and are not repeated here.
The second screening unit is used for deleting the weights of the second preset proportion according to the weight values to obtain the screened weights; the details are described in the above embodiments, and are not repeated here.
The adjusting unit is used for adjusting the second disease risk prediction model according to the weights to obtain a fourth disease risk prediction model; the details are described in the above embodiments, and are not repeated here.
And the training unit is used for training the fourth disease risk prediction model according to the second training data to obtain a third disease risk prediction model. The details are described in the above embodiments, and are not repeated here.
Based on the same inventive concept, the embodiment of the present invention further provides a disease risk analysis device based on mutation signature, as shown in fig. 6, the device includes:
a second obtaining module 601, configured to obtain a plurality of first mutation signatures of a patient, and mutation signature activity values corresponding to the first mutation signatures; the details are described in step S201 in the above embodiments, and are not described herein.
The second prediction module 602 is configured to input each mutation signature activity value into a disease risk prediction model, and predict a mutation type of a patient, where the disease risk prediction model is obtained by using the mutation signature-based disease risk prediction model construction method provided in the foregoing embodiment. The details of step S202 in the above embodiments are described in the foregoing, and are not described in detail herein.
In an example, the apparatus further comprises:
a third acquisition module for acquiring a plurality of genes of the patient; the details are described in the above embodiments, and are not repeated here.
The calculation module is used for calculating the cumulative contribution abundance value of each gene and each first mutation signature, and the cumulative contribution abundance value characterizes the contribution degree of the genes to the first mutation signatures. The details are described in the above embodiments, and are not repeated here.
In an example, the computing module includes:
a second calculation sub-module for calculating the correlation between the mutation type and each first mutation signature; the details are described in the above embodiments, and are not repeated here.
The second screening submodule is used for screening each first mutation signature according to each correlation to obtain at least one second mutation signature; the details are described in the above embodiments, and are not repeated here.
And a third calculation sub-module for calculating cumulative contribution abundance values between each gene and each second mutation signature. The details are described in the above embodiments, and are not repeated here.
In an example, the third computing sub-module includes:
a second determining unit for determining a first number of times that the gene has been mutated in somatic cells of the patient; the details are described in the above embodiments, and are not repeated here.
A third determining unit for determining a second number of occurrences of the second mutation signature in somatic cells of the patient; the details are described in the above embodiments, and are not repeated here.
A first calculation unit for calculating a first influence factor of the gene based on the first number of times; the details are described in the above embodiments, and are not repeated here.
A second calculation unit for calculating a second influence factor of a second mutation signature according to the second number of times; the details are described in the above embodiments, and are not repeated here.
A third calculation unit for calculating a cumulative contribution abundance value between the gene and the second mutation signature based on the first influence factor and the second influence factor. The details are described in the above embodiments, and are not repeated here.
In an example, the apparatus further comprises:
the determining module is used for determining genes corresponding to the second mutation signatures according to the accumulated contribution abundance values; the details are described in the above embodiments, and are not repeated here.
And the analysis module is used for carrying out enrichment analysis on the genes corresponding to the second mutation signatures to obtain the biological functions corresponding to the genes. The details are described in the above embodiments, and are not repeated here.
The specific limitation of the device and the beneficial effects can be referred to the limitation of the disease risk prediction model construction method based on mutation signature and the disease risk analysis method based on mutation signature, and are not repeated herein. The various modules described above may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
Fig. 7 is a schematic diagram of a hardware structure of a computer device according to an exemplary embodiment. As shown in fig. 7, the device includes one or more processors 710 and a memory 720, the memory 720 including persistent memory, volatile memory and a hard disk, one processor 710 being illustrated in fig. 7. The apparatus may further include: an input device 730 and an output device 740.
Processor 710, memory 720, input device 730, and output device 740 may be connected by a bus or other means, for example in fig. 7.
The processor 710 may be a central processing unit (Central Processing Unit, CPU). The processor 710 may also be a chip such as other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or a combination thereof. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 720, which is a non-transitory computer readable storage medium, includes a persistent memory, a volatile memory, and a hard disk, may be used to store a non-transitory software program, a non-transitory computer executable program, and a module, such as program instructions/modules corresponding to a disease risk prediction model construction method based on a mutation signature and a disease risk analysis method based on a mutation signature in the embodiments of the present application. Processor 710 executes various functional applications and data processing of the server by running non-transitory software programs, instructions, and modules stored in memory 720, that is, implementing any of the above-described mutation signature-based disease risk prediction model construction methods, mutation signature-based disease risk analysis methods.
Memory 720 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data or the like used as needed. In addition, memory 720 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 720 may optionally include memory located remotely from processor 710, which may be connected to the data processing apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 730 may receive input numeric or character information and generate signal inputs related to user settings and function control. The output device 740 may include a display device such as a display screen.
One or more modules are stored in memory 720 that, when executed by one or more processors 710, perform the method as shown in fig. 1.
The product can execute the method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. Technical details which are not described in detail in the present embodiment can be found in the embodiment shown in fig. 1.
The present invention also provides a non-transitory computer storage medium storing computer executable instructions that can perform the method of any of the above-described method embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The foregoing is merely exemplary of embodiments of the present invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (7)

1. A disease risk prediction model construction method based on mutation signatures, the method comprising:
acquiring first training data, wherein the first training data comprise mutation signature data of a plurality of patients, the mutation signature data comprise at least one mutation signature and mutation signature activity values corresponding to the mutation signatures, and the mutation signature activity values represent the proportion of the mutation signatures corresponding to the mutation signature activity values in somatic mutation of the patients;
training the first disease risk prediction model according to the first training data to obtain a second disease risk prediction model;
inputting the first training data into the second disease risk prediction model to obtain mutation types corresponding to the patients;
according to each mutation type, the second disease risk prediction model is adjusted to obtain a third disease risk prediction model;
according to each mutation type, the second disease risk prediction model is adjusted to obtain a third disease risk prediction model, which comprises the following steps:
calculating the correlation between each mutation signature and each mutation type in the first training data;
Screening each mutation signature in the first training data according to each correlation to obtain second training data, wherein the second training data comprises the mutation signature after screening and mutation signature activity values corresponding to the mutation signature after screening in the first training data;
training the second disease risk prediction model according to the second training data to obtain the third disease risk prediction model;
the correlation is represented by mutual information, and each mutation signature in the first training data is screened according to each correlation to obtain second training data, which comprises the following steps:
deleting mutation signatures of a first preset proportion from the mutation signatures according to mutual information between the mutation signatures and the mutation types in the first training data to obtain screened mutation signatures;
taking the mutation signature after screening and the mutation signature activity value corresponding to the mutation signature after screening in the first training data as the second training data;
the second disease risk prediction model is a fully connected neural network model, the fully connected neural network model includes a plurality of neurons, the training is performed on the second disease risk prediction model according to the second training data to obtain the third disease risk prediction model, and the method includes:
Acquiring a weight value among the neurons in the second disease risk prediction model;
deleting the weight of the second preset proportion according to each weight value to obtain a screened weight;
according to the weights, the second disease risk prediction model is adjusted to obtain a fourth disease risk prediction model;
and training the fourth disease risk prediction model according to the second training data to obtain the third disease risk prediction model.
2. A method of disease risk analysis based on mutation signatures, the method comprising:
acquiring a plurality of first mutation signatures of a patient and mutation signature activity values corresponding to the first mutation signatures;
inputting each mutation signature activity value into a disease risk prediction model for predicting the mutation type of the patient, wherein the disease risk prediction model is obtained by the disease risk prediction model construction method based on mutation signatures as set forth in claim 1.
3. The method according to claim 2, wherein the method further comprises:
obtaining a plurality of genes for the patient;
calculating a cumulative contribution abundance value for each of the genes and each of the first mutation signatures, the cumulative contribution abundance value characterizing a degree of contribution of the genes to the first mutation signatures.
4. The method of claim 3, wherein calculating a cumulative contribution abundance value for each of the genes and each of the first mutation signatures comprises:
calculating a correlation between the mutation type and each of the first mutation signatures;
screening each first mutation signature according to each correlation to obtain at least one second mutation signature;
and calculating a cumulative contribution abundance value between each of the genes and each of the second mutation signatures.
5. The method of claim 4, wherein calculating a cumulative contribution abundance value between the gene and the second mutation signature comprises:
determining a first number of mutations in somatic cells of the patient for the gene;
determining a second number of occurrences of the second mutation signature in somatic cells of the patient;
calculating a first influence factor of the gene according to the first times;
calculating a second influence factor of the second mutation signature according to the second times;
calculating a cumulative contribution abundance value between the gene and the second mutation signature based on the first and second impact factors.
6. The method of claim 5, wherein the method further comprises:
Determining genes corresponding to the second mutation signatures according to the accumulated contribution abundance values;
and carrying out enrichment analysis on genes corresponding to the second mutation signatures to obtain biological functions corresponding to the genes.
7. The method of claim 6, wherein determining the gene for each of the second mutation signatures based on the cumulative contribution abundance values comprises:
and determining the gene with the largest cumulative contribution abundance value as the gene corresponding to each second mutation signature.
CN202311678281.1A 2023-12-08 2023-12-08 Disease risk prediction model construction method and analysis method based on mutation signature Active CN117373678B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311678281.1A CN117373678B (en) 2023-12-08 2023-12-08 Disease risk prediction model construction method and analysis method based on mutation signature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311678281.1A CN117373678B (en) 2023-12-08 2023-12-08 Disease risk prediction model construction method and analysis method based on mutation signature

Publications (2)

Publication Number Publication Date
CN117373678A CN117373678A (en) 2024-01-09
CN117373678B true CN117373678B (en) 2024-03-05

Family

ID=89395072

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311678281.1A Active CN117373678B (en) 2023-12-08 2023-12-08 Disease risk prediction model construction method and analysis method based on mutation signature

Country Status (1)

Country Link
CN (1) CN117373678B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347837A (en) * 2019-07-17 2019-10-18 电子科技大学 A kind of unplanned Risk Forecast Method of being hospitalized again of cardiovascular disease
CN111000571A (en) * 2019-12-26 2020-04-14 上海市精神卫生中心(上海市心理咨询培训中心) Method for predicting risk of post-traumatic psychological disease, related device and storage device
CN112601826A (en) * 2018-02-27 2021-04-02 康奈尔大学 Ultrasensitive detection of circulating tumor DNA by whole genome integration
CN112602156A (en) * 2018-02-27 2021-04-02 康奈尔大学 System and method for detecting residual disease
CN113764038A (en) * 2021-08-31 2021-12-07 华南理工大学 Method for constructing myelodysplastic syndrome whitening gene prediction model
CN114937473A (en) * 2022-07-20 2022-08-23 中日友好医院(中日友好临床医学研究所) VTE risk assessment model based on polygenic mutation characteristics, construction method and application
CN115035951A (en) * 2022-05-27 2022-09-09 中国科学院深圳理工大学(筹) Mutation signature prediction method and device, terminal equipment and storage medium
CN116194995A (en) * 2020-07-27 2023-05-30 索菲亚遗传股份有限公司 Method for identifying chromosomal dimensional instability such as homologous repair defects in next generation sequencing data of low coverage
CN117178302A (en) * 2021-02-19 2023-12-05 迪普赛尔公司 Systems and methods for cell analysis

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10283221B2 (en) * 2016-10-27 2019-05-07 International Business Machines Corporation Risk assessment based on patient similarity determined using image analysis
US20190392951A1 (en) * 2018-06-20 2019-12-26 California Institute Of Technology Mutation profile and related labeled genomic components, methods and systems
WO2020068506A1 (en) * 2018-09-24 2020-04-02 President And Fellows Of Harvard College Systems and methods for classifying tumors
JP2022504916A (en) * 2018-10-12 2022-01-13 ヒューマン ロンジェヴィティ インコーポレイテッド Multi-omics search engine for integrated analysis of cancer genes and clinical data
US11705226B2 (en) * 2019-09-19 2023-07-18 Tempus Labs, Inc. Data based cancer research and treatment systems and methods
US20210102197A1 (en) * 2019-10-07 2021-04-08 The Broad Institute, Inc. Designing sensitive, specific, and optimally active binding molecules for diagnostics and therapeutics

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112601826A (en) * 2018-02-27 2021-04-02 康奈尔大学 Ultrasensitive detection of circulating tumor DNA by whole genome integration
CN112602156A (en) * 2018-02-27 2021-04-02 康奈尔大学 System and method for detecting residual disease
CN110347837A (en) * 2019-07-17 2019-10-18 电子科技大学 A kind of unplanned Risk Forecast Method of being hospitalized again of cardiovascular disease
CN111000571A (en) * 2019-12-26 2020-04-14 上海市精神卫生中心(上海市心理咨询培训中心) Method for predicting risk of post-traumatic psychological disease, related device and storage device
CN116194995A (en) * 2020-07-27 2023-05-30 索菲亚遗传股份有限公司 Method for identifying chromosomal dimensional instability such as homologous repair defects in next generation sequencing data of low coverage
CN117178302A (en) * 2021-02-19 2023-12-05 迪普赛尔公司 Systems and methods for cell analysis
CN113764038A (en) * 2021-08-31 2021-12-07 华南理工大学 Method for constructing myelodysplastic syndrome whitening gene prediction model
CN115035951A (en) * 2022-05-27 2022-09-09 中国科学院深圳理工大学(筹) Mutation signature prediction method and device, terminal equipment and storage medium
CN114937473A (en) * 2022-07-20 2022-08-23 中日友好医院(中日友好临床医学研究所) VTE risk assessment model based on polygenic mutation characteristics, construction method and application

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"二代测序技术在乳腺癌分子生物学研究中的应用及进展";叶新星 等;《东南大学学报( 医学版 )》;813-816页 *

Also Published As

Publication number Publication date
CN117373678A (en) 2024-01-09

Similar Documents

Publication Publication Date Title
Hao et al. DNA methylation markers for diagnosis and prognosis of common cancers
Badia-i-Mompel et al. Gene regulatory network inference in the era of single-cell multi-omics
Kim et al. Meta-analytic support vector machine for integrating multiple omics data
Chiang et al. The impact of structural variation on human gene expression
Grimes et al. Integrating gene regulatory pathways into differential network analysis of gene expression data
De Leeuw et al. The statistical properties of gene-set analysis
Taşan et al. Selecting causal genes from genome-wide association studies via functionally coherent subnetworks
Drier et al. Pathway-based personalized analysis of cancer
Carmi et al. Sequencing an Ashkenazi reference panel supports population-targeted personal genomics and illuminates Jewish and European origins
Martinez et al. Parallel evolution of tumour subclones mimics diversity between tumours
Wu et al. ROAST: rotation gene set tests for complex microarray experiments
Zhao et al. DeepOmix: a scalable and interpretable multi-omics deep learning framework and application in cancer survival analysis
Pal Predictive modeling of drug sensitivity
Galatenko et al. Highly informative marker sets consisting of genes with low individual degree of differential expression
Kim et al. Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method
Zhang et al. Mixed linear model approaches of association mapping for complex traits based on omics variants
Haider et al. Pathway-based subnetworks enable cross-disease biomarker discovery
Zhao et al. The network organization of cancer-associated protein complexes in human tissues
Oulas et al. Selecting variants of unknown significance through network-based gene-association significantly improves risk prediction for disease-control cohorts
Bartlett et al. Single-cell co-expression subnetwork analysis
Yuan et al. A network-guided association mapping approach from DNA methylation to disease
Liu et al. Pathway analyses and understanding disease associations
Keller et al. Competitive learning suggests circulating miRNA profiles for cancers decades prior to diagnosis
Chen et al. Integration of spatial and single-cell data across modalities with weakly linked features
Peng et al. Discovery of bladder Cancer-related genes using integrative heterogeneous network modeling of multi-omics data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant