WO2024021037A1 - 疾病分析方法、疾病分析模型的训练方法及装置 - Google Patents

疾病分析方法、疾病分析模型的训练方法及装置 Download PDF

Info

Publication number
WO2024021037A1
WO2024021037A1 PCT/CN2022/109023 CN2022109023W WO2024021037A1 WO 2024021037 A1 WO2024021037 A1 WO 2024021037A1 CN 2022109023 W CN2022109023 W CN 2022109023W WO 2024021037 A1 WO2024021037 A1 WO 2024021037A1
Authority
WO
WIPO (PCT)
Prior art keywords
omics
model
sample data
data
group
Prior art date
Application number
PCT/CN2022/109023
Other languages
English (en)
French (fr)
Inventor
宋阳
丁丁
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Priority to CN202280002458.3A priority Critical patent/CN117795617A/zh
Priority to PCT/CN2022/109023 priority patent/WO2024021037A1/zh
Publication of WO2024021037A1 publication Critical patent/WO2024021037A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Definitions

  • Embodiments of the present disclosure relate to, but are not limited to, the field of biological information technology, and in particular, to a disease analysis method, a disease analysis model training method, and a device.
  • the current tumor prognosis methods have the following problems: 1) The single-omics evaluation system considers limited factors and cannot conduct a comprehensive evaluation; the multi-omics evaluation system is simply integrated and does not effectively take advantage of the conditions of multiple factors. All of the above will affect the evaluation. accuracy has a certain impact. 2) Patients with similar clinical manifestations are not effectively related. The related patients may have consistency in disease diagnosis, treatment and prognosis.
  • Embodiments of the present disclosure provide a disease analysis method, including:
  • first omics data and second omics data of the patient the first omics data including a plurality of first sites, and the second omics data including a plurality of second sites;
  • the first omics data and the second omics data are input into a fusion algorithm model to obtain the patient's prediction result.
  • the fusion algorithm model is formed based on the fusion of the first omics model and the second omics model.
  • the first omics model is constructed based on the sample data of the first omics model
  • the second omics model is constructed based on the sample data of the second omics model.
  • An embodiment of the present disclosure also provides a disease analysis device, including a memory; and a processor connected to the memory, the memory is used to store instructions, and the processor is configured to based on the instructions stored in the memory , perform the steps of the disease analysis method described in any embodiment of the present disclosure.
  • Embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the disease analysis method described in any embodiment of the present disclosure is implemented.
  • Embodiments of the present disclosure also provide a training method for a disease analysis model, including:
  • Calculate the first similarity matrix between the sample data of the first group construct a first undirected link network based on the calculated first similarity matrix; calculate the first similarity matrix between the sample data of the second group Two similarity matrices, construct a second undirected link network based on the calculated second similarity matrix;
  • a first omics model is constructed;
  • a second omics model is constructed omics model;
  • a fusion algorithm model is constructed based on the two-dimensional undirected link network and the product probability matrix, and the two-dimensional undirected link network is the first undirected link network or the second undirected link network.
  • An embodiment of the present disclosure also provides a training device for a disease analysis model, including a memory; and a processor connected to the memory, where the memory is used to store instructions, and the processor is configured to based on the instructions stored in the memory.
  • the instructions in execute the steps of the training method of the disease analysis model described in any embodiment of the present disclosure.
  • Embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the training method of the disease analysis model described in any embodiment of the present disclosure is implemented.
  • Figure 1 is a schematic flow chart of a disease analysis method provided by an exemplary embodiment of the present disclosure
  • Figure 2A is a schematic diagram of a methylation data set after data preprocessing provided by an exemplary embodiment of the present disclosure
  • Figure 2B is a schematic diagram of a gene mutation data set after data preprocessing provided by an exemplary embodiment of the present disclosure
  • Figure 3A is a schematic diagram of a methylation data set filtered by low variance features provided by an exemplary embodiment of the present disclosure
  • Figure 3B is a schematic diagram of a gene mutation data set filtered by low variance features provided by an exemplary embodiment of the present disclosure
  • Figure 4A is a schematic diagram of a methylation data set after low-variance feature filtering and recursive dimensionality reduction feature filtering provided by an exemplary embodiment of the present disclosure
  • Figure 4B is a schematic diagram of a gene mutation data set after low-variance feature filtering and recursive dimensionality reduction feature filtering provided by an exemplary embodiment of the present disclosure
  • Figure 5A is a schematic diagram of the first undirected link network relationship data set obtained according to the methylation data set of Figure 3A;
  • Figure 5B is a schematic diagram of the second undirected link network relationship data set obtained based on the gene mutation data set of Figure 3B;
  • Figure 5C is a network diagram of the completed first undirected link network or the second undirected link network
  • Figure 6A is a ROC curve chart of the methylation algorithm model constructed based on the first undirected link network relationship data set of Figure 5A;
  • Figure 6B is the ROC curve of the gene mutation algorithm model constructed based on the second undirected link network relationship data set of Figure 5B;
  • Figure 7A is a ROC curve chart of the methylation algorithm model constructed based on the first undirected link network relationship methylation data set in Figure 4A after low variance feature filtering and recursive dimensionality reduction feature filtering;
  • Figure 7B is the ROC curve of the gene mutation algorithm model constructed based on the second undirected link network relationship gene mutation data set in Figure 4B after low variance feature filtering and recursive dimensionality reduction feature filtering;
  • Figure 8 is a ROC curve chart of the fusion algorithm model created based on the methylation algorithm model of Figure 7A and the gene mutation algorithm model of Figure 7B;
  • Figure 9A is a schematic diagram of the top 10 clinical molecular marker system of methylation obtained according to the disease analysis method of the embodiment of the present disclosure.
  • Figure 9B is a schematic diagram of the clinical molecular marker system of the top 10 gene mutations obtained according to the disease analysis method of the embodiment of the present disclosure.
  • Figures 10A and 10B are schematic flow charts of two other disease analysis methods provided by exemplary embodiments of the present disclosure.
  • Figure 11 is a schematic structural diagram of a disease analysis device provided by an exemplary embodiment of the present disclosure.
  • Figure 12 is a schematic flowchart of a training method for a disease analysis model provided by an exemplary embodiment of the present disclosure.
  • the embodiment of the present disclosure provides a disease analysis method, including the following steps:
  • Step 101 Obtain the first omics data and the second omics data of the patient.
  • the first omics data includes multiple first sites, and the second omics data includes multiple second sites;
  • Step 102 Input the first omics data and the second omics data into the fusion algorithm model to obtain the prediction result of the patient.
  • the fusion algorithm model is formed based on the fusion of the first omics model and the second omics model.
  • the first group The first group study model is constructed based on the sample data of the first group study, and the second group study model is constructed based on the second group study sample data.
  • the embodiments of the present disclosure can more accurately judge the patient's disease diagnosis, treatment or prognosis, avoiding the need for a single group
  • the academic evaluation system cannot comprehensively evaluate and the multi-omics evaluation system does not effectively take advantage of the multi-factor conditions.
  • the disease described in the embodiments of the disclosure may be a tumor or other diseases, and the embodiments of the disclosure are not limited thereto.
  • omics mainly includes DNA methylomics, genomics, proteomics, transcriptomics, etc.
  • the first site may be: a DNA methylation site
  • the second site may be: a gene site
  • the first site may be: a gene site
  • the second site It can be: DNA methylation site, where the information that the gene site can include includes mutation status and/or expression status.
  • the embodiments of the present disclosure are not limited to this.
  • the first site can be a DNA methylation site
  • the second site can be a gene site
  • the information included in the gene site is gene mutation information.
  • methylation is catalyzed by enzymes and involves heavy metal modification, regulation of gene expression, regulation of protein function, and ribonucleic acid processing. Changes in methylation are often associated with abnormal expression of disease genes.
  • the specific location that a gene occupies on a chromosome is called a gene locus. There are many genes, but the number of chromosomes is small, so one chromosome contains many genes, and the genes are arranged in a single line on the chromosome.
  • Gene mutations refer to changes in genes in cells, including point mutations of single bases, duplications, insertions, and deletions of multiple bases. Gene mutations can lead to changes in protein expression, thereby affecting cell function.
  • the first omics model is constructed based on the sample data of the first omics, which may include:
  • a first omics model is constructed.
  • the first undirected link network and the second undirected link network described below are both two-dimensional undirected link networks.
  • a two-dimensional undirected link network is a two-dimensional network composed of connections between nodes. Nodes with similar data angles will be connected using undirected line segments. The data characteristics of surrounding nodes will be comprehensively considered when building the model and predicting unknown nodes. .
  • the embodiments of the present disclosure establish a local network of patients with intrinsic connections through a two-dimensional undirected link network, which can implement similar diagnosis and treatment plans for patients in the local network and more effectively judge the prognosis of the patients.
  • performing data preprocessing on the sample data of the first group may include:
  • One or more first sites in the sample data of the first group are deleted, and each deleted first site has at least one non-existing data value in the patient sample.
  • the first group includes multiple methylation sites
  • the first group includes multiple methylation sites
  • performing data preprocessing on the sample data of the first group may include:
  • performing data preprocessing on the sample data of the first group includes: classifying each patient in the sample data of the first group according to the prognosis or disease stage of each patient, The classification results include at least two categories.
  • the overall survival years are calculated by calculating the number of survival days, according to They were classified by overall survival years. For example, when the patient's overall survival is less than or equal to 2 years, the patient is classified as category I; when the patient's overall survival is greater than 2 years, the patient is classified as category II. After all patients in the sample data have been classified, the determined patient categories are added to the last column of the methylation data set. In this example, patients are divided into two categories according to the number of years of overall survival. However, in other examples, patients can also be divided into three or more categories according to needs, and the embodiment of the present disclosure does not limit this.
  • the prediction result of the algorithm model of the embodiment of the present disclosure is the patient's disease stage
  • the prediction result of the algorithm model of the embodiment of the present disclosure is the patient's disease stage
  • clinical staging categories can be directly used as categories I, II, III, etc.
  • the determined patient categories are added to the last column of the methylation data set.
  • Embodiments of the present disclosure can classify patients into two, three, or more categories according to their disease staging, and this embodiment of the present disclosure does not limit this.
  • the methylation data set includes 68 patient samples and 374905 methylation sites, as shown in Figure 2A, where the row names represent different patients, columns 1 to 2 Column 374905 represents different methylation sites, and column 374906 represents patient categories.
  • the second omics model is constructed based on the sample data of the second omics, which may include:
  • a second omics model is constructed.
  • the second site may be a gene site, however, the embodiments of the present disclosure are not limited thereto.
  • performing data preprocessing on the sample data of the second group may include:
  • the gene mutation data set includes 68 patient samples and 17513 gene loci, as shown in Figure 2B, where the row names represent different patients, and columns 1 to 17513 represent Different gene loci, column 17514 indicates the patient category.
  • the patient categories in the gene mutation dataset are the same as the corresponding patient categories in the methylation dataset.
  • performing feature filtering on the sample data of the first group may include:
  • the first point whose calculated variance is less than a preset first variance threshold is deleted.
  • performing feature filtering on the sample data of the second group may include:
  • Second sites whose calculated variance is less than a preset second variance threshold are deleted.
  • the disease analysis method in the embodiment of the present disclosure determines candidate features that can optimize the performance of subsequent models through feature filtering.
  • the size of the preset first variance threshold and the second variance threshold can be set according to the actual situation of the data set of the first site and the data set of the second site.
  • the size of the preset first variance threshold is It can be set so that after deleting the first point whose calculated variance is less than the preset first variance threshold, the number of remaining first points accounts for 0.5% to 2% of the number of first points before deletion. time; similarly, the size of the preset second variance threshold can be set so that after deleting the second site whose calculated variance is less than the preset second variance threshold, the number of remaining second sites before deletion will The number of binary sites is between 0.5% and 2%.
  • the preset first site number threshold and the preset second site number threshold can be dynamically adjusted based on the final model accuracy.
  • the variance is calculated for each column of methylation site data in the methylation data set, based on the preset
  • the first variance threshold filters methylation sites for low variance features.
  • the variance is calculated for each column of gene mutation in the gene site mutation data set, and gene sites with low variance characteristics are filtered based on the preset second variance threshold.
  • Filtering methylation sites or gene sites whose variance is less than the preset variance threshold can be expressed by the following formula 1:
  • ⁇ n i is the methylation value variance or gene mutation value variance of all patients at a specific methylation site or gene site, and ⁇ is the preset first variance threshold or second variance threshold.
  • ⁇ n i is greater than or equal to ⁇ , retain the methylation site or gene site; when ⁇ n i is less than ⁇ , discard the methylation site or gene site, and Column a is the feature column remaining after filtering.
  • the methylation data set includes 68 patient samples and 2040 methylation sites, as shown in Figure 3A shown.
  • the gene mutation data set includes 68 patient samples and 283 gene loci, as shown in Figure 3B.
  • performing feature filtering on the sample data of the first group may also include:
  • the base model may be a linear regression model, a logistic regression model, or a decision tree model.
  • performing feature filtering on the sample data of the second group may also include:
  • a base model use the base model and the sample data of the second group to conduct multiple trainings, and remove y second sites in the sample data of the second group after each training, and the y second sites are removed
  • the weight value of is the lower y weight values of all the second position weight values obtained in each training, y is a natural number greater than or equal to 1, until the number of remaining second position points is equal to the preset second position Point quantity threshold.
  • a type of model algorithm is selected as the base model, Traverse the fitted data set and recursively filter out methylation sites that contribute less to the model (i.e., have lower weight) until the number of remaining methylation sites equals the preset first site number threshold.
  • select a type of model algorithm as the base model traverse the fitting data set, and recursively filter out gene sites that contribute less to the model (that is, have lower weights) until the number of remaining gene sites is Equal to the preset second site number threshold.
  • the preset first site quantity threshold and the preset second site quantity threshold can be set based on the number of first sites filtered by low variance features and the number of second sites filtered by low variance features. For example, the size of the preset first site quantity threshold can be set to more than 40% of the number of first sites filtered by low variance features; the size of the preset second site quantity threshold can be set to low The variance feature accounts for more than 40% of the number of second sites after filtering.
  • the preset first site number threshold and the preset second site number threshold can be dynamically adjusted based on the final model accuracy.
  • Filtering methylation sites or gene sites that contribute less to the model can be expressed by the following formula 2:
  • S_ ⁇ i ⁇ is each feature subset. After ranking the feature variables by importance (i.e., weight), N features with higher importance are screened, and then the feature subset is constructed until the length of the feature subset is equal to the set value. The number of features is determined, and Column b is the remaining feature column after further filtering based on Column a .
  • Embodiments of the present disclosure can effectively improve the accuracy of the final result (the running time after feature filtering is reduced and the accuracy is improved) through a feature filtering algorithm dominated by low variance filtering and recursive dimensionality reduction.
  • the methylation data set includes 68 patient samples and 1,000 methylation sites, as shown in the figure As shown in 4A.
  • the gene mutation data set includes 68 patient samples and 250 gene loci, as shown in Figure 4B.
  • a cosine algorithm may be used to calculate the first similarity matrix between the sample data of the first omics or the second similarity matrix between the sample data of the second omics.
  • the cosine algorithm is used to calculate the similarity score between patient A and patient B, which can be expressed by the following formula 3:
  • a i is the methylation level value of patient A at a certain methylation site or the mutation situation at a certain gene site
  • B i is the methylation level value of patient B at the same site or the mutation at the same site.
  • Gene mutation status n is the number of all methylation sites or the number of all genes.
  • the difference between patient A and patient B is measured according to the similarity score (i.e., cosine value cos( ⁇ )).
  • the similarity threshold of the methylation data set is set to 0.8.
  • the methylation site similarity scores between two patients are greater than or equal to When 0.8, it is considered that there is a network relationship between the two patients; when the methylation site similarity score between the two patients is less than 0.8, it is considered that there is no network relationship between the two patients, thus establishing the first undirected link. network.
  • the similarity threshold of the gene mutation data set is set to 0.5 according to the statistics of gene locus similarity scores.
  • the gene locus similarity scores between two patients are greater than or equal to 0.5, the two patients are considered There is a network relationship between two patients; when the gene locus similarity score between two patients is less than 0.5, it is considered that there is no network relationship between the two patients, and a second undirected link network is established.
  • the first undirected link network relationship data set established is shown in Figure 5A.
  • the second undirected link network relationship data set established is shown in Figure 5A. As shown in 5B.
  • the first similarity matrix or the second similarity matrix can be established according to the following formula 4, where l i and l j are the i-th row data (i-th patient) and i-th row data in the methylation data set or gene locus data set respectively. j row of data (jth patient), and i ⁇ j, S(l i , l j ) represents the similarity score between the i-th patient and the j-th patient in the methylation data set or gene locus data set:
  • the network diagram of the first undirected link network or the second undirected link network after construction is completed is shown in Figure 5C.
  • A, B, C... are patient numbers
  • a Hi is the methylation value of patient A at methylation site i
  • the methylation sites include: 1, 2, 3...b h
  • a Mi is The gene mutation of patient A at gene locus i
  • the gene loci include: 1, 2, 3...b m .
  • the first undirected link network or the second undirected link network can be constructed according to the following formula 5:
  • N i is the neighboring node of node i.
  • the first undirected link network is constructed based on the first similarity matrix
  • the second undirected link network is constructed based on the second similarity matrix.
  • Both the first undirected link network and the second undirected link network are two-dimensional.
  • Undirected link network the embodiment of the present disclosure considers the characteristics of the node itself and the characteristics of neighboring nodes through the two-dimensional undirected link network, comprehensively builds an algorithm model, and uses a multi-fold cross-validation method to evaluate the model performance.
  • a methylation algorithm model is constructed based on the first undirected link network relationship data set in Figure 5A, and the resulting ROC curve is shown in Figure 6A. It can be found that after 5-fold cross-validation, the methylation algorithm model The median AUC value was 0.76.
  • the gene mutation algorithm model is constructed based on the second undirected link network relationship data set in Figure 5B. The resulting ROC curve is shown in Figure 6B. It can be found that after 5-fold cross-validation, the median AUC value of the gene mutation algorithm model is 0.69. .
  • Receiver Operating Characteristic (ROC) curve also known as sensitivity (Sensitivity) curve.
  • the reason for this name is that each point on the curve reflects the same sensitivity. They are all responses to the same signal stimulus, but they are just the results obtained under several different criteria.
  • the receiver operating characteristic curve is a coordinate graph with the false positive rate (False Positive Rate) as the horizontal axis and the true positive rate (True Positive Rate) as the vertical axis. It is based on the different judgments used by subjects under specific stimulus conditions. Curves drawn for different results derived from the standard.
  • the value of AUC (Area Under roc Curve) is the size of the area below the ROC curve. A larger AUC represents better performance.
  • AUC 0.5, 0.7
  • the AUC indicates that the diagnostic method has lower accuracy.
  • the AUC indicates that the diagnostic method has (0.7, 0.9], it indicates that the diagnostic method has a certain accuracy.
  • the AUC is greater than 0.9, it indicates that the diagnostic method has relatively high accuracy. High accuracy.
  • a first free methylation data set is established.
  • the resulting ROC curve is shown in Figure 7A. It can be found that after 5-fold cross-validation, the median AUC value of the methylation algorithm model is 0.82, which can be seen The performance of the model after the second filtering is significantly better than that of the first filtering.
  • the fusion algorithm model is formed based on the fusion construction of the first omics model and the second omics model, which may include:
  • the two-dimensional undirected link network can be the first undirected link network or the second undirected link network.
  • the two types of models can be further fused to form a fusion algorithm model.
  • the fusion algorithm model integrates multiple models into a final model, which is better than any single model in terms of model performance.
  • the product probability matrix is calculated according to the following formula 6:
  • the obtained product probability matrix is used as the new feature value of each patient, replacing the methylation level value of the methylation site and the original feature of gene mutation of the two-dimensional link network, and using the first undirected link network constructed previously Or the second undirected link network as the benchmark, build a fusion algorithm model, and use the cross-validation method to evaluate the model performance.
  • the ROC curve of the fusion algorithm model is shown in Figure 8. It can be found that after 5-fold cross-validation, the median AUC of the fusion algorithm model The value is 0.99, indicating that the performance of the fusion algorithm model is better than that of the separate methylation algorithm model and gene mutation algorithm model after secondary filtering.
  • the disease analysis method may further include:
  • the first group model is rebuilt; the prediction results of the reconstructed first group model are evaluated against the real results. mean absolute error;
  • the first N1 randomly scrambled first sites corresponding to the larger average absolute errors are used as candidate molecular markers, where N1 is a natural number greater than or equal to 1.
  • candidate molecular markers refer to molecular markers that are clinically most relevant to the occurrence and development of the disease. They can be biological substances such as genes, methylation sites, and proteins, and are useful in the diagnosis and treatment of diseases. Reference value.
  • the disease analysis method may further include:
  • For multiple first points perform the following operations one by one: randomly scramble the sample data of the currently selected first point, and combine the randomly scrambled sample data of the first point with the sample data of other first points to form the third A group of new feature values are learned, and K-fold cross-validation is used to reconstruct the K-time first-omics model; evaluate the average absolute error between the predicted results of the reconstructed K first-omics models and the real results, and calculate the K average mean absolute error;
  • the randomly scrambled first point corresponding to the mean of the first N1 larger K average absolute errors is used as a candidate molecular marker, where N1 is a natural number greater than or equal to 1, and K is a natural number greater than 1.
  • the mean of K average absolute errors can be calculated according to the following formula 7:
  • y_true is the true value
  • y_pred shuffle(l) is the prediction result after randomly shuffling the lth column in the data set
  • n is the total number of prediction results
  • K is the fold number of the cross-validation division of the data set.
  • the disease analysis method may further include:
  • For multiple second sites perform the following operations one by one: randomly scramble the sample data of the currently selected second site, and combine the randomly scrambled sample data of the second site with the sample data of other second sites to form the third site.
  • the second omics model is reconstructed based on the new eigenvalues of the second omics model and the second undirected link network; the prediction results of the reconstructed second omics model are evaluated against the real results. mean absolute error;
  • the first N2 randomly scrambled second sites corresponding to the larger average absolute error are used as candidate molecular markers, where N2 is a natural number greater than or equal to 1.
  • the disease analysis method may further include:
  • For multiple second sites perform the following operations one by one: randomly scramble the sample data of the currently selected second site, and combine the randomly scrambled sample data of the second site with the sample data of other second sites to form the third site.
  • K-fold cross-validation is used to reconstruct K second group models; evaluate the average absolute error between the predicted results of the reconstructed K second group models and the real results, and calculate the K average mean absolute error;
  • the randomly scrambled second sites corresponding to the mean of the first N2 larger K average absolute errors are used as candidate molecular markers, where N2 is a natural number greater than or equal to 1, and K is a natural number greater than 1.
  • the mean of K average absolute errors can be calculated according to the following formula 7:
  • y_true is the true value
  • y_pred shuffle(l) is the prediction result after randomly shuffling the lth column in the data set
  • n is the total number of prediction results
  • K is the fold number of the cross-validation division of the data set.
  • a clinical molecular marker system of the top 10 methylations is obtained, as shown in Figure 9A.
  • the top 10 gene mutations are The clinical molecular marker system of The methylation sites found are cg14419975, cg12886942, cg05922253, cg18525352, cg10375890, cg19019537, cg07513622, cg26646370, cg10762626, cg14745270.
  • the genes involved are DOCK2, ANK3, KMT2B, CDH23, CFH, LAMA2, ABCA4, PLXNB2, ABCA10, and ARHGAP31.
  • the first site is a DNA methylation site and the second site is a gene site
  • the technical solution of the embodiments of the present disclosure will be described in detail below.
  • the disease analysis method provided by the embodiments of the present disclosure can construct a clinical molecular marker system and systematic algorithm for the diagnosis and treatment of tumor patients by comprehensively analyzing DNA methylation sites and gene sites, as shown in Figure 10A and Figure 10B , this method mainly includes the following steps:
  • the disease analysis method of the embodiment of the present disclosure can more accurately judge the prognosis of the patient and establish a local network of patients with intrinsic connections. Each step is explained in detail below.
  • S1 performs data preprocessing on methylation data and gene mutation data.
  • S1.2 Screen the set type of gene mutations. For each patient sample, if the sample has a mutation in a certain gene, it is marked as 1. If there is no mutation, it is marked as 0, forming a gene mutation data set; or statistics The number of mutations present in a certain gene in this sample is recorded as n, forming a gene mutation data set.
  • S1.3 converts the patient's prognosis into digital data and constructs two types of data sets: methylation-prognosis and gene mutation-prognosis.
  • S2 performs feature filtering on methylation data and gene mutation data respectively.
  • the filtering method is as follows:
  • ⁇ n i is the variance of methylation values or gene mutation values of all patients at a specific methylation site or gene
  • is the set variance threshold.
  • S2.4 is optional.
  • For the genes filtered in S2.2 select a type of model algorithm as the base model, traverse the fitting data set, and recursively filter out gene sites that contribute less to the model (that is, have lower weights). Until the number of remaining gene loci equals the preset number of features.
  • the filtering method is as follows:
  • S_ ⁇ i ⁇ is each feature subset. After ranking the feature variables by importance, N features with higher importance are screened, and then the feature subset is constructed until the length of the feature subset is equal to the set feature. Number, Column b is the remaining feature column after further filtering based on Column a .
  • S3 constructs a patient similarity matrix for the feature-filtered methylation data set and gene mutation data set, and then constructs a two-dimensional undirected link network with patients as nodes.
  • calculation method is as follows:
  • a i is the methylation level value of patient A at a certain methylation site or the mutation situation at a certain gene site
  • B i is the methylation level value of patient B at the same site or the mutation at the same site.
  • the mutation status of the gene, n is the number of all methylation sites or the number of all gene sites.
  • l i , l j are any two rows of patient data, and i ⁇ j.
  • the similarity score S(l i , l j ) is greater than the set threshold ⁇ , there is a network relationship with patient i and patient j, otherwise There is no online relationship.
  • the network diagram after construction is completed is shown in Figure 5C.
  • A, B, C... are patient numbers
  • a Hi is the methylation value of patient A at methylation site i
  • the methylation sites include: 1, 2, 3...b h
  • a Mi is The gene mutation of patient A at gene locus i, the gene loci include: 1, 2, 3...b m .
  • S4 builds a methylation algorithm model and a gene mutation algorithm model based on the similarity matrix.
  • This step considers the characteristics of the two-dimensional undirected link network node itself and the characteristics of neighboring nodes, comprehensively builds the algorithm model, and uses the cross-validation method to evaluate the model performance.
  • calculation method is as follows:
  • f i (l) is the characteristic of node i in the l-th layer
  • is the nonlinear activation
  • C ij is the normalization factor
  • N i is the neighboring node of node i.
  • S5 fuses the methylation algorithm model and the gene mutation algorithm model.
  • calculation method is as follows:
  • S5.2 uses the obtained product probability matrix as a new feature of each patient, and uses the patient's two-dimensional undirected link network constructed previously in S3 (which can be a patient's two-dimensional undirected link network constructed based on the methylation data set, or The patient two-dimensional undirected link network constructed based on the gene mutation data set can be used as the benchmark, the fusion algorithm model can be constructed according to the model construction method of S4, and the cross-validation method can be used to evaluate the model performance.
  • S6 uses ablation algorithms to build a clinical molecular marker system.
  • the methylation level value of each column of methylation site or the gene mutation data of the gene is randomly scrambled, and the average absolute error between the prediction result and the real result after the shuffling is evaluated.
  • the average absolute error after the shuffling is screened and shuffled.
  • the top N methylation sites or gene sites are used as candidate molecular markers.
  • calculation method is as follows:
  • y_true is the true value
  • y_pred shuffle(l) is the prediction result after randomly shuffling the lth column in the data set
  • n is the total number of prediction results
  • K is the fold number of the cross-validation division of the data set.
  • methylation when there is one or more methylation sites and no methylation level value is measured in the patient sample, the site will be deleted and the remaining sites will be retained.
  • the methylation data set after screening is shown in Figure 2A.
  • the overall survival years are obtained by calculating the number of survival days. When the number of years is less than or equal to 2 years, the patient's prognosis is judged to be category I. When the number of years is greater than 2 years, the patient's prognosis is judged to be category II. and add the processed prognostic categories to the last column of the dataset.
  • the cosine algorithm is used to calculate the similarity score between each two patients.
  • the calculation method is:
  • a i is the methylation level value of patient A at a certain methylation site
  • B i is the methylation level value of patient B at the same site
  • n is equal to the number of all methylation sites.
  • the threshold of the methylation data set is set to 0.8.
  • the scores of two patients are greater than or equal to 0.8, it is considered that there is a network relationship between the two patients.
  • the obtained network relationship data set is shown in Figure 5A.
  • a two-dimensional link undirected network graph is constructed. Based on the two-dimensional link undirected network and the methylation level value as the feature, a methylation algorithm model was constructed according to the aforementioned method, and the performance of the methylation algorithm model was evaluated using the cross-validation method.
  • the ROC curve of the methylation algorithm model is shown in Figure 6A. It can be found that after 5-fold cross-validation, the median AUC value of the methylation algorithm model is 0.76.
  • the gene mutation data set after screening is shown in Figure 2B.
  • the overall survival years are obtained by calculating the number of survival days. When the number of years is less than or equal to 2 years, the patient's prognosis is judged to be category I. When the number of years is greater than 2 years, the patient's prognosis is judged to be category II. and add the processed prognostic categories to the last column of the dataset.
  • the low variance threshold to 0.08, calculate the variance of each column of the gene mutation data set, and delete genes with variances less than 0.08.
  • the deleted gene mutation data set is shown in Figure 3B. After the screening is completed, there are 68 patient samples. 283 genes. It can be seen that because each patient sample is marked as a binary classification according to whether the gene is mutated, most of the data of some genes are 0, the variance is small, and most of the sites are filtered out.
  • the cosine algorithm is used to calculate the similarity score between each two patients.
  • the calculation method is:
  • a i is the mutation status of patient A in a certain gene
  • B i is the mutation status of patient B in the same gene
  • n is equal to the number of all genes.
  • the threshold of the gene mutation data set is set to 0.5.
  • the scores of two patients are greater than or equal to 0.5, it is considered that there is a network relationship between the two patients.
  • the obtained network relationship data set is shown in Figure 5B.
  • a two-dimensional link undirected network graph is constructed. Based on the two-dimensional linked undirected network and the gene mutation situation as the feature, a gene mutation algorithm model was constructed according to the aforementioned method, and the performance of the gene mutation algorithm model was evaluated using the cross-validation method.
  • the ROC curve of the gene mutation algorithm model is shown in Figure 6B. It can be found that after 5-fold cross-validation, the median AUC value of the gene mutation algorithm model is 0.69.
  • secondary feature screening is performed on the feature subsets of methylation sites and gene sites respectively.
  • the methylation data set after secondary feature screening is shown in Figure 4A. After the screening is completed, there are 68 patient samples and 1,000 methylation sites.
  • the gene mutation data set after secondary feature screening is shown in Figure 4B. After the screening is completed, there are 68 patient samples and 250 genes.
  • the cosine algorithm is used to calculate the methylation sites or gene sites between each two patients, and obtain the similarity between each two patients. Based on the statistics of similarity scores, a two-dimensional link undirected network graph is constructed. Based on the two-dimensional link undirected network, the methylation level value or gene mutation value is used as the feature, and the methylation algorithm model or gene mutation algorithm model is constructed according to the method mentioned above (Formula 5), and the cross-validation method is used , evaluate the performance of the methylation algorithm model or the gene mutation algorithm model.
  • the ROC curve of the methylation algorithm model is shown in Figure 7A. It can be found that after 5-fold cross-validation, the median AUC value of the methylation algorithm model is 0.82. It can be seen that the performance of the model after secondary filtering is significantly better. for the first filter.
  • the ROC curve of the gene mutation algorithm model is shown in Figure 7B. It can be found that after 5-fold cross-validation, the median AUC value of the gene mutation algorithm model is 0.85. Similarly, the performance of the model after the second filtering is significantly better than the first time. filter. At the same time, due to the reduction in the number of methylation sites and gene features, the time cost of building the algorithm model after feature filtering is further reduced.
  • the probability that the patient's prognosis is Category I in the methylation algorithm model with the probability that the patient's prognosis is Category I in the gene mutation algorithm model and multiply the probability that the patient's prognosis is Category II in the methylation algorithm model with
  • the probability of predicting that the patient's prognosis category is class II is multiplied, and then the numerical matrix of class I and class II is used as a feature to replace the methylation level value of the methylation site and the gene mutation of the two-dimensional link network
  • a fusion algorithm model is further constructed, and a 5-fold cross-validation is used to evaluate the model.
  • the ROC curve of the fusion algorithm model is shown in Figure 8. It can be found that after 5-fold cross-validation, the median AUC value of the fusion algorithm model is 0.99, indicating that the performance of the fusion algorithm model is due to the separate methylation algorithm after secondary filtering. Models and gene mutation algorithm models.
  • the molecular marker system was determined, and the 1000 methylation site data in the methylation data set or the 250 gene mutation data in the gene mutation data set were traversed and disrupted. , replace the feature matrix in the two-dimensional link undirected network, and also use the 5-fold cross-validation method to build the algorithm model, and evaluate the difference between all predicted values of the model and the true value after each column of methylation site data or gene mutation data is disrupted The average absolute error between values was then averaged again for the five types of data splitting methods, and the top 10 methylation sites or genes with the final larger values were taken as the clinical molecular marker system.
  • the calculation method is as follows:
  • the clinical molecular marker system of the top 10 methylations is shown in Figure 9A
  • the clinical molecular marker system of the top 10 gene mutations is shown in Figure 9B.
  • c1, c2, c3, c4, and c5 are cross-validated 5 times respectively.
  • the average absolute error value obtained, mean is the average of 5 average absolute error values
  • the methylation sites involved are cg14419975, cg12886942, cg05922253, cg18525352, cg10375890, cg19019537, cg07513622, cg26646370, cg107626 26,cg14745270.
  • the genes involved are DOCK2, ANK3, KMT2B, CDH23, CFH, LAMA2, ABCA4, PLXNB2, ABCA10, and ARHGAP31.
  • the disease analysis method provided by the embodiments of the present disclosure can more accurately judge the prognosis of patients through cross-integration analysis of DNA methylation and gene mutations; through the calculation method of the two-dimensional link network, a system of patients can be established
  • the network can implement similar diagnosis and treatment plans for patients in the local network and more effectively judge the patient's prognosis; through the feature filtering algorithm dominated by low variance filtering and recursive dimensionality reduction, the final result can be effectively improved. Accuracy of results (reduced running time and improved accuracy after feature filtering).
  • An embodiment of the present disclosure also provides a disease analysis device, including a memory; and a processor connected to the memory, the memory is used to store instructions, and the processor is configured to based on the instructions stored in the memory , perform the steps of the disease analysis method described in any embodiment of the present disclosure.
  • the disease analysis device may include: a processor 1110, a memory 1120 and a bus system 1130, wherein the processor 1110 and the memory 1120 are connected through the bus system 1130, and the memory 1120 is used to store instructions,
  • the processor 1110 is configured to execute instructions stored in the memory 1120 to obtain first omics data and second omics data of the patient.
  • the first omics data includes a plurality of first sites, and the second omics data includes a plurality of second sites. Point; input the first omics data and the second omics data into the fusion algorithm model to obtain the patient's prediction results.
  • the fusion algorithm model is formed based on the fusion of the first omics model and the second omics model.
  • the first omics model The model is built based on the sample data of the first group, and the second group model is built based on the sample data of the second group.
  • the processor 1110 can be a central processing unit (Central Processing Unit, CPU), and the processor 1110 can also be other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASICs), and off-the-shelf programmable gate arrays. (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • Memory 1120 may include read-only memory and random access memory and provides instructions and data to processor 1110 . A portion of memory 1120 may also include non-volatile random access memory. For example, memory 1120 may also store device type information.
  • bus system 1130 may also include a power bus, a control bus, a status signal bus, etc.
  • bus system 1130 may also include a power bus, a control bus, a status signal bus, etc.
  • the various buses are labeled as bus system 1130 in FIG. 11 .
  • the processing performed by the processing device may be completed by instructions in the form of hardware integrated logic circuits or software in the processor 1110 . That is to say, the method steps of the embodiments of the present disclosure may be implemented by a hardware processor, or may be executed by a combination of hardware and software modules in the processor.
  • Software modules can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media.
  • the storage medium is located in the memory 1120.
  • the processor 1110 reads the information in the memory 1120 and completes the steps of the above method in combination with its hardware. To avoid repetition, it will not be described in detail here.
  • An embodiment of the present disclosure also provides a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the disease analysis method as described in any embodiment of the present disclosure is implemented.
  • the method of driving prognosis analysis by executing executable instructions is basically the same as the disease analysis method provided by the above embodiments of the present disclosure, and will not be described again here.
  • various aspects of the disease analysis method provided by this application can also be implemented in the form of a program product, which includes program code.
  • the program product When the program product is run on a computer device, the program code For causing the computer device to execute the steps in the disease analysis method according to various exemplary embodiments of the present application described above in this specification, for example, the computer device may execute the disease analysis method described in the embodiments of the present application.
  • the program product may take the form of any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to: electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • an embodiment of the present disclosure also provides a training method for a disease analysis model, including:
  • Step 1201 Perform data preprocessing on the sample data of the first group and the sample data of the second group respectively;
  • Step 1202 Perform feature filtering on the sample data of the first group and the sample data of the second group respectively;
  • Step 1203 Calculate the first similarity matrix between the sample data of the first group, construct a first undirected link network based on the calculated first similarity matrix, and calculate the second similarity matrix between the sample data of the second group. Similarity matrix, construct the second undirected link network based on the calculated second similarity matrix;
  • Step 1204 Construct a first omics model based on the constructed first undirected link network and the characteristic values of the first omics; based on the constructed second undirected link network and the characteristic values of the second omics, Build a second omics model;
  • Step 1205 Multiply the probability values of the same predicted category in the first omics model and the second omics model to obtain a product probability matrix
  • Step 1206 Construct a fusion algorithm model based on the two-dimensional undirected link network and the product probability matrix.
  • the two-dimensional undirected link network is the first undirected link network or the second undirected link network.
  • the training method may further include:
  • For multiple first points perform the following operations one by one: randomly scramble the sample data of the currently selected first point, and combine the randomly scrambled sample data of the first point with other first points.
  • the sample data of the points constitute new feature values of the first omics model, and the first omics model is rebuilt; the average absolute error between the prediction results of the rebuilt first omics model and the real results is evaluated;
  • the first N1 randomly scrambled first sites corresponding to the larger average absolute errors are used as candidate molecular markers, where N1 is a natural number greater than or equal to 1.
  • the training method may also include:
  • For multiple first points perform the following operations one by one: randomly scramble the sample data of the currently selected first point, and combine the randomly scrambled sample data of the first point with the sample data of other first points to form the third A group of new feature values are learned, and K-fold cross-validation is used to reconstruct the K-time first-omics model; evaluate the average absolute error between the predicted results of the reconstructed K first-omics models and the real results, and calculate the K average mean absolute error;
  • the randomly scrambled first point corresponding to the mean of the first N1 larger K average absolute errors is used as a candidate molecular marker, where N1 is a natural number greater than or equal to 1, and K is a natural number greater than 1.
  • the training method may also include:
  • For multiple second sites perform the following operations one by one: randomly scramble the sample data of the currently selected second site, and combine the randomly scrambled sample data of the second site with the sample data of other second sites to form the third site.
  • the first N2 randomly scrambled second sites corresponding to the larger average absolute error are used as candidate molecular markers, where N2 is a natural number greater than or equal to 1.
  • the training method may also include:
  • For multiple second sites perform the following operations one by one: randomly scramble the sample data of the currently selected second site, and combine the randomly scrambled sample data of the second site with the sample data of other second sites to form the third site.
  • K-fold cross-validation is used to reconstruct K second group models; evaluate the average absolute error between the predicted results of the reconstructed K second group models and the real results, and calculate the K average mean absolute error;
  • the randomly scrambled second sites corresponding to the mean of the first N2 larger K average absolute errors are used as candidate molecular markers, where N2 is a natural number greater than or equal to 1, and K is a natural number greater than 1.
  • An embodiment of the present disclosure also provides a training device for a disease analysis model, including a memory; and a processor connected to the memory, where the memory is used to store instructions, and the processor is configured to based on the instructions stored in the memory.
  • the instructions in execute the steps of the method for training a disease analysis model as described in any embodiment of the present disclosure.
  • Embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the training method of a disease analysis model as described in any embodiment of the present disclosure is implemented.
  • various aspects of the training method of the disease analysis model provided by this application can also be implemented in the form of a program product, which includes program code.
  • the program product When the program product is run on a computer device, the The program code is used to cause the computer device to execute the steps in the training method of the disease analysis model according to various exemplary embodiments of the present application described above.
  • the computer device can execute the steps described in the embodiments of the present application. Training methods for disease analysis models.
  • the program product may take the form of any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to: electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • computer storage media includes volatile and nonvolatile media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. removable, removable and non-removable media.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, tapes, disk storage or other magnetic storage devices, or may Any other medium used to store the desired information and that can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

一种疾病分析方法、疾病分析模型的训练方法及装置,其中,疾病分析方法包括: 获取患者的第一组学数据和第二组学数据,第一组学包括多个第一位点,第二组学包括多个第二位点;将第一组学数据和第二组学数据输入融合算法模型,以得到患者的预测结果,融合算法模型根据第一组学模型和第二组学模型进行融合构建形成,第一组学模型根据第一组学的样本数据构建,第二组学模型根据第二组学的样本数据构建。

Description

疾病分析方法、疾病分析模型的训练方法及装置 技术领域
本公开实施例涉及但不限于生物信息技术领域,尤其涉及一种疾病分析方法、疾病分析模型的训练方法及装置。
背景技术
部分肿瘤的侵袭性较强,患者预后较差,如何提高患者预后的准确性,并建立有效的方法评价体系,对患者进行个性化治疗指导是当今国家政策布局和科研领域十分关注的问题。肿瘤的发展机制复杂,受到基因层面和表观层面等多因素的影响,脱氧核糖核酸(DeoxyriboNucleic Acid,DNA)甲基化和基因突变与肿瘤的发生、发展有着密切的关系。
目前肿瘤预后方法存在以下问题:1)单组学的评估体系考虑因素有限,不能全面的进行评价;多组学的评估体系简易整合,没有有效发挥出多因素的条件优势,以上均会对评估准确度造成一定的影响。2)未有效的将临床上表现相似的患者进行关联,相互联系的患者在疾病诊治上和预后情况上可能存在一致性。
发明内容
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本公开实施例提供了一种疾病分析方法,包括:
获取患者的第一组学数据和第二组学数据,所述第一组学包括多个第一位点,所述第二组学包括多个第二位点;
将所述第一组学数据和第二组学数据输入融合算法模型,以得到患者的预测结果,所述融合算法模型根据第一组学模型和第二组学模型进行融合构建形成,所述第一组学模型根据第一组学的样本数据构建,所述第二组学模型根据第二组学的样本数据构建。
本公开实施例还提供了一种疾病分析装置,包括存储器;和连接至所述存储器的处理器,所述存储器用于存储指令,所述处理器被配置为基于存储在所述存储器中的指令,执行本公开任一实施例所述的疾病分析方法的步骤。
本公开实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现本公开任一实施例所述的疾病分析方法。
本公开实施例还提供了一种疾病分析模型的训练方法,包括:
对第一组学的样本数据和第二组学的样本数据分别进行数据预处理;
对所述第一组学的样本数据和第二组学的样本数据分别进行特征过滤;
计算所述第一组学的样本数据之间的第一相似度矩阵,根据计算出的第一相似度矩阵构建第一无向链接网络;计算所述第二组学的样本数据之间的第二相似度矩阵,根据计算出的第二相似度矩阵构建第二无向链接网络;
根据构建的第一无向链接网络以及所述第一组学的特征值,构建第一组学模型;根据构建的第二无向链接网络以及所述第二组学的特征值,构建第二组学模型;
将所述第一组学模型与第二组学模型中预测类别相同的概率值相乘,得到乘积概率矩阵;
根据二维无向链接网络和乘积概率矩阵,构建融合算法模型,所述二维无向链接网络为所述第一无向链接网络或第二无向链接网络。
本公开实施例还提供了一种疾病分析模型的训练装置,包括存储器;和连接至所述存储器的处理器,所述存储器用于存储指令,所述处理器被配置为基于存储在所述存储器中的指令,执行本公开任一实施例所述的疾病分析模型的训练方法的步骤。
本公开实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现本公开任一实施例所述的疾病分析模型的训练方法。
在阅读理解了附图和详细描述后,可以明白其他方面。
附图说明
附图用来提供对本公开技术方案的进一步理解,并且构成说明书的一部分,与本公开的实施例一起用于解释本公开的技术方案,并不构成对本公开的技术方案的限制。附图中各部件的形状和大小不反映真实比例,目的只是示意说明本公开内容。
图1为本公开示例性实施例提供的一种疾病分析方法的流程示意图;
图2A为本公开示例性实施例提供的一种经过数据预处理后的甲基化数据集示意图;
图2B为本公开示例性实施例提供的一种经过数据预处理后的基因突变数据集示意图;
图3A为本公开示例性实施例提供的一种经过低方差特征过滤后的甲基化数据集示意图;
图3B为本公开示例性实施例提供的一种经过低方差特征过滤后的基因突变数据集示意图;
图4A为本公开示例性实施例提供的一种经过低方差特征过滤和递归降维特征过滤后的甲基化数据集示意图;
图4B为本公开示例性实施例提供的一种经过低方差特征过滤和递归降维特征过滤后的基因突变数据集示意图;
图5A为根据图3A的甲基化数据集获得的第一无向链接网络关系数据集示意图;
图5B为根据图3B的基因突变数据集获得的第二无向链接网络关系数据集示意图;
图5C为构建完成的第一无向链接网络或第二无向链接网络的网络简图;
图6A为根据图5A的第一无向链接网络关系数据集构建的甲基化算法模型的ROC曲线图;
图6B为根据图5B的第二无向链接网络关系数据集构建的基因突变算法 模型的ROC曲线图;
图7A为根据图4A的经过低方差特征过滤和递归降维特征过滤后的第一无向链接网络关系甲基化数据集构建的甲基化算法模型的ROC曲线图;
图7B为根据图4B的经过低方差特征过滤和递归降维特征过滤后的第二无向链接网络关系基因突变数据集构建的基因突变算法模型的ROC曲线图;
图8为根据图7A的甲基化算法模型和图7B的基因突变算法模型创建的融合算法模型的ROC曲线图;
图9A为根据本公开实施例的疾病分析方法得到的甲基化前10的临床分子标志物体系示意图;
图9B为根据本公开实施例的疾病分析方法得到的基因突变前10的临床分子标志物体系示意图;
图10A和图10B为本公开示例性实施例提供的另两种疾病分析方法的流程示意图;
图11为本公开示例性实施例提供的一种疾病分析装置的结构示意图;
图12为本公开示例性实施例提供的一种疾病分析模型的训练方法的流程示意图。
具体实施方式
为使本公开的目的、技术方案和优点更加清楚明白,下文中将结合附图对本公开的实施例进行详细说明。需要说明的是,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互任意组合。
除非另外定义,本公开实施例公开使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开实施例中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。“包括”或者“包含”等类似的词语意指出该词前面的元件或物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。
如图1所示,本公开实施例提供了一种疾病分析方法,包括如下步骤:
步骤101、获取患者的第一组学数据和第二组学数据,第一组学包括多个第一位点,第二组学包括多个第二位点;
步骤102、将第一组学数据和第二组学数据输入融合算法模型,以得到患者的预测结果,融合算法模型根据第一组学模型和第二组学模型进行融合构建形成,第一组学模型根据第一组学的样本数据构建,第二组学模型根据第二组学的样本数据构建。
本公开实施例通过根据第一组学模型和第二组学模型建立融合算法模型,并根据融合算法模型对疾病进行预测,能更准确的对患者的疾病诊治或预后情况进行判断,避免单组学的评估体系不能全面的进行评价以及多组学的评估体系没有有效发挥出多因素条件优势的问题。本公开实施例所述的疾病可以为肿瘤,也可以为其他疾病,本公开实施例对此不作限制。
本公开实施例中,组学主要包括DNA甲基化组学、基因组学、蛋白组学和转录组学等等。
在一些示例性实施方式中,第一位点可以是:DNA甲基化位点,第二位点可以是:基因位点;或者,第一位点可以是:基因位点,第二位点可以是:DNA甲基化位点,其中,基因位点可包括的信息有突变情况和/或表达情况,然而,本公开实施例对此不作限制。
示例性的,第一位点可以为DNA甲基化位点,第二位点可以为基因位点,基因位点包括的信息为基因突变情况信息。在生物系统内,甲基化是经酶催化的,这种甲基化涉及重金属修饰、基因表达的调控、蛋白质功能的调节以及核糖核酸加工。甲基化的变化通常与疾病基因的表达异常相关。而基因在染色体上占有的特定位置叫基因位点。基因位数很多,而染色体的数目较少,因而一条染色体上含有许多基因,基因在染色体上呈单行直线排列。基因突变指的是细胞中的基因发生改变,这种改变包括单碱基的点突变,多碱基的重复、插入、缺失等。基因突变会导致蛋白表达发生改变,从而影响细胞功能。
在一些示例性实施方式中,第一组学模型根据第一组学的样本数据构建, 可以包括:
对第一组学的样本数据进行数据预处理;
对第一组学的样本数据进行特征过滤;
计算第一组学的样本数据之间的第一相似度矩阵,根据计算出的第一相似度矩阵构建第一无向链接网络;
根据构建的第一无向链接网络以及第一组学的特征值,构建第一组学模型。
本公开实施例中,第一无向链接网络和后文所述的第二无向链接网络均为二维无向链接网络。二维无向链接网络是节点与节点间的连线构成的二维网络,在数据角度相似的节点会利用无向线段相连,在模型构建和对未知节点预测时会综合考虑周围节点的数据特征。本公开实施例通过二维无向链接网络建立存在内在联系的患者局部网络,能针对性的对局部网络中的患者实施相似的诊治方案,且更有效的对患者的预后情况进行判断。
在一些示例性实施方式中,对第一组学的样本数据进行数据预处理,可以包括:
删除第一组学的样本数据中的一个或多个第一位点,所删除的每个第一位点至少存在一个患者样本不存在数据值。
示例性的,假设第一组学包括多个甲基化位点,那么,在对甲基化位点的样本数据进行数据预处理时,当某个甲基化位点存在一个或一个以上的患者样本未测到甲基化水平值时,将该甲基化位点删除。
在一些示例性实施方式中,对第一组学的样本数据进行数据预处理,可以包括:
删除第一组学的样本数据中的一个或多个第一位点,所删除的每个第一位点至少存在b%的患者样本不存在数据值,其中,b为大于0的实数;
对第一组学的样本数据中不存在数据值的患者样本进行数据填充。
示例性的,仍以第一组学包括多个甲基化位点为例,如果b=20,那么,在对甲基化位点的样本数据进行数据预处理时,当某个甲基化位点存在大于 20%的患者样本未测到甲基化水平值时,将该甲基化位点删除。对保留的甲基化位点,若有缺失数据,采用该甲基化位点的中位值或平均值进行填充,获得最终甲基化位点,形成甲基化数据集。
在一些示例性实施方式中,对第一组学的样本数据进行数据预处理,包括:根据每个患者的预后情况或疾病分期情况对第一组学的样本数据中的每个患者进行分类,分类结果包括至少两类。
示例性的,假设本公开实施例的算法模型的预测结果为患者的预后情况,那么,在进行数据预处理时,对于样本数据中的每个患者,通过生存天数计算得到总生存期年数,根据总生存期年数对其进行分类。例如,当患者总生存期年数小于或等于2年时,判定其为I类;当患者总生存期年数大于2年时,判定其为II类。对样本数据中的所有患者都判定完类别后,将判定的患者类别添加到甲基化数据集的最后一列。本示例根据总生存期年数将患者分为两类,然而,也其他的一些示例中,也可以根据需要,将患者分为三类或三类以上,本公开实施例对此不作限制。
示例性的,假设本公开实施例的算法模型的预测结果为患者的疾病分期情况,那么,在进行数据预处理时,对于样本数据中的每个患者,根据患者的疾病分期情况对其进行分类,例如,可以直接用临床的分期类别来作为I、II、III等类别。对样本数据中的所有患者都判定完类别后,将判定的患者类别添加到甲基化数据集的最后一列。本公开实施例可以根据患者的疾病分期情况将患者分为两类、三类或三类以上,本公开实施例对此不作限制。
示例性的,假设经过数据预处理之后,甲基化数据集包括68个患者样本,374905个甲基化位点,如图2A所示,其中,行名表示不同的患者,第1列至第374905列表示不同的甲基化位点,第374906列表示患者类别。
在一些示例性实施方式中,第二组学模型根据第二组学的样本数据构建,可以包括:
对第二组学的样本数据进行数据预处理;
对第二组学的样本数据进行特征过滤;
计算第二组学的样本数据之间的第二相似度矩阵,根据计算出的第二相 似度矩阵构建第二无向链接网络;
根据构建的第二无向链接网络以及第二组学的特征值,构建第二组学模型。
在一些示例性实施方式中,第二位点可以为基因位点,然而,本公开实施例对此不作限制。
在一些示例性实施方式中,对第二组学的样本数据进行数据预处理,可以包括:
筛选所设类型的基因突变,针对每个患者样本,如果该患者样本在某个基因上存在基因突变则标记为1,如果不存在基因突变则标记为0,形成基因突变数据集;或者,统计该患者样本在某个基因上存在的突变个数并标记为n,形成基因突变数据集。
示例性的,假设经过数据预处理之后,基因突变数据集包括68个患者样本,17513个基因位点,如图2B所示,其中,行名表示不同的患者,第1列至第17513列表示不同的基因位点,第17514列表示患者类别。基因突变数据集中的患者类别与甲基化数据集中对应的患者类别相同。
在一些示例性实施方式中,对第一组学的样本数据进行特征过滤,可以包括:
对每个第一位点的数据计算方差;
比较计算出的方差与预设第一方差阈值的大小;
删除计算出的方差小于预设第一方差阈值的第一位点。
在一些示例性实施方式中,对第二组学的样本数据进行特征过滤,可以包括:
对每个第二位点的数据计算方差;
比较计算出的方差与预设第二方差阈值的大小;
删除计算出的方差小于预设第二方差阈值的第二位点。
本公开实施例的疾病分析方法,通过特征过滤,确定能使后续模型性能最优的候选特征。预设第一方差阈值和第二方差阈值的大小可以根据第一位 点的数据集和第二位点的数据集的实际情况设定,示例性的,预设第一方差阈值的大小可以设定为使删除计算出的方差小于预设第一方差阈值的第一位点后,剩余的第一位点的数量占删除前的第一位点的数量的0.5%至2%之间;同理,预设第二方差阈值的大小可以设定为使删除计算出的方差小于预设第二方差阈值的第二位点后,剩余的第二位点的数量占删除前的第二位点的数量的0.5%至2%之间。预设第一位点数量阈值和预设第二位点数量阈值可根据最终模型精度动态调整。
示例性的,仍以第一位点为DNA甲基化位点、第二位点为基因位点为例,针对甲基化数据集每一列甲基化位点数据计算方差,依据预先设定的第一方差阈值对低方差特征的甲基化位点进行过滤。针对基因位点突变数据集每一列基因突变情况计算方差,依据预先设定的第二方差阈值对低方差特征的基因位点进行过滤。
过滤方差小于预设方差阈值(包括预设第一方差阈值或预设第二方差阈值)的甲基化位点或基因位点,可以用如下公式1表示:
Figure PCTCN2022109023-appb-000001
δn i为某一特定甲基化位点或基因位点的所有患者的甲基化值方差或基因突变值方差,ε为预设的第一方差阈值或第二方差阈值,当δn i大于或等于ε时,保留该甲基化位点或基因位点;当δn i小于ε时,舍弃该甲基化位点或基因位点,Column a为过滤后剩下的特征列。
示例性的,仍以前述的甲基化数据集和基因突变数据集为例,经过低方差特征过滤之后,甲基化数据集包括68个患者样本,2040个甲基化位点,如图3A所示。基因突变数据集包括68个患者样本,283个基因位点,如图3B所示。
在一些示例性实施方式中,对第一组学的样本数据进行特征过滤,还可以包括:
选取基模型,采用基模型与第一组学的样本数据进行多次训练,每次训练结束移除第一组学的样本数据中的x个第一位点,所述x个第一位点的权 重值为每次训练得到的所有第一位点的权重值中较低的x个权重值,x为大于或等于1的自然数,直到所剩第一位点的数量等于预设第一位点数量阈值。
在一些示例性实施方式中,基模型可以为线性回归模型、逻辑回归模型或者决策树模型。
在一些示例性实施方式中,对第二组学的样本数据进行特征过滤,还可以包括:
选取基模型,采用基模型与第二组学的样本数据进行多次训练,每次训练结束移除第二组学的样本数据中的y个第二位点,所述y个第二位点的权重值为每次训练得到的所有第二位点的权重值中较低的y个权重值,y为大于或等于1的自然数,直到所剩第二位点的数量等于预设第二位点数量阈值。
示例性的,仍以第一位点为DNA甲基化位点、第二位点为基因位点为例,针对上述特征过滤后的甲基化位点,选取一类模型算法作为基模型,遍历拟合数据集,递归过滤掉对模型贡献较低(即权重较低)的甲基化位点,直到所剩甲基化位点数等于预设第一位点数量阈值。针对上述特征过滤后的基因位点,选取一类模型算法作为基模型,遍历拟合数据集,递归过滤掉对模型贡献较低(即权重较低)的基因位点,直到所剩基因位点数等于预设第二位点数量阈值。预设第一位点数量阈值和预设第二位点数量阈值可以根据低方差特征过滤后的第一位点的数量和低方差特征过滤后的第二位点的数量进行设定。示例性的,预设第一位点数量阈值的大小可以设定为低方差特征过滤后的第一位点的数量的40%以上;预设第二位点数量阈值的大小可以设定为低方差特征过滤后的第二位点的数量的40%以上。预设第一位点数量阈值和预设第二位点数量阈值可根据最终模型精度动态调整。
过滤对模型贡献度较低(即权重较低)的甲基化位点或基因位点,可以用如下公式2表示:
Column b=Sort(S_{i}),i=1,2,3...(公式2)
S_{i}为每一种特征子集,在对特征变量进行重要度(即权重)排序后,筛选重要度较高的N个特征,再构造特征子集,一直到特征子集长度等于设定特征数,Column b是在Column a的基础上进一步过滤后剩下的特征列。
本公开实施例通过以低方差过滤和递归降维为主导的特征过滤算法,能有效提高最终结果的准确度(特征过滤后的运行时间减少且准确度提高)。
示例性的,仍以前述的甲基化数据集和基因突变数据集为例,经过递归降维特征过滤之后,甲基化数据集包括68个患者样本,1000个甲基化位点,如图4A所示。基因突变数据集包括68个患者样本,250个基因位点,如图4B所示。
在一些示例性实施方式中,可以利用余弦算法计算第一组学的样本数据之间的第一相似度矩阵或第二组学的样本数据之间的第二相似度矩阵。利用余弦算法计算患者A和患者B之间的相似度得分,可以用如下公式3表示:
Figure PCTCN2022109023-appb-000002
其中,A i为患者A在某一甲基化位点的甲基化水平值或在某一基因位点的突变情况,B i为患者B在相同位点的甲基化水平值或在相同基因的突变情况,n为所有的甲基化位点个数或所有的基因个数。
根据相似度得分(即余弦值cos(θ))衡量患者A和患者B之间差异的大小,cos(θ)越接近1,表明患者A和患者B越相似;cos(θ)越接近于0,表明患者A和患者B越不相似。
示例性的,假设根据甲基化位点相似度得分的统计情况,甲基化数据集的相似度阈值设定为0.8,当两个患者之间的甲基化位点相似度得分大于或等于0.8时,认为两个患者之间存在网络关系;当两个患者之间的甲基化位点相似度得分小于0.8时,认为两个患者之间不存在网络关系,从而建立第一无向链接网络。
示例性的,假设根据基因位点相似度得分的统计情况,基因突变数据集的相似度阈值设定为0.5,当两个患者之间的基因位点相似度得分大于或等于0.5时,认为两个患者之间存在网络关系;当两个患者之间的基因位点相似度得分小于0.5时,认为两个患者之间不存在网络关系,从而建立第二无向链接网络。
以图3A构建的数据集为基准,建立的第一无向链接网络关系数据集如 图5A所示,以图3B构建的数据集为基准,建立的第二无向链接网络关系数据集如图5B所示。
可以依据如下的公式4建立第一相似度矩阵或第二相似度矩阵,其中,l i和l j分别为甲基化数据集或基因位点数据集中第i行数据(第i患者)和第j行数据(第j患者),且i≠j,S(l i,l j)表示甲基化数据集或基因位点数据集中第i患者和第j患者之间的相似度得分:
Figure PCTCN2022109023-appb-000003
当相似度得分S(l i,l j)大于所设定相似度阈值ε时,患者i与患者j存在网络关系,否则不存在网络关系。构建完成后的第一无向链接网络或第二无向链接网络的网络简图如图5C所示。其中A,B,C……为患者编号,A Hi为患者A在甲基化位点i的甲基化值,甲基化位点包括:1,2,3……b h,A Mi为患者A在基因位点i的基因突变情况,基因位点包括:1,2,3……b m
在一些示例性实施方式中,可以依据如下公式5,构建第一无向链接网络或第二无向链接网络:
Figure PCTCN2022109023-appb-000004
其中,
Figure PCTCN2022109023-appb-000005
为节点i在第l层的特征,σ为非线性激活函数,C ij为归一化因子,N i为节点i的邻近节点。
本实施例中,根据第一相似度矩阵构建第一无向链接网络,根据第二相似度矩阵构建第二无向链接网络,第一无向链接网和第二无向链接网络均为二维无向链接网络,本公开实施例通过二维无向链接网络考虑节点自身特征和邻近节点的特征,综合构建算法模型,并利用多折交叉验证方法评估模型性能。
示例性的,根据图5A的第一无向链接网络关系数据集构建甲基化算法模型,得到的ROC曲线图如图6A所示,可以发现在5折交叉验证后,甲基化算法模型的AUC中位值为0.76。根据图5B的第二无向链接网络关系数据集构建基因突变算法模型,得到的ROC曲线图如图6B所示,可以发现在5折交叉验证后,基因突变算法模型的AUC中位值为0.69。
受试者工作特征(Receiver Operating Characteristic,ROC)曲线,又称为感受性(Sensitivity)曲线。得此名的原因在于曲线上各点反映着相同的感受性,它们都是对同一信号刺激的反应,只不过是在几种不同的判定标准下所得的结果而已。受试者工作特征曲线就是以假阳性率(False Positive Rate)为横轴,真阳性率(True Positive Rate)为纵轴所组成的坐标图,和被试在特定刺激条件下由于采用不同的判断标准得出的不同结果画出的曲线。AUC(Area Under roc Curve)的值就是处于ROC曲线下方的那部分面积的大小。较大的AUC代表了较好的性能。通常当AUC=0.5时,表明该诊断方法没有诊断价值。AUC在(0.5,0.7]时,表明该诊断方法有较低的准确性,AUC在(0.7,0.9]时,表明该诊断方法有一定的准确性,AUC大于0.9时,表明该诊断方法有较高的准确性。
示例性的,仍以前述的甲基化数据集和基因突变数据集为例,根据图4A的二次特征筛选(低方差过滤和递归降维)过后的甲基化数据集,建立第一无向链接网络关系数据集,进而构建甲基化算法模型,得到的ROC曲线图如图7A所示,可以发现在5折交叉验证后,甲基化算法模型的AUC中位值为0.82,可以看到二次过滤后模型的性能明显优于第一次过滤。根据图4B的二次特征筛选(低方差过滤和递归降维)过后的基因突变数据集,建立第二无向链接网络关系数据集,进而构建基因突变算法模型,得到的ROC曲线图如图7B所示,可以发现在5折交叉验证后,基因突变算法模型的AUC中位值为0.85,同样在二次过滤后模型的性能明显优于第一次过滤。同时,由于甲基化位点和基因位点特征数的减少,在特征过滤后算法模型构建的时间成本进一步降低。
在一些示例性实施方式中,融合算法模型根据第一组学模型和第二组学模型进行融合构建形成,可以包括:
将第一组学模型与第二组学模型中预测类别相同的概率值相乘,得到乘积概率矩阵;
根据二维无向链接网络和乘积概率矩阵,构建融合算法模型,二维无向链接网络可以为第一无向链接网络或第二无向链接网络。
本实施例中,在对甲基化算法模型和基因突变算法模型构建完成的基础 上,可进一步进行两类模型的融合,形成融合算法模型。融合算法模型将多个模型整合为一个最终模型,在模型性能上优于任意单个模型。
示例性的,仍以第一位点为DNA甲基化位点、第二位点为基因位点为例,将甲基化算法模型和基因突变算法模型预测类别相同的概率值相乘,形成乘积概率矩阵,其中,矩阵宽度为所需预测类别数。
假设预测类别数包括两类,则根据如下公式6计算乘积概率矩阵:
Figure PCTCN2022109023-appb-000006
其中,
Figure PCTCN2022109023-appb-000007
为甲基化算法模型中预测患者预后类别为I类的概率与基因突变算法模型中预测患者预后类别为I类的概率的概率乘积,
Figure PCTCN2022109023-appb-000008
为甲基化算法模型中预测患者预类别为II类的概率与基因突变算法模型中预测患者预后类别为II类的概率的概率乘积,需要指出的是,矩阵宽度根据类别数增加而增加,如
Figure PCTCN2022109023-appb-000009
将获得的乘积概率矩阵作为每个患者的新特征值,替换掉二维链接网络的甲基化位点甲基化水平值和基因突变情况原始特征,并以先前构建的第一无向链接网络或第二无向链接网络为基准,构建融合算法模型,并利用交叉验证方法评估模型性能。
示例性的,仍以前述的甲基化数据集和基因突变数据集为例,融合算法模型的ROC曲线图如图8所示,可以发现在5折交叉验证后,融合算法模型的AUC中位值为0.99,说明融合算法模型的性能优于二次过滤后单独的甲基化算法模型和基因突变算法模型。
在一些示例性实施方式中,该疾病分析方法还可以包括:
对多个第一位点,逐个执行如下操作:随机打乱当前选择的第一位点的样本数据,将随机打乱的第一位点的样本数据与其他第一位点的样本数据组成第一组学的新的特征值,根据第一组学的新的特征值和第一无向链接网络重新构建第一组学模型;评估重新构建的第一组学模型的预测结果与真实结果的平均绝对误差;
将前N1个较大的平均绝对误差对应的随机打乱的第一位点作为候选分 子标志物,其中,N1为大于或等于1的自然数。
本公开实施例中,候选分子标志物指的是临床上与疾病发生发展最相关的分子标志物,可以是基因、甲基化位点、蛋白等生物学物质,在疾病诊治上有较好的参考价值。
在一些示例性实施方式中,该疾病分析方法还可以包括:
对多个第一位点,逐个执行如下操作:随机打乱当前选择的第一位点的样本数据,将随机打乱的第一位点的样本数据与其他第一位点的样本数据组成第一组学的新的特征值,通过K折交叉验证重新构建K次第一组学模型;评估重新构建的K个第一组学模型的预测结果与真实结果的平均绝对误差,计算K个平均绝对误差的均值;
将前N1个较大的K个平均绝对误差的均值对应的随机打乱的第一位点作为候选分子标志物,其中,N1为大于或等于1的自然数,K为大于1的自然数。
在一些示例性实施方式中,可以根据如下公式7计算K个平均绝对误差的均值:
Figure PCTCN2022109023-appb-000010
其中,y_true为真实值,y_pred shuffle(l)为随机打乱数据集中第l列后的预测结果,n为预测结果总数,K为数据集交叉验证划分折数。
在一些示例性实施方式中,该疾病分析方法还可以包括:
对多个第二位点,逐个执行如下操作:随机打乱当前选择的第二位点的样本数据,将随机打乱的第二位点的样本数据与其他第二位点的样本数据组成第二组学的新的特征值,根据第二组学的新的特征值和第二无向链接网络重新构建第二组学模型;评估重新构建的第二组学模型的预测结果与真实结果的平均绝对误差;
将前N2个较大的平均绝对误差对应的随机打乱的第二位点作为候选分子标志物,其中,N2为大于或等于1的自然数。
在一些示例性实施方式中,该疾病分析方法还可以包括:
对多个第二位点,逐个执行如下操作:随机打乱当前选择的第二位点的样本数据,将随机打乱的第二位点的样本数据与其他第二位点的样本数据组成第二组学的新的特征值,通过K折交叉验证重新构建K次第二组学模型;评估重新构建的K个第二组学模型的预测结果与真实结果的平均绝对误差,计算K个平均绝对误差的均值;
将前N2个较大的K个平均绝对误差的均值对应的随机打乱的第二位点作为候选分子标志物,其中,N2为大于或等于1的自然数,K为大于1的自然数。
在一些示例性实施方式中,可以根据如下公式7计算K个平均绝对误差的均值:
Figure PCTCN2022109023-appb-000011
其中,y_true为真实值,y_pred shuffle(l)为随机打乱数据集中第l列后的预测结果,n为预测结果总数,K为数据集交叉验证划分折数。
示例性的,仍以前述的甲基化数据集和基因突变数据集为例,根据前述的疾病分析方法,得到甲基化前10的临床分子标志物体系如图9A所示,基因突变前10的临床分子标志物体系如图9B所示,其中,c1,c2,c3,c4,c5分别为5次交叉验证所获得的平均绝对误差值,mean为5个平均绝对误差值的平均值,涉及到的甲基化位点有cg14419975,cg12886942,cg05922253,cg18525352,cg10375890,cg19019537,cg07513622,cg26646370,cg10762626,cg14745270。涉及到的基因有DOCK2,ANK3,KMT2B,CDH23,CFH,LAMA2,ABCA4,PLXNB2,ABCA10,ARHGAP31。
下面以本公开实施例提供的疾病分析方法中第一位点为DNA甲基化位点、第二位点为基因位点为例,详细说明本公开实施例的技术方案。
本公开实施例提供的疾病分析方法,通过对DNA甲基化位点和基因位点进行综合分析,可构建肿瘤患者诊治的临床分子标志物体系和系统性算法,如图10A和图10B所示,该方法主要包含以下步骤:
S1、对患者的DNA甲基化和基因突变等数据及对应的患者总生存期进 行数据预处理;
S2、分别对DNA甲基化/基因突变数据的甲基化位点/基因等特征值进行特征过滤;
S3、分别利用相似度算法,通过DNA甲基化/基因突变数据特征值构建患者的相似度矩阵,根据相似度得分构建以患者为节点的二维链接网络;
S4、以患者的二维链接网络和对应特征值作为输入对象,构建包含输入特征节点、中间层节点、输出结果节点的算法模型,评估算法模型的精度。
S5、将DNA甲基化和基因突变组学数据的算法模型的各分类预测概率进行特征集成,融合两个组学进行算法模型构建,评估算法模型的精度,并比较融合算法模型和DNA甲基化/基因突变的单组学算法模型的精度高低。
S6、利用消融算法评估DNA甲基化/基因突变每个特征随机打乱后的损失值,选取损失值较高的几类特征作为临床分子标志物体系。
本公开实施例的疾病分析方法,能更准确的对患者的预后情况进行判断,并建立存在内在联系的患者局部网络。下面对各个步骤进行详细说明。
S1对甲基化数据和基因突变数据进行数据预处理。
S1.1当甲基化位点存在一个或一个以上患者样本未测到甲基化水平值时,对该位点进行删除,或保留80%以上患者存在数据的甲基化位点,若有缺失数据,采用中位值或平均值进行填充,获得最终甲基化位点,形成甲基化数据集。
S1.2筛选所设类型的基因突变,针对每一位患者样本,如果该样本在某一基因上存在突变即标记为1,如果不存在突变即标记为0,形成基因突变数据集;或统计该样本在某一基因存在的突变个数记为n,形成基因突变数据集。
S1.3将患者的预后情况转换为数字型,构建甲基化-预后情况,基因突变-预后情况两类数据集。
S2对甲基化数据和基因突变数据分别进行特征过滤。
S2.1针对甲基化数据集每一列甲基化位点数据计算方差,依据设定的特 定阈值对低方差特征的甲基化位点进行过滤。
S2.2针对基因突变数据集每一列基因突变情况计算方差,依据设定的特定阈值对低方差特征的基因进行过滤。
可选的,过滤方法如下:
Figure PCTCN2022109023-appb-000012
其中,δn i为某一特定甲基化位点或基因的所有患者的甲基化值或基因突变值方差,ε为所设定的方差阈值,当δn i大于等于所设方差阈值时,保留该甲基化位点或基因,Column a为过滤后剩下的特征列。
S2.3可选的,针对S2.1过滤后的甲基化位点,选取一类模型算法作为基模型,遍历拟合数据集,递归过滤掉对模型贡献较低(即权重较低)的甲基化位点,直到所剩甲基化位点数等于预设特征数量。
S2.4可选的,针对S2.2过滤后的基因,选取一类模型算法作为基模型,遍历拟合数据集,递归过滤掉对模型贡献较低(即权重较低)的基因位点,直到所剩基因位点数等于预设特征数量。
可选的,过滤方法如下:
Column b=Sort(S_{i}),i=1,2,3...(公式2)
其中,S_{i}为每一种特征子集,在对特征变量进行重要度排序后,筛选重要度较高的N个特征,再构造特征子集,一直到特征子集长度等于设定特征数,Column b是在Column a的基础上进一步过滤后剩下的特征列。
S3对特征过滤后的甲基化数据集和基因突变数据集进行患者相似度矩阵构建,进而构建以患者为节点的二维无向链接网络。
对甲基化或基因突变数据集中任意两行患者数据之间计算相似度,当相似度值大于所设定阈值时,认为两个患者存在网络关系,基于此方法构建二维无向链接网络。
可选的,计算方法如下:
利用余弦算法计算甲基化或基因突变数据集中每两个患者之间的相似度 得分,如公式3表示:
Figure PCTCN2022109023-appb-000013
其中,A i为患者A在某一甲基化位点的甲基化水平值或在某一基因位点的突变情况,B i为患者B在相同位点的甲基化水平值或在相同基因的突变情况,n为所有的甲基化位点个数或所有的基因位点个数。
Figure PCTCN2022109023-appb-000014
其中,l i,l j为任意两行患者数据,且i≠j,当相似度得分S(l i,l j)大于所设定阈值ε时,与患者i与患者j存在网络关系,否则不存在网络关系。构建完成后的网络简图如图5C所示。其中A,B,C……为患者编号,A Hi为患者A在甲基化位点i的甲基化值,甲基化位点包括:1,2,3……b h,A Mi为患者A在基因位点i的基因突变情况,基因位点包括:1,2,3……b m
S4依据相似度矩阵构建甲基化算法模型和基因突变算法模型。
本步骤考虑二维无向链接网络节点自身特征和邻近节点的特征,综合构建算法模型,并利用交叉验证方法评估模型性能。
可选的,计算方法如下:
Figure PCTCN2022109023-appb-000015
其中,f i (l)为节点i在第l层的特征,σ为非线性激活,C ij为归一化因子,N i为节点i的邻近节点。
S5将甲基化算法模型和基因突变算法模型进行模型融合。
S5.1将甲基化算法模型和基因突变算法模型预测类别相同的概率值相乘,形成乘积概率矩阵,其中矩阵宽度为所需预测类别数。
可选的,计算方法如下:
Figure PCTCN2022109023-appb-000016
其中,
Figure PCTCN2022109023-appb-000017
为甲基化算法模型和基因突变算法模型预测结果为I类的 概率乘积,
Figure PCTCN2022109023-appb-000018
为甲基化算法模型和基因突变算法模型预测结果为II类的概率乘积,需要指出的是,矩阵宽度根据类别数增加而增加,如
Figure PCTCN2022109023-appb-000019
S5.2将获得的乘积概率矩阵作为每个患者的新特征,并以先前S3构建的患者二维无向链接网络(可以为根据甲基化数据集构建的患者二维无向链接网络,也可以为根据基因突变数据集构建的患者二维无向链接网络)为基准,按S4的模型构建方法构建融合算法模型,并利用交叉验证方法评估模型性能。
S6利用消融算法构建临床分子标志物体系。
按顺序,将每一列甲基化位点甲基化水平值或基因的基因突变情况数据进行随机打乱,评估打乱后预测结果与真实结果的平均绝对误差,筛选打乱后平均绝对误差较大的前N个甲基化位点或基因位点作为候选分子标志物。
可选的,计算方法如下:
Figure PCTCN2022109023-appb-000020
其中,y_true为真实值,y_pred shuffle(l)为随机打乱数据集中第l列后的预测结果,n为预测结果总数,K为数据集交叉验证划分折数。
下面以实际的甲基化数据集和基因突变数据集为例,来进一步详细说明本公开实施例的技术方案。
甲基化方面,当甲基化位点存在一个或一个以上患者样本未测到甲基化水平值时,对该位点进行删除,保留剩下位点。
筛选完成后的甲基化数据集如图2A所示,存在68个患者样本,374905个甲基化位点。对于患者的预后数据,通过生存天数计算获得总生存期年数,当年数小于等于2年时,判定患者预后为I类,当年数大于2年时,判定患者预后为II类。并将处理后的预后类别添加到数据集的最后一列。
设定低方差阈值为0.08,计算甲基化数据集的每一列方差,对方差小于0.08的甲基化位点进行删除,删除后的甲基化数据集如图3A所示,筛选完 成后,存在68个患者样本,2040个甲基化位点。
以上述构建的数据集为基准,利用余弦算法计算每两个患者之间的相似度得分。可选的,计算方法为:
Figure PCTCN2022109023-appb-000021
其中,A i为患者A在某一甲基化位点的甲基化水平值,B i为患者B在相同位点的甲基化水平值,n等于所有的甲基化位点个数。
根据相似度得分的统计情况,甲基化数据集的阈值设定为0.8,当两个患者得分大于等于0.8时,认为两个患者之间存在网络关系。获得的网络关系数据集如图5A所示。依据前述提到的方法,构建二维链接无向网络图。以二维链接无向网络为基础,甲基化水平值情况作为特征,依据前述提到的方法构建甲基化算法模型,利用交叉验证的方法,评估甲基化算法模型的性能。
甲基化算法模型的ROC曲线图如图6A所示,可以发现在5折交叉验证后,甲基化算法模型的AUC中位值为0.76。
基因突变方面,针对每一位患者样本,如果该样本在某一基因上存在突变即标记为1,如果不存在突变即标记为0。
筛选完成后的基因突变数据集如图2B所示,存在68个患者样本,17513个基因位点。对于患者的预后数据,通过生存天数计算获得总生存期年数,当年数小于等于2年时,判定患者预后为I类,当年数大于2年时,判定患者预后为II类。并将处理后的预后类别添加到数据集的最后一列。
设定低方差阈值为0.08,计算基因突变数据集的每一列方差,对方差小于0.08的基因进行删除,删除后的基因突变数据集如图3B所示,筛选完成后,存在68个患者样本,283个基因。可以看出由于针对每一位患者样本,按该基因是否突变做二分类标记,使得部分基因的大部分数据为0,方差偏小,过滤了绝大部分位点。
以上述构建的数据集为基准,利用余弦算法计算每两个患者之间的相似度得分。可选的,计算方法为:
Figure PCTCN2022109023-appb-000022
其中,A i为患者A在某一基因的突变情况,B i为患者B在相同基因的突变情况,n等于所有的基因个数。
根据相似度得分的统计情况,基因突变数据集的阈值设定为0.5,当两个患者得分大于等于0.5时,认为两个患者之间存在网络关系。获得的网络关系数据集如图5B所示。依据前述提到的方法,构建二维链接无向网络图。以二维链接无向网络为基础,基因突变情况作为特征,依据前述提到的方法构建基因突变算法模型,利用交叉验证的方法,评估基因突变算法模型的性能。
基因突变算法模型的ROC曲线图如图6B所示,可以发现在5折交叉验证后,基因突变算法模型的AUC中位值为0.69。
在对甲基化数据集和基因突变数据集的特征进行过滤时,还可以在前述实施例的基础上进行进一步的特征筛选,选取线性回归方法作为基模型,设定甲基化数据集所需的特征数量为1000,基因突变数据集所需的特征数量为250,递归删除特征后使用剩余特征拟合基模型,对模型准确度进行统计,判断哪些特征组合对模型的性能贡献度最高。根据如下公式:
Column b=Sort(S_{i}),i=1,2,3...a;
在前述实施例进行一次特征筛选数据集的基础上,分别对甲基化位点和基因位点的特征子集进行二次特征筛选。二次特征筛选过后的甲基化数据集如图4A所示,筛选完成后,存在68个患者样本,1000个甲基化位点。二次特征筛选过后的基因突变数据集如图4B所示,筛选完成后,存在68个患者样本,250个基因。
以上述构建的数据集为基准,利用余弦算法计算每两个患者之间甲基化位点或基因位点情况,获得每两个患者的相似度。根据相似度得分的统计情况,构建二维链接无向网络图。以二维链接无向网络为基础,甲基化水平值或基因突变值情况作为特征,依据前述提到的方法(公式5)构建甲基化算法模型或基因突变算法模型,利用交叉验证的方法,评估甲基化算法模型或 基因突变算法模型的性能。
其中甲基化算法模型的ROC曲线图如图7A所示,可以发现在5折交叉验证后,甲基化算法模型的AUC中位值为0.82,可以看到二次过滤后模型的性能明显优于第一次过滤。基因突变算法模型的ROC曲线图如图7B所示,可以发现在5折交叉验证后,基因突变算法模型的AUC中位值为0.85,同样在二次过滤后模型的性能明显优于第一次过滤。同时,由于甲基化位点和基因特征数的减少,在特征过滤后算法模型构建上时间成本进一步降低。
在对甲基化算法模型和基因突变算法模型构建完成的基础上,可进一步进行两类模型的融合。可选的,根据如下公式:
Figure PCTCN2022109023-appb-000023
将甲基化算法模型中预测患者预后为I类的概率与基因突变算法模型中预测患者预后类别为I类的概率相乘,并将甲基化算法模型中预测患者预后为II类的概率与基因突变算法模型中预测患者预后类别为II类的概率相乘,随后将I类和II类的数值矩阵作为特征,替换掉二维链接网络的甲基化位点甲基化水平值和基因突变情况原始特征,进一步构建融合算法模型,并使用5折交叉验证对模型进行评价。
融合算法模型的ROC曲线图如图8所示,可以发现在5折交叉验证后,融合算法模型的AUC中位值为0.99,说明融合算法模型的性能由于二次过滤后单独的甲基化算法模型和基因突变算法模型。
进一步的,在前述实施例的基础上,对分子标志物体系进行判定,将甲基化数据集中的1000个甲基化位点数据或基因突变数据集中的250个基因突变情况数据遍历进行打乱,替换掉二维链接无向网络中的特征矩阵,同样采用5折交叉验证的方法进行算法模型构建,评估每一列甲基化位点数据或基因突变情况数据打乱后模型所有预测值与真实值间的平均绝对误差,随后对5类数据拆分方式不同的平均绝对误差再一次取平均值,取最后数值较大的前10个甲基化位点或基因作为临床分子标志物体系。可选的,计算方法如下:
Figure PCTCN2022109023-appb-000024
其中,甲基化前10的临床分子标志物体系如图9A所示,基因突变前10的临床分子标志物体系如图9B所示,c1,c2,c3,c4,c5分别为5次交叉验证所获得的平均绝对误差值,mean为5个平均绝对误差值的平均值,涉及到的甲基化位点有cg14419975,cg12886942,cg05922253,cg18525352,cg10375890,cg19019537,cg07513622,cg26646370,cg10762626,cg14745270。涉及到的基因有DOCK2,ANK3,KMT2B,CDH23,CFH,LAMA2,ABCA4,PLXNB2,ABCA10,ARHGAP31。
本公开实施例提供的疾病分析方法,通过对DNA甲基化和基因突变进行交叉整合分析,能更准确的对患者的预后情况进行判断;通过二维链接网络的计算方法,建立成体系的患者网络,能针对性的对局部网络中的患者实施相似的诊治方案,且更有效的对患者的预后情况进行判断;通过以低方差过滤和递归降维为主导的特征过滤算法,能有效提高最终结果的准确度(特征过滤后的运行时间减少且准确度提高)。
本公开实施例还提供了一种疾病分析装置,包括存储器;和连接至所述存储器的处理器,所述存储器用于存储指令,所述处理器被配置为基于存储在所述存储器中的指令,执行如本公开任一实施例所述的疾病分析方法的步骤。
如图11所示,在一个示例中,该疾病分析装置可包括:处理器1110、存储器1120和总线系统1130,其中,处理器1110和存储器1120通过总线系统1130相连,存储器1120用于存储指令,处理器1110用于执行存储器1120存储的指令,以获取患者的第一组学数据和第二组学数据,第一组学包括多个第一位点,第二组学包括多个第二位点;将第一组学数据和第二组学数据输入融合算法模型,以得到患者的预测结果,融合算法模型根据第一组学模型和第二组学模型进行融合构建形成,第一组学模型根据第一组学的样本数据构建,第二组学模型根据第二组学的样本数据构建。
应理解,处理器1110可以是中央处理单元(Central Processing Unit,CPU),处理器1110还可以是其他通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或 者该处理器也可以是任何常规的处理器等。
存储器1120可以包括只读存储器和随机存取存储器,并向处理器1110提供指令和数据。存储器1120的一部分还可以包括非易失性随机存取存储器。例如,存储器1120还可以存储设备类型的信息。
总线系统1130除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图11中将各种总线都标为总线系统1130。
在实现过程中,处理设备所执行的处理可以通过处理器1110中的硬件的集成逻辑电路或者软件形式的指令完成。即本公开实施例的方法步骤可以体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等存储介质中。该存储介质位于存储器1120,处理器1110读取存储器1120中的信息,结合其硬件完成上述方法的步骤。为避免重复,这里不再详细描述。
本公开实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本公开任一实施例所述的疾病分析方法。通过执行可执行指令驱动预后分析的方法与本公开上述实施例提供的疾病分析方法基本相同,在此不做赘述。
在一些可能的实施方式中,本申请提供的疾病分析方法的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在计算机设备上运行时,所述程序代码用于使所述计算机设备执行本说明书上述描述的根据本申请各种示例性实施方式的疾病分析方法中的步骤,例如,所述计算机设备可以执行本申请实施例所记载的疾病分析方法。
所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以是但不限于:电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存 储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
如图12所示,本公开实施例还提供了一种疾病分析模型的训练方法,包括:
步骤1201、对第一组学的样本数据和第二组学的样本数据分别进行数据预处理;
步骤1202、对第一组学的样本数据和第二组学的样本数据分别进行特征过滤;
步骤1203、计算第一组学的样本数据之间的第一相似度矩阵,根据计算出的第一相似度矩阵构建第一无向链接网络;计算第二组学的样本数据之间的第二相似度矩阵,根据计算出的第二相似度矩阵构建第二无向链接网络;
步骤1204、根据构建的第一无向链接网络以及所述第一组学的特征值,构建第一组学模型;根据构建的第二无向链接网络以及所述第二组学的特征值,构建第二组学模型;
步骤1205、将第一组学模型与第二组学模型中预测类别相同的概率值相乘,得到乘积概率矩阵;
步骤1206、根据二维无向链接网络和乘积概率矩阵,构建融合算法模型,二维无向链接网络为所述第一无向链接网络或第二无向链接网络。
在一些示例性实施方式中,该训练方法还可以包括:
对多个第一位点,逐个执行如下操作:随机打乱当前选择的所述第一位点的样本数据,将随机打乱的所述第一位点的样本数据与其他所述第一位点的样本数据组成所述第一组学的新的特征值,重新构建第一组学模型;评估重新构建的第一组学模型的预测结果与真实结果的平均绝对误差;
将前N1个较大的平均绝对误差对应的随机打乱的第一位点作为候选分子标志物,其中,N1为大于或等于1的自然数。
在一些示例性实施方式中,该训练方法还可以包括:
对多个第一位点,逐个执行如下操作:随机打乱当前选择的第一位点的样本数据,将随机打乱的第一位点的样本数据与其他第一位点的样本数据组 成第一组学的新的特征值,通过K折交叉验证重新构建K次第一组学模型;评估重新构建的K个第一组学模型的预测结果与真实结果的平均绝对误差,计算K个平均绝对误差的均值;
将前N1个较大的K个平均绝对误差的均值对应的随机打乱的第一位点作为候选分子标志物,其中,N1为大于或等于1的自然数,K为大于1的自然数。
在一些示例性实施方式中,该训练方法还可以包括:
对多个第二位点,逐个执行如下操作:随机打乱当前选择的第二位点的样本数据,将随机打乱的第二位点的样本数据与其他第二位点的样本数据组成第二组学的新的特征值,根据第二组学的新的特征值和第二无向链接网络重新构建第二组学模型;评估重新构建的第二组学模型的预测结果与真实结果的平均绝对误差;
将前N2个较大的平均绝对误差对应的随机打乱的第二位点作为候选分子标志物,其中,N2为大于或等于1的自然数。
在一些示例性实施方式中,该训练方法还可以包括:
对多个第二位点,逐个执行如下操作:随机打乱当前选择的第二位点的样本数据,将随机打乱的第二位点的样本数据与其他第二位点的样本数据组成第二组学的新的特征值,通过K折交叉验证重新构建K次第二组学模型;评估重新构建的K个第二组学模型的预测结果与真实结果的平均绝对误差,计算K个平均绝对误差的均值;
将前N2个较大的K个平均绝对误差的均值对应的随机打乱的第二位点作为候选分子标志物,其中,N2为大于或等于1的自然数,K为大于1的自然数。
本公开实施例还提供了一种疾病分析模型的训练装置,包括存储器;和连接至所述存储器的处理器,所述存储器用于存储指令,所述处理器被配置为基于存储在所述存储器中的指令,执行如本公开任一实施例所述的疾病分析模型的训练方法的步骤。
本公开实施例还提供了一种计算机可读存储介质,其上存储有计算机程 序,该程序被处理器执行时实现如本公开任一实施例所述的疾病分析模型的训练方法。
在一些可能的实施方式中,本申请提供的疾病分析模型的训练方法的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在计算机设备上运行时,所述程序代码用于使所述计算机设备执行本说明书上述描述的根据本申请各种示例性实施方式的疾病分析模型的训练方法中的步骤,例如,所述计算机设备可以执行本申请实施例所记载的疾病分析模型的训练方法。
所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以是但不限于:电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些组件或所有组件可以被实施为由处理器,如数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以 被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
虽然本公开所揭露的实施方式如上,但所述的内容仅为便于理解本公开而采用的实施方式,并非用以限定本发明。任何所属领域内的技术人员,在不脱离本公开所揭露的精神和范围的前提下,可以在实施的形式及细节上进行任何的修改与变化,但本发明的专利保护范围,仍须以所附的权利要求书所界定的范围为准。

Claims (20)

  1. 一种疾病分析方法,包括:
    获取患者的第一组学数据和第二组学数据,所述第一组学包括多个第一位点,所述第二组学包括多个第二位点;
    将所述第一组学数据和第二组学数据输入融合算法模型,以得到患者的预测结果,所述融合算法模型根据第一组学模型和第二组学模型进行融合构建形成,所述第一组学模型根据第一组学的样本数据构建,所述第二组学模型根据第二组学的样本数据构建。
  2. 根据权利要求1所述的疾病分析方法,其中,所述第一位点为DNA甲基化位点,所述第二位点为基因位点;或者,所述第一位点为基因位点,所述第二位点为DNA甲基化位点,其中,基因位点包括基因突变情况信息和/或基因表达情况信息。
  3. 根据权利要求1所述的疾病分析方法,其中,所述第一组学模型根据第一组学的样本数据构建,包括:
    对所述第一组学的样本数据进行数据预处理;
    对所述第一组学的样本数据进行特征过滤;
    计算所述第一组学的样本数据之间的第一相似度矩阵,根据计算出的第一相似度矩阵构建第一无向链接网络;
    根据构建的第一无向链接网络以及所述第一组学的特征值,构建第一组学模型。
  4. 根据权利要求3所述的疾病分析方法,其中,所述对所述第一组学的样本数据进行数据预处理,包括:
    删除所述第一组学的样本数据中的一个或多个第一位点,所删除的每个第一位点至少存在一个患者样本不存在数据值。
  5. 根据权利要求3所述的疾病分析方法,其中,所述对所述第一组学的样本数据进行数据预处理,包括:
    删除所述第一组学的样本数据中的一个或多个第一位点,所删除的每个第一位点存在大于b%的患者样本不存在数据值,其中,b为大于0的实数;
    对所述第一组学的样本数据中不存在数据值的患者样本进行数据填充。
  6. 根据权利要求5所述的疾病分析方法,其中,所述对所述第一组学的样本数据进行数据预处理,包括:根据每个患者的预后情况或疾病分期情况对所述第一组学和第二组学的样本数据中的每个患者进行分类,所述分类结果包括至少两类。
  7. 根据权利要求3所述的疾病分析方法,其中,所述对所述第一组学的样本数据进行特征过滤,包括:
    对每个所述第一位点的数据计算方差;
    比较计算出的方差与预设第一方差阈值的大小;
    删除计算出的方差小于预设第一方差阈值的第一位点。
  8. 根据权利要求7所述的疾病分析方法,其中,所述对所述第一组学的样本数据进行特征过滤,还包括:
    选取基模型,采用所述基模型与所述第一组学的样本数据进行多次训练,每次训练结束移除所述第一组学的样本数据中的x个第一位点,所述x个第一位点的权重值为每次训练得到的所有第一位点的权重值中较低的x个权重值,x为大于或等于1的自然数,直到所剩所述第一位点的数量等于预设第一位点数量阈值。
  9. 根据权利要求8所述的疾病分析方法,其中,所述基模型为以下任意一种:线性回归模型、逻辑回归模型或者决策树模型。
  10. 根据权利要求1或3所述的疾病分析方法,其中,所述第二组学模型根据第二组学的样本数据构建,包括:
    对所述第二组学的样本数据进行数据预处理;
    对所述第二组学的样本数据进行特征过滤;
    计算所述第二组学的样本数据之间的第二相似度矩阵,根据计算出的第 二相似度矩阵构建第二无向链接网络;
    根据构建的第二无向链接网络以及所述第二组学的特征值,构建第二组学模型。
  11. 根据权利要求10所述的疾病分析方法,其中,所述融合算法模型根据第一组学模型和第二组学模型进行融合构建形成,包括:
    将所述第一组学模型与第二组学模型中预测类别相同的概率值相乘,得到乘积概率矩阵;
    根据二维无向链接网络和乘积概率矩阵,构建融合算法模型,所述二维无向链接网络为根据所述第一组学的样本数据建立的第一无向链接网络或根据所述第二组学的样本数据建立的第二无向链接网络。
  12. 根据权利要求1所述的疾病分析方法,所述方法还包括:
    对多个所述第一位点,逐个执行如下操作:随机打乱当前选择的所述第一位点的样本数据,将随机打乱的所述第一位点的样本数据与其他所述第一位点的样本数据组成所述第一组学的新的特征值,重新构建第一组学模型;评估重新构建的第一组学模型的预测结果与真实结果的平均绝对误差;
    将前N1个较大的平均绝对误差对应的随机打乱的第一位点作为候选分子标志物,其中,N1为大于或等于1的自然数。
  13. 根据权利要求1所述的疾病分析方法,所述方法还包括:
    对多个所述第一位点,逐个执行如下操作:随机打乱当前选择的所述第一位点的样本数据,将随机打乱的所述第一位点的样本数据与其他所述第一位点的样本数据组成所述第一组学的新的特征值,通过K折交叉验证重新构建K次第一组学模型;评估重新构建的K个第一组学模型的预测结果与真实结果的平均绝对误差,计算K个平均绝对误差的均值;
    将前N1个较大的K个平均绝对误差的均值对应的随机打乱的第一位点作为候选分子标志物,其中,N1为大于或等于1的自然数,K为大于1的自然数。
  14. 根据权利要求1所述的疾病分析方法,所述方法还包括:
    对多个所述第二位点,逐个执行如下操作:随机打乱当前选择的所述第二位点的样本数据,将随机打乱的所述第二位点的样本数据与其他所述第二位点的样本数据组成所述第二组学的新的特征值,重新构建第二组学模型;评估重新构建的第二组学模型的预测结果与真实结果的平均绝对误差;
    将前N2个较大的平均绝对误差对应的随机打乱的第二位点作为候选分子标志物,其中,N2为大于或等于1的自然数。
  15. 根据权利要求1所述的疾病分析方法,所述方法还包括:
    对多个所述第二位点,逐个执行如下操作:随机打乱当前选择的所述第二位点的样本数据,将随机打乱的所述第二位点的样本数据与其他所述第二位点的样本数据组成所述第二组学的新的特征值,通过K折交叉验证重新构建K次第二组学模型;评估重新构建的K个第二组学模型的预测结果与真实结果的平均绝对误差,计算K个平均绝对误差的均值;
    将前N2个较大的K个平均绝对误差的均值对应的随机打乱的第二位点作为候选分子标志物,其中,N2为大于或等于1的自然数,K为大于1的自然数。
  16. 一种疾病分析装置,包括存储器;和连接至所述存储器的处理器,所述存储器用于存储指令,所述处理器被配置为基于存储在所述存储器中的指令,执行如权利要求1至15中任一项所述的疾病分析方法的步骤。
  17. 一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1至15中任一项所述的疾病分析方法。
  18. 一种疾病分析模型的训练方法,包括:
    对第一组学的样本数据和第二组学的样本数据分别进行数据预处理;
    对所述第一组学的样本数据和第二组学的样本数据分别进行特征过滤;
    计算所述第一组学的样本数据之间的第一相似度矩阵,根据计算出的第一相似度矩阵构建第一无向链接网络;计算所述第二组学的样本数据之间的 第二相似度矩阵,根据计算出的第二相似度矩阵构建第二无向链接网络;
    根据构建的第一无向链接网络以及所述第一组学的特征值,构建第一组学模型;根据构建的第二无向链接网络以及所述第二组学的特征值,构建第二组学模型;
    将所述第一组学模型与第二组学模型中预测类别相同的概率值相乘,得到乘积概率矩阵;
    根据二维无向链接网络和乘积概率矩阵,构建融合算法模型,所述二维无向链接网络为所述第一无向链接网络或第二无向链接网络。
  19. 一种疾病分析模型的训练装置,包括存储器;和连接至所述存储器的处理器,所述存储器用于存储指令,所述处理器被配置为基于存储在所述存储器中的指令,执行如权利要求18所述的疾病分析模型的训练方法的步骤。
  20. 一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求18所述的疾病分析模型的训练方法。
PCT/CN2022/109023 2022-07-29 2022-07-29 疾病分析方法、疾病分析模型的训练方法及装置 WO2024021037A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202280002458.3A CN117795617A (zh) 2022-07-29 2022-07-29 疾病分析方法、疾病分析模型的训练方法及装置
PCT/CN2022/109023 WO2024021037A1 (zh) 2022-07-29 2022-07-29 疾病分析方法、疾病分析模型的训练方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/109023 WO2024021037A1 (zh) 2022-07-29 2022-07-29 疾病分析方法、疾病分析模型的训练方法及装置

Publications (1)

Publication Number Publication Date
WO2024021037A1 true WO2024021037A1 (zh) 2024-02-01

Family

ID=89705050

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/109023 WO2024021037A1 (zh) 2022-07-29 2022-07-29 疾病分析方法、疾病分析模型的训练方法及装置

Country Status (2)

Country Link
CN (1) CN117795617A (zh)
WO (1) WO2024021037A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109994200A (zh) * 2019-03-08 2019-07-09 华南理工大学 一种基于相似度融合的多组学癌症数据整合分析方法
CN111161882A (zh) * 2019-12-04 2020-05-15 深圳先进技术研究院 一种基于深度神经网络的乳腺癌生存期预测方法
CN112735529A (zh) * 2021-01-18 2021-04-30 中国医学科学院肿瘤医院 乳腺癌预后模型的构建方法及应用方法、电子设备
CN113486922A (zh) * 2021-06-01 2021-10-08 安徽大学 基于栈式自编码器的数据融合优化方法及其系统
CN114171197A (zh) * 2021-11-12 2022-03-11 东莞市人民医院 一种乳腺癌her2状态的预测方法及相关设备
CN114398983A (zh) * 2022-01-14 2022-04-26 腾讯科技(深圳)有限公司 分类预测方法、装置、设备、存储介质及计算机程序产品

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109994200A (zh) * 2019-03-08 2019-07-09 华南理工大学 一种基于相似度融合的多组学癌症数据整合分析方法
CN111161882A (zh) * 2019-12-04 2020-05-15 深圳先进技术研究院 一种基于深度神经网络的乳腺癌生存期预测方法
CN112735529A (zh) * 2021-01-18 2021-04-30 中国医学科学院肿瘤医院 乳腺癌预后模型的构建方法及应用方法、电子设备
CN113486922A (zh) * 2021-06-01 2021-10-08 安徽大学 基于栈式自编码器的数据融合优化方法及其系统
CN114171197A (zh) * 2021-11-12 2022-03-11 东莞市人民医院 一种乳腺癌her2状态的预测方法及相关设备
CN114398983A (zh) * 2022-01-14 2022-04-26 腾讯科技(深圳)有限公司 分类预测方法、装置、设备、存储介质及计算机程序产品

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG GE, PENG ZHEN, YAN CHAOKUN, WANG JIANLIN, LUO JUNWEI, LUO HUIMIN: "A novel liver cancer diagnosis method based on patient similarity network and DenseGCN", SCIENTIFIC REPORTS, NATURE PUBLISHING GROUP, US, vol. 12, no. 1, 26 April 2022 (2022-04-26), US , pages 6797, XP093132592, ISSN: 2045-2322, DOI: 10.1038/s41598-022-10441-3 *

Also Published As

Publication number Publication date
CN117795617A (zh) 2024-03-29

Similar Documents

Publication Publication Date Title
Alfares et al. Results of clinical genetic testing of 2,912 probands with hypertrophic cardiomyopathy: expanded panels offer limited additional sensitivity
US20230114581A1 (en) Systems and methods for predicting homologous recombination deficiency status of a specimen
Alavi et al. A web server for comparative analysis of single-cell RNA-seq data
US20130030713A1 (en) Methods of associating an unkown biological specimen with a family
US11514289B1 (en) Generating machine learning models using genetic data
CN113555062B (zh) 一种用于基因组碱基变异检测的数据分析系统及分析方法
Li et al. Estimation of quantitative trait locus effects with epistasis by variational Bayes algorithms
CN111913999B (zh) 基于多组学与临床数据的统计分析方法、系统和存储介质
CN113056563A (zh) 识别血液中基因异常的方法及系统
US20130041594A1 (en) Automated decision support for associating an unkown biological specimen with a family
Gillet-Markowska et al. Ulysses: accurate detection of low-frequency structural variations in large insert-size sequencing libraries
Lim et al. Functional coding haplotypes and machine-learning feature elimination identifies predictors of Methotrexate Response in Rheumatoid Arthritis patients
CN111180013A (zh) 检测血液病融合基因的装置
Tahmouresi et al. Gene selection using pyramid gravitational search algorithm
WO2024021037A1 (zh) 疾病分析方法、疾病分析模型的训练方法及装置
McCallum et al. Empirical Bayes scan statistics for detecting clusters of disease risk variants in genetic studies
Shen et al. scDetect: a rank-based ensemble learning algorithm for cell type identification of single-cell RNA sequencing in cancer
Wibowo et al. XGB5hmC: Identifier based on XGB model for RNA 5-hydroxymethylcytosine detection
WO2023136296A1 (ja) 情報処理装置、情報処理方法、及びプログラム
WO2023136297A1 (ja) 情報処理システム、情報処理装置、情報処理方法、及びプログラム
RU2798897C1 (ru) Метод поиска терапевтически значимых молекулярных мишеней для заболеваний путем применения методов машинного обучения к комбинированным данным, включающим графы сигнальных путей, омиксные и текстовые типы данных
CN115985388B (zh) 基于预处理降噪和生物中心法则的多组学集成方法和系统
CN117312893B (zh) 一种菌群匹配度的评估方法及相关装置
Zador et al. Homogenous subgroups of atypical meningiomas defined using oncogenic signatures: basis for a new grading system?
TWI650664B (zh) 建立蛋白質功能缺失評估模型的方法以及利用上述模型的風險評估方法與系統

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22952484

Country of ref document: EP

Kind code of ref document: A1