WO2020124584A1 - 患病风险预测方法、电子设备及存储介质 - Google Patents
患病风险预测方法、电子设备及存储介质 Download PDFInfo
- Publication number
- WO2020124584A1 WO2020124584A1 PCT/CN2018/122786 CN2018122786W WO2020124584A1 WO 2020124584 A1 WO2020124584 A1 WO 2020124584A1 CN 2018122786 W CN2018122786 W CN 2018122786W WO 2020124584 A1 WO2020124584 A1 WO 2020124584A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- reference object
- driving force
- activity
- gene
- clustering
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/50—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Definitions
- the present application relates to biotechnology, in particular to a method for predicting disease risk, electronic equipment and storage media.
- Breast cancer is one of the most important threats to women’s health worldwide. There are about 1.3 million new breast cancer cases and about 500,000 deaths each year. Taking the statistics of China in 2015 and the United States in 2018 as an example, the incidence of breast cancer in the two countries ranks first among all cancers in women, and the mortality rate ranks fifth and second, respectively. As of the time of statistics, the total number of surviving patients exceeded 260,000. On average, every woman has a 12% chance of developing breast cancer in her lifetime. Early prevention, early detection, and early treatment have shown in a number of retrospective studies that the prognosis of breast cancer patients has been significantly improved, especially for triple-negative breast cancer with early onset, poor prognosis, and unknown mechanism.
- the purpose of this application is to provide a solution for predicting the risk of disease based on the signal pathway information.
- this application provides a disease risk prediction method, which is executed by an electronic device and includes:
- the risk of the detected object suffering from the specific disease is output according to the first clustering result obtained after performing the first clustering.
- Another aspect of the present application provides an electronic device, including: a memory, a processor, and a program stored in the memory, where the program is configured to be executed by a processor, which is implemented when the processor executes the program:
- the present application provides a storage medium that stores a computer program, where the computer program is implemented when executed by a processor:
- the driving force information of the mutant gene of the detected object on the changes of the activities of several predetermined signaling pathways is used to realize the prediction of the risk of disease.
- the entire germline genetic information is used to comprehensively evaluate the basis of the overall characteristics of germline genetics, so it can cover the risks caused by germline inheritance of various sporadic and familial genetic diseases (such as breast cancer) Evaluation improves the sensitivity of detection to risk individuals.
- the discrete, high-dimensional, multivariate correlation, and non-standardized germline variation characteristics can be projected to the gene predictive expression characteristics and signal pathway activity characteristics of continuous value range, relatively low dimensionality, and the correlation gradually converges
- a quantitative model that converts discrete qualitative data into a continuous space is constructed.
- it retains the global characteristics of the data.
- it becomes a link between germline genetic information and other deterministic events in breast cancer (including but not limited to lymph nodes The pathophysiological characteristics of metastasis, age at onset, etc.) drive the basis of classification.
- the risk rating and clinical feature association of sporadic genetic breast cancer such as triple-negative breast cancer can be graded according to pathway activity, making up for the knowledge based on gene panel
- the coverage of the driving method is vacant, and the false negative rate is significantly reduced.
- the risk of disease can be correlated with other clinical, pathological, physiological, or behavioral deterministic event characteristics, so that the model can be based on germline genetic information for patient prognosis assessment, early clinical intervention and Management provides the basis.
- FIG. 1 is a schematic flowchart of a method for obtaining a deterministic event in a cell according to an embodiment of the present application
- FIG. 2 is a schematic flowchart of a method for obtaining a deterministic event in a cell according to another embodiment of the present application
- FIG. 3 is a schematic flowchart of a method for predicting a disease risk according to an embodiment of the present application
- FIG. 4 is a schematic structural diagram of an electronic device according to an embodiment of the application.
- global germline genetic information refers to all genetic information that is derived from the parent and is encoded in the genome of all normal cells developed from the embryo, carried by the individual for life, and can be passed on to the offspring through reproduction. Its form includes but is not limited to genomic DNA sequence, epigenetic modification information and so on.
- intracellular deterministic events refer to the interaction of various molecules in the organism according to known or unknown mechanisms, and ultimately generate event characteristics that can be detected qualitatively or quantitatively by various methods, including but not limited to signaling pathways (Signaling Pathways) Activation or inhibition, the type and content of metabolites (Metabolites), and the interaction patterns of biomolecules (including large molecules such as proteins/nucleic acids, lipid/small molecule drugs/metabolic products/inorganic metal ions) , State and change (Interactome), polymer/cell/tissue structure and change, etc.
- signaling pathways Signaling pathways
- Metabolites the type and content of metabolites
- biomolecules including large molecules such as proteins/nucleic acids, lipid/small molecule drugs/metabolic products/inorganic metal ions
- Interactome State and change
- deterministic events within the cell include germline genetically determined gene expression, signaling pathway activity, the risk or resistance to breast cancer, and the probability of occurrence of breast cancer-related pathophysiological states
- FIG. 1 is a schematic flowchart of a method for obtaining a deterministic event in a cell according to an embodiment of the present application. The method may be executed by an electronic device, including:
- the determining at least one predetermined type of intracellular deterministic event of the detected object in S14 includes:
- S142 Determine the second type of intracellular deterministic event information of the detected object according to the first type of intracellular deterministic event information of the detected object.
- the detected object may be a living organism, for example, it may belong to but not limited to a human being.
- the predetermined genome may be, for example, some or all genes in the known human genome.
- mutant genes of the detected object belong to a predetermined genome, which may be rare germline mutant genes or global germline mutant genes, depending on the actual situation.
- global germline genetic information of the detected object can be obtained, such as whole exon sequencing data, from which rare germline mutant genes are determined.
- the rare germline mutant gene of the detected object may be determined by, for example, determining whether the mutant gene in the all-exon sequencing data of the detected object is in a predetermined rare mutant genome.
- the rare germline mutant genome can be determined by the set mutation frequency threshold. In other words, if the probability of a gene occurring in the population is greater than the set mutation frequency threshold, the gene is a rare germline mutant gene.
- Qualcomm global data may also be used instead of whole exome sequencing data, such as Qualcomm global data including but not limited to whole exome sequencing, whole genome sequencing, gene chip, expression Chip data, etc.
- the aforementioned first type of intracellular deterministic event information may be driving force information for the change in the activity of the detected mutant genes on at least one predetermined signaling pathway, and the second type of intracellular determination
- the sexual event information may be the predicted risk of the detected object suffering from a specific disease.
- FIG. 2 is a schematic flowchart of a method for obtaining a deterministic event in a cell according to an embodiment of the present application.
- the method may be executed by an electronic device.
- the driving force for the change in the activity of the plurality of mutated genes of the detected object to at least one predetermined signaling pathway can be obtained.
- the method of this embodiment includes:
- gene expression refers to the amount of RNA products on the genome that can be transcribed by a detected gene or the amount of translated protein.
- the gene expression amount can be a value in a continuous range, which can be obtained from existing data.
- the at least one predetermined type of intracellular deterministic event information of the detected object includes: determining that the activity of the plurality of mutated genes of the detected object changes to multiple predetermined signaling pathways Driving force information.
- the plurality of predetermined signal pathways may be selected from the existing signal pathways in the prior art, and when selected, for example, a signal pathway in which a gene contained in the signal pathway and a gene in the above-mentioned predetermined genome may coincide with each other is greater than a predetermined threshold .
- the driving force of the mutation gene to change the activity of the signaling pathway indicates the ability of the mutant gene to influence the activity change of the signaling pathway.
- the driving force information for obtaining, in S22, that each mutant gene in the plurality of mutant genes changes the gene expression of each gene in the predetermined genome includes:
- the method for obtaining the template data includes: performing the following processing for each gene gi in the predetermined genome:
- the predetermined reference cell line into a first cell line group and a second cell line group, wherein the first cell line group includes the reference cell line including the mutant gene gi in the predetermined reference cell line.
- the second cell line group includes reference cell lines that do not include the mutant gene gi among the predetermined reference cell lines.
- p reference cell lines are divided into two groups: a first cell line group (also called a mutant group) mti and a second cell line group (also called a wild group) wti, wherein, the first cell line group includes p reference cell lines including the gene gi reference cell line (the number is pi1), and the second cell line group includes p reference cell lines that do not include the gene gi (Let the number be pi2).
- de ij is the difference between the average value of the gene expression value of the gene gj of each reference cell line in the mutant group mti corresponding to the gene gi and the average value of the gene expression value of the gene gj of each reference cell line in the wild group wti value
- ⁇ mtij represents the average expression values gj gene mutation in each group mti reference cell line
- ⁇ wtij represents the average gene expression values gj wild wti group in each of the reference cell line.
- the above-mentioned difference de ij may be subjected to noise reduction processing.
- random simulations may be performed a predetermined number of times (for example, but not limited to 10,000 times).
- p cell lines were randomly divided into a mutant group and a wild group, and the number of reference cell lines in the mutant group was pi1, and the number of reference cell lines in the wild group was pi2. Then calculate the de null of the difference of the average value of the expression value of each gene gi in these two randomly divided two groups.
- de ij is subjected to noise reduction processing (also called normalization processing) using the difference value de null obtained from each random simulation, and the value obtained after the normalization processing represents the driving force df.
- noise reduction processing also called normalization processing
- This normalization processing can be achieved by the following formula:
- df ij is the driving force information for the gene gi to change the gene expression of the gene gj.
- mean(de null ) and std(de null ) are the average and standard deviation of de null calculated by 10000 random simulations.
- the above process is to calculate the driving force for a gene gi to change the gene expression of each gene gj.
- the above calculation process is performed to obtain driving force information, that is, template data, for each gene in the predetermined genome to change the gene expression of each gene in the predetermined genome.
- the template data can be represented by a matrix of n x n, each row of the matrix corresponds to a gene gi, and each column corresponds to a gene gj, and each value in the matrix represents the row of genes for that column of genes The driving force behind the change of gene expression.
- determining the driving force information for each mutant gene in the m mutant genes of the detected object to change the gene expression of each gene in the predetermined genome may include: from the above matrix Obtain the m rows of data corresponding to the m mutant genes to obtain a matrix of m x n.
- a method for obtaining driving force information for changing the gene expression of each gene in a predetermined genome by several mutant genes of a detected object includes: performing for each gene gj in the predetermined genome The following processing:
- the driving force of each gene can be weighted (w), and then the average value DF can be obtained.
- DF j is the average driving force of all m mutant genes of the detected object to change the gene expression of the gene gj in the predetermined genome
- i k is the number of rows of the k th mutant gene of the detected object in the nxn matrix
- Df is the value of the corresponding position in the aforementioned nxn matrix.
- a simple method is to assume that the weight of the driving force of each mutant gene is the same, and it is understandable that the weight of the driving force of each mutant gene may also be different.
- the weighted average value DF null obtained by each random simulation is used to perform noise reduction processing (also called normalization processing) on DF j .
- This normalization processing can be achieved by the following formula:
- ZDF j represents the driving force for all m mutant genes carried by the detected object to change the gene expression of gene gj in the predetermined genome
- mean(DF null ) and std(DF null ) are DF calculated by 10000 random simulations respectively The mean and standard deviation of null .
- a 1x matrix is obtained.
- each tested object carries a different number of mutant genes, through the above processing, the different mxn matrices corresponding to the different tested objects are converted into the same 1xn matrix, which can be subsequently compared in the same dimension.
- obtaining the driving force information of the activity change of at least one predetermined signal pathway by several mutant genes of the detected object in S24 includes: For each of the signal pathways sj performs the following processing:
- S242 Obtain comprehensive information on the influence of several mutant genes of the detected object on the activity of the signal pathway sj according to the information on the influence of each gene gi in the predetermined genome on the activity of the signal pathway sj.
- obtaining information on the influence of each gene gi in the predetermined genome on the activity of the signaling pathway sj in S241 includes:
- S2413 Obtain, according to the driving force information obtained in S2411 and the influence information obtained in S2412, influence information of each gene gi in the predetermined genome on the activity of the signaling pathway sj.
- DFP ij is the influence value of a gene gi in the predetermined genome on the activity of the jth signal pathway
- df is the value of the corresponding position of the aforementioned n x n matrix
- j a is the a gene in the jth signal pathway in the nxn matrix
- the number of columns in sig; sig a is the effect of the a gene ak on the activity of the j signal pathway, which can be obtained from the existing data.
- the value is 1 for up-regulation and -1 for down-regulation.
- noise reduction processing can be performed on DFP ij .
- random simulations may be performed a predetermined number of times (for example, but not limited to 10,000 times).
- the data corresponding to k genes can be randomly selected from the aforementioned n x n matrix and the DFP null can be calculated by the above formula.
- the DFP null obtained from each random simulation is used to perform noise reduction processing (also called normalization) on the DFP.
- This normalization processing can be achieved by the following formula:
- ZDFP ij is the driving force for the activity change of a gene gi in the predetermined genome to the jth signal pathway
- mean(DFP null ) and std(DFP null ) are the average and standard of DFP null calculated by 10000 random simulations, respectively difference.
- n x q After obtaining the driving force ZDFP ij for each gene gi of the n genes in the predetermined genome to change the activity of each of the q predetermined signal pathways sj, a matrix of n x q can be obtained.
- the comprehensive influence information on the activity of the signaling pathway sj of several mutant genes of the detected object in S242 can be obtained by the following formula:
- IDFP j is the comprehensive influence of the m mutant genes of the test object on the activity of the signal pathway sj
- i a is the number of rows of the a-th gene in the jth signal pathway in the aforementioned n x 60 matrix.
- noise reduction processing can be performed on IDFP j .
- random simulations may be performed a predetermined number of times (for example, but not limited to 10,000 times).
- m rows are randomly selected from the n x 60 matrix and IDFP null is calculated by the above formula.
- IDFP null obtained from each random simulation is used to perform noise reduction processing (also called normalization) on IDFP j .
- This normalization processing can be implemented by the following formula:
- ZIDFP j is the driving force for the change of the activity of the jth signal pathway of all m mutant genes carried by the tested object.
- mean(IDFP null ) and std(IDFP null ) are the average and standard deviation of IDFP null calculated by 10000 random simulations, respectively.
- each detected object is represented by a 1xq matrix, without considering the mutated gene data and specific mutated genes of the detected object.
- FIG. 3 shows a schematic flowchart of a method for predicting a disease risk according to an embodiment of the present application.
- the method may be executed by an electronic device, including:
- the specific disease is triple negative breast cancer. It is understandable that the disease risk prediction method of this embodiment can also be used for other suitable specific diseases, and is not limited to triple negative breast cancer.
- the method further includes: several clusters obtained after performing the first clustering Combine into multiple groups.
- the method further includes: obtaining and outputting the same disease risk as the detected object At least one of clinical or pathological deterministic event characteristics, pathological characteristics, physiological characteristics, and behavioral characteristics of the reference object of the grade.
- the first clustering is performed on each reference object in the detected object, the first and second reference object groups using the NMRCLUST clustering method.
- other clustering methods can be selected for the first clustering according to the actual situation.
- hierarchical methods but not limited to hierarchical methods
- kNN k-nearest-neighbor
- Algorithms, etc. Partition-based methods (e.g. K-Means clustering, etc.)
- Density-based methods e.g. Density-Based Spatial Clustering of Applications with Noise ( (Referred to as DBSCAN, etc.)
- Grid-based methods e.g. (STatistical INformation Grid (STING) algorithm, etc.)
- model-based methods e.g. Gaussian) Mixture Models (abbreviated as GMM)
- before obtaining the driving force information of the change in the activity of the mutant gene of the detected object for several predetermined signal pathways includes: determining the several predetermined signal pathways from among multiple reference signal pathways.
- before determining the plurality of predetermined signal paths from the multiple reference signal paths includes: determining a pre-classification type corresponding to the detected object; according to the pre-classification type, from the third reference object group
- the first reference object group is determined in, wherein each reference object of the third reference object group belongs to the health class object, the first reference object group corresponds to the pre-classification type; and according to the pre- Classification type, the second reference object group is determined from a fourth reference object group, wherein each reference object of the fourth reference object group belongs to the object with a specific disease category, and the second reference object group corresponds to The pre-classification type.
- Determining the plurality of predetermined signal paths from the plurality of reference signal paths includes: determining the plurality of predetermined signal paths from the plurality of reference signal paths according to the pre-classification type.
- determining the pre-classification type corresponding to the detected object includes: obtaining driving force information of the change in the activity of the detected object's mutant gene on the multiple reference signal pathways; obtaining the third and first The driving force information of the mutation genes of each reference object in the four reference object groups for the change of the activity of the multiple reference signal pathways; and the change of the activity of the mutation genes of the detected object for the multiple reference signal pathways.
- the driving force information and the driving force information for the change in the activity of the mutant genes of each reference object in the third and fourth reference object groups for the multiple reference signal pathways, for the detected object, the third and fourth Each reference object in the reference object group performs the second clustering.
- the second clustering is performed on each reference object in the detected object, the third and fourth reference object groups using the Ward Hierarchical Clustering clustering method.
- Ward Hierarchical Clustering clustering method can be understood that other clustering methods can be selected for the second clustering according to the actual situation, for example, hierarchical methods (such as k-nearest-neighbor (referred to as kNN) algorithm, etc.) can also be used.
- Partition-based methods e.g. K-Means clustering, etc.
- density-based methods Density-based methods
- DBSCAN Density-Based Spatial Clustering of Applications with Noise
- Grid-based methods such as STatistical INformation Grid (referred to as STING) algorithm, etc.
- model-based methods Model-based methods
- Gaussian Mixture Models abbreviated as Gaussian GMM
- determining the plurality of predetermined signal paths from the multiple reference signal paths according to the pre-classification type includes: determining from the third reference object group according to the pre-classification type A fifth reference object group corresponding to the pre-classification type; according to the pre-classification type, determining a sixth reference object group corresponding to the pre-classification type from the fourth reference object group; for the multiple For each signal path sk in the signal path, determine the driving force information of the mutation gene of each reference object in the fifth reference object group for the activity change of the signal path sk and each of the sixth reference object group The difference between the driving force information of the mutant gene of the reference object for the activity change of the signal pathway sk; and according to the difference, the plurality of predetermined schedules satisfying the preset difference significance condition are determined from the plurality of information pathways signal path.
- the driving force information of the change in the activity of the mutation gene of each reference object in the fifth reference object group on the signal pathway sk is determined and each reference in the sixth reference object group
- the method for the difference between the driving force information of the change of the activity of the signal path sk by the mutant gene of the object includes: obtaining the average drive of the mutation gene of each reference object in the sixth reference object group to the activity change of the signal path sk The difference between the force value and the average driving force value of the change in the activity of the mutant gene of each reference object in the fifth reference object group on the signal pathway sk.
- noise reduction can be performed on the difference.
- outputting the risk of the detected object suffering from the specific disease according to the first clustering result obtained after performing the first clustering includes: at least according to the cluster to which the detected object belongs The ratio of the number of reference objects belonging to the second reference object group and the number of reference objects belonging to the first reference object group in the class and the cluster determines and outputs the risk of the detected object suffering from the specific disease.
- the following uses triple-negative breast cancer as an example to describe in detail a method for predicting the risk of the present application through a specific example.
- the driving force information of the change of the activity of q predetermined signal pathways of the several mutated genes of the detected object obtained in the foregoing embodiment of the method for obtaining a deterministic event in a cell can be used to predict the detected object The risk of triple negative breast cancer.
- triple-negative breast cancer refers to estrogen receptor (ER), progesterone receptor (PR),
- ER estrogen receptor
- PR progesterone receptor
- the HER2 gene is negative for breast cancer, accounting for about 15% of all breast cancer patients, and has the characteristics of early onset, poor prognosis, unclear pathogenesis, and low response to treatment.
- each person can be represented by a matrix of 1x q as described above, which represents the driving force information of each person's mutant gene for the change of the activity of q signal pathways.
- Cluster analysis of the n 1 1x q matrices that is, n 1 x q matrices (for example, by the Ward Hierarchical Clustering method), found that these reference objects can be divided into two categories: Type A and Type B.
- each patient can be represented by the aforementioned 1x q matrix, which represents the driving force for each person’s mutant gene to change the activity of q signaling pathways information.
- Cluster analysis of the n 2 1x q matrices that is, n 2 x q matrices (for example, by the Ward Hierarchical Clustering method), found that these people can also be divided into two categories: category A and category B.
- the reference objects in the third and fourth reference object groups can be divided into There are two types of A and B, both of which include healthy people and triple negative breast cancer patients.
- the 1x q matrix of the test object can be obtained according to the method in the foregoing embodiment. Then, the 1x q matrix of the detected object is combined with the n 1 x q matrix and n 2 xq matrix corresponding to the third and fourth reference object groups, for example, through the Ward Hierarchical Clustering method to perform the second clustering to determine the detected object Type of pre-classification.
- the reference objects in the third and fourth reference object groups will be divided into two categories, A and B, and the detected objects will be clustered into A or B, that is, after the second clustering, It can be determined that the pre-classified type of the detected object is Class A or Class B.
- a fifth reference object group corresponding to the class A is determined from the third reference object group
- a sixth corresponding to the class A is determined from the fourth reference object group Reference object group.
- the fifth reference object group may include some or all A-type reference objects in the third reference object group
- the sixth reference object group may include some or all the A-type reference objects in the fourth reference object group.
- each type A triple-negative breast cancer in the sixth reference object group Between the driving force information of the patient's mutant gene for the activity change of the kth signal pathway sk and the driving force information of the mutant gene of the healthy people in the fifth reference group for the activity change of the kth signal pathway sk
- the difference DP k can be determined by the following formula:
- ZIDFP ik is the driving force for the change in the activity of the kth signal pathway by the mutant gene carried by the i-th triple-negative breast cancer patient
- ZIDFPjk is the change in the activity of the k signal pathway by the mutant gene carried by the jth healthy person Driving force.
- DP k may be subjected to noise reduction processing.
- a random simulation may be performed a predetermined number of times (for example, but not limited to 1,000,000 times).
- the label of each reference object that is a healthy person or a triple negative breast cancer patient is randomly disturbed, and DP null is calculated according to the above formula.
- the DP null obtained from each random simulation is used to perform noise reduction processing (also called normalization) on DP k .
- This normalization processing can be achieved by the following formula:
- mean(DP null ) and std(IDFP null ) are the average and standard deviation of DP null calculated by 1000000 random simulations, respectively.
- ZDP k the more ZDP k deviates from 0, the less the difference in activity of this signal pathway between triple-negative breast cancer patients and healthy people is random, but of specific biological significance.
- the difference between the driving force information of the activity change of the path is determined from the q information paths to determine a number of signal paths that satisfy the preset significant difference condition.
- q1 (eg, 8) signal paths with the largest absolute value of ZDP k among q signal paths may be selected for subsequent analysis.
- the q1 line of data corresponding to the q1 signal path is obtained from the 1xq matrix of the detected object, and the driving force information of the change of the activity of the mutant gene of the detected object on the q1 reference signal path is obtained.
- the pre-classification type of the detected object is class A
- the first reference subject group corresponding to the healthy person of class A is determined from the third reference subject group
- the triple negative breast cancer corresponding to class A is determined from the fourth reference subject group
- the second reference object group Obtain the q1 line data corresponding to the q1 signal path from the 1x matrix of each reference object in the first and second reference object groups to obtain the mutation of each reference object in the first and second reference object groups Information on the driving force of genes to change the activity of the q1 reference signaling pathway.
- first reference object group may include part or all A-type reference objects in the third reference object group
- second reference object group may include part or all the A-type reference objects in the fourth reference object group.
- the first reference object group may be the same as or different from the fifth reference object group
- the second reference object group may be the same as or different from the sixth reference object group.
- the changed driving force information performs first clustering on the detected object, each reference object in the first and second reference object groups, and obtains u1 clusters.
- the first clustering can be realized using the NMRCLUST clustering method, for example.
- NMRCLUST clustering method uses average link distance clustering, and then uses a penalty function to simultaneously optimize the number of clusters and the distance between clusters. For example, the number of clusters corresponding to the minimum penalty value can be selected to cluster the detected objects of type A, each reference object in the first and second reference object groups into u (for example, 15) clusters, and each cluster can correspond to Due to different levels of disease risk. It can be understood that other clustering methods can be selected for the first clustering according to the actual situation, and the present application is not limited to this.
- the risk of triple negative breast cancer of the detected object is output.
- the number of reference objects in the second reference object group ie, the number of triple negative breast cancer patients.
- the clusters obtained after performing the first clustering may be merged according to data distribution characteristics, so as to obtain a group with more distinctive features. For example, the u disease risk levels are combined into a smaller number of disease risk levels, so as to facilitate the reference of the detected object.
- the pre-classification type corresponding to the detected object may be determined by comparing preset classification rules of various types with information of the detected object corresponding to the classification rules.
- each reference object in the foregoing third reference object group and fourth reference object group may be clustered second, and the reference objects in the third and fourth reference object groups may be classified into class A and There are two types of class B, and then the related information of the class A reference object and the class B reference object (for example, the driving force information of the change of the activity of each mutant gene in each type of reference object on q signal pathways) to obtain each Classification rules of the class;
- the information of the detected object corresponding to the classification rule for example, the activity of the mutant gene of the detected object on q signal pathways can be changed The driving force information
- is compared with the classification rules of each class and the detected object is classified into the closest class of each class.
- the application determines the pre-classification type corresponding to the detected object according to the preset classification rules of each class.
- the application is not limited to this, for example, in other embodiments,
- the classification rule of each class may be determined by other means, and the information of the detected object corresponding to the classification rule is not limited to the exemplary information mentioned above.
- a reference object that belongs to the same disease risk level for example, the same cluster or the same group
- Clinical or pathologically relevant deterministic event characteristics such as age of onset, lymph node metastasis, etc.
- pathological characteristics such as drug response, primary or metastatic, etc.
- physiological characteristics such as exercise function, cardiovascular respiratory system function, etc.
- behavioral characteristics For example, diet exercise, etc.
- the application has been described above with triple negative breast cancer as an example, but the application does not limit the need to perform pre-classification or limit the types of pre-classification to only two types. In other embodiments of the present application, for example, in the method for predicting the risk of other diseases, there may be more than two types of pre-classification, or pre-classification may not be required.
- FIG. 4 shows an electronic device 40 according to an embodiment of the present application, including a memory 42, a processor 44, and a program 46 stored in the memory 44, the program 46 is configured to be executed by the processor 44, and the processor 44 executes When the program realizes at least part of the aforementioned method for obtaining a deterministic event in a cell, or at least part of the aforementioned method for predicting disease risk, or a combination of the two methods.
- the present application also provides a storage medium that stores a computer program, where the computer program is executed by a processor to implement at least part of the foregoing method for obtaining a deterministic event in a cell, or to implement the foregoing disease risk prediction At least part of the method, or a combination of the two methods.
- the use of all germline genetic information to comprehensively evaluate the basis of the overall characteristics of germline genetics can therefore cover the risk assessment of various sporadic and familial genetic breast cancers caused by germline inheritance, which improves Sensitivity to detection of risk individuals.
- the discrete, high-dimensional, multivariate correlation, and non-standardized germline variation characteristics can be projected to the gene predictive expression characteristics and signal pathway activity characteristics of continuous value range, relatively low dimensionality, and the correlation gradually converges
- a quantitative model that converts discrete qualitative data into a continuous space is constructed.
- it retains the global characteristics of the data.
- it becomes a link between germline genetic information and other deterministic events in breast cancer (including but not limited to lymph nodes The pathophysiological characteristics of metastasis, age at onset, etc.) drive the basis of classification.
- the risk rating and clinical feature association of sporadic genetic breast cancer such as triple-negative breast cancer can be graded according to pathway activity, making up for the knowledge based on gene panel
- the coverage of the driving method is vacant, and the false negative rate is significantly reduced.
- the risk of disease can be correlated with other clinical, pathological, physiological, or behavioral deterministic event characteristics, so that the model can be based on germline genetic information for patient prognosis assessment, early clinical intervention and Management provides the basis.
- the electronic device may be a user terminal device, a server, or a network device in some embodiments.
- the memory includes at least one type of readable storage medium, the readable storage medium including flash memory, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), random access memory (RAM), static random access memory ( SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
- the memory stores the operating system and various application software and data installed on the service node device.
- the processor may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
- CPU central processing unit
- controller microcontroller
- microprocessor or other data processing chip in some embodiments.
- the implementation of all or part of the process in the method of the above embodiment of the present invention can also be accomplished by a computer program instructing relevant hardware.
- the computer program can be stored in a computer-readable storage medium, and the computer program is processed by the processor During execution, the steps of the foregoing method embodiments may be implemented.
- the computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file, or some intermediate form, etc.
- the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a mobile hard disk, a magnetic disk, an optical disc, a computer memory, and a read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signals, telecommunications signals and software distribution media, etc.
- ROM Read-Only Memory
- RAM Random Access Memory
- electrical carrier signals telecommunications signals and software distribution media, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Health & Medical Sciences (AREA)
- Public Health (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Pathology (AREA)
- Primary Health Care (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
本申请公开患病风险预测方法、电子设备及存储介质,所述方法包括:获得被检测对象的属于预定基因组的突变基因对于若干条预定信号通路的活性改变的驱动力信息;获得第一及第二参考对象组中的每个参考对象的属于所述预定基因组的突变基因对于所述若干条预定信号通路的活性改变的驱动力信息;其中,所述第一参考对象组中的各参考对象属于健康类对象,所述第二参考对象组中的各参考对象属于患特定疾病类对象;依据所述被检测对象的突变基因对于若干条预定信号通路的活性改变的驱动力信息及所述第一及第二参考对象组中的每个参考对象的突变基因对于所述若干条预定信号通路的活性改变的驱动力信息,对所述被检测对象、第一及第二参考对象组中的各参考对象进行第一聚类;以及依据进行所述第一聚类后获得的第一聚类结果输出所述被检测对象患所述特定疾病的风险。
Description
本申请涉及生物技术,尤其涉及患病风险预测方法、电子设备及存储介质。
乳腺癌为全球范围内侵害女性健康的最主要威胁之一,全球每年约有130万新增乳腺癌病例及约50万死亡病例。以2015年中国及2018年美国的统计数据为例,两国乳腺癌发病率排名女性所有部位癌症首位,死亡率分别排第五、第二位,截至统计时间总存活患者总数均超过26万。平均而言,每位女性一生中约有12%的几率罹患乳腺癌。而及早预防、及早发现、及早治疗在多项回顾性研究中证明对乳腺癌患者的预后有显著的提升,特别是发病早、预后差、机制不明的三阴性乳腺癌。
随着生物学技术的发展,人们发现,在肿瘤发展过程中,信号通路控制着众多至关重要的细胞生物学过程。
本申请旨在提供一种基于信号通路信息预测患病风险的方案。
本申请一方面提供患病风险预测方法,由电子设备执行,包括:
获得被检测对象的属于预定基因组的突变基因对于若干条预定信号通路的活性改变的驱动力信息;
获得第一及第二参考对象组中的每个参考对象的属于所述预定基因组的突变基因对于所述若干条预定信号通路的活性改变的驱动力信息;其中,所述第一参考对象组中的各参考对象属于健康类对象,所述第二参考对象组中的各参考对象属于患特定疾病类对象;
依据所述被检测对象的突变基因对于若干条预定信号通路的活性改变的驱动力信息及所述第一及第二参考对象组中的每个参考对象的突变基因对于所述若干条预定信号通路的活性改变的驱动力信息,对所述被检测对象、第一及第二参考对象组中的各参考对象进行第一聚类;以及
依据进行所述第一聚类后获得的第一聚类结果输出所述被检测对象患所述特定疾病的风险。
本申请另一方面提供一种电子装置,包括:存储器、处理器以及存储在存储器中的程序,所述程序被配置成由处理器执行,所述处理器执行所述程序时实现:
如前所述的患病风险预测方法。
本申请再一方面提供一种存储介质,所述存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现:
如前所述的患病风险预测方法。
本申请的一些实施例中,基于信号通路信息,通过被检测对象的突变基因对于若干条预定信号通路的活性改变的驱动力信息,实现患病风险的预测。
本申请的一些实施例中,利用全部胚系遗传信息,全面评价胚系遗传整体特征的基础,因此能覆盖各种散发型和家族性遗传疾病(例如乳腺癌)由胚系遗传所导致的风险评估,提高了对风险个体检出的灵敏度。
本申请的一些实施例中,使得离散、高维、多元相关、非标准化的胚系变异特征能够投射到值域连续、相对低维、相关性逐渐收敛的基因预测表达量特征和信号通路活性特征上,构建了将离散定性数据转化为连续空间上的定量模型,一方面保留了数据的全局特征,另一方面成为了关联胚系遗传信息与乳腺癌中其他确定性事件(包括但不限于淋巴结转移、发病年龄等病理生理特征)的数据驱动分类基础。
本申请的一些实施例中,由于输入源为全局胚系稀有变异,使三阴性乳腺癌等散发型遗传乳腺癌的风险评级、临床特征关联能够按照通路活性进行分级,弥补了基于gene panel的知识驱动型方法的覆盖空缺,并且显著降低了假阴性率。
本申请的一些实施例中,由于能够将患病风险与其他临床、病理、生理、或行为相关确定性事件特征相关联,使得模型能够依据胚系遗传信息为患者的预后评估、早期临床干预与管理提供依据。
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是依据本申请一实施例的获得细胞内确定性事件方法的流程示意图;
图2是依据本申请另一实施例的获得细胞内确定性事件方法的流程示意图;
图3是依据本申请一实施例的患病风险预测方法的流程示意图;
图4是依据本申请一实施例的电子设备的结构示意图。
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
本申请的说明书和权利要求书及上述附图中的术语“包括”以及它们任何变形,意图在于覆盖不排他的包含。例如包含一系列步骤或单元的过程、方法或系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。此外,术语“第一”、“第二”和“第三”等是用于区别不同对象,而非用于描述特定顺序。
本申请中,全局胚系遗传信息指来源于亲本,编码于所有由胚胎发育而成的正常细胞的基因组中,由个体终生携带、并能通过生殖遗传给后代的所有遗传信息。其形式包括但不限于基因组DNA序列、表观遗传修饰信息等。
本申请中,细胞内确定性事件指生物体内各类分子依照已知或未知的机制相互作用,最终产生可以被各类方法检测定性或定量的事件特征,包括但不限于信号通路(Signaling Pathways)的激活或抑制、新陈代谢产物(Metabolites)的种类及含量变化、生物分子(包括蛋白/核酸等大分子,脂质/小分子药物/代谢产物/无机金属离子等小分子)之间的相互作用模式、状态及其变更(Interactome)、多聚物/细胞/组织器官的结构形态及其变更等。在本申请中,细胞内确定性事件包括胚系遗传决定的基因表达、信号通路活性、对乳腺癌的患病风险或抵抗、乳腺癌相关的病生理状态发生的概率等。
图1示出本申请一实施例的获得细胞内确定性事件的方法的流程示意图,该方法可由一电子设备执行,包括:
S11、获得被检测对象的属于预定基因组的若干突变基因。
S12、获得所述若干突变基因中的每个突变基因对于所述预定基因组中的每个基因发生改变的驱动力信息。
S13、依据所述若干突变基因中的每个突变基因对于所述预定基因组中的每个基因发生改变的驱动力信息,获得所述若干突变基因对所述预定基因组中的每个基因发生改变的驱动力信息;以及
S14、依据所述若干突变基因对所述预定基因组中的每个基因发生改变的驱动力信息,确定所述被检测对象的至少一个预定类型的细胞内确定性事件。
在一个实现方式中,S14中所述确定所述被检测对象的至少一个预定类型的细胞内确定性事件包括:
S141、获得所述被检测对象的第一类型的细胞内确定性事件信息;以及
S142、依据所述被检测对象的第一类型的细胞内确定性事件信息,确定所述被检测对象的第二类型的细胞内确定性事件信息。
本申请中,被检测对象可以是活体生物,例如可以属于但不仅限于人类。
以人为例,预定基因组例如可以是已知人类基因组中的部分或全部基因。
被检测对象的若干突变基因属于预定基因组,可以是稀有胚系突变基因,也可以是全局胚系突变基因,视实际情况而定。
在一个实现方式中,可以获得被检测对象的全局胚系遗传信息,例如全外显子测序数据,从中确定稀有胚系突变基因。其中,被检测对象的稀有胚系突变基因例如可以是通过判断被检测对象的全外显子测序数据中的突变基因是否在预先确定的稀有突变基因组来确定。稀有胚系突变基因组可以通过设定的变异频率阈值来确定,换句话说,若某个基因在人群中出现变异的概率大于设定的变异频率阈值,则该基因为稀有胚系突变基因。
可以理解,在其他实现方式中,也可以使用其它高通全局数据替代全外显子测序数据,所述的高通全局数据例如包括但不限于全外显子组测序、全基因组测序、基因芯片、表达芯片数据等。
在一个具体实例中,前述的第一类型的细胞内确定性事件信息可以为被检测对象的所述若干突变基因对至少一条预定信号通路的活性改变的驱动力信息,第二类型的细胞内确定性事件信息可以为该被检测对象患特定疾病的预测风险。
图2示出本申请一实施例的获得细胞内确定性事件的方法的流程示意图,该方法可由一电子设备执行。本实施例中,可获得所述被检测对象的所述若干突变基因对至少一条预定信号通路的活性改变的驱动力。本实施例的方法包括:
S21、获得被检测对象的属于预定基因组的若干突变基因;
S22、获得所述若干突变基因中的每个突变基因对于所述预定基因组中的每个基因的基因表达发生改变的驱动力信息;
S23、依据所述若干突变基因中的每个突变基因对于所述预定基因组中的 每个基因的基因表达发生改变的驱动力信息,获得所述若干突变基因对所述预定基因组中的每个基因的基因表达发生改变的驱动力信息;以及
S24、依据所述若干突变基因对所述预定基因组中的每个基因的基因表达发生改变的驱动力信息,确定所述被检测对象的所述若干突变基因对至少一条预定信号通路的活性改变的驱动力信息。
本申请中,基因表达指基因组上可被某个检测的基因转录的RNA产物的量或翻译得到的蛋白质的量,基因表达量可以是连续值域中的值,可以从现有数据中获得。
在本申请一种实现方式中,所述被检测对象的至少一个预定类型的细胞内确定性事件信息包括:确定所述被检测对象的所述若干突变基因对多条预定信号通路的活性改变的驱动力信息。该多条预定信号通路可以是从现有技术中已有的信号通路中选择确定,选择时,例如可以选择信号通路所包含的基因与上述预定基因组中的基因的重合度大于预定阈值的信号通路。
突变基因对信号通路的活性改变的驱动力表示突变基因对信号通路的活性改变影响能力。
在本申请一种实现方式中,S22中所述获得所述若干突变基因中的每个突变基因对于所述预定基因组中的每个基因的基因表达发生改变的驱动力信息包括:
从预先获得的模板数据中获取所述若干突变基因中的每个突变基因对于所述预定基因组中的每个基因的基因表达发生改变的驱动力信息,其中,所述模板数据包括所述预定基因组中的每个基因对于所述预定基因组中的各个基因的基因表达发生改变的驱动力信息。
在本申请一种实现方式中,获得所述模板数据的方法包括:针对所述预定基因组中的每个基因gi进行以下处理:
S221、将预定的参考细胞系分为第一细胞系组和第二细胞系组,其中,所述第一细胞系组包括所述预定的参考细胞系中包括突变基因gi的参考细胞系,所述第二细胞系组包括所述预定的参考细胞系中不包括突变基因gi的参考细胞系。
S222、对于预定基因组中的每个基因gj,获得所述第一细胞系组中的参考细胞系的突变基因gj的平均基因表达信息与所述第二细胞系组中的参考细胞系的突变基因gj的平均基因表达信息之间的差异信息。
S223、对所述差异信息进行降噪处理。
以下通过一个具体实例进行说明。
设预定基因组中基因的数量为n,参考细胞系的数量为p,
针于预定基因组中的每个基因gi,p个参考细胞系被分为两组:第一细胞系组(也称为突变组)mti和第二细胞系组(也称为野生组)wti,其中,第一细胞系组包括p个参考细胞系中包括基因gi的参考细胞系(设数量为pi1),所述第二细胞系组包括p个参考细胞系中不包括基因gi的参考细胞系(设数量为pi2)。
然后对于预定基因组中的每个基因gj,计算第一细胞系组中的pi1个参考细胞系的基因gj的平均基因表达信息与第二细胞系组中pi2个参考细胞系的基因gj的平均基因表达信息之间的差异信息;具体的,可以是计算第一细胞系组中的pi1个参考细胞系的基因gj的基因表达值的平均值与第二细胞系组中pi2个参考细胞系的基因gj的基因表达值的平均值差值de:
de
ij=μ
mtij-μ
wtij
其中,de
ij为基因gi对应的突变组mti中的各参考细胞系的基因gj的基因表达值的平均值与野生组wti中的各参考细胞系的基因gj的基因表达值的平均值的差值,μ
mtij表示突变组mti中的各参考细胞系的基因gj的基因表达值的平均值,μ
wtij表示野生组wti中的各参考细胞系的基因gj的基因表达值的平均值。
进一步的,可以对上述差值de
ij进行降噪处理。
在一种实现方式中,可以先进行预定次数(例如可以是但不限于10000次)的随机模拟。在每次模拟中,把p个细胞系随机分到突变组和野生组,并且突变组中参考细胞系的个数为pi1,野生组中参考细胞系的个数为pi2。然后计算每个基因gi在这随机分成的两组里的表达值的平均值的差值de
null。
之后,利用各次随机模拟获得的差值de
null对de
ij进行降噪处理(也称标准化处理),标准化处理后获得的值表示驱动力df,此标准化处理可通过下述公式实现:
其中df
ij是基因gi对基因gj的基因表达发生改变的驱动力信息。mean(de
null)和std(de
null)分别为10000次随机模拟计算出的de
null的平均值和标准差。
以上过程为计算一个基因gi对各个基因gj的基因表达发生改变的驱动力。 对于预定基因组中的n个基因,均进行上述计算过程,即可得到预定基因组中的每个基因对于所述预定基因组中的各个基因的基因表达发生改变的驱动力信息,即模板数据。在一种实现方式中,模板数据可以用一个n x n的矩阵表示,该矩阵的每一行对应一个基因gi,每一列对应一个基因gj,矩阵中的每一个值表示该行基因对该列基因的基因表达发生改变的驱动力。
每一个被检测对象携带不同数量的突变基因,假设被检测对象携带m个突变基因。在一个实现方式中,确定被检测对象的m个突变基因中的每个突变基因对于所述预定基因组中的每个基因的基因表达发生改变的驱动力信息可以包括:从上述n x n矩阵里获取这m个突变基因对应的m行数据,得到m x n的矩阵。
在本申请一种实现方式中,S23中获得被检测对象的若干突变基因对预定基因组中的每个基因的基因表达发生改变的驱动力信息的方法包括:对于预定基因组中的每个基因gj进行以下处理:
S231、将被检测对象的若干突变基因中的每个突变基因对于预定基因组中每个基因的基因表达发生改变的驱动力信息进行加权平均处理。
为了确定被检测对象的m个突变基因的整体效果,可以对各个基因的驱动力进行加权(w),然后求平均值DF。
其中DF
j为被检测对象的所有m个突变基因对预定基因组中基因gj的基因表达发生改变的驱动力的平均值,i
k为被检测对象的第k个突变基因在n x n矩阵中的行数,df为前述n x n矩阵中相应位置的值。
一种简单的方法是假设各突变基因的驱动力的权重都是相同的,可以理解,各突变基因的驱动力的权重也可以是不同的。
S232、将加权平均处理所获得的结果DF
j进行降噪处理。在一种实现方式中,可以先进行预定次数(例如可以是但不限于10000次)的随机模拟。在每次模拟中,从预定基因组的n个基因里随机取m个基因进行加权平均处理,获得DF
null。
之后,利用各次随机模拟获得的加权平均值DF
null按对DF
j进行降噪处理(也称标准化处理),此标准化处理可通过下述公式实现:
其中ZDF
j表示被检测对象携带的所有m个突变基因对预定基因组中基因gj的基因表达发生改变的驱动力,mean(DF
null)和std(DF
null)分别为10000次随机模拟计算出的DF
null的平均值和标准差。
获得被检测对象携带的所有m个突变基因对预定基因组中每个基因的基因表达发生改变的驱动力后,得到一个1x n的矩阵。虽然每个被检测对象携带不同数量的突变基因,通过上述处理,不同被检测对象对应的不同的m x n矩阵都转换为相同的1x n矩阵,后续可以在同一维度进行比较。
在本申请一种实现方式中,假设预定信号通路的数量为q,S24中获得被检测对象的若干突变基因对至少一条预定信号通路的活性改变的驱动力信息包括:对于每条所述信号通路sj进行如下处理:
S241、获得预定基因组中每个基因gi对该条信号通路sj的活性的影响信息;以及
S242、依据预定基因组中每个基因gi对该条信号通路sj的活性的影响信息,获得所述被检测对象的若干突变基因对该条信号通路sj的活性的综合影响信息。
在本申请一种实现方式中,S241中获得预定基因组中每个基因gi对信号通路sj的活性的影响信息包括:
S2411、获得每个基因gi对于信号通路sj中的每个基因a的基因表达发生改变的驱动力信息;
S2412、获得信号通路sj中的每个基因ak的基因表达的改变对于信号通路sj的影响信息;以及
S2413、依据S2411中获得的所述驱动力信息和S2412中获得的所述影响信息获得预定基因组中每个基因gi对信号通路sj的活性的影响信息。
在本申请一种实现方式中,首先获得预定基因组中每个基因gi对信号通路sj的活性的影响信息。假设一条信号通路由k个基因组成,其中信号通路中每个基因ak的基因表达的改变对信号通路的活性的影响分为两种,即上调(up)或下调(down),那么基因gi对第j条信号通路的活性的影响可通过下述公式确定:
其中,DFP
ij为预定基因组中一个基因gi对第j条信号通路的活性的影响值,df为前述n x n矩阵相应位置的值,j
a为第j条信号通路中的第a个基因在n x n矩阵中的列数;sig
a为第a个基因ak对第j条信号通路的活性的影响,可以从现有数据中获得,在一个实例中,上调时值为1,下调时值为-1。
进一步的,可以对DFP
ij进行降噪处理。
在一种实现方式中,可以先进行预定次数(例如可以是但不限于10000次)的随机模拟。在每次模拟中,可以从前述n x n矩阵中随机取k个基因对应的数据通过上述公式计算DFP
null。
之后,利用各次随机模拟中获得的DFP
null对DFP进行降噪处理(也称标准化),此标准化处理可通过以下公式实现:
其中ZDFP
ij为预定基因组中一个基因gi对第j条信号通路的活性改变的驱动力,mean(DFP
null)和std(DFP
null)分别为10000次随机模拟计算出的DFP
null的平均值和标准差。
获得预定基因组的n个基因中的每个基因gi对q条预定信号通路中的每条信号通路sj的活性改变的驱动力ZDFP
ij后,可以得到一个n x q的矩阵。
在本申请一种实现方式中,S242中所述被检测对象的若干突变基因对信号通路sj的活性的综合影响信息可通过下式公式获得:
其中,IDFP
j为被检测对象的m个突变基因对信号通路sj的活性的综合影响,i
a为第j条信号通路中的第a个基因在前述n x 60矩阵中的行数。
进一步的,可以对IDFP
j进行降噪处理。
在一种实现方式中,可以先进行预定次数(例如可以是但不限于10000 次)的随机模拟。在每次模拟中,从n x 60矩阵中随机取m行通过上述公式计算IDFP
null。
之后,利用各次随机模拟中获得的IDFP
null对IDFP
j进行降噪处理(也称标准化),此标准化处理可通过以下公式实现:
其中ZIDFP
j为被检测对象所携带的所有m个突变基因对第j条信号通路的活性改变的驱动力。mean(IDFP
null)和std(IDFP
null)分别为10000次随机模拟计算出的IDFP
null的平均值和标准差。
获得被检测对象所携带的所有m个突变基因对每条信号通路的活性改变的驱动力后,可以得到一个1x q的矩阵。这样,每个被检测对象都用一个1x q的矩阵表示,而无需考虑该被检测对象的突变基因数据及具体突变的基因。
图3示出本申请一实施例的患病风险预测方法的流程示意图,该方法可由一电子设备执行,包括:
S31、获得被检测对象的属于预定基因组的突变基因对于若干条预定信号通路的活性改变的驱动力信息;
S32、获得第一及第二参考对象组中的每个参考对象的属于所述预定基因组的突变基因对于所述若干条预定信号通路的活性改变的驱动力信息;其中,所述第一参考对象组中的各参考对象属于健康类对象,所述第二参考对象组中的各参考对象属于患特定疾病类对象;
S33、依据所述被检测对象的突变基因对于若干条预定信号通路的活性改变的驱动力信息及所述第一及第二参考对象组中的每个参考对象的突变基因对于所述若干条预定信号通路的活性改变的驱动力信息,对所述被检测对象、第一及第二参考对象组中的各参考对象进行第一聚类;以及
S34、依据进行所述第一聚类后获得的第一聚类结果输出所述被检测对象患所述特定疾病的风险。
在一个具体实例中,所述特定疾病为三阴性乳腺癌。可以理解的,本实施例的患病风险预测方法也可用于其他合适的特定疾病,并不仅限于三阴性乳腺癌。
在一种实现方式中,对所述被检测对象、第一及第二参考对象组中的各参考对象进行第一聚类后还包括:将进行所述第一聚类后获得的若干聚类合并为多个组。
在一种实现方式中,对所述被检测对象、第一及第二参考对象组中的各参考对象进行第一聚类后还包括:获得并输出与所述被检测对象属于同一患病风险等级的参考对象的临床或病理相关确定性事件特征、病理特征、生理特征以及行为特征之中的至少一个。
在一种实现方式中,使用NMRCLUST聚类法对所述被检测对象、第一及第二参考对象组中的各参考对象进行所述第一聚类。可以理解,视实际情况可以选择其他的聚类方法进行所述第一聚类,例如,也可以使用包含但不限于基于层次的方法(Hierarchical methods)(例如k-nearest-neighbor(简称为kNN)算法等)、基于划分的方法(Partition-based methods)(例如K均值(K-Means)聚类等)、基于密度的方法(Density-based methods)(例如Density-Based Spatial Clustering of Applications with Noise(简称为DBSCAN等))、基于网络的方法(Grid-based methods)(例如(STatistical INformation Grid(简称为STING)算法等)、或基于模型的方法(Model-based methods)(例如高斯混合模型(Gaussian Mixture Models,简称为GMM))等,本申请包括并不限于此。
在一种实现方式中,在获得被检测对象的突变基因对于若干条预定信号通路的活性改变的驱动力信息之前包括:从多条参考信号通路中确定所述若干条预定信号通路。
在一种实现方式中,从多条参考信号通路中确定所述若干条预定信号通路之前包括:确定所述被检测对象对应的预分类类型;依据所述预分类类型,从第三参考对象组中确定所述第一参考对象组,其中,所述第三参考对象组的各参考对象属于所述健康类对象,所述第一参考对象组对应于所述预分类类型;以及依据所述预分类类型,从第四参考对象组中确定所述第二参考对象组,其中,所述第四参考对象组的各参考对象属于所述患特定疾病类对象,所述第二参考对象组对应于所述预分类类型。
从多条参考信号通路中确定所述若干条预定信号通路包括:依据所述预分类类型,从多条参考信号通路中确定所述若干条预定信号通路。
在一种实现方式中,确定所述被检测对象对应的预分类类型包括:获得被检测对象的突变基因对于所述多条参考信号通路的活性改变的驱动力信息;获得所述第三及第四参考对象组中每个参考对象的突变基因对于所述多条参考信号通路的活性改变的驱动力信息;以及依据所述被检测对象的突变基因对于所述多条参考信号通路的活性改变的驱动力信息及所述第三及第四参考对象组中每个参考对象的突变基因对于所述多条参考信号通路的活性改变的驱动 力信息,对所述被检测对象、第三及第四参考对象组中的各参考对象进行第二聚类。
在一种实现方式中,使用Ward Hierarchical Clustering聚类法对所述被检测对象、第三及第四参考对象组中的各参考对象进行所述第二聚类。可以理解,视实际情况可以选择其他的聚类方法进行所述第二聚类,例如,也可以使用基于层次的方法(Hierarchical methods)(例如k-nearest-neighbor(简称为kNN)算法等)、基于划分的方法(Partition-based methods)(例如K均值(K-Means)聚类等)、基于密度的方法(Density-based methods)(例如Density-Based Spatial Clustering of Applications with Noise(简称为DBSCAN)等))、基于网络的方法(Grid-based methods)(例如STatistical INformation Grid(简称为STING)算法等)、或基于模型的方法(Model-based methods)(例如高斯混合模型(Gaussian Mixture Models,简称为GMM))等,本申请包括但并不限于此。
在本申请一种实现方式中,依据所述预分类类型,从多条参考信号通路中确定所述若干条预定信号通路包括:依据所述预分类类型,从所述第三参考对象组中确定对应于所述预分类类型的第五参考对象组;依据所述预分类类型,从所述第四参考对象组中确定对应于所述预分类类型的第六参考对象组;对于所述多条信号通路中的每条信号通路sk,确定所述第五参考对象组中的各参考对象的突变基因对于该条信号通路sk的活性改变的驱动力信息与所述第六参考对象组中的各参考对象的突变基因对于该条信号通路sk的活性改变的驱动力信息之间的差异;以及依据该差异,从所述多条信息通路中确定满足预设差异显著性条件的所述若干条预定信号通路。
在本申请一种实现方式中,确定所述第五参考对象组中的各参考对象的突变基因对于该条信号通路sk的活性改变的驱动力信息与所述第六参考对象组中的各参考对象的突变基因对于该条信号通路sk的活性改变的驱动力信息之间的差异的方法包括:获得第六参考对象组中各参考对象的突变基因对该条信号通路sk的活性改变的平均驱动力值与第五参考对象组中各参考对象的突变基因对该条信号通路sk的活性改变的平均驱动力值之间的差值。
进一步的,可以对所述差值进行降噪处理。
在本申请一种实现方式中,依据进行所述第一聚类后获得的第一聚类结果输出所述被检测对象患所述特定疾病的风险包括:至少依据所述被检测对象所属的聚类及该聚类中属于第二参考对象组的参考对象的数量及属于第一参考对象组的参考对象的数量的比例,确定并输出所述被检测对象患所述特定疾病 的风险。
以下以三阴性乳腺癌为例,通过一个具体实例对本申请的患病风险预测方法进行详细说明。本实施例中,可以利用前述获得细胞内确定性事件的方法的实施例中获得的被检测对象的所述若干突变基因对q条预定信号通路的活性改变的驱动力信息,预测该被检测对象患三阴性乳腺癌的风险。
本申请中,三阴性乳腺癌(triple negative breast cancer,简称TNBC)指在乳腺癌分子分型检测中雌激素受体(Estrogen Receptor,简称ER)、孕激素受体(Progesterone Receptor,简称PR)、HER2基因均为阴性的乳腺癌,约占所有乳腺癌患者的15%,并具有发病早、预后较差、发病机制不明确、治疗响应较低等特点。
对于由n
1个健康人组成的第三参考对象组,每个人可由一个前述的1x q的矩阵表示,该矩阵表示每个人的突变基因对于q条信号通路的活性改变的驱动力信息。对这n
1个1x q的矩阵即n
1x q的矩阵进行聚类分析(例如通过Ward Hierarchical Clustering方法分析),发现这些参考对象可以分成两类:A类和B类。
对于由n
2个三阴性乳腺癌患者组成的第四参考对象组,每个患者可由一个前述的1x q的矩阵表示,该矩阵表示每个人的突变基因对于q条信号通路的活性改变的驱动力信息。对这n
2个1x q的矩阵即n
2x q的矩阵进行聚类分析(例如通过Ward Hierarchical Clustering方法分析),发现这些人也可以分成两类:A类和B类。
换句话说,对于第三参考对象组和第四参考对象组对应的n
1x q的矩阵和n
2x q的矩阵进行聚类分析,可以将第三、第四参考对象组中的参考对象分为A类和B类两类,两类中均同时包含健康人和三阴性乳腺癌患者。
需要预测被检测对象患三阴性乳腺癌的风险时,可以按照前述实施例中的方法获得被检测对象的1x q的矩阵。然后将被检测对象的1x q的矩阵与第三、第四参考对象组对应的n
1x q的矩阵和n
2x q的矩阵一起例如通过Ward Hierarchical Clustering方式进行第二聚类,以确定被检测对象的预分类类型。如前所述,第三、第四参考对象组中的参考对象会分为A类和B类两类,被检测对象会被聚类到A类或B类,即进行第二聚类后,可确定被检测对象的预分类类型为A类或B类。
假设被检测对象的预分类类型为A类,从第三参考对象组中确定对应于所述A类的第五参考对象组,从第四参考对象组中确定对应于所述A类的第 六参考对象组。可以理解的,第五参考对象组中可以包括第三参考对象组中的部分或者所有A类参考对象,第六参考对象组中可以包括第四参考对象组中的部分或者所有A类参考对象。假设第五参考对象组中A类健康人和第六参考对象组中的A类三阴性乳腺癌患者的数量分别为n
1a和n
2a,那么第六参考对象组中各A类三阴性乳腺癌患者的突变基因对于第k条信号通路sk的活性改变的驱动力信息与第五参考对象组中各A类健康人的突变基因对于第k条信号通路sk的活性改变的驱动力信息之间的差异DP
k可通过以下公式确定:
其中,ZIDFP
ik为第i个三阴性乳腺癌患者所携带的突变基因对第k条信号通路活性改变的驱动力,ZIDFPjk为第j个健康人所携带的突变基因对第k条信号通路活性改变的驱动力。
进一步的,可以对DP
k进行降噪处理。
在一种实现方式中,可以先进行预定次数(例如可以是但不限于1000000次)的随机模拟。在每次随机模拟中,随机打乱每个参考对象是健康人或三阴性乳腺癌患者的标签,按照上述公式计算出DP
null。
之后,利用各次随机模拟中获得的DP
null对DP
k进行降噪处理(也称标准化),此标准化处理可通过以下公式实现:
其中,mean(DP
null)和std(IDFP
null)分别为1000000次随机模拟计算出的DP
null的平均值和标准差。ZDP
k越偏离0表示该条信号通路活性在三阴性乳腺癌患者和健康人之间的差异越不是随机的,而是有特定生物学意义的。
接着,可以依据所获得的第五参考对象组中的各参考对象的突变基因对于q条信号通路的活性改变的驱动力信息与第六参考对象组中的各参考对象的突变基因对于q条信号通路的活性改变的驱动力信息之间的差异,从q条信息通路中确定满足预设差异显著性条件的若干条信号通路。
在一种实现方式中,可以选取q条信号通路中ZDP
k绝对值最大的q1条(例如8条)信号通路进行后续分析。
从被检测对象的1x q的矩阵中获取与该q1条信号通路对应的q1行数据,得到被检测对象的突变基因对于该q1条参考信号通路的活性改变的驱动力信 息。
另外,被检测对象的预分类类型为A类,从第三参考对象组中确定对应于A类健康人的第一参考对象组,从第四参考对象组中确定对应于A类三阴性乳腺癌的第二参考对象组。从第一及第二参考对象组中的各参考对象的1x q的矩阵中分别获取与该q1条信号通路对应的q1行数据,获得第一及第二参考对象组中的各参考对象的突变基因对于该q1条参考信号通路的活性改变的驱动力信息。
可以理解的,第一参考对象组中可以包括第三参考对象组中的部分或者所有A类参考对象,第二参考对象组中可以包括第四参考对象组中的部分或者所有A类参考对象。第一参考对象组可以与第五参考对象组相同或不同,第二参考对象组可以与第六参考对象组相同或不同。
随后,依据被检测对象的的突变基因对于该q1条参考信号通路的活性改变的驱动力信息及第一及第二参考对象组中的各参考对象的突变基因对于该q1条参考信号通路的活性改变的驱动力信息,对被检测对象、第一及第二参考对象组中的各参考对象进行第一聚类,获得u1个聚类。
第一聚类例如可以使用NMRCLUST聚类法实现。NMRCLUST聚类法使用平均链接距离聚类,然后使用惩罚函数来同时优化聚类的数量和聚类之间的距离。例如可以选取最小惩罚值对应的聚类数量将A型的被检测对象、第一及第二参考对象组中的各参考对象聚类为u(例如15)个聚类,各个聚类可分别对应于不同患病风险等级。可以理解,视实际情况可以选择其他的聚类方法进行第一聚类,本申请并不限于此。
接着,依据进行第一聚类后获得的第一聚类结果,输出被检测对象患三阴性乳腺癌的风险。进行第一聚类后,可以确定被检测对象属于u个聚类中的哪个聚类,以及每个聚类中属于第一参考对象组的参考对象的数量(即健康人的数量)及属于第二参考对象组的参考对象的数量(即三阴性乳腺癌患者的数量)。然后计算每个聚类中三阴性乳腺癌患者的数量和健康人的数量的百分比,作为患病风险等级的定量参数表征,百分比值越大表明越有可能患三阴性乳腺癌。将各个聚类对应的百分比按大小进行排序,可确定每个聚类对应的患病风险等级的高低。因此,依据被检测对象所属的聚类,即可预测被检测对象患三阴性乳腺癌的风险。
可以理解的,也可直接依据被检测对象所属的聚类及该聚类中属于第二参考对象组的参考对象的数量及属于第一参考对象组的参考对象的数量的比例, 确定并输出被检测对象患三阴性乳腺癌的风险。
进一步的,进行第一聚类获得的聚类数量较多时,可根据数据分布特征将进行所述第一聚类后获得的聚类进行合并,从而得到特征更显著的组。例如,将u个患病风险等级合并为数量更少的患病风险等级,以便于被检测对象参考。
在另一种实施方式中,可以通过将预设的各类的分类规则与被检测对象的与所述分类规则相应的信息进行对比,确定所述被检测对象对应的预分类类型。例如,在一个实例中,可以对前述第三参考对象组和第四参考对象组中的各参考对象进行第二聚类,将第三、第四参考对象组中的参考对象分为A类和B类两类,进而对A类参考对象和B类参考对象的相关信息(例如,各类参考对象中每个人的突变基因对于q条信号通路的活性改变的驱动力信息)进行统计获得每个类的分类规则;在确定所述被检测对象对应的预分类类型时,可以将被检测对象的与所述分类规则相应的信息(例如,被检测对象的突变基因对于q条信号通路的活性改变的驱动力信息)与每个类的分类规则进行对比,将被检测对象分到各个类中最接近的那一类。可以理解的,上述仅给出本申请依据预设的各个类的分类规则确定所述被检测对象对应的预分类类型的一个具体实例,本申请并不仅限于此,例如,在其他实施例中,各个类的分类规则可以通过其他方式确定,被检测对象的与所述分类规则相应的信息也不限于上面所提及的示例性的信息。
在本申请一种实现方式中,除了输出被检测对象患三阴性乳腺癌的预测风险,还可获得并输出与被检测对象属于同一患病风险等级(例如同一聚类或同一组)的参考对象的临床或病理相关确定性事件特征(例如发病年龄、淋巴结转移等)、病理特征(例如药物响应、原发或转移等)、生理特征(免疫机能、心血管呼吸系统机能等)以及行为特征(例如饮食运动等)等。
可以理解的,上面以三阴性乳腺癌为例对本申请进行了描述,但本申请并不限定必须进行预分类,或者限定预分类类型仅为两类。在本申请的其他实施例中,例如在其他疾病的患病风险预测方法中,预分类类型可以多于两类,或者,也可能不需要进行预分类。
图4示出本申请一实施例的电子设备40,包括存储器42、处理器44以及存储在存储器44中的程序46,所述程序46被配置成由处理器44执行,所述处理器44执行所述程序时实现前述获得细胞内确定性事件的方法的至少部分、或实现前述患病风险预测方法中的至少部分、或所述两方法的组合。
本申请还提供一种存储介质,所述存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现前述获得细胞内确定性事件的方法的至少部分、或实现前述患病风险预测方法中的至少部分、或所述两方法的组合。
本申请的一些实施例中,利用全部胚系遗传信息,全面评价胚系遗传整体特征的基础,因此能覆盖各种散发型和家族性遗传乳腺癌由胚系遗传所导致的风险评估,提高了对风险个体检出的灵敏度。
本申请的一些实施例中,使得离散、高维、多元相关、非标准化的胚系变异特征能够投射到值域连续、相对低维、相关性逐渐收敛的基因预测表达量特征和信号通路活性特征上,构建了将离散定性数据转化为连续空间上的定量模型,一方面保留了数据的全局特征,另一方面成为了关联胚系遗传信息与乳腺癌中其他确定性事件(包括但不限于淋巴结转移、发病年龄等病理生理特征)的数据驱动分类基础。
本申请的一些实施例中,由于输入源为全局胚系稀有变异,使三阴性乳腺癌等散发型遗传乳腺癌的风险评级、临床特征关联能够按照通路活性进行分级,弥补了基于gene panel的知识驱动型方法的覆盖空缺,并且显著降低了假阴性率。
本申请的一些实施例中,由于能够将患病风险与其他临床、病理、生理、或行为相关确定性事件特征相关联,使得模型能够依据胚系遗传信息为患者的预后评估、早期临床干预与管理提供依据。
电子设备在一些实施例中可以是用户终端设备、服务器、或者网络设备等。例如移动电话、智能电话、笔记本电脑、数字广播接收机、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、导航装置、车载装置、数字TV、台式计算机等、单个网络服务器、多个网络服务器组成的服务器组或者基于云计算的由大量主机或者网络服务器构成的云等。
存储器至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。存储器中存储安装于服务节点设备的操作系统和各类应用软件及数据等。
处理器在一些实施例中可以是中央处理器(CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
本发明实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括是电载波信号和电信信号。以上所述实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围,均应包含在本发明的保护范围之内。
Claims (16)
- 患病风险预测方法,由电子设备执行,包括:获得被检测对象的属于预定基因组的突变基因对于若干条预定信号通路的活性改变的驱动力信息;获得第一及第二参考对象组中的每个参考对象的属于所述预定基因组的突变基因对于所述若干条预定信号通路的活性改变的驱动力信息;其中,所述第一参考对象组中的各参考对象属于健康类对象,所述第二参考对象组中的各参考对象属于患特定疾病类对象;依据所述被检测对象的突变基因对于若干条预定信号通路的活性改变的驱动力信息及所述第一及第二参考对象组中的每个参考对象的突变基因对于所述若干条预定信号通路的活性改变的驱动力信息,对所述被检测对象、第一及第二参考对象组中的各参考对象进行第一聚类;以及依据进行所述第一聚类后获得的第一聚类结果输出所述被检测对象患所述特定疾病的风险。
- 如权利要求1所述的患病风险预测方法,其特征在于,所述特定疾病为三阴性乳腺癌。
- 如权利要求1所述的患病风险预测方法,其特征在于,对所述被检测对象、第一及第二参考对象组中的各参考对象进行第一聚类后还包括:将进行所述第一聚类后获得的若干聚类合并为多个组。
- 如权利要求1所述的患病风险预测方法,其特征在于,对所述被检测对象、第一及第二参考对象组中的各参考对象进行第一聚类后还包括:获得并输出与所述被检测对象属于同一患病风险等级的参考对象的临床、病理、生理、或行为相关确定性事件特征之中的至少一个。
- 如权利要求1所述的患病风险预测方法,其特征在于,使用NMRCLUST聚类法、基于层次的方法、基于划分的方法、基于密度的方法、基于网络的方法、或基于模型的方法对所述被检测对象、第一及第二参考对象组中的各参考对象进行所述第一聚类。
- 如权利要求1所述的患病风险预测方法,其特征在于,在获得被检测对象的突变基因对于若干条预定信号通路的活性改变的驱动力信息之前包括:从多条参考信号通路中确定所述若干条预定信号通路。
- 如权利要求6所述的患病风险预测方法,其特征在于:从多条参考信号通路中确定所述若干条预定信号通路之前包括:确定所述被检测对象对应的预分类类型;依据所述预分类类型,从第三参考对象组中确定所述第一参考对象组,其中,所述第三参考对象组的各参考对象属于所述健康类对象,所述第一参考对象组对应于所述预分类类型;以及依据所述预分类类型,从第四参考对象组中确定所述第二参考对象组,其中,所述第四参考对象组的各参考对象属于所述患特定疾病类对象,所述第二参考对象组对应于所述预分类类型;所述从多条参考信号通路中确定所述若干条预定信号通路包括:依据所述预分类类型,从多条参考信号通路中确定所述若干条预定信号通路。
- 如权利要求7所述的患病风险预测方法,其特征在于,所述确定所述被检测对象对应的预分类类型包括:获得被检测对象的突变基因对于所述多条参考信号通路的活性改变的驱动力信息;获得所述第三及第四参考对象组中的每个参考对象的突变基因对于所述多条参考信号通路的活性改变的驱动力信息;以及依据所述被检测对象的突变基因对于所述多条参考信号通路的活性改变的驱动力信息及所述第三及第四参考对象组中每个参考对象的突变基因对于所述多条参考信号通路的活性改变的驱动力信息,对所述被检测对象、第三及第四参考对象组中的各参考对象进行第二聚类。
- 如权利要求8所述的患病风险预测方法,其特征在于,使用Ward Hierarchical Clustering聚类法、基于层次的方法、基于划分的方法、基于密度的方法、基于网络的方法、或基于模型的方法对所述被检测对象、第三及第四参考对象组中的各参考对象进行所述第二聚类。
- 如权利要求7所述的患病风险预测方法,其特征在于,所述确定所述被检测对象对应的预分类类型包括:将预设的各类的分类规则与被检测对象的与所述分类规则相应的信息进行对比,确定所述被检测对象对应的预分类类型。
- 如权利要求7所述的患病风险预测方法,其特征在于,依据所述预分类类型,从多条参考信号通路中确定所述若干条预定信号通路包括:依据所述预分类类型,从所述第三参考对象组中确定对应于所述预分类类型的第五参考对象组;依据所述预分类类型,从所述第四参考对象组中确定对应于所述预分类类型的第六参考对象组;对于所述多条信号通路中的每条信号通路sk,确定所述第五参考对象组中的各参考对象的突变基因对于该条信号通路sk的活性改变的驱动力信息与所述第六参考对象组中的各参考对象的突变基因对于该条信号通路sk的活性改变的驱动力信息之间的差异;以及依据所述差异,从所述多条信息通路中确定满足预设差异显著性条件的所述若干条预定信号通路。
- 如权利要求11所述的患病风险预测方法,其特征在于,确定所述第五参考对象组中的各参考对象的突变基因对于该条信号通路sk的活性改变的驱动力信息与所述第六参考对象组中的各参考对象的突变基因对于该条信号通路sk的活性改变的驱动力信息之间的差异的方法包括:获得第六参考对象组中各参考对象的突变基因对该条信号通路sk的活性改变的平均驱动力值与第五参考对象组中各参考对象的突变基因对该条信号通路sk的活性改变的平均驱动力值之间的差值。
- 如权利要求12所述的患病风险预测方法,其特征在于,确定所述第五参考对象组中的各参考对象的突变基因对于该条信号通路sk的活性改变的驱动力信息与所述第六参考对象组中的各参考对象的突变基因对于该条信号通路sk的活性改变的驱动力信息之间的差异的方法还包括:对所述差值进行降噪处理。
- 如权利要求1所述的患病风险预测方法,其特征在于,依据进行所述第一聚类后获得的第一聚类结果输出所述被检测对象患所述特定疾病的风险包括:至少依据所述被检测对象所属的聚类及该聚类中属于第二参考对象组的参考对象的数量及属于第一参考对象组的参考对象的数量的比例,确定并输出所述被检测对象患所述特定疾病的风险。
- 一种电子设备,包括:存储器、处理器以及存储在存储器中的程序,所述程序被配置成由处理器执行,所述处理器执行所述程序时实现:如权利要求1至14任一项所述的患病风险预测方法。
- 一种存储介质,所述存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现:如权利要求1至14任一项所述的患病风险预测方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201880003024.9A CN111670476B (zh) | 2018-12-21 | 2018-12-21 | 患病风险预测方法、电子设备及存储介质 |
US17/416,919 US20220068491A1 (en) | 2018-12-21 | 2018-12-21 | Method for predicting a risk of suffering from a disease, electronic device and storage medium |
PCT/CN2018/122786 WO2020124584A1 (zh) | 2018-12-21 | 2018-12-21 | 患病风险预测方法、电子设备及存储介质 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2018/122786 WO2020124584A1 (zh) | 2018-12-21 | 2018-12-21 | 患病风险预测方法、电子设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020124584A1 true WO2020124584A1 (zh) | 2020-06-25 |
Family
ID=71102489
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/122786 WO2020124584A1 (zh) | 2018-12-21 | 2018-12-21 | 患病风险预测方法、电子设备及存储介质 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20220068491A1 (zh) |
CN (1) | CN111670476B (zh) |
WO (1) | WO2020124584A1 (zh) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110246409A1 (en) * | 2010-04-05 | 2011-10-06 | Indian Statistical Institute | Data set dimensionality reduction processes and machines |
CN107924430A (zh) * | 2015-08-17 | 2018-04-17 | 皇家飞利浦有限公司 | 生物数据模式识别的多级体系构架 |
CN109036571A (zh) * | 2014-12-08 | 2018-12-18 | 20/20基因系统股份有限公司 | 用于预测患有癌症的可能性或风险的方法和机器学习系统 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2439282A1 (en) * | 2010-10-06 | 2012-04-11 | bioMérieux | Method for determining a biological pathway activity |
EP3172362A4 (en) * | 2014-07-23 | 2018-01-10 | Ontario Institute for Cancer Research | Systems, devices and methods for constructing and using a biomarker |
CN107658023B (zh) * | 2017-09-25 | 2021-07-13 | 泰康保险集团股份有限公司 | 疾病预测方法、装置、介质和电子设备 |
-
2018
- 2018-12-21 CN CN201880003024.9A patent/CN111670476B/zh active Active
- 2018-12-21 WO PCT/CN2018/122786 patent/WO2020124584A1/zh active Application Filing
- 2018-12-21 US US17/416,919 patent/US20220068491A1/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110246409A1 (en) * | 2010-04-05 | 2011-10-06 | Indian Statistical Institute | Data set dimensionality reduction processes and machines |
CN109036571A (zh) * | 2014-12-08 | 2018-12-18 | 20/20基因系统股份有限公司 | 用于预测患有癌症的可能性或风险的方法和机器学习系统 |
CN107924430A (zh) * | 2015-08-17 | 2018-04-17 | 皇家飞利浦有限公司 | 生物数据模式识别的多级体系构架 |
Also Published As
Publication number | Publication date |
---|---|
US20220068491A1 (en) | 2022-03-03 |
CN111670476A (zh) | 2020-09-15 |
CN111670476B (zh) | 2023-04-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
McDowell et al. | Clustering gene expression time series data using an infinite Gaussian process mixture model | |
Torang et al. | An elastic-net logistic regression approach to generate classifiers and gene signatures for types of immune cells and T helper cell subsets | |
Ritchie et al. | Comparing genotyping algorithms for Illumina's Infinium whole-genome SNP BeadChips | |
CN108038352B (zh) | 结合差异化分析和关联规则挖掘全基因组关键基因的方法 | |
WO2015026953A1 (en) | Methods for predicting prognosis | |
US20230162004A1 (en) | Deep neural networks for estimating polygenic risk scores | |
Mohammed et al. | Colorectal cancer classification and survival analysis based on an integrated rna and dna molecular signature | |
Zhang et al. | A comparative study of five association tests based on CpG set for epigenome-wide association studies | |
Desantis et al. | Supervised Bayesian latent class models for high‐dimensional data | |
Sharma et al. | A Comparative Study of Data Mining, Digital Image Processing and Genetical Approach for Early Detection of Liver Cancer | |
WO2020124584A1 (zh) | 患病风险预测方法、电子设备及存储介质 | |
CN110880355A (zh) | 敏感性基因发现方法、装置及存储介质 | |
WO2020124585A1 (zh) | 获得细胞内确定性事件的方法、电子设备及存储介质 | |
Qiu et al. | Genomic processing for cancer classification and prediction-Abroad review of the recent advances in model-based genomoric and proteomic signal processing for cancer detection | |
Schwender et al. | Empirical Bayes analysis of single nucleotide polymorphisms | |
CA3036597C (en) | Systems, methods, and gene signatures for predicting a biological status of an individual | |
CN116469552A (zh) | 一种用于乳腺癌多基因遗传风险评估的方法和系统 | |
Huang et al. | ASTK: A Machine Learning‐Based Integrative Software for Alternative Splicing Analysis | |
CN112930573B (zh) | 疾病类型自动确定方法及电子设备 | |
WO2021042236A1 (zh) | 疾病治疗管理因素特征自动预测方法及电子设备 | |
WO2021042237A1 (zh) | 获得细胞内确定性事件的方法及电子设备 | |
Srivastava et al. | A novel method incorporating gene ontology information for unsupervised clustering and feature selection | |
Zararsiz et al. | Introduction to statistical methods for microRNA analysis | |
Tsai et al. | Significance analysis of ROC indices for comparing diagnostic markers: applications to gene microarray data | |
Gan et al. | Identification of differential gene groups from single-cell transcriptomes using network entropy |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18943846 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22/10/2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18943846 Country of ref document: EP Kind code of ref document: A1 |