CN111105879A - Probabilistic identification model for breast cancer prognosis generated by deep machine learning - Google Patents

Probabilistic identification model for breast cancer prognosis generated by deep machine learning Download PDF

Info

Publication number
CN111105879A
CN111105879A CN201811265590.5A CN201811265590A CN111105879A CN 111105879 A CN111105879 A CN 111105879A CN 201811265590 A CN201811265590 A CN 201811265590A CN 111105879 A CN111105879 A CN 111105879A
Authority
CN
China
Prior art keywords
gene
genes
model
machine learning
prognosis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811265590.5A
Other languages
Chinese (zh)
Inventor
张培森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201811265590.5A priority Critical patent/CN111105879A/en
Publication of CN111105879A publication Critical patent/CN111105879A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The technical field of the invention is as follows: the identification model of breast cancer prognosis is used for calculating whether clinical prognosis and auxiliary chemotherapy are worth to be carried out, and is used for precise treatment. By applying an autonomously developed deep machine learning data mining algorithm, a probability identification model for cancer prognosis is developed. The "70 gene signature" was the first and by far the only us FDA-approved test for prognosis of breast cancer. Based on the same clinical 2 ten thousand 5 thousand RNA dataset (151 breast cancer lymph node negative patients, 97 survival five years old or more, 54 controls), my deep machine learning started with 2 genes, increased by 1 gene each time, and selected the combination with the strongest recognition ability. My deep machine learning produced 7 genes "probabilistic identification model of breast cancer prognosis", the identification ability has exceeded the ability of the "70 gene signature".

Description

Probabilistic identification model for breast cancer prognosis generated by deep machine learning
The inventor: zhang Peesen
(I) technical field
Identification of breast cancer prognosis models to calculate clinical prognosis and whether adjuvant chemotherapy is worthwhile.
(II) background of the invention
(2.1) overview:
"breast cancer patients at the same disease stage may have significantly different treatment responses and outcomes. Clinical predictors of metastasis (e.g., lymph node status and histological grade) do not accurately classify breast tumors. Chemotherapy or hormone treatment can reduce the risk of metastasis by about one-third; however, 70-80% of patients receiving such treatment can survive without such treatment. "(technical literature [ 1 ])
Several gene recognition models have been developed to predict clinical outcome and determine whether adjuvant chemotherapy is worthwhile. Among them, the "70 Gene Signature" (70-Gene Signature) (technical documents [ 1, 2, 3 ]), which tests the classification of tumors as good or poor prognosis, depending on the risk of recurrence for 5 years. The transformation research system consortium (transcbig) is a network consisting of 21 countries, about 40 partners, including the mammary gland international group (BIG). An independently validated study of this consortium demonstrated that the "70 gene signature" approved by the U.S. Food and Drug Administration (FDA) was able to distinguish patients with significant risk of metastasis recurrence and death from low risk patients. (technical literature [ 3 ])
Expression of some genes is correlated. The associated genes are duplicated and redundant in the recognition model. The redundant genes increase the detection cost and introduce noise and increase errors. The 70 genes used in the "70 gene tag" are related in some cases. It is desirable to select relatively independent genes as much as possible to establish a model, thereby reducing the detection cost, reducing the noise and improving the precision.
(2.2) data sources: (patient selection, RNA isolation, and biochip expression):
we used the same clinical 2 ten thousand 5 kilo RNA dataset with the "70 gene signature" (151 lymph node negative patients with breast cancer, 97 patients who survived for more than five years, 54 controls). The "70 gene signature" was selected from a subsample set of 151 patients, 78 patients (34 patients with no metastasis for more than five years, 44 controls). Our probabilistic recognition model uses 151 cases of the entire sample set.
Tumors were selected from 295 women with breast cancer from fresh frozen tissue banks at the netherlands cancer institute according to the following criteria: the tumor is primary invasive breast cancer, and the diameter of the tumor is less than 5cm (pT1 or pT2) in pathological examination; apical axillary lymph nodes were tumor negative as determined from subclavian lymph node biopsy; diagnosis age 52 years or less; the diagnosis period is between 1984 and 1995; there was no history of cancer, except for non-melanoma skin cancers. All patients received modified radical mastectomy or breast conservation surgical treatment, including axillary lymphadenectomy, and radiation therapy if indicated. Of 295 patients, 151 patients were node negative (pathological examination result pN0), and 144 were node positive (pN +). (technical literature [ 2 ])
Tumor material was snap frozen in liquid nitrogen within 1 hour after surgery. Frozen sections were stained with hematoxylin and eosin; only samples with more than 50% tumor cells were selected. 30- μm sections were used for RNA isolation. Total RNA was isolated using RNAzolB and dissolved in RNase-free water. 25 μ g of RNA was then treated using Qiagen RNase-free DNase kit and RNeasy spin column, then dissolved in RNase-free water to a final concentration of 0.2 μ g/μ l, transcribed in vitro by using T7 RNA polymerase and 5 μ g total RNA and labeled with Cy3 or Cy5(Cy Dye, Amersham Pharmacia Biotech). 5 micrograms of Cylabeled cRNA from one breast cancer tumor was mixed with the same amount of reverse Cy-labeled product in a pool of equal amounts of cRNA from each patient. The labeled cRNA was fragmented to an average size of about 50 to 100 nucleotides by heating the sample to 60 ℃ in the presence of 10mM zinc chloride and adding a hybridization buffer containing 1M sodium chloride, 0.5% sodium sarcosinate, 50mM morpholino-ethanolamine and 50mM acetoacetate. Ethanesulfonic acid (pH6.5) and formamide (final concentration, 30% at 40 ℃); the final volume was 3 ml. The microarray included 24,479 biological oligonucleotides and 1281 control probes. After hybridization, the slides were washed and scanned with a confocal laser scanner (Agilent Technologies). The fluorescence intensity of the scanned image was quantified and the background level was corrected and normalized. (technical literature [ 2 ])
(2.3) "70 Gene signature" big data analysis and data mining algorithm (technical literature [ 1 "):
in the first step, the "70 gene tag" screens 24,479 genes of the biochip for 5,000 important genes. These genes were more than twice expressed in more than 5 experiments and were significant p < 0.01.
In the second step, the "70 gene signature" calculated the correlation between the prognostic class (metastasis vs. no metastasis) and the log expression ratio for all 78 samples of each individual gene of the 5,000 important genes. The "70 gene signature" found 231 genes with correlation coefficients greater than 0.3 ("related genes") or less than-0.3 ("anti-related genes").
In the third step, the "70 gene tag" is cross-validated using the method of "rule-one-out". One sample at a time is taken, the remaining samples are used for learning, a model is generated, and then the model is used to identify the taken samples. One at a time until all samples are exhausted. This approach avoids information penetration. The samples to be identified are not in the "learning set". The "70 gene signature" takes one sample at a time and uses the remaining 77 samples to define a classifier based on 231 distinct genes. The result of the first sample taken is then predicted. The prediction of the samples was based on their correlation coefficients with a "good prognosis" template and a "poor prognosis" template, where the "good" and "poor" templates were the average expression of the "good" and "poor" samples of the 77 samples in the clinic. Correlation coefficients were calculated using the selected reporter genes. This procedure was repeated until each of the 78 samples was expelled. It is finally calculated how many cases the prediction is correct and how many cases the prediction is incorrect. The performance of the classifier is measured by the error rate of type 1 (false negative) and type 2 (false positive) of the selected genome. The "70 gene signature" repeats the above-described performance assessment procedure based on "one out" cross-validation, from the top of the candidate list adding 5 more marker genes at a time until all 231 genes are used as discriminators. The number of mispredictions of type 1 and type 2 errors varies significantly with the number of marker genes used. The combined error rate is lowest when using the "70 gene tag" from the top of the candidate list. Thus, the "70 gene signature" considers this group of 70 genes as the best marker genome that can be used to classify patients into two prognostic subgroups, a "good prognosis" group and a "poor prognosis" group. Interestingly, the accuracy of predicting the prognosis of "sporadic" breast cancer patients is rather low when only a few marker genes are used. Accuracy increased with increasing number of marker genes until an optimal number of marker genes (70 genes) was reached. However, in addition to the optimal number of marker genes, accuracy is deteriorated due to the introduction of noise.
(2.4) "70 Gene tag" was approved in US patent 2007, U.S. patent No.: 7171311 (patent document [ 1 ])
Disclosure of the invention
(3.1) overview:
by applying the self-developed deep machine learning data mining algorithm, a probability recognition model for cancer prognosis is self-developed. The "70 gene signature" was the first and by far the only us FDA-approved test for prognosis of breast cancer. Based on the probability recognition model of 2 ten thousand 5 thousand RNA data sets (151 breast cancer lymph node negative patients, 97 breast cancer lymph node negative patients with five years of survival and 54 controls) in the same clinic, the number of genes to be detected is reduced, and the accuracy of the gene label is improved by over 70.
(3.2) our deep machine learning data mining algorithm:
firstly, a self recognition model is constructed by adopting a deep machine learning algorithm. We use 70 genes of the '70 gene label' as the basis, and use the deep machine learning data mining algorithm developed by us to calculate the detection capability by starting from 2 genes and increasing 1 gene each time. Our algorithm is a deep machine learning algorithm. All combinations of genes are learned. For example, taking 5 genes from 70 genes requires 1 thousand to 2 million studies. Taking 6 genes from 70 genes requires 1 million to 3 million studies. Before each learning, data normalization is carried out, and the accuracy of the data is guaranteed.
In the second step, our identification model hopes that gene expression is as independent as possible from each other, so that each gene can fully play a role in the identification process. Our recognition model is an independent probabilistic model. Breast cancer patients are classified by the probability of "good prognosis" and "poor prognosis". Which probability is high belongs to that class. The recognition model adopts the independent gene expression probability, and the probability of good prognosis and the probability of poor prognosis are the product of the probabilities of good prognosis and poor prognosis of each gene expression. And finally, classifying according to the probability.
How to determine the probability of "good prognosis" and "poor prognosis" for a single gene of a sample? First, the sample data set for machine learning (learning set) was divided into "good" set (survived for more than 5 years) and "bad" set (survived for less than 5 years) according to clinical 5-year survival. Then, the mean values of the expression intensities of the individual genes (RNAs) in the "good" set and the "poor" set were calculated. The expression of the entire sample data set (the learning set) at this gene was divided into two groups using the midpoint of the two averages as a boundary. The group containing the mean values of gene expression of the "good" set is called "near good group"; similarly, a set containing "differences" is referred to as a "near difference group". This demarcation also localizes the gene expression of the identified sample in a "near good group" or a "near bad group". The two groups calculate the probabilities of "good after prognosis" and "poor after prognosis", respectively. For example, the expression of the gene in the sample being tested is in the "near elite group"; the 'good group' has 80 'good' set members and 10 'poor' set members; then the probability of the test sample being "good after prognosis" at this gene is 80/90 and the probability of "poor after prognosis" is 10/90. The gene (RNA) expression intensity of the sample to be tested belongs to which group, and the probability of "good prognosis" and "poor prognosis" of that group is the probability of "good prognosis" and "poor prognosis" of this sample. The total probability of "good prognosis" and "poor prognosis" of the sample to be tested is the product of the probabilities of "good prognosis" and "poor prognosis" of the expression of each gene. The total probability of good prognosis and poor prognosis of the sample to be tested is high, and the sample to be tested is classified into that class.
Third, we use the method of "rule-one-out" to perform cross-validation to ensure the recognition ability of the model built by our deep machine learning.
(IV) 7-gene probability identification model for breast cancer prognosis
We started the deep machine learning with 2 genes, and each time 1 gene was added, the combination with the strongest recognition ability was selected. The recognition ability of the combination of 7 genes exceeds the ability of the '70 gene label'. We selected the following 7 genes to construct our cognitive model: contig46223_ RC, X05610, NM _006931, Contig55725_ RC, NM _020386, AF055033, Contig2399_ RC. Because our model contains only 7 genes, it is much simpler to detect the expression of these 7 genes (RNA) than 70 genes, the cost can be as low as one tenth, and the noise infiltration is reduced, and the precision is improved. We can do this by RT-PCR. Technical literature [ 2 ] published a comparison of the accuracy of the "70 gene signature" and the traditional clinical st. Here we add the results of the 7-gene probabilistic recognition model. Clinical samples, 151 breast cancer node negative patients, 97 survived for more than five years, and 54 controls. "7 Gene model": accuracy, 84.1%; "70 gene tag": accuracy, 80.8%; galen "accuracy, 59.0%; "NIH" accuracy, 46.2%.
(V) detailed description of the preferred embodiments
(5.1) overview:
the "learning set" of our 7-gene probabilistic identification model for breast cancer prognosis includes the "learning set" of the "70-gene signature". Our "7 gene model" is the result of deep machine learning and can be seen as an upgraded version of the "70 gene signature".
(5.2) the detection method comprises the following steps:
we will produce a 7-gene counterpart kit for the "7-gene model" to help hospitals and other institutions in need thereof. We were also prepared to set up a third party testing facility to undertake the testing of 7 genes for the "7 gene model".
(5.3) "calculation of 7 Gene model":
we will provide a computational server for the "7 gene model" of the network. The computational APP of the "7 gene model" of the mobile phone is also provided.
Technical literature
【1】Gene expression profiling predicts clinical outcome of breastcancer.Nature.2002Jan 31;415(6871):530-536.
【2】A GENE-EXPRESSION SIGNATURE AS A PREDICTOR OF SURVIVAL IN BREASTCANCER.N EnglJ Med,Vol.347,No.25December 19,2002
【3】70-Gene Signature as an Aid to Treatment Decisions in Early-StageBreast Cancer.N Engl J Med 2016;375:717-29.
Patent document
【1】Methods of assigning treatment to breast cancer patients.USPatient 7171311;January 30,2007

Claims (1)

1. Probabilistic identification model for breast cancer prognosis generated by deep machine learning
The inventor: zhang Peesen
1, independent claim
Description of the invention
Our invention is a "probabilistic identification model of breast cancer prognosis generated by deep machine learning". Our invention is a gene recognition model for breast cancer prognosis to calculate clinical prognosis and to determine if adjuvant chemotherapy would be worthwhile. We used the same clinical 2 ten thousand 5 kilo RNA dataset with the "70 gene signature" (151 lymph node negative patients with breast cancer, 97 patients who survived for more than five years, 54 controls).
(II) characteristic section
Our invention is characterized by: deep machine learning and probabilistic recognition models. From the data set of 151 cases of 70 genes of the "70 gene tag", we developed a probabilistic identification model for cancer prognosis by applying our self-developed deep machine learning data mining algorithm. We started the deep machine learning with 2 genes, and each time 1 gene was added, the combination with the strongest recognition ability was selected. The recognition ability of the combination of 7 genes exceeds the ability of the '70 gene label'. We selected the following 7 genes to construct our cognitive model: contig46223_ RC, X05610, NM _006931, Contig55725_ RC, NM _020386, AF055033, Contig2399_ RC.
Dependent claims
2, deep machine learning:
(1.1) our invention is a "probabilistic identification model of breast cancer prognosis generated by deep machine learning".
(1.2) the invention is characterized in that: deep machine learning. Changes in the machine learning pattern (e.g., starting with 3 genes, adding 2 genes at a time) should all be considered as included in the present invention.
3, probability recognition model:
(3.1) our invention is a "probabilistic identification model of breast cancer prognosis generated by deep machine learning".
(3.2) our invention is characterized by: the recognition model adopts the independent gene expression probability, and the probability of good prognosis and the probability of poor prognosis are the product of the probabilities of good prognosis and poor prognosis of each gene expression. Modifications of the recognition model, (e.g., classification tree models) they are all considered to be included in the present invention.
4, gene combination:
(4.1) we selected the following 7 genes to construct our cognitive model: contig46223_ RC, X05610, NM _006931, Contig55725_ RC, NM _020386, AF055033, Contig2399_ RC.
(4.2) our invention is characterized by: the 7 gene model formed by the 7 gene combination is one of the best models obtained by deep learning, and other gene combinations can form similar models with very close precision. These similar combinations of genes are to be considered as included in the present invention.
5, a detection method:
(5.1) we selected the following 7 genes to construct our cognitive model: contig46223_ RC, X05610, NM _006931, Contig55725_ RC, NM _020386, AF055033, Contig2399_ RC.
(5.2) we will produce 7 gene corresponding kits of the "7 gene model" to help hospitals and other required facilities. We were also prepared to set up the detection mechanism to undertake the detection of 7 genes of the "7 gene model".
Calculation of "7 Gene model":
(6.1) we selected the following 7 genes to construct our cognitive model: contig46223_ RC, X05610, NM _006931, Contig55725_ RC, NM _020386, AF055033, Contig2399_ RC. .
(6.2) we will provide a computational server for the "7 gene model" of the network. The computational APP of the "7 gene model" of the mobile phone is also provided.
CN201811265590.5A 2018-10-29 2018-10-29 Probabilistic identification model for breast cancer prognosis generated by deep machine learning Pending CN111105879A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811265590.5A CN111105879A (en) 2018-10-29 2018-10-29 Probabilistic identification model for breast cancer prognosis generated by deep machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811265590.5A CN111105879A (en) 2018-10-29 2018-10-29 Probabilistic identification model for breast cancer prognosis generated by deep machine learning

Publications (1)

Publication Number Publication Date
CN111105879A true CN111105879A (en) 2020-05-05

Family

ID=70420268

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811265590.5A Pending CN111105879A (en) 2018-10-29 2018-10-29 Probabilistic identification model for breast cancer prognosis generated by deep machine learning

Country Status (1)

Country Link
CN (1) CN111105879A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113764101A (en) * 2021-09-18 2021-12-07 新疆医科大学第三附属医院 CNN-based breast cancer neoadjuvant chemotherapy multi-modal ultrasonic diagnosis system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113764101A (en) * 2021-09-18 2021-12-07 新疆医科大学第三附属医院 CNN-based breast cancer neoadjuvant chemotherapy multi-modal ultrasonic diagnosis system
CN113764101B (en) * 2021-09-18 2023-08-25 新疆医科大学第三附属医院 Novel auxiliary chemotherapy multi-mode ultrasonic diagnosis system for breast cancer based on CNN

Similar Documents

Publication Publication Date Title
Ozawa et al. A microRNA signature associated with metastasis of T1 colorectal cancers to lymph nodes
CN105121665B (en) Medical prognosis and prediction using the active therapeutic response of cell multiplex signal transduction path
Tothill et al. An expression-based site of origin diagnostic method designed for clinical application to cancer of unknown origin
JP7186700B2 (en) Methods to Distinguish Tumor Suppressor FOXO Activity from Oxidative Stress
JP7354099B2 (en) How to diagnose, stage, and monitor melanoma using microRNA gene expression
ES2821300T3 (en) Prognostic Prediction for Cancer Melanoma
Smeets et al. Prediction of lymph node involvement in breast cancer from primary tumor tissue using gene expression profiling and miRNAs
CN113462776B (en) m 6 Application of A modification-related combined genome in prediction of immunotherapy efficacy of renal clear cell carcinoma patient
EP3931318A1 (en) Purity independent subtyping of tumors (purist), a platform and sample type independent single sample classifier for treatment decision making in pancreatic cancer
CN114150063A (en) Urine miRNA marker for bladder cancer diagnosis, diagnostic reagent and kit
Maxwell et al. Transcript expression in endometrial cancers from Black and White patients
CN111105879A (en) Probabilistic identification model for breast cancer prognosis generated by deep machine learning
WO2009002175A1 (en) A method of typing a sample comprising colorectal cancer cells
CN111748626B (en) System for predicting treatment effect and prognosis of neoadjuvant radiotherapy and chemotherapy of esophageal squamous carcinoma patient and application of system
Kroon et al. Microarray gene‐expression profiling to predict lymph node metastasis in penile carcinoma
TW202242143A (en) Risk estimation method of breast cancer recurrence or metastasis and kit thereof
Wen et al. Breast Cancer Pathology in the Era of Genomics
EP2459748B1 (en) Determination of the risk of distant metastases in surgically treated patients with non-small cell lung cancer in stage i-iiia
KR102138517B1 (en) Extracting method for biomarker for diagnosis of pancreatic cancer, computing device therefor, biomarker, and pancreatic cancer diagnosis device comprising same
CN117004711A (en) Tool for measuring prognosis marker of breast cancer local recurrence risk and application thereof
Nomikou Investigating Nuclear Morphological Features and Chromatin Architecture in Cancer
WO2024002599A1 (en) Novel signatures for lung cancer detection
Mook et al. Personalized Medicine by the Use of Microarray Gene Expression Profiling
Kroon et al. Microarray gene-expression profiling to predict lymph node metastasis in penile carcinoma
WO2022018086A1 (en) Prognostic and treatment response predictive method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200505