CN111105879A

CN111105879A - Probabilistic identification model for breast cancer prognosis generated by deep machine learning

Info

Publication number: CN111105879A
Application number: CN201811265590.5A
Authority: CN
Inventors: 张培森
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-10-29
Filing date: 2018-10-29
Publication date: 2020-05-05

Abstract

The technical field of the invention is as follows: the identification model of breast cancer prognosis is used for calculating whether clinical prognosis and auxiliary chemotherapy are worth to be carried out, and is used for precise treatment. By applying an autonomously developed deep machine learning data mining algorithm, a probability identification model for cancer prognosis is developed. The "70 gene signature" was the first and by far the only us FDA-approved test for prognosis of breast cancer. Based on the same clinical 2 ten thousand 5 thousand RNA dataset (151 breast cancer lymph node negative patients, 97 survival five years old or more, 54 controls), my deep machine learning started with 2 genes, increased by 1 gene each time, and selected the combination with the strongest recognition ability. My deep machine learning produced 7 genes "probabilistic identification model of breast cancer prognosis", the identification ability has exceeded the ability of the "70 gene signature".

Description

Probabilistic identification model for breast cancer prognosis generated by deep machine learning

The inventor: zhang Peesen

(I) technical field

Identification of breast cancer prognosis models to calculate clinical prognosis and whether adjuvant chemotherapy is worthwhile.

(II) background of the invention

(2.1) overview:

"breast cancer patients at the same disease stage may have significantly different treatment responses and outcomes. Clinical predictors of metastasis (e.g., lymph node status and histological grade) do not accurately classify breast tumors. Chemotherapy or hormone treatment can reduce the risk of metastasis by about one-third; however, 70-80% of patients receiving such treatment can survive without such treatment. "(technical literature [ 1 ])

Several gene recognition models have been developed to predict clinical outcome and determine whether adjuvant chemotherapy is worthwhile. Among them, the "70 Gene Signature" (70-Gene Signature) (technical documents [ 1, 2, 3 ]), which tests the classification of tumors as good or poor prognosis, depending on the risk of recurrence for 5 years. The transformation research system consortium (transcbig) is a network consisting of 21 countries, about 40 partners, including the mammary gland international group (BIG). An independently validated study of this consortium demonstrated that the "70 gene signature" approved by the U.S. Food and Drug Administration (FDA) was able to distinguish patients with significant risk of metastasis recurrence and death from low risk patients. (technical literature [ 3 ])

Expression of some genes is correlated. The associated genes are duplicated and redundant in the recognition model. The redundant genes increase the detection cost and introduce noise and increase errors. The 70 genes used in the "70 gene tag" are related in some cases. It is desirable to select relatively independent genes as much as possible to establish a model, thereby reducing the detection cost, reducing the noise and improving the precision.

(2.2) data sources: (patient selection, RNA isolation, and biochip expression):

we used the same clinical 2 ten thousand 5 kilo RNA dataset with the "70 gene signature" (151 lymph node negative patients with breast cancer, 97 patients who survived for more than five years, 54 controls). The "70 gene signature" was selected from a subsample set of 151 patients, 78 patients (34 patients with no metastasis for more than five years, 44 controls). Our probabilistic recognition model uses 151 cases of the entire sample set.

Tumors were selected from 295 women with breast cancer from fresh frozen tissue banks at the netherlands cancer institute according to the following criteria: the tumor is primary invasive breast cancer, and the diameter of the tumor is less than 5cm (pT1 or pT2) in pathological examination; apical axillary lymph nodes were tumor negative as determined from subclavian lymph node biopsy; diagnosis age 52 years or less; the diagnosis period is between 1984 and 1995; there was no history of cancer, except for non-melanoma skin cancers. All patients received modified radical mastectomy or breast conservation surgical treatment, including axillary lymphadenectomy, and radiation therapy if indicated. Of 295 patients, 151 patients were node negative (pathological examination result pN0), and 144 were node positive (pN +). (technical literature [ 2 ])

Tumor material was snap frozen in liquid nitrogen within 1 hour after surgery. Frozen sections were stained with hematoxylin and eosin; only samples with more than 50% tumor cells were selected. 30- μm sections were used for RNA isolation. Total RNA was isolated using RNAzolB and dissolved in RNase-free water. 25 μ g of RNA was then treated using Qiagen RNase-free DNase kit and RNeasy spin column, then dissolved in RNase-free water to a final concentration of 0.2 μ g/μ l, transcribed in vitro by using T7 RNA polymerase and 5 μ g total RNA and labeled with Cy3 or Cy5(Cy Dye, Amersham Pharmacia Biotech). 5 micrograms of Cylabeled cRNA from one breast cancer tumor was mixed with the same amount of reverse Cy-labeled product in a pool of equal amounts of cRNA from each patient. The labeled cRNA was fragmented to an average size of about 50 to 100 nucleotides by heating the sample to 60 ℃ in the presence of 10mM zinc chloride and adding a hybridization buffer containing 1M sodium chloride, 0.5% sodium sarcosinate, 50mM morpholino-ethanolamine and 50mM acetoacetate. Ethanesulfonic acid (pH6.5) and formamide (final concentration, 30% at 40 ℃); the final volume was 3 ml. The microarray included 24,479 biological oligonucleotides and 1281 control probes. After hybridization, the slides were washed and scanned with a confocal laser scanner (Agilent Technologies). The fluorescence intensity of the scanned image was quantified and the background level was corrected and normalized. (technical literature [ 2 ])

(2.3) "70 Gene signature" big data analysis and data mining algorithm (technical literature [ 1 "):

in the first step, the "70 gene tag" screens 24,479 genes of the biochip for 5,000 important genes. These genes were more than twice expressed in more than 5 experiments and were significant p < 0.01.

In the second step, the "70 gene signature" calculated the correlation between the prognostic class (metastasis vs. no metastasis) and the log expression ratio for all 78 samples of each individual gene of the 5,000 important genes. The "70 gene signature" found 231 genes with correlation coefficients greater than 0.3 ("related genes") or less than-0.3 ("anti-related genes").

In the third step, the "70 gene tag" is cross-validated using the method of "rule-one-out". One sample at a time is taken, the remaining samples are used for learning, a model is generated, and then the model is used to identify the taken samples. One at a time until all samples are exhausted. This approach avoids information penetration. The samples to be identified are not in the "learning set". The "70 gene signature" takes one sample at a time and uses the remaining 77 samples to define a classifier based on 231 distinct genes. The result of the first sample taken is then predicted. The prediction of the samples was based on their correlation coefficients with a "good prognosis" template and a "poor prognosis" template, where the "good" and "poor" templates were the average expression of the "good" and "poor" samples of the 77 samples in the clinic. Correlation coefficients were calculated using the selected reporter genes. This procedure was repeated until each of the 78 samples was expelled. It is finally calculated how many cases the prediction is correct and how many cases the prediction is incorrect. The performance of the classifier is measured by the error rate of type 1 (false negative) and type 2 (false positive) of the selected genome. The "70 gene signature" repeats the above-described performance assessment procedure based on "one out" cross-validation, from the top of the candidate list adding 5 more marker genes at a time until all 231 genes are used as discriminators. The number of mispredictions of type 1 and type 2 errors varies significantly with the number of marker genes used. The combined error rate is lowest when using the "70 gene tag" from the top of the candidate list. Thus, the "70 gene signature" considers this group of 70 genes as the best marker genome that can be used to classify patients into two prognostic subgroups, a "good prognosis" group and a "poor prognosis" group. Interestingly, the accuracy of predicting the prognosis of "sporadic" breast cancer patients is rather low when only a few marker genes are used. Accuracy increased with increasing number of marker genes until an optimal number of marker genes (70 genes) was reached. However, in addition to the optimal number of marker genes, accuracy is deteriorated due to the introduction of noise.

(2.4) "70 Gene tag" was approved in US patent 2007, U.S. patent No.: 7171311 (patent document [ 1 ])

Disclosure of the invention

(3.1) overview:

by applying the self-developed deep machine learning data mining algorithm, a probability recognition model for cancer prognosis is self-developed. The "70 gene signature" was the first and by far the only us FDA-approved test for prognosis of breast cancer. Based on the probability recognition model of 2 ten thousand 5 thousand RNA data sets (151 breast cancer lymph node negative patients, 97 breast cancer lymph node negative patients with five years of survival and 54 controls) in the same clinic, the number of genes to be detected is reduced, and the accuracy of the gene label is improved by over 70.

(3.2) our deep machine learning data mining algorithm:

firstly, a self recognition model is constructed by adopting a deep machine learning algorithm. We use 70 genes of the '70 gene label' as the basis, and use the deep machine learning data mining algorithm developed by us to calculate the detection capability by starting from 2 genes and increasing 1 gene each time. Our algorithm is a deep machine learning algorithm. All combinations of genes are learned. For example, taking 5 genes from 70 genes requires 1 thousand to 2 million studies. Taking 6 genes from 70 genes requires 1 million to 3 million studies. Before each learning, data normalization is carried out, and the accuracy of the data is guaranteed.

In the second step, our identification model hopes that gene expression is as independent as possible from each other, so that each gene can fully play a role in the identification process. Our recognition model is an independent probabilistic model. Breast cancer patients are classified by the probability of "good prognosis" and "poor prognosis". Which probability is high belongs to that class. The recognition model adopts the independent gene expression probability, and the probability of good prognosis and the probability of poor prognosis are the product of the probabilities of good prognosis and poor prognosis of each gene expression. And finally, classifying according to the probability.

How to determine the probability of "good prognosis" and "poor prognosis" for a single gene of a sample? First, the sample data set for machine learning (learning set) was divided into "good" set (survived for more than 5 years) and "bad" set (survived for less than 5 years) according to clinical 5-year survival. Then, the mean values of the expression intensities of the individual genes (RNAs) in the "good" set and the "poor" set were calculated. The expression of the entire sample data set (the learning set) at this gene was divided into two groups using the midpoint of the two averages as a boundary. The group containing the mean values of gene expression of the "good" set is called "near good group"; similarly, a set containing "differences" is referred to as a "near difference group". This demarcation also localizes the gene expression of the identified sample in a "near good group" or a "near bad group". The two groups calculate the probabilities of "good after prognosis" and "poor after prognosis", respectively. For example, the expression of the gene in the sample being tested is in the "near elite group"; the 'good group' has 80 'good' set members and 10 'poor' set members; then the probability of the test sample being "good after prognosis" at this gene is 80/90 and the probability of "poor after prognosis" is 10/90. The gene (RNA) expression intensity of the sample to be tested belongs to which group, and the probability of "good prognosis" and "poor prognosis" of that group is the probability of "good prognosis" and "poor prognosis" of this sample. The total probability of "good prognosis" and "poor prognosis" of the sample to be tested is the product of the probabilities of "good prognosis" and "poor prognosis" of the expression of each gene. The total probability of good prognosis and poor prognosis of the sample to be tested is high, and the sample to be tested is classified into that class.

Third, we use the method of "rule-one-out" to perform cross-validation to ensure the recognition ability of the model built by our deep machine learning.

(IV) 7-gene probability identification model for breast cancer prognosis

We started the deep machine learning with 2 genes, and each time 1 gene was added, the combination with the strongest recognition ability was selected. The recognition ability of the combination of 7 genes exceeds the ability of the '70 gene label'. We selected the following 7 genes to construct our cognitive model: contig46223_ RC, X05610, NM _006931, Contig55725_ RC, NM _020386, AF055033, Contig2399_ RC. Because our model contains only 7 genes, it is much simpler to detect the expression of these 7 genes (RNA) than 70 genes, the cost can be as low as one tenth, and the noise infiltration is reduced, and the precision is improved. We can do this by RT-PCR. Technical literature [ 2 ] published a comparison of the accuracy of the "70 gene signature" and the traditional clinical st. Here we add the results of the 7-gene probabilistic recognition model. Clinical samples, 151 breast cancer node negative patients, 97 survived for more than five years, and 54 controls. "7 Gene model": accuracy, 84.1%; "70 gene tag": accuracy, 80.8%; galen "accuracy, 59.0%; "NIH" accuracy, 46.2%.

(V) detailed description of the preferred embodiments

(5.1) overview:

the "learning set" of our 7-gene probabilistic identification model for breast cancer prognosis includes the "learning set" of the "70-gene signature". Our "7 gene model" is the result of deep machine learning and can be seen as an upgraded version of the "70 gene signature".

(5.2) the detection method comprises the following steps:

we will produce a 7-gene counterpart kit for the "7-gene model" to help hospitals and other institutions in need thereof. We were also prepared to set up a third party testing facility to undertake the testing of 7 genes for the "7 gene model".

(5.3) "calculation of 7 Gene model":

we will provide a computational server for the "7 gene model" of the network. The computational APP of the "7 gene model" of the mobile phone is also provided.

Technical literature

【1】Gene expression profiling predicts clinical outcome of breastcancer.Nature.2002Jan 31；415(6871)：530-536.

【2】A GENE-EXPRESSION SIGNATURE AS A PREDICTOR OF SURVIVAL IN BREASTCANCER.N EnglJ Med，Vol.347，No.25December 19，2002

【3】70-Gene Signature as an Aid to Treatment Decisions in Early-StageBreast Cancer.N Engl J Med 2016；375：717-29.

Patent document

【1】Methods of assigning treatment to breast cancer patients.USPatient 7171311；January 30，2007

Claims

1. Probabilistic identification model for breast cancer prognosis generated by deep machine learning

The inventor: zhang Peesen

1, independent claim

Description of the invention

Our invention is a "probabilistic identification model of breast cancer prognosis generated by deep machine learning". Our invention is a gene recognition model for breast cancer prognosis to calculate clinical prognosis and to determine if adjuvant chemotherapy would be worthwhile. We used the same clinical 2 ten thousand 5 kilo RNA dataset with the "70 gene signature" (151 lymph node negative patients with breast cancer, 97 patients who survived for more than five years, 54 controls).

(II) characteristic section

Our invention is characterized by: deep machine learning and probabilistic recognition models. From the data set of 151 cases of 70 genes of the "70 gene tag", we developed a probabilistic identification model for cancer prognosis by applying our self-developed deep machine learning data mining algorithm. We started the deep machine learning with 2 genes, and each time 1 gene was added, the combination with the strongest recognition ability was selected. The recognition ability of the combination of 7 genes exceeds the ability of the '70 gene label'. We selected the following 7 genes to construct our cognitive model: contig46223_ RC, X05610, NM _006931, Contig55725_ RC, NM _020386, AF055033, Contig2399_ RC.

Dependent claims

2, deep machine learning:

(1.1) our invention is a "probabilistic identification model of breast cancer prognosis generated by deep machine learning".

(1.2) the invention is characterized in that: deep machine learning. Changes in the machine learning pattern (e.g., starting with 3 genes, adding 2 genes at a time) should all be considered as included in the present invention.

3, probability recognition model:

(3.1) our invention is a "probabilistic identification model of breast cancer prognosis generated by deep machine learning".

(3.2) our invention is characterized by: the recognition model adopts the independent gene expression probability, and the probability of good prognosis and the probability of poor prognosis are the product of the probabilities of good prognosis and poor prognosis of each gene expression. Modifications of the recognition model, (e.g., classification tree models) they are all considered to be included in the present invention.

4, gene combination:

(4.1) we selected the following 7 genes to construct our cognitive model: contig46223_ RC, X05610, NM _006931, Contig55725_ RC, NM _020386, AF055033, Contig2399_ RC.

(4.2) our invention is characterized by: the 7 gene model formed by the 7 gene combination is one of the best models obtained by deep learning, and other gene combinations can form similar models with very close precision. These similar combinations of genes are to be considered as included in the present invention.

5, a detection method:

(5.1) we selected the following 7 genes to construct our cognitive model: contig46223_ RC, X05610, NM _006931, Contig55725_ RC, NM _020386, AF055033, Contig2399_ RC.

(5.2) we will produce 7 gene corresponding kits of the "7 gene model" to help hospitals and other required facilities. We were also prepared to set up the detection mechanism to undertake the detection of 7 genes of the "7 gene model".

Calculation of "7 Gene model":

(6.1) we selected the following 7 genes to construct our cognitive model: contig46223_ RC, X05610, NM _006931, Contig55725_ RC, NM _020386, AF055033, Contig2399_ RC. .

(6.2) we will provide a computational server for the "7 gene model" of the network. The computational APP of the "7 gene model" of the mobile phone is also provided.