CN114300139A

CN114300139A - Construction of breast cancer prognosis model, application method and storage medium thereof

Info

Publication number: CN114300139A
Application number: CN202210037681.3A
Authority: CN
Inventors: 黄琛
Original assignee: Macau University of Science and Technology
Current assignee: Macau University of Science and Technology
Priority date: 2022-01-13
Filing date: 2022-01-13
Publication date: 2022-04-08

Abstract

The application relates to a construction method of a breast cancer prognosis model, an application method and a storage medium thereof, wherein the construction method comprises the following steps: acquiring a breast cancer training dataset and preprocessing data of each patient in the breast cancer training dataset, wherein the data of each patient comprises a plurality of features; based on the plurality of features, dividing the breast cancer training dataset into different groupings, the different groupings determined based on infiltration levels of different immune cells; analyzing the differential expression genes among the different groups, and filtering to obtain candidate genes, wherein the differential expression genes comprise differential expression genes related to immune infiltration and differential expression genes involved in a transfer mechanism; and constructing a risk scoring model based on the candidate genes. By integrating the immune-related gene signals associated with metastasis, not only can the prognostic effect of a patient be more accurately assessed, but also the patient can be guided for treatment.

Description

Construction of breast cancer prognosis model, application method and storage medium thereof

Technical Field

The application relates to construction of a breast cancer prognosis model, an application method and a storage medium thereof.

Background

Cancer has experienced a long history in humans and remains the leading cause of death, with breast cancer being one of the most common malignancies in women worldwide. Breast cancer is also the second most common cause of cancer-related death in women. Although great medical advances have reduced mortality over the years, the high heterogeneity of breast cancer still makes prognosis and treatment challenging.

Over the past decade, a great deal of work has been done to develop prognostic indicators of breast cancer progression. Most (about 80%) of the breast cancers become invasive and about 20-30% lead to distant metastasis after treatment. Thus, metastasis is the most lethal development of breast cancer, greatly reducing long-term survival from 90% to 5%. However, most metastasis-based signatures develop based on organ-specific metastatic events, whereas breast cancer consists of tumors with an extremely diverse cell type, leading to a difference between prognosis and survival. Thus, currently available metastasis-based prognostic indicators perform poorly. On the other hand, tumor-infiltrating lymphocytes have been reported to have an integral link with therapeutic efficacy in many cancers and patient survival. Many prognostic predictors are developed by assessing the level of infiltration of immune cells into tumors, and are preferred for the prognosis of various cancers. These histological strategies based on small-scale immune cell marker gene analysis support prognostic significance of immune infiltration, but are still limited. Strategies used to describe the level of immunoinfiltration are the first limitation of current research. Specifically, each subset of immune cells is computationally estimated by reference based on an overall analysis of the tissue sample. This is a major disadvantage because the transcriptional program of immune cells exhibits high plasticity in the tumor microenvironment. Second, while most studies have been used in relation to immune infiltration to improve cancer prognosis, only one or two of a subset of immune cells are included, which subset lacks functional variation, and thus treatments based on these indices fail to achieve a satisfactory immune response. The above methods are based on prognostic indicators of only a single characteristic, and are not sufficient to accurately assess risk stratification and guide treatment strategies.

Disclosure of Invention

The embodiment of the application provides a construction method of a breast cancer prognosis model, an application method of the breast cancer prognosis model and a storage medium, and aims to at least solve the problem that risk stratification cannot be accurately evaluated and treatment strategies cannot be guided based on only one characteristic in the related art.

In a first aspect, the present application provides a method for constructing a breast cancer prognosis model, including: acquiring a breast cancer training dataset and preprocessing data of each patient in the breast cancer training dataset, wherein the data of each patient comprises a plurality of features; based on the plurality of features, dividing the breast cancer training dataset into different groupings, the different groupings determined based on infiltration levels of different immune cells; analyzing the differential expression genes among the different groups, and filtering to obtain candidate genes, wherein the differential expression genes comprise differential expression genes related to immune infiltration and differential expression genes involved in a transfer mechanism; and constructing a risk scoring model based on the candidate genes.

In some embodiments, the acquiring a breast cancer training dataset and preprocessing data for each patient in the breast cancer training dataset comprises: acquiring original matrix data of the breast cancer training data set, wherein row data of the original matrix data represent different probe sets, and column data represent different patients; respectively obtaining the maximum expression value of each probe set, and screening the repetitive genes detected by the probe sets; and (4) standardizing the screened gene expression data.

In some embodiments, the dividing the breast cancer training dataset into different groupings based on the plurality of features comprises: quantifying data in each feature of the gene expression data separately using a single sample gene set enrichment analysis; based on the quantified results, the breast cancer training dataset is divided into a first immune-infiltrated group and a second immune-infiltrated group using a clustering method, respectively.

In some embodiments, said filtering candidate genes based on differentially expressed genes between said different groupings comprises: analyzing differentially expressed genes related to immune infiltration and involved in a transfer mechanism from the first immune infiltration group and the second immune infiltration group respectively based on selection criteria by using a Wilcoxon rank sum method to obtain a first candidate genome; screening a second candidate genome from the first candidate genome based on a univariate Cox proportional risk regression model; obtaining a third candidate genome by performing correlation analysis on the second candidate genome; wherein the correlation of genes within the first candidate genome, the second candidate genome, and the third candidate genome to the overall survival of the patient increases in order.

In some embodiments, the constructing a risk scoring model based on the candidate genes comprises: constructing a multilayer perceptron neural network based on the third candidate genome, wherein the multilayer perceptron neural network is used for optimizing the weight of each candidate gene in the third candidate genome, and the optimized weight is used as the maximum weight of a hidden layer of the neural network; a risk scoring model is determined.

In some embodiments, the risk scoring model is represented as:

wherein, MIRS_iRepresenting the risk score of the ith patient consisting of the third candidate genome, weight representing the maximum weight of the hidden layer of the multilayer perceptron neural network, I_{{protective gene}}And I_{{dangerous gene}}The expression values of the ith gene used for constructing the risk scoring model are determined based on a risk ratio and a score cutoff value, and are expressed as protective genes and risk genes, m and n respectively represent the number of the protective genes and the risk genes, and the sum of m and n is equal to the total number of genes of the third candidate genome, wherein the score cutoff value is determined based on the expression values of all patient genes, and the I_{{protective gene}}And I_{{dangerous gene}}Respectively expressed as:

in some embodiments, the building method further comprises: calculating a risk score for each patient in the breast cancer training dataset based on the risk score model; determining a group cutoff value from the risk scores of all patients in the breast cancer training dataset and dividing the patients in the breast cancer training dataset into a first high risk subgroup and a first low risk subgroup according to the group cutoff value and the risk scores of the patients; evaluating a difference in survival distribution between the first high-risk subgroup and the first low-risk subgroup using a Kaplan-Meier curve of the breast cancer training dataset.

In some embodiments, the building method further comprises: evaluating the third candidate genomic gene for independence based on multivariate Cox proportional hazards regression analysis.

In some embodiments, the building method further comprises: acquiring a breast cancer verification dataset; validating performance of the risk scoring model based on the breast cancer validation dataset.

In a second aspect, the embodiments of the present application provide a method for applying a breast cancer prognosis model, where the breast cancer prognosis model includes a risk score model constructed by the above construction method, and the application method includes: obtaining gene expression data of a patient, wherein the gene expression data of the patient comprises gene expression data related to immune infiltration and involved in metastasis mechanisms for constructing the risk score model; calculating a risk score for the patient according to the risk score model based on the gene expression data for the patient.

In a third aspect, an embodiment of the present application provides an electronic device for breast cancer prognosis, including: a memory for storing a program; and a processor for executing the program stored in the memory, and when the processor executes the program stored in the memory, the processor is configured to execute the above construction method or the above application method.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium storing a computer program for executing the above-mentioned construction method or the above-mentioned application method.

According to the embodiment of the application, the breast cancer patients are divided into the high-risk group and the low-risk group by integrating the transfer-related immune-related gene signals, so that the prognosis effect of the patients can be more accurately evaluated, and the breast cancer patients can be used for a potential treatment strategy to guide the treatment of the patients. In addition, by constructing neural network models to estimate genetic weights and then establishing a metastasis and immunogene risk score, the score has a significant ability to predict survival status compared to the single feature based indicators in the prior art.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.

FIG. 1 illustrates a flowchart of a method for constructing a Metastasis and Immune Risk Score (MIRS) model according to an embodiment of the present application;

FIG. 2 illustrates sample composition information for a data set used by an embodiment of the present application;

FIG. 3 illustrates a step flow diagram of step S100 of FIG. 1 according to an embodiment of the present application;

FIG. 4 illustrates a step flow diagram of step S200 of FIG. 1 according to an embodiment of the present application;

FIG. 5 illustrates a schematic flow chart for constructing a prognostic risk score model according to an embodiment of the present application;

FIG. 6 shows a schematic representation of an immune cell infiltration grouping according to an embodiment of the present application;

fig. 7A shows a box line plot of gene expression levels of CD family genes between the high and low immunoinfiltration groups, respectively;

fig. 7B shows a box line plot of gene expression levels of IL family genes between the hyperimmune infiltration group and the hypoimmunoinfiltration group, respectively;

FIG. 8 is a graph illustrating a comparison of evaluation results between a high and low immune-infiltration group calculated using ESTIMATE according to one embodiment of the present application;

FIG. 9 is a graph showing a comparison of the results of the ratio of immune cell types between the high and low immunoinfiltration groups calculated using CIBERSTRAT according to one embodiment of the present application;

FIG. 10 illustrates a step flow diagram of step S300 of FIG. 1 according to an embodiment of the present application;

FIG. 11 illustrates a schematic diagram of selecting a first candidate genome based on differentially expressed genes, according to an embodiment of the present application;

fig. 12A shows a thermodynamic diagram of differentially expressed genes between a first and second immunoinfiltration group in a TCGA cohort;

figure 12B shows a thermodynamic diagram of differentially expressed genes between primary and metastatic groups in a GSE10893 cohort;

figure 12C demonstrates a thermodynamic diagram of differentially expressed genes between primary and metastatic groups in the GSE3521 cohort;

FIG. 13 illustrates a second candidate genomic list selected according to an embodiment of the present application;

FIG. 14A shows a schematic diagram of a second candidate gene screening candidate genes using variance inflation factor values according to an embodiment of the present application;

FIG. 14B is a schematic diagram of screening candidate genes using Pearson's correlation coefficient for a second candidate gene according to an embodiment of the present disclosure;

FIG. 15 illustrates a flowchart of the operation of step S400 of FIG. 1 according to an embodiment of the present application;

FIG. 16 illustrates a schematic diagram of a neural network training model in accordance with an embodiment of the present application;

FIG. 17 illustrates an ROC curve obtained using a TCGA cohort based on the MIRS model according to an embodiment of the present application;

FIG. 18 illustrates a flow chart of a method for constructing the MIRS model of FIG. 1 according to another embodiment of the present application;

FIG. 19A illustrates an OS curve using a TCGA queue based on the MIRS model according to an embodiment of the present application;

FIG. 19B illustrates an OS curve obtained using a GSE20685 queue based on the MIRS model according to an embodiment of the present application;

FIG. 20A illustrates ROC curves derived using GSE96058 queues based on a MIRS model according to another embodiment of the present application;

FIG. 20B illustrates a ROC curve obtained using a GSE86166 queue based on the MIRS model according to another embodiment of the present application;

FIG. 20C illustrates an ROC curve obtained using a GSE96058 queue based on the MIRS model according to an embodiment of the present application;

FIG. 20D illustrates an OS curve obtained using a GSE96058 queue based on the MIRS model according to an embodiment of the present application;

FIG. 20E illustrates an OS curve obtained using a GSE86166 queue based on the MIRS model according to an embodiment of the present application;

FIG. 20F illustrates an OS curve obtained using a GSE96058 queue based on the MIRS model according to an embodiment of the present application;

fig. 21A illustrates a comparison graph of the evaluation results between the high and low immune infiltration groups in the GSE86166 cohort calculated using ESTIMATE, according to an embodiment of the present application;

fig. 21B illustrates a comparison graph of evaluation results between a high and low immune-infiltration group in a GSE96058 cohort calculated using ESTIMATE according to an embodiment of the present application;

FIG. 22 shows a schematic representation of the results of a functional enrichment assay by Metascape based on candidate genes according to an embodiment of the present application;

FIG. 23 shows a comparison of results of an ssGSEA assay based on 17 immune-related biological functions and pathways in accordance with an embodiment of the present application;

FIG. 24A shows a graphical representation of the results of correlation analysis of MIRS with three important immune checkpoint molecules (PD-1, PD-L1, and CTLA4) according to an embodiment of the present application;

FIG. 24B shows a schematic graph of the results of correlation analysis of TCGA cohort-based MIRSs with three important immune checkpoint molecules (PD-1, PD-L1, and CTLA4) according to an embodiment of the present application;

figure 24C shows a schematic of the results of correlation analysis of MIRS based on GSE96058 cohorts with three important immune checkpoint molecules (PD-1, PD-L1, and CTLA4) according to an embodiment of the present application;

FIG. 25 shows a graphical representation of the results of an analysis of the correlation of MIRS scores with expression levels of PD-1, PD-L1, and CTLA4 in accordance with an embodiment of the present application;

FIG. 26 is a schematic diagram showing the results of an ssGSEA assay based on 23 metastasis associated genes, according to an embodiment of the present application;

FIG. 27A is a box line graph showing expression values of DCC, MMP9, and ETS1 genes between different groups according to an embodiment of the present application;

FIG. 27B shows a schematic diagram of the correlation analysis results of DCC, MMP9, and ETS1 genes with MIRS scores according to an embodiment of the present application;

FIG. 28A shows a morgan plot of MIRS scores with different subsets of intrinsic molecules in a TCGA cohort according to an embodiment of the present application;

fig. 28B illustrates a violin diagram of MIRS score distributions for different intrinsic molecular subtypes in a TCGA cohort according to an embodiment of the present application;

figure 28C illustrates a morse plot of MIRS scores with different subsets of intrinsic molecules in a METABRIC cohort according to an embodiment of the present application;

fig. 28D illustrates a violin diagram of MIRS score distributions for different intrinsic molecular subtypes in a METABRIC cohort according to an embodiment of the present application;

FIG. 29 is a schematic representation of the results of an APOA5 gene-based analysis of GSEA in GSE20685 cohort according to an embodiment of the present application;

figure 30A shows a schematic diagram of enrichment result analysis for EMT according to an embodiment of the present application;

figure 30B shows a schematic diagram of an analysis of the enrichment results of TNFA signals by NFKB according to one embodiment of the present application;

FIG. 30C shows a schematic diagram of an analysis of enrichment results for an immune response modulating signaling pathway according to an embodiment of the present application;

FIG. 31 shows a schematic diagram of the analysis of the results of the expression of the APOA5 gene between different groups according to an embodiment of the present application;

figure 32A illustrates a graph of the difference in survival distribution of APOA5 for different expression values in breast cancer patients according to an embodiment of the present application;

figure 32B shows a graph illustrating the difference in survival distribution of APOA5 for different expression values in head and neck squamous cell carcinoma patients according to an embodiment of the present application;

figure 32C illustrates a graph showing the difference in survival distribution of APOA5 for different expression values in gastric adenocarcinoma patients according to an embodiment of the present application;

figure 32D shows a graph illustrating the difference in survival distribution of APOA5 for different expression values in renal clear cell carcinoma patients, according to an embodiment of the present application;

figure 32E illustrates a graph of the difference in survival distribution of APOA5 for different expression values in lung adenocarcinoma patients according to an embodiment of the present application;

figure 32F illustrates a graph showing the difference in survival distribution of APOA5 for different expression values in squamous cell lung carcinoma patients, according to an embodiment of the present application;

FIG. 33A illustrates a TCGA BRCA queue MIRS according to an embodiment of the present application^highSchematic diagram of 10 genes with highest mutation frequency in the group;

FIG. 33B illustrates a TCGA BRCA queue MIRS according to an embodiment of the present application^lowSchematic diagram of 10 genes with highest mutation frequency in the group;

FIG. 34 illustrates a box plot of MIRS scores at different TMBs, according to an embodiment of the present application;

FIG. 35 is a graph illustrating the results of a correlation analysis of MIRS and TMB according to one embodiment of the present application;

figure 36A illustrates an OS profile for patients receiving adjuvant chemotherapy in a GSE20685 cohort according to an embodiment of the present application;

FIG. 36B shows MIRS of patients receiving adjuvant chemotherapy and patients not receiving adjuvant chemotherapy in the GSE20685 cohort according to one embodiment of the present application^highThe OS curve of the group;

FIG. 36C shows MIRS of patients receiving adjuvant chemotherapy and patients not receiving adjuvant chemotherapy in the GSE20685 cohort according to one embodiment of the present application^lowThe OS curve of the group;

FIG. 37A is a graph showing the results of a drug sensitivity assay for platinum compounds in a TCGA cohort according to one embodiment of the present application;

FIG. 37B is a graph showing the results of a vincristine drug sensitivity assay in a TCGA cohort according to an embodiment of the present application;

fig. 37C shows a schematic diagram of the results of an analysis of drug sensitivity to imatinib in a TCGA cohort according to an embodiment of the present application;

FIG. 38A shows Imatinib drug at MIRS according to an embodiment of the present application^highAnd MIRS^lowIC50 evaluation comparison results for the panel are shown;

FIG. 38B shows a platinum compound drug at MIRS according to an embodiment of the present application^highAnd MIRS^lowIC50 evaluation comparison results for the panel are shown;

FIG. 38C shows gemcitabine drug in MIRS according to an embodiment of the present application^highAnd MIRS^lowIC50 evaluation comparison results for the panel are shown;

FIG. 39 illustrates MIRS in GSE20711 queue according to an embodiment of the present application^highAnd MIRS^lowTIS score boxplot for the group;

FIG. 40A shows ROC curves for MIRS and PD-1 without anti-PD-1 treatment according to an embodiment of the present application;

FIG. 40B shows a graphical representation of ROC curves comparing MIRS at different time periods under anti-PD-1 treatment according to an embodiment of the present application;

FIG. 40C shows a graphical representation of ROC curves comparing PD-1 at different time periods under anti-PD-1 treatment according to an embodiment of the present application;

FIG. 41A illustrates an MIRS according to an embodiment of the present application^highAnd MIRS^lowThe OS curve of the group;

FIG. 41B illustrates an MIRS according to another embodiment of the present application^highAnd MIRS^lowThe OS curve of the group;

FIG. 42A illustrates a violin plot of MIRS values for CR/PR and SD/PD patients according to an embodiment of the present application;

FIG. 42B illustrates a violin plot of MIRS values for CR/PR and SD/PD patients according to another embodiment of the present application;

FIG. 43A shows a CR/PR and SD/PD patient at MIRS according to an embodiment of the present application^highAnd MIRS^lowA schematic of the distribution within the group;

FIG. 43B shows a CR/PR and SD/PD patient at MIRS according to another embodiment of the present application^highAnd MIRS^lowA schematic of the distribution within the group;

FIG. 44A illustrates a schematic diagram of an analysis of MIRS performance results in different queues according to an embodiment of the present application;

FIG. 44B shows mPS performance results analysis diagrams under different queues;

FIG. 44C illustrates a performance result analysis diagram for a scenario under different queues;

FIG. 45A illustrates MIRS under Liu et al cohort according to one embodiment of the present application^highAnd MIRS^lowThe OS curve of the group;

FIG. 45B shows mPS MIRS under Liu et al^highAnd MIRS^lowThe OS curve of the group;

FIG. 45C shows MIRS under Liu et al^highAnd MIRS^lowThe OS curve of the group;

FIG. 46 illustrates ROC curves for MIRS, mPS, and tret at different time periods;

FIG. 47 illustrates a flowchart of a method for applying a risk scoring model according to an embodiment of the present application;

the document of this patent or application contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided for the purpose of requesting and paying the necessary fee. Fig. 5 to 9, fig. 11 to 12C, fig. 14A to 14B, fig. 16 to 17, and fig. 19A to 46 are color drawings, and details of the corresponding color drawings are described in the substantive review reference.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be understood that in the description of the embodiments of the present application, a plurality (or a plurality) means two or more, and more than, less than, more than, etc. are understood as excluding the present number, and more than, less than, etc. are understood as including the present number. If the description of "first", "second", "third", etc. is used for the purpose of distinguishing technical features, it is not to be understood as indicating or implying relative importance or implying number of indicated technical features or implying precedence of indicated technical features.

Has the characteristic that the diffusion is the most fatal breast cancer. Most metastasis-based signatures develop based on organ-specific metastatic events, however tumors have a very diverse cell type, resulting in underperforming currently available metastasis-based prognostic indicators. In the prior art, many prognostic indicators are developed by assessing the level of immune cell infiltration into tumors, and are preferred for the prognosis of various cancers, but with the following constraints: one is the choice of strategies for describing the level of immunoinfiltration, and the other is the involvement of only one or two of a subset of immune cells, and therefore, the satisfactory immune response is not achieved either. Two major features, tumor metastasis and immunoinfiltration, have been widely documented in association with tumor development, drug resistance and patient prognosis in breast cancer. Considerable research has revealed the role of metastasis and tumor immunoinfiltration as prognostic factors for predicting the outcome of breast cancer survival. Unfortunately, breast tumors are highly heterogeneous from individual to individual. Much of the current work only considers organ-specific metastasis or immunoinfiltration levels, which are not sufficient to obtain satisfactory predictive power for prognosis prediction, and thus cannot accurately guide medical treatment. To address this issue, embodiments of the present application provide a comprehensive and effective prognostic model, taking into account both metastasis and immune infiltration levels, to help clinicians provide accurate treatment strategies for breast cancer patients.

The embodiment of the application provides a method for constructing a breast cancer prognosis model, which comprises the steps of preprocessing gene expression data in a queue by obtaining a plurality of breast cancer open queues, analyzing the characteristic of the preprocessed gene expression data, dividing the data in the queue into different immune infiltration groups, analyzing differential expression genes related to immune infiltration and transfer characteristics among the different groups, screening to obtain candidate genomes remarkably related to Overall Survival (OS), and constructing a risk scoring neural network model based on the candidate genes to automatically calculate patient prognosis risks based on patient gene expression data and clinical data information so as to further guide treatment of patients.

The embodiments of the present application will be further explained with reference to the drawings.

Fig. 1 is a flowchart of a method for constructing a Metastasis and Immune Risk Score (MIRS) model according to an embodiment of the present disclosure. The method comprises the following steps:

step S100, acquiring a breast cancer training data set, and preprocessing data of each patient in the breast cancer training data set, wherein the data of each patient comprises a plurality of characteristics;

step S200, dividing a breast cancer training data set into different groups based on a plurality of characteristics, wherein the different groups are determined based on the infiltration levels of different immune cells;

step S300, analyzing differential expression genes among different groups, and filtering to obtain candidate genes, wherein the differential expression genes comprise differential expression genes related to immune infiltration and differential expression genes participating in a transfer mechanism;

and S400, constructing a risk scoring model based on the candidate genes.

In some embodiments, the Breast Cancer dataset comprises a Gene Expression profile dataset and a clinical dataset, all Gene Expression profiles and corresponding clinical datasets used herein are collected from the Gene Expression complex (GEO), Cancer Genome map (TCGA), and Molecular taxomy of Breast Cancer International association (METABRIC), including only available cohorts with sufficient overall survival information, consisting of 8424 patients from 14 cohorts. The sample composition of each queue is shown in fig. 2.

Therein, two public cohorts containing 1243 breast cancer patients were analyzed during the training phase to generate risk scores. Cohorts GSE86166, GSE96058, GSE20685, GSE20711, GSE58812, GSE9893, GSE3143, GSE425678 and METABRIC containing a total of 6598 breast cancer patients were used to test the robustness of the constructed risk scoring model in the validation phase. Both expression profiles and clinical data from the skin melanoma cohort (TCGA-SCKM) were downloaded from the TCGA database, and the TCGA-SCKM received various immunotherapies, such as immune checkpoint inhibitors, vaccines and cytokines. Liu et al data received anti-PD-1 treatment, the two malignant melanoma cohorts (TCGA-SCKM and TCGA-BRCA) were used as training data sets to enhance prediction of MIRS in immunotherapy response, the METABRIC data set and partial GEO data sets were used as validation data sets to validate the accuracy of the constructed risk scoring model, and all 14 cohorts were considered as complete cohorts.

Fig. 3 is a flowchart of the steps of step S100 of fig. 1 according to an embodiment of the present application. As shown in the figure, in some embodiments, the step S100 specifically includes the following steps:

step S110, acquiring original matrix data of a breast cancer training data set, wherein row data of the original matrix data represent different probe sets, and column data represent different patients;

step S120, respectively obtaining the maximum expression value of each probe group, and screening the repetitive genes detected by the probe groups;

step S130, the screened gene expression data is standardized.

It should be noted that the operations of data processing, analysis, risk scoring model construction, training, testing, verification, and the like, which are referred to in the present application, are all executed based on the R language environment, but in practical applications, a suitable language and tool can be selected according to a specific execution environment, which is not limited herein.

Affymetrix chips store a large amount of bioinformatic data. There are mainly 4 basic types of GEO data: sample, Platform, Series and Dataset, where Series refers to records of the same study, including processed data, summaries and analyses, enabling rapid information to be obtained from the GSEMatrix file parsing. In some embodiments, raw expression matrices (series matrix files) generated by Affymetrix are downloaded from the GEO database using the GEOquery package in the R package, with row data representing different probe sets and column data representing different patients, and are simply processed and analyzed.

In some embodiments, the maximum expression value of each probe set is obtained separately, and duplicate genes detected by multiple probe sets are screened. The original matrix data contains many different gene characteristics, but some genes have low correlation with breast cancer, and the analysis treatment of the genes is an ineffective operation. Therefore, the maximum expression value of each probe set is obtained first, and the duplicate genes detected by a plurality of probes are reserved as initial candidate genomes. The initial candidate genome data is then normalized, and in some embodiments, gene expression values can be normalized using the Log2 transform. Each GEO queue and Next Generation Sequencing (NGS) queue are processed independently. The processing operation on invalid data is reduced by primarily screening the original matrix data, and the heterogeneity of the data is reduced by normalization operation, so that various processing analyses of the data are facilitated.

Fig. 4 is a flowchart of steps of step S200 of fig. 1 according to an embodiment of the present application. As shown in the figure, in some embodiments, the step S200 specifically includes the following steps:

step S210, quantifying data in each feature of the gene expression data by using single-sample gene set enrichment analysis;

step S220, based on the quantized result, the data set is divided into a first immune-infiltration group and a second immune-infiltration group using a clustering method, respectively.

In order to obtain important prognostic biomarkers for breast cancer, the examples of the present application propose a systematic bioinformatic analysis scheme. Given that the metastatic and immunoinvasive processes of tumors play various important roles in cancer development, the present examples assume that gene expression associated with metastasis and immunoinvasion in tumors may be correlated with Overall Survival (OS) of cancer patients. And identifying a prognostic signature based on the two features. The overall process for screening candidate genes based on gene expression data and constructing a risk scoring model is shown in FIG. 5. For ease of understanding, each step is described in detail below in conjunction with FIG. 5.

In the related art, the biomarker panel includes 45 immune signatures associated with immune cell types, immune-related pathways and functions. Single sample gene set enrichment analysis (ssGSEA) was performed by GSVA in the R-package to quantify the level of infiltration of different immune cells, activity of immune gene pathways and immune related functions. Based on the results of ssGSEA, the gene expression profile data of 1100 patients from the TCGA cohort was analyzed, hierarchical clustering analysis was performed using the "h crest" function of the R package, and the patients were divided into a first immune cell infiltration group and a second immune cell infiltration group, which represent immune cell groups with different infiltration levels, such as a high immune infiltration group and a low immune infiltration group, respectively, and the division results are shown in fig. 6.

In some embodiments, to verify the reliability of the immune cell infiltration grouping strategy, the expression levels of two immune-related family genes CD and IL between the two groups were analyzed. The analysis results are shown in fig. 7A and 7B, where fig. 7A is a box line graph of gene expression levels of CD family genes between the high and low immunoinfiltration groups, respectively, and fig. 7B is a box line graph of gene expression levels of IL family genes between the high and low immunoinfiltration groups, respectively. As can be seen from the figure, the expression of the two immune-related gene families in the high immune infiltration group is obviously higher than that in the low immune infiltration group. The Stromal Score (Stromal Score), Immune Score (Immune Score), assessment Score (estamate Score), and Tumor Purity (Tumor Purity) were calculated separately for each breast cancer patient using the estate package in the R package at default parameter settings, as shown in fig. 8, the hyperimmune infiltrated group showed a higher proportion of Immune cells to Stromal cells, but the Tumor Purity was lower. Furthermore, in some embodiments, the CIBERSORT algorithm was used to assess the proportion of immune cell types from different immune infiltration groups under 1000 displacement tests, as shown in figure 9, the proportion of high immune cell infiltration groups in most immune cell types was significantly greater than the low immune cell infiltration groups.

Fig. 10 is a flowchart of steps of step S300 in fig. 1 according to an embodiment of the present application, which is described in conjunction with the flowchart of step S300' in fig. 5, in some embodiments, step S300 specifically includes the following steps:

step S310, analyzing differentially expressed genes related to immune infiltration and participating in a transfer mechanism from the first immune infiltration group and the second immune infiltration group respectively based on selection criteria by using a Wilcoxon rank sum method to obtain a first candidate genome;

step S320, screening a second candidate genome from the first candidate genome based on a univariate Cox proportional risk regression model;

step S330, a third candidate genome is obtained by performing correlation analysis on the second candidate genome; wherein the correlation of genes within the first candidate genome, the second candidate genome, and the third candidate genome to the overall survival of the patient increases in order.

In some embodiments, to identify candidate genomes associated with immune infiltration, we used Wilcoxon's rank-sum method with filter criteria | log2FC | greater than 0.5 and an adjustable p-value p less than 0.05, where FC refers to Fold-Change (Fold Change) to measure the magnitude of differential gene expression in different immune infiltration groups. Differentially expressed genes associated with tumor immunoinfiltration were detected from high and low immunoinfiltration conditions (corresponding to the first and second immunoinfiltration groups, respectively) using the Benjamini & Hochberg method (BH method), and 1222 differentially expressed genes associated with immunoinfiltration were identified by differential expression analysis. In another example, to identify candidate genomes associated with metastasis, 2159 differentially expressed genes involved in the metastatic machinery were screened by comparing the data of metastatic and primary tumors in the dataset using the same standard Wilcoxon rank-sum method. It should be noted that the imbalance between primary and metastatic tumor samples will result in a bias in differential expression analysis, such as 1165 primary tumor samples, but only 23 metastatic tumor samples in the TCGA breast cancer cohort. To ensure a relatively balanced sample size between the metastatic and primary tumor groups, two cohorts of GSE10893 and GSE3521 were selected from which 2159 genes were identified.

In some embodiments, 52 genes were selected by intersecting 1222 and 2159 metastasis associated genes, as shown in fig. 11, where the intersection represents the first candidate genome associated with both immune infiltration and metastasis. As shown in fig. 12A, a thermodynamic diagram of differentially expressed genes between the first and second immunoinfiltration groups in the TCGA cohort. Figure 12B is a thermodynamic diagram of differentially expressed genes between primary and metastatic groups in a GSE10893 cohort. Figure 12C is a thermodynamic diagram of differentially expressed genes between primary and metastatic groups in the GSE3521 cohort.

In some embodiments, univariate Cox proportional hazards regression analysis was used to screen 52 candidate genes from the TCGA breast cancer cohort for features associated with Overall Survival (OS). Of these 52 candidate genes, only genes with absolute risk Ratio (Hazard Ratio, HR) greater than 1 and p-value less than 0.05 were retained, and 15 genes with p-value less than 0.05 were used for the subsequent studies, and the filtered gene list is shown in fig. 13. In view of the excessive redundant variables that would result in overfitting in the linear model, to eliminate co-linearity, qualified candidate genes were further filtered based on the criterion that the square root of the variance expansion factor (VIF) is less than 2 and the pearson correlation coefficient is less than 0.5, with the results of the screening shown in fig. 14A and 14B. Finally, 12 prognostic genes that are significantly associated with patient OS, i.e., a third candidate genome, specifically including: APOA5, FAM9C, IVL, PAGE5, CACNA1E, CCL25, CD1A, CD1B, GPR55, LAX1, TNFRSF8 and WNT 10A.

Fig. 15 is a flowchart of an operation of step S400 in fig. 1 according to an embodiment of the present disclosure, which is described in conjunction with the flowchart of step S400' in fig. 5, in some embodiments, step S400 specifically includes the following steps:

step S410, constructing a multilayer perceptron neural network based on the third candidate genome, wherein the multilayer perceptron neural network is used for optimizing the weight of each candidate gene in the third candidate genome, and the optimized weight is used as the maximum weight of a hidden layer of the neural network;

step S420, a risk scoring model is determined.

Several machine learning methods have been found in the prior art to be successful in a variety of data mining problems, including those with transcribed data. Therefore, the present application estimates the weights of 12 prognostic genes by constructing a multi-layered perceptron neural network model. As shown in fig. 16, a schematic diagram of a neural network training model according to an embodiment of the present application is illustrated.

In some embodiments, the neuronal network is constructed using Tensorflow and Keras packets with corrected linear units (corrected linear units) as activation functions in the hidden layer, two nodes of the output layer using Softmax functions, and cross entropy error is introduced as a loss function and Adam's method is used to optimize prognostic gene weights. After the model training is finished, the coefficient of each prognostic gene is determined as the maximum weight of the hidden layer.

In some embodiments, the final determined 12 prognostic genes are divided into binary states, one defined as a protective state, where HR is less than 1, and the other is a dangerous state, with a corresponding HR greater than 1. The expression status of each mRNA is assigned to 1 if its expression level is above the score cutoff value, and to 0 otherwise. Conversely, the expression status of a risk mRNA gene is assigned as 1 if the expression value is below the score cutoff value, and 0 otherwise. In some embodiments, the score cutoff value is the median of the expression values for all patients in the dataset. The protective genes and the dangerous genes are mainly distributed in the breast cancer patient population according to MIRS scores, the dangerous genes are defined as being larger than the score cutoff value and the protective genes are smaller than the score cutoff value, and in clinical significance, the higher the expression of the genes is, the longer the survival time is, the protective genes are, and on the contrary, the higher the expression is, the worse the survival time is, the dangerous genes are. The risk Ratio (Hazard Ratio) is the Ratio of two risk ratios (Hazard Rates). In the embodiment of the application, the ratio of the survival rates of two groups of people (the group with high and low MIRS scores, namely the first high-risk subgroup group and the first low-risk subgroup group) is obtained, and the larger the ratio is, the larger the difference of the survival time is, the larger the influence of the gene on the prognosis of the breast cancer is. The definition of protective genes and risk genes is represented by the following formulae:

in some embodiments, the risk score for each patient consists of the sum of the product of the prognostic gene weight and the gene expression value associated with each transfer and immune gene within the third candidate genome, specifically represented by the formula:

wherein, MIRS_iRepresenting the risk score of the ith patient consisting of the third candidate genome, weight representing the maximum weight of the hidden layer of the multilayer perceptron neural network, I_{{protective gene}}And I_{{dangerous gene}}Expression values of the ith gene used for constructing the risk score model, respectively, m and n represent the numbers of protective genes and risk genes, respectively, and the sum of m and n is equal to the third candidateThe total number of genes in the selected genome, here 12.

In some embodiments, the detailed information of the classes of the 12 genes based on the third candidate genome and the corresponding weights of the risk scoring model is shown in table 1.

TABLE 1 12 candidate genes in the TCGA dataset for calculating risk score

In some embodiments, MIRS are used to predict the survival status of a patient, in order to assess the predictive performance of risk score MIRS, a Receiver Operating Characteristics (ROC) Curve is generated using the pROC package and Area Under the Curve (AUC) is calculated as an indicator of assessing MIRS model performance, the AUC indicator assesses the goodness of fit of models constructed based on different prognostic indicators, and the AUC values range between 0.5 and 1. The closer the AUC value is to 1.0, the higher the reliability of the risk scoring model is; the closer the AUC value is to 0.5, the lower the reliability of the risk scoring model. As shown in fig. 17, the ROC curve obtained using the TCGA training dataset is shown with an AUC accuracy of 0.875.

Fig. 18 is a flowchart of a method for constructing the MIRS model of fig. 1 according to another embodiment of the present application, further including:

step S430, calculating a risk score of each patient in the data set based on a risk score model;

step S440, determining a group cutoff value according to the risk scores of all patients in the data set, and dividing the patients in the data set into a first high-risk subgroup and a first low-risk subgroup according to the group cutoff value and the risk scores of the patients;

step S450, using the Kaplan-Meier (KM) curve of the data set, the survival distribution difference between the first high risk subgroup and the first low risk subgroup is evaluated.

In some embodiments, the risk score of each patient in the training dataset is calculated according to the risk score MIRS model determined in the above steps, and a risk grouping cutoff value is determined according to the risk scores of all patients, it should be noted that the risk grouping cutoff value may be a median value of the risk scores of all patients, or may be a mean value of the risk scores of all patients, and it should be understood that any suitable value may be used. Based on the risk grouping cutoff value, all patients are classified into a first risk subgroup and a second risk subgroup, corresponding to high risk type and low risk type, respectively, according to whether the value of (MIRS per patient)/(median of MIRS of all patients) is greater than 1. Survival analysis and visualization are carried out by using a survivor package in an R language, an OS curve is constructed by using a Kaplan-Meier survival curve function, and visualization is carried out by using ggsurfplot. The visualization of the OS curves generated by the TCGA data set is shown in FIG. 19A, and the visualization of the OS curves generated by the GSE20685 queue is shown in FIG. 19B, from which it can be seen that MIRS^lowThe OS or disease-free survival (DFS) of the patients in the group was significantly longer than MIRS^highGroup (logarithmic rank p)<0.001). Survival distribution differences between the assessed risk subgroups were then estimated based on the two-sided log rank test.

In some embodiments, to further test the robustness and feasibility of the MIRS model, a comprehensive survival analysis was performed in three separate test cohorts using the KM method (corresponding to the flow diagram of step S500' in fig. 5). Notably, MIRS showed better prediction ability in GSE96058, GSE86166 and GSE20685, with AUC of 0.934, 0.901 and 0.904, respectively, as shown in figures 20A, 20B and 20C. With respect to survival analysis, consistent with the results of the training data, it was classified as MIRS^highPatient ratio of group MIRS^lowPatients in the group had significantly worse overall survival as shown in figures 20D, 20E and 20F. These analyses indicate that MIRS has superior prognostic power in breast cancer. A higher MIRS score corresponds to poor results, while a lower MIRS score indicates better results.

In some embodiments, the independence of the candidate genes is assessed based on multivariate Cox proportional hazards regression analysis. In determining candidate genomes, targeted prognostic genes significantly associated with OS were revealed based on univariate Cox proportional risk regression analysis, and risk ratios, 95% confidence intervals, and p-values were evaluated. To assess whether the risk score is an independent prognostic factor compared to other important clinical features, a multivariate Cox proportional hazards regression model was used for statistical test analysis. All statistical tests were considered significant with p values < 0.05.

In some embodiments, the correlation between MIRS and immune cells, stromal cells, and tumor purity was studied by estamate in the GSE86166 cohort for breast cancer patient metastasis and correlation of immune genes to MIRS. The results show that MIRS^lowThe proportion of immune cells and stromal cells in the group was high, but the tumor purity was low, as shown in fig. 21A. A similar situation is also observed in the GSE96058 queue, as shown in fig. 21B. Reasonably said, in MIRS^lowIn the patients of the group, the higher proportion of immune cells and the lower tumor purity reflected high levels of infiltrating T lymphocytes, which had better results in previous survival assays.

In some embodiments, 730 genes were identified as being associated with 12 genes of MIRS (Spearman correlation coefficient not less than 0.4), and subsequent functional enrichment analysis by Metascape based on these associated genes indicated that many immune-related processes and pathways were significantly enriched, including T cell activation, cytokine-cytokine receptor interactions, and the like, as shown in figure 22. This observation revealed a strong correlation of MIRS with immune activity. In another embodiment, ssGSEA analysis was applied using 17 immune-related biological functions and pathways from immune-related database import to assess the level of immune infiltration in the test cohort. The results showed that most of the 17 items showed MIRS^highAnd MIRS^lowThe significant difference between them, as shown in fig. 23. Notably, in MIRS^lowIn the group, all immune-related biological processes and pathways with significant differences showed higher levels of immune infiltration, consistent with previous analyses.

In another example, the association of MIRS with three important immune checkpoint molecules (PD-1, PD-L1, and CTLA4) was estimated, as shown in figure 24A, with MIRS^highGroup comparison, MIRS^lowThe group showed significantly higher expression (Wilcoxon rank sum test P less than 0.05). Similar results were observed in the TCGA and GSE96058 queues, as shown in fig. 24B and 24C. The MIRS score was negatively correlated with the expression levels of PD-1, PD-L1, and CTLA4, as shown in figure 25. Overall, the difference in tumor immunogenicity between the MIRS groups was significant, MIRS^highThe level of immunoinfiltration was relatively low in the group, whereas MIRS^lowThe level of immunoinfiltration was relatively high for the groups. This finding suggests MIRS^lowThe group may respond better in immune checkpoint blockade therapy.

In some embodiments, to investigate the correlation between MIRS scores and metastatic mechanisms, a Metastatic Breast Cancer (MBC) cohort was downloaded from the Human Cancer Metastasis Database (HCMDB), the cohort containing primary and metastatic tumors. Functional analysis performed on GSEA, which filtered the results using a Normalized Enrichment Score (NES) of greater than 1, a nominal p-value (NOM) of less than 0.05, and a False Discovery Rate (FDR) q-value of greater than 0.25, detected a set of 23 qualifying metastasis-associated genes. Wherein NES represents the normalized enrichment score, NOM p value represents p value, representing the credibility of the enrichment result, and FDR q value represents q value, which is the p value corrected by multiple hypothesis test. Thereafter, MIRS was evaluated using ssGSEA analysis^highGroup and MIRS^lowGroup metastatic function and pathway differences. It can be observed that most of the transferred genomes show significant differences between the MIRS groups, most of the MIRS^highThe group had a higher ssGSEA score as shown in figure 26. Higher ssGSEA scores indicate higher activity of the metastatic function and pathway. Similar observations were made in the TCGA and GSE96058 cohorts. In addition, differences in expression of three well-known genes (DCC, MMP9, and ETS1) were found to be associated with invasion and metastasis of breast cancer, as shown in fig. 27A, and MIRS was negatively associated with expression of these genes, as shown in fig. 27B.

In some embodimentsThe relationship between intrinsic molecular subtypes and MIRS was studied. For the TCGA cohort, note MIRS^highAnd MIRS^lowThe proportion of intrinsic molecular subtypes is not balanced, as shown in FIG. 28A. MIRS^high48.09% LumA tumor and 22.5% normal-like tumor in the group, whereas MIRS^lowThere were 32.33% of LumB tumors in the subtype. However, in MIRS^lowThe proportion of basal-like tumors in subtypes is higher. In the present studies, it was shown that the number of tumor infiltrating lymphocytes is highest in the basal-like subtype, which may support the presence of MIRS^lowObservation of high enrichment of basal-like tumors in the group. Furthermore, by using the Kruskal-Wallis method, it was also observed that statistically significant differences were detected between these five intrinsic molecular subtypes. As shown in fig. 28B, it can be observed that MIRS in the normal sample are significantly lower than those in the other molecular subtypes, whereas MIRS in the LumB subtype are highest, in contrast. Similar results were also found in the METABRIC queue as shown in fig. 28C and 28D. These analyses indicate that the MIRS group exhibits a chaotic association with classical molecular subtypes, which may be attributed to the high tumor heterogeneity of breast cancer.

The foregoing analysis indicates that there is a high correlation between MIRS and tumor infiltration microenvironment and tumor metastasis. In some embodiments, the 12-genome molecular mechanism of breast cancer prognosis is extensively explored. Most of these prognostic-related genes have been well characterized and reported to be involved in the process of tumorigenesis or canceration. It is worth mentioning that APOA5 encoding apolipoprotein has been reported to be associated with cardiovascular disease, but little work has revealed its role in tumorigenesis and cancer prognosis. To describe its potential prognostic role in breast cancer, one example of the present application analyzed the gene expression profile of 327 breast cancer patients in the GSE20685 cohort. All patients were divided into four quartiles, ranked according to expression of APOA5, and then GSEA analysis was performed between the two quartiles with the highest and lowest expression, as shown in figure 29. Notably, in the highly expressed APOA5 group, a significant abundance of many metastasis and immune related pathways were observed, as shown in figures 30A, 30B, and 30C, including epithelial-mesenchymal transition (EMT), Tumor Necrosis Factor (TNF) a signaling through NFKB and immune response regulatory signaling pathways. Interestingly, many of the lipid metabolic pathways involved in immune activities were significantly abundant in patients with low expression of APOA5, as shown in figure 29. In fact, more and more studies have shown that energy metabolism, including lipid metabolism, has a significant impact on both immune and clinical response in cancer patients. Notably, although the KM profile based on APOA5 expression showed no significant difference between the two groups, as shown in FIG. 31, subsequent analysis of the survival of pan cancer based on TCGA cohorts obtained by Kaplan-Meier plotters (https:// kmplot. com/analysis /), showed that APOA5 could be a prognostic indicator for a variety of cancers, i.e., breast cancer, head and neck squamous cell carcinoma, gastric adenocarcinoma, etc., with the worst survival of breast cancer patients with the highest APOA5 expression, as shown in FIGS. 32A-32C. In contrast, referring to fig. 32D-32F, overexpression of APOA5 showed significantly improved survival in patients with renal clear cell carcinoma, lung adenocarcinoma, and lung squamous carcinoma, and most importantly, the analysis showed that APOA5 could exert its prognostic function by participating in lipid metabolism to affect the immune activity of breast cancer, and could potentially be a potential target for breast cancer treatment.

Genomic mutations are primarily associated with the prognosis of survival for various cancers. In some embodiments, the association between somatic mutations and MIRS is tested in breast cancer TCGA data. According to the relevant analysis in the existing studies, only genes with a somatic mutation frequency of more than 2.5% were included. By analyzing the mutation annotation of the TCGA BRCA cohort, the examples of the present application select the 10 genes with the highest mutation frequency in each MIRS group. As shown in FIGS. 33A and 33B, MIRS^lowGroup mutation event frequency higher than MIRS^highAnd (4) grouping. Related studies indicate that patients with more mutations are likely to have an increased number of neoantigens that enhance the response to immunotherapy. In the examples of this application, this result may explain MIRS^lowGroup ratio MIRS^highThe group had the reason for better prognostic outcome.

Recently, Tumor Mutational Burden (TMB) is the most important prognostic indicator in cancer survival. In another embodiment of the present application, the link between MIRS and TMB was further investigated. As shown in FIG. 34, with MIRS^highGroup phaseRatio, MIRS^lowPatients in the group showed a significant increase in TMB. Related studies indicate that high TMB is associated with increased survival and increased TMB is associated with improved response to PD-1 blocking therapy. Correlation analysis of MIRS with TMB showed that the MIRS score was negatively correlated with TMB (Spearman coefficient: R-0.1, p-0.0011, as shown in figure 35). These findings indicate that MIRS are effectively associated with prognostic and therapeutic value, particularly in immunotherapy. Patients with lower MIRS may respond better in immunotherapy, consistent with previous analysis, as shown in figure 23.

In another embodiment of the present application, to further validate the therapeutic value of MIRS, its predictive potential was examined from the chemotherapy and immunotherapy perspectives. The cohort included breast cancer patients receiving adjuvant chemotherapy in GSE 20685. The optimal cut point for MIRS is determined by the median cut point, it being understood that other suitable values may be selected, such as the mean of all patient MIRS. The patient was then stratified into MIRS^highGroup and MIRS^lowAnd (4) grouping. Survival analysis showed that in the case of adjuvant chemotherapy, MIRS^lowBreast cancer patient ratio of MIRS^highThe breast cancer patients had better survival as shown in figure 36A. The prognosis of different MIRS subtype adjuvant or non-adjuvant chemotherapy was also investigated according to an embodiment of the present application, and as shown in FIG. 36B, it was observed that MIRS were observed between patients receiving adjuvant chemotherapy and patients not receiving adjuvant chemotherapy^highGroups had statistically significant differences. However, in MIRS^lowNo consistent results were observed in those patients as shown in fig. 36C. These results indicate that adjuvant chemotherapy may be more beneficial for MIRS^high. GSEA prediction, based on gene sets of different drug treatments retrieved from MSigDB database, in TCGA cohort, MIRS^highSignificant correlation with drug sensitivity is shown in fig. 37A, 37B and 37C. The pRRophetic kit in the R kit was used to assess the sensitivity of four chemotherapeutic drugs (imatinib, platinum compounds, gemcitabine and vinblastine), which have been commonly used for breast cancer treatment. The results show that the estimated IC50 values for these drugs are at MIRS^highSignificant reduction in subtype, as shown in FIG. 38A, 38B and 38C.

From an immunotherapy perspective, TIS is a good predictor of the clinical response to pembrolizumab (pembrolizumab) in various tumor types. In one example of the present application, the effectiveness of MIRS in predicting the immune therapeutic response of breast cancer patients was demonstrated. All MIRSs in GSE20711 queue^lowThe TIS scores of the patients are obviously higher than that of the MIRS^highPatient, as shown in fig. 39. These box plots prompt MIRS^lowGroups are associated with response to immune checkpoint inhibitors. To further evaluate MIRS^lowPrognostic ability of panel in immunotherapy, MIRS was compared in breast cancer test cohort using KM survival analysis^highGroup and MIRS^lowDifferences in overall survival of groups. Unfortunately, to date there are few published data sets for breast cancer patients receiving immunotherapy. Instead, Liu et al melanoma data and the TCGA-SKCM cohort receiving immunotherapy were analyzed in the examples of this application. Thus, MIRS showed robust AUC compared to PD-1 biomarker after receiving anti-PD-1 treatment, as shown in figures 40A, 40B and 40C. Furthermore, MIRS^highThe overall survival of the patients was significantly shorter than their counterparts as shown in fig. 41A and 41B. MIRS values were significantly increased in patients with Stable Disease (SD) or Progressive Disease (PD) compared to patients with Complete Response (CR) or Partial Response (PR), as shown in fig. 42A and 42B. In addition, CR/PR and SD/PD in MIRS were also verified^highGroup and MIRS^lowDistribution in the group. As shown in FIGS. 43A and 43B, it can be observed that MIRS^lowPatients in the group responded better to immunotherapy than MIRS^highAnd (4) grouping.

These results indicate that MIRS has great potential in predicting breast cancer patients' response to chemotherapy and immunotherapy. In summary, MIRS^highThe patient may benefit from chemotherapy, MIRS^lowPatients may be more susceptible to immunotherapy.

Referring now to fig. 44A-46, the ability of the MIRS model to assess prognostic risk assessment is assessed by comparing MIRS to previous predictive models.

Prior to MIRS creation, Shimizu et al demonstrated in A23 gene-based molecular profiling defects overview of Breast cancer Patients that based on modeling neuronal network models, a 23 gene panel (mPS) helped predict OS in breast cancer patients; the scoring by Chi et al in A probabilistic elevation-gene expression signature for topics with a cleavage center receiving additional chemother is based on the traditional Lasso regression (Lasso Cox) model to construct 8-gene signatures. In an embodiment of the present application, the predictive power of the scores of MIRS, mPS, and treelets is evaluated comprehensively by predictive regression analysis from various common data sets. As shown in fig. 44A, 44B and 44C, MIRS performs very well in different queues. Although mPS appears to be more robust than MIRS in many cohorts, some absolute risk ratios (HR) in mPS are not significant (P values > 0.05). In these models, the score of the treelet performs the worst.

In addition, the prediction potential of these three models in the immunotherapy response was carefully studied using the malignant melanoma cohort treated with anti-PD-1 in one example of the present application. The optimal cut point for the score of the treble and the value of mPS is determined by the median. KM survival curves for MIRS show MIRS^highGroup and MIRS^lowThere were significant differences in overall survival between groups, as shown in figure 45A. In contrast, the survival analysis of mPS and a treble score shows that patients with a low mPS or treble score do not have statistically significant differences compared to patients with a high mPS or treble score, as shown in fig. 45B and 45C. Scores for MIRS, mPS, and treeing were also included in the time-dependent ROC analysis in the test cohort for predicting immunotherapy efficacy. Notably, the MIRS ratio mPS and the treble score have better predictive power for OS scores of 1 year, 1.5 years, and 2 years, as shown in fig. 46.

Fig. 47 is an application method of the risk scoring model provided according to the embodiment of the present application. The application method specifically comprises the following steps:

step S500, acquiring gene expression data of a patient, wherein the gene expression data of the patient comprises gene expression data which is used for constructing a risk score model, is related to immune infiltration and participates in a transfer mechanism;

step S600, calculating the risk score of the patient according to the risk score model based on the gene expression data of the patient.

In some embodiments, based on the breast cancer prognosis risk assessment model constructed as described above, the patient's gene expression data is input into the risk assessment model, so that the patient's risk score can be automatically calculated, thereby accurately assessing the patient's prognosis effect or providing guidance for subsequent treatment of the patient.

The embodiment of the application aims to construct a breast cancer prognosis risk scoring model based on immune infiltration and metastasis characteristics, and breast cancer patients are divided into a high risk group and a low risk group by integrating immune related gene signals related to metastasis, so that the prognosis effect of the patients can be more accurately evaluated, and the breast cancer prognosis risk scoring model can be used for potential treatment strategies to guide the treatment of the patients. Genetic weights are estimated by constructing a neural network model, which shows outstanding performance in binary classification, and then establishing a metastasis and immune gene risk score, which has a significant ability to predict survival status compared to the single feature-based indicators in the prior art. In addition, the ability of MIRS to predict treatment was also confirmed, suggesting its potential to guide the therapeutic strategy for breast cancer.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

Computer-readable media may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

While the present invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for constructing a breast cancer prognosis model is characterized by comprising the following steps:

acquiring a breast cancer training dataset and preprocessing data of each patient in the breast cancer training dataset, wherein the data of each patient comprises a plurality of features;

based on the plurality of features, dividing the breast cancer training dataset into different groupings, the different groupings determined based on infiltration levels of different immune cells;

analyzing the differential expression genes among the different groups, and filtering to obtain candidate genes, wherein the differential expression genes comprise differential expression genes related to immune infiltration and differential expression genes involved in a transfer mechanism;

and constructing a risk scoring model based on the candidate genes.

2. The construction method according to claim 1, wherein the acquiring a breast cancer training dataset and preprocessing data of each patient in the breast cancer training dataset comprises:

acquiring original matrix data of the breast cancer training data set, wherein row data of the original matrix data represent different probe sets, and column data represent different patients;

respectively obtaining the maximum expression value of each probe set, and screening the repetitive genes detected by the probe sets;

and (4) standardizing the screened gene expression data.

3. The construction method according to claim 2, wherein the dividing the breast cancer training data set into different groups based on the plurality of features comprises:

quantifying data in each feature of the gene expression data separately using a single sample gene set enrichment analysis;

based on the quantified results, the breast cancer training dataset is divided into a first immune-infiltrated group and a second immune-infiltrated group using a clustering method, respectively.

4. The method of constructing according to claim 3, wherein the filtering candidate genes based on the differentially expressed genes between the different groupings comprises:

analyzing differentially expressed genes related to immune infiltration and involved in a transfer mechanism from the first immune infiltration group and the second immune infiltration group respectively based on selection criteria by using a Wilcoxon rank sum method to obtain a first candidate genome;

screening a second candidate genome from the first candidate genome based on a univariate Cox proportional risk regression model;

obtaining a third candidate genome by performing correlation analysis on the second candidate genome;

wherein the correlation of genes within the first candidate genome, the second candidate genome, and the third candidate genome to the overall survival of the patient increases in order.

5. The method of constructing according to claim 4, wherein constructing a risk score model based on the candidate genes comprises:

constructing a multilayer perceptron neural network based on the third candidate genome, wherein the multilayer perceptron neural network is used for optimizing the weight of each candidate gene in the third candidate genome, and the optimized weight is used as the maximum weight of a hidden layer of the neural network;

determining the risk scoring model.

6. The method of construction of claim 5, wherein the risk scoring model is represented as:

7. the building method according to claim 6, further comprising:

calculating a risk score for each patient in the breast cancer training dataset based on the risk score model;

determining a group cutoff value from the risk scores of all patients in the breast cancer training dataset and dividing the patients in the breast cancer training dataset into a first high risk subgroup and a first low risk subgroup according to the group cutoff value and the risk scores of the patients;

evaluating a difference in survival distribution between the first high-risk subgroup and the first low-risk subgroup using a Kaplan-Meier curve of the breast cancer training dataset.

8. The building method according to claim 7, further comprising:

evaluating the third candidate genomic gene for independence based on multivariate Cox proportional hazards regression analysis.

9. The method of constructing according to claim 8, wherein the constructing a risk score model based on the candidate genes further comprises:

acquiring a breast cancer verification dataset;

validating performance of the risk scoring model based on the breast cancer validation dataset.

10. A method for applying a breast cancer prognosis model, wherein the breast cancer prognosis model comprises a risk score model obtained by the construction method according to any one of claims 1-9, and the application method comprises the following steps:

obtaining gene expression data of a patient, wherein the gene expression data of the patient comprises gene expression data related to immune infiltration and involved in metastasis mechanisms for constructing the risk score model;

calculating a risk score for the patient according to the risk score model based on the gene expression data for the patient.

11. An electronic device for breast cancer prognosis, comprising:

a memory for storing a program; and

a processor for executing the memory-stored program, the processor being configured to perform the construction method of any one of claims 1 to 9 or the application method of claim 10 when the processor executes the memory-stored program.

12. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute instructions of the construction method according to any one of claims 1 to 9 or the application method according to claim 10.