CN116564409A

CN116564409A - Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer

Info

Publication number: CN116564409A
Application number: CN202310505357.4A
Authority: CN
Inventors: 张子龙; 段昊; 崔菲菲; 李兴风; 张清辰
Original assignee: Hainan University
Current assignee: Hainan University
Priority date: 2023-05-06
Filing date: 2023-05-06
Publication date: 2023-08-08

Abstract

The utility model provides a method for identifying metastatic breast cancer transcriptome sequencing data based on machine learning, which relates to the technical field of biological information and aims at solving the problem of low accuracy of identifying metastatic breast cancer transcriptome sequencing data in the prior art. According to the technical scheme, metastatic breast cancer transcriptome sequencing data in the breast cancer transcriptome sequencing data can be accurately identified.

Description

Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer

Technical Field

The invention relates to the technical field of biological information, in particular to a machine learning-based identification method for sequencing data of a transcriptome of metastatic breast cancer.

Background

Breast cancer is often referred to as a "pink killer" and its incidence is the leading cause of female malignancy. Identification of metastatic breast cancer transcriptome sequencing data the metastatic breast cancer transcriptome sequencing data can be identified from the breast cancer transcriptome sequencing data, so that a certain degree of technical support is provided for data aspect of breast cancer metastasis research, and progress of breast cancer metastasis research is further promoted.

The prior art is limited by the complexity of the transcriptome sequencing data itself, so the recognition accuracy of transcriptome sequencing data for metastatic breast cancer is very low.

Disclosure of Invention

The purpose of the invention is that: aiming at the problem of low accuracy of identification of the transcriptome sequencing data of the metastatic breast cancer in the prior art, a machine learning-based identification method of the transcriptome sequencing data of the metastatic breast cancer is provided.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a machine learning based method for identifying transcriptome sequencing data of metastatic breast cancer, comprising the steps of:

acquiring transcriptome sequencing data of a breast cancer patient;

acquiring breast cancer metastasis key gene expression data from transcriptome sequencing data of a breast cancer patient;

training a plurality of different classification models by using breast cancer metastasis key gene expression data, and selecting the classification model with highest classification precision as a metastatic breast cancer transcriptome sequencing data identification model;

utilizing a metastatic breast cancer transcriptome sequencing data identification model to identify metastatic breast cancer transcriptome sequencing data;

the specific steps for acquiring the breast cancer metastasis key gene expression data from the transcriptome sequencing data of the breast cancer patient are as follows:

step one: performing differential analysis on transcriptome sequencing data of a breast cancer patient to obtain a breast cancer metastasis differential gene set;

step two: WGCNA is carried out on transcriptome sequencing data of a breast cancer patient to obtain a breast cancer metastasis hub gene set;

step three: extracting common genes in the breast cancer metastasis difference gene set and the breast cancer metastasis hub gene set;

step four: and screening the common genes by using LASSO regression analysis to obtain breast cancer metastasis key genes, and comparing the breast cancer metastasis key genes with transcriptome sequencing data of breast cancer patients to obtain breast cancer metastasis key gene expression data.

Further, the step of obtaining transcriptome sequencing data of the breast cancer patient specifically comprises the following steps:

firstly, acquiring a GSE data set of breast cancer metastasis;

secondly, extracting transcriptome sequencing data, clinical characteristics and GEO chip platform numbers of the breast cancer patients from the acquired GSE data set in RStudio, and acquiring gene names corresponding to each gene probe in the chip according to the GEO chip platform numbers.

Further, the specific steps of the differential analysis are as follows:

performing differential analysis on transcriptome sequencing data of a breast cancer patient by using a limma package in RStudio, adding a list of gene names into a differential analysis result according to the obtained gene names corresponding to each gene probe so as to determine the expression of the genes detected by each gene probe, and then screening by taking P-value <0.05 and |logFC| >0.5 as screening standards according to the differential analysis result to obtain differential genes of breast cancer metastasis.

Further, the specific steps of the WGCNA are as follows:

step two,: clustering all samples to obtain a sample cluster tree and an outlier sample, and then setting a cutHeight value according to the position of the outlier sample in the sample cluster tree so as to remove the outlier value and obtain a residual sample;

step two: based on the residual samples, calculating a soft threshold by using a soft threshold calculation function in the WGCNA package, drawing a variation trend graph of the scale-free topology fitting index and the average connectivity along with the variation of the soft threshold according to the calculation result, and selecting the optimal soft threshold by taking the scale-free topology fitting index being more than 0.9 and the average connectivity trend leveling position as the standard;

step two, three: constructing a non-scale network according to an optimal soft threshold value to obtain a hierarchical clustering tree diagram of module identification, correlating the modules with clinical characteristics to obtain a correlation coefficient of each module in the hierarchical clustering tree diagram and breast cancer metastasis, and selecting the module with the highest correlation coefficient and the next highest correlation coefficient;

step two, four: extracting genes in the module with the highest correlation coefficient and the next highest correlation coefficient to obtain the breast cancer metastasis junction gene.

Further, the specific steps of screening the common genes by using LASSO regression analysis are as follows:

and performing LASSO regression analysis on the breast cancer metastasis key genes by using a glmnet package in RStudio, and selecting the gene corresponding to the lambda value when the mean square error of the LASSO model is minimum, namely the breast cancer metastasis key genes.

Further, the plurality of different classification models includes: logistic regression models, random forest models, support vector machine models, GBDT models, and XGboost models.

Further, the specific steps of training a plurality of different classification models by using breast cancer metastasis key gene expression data are as follows:

based on a grid optimization method and five-fold cross verification, respectively carrying out classification training and super-parameter optimization on a plurality of classification models by searching all super-parameter combinations of the models in a parameter space range.

Furthermore, the identification model with highest classification precision in the selected classification model is obtained by evaluating the classification effect of each model, and the specific steps are as follows:

firstly, a model with the largest F1-Score value in a plurality of classification models is taken as an optimal model, if the F1-Score values are the same, the model with the largest Accumey value in the plurality of classification models is taken as the optimal model, if the Accumey values are the same, the AUC value of the plurality of classification models is compared, the model with the largest AUC value is taken as the optimal model, if the F1-Score values, the Accumey values and the AUC value of the plurality of classification models are the same, the classification model priority order XGboost model > GBDT model > support vector machine model > random forest model > logistic regression model is selected.

Further, the F1-Score and Accuracy are expressed as:

wherein ACC, acceracy, TP represents the number of metastatic breast cancer transcriptome sequencing data samples identified as metastatic breast cancer transcriptome sequencing data, FP represents the number of non-metastatic breast cancer transcriptome sequencing data samples identified as metastatic breast cancer transcriptome sequencing data, TN represents the number of non-metastatic breast cancer transcriptome sequencing data samples identified as non-metastatic breast cancer transcriptome sequencing data, and FN represents the number of metastatic breast cancer transcriptome sequencing data samples identified as non-metastatic breast cancer transcriptome sequencing data.

Further, the GSE dataset is a GSE9893 dataset and a GSE43837 dataset.

The beneficial effects of the invention are as follows:

the method comprises the steps of constructing a plurality of identification classifiers based on the sequencing data of the metastatic breast cancer transcriptome of each classification model by using each trained classification model, inputting the expression data of the breast cancer metastasis key genes in the sequencing data of the breast cancer transcriptome to be identified into the identification classifier of the sequencing data of the metastatic breast cancer transcriptome to obtain classification results, and completing identification of the sequencing data of the metastatic breast cancer transcriptome. According to the technical scheme, metastatic breast cancer transcriptome sequencing data in the breast cancer transcriptome sequencing data can be accurately identified.

The application realizes the identification of whether the breast cancer transcriptome sequencing data is metastatic breast cancer transcriptome sequencing data. According to the results of the combined screening of the genes by the differential analysis, the WGCNA and the LASSO regression analysis, the breast cancer metastasis key genes are obtained, and the identification accuracy of metastatic breast cancer transcriptome sequencing data is effectively improved.

Drawings

FIG. 1 is an overall flow chart of the present application;

FIG. 2 is a graph showing the results of differential analysis under different data sets;

FIG. 3 is a schematic diagram of WGCNA results;

FIG. 4 is a graph showing the mean square error of the LASSO model at different lambda values;

FIG. 5 is a schematic diagram showing comparison of recognition effects of various classification models;

FIG. 6 is a schematic structural diagram of a device for identifying transcriptome sequencing data of metastatic breast cancer.

Detailed Description

It should be noted in particular that, without conflict, the various embodiments disclosed herein may be combined with each other.

The first embodiment is as follows: referring to fig. 1, the method for identifying transcriptome sequencing data of metastatic breast cancer based on machine learning according to the present embodiment includes the following steps:

acquiring transcriptome sequencing data of a breast cancer patient;

The second embodiment is as follows: this embodiment is further described with respect to the first embodiment, and the difference between this embodiment and the first embodiment is that the step of obtaining transcriptome sequencing data of a breast cancer patient specifically includes:

firstly, acquiring a GSE data set of breast cancer metastasis;

And a third specific embodiment: this embodiment is further described with respect to the second embodiment, and the difference analysis between this embodiment and the second embodiment includes the following specific steps:

The specific embodiment IV is as follows: this embodiment is further described with respect to the third embodiment, and the specific steps of the WGCNA are as follows:

Fifth embodiment: this embodiment is further described with respect to the fourth embodiment, and the difference between this embodiment and the fourth embodiment is that the specific steps of screening the common gene by using LASSO regression analysis are as follows:

Specific embodiment six: this embodiment is a further description of the fifth embodiment, and the difference between this embodiment and the fifth embodiment is that the plurality of different classification models include: logistic regression models, random forest models, support vector machine models, GBDT models, and XGboost models.

Seventh embodiment: this embodiment is further described with respect to the sixth embodiment, and the difference between this embodiment and the sixth embodiment is that the training of the plurality of different classification models using breast cancer metastasis key gene expression data includes the following specific steps:

Eighth embodiment: in this embodiment, further description is given of the sixth embodiment, and the difference between the sixth embodiment and the sixth embodiment is that the identification model of transcriptome sequencing data of metastatic breast cancer with highest classification accuracy in the selected classification model is obtained by evaluating the classification effect of each model, and the specific steps are as follows:

Detailed description nine: this embodiment is a further description of embodiment eight, and the difference between this embodiment and embodiment eight is that the F1-Score and Accuracy are expressed as:

Detailed description ten: this embodiment is a further description of the ninth embodiment, and the difference between this embodiment and the ninth embodiment is that the GSE data set is a GSE9893 data set and a GSE43837 data set.

As an embodiment of the present application, as shown in fig. 1, the method includes the following steps:

s101, acquiring transcriptome sequencing data and clinical information of a breast cancer patient.

The transcriptome sequencing data of the breast cancer patient and the clinical information comprise the transcriptome sequencing data of the breast cancer patient and the clinical information of the breast cancer patient, wherein the transcriptome sequencing data of the breast cancer patient are expression quantity data of a plurality of genes measured by the breast cancer patient from a tissue sample, and the clinical information of the breast cancer patient is information such as physical conditions, breast cancer progress and the like during the treatment of the breast cancer patient.

In some alternative embodiments, the transcriptome sequencing data and clinical information of a breast cancer patient total 2 data sets, including: the GSE9893 dataset in the GEO database (total number of samples 155, where the number of metastatic breast cancer transcriptome sequencing data samples is 48, the number of non-metastatic breast cancer transcriptome sequencing data samples is 107, the measured basis factor is 22656), the GSE43837 dataset in the GEO database (total number of samples 38, where the number of metastatic breast cancer transcriptome sequencing data samples is 19, the number of non-metastatic breast cancer transcriptome sequencing data samples is 19, the measured basis factor is 61359).

S102, primarily screening breast cancer metastasis key genes based on difference analysis and WGCNA.

Step S102 includes the following substeps S1021-S1023

S1021, performing differential analysis on the transcriptome sequencing data of the breast cancer patient to obtain a breast cancer metastasis differential gene set.

In some alternative embodiments, the pretreatment of transcriptome sequencing data and clinical information of the breast cancer patient using RStudio comprises: extracting the two data sets to obtain an expression matrix and clinical information, grouping the breast cancer patients according to whether the breast cancer metastasis occurs in the patients in the clinical information, and annotating gene probes in the expression matrix.

In some alternative embodiments, the transcriptome sequencing data of the breast cancer patient is differentially analyzed in RStudio using a limma package, and related differential genes for distant metastasis of breast cancer are initially screened using P-value <0.05 and |logfc| >0.5 as screening criteria. The differential analysis results are shown in FIG. 2, wherein 6188 differential genes were screened from the GSE9893 dataset and 2122 differential genes were screened from the GSE43837 dataset.

S1022, performing WGCNA on the transcriptome sequencing data of the breast cancer patient to obtain a breast cancer metastasis hub gene set.

In some alternative embodiments, all samples of the dataset GSE9893 are clustered, whether the samples have outliers or outliers, and the cutHeight is set to 150 to remove outliers with this criterion. After outliers are removed, a sample cluster tree is reconstructed based on the remaining samples, and the association of the phenotype data with the samples is visualized.

In some alternative embodiments, the minimum soft threshold is 6 based on a scale-free topology fit index and a trend of average connectivity over soft threshold. And further constructing a scaleless network according to the soft threshold 6, obtaining a hierarchical clustering tree diagram of the module identification, and correlating the module with clinical characteristics to obtain a correlation coefficient of the module and phenotype data (whether the module is metastatic breast cancer transcriptome sequencing data). The WGCNA results, i.e., the correlation coefficients of the module and the phenotype data (whether or not it is metastatic breast cancer transcriptome sequencing data) are shown in fig. 3, wherein the correlation coefficients of the yellow-green module gene and the blue module gene and the phenotype data (whether or not it is metastatic breast cancer transcriptome sequencing data) are relatively high, and the yellow-green module gene and the blue module gene (3404 genes in total) are taken as breast cancer metastasis junction genes for further analysis and screening.

S1023, extracting common genes in the breast cancer metastasis difference gene set and the breast cancer metastasis pivot gene set.

In some alternative embodiments, 114 common genes of 6188 differential genes in the GSE9893 dataset, 2122 differential genes in the GSE43837 dataset, and the 3404 breast cancer metastasis hub genes are extracted for further analytical screening.

S103, carrying out further screening on the breast cancer metastasis key genes based on LASSO regression analysis.

In some alternative embodiments, LASSO regression analysis is performed on the breast cancer metastasis key gene in RStudio using the glmnet package. The absolute value of the 114 breast cancer metastasis key gene coefficients in the LASSO model decreases continuously with increasing lambda value, even to 0 (i.e. the gene does not play a role in the model). It is therefore necessary to determine the lambda value at which the LASSO model performs optimally, and thus to further determine the part of the gene that actually functions in the LASSO model. The mean square error of the LASSO model under different lambda values is shown in figure 4, so that the lambda value of the LASSO model with the optimal performance, namely the minimum mean square error, is selected, and 21 genes with stronger predictive ability, such as ENPP2 and the like, corresponding to the lambda value, namely the breast cancer metastasis key genes are further selected.

S104, constructing a metastatic breast cancer transcriptome sequencing data identification model based on the breast cancer metastasis key genes.

The seed model is selected from a logistic regression model, a random forest model, a support vector machine model, a GBDT model and an XGboost model, and the seed model, namely the logistic regression model, the random forest model, the support vector machine model, the GBDT model and the XGboost model, is subjected to classification training based on 80% data randomly divided in the GSE9893 data set, namely a training set, so that a trained classification model is obtained. The step S104 specifically includes:

based on a grid optimization method and five-fold cross validation, classifying training and super-parameter optimization are carried out on the logistic regression model, the random forest model, the support vector machine model, the GBDT model and the XGboost model by searching all super-parameter combinations of the models in the parameter space range.

And evaluating the classification effect.

In some alternative embodiments, the index for evaluating the classification effect includes ACC, F1-Score, AUC, which is calculated as follows:

where TP represents the number of metastatic breast cancer transcriptome sequencing data samples identified as metastatic breast cancer transcriptome sequencing data, FP represents the number of non-metastatic breast cancer transcriptome sequencing data samples identified as metastatic breast cancer transcriptome sequencing data, TN represents the number of non-metastatic breast cancer transcriptome sequencing data samples identified as non-metastatic breast cancer transcriptome sequencing data, and FN represents the number of metastatic breast cancer transcriptome sequencing data samples identified as non-metastatic breast cancer transcriptome sequencing data.

F1-Score is a statistical indicator used to measure the accuracy of two classification models. The method and the device have the advantages that the accuracy rate and the recall rate of the classification model are simultaneously considered, and the method and the device can be regarded as a harmonic mean of the model accuracy rate and the recall rate.

AUC is an evaluation index for measuring the quality of the two classification models, and represents the probability that the predicted positive case is arranged in front of the negative case. AUC value is the area enclosed by the axis of the ROC curve.

S105, acquiring breast cancer metastasis key gene expression data in breast cancer transcriptome sequencing data to be identified;

s106, identifying the metastatic breast cancer transcriptome sequencing data in the breast cancer transcriptome sequencing data based on the identification model of the metastatic breast cancer transcriptome sequencing data.

And constructing a plurality of identification classifiers based on the sequencing data of the metastatic breast cancer transcriptome of each classification model by adopting each trained classification model, inputting the expression data of the breast cancer metastasis key genes in the sequencing data of the breast cancer transcriptome to be identified into the identification classifier of the sequencing data of the metastatic breast cancer transcriptome to obtain a classification result, and completing identification of the sequencing data of the metastatic breast cancer transcriptome.

As an embodiment of the present application, the method comprises the following steps:

LASSO was used to select for breast cancer metastasis key genes. A plot of lambda value versus mean square error in the LASSO model is shown in fig. 4. And selecting breast cancer metastasis key genes on common genes in the breast cancer metastasis difference gene set and the breast cancer metastasis pivot gene set, and determining the number and specific gene names of the breast cancer metastasis key genes. The mean square error of the LASSO model is minimal when the lambda value is approximately-4.2. Therefore, we take the part of the genes which really act in the LASSO model when the lambda value is approximately-4.2, namely 21 breast cancer metastasis key genes with stronger prediction ability such as ENPP2 and the like.

And then obtaining gene expression data of the breast cancer metastasis key genes in each sample, namely training data and verification data required by constructing a metastatic breast cancer transcriptome sequencing data identification model. Training the seed model, namely a logistic regression model, a random forest model, a support vector machine model, a GBDT model and an XGboost model based on the training data, namely classifying and training the logistic regression model, the random forest model, the support vector machine model, the GBDT model and the XGboost model and optimizing the hyper parameters by searching all hyper-parameter combinations of the models in a parameter space range, thereby obtaining the identification model of the sequencing data of the transcriptome of the metastatic breast cancer.

Finally, several identification models of the sequencing data of the transcriptome of the metastatic breast cancer are compared, and the consistent evaluation indexes ACC, F1-Score and AUC are used on the basis of ensuring that the used data sets are consistent during comparison, as shown in figure 5. Through comparison, the metastatic breast cancer transcriptome sequencing data identification model constructed based on the support vector machine model and the XGboost model is superior to other metastatic breast cancer transcriptome sequencing data identification models, has a certain effectiveness in the identification of the metastatic breast cancer transcriptome sequencing data, and can provide a new thought for the research of the identification of the metastatic breast cancer transcriptome sequencing data.

It should be noted that the detailed description is merely for explaining and describing the technical solution of the present invention, and the scope of protection of the claims should not be limited thereto. All changes which come within the meaning and range of equivalency of the claims and the specification are to be embraced within their scope.

Claims

1. The machine learning-based identification method for the transcriptome sequencing data of the metastatic breast cancer is characterized by comprising the following steps of:

acquiring transcriptome sequencing data of a breast cancer patient;

2. The machine learning based identification method of transcriptome sequencing data of metastatic breast cancer according to claim 1, wherein the step of obtaining transcriptome sequencing data of breast cancer patient comprises the steps of:

firstly, acquiring a GSE data set of breast cancer metastasis;

3. The machine learning based identification method of transcriptome sequencing data of metastatic breast cancer according to claim 2, wherein the specific steps of the differential analysis are:

4. The machine learning based identification method of transcriptome sequencing data of metastatic breast cancer according to claim 3, wherein the specific steps of WGCNA are:

5. The machine learning based identification method of transcriptome sequencing data of metastatic breast cancer according to claim 4, wherein the specific steps of screening common genes by using LASSO regression analysis are as follows:

6. The machine learning based method of identifying transcriptome sequencing data for metastatic breast cancer according to claim 5, wherein said plurality of different classification models comprises: logistic regression models, random forest models, support vector machine models, GBDT models, and XGboost models.

7. The machine learning based identification method of transcriptome sequencing data of metastatic breast cancer according to claim 6, wherein the training of a plurality of different classification models using breast cancer metastasis key gene expression data comprises the following specific steps:

8. The machine learning based identification method of transcriptome sequencing data of metastatic breast cancer according to claim 6, wherein the identification model of transcriptome sequencing data of metastatic breast cancer with highest classification accuracy in the selected classification model is obtained by evaluating the classification effect of each model, and the specific steps are as follows:

9. The machine learning based identification method of transcriptome sequencing data of metastatic breast cancer according to claim 8, wherein said F1-Score and Accuracy are expressed as:

10. The machine learning based method of identifying transcriptome sequencing data of metastatic breast cancer according to claim 9, wherein the GSE dataset is a GSE9893 dataset and a GSE43837 dataset.