CN116564409A - Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer - Google Patents

Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer Download PDF

Info

Publication number
CN116564409A
CN116564409A CN202310505357.4A CN202310505357A CN116564409A CN 116564409 A CN116564409 A CN 116564409A CN 202310505357 A CN202310505357 A CN 202310505357A CN 116564409 A CN116564409 A CN 116564409A
Authority
CN
China
Prior art keywords
breast cancer
sequencing data
transcriptome sequencing
model
metastatic breast
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310505357.4A
Other languages
Chinese (zh)
Inventor
张子龙
段昊
崔菲菲
李兴风
张清辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hainan University
Original Assignee
Hainan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hainan University filed Critical Hainan University
Priority to CN202310505357.4A priority Critical patent/CN116564409A/en
Publication of CN116564409A publication Critical patent/CN116564409A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The utility model provides a method for identifying metastatic breast cancer transcriptome sequencing data based on machine learning, which relates to the technical field of biological information and aims at solving the problem of low accuracy of identifying metastatic breast cancer transcriptome sequencing data in the prior art. According to the technical scheme, metastatic breast cancer transcriptome sequencing data in the breast cancer transcriptome sequencing data can be accurately identified.

Description

Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer
Technical Field
The invention relates to the technical field of biological information, in particular to a machine learning-based identification method for sequencing data of a transcriptome of metastatic breast cancer.
Background
Breast cancer is often referred to as a "pink killer" and its incidence is the leading cause of female malignancy. Identification of metastatic breast cancer transcriptome sequencing data the metastatic breast cancer transcriptome sequencing data can be identified from the breast cancer transcriptome sequencing data, so that a certain degree of technical support is provided for data aspect of breast cancer metastasis research, and progress of breast cancer metastasis research is further promoted.
The prior art is limited by the complexity of the transcriptome sequencing data itself, so the recognition accuracy of transcriptome sequencing data for metastatic breast cancer is very low.
Disclosure of Invention
The purpose of the invention is that: aiming at the problem of low accuracy of identification of the transcriptome sequencing data of the metastatic breast cancer in the prior art, a machine learning-based identification method of the transcriptome sequencing data of the metastatic breast cancer is provided.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a machine learning based method for identifying transcriptome sequencing data of metastatic breast cancer, comprising the steps of:
acquiring transcriptome sequencing data of a breast cancer patient;
acquiring breast cancer metastasis key gene expression data from transcriptome sequencing data of a breast cancer patient;
training a plurality of different classification models by using breast cancer metastasis key gene expression data, and selecting the classification model with highest classification precision as a metastatic breast cancer transcriptome sequencing data identification model;
utilizing a metastatic breast cancer transcriptome sequencing data identification model to identify metastatic breast cancer transcriptome sequencing data;
the specific steps for acquiring the breast cancer metastasis key gene expression data from the transcriptome sequencing data of the breast cancer patient are as follows:
step one: performing differential analysis on transcriptome sequencing data of a breast cancer patient to obtain a breast cancer metastasis differential gene set;
step two: WGCNA is carried out on transcriptome sequencing data of a breast cancer patient to obtain a breast cancer metastasis hub gene set;
step three: extracting common genes in the breast cancer metastasis difference gene set and the breast cancer metastasis hub gene set;
step four: and screening the common genes by using LASSO regression analysis to obtain breast cancer metastasis key genes, and comparing the breast cancer metastasis key genes with transcriptome sequencing data of breast cancer patients to obtain breast cancer metastasis key gene expression data.
Further, the step of obtaining transcriptome sequencing data of the breast cancer patient specifically comprises the following steps:
firstly, acquiring a GSE data set of breast cancer metastasis;
secondly, extracting transcriptome sequencing data, clinical characteristics and GEO chip platform numbers of the breast cancer patients from the acquired GSE data set in RStudio, and acquiring gene names corresponding to each gene probe in the chip according to the GEO chip platform numbers.
Further, the specific steps of the differential analysis are as follows:
performing differential analysis on transcriptome sequencing data of a breast cancer patient by using a limma package in RStudio, adding a list of gene names into a differential analysis result according to the obtained gene names corresponding to each gene probe so as to determine the expression of the genes detected by each gene probe, and then screening by taking P-value <0.05 and |logFC| >0.5 as screening standards according to the differential analysis result to obtain differential genes of breast cancer metastasis.
Further, the specific steps of the WGCNA are as follows:
step two,: clustering all samples to obtain a sample cluster tree and an outlier sample, and then setting a cutHeight value according to the position of the outlier sample in the sample cluster tree so as to remove the outlier value and obtain a residual sample;
step two: based on the residual samples, calculating a soft threshold by using a soft threshold calculation function in the WGCNA package, drawing a variation trend graph of the scale-free topology fitting index and the average connectivity along with the variation of the soft threshold according to the calculation result, and selecting the optimal soft threshold by taking the scale-free topology fitting index being more than 0.9 and the average connectivity trend leveling position as the standard;
step two, three: constructing a non-scale network according to an optimal soft threshold value to obtain a hierarchical clustering tree diagram of module identification, correlating the modules with clinical characteristics to obtain a correlation coefficient of each module in the hierarchical clustering tree diagram and breast cancer metastasis, and selecting the module with the highest correlation coefficient and the next highest correlation coefficient;
step two, four: extracting genes in the module with the highest correlation coefficient and the next highest correlation coefficient to obtain the breast cancer metastasis junction gene.
Further, the specific steps of screening the common genes by using LASSO regression analysis are as follows:
and performing LASSO regression analysis on the breast cancer metastasis key genes by using a glmnet package in RStudio, and selecting the gene corresponding to the lambda value when the mean square error of the LASSO model is minimum, namely the breast cancer metastasis key genes.
Further, the plurality of different classification models includes: logistic regression models, random forest models, support vector machine models, GBDT models, and XGboost models.
Further, the specific steps of training a plurality of different classification models by using breast cancer metastasis key gene expression data are as follows:
based on a grid optimization method and five-fold cross verification, respectively carrying out classification training and super-parameter optimization on a plurality of classification models by searching all super-parameter combinations of the models in a parameter space range.
Furthermore, the identification model with highest classification precision in the selected classification model is obtained by evaluating the classification effect of each model, and the specific steps are as follows:
firstly, a model with the largest F1-Score value in a plurality of classification models is taken as an optimal model, if the F1-Score values are the same, the model with the largest Accumey value in the plurality of classification models is taken as the optimal model, if the Accumey values are the same, the AUC value of the plurality of classification models is compared, the model with the largest AUC value is taken as the optimal model, if the F1-Score values, the Accumey values and the AUC value of the plurality of classification models are the same, the classification model priority order XGboost model > GBDT model > support vector machine model > random forest model > logistic regression model is selected.
Further, the F1-Score and Accuracy are expressed as:
wherein ACC, acceracy, TP represents the number of metastatic breast cancer transcriptome sequencing data samples identified as metastatic breast cancer transcriptome sequencing data, FP represents the number of non-metastatic breast cancer transcriptome sequencing data samples identified as metastatic breast cancer transcriptome sequencing data, TN represents the number of non-metastatic breast cancer transcriptome sequencing data samples identified as non-metastatic breast cancer transcriptome sequencing data, and FN represents the number of metastatic breast cancer transcriptome sequencing data samples identified as non-metastatic breast cancer transcriptome sequencing data.
Further, the GSE dataset is a GSE9893 dataset and a GSE43837 dataset.
The beneficial effects of the invention are as follows:
the method comprises the steps of constructing a plurality of identification classifiers based on the sequencing data of the metastatic breast cancer transcriptome of each classification model by using each trained classification model, inputting the expression data of the breast cancer metastasis key genes in the sequencing data of the breast cancer transcriptome to be identified into the identification classifier of the sequencing data of the metastatic breast cancer transcriptome to obtain classification results, and completing identification of the sequencing data of the metastatic breast cancer transcriptome. According to the technical scheme, metastatic breast cancer transcriptome sequencing data in the breast cancer transcriptome sequencing data can be accurately identified.
The application realizes the identification of whether the breast cancer transcriptome sequencing data is metastatic breast cancer transcriptome sequencing data. According to the results of the combined screening of the genes by the differential analysis, the WGCNA and the LASSO regression analysis, the breast cancer metastasis key genes are obtained, and the identification accuracy of metastatic breast cancer transcriptome sequencing data is effectively improved.
Drawings
FIG. 1 is an overall flow chart of the present application;
FIG. 2 is a graph showing the results of differential analysis under different data sets;
FIG. 3 is a schematic diagram of WGCNA results;
FIG. 4 is a graph showing the mean square error of the LASSO model at different lambda values;
FIG. 5 is a schematic diagram showing comparison of recognition effects of various classification models;
FIG. 6 is a schematic structural diagram of a device for identifying transcriptome sequencing data of metastatic breast cancer.
Detailed Description
It should be noted in particular that, without conflict, the various embodiments disclosed herein may be combined with each other.
The first embodiment is as follows: referring to fig. 1, the method for identifying transcriptome sequencing data of metastatic breast cancer based on machine learning according to the present embodiment includes the following steps:
acquiring transcriptome sequencing data of a breast cancer patient;
acquiring breast cancer metastasis key gene expression data from transcriptome sequencing data of a breast cancer patient;
training a plurality of different classification models by using breast cancer metastasis key gene expression data, and selecting the classification model with highest classification precision as a metastatic breast cancer transcriptome sequencing data identification model;
utilizing a metastatic breast cancer transcriptome sequencing data identification model to identify metastatic breast cancer transcriptome sequencing data;
the specific steps for acquiring the breast cancer metastasis key gene expression data from the transcriptome sequencing data of the breast cancer patient are as follows:
step one: performing differential analysis on transcriptome sequencing data of a breast cancer patient to obtain a breast cancer metastasis differential gene set;
step two: WGCNA is carried out on transcriptome sequencing data of a breast cancer patient to obtain a breast cancer metastasis hub gene set;
step three: extracting common genes in the breast cancer metastasis difference gene set and the breast cancer metastasis hub gene set;
step four: and screening the common genes by using LASSO regression analysis to obtain breast cancer metastasis key genes, and comparing the breast cancer metastasis key genes with transcriptome sequencing data of breast cancer patients to obtain breast cancer metastasis key gene expression data.
The second embodiment is as follows: this embodiment is further described with respect to the first embodiment, and the difference between this embodiment and the first embodiment is that the step of obtaining transcriptome sequencing data of a breast cancer patient specifically includes:
firstly, acquiring a GSE data set of breast cancer metastasis;
secondly, extracting transcriptome sequencing data, clinical characteristics and GEO chip platform numbers of the breast cancer patients from the acquired GSE data set in RStudio, and acquiring gene names corresponding to each gene probe in the chip according to the GEO chip platform numbers.
And a third specific embodiment: this embodiment is further described with respect to the second embodiment, and the difference analysis between this embodiment and the second embodiment includes the following specific steps:
performing differential analysis on transcriptome sequencing data of a breast cancer patient by using a limma package in RStudio, adding a list of gene names into a differential analysis result according to the obtained gene names corresponding to each gene probe so as to determine the expression of the genes detected by each gene probe, and then screening by taking P-value <0.05 and |logFC| >0.5 as screening standards according to the differential analysis result to obtain differential genes of breast cancer metastasis.
The specific embodiment IV is as follows: this embodiment is further described with respect to the third embodiment, and the specific steps of the WGCNA are as follows:
step two,: clustering all samples to obtain a sample cluster tree and an outlier sample, and then setting a cutHeight value according to the position of the outlier sample in the sample cluster tree so as to remove the outlier value and obtain a residual sample;
step two: based on the residual samples, calculating a soft threshold by using a soft threshold calculation function in the WGCNA package, drawing a variation trend graph of the scale-free topology fitting index and the average connectivity along with the variation of the soft threshold according to the calculation result, and selecting the optimal soft threshold by taking the scale-free topology fitting index being more than 0.9 and the average connectivity trend leveling position as the standard;
step two, three: constructing a non-scale network according to an optimal soft threshold value to obtain a hierarchical clustering tree diagram of module identification, correlating the modules with clinical characteristics to obtain a correlation coefficient of each module in the hierarchical clustering tree diagram and breast cancer metastasis, and selecting the module with the highest correlation coefficient and the next highest correlation coefficient;
step two, four: extracting genes in the module with the highest correlation coefficient and the next highest correlation coefficient to obtain the breast cancer metastasis junction gene.
Fifth embodiment: this embodiment is further described with respect to the fourth embodiment, and the difference between this embodiment and the fourth embodiment is that the specific steps of screening the common gene by using LASSO regression analysis are as follows:
and performing LASSO regression analysis on the breast cancer metastasis key genes by using a glmnet package in RStudio, and selecting the gene corresponding to the lambda value when the mean square error of the LASSO model is minimum, namely the breast cancer metastasis key genes.
Specific embodiment six: this embodiment is a further description of the fifth embodiment, and the difference between this embodiment and the fifth embodiment is that the plurality of different classification models include: logistic regression models, random forest models, support vector machine models, GBDT models, and XGboost models.
Seventh embodiment: this embodiment is further described with respect to the sixth embodiment, and the difference between this embodiment and the sixth embodiment is that the training of the plurality of different classification models using breast cancer metastasis key gene expression data includes the following specific steps:
based on a grid optimization method and five-fold cross verification, respectively carrying out classification training and super-parameter optimization on a plurality of classification models by searching all super-parameter combinations of the models in a parameter space range.
Eighth embodiment: in this embodiment, further description is given of the sixth embodiment, and the difference between the sixth embodiment and the sixth embodiment is that the identification model of transcriptome sequencing data of metastatic breast cancer with highest classification accuracy in the selected classification model is obtained by evaluating the classification effect of each model, and the specific steps are as follows:
firstly, a model with the largest F1-Score value in a plurality of classification models is taken as an optimal model, if the F1-Score values are the same, the model with the largest Accumey value in the plurality of classification models is taken as the optimal model, if the Accumey values are the same, the AUC value of the plurality of classification models is compared, the model with the largest AUC value is taken as the optimal model, if the F1-Score values, the Accumey values and the AUC value of the plurality of classification models are the same, the classification model priority order XGboost model > GBDT model > support vector machine model > random forest model > logistic regression model is selected.
Detailed description nine: this embodiment is a further description of embodiment eight, and the difference between this embodiment and embodiment eight is that the F1-Score and Accuracy are expressed as:
wherein ACC, acceracy, TP represents the number of metastatic breast cancer transcriptome sequencing data samples identified as metastatic breast cancer transcriptome sequencing data, FP represents the number of non-metastatic breast cancer transcriptome sequencing data samples identified as metastatic breast cancer transcriptome sequencing data, TN represents the number of non-metastatic breast cancer transcriptome sequencing data samples identified as non-metastatic breast cancer transcriptome sequencing data, and FN represents the number of metastatic breast cancer transcriptome sequencing data samples identified as non-metastatic breast cancer transcriptome sequencing data.
Detailed description ten: this embodiment is a further description of the ninth embodiment, and the difference between this embodiment and the ninth embodiment is that the GSE data set is a GSE9893 data set and a GSE43837 data set.
As an embodiment of the present application, as shown in fig. 1, the method includes the following steps:
s101, acquiring transcriptome sequencing data and clinical information of a breast cancer patient.
The transcriptome sequencing data of the breast cancer patient and the clinical information comprise the transcriptome sequencing data of the breast cancer patient and the clinical information of the breast cancer patient, wherein the transcriptome sequencing data of the breast cancer patient are expression quantity data of a plurality of genes measured by the breast cancer patient from a tissue sample, and the clinical information of the breast cancer patient is information such as physical conditions, breast cancer progress and the like during the treatment of the breast cancer patient.
In some alternative embodiments, the transcriptome sequencing data and clinical information of a breast cancer patient total 2 data sets, including: the GSE9893 dataset in the GEO database (total number of samples 155, where the number of metastatic breast cancer transcriptome sequencing data samples is 48, the number of non-metastatic breast cancer transcriptome sequencing data samples is 107, the measured basis factor is 22656), the GSE43837 dataset in the GEO database (total number of samples 38, where the number of metastatic breast cancer transcriptome sequencing data samples is 19, the number of non-metastatic breast cancer transcriptome sequencing data samples is 19, the measured basis factor is 61359).
S102, primarily screening breast cancer metastasis key genes based on difference analysis and WGCNA.
Step S102 includes the following substeps S1021-S1023
S1021, performing differential analysis on the transcriptome sequencing data of the breast cancer patient to obtain a breast cancer metastasis differential gene set.
In some alternative embodiments, the pretreatment of transcriptome sequencing data and clinical information of the breast cancer patient using RStudio comprises: extracting the two data sets to obtain an expression matrix and clinical information, grouping the breast cancer patients according to whether the breast cancer metastasis occurs in the patients in the clinical information, and annotating gene probes in the expression matrix.
In some alternative embodiments, the transcriptome sequencing data of the breast cancer patient is differentially analyzed in RStudio using a limma package, and related differential genes for distant metastasis of breast cancer are initially screened using P-value <0.05 and |logfc| >0.5 as screening criteria. The differential analysis results are shown in FIG. 2, wherein 6188 differential genes were screened from the GSE9893 dataset and 2122 differential genes were screened from the GSE43837 dataset.
S1022, performing WGCNA on the transcriptome sequencing data of the breast cancer patient to obtain a breast cancer metastasis hub gene set.
In some alternative embodiments, all samples of the dataset GSE9893 are clustered, whether the samples have outliers or outliers, and the cutHeight is set to 150 to remove outliers with this criterion. After outliers are removed, a sample cluster tree is reconstructed based on the remaining samples, and the association of the phenotype data with the samples is visualized.
In some alternative embodiments, the minimum soft threshold is 6 based on a scale-free topology fit index and a trend of average connectivity over soft threshold. And further constructing a scaleless network according to the soft threshold 6, obtaining a hierarchical clustering tree diagram of the module identification, and correlating the module with clinical characteristics to obtain a correlation coefficient of the module and phenotype data (whether the module is metastatic breast cancer transcriptome sequencing data). The WGCNA results, i.e., the correlation coefficients of the module and the phenotype data (whether or not it is metastatic breast cancer transcriptome sequencing data) are shown in fig. 3, wherein the correlation coefficients of the yellow-green module gene and the blue module gene and the phenotype data (whether or not it is metastatic breast cancer transcriptome sequencing data) are relatively high, and the yellow-green module gene and the blue module gene (3404 genes in total) are taken as breast cancer metastasis junction genes for further analysis and screening.
S1023, extracting common genes in the breast cancer metastasis difference gene set and the breast cancer metastasis pivot gene set.
In some alternative embodiments, 114 common genes of 6188 differential genes in the GSE9893 dataset, 2122 differential genes in the GSE43837 dataset, and the 3404 breast cancer metastasis hub genes are extracted for further analytical screening.
S103, carrying out further screening on the breast cancer metastasis key genes based on LASSO regression analysis.
In some alternative embodiments, LASSO regression analysis is performed on the breast cancer metastasis key gene in RStudio using the glmnet package. The absolute value of the 114 breast cancer metastasis key gene coefficients in the LASSO model decreases continuously with increasing lambda value, even to 0 (i.e. the gene does not play a role in the model). It is therefore necessary to determine the lambda value at which the LASSO model performs optimally, and thus to further determine the part of the gene that actually functions in the LASSO model. The mean square error of the LASSO model under different lambda values is shown in figure 4, so that the lambda value of the LASSO model with the optimal performance, namely the minimum mean square error, is selected, and 21 genes with stronger predictive ability, such as ENPP2 and the like, corresponding to the lambda value, namely the breast cancer metastasis key genes are further selected.
S104, constructing a metastatic breast cancer transcriptome sequencing data identification model based on the breast cancer metastasis key genes.
The seed model is selected from a logistic regression model, a random forest model, a support vector machine model, a GBDT model and an XGboost model, and the seed model, namely the logistic regression model, the random forest model, the support vector machine model, the GBDT model and the XGboost model, is subjected to classification training based on 80% data randomly divided in the GSE9893 data set, namely a training set, so that a trained classification model is obtained. The step S104 specifically includes:
based on a grid optimization method and five-fold cross validation, classifying training and super-parameter optimization are carried out on the logistic regression model, the random forest model, the support vector machine model, the GBDT model and the XGboost model by searching all super-parameter combinations of the models in the parameter space range.
And evaluating the classification effect.
In some alternative embodiments, the index for evaluating the classification effect includes ACC, F1-Score, AUC, which is calculated as follows:
where TP represents the number of metastatic breast cancer transcriptome sequencing data samples identified as metastatic breast cancer transcriptome sequencing data, FP represents the number of non-metastatic breast cancer transcriptome sequencing data samples identified as metastatic breast cancer transcriptome sequencing data, TN represents the number of non-metastatic breast cancer transcriptome sequencing data samples identified as non-metastatic breast cancer transcriptome sequencing data, and FN represents the number of metastatic breast cancer transcriptome sequencing data samples identified as non-metastatic breast cancer transcriptome sequencing data.
F1-Score is a statistical indicator used to measure the accuracy of two classification models. The method and the device have the advantages that the accuracy rate and the recall rate of the classification model are simultaneously considered, and the method and the device can be regarded as a harmonic mean of the model accuracy rate and the recall rate.
AUC is an evaluation index for measuring the quality of the two classification models, and represents the probability that the predicted positive case is arranged in front of the negative case. AUC value is the area enclosed by the axis of the ROC curve.
S105, acquiring breast cancer metastasis key gene expression data in breast cancer transcriptome sequencing data to be identified;
s106, identifying the metastatic breast cancer transcriptome sequencing data in the breast cancer transcriptome sequencing data based on the identification model of the metastatic breast cancer transcriptome sequencing data.
And constructing a plurality of identification classifiers based on the sequencing data of the metastatic breast cancer transcriptome of each classification model by adopting each trained classification model, inputting the expression data of the breast cancer metastasis key genes in the sequencing data of the breast cancer transcriptome to be identified into the identification classifier of the sequencing data of the metastatic breast cancer transcriptome to obtain a classification result, and completing identification of the sequencing data of the metastatic breast cancer transcriptome.
As an embodiment of the present application, the method comprises the following steps:
LASSO was used to select for breast cancer metastasis key genes. A plot of lambda value versus mean square error in the LASSO model is shown in fig. 4. And selecting breast cancer metastasis key genes on common genes in the breast cancer metastasis difference gene set and the breast cancer metastasis pivot gene set, and determining the number and specific gene names of the breast cancer metastasis key genes. The mean square error of the LASSO model is minimal when the lambda value is approximately-4.2. Therefore, we take the part of the genes which really act in the LASSO model when the lambda value is approximately-4.2, namely 21 breast cancer metastasis key genes with stronger prediction ability such as ENPP2 and the like.
And then obtaining gene expression data of the breast cancer metastasis key genes in each sample, namely training data and verification data required by constructing a metastatic breast cancer transcriptome sequencing data identification model. Training the seed model, namely a logistic regression model, a random forest model, a support vector machine model, a GBDT model and an XGboost model based on the training data, namely classifying and training the logistic regression model, the random forest model, the support vector machine model, the GBDT model and the XGboost model and optimizing the hyper parameters by searching all hyper-parameter combinations of the models in a parameter space range, thereby obtaining the identification model of the sequencing data of the transcriptome of the metastatic breast cancer.
Finally, several identification models of the sequencing data of the transcriptome of the metastatic breast cancer are compared, and the consistent evaluation indexes ACC, F1-Score and AUC are used on the basis of ensuring that the used data sets are consistent during comparison, as shown in figure 5. Through comparison, the metastatic breast cancer transcriptome sequencing data identification model constructed based on the support vector machine model and the XGboost model is superior to other metastatic breast cancer transcriptome sequencing data identification models, has a certain effectiveness in the identification of the metastatic breast cancer transcriptome sequencing data, and can provide a new thought for the research of the identification of the metastatic breast cancer transcriptome sequencing data.
It should be noted that the detailed description is merely for explaining and describing the technical solution of the present invention, and the scope of protection of the claims should not be limited thereto. All changes which come within the meaning and range of equivalency of the claims and the specification are to be embraced within their scope.

Claims (10)

1. The machine learning-based identification method for the transcriptome sequencing data of the metastatic breast cancer is characterized by comprising the following steps of:
acquiring transcriptome sequencing data of a breast cancer patient;
acquiring breast cancer metastasis key gene expression data from transcriptome sequencing data of a breast cancer patient;
training a plurality of different classification models by using breast cancer metastasis key gene expression data, and selecting the classification model with highest classification precision as a metastatic breast cancer transcriptome sequencing data identification model;
utilizing a metastatic breast cancer transcriptome sequencing data identification model to identify metastatic breast cancer transcriptome sequencing data;
the specific steps for acquiring the breast cancer metastasis key gene expression data from the transcriptome sequencing data of the breast cancer patient are as follows:
step one: performing differential analysis on transcriptome sequencing data of a breast cancer patient to obtain a breast cancer metastasis differential gene set;
step two: WGCNA is carried out on transcriptome sequencing data of a breast cancer patient to obtain a breast cancer metastasis hub gene set;
step three: extracting common genes in the breast cancer metastasis difference gene set and the breast cancer metastasis hub gene set;
step four: and screening the common genes by using LASSO regression analysis to obtain breast cancer metastasis key genes, and comparing the breast cancer metastasis key genes with transcriptome sequencing data of breast cancer patients to obtain breast cancer metastasis key gene expression data.
2. The machine learning based identification method of transcriptome sequencing data of metastatic breast cancer according to claim 1, wherein the step of obtaining transcriptome sequencing data of breast cancer patient comprises the steps of:
firstly, acquiring a GSE data set of breast cancer metastasis;
secondly, extracting transcriptome sequencing data, clinical characteristics and GEO chip platform numbers of the breast cancer patients from the acquired GSE data set in RStudio, and acquiring gene names corresponding to each gene probe in the chip according to the GEO chip platform numbers.
3. The machine learning based identification method of transcriptome sequencing data of metastatic breast cancer according to claim 2, wherein the specific steps of the differential analysis are:
performing differential analysis on transcriptome sequencing data of a breast cancer patient by using a limma package in RStudio, adding a list of gene names into a differential analysis result according to the obtained gene names corresponding to each gene probe so as to determine the expression of the genes detected by each gene probe, and then screening by taking P-value <0.05 and |logFC| >0.5 as screening standards according to the differential analysis result to obtain differential genes of breast cancer metastasis.
4. The machine learning based identification method of transcriptome sequencing data of metastatic breast cancer according to claim 3, wherein the specific steps of WGCNA are:
step two,: clustering all samples to obtain a sample cluster tree and an outlier sample, and then setting a cutHeight value according to the position of the outlier sample in the sample cluster tree so as to remove the outlier value and obtain a residual sample;
step two: based on the residual samples, calculating a soft threshold by using a soft threshold calculation function in the WGCNA package, drawing a variation trend graph of the scale-free topology fitting index and the average connectivity along with the variation of the soft threshold according to the calculation result, and selecting the optimal soft threshold by taking the scale-free topology fitting index being more than 0.9 and the average connectivity trend leveling position as the standard;
step two, three: constructing a non-scale network according to an optimal soft threshold value to obtain a hierarchical clustering tree diagram of module identification, correlating the modules with clinical characteristics to obtain a correlation coefficient of each module in the hierarchical clustering tree diagram and breast cancer metastasis, and selecting the module with the highest correlation coefficient and the next highest correlation coefficient;
step two, four: extracting genes in the module with the highest correlation coefficient and the next highest correlation coefficient to obtain the breast cancer metastasis junction gene.
5. The machine learning based identification method of transcriptome sequencing data of metastatic breast cancer according to claim 4, wherein the specific steps of screening common genes by using LASSO regression analysis are as follows:
and performing LASSO regression analysis on the breast cancer metastasis key genes by using a glmnet package in RStudio, and selecting the gene corresponding to the lambda value when the mean square error of the LASSO model is minimum, namely the breast cancer metastasis key genes.
6. The machine learning based method of identifying transcriptome sequencing data for metastatic breast cancer according to claim 5, wherein said plurality of different classification models comprises: logistic regression models, random forest models, support vector machine models, GBDT models, and XGboost models.
7. The machine learning based identification method of transcriptome sequencing data of metastatic breast cancer according to claim 6, wherein the training of a plurality of different classification models using breast cancer metastasis key gene expression data comprises the following specific steps:
based on a grid optimization method and five-fold cross verification, respectively carrying out classification training and super-parameter optimization on a plurality of classification models by searching all super-parameter combinations of the models in a parameter space range.
8. The machine learning based identification method of transcriptome sequencing data of metastatic breast cancer according to claim 6, wherein the identification model of transcriptome sequencing data of metastatic breast cancer with highest classification accuracy in the selected classification model is obtained by evaluating the classification effect of each model, and the specific steps are as follows:
firstly, a model with the largest F1-Score value in a plurality of classification models is taken as an optimal model, if the F1-Score values are the same, the model with the largest Accumey value in the plurality of classification models is taken as the optimal model, if the Accumey values are the same, the AUC value of the plurality of classification models is compared, the model with the largest AUC value is taken as the optimal model, if the F1-Score values, the Accumey values and the AUC value of the plurality of classification models are the same, the classification model priority order XGboost model > GBDT model > support vector machine model > random forest model > logistic regression model is selected.
9. The machine learning based identification method of transcriptome sequencing data of metastatic breast cancer according to claim 8, wherein said F1-Score and Accuracy are expressed as:
wherein ACC, acceracy, TP represents the number of metastatic breast cancer transcriptome sequencing data samples identified as metastatic breast cancer transcriptome sequencing data, FP represents the number of non-metastatic breast cancer transcriptome sequencing data samples identified as metastatic breast cancer transcriptome sequencing data, TN represents the number of non-metastatic breast cancer transcriptome sequencing data samples identified as non-metastatic breast cancer transcriptome sequencing data, and FN represents the number of metastatic breast cancer transcriptome sequencing data samples identified as non-metastatic breast cancer transcriptome sequencing data.
10. The machine learning based method of identifying transcriptome sequencing data of metastatic breast cancer according to claim 9, wherein the GSE dataset is a GSE9893 dataset and a GSE43837 dataset.
CN202310505357.4A 2023-05-06 2023-05-06 Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer Pending CN116564409A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310505357.4A CN116564409A (en) 2023-05-06 2023-05-06 Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310505357.4A CN116564409A (en) 2023-05-06 2023-05-06 Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer

Publications (1)

Publication Number Publication Date
CN116564409A true CN116564409A (en) 2023-08-08

Family

ID=87497616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310505357.4A Pending CN116564409A (en) 2023-05-06 2023-05-06 Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer

Country Status (1)

Country Link
CN (1) CN116564409A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117409965A (en) * 2023-09-28 2024-01-16 江苏先声医学诊断有限公司 Risk prediction system suitable for Asian HER2 positive breast cancer patients
CN117746983A (en) * 2023-12-19 2024-03-22 南昌大学 Construction method and application of senile breast cancer aging scoring model

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344907A (en) * 2018-10-30 2019-02-15 顾海艳 Based on the method for discrimination for improving judgment criteria sorting algorithm
CN110120264A (en) * 2019-04-19 2019-08-13 上海依智医疗技术有限公司 A kind of prognostic evaluation methods and device of asthma
CN111081317A (en) * 2019-12-10 2020-04-28 山东大学 Gene spectrum-based breast cancer lymph node metastasis prediction method and prediction system
CN111899882A (en) * 2020-08-07 2020-11-06 北京科技大学 Method and system for predicting cancer
CN113130002A (en) * 2021-04-29 2021-07-16 吉林大学 Novel method for lung adenocarcinoma biomarker screening, prognosis model construction and biological verification
CN113140320A (en) * 2021-05-13 2021-07-20 广州市妇女儿童医疗中心 Construction method of prediction model for postoperative long-term malnutrition of infant suffering from congenital heart disease operation
CN114360642A (en) * 2022-01-14 2022-04-15 吉林省蒲川生物医药有限公司 Cancer transcriptome data processing method based on gene co-expression network analysis
CN114496066A (en) * 2022-04-13 2022-05-13 南京墨宁医疗科技有限公司 Construction method and application of gene model for prognosis of triple negative breast cancer
CN115659245A (en) * 2022-10-24 2023-01-31 东华理工大学 Sandstone-type uranium deposit rock stratum type identification method and device based on machine learning
CN115938590A (en) * 2023-02-09 2023-04-07 四川大学华西医院 Construction method and prediction system of colorectal cancer postoperative LARS prediction model

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344907A (en) * 2018-10-30 2019-02-15 顾海艳 Based on the method for discrimination for improving judgment criteria sorting algorithm
CN110120264A (en) * 2019-04-19 2019-08-13 上海依智医疗技术有限公司 A kind of prognostic evaluation methods and device of asthma
CN111081317A (en) * 2019-12-10 2020-04-28 山东大学 Gene spectrum-based breast cancer lymph node metastasis prediction method and prediction system
CN111899882A (en) * 2020-08-07 2020-11-06 北京科技大学 Method and system for predicting cancer
CN113130002A (en) * 2021-04-29 2021-07-16 吉林大学 Novel method for lung adenocarcinoma biomarker screening, prognosis model construction and biological verification
CN113140320A (en) * 2021-05-13 2021-07-20 广州市妇女儿童医疗中心 Construction method of prediction model for postoperative long-term malnutrition of infant suffering from congenital heart disease operation
CN114360642A (en) * 2022-01-14 2022-04-15 吉林省蒲川生物医药有限公司 Cancer transcriptome data processing method based on gene co-expression network analysis
CN114496066A (en) * 2022-04-13 2022-05-13 南京墨宁医疗科技有限公司 Construction method and application of gene model for prognosis of triple negative breast cancer
CN115659245A (en) * 2022-10-24 2023-01-31 东华理工大学 Sandstone-type uranium deposit rock stratum type identification method and device based on machine learning
CN115938590A (en) * 2023-02-09 2023-04-07 四川大学华西医院 Construction method and prediction system of colorectal cancer postoperative LARS prediction model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
高裴裴: "《智能计算技术与应用基础:面向新文科》", vol. 1, 31 August 2022, 北京邮电大学出版社, pages: 111 - 112 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117409965A (en) * 2023-09-28 2024-01-16 江苏先声医学诊断有限公司 Risk prediction system suitable for Asian HER2 positive breast cancer patients
CN117746983A (en) * 2023-12-19 2024-03-22 南昌大学 Construction method and application of senile breast cancer aging scoring model

Similar Documents

Publication Publication Date Title
CN113053535B (en) Medical information prediction system and medical information prediction method
CN106250442A (en) The feature selection approach of a kind of network security data and system
CN116153420B (en) Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model
CN116564409A (en) Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer
CN107463797B (en) Biological information analysis method and device for high-throughput sequencing, equipment and storage medium
CN113593708A (en) Sepsis prognosis prediction method based on integrated learning algorithm
CN115274136A (en) Tumor cell line drug response prediction method integrating multiomic and essential genes
CN114358169A (en) Colorectal cancer detection system based on XGboost
CN111863135B (en) False positive structure variation filtering method, storage medium and computing device
Özkan et al. Effect of data preprocessing on ensemble learning for classification in disease diagnosis
CN110942808A (en) Prognosis prediction method and prediction system based on gene big data
CN106650304A (en) Extension method of DNA methylation chip data
JP6356015B2 (en) Gene expression information analyzing apparatus, gene expression information analyzing method, and program
CN116130105A (en) Health risk prediction method based on neural network
Zhao et al. Rfe based feature selection improves performance of classifying multiple-causes deaths in colorectal cancer
CN113889274B (en) Method and device for constructing risk prediction model of autism spectrum disorder
CN113838519B (en) Gene selection method and system based on adaptive gene interaction regularization elastic network model
CN115881218A (en) Automatic gene selection method for whole genome association analysis
CN112382395B (en) Integrated modeling system based on machine learning
CN113113085B (en) Analysis system and method for tumor detection based on intelligent metagenome sequencing data
CN110265151B (en) Learning method based on heterogeneous temporal data in EHR
CN114639482A (en) IDPC and LASSO-based esophageal squamous carcinoma prognosis survival risk assessment method
CN113971984A (en) Classification model construction method and device, electronic equipment and storage medium
CN107710206B (en) Methods, systems, and apparatus for subpopulation detection based on biological data
CN116741384B (en) Bedside care-based severe acute pancreatitis clinical data management method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination