CN117594133A - Screening method of biomarker for distinguishing uterine lesion type and application thereof - Google Patents
Screening method of biomarker for distinguishing uterine lesion type and application thereof Download PDFInfo
- Publication number
- CN117594133A CN117594133A CN202410081934.6A CN202410081934A CN117594133A CN 117594133 A CN117594133 A CN 117594133A CN 202410081934 A CN202410081934 A CN 202410081934A CN 117594133 A CN117594133 A CN 117594133A
- Authority
- CN
- China
- Prior art keywords
- uterine
- biomarker
- gene set
- genes
- candidate characteristic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 239000000090 biomarker Substances 0.000 title claims abstract description 72
- 230000003902 lesion Effects 0.000 title claims abstract description 60
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000012216 screening Methods 0.000 title claims abstract description 31
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 155
- 230000014509 gene expression Effects 0.000 claims abstract description 59
- 238000010801 machine learning Methods 0.000 claims abstract description 40
- 201000010099 disease Diseases 0.000 claims abstract description 36
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 36
- 238000012549 training Methods 0.000 claims abstract description 16
- 238000001514 detection method Methods 0.000 claims abstract description 13
- 206010046798 Uterine leiomyoma Diseases 0.000 claims description 53
- 208000037965 uterine sarcoma Diseases 0.000 claims description 45
- 238000012163 sequencing technique Methods 0.000 claims description 22
- 201000010260 leiomyoma Diseases 0.000 claims description 16
- 239000003153 chemical reaction reagent Substances 0.000 claims description 13
- 210000002381 plasma Anatomy 0.000 claims description 8
- 102000004169 proteins and genes Human genes 0.000 claims description 7
- 108020004635 Complementary DNA Proteins 0.000 claims description 6
- 238000010804 cDNA synthesis Methods 0.000 claims description 6
- 239000002299 complementary DNA Substances 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 4
- 238000005119 centrifugation Methods 0.000 claims description 4
- 238000002474 experimental method Methods 0.000 claims description 4
- 210000005259 peripheral blood Anatomy 0.000 claims description 4
- 239000011886 peripheral blood Substances 0.000 claims description 4
- 102100028737 CAP-Gly domain-containing linker protein 1 Human genes 0.000 claims description 3
- 102100037290 Coatomer subunit gamma-1 Human genes 0.000 claims description 3
- 101001071611 Dictyostelium discoideum Glutathione reductase Proteins 0.000 claims description 3
- 102100036442 Glutathione reductase, mitochondrial Human genes 0.000 claims description 3
- 102100038970 Histone-lysine N-methyltransferase EZH2 Human genes 0.000 claims description 3
- 101000767052 Homo sapiens CAP-Gly domain-containing linker protein 1 Proteins 0.000 claims description 3
- 101000952964 Homo sapiens Coatomer subunit gamma-1 Proteins 0.000 claims description 3
- 101001071608 Homo sapiens Glutathione reductase, mitochondrial Proteins 0.000 claims description 3
- 101000882127 Homo sapiens Histone-lysine N-methyltransferase EZH2 Proteins 0.000 claims description 3
- 101000832631 Homo sapiens Small ubiquitin-related modifier 3 Proteins 0.000 claims description 3
- 101000689224 Homo sapiens Src-like-adapter 2 Proteins 0.000 claims description 3
- 101000797340 Homo sapiens Trem-like transcript 1 protein Proteins 0.000 claims description 3
- 102100024534 Small ubiquitin-related modifier 3 Human genes 0.000 claims description 3
- 102100024510 Src-like-adapter 2 Human genes 0.000 claims description 3
- 102100032885 Trem-like transcript 1 protein Human genes 0.000 claims description 3
- 238000012165 high-throughput sequencing Methods 0.000 claims description 3
- 238000000575 proteomic method Methods 0.000 claims description 3
- 238000004519 manufacturing process Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000011529 RT qPCR Methods 0.000 abstract description 2
- 238000012706 support-vector machine Methods 0.000 description 16
- 206010061692 Benign muscle neoplasm Diseases 0.000 description 4
- 201000004458 Myoma Diseases 0.000 description 4
- 238000003064 k means clustering Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000004393 prognosis Methods 0.000 description 3
- 208000010579 uterine corpus leiomyoma Diseases 0.000 description 3
- 201000007954 uterine fibroid Diseases 0.000 description 3
- 230000004075 alteration Effects 0.000 description 2
- 210000004369 blood Anatomy 0.000 description 2
- 239000008280 blood Substances 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 201000003365 uterine corpus sarcoma Diseases 0.000 description 2
- 238000000729 Fisher's exact test Methods 0.000 description 1
- 238000000585 Mann–Whitney U test Methods 0.000 description 1
- 238000012339 Real-time fluorescence quantitative polymerase chain reaction Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000003759 clinical diagnosis Methods 0.000 description 1
- 230000009274 differential gene expression Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000035558 fertility Effects 0.000 description 1
- 230000008826 genomic mutation Effects 0.000 description 1
- 238000009802 hysterectomy Methods 0.000 description 1
- 101150104734 ldh gene Proteins 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000000528 statistical test Methods 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/20—Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Abstract
The invention discloses a screening method of biomarkers for distinguishing uterine lesion categories and application thereof, and belongs to the technical field of biomarkers. The screening method comprises the following steps: firstly, counting the expression value of each gene in platelets; then performing unsupervised clustering, and screening to obtain candidate characteristic gene sets based on the consistency of unsupervised clustering results and disease labels; drawing an ROC curve, calculating an AUC value, and screening a new candidate characteristic gene set according to the AUC value; and finally, training a machine learning model by using the expression values of all genes in the new candidate characteristic gene set and the disease label to obtain a trained machine learning model and a gene combination corresponding to the machine learning model, and taking the gene combination as a biomarker for distinguishing uterine lesion types. The invention can realize noninvasive detection in clinical application, has simple quantity of the biomarkers, not only contains comprehensive information, but also avoids higher detection cost, and can detect the biomarkers by qPCR technology.
Description
Technical Field
The invention relates to the technical field of biomarkers, in particular to a screening method of biomarkers for distinguishing uterine lesion categories and application thereof.
Background
Hysteromyoma and uterine sarcoma are two types of uterine lesions, but the surgical modes and prognosis of the two are very different. For hysteromyoma, myoma culling operation is generally adopted, the prognosis is good, and observation after operation is the main; hysterectomy is generally adopted for uterine sarcoma, prognosis is poor, and certain drug treatment is also needed after operation. It follows that accurate discrimination of uterine lesion type is particularly important prior to surgery. However, in current clinical practice, there are some difficulties in accurately discriminating uterine lesions: firstly, the uterine sarcoma judging sensitivity is low, and if the uterine sarcoma is misjudged as hysteromyoma, a larger survival risk is brought to a patient; second, uterine fibroids are less specific to the discriminant and if they are misinterpreted as uterine sarcomas, the patient will be resected and overdreated, causing the patient to lose fertility and increasing the economic burden. Therefore, in order to improve the accuracy of discrimination of uterine lesions, it is necessary to screen for relevant biomarkers.
Currently, some molecular features have been proposed that can be used to distinguish between uterine fibroids and uterine sarcomas, but suffer from the following three drawbacks: first, some molecular characteristics are only single proteins, such as CA125 gene encoding proteins, LDH gene encoding proteins, etc., and the accuracy is low by only judging hysteromyoma and hysterosarcoma according to the content of the single proteins in the blood of a patient; second, some molecular features are based on genomic sequencing or transcriptome sequencing of uterine fibroid and uterine sarcoma lesion tissue samples and then comparing them with each other, including different genomic mutations, different gene copy number variations, differentially expressed genes, etc., and these molecular features do not observe corresponding differences in the blood of patients, so that these molecular features cannot play a role in clinical diagnosis in the form of noninvasive detection; third, the molecular features currently used to distinguish uterine fibroids from uterine sarcomas, for which clinical performance lacks validation based on independent clinical cohorts, therefore the evidence of these molecular features is not of sufficient grade and reliability is questionable.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides the following technical scheme.
The first aspect of the present invention provides a screening method for discriminating a biomarker of uterine lesion class, comprising:
for patients with hysteromyoma and uterine sarcoma, counting the expression value of each gene in the platelets according to the transcriptome sequencing data of the platelets respectively;
performing unsupervised clustering based on the expression values of genes in platelets, and screening to obtain a candidate characteristic gene set based on the consistency of unsupervised clustering results and disease labels, wherein the disease labels are uterine fibroids or uterine sarcomas;
drawing ROC curves according to the distribution of the expression value of each candidate characteristic gene in the candidate characteristic gene set in all hysteromyoma and hysterosarcoma patients and the disease label, calculating AUC values, and screening a new candidate characteristic gene set according to the AUC values;
and training a machine learning model by using the expression values of all genes in the new candidate characteristic gene set and the disease label to obtain a trained machine learning model and a gene combination corresponding to the machine learning model, and taking the gene combination as a biomarker for distinguishing uterine lesion types.
Preferably, the transcriptome sequencing data of the platelets is obtained by the following method:
peripheral blood plasma samples are respectively collected aiming at patients with hysteromyoma and hysterosarcoma, and platelets are separated from the plasma based on a gradient centrifugation experimental method;
extracting RNA from the separated platelets, and reversely transcribing the RNA into complementary DNA;
a sequencing library is constructed for complementary DNA and transcriptome sequencing is performed to obtain transcriptome sequencing data for platelets.
Preferably, the performing unsupervised clustering based on the expression value of the genes in the platelets, and screening to obtain the candidate characteristic gene set based on the consistency of the unsupervised clustering result and the disease label comprises:
carrying out differential analysis on the expression values of genes in platelets of patients with hysteromyoma and uterine sarcoma, and carrying out gene sequencing according to the difference from large to small to obtain an initial candidate characteristic gene set;
step two, removing the genes with the least difference of gene expression values in the initial candidate characteristic gene set to obtain an updated initial candidate characteristic gene set;
step three, performing unsupervised clustering on patients with hysteromyoma and uterine sarcoma based on the expression values of all initial candidate feature genes in the updated initial candidate feature gene set, and calculating the consistency of clustering results and disease labels;
and step four, iteratively repeating the step two to the step three, wherein the updated initial candidate characteristic gene set obtained in the previous iteration period is used as the initial candidate characteristic gene set in the next iteration period, the consistency between the clustering result and the disease label is not increased any more, and the updated initial candidate characteristic gene set in the previous iteration period is used as the candidate characteristic gene set.
Preferably, the screening the new candidate set of signature genes based on AUC values comprises: and selecting candidate characteristic genes with AUC values larger than a threshold value to form a new candidate characteristic gene set.
Preferably, the training the machine learning model by using the expression values of all genes in the new candidate characteristic gene set and the disease label, and obtaining the trained machine learning model and the corresponding gene combination thereof includes:
the first step, randomly removing 5 genes from all genes in the current new candidate characteristic gene set, then training an SVM model by adopting the expression values of the rest genes and the patient disease label, and recording the accuracy of the SVM model; repeating the operation until all combinations of 5 genes in the new candidate signature gene set are removed;
secondly, selecting a gene set corresponding to the SVM model with highest accuracy from all SVM models obtained through training in the first step as an updated candidate characteristic gene set;
and thirdly, continuously iterating the first step to the second step until the accuracy is not increased any more, taking the candidate characteristic gene set in the previous iteration period as a finally screened biomarker for distinguishing and diagnosing the hysteromyoma and the uterine sarcoma, and simultaneously taking the SVM model in the previous iteration period as a machine learning model for distinguishing and diagnosing the hysteromyoma and the uterine sarcoma.
Preferably, the screening method for determining the biomarkers of uterine lesion type provided by the invention further comprises the steps of: biomarkers were validated.
In a second aspect, the present invention provides a biomarker for discriminating uterine lesion classification obtained by a screening method for discriminating uterine lesion classification according to the first aspect.
Preferably, the biomarkers for discriminating uterine lesion categories provided by the invention include EZH2, COPG1, SUMO3, CLIP1, GSR, SLA2 and TREML1.
Preferably, the discriminating of the uterine lesion type adopts the following method: detecting the content or the expression level of the biomarker of uterine lesion patients of different categories, taking the content or the expression level of the biomarker as the input of a trained machine learning model, and taking the discrimination probability as the output of the trained machine learning model: if the output discrimination probability is greater than 0.5, judging that the uterine lesion is uterine sarcoma; if the output discrimination probability is less than 0.5, the uterine lesion is judged to be hysteromyoma.
In a third aspect the present invention provides the use of a biomarker, or a detection reagent for a biomarker, as described in the second aspect, in the manufacture of a product for discriminating uterine lesion categories.
Preferably, the method for distinguishing uterine lesion type comprises the following steps: detecting the content or the expression level of the biomarker of uterine lesion patients of different categories, taking the content or the expression level of the biomarker as the input of a trained machine learning model, and taking the discrimination probability as the output of the trained machine learning model: if the output discrimination probability is greater than 0.5, judging that the uterine lesion is uterine sarcoma; if the output discrimination probability is less than 0.5, the uterine lesion is judged to be hysteromyoma.
Preferably, the detection reagent for the biomarker comprises a reagent for detecting the content or expression level of the biomarker; and/or the product comprises a reagent, a kit, a test paper, a gene chip, a protein chip, a high throughput sequencing platform or a proteomic analysis product.
In a fourth aspect, the invention provides a product for distinguishing uterine lesion categories, which comprises the biomarker for distinguishing uterine lesion categories according to the second aspect.
The beneficial effects of the invention are as follows: the screening method for the biomarkers for distinguishing uterine lesions and the application thereof provided by the invention are based on platelet transcriptome sequencing data, and are combined with an unsupervised clustering and machine learning method to screen a group of biomarkers capable of distinguishing uterine lesions (hysteromyoma and uterine sarcoma), and a machine learning model with comprehensive information is obtained to distinguish two types of uterine lesions, and the distinguishing performance of the machine learning model can be verified in an independent clinical queue. The invention detects the gene expression of the platelet of the patient, so that the invention can realize noninvasive detection in clinical application; in addition, the quantity of the screened biomarkers is reduced, comprehensive information is contained, higher detection cost is avoided, and the detection can be performed by qPCR technology (real-time fluorescence quantitative polymerase chain reaction technology).
Drawings
Fig. 1 is a flow chart of a screening method for discriminating uterine lesion type biomarkers according to the invention.
Detailed Description
In order to better understand the above technical solutions, the following description will refer to the drawings and specific embodiments.
As shown in fig. 1, an embodiment of the present invention provides a screening method for determining biomarkers of uterine lesion type, including:
s101, aiming at patients with hysteromyoma and uterine sarcoma, counting the expression value of each gene in platelets according to transcriptome sequencing data of the platelets respectively;
s102, performing unsupervised clustering based on the expression values of genes in platelets, and screening to obtain a candidate characteristic gene set based on the consistency of unsupervised clustering results and disease labels, wherein the disease labels are uterine fibroids or uterine sarcomas;
s103, drawing an ROC curve (receiver operating characteristic, a subject working characteristic curve) and calculating an AUC value (area under the curve of ROC, area under the ROC curve) according to the distribution of the expression value of each candidate characteristic gene in the candidate characteristic gene set in all hysteromyoma and hysterosarcoma patients and the disease label, and screening a new candidate characteristic gene set according to the AUC value;
s104, training a machine learning model by using the expression values of all genes in the new candidate characteristic gene set and disease labels to obtain a trained machine learning model and a corresponding gene combination thereof, and taking the gene combination as a biomarker for distinguishing uterine lesion types.
Wherein, in step S101, the transcriptome sequencing data of the platelets may be obtained by the following method:
peripheral blood plasma samples are respectively collected aiming at patients with hysteromyoma and hysterosarcoma, and platelets are separated from the plasma based on a gradient centrifugation experimental method;
extracting RNA from the separated platelets, and reversely transcribing the RNA into complementary DNA;
a sequencing library is constructed for complementary DNA and transcriptome sequencing is performed to obtain transcriptome sequencing data for platelets.
The read data from transcriptome sequencing is then appended back to the human reference genome, and the number of reads that fall at the junction of an exon (exon) and an intron (intron) is counted for each gene as the expression value of that gene in the platelet sample, with reference to the annotation information of the human gene structure.
The step S102 is performed by the following steps:
carrying out differential analysis on the expression values of genes in platelets of patients with hysteromyoma and uterine sarcoma, and carrying out gene sequencing according to the difference from large to small to obtain an initial candidate characteristic gene set; specifically, a wilcoxon rank sum test method can be used to perform a differential analysis of gene expression values between patients with uterine fibroids and uterine sarcomas, to obtain a p-value (tail region probability) that can indicate the degree of differential gene expression. If the p-value is smaller, it is suggested that the expression of the gene has a large difference between patients with hysteromyoma and hysterosarcoma; conversely, if the p-value is greater, it is suggested that the expression of the gene will have a smaller difference between patients with uterine fibroids and uterine sarcomas. When sorting genes, all genes may be sorted in order of p-value size indicating the degree of difference in gene expression. The initial candidate characteristic gene set obtained in the first step contains both genes, and also contains the expression values of the respective genes in platelets and p-value indicating the degree of difference in gene expression.
Step two, removing the genes with the least difference of gene expression values in the initial candidate characteristic gene set to obtain an updated initial candidate characteristic gene set; in the embodiment of the invention, 10 genes with the largest p-value indicating the differential degree of gene expression can be removed from the initial candidate characteristic gene set to obtain an updated initial candidate characteristic gene set.
Step three, performing unsupervised clustering on patients with hysteromyoma and uterine sarcoma based on the expression values of all initial candidate feature genes in the updated initial candidate feature gene set, and calculating the consistency of clustering results and disease labels; specifically, a k-means clustering method may be used, where the number of clusters is designated as 2 (k=2), and the patient is clustered. In addition, a Fisher's exact test (Fisher accuracy test) statistical test method can be adopted to test the consistency of the k-means clustering result of the patient and the disease label thereof, so as to obtain consistency p-value. If the consistency p-value is smaller, the consistency of the k-means clustering result and the disease label is better; otherwise, if the consistency p-value is larger, the poor consistency of the k-means clustering result and the disease label is indicated.
And step four, iteratively repeating the step two to the step three, wherein the updated initial candidate characteristic gene set obtained in the previous iteration period is used as the initial candidate characteristic gene set in the next iteration period until the consistency between the clustering result and the disease label is not increased (specifically, the obtained consistency p-value is not reduced), and the updated initial candidate characteristic gene set in the previous iteration period is used as the candidate characteristic gene set.
Step S103 is performed to draw ROC curves and calculate AUC values based on the distribution of expression values of the gene in all patients and patient disease markers (uterine fibroids or uterine sarcomas) for each candidate characteristic gene screened by step S102. If the AUC value is larger, the effect of the gene expression value for distinguishing patients with hysteromyoma and hysterosarcoma is better. And selecting candidate characteristic genes with AUC values larger than a threshold value to form a new candidate characteristic gene set. Specifically, 0.7 can be used as a threshold value, and all genes with AUC values greater than the threshold value of 0.7 are taken to form a new candidate characteristic gene set.
The execution of step S104 may be performed as follows:
the first step, 5 genes are arbitrarily removed from all genes in the current new candidate characteristic gene set, and then an SVM (support vector machine ) model is trained by adopting the expression values of the rest genes and patient disease labels respectively; and recording the accuracy of the SVM model; repeating the operation until all combinations of 5 genes in the new candidate signature gene set are removed; in a specific training process after 5 genes are arbitrarily removed each time, the expression value of the candidate characteristic genes of each patient is used as the input of a model, and the disease label of the patient is used for checking the standard of the output accuracy of the model. The support vector machine calculates a probability from the input of each patient: if the probability is greater than 0.5, indicating that the model judges the patient to be a uterine sarcoma patient based on the input of the patient; if the probability is less than 0.5, the model determines that the patient is a myoma patient based on the patient's input. The probability is calculated for each patient separately in the above-described manner, and a disease label (uterine sarcoma or myoma) judged on the basis of the model for each patient is obtained according to the above-described rule. Finally, the accuracy of the model was assessed with reference to the patient's actual disease label (uterine sarcoma or myoma).
In the invention, the accuracy can be calculated by using the following formula: accuracy= (true positive + true negative)/(true positive + true negative + false positive + false negative). Wherein positive refers to uterine sarcoma and negative refers to uterine fibroid; true positives mean that for a sample, the machine learning model predicts uterine sarcoma, the true signature of which is also uterine sarcoma; true negative means that for a certain sample, the machine learning model predicts uterine fibroids, the true label of which is also uterine fibroids; false positives refer to that for a certain sample, a machine learning model predicts uterine sarcoma, and the true label of the model is uterine fibroid; false negative means that for a sample, the machine learning model predicts uterine fibroids, the true signature of which is uterine sarcomas.
Secondly, selecting a gene set corresponding to the SVM model with highest accuracy from all SVM models obtained through training in the first step as an updated candidate characteristic gene set; here, "a set of genes corresponding to an SVM model" refers to an input gene used for training the SVM model. The set of genes used as inputs for training each SVM model is different, i.e. "the remaining candidate signature genes after the arbitrary removal of 5 genes" is different. The gene set is input of model training and has no relation with patient disease labels which are taken as model output accuracy judgment standards.
And thirdly, continuously iterating the first step to the second step until the accuracy is not increased any more, taking the candidate characteristic gene set in the previous iteration period as a finally screened biomarker which can be used for distinguishing and diagnosing the hysteromyoma and the uterine sarcoma, and simultaneously taking the SVM model in the previous iteration period as a machine learning model for distinguishing and diagnosing the hysteromyoma and the uterine sarcoma.
The 7 biomarkers obtained by the screening method provided by the invention comprise EZH2, COPG1, SUMO3, CLIP1, GSR, SLA2 and TREML1, and the differentiation and diagnosis of hysteromyoma and uterine sarcoma can be performed based on the expression values of the 7 biomarkers in platelets.
In a preferred embodiment, a screening method for biomarkers for discriminating uterine lesion categories may further comprise the steps of: biomarkers were validated. Specifically, performance of biomarkers and machine learning models was validated based on independent clinical cohorts. In the first step, peripheral blood plasma samples of patients with uterine fibroids and uterine sarcomas were additionally collected independently of the above-described samples for characteristic gene screening and machine learning model training, platelets were then separated from the plasma based on a gradient centrifugation experimental method, and gene expression values in the platelet samples were detected as described in S101. Second, the expression values of 7 biomarkers screened by S104 were extracted for each sample. Thirdly, the SVM model constructed by S104 and used for distinguishing and diagnosing the myoma and the sarcoma of uterus and the expression values of 7 biomarkers in each sample are adopted to distinguish and diagnose the myoma or the sarcoma of uterus of different patients. And fourth, comparing the discrimination diagnosis result obtained in the third step with the actual disease label of the same patient, and calculating the AUC value of the machine learning model, wherein the AUC value is used for measuring the effectiveness of the biomarker screened by the method and the constructed machine learning model. And fifthly, the AUC value calculated in the fourth step is 0.85, which shows that the biomarker and the machine learning model can effectively judge and diagnose hysteromyoma and hysterosarcoma.
The biomarker or the detection reagent of the biomarker can be used for preparing products for distinguishing uterine lesion types. Wherein, the discrimination of the uterine lesion type can be carried out by the following method: detecting the content or the expression level of the biomarker of uterine lesion patients of different categories, taking the content or the expression level of the biomarker as the input of a trained machine learning model, and taking the discrimination probability as the output of the trained machine learning model: if the output discrimination probability is greater than 0.5, judging that the uterine lesion is uterine sarcoma; if the output discrimination probability is less than 0.5, the uterine lesion is judged to be hysteromyoma. The detection reagent for the biomarker may include a reagent for detecting the content or expression level of the biomarker; and/or the product comprises a reagent, a kit, a test paper, a gene chip, a protein chip, a high throughput sequencing platform or a proteomic analysis product.
The invention also provides a product for distinguishing uterine lesion types, and the product comprises the biomarker provided by the invention or a detection reagent of the biomarker.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (13)
1. A screening method for biomarkers for discriminating uterine lesion categories, comprising:
for patients with hysteromyoma and uterine sarcoma, counting the expression value of each gene in the platelets according to the transcriptome sequencing data of the platelets respectively;
performing unsupervised clustering based on the expression values of genes in platelets, and screening to obtain a candidate characteristic gene set based on the consistency of unsupervised clustering results and disease labels, wherein the disease labels are uterine fibroids or uterine sarcomas;
drawing ROC curves according to the distribution of the expression value of each candidate characteristic gene in the candidate characteristic gene set in all hysteromyoma and hysterosarcoma patients and the disease label, calculating AUC values, and screening a new candidate characteristic gene set according to the AUC values;
and training a machine learning model by using the expression values of all genes in the new candidate characteristic gene set and the disease label to obtain a trained machine learning model and a gene combination corresponding to the machine learning model, and taking the gene combination as a biomarker for distinguishing uterine lesion types.
2. The method of claim 1, wherein the transcriptome sequencing data of platelets is obtained by:
peripheral blood plasma samples are respectively collected aiming at patients with hysteromyoma and hysterosarcoma, and platelets are separated from the plasma based on a gradient centrifugation experimental method;
extracting RNA from the separated platelets, and reversely transcribing the RNA into complementary DNA;
a sequencing library is constructed for complementary DNA and transcriptome sequencing is performed to obtain transcriptome sequencing data for platelets.
3. The method for screening biomarkers for discriminating uterine lesion categories according to claim 1 wherein said performing unsupervised clustering based on gene expression values in platelets and screening based on consistency of unsupervised clustering results with disease signatures to obtain a set of candidate signature genes comprises:
carrying out differential analysis on the expression values of genes in platelets of patients with hysteromyoma and uterine sarcoma, and carrying out gene sequencing according to the difference from large to small to obtain an initial candidate characteristic gene set;
step two, removing the genes with the least difference of gene expression values in the initial candidate characteristic gene set to obtain an updated initial candidate characteristic gene set;
step three, performing unsupervised clustering on patients with hysteromyoma and uterine sarcoma based on the expression values of all initial candidate feature genes in the updated initial candidate feature gene set, and calculating the consistency of clustering results and disease labels;
and step four, iteratively repeating the step two to the step three, wherein the updated initial candidate characteristic gene set obtained in the previous iteration period is used as the initial candidate characteristic gene set in the next iteration period, the consistency between the clustering result and the disease label is not increased any more, and the updated initial candidate characteristic gene set in the previous iteration period is used as the candidate characteristic gene set.
4. The method of claim 1, wherein screening the new candidate signature gene set based on AUC values comprises: and selecting candidate characteristic genes with AUC values larger than a threshold value to form a new candidate characteristic gene set.
5. The method for screening biomarkers for uterine lesion classification according to claim 1, wherein training a machine learning model using expression values of all genes in the new candidate signature gene set and disease tags, the obtaining a trained machine learning model and corresponding gene combinations thereof comprises:
the first step, randomly removing 5 genes from all genes in the current new candidate characteristic gene set, then training an SVM model by adopting the expression values of the rest genes and the patient disease label, and recording the accuracy of the SVM model; repeating the operation until all combinations of 5 genes in the new candidate signature gene set are removed;
secondly, selecting a gene set corresponding to the SVM model with highest accuracy from all SVM models obtained through training in the first step as an updated candidate characteristic gene set;
and thirdly, continuously iterating the first step to the second step until the accuracy is not increased any more, taking the candidate characteristic gene set in the previous iteration period as a finally screened biomarker for distinguishing and diagnosing the hysteromyoma and the uterine sarcoma, and simultaneously taking the SVM model in the previous iteration period as a machine learning model for distinguishing and diagnosing the hysteromyoma and the uterine sarcoma.
6. The method of screening for biomarkers for the discrimination of uterine lesion categories according to claim 1, further comprising the step of: biomarkers were validated.
7. A biomarker for discriminating uterine lesion classification, characterized in that it is obtained by using the screening method for discriminating a biomarker for uterine lesion classification according to any of claims 1-6.
8. The biomarker for discriminating uterine lesion categories according to claim 7 including EZH2, COPG1, SUMO3, CLIP1, GSR, SLA2 and TREML1.
9. The biomarker for discriminating a uterine lesion class according to claim 7, characterized in that the discriminating a uterine lesion class employs the following method: detecting the content or the expression level of the biomarker of uterine lesion patients of different categories, taking the content or the expression level of the biomarker as the input of a trained machine learning model, and taking the discrimination probability as the output of the trained machine learning model: if the output discrimination probability is greater than 0.5, judging that the uterine lesion is uterine sarcoma; if the output discrimination probability is less than 0.5, the uterine lesion is judged to be hysteromyoma.
10. Use of a biomarker or a detection reagent for a biomarker in the manufacture of a product for discriminating uterine lesion categories, wherein the biomarker is a biomarker according to claim 7.
11. The use of claim 10, wherein the discrimination of uterine lesion categories is by the following method: detecting the content or the expression level of the biomarker of uterine lesion patients of different categories, taking the content or the expression level of the biomarker as the input of a trained machine learning model, and taking the discrimination probability as the output of the trained machine learning model: if the output discrimination probability is greater than 0.5, judging that the uterine lesion is uterine sarcoma; if the output discrimination probability is less than 0.5, the uterine lesion is judged to be hysteromyoma.
12. The use of claim 10, wherein the detection reagent for the biomarker comprises a reagent for detecting the level of expression or the content of the biomarker; and/or the product comprises a reagent, a kit, a test paper, a gene chip, a protein chip, a high throughput sequencing platform or a proteomic analysis product.
13. A product for discriminating uterine lesion classification, characterized in that the product comprises the biomarker for discriminating uterine lesion classification according to claim 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410081934.6A CN117594133A (en) | 2024-01-19 | 2024-01-19 | Screening method of biomarker for distinguishing uterine lesion type and application thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410081934.6A CN117594133A (en) | 2024-01-19 | 2024-01-19 | Screening method of biomarker for distinguishing uterine lesion type and application thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117594133A true CN117594133A (en) | 2024-02-23 |
Family
ID=89913798
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410081934.6A Withdrawn CN117594133A (en) | 2024-01-19 | 2024-01-19 | Screening method of biomarker for distinguishing uterine lesion type and application thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117594133A (en) |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005076005A2 (en) * | 2004-01-30 | 2005-08-18 | Medizinische Universität Wien | A method for classifying a tumor cell sample based upon differential expression of at least two genes |
CN103025890A (en) * | 2010-04-06 | 2013-04-03 | 卡里斯生命科学卢森堡控股 | Circulating biomarkers for disease |
CN103409501A (en) * | 2013-05-10 | 2013-11-27 | 新疆医科大学 | Method for screening candidate plasma protein markers by cervical carcinoma specificity difference expression |
US20180258499A1 (en) * | 2013-05-28 | 2018-09-13 | Beijing Normal University | Neuroglioma molecular subtyping gene group and use thereof |
CN109680060A (en) * | 2017-10-17 | 2019-04-26 | 华东师范大学 | Methylate marker and its application in diagnosing tumor, classification |
CN110444248A (en) * | 2019-07-22 | 2019-11-12 | 山东大学 | Cancer Biology molecular marker screening technique and system based on network topology parameters |
CN111739581A (en) * | 2020-06-12 | 2020-10-02 | 大连理工大学 | Comprehensive screening method for genome variables |
CN112397153A (en) * | 2020-11-18 | 2021-02-23 | 河南科技大学第一附属医院 | Method for screening biomarker for predicting esophageal squamous cell carcinoma prognosis |
CN112927757A (en) * | 2021-02-24 | 2021-06-08 | 河南大学 | Gastric cancer biomarker identification method based on gene expression and DNA methylation data |
CN113862351A (en) * | 2020-06-30 | 2021-12-31 | 清华大学 | Kit and method for identifying extracellular RNA biomarkers in body fluid sample |
CN114582425A (en) * | 2022-03-14 | 2022-06-03 | 上海交通大学医学院附属仁济医院 | NMIBC prognosis prediction molecular marker, screening method and modeling method |
CN115287347A (en) * | 2022-08-01 | 2022-11-04 | 华中农业大学 | Asymptomatic mitral valve myxomatosis-like lesion biomarker for dogs and application thereof |
-
2024
- 2024-01-19 CN CN202410081934.6A patent/CN117594133A/en not_active Withdrawn
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005076005A2 (en) * | 2004-01-30 | 2005-08-18 | Medizinische Universität Wien | A method for classifying a tumor cell sample based upon differential expression of at least two genes |
CN103025890A (en) * | 2010-04-06 | 2013-04-03 | 卡里斯生命科学卢森堡控股 | Circulating biomarkers for disease |
CN103409501A (en) * | 2013-05-10 | 2013-11-27 | 新疆医科大学 | Method for screening candidate plasma protein markers by cervical carcinoma specificity difference expression |
US20180258499A1 (en) * | 2013-05-28 | 2018-09-13 | Beijing Normal University | Neuroglioma molecular subtyping gene group and use thereof |
CN109680060A (en) * | 2017-10-17 | 2019-04-26 | 华东师范大学 | Methylate marker and its application in diagnosing tumor, classification |
CN110444248A (en) * | 2019-07-22 | 2019-11-12 | 山东大学 | Cancer Biology molecular marker screening technique and system based on network topology parameters |
CN111739581A (en) * | 2020-06-12 | 2020-10-02 | 大连理工大学 | Comprehensive screening method for genome variables |
CN113862351A (en) * | 2020-06-30 | 2021-12-31 | 清华大学 | Kit and method for identifying extracellular RNA biomarkers in body fluid sample |
CN112397153A (en) * | 2020-11-18 | 2021-02-23 | 河南科技大学第一附属医院 | Method for screening biomarker for predicting esophageal squamous cell carcinoma prognosis |
CN112927757A (en) * | 2021-02-24 | 2021-06-08 | 河南大学 | Gastric cancer biomarker identification method based on gene expression and DNA methylation data |
CN114582425A (en) * | 2022-03-14 | 2022-06-03 | 上海交通大学医学院附属仁济医院 | NMIBC prognosis prediction molecular marker, screening method and modeling method |
CN115287347A (en) * | 2022-08-01 | 2022-11-04 | 华中农业大学 | Asymptomatic mitral valve myxomatosis-like lesion biomarker for dogs and application thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112888459B (en) | Convolutional neural network system and data classification method | |
CN110444248B (en) | Cancer biomolecule marker screening method and system based on network topology parameters | |
CN110100013A (en) | Method and system for lesion detection | |
US11929148B2 (en) | Systems and methods for enriching for cancer-derived fragments using fragment size | |
US20200219587A1 (en) | Systems and methods for using fragment lengths as a predictor of cancer | |
CN111778326B (en) | Gene marker combination for endometrial receptivity assessment and application thereof | |
KR20210113237A (en) | Characterization of cell-free DNA ends | |
KR20200080272A (en) | Use of nucleic acid size ranges for non-invasive prenatal testing and cancer detection | |
US20220336043A1 (en) | cfDNA CLASSIFICATION METHOD, APPARATUS AND APPLICATION | |
TW201920683A (en) | Enhancement of cancer screening using cell-free viral nucleic acids | |
CN113362893A (en) | Construction method and application of tumor screening model | |
KR101990430B1 (en) | System and method of biomarker identification for cancer recurrence prediction | |
CN117594133A (en) | Screening method of biomarker for distinguishing uterine lesion type and application thereof | |
US20220042106A1 (en) | Systems and methods of using cell-free nucleic acids to tailor cancer treatment | |
CN115803448A (en) | Micronucleus DNA from peripheral red blood cells and uses thereof | |
KR20230007010A (en) | Method and system for predicting metabolic disease risk | |
KR102225231B1 (en) | IDENTIFYING METHOD FOR TUMOR PATIENT BASED ON miRNA IN EXOSOME AND APPARATUS FOR THE SAME | |
CN113393901B (en) | Glioma sorting device based on tumor nucleic acid is gathered to monocyte | |
CN115678999B (en) | Application of marker in lung cancer recurrence prediction and prediction model construction method | |
WO2023102786A1 (en) | Application of gene marker in prediction of premature birth risk of pregnant woman | |
CN116844638A (en) | Child acute leukemia typing system and method based on high-throughput transcriptome sequencing | |
KR20230059423A (en) | Method for diagnosing and predicting cancer type using methylated cell free DNA | |
CN106909767B (en) | System for classifying hepatitis B-related cirrhosis | |
CN117766028A (en) | Method and device for predicting sample sources based on methylation differences | |
CN117095745A (en) | Method and device for detecting fetal aneuploidy and copy number variation in maternal plasma free DNA and application thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20240223 |
|
WW01 | Invention patent application withdrawn after publication |