CN117594133A

CN117594133A - Screening method of biomarker for distinguishing uterine lesion type and application thereof

Info

Publication number: CN117594133A
Application number: CN202410081934.6A
Authority: CN
Inventors: 季序我; 赵义; 李哲
Original assignee: Beijing Pukang Ruiren Medical Laboratory Co ltd; Predatum Biomedicine Suzhou Co ltd; Precision Scientific Technology Beijing Co ltd
Current assignee: Beijing Pukang Ruiren Medical Laboratory Co ltd; Predatum Biomedicine Suzhou Co ltd; Precision Scientific Technology Beijing Co ltd
Priority date: 2024-01-19
Filing date: 2024-01-19
Publication date: 2024-02-23

Abstract

The invention discloses a screening method of biomarkers for distinguishing uterine lesion categories and application thereof, and belongs to the technical field of biomarkers. The screening method comprises the following steps: firstly, counting the expression value of each gene in platelets; then performing unsupervised clustering, and screening to obtain candidate characteristic gene sets based on the consistency of unsupervised clustering results and disease labels; drawing an ROC curve, calculating an AUC value, and screening a new candidate characteristic gene set according to the AUC value; and finally, training a machine learning model by using the expression values of all genes in the new candidate characteristic gene set and the disease label to obtain a trained machine learning model and a gene combination corresponding to the machine learning model, and taking the gene combination as a biomarker for distinguishing uterine lesion types. The invention can realize noninvasive detection in clinical application, has simple quantity of the biomarkers, not only contains comprehensive information, but also avoids higher detection cost, and can detect the biomarkers by qPCR technology.

Description

Screening method of biomarker for distinguishing uterine lesion type and application thereof

Technical Field

The invention relates to the technical field of biomarkers, in particular to a screening method of biomarkers for distinguishing uterine lesion categories and application thereof.

Background

Hysteromyoma and uterine sarcoma are two types of uterine lesions, but the surgical modes and prognosis of the two are very different. For hysteromyoma, myoma culling operation is generally adopted, the prognosis is good, and observation after operation is the main; hysterectomy is generally adopted for uterine sarcoma, prognosis is poor, and certain drug treatment is also needed after operation. It follows that accurate discrimination of uterine lesion type is particularly important prior to surgery. However, in current clinical practice, there are some difficulties in accurately discriminating uterine lesions: firstly, the uterine sarcoma judging sensitivity is low, and if the uterine sarcoma is misjudged as hysteromyoma, a larger survival risk is brought to a patient; second, uterine fibroids are less specific to the discriminant and if they are misinterpreted as uterine sarcomas, the patient will be resected and overdreated, causing the patient to lose fertility and increasing the economic burden. Therefore, in order to improve the accuracy of discrimination of uterine lesions, it is necessary to screen for relevant biomarkers.

Currently, some molecular features have been proposed that can be used to distinguish between uterine fibroids and uterine sarcomas, but suffer from the following three drawbacks: first, some molecular characteristics are only single proteins, such as CA125 gene encoding proteins, LDH gene encoding proteins, etc., and the accuracy is low by only judging hysteromyoma and hysterosarcoma according to the content of the single proteins in the blood of a patient; second, some molecular features are based on genomic sequencing or transcriptome sequencing of uterine fibroid and uterine sarcoma lesion tissue samples and then comparing them with each other, including different genomic mutations, different gene copy number variations, differentially expressed genes, etc., and these molecular features do not observe corresponding differences in the blood of patients, so that these molecular features cannot play a role in clinical diagnosis in the form of noninvasive detection; third, the molecular features currently used to distinguish uterine fibroids from uterine sarcomas, for which clinical performance lacks validation based on independent clinical cohorts, therefore the evidence of these molecular features is not of sufficient grade and reliability is questionable.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides the following technical scheme.

The first aspect of the present invention provides a screening method for discriminating a biomarker of uterine lesion class, comprising:

for patients with hysteromyoma and uterine sarcoma, counting the expression value of each gene in the platelets according to the transcriptome sequencing data of the platelets respectively;

performing unsupervised clustering based on the expression values of genes in platelets, and screening to obtain a candidate characteristic gene set based on the consistency of unsupervised clustering results and disease labels, wherein the disease labels are uterine fibroids or uterine sarcomas;

drawing ROC curves according to the distribution of the expression value of each candidate characteristic gene in the candidate characteristic gene set in all hysteromyoma and hysterosarcoma patients and the disease label, calculating AUC values, and screening a new candidate characteristic gene set according to the AUC values;

and training a machine learning model by using the expression values of all genes in the new candidate characteristic gene set and the disease label to obtain a trained machine learning model and a gene combination corresponding to the machine learning model, and taking the gene combination as a biomarker for distinguishing uterine lesion types.

Preferably, the transcriptome sequencing data of the platelets is obtained by the following method:

peripheral blood plasma samples are respectively collected aiming at patients with hysteromyoma and hysterosarcoma, and platelets are separated from the plasma based on a gradient centrifugation experimental method;

extracting RNA from the separated platelets, and reversely transcribing the RNA into complementary DNA;

a sequencing library is constructed for complementary DNA and transcriptome sequencing is performed to obtain transcriptome sequencing data for platelets.

Preferably, the performing unsupervised clustering based on the expression value of the genes in the platelets, and screening to obtain the candidate characteristic gene set based on the consistency of the unsupervised clustering result and the disease label comprises:

carrying out differential analysis on the expression values of genes in platelets of patients with hysteromyoma and uterine sarcoma, and carrying out gene sequencing according to the difference from large to small to obtain an initial candidate characteristic gene set;

step two, removing the genes with the least difference of gene expression values in the initial candidate characteristic gene set to obtain an updated initial candidate characteristic gene set;

step three, performing unsupervised clustering on patients with hysteromyoma and uterine sarcoma based on the expression values of all initial candidate feature genes in the updated initial candidate feature gene set, and calculating the consistency of clustering results and disease labels;

and step four, iteratively repeating the step two to the step three, wherein the updated initial candidate characteristic gene set obtained in the previous iteration period is used as the initial candidate characteristic gene set in the next iteration period, the consistency between the clustering result and the disease label is not increased any more, and the updated initial candidate characteristic gene set in the previous iteration period is used as the candidate characteristic gene set.

Preferably, the screening the new candidate set of signature genes based on AUC values comprises: and selecting candidate characteristic genes with AUC values larger than a threshold value to form a new candidate characteristic gene set.

Preferably, the training the machine learning model by using the expression values of all genes in the new candidate characteristic gene set and the disease label, and obtaining the trained machine learning model and the corresponding gene combination thereof includes:

the first step, randomly removing 5 genes from all genes in the current new candidate characteristic gene set, then training an SVM model by adopting the expression values of the rest genes and the patient disease label, and recording the accuracy of the SVM model; repeating the operation until all combinations of 5 genes in the new candidate signature gene set are removed;

secondly, selecting a gene set corresponding to the SVM model with highest accuracy from all SVM models obtained through training in the first step as an updated candidate characteristic gene set;

and thirdly, continuously iterating the first step to the second step until the accuracy is not increased any more, taking the candidate characteristic gene set in the previous iteration period as a finally screened biomarker for distinguishing and diagnosing the hysteromyoma and the uterine sarcoma, and simultaneously taking the SVM model in the previous iteration period as a machine learning model for distinguishing and diagnosing the hysteromyoma and the uterine sarcoma.

Preferably, the screening method for determining the biomarkers of uterine lesion type provided by the invention further comprises the steps of: biomarkers were validated.

In a second aspect, the present invention provides a biomarker for discriminating uterine lesion classification obtained by a screening method for discriminating uterine lesion classification according to the first aspect.

Preferably, the biomarkers for discriminating uterine lesion categories provided by the invention include EZH2, COPG1, SUMO3, CLIP1, GSR, SLA2 and TREML1.

Preferably, the discriminating of the uterine lesion type adopts the following method: detecting the content or the expression level of the biomarker of uterine lesion patients of different categories, taking the content or the expression level of the biomarker as the input of a trained machine learning model, and taking the discrimination probability as the output of the trained machine learning model: if the output discrimination probability is greater than 0.5, judging that the uterine lesion is uterine sarcoma; if the output discrimination probability is less than 0.5, the uterine lesion is judged to be hysteromyoma.

In a third aspect the present invention provides the use of a biomarker, or a detection reagent for a biomarker, as described in the second aspect, in the manufacture of a product for discriminating uterine lesion categories.

Preferably, the method for distinguishing uterine lesion type comprises the following steps: detecting the content or the expression level of the biomarker of uterine lesion patients of different categories, taking the content or the expression level of the biomarker as the input of a trained machine learning model, and taking the discrimination probability as the output of the trained machine learning model: if the output discrimination probability is greater than 0.5, judging that the uterine lesion is uterine sarcoma; if the output discrimination probability is less than 0.5, the uterine lesion is judged to be hysteromyoma.

Preferably, the detection reagent for the biomarker comprises a reagent for detecting the content or expression level of the biomarker; and/or the product comprises a reagent, a kit, a test paper, a gene chip, a protein chip, a high throughput sequencing platform or a proteomic analysis product.

In a fourth aspect, the invention provides a product for distinguishing uterine lesion categories, which comprises the biomarker for distinguishing uterine lesion categories according to the second aspect.

The beneficial effects of the invention are as follows: the screening method for the biomarkers for distinguishing uterine lesions and the application thereof provided by the invention are based on platelet transcriptome sequencing data, and are combined with an unsupervised clustering and machine learning method to screen a group of biomarkers capable of distinguishing uterine lesions (hysteromyoma and uterine sarcoma), and a machine learning model with comprehensive information is obtained to distinguish two types of uterine lesions, and the distinguishing performance of the machine learning model can be verified in an independent clinical queue. The invention detects the gene expression of the platelet of the patient, so that the invention can realize noninvasive detection in clinical application; in addition, the quantity of the screened biomarkers is reduced, comprehensive information is contained, higher detection cost is avoided, and the detection can be performed by qPCR technology (real-time fluorescence quantitative polymerase chain reaction technology).

Drawings

Fig. 1 is a flow chart of a screening method for discriminating uterine lesion type biomarkers according to the invention.

Detailed Description

In order to better understand the above technical solutions, the following description will refer to the drawings and specific embodiments.

As shown in fig. 1, an embodiment of the present invention provides a screening method for determining biomarkers of uterine lesion type, including:

s101, aiming at patients with hysteromyoma and uterine sarcoma, counting the expression value of each gene in platelets according to transcriptome sequencing data of the platelets respectively;

s102, performing unsupervised clustering based on the expression values of genes in platelets, and screening to obtain a candidate characteristic gene set based on the consistency of unsupervised clustering results and disease labels, wherein the disease labels are uterine fibroids or uterine sarcomas;

s103, drawing an ROC curve (receiver operating characteristic, a subject working characteristic curve) and calculating an AUC value (area under the curve of ROC, area under the ROC curve) according to the distribution of the expression value of each candidate characteristic gene in the candidate characteristic gene set in all hysteromyoma and hysterosarcoma patients and the disease label, and screening a new candidate characteristic gene set according to the AUC value;

s104, training a machine learning model by using the expression values of all genes in the new candidate characteristic gene set and disease labels to obtain a trained machine learning model and a corresponding gene combination thereof, and taking the gene combination as a biomarker for distinguishing uterine lesion types.

Wherein, in step S101, the transcriptome sequencing data of the platelets may be obtained by the following method:

The read data from transcriptome sequencing is then appended back to the human reference genome, and the number of reads that fall at the junction of an exon (exon) and an intron (intron) is counted for each gene as the expression value of that gene in the platelet sample, with reference to the annotation information of the human gene structure.

The step S102 is performed by the following steps:

carrying out differential analysis on the expression values of genes in platelets of patients with hysteromyoma and uterine sarcoma, and carrying out gene sequencing according to the difference from large to small to obtain an initial candidate characteristic gene set; specifically, a wilcoxon rank sum test method can be used to perform a differential analysis of gene expression values between patients with uterine fibroids and uterine sarcomas, to obtain a p-value (tail region probability) that can indicate the degree of differential gene expression. If the p-value is smaller, it is suggested that the expression of the gene has a large difference between patients with hysteromyoma and hysterosarcoma; conversely, if the p-value is greater, it is suggested that the expression of the gene will have a smaller difference between patients with uterine fibroids and uterine sarcomas. When sorting genes, all genes may be sorted in order of p-value size indicating the degree of difference in gene expression. The initial candidate characteristic gene set obtained in the first step contains both genes, and also contains the expression values of the respective genes in platelets and p-value indicating the degree of difference in gene expression.

Step two, removing the genes with the least difference of gene expression values in the initial candidate characteristic gene set to obtain an updated initial candidate characteristic gene set; in the embodiment of the invention, 10 genes with the largest p-value indicating the differential degree of gene expression can be removed from the initial candidate characteristic gene set to obtain an updated initial candidate characteristic gene set.

Step three, performing unsupervised clustering on patients with hysteromyoma and uterine sarcoma based on the expression values of all initial candidate feature genes in the updated initial candidate feature gene set, and calculating the consistency of clustering results and disease labels; specifically, a k-means clustering method may be used, where the number of clusters is designated as 2 (k=2), and the patient is clustered. In addition, a Fisher's exact test (Fisher accuracy test) statistical test method can be adopted to test the consistency of the k-means clustering result of the patient and the disease label thereof, so as to obtain consistency p-value. If the consistency p-value is smaller, the consistency of the k-means clustering result and the disease label is better; otherwise, if the consistency p-value is larger, the poor consistency of the k-means clustering result and the disease label is indicated.

And step four, iteratively repeating the step two to the step three, wherein the updated initial candidate characteristic gene set obtained in the previous iteration period is used as the initial candidate characteristic gene set in the next iteration period until the consistency between the clustering result and the disease label is not increased (specifically, the obtained consistency p-value is not reduced), and the updated initial candidate characteristic gene set in the previous iteration period is used as the candidate characteristic gene set.

Step S103 is performed to draw ROC curves and calculate AUC values based on the distribution of expression values of the gene in all patients and patient disease markers (uterine fibroids or uterine sarcomas) for each candidate characteristic gene screened by step S102. If the AUC value is larger, the effect of the gene expression value for distinguishing patients with hysteromyoma and hysterosarcoma is better. And selecting candidate characteristic genes with AUC values larger than a threshold value to form a new candidate characteristic gene set. Specifically, 0.7 can be used as a threshold value, and all genes with AUC values greater than the threshold value of 0.7 are taken to form a new candidate characteristic gene set.

The execution of step S104 may be performed as follows:

the first step, 5 genes are arbitrarily removed from all genes in the current new candidate characteristic gene set, and then an SVM (support vector machine ) model is trained by adopting the expression values of the rest genes and patient disease labels respectively; and recording the accuracy of the SVM model; repeating the operation until all combinations of 5 genes in the new candidate signature gene set are removed; in a specific training process after 5 genes are arbitrarily removed each time, the expression value of the candidate characteristic genes of each patient is used as the input of a model, and the disease label of the patient is used for checking the standard of the output accuracy of the model. The support vector machine calculates a probability from the input of each patient: if the probability is greater than 0.5, indicating that the model judges the patient to be a uterine sarcoma patient based on the input of the patient; if the probability is less than 0.5, the model determines that the patient is a myoma patient based on the patient's input. The probability is calculated for each patient separately in the above-described manner, and a disease label (uterine sarcoma or myoma) judged on the basis of the model for each patient is obtained according to the above-described rule. Finally, the accuracy of the model was assessed with reference to the patient's actual disease label (uterine sarcoma or myoma).

In the invention, the accuracy can be calculated by using the following formula: accuracy= (true positive + true negative)/(true positive + true negative + false positive + false negative). Wherein positive refers to uterine sarcoma and negative refers to uterine fibroid; true positives mean that for a sample, the machine learning model predicts uterine sarcoma, the true signature of which is also uterine sarcoma; true negative means that for a certain sample, the machine learning model predicts uterine fibroids, the true label of which is also uterine fibroids; false positives refer to that for a certain sample, a machine learning model predicts uterine sarcoma, and the true label of the model is uterine fibroid; false negative means that for a sample, the machine learning model predicts uterine fibroids, the true signature of which is uterine sarcomas.

Secondly, selecting a gene set corresponding to the SVM model with highest accuracy from all SVM models obtained through training in the first step as an updated candidate characteristic gene set; here, "a set of genes corresponding to an SVM model" refers to an input gene used for training the SVM model. The set of genes used as inputs for training each SVM model is different, i.e. "the remaining candidate signature genes after the arbitrary removal of 5 genes" is different. The gene set is input of model training and has no relation with patient disease labels which are taken as model output accuracy judgment standards.

And thirdly, continuously iterating the first step to the second step until the accuracy is not increased any more, taking the candidate characteristic gene set in the previous iteration period as a finally screened biomarker which can be used for distinguishing and diagnosing the hysteromyoma and the uterine sarcoma, and simultaneously taking the SVM model in the previous iteration period as a machine learning model for distinguishing and diagnosing the hysteromyoma and the uterine sarcoma.

The 7 biomarkers obtained by the screening method provided by the invention comprise EZH2, COPG1, SUMO3, CLIP1, GSR, SLA2 and TREML1, and the differentiation and diagnosis of hysteromyoma and uterine sarcoma can be performed based on the expression values of the 7 biomarkers in platelets.

In a preferred embodiment, a screening method for biomarkers for discriminating uterine lesion categories may further comprise the steps of: biomarkers were validated. Specifically, performance of biomarkers and machine learning models was validated based on independent clinical cohorts. In the first step, peripheral blood plasma samples of patients with uterine fibroids and uterine sarcomas were additionally collected independently of the above-described samples for characteristic gene screening and machine learning model training, platelets were then separated from the plasma based on a gradient centrifugation experimental method, and gene expression values in the platelet samples were detected as described in S101. Second, the expression values of 7 biomarkers screened by S104 were extracted for each sample. Thirdly, the SVM model constructed by S104 and used for distinguishing and diagnosing the myoma and the sarcoma of uterus and the expression values of 7 biomarkers in each sample are adopted to distinguish and diagnose the myoma or the sarcoma of uterus of different patients. And fourth, comparing the discrimination diagnosis result obtained in the third step with the actual disease label of the same patient, and calculating the AUC value of the machine learning model, wherein the AUC value is used for measuring the effectiveness of the biomarker screened by the method and the constructed machine learning model. And fifthly, the AUC value calculated in the fourth step is 0.85, which shows that the biomarker and the machine learning model can effectively judge and diagnose hysteromyoma and hysterosarcoma.

The biomarker or the detection reagent of the biomarker can be used for preparing products for distinguishing uterine lesion types. Wherein, the discrimination of the uterine lesion type can be carried out by the following method: detecting the content or the expression level of the biomarker of uterine lesion patients of different categories, taking the content or the expression level of the biomarker as the input of a trained machine learning model, and taking the discrimination probability as the output of the trained machine learning model: if the output discrimination probability is greater than 0.5, judging that the uterine lesion is uterine sarcoma; if the output discrimination probability is less than 0.5, the uterine lesion is judged to be hysteromyoma. The detection reagent for the biomarker may include a reagent for detecting the content or expression level of the biomarker; and/or the product comprises a reagent, a kit, a test paper, a gene chip, a protein chip, a high throughput sequencing platform or a proteomic analysis product.

The invention also provides a product for distinguishing uterine lesion types, and the product comprises the biomarker provided by the invention or a detection reagent of the biomarker.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A screening method for biomarkers for discriminating uterine lesion categories, comprising:

2. The method of claim 1, wherein the transcriptome sequencing data of platelets is obtained by:

3. The method for screening biomarkers for discriminating uterine lesion categories according to claim 1 wherein said performing unsupervised clustering based on gene expression values in platelets and screening based on consistency of unsupervised clustering results with disease signatures to obtain a set of candidate signature genes comprises:

4. The method of claim 1, wherein screening the new candidate signature gene set based on AUC values comprises: and selecting candidate characteristic genes with AUC values larger than a threshold value to form a new candidate characteristic gene set.

5. The method for screening biomarkers for uterine lesion classification according to claim 1, wherein training a machine learning model using expression values of all genes in the new candidate signature gene set and disease tags, the obtaining a trained machine learning model and corresponding gene combinations thereof comprises:

6. The method of screening for biomarkers for the discrimination of uterine lesion categories according to claim 1, further comprising the step of: biomarkers were validated.

7. A biomarker for discriminating uterine lesion classification, characterized in that it is obtained by using the screening method for discriminating a biomarker for uterine lesion classification according to any of claims 1-6.

8. The biomarker for discriminating uterine lesion categories according to claim 7 including EZH2, COPG1, SUMO3, CLIP1, GSR, SLA2 and TREML1.

9. The biomarker for discriminating a uterine lesion class according to claim 7, characterized in that the discriminating a uterine lesion class employs the following method: detecting the content or the expression level of the biomarker of uterine lesion patients of different categories, taking the content or the expression level of the biomarker as the input of a trained machine learning model, and taking the discrimination probability as the output of the trained machine learning model: if the output discrimination probability is greater than 0.5, judging that the uterine lesion is uterine sarcoma; if the output discrimination probability is less than 0.5, the uterine lesion is judged to be hysteromyoma.

10. Use of a biomarker or a detection reagent for a biomarker in the manufacture of a product for discriminating uterine lesion categories, wherein the biomarker is a biomarker according to claim 7.

11. The use of claim 10, wherein the discrimination of uterine lesion categories is by the following method: detecting the content or the expression level of the biomarker of uterine lesion patients of different categories, taking the content or the expression level of the biomarker as the input of a trained machine learning model, and taking the discrimination probability as the output of the trained machine learning model: if the output discrimination probability is greater than 0.5, judging that the uterine lesion is uterine sarcoma; if the output discrimination probability is less than 0.5, the uterine lesion is judged to be hysteromyoma.

12. The use of claim 10, wherein the detection reagent for the biomarker comprises a reagent for detecting the level of expression or the content of the biomarker; and/or the product comprises a reagent, a kit, a test paper, a gene chip, a protein chip, a high throughput sequencing platform or a proteomic analysis product.

13. A product for discriminating uterine lesion classification, characterized in that the product comprises the biomarker for discriminating uterine lesion classification according to claim 7.