CN116987789A

CN116987789A - UTUC molecular typing, single sample classifier and construction method thereof

Info

Publication number: CN116987789A
Application number: CN202310791539.2A
Authority: CN
Inventors: 金鸽; 赵婷婷; 徐小红; 曹建军
Original assignee: Shanghai Rendong Medical Laboratory Co ltd
Current assignee: Shanghai Rendong Medical Laboratory Co ltd
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-11-03

Abstract

The invention discloses a molecular typing method for upper urinary tract urothelial cancer, which is used for carrying out molecular typing on the upper urinary tract urothelial cancer according to the lncRNA characteristics of patients suffering from the upper urinary tract urothelial cancer. The invention also discloses a single sample classifier for identifying the molecular typing of the upper urinary tract urothelial cancer and a construction method thereof. The invention utilizes transcriptome sequencing data of patients with the upper urinary tract urothelial cancer to analyze and obtain lncRNA abundance expression data of the patients, then obtains molecular typing of the upper urinary tract urothelial cancer by screening lncRNA related to prognosis and clustering, and further constructs a single sample classifier capable of identifying the molecular typing of the upper urinary tract urothelial cancer, thereby deeply analyzing the influence of the lncRNA abundance expression characteristics of the patients on prognosis from a molecular level and being beneficial to accurate identification and practical application of the molecular typing of the upper urinary tract urothelial cancer.

Description

UTUC molecular typing, single sample classifier and construction method thereof

Technical Field

The invention relates to the field of urinary oncology medicine, in particular to upper urinary tract urothelial carcinoma (upper urinary tract urothelial carcinoma, UTUC), and more particularly relates to molecular typing of UTUC and construction of a single sample classifier.

Background

UTUC is a type of Urothelial Carcinoma (UC) that occurs in the renal pelvis and ureter. In China, UTUC accounts for about 10% -30% of the total UC, and is significantly higher than the proportion of 5% -10% in Western countries. UTUC has some of the same clinical pathological features as urothelial carcinoma (UBC), but it is also unique, such as hidden onset, high grade, strong invasiveness, and high recurrence rate. Current studies show that factors such as sex, age, stage grade of tumor, lymph node metastasis, etc. may be risk factors affecting their prognosis. Compared with UBC, the research on the molecular mechanism of the development of UTUC is limited, and meanwhile, accurate biomarkers for molecular typing and prognosis are also lacking.

31 UTUC whole exon sequencing and RNA sequencing integration analysis results were reported for the first time by Moss TJ et al, journal of Eur Urol, 2017, U.S. cancer center, md.A., moss T J et al, J.2017, 72 (4): 641-649, J.Moss T J, qi Y, xi L, et al, computer genomic characterization of upper tract urothelial carcinoma [ J ]. European urology). The results of the Whole Exon Sequencing (WES) analysis show that the high frequency mutation of UTUC is FGFR3, KMT2D, PIK CA, TP53, etc. UTUC was classified as type 4 by unsupervised cluster analysis of RNA sequencing data, characterized separately as follows: 1) Type I: no PIK3CA mutation, no smoking history, high grade < pT2 tumor, high recurrence; 2) Type II: 100% FGFR3 mutation, low-grade tumor, smoking history, non-myometrial infiltration disease, no recurrence; 3) Type III: 100% FGFR3 mutation, 71% PIK3CA mutation, no TP53 mutation, 5 recurrence, smoking history, and tumor stage < pT2; 4) Type IV: 62.5% KMT2D mutation, 50% FGFR3 mutation, 50% TP53 mutation, no PIK3CA mutation, high-grade tumor, smoking history, carcinoma in situ, and short survival period.

In 2019, robinson BD et al, the university of Wilconall medical college pathology and inspection medical center, described the molecular characteristics of 37 high-grade UTUCs, the vast majority of which were found to be lumen-papillary by integrated analysis of WES and RNA-seq sequencing data. UTUC has an immune environment with T cell depletion, and highly expresses FGFR3. Furthermore, sporadic UTUC has a lower tumor mutational burden than UBC.

In 2021, the molecular pathogenesis was comprehensively characterized by integrative analysis of gene mutation, copy number variation, DNA methylation and gene expression profile of 199 UTUC samples from the Seishi Ogawa study group, university of kyoto, japan. UTUC is first classified into type 5 by the genetically mutated state of TP53, MDM2, RAS and FGFR 3: hyper mutant (5.5%), TP53/MDM2 (37.7%), RAS (HRAS/KRAS/NRAS, 15.1%), FGFR3 (35.2%), trisomy (6.5%). In addition, five C1-C5 expression profiling subtypes were identified by RNA sequencing of 158 UTUC samples and performing a differential-free clustering analysis. Most FGFR3 mutations and most hyper-mutant subtypes are classified in the C1 expression profile subtype, TP53/MDM2 mutations and triple-negative subtypes are mainly classified in the C3-C5 expression profile subtype, while in most cases they belong to the C2 expression profile subtype with mutations in one of the RAS mutant subtypes and FGFR3 subsets. The authors also performed a robust clustering analysis based on the DNA methylation status of tumor-specific CpG islands, resulting in three subclasses of DNA methylation status.

In general, currently few molecular typing studies for UTUC are underway, and based primarily on genomic and mRNA transcriptome data, there is an urgent need to integrate other dimensional histologic information to further explore the biological processes of invasion, recurrence and progression of disease comprehensively. lncRNA (Long non-coding RNA) refers to non-coding RNA with a length of more than 200 nucleotides, has high heterogeneity, and is mainly involved in gene transcription regulation, post-transcriptional regulation, translational regulation, mediated chromosome modification, and the like. lncRNA can be extracted non-invasively from body fluids, tissues and cells. In recent years, lncRNA has received extensive attention and is believed to be involved in developmental processes and various diseases.

Disclosure of Invention

One of the technical problems to be solved by the invention is to provide a molecular typing method for the upper urinary tract urothelial cancer, which can analyze the prognosis difference between different UTUC patients from the lncRNA level.

In order to solve the technical problems, the molecular typing method for the upper urinary tract urothelial cancer provided by the invention is used for carrying out molecular typing on the upper urinary tract urothelial cancer according to the lncRNA characteristics of an upper urinary tract urothelial cancer patient, and comprises the following steps of:

1) Obtaining tumor tissue transcriptome sequencing data and clinical information of a patient with the urothelial cancer;

2) Comparing the sequencing data obtained in the step 1) to a human reference genome, and annotating genes by using GTF gene annotation files of corresponding versions of the reference genome;

3) Quantifying, filtering, normalizing and log2 transforming the genes annotated in step 2);

4) Screening lncRNA which is related to prognosis of the upper urinary tract urothelium cancer and has large variation of abundance expression value in all patient samples as candidate parting characteristics;

5) And 4) carrying out cluster analysis on all patient samples based on the expression matrix of the candidate parting characteristic obtained in the step 4) to obtain the optimal molecular parting result of the upper urinary tract urothelial cancer.

Step 1) above, the clinical information includes a progression-free survival time and a progression-free survival status.

Step 3) above, the normalization preferably uses a TPM normalization method.

Step 4) above, the classification characteristics are preferably screened by a single factor Cox proportional hazards model, a LASSO model and an absolute median difference value in sequence.

The step 5) is preferably performed by a consensus cluster analysis method. The best molecular typing results in classifying the upper urinary tract urothelial cancer into three molecular subtypes of type I, type II and type III.

The second technical problem to be solved by the present invention is to provide a group of markers for molecular typing of upper urinary tract urothelial cancer, the group of markers comprising 46 lncRNA shown in the following table 1:

TABLE 1

ENSG00000235491	ENSG00000203706
		ENSG00000228873	ENSG00000283684
ENSG00000259439	ENSG00000285280
		ENSG00000226674	ENSG00000231246
ENSG00000204588	ENSG00000289326
		ENSG00000240040	ENSG00000125462
ENSG00000224165	ENSG00000224616
		ENSG00000289062	ENSG00000224559
ENSG00000225087	ENSG00000226780
		ENSG00000203709	ENSG00000229021
ENSG00000233593	ENSG00000225643
		ENSG00000175147	ENSG00000225077
ENSG00000291077	ENSG00000287670
		ENSG00000189223	ENSG00000289305
ENSG00000226994	ENSG00000286572
		ENSG00000227088	ENSG00000230186
ENSG00000289033	ENSG00000228971
		ENSG00000238122	ENSG00000287064
ENSG00000228794	ENSG00000224875
		ENSG00000228044	ENSG00000287628
ENSG00000289077	ENSG00000287305
		ENSG00000228852	ENSG00000231407
ENSG00000289483	ENSG00000288007

。

The invention provides a single sample classifier for classifying the urothelial cancer of the upper urinary tract. The single sample classifier mainly comprises a storage module and a correlation calculation module, wherein the storage module stores the abundance expression central point values of all subtype specificity lncRNA characteristics of UTUC in each subtype sample of UTUC respectively; the correlation calculation module is used for calculating the correlation (which can be pearson correlation or spearman correlation) between the abundance expression values of all subtype specific lncRNA characteristics in the UTUC sample of the subtype category to be identified and the abundance expression central point value of the specific lncRNA characteristics of each subtype stored in the storage module.

The fourth technical problem to be solved by the invention is to provide a construction method of the above-mentioned single sample classifier for upper urinary tract urothelial carcinoma, which specifically comprises the following steps:

1) Obtaining tumor tissue transcriptome sequencing data of a patient with the upper urinary tract urothelial cancer;

4) Calculating an AUC value of each lncRNA predicted subtype obtained in the step 3) aiming at each molecular subtype, and reserving lncRNA with the AUC value larger than 0.7 as a specific lncRNA characteristic of the subtype;

5) Combining and de-weighting the specificity lncRNA features screened in the step 4) to obtain all subtype specificity lncRNA features of the upper urinary tract urothelial cancer;

6) Calculating the average abundance expression value of each specific lncRNA characteristic obtained in the step 5) in each subtype sample, and taking the average abundance expression value as an abundance expression central point value of the specific lncRNA characteristic, and finally obtaining a group of data containing all subtype specific lncRNA characteristic central point values for each subtype.

The fifth technical problem to be solved by the present invention is to provide a method for typing and identifying a UTUC sample by using the single sample classifier, the method comprising the steps of:

calculating the correlation between the abundance expression values of all subtype-specific lncRNA features in the UTUC samples requiring identification of subtype categories and a set of abundance expression center point values containing all subtype-specific lncRNA features corresponding to each subtype, and classifying the samples into the subtype corresponding to the highest correlation.

The full subtype specific lncRNA characteristics of UTUC and the abundance expression central point values in the three molecular subtype i, ii and iii samples of UTUC are preferably as shown in table 3:

TABLE 3 Table 3

The invention utilizes transcriptome sequencing data analysis of UTUC to obtain lncRNA data, and then screens lncRNA related to prognosis of UTUC for consensus clustering to obtain 3 molecular types of UTUC based on lncRNA; meanwhile, the invention further constructs a UTUC single sample classifier by screening the specific lncRNA characteristics corresponding to each molecular subtype, and the single sample classifier can identify the lncRNA molecular typing of a UTUC patient, thereby realizing subtype identification and prognosis layering of the UTUC patient.

Drawings

Fig. 1 is a consensus cluster diagram of the present invention when 1k=3.

FIG. 2 shows that three molecular types based on lncRNA of example 1 of the present invention are significantly related to progression-free survival (PFS).

Fig. 3 shows the prognostic differences between the different types after typing 403 UTUC samples in the TCGA public database using a single sample classifier in accordance with example 2 of the present invention.

Detailed Description

For a more specific understanding of the technical content, features and effects of the present invention, the technical solution of the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments:

example 1 lncRNA-based UTUC molecular typing

1. Collecting the lncRNA abundance expression data and clinical data of UTUC

156 UTUC patients with tumor tissue transcriptome sequencing (RNA-seq) data, data number EGAD00001007667, of the type of BMA file aligned to human reference genome hg19, were downloaded from EGA public database. Downloading a GTF gene annotation file gene code.v43lift37.animation.gtf corresponding to hg19 from a GENCODE website, annotating and quantifying genes by using a featuresource tool, and reserving genes with the gene type of 'lncRNA' and genes with the expression abundance median value larger than 0 in the GTF gene annotation file to finally obtain 7698 lncRNAs.

TPM standardization and log2 conversion are carried out on the abundance expression values of 7698 lncRNAs, and standardized lncRNA abundance expression values are obtained, wherein the TPM standardization conversion formula is as follows:

wherein i is sample number, j is gene number, R _ij For the ready count value, F, of sample i gene j _ij FPKM (Fragments Per Kilobase of exon model per Million mapped fragments, reads per million maps per kilobase of transcription) value, L, for sample i Gene j _j For the length of the coding region of gene j, T _i Number of sequencing reads for sample i.

In addition, clinical information corresponding to the 156 patients described above, including time to progression free survival and state of progression free survival, was collected.

2. Selection of lncRNA associated with prognosis of UTUC

Screening the 7698 lncRNAs by sequentially using a single factor Cox proportional risk model, a LASSO model and an absolute median difference value: screening to obtain lncRNA with p value smaller than 0.05 in single factor Cox proportion risk model analysis, and carrying out statistics test on the lncRNA; recycling for 100 LASSO model analyses, selecting lncRNAs with non-zero coefficients in the results above 60 times, which lncRNAs are considered to be lncRNAs that are significantly associated with the progression-free survival of the patient; finally, calculating absolute median of the lncRNAs related to prognosis, and selecting the lncRNAs with the first 50% of the absolute median from large to small row as candidate typing characteristics to obtain 46 lncRNAs related to prognosis and with large abundance expression change (the gene IDs of the lncRNAs in Ensembl database are shown in table 1).

TABLE 1 46 lncRNAs with greater abundance expression relative to UTUC prognosis

。

3. Consensus clustering to obtain molecular typing

Based on 46 lncRNAs related to UTUC prognosis obtained by screening, consensus clustering is carried out on all patient samples by using a cancer subtypes package, a clustering algorithm is set as pam, a distance calculation method is "pearson", the clustering number is 2-4, the average profile coefficient of the samples of type 2-4 is calculated, the type 3 (k=3) with the maximum average profile coefficient is determined as the optimal clustering result, and the final UTUC molecular typing result (shown in fig. 1) based on the lncRNA is obtained, wherein 61 patients belong to type I, 34 patients belong to type II and 61 patients belong to type III.

4. Molecular typing prognostic layering ability validation

Based on the obtained UTUC molecular typing results, the prognosis differences among the three types are calculated by using single factor Kaplan-Meier survival analysis, as shown in fig. 2, the type i prognosis is the best, the type iii prognosis is the worst, the log rank test difference p value between the three types is less than 0.001, which indicates that significant differences exist in progression-free survival rates among the UTUC patients of different types, and indicates that the molecular typing of the present embodiment can distinguish prognosis risks of the UTUC patients.

Example 2UTUC typing identification

1. Construction of single sample classifier based on molecular parting label

(1) Screening for subtype-specific lncRNA signatures

Using 7698 lncRNA from 156 samples collected in example 1, AUC values for each lncRNA predicted subtype were calculated for each UTUC molecular subtype using AUC function of R-packet pROC, preserving lncRNA with AUC values greater than 0.7 as the specific characteristics of that subtype. Wherein, the I type obtains 7 specific lncRNA characteristics, the II type obtains 148 specific lncRNA characteristics, the III type obtains 6 specific lncRNA characteristics, wherein 3 specific lncRNA characteristics repeatedly appear in two subtypes, and after all specific lncRNA characteristics are combined and de-duplicated, 158 subtype specific lncRNA characteristics are finally obtained (the gene ID in Ensembl database is shown in table 2).

Table 2 specific lncRNA characteristics of UTUC 3 subtypes

/>

(2) Calculating a central point value for the abundant expression of each subtype specific lncRNA feature

Calculating the average expression values of 158 specific lncRNA features in three subtype samples respectively, wherein the average expression values are taken as the central point values of the specific lncRNA features, and finally, each subtype obtains a group of data (see the table 3) containing the central point values of all the specific lncRNA features, and the group of data can be used for UTUC single sample classification.

TABLE 3 center point values for 158 subtype-specific lncRNA characteristics

/>

2. Novel sample UTUC molecular typing identification

The urothelial cancer data set, which includes 403 cases of urothelial cancer samples with complete gene abundance expression matrix and clinical information (including total survival time), was downloaded from the TCGA public database, and was used to verify the effect of the single sample classifier constructed in this example. And calculating pearson correlations between the abundance expression values of 158 subtype specific lncRNA characteristics in the samples and the center point values of the specific lncRNA characteristics of each subtype in the table 3, when the pearson correlations are highest, the samples belong to the subtype corresponding to the highest correlations, and finally identifying 111 samples as belonging to type i, 180 samples as belonging to type ii and 112 samples as belonging to type iii.

Based on the obtained molecular typing results, the single-factor Kaplan-Meier survival analysis is used for calculating the prognosis difference among three types, as shown in fig. 3, the type i prognosis is the best, the type iii prognosis is the worst, the log rank test difference p value among the three types is less than 0.05, and the difference p value is consistent with the previous typing difference trend, which indicates that the single-sample classifier of the embodiment can identify the UTUC molecular typing of a new sample.

The foregoing embodiments are merely examples of possible or preferred embodiments of the present invention, which are not intended to limit the scope of the present invention, and therefore, all equivalent changes and modifications that are consistent with the scope of the present invention shall fall within the scope of the present invention.

Claims

1. A method for molecular typing of urothelial cancer in an upper urinary tract, said method not being used for the diagnosis and treatment of diseases, characterized in that the molecular typing of urothelial cancer is performed on the basis of lncRNA characteristics of patients with urothelial cancer.

2. The molecular typing method according to claim 1, wherein the method comprises the steps of:

3. The molecular typing method of claim 2, wherein step 1) the clinical information includes progression free survival time and progression free survival status.

4. The molecular typing method of claim 2, wherein step 3), the filtering comprises: the gene type remained in the GTF gene annotation file is "lncRNA" and the gene with the median value of expression abundance larger than 0.

5. The molecular typing method of claim 2, wherein in step 3), the normalization process uses a TPM normalization method, and a TPM normalization conversion formula is:

wherein i is sample number, j is gene number, R _ij For the ready count value, F, of sample i gene j _ij FPKM value, L, for sample i Gene j _j For the length of the coding region of gene j, T _i Number of sequencing reads for sample i.

6. The molecular typing method according to claim 2, wherein step 4) is characterized by screening the typing characteristics sequentially with a single factor Cox proportional hazards model, a LASSO model, and an absolute median difference.

7. The method of molecular typing according to claim 6, wherein the screening method comprises: and screening lncRNAs with p value smaller than 0.05 obtained by single factor Cox proportional risk model analysis, carrying out 100 times of LASSO model analysis circularly, retaining lncRNAs with non-zero coefficients in more than 60 times of circulating results, calculating absolute median differences of the lncRNAs, arranging the lncRNAs from large to small, and selecting lncRNAs with the absolute median differences of which the first 50% are arranged as candidate typing characteristics.

8. The method of molecular typing according to any one of claims 2, 6 or 7, wherein in step 4), the typing profile comprises 46 lncRNA as shown in table 1:

TABLE 1

9. The molecular typing method of claim 2, wherein step 5) employs consensus cluster analysis to determine the cluster with the highest average profile factor as the best cluster result by calculating the average profile factor.

10. The method of molecular typing according to claim 2, wherein in step 5), the optimal molecular typing results in classifying the upper urinary tract urothelial cancer into three molecular subtypes of type i, type ii and type iii.

11. An upper urinary tract urothelial cancer molecular typing marker, comprising 46 lncRNA as shown in table 1:

TABLE 1

12. The single sample classifier for the upper urinary tract and urothelial cancer parting is characterized by comprising a storage module and a correlation calculation module, wherein the storage module stores the abundance expression central point values of all subtype specificity lncRNA characteristics of UTUC in all subtype samples of the UTUC respectively; the correlation calculation module is used for calculating the correlation between the abundance expression value of all subtype specific lncRNA characteristics in the UTUC sample of the subtype category to be identified and the abundance expression central point value of the specific lncRNA characteristics of each subtype stored by the storage module.

13. The single sample classifier of claim 12, wherein the subtype specific lncRNA signature and its abundance expression center point values in three molecular subtype samples of upper urinary tract urothelial cancer are shown in table 3:

TABLE 3 Table 3

14. A method of constructing a single sample classifier for upper urinary tract urothelial carcinoma according to claim 12 or 13, said method not being used for disease diagnosis and treatment purposes, comprising the steps of:

15. The method of claim 14, wherein step 3) the filtering comprises: the gene type remained in the GTF gene annotation file is "lncRNA" and the gene with the median value of expression abundance larger than 0.

16. The method of claim 14, wherein in step 3), the normalization process uses a TPM normalization method, and the TPM normalization conversion formula is:

17. The method of claim 14, wherein step 6) wherein the set of data for each subtype comprising the central point values of all subtype-specific lncRNA signatures is set forth in table 3:

TABLE 3 Table 3

/>

18. A method for the genotyping of an upper urinary tract urothelial cancer sample using the single sample classifier of claim 12 or 13, said method not being used for diagnostic and therapeutic purposes of the disease, comprising the steps of:

calculating the correlation between the abundance expression values of all subtype-specific lncRNA features in the upper urinary tract urothelial cancer sample of which the subtype types need to be identified and a group of abundance expression central point values containing all subtype-specific lncRNA features corresponding to each subtype, and classifying the sample into the subtype corresponding to the highest correlation.