WO2021253544A1 - Model using 87 genes serving as biomarkers to predict cell proliferation activity - Google Patents

Model using 87 genes serving as biomarkers to predict cell proliferation activity Download PDF

Info

Publication number
WO2021253544A1
WO2021253544A1 PCT/CN2020/101544 CN2020101544W WO2021253544A1 WO 2021253544 A1 WO2021253544 A1 WO 2021253544A1 CN 2020101544 W CN2020101544 W CN 2020101544W WO 2021253544 A1 WO2021253544 A1 WO 2021253544A1
Authority
WO
WIPO (PCT)
Prior art keywords
cell
genes
gene
cell proliferation
expression
Prior art date
Application number
PCT/CN2020/101544
Other languages
French (fr)
Chinese (zh)
Inventor
吴超
郑敏
Original Assignee
浙江大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江大学 filed Critical 浙江大学
Publication of WO2021253544A1 publication Critical patent/WO2021253544A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/158Expression markers

Definitions

  • the invention belongs to the fields of gene technology and biomedicine, and specifically relates to a method for predicting cell proliferation activity using 87 genes as biomarkers
  • the disorderly proliferation of cancer cells is a key mechanism for tumorigenesis.
  • treatment methods such as chemotherapy have been developed.
  • a number of cell proliferation gene markers such as MKI67, MCM2 and PCNA have been developed to use their mRNA or protein expression levels to indicate the proliferation activity of cancer cells, thereby assisting in evaluating the prognosis of patients after surgery.
  • the Ki-67 index was developed to mark the ratio of Ki-67 positive cells in pathological samples, so as to evaluate lung cancer, breast cancer, prostate cancer, cervical cancer, colorectal cancer, bladder cancer, The prognosis of cancer patients such as lymphoma.
  • Proliferation is not unique to cancer cells. Studies have shown that there are a large number of cells with proliferation activity in human skin, bone marrow and gastrointestinal tissues. When cancer occurs in the above-mentioned tissues, the expression of cell proliferation markers such as MKI67 in cancer tissue samples of patients after surgery is partly derived from cancer cells and partly derived from normal proliferating cells, which will not accurately reflect the proliferation activity of cancer cells. Due to the lack of sufficient data support, the American Society of Clinical Oncology (ASCO) Tumor Marker Steering Committee does not recommend the Ki-67 index as a routine prognostic marker for patients newly diagnosed with breast cancer.
  • ASCO American Society of Clinical Oncology
  • the present invention provides a method for evaluating cell proliferation activity using a collection of 87 cell proliferation genes as markers. In order to achieve this objective, the present invention adopts the following technical solutions.
  • RNA-Seq data of different types of normal cells from the Tabula Muris database (https://tabula-muris.ds.czbiohub.org/), from the Cancer Genome Atlas (TCGA) database (http://cancergenome.nih. gov/)
  • TCGA Cancer Genome Atlas
  • RNA-Seq data of cancer and adjacent tissues obtain the tissue RNA-Seq data from the GTEx (Genotype-Tissue Expression Project) database (https://www.gtexportal.org/), and obtain the tissue RNA-Seq data from CCLE (Cancer Cell Line) Encyclopedia) database (https://portals.broadinstitute.org/ccle) to obtain cell line RNA-Seq data and cell proliferation activity data.
  • n is the number of cells whose read count of cell gene j in cell type i is greater than 0.
  • a) Obtain the expression values of genes in the stem/group cell specific gene set in each normal tissue sample in the GTEx database. In most normal tissues, terminal cells that do not have proliferative activity occupies the main component. For this reason, the above-mentioned genes were analyzed by hierarchical clustering, and a gene group consisting of 87 genes that were low expressed in normal tissues was obtained (87 genes including ANLN, ARHGAP11A, ASF1B, ATAD2, AURKA, AURKB, BIRC5, BRCA2, BUB1, BUB1B, CCNA2, CCNB1, CCNB2, CDC20, CDC45, CDCA2, CDCA5, CDCA8, CDK1, CDT1, CENPA, CENPE, CENPF, CENPH, CENPK, CENPM, CENPW, CEP55, CKAP2, CKAP2L, CLSPN, DBF4, DLGAP5, ECT2, ESCO2, FEN1, FOXM1, HIRIP3, HIST1H2AG, HMMR
  • the cell proliferation gene set predicts the proliferation activity of cancer cell lines in vitro
  • a) Obtain the expression values of genes in the cell proliferation gene set in each cancer cell line in the CCLE database. Similarly, for a certain gene j (1 ⁇ j ⁇ 87), calculate its Z-score normalized gene expression value Z j in all cell line samples. For a certain cell line sample k, enumerate the expression vectors of 87 genes as ⁇ Z 1k ,Z 2k ,...,Z 87k ⁇ , and then calculate the expression value of the gene set as the median value of the 87 gene expression values (median ⁇ Z 1k ,Z 2k ,...,Z 87k ⁇ ). Calculate the expression value of the cell proliferation gene set for each cell line sample.
  • 81 cell types are grouped into 2-3 different cell type groups, for each cell type group, the expression value vector of its cell proliferation gene set is obtained, and the different cell type groups are compared.
  • the expression value of the cell proliferation gene set (T test, two-tailed). Using P ⁇ 0.05 as the threshold, judge whether the expression value of the cell proliferation gene set of a certain cell type group is significantly higher than that of other cell type groups, so as to confirm the cell type with high expression of cell proliferation gene and the cell type with low expression of cell proliferation gene. Evaluation of the proliferation activity of 81 cell types.
  • the expression level of the cell proliferation gene set has been used to evaluate the proliferation activity of 81 normal cell types in vivo.
  • the present invention uses single-cell RNA-Seq data to identify a cell proliferation-related gene marker set composed of 87 genes. Using this set, we evaluate the proliferation activity of different normal cell types in the body, and comprehensively identify the normal cell types that proliferate at high speed in the body. The realization of this technology can help us determine whether there are normal cells with proliferation activity in cancer tissues. When there are a large number of such cells in cancer tissues, the treatment and evaluation methods for cell proliferation markers will be interfered and may fail.
  • the advantages of the present invention are that (1) The method for judging cell proliferation activity based on culture requires in vitro culture of normal tissue cells. At present, some tissue cells cannot be cultured in vitro, and some tissue cells are affected by culture conditions. There are huge differences in cell proliferation activity in vivo and in vitro.
  • This method uses single-cell technology to directly evaluate the proliferation activity of tissue cells in the body, and can accurately evaluate the proliferation activity of cells.
  • the results of normal cell proliferation activity obtained by this method can assist in determining whether there are a large number of normal cells with proliferation ability in cancer tissues, so as to provide guidance for cancer treatment and evaluation methods for cell proliferation mechanisms.
  • Figure 1 Heat map of cluster analysis of high-expressed genes in stem/group cell group.
  • one column represents a cell type, and one row represents a gene.
  • Epi-SC indicates epidermal stem cells
  • numbers 1-7 indicate Slamf1 positive pluripotent cells (1), megakaryocyte-erythroid progenitor cells (2), advanced B precursor cells (3), and granulomanocytic cells (4) ), granulocytes (5), lymphoid progenitor cells (6) and natural killer precursor cells (7), these 8 types of cells form a stem/group cell group.
  • FIG. 2 Heat map of cluster analysis of stem/group cell-specific gene set genes in 54 normal tissue samples.
  • one column represents one sample, one row represents one gene, and samples of the same color belong to the same tissue type.
  • Cluster analysis was performed on 17382 samples of 54 normal tissues of stem/group cells specifically expressed gene set, and clustered into 2 gene clusters. It was found that the gene cluster formed by the aggregation of 87 genes was only in (1) cultured skin fibroblasts after culture, (2) EBV-transformed lymphocytes (EBV-transformed lymphocytes) and (3) testis tissue (testis) High expression in tissues, low expression in other tissues.
  • EBV-transformed lymphocytes EBV-transformed lymphocytes
  • testis testis
  • Figure 3 Box plot of expression levels of cell proliferation gene sets in different cancers. Obtain the expression level of the cell proliferation gene set of 9630 samples from 32 kinds of cancer and adjacent tissues, and then merge all the adjacent samples (Control). The t-test was used to compare the expression levels of cell proliferation gene sets for each cancer and Control. Taking the two-tailed P-value ⁇ 0.05 as an indicator, the expression level of the cell proliferation gene set in the red highlight was significantly higher than that of the cancer types in the adjacent group.
  • Figure 4 Correlation analysis between the expression level of cell proliferation gene set and the optimal doubling time of cells. Each point in the figure represents a cell line, the abscissa represents the expression level of the cell line's cell proliferation gene set, and the ordinate represents the doubling time of the cell line (provided by the supplier). Calculate the Pearson correlation coefficient and P-value of the cell proliferation gene set expression level and the optimal cell doubling time.
  • Figure 5 Cluster analysis heat map of 81 different normal cell types in Tabula Muris.
  • one column represents a cell type, and one row represents a gene in a cell proliferation gene set.
  • 81 normal cell types are grouped into three types.
  • Example 1 Using Tabula Muris database, TCGA database, GTEx database and CCLE database to establish a cell proliferation gene set containing 87 genes, predict the proliferation activity of 81 different normal cell types in the body collected in the Tabula Muris database, and assist in determining TCGA Whether there are a large number of normal cells with proliferation activity in the cancer tissues in the database, guide the cancer treatment and evaluation methods for cell proliferation markers.
  • RNA-Seq data from 81 different normal cell types generated by Smart-Seq2 single-cell sequencing technology from Tabula Muris database. Obtained from the Cancer Genome Atlas (TCGA) database, 9630 cancer and adjacent tissue RNA-Seq data of 32 cancers, and prognostic data of 31 cancers. 17382 tissue RNA-Seq data of 54 tissues were obtained from GTEx database. The RNA-Seq data of 1019 cell line samples and the culture mode (suspension/adherent/semi-adherent) and doubling time information of some of the cell lines were obtained from the CCLE database.
  • TCGA Cancer Genome Atlas
  • the stem/group cell group includes epidermal stem cells (stem cells of epidermis), Slamf1-positive multipotent progenitor cells, megakaryocyte-erythroid progenitor cells, and advanced B precursor cells ( late pro-B cell, granulocyte monocyte progenitor cell, granulocytopoietic cell, common lymphoid progenitor and pre-natural killer cell; others
  • the cell group includes the remaining 73 types of cells.
  • Table 1 81 normal cell types in Tabula Muris database
  • the expression values of the above 87 genes in a total of 9630 samples from 32 cancers (Table 3) in cancer tissues and adjacent tissues in the TCGA database were obtained.
  • a certain gene j (1 ⁇ j ⁇ 87) calculate its Z-score normalized gene expression value Y j in all cancer and para-cancerous tissue samples.
  • a certain sample k enumerate the expression vectors of its 87 genes as ⁇ Y 1k ,Y 2k ,...,Y 87k ⁇ , and then calculate the expression value of the gene set as the median value of the above 87 gene expression vectors (median ⁇ Y 1k ,Y 2k ,...,Y 87k ⁇ ).
  • the T test is further used to compare the gene set expression values of each cancer sample with the gene set expression values of all adjacent samples. Since most cancer tissues are composed of highly proliferating cancer cells, it is further confirmed that in 28 cancers of 32 cancer types, the expression value of cell proliferation gene sets in cancer tissues is higher than that in adjacent tissues (P ⁇ 0.05) , Table 3 and Figure 3).
  • Table 3 Chinese and English comparison of the names of 32 cancers in the TCGA database
  • the CCLE database provides information on the culture method (suspension/semi-adherent/adherent) and doubling time (provided by the supplier/statistic by the CCLE staff) of some cell lines. It is believed that the doubling time provided by the supplier expresses the optimal doubling time of the cell line and can be used as an indicator of the cell line's cell proliferation activity. To this end, data on the doubling time of 99 semi-adherent or adherent cell lines (provided by the supplier) were obtained as an indicator of the cell proliferation activity of this batch of cell lines. Finally, it was found that the expression value of the cell proliferation gene set of this batch of cell lines was negatively correlated with the cell doubling time (Pierce correlation analysis, Figure 4).
  • the cell proliferation activity is stronger. This result indicates that there is a positive correlation between the expression value of the cell proliferation gene set and the cell proliferation activity.
  • solid tumors grow in a semi-adherent or adherent manner, while hematomas grow in a suspended manner. This result indicates that the expression value of the cell proliferation gene set of solid tumors can predict its cell proliferation activity.
  • the single cells in the Tabula Muris database are classified into 81 categories by cell type, and the gene expression values of various types of cells are calculated as above. For these 81 different cell types, the expression values of 87 genes in the cell proliferation gene set in each cell type were obtained.
  • the expression value of its cell proliferation gene set was calculated.
  • the expression value of its cell proliferation gene set was calculated.
  • For a certain cell type i for a certain gene j (1 ⁇ j ⁇ 87 gene expression value X ji , enumerate the expression vector of 87 genes as ⁇ X 1i ,X 2i ,...,X 87i ⁇ , and then calculate The expression value of the cell proliferation gene set is the median value of the above 87 gene expression values (median ⁇ X 1i , X 2i ,..., X 87i ).
  • the T test method using the two-tailed P ⁇ 0.05 as the threshold, found that the expression value of the cell proliferation gene set of the stem/group cell group was significantly greater than the expression value of the cell proliferation gene set of the significant proliferation group, and the expression of the cell proliferation gene set of the significant proliferation group The value is greater than the expression value of the cell proliferation gene set of the rare proliferation group. In this way, 81 different normal cell types are successfully divided into three types of cell type groups with different cell proliferation capabilities, and the level evaluation of the proliferation ability of the corresponding cell types is achieved.
  • immature B cell basal cell of epidermis (epidermis basal cell), epidermal cell of large intestine (large intestine epithelial cell) may have solid tumors in the tissues. .
  • DLBC diffuse large B cell lymphoma
  • cancer tissue may contain a large number of immature B cells, HNSC (head and neck squamous cell carcinoma), LUSC (lung Squamous cell carcinoma), ESCA (esophageal cancer) and CESC (cervical squamous cell carcinoma and adenocarcinoma) all contain squamous cell carcinoma
  • the cancerous tissue may contain a large number of epidermal basal cells
  • COAD colon cancer
  • READ rectal adenocarcinoma
  • the cancer tissue may contain a large number of large intestinal epithelial cells.
  • Table 4 Prognostic analysis of cancer cell proliferation marker MKI67, a cancer cell proliferation marker with a large number of significantly proliferating normal cell types in 7 cancer tissues

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Molecular Biology (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Analytical Chemistry (AREA)
  • Immunology (AREA)
  • Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Oncology (AREA)
  • Hospice & Palliative Care (AREA)
  • Physiology (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Microbiology (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Provided is a model using 87 genes serving as biomarkers to predict cell proliferation activity. The expression level of a set of cell proliferation genes correlates positively with cell proliferation activity. The present invention provides a set of methods to evaluate cell proliferation activity without requiring in vitro culture. Combined with single-cell sequencing, the model can measure, quickly and easily, the proliferation activity of various in vivo cell types, determining whether cancer tissues have normal cells that are significantly proliferating.

Description

以87个基因作为生物标志物预测细胞增殖活性的模型A model for predicting cell proliferation activity using 87 genes as biomarkers 技术领域Technical field
本发明属于基因技术及生物医学领域,具体涉及一种以87个基因作为生物标志物预测细胞增殖活性的方法The invention belongs to the fields of gene technology and biomedicine, and specifically relates to a method for predicting cell proliferation activity using 87 genes as biomarkers
背景技术Background technique
癌细胞的大量无序增殖是肿瘤发生的关键机制。针对细胞增殖机制,人们发展出化疗等治疗手段。同时,人们开发出多个细胞增殖基因标志物如MKI67,MCM2和PCNA等,使用其mRNA或者蛋白表达水平来指示癌症细胞的增殖活性,从而辅助评估术后病患的预后情况。特别是针对MKI67的蛋白表达量,人们开发出Ki-67指数来标记病理样本中Ki-67表达阳性细胞的比率,从而评估肺癌、乳腺癌、前列腺癌、宫颈癌、结直肠癌、膀胱癌、淋巴癌等癌症患者的预后。The disorderly proliferation of cancer cells is a key mechanism for tumorigenesis. In view of the cell proliferation mechanism, treatment methods such as chemotherapy have been developed. At the same time, a number of cell proliferation gene markers such as MKI67, MCM2 and PCNA have been developed to use their mRNA or protein expression levels to indicate the proliferation activity of cancer cells, thereby assisting in evaluating the prognosis of patients after surgery. Especially for the protein expression level of MKI67, the Ki-67 index was developed to mark the ratio of Ki-67 positive cells in pathological samples, so as to evaluate lung cancer, breast cancer, prostate cancer, cervical cancer, colorectal cancer, bladder cancer, The prognosis of cancer patients such as lymphoma.
增殖并不是癌症细胞所独有的特性。已有研究表明,人体皮肤、骨髓与胃肠道等组织中存在大量具有增殖活性的细胞。当癌症发生于上述组织时,术后病患的癌症组织样本中MKI67等细胞增殖标志物的表达量部分来源于癌症细胞,部分来源于正常增殖细胞,将无法准确反映癌症细胞的增殖活性。由于缺乏足够的数据支持,美国临床肿瘤学会(ASCO)肿瘤标志物指导委员会不建议将Ki-67指数作为新近诊断为乳腺癌的患者的常规预后标志物。这一现象的部分原因是由于正常骨髓与淋巴结等免疫器官中亦存在大量增殖细胞,在患者病理样本中,Ki-67指数无法精确区分正常增殖细胞与肿瘤细胞,导致其癌症细胞增殖活性估计精度下降,进而导致预测患者预后能力的下降。Proliferation is not unique to cancer cells. Studies have shown that there are a large number of cells with proliferation activity in human skin, bone marrow and gastrointestinal tissues. When cancer occurs in the above-mentioned tissues, the expression of cell proliferation markers such as MKI67 in cancer tissue samples of patients after surgery is partly derived from cancer cells and partly derived from normal proliferating cells, which will not accurately reflect the proliferation activity of cancer cells. Due to the lack of sufficient data support, the American Society of Clinical Oncology (ASCO) Tumor Marker Steering Committee does not recommend the Ki-67 index as a routine prognostic marker for patients newly diagnosed with breast cancer. Part of the reason for this phenomenon is that there are also a large number of proliferating cells in immune organs such as normal bone marrow and lymph nodes. In the pathological samples of patients, the Ki-67 index cannot accurately distinguish between normal proliferating cells and tumor cells, resulting in the accuracy of estimating the proliferation activity of cancer cells. Decrease, which in turn leads to a decline in the ability to predict the prognosis of patients.
体外培养能够帮助我们鉴定正常细胞的增殖能力。但是目前这一方法存在极大的困难:1.部分细胞无法在体外进行培养;2.部分细胞由于生存环境存在巨大差异,其体外培养条件下的增殖能力无法反映体内环境中的真实增殖能力。In vitro culture can help us identify the proliferation ability of normal cells. However, this method currently has great difficulties: 1. Some cells cannot be cultured in vitro; 2. Due to huge differences in the living environment of some cells, their proliferation ability under in vitro culture conditions cannot reflect the true proliferation ability in the in vivo environment.
发明内容Summary of the invention
针对人体不同类型细胞增殖活性的差异和当前培养方式细胞增殖活性评估的困难,本发明提供一个以87个细胞增殖基因集合作为标志物评估细胞增殖活性的方法。为了实现这一目的,本发明采用以下技术方案。In view of the differences in the proliferation activity of different types of cells in the human body and the difficulty in evaluating cell proliferation activity in current culture methods, the present invention provides a method for evaluating cell proliferation activity using a collection of 87 cell proliferation genes as markers. In order to achieve this objective, the present invention adopts the following technical solutions.
1.建立细胞增殖基因集合,由87个基因组成细胞增殖基因集合,具体实施步骤如下:1. Establish the cell proliferation gene collection, which consists of 87 genes. The specific implementation steps are as follows:
(1)数据采集(1) Data collection
从Tabula Muris数据库(https://tabula-muris.ds.czbiohub.org/)获得不同类型 正常细胞的单细胞RNA-Seq数据,从癌症基因组图谱(TCGA)数据库(http://cancergenome.nih.gov/)获得癌症和癌旁组织RNA-Seq数据,从GTEx(Genotype-Tissue Expression Project)数据库(https://www.gtexportal.org/)中获得组织RNA-Seq数据,从CCLE(Cancer Cell Line Encyclopedia)数据库(https://portals.broadinstitute.org/ccle)获得细胞系RNA-Seq数据与细胞增殖活性数据。Obtain single-cell RNA-Seq data of different types of normal cells from the Tabula Muris database (https://tabula-muris.ds.czbiohub.org/), from the Cancer Genome Atlas (TCGA) database (http://cancergenome.nih. gov/) Obtain the RNA-Seq data of cancer and adjacent tissues, obtain the tissue RNA-Seq data from the GTEx (Genotype-Tissue Expression Project) database (https://www.gtexportal.org/), and obtain the tissue RNA-Seq data from CCLE (Cancer Cell Line) Encyclopedia) database (https://portals.broadinstitute.org/ccle) to obtain cell line RNA-Seq data and cell proliferation activity data.
(2)干/组细胞特异性表达基因集合挖掘(2) Mining the collection of specific expression genes in stem/group cells
a)将Tabula Muris数据库中的体内正常单细胞按细胞类型归为81类,计算各类细胞的基因表达值。对某一特定细胞类型i当中的某一基因j,计算其表达值(X ji)如下:
Figure PCTCN2020101544-appb-000001
a) The normal single cells in the body in the Tabula Muris database are classified into 81 types according to cell types, and the gene expression values of various types of cells are calculated. For a certain gene j in a certain cell type i, calculate its expression value (X ji ) as follows:
Figure PCTCN2020101544-appb-000001
其中m为属于细胞类型i的细胞总数,n为细胞类型i中细胞基因j的reads count大于0的细胞的数目。如此,计算细胞类型i中所有基因的表达值。依次,计算81种细胞类型的所有基因的表达值。Where m is the total number of cells belonging to cell type i, and n is the number of cells whose read count of cell gene j in cell type i is greater than 0. In this way, the expression values of all genes in cell type i are calculated. In turn, the expression values of all genes of 81 cell types are calculated.
b)将81类细胞分为两组:干/组细胞组与其他细胞组。b) Divide 81 types of cells into two groups: stem/group cell group and other cell group.
c)使用层次聚类分析,挖掘在干/组细胞组中高表达,在其他细胞组极低表达的基因,作为干/组细胞特异性基因集合。c) Use hierarchical clustering analysis to mine the genes that are highly expressed in the stem/group cell group and extremely lowly expressed in other cell groups as a set of stem/group cell-specific genes.
(3)细胞增殖基因集合挖掘(3) Mining of cell proliferation gene collection
a)获得GTEx数据库中各正常组织样本中干/组细胞特异性基因集合中基因的表达值。在绝大多数正常组织中不具有增殖活性的终末细胞占据主要成分,为此对上述基因进行层次聚类分析,获得在正常组织中低表达的87个基因组成的基因群(87个基因包括ANLN、ARHGAP11A、ASF1B、ATAD2、AURKA、AURKB、BIRC5、BRCA2、BUB1、BUB1B、CCNA2、CCNB1、CCNB2、CDC20、CDC45、CDCA2、CDCA5、CDCA8、CDK1、CDT1、CENPA、CENPE、CENPF、CENPH、CENPK、CENPM、CENPW、CEP55、CKAP2、CKAP2L、CLSPN、DBF4、DLGAP5、ECT2、ESCO2、FEN1、FOXM1、HIRIP3、HIST1H2AG、HMMR、KIF11、KIF15、KIF20A、KIF20B、KIF23、KIFC1、LMNB1、LMNB2、LRWD1、MAD2L1、MCM2、MKI67、NCAPG、NCAPG2、NCAPH、NDC80、NEIL3、NUDT1、NUF2、NUSAP1、PBK、PKMYT1、PLK1、PLK4、PRC1、RACGAP1、RAD51、RCC1、RRM2、SHCBP1、SKA1、SMC2、SNRNP25、SPC24、SPC25、SYCE2、TACC3、TK1、TOP2A、TPX2、TRIM59、TRIP13、TYMS、UBE2C、UBE2T、UHRF1和VRK1)。a) Obtain the expression values of genes in the stem/group cell specific gene set in each normal tissue sample in the GTEx database. In most normal tissues, terminal cells that do not have proliferative activity occupies the main component. For this reason, the above-mentioned genes were analyzed by hierarchical clustering, and a gene group consisting of 87 genes that were low expressed in normal tissues was obtained (87 genes including ANLN, ARHGAP11A, ASF1B, ATAD2, AURKA, AURKB, BIRC5, BRCA2, BUB1, BUB1B, CCNA2, CCNB1, CCNB2, CDC20, CDC45, CDCA2, CDCA5, CDCA8, CDK1, CDT1, CENPA, CENPE, CENPF, CENPH, CENPK, CENPM, CENPW, CEP55, CKAP2, CKAP2L, CLSPN, DBF4, DLGAP5, ECT2, ESCO2, FEN1, FOXM1, HIRIP3, HIST1H2AG, HMMR, KIF11, KIF15, KIF20A, KIF20B, KIF23, KIFC1, LMNB1, LMNB2, LLR1, MAD2L MCM2, MKI67, NCAPG, NCAPG2, NCAPH, NDC80, NEIL3, NUDT1, NUF2, NUSAP1, PBK, PKMYT1, PLK1, PLK4, PRC1, RACGAP1, RAD51, RCC1, RRM2, SHCBP1, SKA1, SMC2, SNRNP25, SPC24, SPC25, SYCE2, TACC3, TK1, TOP2A, TPX2, TRIM59, TRIP13, TYMS, UBE2C, UBE2T, UHRF1 and VRK1).
b)获得TCGA数据库癌和癌旁组织样本中上述87个基因的表达值。对某一个基因j(1≤j≤87),计算其在所有癌和癌旁组织样本中Z-score标准化的基因表达值Y j。对某一个样本k,列举其87个基因的表达向量为{Y 1k,Y 2k,…,Y 87k},然后,计算基因集合的表达值 为上述87个基因表达向量的中值(median{Y 1k,Y 2k,…,Y 87k})。进一步使用T检验将每一种癌症的样本的基因集合表达值与所有癌旁样本的基因集合表达值进行比较。由于绝大多数癌组织由高增殖的癌细胞组成,进一步确认上述基因集合在癌组织高表达,在癌旁低表达,至此,确认上述87个基因组成的基因群为细胞增殖基因集合。 b) Obtain the expression values of the above 87 genes in cancer and adjacent tissue samples from the TCGA database. For a certain gene j (1≤j≤87), calculate its Z-score normalized gene expression value Y j in all cancer and para-cancerous tissue samples. For a certain sample k, enumerate the expression vectors of its 87 genes as {Y 1k ,Y 2k ,...,Y 87k }, and then calculate the expression value of the gene set as the median value of the above 87 gene expression vectors (median{Y 1k ,Y 2k ,…,Y 87k }). The T test is further used to compare the gene set expression values of each cancer sample with the gene set expression values of all adjacent samples. Since most cancer tissues are composed of highly proliferative cancer cells, it is further confirmed that the above-mentioned gene set is highly expressed in cancer tissues and lowly expressed next to cancer. So far, it has been confirmed that the above-mentioned 87 genes are a cell proliferation gene set.
2.使用上述细胞增殖基因集合 建立预测细胞增殖活性的模型,具体实施步骤如下: 2. Cell proliferation genes using the established set of prediction model of cell proliferation activity, specific implementation steps are as follows:
(1)细胞增殖基因集合预测体外培养癌细胞系增殖活性(1) The cell proliferation gene set predicts the proliferation activity of cancer cell lines in vitro
a)获得CCLE数据库中各癌症细胞系中细胞增殖基因集合中基因的表达值。同样,对某一个基因j(1≤j≤87),计算其在所有细胞系样本中Z-score标准化的基因表达值Z j。对某一个细胞系样本k,列举其87个基因的表达向量为{Z 1k,Z 2k,…,Z 87k},然后,计算基因集合的表达值为上述87个基因表达值的中值(median{Z 1k,Z 2k,…,Z 87k})。计算每一个细胞系样本的细胞增殖基因集合表达值。 a) Obtain the expression values of genes in the cell proliferation gene set in each cancer cell line in the CCLE database. Similarly, for a certain gene j (1≤j≤87), calculate its Z-score normalized gene expression value Z j in all cell line samples. For a certain cell line sample k, enumerate the expression vectors of 87 genes as {Z 1k ,Z 2k ,...,Z 87k }, and then calculate the expression value of the gene set as the median value of the 87 gene expression values (median {Z 1k ,Z 2k ,…,Z 87k }). Calculate the expression value of the cell proliferation gene set for each cell line sample.
b)获得CCLE数据库中部分细胞增殖活性数据(倍增时间)。b) Obtain some cell proliferation activity data (doubling time) in the CCLE database.
c)对细胞系样本的细胞增殖基因集合表达值数据与对应细胞系的倍增时间数据,进行皮尔森相关分析。确认在来源于实体瘤的癌症细胞系中,细胞增殖活性与87个基因组成的细胞增殖基因集合表达值存在显著正相关,即细胞增殖基因集合表达高低可以预测来源于实体瘤的癌症细胞系的增殖活性。c) Perform Pearson correlation analysis on the expression value data of the cell proliferation gene set of the cell line sample and the doubling time data of the corresponding cell line. It is confirmed that in cancer cell lines derived from solid tumors, cell proliferation activity is significantly positively correlated with the expression value of the cell proliferation gene set composed of 87 genes, that is, the expression level of the cell proliferation gene set can predict the cancer cell line derived from solid tumors. Proliferative activity.
(2)建立细胞增殖活性预测模型(2) Establish a prediction model for cell proliferation activity
a)将Tabula Muris数据库中的单细胞按细胞类型归为81类,获得各类细胞的基因表达值如上。a) The single cells in the Tabula Muris database are classified into 81 categories by cell type, and the gene expression values of various types of cells are obtained as above.
b)使用上述87个基因的表达值对81个细胞类型进行层次聚类分析。通过聚类分析,将细胞类型聚成2-3类。b) Perform hierarchical cluster analysis on 81 cell types using the expression values of the above 87 genes. Through cluster analysis, the cell types are grouped into 2-3 categories.
c)对81个细胞类型中的每一个细胞类型,计算其细胞增殖基因集合表达值,获取每一个细胞类型中细胞增殖基因集合中87个基因的表达值。对某一个细胞类型i,对某一个基因j(1≤j≤87的基因表达值X ji,列举其87个基因的表达向量为{X 1i,X 2i,…,X 87i},然后,计算细胞增殖基因集合的表达值为上述87个基因表达值的中值(median{X 1i,X 2i,…,X 87i)。 c) For each of the 81 cell types, calculate the expression value of the cell proliferation gene set, and obtain the expression value of 87 genes in the cell proliferation gene set in each cell type. For a certain cell type i, for a certain gene j (1≤j≤87 gene expression value X ji , enumerate the expression vector of 87 genes as {X 1i ,X 2i ,...,X 87i }, and then calculate The expression value of the cell proliferation gene set is the median value of the above 87 gene expression values (median{X 1i , X 2i ,..., X 87i ).
d)依据聚类分析的结果,将81个细胞类型聚成2-3个不同的细胞类型群,对每一个细胞类型群,获得其细胞增殖基因集合的表达值向量,比较不同细胞类型群的细胞增殖基因集合的表达值(T检验,双尾)。以P<0.05为阈值,判断是否某一细胞类型群的细胞增殖基因集合表达值显著高于其他细胞类型群,从而确认高表达细胞增殖基因的细胞类型与低表达细胞增殖基因的细胞类型,实现对81种细胞类型增殖活性的评估。d) According to the results of cluster analysis, 81 cell types are grouped into 2-3 different cell type groups, for each cell type group, the expression value vector of its cell proliferation gene set is obtained, and the different cell type groups are compared. The expression value of the cell proliferation gene set (T test, two-tailed). Using P<0.05 as the threshold, judge whether the expression value of the cell proliferation gene set of a certain cell type group is significantly higher than that of other cell type groups, so as to confirm the cell type with high expression of cell proliferation gene and the cell type with low expression of cell proliferation gene. Evaluation of the proliferation activity of 81 cell types.
至此,使用细胞增殖基因集合表达水平实现体内81种正常细胞类型的增殖活性的评估。So far, the expression level of the cell proliferation gene set has been used to evaluate the proliferation activity of 81 normal cell types in vivo.
本发明通过单细胞RNA-Seq数据,识别出87个基因组成的细胞增殖相关基因标志物集合,使用该集合,我们评估体内不同正常细胞类型的增殖活性,全面识别体内高速增殖的正常细胞类型。这一技术的实现,可以帮助我们判断癌症组织中是否存在具有增殖活性的正常细胞。当癌症组织中存在大量该类细胞时,针对细胞增殖标志物的治疗与评估手段将会受到干扰而可能失败。The present invention uses single-cell RNA-Seq data to identify a cell proliferation-related gene marker set composed of 87 genes. Using this set, we evaluate the proliferation activity of different normal cell types in the body, and comprehensively identify the normal cell types that proliferate at high speed in the body. The realization of this technology can help us determine whether there are normal cells with proliferation activity in cancer tissues. When there are a large number of such cells in cancer tissues, the treatment and evaluation methods for cell proliferation markers will be interfered and may fail.
本发明的优势在于,(1)基于培养的细胞增殖活性判断方法需要对正常组织细胞进行体外培养,目前部分组织细胞无法进行体外培养,部分组织细胞受培养条件影响体内外细胞增殖活性存在巨大差异,本方法利用单细胞技术,直接对体内组织细胞的增殖活性进行评估,能够准确评估细胞的增殖活性。(2)本方法所获得的正常细胞增殖活性结果可以辅助判断癌症组织中是否存在大量具有增殖能力的正常细胞,从而为针对细胞增殖机制的癌症治疗与评估手段提供指导。The advantages of the present invention are that (1) The method for judging cell proliferation activity based on culture requires in vitro culture of normal tissue cells. At present, some tissue cells cannot be cultured in vitro, and some tissue cells are affected by culture conditions. There are huge differences in cell proliferation activity in vivo and in vitro. This method uses single-cell technology to directly evaluate the proliferation activity of tissue cells in the body, and can accurately evaluate the proliferation activity of cells. (2) The results of normal cell proliferation activity obtained by this method can assist in determining whether there are a large number of normal cells with proliferation ability in cancer tissues, so as to provide guidance for cancer treatment and evaluation methods for cell proliferation mechanisms.
附图说明Description of the drawings
图1:干/组细胞组高表达基因聚类分析热图。图中一列表示一种细胞类型,一行表示一个基因。对在干/组细胞组中任一细胞类型表达水平>0.5的基因进行聚类分析,聚成15个基因群,发现一个162个基因组成的基因群,其基因在干/组细胞组显著表达,在其他细胞类型中极低表达。图中Epi-SC指示表皮干细胞,数字1-7指示Slamf1阳性多能组细胞(1)、巨核-红系祖细胞(2)、晚期B前体细胞(3)、粒单核组细胞(4)、粒系细胞(5)、淋巴祖细胞(6)和自然杀伤前体细胞(7),这8类细胞组成干/组细胞组。Figure 1: Heat map of cluster analysis of high-expressed genes in stem/group cell group. In the figure, one column represents a cell type, and one row represents a gene. Perform cluster analysis on genes whose expression level of any cell type in the stem/group cell group is greater than 0.5, cluster them into 15 gene groups, and find a gene group consisting of 162 genes, whose genes are significantly expressed in the stem/group cell group , Very low expression in other cell types. In the figure, Epi-SC indicates epidermal stem cells, numbers 1-7 indicate Slamf1 positive pluripotent cells (1), megakaryocyte-erythroid progenitor cells (2), advanced B precursor cells (3), and granulomanocytic cells (4) ), granulocytes (5), lymphoid progenitor cells (6) and natural killer precursor cells (7), these 8 types of cells form a stem/group cell group.
图2:干/组细胞特异性表达基因集合基因在54个人正常组织样本的聚类分析热图。图中一列表示一个样本,一行表示一个基因,同一颜色的样本属于同一个组织类型。在54个人正常组织的17382样本中对干/组细胞特异性表达基因集合基因进行聚类分析,聚成2个基因群。发现由87个基因聚集形成的基因群只在(1)培养后皮肤成纤维细胞(cultured skin fibroblasts),(2)EBV转染淋巴细胞(EBV-transformed lymphocytes)和(3)睾丸组织(testis)组织中高表达,在其他组织中均为低表达。Figure 2: Heat map of cluster analysis of stem/group cell-specific gene set genes in 54 normal tissue samples. In the figure, one column represents one sample, one row represents one gene, and samples of the same color belong to the same tissue type. Cluster analysis was performed on 17382 samples of 54 normal tissues of stem/group cells specifically expressed gene set, and clustered into 2 gene clusters. It was found that the gene cluster formed by the aggregation of 87 genes was only in (1) cultured skin fibroblasts after culture, (2) EBV-transformed lymphocytes (EBV-transformed lymphocytes) and (3) testis tissue (testis) High expression in tissues, low expression in other tissues.
图3:不同癌症中细胞增殖基因集合的表达水平箱式图。获得来源于32种癌和癌旁组织的9630个样本的细胞增殖基因集合表达水平值,然后将所有癌旁样本合并(Control)。使用t检验比较每一种癌症和Control的细胞增殖基因集合表达水平。以双尾P-value<0.05为指标,红色高亮其细胞增殖基因集合表达水平显著高于癌旁组的癌症类型。Figure 3: Box plot of expression levels of cell proliferation gene sets in different cancers. Obtain the expression level of the cell proliferation gene set of 9630 samples from 32 kinds of cancer and adjacent tissues, and then merge all the adjacent samples (Control). The t-test was used to compare the expression levels of cell proliferation gene sets for each cancer and Control. Taking the two-tailed P-value<0.05 as an indicator, the expression level of the cell proliferation gene set in the red highlight was significantly higher than that of the cancer types in the adjacent group.
图4:细胞增殖基因集合表达水平和细胞最佳倍增时间相关分析。图中每一个点表示一个细胞系,横坐标表示细胞系的细胞增殖基因集合表达水平,纵坐标指示(供货商提供的) 细胞系倍增时间。计算细胞增殖基因集合表达水平和细胞最佳倍增时间的皮尔森相关系数和P-value。Figure 4: Correlation analysis between the expression level of cell proliferation gene set and the optimal doubling time of cells. Each point in the figure represents a cell line, the abscissa represents the expression level of the cell line's cell proliferation gene set, and the ordinate represents the doubling time of the cell line (provided by the supplier). Calculate the Pearson correlation coefficient and P-value of the cell proliferation gene set expression level and the optimal cell doubling time.
图5:Tabula Muris中81个不同正常细胞类型的聚类分析热图。图中一列表示一种细胞类型,一行表示一个细胞增殖基因集合中的一个基因。根据细胞增殖基因集合中基因的表达水平将81个正常细胞类型聚成三类。Figure 5: Cluster analysis heat map of 81 different normal cell types in Tabula Muris. In the figure, one column represents a cell type, and one row represents a gene in a cell proliferation gene set. According to the expression levels of genes in the cell proliferation gene set, 81 normal cell types are grouped into three types.
具体实施方式detailed description
下面结合附图和实施例详细描述本发明,以下所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员,在不脱离本发明方法的前提下,还可以做出若干改进和补充,这些改进和补充也应视为本发明的保护范围。The present invention will be described in detail below with reference to the accompanying drawings and embodiments. The following are only preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the method of the present invention, they can also make Several improvements and supplements, these improvements and supplements should also be regarded as the protection scope of the present invention.
实施例1:使用Tabula Muris数据库、TCGA数据库、GTEx数据库与CCLE数据库建立含有87个基因的细胞增殖基因集合,预测Tabula Muris数据库中收集的体内81种不同正常细胞类型的增殖活性,并辅助判断TCGA数据库中的癌症组织中是否存在大量具有增殖活性的正常细胞,指导针对细胞增殖标志物的癌症治疗与评估手段。Example 1: Using Tabula Muris database, TCGA database, GTEx database and CCLE database to establish a cell proliferation gene set containing 87 genes, predict the proliferation activity of 81 different normal cell types in the body collected in the Tabula Muris database, and assist in determining TCGA Whether there are a large number of normal cells with proliferation activity in the cancer tissues in the database, guide the cancer treatment and evaluation methods for cell proliferation markers.
(1)数据采集(1) Data collection
从Tabula Muris数据库获得Smart-Seq2单细胞测序技术产生的来源于81种不同正常细胞类型的53760个单细胞RNA-Seq数据。从癌症基因组图谱(TCGA)数据库获得32种癌症的9630个癌与癌旁组织RNA-Seq数据,同时获得其中31种癌症预后数据。从GTEx数据库获得54个组织的17382组织RNA-Seq数据。从CCLE数据库获得1019个细胞系样本的RNA-Seq数据与其中部分细胞系的培养方式(悬浮/贴壁/半贴壁)和倍增时间信息。Obtained 53760 single-cell RNA-Seq data from 81 different normal cell types generated by Smart-Seq2 single-cell sequencing technology from Tabula Muris database. Obtained from the Cancer Genome Atlas (TCGA) database, 9630 cancer and adjacent tissue RNA-Seq data of 32 cancers, and prognostic data of 31 cancers. 17382 tissue RNA-Seq data of 54 tissues were obtained from GTEx database. The RNA-Seq data of 1019 cell line samples and the culture mode (suspension/adherent/semi-adherent) and doubling time information of some of the cell lines were obtained from the CCLE database.
(2)干/组细胞特异性表达基因集合挖掘(2) Mining the collection of specific expression genes in stem/group cells
首先,我们将Tabula Muris数据库中收集的53760个体内正常单细胞按细胞类型归为81类,计算不同细胞类型当中基因的表达值。对某一特定细胞类型i当中的某一基因j,计算其表达值(X ji)如下:
Figure PCTCN2020101544-appb-000002
其中m为属于细胞类型i的细胞总数,n为细胞类型i中细胞基因j的reads count大于0的细胞的数目。如此,计算细胞类型i中所有基因的表达值。依次,计算81种细胞类型的所有基因的表达值。
First, we classified the normal single cells in the 53760 individuals collected in the Tabula Muris database into 81 types according to cell types, and calculated the expression values of genes in different cell types. For a certain gene j in a certain cell type i, calculate its expression value (X ji ) as follows:
Figure PCTCN2020101544-appb-000002
Where m is the total number of cells belonging to cell type i, and n is the number of cells in cell type i whose read count of cell gene j is greater than 0. In this way, the expression values of all genes in cell type i are calculated. In turn, the expression values of all genes of 81 cell types are calculated.
其次,将81类细胞分为两组:干/组细胞组与其他细胞组(表1)。干/组细胞组包括表皮干细胞(stem cell of epidermis),Slamf1阳性多能组细胞(Slamf1-positive multipotent progenitor cell),巨核-红系祖细胞(megakaryocyte-erythroid progenitor cell),晚期B前体细胞(late pro-B cell),粒单核组细胞(granulocyte monocyte progenitor cell),粒系细胞(granulocytopoietic cell),淋巴祖细胞(common lymphoid  progenitor)和自然杀伤前体细胞(pre-natural killer cell);其他细胞组包括余下的73类细胞。Secondly, the 81 types of cells are divided into two groups: stem/group cell group and other cell group (Table 1). The stem/group cell group includes epidermal stem cells (stem cells of epidermis), Slamf1-positive multipotent progenitor cells, megakaryocyte-erythroid progenitor cells, and advanced B precursor cells ( late pro-B cell, granulocyte monocyte progenitor cell, granulocytopoietic cell, common lymphoid progenitor and pre-natural killer cell; others The cell group includes the remaining 73 types of cells.
最终,筛选在干/组细胞组任一细胞类型中表达值>0.5的基因。对这些基因,使用层次聚类分析,挖掘在干/组细胞组中高表达同时在其他细胞组极低表达的162个基因组成的基因群,作为干/组细胞特异性基因集合(图1)。Finally, select genes whose expression value is greater than 0.5 in any cell type of the stem/group cell group. For these genes, hierarchical cluster analysis was used to mine a gene group composed of 162 genes that are highly expressed in the stem/group cell group while being extremely low in other cell groups, as a stem/group cell-specific gene set (Figure 1).
表1:Tabula Muris数据库中81种正常细胞类型Table 1: 81 normal cell types in Tabula Muris database
Figure PCTCN2020101544-appb-000003
Figure PCTCN2020101544-appb-000003
(3)细胞增殖基因集合挖掘(3) Mining of cell proliferation gene collection
首先,获得GTEx数据库中各正常组织样本中干/组细胞特异性基因集合中162个基因的表达值。对每一个基因,获得其在17382个组织样本中的表达值,进行Z-score标准化,获得该基因的标准化后的表达值。依此,获得162个基因的标准化后表达值。对基因进行层次聚类分析,由于在绝大多数正常组织中不具有增殖活性的终末细胞占据主要成分,获得只在(i)培养后皮肤成纤维细胞(cultured skin fibroblasts),(ii)EBV转染淋巴细胞(EBV-transformed lymphocytes)和(iii)睾丸组织(testis)组织中高表达,在其他51个组织低表达的87个基因聚集而成的基因群(图2)。该87个基因组成细胞增殖基因集合(表2)。First, obtain the expression values of 162 genes in the stem/group cell-specific gene set in each normal tissue sample in the GTEx database. For each gene, its expression value in 17,382 tissue samples was obtained, and Z-score normalization was performed to obtain the normalized expression value of the gene. Accordingly, the normalized expression values of 162 genes were obtained. Hierarchical clustering analysis of genes, since terminal cells that do not have proliferative activity in most normal tissues occupy the main component, only (i) cultured skin fibroblasts (cultured skin fibroblasts), (ii) EBV are obtained after culture Gene clusters formed by 87 genes that are highly expressed in transfected lymphocytes (EBV-transformed lymphocytes) and (iii) testis tissues (testis), and lowly expressed in 51 other tissues (Figure 2). These 87 genes constitute the cell proliferation gene set (Table 2).
表2:细胞增殖基因集合基因列表Table 2: Gene list of cell proliferation gene set
ANLNANLN CCNA2CCNA2 CENPACENPA CLSPNCLSPN KIF11KIF11 MCM2MCM2 PBKPBK SKA1SKA1 TRIM59TRIM59
ARHGAP11AARHGAP11A CCNB1CCNB1 CENPECENPE DBF4DBF4 KIF15KIF15 MKI67MKI67 PKMYT1PKMYT1 SMC2SMC2 TRIP13TRIP13
ASF1BASF1B CCNB2CCNB2 CENPFCENPF DLGAP5DLGAP5 KIF20AKIF20A NCAPGNCAPG PLK1PLK1 SNRNP25SNRNP25 TYMSTYMS
ATAD2ATAD2 CDC20CDC20 CENPHCENPH ECT2ECT2 KIF20BKIF20B NCAPG2NCAPG2 PLK4PLK4 SPC24SPC24 UBE2CUBE2C
AURKAAURKA CDC45CDC45 CENPKCENPK ESCO2ESCO2 KIF23KIF23 NCAPHNCAPH PRC1PRC1 SPC25SPC25 UBE2TUBE2T
AURKBAURKB CDCA2CDCA2 CENPMCENPM FEN1FEN1 KIFC1KIFC1 NDC80NDC80 RACGAP1RACGAP1 SYCE2SYCE2 UHRF1UHRF1
BIRC5BIRC5 CDCA5CDCA5 CENPWCENPW FOXM1FOXM1 LMNB1LMNB1 NEIL3NEIL3 RAD51RAD51 TACC3TACC3 VRK1VRK1
BRCA2BRCA2 CDCA8CDCA8 CEP55CEP55 HIRIP3HIRIP3 LMNB2LMNB2 NUDT1NUDT1 RCC1RCC1 TK1TK1  To
BUB1BUB1 CDK1CDK1 CKAP2CKAP2 HIST1H2AGHIST1H2AG LRWD1LRWD1 NUF2NUF2 RRM2RRM2 TOP2ATOP2A  To
BUB1BBUB1B CDT1CDT1 CKAP2LCKAP2L HMMRHMMR MAD2L1MAD2L1 NUSAP1NUSAP1 SHCBP1SHCBP1 TPX2TPX2  To
其次,获得TCGA数据库中32种癌症(表3)癌症组织和癌旁组织共计9630样本中上述87个基因的表达值。对某一个基因j(1≤j≤87),计算其在所有癌和癌旁组织样本中Z-score标准化的基因表达值Y j。对某一个样本k,列举其87个基因的表达向量为{Y 1k,Y 2k,…,Y 87k},然后,计算基因集合的表达值为上述87个基因表达向量的中值(median{Y 1k,Y 2k,…,Y 87k})。进一步使用T检验将每一种癌症的样本的基因集合表达值与所有癌旁样本的基因集合表达值进行比较。由于绝大多数癌症组织由高增殖的癌细胞组成,进一步确认在32个癌症类型的28个癌症中,细胞增殖基因集合在癌症组织中的表达值高于癌旁组织的表达值(P<0.05,表3和图3)。 Secondly, the expression values of the above 87 genes in a total of 9630 samples from 32 cancers (Table 3) in cancer tissues and adjacent tissues in the TCGA database were obtained. For a certain gene j (1≤j≤87), calculate its Z-score normalized gene expression value Y j in all cancer and para-cancerous tissue samples. For a certain sample k, enumerate the expression vectors of its 87 genes as {Y 1k ,Y 2k ,...,Y 87k }, and then calculate the expression value of the gene set as the median value of the above 87 gene expression vectors (median{Y 1k ,Y 2k ,…,Y 87k }). The T test is further used to compare the gene set expression values of each cancer sample with the gene set expression values of all adjacent samples. Since most cancer tissues are composed of highly proliferating cancer cells, it is further confirmed that in 28 cancers of 32 cancer types, the expression value of cell proliferation gene sets in cancer tissues is higher than that in adjacent tissues (P<0.05) , Table 3 and Figure 3).
表3:TCGA数据库中32种癌症名称中英文对照Table 3: Chinese and English comparison of the names of 32 cancers in the TCGA database
HNSCHNSC 头颈鳞状细胞癌Head and neck squamous cell carcinoma SKCMSKCM 皮肤黑色素瘤Skin melanoma
KICHKICH 肾嫌色细胞癌Chromophobe renal cell carcinoma STADSTAD 胃癌Stomach cancer
KIRCKIRC 肾透明细胞癌Renal clear cell carcinoma BLCABLCA 膀胱尿路上皮癌Bladder urothelial carcinoma
KIRPKIRP 肾乳头状细胞癌Papillary cell carcinoma of the kidney TGCTTGCT 睾丸癌Testicular cancer
LAMLLAML 急性髓细胞样白血病Acute myeloid leukemia THCATHCA 甲状腺癌Thyroid cancer
LGGLGG 脑低级别胶质瘤Low-grade glioma of the brain THYMTHYM 胸腺癌Thymic Cancer
LIHCLIHC 肝细胞肝癌Hepatocellular carcinoma UCECUCEC 子宫内膜癌Endometrial cancer
LUADLUAD 肺腺癌Lung adenocarcinoma UCSUCS 子宫肉瘤Uterine Sarcoma
LUSCLUSC 肺鳞癌Lung squamous cell carcinoma UVMUVM 葡萄膜黑色素瘤Uveal melanoma
ACCACC 肾上腺皮质癌Adrenocortical carcinoma BRCABRCA 乳腺浸润癌Invasive breast carcinoma
MESOMESO 间皮瘤Mesothelioma CESCCESC 宫颈鳞癌和腺癌Cervical squamous cell carcinoma and adenocarcinoma
OVOV 卵巢浆液性囊腺癌Serous cystadenocarcinoma of ovary COADCOAD 结肠癌Colon cancer
PAADPAAD 胰腺癌Pancreatic cancer DLBCDLBC 弥漫性大B细胞淋巴瘤Diffuse large B cell lymphoma
PCPGPCPG 嗜铬细胞瘤和副神经节瘤Pheochromocytoma and paraganglioma ESCAESCA 食管癌Esophageal cancer
PRADPRAD 前列腺癌Prostate cancer GBMGBM 多形成性胶质细胞瘤Multiforme glioma
READREAD 直肠腺癌Rectal adenocarcinoma SARCSARC 肉瘤sarcoma
(4)细胞增殖基因集合预测癌细胞增殖活性(4) Cell proliferation gene set predicts cancer cell proliferation activity
首先,获得CCLE数据库中所有癌症细胞系中细胞增殖基因集合中87个基因的表达值。First, obtain the expression values of 87 genes in the cell proliferation gene set of all cancer cell lines in the CCLE database.
对某一个基因j(1≤j≤87),计算其在1019个细胞系样本中Z-score标准化的基因表达值Z j。对某一个细胞系样本k,列举其87个基因的表达向量为{Z 1k,Z 2k,…,Z 87k},然后,计算基因集合的表达值为上述87个基因表达值的中值(median{Z 1k,Z 2k,…,Z 87k})。 For a certain gene j (1≤j≤87), calculate its Z-score normalized gene expression value Z j in 1019 cell line samples. For a certain cell line sample k, enumerate the expression vectors of 87 genes as {Z 1k ,Z 2k ,...,Z 87k }, and then calculate the expression value of the gene set as the median value of the 87 gene expression values (median {Z 1k ,Z 2k ,…,Z 87k }).
计算每一个细胞系样本的细胞增殖基因集合表达值。Calculate the expression value of the cell proliferation gene set for each cell line sample.
其次,CCLE数据库提供部分细胞系的培养方式(悬浮/半贴壁/贴壁)和倍增时间(由供货商提供/由CCLE工作人员统计)的信息。认为供货商提供的倍增时间表达了细胞系的最佳倍增时间,可以作为细胞系的细胞增殖活性能力的指标。为此,获得99个以半贴壁或贴壁培养的细胞系(由供货商提供的)倍增时间数据,作为这批细胞系的细胞增殖活性的指标。最后,发现这批细胞系的细胞增殖基因集合表达值与细胞倍增时间存在负相关关系(皮尔斯相关分析,图4)。由于细胞倍增时间越短,细胞增殖活性越强。这一结果表明细胞增殖基因集合表达值与细胞增殖活性之间存在正相关关系。一般实体瘤为半贴壁或贴壁方式生长而血液瘤以悬浮方式生长。这一结果表明,实体瘤的细胞增殖基因集合表达值可以预测其细胞增殖活性。Secondly, the CCLE database provides information on the culture method (suspension/semi-adherent/adherent) and doubling time (provided by the supplier/statistic by the CCLE staff) of some cell lines. It is believed that the doubling time provided by the supplier expresses the optimal doubling time of the cell line and can be used as an indicator of the cell line's cell proliferation activity. To this end, data on the doubling time of 99 semi-adherent or adherent cell lines (provided by the supplier) were obtained as an indicator of the cell proliferation activity of this batch of cell lines. Finally, it was found that the expression value of the cell proliferation gene set of this batch of cell lines was negatively correlated with the cell doubling time (Pierce correlation analysis, Figure 4). As the cell doubling time is shorter, the cell proliferation activity is stronger. This result indicates that there is a positive correlation between the expression value of the cell proliferation gene set and the cell proliferation activity. Generally, solid tumors grow in a semi-adherent or adherent manner, while hematomas grow in a suspended manner. This result indicates that the expression value of the cell proliferation gene set of solid tumors can predict its cell proliferation activity.
(5)细胞增殖活性评估(5) Evaluation of cell proliferation activity
首先,将Tabula Muris数据库中的单细胞按细胞类型归为81类,计算各类细胞的基因表达值如上。对这81个不同细胞类型,获取每一个细胞类型中细胞增殖基因集合中87个基因的表达值。First, the single cells in the Tabula Muris database are classified into 81 categories by cell type, and the gene expression values of various types of cells are calculated as above. For these 81 different cell types, the expression values of 87 genes in the cell proliferation gene set in each cell type were obtained.
其次,使用上述87基因的表达值对81种不同正常细胞类型进行层次聚类分析。使用R软件包“factoextra”进行层次聚类分析,使用的距离度量为“euclidean”,聚类方法为“ward.D2”,依据层次聚类树结果,将81个细胞类型聚成三类(图5)。一类为干/组细胞组,其他两类来源于其他细胞组(表1),分别为显著增殖组(该组细胞具有显著细胞增殖基因表达从而具有一定的细胞增殖能力)和稀少增殖组((该组细胞很少表达细胞增殖基因从而细胞增殖能力很弱)。Secondly, using the expression values of the 87 genes mentioned above to perform hierarchical cluster analysis on 81 different normal cell types. Use the R software package "factoextra" to perform hierarchical clustering analysis. The distance metric used is "euclidean" and the clustering method is "ward.D2". According to the results of the hierarchical clustering tree, 81 cell types are grouped into three categories (Figure 5). One type is the stem/group cell group, and the other two types are derived from other cell groups (Table 1), namely the significant proliferation group (the cells in this group have significant cell proliferation gene expression and thus have certain cell proliferation ability) and the rare proliferation group ( (This group of cells rarely express cell proliferation genes and therefore have weak cell proliferation ability).
最后,对81个细胞类型中的每一个细胞类型,计算其细胞增殖基因集合表达值。对81个细胞类型中的每一个细胞类型,计算其细胞增殖基因集合表达值。获取每一个细胞类型中细胞增殖基因集合中87个基因的表达值。对某一个细胞类型i,对某一个基因j(1≤j≤87的基因表达值X ji,列举其87个基因的表达向量为{X 1i,X 2i,…,X 87i},然后,计算细胞增殖基因集合的表达值为上述87个基因表达值的中值(median{X 1i,X 2i,…,X 87i)。比较上述3类不同细胞类型群的细胞增殖基因集合表达值结果。使用T检验方法,以双尾P<0.05为阈值,发现干/组细胞组群的细胞增殖基因集合表达值显著大于显著增殖组的细胞增殖基因集合表达值,同时显著增殖组的细胞增殖基因集合表达值大于稀少增殖组的细胞增殖基因集合表达值。如此,成功将81种不同正常细胞类型分成三类细胞增殖能力不同的细胞类型群,实现相应细胞类型的增殖能力的等级评估。 Finally, for each of the 81 cell types, the expression value of its cell proliferation gene set was calculated. For each of the 81 cell types, the expression value of its cell proliferation gene set was calculated. Obtain the expression values of 87 genes in the cell proliferation gene set in each cell type. For a certain cell type i, for a certain gene j (1≤j≤87 gene expression value X ji , enumerate the expression vector of 87 genes as {X 1i ,X 2i ,...,X 87i }, and then calculate The expression value of the cell proliferation gene set is the median value of the above 87 gene expression values (median{X 1i , X 2i ,..., X 87i ). Compare the cell proliferation gene set expression values of the above three types of different cell type groups. Use. The T test method, using the two-tailed P<0.05 as the threshold, found that the expression value of the cell proliferation gene set of the stem/group cell group was significantly greater than the expression value of the cell proliferation gene set of the significant proliferation group, and the expression of the cell proliferation gene set of the significant proliferation group The value is greater than the expression value of the cell proliferation gene set of the rare proliferation group. In this way, 81 different normal cell types are successfully divided into three types of cell type groups with different cell proliferation capabilities, and the level evaluation of the proliferation ability of the corresponding cell types is achieved.
(6)正常细胞类型增殖活性指导细胞增殖标志物的临床应用(6) Clinical application of normal cell type proliferation activity to guide cell proliferation markers
首先,依据细胞类型信息,发现在显著增殖组中immature B cell(非成熟B细胞),basal cell of epidermis(表皮基底细胞),epithelial cell of large intestine(大肠上皮细胞)所在组织可能会发生实体瘤。First of all, based on cell type information, it was found that in the significant proliferation group, immature B cell, basal cell of epidermis (epidermis basal cell), epidermal cell of large intestine (large intestine epithelial cell) may have solid tumors in the tissues. .
其次,对TCGA的31组具有临床预后信息的实体瘤分析,发现DLBC(弥漫性大B细胞淋巴瘤)癌组织中可能包含大量非成熟B细胞,HNSC(头颈鳞状细胞癌),LUSC(肺鳞癌),ESCA(食管癌)和CESC(宫颈鳞癌和腺癌)均包含鳞状上皮细胞癌,其癌组织可能包含大量表皮基底细胞,而COAD(结肠癌)和READ(直肠腺癌)其癌组织可能包含大量大肠上皮细胞。这7类癌症中均含有大量具有显著增殖活性的正常细胞,基于细胞增殖标志物的治疗与预后预测可能会失败。Secondly, the analysis of 31 groups of solid tumors with clinical prognostic information in TCGA found that DLBC (diffuse large B cell lymphoma) cancer tissue may contain a large number of immature B cells, HNSC (head and neck squamous cell carcinoma), LUSC (lung Squamous cell carcinoma), ESCA (esophageal cancer) and CESC (cervical squamous cell carcinoma and adenocarcinoma) all contain squamous cell carcinoma, the cancerous tissue may contain a large number of epidermal basal cells, while COAD (colon cancer) and READ (rectal adenocarcinoma) The cancer tissue may contain a large number of large intestinal epithelial cells. These 7 types of cancers all contain a large number of normal cells with significant proliferation activity, and treatment and prognosis prediction based on cell proliferation markers may fail.
最后,使用TCGA的癌症组织RNA-Seq数据与临床progression-free interval(PFI,疾病缓解期)数据,运用Cox比例风险回归模型(连续变量方法)方法,判断DLBC,HNSC,LUSC,ESCA,CESC,COAD和READ患者术后癌组织样本增殖标记MKI67表达值是否对其疾病缓解期具有预测意义。以P-value<0.05为阈值,发现MKI67表达值不能够预测这7种癌症的疾病缓解期(表4)。这一结果与我们预测结果相一致。Finally, using TCGA's cancer tissue RNA-Seq data and clinical progression-free interval (PFI, disease remission) data, using Cox proportional hazard regression model (continuous variable method) method to determine DLBC, HNSC, LUSC, ESCA, CESC, Whether the expression value of proliferation marker MKI67 in cancer tissue samples after COAD and READ patients has predictive significance for their disease remission. Using P-value<0.05 as the threshold, it was found that the expression value of MKI67 could not predict the remission period of these 7 cancers (Table 4). This result is consistent with our forecast.
表4:7种癌组织包含大量显著增殖正常细胞类型的癌症细胞增殖标志物MKI67预后分析Table 4: Prognostic analysis of cancer cell proliferation marker MKI67, a cancer cell proliferation marker with a large number of significantly proliferating normal cell types in 7 cancer tissues
癌症类型Cancer type Hazard Ratio(95%置信区间)Hazard Ratio (95% confidence interval) Type 3 P-valueType 3 P-value
宫颈鳞癌和腺癌Cervical squamous cell carcinoma and adenocarcinoma 1.03(0.77-1.37)1.03(0.77-1.37) 0.85660.8566
结肠癌Colon cancer 0.87(0.54-1.4)0.87(0.54-1.4) 0.55550.5555
弥漫性大B细胞淋巴瘤Diffuse large B cell lymphoma 1(0.55-1.81)1(0.55-1.81) 0.99020.9902
食管癌Esophageal cancer 1.04(0.89-1.21)1.04(0.89-1.21) 0.61890.6189
头颈鳞状细胞癌Head and neck squamous cell carcinoma 0.97(0.84-1.12)0.97 (0.84-1.12) 0.65020.6502
肺鳞癌Lung squamous cell carcinoma 1.18(0.95-1.45)1.18 (0.95-1.45) 0.12770.1277
直肠腺癌Rectal adenocarcinoma 1.61(0.73-3.55)1.61(0.73-3.55) 0.2420.242

Claims (4)

  1. 一种以87个基因作为生物标志物预测细胞增殖活性的模型,其特征在于,通过以下步骤实现:A model for predicting cell proliferation activity using 87 genes as biomarkers is characterized in that it is achieved through the following steps:
    (1)建立细胞增殖基因集合:(1) Establish a collection of cell proliferation genes:
    1)数据采集1) Data collection
    从Tabula Muris数据库获得不同类型正常细胞的单细胞RNA-Seq数据,从癌症基因组图谱数据库获得癌症和癌旁组织RNA-Seq数据,从GTEx数据库中获得组织RNA-Seq数据,从CCLE数据库获得细胞系RNA-Seq数据与细胞增殖活性数据;Obtain single-cell RNA-Seq data of different types of normal cells from the Tabula Muris database, RNA-Seq data of cancer and para-cancerous tissues from the Cancer Genome Atlas database, tissue RNA-Seq data from the GTEx database, and cell lines from the CCLE database RNA-Seq data and cell proliferation activity data;
    2)干/组细胞特异性表达基因集合挖掘2) Mining the collection of specific expression genes in stem/group cells
    a)将Tabula Muris数据库中的体内正常单细胞按细胞类型归为81类,计算各类细胞的基因表达值,对某一特定细胞类型i当中的某一基因j,计算其表达值(X ji)如下: a) The normal single cells in the body in the Tabula Muris database are classified into 81 types by cell type, and the gene expression value of each type of cell is calculated. For a certain gene j in a specific cell type i, the expression value (X ji )as follows:
    Figure PCTCN2020101544-appb-100001
    Figure PCTCN2020101544-appb-100001
    其中m为属于细胞类型i的细胞总数,n为细胞类型i中细胞基因j的reads count大于0的细胞的数目,计算细胞类型i中所有基因的表达值,依次计算81种细胞类型的所有基因的表达值;Where m is the total number of cells belonging to cell type i, n is the number of cells whose read count of cell gene j in cell type i is greater than 0, calculate the expression values of all genes in cell type i, and count all genes of 81 cell types in turn Expression value of
    b)将81类细胞分为两组:干/组细胞组与其他细胞组;b) Divide 81 types of cells into two groups: stem/group cell group and other cell groups;
    c)使用层次聚类分析,挖掘在干/组细胞组中高表达,在其他细胞组极低表达的基因,作为干/组细胞特异性基因集合;c) Use hierarchical clustering analysis to mine the genes that are highly expressed in the stem/group cell group and extremely lowly expressed in other cell groups as a set of stem/group cell-specific genes;
    3)细胞增殖基因集合挖掘3) Mining of cell proliferation gene collection
    a)获得GTEx数据库中各正常组织样本中干/组细胞特异性基因集合中基因的表达值,在绝大多数正常组织中不具有增殖活性的终末细胞占据主要成分,为此对基因进行层次聚类分析,获得在正常组织中低表达的87个基因组成的基因群;a) Obtain the expression values of genes in the stem/group cell-specific gene set in each normal tissue sample in the GTEx database. In most normal tissues, terminal cells that do not have proliferative activity occupy the main components, and the genes are hierarchized for this purpose Cluster analysis to obtain a gene group consisting of 87 genes that are low expressed in normal tissues;
    b)获得TCGA数据库癌和癌旁组织样本中上述87个基因的表达值,对某一个基因j,计算其在所有癌和癌旁组织样本中Z-score标准化的基因表达值Y j,对某一个样本k,列举其87个基因的表达向量为{Y 1k,Y 2k,…,Y 87k},然后,计算基因集合的表达值为87个基因表达向量的中值,进一步使用T检验将每一种癌症的样本的基因集合表达值与所有癌旁样本的基因集合表达值进行比较,由于绝大多数癌组织由高增殖的癌细胞组成,进一步确认上述基因集合在癌组织高表达,在癌旁低表达,确认上述87个基因组成的基因群为细胞增殖基因集合; b) Obtain the expression values of the above 87 genes in the cancer and adjacent tissue samples from the TCGA database. For a certain gene j, calculate the Z-score normalized gene expression value Y j in all cancer and adjacent tissue samples. For a sample k, enumerate the expression vectors of its 87 genes as {Y 1k ,Y 2k ,...,Y 87k }, then calculate the expression value of the gene set as the median value of the 87 gene expression vectors, and further use the T test to The gene set expression value of a cancer sample is compared with the gene set expression values of all adjacent samples. Since most cancer tissues are composed of highly proliferating cancer cells, it is further confirmed that the above gene sets are highly expressed in cancer tissues. Side low expression, confirming that the gene group composed of the above 87 genes is a cell proliferation gene collection;
    (2)使用上述细胞增殖基因集合建立预测细胞增殖活性的模型:(2) Use the above cell proliferation gene set to establish a model for predicting cell proliferation activity:
    1)细胞增殖基因集合预测体外培养癌细胞系增殖活性1) Cell proliferation gene set predicts the proliferation activity of cancer cell lines in vitro
    a)获得CCLE数据库中各癌症细胞系中细胞增殖基因集合中基因的表达值,对某一个基 因j,计算其在所有细胞系样本中Z-score标准化的基因表达值Z j,对某一个细胞系样本k,列举其87个基因的表达向量为{Z 1k,Z 2k,…,Z 87k},然后,计算基因集合的表达值为上述87个基因表达值的中值,计算每一个细胞系样本的细胞增殖基因集合表达值; a) Obtain the expression values of genes in the cell proliferation gene set in each cancer cell line in the CCLE database. For a certain gene j, calculate the Z-score normalized gene expression value Z j in all cell line samples. For a certain cell Line sample k, enumerate the expression vector of its 87 genes as {Z 1k ,Z 2k ,...,Z 87k }, and then calculate the expression value of the gene set as the median value of the above 87 gene expression values, and calculate each cell line The expression value of the cell proliferation gene set of the sample;
    b)获得CCLE数据库中部分细胞增殖活性数据;b) Obtain some cell proliferation activity data in the CCLE database;
    c)对细胞系样本的细胞增殖基因集合表达值数据与对应细胞系的倍增时间数据,进行皮尔森相关分析,确认在来源于实体瘤的癌症细胞系中,细胞增殖活性与87个基因组成的细胞增殖基因集合表达值存在显著正相关,通过细胞增殖基因集合表达高低预测来源于实体瘤的癌症细胞系的增殖活性;c) Perform Pearson correlation analysis on the expression value data of the cell proliferation gene set expression value of the cell line samples and the doubling time data of the corresponding cell line, and confirm that the cell proliferation activity is composed of 87 genes in the cancer cell line derived from solid tumors. The expression value of the cell proliferation gene set has a significant positive correlation, and the proliferation activity of cancer cell lines derived from solid tumors is predicted by the expression level of the cell proliferation gene set;
    2)建立细胞增殖活性预测模型2) Establish a prediction model for cell proliferation activity
    a)将Tabula Muris数据库中的单细胞按细胞类型归为81类,获得各类细胞的基因表达值如上;a) The single cells in the Tabula Muris database are classified into 81 types according to cell types, and the gene expression values of various types of cells are obtained as above;
    b)使用上述87个基因的表达值对81个细胞类型进行层次聚类分析,通过聚类分析,将细胞类型聚成2-3类;b) Perform hierarchical cluster analysis on 81 cell types using the expression values of the above 87 genes, and cluster the cell types into 2-3 categories through cluster analysis;
    c)对81个细胞类型中的每一个细胞类型,计算其细胞增殖基因集合表达值,获取每一个细胞类型中细胞增殖基因集合中87个基因的表达值,对某一个细胞类型i,对某一个基因j的基因表达值X ji,列举其87个基因的表达向量为{X 1i,X 2i,…,X 87i},然后计算细胞增殖基因集合的表达值为上述87个基因表达值的中值; c) For each of the 81 cell types, calculate the expression value of the cell proliferation gene set, and obtain the expression value of 87 genes in the cell proliferation gene set in each cell type. For a certain cell type i, for a certain cell type The gene expression value X ji of a gene j, enumerate the expression vector of its 87 genes as {X 1i ,X 2i ,...,X 87i }, and then calculate the expression value of the cell proliferation gene set as the middle of the 87 gene expression values value;
    d)依据聚类分析的结果,将81个细胞类型聚成2-3个不同的细胞类型群,对每一个细胞类型群,获得其细胞增殖基因集合的表达值向量,比较不同细胞类型群的细胞增殖基因集合的表达值,以P<0.05为阈值,判断是否某一细胞类型群的细胞增殖基因集合表达值显著高于其他细胞类型群,从而确认高表达细胞增殖基因的细胞类型与低表达细胞增殖基因的细胞类型,实现对81种细胞类型增殖活性的评估。d) According to the results of cluster analysis, 81 cell types are grouped into 2-3 different cell type groups, for each cell type group, the expression value vector of its cell proliferation gene set is obtained, and the different cell type groups are compared. The expression value of the cell proliferation gene set, with P<0.05 as the threshold, judge whether the expression value of the cell proliferation gene set of a certain cell type group is significantly higher than that of other cell type groups, so as to confirm the cell type with high expression of cell proliferation gene and low expression The cell types of cell proliferation genes can be used to evaluate the proliferation activity of 81 cell types.
  2. 根据权利要求1所述的一种以87个基因作为生物标志物预测细胞增殖活性的模型,其特征在于,87个基因为:ANLN、ARHGAP11A、ASF1B、ATAD2、AURKA、AURKB、BIRC5、BRCA2、BUB1、BUB1B、CCNA2、CCNB1、CCNB2、CDC20、CDC45、CDCA2、CDCA5、CDCA8、CDK1、CDT1、CENPA、CENPE、CENPF、CENPH、CENPK、CENPM、CENPW、CEP55、CKAP2、CKAP2L、CLSPN、DBF4、DLGAP5、ECT2、ESCO2、FEN1、FOXM1、HIRIP3、HIST1H2AG、HMMR、KIF11、KIF15、KIF20A、KIF20B、KIF23、KIFC1、LMNB1、LMNB2、LRWD1、MAD2L1、MCM2、MKI67、NCAPG、NCAPG2、NCAPH、NDC80、NEIL3、NUDT1、NUF2、NUSAP1、PBK、PKMYT1、PLK1、PLK4、PRC1、RACGAP1、RAD51、RCC1、RRM2、SHCBP1、SKA1、SMC2、SNRNP25、SPC24、SPC25、SYCE2、TACC3、TK1、TOP2A、 TPX2、TRIM59、TRIP13、TYMS、UBE2C、UBE2T、UHRF1和VRK1。The model for predicting cell proliferation activity using 87 genes as biomarkers according to claim 1, wherein the 87 genes are: ANLN, ARHGAP11A, ASF1B, ATAD2, AURKA, AURKB, BIRC5, BRCA2, BUB1 , BUB1B, CCNA2, CCNB1, CCNB2, CDC20, CDC45, CDCA2, CDCA5, CDCA8, CDK1, CDT1, CENPA, CENPE, CENPF, CENPH, CENPK, CENPM, CENPW, CEP55, CKAP2, CKAP2L, CLSPN, DBF4, DLGAP5, ECT2 , ESCO2, FEN1, FOXM1, HIRIP3, HIST1H2AG, HMMR, KIF11, KIF15, KIF20A, KIF20B, KIF23, KIFC1, LMNB1, LMNB2, LRWD1, MAD2L1, MCM2, MKI67, NCAPG, NCAPG2, NCAPH, NDC80, NEIL3, NUDT1, NUDT1 , NUSAP1, PBK, PKMYT1, PLK1, PLK4, PRC1, RACGAP1, RAD51, RCC1, RRM2, SHCBP1, SKA1, SMC2, SNRNP25, SPC24, SPC25, SYCE2, TACC3, TK1, TOP2A, TPX2, TRIM59, TRIP13, TYMS, UBE2C , UBE2T, UHRF1 and VRK1.
  3. 根据权利要求1所述的一种以87个基因作为生物标志物预测细胞增殖活性的模型,其特征在于,Tabula Muris数据库: https://tabula-muris.ds.czbiohub.org/;癌症基因组图谱数据库: http://cancergenome.nih.gov/;GTEx数据库:https://www.gtexportal.org/;从CCLE数据库: https://portals.broadinstitute.org/ccleThe model for predicting cell proliferation activity using 87 genes as biomarkers according to claim 1, characterized in that the Tabula Muris database: https://tabula-muris.ds.czbiohub.org/ ; Cancer Genome Atlas Database: http://cancergenome.nih.gov/ ; GTEx database: https://www.gtexportal.org/; from CCLE database: https://portals.broadinstitute.org/ccle .
  4. 根据权利要求1所述的一种以87个基因作为生物标志物预测细胞增殖活性的模型,其特征在于:步骤(2)中获得CCLE数据库中部分细胞增殖活性数据,是指获得CCLE数据库中部分细胞倍增时间增殖活性数据,使用T检验比较不同细胞类型群的细胞增殖基因集合的表达值是使用T检验。The model for predicting cell proliferation activity using 87 genes as biomarkers according to claim 1, characterized in that: obtaining part of cell proliferation activity data in the CCLE database in step (2) refers to obtaining part of the cell proliferation activity data in the CCLE database Cell doubling time proliferation activity data, using T test to compare the expression values of cell proliferation gene sets of different cell types are using T test.
PCT/CN2020/101544 2020-06-17 2020-07-13 Model using 87 genes serving as biomarkers to predict cell proliferation activity WO2021253544A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010554703.4 2020-06-17
CN202010554703.4A CN111739586B (en) 2020-06-17 2020-06-17 Model for predicting cell proliferation activity by using 87 genes as biomarkers

Publications (1)

Publication Number Publication Date
WO2021253544A1 true WO2021253544A1 (en) 2021-12-23

Family

ID=72649544

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/101544 WO2021253544A1 (en) 2020-06-17 2020-07-13 Model using 87 genes serving as biomarkers to predict cell proliferation activity

Country Status (2)

Country Link
CN (1) CN111739586B (en)
WO (1) WO2021253544A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116259360A (en) * 2023-03-16 2023-06-13 中国人民解放军空军军医大学 Identification and characteristic gene set of hyperproliferative tumor subgroup in lung adenocarcinoma and application

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114042161B (en) * 2021-11-17 2023-05-30 浙江省人民医院 Application of CENPW inhibitor in preparation of antitumor drugs
GB2613386A (en) * 2021-12-02 2023-06-07 Apis Assay Tech Limited Diagnostic test
CN117965734B (en) * 2024-02-02 2024-09-24 奥明星程(杭州)生物科技有限公司 Gene marker for detecting hard fibroid, kit, detection method and application

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014066984A1 (en) * 2012-10-29 2014-05-08 Ontario Institute For Cancer Research (Oicr) Method for identifying a target molecular profile associated with a target cell population
CN109797221A (en) * 2019-03-13 2019-05-24 上海市第十人民医院 A kind of biomarker combination and its application for Myometrial involvement bladder cancer progress molecule parting and/or prognosis prediction
WO2020033866A1 (en) * 2018-08-10 2020-02-13 Omniseq, Inc. Methods and systems for assessing proliferative potential and resistance to immune checkpoint blockade

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2549399A1 (en) * 2011-07-19 2013-01-23 Koninklijke Philips Electronics N.V. Assessment of Wnt pathway activity using probabilistic modeling of target gene expression
CN108424969B (en) * 2018-06-06 2022-07-15 深圳市颐康生物科技有限公司 Biomarker, method for diagnosing or predicting death risk
KR102170726B1 (en) * 2018-10-04 2020-10-27 사회복지법인 삼성생명공익재단 Method for selecting biomarker and method for providing information for diagnosis of cancer using thereof
CN109859801B (en) * 2019-02-14 2023-09-19 辽宁省肿瘤医院 Model for predicting lung squamous carcinoma prognosis by using seven genes as biomarkers and establishing method
CN110441523A (en) * 2019-08-09 2019-11-12 首都医科大学附属北京朝阳医院 ATAD2 albumen is judging the application in oophoroma vegetative state as marker

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014066984A1 (en) * 2012-10-29 2014-05-08 Ontario Institute For Cancer Research (Oicr) Method for identifying a target molecular profile associated with a target cell population
WO2020033866A1 (en) * 2018-08-10 2020-02-13 Omniseq, Inc. Methods and systems for assessing proliferative potential and resistance to immune checkpoint blockade
CN109797221A (en) * 2019-03-13 2019-05-24 上海市第十人民医院 A kind of biomarker combination and its application for Myometrial involvement bladder cancer progress molecule parting and/or prognosis prediction

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LACHMANN ALEXANDER, TORRE DENIS, KEENAN ALEXANDRA B., JAGODNIK KATHLEEN M., LEE HOYJIN J., WANG LILY, SILVERSTEIN MOSHE C., MA’AYA: "Massive mining of publicly available RNA-seq data from human and mouse", NATURE COMMUNICATIONS, vol. 9, no. 1, 1 December 2018 (2018-12-01), XP055882013, DOI: 10.1038/s41467-018-03751-6 *
ROONEY MICHAEL S ET AL: "Molecular and genetic properties of tumors associated with local immune cytolytic activity.", CELL, US, vol. 160, no. 1-2, 15 January 2015 (2015-01-15), US , pages 48 - 61, XP002782862, ISSN: 1097-4172, DOI: 10.1016/j.cell.2014.12.033 *
ZHANG JIANHUA, YING ANNA;YU MENGQI;YU BINGNAN: "Research Report RNA-seq Analysis of Differentially Expressed Genes and Protein-Protein Interaction in Hepatocellular Carcinoma Based on RNA-seq Data", GENOMICS AND APPLIED BIOLOGY, vol. 38, no. 1, 28 February 2019 (2019-02-28), pages 461 - 467, XP055882019, DOI: 10.13417/j.gab.038.000461 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116259360A (en) * 2023-03-16 2023-06-13 中国人民解放军空军军医大学 Identification and characteristic gene set of hyperproliferative tumor subgroup in lung adenocarcinoma and application
CN116259360B (en) * 2023-03-16 2024-02-09 中国人民解放军空军军医大学 Identification and characteristic gene set of hyperproliferative tumor subgroup in lung adenocarcinoma and application

Also Published As

Publication number Publication date
CN111739586A (en) 2020-10-02
CN111739586B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
WO2021253544A1 (en) Model using 87 genes serving as biomarkers to predict cell proliferation activity
Colacino et al. Heterogeneity of human breast stem and progenitor cells as revealed by transcriptional profiling
Quezada et al. Omics-based biomarkers: current status and potential use in the clinic
Van Dalum et al. Circulating tumor cells before and during follow-up after breast cancer surgery
Vatter et al. High-dimensional phenotyping identifies age-emergent cells in human mammary epithelia
Powell et al. Single cell profiling of circulating tumor cells: transcriptional heterogeneity and diversity from breast cancer cell lines
Lee et al. Triple negative breast cancer in Korea-distinct biology with different impact of prognostic factors on survival
JP2015530072A (en) Breast cancer treatment with gemcitabine therapy
Schwede et al. Stem cell-like gene expression in ovarian cancer predicts type II subtype and prognosis
Kawaguchi et al. Gene Expression Signature–Based Prognostic Risk Score in Patients with Primary Central Nervous System Lymphoma
Liu et al. Discovery of microarray-identified genes associated with ovarian cancer progression
JP2016537010A (en) Method and kit for predicting prognosis, and method and kit for treating breast cancer using radiation therapy
Keup et al. Integrative statistical analyses of multiple liquid biopsy analytes in metastatic breast cancer
Ring et al. EpCAM based capture detects and recovers circulating tumor cells from all subtypes of breast cancer except claudin-low
Desmedt et al. Gene expression predictors in breast cancer: current status, limitations and perspectives
Liu et al. Identification of key candidate genes and pathways in endometrial cancer by integrated bioinformatical analysis
Skvortsov et al. Proteomics of cancer stem cells
Riester et al. Distance in cancer gene expression from stem cells predicts patient survival
Gross et al. A multi-omic analysis of MCF10A cells provides a resource for integrative assessment of ligand-mediated molecular and phenotypic responses
Soave et al. Do circulating tumor cells have a role in deciding on adjuvant chemotherapy after radical cystectomy?
US20200168294A1 (en) A diagnostic and prognostic test for multiple cancer types based on transcript profiling
Wang et al. A comprehensive understanding of ovarian carcinoma survival prognosis by novel biomarkers.
Sabatier et al. Gene expression profiling and prediction of clinical outcome in ovarian cancer
Liu et al. Can we infer tumor presence of single cell transcriptomes and their tumor of origin from bulk transcriptomes by machine learning?
Diaz‐Romero et al. Hierarchical clustering of flow cytometry data for the study of conventional central chondrosarcoma

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20940594

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20940594

Country of ref document: EP

Kind code of ref document: A1