WO2024027591A1 - Multi-cancer methylation detection kit and use thereof - Google Patents

Multi-cancer methylation detection kit and use thereof Download PDF

Info

Publication number
WO2024027591A1
WO2024027591A1 PCT/CN2023/109837 CN2023109837W WO2024027591A1 WO 2024027591 A1 WO2024027591 A1 WO 2024027591A1 CN 2023109837 W CN2023109837 W CN 2023109837W WO 2024027591 A1 WO2024027591 A1 WO 2024027591A1
Authority
WO
WIPO (PCT)
Prior art keywords
cancer
biomarker combination
dmrs
dmr
tumor
Prior art date
Application number
PCT/CN2023/109837
Other languages
French (fr)
Chinese (zh)
Inventor
李冰思
许佳悦
邱福俊
汉雨生
张之宏
Original Assignee
广州燃石医学检验所有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州燃石医学检验所有限公司 filed Critical 广州燃石医学检验所有限公司
Publication of WO2024027591A1 publication Critical patent/WO2024027591A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Organic Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • Genetics & Genomics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Wood Science & Technology (AREA)
  • Immunology (AREA)
  • Zoology (AREA)
  • Pathology (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Biochemistry (AREA)
  • Public Health (AREA)
  • Oncology (AREA)
  • Microbiology (AREA)
  • Bioethics (AREA)
  • Hospice & Palliative Care (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Provided are a multi-cancer methylation detection kit and use thereof. Specifically, a biomarker combination for evaluating the correlation between a sample to be detected and a tumorigenesis risk and/or a tumor tissue source is provided, wherein the reference gene version for differentially methylated regions (DMRs) is the hg19 version. Also provided is use of a reagent of the biomarker combination in preparing a kit for determining a tumorigenesis risk and/or evaluating the tumor tissue source of a sample. The present invention is suitable for risk assessment and tissue tracing of various cancers, and has the advantages of cost efficiency and high accuracy.

Description

一种多癌种甲基化检测试剂盒及其应用A multi-cancer methylation detection kit and its application 技术领域Technical field
本申请涉及生物医学领域,具体涉及一种多癌种甲基化检测试剂盒及其应用。This application relates to the field of biomedicine, specifically to a multi-cancer methylation detection kit and its application.
背景技术Background technique
DNA甲基化已知在基因表达调控中起着重要的作用。异常的DNA甲基化标记在多种疾病发生发展中过程中都被报道过,包括癌症。DNA甲基化测序作为一种高分辨率,高通量的技术,其作用在癌症筛查,诊断,以及监控的作用越来越被认识。全基因组重亚硫酸盐测序(WGBS,whole genome bisulfite sequencing)是甲基化测序的金标准,但是因为处理过程中对DNA的严重破坏和过高的测序成本,成为临床应用的困难。更重要的是,人类基因组的大部分区域在癌症发生发展过程中并不活跃,癌症相关的变异往往集中在某些特定区域,如CpG岛(CpG island),这就为靶向测序提供了很好的机会。DNA methylation is known to play an important role in the regulation of gene expression. Abnormal DNA methylation marks have been reported during the development and progression of various diseases, including cancer. As a high-resolution, high-throughput technology, DNA methylation sequencing is increasingly recognized for its role in cancer screening, diagnosis, and monitoring. Whole genome bisulfite sequencing (WGBS) is the gold standard for methylation sequencing, but it has become difficult to apply in clinical applications due to the severe damage to DNA during processing and the high cost of sequencing. More importantly, most regions of the human genome are not active in the development of cancer, and cancer-related mutations are often concentrated in certain specific regions, such as CpG islands, which provides a lot of opportunities for targeted sequencing. Good opportunity.
尽管如此,癌症相关的甲基化差异区域(DMR)的发现和筛选是有挑战的,因为人群异质性,包括疾病,年龄等状态会带来甲基化谱的非特异变化,所以癌症评估DOC模型建立过程中需要对这些非癌症但是异常的信号进行处理。最后,对于多种癌种的检测的应用,组织溯源TOO模型的建立对于追溯癌症变异可能的来源器官,对下游诊疗路径确定,节省医疗成本有重要辅助意义。Nonetheless, the discovery and screening of cancer-related differentially methylated regions (DMRs) is challenging because population heterogeneity, including disease, age and other states, can bring about non-specific changes in methylation profiles, so cancer assessment During the establishment of the DOC model, these non-cancer but abnormal signals need to be processed. Finally, for the application of detection of multiple cancer types, the establishment of the tissue traceability TOO model is of great auxiliary significance for tracing the possible source organs of cancer mutations, determining downstream diagnosis and treatment paths, and saving medical costs.
发明内容Contents of the invention
本申请建立了一种低成本,高精确度的方法,采用DNA或RNA寡核苷酸序列对多种不同癌症的甲基化变异区域,以及各种器官的特定甲基化特征区域进行捕获,并对于血液游离DNA(cfDNA)中的肿瘤组分(ctDNA)存在进行判断,并对其组织来源进行评估。This application establishes a low-cost, high-precision method that uses DNA or RNA oligonucleotide sequences to capture methylation variation regions in a variety of different cancers, as well as specific methylation characteristic regions in various organs. And judge the presence of tumor components (ctDNA) in blood cell-free DNA (cfDNA), and evaluate its tissue origin.
一方面,本申请提供了一种用于评估待测样本与肿瘤形成风险相关性的生物标志物组合,其特征在于,所述生物标志物组合包含表1A中所示的任意至少10个差异甲基化区域DMR,其中所述表中的DMR涉及的参考基因版本为hg19版本。On the one hand, the present application provides a biomarker combination for assessing the correlation between a test sample and the risk of tumor formation, characterized in that the biomarker combination includes any of at least 10 differential alphas shown in Table 1A. Kylation region DMR, where the reference gene version involved in the DMR in the table is the hg19 version.
一方面,本申请提供了一种用于评估待测样本与肿瘤组织来源相关性的生物标志物组合,其特征在于,所述生物标志物组合包含表1B中所示的任意至少10个差异甲基化区域DMR,其中所述表中的DMR涉及的参考基因版本为hg19版本。On the one hand, the present application provides a biomarker combination for evaluating the correlation between a test sample and the origin of a tumor tissue, characterized in that the biomarker combination includes any of at least 10 differential A shown in Table 1B. Kylation region DMR, where the reference gene version involved in the DMR in the table is the hg19 version.
一方面,本申请提供了一种用于评估待测样本与肿瘤形成风险和/或肿瘤组织来源的相关性的生物标志物组合,其特征在于,所述生物标志物组合包含表1C中所示的任意至少10个差异甲基化区域DMR,其中所述表中的DMR涉及的参考基因版本为hg19版本。On the one hand, the present application provides a biomarker combination for assessing the correlation between a test sample and the risk of tumor formation and/or the origin of tumor tissue, characterized in that the biomarker combination includes what is shown in Table 1C Any at least 10 differentially methylated region DMRs, wherein the reference gene version involved in the DMRs in the table is the hg19 version.
一方面,本申请提供了一种试剂盒,所述试剂盒包含本申请所述的生物标志物组合,以及任选地包含二代高通量测序试剂。In one aspect, the present application provides a kit comprising the biomarker combination described in the present application, and optionally a second-generation high-throughput sequencing reagent.
一方面,本申请提供了用于检测本申请所述的生物标志物组合的试剂在制备诊断肿瘤形成风险和/或肿瘤组织来源的试剂盒中的应用。In one aspect, the present application provides the use of a reagent for detecting a combination of biomarkers described in the present application in the preparation of a kit for diagnosing the risk of tumor formation and/or the origin of tumor tissue.
一方面,本申请提供了一种评估待测样本与肿瘤形成风险和/或肿瘤组织来源的相关性的方法,所述方法包含:对待测样本中的生物标志物组合进行甲基化水平的检测,所述生物标志物组合包含本申请所述的生物标志物组合。On the one hand, the present application provides a method for evaluating the correlation between a sample to be tested and the risk of tumor formation and/or the source of tumor tissue, the method comprising: detecting the methylation level of a combination of biomarkers in the sample to be tested , the biomarker combination includes the biomarker combination described in this application.
一方面,本申请提供了一种储存介质,其记载可以运行本申请所述的方法的程序。On the one hand, the present application provides a storage medium recording a program that can run the method described in the present application.
一方面,本申请提供了一种设备,所述设备包含本申请所述的储存介质,以及所述设备任选地包含耦接至所述储存介质的处理器,所述处理器被配置为基于存储在所述储存介质中的程序执行以实现本申请所述的方法。In one aspect, the application provides an apparatus comprising a storage medium as described herein, and the apparatus optionally includes a processor coupled to the storage medium, the processor configured to The program stored in the storage medium is executed to implement the method described in this application.
本申请的生物标志物组合、试剂盒、方法、设备、存储介质与应用可以适用于多种癌症的风险评估和组织溯源,具有成本低、准确度高的优点。The biomarker combination, kit, method, equipment, storage medium and application of the present application can be suitable for risk assessment and tissue traceability of various cancers, and have the advantages of low cost and high accuracy.
本领域技术人员能够从下文的详细描述中容易地洞察到本申请的其它方面和优势。下文的详细描述中仅显示和描述了本申请的示例性实施方式。如本领域技术人员将认识到的,本申请的内容使得本领域技术人员能够对所公开的具体实施方式进行改动而不脱离本申请所涉及发明的精神和范围。相应地,本申请的附图和说明书中的描述仅仅是示例性的,而非为限制性的。Those skilled in the art will readily appreciate other aspects and advantages of the present application from the detailed description below. Only exemplary embodiments of the present application are shown and described in the following detailed description. As those skilled in the art will realize, the contents of this application enable those skilled in the art to make changes to the specific embodiments disclosed without departing from the spirit and scope of the invention covered by this application. Accordingly, the drawings and descriptions of the present application are illustrative only and not restrictive.
附图说明Description of the drawings
本申请所涉及的发明的具体特征如所附权利要求书所显示。通过参考下文中详细描述的示例性实施方式和附图能够更好地理解本申请所涉及发明的特点和优势。对附图简要说明如下:The specific features of the invention to which this application relates are set forth in the appended claims. The features and advantages of the invention to which this application relates can be better understood by reference to the exemplary embodiments described in detail below and the accompanying drawings. A brief description of the drawings is as follows:
图1显示的是,一种示例性的情况(一种理论上的示例性展示,不用于表示实际的测序情况)。Figure 1 shows an exemplary situation (a theoretical exemplary display, not used to represent actual sequencing situations).
图2显示的是,另一种示例性的情况(一种理论上的示例性展示,不用于表示实际的测序情况)。Figure 2 shows another exemplary situation (a theoretical exemplary display, not used to represent actual sequencing situations).
图3A-3C显示的是,另一种示例性的情况(一种理论上的示例性展示,不用于表示实际的测序情况)。Figures 3A-3C show another exemplary situation (a theoretical exemplary display, not used to represent actual sequencing situations).
图4显示的是,在5倍交叉验证中,可以实现98%(95%CI:96-99%)的组织溯源准确性。Figure 4 shows that in 5-fold cross-validation, a tissue traceability accuracy of 98% (95% CI: 96-99%) can be achieved.
图5显示的是混淆相关特征在本申请的Salmon-DOC模型的权重配置的控制结果。 Figure 5 shows the control results of the weight configuration of the Salmon-DOC model of the present application for confusion-related features.
图6A-6F显示的是,本申请Salmon-DOC模型在肿瘤组模型可以高效的实现6个癌种不同分期的检出。Figures 6A-6F show that the Salmon-DOC model of the present application can efficiently detect 6 cancer types at different stages in the tumor group model.
图7A-7F显示的是,在健康组本申请Salmon-DOC模型克服了既往甲基化假阳性随着年龄增高的弱点,在各个年龄段中保持平衡(横轴为年龄,纵轴为模型癌症概率打分)。Figures 7A-7F show that in the healthy group, the Salmon-DOC model of this application has overcome the previous weakness of methylation false positives increasing with age, and maintained balance among various age groups (the horizontal axis is age, and the vertical axis is model cancer probability score).
图8A-8D显示的是本申请Salmon-TOO双层模型溯源准确性在交叉验证和独立验证中均优于单层模型。Figures 8A-8D show that the traceability accuracy of the Salmon-TOO two-layer model of this application is better than that of the single-layer model in both cross-validation and independent verification.
图9显示的是,基于103个TOO相关DMR区域,得到的组织溯源评估结果。Figure 9 shows the tissue traceability assessment results based on 103 TOO-related DMR areas.
具体实施方式Detailed ways
以下由特定的具体实施例说明本申请发明的实施方式,熟悉此技术的人士可由本说明书所公开的内容容易地了解本申请发明的其他优点及效果。The implementation of the invention of the present application will be described below with specific examples. Those familiar with this technology can easily understand other advantages and effects of the invention of the present application from the content disclosed in this specification.
术语定义Definition of Terms
在本申请中,术语“差异甲基化区域”(DMR)通常是指包含一个或多个差异甲基化位点的DNA区域。例如,在选定的感兴趣的条件下,例如癌症状态,包括更多数量或频率的甲基化位点的DMR可以被称为高甲基化DMR。例如,在选定的感兴趣的条件下,例如癌症状态,包括较少数量或频率的甲基化位点的DMR可以被称为低甲基化DMR。In this application, the term "differentially methylated region" (DMR) generally refers to a region of DNA containing one or more differentially methylated sites. For example, a DMR that includes a greater number or frequency of methylated sites under a selected condition of interest, such as a cancer state, may be termed a hypermethylated DMR. For example, a DMR that includes a smaller number or frequency of methylated sites under a selected condition of interest, such as a cancer state, may be termed a hypomethylated DMR.
在本申请中,术语“二代基因测序(NGS)”、高通量测序”或“下一代测序”通常是指第二代高通量测序技术及之后发展的更高通量的测序方法。下一代测序平台包括但不限于已有的Illumina等测序平台。随着测序技术的不断发展,本领域技术人员能够理解的是还可以采用其他方法的测序方法和装置用于本方法。例如,二代基因测序可以具有高灵敏度、通量大、测序深度高、或低成本的优势。根据发展历史、影响力、测序原理和技术不同等,主要有以下几种:大规模平行签名测序(Massively Parallel Signature Sequencing,MPSS)、聚合酶克隆(Polony Sequencing)、454焦磷酸测序(454 pyro sequencing)、Illumina(Solexa)sequencing、离子半导体测序(Ion semi conductor sequencing)、DNA纳米球测序(DNA nano-ball sequencing)、Complete Genomics的DNA纳米阵列与组合探针锚定连接测序法等。所述二代基因测序可以使对一个物种的转录组和基因组进行细致全貌的分析成为可能,所以又被称为深度测序(deep sequencing)。例如,本申请的方法同样可以应用于一代基因测序、二代基因测序、三代基因测序或单分子测序(SMS)。In this application, the terms "second-generation gene sequencing (NGS)", high-throughput sequencing" or "next-generation sequencing" generally refer to the second-generation high-throughput sequencing technology and higher-throughput sequencing methods developed thereafter. Next-generation sequencing platforms include but are not limited to existing sequencing platforms such as Illumina. With the continuous development of sequencing technology, those skilled in the art can understand that other sequencing methods and devices can also be used for this method. For example, two Next-generation gene sequencing can have the advantages of high sensitivity, high throughput, high sequencing depth, or low cost. According to development history, influence, sequencing principles and technologies, there are mainly the following types: Massively Parallel Signature Sequencing (Massively Parallel Sequencing) Signature Sequencing (MPSS), Polony Sequencing, 454 pyro sequencing, Illumina (Solexa) sequencing, Ion semi conductor sequencing, DNA nano-ball sequencing ), Complete Genomics' DNA nanoarray and combined probe anchored ligation sequencing method, etc. The second-generation gene sequencing can make it possible to conduct a detailed and comprehensive analysis of the transcriptome and genome of a species, so it is also called deep sequencing (deep sequencing). For example, the method of this application can also be applied to first-generation gene sequencing, second-generation gene sequencing, third-generation gene sequencing or single molecule sequencing (SMS).
在本申请中,术语“待测样本”通常是指需要进行检测的样本。例如,可以检测待测样本上的一个或者多个基因区域是否存在有修饰状态。In this application, the term "sample to be tested" generally refers to the sample to be tested. For example, it can be detected whether one or more gene regions on the sample to be tested are modified.
在本申请中,术语“多核苷酸”、“核苷酸”、“核酸”和“寡核苷酸”是可互换使用的。它们表示具有任何长度的核苷酸(脱氧核糖核苷酸或者核糖核苷酸)的多聚形式,或其类似物。多核苷酸可以具有任何立体结构,并且可以发挥任何功能,无论是已知的还是未知的。以下是多核苷酸的非限制性实例:基因或基因片段的编码或非编码区、根据连锁分析所限定的基因座(基因座)、外显子、内含子、信使RNA(mRNA)、转运RNA(tRNA)、核糖体RNA(rRNA)、短干扰RNA(siRNA)、短-发夹RNA(shRNA)、微小RNA(miRNA)、核糖酶、cDNA、重组多核苷酸、分枝多核苷酸、质粒、载体、具有任何序列的分离的DNA、具有任何序列的分离的RNA、核酸探针、引物和接头。多核苷酸可以包括一个或多个修饰的核苷酸,如甲基化核苷酸和核苷酸类似物。In this application, the terms "polynucleotide," "nucleotide," "nucleic acid," and "oligonucleotide" are used interchangeably. They represent polymeric forms of nucleotides (deoxyribonucleotides or ribonucleotides) of any length, or analogs thereof. Polynucleotides can have any three-dimensional structure and can perform any function, whether known or unknown. The following are non-limiting examples of polynucleotides: coding or non-coding regions of genes or gene fragments, genetic loci (loci) defined by linkage analysis, exons, introns, messenger RNA (mRNA), transfer RNA (tRNA), ribosomal RNA (rRNA), short interfering RNA (siRNA), short-hairpin RNA (shRNA), microRNA (miRNA), ribozyme, cDNA, recombinant polynucleotide, branched polynucleotide, Plasmids, vectors, isolated DNA of any sequence, isolated RNA of any sequence, nucleic acid probes, primers and adapters. Polynucleotides may include one or more modified nucleotides, such as methylated nucleotides and nucleotide analogs.
在本申请中,术语“甲基化”通常是指本申请中基因片段、核苷酸或其碱基具有的甲基化状态。例如,本申请中基因所在的DNA片段可以在一条链或多条链上具有甲基化。例如,本申请中基因所在的DNA片段可以在一个位点或DMR或多个位点或DMR上具有甲基化。In this application, the term "methylation" generally refers to the methylation state possessed by gene fragments, nucleotides or bases thereof in this application. For example, the DNA segment in which the gene in this application is located may have methylation on one or more strands. For example, the DNA fragment in which the gene in this application is located may have methylation at one site or DMR or multiple sites or DMRs.
在本申请中,术语“人类参考基因组”通常是指可以在基因测序中发挥参照功能的人类基因组。所述人类参考基因组的信息可以参考UCSC。所述人类参考基因组可以有不同的版本,例如,可以为hg19、GRCH37或ensembl 75。In this application, the term "human reference genome" generally refers to the human genome that can serve as a reference in gene sequencing. For information on the human reference genome, please refer to UCSC. The human reference genome can have different versions, for example, it can be hg19, GRCH37 or ensemble 75.
在本申请中,术语“机器学习模型”通常是指被配置为实现算法、过程或数学模型的系统或程序指令和/或数据的集合。在本申请中,所述算法、过程或数学模型可以基于给定的输入来评估和提供期望的输出。在本申请中,所述机器学习模型的参数可以没有被明确地编程,并且在传统意义上,所述机器学习模型可以没有被明确地设计成遵循特定的规则以便为给定的输入提供期望的输出。例如,所述机器学习模型的使用可以意味着机器学习模型和/或作为机器学习模型的数据结构/一组规则是由机器学习算法训练的。In this application, the term "machine learning model" generally refers to a system or collection of program instructions and/or data configured to implement an algorithm, process, or mathematical model. In this application, the algorithm, process or mathematical model can evaluate and provide a desired output based on given inputs. In this application, the parameters of the machine learning model may not be explicitly programmed, and in the traditional sense, the machine learning model may not be explicitly designed to follow specific rules in order to provide the desired results for a given input. output. For example, the use of a machine learning model may mean that the machine learning model and/or the data structure/set of rules that is the machine learning model is trained by a machine learning algorithm.
在本申请中,术语“包含”通常是指包括明确指定的特征,但不排除其他要素。In this application, the term "comprising" generally means the inclusion of explicitly specified features, but not the exclusion of other elements.
在本申请中,术语“约”通常是指在指定数值以上或以下0.5%-10%的范围内变动,例如在指定数值以上或以下0.5%、1%、1.5%、2%、2.5%、3%、3.5%、4%、4.5%、5%、5.5%、6%、6.5%、7%、7.5%、8%、8.5%、9%、9.5%、或10%的范围内变动。In this application, the term "about" generally refers to a variation within the range of 0.5% to 10% above or below the specified value, such as 0.5%, 1%, 1.5%, 2%, 2.5%, above or below the specified value. 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10%.
为了实现对肺癌,肠癌,肝癌,卵巢癌,胰腺癌,食管癌6个高发病率高致死率癌种的的检测,本申请采用公共数据库(TCGA)和内部数据挖掘结合的方式,采用一种新的算法,对基因组的甲基化变异和空间位置同时进行比较,共筛选出2536个和癌症高度相关的变异区域(differentially methylated region,DMR)。In order to detect six cancer types with high incidence and fatality rates, including lung cancer, intestinal cancer, liver cancer, ovarian cancer, pancreatic cancer, and esophageal cancer, this application uses a combination of public database (TCGA) and internal data mining. A new algorithm compares the methylation variation and spatial location of the genome simultaneously, and selects a total of 2536 variant regions (differentially methylated region, DMR) that are highly related to cancer.
发明内容Contents of the invention
一方面,本申请提供了一种用于评估待测样本与肿瘤形成风险相关性的生物标志物组合,所述生物标志物组合包含表1A中所示的任意至少10个差异甲基化区域DMR,其中所述表中的DMR涉及的参考基因版本为hg19版本。In one aspect, the present application provides a biomarker combination for assessing the correlation between a test sample and the risk of tumor formation, the biomarker combination comprising any of at least 10 differentially methylated region DMRs shown in Table 1A , where the reference gene version involved in the DMR in the table is the hg19 version.
例如,所述生物标志物组合包含表1A中94个DMR。例如,所述生物标志物组合包含表1A中约94个DMR、 任意至少约90个DMR、任意至少约80个DMR、任意至少约70个DMR、任意至少约60个DMR、任意至少约50个DMR、任意至少约40个DMR、任意至少约30个DMR、任意至少约20个DMR、或任意至少约10个DMR。For example, the biomarker panel includes the 94 DMRs in Table 1A. For example, the biomarker panel includes approximately 94 DMRs in Table 1A, Any at least about 90 DMR, Any at least about 80 DMR, Any at least about 70 DMR, Any at least about 60 DMR, Any at least about 50 DMR, Any at least about 40 DMR, Any at least about 30 DMR, Any At least about 20 DMR, or optionally at least about 10 DMR.
一方面,本申请提供了一种用于评估待测样本与肿瘤组织来源相关性的生物标志物组合,所述生物标志物组合包含表1B中所示的任意至少10个差异甲基化区域DMR,其中所述表中的DMR涉及的参考基因版本为hg19版本。On the one hand, the present application provides a biomarker combination for evaluating the correlation between a test sample and the origin of a tumor tissue, the biomarker combination comprising any of at least 10 differentially methylated region DMRs shown in Table 1B , where the reference gene version involved in the DMR in the table is the hg19 version.
例如,所述生物标志物组合包含表1B中103个DMR。例如,所述生物标志物组合包含表1B中约103个DMR、任意至少约100个DMR、任意至少约90个DMR、任意至少约80个DMR、任意至少约70个DMR、任意至少约60个DMR、任意至少约50个DMR、任意至少约40个DMR、任意至少约30个DMR、任意至少约20个DMR、或任意至少约10个DMR。For example, the biomarker panel includes the 103 DMRs in Table 1B. For example, the biomarker panel includes about 103 DMRs in Table IB, any at least about 100 DMRs, any at least about 90 DMRs, any at least about 80 DMRs, any at least about 70 DMRs, any at least about 60 DMRs DMR, any at least about 50 DMR, any at least about 40 DMR, any at least about 30 DMR, any at least about 20 DMR, or any at least about 10 DMR.
一方面,本申请提供了一种用于评估待测样本与肿瘤形成风险和/或肿瘤组织来源的相关性的生物标志物组合,所述生物标志物组合包含表1C中所示的任意至少10个差异甲基化区域DMR,其中所述表中的DMR涉及的参考基因版本为hg19版本。In one aspect, the present application provides a biomarker combination for assessing the correlation between a test sample and the risk of tumor formation and/or the origin of tumor tissue, the biomarker combination comprising at least 10 of any of the components shown in Table 1C DMRs of differentially methylated regions, where the reference gene version involved in the DMRs in the table is the hg19 version.
例如,所述生物标志物组合包含表1E中任意至少222个DMR。例如,所述生物标志物组合包含表1E中约222个DMR、任意至少约220个DMR、任意至少约210个DMR、任意至少约200个DMR、任意至少约150个DMR、任意至少约100个DMR、任意至少约90个DMR、任意至少约80个DMR、任意至少约70个DMR、任意至少约60个DMR、任意至少约50个DMR、任意至少约40个DMR、任意至少约30个DMR、任意至少约20个DMR、或任意至少约10个DMR。For example, the biomarker panel includes any of at least 222 DMRs in Table IE. For example, the biomarker panel includes about 222 DMRs, any at least about 220 DMRs, any at least about 210 DMRs, any at least about 200 DMRs, any at least about 150 DMRs, any at least about 100 DMRs in Table 1E DMR, any at least about 90 DMR, any at least about 80 DMR, any at least about 70 DMR, any at least about 60 DMR, any at least about 50 DMR, any at least about 40 DMR, any at least about 30 DMR , any at least about 20 DMR, or any at least about 10 DMR.
例如,所述生物标志物组合包含表1D中488个DMR。例如,所述生物标志物组合包含表1D中约488个DMR、任意至少约480个DMR、任意至少约450个DMR、任意至少约400个DMR、任意至少约300个DMR、任意至少约200个DMR、任意至少约150个DMR、任意至少约100个DMR、任意至少约90个DMR、任意至少约80个DMR、任意至少约70个DMR、任意至少约60个DMR、任意至少约50个DMR、任意至少约40个DMR、任意至少约30个DMR、任意至少约20个DMR、或任意至少约10个DMR。For example, the biomarker panel includes the 488 DMRs in Table ID. For example, the biomarker panel includes about 488 DMRs, any at least about 480 DMRs, any at least about 450 DMRs, any at least about 400 DMRs, any at least about 300 DMRs, any at least about 200 DMRs in Table ID DMR, any at least about 150 DMR, any at least about 100 DMR, any at least about 90 DMR, any at least about 80 DMR, any at least about 70 DMR, any at least about 60 DMR, any at least about 50 DMR , any at least about 40 DMR, any at least about 30 DMR, any at least about 20 DMR, or any at least about 10 DMR.
例如,所述生物标志物组合包含表1C中860个DMR。例如,所述生物标志物组合包含表1C中约860个DMR、任意至少约850个DMR、任意至少约800个DMR、任意至少约700个DMR、任意至少约600个DMR、任意至少约500个DMR、400个DMR、任意至少约300个DMR、任意至少约200个DMR、任意至少约150个DMR、任意至少约100个DMR、任意至少约90个DMR、任意至少约80个DMR、任意至少约70个DMR、任意至少约60个DMR、任意至少约50个DMR、任意至少约40个DMR、任意至少约30个DMR、任意至少约20个DMR、或任意至少约10个DMR。For example, the biomarker panel includes the 860 DMRs in Table 1C. For example, the biomarker panel includes about 860 DMRs, any at least about 850 DMRs, any at least about 800 DMRs, any at least about 700 DMRs, any at least about 600 DMRs, any at least about 500 DMRs in Table 1C DMR, 400 DMR, any at least about 300 DMR, any at least about 200 DMR, any at least about 150 DMR, any at least about 100 DMR, any at least about 90 DMR, any at least about 80 DMR, any at least About 70 DMRs, any at least about 60 DMRs, any at least about 50 DMRs, any at least about 40 DMRs, any at least about 30 DMRs, any at least about 20 DMRs, or any at least about 10 DMRs.
例如,所述肿瘤来自于同质肿瘤(homogenous tumors)、异质肿瘤、血液癌和/或实体瘤。例如,所述肿瘤来自于以下组的癌症中的一种或多种:脑癌、肺癌、皮肤癌、鼻咽癌、咽喉癌、肝癌、骨癌、淋巴瘤、胰腺癌、皮肤癌、肠癌、直肠癌、甲状腺癌、膀胱癌、肾癌、口腔癌、胃癌、实体瘤、卵巢癌、食管癌、胆囊癌、胆道癌、乳腺癌、宫颈癌、子宫癌、前列腺癌、头颈癌、肉瘤、除肺外胸腔恶性肿瘤、黑色素瘤、和睾丸癌。例如,所述肿瘤包含肺癌,肠癌,肝癌,卵巢癌,胰腺癌,和/或食管癌。For example, the tumors are derived from homogenous tumors, heterogeneous tumors, hematological cancers, and/or solid tumors. For example, the tumor is from one or more of the following group of cancers: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer, bowel cancer , Rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer, gastric cancer, solid tumors, ovarian cancer, esophageal cancer, gallbladder cancer, bile duct cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer, sarcoma, Extrapulmonary thoracic malignancies, melanoma, and testicular cancer. For example, the tumor includes lung cancer, intestinal cancer, liver cancer, ovarian cancer, pancreatic cancer, and/or esophageal cancer.
一方面,本申请提供了一种试剂盒,所述试剂盒包含本申请所述的生物标志物组合,以及任选地包含二代高通量测序试剂。例如,所述试剂盒能够用于评估待测样本与肿瘤形成风险和/或肿瘤组织来源的相关性。In one aspect, the present application provides a kit comprising the biomarker combination described in the present application, and optionally a second-generation high-throughput sequencing reagent. For example, the kit can be used to assess the correlation of the sample to be tested with the risk of tumor formation and/or the origin of the tumor tissue.
一方面,本申请提供了用于检测本申请所述的生物标志物组合的试剂在制备诊断肿瘤形成风险和/或肿瘤组织来源的试剂盒中的应用。例如,所述肿瘤来自于同质肿瘤(homogenous tumors)、异质肿瘤、血液癌和/或实体瘤。例如,所述肿瘤来自于以下组的癌症中的一种或多种:脑癌、肺癌、皮肤癌、鼻咽癌、咽喉癌、肝癌、骨癌、淋巴瘤、胰腺癌、皮肤癌、肠癌、直肠癌、甲状腺癌、膀胱癌、肾癌、口腔癌、胃癌、实体瘤、卵巢癌、食管癌、胆囊癌、胆道癌、乳腺癌、宫颈癌、子宫癌、前列腺癌、头颈癌、肉瘤、除肺外胸腔恶性肿瘤、黑色素瘤、和睾丸癌。例如,所述肿瘤包含肺癌,肠癌,肝癌,卵巢癌,胰腺癌,和/或食管癌。In one aspect, the present application provides the use of a reagent for detecting a combination of biomarkers described in the present application in the preparation of a kit for diagnosing the risk of tumor formation and/or the origin of tumor tissue. For example, the tumors are derived from homogenous tumors, heterogeneous tumors, hematological cancers, and/or solid tumors. For example, the tumor is from one or more of the following group of cancers: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer, bowel cancer , Rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer, gastric cancer, solid tumors, ovarian cancer, esophageal cancer, gallbladder cancer, bile duct cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer, sarcoma, Extrapulmonary thoracic malignancies, melanoma, and testicular cancer. For example, the tumor includes lung cancer, intestinal cancer, liver cancer, ovarian cancer, pancreatic cancer, and/or esophageal cancer.
一方面,本申请提供了一种评估待测样本与肿瘤的形成风险和/或样本的肿瘤组织来源的相关性的评估方法,所述方法包含:对待测样本中的生物标志物组合进行甲基化水平的检测,所述生物标志物组合包含本申请所述的生物标志物组合。On the one hand, the present application provides a method for evaluating the correlation between a sample to be tested and the risk of tumor formation and/or the tumor tissue source of the sample. The method includes: performing methylation on a combination of biomarkers in the sample to be tested. The biomarker combination includes the biomarker combination described in this application.
例如,所述样本选自以下组:组织样本、血液样本、唾液、痰液、胸腔积液、肺部灌洗液、腹膜积液、腹腔灌洗液、灌肠液和脑脊液。For example, the sample is selected from the group consisting of tissue sample, blood sample, saliva, sputum, pleural effusion, lung lavage fluid, peritoneal effusion, peritoneal lavage fluid, enema and cerebrospinal fluid.
一方面,本申请提供了一种储存介质,其记载可以运行本申请所述的方法的程序。例如,所述非易失性计算机可读存储介质可以包括软盘、柔性盘、硬盘、固态存储(SSS)(例如固态驱动(SSD))、固态卡(SSC)、固态模块(SSM))、企业级闪存驱动、磁带或任何其他非临时性磁介质等。非易失性计算机可读存储介质还可以包括打孔卡、纸带、光标片(或任何其他具有孔型图案或其他光学可识别标记的物理介质)、压缩盘只读存储器(CD-ROM)、可重写式光盘(CD-RW)、数字通用光盘(DVD)、蓝光光盘(BD)和/或任何其他非临时性光学介质。On the one hand, the present application provides a storage medium recording a program that can run the method described in the present application. For example, the non-volatile computer-readable storage medium may include a floppy disk, a flexible disk, a hard disk, a solid state storage (SSS) (such as a solid state drive (SSD)), a solid state card (SSC), a solid state module (SSM)), an enterprise class flash drives, tapes, or any other non-transitory magnetic media, etc. Non-volatile computer-readable storage media may also include punched cards, paper tape, cursor pads (or any other physical media having a hole pattern or other optically identifiable markings), compact disk read-only memory (CD-ROM) , Compact Disc Rewritable (CD-RW), Digital Versatile Disc (DVD), Blu-ray Disc (BD) and/or any other non-transitory optical media.
一方面,本申请提供了一种设备,所述设备包含本申请所述的储存介质,以及所述设备任选地包含耦接至所述储存介质的处理器,所述处理器被配置为基于存储在所述储存介质中的程序执行以实现本申请所述的方法。In one aspect, the application provides an apparatus comprising a storage medium as described herein, and the apparatus optionally includes a processor coupled to the storage medium, the processor configured to The program stored in the storage medium is executed to implement the method described in this application.
不欲被任何理论所限,下文中的实施例仅仅是为了阐释本申请的试剂盒、方法和用途等,而不用于限制本申请发明的范围。Without intending to be limited by any theory, the following examples are only to illustrate the kits, methods and uses of the present application, and are not intended to limit the scope of the invention of the present application.
实施例 Example
实施例1Example 1
对于样本进行示例性的重亚硫酸盐处理的二代测序,得到的测序数据包含对于甲基化位点CpG的甲基化水平和测序覆盖深度。任选地,对于基因组甲基化信号CpG和噪音区CHH/CHG位点进行噪音去除。然后,对于“肿瘤”(C)和“正常”(N)组计算加权逻辑回归(weighted logistic regression)得到的p-value,逻辑回归的解释变量采用连续变量,也就是每个CpG点的甲基化水平,反应变量采取二元输出,即(0,1),对应C和N。加权逻辑回归(weighted logistic regression)对每个CpG位点区分C和N做检验,零假设(null hypothesis)是C和N在该CpG位点的区别无统计显著性,权重则是根据每个CpG位点的覆盖深度来决定的。Exemplary bisulfite-treated second-generation sequencing is performed on the sample, and the obtained sequencing data includes the methylation level and sequencing coverage depth of the methylation site CpG. Optionally, perform noise removal on genomic methylation signal CpG and noise region CHH/CHG sites. Then, the p-value obtained by weighted logistic regression is calculated for the "tumor" (C) and "normal" (N) groups. The explanatory variable of the logistic regression adopts a continuous variable, that is, the methyl group of each CpG point. ization level, the response variable takes a binary output, that is, (0, 1), corresponding to C and N. Weighted logistic regression tests the difference between C and N at each CpG site. The null hypothesis is that the difference between C and N at the CpG site is not statistically significant. The weight is based on each CpG. Determined by the depth of coverage of the site.
DMR划分DMR division
基于甲基化位点CpG的甲基化水平和测序覆盖深度,确定DMR各个区域如何划分。具体地,将甲基化位点CpG的甲基化水平和测序覆盖深度按照下式进行计算:


Based on the methylation level and sequencing coverage depth of the methylation site CpG, determine how each region of the DMR is divided. Specifically, the methylation level and sequencing coverage depth of the methylation site CpG were calculated according to the following formula:


此处dij是C组第i个样本第j个位点的有效覆盖深度,Mij是C组第i个样本第j个位点的甲基化水平,对基因组空间连续位点的甲基化水平相似度进行评估。覆盖深度越深,参数P取值越大,则同组内相邻CpG位点间的甲基化水平近似度越高。Here d ij is the effective coverage depth of the j-th site in the i-th sample of group C, M ij is the methylation level of the j-th site in the i-th sample of group C, and the methyl group of continuous sites in the genome space is The level of similarity is evaluated. The deeper the coverage depth and the larger the value of parameter P, the higher the similarity in methylation levels between adjacent CpG sites in the same group.
图1显示的是,一种示例性的情况(一种理论上的示例性展示,不用于表示实际的测序情况)。Figure 1 shows an exemplary situation (a theoretical exemplary display, not used to represent actual sequencing situations).
对于区域内的第一个CpG位点,样本A和样本B分别获得了500条有效序列的覆盖,样本C获得了200条有效序列的覆盖。对于样本A而言,该CpG位点的甲基化水平为0.2。样本A第二个CpG位点的甲基化水平为0。针对三例样本计算该组第一个CpG位点的覆盖深度参数值P为0.617。此时,βij=|0.2-0|*e(1-0.617)=0.29。同时鉴于前后两个CpG位点的甲基化水平差异小于0.25为可将该两个相邻位点划分进同一个DMR的必要条件之一,则该示例中的第一、二个CpG位点将不被划分进同一个DMR。For the first CpG site in the region, sample A and sample B each obtained coverage of 500 effective sequences, and sample C obtained coverage of 200 effective sequences. For sample A, the methylation level of this CpG site is 0.2. The methylation level of the second CpG site in sample A is 0. For three samples, the coverage depth parameter value P of the first CpG site in this group was calculated to be 0.617. At this time, β ij =|0.2-0|*e (1-0.617) =0.29. At the same time, considering that the difference in methylation levels between the two adjacent CpG sites is less than 0.25, which is one of the necessary conditions for classifying the two adjacent sites into the same DMR, then the first and second CpG sites in this example will not be classified into the same DMR.
图2显示的是,另一种示例性的情况(一种理论上的示例性展示,不用于表示实际的测序情况)。Figure 2 shows another exemplary situation (a theoretical exemplary display, not used to represent actual sequencing situations).
若将上述样本替换为A、B、D(其中样本D在第一个CpG位点获得了400条有效序列的覆盖)。同样地,对于样本A而言,该CpG位点的甲基化水平为0.2。样本A第二个CpG位点的甲基化水平为0。然而由于本示例中样本D的测序覆盖深度提高,三例样本计算该组第一个CpG位点的覆盖深度参数值P为0.962。此时,βij=|0.2-0|*e(1-0.962)=0.21,小于划分进同一个DMR的阈值0.25,则此时根据样本A该示例中的第一、二个CpG位点具备被划分进同一个DMR的前提条件。If the above samples are replaced by A, B, and D (sample D has obtained 400 effective sequence coverage at the first CpG site). Similarly, for sample A, the methylation level of this CpG site is 0.2. The methylation level of the second CpG site of sample A is 0. However, due to the increased sequencing coverage depth of sample D in this example, the coverage depth parameter value P of the first CpG site in the group is calculated to be 0.962 for the three samples. At this time, β ij =|0.2-0|*e (1-0.962) =0.21, which is less than the threshold 0.25 for dividing into the same DMR. At this time, according to sample A, the first and second CpG sites in this example have A prerequisite for being classified into the same DMR.
因此,通过本申请的方法引入CpG位点的覆盖深度,能够显著提高DMR区域划分的准确性。Therefore, introducing the coverage depth of CpG sites through the method of this application can significantly improve the accuracy of DMR region division.
进一步任选地,对于一个区域内的Bij,计算方式如下:

Further optionally, for B ij within a region, the calculation method is as follows:

图3A-3C显示的是,另一种示例性的情况(一种理论上的示例性展示,不用于表示实际的测序情况)。当DMR区域包含10个CpG位点时,将所有样本的Bij合并在一起,取平均的方法计算,每个DMR的得分。Figures 3A-3C show another exemplary situation (a theoretical exemplary display, not used to represent actual sequencing situations). When the DMR region contains 10 CpG sites, the B ij of all samples are combined and averaged to calculate the score of each DMR.
其中组A所示DMR区域内的B值计算步骤如下表1所示:The calculation steps for the B value in the DMR area shown in Group A are as shown in Table 1 below:
表1组A DMR区域内的B值计算
Table 1 Calculation of B values within the DMR area of Group A
B值得分为0.1,即 The B value is divided into 0.1, that is
类似地,组B所示DMR内的B值得分为0.7,即,组C所示DMR内的B值得分为1.233,即, Similarly, the B value within the DMR shown in group B is divided into 0.7, i.e., The B value within the DMR shown in Group C is divided into 1.233, i.e.,
通过此方法筛选出的DMR区域不仅包含各种癌种的癌症变异信息,也包含了组织特异的特征,并且在区域边界具有更好的分割效应。The DMR regions screened by this method not only contain cancer mutation information of various cancer types, but also contain tissue-specific features, and have better segmentation effects at regional boundaries.
图4显示的是,在5倍交叉验证中,可以实现98%(95%CI:96-99%)的组织溯源准确性。Figure 4 shows that in 5-fold cross-validation, a tissue traceability accuracy of 98% (95% CI: 96-99%) can be achieved.
实施例2Example 2
癌症评估(DOC)模型建立Cancer assessment (DOC) model establishment
不同癌症的不同发展时段,在血液中的ctDNA含量差别很大,容易受实验批次效应影响。此外甲基化变异和年龄,疾病,人种等有关,这些如果不加以处理,作为混淆变量(confounding variable)对分类模型的准确性可能会造成影响。本申请采用了一种叫Salmon的模型构建方法,首先对混淆变量带来的偏倚进行量化(量化方式可采用但不局限于希尔伯特-施密特独立准则),然后嵌入模型的正则化项(regularization)进行矫正,增加模型准确性和可泛化能力。At different stages of development of different cancers, the ctDNA content in the blood varies greatly and is easily affected by the experimental batch effect. In addition, methylation variations are related to age, disease, race, etc. If these are not dealt with, they may affect the accuracy of the classification model as confounding variables. This application adopts a model construction method called Salmon, which first quantifies the bias caused by confounding variables (the quantification method can use but is not limited to the Hilbert-Schmidt independence criterion), and then embeds the regularization of the model. Regularization is used to correct the model to increase model accuracy and generalizability.
算法建立Algorithm establishment
假设m个样本,设定特征矢量X(x1,…,xm),分类标签Y(y1,…,ym),混淆变量Z(z1,…,zm),其中xi是一个n维矢量,代表样本i的甲基化特征,yi是xi的分类标签,yi∈{-1,+1},zi是样本i的某种混淆变量。
Assume m samples , set the feature vector An n-dimensional vector represents the methylation feature of sample i, y i is the classification label of xi , y i ∈{-1,+1}, z i is some kind of confounding variable of sample i.
此处LH指希尔伯特-施密特独立系数(Hilbert-Schmidt independence criterion),用于衡量变量X和Z的独立程度,h(y)和h(z)是Y和Z的核函数(Kernel function),Ph(x)h(z)表示h(y)和h(z)的概率分布,F和G分别表示X和Z的再生核希尔伯特空间(reproducing kernel Hilbert space),可以理解为对X和Z的非线性处理后映射的域,Ch(x)h(z)指代这两个核函数的相关系数(correlation coefficient),HS即希尔伯特空间(Hilbert Space)。
‖Ch(y)h(z)2=(Eh(x)h(z)-Eh(x)Eh(z))2
=(Eh(x)h(z))2+(Eh(x)Eh(z))2-2Eh(x)h(z)Eh(x)Eh(z)
Here L H refers to the Hilbert-Schmidt independence criterion, which is used to measure the degree of independence of variables X and Z. h(y) and h(z) are the kernel functions of Y and Z (Kernel function), P h(x)h(z) represents the probability distribution of h(y) and h(z), F and G represent the reproducing kernel Hilbert space of X and Z respectively. , can be understood as the domain mapped after nonlinear processing of X and Z, C h(x)h(z) refers to the correlation coefficient of these two kernel functions, HS is the Hilbert space Space).
‖C h(y)h(z)2 =(E h(x)h(z) -E h(x) E h(z) ) 2
=(E h(x)h(z) ) 2 +(E h(x) E h(z) ) 2 -2E h(x)h(z) E h(x) E h(z)
采用支持向量机(SVM,support vector machine)作为主分类器,
f(x;w,b)=sgn(wTx+b)
sgn(a)=1(-1)if a≥0(<0)
Use support vector machine (SVM, support vector machine) as the main classifier.
f(x;w,b)=sgn(wTx+b)
sgn(a)=1(-1)if a≥0(<0)
分类界面的确定是采用解决如下目标方程确定的,

s.t. yi(wTx+b)≥1
The classification interface is determined by solving the following objective equation,

st y i (wTx+b)≥1
对于不可分数据,软间隔支持向量机(soft-margin SVM)则引入对训练错误的惩罚项,

s.t. yi(wTx+b)≥1-ξi
ξi≥0
For inseparable data, soft-margin support vector machine (soft-margin SVM) introduces a penalty term for training errors.

st y i (wTx+b)≥1-ξ i
ξ i ≥0
此处C控制最小化训练错误和最大化分类间隔(margin)的平衡,而ξi指代样本xi违背等式的程度。Here C controls the balance of minimizing training error and maximizing the classification margin, while ξ i refers to the degree to which sample xi violates the equation.
Salmon为了对混淆因素进行控制,在SVM求解的目标方程中添加正则项,参数λ控制训练中混淆因素错误和最大化边界宽度的平衡,目标方程为:

s.t. yi(wTx+b)≥1-ξi
ξi≥0
In order to control confounding factors, Salmon adds regular terms to the objective equation solved by SVM. The parameter λ controls the balance between confounding factor errors and maximizing the boundary width during training. The objective equation is:

st y i (wTx+b)≥1-ξ i
ξ i ≥0
此处C和λ控制最小化训练错误,最小化混淆变量与解释变量的相关性,和最大化分类间隔的平衡。Here C and λ control the balance of minimizing training error, minimizing the correlation between confounding variables and explanatory variables, and maximizing the classification interval.
图5显示的是混淆相关特征在本申请的Salmon-DOC模型的权重配置的控制结果。Figure 5 shows the control results of the weight configuration of the Salmon-DOC model of the present application for confusion-related features.
每个数据点代表一个用于Salmon-DOC模型构建的血液样本,横轴为对应样本的confunding factor,纵轴分别为原始未经校正的variable coef(图A)和校正后的variable coef(图B)。对比校正前后,表明混淆相关特征在Salmon-DOC中,权重得到控制。Each data point represents a blood sample used for Salmon-DOC model construction. The horizontal axis is the confunding factor of the corresponding sample, and the vertical axis is the original uncorrected variable coef (Figure A) and the corrected variable coef (Figure B). ). Comparison before and after correction shows that the weight of confusion-related features in Salmon-DOC is controlled.
回顾队列数据Review cohort data
本申请采用了6个癌种的回顾性临床样本,分为训练集(Training set)和验证集(Validation set),对Salmon的二元分类器(癌vs非癌)准确性进行评估。This application uses retrospective clinical samples of 6 cancer types, divided into training set and validation set, to evaluate the accuracy of Salmon's binary classifier (cancer vs non-cancer).
图6A-6F显示的是,本申请Salmon-DOC模型在肿瘤组模型可以高效的实现6个癌种不同分期的检出。Figures 6A-6F show that the Salmon-DOC model of the present application can efficiently detect 6 cancer types at different stages in the tumor group model.
图7A-7F显示的是,在健康组本申请Salmon-DOC模型克服了既往甲基化假阳性随着年龄增高的弱点,在各个 年龄段中保持平衡(横轴为年龄,纵轴为模型癌症概率打分)。Figures 7A-7F show that in the healthy group, the Salmon-DOC model of this application has overcome the weakness of the previous increase in methylation false positives with age. Maintain balance among age groups (the horizontal axis is age and the vertical axis is model cancer probability score).
实施例3Example 3
组织溯源(TOO)模型建立Organizational traceability (TOO) model establishment
第一层TOO模型构建First layer TOO model construction
TOO模型本质是一个多分类问题,对于每一个类别(class)的概率计算,可以简化为对成对的二分类(pairwise)结果进行投票(voting),然后选取的票最多的结果。然而对于组织溯源模型的可能的临床应用,仅仅产生一个分类结果是不够的,只有产生分类的概率,才能使模型的叠加(assembly)成为可能。The essence of the TOO model is a multi-classification problem. The probability calculation for each category can be simplified to voting on the pairwise results, and then selecting the result with the most votes. However, for possible clinical applications of the tissue traceability model, it is not enough to generate a classification result. Only by generating the probability of classification can the assembly of the model become possible.
所以本申请Salmon-TOO模型的第一步,是对于二分类投票(voting)结果进行量化。这个量化可以通过概率计算证明。如果定义某个数据点x和标签y,我们假设成对的分类概率μij是存在的,那么从训练集中第i和第j个类别,我们可以得到一个模型,只要输入任何新的数据点x,即可用计算的rij作为μij的近似估计。问题可以简化为用所有的rij来估计第i个类别的概率:
pi=P(y=i|x),i=1,…,k
Therefore, the first step of the Salmon-TOO model in this application is to quantify the two-category voting results. This quantification can be proven by probability calculations. If we define a certain data point x and label y, we assume that the pairwise classification probability μ ij exists, then from the i-th and j-th categories in the training set, we can get a model as long as any new data point x is input , that is, the calculated r ij can be used as an approximate estimate of μ ij . The problem can be simplified to estimating the probability of the i-th class using all r ij :
p i =P(y=i|x),i=1,…,k
定义rij为μij的估计,假设μijji=1.对于多分类问题采用“投票”制,
μij≡P(y=i|y=i or j,x)
Define r ij as the estimate of μ ij , assuming μ ijji =1. For multi-classification problems, a "voting" system is used,
μ ij ≡P(y=i|y=i or j,x)
定义I是目标方程:I{x}=1如果x为真,否则为假。概率计算可以写为:
Define I as the objective equation: I {x} = 1 if x is true, otherwise false. The probability calculation can be written as:
第二层TOO模型构建Second layer TOO model construction
Salmon-TOO模型的第二层,是对于不同类别(class)进行MLR拟合。The second layer of the Salmon-TOO model is MLR fitting for different categories.
假设需要对种组织来源进行概率计算,则根据第一层可以得到个量化后的二分类概率,取值范围为(∞,-∞)。由于每对二分类概率的实际分布不一致,因此可以进一步将量化后的个二分类概率作为逻辑回归的解释变量,反应变量采取多元输出,对应建模过程中种已知的组织来源。Assuming that it is necessary to calculate the probability of the source of the tissue, a quantified two-class probability can be obtained based on the first layer, and the value range is (∞,-∞). Since the actual distribution of each pair of binary classification probabilities is inconsistent, the quantified binary classification probabilities can be further used as explanatory variables for logistic regression. The response variables adopt multivariate outputs, corresponding to known tissue sources in the modeling process.
表2两两组织类别对应的二分类评估概率
Table 2 Two-category evaluation probabilities corresponding to pairwise organizational categories
如上表2所示,每一列代表逻辑回归的一个特征变量即两两组织类别的二分类评估概率;每一行代表一个反应变量y1,即组织类别(class)。As shown in Table 2 above, each column represents a characteristic variable of logistic regression That is, the probability of binary classification evaluation of pairwise organizational categories; each row represents a response variable y 1 , that is, the organizational category (class).
为用于解释二分类概率的特征变量,假定共存在J个非连续反映变量,则将评估结果转化为Yi1,…,YiJ,βj为基于每个反映变量的特征权重。
It is a characteristic variable used to explain the binary classification probability. Assuming that there are J non-continuous reflection variables, the evaluation results are converted into Y i1 ,..., Y iJ , and β j is the feature weight based on each reflection variable.
由于在Salmon-DOC模型中,我们可以得到,在部分癌种中被判为阴性,而在部分癌种中被判为阳性,所以针对这一判断,在进行溯源建模时,对组织类别(class)进行了基于拟极大似然估计方法的权重矫正,以二元逻辑回归为例可解释为:
Since in the Salmon-DOC model, we can get that some cancer types are judged as negative, and some cancer types are judged as positive, so for this judgment, when performing traceability modeling, the tissue category ( class) performed weight correction based on the quasi-maximum likelihood estimation method. Taking binary logistic regression as an example, it can be explained as:
回顾队列数据Review cohort data
回顾队列的全部数据被随机1:1拆分为训练集和验证集。首先,通过训练集进行交叉验证得到溯源评估结果,在该过程中不断优化模型参数并最终锁定。最后,验证集的全部数据均以锁定后的模型评估其溯源结果。在溯源模型训练集中,六癌种样本量共计300例,各癌种各分期数量相对平衡:肺癌36例(I~IV期例数分别为4/12/5/15),肠癌62例(I~IV期例数分别为8/18/18/18),肝癌74例(I~IV期例数分别为25/14/22/13),卵巢癌48例(I~IV期例数分别为1/4/38/5),胰腺癌40例(I~IV期例数分别为3/6/13/18),食管癌42例(I~IV期例数分别为5/10/15/12)。 溯源模型验证集共224例样本,包含:肺癌31例(I~IV期例数分别为4/5/12/10),肠癌52例(I~IV期例数分别为7/15/13/17),肝癌55例(I~IV期例数分别为17/11/20/7),卵巢癌27例(I~IV期例数分别为3/4/8/12),胰腺癌25例(I~IV期例数分别为4/6/6/9),食管癌34例(I~IV期例数分别为4/7/8/15)。All data in the retrospective cohort were randomly split into training and validation sets at a ratio of 1:1. First, the traceability evaluation results are obtained through cross-validation on the training set. During this process, the model parameters are continuously optimized and finally locked. Finally, all data in the validation set are evaluated with the locked model for traceability results. In the traceability model training set, there were a total of 300 cases of six cancer types, and the numbers of each cancer type and stage were relatively balanced: 36 cases of lung cancer (the number of cases in stages I to IV were 4/12/5/15 respectively), and 62 cases of intestinal cancer ( The number of cases in stages I to IV was 8/18/18/18 respectively), 74 cases of liver cancer (the number of cases in stages I to IV were 25/14/22/13 respectively), and 48 cases of ovarian cancer (the number of cases in stages I to IV were respectively 1/4/38/5), 40 cases of pancreatic cancer (number of cases in stages I to IV were 3/6/13/18 respectively), and 42 cases of esophageal cancer (number of cases in stages I to IV were 5/10/15 respectively) /12). The traceability model verification set has a total of 224 samples, including: 31 cases of lung cancer (the number of cases in stages I to IV is 4/5/12/10 respectively), and 52 cases of intestinal cancer (the number of cases in stages I to IV is 7/15/13 respectively) /17), 55 cases of liver cancer (number of cases in stages I to IV were 17/11/20/7 respectively), 27 cases of ovarian cancer (number of cases in stages I to IV were 3/4/8/12 respectively), 25 cases of pancreatic cancer There were 34 cases of esophageal cancer (the number of cases in stages I to IV was 4/6/6/9 respectively), and 34 cases of esophageal cancer (the number of cases in stages I to IV were 4/7/8/15 respectively).
图8显示的是本申请Salmon-TOO双层模型溯源准确性在交叉验证和独立验证中均优于单层模型。Figure 8 shows that the traceability accuracy of the Salmon-TOO two-layer model of this application is better than that of the single-layer model in both cross-validation and independent verification.
图8A、8B为六癌种训练集中六癌种数据交叉验证的溯源评估结果。其中,图8A为仅构建了第一层TOO模型后输出的结果,溯源准确性为0.87(260/300),若纳入次优的溯源结果,准确性为0.93(279/300);图8B为在第一层TOO模型基础上补充了第二层MLR模型后的输出结果,溯源准确性提升至0.90(270/300),若纳入次优的溯源结果,准确性可进一步提升至0.95(284/300)。类似的,图8C、8D为上述验证集中六癌种数据独立验证的溯源评估结果。其中,图8C为仅构建了第一层TOO模型后输出的结果,溯源准确性为0.77(173/224),若纳入次优的溯源结果,准确性为0.87(194/224);图8D为在第一层TOO模型基础上补充了第二层MLR模型后的输出结果,溯源准确性提升至0.84(187/224),若纳入次优溯源结果,准确性可进一步提升至0.89(199/224)。Figures 8A and 8B show the traceability evaluation results of the cross-validation of the six cancer types data in the six cancer types training set. Among them, Figure 8A shows the output result after only building the first layer TOO model. The traceability accuracy is 0.87 (260/300). If the suboptimal traceability results are included, the accuracy is 0.93 (279/300); Figure 8B is Based on the first-layer TOO model and the output result of the second-layer MLR model, the traceability accuracy is increased to 0.90 (270/300). If sub-optimal traceability results are included, the accuracy can be further improved to 0.95 (284/284/ 300). Similarly, Figures 8C and 8D show the traceability evaluation results of the independent verification of the six cancer types in the above verification set. Among them, Figure 8C shows the output result after only building the first layer of TOO model. The traceability accuracy is 0.77 (173/224). If the suboptimal traceability results are included, the accuracy is 0.87 (194/224); Figure 8D is After supplementing the output results of the second-layer MLR model based on the first-layer TOO model, the traceability accuracy is improved to 0.84 (187/224). If the sub-optimal traceability results are included, the accuracy can be further improved to 0.89 (199/224 ).
综上所述,本申请Salmon-TOO双层溯源模型的评估准确性在训练集交叉验证和独立验证中均优于单层模型。To sum up, the evaluation accuracy of the Salmon-TOO two-layer traceability model of this application is better than that of the single-layer model in both training set cross-validation and independent verification.
实施例4、抗CCR8嵌合抗体的ADCC活性检测Example 4. ADCC activity detection of anti-CCR8 chimeric antibodies
DOC癌症检出模型DOC cancer detection model
表3A显示的是用于DOC癌症检出模型的94个DMR区域清单。Table 3A shows the list of 94 DMR regions used in the DOC cancer detection model.
表3A用于DOC癌症检出模型的DMR区域清单

Table 3A List of DMR regions used for DOC cancer detection model

基于94个DOC相关DMR区域,对独立验证集1中的100例健康人样本和318例六癌阳性样本进行评估,整体敏感性为80.5%(256/318),整体特异性为95%(95/100)。在保持特异性在90%水平下,具体癌种及分期敏感性如下表3B所示:Based on 94 DOC-related DMR regions, 100 healthy human samples and 318 six cancer-positive samples in independent validation set 1 were evaluated, with an overall sensitivity of 80.5% (256/318) and an overall specificity of 95% (95 /100). While maintaining the specificity at 90% level, the specific cancer types and staging sensitivities are shown in Table 3B below:
表3B不同癌种分期敏感性的评估结果
Table 3B Evaluation results of sensitivity of staging of different cancer types
接着进行重复测试,每次测试采纳94个DOC区域中的随机50个。在保持特异性在90%(90/100)水平下,六癌阳性样本在五次重复检测中的敏感性结果如下表3C所示:Then repeat the test, using 50 random areas out of the 94 DOC areas for each test. While maintaining the specificity at the 90% (90/100) level, the sensitivity results of six cancer-positive samples in five repeated tests are shown in Table 3C below:
表3C特异性在90%水平下的六癌阳性样本在五次重复检测中的敏感性结果

Table 3C Sensitivity results of six cancer-positive samples in five repeated tests with specificity at 90% level

实施例5、抗CCR8嵌合抗体抑制CCL1引起的靶细胞钙流信号Example 5. Anti-CCR8 chimeric antibody inhibits the calcium flux signal of target cells caused by CCL1
TOO组织溯源模型TOO organizational traceability model
表4A显示的是用于TOO组织溯源模型的103个DMR区域清单。Table 4A shows the list of 103 DMR areas used in the TOO organization traceability model.
表4A用于TOO组织溯源模型的DMR区域清单

Table 4A List of DMR areas used in the TOO organization traceability model

基于103个TOO相关DMR区域,对独立验证集2中的473例六癌阳性样本进行溯源评估,第一溯源准确性为63.0%(298/473),若纳入次优溯源结果,准确性可提升至71.5%(338/473)。Based on 103 TOO-related DMR areas, 473 six-cancer positive samples in the independent verification set 2 were evaluated for traceability. The accuracy of the first traceability was 63.0% (298/473). If the suboptimal traceability results are included, the accuracy can be improved. to 71.5% (338/473).
图9显示的是,基于103个TOO相关DMR区域,得到的组织溯源评估结果。Figure 9 shows the tissue traceability assessment results based on 103 TOO-related DMR areas.
接着进行四轮重复测试,每次采纳103个TOO区域中的随机50个,四轮评估中溯源准确性结果如下表4B所示:Then four rounds of repeated testing were conducted, each time using 50 random ones from the 103 TOO areas. The traceability accuracy results in the four rounds of evaluation are shown in Table 4B below:
表4B四轮重复测试评估的溯源准确性结果
Table 4B Traceability accuracy results of four rounds of repeated testing assessments
实施例6Example 6
DMR同时对DOC以及TOO评估结果:DMR evaluates DOC and TOO simultaneously:
表5A显示的是用于DOC以及TOO评估模型的860个DMR区域清单。Table 5A shows the list of 860 DMR regions used for the DOC and TOO evaluation models.
表5A用于DOC以及TOO评估模型的DMR区域清单






Table 5A List of DMR regions for DOC and TOO evaluation models






表5B显示的是用于DOC以及TOO评估模型的488个DMR区域清单。Table 5B shows the list of 488 DMR regions used for the DOC and TOO evaluation models.
表5B用于DOC以及TOO评估模型的DMR区域清单




Table 5B List of DMR regions for DOC and TOO evaluation models




表5C显示的是用于DOC以及TOO评估模型的222个DMR区域清单。Table 5C shows the list of 222 DMR regions used for the DOC and TOO evaluation models.
表5C用于DOC以及TOO评估模型的DMR区域清单


Table 5C List of DMR regions for DOC and TOO evaluation models


在独立验证集3中,对473例阴性样本以及473例阳性六癌样本,在标志物marker数量逐渐梯度压缩的情况下,计算在统一特异性95.1%(450/473)下的敏感性和溯源准确性。评估的肿瘤检测以及组织溯源结果如下表5D和5E所示:In the independent validation set 3, for 473 negative samples and 473 positive six cancer samples, with the number of markers gradually compressed, the sensitivity and traceability at a unified specificity of 95.1% (450/473) were calculated accuracy. The evaluated tumor detection and tissue traceability results are shown in Tables 5D and 5E below:
表5D不同DMR中肿瘤检测结果
Table 5D Tumor detection results in different DMRs
表5E不同DMR中组织溯源结果
Table 5E Tissue traceability results in different DMRs
前述详细说明是以解释和举例的方式提供的,并非要限制所附权利要求的范围。目前本申请所列举的实施方式的多种变化对本领域普通技术人员来说是显而易见的,且保留在所附的权利要求和其等同方案的范围内。 The foregoing detailed description is provided by way of explanation and example, and is not intended to limit the scope of the appended claims. Various modifications to the embodiments described herein will be apparent to those of ordinary skill in the art and remain within the scope of the appended claims and their equivalents.

Claims (22)

  1. 一种用于评估待测样本与肿瘤形成风险相关性的生物标志物组合,其特征在于,所述生物标志物组合包含表1A中所示的任意至少10个差异甲基化区域DMR,其中所述表中的DMR涉及的参考基因版本为hg19版本。A biomarker combination for assessing the correlation between a test sample and the risk of tumor formation, characterized in that the biomarker combination includes any of at least 10 differentially methylated region DMRs shown in Table 1A, wherein the The reference gene version involved in the DMR in the table is the hg19 version.
  2. 权利要求1所述的生物标志物组合,所述生物标志物组合包含表1A中任意至少50个DMR。The biomarker combination of claim 1, said biomarker combination comprising any at least 50 DMRs in Table 1A.
  3. 权利要求1-2中任一项所述的生物标志物组合,所述生物标志物组合包含表1A中94个DMR。The biomarker combination of any one of claims 1-2, said biomarker combination comprising 94 DMRs in Table 1A.
  4. 一种用于评估待测样本与肿瘤组织来源相关性的生物标志物组合,其特征在于,所述生物标志物组合包含表1B中所示的任意至少10个差异甲基化区域DMR,其中所述表中的DMR涉及的参考基因版本为hg19版本。A biomarker combination for assessing the correlation between a test sample and the origin of a tumor tissue, characterized in that the biomarker combination includes any of at least 10 differentially methylated region DMRs shown in Table 1B, wherein the The reference gene version involved in the DMR in the table is the hg19 version.
  5. 权利要求4所述的生物标志物组合,所述生物标志物组合包含表1B中任意至少50个DMR。The biomarker combination of claim 4, comprising any of at least 50 DMRs in Table IB.
  6. 权利要求4-5中任一项所述的生物标志物组合,所述生物标志物组合包含表1B中103个DMR。The biomarker combination of any one of claims 4-5, said biomarker combination comprising 103 DMRs in Table 1B.
  7. 一种用于评估待测样本与肿瘤形成风险和/或肿瘤组织来源的相关性的生物标志物组合,其特征在于,所述生物标志物组合包含表1C中所示的任意至少10个差异甲基化区域DMR,其中所述表中的DMR涉及的参考基因版本为hg19版本。A biomarker combination for assessing the correlation between a test sample and the risk of tumor formation and/or the origin of tumor tissue, characterized in that the biomarker combination includes any of at least 10 differential A shown in Table 1C Kylation region DMR, where the reference gene version involved in the DMR in the table is the hg19 version.
  8. 权利要求7所述的生物标志物组合,所述生物标志物组合包含1E、表1D或表1C中任意至少50个DMR。The biomarker combination of claim 7, said biomarker combination comprising any at least 50 DMRs in IE, Table 1D or Table 1C.
  9. 权利要求7-8中任一项所述的生物标志物组合,所述生物标志物组合包含表1E中222个DMR。The biomarker combination of any one of claims 7-8, said biomarker combination comprising 222 DMRs in Table 1E.
  10. 权利要求7-9中任一项所述的生物标志物组合,所述生物标志物组合包含表1D中488个DMR。The biomarker combination of any one of claims 7-9, said biomarker combination comprising 488 DMRs in Table 1D.
  11. 权利要求7-10中任一项所述的生物标志物组合,所述生物标志物组合包含表1C中860个DMR。The biomarker combination of any one of claims 7-10, said biomarker combination comprising 860 DMRs in Table 1C.
  12. 权利要求1-11中任一项所述的生物标志物组合,所述肿瘤来自于同质肿瘤(homogenous tumors)、异质肿瘤、血液癌和/或实体瘤;优选地,所述肿瘤来自于以下组的癌症中的一种或多种:脑癌、肺癌、皮肤癌、鼻咽癌、咽喉癌、肝癌、骨癌、淋巴瘤、胰腺癌、皮肤癌、肠癌、直肠癌、甲状腺癌、膀胱癌、肾癌、口腔癌、胃癌、实体瘤、卵巢癌、食管癌、胆囊癌、胆道癌、乳腺癌、宫颈癌、子宫癌、前列腺癌、头颈癌、肉瘤、除肺外胸腔恶性肿瘤、黑色素瘤、和睾丸癌。The biomarker combination of any one of claims 1-11, the tumor is derived from homogenous tumors (homogenous tumors), heterogeneous tumors, blood cancers and/or solid tumors; preferably, the tumor is derived from One or more cancers from the following groups: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer, bowel cancer, rectal cancer, thyroid cancer, Bladder cancer, kidney cancer, oral cancer, gastric cancer, solid tumors, ovarian cancer, esophageal cancer, gallbladder cancer, bile duct cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer, sarcoma, thoracic malignant tumors except lung, melanoma, and testicular cancer.
  13. 权利要求1-12中任一项所述的生物标志物组合,所述肿瘤包含肺癌,肠癌,肝癌,卵巢癌,胰腺癌,和/或食管癌。The biomarker combination of any one of claims 1-12, the tumor comprising lung cancer, intestinal cancer, liver cancer, ovarian cancer, pancreatic cancer, and/or esophageal cancer.
  14. 一种试剂盒,所述试剂盒包含权利要求1-13中任一项所述的生物标志物组合,以及任选地包含二代高通量测序试剂。A kit comprising the biomarker combination according to any one of claims 1-13 and optionally a second-generation high-throughput sequencing reagent.
  15. 权利要求14所述的试剂盒,所述试剂盒用于评估待测样本与肿瘤的形成风险和/或肿瘤组织来源的相关性。The kit according to claim 14, which is used to evaluate the correlation between the sample to be tested and the risk of tumor formation and/or the origin of tumor tissue.
  16. 用于检测权利要求1-13中任一项所述的生物标志物组合的试剂在制备诊断肿瘤形成风险和/或肿瘤组织来源的试剂盒中的应用。Use of a reagent for detecting the biomarker combination according to any one of claims 1 to 13 in the preparation of a kit for diagnosing the risk of tumor formation and/or the origin of tumor tissue.
  17. 如权利要求16所述的应用,所述肿瘤来自于同质肿瘤(homogenous tumors)、异质肿瘤、血液癌和/或实体瘤;优选地,所述肿瘤来自于以下组的癌症中的一种或多种:脑癌、肺癌、皮肤癌、鼻咽癌、咽喉癌、肝癌、骨癌、淋巴瘤、胰腺癌、皮肤癌、肠癌、直肠癌、甲状腺癌、膀胱癌、肾癌、口腔癌、胃癌、实体瘤、卵巢癌、食管癌、胆囊癌、胆道癌、乳腺癌、宫颈癌、子宫癌、前列腺癌、头颈癌、肉瘤、除肺外胸腔恶性肿瘤、黑色素瘤、和睾丸癌。The application of claim 16, wherein the tumor is derived from homogenous tumors, heterogeneous tumors, blood cancers and/or solid tumors; preferably, the tumor is derived from one of the following groups of cancers or more: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer, bowel cancer, rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer , gastric cancer, solid tumors, ovarian cancer, esophageal cancer, gallbladder cancer, bile duct cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer, sarcoma, thoracic malignant tumors other than lung, melanoma, and testicular cancer.
  18. 如权利要求16-17中任一项所述的应用,所述肿瘤包含肺癌,肠癌,肝癌,卵巢癌,胰腺癌,和/或食管癌。The use according to any one of claims 16-17, wherein the tumor comprises lung cancer, intestinal cancer, liver cancer, ovarian cancer, pancreatic cancer, and/or esophageal cancer.
  19. 一种评估待测样本与肿瘤形成风险和/或肿瘤组织来源的相关性的方法,所述方法包含:对待测样本中的生物标志物组合进行甲基化水平的检测,所述生物标志物组合包含权利要求1-13中任一项所述的生物标志物组合。A method for assessing the correlation between a sample to be tested and the risk of tumor formation and/or the source of tumor tissue, the method comprising: detecting methylation levels of a combination of biomarkers in the sample to be tested, the biomarker combination Comprising the biomarker combination of any one of claims 1-13.
  20. 权利要求19所述的评估方法,所述样本选自以下组:组织样本、血液样本、唾液、痰液、胸腔积液、肺部灌洗液、腹膜积液、腹腔灌洗液、灌肠液和脑脊液。The evaluation method of claim 19, wherein the sample is selected from the group consisting of tissue samples, blood samples, saliva, sputum, pleural effusion, lung lavage fluid, peritoneal effusion, peritoneal lavage fluid, enema and cerebrospinal fluid.
  21. 一种储存介质,其记载可以运行权利要求19-20中任一项所述的方法的程序。A storage medium recording a program capable of executing the method according to any one of claims 19-20.
  22. 一种设备,所述设备包含权利要求21所述的储存介质,以及所述设备任选地包含耦接至所述储存介质的处理器,所述处理器被配置为基于存储在所述储存介质中的程序执行以实现权利要求19-20中任一项所述的方法。 An apparatus comprising the storage medium of claim 21 , and optionally comprising a processor coupled to the storage medium, the processor configured to perform data based on the storage medium stored in the storage medium. The program in is executed to implement the method described in any one of claims 19-20.
PCT/CN2023/109837 2022-08-01 2023-07-28 Multi-cancer methylation detection kit and use thereof WO2024027591A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210914446.XA CN117535404A (en) 2022-08-01 2022-08-01 Multi-cancer methylation detection kit and application thereof
CN202210914446.X 2022-08-01

Publications (1)

Publication Number Publication Date
WO2024027591A1 true WO2024027591A1 (en) 2024-02-08

Family

ID=89784781

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/109837 WO2024027591A1 (en) 2022-08-01 2023-07-28 Multi-cancer methylation detection kit and use thereof

Country Status (2)

Country Link
CN (1) CN117535404A (en)
WO (1) WO2024027591A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190316209A1 (en) * 2018-04-13 2019-10-17 Grail, Inc. Multi-Assay Prediction Model for Cancer Detection
CN112820407A (en) * 2021-01-08 2021-05-18 清华大学 Deep learning method and system for detecting cancer by using plasma free nucleic acid
US20210310075A1 (en) * 2020-03-30 2021-10-07 Grail, Inc. Cancer Classification with Synthetic Training Samples
CN114171115A (en) * 2021-11-12 2022-03-11 深圳吉因加医学检验实验室 Differential methylation region screening method and device thereof
CN114736968A (en) * 2022-06-13 2022-07-12 南京世和医疗器械有限公司 Application of plasma free DNA methylation marker in lung cancer early screening and lung cancer early screening device
CN115132273A (en) * 2022-08-01 2022-09-30 广州燃石医学检验所有限公司 Method and system for evaluating tumor formation risk and tumor tissue source

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190316209A1 (en) * 2018-04-13 2019-10-17 Grail, Inc. Multi-Assay Prediction Model for Cancer Detection
US20210310075A1 (en) * 2020-03-30 2021-10-07 Grail, Inc. Cancer Classification with Synthetic Training Samples
CN112820407A (en) * 2021-01-08 2021-05-18 清华大学 Deep learning method and system for detecting cancer by using plasma free nucleic acid
CN114171115A (en) * 2021-11-12 2022-03-11 深圳吉因加医学检验实验室 Differential methylation region screening method and device thereof
CN114736968A (en) * 2022-06-13 2022-07-12 南京世和医疗器械有限公司 Application of plasma free DNA methylation marker in lung cancer early screening and lung cancer early screening device
CN115132273A (en) * 2022-08-01 2022-09-30 广州燃石医学检验所有限公司 Method and system for evaluating tumor formation risk and tumor tissue source

Also Published As

Publication number Publication date
CN117535404A (en) 2024-02-09

Similar Documents

Publication Publication Date Title
TWI822789B (en) Convolutional neural network systems and methods for data classification
US20240079092A1 (en) Systems and methods for deriving and optimizing classifiers from multiple datasets
WO2024027032A1 (en) Method and system for evaluating tumor formation risk and tumor tissue source
TWI814753B (en) Models for targeted sequencing
JP7385686B2 (en) Methods for multiresolution analysis of cell-free nucleic acids
US11869661B2 (en) Systems and methods for determining whether a subject has a cancer condition using transfer learning
US20200219587A1 (en) Systems and methods for using fragment lengths as a predictor of cancer
CN113366577A (en) Enhanced detection of target DNA by fragment size analysis
EP2304630A1 (en) Molecular markers for cancer prognosis
CN115699205A (en) Generating cancer detection analysis sets from performance metrics
US20210310075A1 (en) Cancer Classification with Synthetic Training Samples
EP4035161A1 (en) Systems and methods for diagnosing a disease condition using on-target and off-target sequencing data
US20190073445A1 (en) Identifying false positive variants using a significance model
KR20230017206A (en) RNA markers and methods for identifying colon cell proliferative disorders
CN114974417A (en) Methylation sequencing method and device
KR20220086603A (en) Cancer classification using tissue-of-origin thresholding
CN113278706B (en) Method for distinguishing somatic mutation from germline mutation
CN115836349A (en) System and method for evaluating longitudinal biometric data
WO2024027591A1 (en) Multi-cancer methylation detection kit and use thereof
US20210310050A1 (en) Identification of global sequence features in whole genome sequence data from circulating nucleic acid
CN117413072A (en) Methods and systems for detecting cancer by nucleic acid methylation analysis
WO2022262569A1 (en) Method for distinguishing somatic mutation and germline mutation
Friedl Transcriptional Signatures of the Tumor and the Tumor Microenvironment Predict Cancer Patient Outcomes
CN117965725A (en) Method, device and kit for distinguishing liver cancer from liver non-cancer disease samples
CN113159529A (en) Risk assessment model and related system for intestinal polyp

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23849299

Country of ref document: EP

Kind code of ref document: A1