TWI670495B

TWI670495B - Method and system for identifying tumor burden in a sample

Info

Publication number: TWI670495B
Application number: TW106131581A
Authority: TW
Inventors: 薄世平; 梁覃斯; 任軍; 陸思嘉
Original assignee: 大陸商上海億康醫學檢驗所有限公司
Priority date: 2016-09-22
Filing date: 2017-09-14
Publication date: 2019-09-01
Also published as: WO2018054254A1; CN106367512A; TW201814290A

Abstract

本發明提供了一種鑒定樣本中腫瘤負荷的方法和系統，具體地，本發明提供了一種非診斷性地鑒定樣本中腫瘤負荷的方法，包括步驟：(i)提供一待測樣本；(ii)對所述待測樣本進行測序，從而獲得所述樣本的基因組序列；(iii)將步驟(ii)獲得的基因組序列與參考基因組進行比對，從而獲得基因組序列在參考基因組上的位置資訊；(iv)將所述的參考基因組分成M個區域片段，其中每個區域片段為一個視窗b，計算每個視窗b的拷貝數；(v)對步驟(iv)的每個視窗b進行Z檢驗，從而計算每個視窗b的Z值；和(vi)根據步驟(v)所得到的Z值，計算基因組混亂度(GAS)，基於基因組混亂度的數值鑒定所述待測樣本中的腫瘤負荷。本發明的方法和系統可提高腫瘤檢測的靈敏性和通用性。The present invention provides a method and system for identifying tumor burden in a sample. Specifically, the present invention provides a method for non-diagnosticly identifying tumor burden in a sample, comprising the steps of: (i) providing a sample to be tested; (ii) Sequencing the test sample to obtain a genomic sequence of the sample; (iii) comparing the genomic sequence obtained in step (ii) with a reference genome to obtain position information of the genomic sequence on the reference genome; iv) the reference gene component is divided into M region fragments, where each region fragment is a window b, and the copy number of each window b is calculated; (v) a Z-test is performed on each window b in step (iv), Thus, the Z value of each window b is calculated; and (vi) the genomic disorder (GAS) is calculated according to the Z value obtained in step (v), and the tumor burden in the test sample is identified based on the value of the genomic disorder. The method and system of the invention can improve the sensitivity and versatility of tumor detection.

Description

Method and system for identifying tumor load in a sample

本領域涉及生物技術領域，具體地，涉及一種鑒定樣本中腫瘤負荷的方法和系統。The field relates to the field of biotechnology, and in particular, to a method and system for identifying tumor burden in a sample.

在生物醫學的科學研究及臨床應用領域，腫瘤患者的腫瘤細胞經常有大量的基因組拷貝數變異。拷貝數變異可存在於腫瘤組織、體液(如血液、組織間隙液、淋巴液、腦脊液、尿液、唾液等)中，體液中具體存在於游離的迴圈腫瘤細胞(CTC)、細胞外游離DNA(cfDNA)、外泌體等。體液中基因組拷貝數變異的情況是鑒定腫瘤負荷的重要指標，鑒定腫瘤負荷可應用於腫瘤早期篩查、診斷，患者的病情監控、預後治療等。　　目前檢測腫瘤基因組拷貝數變異的主要方法有：比較基因組雜交(comparative genomic hybridization,CGH)，螢光定量PCR(realtime fluorescence quantitative PCR，RTFQ PCR)，螢光原位雜交(fluorescence in situ hybridization, FISH)，多重連接探針擴增技術(multiplex ligation-dependent probe amplification ,MLPA)。　　然而，比較基因組雜交解析度比較低，Mb級，通量低，成本高；螢光定量PCR同樣通量低，成本高，一次只能測一個拷貝數變異；螢光原位雜交，只針對特定位置，解析度低，探針雜交效率不穩定；多重連接探針擴增技術，操作複雜，通量低，成本高，覆蓋度小，易造成PCR污染。除上述技術上的缺陷，以上技術檢測大部分隻針對基因組上特定的區域，而腫瘤異質性很強，特定的一個或幾個位點不能有效綜合評價體液中腫瘤的負荷。　　因此，本領域迫切需要開發一種能夠更有效綜合評價體液中腫瘤的負荷，提高腫瘤檢測的靈敏性和通用性的方法和設備。In the fields of scientific research and clinical application of biomedicine, tumor cells of tumor patients often have a large number of genomic copy number variations. Copy number variations can exist in tumor tissues, body fluids (such as blood, interstitial fluid, lymph fluid, cerebrospinal fluid, urine, saliva, etc.), and body fluids specifically exist in free circulating tumor cells (CTC), extracellular free DNA (cfDNA), exosomes, etc. The variation of genomic copy number in body fluids is an important indicator for the identification of tumor burden. Identification of tumor burden can be used in early screening, diagnosis, patient monitoring, and prognosis treatment of tumors. At present, the main methods for detecting copy number variations of tumor genomes are: comparative genomic hybridization (CGH), real-time quantitative quantitative PCR (RTFQ PCR), and fluorescence in situ hybridization (FISH) , Multiplex ligation-dependent probe amplification (MLPA). However, comparative genomic hybridization has relatively low resolution, Mb level, low throughput, and high cost; fluorescent quantitative PCR also has low throughput and high cost, and can only measure one copy number variation at a time; fluorescent in situ hybridization is only for specific Location, low resolution, and unstable probe hybridization efficiency. Multiplexed probe amplification technology is complicated in operation, low in throughput, high in cost, small in coverage, and easy to cause PCR contamination. In addition to the above-mentioned technical defects, most of the above techniques detect only specific regions on the genome, and tumors are very heterogeneous, and specific one or more sites cannot effectively comprehensively evaluate the tumor load in body fluids. Therefore, there is an urgent need in the art to develop a method and device that can more effectively comprehensively evaluate the burden of tumors in body fluids, and improve the sensitivity and versatility of tumor detection.

本發明提供一種能夠更有效綜合評價體液中腫瘤的負荷，提高腫瘤檢測的靈敏性和通用性的方法和設備。　　本發明第一方面提供了一種非診斷性地鑒定樣本中腫瘤負荷的方法，包括步驟：　　(i)提供一待測樣本；　　(ii)對所述待測樣本進行測序，從而獲得所述樣本的基因組序列；　　(iii)將步驟(ii)獲得的基因組序列與參考基因組進行比對，從而獲得基因組序列在參考基因組上的位置資訊；　　(iv)將所述的參考基因組分成M個區域片段，其中每個區域片段為一個視窗b，計算每個視窗b的拷貝數；　　(v)對步驟(iv)的每個視窗 b進行Z檢驗，從而計算每個視窗b的Z值；和(vi)根據步驟(v)所得到的Z值，計算基因組混亂度(GAS)，基於基因組混亂度的數值鑒定所述待測樣本中的腫瘤負荷。 The invention provides a method and a device that can more effectively comprehensively evaluate the load of tumors in body fluids and improve the sensitivity and versatility of tumor detection. A first aspect of the present invention provides a method for non-diagnosticly identifying a tumor burden in a sample, comprising the steps of: (i) providing a sample to be tested; (ii) sequencing the sample to be tested, thereby obtaining a sample of Genomic sequence; (iii) comparing the genomic sequence obtained in step (ii) with a reference genome to obtain position information of the genomic sequence on the reference genome; (iv) dividing the reference gene component into M region fragments, where Each region fragment is a window b, and the number of copies of each window b is calculated; (v) performs a Z-test on each window b of step (iv) to calculate the Z value of each window b; and (vi) according to The Z value obtained in step (v) is used to calculate the genomic disorder (GAS), and the tumor burden in the test sample is identified based on the value of the genomic disorder.

在另一優選例中，所述參考基因組可以是連續的，也可以是不連續的。 In another preferred example, the reference genome may be continuous or discontinuous.

在另一優選例中，所述參考基因組包括全基因組。 In another preferred example, the reference genome includes a whole genome.

在另一優選例中，所述參考基因組指該物種(如人)所有染色體的全長、單條或多條染色體的全長、單條或多條染色體的一部分、或其組合。 In another preferred example, the reference genome refers to the full length of all chromosomes of the species (such as a human), the full length of a single or multiple chromosomes, a portion of a single or multiple chromosomes, or a combination thereof.

在另一優選例中，所述參考基因組的覆蓋率達到全基因組的50%以上，較佳地，60%以上，更佳地，70%以上，更佳地，80%以上，最佳地，95%以上。 In another preferred example, the coverage of the reference genome reaches more than 50% of the whole genome, preferably 60% or more, more preferably 70% or more, more preferably 80% or more, most preferably, above 95.

在另一優選例中，所述樣本來自待檢測個體。 In another preferred example, the sample is from an individual to be tested.

在另一優選例中，所述待檢測個體為人或非人哺乳動物。 In another preferred example, the individual to be detected is a human or a non-human mammal.

在另一優選例中，所述樣本為固體樣本或液體樣本。 In another preferred example, the sample is a solid sample or a liquid sample.

在另一優選例中，所述樣本包括體液樣本。 In another preferred example, the sample includes a body fluid sample.

在另一優選例中，所述樣本選自下組：血液、血漿、組織間隙液、淋巴液、腦脊液、尿液、唾液、房水、精液、或其組合。 In another preferred example, the sample is selected from the group consisting of blood, plasma, interstitial fluid, lymph fluid, cerebrospinal fluid, urine, saliva, aqueous humor, semen, or a combination thereof.

在另一優選例中，所述樣本選自下組：游離的迴圈腫瘤細胞(CTC)、細胞外游離DNA(cfDNA)、外泌體、或其組合。 In another preferred example, the sample is selected from the group consisting of free circulating tumor cells (CTC), extracellular free DNA (cfDNA), exosomes, or a combination thereof.

在另一優選例中，所述測序選自下組：單端測序、雙端測序、或其組合。 In another preferred example, the sequencing is selected from the group consisting of single-ended sequencing, double-ended sequencing, or a combination thereof.

在另一優選例中，所述步驟(iv)還包括校正每個視窗b的拷貝數，計算每個視窗b校正後的拷貝數的步驟。 In another preferred example, the step (iv) further includes the steps of correcting the copy number of each window b and calculating the corrected copy number of each window b.

在另一優選例中，所述校正方法選自下組：Loess校正、權重法、殘差法、或其組合。 In another preferred example, the correction method is selected from the following group: Loess correction, weighting method, residual method, or a combination thereof.

在另一優選例中，根據基因組序列在參考基因組上的位置資訊，統計落到每個視窗b的序列數目、堿基分佈、參考基因組的堿基分佈。 In another preferred example, according to the position information of the genomic sequence on the reference genome, the number of sequences falling into each window b, the distribution of the base groups, and the base distribution of the reference genome are counted.

在另一優選例中，根據每個視窗b的序列及堿基含量，校正每個視窗b的拷貝數。 In another preferred example, the copy number of each window b is corrected according to the sequence and the fluorene content of each window b.

在另一優選例中，用下述公式計算每個視窗b的Z值：其中，i為1至M的任意正整數；M為參考基因組分成的視窗的總數量，其中M為50的正整數，較佳地，50M10⁵，更佳地，100M10⁵，最佳地，200M10⁵；x_i為所述待測樣本在第i個視窗b_i檢測的拷貝數值；b_i為第i個視窗；μ_i為正常對照樣本在視窗b_i的拷貝數的算術平均值，用如下公式計算：其中，j為1至N的任意正整數；N為正常對照樣本的總數量，其中N為30的正整數，較佳地，30N10⁸，更佳地，50N10⁷，最佳地，100N10⁴；X_j指第j個正常對照樣本在所述視窗b_i檢測的拷貝數值；σ_i為正常對照樣本在所述視窗b_i的拷貝數的標準差，用如下公式計算：式中，N、j、X_j和μ_i的定義如上。 In another preferred example, the Z value of each window b is calculated using the following formula: Where i is any positive integer from 1 to M; M is the total number of windows formed by the reference gene components, where M is A positive integer of 50, preferably 50 M 10 ⁵ , better yet, 100 M 10 ⁵ , best of 200 M 10 ⁵ ; x _i is the copy value detected by the sample under test in the i-th window b _i ; b _i is the i-th window; μ _i is the arithmetic mean of the copy number of the normal control sample in the window b _i . Calculated as follows: Where j is any positive integer from 1 to N; N is the total number of normal control samples, where N is A positive integer of 30, preferably 30 N 10 ⁸ , better yet, 50 N 10 ⁷ , optimally, 100 N 10 ⁴ ; X _j refers to the copy value detected by the j-th normal control sample in the window b _i ; σ _i is the standard deviation of the copy number of the normal control sample in the window b _i and is calculated by the following formula: In the formula, N, j, X _j and μ _i are defined as above.

在另一優選例中，所述正常對照樣本指同一物種的正常人的同類樣本。 In another preferred example, the normal control sample refers to a homogeneous sample of a normal person of the same species.

在另一優選例中，用下述公式計算基因組混亂度：其中，m_b為排序在第m%的視窗，p_b為排序在第p%的視窗，m為30-98，較佳地，40-97，更佳地，60-96，最佳地，80-95，最佳地，95，p為80-100，較佳地，85-100，更佳地，90-100，最佳地，100，且p-m2(較佳地，5，更佳地，10，更佳地，15，最佳地，20)。 In another preferred example, the following formula is used to calculate the degree of genomic confusion: Among them, m _b is the window sorted at the m-th percentile, p _b is the window sorted at the p-th percentile, m is 30-98, preferably 40-97, more preferably 60-96, most preferably, 80-95, optimally, 95, p is 80-100, preferably, 85-100, more preferably, 90-100, optimally, 100, and pm 2 (preferably, 5. Better yet, 10, better yet, 15, optimally, 20).

在另一優選例中，所述計算基因組混亂度之前，包括如下步驟：(a)根據參考基因組序列特徵去除基因組上著絲粒、端粒、隨體、異染色質等高通量測序測不到的區域，去除基因組上著絲粒、端粒、隨體、異染色質附近L長度的區域，L為小於3M的任何長度；或(b)根據樣本的拷貝數特徵去除基因組上著絲粒、端粒、隨體、異染色質等高通量測序測不到的區域。 In another preferred example, before calculating the degree of genomic confusion, the method includes the following steps: (a) removing centromeres, telomeres, satellites, heterochromatin and other high-throughput sequencing tests on the genome according to the characteristics of the reference genome sequence. To the region, remove the region of length L near the centromere, telomere, satellite, heterochromatin on the genome, where L is any length less than 3M; or (b) remove the centromere on the genome according to the copy number characteristics of the sample , Telomere, satellite, heterochromatin and other areas not detected by high-throughput sequencing.

在另一優選例中，所述步驟(v)之前還包括如下步驟：(iv1)根據步驟(iv)的每個視窗b的拷貝數，計算正常對照樣本中每個視窗b的變異係數CV_i；和(iv2)將所述CV_i從小到大排序，去除最大的前n%的視窗，其中，n為大於0，小於等於5的任意數值，較佳地，n=1、2、2.5、3、3.1、4、4.2或5。 In another preferred example, before step (v), the method further includes the following steps: (iv1) calculating the coefficient of variation CV _i of each window b in the normal control sample according to the copy number of each window b in step (iv) ; And (iv2) sort the CV _i from small to large, removing the largest first n% of the window, where n is any value greater than 0 and less than or equal to 5, preferably n = 1, 2, 2.5, 3, 3.1, 4, 4.2 or 5.

在另一優選例中，所述變異係數CV_i用下述公式進行計算：其中，μ_i為正常對照樣本拷貝數的算術平均值，用如下公式計算： σ_i為正常對照樣本拷貝數的標準差，用如下公式計算：式中，N、j、X_j、μ_i和σ_i的定義如上。 In another preferred example, the coefficient of variation CV _i is calculated using the following formula: Among them, μ _i is the arithmetic average of the copy number of the normal control sample, and is calculated by the following formula: σ _i is the standard deviation of the copy number of the normal control sample, and is calculated using the following formula: In the formula, N, j, X _j , μ _i and σ _i are defined as above.

本發明第二方面提供了一種用於鑒定樣本中腫瘤負荷的系統(設備)，包括：測序單元，所述測序單元用於對待測樣本進行核酸測序，從而獲得所述樣本的基因組序列；比對單元，所述比對單元與所述測序單元相連，用於將獲得的所述樣本的基因組序列與參考基因組進行比對，從而獲得基因組序列在參考基因組上的位置資訊；　　計算與檢驗單元，所述計算與檢驗單元和所述比對單元相連，用於計算所述參考基因組的每個視窗b的拷貝數，並對每個視窗進行Z檢驗，從而計算每個視窗b的Z值；以及　　鑒定單元，所述鑒定單元和所述計算與檢驗單元相連，用於根據所得到Z的值，計算基因組混亂度(GAS)，並基於基因組混亂度的數值鑒定樣本中的腫瘤負荷。　　在另一優選例中，所述系統還包括校正單元，所述校正單元和所述計算與檢驗單元相連，用於校正所述參考基因組的每個視窗b的拷貝數，從而計算每個視窗b校正後的拷貝數。　　在另一優選例中，在所述計算與檢驗單元中，在對每個視窗b進行Z檢驗前，可根據每個視窗b的拷貝數，計算每個視窗b的變異係數CV_i ，並將所述CV_i 從小到大排序，去除最大的前n%的視窗，其中，n為大於0，小於等於5的任意數值，較佳地，n＝1、2、2.5、3、3.1、4、4.2或5。　　應理解，在本發明範圍內中，本發明的上述各技術特徵和在下文(如實施例)中具體描述的各技術特徵之間都可以互相組合，從而構成新的或優選的技術方案。限於篇幅，在此不再一一累述。A second aspect of the present invention provides a system (equipment) for identifying a tumor burden in a sample, including: a sequencing unit, the sequencing unit is configured to perform nucleic acid sequencing on a sample to be tested, thereby obtaining a genomic sequence of the sample; A unit, the comparison unit is connected to the sequencing unit, and is configured to compare the obtained genomic sequence of the sample with a reference genome, thereby obtaining position information of the genomic sequence on the reference genome; a calculation and inspection unit, all The calculation is connected with the inspection unit and the comparison unit, and is used for calculating the copy number of each window b of the reference genome, and performing a Z test on each window, thereby calculating the Z value of each window b; and identifying A unit, the identification unit and the calculation and inspection unit are connected to calculate a genomic disorder (GAS) based on the obtained value of Z, and identify a tumor burden in the sample based on the value of the genomic disorder. In another preferred example, the system further includes a correction unit, and the correction unit and the calculation and inspection unit are connected to correct a copy number of each window b of the reference genome, thereby calculating each window b Corrected copy number. In another preferred example, in the calculation and inspection unit, before performing the Z test on each window b, the coefficient of variation CV _{i of} each window b may be calculated according to the copy number of each window b, and The CV _{i is} sorted from small to large, removing the largest first n% of the window, where n is any value greater than 0 and less than or equal to 5, preferably, n = 1, 2, 2.5, 3, 3.1, 4, 4.2 or 5. It should be understood that, within the scope of the present invention, the above technical features of the present invention and the technical features specifically described in the following (such as the embodiments) may be combined with each other to form a new or preferred technical solution. Due to space limitations, I will not repeat them here.

本發明人通過廣泛而深入的研究，首次建立了一種有效且可提高腫瘤檢測的靈敏性和通用性的鑒定樣本中腫瘤負荷的方法，具體地，通過計算基因組混亂度(GAS)，從而基於基因組混亂度的數值鑒定樣本中的腫瘤負荷。　　此外，本發明還提供了一種鑒定樣本中腫瘤負荷的系統（設備），所述系統（設備）包括：測序單元；比對單元；計算與檢驗單元和鑒定單元。在本發明的一個優選例中，還包括校正單元。在此基礎上，本發明人完成了本發明。術語如本文所用，術語“拷貝數變異(Copy Number Variations，CNV)”是指樣本基因組染色體或染色體片段拷貝數異常，包括但不限於染色體非整倍體、缺失、重複，大於1000bp堿基的微缺失、微重複。　　如本文所用，術語“基因組混亂度值(Genomic Abnormality Score，GAS)”是根據樣本基因組染色體或染色體片段拷貝數異常計算得到的分值，分值檢測範圍包括但不限於全基因組、特定的染色體、染色體片段、特定基因。　　如本文所用，術語“Z值(Z-score)”也叫標準分值(standard score),是一個數值與平均數的差再除以標準差的過程。用公式表示為： Z score=(x-μ)/σ 　　其中x為某一具體數值，μ為算術平均值，σ為標準差；Z值代表著原始數值和參考平均值之間的距離，是以標準差為單位計算。　　如本文所用，術語“部分緩解(PR, partial response)”指靶病灶最大徑之和減少≥30%，至少維持4周。　　如本文所用，術語“疾病進展(PD, progressive disease)”指靶病灶最大徑之和至少增加≥20%，或出現新病灶。　　如本文所用，術語“系統”、“設備”為相同含義。參考基因組 在本發明中，以人為例，所述參考基因組可以是全基因組，也可以是部分基因組。並且，所述參考基因組可以是連續的，也可以是不連續的。當所述參考基因組為部分基因組時，所述參考基因組的總覆蓋率(F)為全基因組的50%以上，較佳地，較佳地，60%以上，更佳地，70%以上，更佳地，80%以上，最佳地，95%以上，其中，所述總覆蓋率(F)指參考基因組占全基因組的百分比。　　在一優選實施方式中，所述參考基因組為全基因組。　　在一優選實施方式中，所述參考基因組為該物種(如人)所有染色體的全長、單條或多條染色體的全長、單條或多條染色體的一部分、或其組合。腫瘤負荷 在本發明中，所述“腫瘤負荷”指腫瘤對機體的危害程度，比如腫瘤的大小，腫瘤的活躍程度，腫瘤的轉移情況，不同部位的腫瘤對機體的危險程度。一些評價腫瘤負荷的指標包括(但不限於)：腫瘤大小、腫瘤標記物高低、臨床症狀(喘憋、疼痛等等)、相關併發症(上腔靜脈綜合征等)、消耗情況(貧血、低蛋白血症等)。測序在本發明中，可用常規的測序技術和平臺進行測序。測序平臺不受特別限制，其中第二代測序平臺包括(但不限於)：Illumina公司的GA、GAII、GAIIx、HiSeq1000/2000/2500/3000/4000、X Ten、X Five、NextSeq500/550、MiSeq、MiSeqDx、MiSeq FGx、MiniSeq；Applied Biosystems的SOLiD；Roche的454 FLX；Thermo Fisher Scientific(Life Technologies)的Ion Torrent、Ion PGM、Ion Proton I/II；華大基因的BGISEQ1000、BGISEQ500、BGISEQ100；博奧生物集團的BioelectronSeq 4000；中山大學達安基因股份有限公司的DA8600；貝瑞和康的NextSeq CN500；紫鑫藥業旗下子公司中科紫鑫的BIGIS；華因康基因HYK-PSTAR-IIA。　　第三代單分子測序平臺包括(但不限於)：Helicos BioSciences公司的HeliScope系統，Pacific Bioscience的SMRT系統，Oxford Nanopore Technologies的GridION、MinION。測序類型可為單端(Single End)測序或雙端(Paired End)測序，測序長度可為30bp、40bp、50bp、100bp、300bp等大於30bp的任意長度，測序深度可為基因組的0.01、0.02、0.1、1、5、10、30倍等大於0.01的任意倍數。　　在本發明中，優選Illumina公司的HiSeq2500高通量測序平臺，測序類型為單端(Single End)測序，測序長度41bp，測序數據量為5M。資料處理 在本發明中，資料處理通常包括以下步驟：　　(a)對待測樣本的基因組進行核酸提取、測序，以獲得基因組序列；　　(b)將所述樣本的基因組序列比對到參考基因組，得到序列在參考基因組上的位置；　　(c)將參考基因組分成一定長度的視窗，計算每個視窗b的拷貝數；　　(d)對每個視窗b進行Z檢驗，計算每個視窗的Z值；和　　(e)計算基因組混亂度(GAS)。　　其中，在步驟(a)中，具體還包括：所述待測樣本的類型為體液，體液可以是血液、組織間隙液（簡稱組織液或細胞間液）、淋巴液、腦脊液、尿液、唾液，檢測目標為體液中含有的DNA，DNA具體存在於游離的迴圈腫瘤細胞（CTC）、細胞外游離DNA（cfDNA）、外泌體等。所述待測樣本DNA的提取方式包括（但不限於）：柱式提取、磁珠提取。對樣本進行文庫構建，採用高通量測序平臺，對樣本進行測序。　　其中，在步驟(b)中，具體還包括：將測序結果去掉接頭及低質量數據，比對到參考基因組。參考基因組可為全基因組、任意染色體、染色體的一部分。參考基因組通常選擇已被公認確定的序列，如人的基因組可為NCBI或UCSC的hg18(GRCh18)、hg19(GRCh19)、hg38(GRCh38)，或任意一條染色體及染色體的一部分。比對軟體可用任何一種免費或商務軟體，如BWA(Burrows-Wheeler Alignment tool)、SOAPaligner/soap2 (Short Oligonucleotide Analysis Package)、Bowtie/Bowtie2。將序列比對到參考基因組，得到序列在基因組上的位置。可以選擇在基因組上唯一比對的序列，去除基因組上多處比對的序列，消除重複序列對拷貝數計算帶來的誤差。　　其中，在步驟(c)中，具體還包括：將基因組分成一定長度的視窗，根據測的資料量，視窗長度也可以為100bp-3,000,000bp(3M)範圍內相同或不同的整數。視窗的數量可以是1,000-30,000,000範圍內的任意整數。根據測的序列在基因組上的位置，統計落到每個視窗的序列數目、堿基分佈、參考基因組的堿基分佈。根據每個視窗的序列及堿基GC含量，校正每個視窗的拷貝數，校正方法包括但不限於Loess校正，計算每個視窗校正後的拷貝數。　　其中，在步驟(d)中，具體還包括：取N(N為不少於30的自然數)個正常人的樣本，同樣的提取、建庫、測序條件，重複上述步驟(a)-(c)，作為參考資料集。對於每個視窗b_i ，都對應N個正常拷貝數值。　　計算正常對照樣本拷貝數的算術平均值μ_i ，算術平均值μ_i 計算公式為：；　　計算正常對照樣本拷貝數的標準差σ_i ，標準差的計算公式為：；　　X₁,X₂,X₃,......X_j 為正常樣本的拷貝數值。　　計算待檢測樣本每個視窗b_i 的Z值，Z值的計算公式為：；　　x_i 為視窗b_i 檢測的拷貝數值。　　其中，在步驟(e)中，具體還包括：在整個基因組、某條染色體、染色體片段或基因周圍存在高重複區域，如近著絲粒、端粒、隨體、異染色質等區域。首先去除高重複區域，以消除對混亂度計算的影響。　　在一優選實施方式中，去除的方法包括(但不限於)：　　a. 根據參考基因組序列特徵去除　　去除基因組上著絲粒、端粒、隨體、異染色質等高通量測序測不到的區域，去除基因組上著絲粒、端粒、隨體、異染色質附近L長度的區域，L可以為小於3M的任何長度；或　　b. 根據正常樣本的拷貝數特徵去除　　對於每個視窗bi，計算正常對照樣本在這個視窗的變異係數CV_i (Coefficient of Variation)，CV_i 計算公式為：；　　μ_i 為正常對照樣本拷貝數的算術平均值，σ_i 為正常對照樣本拷貝數的標準差。　　CV從小到大排序，去除最大的前n%的視窗，n可以為大於0，小於等於5的任意數值。　　其中，在步驟(e)中，具體還包括基因組混亂度(GAS)的計算方式：　　首先確定混亂度的檢測範圍，檢測範圍包括但不限於整個基因組、特定染色體、特定染色體片段或特定的基因等1M到基因組長度(如人的基因組約3G)範圍內的任意值。在混亂度檢測範圍內，去除重複序列影響的視窗的Z值取絕對值，Z值絕對值從小到大排序，並將排好序的Z值絕對值平均分配到0%-100%範圍內，其中Z值絕對值最小值被分配至0%，Z值絕對值的最大值被分配給100%。計算對應於第m%到第p%範圍內的各視窗Z值絕對值的累計值，其中，m為30-98，較佳地，40-97，更佳地，60-96，最佳地，80-95，最佳地，95；p為80-100，較佳地，85-100，更佳地，90-100，最佳地，100，且p-m2(較佳地5，更佳地10，更佳地15，最佳地20)，所述的累計值即為基因組混亂度(GAS)，計算公式為： m_b為排序在第m%的視窗，p_b為排序在第p%的視窗。用GAS的值鑒定體液中腫瘤負荷。 Through extensive and in-depth research, the inventors have established for the first time an effective method for identifying tumor burden in samples that can improve the sensitivity and versatility of tumor detection. Specifically, by calculating the genomic disorder (GAS), Numeric values identify the tumor burden in the sample. In addition, the present invention also provides a system (equipment) for identifying tumor burden in a sample, the system (equipment) includes: a sequencing unit; a comparison unit; a calculation and inspection unit and an identification unit. In a preferred example of the present invention, a correction unit is further included. On this basis, the present inventors have completed the present invention. As used herein, the term "Copy Number Variations (CNV)" refers to the abnormal copy number of chromosomes or chromosome fragments of a sample genome, including but not limited to chromosome aneuploidies, deletions, and duplications, and microbes greater than 1000 bp. Missing, microduplicates. As used herein, the term “Genomic Abnormality Score (GAS)” is a score calculated based on the abnormal copy number of a chromosome or a segment of a chromosome of a sample. The detection range of the score includes, but is not limited to, the entire genome, specific chromosomes, Chromosome fragments, specific genes. As used herein, the term "Z-score" is also called a standard score, which is a process of dividing the difference between a numerical value and an average by the standard deviation. Formulated as: Z score = (x-μ) / σ where x is a specific value, μ is the arithmetic mean, and σ is the standard deviation; Z value represents the distance between the original value and the reference average, which is Calculated in standard deviation. As used herein, the term "partial response (PR, partial response)" refers to a reduction of the sum of the maximum diameters of target lesions by ≥30% for at least 4 weeks. As used herein, the term "progressive disease (PD)" means that the sum of the maximum diameters of target lesions increases by at least ≥20%, or new lesions appear. As used herein, the terms "system" and "device" have the same meaning. Reference genome In the present invention, taking a human as an example, the reference genome may be a whole genome or a partial genome. Moreover, the reference genome may be continuous or discontinuous. When the reference genome is a partial genome, the total coverage (F) of the reference genome is more than 50% of the whole genome, preferably, preferably, 60% or more, more preferably, 70% or more, more Preferably, it is above 80%, and most preferably above 95%, wherein the total coverage (F) refers to the percentage of the reference genome in the entire genome. In a preferred embodiment, the reference genome is a whole genome. In a preferred embodiment, the reference genome is the full length of all chromosomes of the species (such as a human), the full length of a single or multiple chromosomes, a portion of a single or multiple chromosomes, or a combination thereof. Tumor burden In the present invention, the "tumor burden" refers to the degree of harm the tumor has to the body, such as the size of the tumor, the degree of tumor activity, the metastasis of the tumor, and the degree of danger to the body from tumors in different parts. Some indicators for evaluating tumor burden include (but are not limited to): tumor size, tumor marker level, clinical symptoms (wheezing, pain, etc.), related complications (superior vena cava syndrome, etc.), and consumption (anemic, low Proteinemia, etc.). Sequencing In the present invention, sequencing can be performed using conventional sequencing techniques and platforms. The sequencing platform is not particularly limited. The second-generation sequencing platform includes (but is not limited to): Illumina's GA, GAII, GAIIx, HiSeq1000 / 2000/2500/3000/4000, X Ten, X Five, NextSeq500 / 550, MiSeq , MiSeqDx, MiSeq FGx, MiniSeq; SOLiD of Applied Biosystems; 454 FLX of Roche; Ion Torrent, Ion PGM, Ion Proton I / II of Thermo Fisher Scientific (Life Technologies); BGISEQ1000, BGISEQ500, BGISEQ100 of BGI BioelectronSeq 4000 of Biological Group; DA8600 of Daan Gene Co., Ltd. of Sun Yat-sen University; NextSeq CN500 of Berry Hekang; BIGIS of Zixin Pharmaceutical, a subsidiary of Zixin Pharmaceutical; HYK-PSTAR-IIA. The third-generation single molecule sequencing platform includes (but is not limited to): Helicos BioSciences 'HeliScope system, Pacific Bioscience's SMRT system, Oxford Nanopore Technologies' GridION, MinION. The sequencing type can be single-end sequencing or paired-end sequencing. The sequencing length can be any length greater than 30bp, such as 30bp, 40bp, 50bp, 100bp, 300bp, etc., and the sequencing depth can be 0.01, 0.02, 0.1, 1, 5, 10, 30 times, etc. Any multiple greater than 0.01. In the present invention, a HiSeq2500 high-throughput sequencing platform from Illumina is preferred. The sequencing type is Single End sequencing, the sequencing length is 41 bp, and the amount of sequencing data is 5M. Data processing In the present invention, data processing generally includes the following steps: (a) performing nucleic acid extraction and sequencing of the genome of a sample to be tested to obtain a genomic sequence; (b) comparing the genomic sequence of the sample to a reference genome to obtain The position of the sequence on the reference genome; (c) the reference gene component is formed into a window of a certain length, and the copy number of each window b is calculated; (d) a Z test is performed on each window b, and the Z value of each window is calculated; and (e) Calculate genomic disorder (GAS). Wherein, in step (a), the type of the sample to be tested is body fluid, which may be blood, interstitial fluid (interstitial fluid or intercellular fluid), lymph fluid, cerebrospinal fluid, urine, saliva, The detection target is DNA contained in body fluids. The DNA specifically exists in free circulating tumor cells (CTC), extracellular free DNA (cfDNA), and exosomes. The DNA extraction method of the test sample includes (but is not limited to): column extraction, magnetic bead extraction. Library samples were constructed using a high-throughput sequencing platform to sequence the samples. Wherein, in step (b), the method further includes: removing the linker and low-quality data from the sequencing result, and comparing the result to the reference genome. The reference genome can be a whole genome, any chromosome, or part of a chromosome. The reference genome usually selects a sequence that has been generally determined. For example, the human genome can be hg18 (GRCh18), hg19 (GRCh19), hg38 (GRCh38) of NCBI or UCSC, or any chromosome and a part of a chromosome. The comparison software can be any free or commercial software, such as BWA (Burrows-Wheeler Alignment tool), SOAPaligner / soap2 (Short Oligonucleotide Analysis Package), Bowtie / Bowtie2. The sequences are aligned to the reference genome to obtain the position of the sequence on the genome. You can select the uniquely aligned sequence on the genome, remove multiple aligned sequences on the genome, and eliminate the error caused by the repeated sequence on the copy number calculation. Wherein, in step (c), the method further specifically includes: forming the gene component into a window of a certain length, and the window length may also be the same or different integer in the range of 100bp-3,000,000bp (3M) according to the measured data amount. The number of windows can be any integer in the range of 1,000-30,000,000. According to the position of the measured sequence on the genome, the number of sequences falling into each window, the distribution of the bases, and the base distribution of the reference genome are counted. The copy number of each window is corrected according to the sequence of each window and the base GC content. The correction method includes, but is not limited to, Loess correction, and calculates the corrected copy number of each window. Among them, in step (d), it also specifically includes: taking a sample of N (N is a natural number of not less than 30) normal people, and extracting, building and sequencing the same conditions, and repeating the above steps (a)-( c), as a reference set. For each window b _i , there are N normal copy values. Control sample of normal copy number of the arithmetic mean μ _i, μ _i the arithmetic mean value is calculated as: ; Calculate the standard deviation σ _{i of the} normal control sample copy number, and the standard deviation calculation formula is: X X, X X, X X, ... X _j is the copy value of the normal sample. Calculate the Z value of each window b _i of the sample to be tested. The formula for calculating the Z value is: ; X _i is the copy value detected by window b _i . Wherein, in step (e), the method further includes: there is a highly repetitive region around the entire genome, a certain chromosome, a chromosome fragment, or a gene, such as a region near a centromere, a telomere, a satellite, a heterochromatin, and the like. First remove the highly repetitive regions to eliminate the impact on the confusion calculation. In a preferred embodiment, the method of removal includes (but is not limited to): a. Removal of centromeres, telomeres, satellites, heterochromatin, etc. that are not detected by high-throughput sequencing according to the characteristics of the reference genome sequence. Region, remove the region of length L near the centromere, telomere, satellite, heterochromatin on the genome, L can be any length less than 3M; or b. Remove for each window bi according to the copy number characteristics of normal samples, Calculate the coefficient of variation CV _i (Coefficient of Variation) of the normal control sample in this window. The formula for CV _i is: ; Μ _i is the arithmetic mean of the copy number of the normal control sample, and σ _i is the standard deviation of the copy number of the normal control sample. CV is sorted from small to large, removing the largest first n% of the window, n can be any value greater than 0 and less than or equal to 5. Wherein, in step (e), the calculation method of genomic disorder (GAS) is specifically included: firstly, the detection range of the disorder is determined, and the detection range includes, but is not limited to, the entire genome, a specific chromosome, a specific chromosome fragment, or a specific gene, etc. Any value ranging from 1M to the length of the genome (e.g., about 3G of the human genome). In the detection range of chaos, the Z value of the window excluding the influence of the repeated sequence is taken as an absolute value, the absolute value of the Z value is sorted from small to large, and the absolute value of the ordered Z value is evenly distributed within the range of 0% -100% The minimum value of the absolute value of Z is assigned to 0%, and the maximum value of the absolute value of Z is assigned to 100%. Calculate the cumulative value of the absolute value of the Z value corresponding to each window in the range from m% to p%, where m is 30-98, preferably 40-97, more preferably 60-96, and most preferably , 80-95, optimally, 95; p is 80-100, preferably, 85-100, more preferably, 90-100, optimally, 100, and pm 2 (preferably 5. Better 10, better 15, best 20), the cumulative value is the genomic disorder (GAS), and the calculation formula is: m _b is the window sorted at m%, and p _b is the window sorted at p%. Tumor burden in body fluids was identified using GAS values.

鑒定樣本中腫瘤負荷的方法Method for identifying tumor burden in a sample

在本發明中，提供了一種有效且可提高腫瘤檢測的靈敏性和通用性的鑒定樣本中腫瘤負荷的方法，包括步驟：(i)提供一待測樣本；(ii)對所述待測樣本進行測序，從而獲得所述樣本的基因組序列；(iii)將步驟(ii)獲得的基因組序列與參考基因組進行比對，從而獲得基因組序列在參考基因組上的位置資訊；(iv)將所述的參考基因組分成M個區域片段，其中每個區域片段為一個視窗b，計算每個視窗b的拷貝數；(v)對步驟(iv)的每個視窗b進行Z檢驗，從而計算每個視窗b的Z值；和 (vi)根據步驟(v)所得到的Z值，計算基因組混亂度(GAS)，基於基因組混亂度的數值鑒定所述待測樣本中的腫瘤負荷。 In the present invention, a method for identifying tumor burden in a sample that is effective and can improve the sensitivity and versatility of tumor detection, includes the steps of: (i) providing a sample to be tested; (ii) analyzing the sample to be tested Performing sequencing to obtain the genomic sequence of the sample; (iii) comparing the genomic sequence obtained in step (ii) with a reference genome to obtain position information of the genomic sequence on the reference genome; (iv) comparing the The reference gene component is divided into M region fragments, where each region fragment is a window b, and the copy number of each window b is calculated; (v) performing a Z test on each window b in step (iv) to calculate each window b Z value; and (vi) Calculate genomic disorder (GAS) based on the Z value obtained in step (v), and identify the tumor burden in the test sample based on the value of the genomic disorder.

在本發明的一個優選例中，所述方法包括步驟：(a)對樣本基因組進行核酸提取、測序，以獲得基因組序列；(b)將序列比對到參考基因組，得到序列在基因組上的位置；(c)將參考基因組分成一定長度的視窗b，計算每個視窗b的拷貝數；以及(d)對每個視窗b進行Z檢驗，計算每個視窗b的Z值；計算基因組混亂度(GAS)，從而基於基因組混亂度的數值鑒定樣本中的腫瘤負荷。 In a preferred example of the present invention, the method includes the steps of: (a) performing nucleic acid extraction and sequencing on a sample genome to obtain a genomic sequence; (b) aligning the sequence to a reference genome to obtain the position of the sequence on the genome ; (C) divide the reference gene component into a certain length of window b, calculate the copy number of each window b; and (d) perform a Z test on each window b, calculate the Z value of each window b, and calculate the degree of genome confusion ( GAS) to identify tumor burden in a sample based on numerical values of genomic disruption.

鑒定樣本中腫瘤負荷的系統(設備)System (equipment) to identify tumor burden in a sample

在本發明中，還提供了一種鑒定樣本中腫瘤負荷的系統(設備)，包括：測序單元，所述測序單元用於對待測樣本進行核酸測序，從而獲得所述樣本的基因組序列；比對單元，所述比對單元與所述測序單元相連，用於將獲得的所述樣本的基因組序列與參考基因組進行比對，從而獲得基因組序列在參考基因組上的位置資訊；計算與檢驗單元，所述計算與檢驗單元和所述比對單元相連，用於計算所述參考基因組的每個視窗b的拷貝數，並對每個視窗進行Z檢驗，從而計算每個視窗b的Z值；以及　　鑒定單元，所述鑒定單元和所述計算與檢驗單元相連，用於根據所得到Z的值，計算基因組混亂度(GAS)，並基於基因組混亂度的數值鑒定樣本中的腫瘤負荷。　　在一優選實施方式中，所述系統還包括校正單元，所述校正單元和所述計算與檢驗單元相連，用於校正所述參考基因組的每個視窗b的拷貝數，從而計算每個視窗b校正後的拷貝數。　　本發明的主要優點包括：　　(1)本發明首次建立一種鑒定樣本中腫瘤負荷的方法和系統，本發明的方法和系統可準確、有效的鑒定樣本中腫瘤負荷。　　(2)本發明的方法和系統可提高腫瘤檢測的靈敏性和通用性。　　(3)本發明的方法和系統可減少腫瘤患者檢測時取樣帶來的痛苦，實現無創檢測。　　(4)本發明的方法和系統可有效的檢測某些常規檢測無法取樣的患者；　　(5)本發明的方法和系統可對腫瘤患者即時檢測，監測用藥療效，對醫生用藥、治療做出一定的指導。　　下面結合具體實施例，進一步陳述本發明。應理解，這些實施例僅用於說明本發明而不用於限制本發明的範圍。下列實施例中未注明詳細條件的實驗方法，通常按照常規條件如Sambrook等人，分子克隆：實驗室手冊(New York:Cold Spring Harbor Laboratory Press,1989)中所述的條件，或按照製造廠商所建議的條件。除非另外說明，否則百分比和份數按重量計算。　　除非有特別說明，否則實施例所用的材料均為市售產品。實施例 1 本發明已經應用到15個例子，並取得良好的效果。為了使本發明的用法和效果更加易於理解和掌握，下面將舉一個實例進行進一步的闡述。實施的簡要流程圖如圖1所示，詳細實施過程如下：1 ．對樣本基因組進行核酸提取、測序 在本實施例中，檢測樣本來源為某胃癌患者血液，提取血液中游離DNA(cfDNA)及白細胞。核酸提取採用康為世紀生物科技有限公司的CW2603核酸提取試劑盒，提取方法按照康為世紀生物科技有限公司提供的產品說明書操作。　　採用康為世紀生物科技有限公司的CW2185建庫試劑盒進行文庫構建，上機測序。上機測序採用Illumina公司的HiSeq2500高通量測序平臺，按照Illumina公司提供的說明書操作。測序類型為單端(Single End)測序，測序長度41bp，測序數據量為5M。In the present invention, a system (equipment) for identifying tumor burden in a sample is also provided, including: a sequencing unit, the sequencing unit is configured to perform nucleic acid sequencing on a sample to be tested, thereby obtaining a genomic sequence of the sample; an alignment unit The comparison unit is connected to the sequencing unit, and is configured to compare the obtained genomic sequence of the sample with a reference genome, thereby obtaining position information of the genomic sequence on the reference genome; a calculation and inspection unit, said The calculation is connected to the inspection unit and the comparison unit, for calculating the copy number of each window b of the reference genome, and performing a Z test on each window, thereby calculating a Z value of each window b; and an identification unit; The identification unit and the calculation and inspection unit are connected to calculate a genomic disorder (GAS) based on the obtained value of Z, and identify a tumor burden in the sample based on the value of the genomic disorder. In a preferred embodiment, the system further includes a correction unit, the correction unit and the calculation and inspection unit are connected to correct a copy number of each window b of the reference genome, thereby calculating each window b Corrected copy number. The main advantages of the present invention include: (1) The present invention establishes a method and system for identifying tumor load in a sample for the first time. The method and system of the present invention can accurately and effectively identify tumor load in a sample. (2) The method and system of the present invention can improve the sensitivity and versatility of tumor detection. (3) The method and system of the present invention can reduce the pain caused by sampling during the detection of tumor patients and realize non-invasive detection. (4) The method and system of the present invention can effectively detect some patients who cannot be sampled by conventional tests; (5) The method and system of the present invention can detect tumor patients in real time, monitor the efficacy of medication, and make certain decisions for doctors' medication and treatment Guidance. The present invention is further described below in conjunction with specific embodiments. It should be understood that these examples are only used to illustrate the present invention and not to limit the scope of the present invention. The experimental methods without detailed conditions in the following examples are generally performed according to conventional conditions such as those described in Sambrook et al., Molecular Cloning: Laboratory Manual (New York: Cold Spring Harbor Laboratory Press, 1989), or according to the manufacturer Suggested conditions. Unless stated otherwise, percentages and parts are by weight. Unless otherwise specified, the materials used in the examples are all commercially available products. Example 1 The present invention has been applied to 15 examples and achieved good results. In order to make the usage and effect of the present invention easier to understand and master, an example will be further described below. A brief flow chart of the implementation is shown in Figure 1. The detailed implementation process is as follows: 1 ． Nucleic acid extraction and sequencing of the sample genome In this embodiment, the source of the test sample is blood from a gastric cancer patient, and free DNA (cfDNA) and white blood cells are extracted from the blood. Nucleic acid extraction uses Kangwei Century Biotechnology Co., Ltd.'s CW2603 nucleic acid extraction kit. The extraction method is based on the product instructions provided by Kangwei Century Biotechnology Co., Ltd. The library was constructed using Kangwei Century Biotechnology Co., Ltd.'s CW2185 library construction kit, and sequenced on the computer. HiSeq2500 high-throughput sequencing platform from Illumina was used for sequencing on the machine, and the instructions provided by Illumina were used. The sequencing type is Single End sequencing, the sequencing length is 41bp, and the amount of sequencing data is 5M.

2．將序列比對到參考基因組，得到序列在基因組上的位置2. Align the sequence to the reference genome to get the position of the sequence on the genome

將測序結果去掉接頭及低質量數據，比對到參考基因組。參考基因組為人的基因組UCSC的hg19(GRCh19)，比對軟體為BWA(Burrows-Wheeler Alignment tool)，採用默認參數，將序列比對到參考基因組，得到序列在基因組上的位置，選擇在基因組上唯一比對的序列。 The sequencing results were removed from the adapter and low-quality data, and compared to the reference genome. The reference genome is the human genome UCSC hg19 (GRCh19), and the alignment software is BWA (Burrows-Wheeler Alignment tool). Using default parameters, the sequences are aligned to the reference genome to obtain the position of the sequence on the genome and select the genome Unique aligned sequences.

3．將參考基因組分成一定長度的視窗，計算每個視窗的拷貝數3． Make reference gene components into windows of a certain length, and calculate the copy number of each window

將基因組分成15489個視窗b(區域)，每個視窗b長度為200K，根據序列在基因組上的位置，統計落到每個視窗b的序列數目、堿基分佈、參考基因組的堿基分佈。根據每個視窗b的序列及堿基GC含量，校正每個視窗b的拷貝數，校正方法為Loess，計算每個視窗b校正後的拷貝數。 The gene group is divided into 15489 windows b (areas), each window b is 200K in length. According to the position of the sequence in the genome, the number of sequences that fall into each window b, the distribution of base groups, and the base distribution of the reference genome are counted. According to the sequence of each window b and the base GC content, the copy number of each window b is corrected. The correction method is Loess, and the corrected copy number of each window b is calculated.

4．計算每個視窗的CV值4． Calculate the CV value of each window

取100個正常人的樣本，同樣的提取、建庫、測序條件，重複上述1、2、3步驟，獲得正常對照樣本資料，作為參考資料集，計算待檢測樣本每個視窗b_i的CV值。 Take 100 normal human samples and repeat the above steps 1, 2, and 3 for the same extraction, database, and sequencing conditions to obtain normal control sample data. As a reference data set, calculate the CV value of each window b _i of the sample to be tested. .

對於每個視窗b_i，都對應N(本實施例N=100)個正常拷貝數值。 For each window b _i , there are N (N = 100 in this embodiment) normal copy values.

計算正常對照樣本拷貝數的算術平均值μ_i，算術平均值μ_i計算公式為：計算正常對照樣本拷貝數的標準差σ_i，標準差的計算公式為： X₁,X₂,X₃,......X_j為正常樣本的拷貝數值。 Control sample of normal copy number of the arithmetic mean μ _i, μ _i the arithmetic mean value is calculated as: Calculate the standard deviation σ _{i of the} copy number of the normal control sample, and the formula for calculating the standard deviation is: X ₁ , X ₂ , X ₃ ,... X _j are copy values of normal samples.

計算待檢測樣本每個視窗b_i的CV值，CV值的計算公式為： Calculate the CV value of each window b _i of the sample to be tested. The calculation formula of the CV value is:

5．對每個視窗進行Z檢驗，計算每個視窗的Z值5． Perform a Z test on each window and calculate the Z value of each window

計算待檢測樣本每個視窗b_i的Z值，Z值的計算公式為： x_i為視窗b_i檢測的拷貝數值，μ_i為正常對照樣本拷貝數的算術平均值，σ_i為正常對照樣本拷貝數的標準差，計算公式同步驟4。 Calculate the Z value of each window b _i of the sample to be tested. The formula for calculating the Z value is: x _i is the copy value detected by window b _i , μ _i is the arithmetic mean of the copy number of the normal control sample, and σ _i is the standard deviation of the copy number of the normal control sample. The calculation formula is the same as step 4.

6．計算基因組混亂度(GAS)6. Calculating Genomic Chaos (GAS)

在本實施例中，每個視窗CV從小到大排序，去除最大的前5%的視窗，不參與以下混亂度計算。混亂度的檢測範圍為整個基因組；Z值取絕對值，並從小到大排序，計算第m%到第p%視窗Z值絕對值的累計值，其累計值即為基因組混亂度(GAS)。計算公式為：；　　m_b 為排序在第m%的視窗，p_b 為排序在第p%的視窗，其中，m為95，p為100。　　用GAS的值鑒定體液中腫瘤負荷。7. 檢測結果 對十幾個樣本進行檢測。一個典型病理的情況如下所示。　　檢測結果如表1、圖2和圖3所示。表1 實施例1對某胃癌患者的臨床用藥效果做腫瘤負荷檢測結果結果顯示，患者臨床用藥前，確診為胃癌，此時cfDNA拷貝數嚴重異常(圖3 S1)，全基因組混亂度為999.84，血液中腫瘤負荷較嚴重。　　伴隨著用藥，到第四週期cfDNA拷貝數正常，全基因組混亂度為728.80，和正常白細胞729.86接近。　　用本實施例相同的方法，計算上述100例正常人的全基因組混亂度，正常範圍為722.87-739.89，算數平均值733.22，本實施例第四用藥週期及白細胞的全基因組混亂度值在正常範圍內，說明血液中腫瘤負荷很小，與其臨床評效結果PR(部分緩解)是對應的。　　伴隨進一步用藥，腫瘤產生抗藥性，cfDNA拷貝數異常情況又變嚴重，全基因組混亂度分值變大，血液中腫瘤負荷變嚴重，到用藥第七週期，全基因組混亂度最高，與其臨床評效結果PD(疾病進展)是對應的。　　結果表明，基因組混亂度可有效鑒定體液中的腫瘤負荷。　　在本發明提及的所有文獻都在本申請中引用作為參考，就如同每一篇文獻被單獨引用作為參考那樣。此外應理解，在閱讀了本發明的上述講授內容之後，本領域技術人員可以對本發明作各種改動或修改，這些等價形式同樣落於本申請所附申請專利範圍所限定的範圍。In this embodiment, the CV of each window is sorted from small to large, the largest top 5% window is removed, and it does not participate in the following confusion calculation. The detection range of the disorder is the entire genome; the absolute value of the Z value is sorted from small to large, and the cumulative value of the absolute value of the Z value from the m% to the p% window is calculated, and the cumulative value is the genome disorder (GAS). The calculation formula is: M _b is the window sorted at the m-th percentile, and p _b is the window sorted at the p-th percentile, where m is 95 and p is 100. Tumor burden in body fluids was identified using GAS values. 7. Test results Tested on a dozen samples. A typical pathological situation is shown below. The test results are shown in Table 1, Figure 2 and Figure 3. Table 1 Example 1 Results of tumor burden test on clinical drug effects of a patient with gastric cancer The results showed that the patient was diagnosed with gastric cancer before clinical medication. At this time, the cfDNA copy number was severely abnormal (Figure 3 S1), the whole genome disorder was 999.84, and the tumor burden in the blood was severe. With the medication, the cfDNA copy number was normal by the fourth cycle, and the genome disorder was 728.80, which was close to that of normal white blood cells 729.86. Using the same method of this embodiment, the whole genome disorder degree of the 100 normal persons is calculated, with a normal range of 722.87-739.89 and an arithmetic average of 733.22. The fourth medication cycle and the whole genome disorder value of the white blood cells in the present embodiment are in the normal range. It shows that the tumor load in the blood is small, which corresponds to its clinical evaluation result PR (partial response). With further medication, tumors develop resistance, cfDNA copy number abnormalities become serious, genome-wide disorder scores increase, and tumor burden in the blood becomes severe. By the seventh cycle of medication, the genome-wide disorder is the highest, and its clinical evaluation As a result, PD (progression of disease) corresponds. The results show that genomic disorder can effectively identify tumor burden in body fluids. All documents mentioned in the present invention are incorporated by reference in this application, as if each document was individually incorporated by reference. In addition, it should be understood that after reading the above teaching content of the present invention, those skilled in the art can make various changes or modifications to the present invention, and these equivalent forms also fall within the scope defined by the scope of the patents attached to this application.

圖1顯示了體液中鑒定腫瘤負荷的分析方法流程圖。　　圖2顯示了患者不同臨床用藥週期的腫瘤負荷檢測結果。　　圖3顯示了S1-7全基因組拷貝數變異及對應的GAS。Figure 1 shows a flow chart of an analytical method for identifying tumor burden in body fluids. Figure 2 shows the tumor load test results of patients in different clinical medication cycles. Figure 3 shows the genome-wide copy number variation of S1-7 and the corresponding GAS.

Claims

A method for non-diagnosticly identifying tumor burden in a sample, comprising the steps of: (i) providing a sample to be tested; (ii) sequencing the sample to be tested to obtain a genomic sequence of the sample; (iii) aligning the genomic sequence obtained in step (ii) with a reference genome to obtain position information of the genomic sequence on the reference genome; (iv) dividing the reference gene component into M region fragments, each of which The fragment is a window b, and the copy number of each window b is calculated; (v) a Z-test is performed on each window b of step (iv) to calculate the Z value of each window b; and (vi) according to step (v ) To obtain the Z value, calculate the genomic disorder, and identify the tumor burden in the test sample based on the value of the genomic disorder, where the genomic disorder is calculated using the following formula: Among them, m _b is the window sorted at the m%, p _b is the window sorted at the p%, m is 30-98, p is 80-100, and pm 2.

The method of claim 1, wherein the reference genome comprises a whole genome.

The method according to claim 1 or 2, wherein the coverage of the reference genome reaches more than 50% of the entire genome.

The method of claim 1, wherein the sample is selected from the group consisting of blood, plasma, interstitial fluid, lymph fluid, cerebrospinal fluid, urine, saliva, aqueous humor, semen, or a combination thereof.

The method according to claim 1, wherein the step (iv) further comprises the steps of correcting the copy number of each window b and calculating the corrected copy number of each window b.

The method of claim 1, wherein the Z value of each window b is calculated using the following formula: Where i is any positive integer from 1 to M; M is the total number of windows formed by the reference gene components, where M is A positive integer of 50; x _i is the copy value detected by the sample under test in the i-th window b _i ; b _i is the i-th window; μ _i is the arithmetic mean of the copy number of the normal control sample in window b _i , Calculated with the following formula: Where j is any positive integer from 1 to N; N is the total number of normal control samples, where N is A positive integer of 30; X _j refers to the copy value detected by the j-th normal control sample in the window b _i ; σ _i is the standard deviation of the copy number of the normal control sample in the window b _i , and is calculated using the following formula: In the formula, N, j, X _j and μ _i are defined as above.

The method according to claim 1, wherein m is 40-97, p is 85-100, and pm 5.

The method according to claim 1, wherein before step (v), the method further comprises the following steps: (iv1) Calculate the number of each window b in the normal control sample according to the copy number of each window b in step (iv). Coefficient of variation CV _i ; (iv2) Sort the CV _i from small to large, and remove the largest first n% of the window, where n is any value greater than 0 and less than or equal to 5.

The method according to claim 8, wherein the coefficient of variation CV _i is calculated using the following formula: Among them, μ _i is the arithmetic average of the copy number of the normal control sample, and is calculated by the following formula: σ _i is the standard deviation of the copy number of the normal control sample, and is calculated using the following formula: In the formula, N, j, X _j , μ _i and σ _i are defined as above.

The method according to claim 3, wherein the coverage of the reference genome reaches more than 60% of the whole genome.