WO2020224504A1 - 一种cfDNA分类方法、装置和用途 - Google Patents

一种cfDNA分类方法、装置和用途 Download PDF

Info

Publication number
WO2020224504A1
WO2020224504A1 PCT/CN2020/087830 CN2020087830W WO2020224504A1 WO 2020224504 A1 WO2020224504 A1 WO 2020224504A1 CN 2020087830 W CN2020087830 W CN 2020087830W WO 2020224504 A1 WO2020224504 A1 WO 2020224504A1
Authority
WO
WIPO (PCT)
Prior art keywords
cancer
cfdna
chry
chr8
chr7
Prior art date
Application number
PCT/CN2020/087830
Other languages
English (en)
French (fr)
Inventor
慈维敏
葛广哲
周媛媛
李学松
Original Assignee
中国科学院北京基因组研究所
北京大学第一医院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院北京基因组研究所, 北京大学第一医院 filed Critical 中国科学院北京基因组研究所
Priority to US17/609,036 priority Critical patent/US20220336043A1/en
Publication of WO2020224504A1 publication Critical patent/WO2020224504A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/60ICT specially adapted for the handling or processing of medical references relating to pathologies
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/158Expression markers

Definitions

  • the invention belongs to the field of genomics and bioinformatics, and relates to a cfDNA classification method, device and application.
  • Tumors of the genitourinary system are serious diseases that endanger human health.
  • the diagnosis and monitoring methods for genitourinary system tumors are usually invasive, or lack sensitivity and specificity.
  • Kidney cancer accounts for about 3% of adult malignant tumors and 90%-95% of kidney tumors, of which about 75% are renal clear cell carcinomas. At present, surgical treatment is still the most effective treatment for localized renal cancer, but approximately 20%-40% of patients will relapse after surgery. Renal cell carcinoma has low sensitivity to radiotherapy and chemotherapy. The mortality rate of renal cancer patients is as high as 40%. The high mortality rate caused by renal cancer is mainly due to the lack of obvious clinical symptoms in the early stage and the lack of effective treatment methods in the advanced stage. At present, imaging, fine needle aspiration (FNA), and core biopsy (CB) can only assist in monitoring and cannot give a clear diagnosis. Currently, there is no tumor marker with good sensitivity and specificity that can be used for early diagnosis and postoperative follow-up of renal cancer.
  • FNA fine needle aspiration
  • CB core biopsy
  • Urothelial cancer is a malignant tumor that occurs in the renal pelvis, ureter, bladder, urethra, etc. and covers transitional epithelial cells. It mainly includes upper urothelial cancer and bladder cancer where the renal pelvis and ureter are located. Among them, upper urothelial cancer is relatively rare, accounting for only 5%-10% of urothelial cancer, but in China, the proportion of upper urothelial cancer is as high as 30%. A number of studies have shown that the regional characteristics of upper urothelial cancer may be related to the use of traditional Chinese medicine containing aristolochic acid and its analogues. In addition, although the tissue sources are the same, upper urothelial cancer and bladder cancer have very different clinicopathological characteristics.
  • cystoscopy is expensive and invasive, which increases the patient's pain.
  • bladder cancer has a high recurrence rate, and cystoscopy is inconvenient for long-term, lifelong and prognostic monitoring.
  • Prostate cancer is a common malignant tumor in men, and the incidence is on the rise to a certain extent. Prostate cancer has no symptoms in the early stage. When the tumor develops to a certain extent, it will block the urethra or invade the bladder neck, causing frequent urination, urgency, and incontinence. Many patients are in the advanced stage when they are diagnosed, and many patients in the advanced stage have bone metastases.
  • the accepted methods for prostate cancer are digital rectal examination and prostate-specific antigen (PSA) examination, but the level of PSA can also be affected by factors such as prostatitis, urinary retention, catheterization, and drugs, resulting in many false positive rates. .
  • liquid biopsy mainly includes free circulating tumor cells (CTCs) detection, circulating tumor DNA (ctDNA) detection, exosomes and circulating RNA (circulating RNA) detection, etc., and traditionally rely on clinical symptoms or imaging diagnosis.
  • CTCs free circulating tumor cells
  • ctDNA circulating tumor DNA
  • circulating RNA circulating RNA
  • the inventor surprisingly found that the detection of free DNA (cfDNA) in the urine supernatant is beneficial to the detection or diagnosis of early stage, low-grade, non-invasive tumors in the urinary system. Furthermore, the inventors designed and completed experiments, sequencing and analysis. By detecting the cfDNA copy number variation (CNV) in the urine supernatant, the diagnosis and classification of up to 3 genitourinary system tumors can be completed at one time. .
  • This provides the following inventions:
  • One aspect of the present invention relates to a cfDNA classification method, including:
  • a classifier model is used to determine the classification to which the target cfDNA belongs.
  • the classification method, wherein determining the classification to which the target cfDNA belongs includes:
  • a random forest model is used to determine the correlation between the cfDNA copy number variation data of each classification label and the human urogenital system tumor;
  • the classifier model is used to determine the classification to which the target cfDNA belongs.
  • the classification method wherein determining the correlation between the cfDNA copy number variation data of each classification label and the tumor of the human urogenital system includes:
  • the vector sequence is input into the random forest model, and the correlation between the cfDNA copy number variation data of the classification label and the tumor of the human urogenital system is determined.
  • the classification method wherein the human genitourinary system tumor is one or more selected from prostate cancer, urothelial cancer and renal cancer;
  • the kidney cancer is clear renal cell carcinoma
  • the urothelial cancer is upper urothelial cancer and/or bladder cancer,
  • the prostate cancer is prostate adenocarcinoma
  • the human urogenital system tumor is diagnosed by tissue biopsy of surgical samples.
  • the classification method wherein the random forest model is at least 3 random forest binary classifiers, and is selected from any one or two groups of the following I-VI groups , Three or four groups:
  • Normal-vs-kidney cancer normal-vs-urothelial cancer, normal-vs-prostate cancer;
  • Kidney cancer-vs-normal kidney cancer-vs-urothelial cancer, kidney cancer-vs-prostate cancer;
  • the classification method wherein each group is voted, and the group with the highest number of votes is correspondingly classified as the final classification. If the number of votes is equal, the predicted probability of the group with the same number of votes is the highest.
  • the category of is the final classification, and the inventor defines the integrated classification method as GUdetector.
  • the classification method wherein the copy number variation data of cfDNA in the target sample and/or the cfDNA copy number variation data of each classification label is derived from the cfDNA in the urine sample
  • the sequencing data is calculated; preferably, the sequencing data is whole-genome sequencing data; preferably, the sequencing depth is 1X-5X.
  • the classification method wherein the cfDNA copy number variation data in the target sample and/or the cfDNA copy number variation data of each classification label are calculated according to the following method:
  • A is the actual number of reads in a bin after GC content correction
  • B is the theoretical number of reads in the bin, which is the total number of reads measured by the sample divided by the total number of bins;
  • the ratio A/B is the copy number variation.
  • the classification method wherein the genome of the sample to be tested is divided into 5000-500000 equal lengths or theoretical simulations by software or algorithms such as Varbin, CNVnator, ReadDepth or SegSeq Bins with equal copy numbers.
  • the ratio A/B of the number of reads corresponding to each bin is calculated by software or algorithms such as Varbin, CNVnator, ReadDepth, or SegSeq.
  • the classification method wherein the genome of the sample to be tested is divided into 10,000-200,000 bins with the same length or the theoretical simulation copy number.
  • the classification method wherein the genome of the sample to be tested is divided into 10,000-150,000 bins with the same length or the theoretical simulation copy number.
  • the classification method wherein the genome of the sample to be tested is divided into 10000-100000 (for example, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000 or 100000) bins with the same length or the theoretical simulation copy number.
  • the classification method wherein the urine sample is morning urine; preferably, the urine sample is morning urine supernatant.
  • the classification method wherein the ratio A/B is the ratio A/B of each biomarker in the biomarker combination
  • the biomarker combination is any one of the biomarker combinations of the present invention described below.
  • Another aspect of the present invention relates to a method for the detection, diagnosis, classification, disease risk assessment or prognosis assessment of human genitourinary system tumors, including the following steps (1), step (2), and optional steps (3), step (4):
  • the cfDNA fragments are classified according to any one of the classification methods of the present invention.
  • the cfDNA fragment is the cfDNA fragment obtained in step (2) or the cfDNA fragment in the whole genome library in step (3).
  • the method wherein the urogenital system tumor is one or more selected from prostate cancer, urothelial cancer and renal cancer;
  • the kidney cancer is clear renal cell carcinoma
  • the urothelial cancer is upper urothelial cancer and/or bladder cancer,
  • the prostate cancer is prostate adenocarcinoma.
  • the method wherein, in step (1), the urine sample is morning urine; preferably, the urine sample is morning urine supernatant.
  • the method wherein, in step (2), the screening is magnetic bead screening.
  • Another aspect of the present invention relates to a device for the detection, diagnosis, classification, disease risk assessment or prognosis assessment of human genitourinary system tumors, including:
  • Normal-vs-kidney cancer normal-vs-urothelial cancer, normal-vs-prostate cancer;
  • Kidney cancer-vs-normal kidney cancer-vs-urothelial cancer, kidney cancer-vs-prostate cancer;
  • Another aspect of the present invention relates to a device for the detection, diagnosis, classification, disease risk assessment or prognosis assessment of human genitourinary system tumors,
  • the memory stores program instructions executed by the processor, and the program instructions include any one, any two, any three, or all four decision-making units selected from the following four decision-making units, where each There are 3 random forest binary classifiers in each decision unit:
  • Normal-vs-kidney cancer normal-vs-urothelial cancer, normal-vs-prostate cancer;
  • Kidney cancer-vs-normal kidney cancer-vs-urothelial cancer, kidney cancer-vs-prostate cancer;
  • the device wherein the processor is configured to execute the classification method according to any one of the present invention based on instructions stored in the memory device.
  • the device wherein the genitourinary system tumor is one or more selected from prostate cancer, urothelial cancer and renal cancer;
  • the kidney cancer is clear renal cell carcinoma
  • the urothelial cancer is upper urothelial cancer and/or bladder cancer,
  • the prostate cancer is prostate adenocarcinoma.
  • Another aspect of the present invention relates to the use of any one selected from the following items 1) to 3) in the preparation of drugs for the detection, diagnosis, disease risk assessment or prognosis assessment of human genitourinary system tumors:
  • the urine is morning urine
  • the cfDNA is 90-300bp cfDNA or 100-300bp cfDNA; more preferably, the cfDNA is 90-150bp cfDNA or 100-150bp cfDNA;
  • DNA library which is prepared from item 2); preferably, the DNA library is a whole genome library;
  • the urogenital system tumor is one or more selected from prostate cancer, urothelial cancer and renal cancer;
  • the kidney cancer is clear renal cell carcinoma
  • the urothelial cancer is upper urothelial cancer and/or bladder cancer,
  • the prostate cancer is prostate adenocarcinoma.
  • Another aspect of the present invention relates to any one selected from the following 1) to 3), which is used for the detection, diagnosis, disease risk assessment or prognosis assessment of human genitourinary system tumors:
  • the urine is morning urine
  • the cfDNA is 90-300bp cfDNA or 100-300bp cfDNA; more preferably, the cfDNA is 90-150bp cfDNA or 100-150bp cfDNA;
  • DNA library which is prepared from item 2); preferably, the DNA library is a whole genome library;
  • the urogenital system tumor is one or more selected from prostate cancer, urothelial cancer and renal cancer;
  • the kidney cancer is clear renal cell carcinoma
  • the urothelial cancer is upper urothelial cancer and/or bladder cancer,
  • the prostate cancer is prostate adenocarcinoma.
  • biomarker combination which comprises m biomarkers, and m is a positive integer greater than or equal to 50;
  • the biomarker is a piece of DNA, corresponding to the start site on the chromosome is A ⁇ n1, and the end site is B ⁇ n2;
  • n1 and n2 are independently non-negative integers less than or equal to 60,000;
  • the chromosomes, A and B are selected from any one group, any two groups, any three groups, any four groups, any five groups, and any six groups (for example, the first 6 groups) in the following groups (1)-(7) Group) or all 7 groups;
  • Kidney cancer VS normal biomarkers the smaller the number of the marker, the stronger the classification efficiency
  • Biomarkers of urothelial cancer VS kidney cancer (the smaller the number of the marker, the stronger the classification efficiency)
  • Biomarkers of urothelial cancer VS prostate cancer (the smaller the serial number of the marker, the stronger the classification efficiency)
  • Biomarkers for normal vs. prostate cancer (considering gender differences, only men are included in the normal population; the smaller the number of the marker, the stronger the classification efficiency)
  • the biomarker combination wherein m is 50-300 or greater than 300, such as 50-100, 100-150, 150-200, 200-250, 250-300, 50 , 100, 150, 200, 250 or 300.
  • the biomarker combination wherein n1 and n2 are independently 5000, 4000, 3000, 2000, 1500, 1000, 500, 300, 200, 150, 100, 90, 80, 70, 60, 50, 40, 30, 20, 10, 5, or 0.
  • the biomarker combination wherein the biomarker is a piece of cfDNA; preferably, the cfDNA is derived from human urine, especially human urine supernatant.
  • biomarker combination wherein:
  • the chromosomes, A and B are shown in any 1, any 2 groups, any 3 groups, any 4 groups, any 5 groups, any 6 groups, or all 7 groups in the groups (1) to (7).
  • bin is a general description in the field of genomics that artificially defines or divides the genome according to a certain length. For example, if the human genome is divided into about 3 billion base pairs into 3,000 bins, each The size of a bin is about one million base pairs.
  • cfNA is the abbreviation of Cell free nucleic acid, which refers to free plasma nucleic acid, which is a nucleic acid fragment located outside the cell in the peripheral circulation.
  • cfDNA is the abbreviation of Cell free DNA, which refers to plasma free DNA, which is a DNA fragment located outside the cell in the peripheral circulation.
  • cover refers to the area of the genome that has been detected at least once, which accounts for the proportion of the entire genome. Coverage is a term that measures how well the genome is covered by data. Due to the existence of complex structures such as high GC and repetitive sequences in the genome, the sequence obtained by the final assembly and assembly of sequencing often cannot cover a certain area, and the unobtained area of this part is called Gap. For example, if a bacterial genome is sequenced and the coverage is 98%, then 2% of the sequence area is not obtained by sequencing.
  • read or “reads” refers to reads, ie, measured sequences.
  • pair-end reads refers to paired reads.
  • CNVs copy number variations
  • theoretical simulation copy number refers to the division of the genome into several regions of equal or unequal length through copy number calculation software and/or methods, but through data simulation, each region contains the same theoretical copy number of.
  • Tissue specific diagnosis Solve the problem of what tumor is diagnosed under unknown circumstances. Based on the biomarker group selected by the established classification system, the inventors can determine which tumor the sample comes from in the urinary system at one time with high accuracy.
  • Urine collection is simple and non-invasive, and the patient has no pain, which is conducive to sample collection, diagnosis, long-term and regular prognostic monitoring.
  • Figure 1 Random forest binary classifier VS normal classification results of kidney cancer: sensitivity 72.2%, specificity 93.1%, accuracy rate 85.1%.
  • Figure 2 Random forest binary classifier urothelial cancer VS normal classification results: sensitivity 76.2%, specificity 100%, accuracy rate 90.0%.
  • Figure 3 Random forest binary classifier prostate cancer VS normal classification results: sensitivity 71.4%, specificity 93.1%, accuracy rate 86.1%.
  • Figure 4 Random forest binary classifier kidney cancer vs prostate cancer classification results: sensitivity 72.2%, specificity 85.7%, accuracy rate 78.1%.
  • Figure 5 Random forest binary classifier urothelial cancer vs. renal cancer classification results: sensitivity 95.2%, specificity 77.8%, accuracy rate 87.2%.
  • Figure 6 Random forest binary classifier urothelial cancer vs prostate classification results: sensitivity 85.7%, specificity 85.7%, accuracy rate 85.7%.
  • Figure 7A schematic diagram of GUdetector integrated classification model.
  • FIG. 7B the four types of classification results of the integrated classification decision system (GUdetector), the accuracy of each type of prediction is 89.7% of the normal group, 76.2% of urothelial cancer, 64.3% of prostate cancer, 44.4% of kidney cancer, and the overall accuracy rate is 72.0 %.
  • Figure 8 Diagnosis model of prostate cancer in male samples. Prostate cancer VS normal: the accuracy rate is 96.7%.
  • Figure 9 SVM four types of classification results (taking into account gender factors, remove all markers on sex chromosomes), the correct prediction rate of each type is 84.7% of the normal group, 74.3% of urothelial cancer, 52.2% of prostate cancer, and 55.8 of kidney cancer. %, the overall accuracy rate is 70.1%.
  • Figure 10 The three types of SVM classification results, the accuracy of each type of prediction was 88.5% for the normal group, 76.1% for urothelial cancer, 64.8% for renal cancer, and the overall accuracy rate was 78.4%.
  • FIG 11 SVM classification results of urothelial carcinoma (defined as UCdetector), and compared with LASSO and random forest methods.
  • the SVM prediction accuracy rate was 94.7% in the normal group and 86.5% in the urothelial carcinoma, and the overall accuracy rate was 91.4%.
  • the accuracy of LASSO prediction was 94.7% in the normal group, 75.0% in urothelial carcinoma, and the overall accuracy rate was 86.72%.
  • the accuracy of random forest prediction was 97.4% in the normal group, 80.8% in urothelial cancer, and the overall accuracy rate was 89.8%.
  • Figures 12A-12D Examples of dynamic monitoring of therapeutic efficacy of urothelial cancer. among them:
  • Figure 12A Postoperative dynamic monitoring of patient 1.
  • Figure 12B patient 2 postoperative dynamic monitoring.
  • Figure 12C patient 3 postoperative dynamic monitoring.
  • Figure 12D summary of postoperative dynamic monitoring of 3 patients.
  • 172 patients including: 58 patients with clear renal cell carcinoma (ccRCC), 69 patients with urothelial carcinoma and 45 patients with prostate cancer. All were confirmed by tissue biopsy of surgical samples.
  • ccRCC clear renal cell carcinoma
  • Urine free DNA extraction kit ZYMO Quick-DNA Urine Kit (ZYMO, Cat#: D3061).
  • Magnetic beads AMPure XP beads (Beckman Coulter, Cat#: A63880).
  • NEBNext End Repair Module Item No. E6050S
  • NEBNext dA-Tailing Module Item No. E6053S.
  • Samples to be tested the library of 267 cases prepared in Example 2 above.
  • the output sequencing depth of each sample is approximately 1X-5X.
  • Varbin algorithm Genetic-wide copy number analysis of single cells.Nature protocols 7,1024-1041, doi:10.1038/nprot.2012.039(2012)
  • the genome of each sample is first divided into 50,000 bins, and then combined with the previous
  • the number of reads and GC content in each bin were calculated, and the total number of reads and GC content obtained by sequencing each library sample were normalized to obtain the number of reads in each bin of each sample.
  • the original number of reads and the actual number of reads corrected by GC content (A).
  • the correction method is LOWESS smoothing; the number of reads in each bin is further obtained relative to that in the bin.
  • A is the actual number of reads in a bin after GC content correction
  • B is the number of theoretical reads in the bin, which is the total number of reads measured by the sample divided by the total number of bins of 50000". Therefore, for a sample, the number of theoretical reads in each bin is equal.
  • the ratio A/B is greater than 1, indicating that this area is likely to have an increase in copy number, equal to 1, indicating that this area has not changed, and less than 1, indicating that this area is likely to have a lack of copy number.
  • each bin is compared in pairs between different groups, and then performed sequentially until all 50,000 bins are checked. That is, a t test is performed on the ratio A/B corresponding to 50,000 bins, the ratio A/B with a significant difference (p ⁇ 0.05) is screened by the t test, and the marker(bin) corresponding to the ratio A/B is found. For example, take a bin, compare the ratio A/B corresponding to the bin in normal people and kidney cancer between the two groups, and retain the bin after statistical testing is significant, otherwise discard it; thus calculate 50,000 bins. In this way, a total of 6 pairwise combinations and 6 groups of markers with significant differences are obtained.
  • the specific method is to put the ratio A/B corresponding to the 6 sets of markers into the random forest classifier for binary classification model training, and pass the importance of the features (that is, the random forest algorithm Operation result) for sorting (the more important the marker is for the classification, the higher the sorting is), select the top markers such as top500, top300, top100, top50, top10 for random forest model training again, and evaluate the training set under different marker sets And the prediction accuracy of the test set, select the marker with high accuracy as the final marker set (when the accuracy is basically the same, the inventor tends to choose a smaller number of marker combinations), so 6 random forest binary classifications A total of 6 sets of markers are obtained by the instrument, and each set contains 50 markers. As shown in the previous table 1-table 6.
  • the inventor combines these six binary classification models to perform multi-class classification by voting.
  • the specific method is as follows:
  • I.'Normal decision-making unit' normal-vs-renal cancer, normal-vs-urothelial cancer, normal-vs-prostate cancer;
  • Kiddney cancer decision unit kidney cancer-vs-normal, kidney cancer-vs-urothelial cancer, kidney cancer-vs-prostate cancer;
  • ‘Urothelial cancer decision unit’ urothelial cancer-vs-normal, urothelial cancer-vs-kidney cancer, urothelial cancer-vs-prostate cancer;
  • Prostate cancer decision unit prostate cancer-vs-normal, prostate cancer-vs-renal cancer, prostate cancer-vs-urothelial cancer.
  • each decision unit that is, the ratio A/B of the 6 groups of markers corresponding to a sample is input into the respective classifiers of the above 4 decision units for predictive classification, such as'normal decision unit' Normal prediction votes are N 1 ,'kidney cancer decision unit', kidney cancer group prediction votes are N 2 ,'prostate cancer decision unit', prostate cancer prediction votes are N 3 ,'urothelial cancer decision unit', urothelial cancer prediction votes As N 4 , the prediction unit with the highest number of votes finally corresponds to the final prediction classification. If the number of votes is equal, the category with the highest prediction probability in the group with the same number of votes is the final prediction classification.
  • TCGA contains the copy number data of various tumor tissues (data of primary tumor tissue and normal tissue), download the corresponding four sets of data, and then calculate the values corresponding to the 6 sets of markers (the segment value provided by TCGA is used To measure the copy number change), put it into the random forest model for training and prediction, and evaluate the accuracy.
  • KIRC kidney cancer
  • UC urothelial cancer
  • PRAD prostate cancer
  • Normal healthy people. They are the prediction results in the 30% test set. Generally, the training set is used to select markers and train the classification model, and the test set is used to evaluate the prediction accuracy.
  • the analysis result is the calculation result of the classification effect evaluated by the random forest binary classifier after the final 6 sets of markers are selected, and calculated by the function in the R language.
  • Kidney cancer VS is normal: sensitivity is 72.2%, specificity is 93.1%.
  • Urothelial carcinoma VS is normal: sensitivity is 76.2%, specificity is 100%.
  • Prostate cancer VS is normal: sensitivity is 71.4%, specificity is 93.1%.
  • Kidney cancer VS prostate cancer sensitivity 72.2%, specificity 85.7%.
  • Urothelial cancer VS kidney cancer sensitivity 95.2%, specificity 77.8%.
  • Urothelial carcinoma VS prostate sensitivity 85.7%, specificity 85.7%.
  • Integrated classification system (GUdetector) 4 groups of simultaneous classification.
  • Diagnosis model of prostate cancer in male samples With reference to the experimental methods and samples in Examples 1-3, the copy number data of 43 male patients and 45 prostate cancer patients in the non-tumor population were used to construct the classification model.
  • Prostate cancer VS normal: accuracy rate AUC 0.967.
  • the accuracy of each type of prediction was 89.7% of the normal group, 76.2% of urothelial cancer, 64.3% of prostate cancer, 44.4% of kidney cancer, and the overall accuracy of 72.0%.
  • the SVM model was used to perform three simultaneous classification results.
  • the accuracy of each category was 88.5% for normal group, 76.1% for urothelial cancer, and 64.8% for renal cancer.
  • the overall accuracy rate was It was 78.4%.
  • the SVM model was used to perform the diagnosis of urothelial cancer and compared with LASSO and random forest methods.
  • the SVM prediction accuracy rate was 94.7% in the normal group and 86.5% in the urothelial carcinoma, and the overall accuracy rate was 91.4%.
  • the accuracy of LASSO prediction was 94.7% in the normal group, 75.0% in urothelial carcinoma, and the overall accuracy rate was 86.72%.
  • the accuracy of random forest prediction was 97.4% in the normal group, 80.8% in urothelial cancer, and the overall accuracy rate was 89.8%.
  • Sensitivity is the ability to pick out cancer patients, and specificity refers to the ability to pick out normal people. For example, suppose there are 1,000 tumor patients and 1,000 normal people. Through this classifier, the sensitivity is 72.2% and the specificity is 93.1%. The inventors singled out 722 people in the tumor group and 931 people in the normal group.
  • the sensitivity and specificity between two cancers refers to the ability to separate two tumors. Although these two concepts are used to evaluate negative and positive, or normal and abnormal, the inventors also used it here. In evaluating two tumors, the inventors defined a positive class, which is displayed as the'positive' class at the bottom of the result.
  • Accuracy refers to the overall accuracy rate.
  • the confusion matrix at the top of each result indicates the number of correct classifications in a group and the number of misclassifications in another group.
  • Prediction refers to the prediction classification, such as the UC group, 16 UCs are predicted to be UC (predicted correctly), 2 UCs are predicted to be Normal, and 3 UC is predicted to become PRAD, none of them are predicted to become KIRC, and the rest are analogous;
  • the overall accuracy rate is 0.7195;
  • the prediction accuracy rate of each category is the corresponding Sensitivity below. I don’t care about specificity here, because these two concepts are concepts in the two-category category. Now it is a 4-category category. I only care about the overall accuracy and the sensitivity of each category. can.
  • the inventors first established a urine-based cfDNA copy number classification system. Through the screened biomarker group, it can predict the different tissue sources of unknown genitourinary system tumors at one time, with high sensitivity and specificity. In addition, considering gender differences, only men have the need to assess the risk of prostate cancer. Therefore, the inventors also retrained prostate cancer classification markers for men. In addition, excluding gender factors, three classification models of normal, kidney cancer and urothelial cancer were trained. The method of ensemble classification voting cannot be used for three types of classification. Therefore, the inventor compared machine learning classification methods such as SVM, LASSO and random forest, and found that the SVM model is significantly better than the other two machine algorithm models (LASSO and random forest) .
  • machine learning classification methods such as SVM, LASSO and random forest
  • Magnetic beads screen DNA fragments of 100bp-300bp
  • Example 6 Screening of diagnostic markers for prostate cancer considering gender differences
  • Prostate cancer is a male-specific tumor. Therefore, if gender factors are not taken into account, healthy people include males and females, the number of copies of sex chromosomes will overestimate the diagnostic accuracy of the classifier. Therefore, when the inventor of the present invention diagnoses whether he has prostate cancer in an unknown male subject, he can use healthy men to re-screen the markers (healthy men vs. prostate cancer patients, Table 7). For a male subject in the clinic, you can refer to the following methods:
  • Magnetic beads screen DNA fragments of 100bp-300bp
  • Example 7 Screening of markers for diagnosis and classification of normal, renal cell carcinoma and urothelial carcinoma
  • Magnetic beads screen DNA fragments of 100bp-300bp
  • Example 8 Example of dynamic monitoring of therapeutic efficacy of urothelial cancer
  • the copy number analysis of cfDNA can be obtained by other algorithms, such as the ichorCNA algorithm. This method divides the genomic region into uniform regions of 1,000,000bp length, and then calculates the copy number variation and the proportion of tumor-derived DNA. For a patient who is reviewed before surgery and after treatment in the outpatient clinic, the following methods can be referred to:
  • Magnetic beads screen DNA fragments of 100bp-300bp
  • Comparative example 1 Using LASSO algorithm model
  • the input data is the ratio A/B corresponding to the 6 groups of biomarkers in Table 1 to Table 6.
  • References CancerLocator non-invasive cancer diagnosis and tissue-of-origin prediction using methylation profiles of cell-free DNA.
  • the input data is the ratio A/B corresponding to the 6 groups of biomarkers in Table 1 to Table 6.
  • References Epigenetic profiling for the molecular classification of metastatic brain tumors.
  • the input data is the ratio A/B corresponding to the 6 groups of biomarkers in Table 1 to Table 6.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Epidemiology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Pathology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Genetics & Genomics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Organic Chemistry (AREA)
  • Primary Health Care (AREA)
  • Molecular Biology (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Immunology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • Oncology (AREA)
  • Microbiology (AREA)
  • Hospice & Palliative Care (AREA)
  • Biochemistry (AREA)
  • General Physics & Mathematics (AREA)

Abstract

一种cfDNA分类方法、装置和用途,属于基因组学和生物信息学领域。该方法包括:计算目标样本中的cfDNA的拷贝数变异数据;计算目标cfDNA拷贝数变异数据与各分类标签的cfDNA拷贝数变异数据的相似度;根据所述相似度,利用分类器模型确定所述目标cfDNA所属的分类。该方法能够实现一次性完成多达3种泌尿生殖系统肿瘤的诊断,具有较高的敏感性和特异性。特别是在尿路上皮癌的诊断和动态监测的方面的敏感性和特异性高于了目前临床使用的检测方法。

Description

一种cfDNA分类方法、装置和用途 技术领域
本发明属于基因组学和生物信息学领域,涉及一种cfDNA分类方法、装置和用途。
背景技术
泌尿生殖系统肿瘤(前列腺癌、尿路上皮癌和肾癌)是危害人类健康的严重疾病。而对于泌尿生殖系统肿瘤的诊断和监测方法通常是侵入性的,或者缺乏敏感性和特异性。
肾癌大约占成人恶性肿瘤的3%,占肾脏肿瘤的90%-95%,其中约75%为肾脏透明细胞癌。目前,手术治疗仍是局限性肾癌最有效的治疗方法,但是术后大约有20%-40%患者将复发。肾细胞癌对放疗和化疗敏感性低。肾癌患者的死亡率高达40%,肾癌引起的高死亡率主要是因为其早期缺乏明显的临床症状,进展期缺乏有效的治疗方法。目前,影像学、细针针吸细胞学检查(fine needle aspiration,FNA),空芯针活检(core biopsy,CB)只能辅助监测,无法给出明确的诊断。目前没有一种敏感性和特异性均较好的肿瘤标志物可用于肾癌的早期诊断及术后随访。
尿路上皮癌是起源于是发生在肾盂、输尿管、膀胱、尿道等覆盖移行上皮细胞的恶性肿瘤,主要包括肾盂和输尿管所在的上尿路上皮癌和膀胱癌。其中上尿路上皮癌相对少见,仅占尿路上皮癌的5%-10%,但在中国上尿路上皮癌占尿路上皮癌的比例高达30%。有多项研究表明上尿路上皮癌的地域特征可能和服用含有马兜铃酸及其类似物的中药相关。另外,虽然组织来源相同,上尿路上皮癌和膀胱癌在临床病理特征方面还有很大不同。筛选尿路上皮癌的新风险因子、新靶点、诊断、预后和动态监测的新型标记物必须同时考虑这两个亚型的癌症。并且,尿路上皮癌患者高复发率可导致手术次数增加、并发症发生率增高、治疗费用增加等。复发患者最终需要进行根治性膀胱切除术或双侧肾输尿管切除术,极大地降低了生存率和生活质量。目前,膀胱癌的诊断可以通过影像学、荧光原位杂交FISH、尿细胞学检查辅助检查判断,但是对于低分级的膀胱肿瘤敏感性只有4%-31%。目前,诊断膀胱癌最主要的方法就是膀胱镜,但膀胱镜费用昂贵,并且是侵入式的,增加了病人的痛苦。此外,膀胱癌复发率较高,膀胱镜不便用于长期、终身以及预后监测。
前列腺癌是男性常见的恶性肿瘤,一定程度上发病率呈上升趋势。前列腺癌早期没有症状,当肿瘤发展到一定程度,会阻塞尿道或侵犯膀胱颈,造成尿频、尿急、尿失禁等。很多患者确诊时已是晚期,晚期很多患者多发生骨转移。目前,前列腺癌公认的方法是直肠指检和前列腺特异性抗原(PSA)检查,但是PSA的水平也会受到前列腺炎、尿潴留、导尿和药物等因素的影响,造成不少的假阳性率。
随着科学技术的发展,对于肿瘤的诊断技术也在不断的推进。2017年6月,世界经济论坛与《科学美国人》杂志的专家委员会联合选出了2017年度全球十大新兴技术榜单,其中肿瘤的无创诊断技术成功入选并荣膺榜首。肿瘤无创诊断技术即液体活检(liquid biopsies)的出现,标志着人类在攻克肿瘤的道路上又前进了一大步。与传统的组织活检相比,液体活检具备实时动态检测、克服肿瘤异质性、提供全面检测信息等独特优势。目前,临床研究中,液体活检主要包括游离循环肿瘤细胞(CTCs)检测、循环肿瘤DNA(ctDNA)检测、外泌体及循环RNA(Circulating RNA)检测等,与传统的依靠临床症状或影像学诊断技术比较,利用液体活检技术可以更早地发现疾病进展。液体活检预计在患者治疗过程中评估肿瘤动态和负荷变化,实时监测治疗的有效性,及监测患者机体的微小残留病灶、复发、预后评估、耐药的产生等多个方面发挥重大作用。
目前,尚需要开发新的泌尿生殖系统肿瘤的检测手段,其特异性和敏感性均较好,更方便用于多次、长期和预后监测,并减少患者痛苦。
发明内容
本发明人经过深入的研究和创造性的劳动,惊奇地发现,通过在尿上清中检测游离DNA(cfDNA),有利于泌尿系统中早期阶段、低级别、非浸润的肿瘤的检测或诊断。进一步地,本发明人设计并完成了实验、测序和分析,通过检测尿液上清的中的cfDNA拷贝数变异(CNV),能够实现一次性完成多达3种泌尿生殖系统肿瘤的诊断和分类。由此提供了下述发明:
本发明的一个方面涉及一种cfDNA分类方法,包括:
计算目标样本中的cfDNA的拷贝数变异数据;
计算目标cfDNA拷贝数变异数据与各分类标签的cfDNA拷贝数变异数据的相似 度;
根据所述相似度,利用分类器模型确定所述目标cfDNA所属的分类。
在本发明的一些实施方式中,所述的分类方法,其中,确定所述目标cfDNA所属的分类包括:
根据所述相似度,利用随机森林模型确定所述各分类标签的cfDNA拷贝数变异数据与人泌尿生殖系统肿瘤的相关度;
根据所述相关度,利用所述分类器模型确定所述目标cfDNA所属的分类。
在本发明的一些实施方式中,所述的分类方法,其中,确定所述各分类标签的cfDNA拷贝数变异数据与人泌尿生殖系统肿瘤的相关度包括:
根据所述相关度,对所述cfDNA拷贝数变异数据进行排序,以形成向量序列;
将所述向量序列输入所述随机森林模型,确定所述分类标签的cfDNA拷贝数变异数据与人泌尿生殖系统肿瘤的相关度。
在本发明的一些实施方式中,所述的分类方法,其中,所述人泌尿生殖系统肿瘤为选自前列腺癌、尿路上皮癌和肾癌中的一种或多种;
优选地,所述肾癌为透明肾细胞癌,
优选地,所述尿路上皮癌为上尿路上皮癌和/或膀胱癌,
优选地,所述前列腺癌为前列腺腺癌;
优选地,所述人泌尿生殖系统肿瘤通过对手术样本进行组织活检确诊。
在本发明的一些实施方式中,所述的分类方法,其中,所述随机森林模型为至少3个随机森林二元分类器,并且选自如下的I-VI组中的任意一组、两组、三组或四组:
I.
正常-vs-肾癌,正常-vs-尿路上皮癌,正常-vs-前列腺癌;
II.
肾癌-vs-正常,肾癌-vs-尿路上皮癌,肾癌-vs-前列腺癌;
III.
尿路上皮癌-vs-正常,尿路上皮癌-vs-肾癌,尿路上皮癌-vs-前列腺癌;
IV.
前列腺癌-vs-正常,前列腺癌-vs-肾癌,前列腺癌-vs-尿路上皮癌。
在本发明的一些实施方式中,所述的分类方法,其中,对每个组进行投票,取得票数最高的组对应分类为最终分类,如果得票数相等,则取得票数相等的组中预测概率最高的类别为最终分类,本发明人定义该集成分类方法为GUdetector。
在本发明的一些实施方式中,所述的分类方法,其中,所述目标样本中的cfDNA的拷贝数变异数据和/或所述各分类标签的cfDNA拷贝数变异数据由尿液样本中的cfDNA的测序数据计算得到;优选地,所述测序数据为全基因组测序数据;优选地,测序深度为1X-5X。
在本发明的一些实施方式中,所述的分类方法,其中,所述目标样本中的cfDNA的拷贝数变异数据和/或所述各分类标签的cfDNA拷贝数变异数据按照如下方法计算:
将待测样本的基因组划分为5000-500000个长度相等或者理论模拟拷贝数相等的bin(例如50000个bin);将测序数据进行归一化处理,并计算得到各个bin对应的reads数的比值A/B,
其中:
A是一个bin中的经GC含量校正后的实际的reads数;
B是该bin里面理论reads数,是将该样本测得的reads总数除以bin的总数;
比值A/B即为拷贝数变异。
在本发明的一个或多个实施方式中,所述的分类方法,其中,通过Varbin、CNVnator、ReadDepth或SegSeq等软件或算法,将待测样本的基因组划分为5000-500000个长度相等或者理论模拟拷贝数相等的bin。
在本发明的一个或多个实施方式中,所述的分类方法,其中,通过Varbin、CNVnator、ReadDepth或SegSeq等软件或算法,计算得到各个bin对应的reads数的比值A/B。
在本发明的一个或多个实施方式中,所述的分类方法,其中,将待测样本的基因组划分为10000-200000个长度相等或者理论模拟拷贝数相等的bin。
在本发明的一个或多个实施方式中,所述的分类方法,其中,将待测样本的基因组划分为10000-150000个长度相等或者理论模拟拷贝数相等的bin。
在本发明的一个或多个实施方式中,所述的分类方法,其中,将待测样本的基因组划分为10000-100000个(例如10000、20000、30000、40000、50000、60000、70000、80000、90000或100000个)长度相等或者理论模拟拷贝数相等的bin。
在本发明的一些实施方式中,所述的分类方法,其中,所述尿液样本为晨尿;优选地,所述尿液样本为晨尿上清。
在本发明的一些实施方式中,所述的分类方法,其中,所述比值A/B为生物标志物组合中的各生物标志物的比值A/B,
其中,
所述的生物标志物组合为下文中所描述的本发明任一项的生物标志物组合。
本发明的另一方面涉及一种用于人泌尿生殖系统肿瘤的检测、诊断、分类、患病风险评估或预后评估的方法,包括下述步骤(1)、步骤(2)、可选的步骤(3)、步骤(4):
(1)收取尿液样本,提取cfDNA;
(2)筛选得到90-300bp的cfDNA片段或100-300bp的cfDNA片段,
(3)利用得到的cfDNA片段构建全基因组文库;优选地,对全基因组文库进行全基因组测序;
(4)将cfDNA片段按照本发明中任一项所述的分类方法进行分类。所述cfDNA片段是步骤(2)中得到的cfDNA片段,或者是步骤(3)中的全基因组文库中的cfDNA片段。
在本发明的一些实施方式中,所述的方法,其中,所述泌尿生殖系统肿瘤为选自前列腺癌、尿路上皮癌和肾癌中的一种或多种;
优选地,所述肾癌为透明肾细胞癌,
优选地,所述尿路上皮癌为上尿路上皮癌和/或膀胱癌,
优选地,所述前列腺癌为前列腺腺癌。
在本发明的一些实施方式中,所述的方法,其中,步骤(1)中,所述尿液样本为晨尿;优选地,所述尿液样本为晨尿上清。
在本发明的一些实施方式中,所述的方法,其中,步骤(2)中,所述筛选为磁珠筛选。
本发明的再一方面涉及一种用于人泌尿生殖系统肿瘤的检测、诊断、分类、患病风险评估或预后评估的装置,包括:
I.‘正常决策单元’:
正常-vs-肾癌,正常-vs-尿路上皮癌,正常-vs-前列腺癌;
II.‘肾癌决策单元’:
肾癌-vs-正常,肾癌-vs-尿路上皮癌,肾癌-vs-前列腺癌;
III.‘尿路上皮癌决策单元’:
尿路上皮癌-vs-正常,尿路上皮癌-vs-肾癌,尿路上皮癌-vs-前列腺癌;
IV.‘前列腺癌决策单元’:
前列腺癌-vs-正常,前列腺癌-vs-肾癌,前列腺癌-vs-尿路上皮癌。
本发明的再一方面涉及一种用于人泌尿生殖系统肿瘤的检测、诊断、分类、患病风险评估或预后评估的装置,
包括存储器;和耦接至所述存储器的处理器,
其中,
所述存储器上存储有由处理器执行的程序指令,所述程序指令包含选自如下的4个决策单元中的任意1个、任意2个、任意3个或者全部4个决策单元,其中,每个决策单元里面包含3个随机森林二元分类器:
I.‘正常决策单元’:
正常-vs-肾癌,正常-vs-尿路上皮癌,正常-vs-前列腺癌;
II.‘肾癌决策单元’:
肾癌-vs-正常,肾癌-vs-尿路上皮癌,肾癌-vs-前列腺癌;
III.‘尿路上皮癌决策单元’:
尿路上皮癌-vs-正常,尿路上皮癌-vs-肾癌,尿路上皮癌-vs-前列腺癌;
IV.‘前列腺癌决策单元’:
前列腺癌-vs-正常,前列腺癌-vs-肾癌,前列腺癌-vs-尿路上皮癌。
在本发明的一些实施方式中,所述的装置,其中,所述处理器被配置为基于存储在所述存储器装置中的指令,执行本发明中任一项所述的分类方法。
在本发明的一些实施方式中,所述的装置,其中,所述泌尿生殖系统肿瘤为选自前列腺癌、尿路上皮癌和肾癌中的一种或多种;
优选地,所述肾癌为透明肾细胞癌,
优选地,所述尿路上皮癌为上尿路上皮癌和/或膀胱癌,
优选地,所述前列腺癌为前列腺腺癌。
本发明的再一方面涉及选自如下的1)-3)项中的任意一项在制备人泌尿生殖系统肿瘤的检测、诊断、患病风险评估或预后评估的药物中的用途:
1)本发明中任一项所述的生物标志物组合;
2)人尿液中的cfDNA特别是人尿液上清中的cfDNA;
优选地,所述尿液为晨尿;
优选地,所述cfDNA为90-300bp的cfDNA或100-300bp的cfDNA;更优选地,所述cfDNA为90-150bp的cfDNA或100-150bp的cfDNA;
3)DNA文库,其由第2)项制得;优选地,所述DNA文库为全基因组文库;
优选地,所述泌尿生殖系统肿瘤为选自前列腺癌、尿路上皮癌和肾癌中的一种或多种;
优选地,所述肾癌为透明肾细胞癌,
优选地,所述尿路上皮癌为上尿路上皮癌和/或膀胱癌,
优选地,所述前列腺癌为前列腺腺癌。
本发明的再一方面涉及选自如下的1)-3)项中的任意一项,其用于人泌尿生殖系统肿瘤的检测、诊断、患病风险评估或预后评估:
1)本发明中任一项所述的生物标志物组合;
2)人尿液中的cfDNA特别是人尿液上清中的cfDNA;
优选地,所述尿液为晨尿;
优选地,所述cfDNA为90-300bp的cfDNA或100-300bp的cfDNA;更优选地,所述cfDNA为90-150bp的cfDNA或100-150bp的cfDNA;
3)DNA文库,其由第2)项制得;优选地,所述DNA文库为全基因组文库;
优选地,所述泌尿生殖系统肿瘤为选自前列腺癌、尿路上皮癌和肾癌中的一种或多种;
优选地,所述肾癌为透明肾细胞癌,
优选地,所述尿路上皮癌为上尿路上皮癌和/或膀胱癌,
优选地,所述前列腺癌为前列腺腺癌。
本发明的还一个方面涉及一种生物标志物组合,其包含m个生物标志物,m为大于或等于50的正整数;
所述生物标志物为一段DNA,其对应于染色体上的起始位点为A±n1,终止位点为B±n2;
其中,所述n1和n2独立地为小于或等于60,000的非负整数;
其中,所述染色体、A和B选自如下的(1)-(7)组中的任意1组、任意2组、任意3组、任意4组、任意5组、任意6组(例如前6组)或全部7组;
(1)肾癌VS正常的生物标志物(标记物的序号越小,分类效能越强)
表1
序号 染色体 A B
1 chr14 105173382 105228468
2 chr4 126141989 126199070
3 chr2 38340335 38396819
4 chr4 120896519 120952988
5 chr1 225263465 225322410
6 chr3 49627990 49683004
7 chr12 55710185 55770826
8 chr2 198023323 198078345
9 chr8 104278540 104334789
10 chr15 102366051 102531392
11 chr5 56684537 56739554
12 chr12 2875899 2930969
13 chr5 8084151 8143261
14 chr13 24239617 24294704
15 chr14 63064067 63121825
16 chr10 32966493 33022298
17 chr18 34499871 34555093
18 chr18 27538044 27593083
19 chr19 52518298 52574358
20 chr3 148084127 148140439
21 chr11 23395282 23450515
22 chr19 53868391 53924718
23 chr7 36856760 36911789
24 chr19 55851675 55906675
25 chr12 130622755 130677832
26 chr8 88140900 88196181
27 chr8 98015299 98073611
28 chr22 24279186 24375790
29 chr10 58285076 58342675
30 chr1 193398457 193455292
31 chr11 44170591 44225937
32 chr3 99497035 99552049
33 chr18 70229325 70284364
34 chr3 86800483 86855497
35 chr7 85391699 85446714
36 chr2 222217699 222274614
37 chr12 51953090 52017679
38 chr2 231506603 231561625
39 chr7 54479671 54534725
40 chr5 40826473 40882045
41 chr3 61041867 61097030
42 chr1 71530378 71587704
43 chr19 30375804 30434948
44 chr5 103365336 103426037
45 chr16 72331875 72390386
46 chr12 77381964 77436979
47 chr19 35419205 35474205
48 chr8 131286269 131341291
49 chr21 30776557 30834320
50 chr9 17638202 17695124
(2)尿路上皮癌VS正常的生物标志物(标记物的序号越小,分类效能越强)
表2
序号 染色体 A B
1 chr1 165542998 165598528
2 chr20 45298182 45353725
3 chr7 110250206 110305749
4 chr8 34086369 34141392
5 chr11 3080528 3135556
6 chr8 81773551 81828573
7 chr7 20604578 20660880
8 chr8 101664207 101719230
9 chr8 127300805 127363897
10 chr3 175419548 175474633
11 chr7 17433047 17488061
12 chr11 126763962 126818990
13 chr8 81328435 81383788
14 chr1 160347268 160402416
15 chr3 150917292 150976246
16 chr8 78266536 78321853
17 chr2 127233784 127288805
18 chr9 119009696 119064910
19 chr7 88363140 88418154
20 chr6 168087004 168142398
21 chr8 101056393 101111465
22 chr9 121669613 121725772
23 chr8 32804682 32859711
24 chr1 160016845 160071870
25 chr8 52860841 52916007
26 chr1 184863212 184918237
27 chr8 103059578 103114914
28 chr11 131771420 131826541
29 chr11 132772276 132827397
30 chr8 142309304 142365059
31 chr11 20866407 20922555
32 chr9 9389289 9445177
33 chr8 86975952 87030974
34 chr8 68297698 68353353
35 chr9 122009782 122064791
36 chr8 61387868 61442890
37 chr8 82499446 82554469
38 chr9 118116705 118171814
39 chr8 117772819 117827841
40 chr9 135838140 135893149
41 chr14 101522031 101577065
42 chr8 81105039 81160812
43 chr3 161042779 161098402
44 chr9 104364444 104420690
45 chr8 61111592 61166615
46 chr20 31048866 31103880
47 chr15 26890253 26945265
48 chr4 28406811 28462319
49 chr5 35031116 35086691
50 chr10 101035266 101090283
(3)前列腺癌VS正常的生物标志物(标记物的序号越小,分类效能越强)
表3
序号 染色体 A B
1 chr6 150259849 150319419
2 chr11 50065867 50143253
3 chr2 223609354 223664376
4 chr3 178315458 178370471
5 chr5 142022744 142077815
6 chr3 72366362 72421541
7 chr14 51571751 51628678
8 chr10 69911981 69966998
9 chr9 75793867 75850925
10 chr16 34486643 34542808
11 chr16 75960918 76016022
12 chr1 213593324 213648410
13 chr14 81176000 81231314
14 chr14 48680148 48735914
15 chr1 66328295 66385662
16 chr2 236695859 236750881
17 chr16 34310644 34370518
18 chr13 70644019 70699054
19 chr1 104971030 105026648
20 chr19 20033425 20088912
21 chr12 41633765 41689196
22 chr1 111186072 111241148
23 chr11 81515081 81570551
24 chr6 164934635 164990438
25 chr7 88753879 88809024
26 chr2 204421512 204476533
27 chr13 38205109 38260137
28 chr19 57310235 57365579
29 chr5 172615261 172670278
30 chr13 100608580 100663608
31 chr1 248513391 248569321
32 chr5 78269787 78325922
33 chr10 12753021 12808156
34 chr7 101911102 101966116
35 chr17 30274080 30334227
36 chr12 87935928 87995848
37 chr9 12175965 12231559
38 chr5 97385699 97441111
39 chr8 3970051 4025074
40 chr7 20604578 20660880
41 chr8 32416104 32471278
42 chr7 12021765 12077292
43 chr20 11563548 11624648
44 chr7 51785230 51840244
45 chr19 16615231 16670336
46 chr10 67343243 67399416
47 chr11 10953369 11008630
48 chr2 22332272 22390528
49 chr17 10390372 10446415
50 chr4 976667 1032082
(4)肾癌VS前列腺癌的生物标志物(标记物的序号越小,分类效能越强)
表4
序号 染色体 A B
1 chr4 163059481 163114735
2 chr4 6580383 6635407
3 chr6 132270265 132325276
4 chr2 82257259 82312280
5 chr1 159394058 159452969
6 chr9 105154079 105209849
7 chr2 187699497 187754518
8 chr4 126199070 126254087
9 chr20 18854392 18909406
10 chr7 15040427 15095480
11 chr3 44690964 44747019
12 chr11 57212694 57267722
13 chr2 48829261 48885035
14 chr12 133782920 133851895
15 chr5 98900964 98963876
16 chr11 86090264 86145292
17 chr7 128477838 128533737
18 chr2 32933311 32988604
19 chr7 12693292 12748805
20 chr4 95879059 95934075
21 chr8 59989616 60044780
22 chr12 32405135 32460143
23 chr7 37972210 38027551
24 chr11 128601685 128656714
25 chr6 64185537 64240615
26 chr7 107787926 107843035
27 chr18 29036127 29091424
28 chr16 47711531 47767836
29 chr7 14590286 14645354
30 chr11 55525982 55582014
31 chr5 174061726 174116744
32 chr14 44456533 44512749
33 chr3 168694552 168750070
34 chr4 114652704 114707721
35 chr2 27431778 27486799
36 chr4 107314339 107370716
37 chr2 182718295 182773317
38 chr10 19690582 19745774
39 chr10 23594781 23649798
40 chr3 3972580 4034015
41 chr6 31323092 31379758
42 chr8 128874896 128929933
43 chr1 26256318 26311633
44 chr5 161340570 161395587
45 chr12 91346168 91401202
46 chr19 2637431 2692582
47 chr7 36856760 36911789
48 chr9 27809024 27864032
49 chr2 116615151 116670172
50 chr9 112566383 112621994
(5)尿路上皮癌VS肾癌的生物标志物(标记物的序号越小,分类效能越强)
表5
序号 染色体 A B
1 chr4 163059481 163114735
2 chr4 6580383 6635407
3 chr6 132270265 132325276
4 chr2 82257259 82312280
5 chr1 159394058 159452969
6 chr9 105154079 105209849
7 chr2 187699497 187754518
8 chr4 126199070 126254087
9 chr20 18854392 18909406
10 chr7 15040427 15095480
11 chr3 44690964 44747019
12 chr11 57212694 57267722
13 chr2 48829261 48885035
14 chr12 133782920 133851895
15 chr5 98900964 98963876
16 chr11 86090264 86145292
17 chr7 128477838 128533737
18 chr2 32933311 32988604
19 chr7 12693292 12748805
20 chr4 95879059 95934075
21 chr8 59989616 60044780
22 chr12 32405135 32460143
23 chr7 37972210 38027551
24 chr11 128601685 128656714
25 chr6 64185537 64240615
26 chr7 107787926 107843035
27 chr18 29036127 29091424
28 chr16 47711531 47767836
29 chr7 14590286 14645354
30 chr11 55525982 55582014
31 chr5 174061726 174116744
32 chr14 44456533 44512749
33 chr3 168694552 168750070
34 chr4 114652704 114707721
35 chr2 27431778 27486799
36 chr4 107314339 107370716
37 chr2 182718295 182773317
38 chr10 19690582 19745774
39 chr10 23594781 23649798
40 chr3 3972580 4034015
41 chr6 31323092 31379758
42 chr8 128874896 128929933
43 chr1 26256318 26311633
44 chr5 161340570 161395587
45 chr12 91346168 91401202
46 chr19 2637431 2692582
47 chr7 36856760 36911789
48 chr9 27809024 27864032
49 chr2 116615151 116670172
50 chr9 112566383 112621994
(6)尿路上皮癌VS前列腺癌的生物标志物(标记物的序号越小,分类效能越强)
表6
序号 染色体 A B
1 chr3 88025277 88080310
2 chr19 39394315 39449482
3 chr20 31436554 31491568
4 chr7 48432792 48487842
5 chr8 87141019 87196120
6 chr4 13859414 13914431
7 chr1 160292243 160347268
8 chr8 112245103 112300126
9 chr8 11530043 11585066
10 chr8 13932292 13987366
11 chr3 152913886 152973883
12 chr9 109516082 109571205
13 chr11 8343925 8398954
14 chr3 122030664 122085678
15 chr5 87727661 87782722
16 chr5 60881889 60936907
17 chr14 40518423 40573582
18 chr8 94667609 94724236
19 chr8 101719230 101774274
20 chr5 113527635 113584160
21 chr3 103853900 103909150
22 chr8 62393903 62449668
23 chr8 124248002 124303024
24 chr17 74131207 74186417
25 chr14 52519339 52574927
26 chr3 144795549 144851338
27 chr3 84803116 84858323
28 chr8 50523567 50578589
29 chr8 88545977 88603606
30 chr1 42119088 42174113
31 chr20 43860121 43915135
32 chr9 121061199 121116207
33 chr9 118676908 118734641
34 chr11 13163841 13219126
35 chr11 57212694 57267722
36 chr8 131892873 131948409
37 chr11 16410024 16465871
38 chr8 109405759 109460782
39 chr5 158002797 158058189
40 chr11 1579888 1635511
41 chr8 51749113 51804136
42 chr9 118562723 118621899
43 chr17 29154317 29209332
44 chr6 73471411 73528437
45 chr3 87522168 87578480
46 chr1 231915581 231971963
47 chr8 117772819 117827841
48 chr1 241691293 241746318
49 chr9 92506773 92712072
50 chr4 19120611 19176371
(7)正常VS前列腺癌的生物标志物(考虑到性别差异,正常人群中只包括了男性;标记物的序号越小,分类效能越强)
表7
序号 染色体 A B
1 chr11 40374531 40429896
2 chr12 61310253 61365625
3 chr19 56809188 56866674
4 chr2 145644444 145702420
5 chr6 98011442 98066653
6 chr7 88753879 88809024
7 chr9 98761758 98817567
8 chrY 4474368 4588559
9 chrY 18884928 18940043
10 chrY 5632826 5746826
11 chrY 24371813 24427746
12 chrY 5948790 6035624
13 chrY 19228861 19283946
14 chrY 21484883 21542276
15 chrY 5746826 5851679
16 chrY 28707448 28764196
17 chrY 6599942 6664881
18 chrY 23799512 23860617
19 chrY 3427018 3545705
20 chrY 13573548 13635016
21 chrY 18387555 18551943
22 chrY 16529414 16585431
23 chrY 19111726 19166891
24 chrY 9020782 9081054
25 chrY 19451088 19508211
26 chrY 6720180 6778075
27 chrY 6349316 6458079
28 chrY 4163770 4261597
29 chrY 28648165 28707448
30 chrY 8741265 8796960
31 chrY 19283946 19339589
32 chrY 3970433 4073487
33 chrY 7346142 7402799
34 chrY 15149848 15205024
35 chrY 18774055 18829409
36 chrY 7290613 7346142
37 chrY 23743018 23799512
38 chrY 4700163 4811039
39 chrY 16473510 16529414
40 chrY 21654324 21709511
41 chrY 14418460 14477812
42 chrY 5851679 5948790
43 chrY 8685630 8741265
44 chrY 14650141 14705375
45 chrY 15605187 15663531
46 chrY 4073487 4163770
47 chrY 9399760 9457656
48 chrY 4366038 4474368
49 chrY 4937971 5066009
50 chrY 19564127 21039220
在本发明的一些实施方式中,所述的生物标志物组合,其中,m为50-300或者大于300,例如50-100、100-150、150-200、200-250、250-300、50、100、150、200、250或300。
在本发明的一个或多个实施方案中,所述的生物标志物组合,其中,n1和n2独立地为5000、4000、3000、2000、1500、1000、500、300、200、150、100、90、80、70、60、50、40、30、20、10、5或0。
在本发明的一个或多个实施方案中,所述的生物标志物组合,其中,所述生物标志物为一段cfDNA;优选地,所述cfDNA来源于人尿液特别是人尿液上清。
在本发明的一个或多个实施方案中,所述的生物标志物组合,其中,
所述染色体、A和B如所述(1)-(7)组中的任意1组、任意2组、任意3组、任意4组、任意5组、任意6组或全部7组所示。
下面对本发明涉及的部分术语进行解释。
术语“bin”(区间/区域)是基因组学研究领域对基因组按某个长度人为定义或划分的通用描述,例如,把人的基因组约30亿个碱基对平均划分为3000个bin,那每个bin的大小就是一百万的碱基对左右。
术语“cfNA”是Cell free nucleic acid的缩写,是指血浆游离核酸,是位于外周循环中的细胞外的核酸片段。
术语“cfDNA”是Cell free DNA的缩写,是指血浆游离DNA,是位于外周循环中的细胞外的DNA片段。
术语“覆盖度(coverage)”指的是基因组上至少被检测到1次的区域,占整个基因组的比例。覆盖度是衡量基因组被数据覆盖程度的术语。由于基因组中的高GC、重复序列等复杂结构的存在,测序最终拼接组装获得的序列往往无法覆盖有所的区域,这部分没有获得的区域就称为Gap。例如一个细菌基因组测序,覆盖度是98%,那么还有2%的序列区域是没有通过测序获得的。
术语“测序深度(depth)”是指是指测序得到的碱基总量(bp)与基因组大小(Genome)的比值,或者理解为基因组中每个碱基被测序到的平均次数。例如,假设一个基因大小为2M,获得的总数据量为20M,那么测序深度为20M/2M=10X。
术语“read”或“reads”是指读段,即测得的序列。
术语“pair-end reads”是指配对读段。
术语“拷贝数变异(copy number variations,CNVs)”是指较大DNA片段的缺失或重复,常见的从几百bp至几百万bp的DNA片段的拷贝数增加或者减少。CNVs是由基因组发生重排而导致的,是肿瘤的重要致病因素之一。
术语“理论模拟拷贝数”是指通过拷贝数计算软件和/或方法,将基因组划分成若干个长度相等或者不等的区域,但通过数据模拟,每个区域包含的理论上的拷贝数是相同的。
发明的有益效果
(1)微量检测,降低了测序成本,实现了较低较浅覆盖度即可检测。早期肿瘤细胞释的cfDNA中的含量一般在百分之一甚至万分之一以下,因此想检测到ctDNA中SNV(单核酸变异)和INDEL(插入/缺失)水平上的变异,对于目前的DNA检测技术来讲,非常具有挑战性,而且需要很深的测序深度,但是本发明人利用cfDNA全基因组测序技术,检测其拷贝数变异的情况,理论和技术上都具备可行性。本发明人的样本测序深度仅为1X到5X,实现了高灵敏性和特异性的诊断。
(2)实现单种泌尿系统肿瘤的高准确性诊断。
(3)组织特异性诊断。解决了未知情况下诊断出是什么肿瘤的问题。本发明人基于建立的分类系统筛选出的生物标志物组,可以较高准确度的一次性判断样本来自于泌尿系统中哪种肿瘤。
(4)真正做到无创。尿液收集简单、无创,病人无任何痛苦,利于样本收集、诊断、长期和预后定期监测。
附图说明
图1:随机森林二元分类器肾癌VS正常分类结果:敏感性72.2%,特异性93.1%,准确率85.1%。
图2:随机森林二元分类器尿路上皮癌VS正常分类结果:敏感性76.2%,特异性100%,准确率90.0%。
图3:随机森林二元分类器前列腺癌VS正常分类结果:敏感性71.4%,特异性93.1%,准确率86.1%。
图4:随机森林二元分类器肾癌VS前列腺癌分类结果:敏感性72.2%,特异性85.7%,准确率78.1%。
图5:随机森林二元分类器尿路上皮癌VS肾癌分类结果:敏感性95.2%,特异性77.8%,准确率87.2%。
图6:随机森林二元分类器尿路上皮癌VS前列腺分类结果:敏感性85.7%,特异性85.7%,准确率85.7%。
图7A,GUdetector集成分类模型示意图。
图7B,集成分类决策系统(GUdetector)四类分类结果,每一类预测正确率分别为正常组89.7%,尿路上皮癌76.2%,前列腺癌64.3%,肾癌44.4%,总体准确率为72.0%。
图8:男性样本的前列腺癌诊断模型。前列腺癌VS正常:准确率96.7%。
图9:SVM四类分类结果(考虑性别因素,去除所有性染色体上的marker),每一类预测正确率分别为正常组84.7%,尿路上皮癌74.3%,前列腺癌52.2%,肾癌55.8%,总体准确率为70.1%。
图10:SVM三类分类结果,每一类预测正确率分别为正常组88.5%,尿路上皮癌76.1%,肾癌64.8%,总体准确率为78.4%。
图11:SVM尿路上皮癌分类结果(定义为UCdetector),并和LASSO和随机森林方法比较。SVM预测正确率分别为正常组94.7%,尿路上皮癌86.5%,总体准确率为91.4%。LASSO预测正确率分别为正常组94.7%,尿路上皮癌75.0%,总体准确率为86.72%。随机森林预测正确率分别为正常组97.4%,尿路上皮癌80.8%,总体准确率为89.8%。
图12A-12D:尿路上皮癌治疗疗效动态监测示例。其中:
图12A,患者1术后动态监测。
图12B,患者2术后动态监测。
图12C,患者3术后动态监测。
图12D,3例患者术后动态监测总结。
具体实施方式
下面将结合实施例对本发明的实施方案进行详细描述,但是本领域技术人员将会 理解,下列实施例仅用于说明本发明,而不应视为限定本发明的范围。实施例中未注明具体条件者,按照常规条件或制造商建议的条件进行。所用试剂或仪器未注明生产厂商者,均为可以通过市购获得的常规产品。
实施例1:cfDNA样品的制备
1.目标群体
95例健康人;
172例患者,包括:58例透明肾细胞癌(ccRCC)患者,69例尿路上皮癌患者和45例前列腺癌患者。均通过对手术样本进行组织活检确诊。
健康人和患者合计267例。
2.实验方法
(1)收集上述健康人的晨尿和肿瘤病人术前晨尿,每例尿液均收集于50ml离心管中,体积约20-50ml,收集后置于4℃冰盒中,半个小时内进行提取,以免cfDNA降解。
(2)将收集到的晨尿样本分别在3500转/分钟下离心15分钟,然后分别取上清。
(3)利用zymo Quick-DNA TM Urine Kit试剂盒进行cfDNA的提取。提取后用Qubit 4荧光定量仪测浓度,后放-80℃保存。
制得267例cfDNA样品。
实施例2:全基因组文库的构建
1.实验样品、试剂和仪器
前面实施例1得到的267例cfDNA样品。
尿液游离DNA提取试剂盒:ZYMO Quick-DNA Urine Kit(ZYMO,Cat#:D3061)。
磁珠:AMPure XP beads(Beckman Coulter,Cat#:A63880)。
普通离心机。
2.实验方法
(1)通过磁珠筛选100bp-300bp的cfDNA(通过控制磁珠体积和cfDNA样本的体积之比能够对磁珠吸附DNA片段大小的范围进行控制)。具体操作如下:
在提取的尿液cfDNA中加入0.6倍体积的磁珠,吸附5分钟后弃去磁珠,保留上 清,然后在上清中加入0.3倍体积的磁珠,吸附5分钟后弃上清,保留磁珠(注:加入0.6倍体积的磁珠目的是吸附大的DNA片段,然后丢掉,再在上清中加入0.3倍体积的磁珠以吸附小片段的目的DNA片段,这样就回收了小的DNA片段),80%乙醇洗两次,最后用水溶解DNA。
(2)末端补平,加A。具体操作参见试剂盒操作说明,NEBNext End Repair Module:货号E6050S;NEBNext dA-Tailing Module,货号E6053S。
(3)加上PE接头。具体操作参见试剂盒操作说明,T4DNA Ligase,货号M0202L。
(4)采用接头特异性引物,进行PCR扩增。
(5)将前面得到的PCR产物用磁珠进行纯化,得到267例样品各自的DNA文库即全基因组文库。
另外,使用Agilent 2100 Bioanalyse对267个文库进行质量检测,确定建完库后均没有接头污染。
实施例3:HiSeq X10 system测序
1.试剂和仪器
待测样品:前面实施例2制得的267例文库。
2.实验方法
进行全基因组测序。测序委托诺禾致源测序公司进行。
3.实验结果
获得267例文库各自的150bp双端测序读段(pair-end reads)。每个样品的产出测序深度大约在1X-5X。用于后面的肿瘤标志物分析。
实施例4:肿瘤标志物的筛选、分析和应用
1.实验方法
(1)比值A/B的计算
根据Varbin算法(Genome-wide copy number analysis of single cells.Nature protocols 7,1024-1041,doi:10.1038/nprot.2012.039(2012))首先将每个样本的基因组均划分为50000个bin,然后结合前面实施例3的测序结果,计算每个bin里面的reads数和GC含量,并对每个文库样本测序得到的reads总数和GC含量进行归一化处理, 从而得到每个样本的每个bin里面的原始reads数和经GC含量校正后的实际的reads数(A),校正的方法是局部加权回归散点平滑法(LOWESS smoothing);进一步得到每个区间(bin)的reads数相对于该区间中的理论reads数的比值A/B。其中:
A是一个bin中的经GC含量校正后的实际的reads数;
B是该bin里面理论reads数,是将该样本测得的reads总数除以bin的总数50000”,因此,对于一个样本而言,其每个bin中的理论read数是相等的。
比值A/B大于1,预示着这个区域很可能是拷贝数增加,等于1,说明这个区域没有变化,小于1说明这个区域很可能是拷贝数缺失。
最终每个样本均得到50000个比值,这50000个比值(也称为特征)用于后面marker的筛选。
(2)marker的筛选
将4组对象样本(健康人样本、透明肾细胞癌患者样本、尿路上皮癌患者样本和前列腺癌患者样本),以随机的方式分别将每一组对象样本划分为训练集(约70%)和测试集(约30%),分别得到4个训练集和相应的4个测试集,其各自的人数如下面的表8所示。
表8
Figure PCTCN2020087830-appb-000001
先对4个训练集进行两两比较。具体是将每个bin在不同组之间进行两两比较,依次进行,直到50000个bin全部检验完。即对50000个bin对应的比值A/B进行t检验,通过t检验筛选出差异显著(p<0.05)的比值A/B,找到对应于该比值A/B的marker(bin)。举例而言,取一个bin,将正常人和肾癌中该bin对应的比值A/B进行两组间比较,统计检验显著后保留该bin,否则舍弃;如此计算50000个bin。这样共得到6种两两组合,以及6组差异显著的marker。
然后将这6组marker进行进一步的筛选,具体的做法是将6组marker对应的比值A/B分别放入随机森林分类器进行二元分类模型训练,通过特征的重要性(即随机森林算法的运算结果)进行排序(marker对于分类越重要,排序就越靠前),选择靠前的marker比如top500、top300、top100、top50、top10再次进行随机森林模型训练,评估在不同的marker集合下训练集和测试集的预测准确率,选择准确率高的marker作为最终的marker集合(当准确率基本一致的时候,本发明人倾向于选择数量较少的marker组合),这样6个随机森林二元分类器一共得到6组marker,每组含有50个marker。如前面的表1-表6所示。
将得到的表1-表6中的6组生物标志物(marker)对应的数据(6组marker的比值A/B)分别提取出来,使用随机森林算法进行训练,最后得到6个二元分类模型。
(3)集成分类系统的构建(GUdetector)
本发明人将这6个二元分类模型组合起来以投票的方式进行多类分类,具体方法如下:
本发明人设计了4个决策单元,每个决策单元里面包含3个随机森林二元分类器:
I.‘正常决策单元’:正常-vs-肾癌,正常-vs-尿路上皮癌,正常-vs-前列腺癌;
II.‘肾癌决策单元’:肾癌-vs-正常,肾癌-vs-尿路上皮癌,肾癌-vs-前列腺癌;
III.‘尿路上皮癌决策单元’:尿路上皮癌-vs-正常,尿路上皮癌-vs-肾癌,尿路上皮癌-vs-前列腺癌;
IV.‘前列腺癌决策单元’:前列腺癌-vs-正常,前列腺癌-vs-肾癌,前列腺癌-vs-尿路上皮癌。
然后本发明人对每个决策单元进行投票,即将一个样本对应的6组marker的比值A/B分别输入到上面4个决策单元中各自对应的分类器中进行预测分类,比如‘正常决策单元’正常预测得票为N 1,‘肾癌决策单元’肾癌组预测得票为N 2,‘前列腺癌决策单元’前列腺癌预测得票为N 3,‘尿路上皮癌决策单元’尿路上皮癌预测得票为N 4,最后取得票数最高的预测单元对应分类为最终预测分类,如果得票数相等,则取得票数相等的组中预测概率最高的类别为最终预测分类。
同时,将6组marker在公开的TCGA数据库中验证可靠性。TCGA中包含了各种肿瘤组织的拷贝数数据(原发肿瘤组织及正常组织的数据),下载对应的四组数据, 然后计算该6组marker对应的值(是TCGA提供的是segment值,用来衡量拷贝数变化),放入随机森林模型进行训练和预测,评估准确率。
2.标志物分析结果:
如图1-图12(图12A-12D)所示。其中,KIRC表示肾癌,UC表示尿路上皮癌,PRAD表示前列腺癌,Normal表示健康人。都是那30%的测试集里面的预测结果,一般是用训练集进行挑选marker和训练分类模型,测试集用来评估预测准确率。
分析结果是筛选得到最终的6组marker后通过随机森林二元分类器对分类效果进行评估后的计算结果,通过R语言中的函数计算得到。
1)如图1所示。
肾癌VS正常:敏感性72.2%,特异性93.1%。
2)如图2所示。
尿路上皮癌VS正常:敏感性76.2%,特异性100%。
3)如图3所示。
前列腺癌VS正常:敏感性71.4%,特异性93.1%。
4)如图4所示。
肾癌VS前列腺癌:敏感性72.2%,特异性85.7%。
5)如图5所示。
尿路上皮癌VS肾癌:敏感性95.2%,特异性77.8%。
6)如图6所示。
尿路上皮癌VS前列腺:敏感性85.7%,特异性85.7%。
7)如图7A和图7B所示。
参照实施例1-3的实验方法和样本。集成分类系统(GUdetector)4组同时分类。
8)如图8所示。
男性样本的前列腺癌诊断模型。参照实施例1-3的实验方法和样本,采用非肿瘤人群中的43个男性患者和45个前列腺癌患者的拷贝数数据,进行分类模型的构建。
前列腺癌VS正常:准确率AUC=0.967。
9)如图9所示。
考虑性别因素,去除所有性染色体上的marker,参照实施例1-3的实验方法和样本,采用SVM模型进行4组同时分类。
每一类预测正确率分别为正常组89.7%,尿路上皮癌76.2%,前列腺癌64.3%,肾癌44.4%,总体准确率为72.0%。
10)如图10所示。
参照实施例1-3的实验方法和样本,采用SVM模型进行3组同时分类结果,每一类预测正确率分别为正常组88.5%,尿路上皮癌76.1%,肾癌64.8%,总体准确率为78.4%。
11)如图11所示。
参照实施例1-3的实验方法和样本,只采用90例非肿瘤个体和65例尿路上皮癌患者,采用SVM模型进行尿路上皮癌诊断结果,并和LASSO和随机森林方法比较。SVM预测正确率分别为正常组94.7%,尿路上皮癌86.5%,总体准确率为91.4%。LASSO预测正确率分别为正常组94.7%,尿路上皮癌75.0%,总体准确率为86.72%。随机森林预测正确率分别为正常组97.4%,尿路上皮癌80.8%,总体准确率为89.8%。
12)如图12A-12D所示。
参照实施例1-3的实验方法和样本,在3例尿路上皮癌治疗疗效动态监测示例,三个患者手术前后的cfDNA的拷贝数以及肿瘤DNA占总的cfDNA的比例,通过ichorCNA算法得到,可见,在三例患者中术前都检测到了拷贝数变化以及肿瘤DNA的含量,但是,术后则未检测到,这和患者其他检测相一致,三例患者都没有出现复发。以上结果支持,本发明也可以用来无创预后监测。
另外说明的是:
特异性和敏感性是评估marker分类效能的指标。敏感性是挑出肿瘤患者的能力,特异性是指挑出正常人的能力,例如,假设一共有1000个肿瘤患者,1000个正常人,通过该分类器,敏感性72.2%和特异性93.1%,本发明人在肿瘤组中挑出了722人,正常组中挑出了931人。
两种癌症之间的敏感性和特异性是指为了评估分开两种肿瘤的能力,虽然这两个概念是用来评估阴性和阳性、或者正常和异常,但在这里,本发明人也拿来评估两种肿瘤,本发明人定义了阳性类,在结果最下方显示为‘positive’class。
除了敏感性数值和特异性数值,Accuracy指的是总体准确率。每个结果最上方的混淆矩阵表示某一组正确分类的个数以及误分类到另一组里面的个数。
Confusion matrix(混淆矩阵),Reference指的是原本的类别,Prediction指的 是预测分类,比如UC组,有16个UC被预测成UC(预测正确),2个UC被预测成了Normal,3个UC被预测成了PRAD,没有一个被预测成KIRC,其余依次类推;
总体准确率为0.7195;
每一类的预测准确率就是下面对应的Sensitivity,这里不用管特异性,因为这两个概念是二分类里面的概念,现在是4类分类,只关心总体的准确率和每一类的灵敏度就可以。
3.结果讨论:
本发明人首创建立了基于尿液的cfDNA拷贝数分类系统,通过筛选出的生物标志物组,能够一次性预测未知泌尿生殖系统肿瘤的不同组织来源,且有着较高的敏感性和特异性。另外,考虑到性别差异,只有男性才有评估前列腺癌风险的需要,所以,本发明人同时针对男性重新训练了前列腺癌分类标记物。另外,排除性别因素,训练了正常、肾癌和尿路上皮癌的3类分类模型。3类分类时将不能采用集成分类投票的方法,所以,本发明人比较了SVM、LASSO和随机森林等机器学习分类方法,发现SVM模型明显优于其它两个机器算法模型(LASSO和随机森林)。
实施例5:诊断示例
针对门诊上的一个随机的未知对象(可能是健康人,也可能是泌尿生殖系统肿瘤患者),可以参考下述方法:
1.收取晨尿,提取cfDNA;
2.磁珠筛选100bp-300bp的DNA片段,
3.进行全基因组文库构建;
4.对文库进行全基因组测序,得到测序数据;
5.将待测样本的基因组划分为50000个bin;将测序数据进行归一化处理,并使用varbin算法计算得到50000个bin对应的reads比值;
6.提取对应于表1-表6中所示的300个marker所对应的比值,放入前面的集成分类系统(GUdetector)进行预测。
上述步骤1-4的具体操作可分别参考实施例1-4。
实施例6:考虑到性别差异前列腺癌诊断标记物筛选
前列腺癌是男性特有肿瘤,因此,如果不考虑性别因素,健康人群中包含男性和女性,性染色体的拷贝数将会高估分类器诊断准确性。因此,本发明人针对男性未知对象,来诊断其是否罹患前列腺癌时,可以用采用健康人群的男性进行标记物的重新筛选(男性健康人群vs.前列腺癌患者,表7)。针对门诊上的一个男性对象,可以参考下述方法:
1.收取晨尿,提取cfDNA;
2.磁珠筛选100bp-300bp的DNA片段,
3.进行全基因组文库构建;
4.对文库进行全基因组测序,得到测序数据;
5.将待测样本的基因组划分为50000个bin;将测序数据进行归一化处理,并使用varbin算法计算得到50000个bin对应的reads比值;
6.提取对应于表7中所示的50个marker所对应的比值,通过SVM等机器学习算法,预测未知样本是否是前列腺癌。
上述步骤1-4的具体操作可分别参考实施例1-4。
实施例7:正常、肾癌和尿路上皮癌诊断和分类标记物筛选
针对门诊上的一个随机的未知对象(可能是健康人,也可能是肾癌和尿路上皮癌患者),可以参考下述方法:
1.收取晨尿,提取cfDNA;
2.磁珠筛选100bp-300bp的DNA片段,
3.进行全基因组文库构建;
4.对文库进行全基因组测序,得到测序数据;
5.将待测样本的基因组划分为50000个bin;将测序数据进行归一化处理,并使用varbin算法计算得到50000个bin对应的reads比值;
6.提取对应于表1、2、5中所示的150个marker所对应的比值,通过SVM等机器学习算法,预测未知样本是否是正常、肾癌和尿路上皮癌。
上述步骤1-4的具体操作可分别参考实施例1-4。
实施例8:尿路上皮癌治疗疗效动态监测示例
针对cfDNA的拷贝数分析完全可以采用其它算法得到,比如,ichorCNA算法。该方法将基因组区域分成了均匀的1,000,000bp长度的区域,进而计算拷贝数变异情况以及肿瘤来源DNA所占的比例。针对门诊上的一个手术前以及治疗后复查的患者,可以参考下述方法:
1.收取手术前和定期复查时晨尿,提取cfDNA;
2.磁珠筛选100bp-300bp的DNA片段,
3.进行全基因组文库构建;
4.对文库进行全基因组测序,得到测序数据;
5.采用ichorCNA的方法得到肿瘤患者手术前和复查时尿液cfDNA的拷贝数变异图谱以及估计的肿瘤DNA含量。
6.根据以上图谱和肿瘤DNA含量的比对,评估患者的治疗疗效以及复发情况。
对比例1:采用LASSO算法模型
1.实验方法
参考文献Circulating tumour DNA methylation markers for diagnosis and prognosis of hepatocellular carcinoma中的方法进行。
输入的数据为表1-表6中的6组生物标志物(marker)对应的比值A/B。
2.实验结果
结果如下面的表9所示。
表9
Figure PCTCN2020087830-appb-000002
结果显示,使用LASSO分类模型,各类预测准确率比本发明人提出的集成分类 系统(GUdetector)低,总体准确率仅有58.5%。
对比例2:采用SVM算法模型
1.实验方法
参考文献CancerLocator:non-invasive cancer diagnosis and tissue-of-origin prediction using methylation profiles of cell-free DNA中的方法进行。
输入的数据为表1-表6中的6组生物标志物(marker)对应的比值A/B。
2.实验结果
结果如下面的表10所示。
表10
Figure PCTCN2020087830-appb-000003
结果显示,使用SVM分类模型,各类预测准确率比本发明人提出的集成分类系统(GUdetector)低,总体准确率仅有54.7%。
对比例3:随机森林四类分类模型
1.实验方法
参考文献:Epigenetic profiling for the molecular classification of metastatic brain tumors中的方法进行。
输入的数据为表1-表6中的6组生物标志物(marker)对应的比值A/B。
2.实验结果
结果如下面的表11所示。
表11
Figure PCTCN2020087830-appb-000004
结果显示,使用随机森林四类分类模型,各类预测准确率比本发明人提出的集成分类系统(GUdetector)低,总体准确率仅有65.1%。
尽管本发明的具体实施方式已经得到详细的描述,本领域技术人员将会理解。根据已经公开的所有教导,可以对那些细节进行各种修改和替换,这些改变均在本发明的保护范围之内。本发明的全部范围由所附权利要求及其任何等同物给出。

Claims (26)

  1. 一种cfDNA分类方法,包括:
    计算目标样本中的cfDNA的拷贝数变异数据;
    计算目标cfDNA拷贝数变异数据与各分类标签的cfDNA拷贝数变异数据的相似度;
    根据所述相似度,利用分类器模型确定所述目标cfDNA所属的分类。
  2. 根据权利要求1所述的分类方法,其中,确定所述目标cfDNA所属的分类包括:
    根据所述相似度,利用随机森林模型确定所述各分类标签的cfDNA拷贝数变异数据与人泌尿生殖系统肿瘤的相关度;
    根据所述相关度,利用所述分类器模型确定所述目标cfDNA所属的分类。
  3. 根据权利要求2所述的分类方法,其中,确定所述各分类标签的cfDNA拷贝数变异数据与人泌尿生殖系统肿瘤的相关度包括:
    根据所述相关度,对所述cfDNA拷贝数变异数据进行排序,以形成向量序列;
    将所述向量序列输入所述随机森林模型,确定所述分类标签的cfDNA拷贝数变异数据与人泌尿生殖系统肿瘤的相关度。
  4. 根据权利要求3所述的分类方法,其中,所述人泌尿生殖系统肿瘤为选自前列腺癌、尿路上皮癌和肾癌中的一种或多种;
    优选地,所述肾癌为透明肾细胞癌,
    优选地,所述尿路上皮癌为上尿路上皮癌和/或膀胱癌,
    优选地,所述前列腺癌为前列腺腺癌;
    优选地,所述人泌尿生殖系统肿瘤通过对手术样本进行组织活检确诊。
  5. 根据权利要求3或4所述的分类方法,其中,所述随机森林模型为至少3个随机森林二元分类器,并且选自如下的I-VI组中的任意一组、两组、三组或四组:
    I.
    正常-vs-肾癌,正常-vs-尿路上皮癌,正常-vs-前列腺癌;
    II.
    肾癌-vs-正常,肾癌-vs-尿路上皮癌,肾癌-vs-前列腺癌;
    III.
    尿路上皮癌-vs-正常,尿路上皮癌-vs-肾癌,尿路上皮癌-vs-前列腺癌;
    IV.
    前列腺癌-vs-正常,前列腺癌-vs-肾癌,前列腺癌-vs-尿路上皮癌。
  6. 根据权利要求5所述的分类方法,其中,对每个组进行投票,取得票数最高的组对应分类为最终分类,如果得票数相等,则取得票数相等的组中预测概率最高的类别为最终分类。
  7. 根据权利要求1至6中任一权利要求所述的分类方法,其中,所述目标样本中的cfDNA的拷贝数变异数据和/或所述各分类标签的cfDNA拷贝数变异数据由尿液样本中的cfDNA的测序数据计算得到;优选地,所述测序数据为全基因组测序数据;优选地,测序深度为1X-5X。
  8. 根据权利要求1至7中任一权利要求所述的分类方法,其中,所述目标样本中的cfDNA的拷贝数变异数据和/或所述各分类标签的cfDNA拷贝数变异数据按照如下方法计算:
    将待测样本的基因组划分为5000-500000个长度相等或者理论模拟拷贝数相等的bin;将测序数据进行归一化处理,并计算得到各个bin对应的reads数的比值A/B,
    其中:
    A是一个bin中的经GC含量校正后的实际的reads数;
    B是该bin里面理论reads数,是将该样本测得的reads总数除以bin的总数;
    比值A/B即为拷贝数变异。
  9. 根据权利要求8所述的分类方法,其中,通过Varbin、CNVnator、ReadDepth 或SegSeq,将待测样本的基因组划分为5000-500000个长度相等或者理论模拟拷贝数相等的bin;
    和/或
    通过Varbin、CNVnator、ReadDepth或SegSeq,计算得到各个bin对应的reads数的比值A/B。
  10. 根据权利要求7至9中任一权利要求所述的分类方法,其中,所述尿液样本为晨尿;优选地,所述尿液样本为晨尿上清。
  11. 根据权利要求8或9所述的分类方法,其中,所述比值A/B为生物标志物组合中的各生物标志物的比值A/B,
    其中,
    所述的生物标志物组合,其包含m个生物标志物,m为大于或等于50的正整数;
    所述生物标志物为一段DNA,其对应于染色体上的起始位点为A±n1,终止位点为B±n2;
    其中,所述n1和n2独立地为小于或等于60000的非负整数;
    其中,所述染色体、A和B选自如下的(1)-(7)组中的任意1组、任意2组、任意3组、任意4组、任意5组、任意6组或全部7组;
    (1)肾癌VS正常的生物标志物
    表1
    序号 染色体 A B 1 chr14 105173382 105228468 2 chr4 126141989 126199070 3 chr2 38340335 38396819 4 chr4 120896519 120952988 5 chr1 225263465 225322410 6 chr3 49627990 49683004 7 chr12 55710185 55770826
    8 chr2 198023323 198078345 9 chr8 104278540 104334789 10 chr15 102366051 102531392 11 chr5 56684537 56739554 12 chr12 2875899 2930969 13 chr5 8084151 8143261 14 chr13 24239617 24294704 15 chr14 63064067 63121825 16 chr10 32966493 33022298 17 chr18 34499871 34555093 18 chr18 27538044 27593083 19 chr19 52518298 52574358 20 chr3 148084127 148140439 21 chr11 23395282 23450515 22 chr19 53868391 53924718 23 chr7 36856760 36911789 24 chr19 55851675 55906675 25 chr12 130622755 130677832 26 chr8 88140900 88196181 27 chr8 98015299 98073611 28 chr22 24279186 24375790 29 chr10 58285076 58342675 30 chr1 193398457 193455292 31 chr11 44170591 44225937
    32 chr3 99497035 99552049 33 chr18 70229325 70284364 34 chr3 86800483 86855497 35 chr7 85391699 85446714 36 chr2 222217699 222274614 37 chr12 51953090 52017679 38 chr2 231506603 231561625 39 chr7 54479671 54534725 40 chr5 40826473 40882045 41 chr3 61041867 61097030 42 chr1 71530378 71587704 43 chr19 30375804 30434948 44 chr5 103365336 103426037 45 chr16 72331875 72390386 46 chr12 77381964 77436979 47 chr19 35419205 35474205 48 chr8 131286269 131341291 49 chr21 30776557 30834320 50 chr9 17638202 17695124
    (2)尿路上皮癌VS正常的生物标志物
    表2
    序号 染色体 A B 1 chr1 165542998 165598528
    2 chr20 45298182 45353725 3 chr7 110250206 110305749 4 chr8 34086369 34141392 5 chr11 3080528 3135556 6 chr8 81773551 81828573 7 chr7 20604578 20660880 8 chr8 101664207 101719230 9 chr8 127300805 127363897 10 chr3 175419548 175474633 11 chr7 17433047 17488061 12 chr11 126763962 126818990 13 chr8 81328435 81383788 14 chr1 160347268 160402416 15 chr3 150917292 150976246 16 chr8 78266536 78321853 17 chr2 127233784 127288805 18 chr9 119009696 119064910 19 chr7 88363140 88418154 20 chr6 168087004 168142398 21 chr8 101056393 101111465 22 chr9 121669613 121725772 23 chr8 32804682 32859711 24 chr1 160016845 160071870 25 chr8 52860841 52916007
    26 chr1 184863212 184918237 27 chr8 103059578 103114914 28 chr11 131771420 131826541 29 chr11 132772276 132827397 30 chr8 142309304 142365059 31 chr11 20866407 20922555 32 chr9 9389289 9445177 33 chr8 86975952 87030974 34 chr8 68297698 68353353 35 chr9 122009782 122064791 36 chr8 61387868 61442890 37 chr8 82499446 82554469 38 chr9 118116705 118171814 39 chr8 117772819 117827841 40 chr9 135838140 135893149 41 chr14 101522031 101577065 42 chr8 81105039 81160812 43 chr3 161042779 161098402 44 chr9 104364444 104420690 45 chr8 61111592 61166615 46 chr20 31048866 31103880 47 chr15 26890253 26945265 48 chr4 28406811 28462319 49 chr5 35031116 35086691
    50 chr10 101035266 101090283
    (3)前列腺癌VS正常的生物标志物
    表3
    序号 染色体 A B 1 chr6 150259849 150319419 2 chr11 50065867 50143253 3 chr2 223609354 223664376 4 chr3 178315458 178370471 5 chr5 142022744 142077815 6 chr3 72366362 72421541 7 chr14 51571751 51628678 8 chr10 69911981 69966998 9 chr9 75793867 75850925 10 chr16 34486643 34542808 11 chr16 75960918 76016022 12 chr1 213593324 213648410 13 chr14 81176000 81231314 14 chr14 48680148 48735914 15 chr1 66328295 66385662 16 chr2 236695859 236750881 17 chr16 34310644 34370518 18 chr13 70644019 70699054 19 chr1 104971030 105026648
    20 chr19 20033425 20088912 21 chr12 41633765 41689196 22 chr1 111186072 111241148 23 chr11 81515081 81570551 24 chr6 164934635 164990438 25 chr7 88753879 88809024 26 chr2 204421512 204476533 27 chr13 38205109 38260137 28 chr19 57310235 57365579 29 chr5 172615261 172670278 30 chr13 100608580 100663608 31 chr1 248513391 248569321 32 chr5 78269787 78325922 33 chr10 12753021 12808156 34 chr7 101911102 101966116 35 chr17 30274080 30334227 36 chr12 87935928 87995848 37 chr9 12175965 12231559 38 chr5 97385699 97441111 39 chr8 3970051 4025074 40 chr7 20604578 20660880 41 chr8 32416104 32471278 42 chr7 12021765 12077292 43 chr20 11563548 11624648
    44 chr7 51785230 51840244 45 chr19 16615231 16670336 46 chr10 67343243 67399416 47 chr11 10953369 11008630 48 chr2 22332272 22390528 49 chr17 10390372 10446415 50 chr4 976667 1032082
    (4)肾癌VS前列腺癌的生物标志物
    表4
    序号 染色体 A B 1 chr4 163059481 163114735 2 chr4 6580383 6635407 3 chr6 132270265 132325276 4 chr2 82257259 82312280 5 chr1 159394058 159452969 6 chr9 105154079 105209849 7 chr2 187699497 187754518 8 chr4 126199070 126254087 9 chr20 18854392 18909406 10 chr7 15040427 15095480 11 chr3 44690964 44747019 12 chr11 57212694 57267722 13 chr2 48829261 48885035
    14 chr12 133782920 133851895 15 chr5 98900964 98963876 16 chr11 86090264 86145292 17 chr7 128477838 128533737 18 chr2 32933311 32988604 19 chr7 12693292 12748805 20 chr4 95879059 95934075 21 chr8 59989616 60044780 22 chr12 32405135 32460143 23 chr7 37972210 38027551 24 chr11 128601685 128656714 25 chr6 64185537 64240615 26 chr7 107787926 107843035 27 chr18 29036127 29091424 28 chr16 47711531 47767836 29 chr7 14590286 14645354 30 chr11 55525982 55582014 31 chr5 174061726 174116744 32 chr14 44456533 44512749 33 chr3 168694552 168750070 34 chr4 114652704 114707721 35 chr2 27431778 27486799 36 chr4 107314339 107370716 37 chr2 182718295 182773317
    38 chr10 19690582 19745774 39 chr10 23594781 23649798 40 chr3 3972580 4034015 41 chr6 31323092 31379758 42 chr8 128874896 128929933 43 chr1 26256318 26311633 44 chr5 161340570 161395587 45 chr12 91346168 91401202 46 chr19 2637431 2692582 47 chr7 36856760 36911789 48 chr9 27809024 27864032 49 chr2 116615151 116670172 50 chr9 112566383 112621994
    (5)尿路上皮癌VS肾癌的生物标志物
    表5
    序号 染色体 A B 1 chr4 163059481 163114735 2 chr4 6580383 6635407 3 chr6 132270265 132325276 4 chr2 82257259 82312280 5 chr1 159394058 159452969 6 chr9 105154079 105209849 7 chr2 187699497 187754518
    8 chr4 126199070 126254087 9 chr20 18854392 18909406 10 chr7 15040427 15095480 11 chr3 44690964 44747019 12 chr11 57212694 57267722 13 chr2 48829261 48885035 14 chr12 133782920 133851895 15 chr5 98900964 98963876 16 chr11 86090264 86145292 17 chr7 128477838 128533737 18 chr2 32933311 32988604 19 chr7 12693292 12748805 20 chr4 95879059 95934075 21 chr8 59989616 60044780 22 chr12 32405135 32460143 23 chr7 37972210 38027551 24 chr11 128601685 128656714 25 chr6 64185537 64240615 26 chr7 107787926 107843035 27 chr18 29036127 29091424 28 chr16 47711531 47767836 29 chr7 14590286 14645354 30 chr11 55525982 55582014 31 chr5 174061726 174116744
    32 chr14 44456533 44512749 33 chr3 168694552 168750070 34 chr4 114652704 114707721 35 chr2 27431778 27486799 36 chr4 107314339 107370716 37 chr2 182718295 182773317 38 chr10 19690582 19745774 39 chr10 23594781 23649798 40 chr3 3972580 4034015 41 chr6 31323092 31379758 42 chr8 128874896 128929933 43 chr1 26256318 26311633 44 chr5 161340570 161395587 45 chr12 91346168 91401202 46 chr19 2637431 2692582 47 chr7 36856760 36911789 48 chr9 27809024 27864032 49 chr2 116615151 116670172 50 chr9 112566383 112621994
    (6)尿路上皮癌VS前列腺癌的生物标志物
    表6
    序号 染色体 A B 1 chr3 88025277 88080310
    2 chr19 39394315 39449482 3 chr20 31436554 31491568 4 chr7 48432792 48487842 5 chr8 87141019 87196120 6 chr4 13859414 13914431 7 chr1 160292243 160347268 8 chr8 112245103 112300126 9 chr8 11530043 11585066 10 chr8 13932292 13987366 11 chr3 152913886 152973883 12 chr9 109516082 109571205 13 chr11 8343925 8398954 14 chr3 122030664 122085678 15 chr5 87727661 87782722 16 chr5 60881889 60936907 17 chr14 40518423 40573582 18 chr8 94667609 94724236 19 chr8 101719230 101774274 20 chr5 113527635 113584160 21 chr3 103853900 103909150 22 chr8 62393903 62449668 23 chr8 124248002 124303024 24 chr17 74131207 74186417 25 chr14 52519339 52574927
    26 chr3 144795549 144851338 27 chr3 84803116 84858323 28 chr8 50523567 50578589 29 chr8 88545977 88603606 30 chr1 42119088 42174113 31 chr20 43860121 43915135 32 chr9 121061199 121116207 33 chr9 118676908 118734641 34 chr11 13163841 13219126 35 chr11 57212694 57267722 36 chr8 131892873 131948409 37 chr11 16410024 16465871 38 chr8 109405759 109460782 39 chr5 158002797 158058189 40 chr11 1579888 1635511 41 chr8 51749113 51804136 42 chr9 118562723 118621899 43 chr17 29154317 29209332 44 chr6 73471411 73528437 45 chr3 87522168 87578480 46 chr1 231915581 231971963 47 chr8 117772819 117827841 48 chr1 241691293 241746318 49 chr9 92506773 92712072
    50 chr4 19120611 19176371
    (7)正常VS前列腺癌的生物标志物
    表7
    序号 染色体 A B 1 chr11 40374531 40429896 2 chr12 61310253 61365625 3 chr19 56809188 56866674 4 chr2 145644444 145702420 5 chr6 98011442 98066653 6 chr7 88753879 88809024 7 chr9 98761758 98817567 8 chrY 4474368 4588559 9 chrY 18884928 18940043 10 chrY 5632826 5746826 11 chrY 24371813 24427746 12 chrY 5948790 6035624 13 chrY 19228861 19283946 14 chrY 21484883 21542276 15 chrY 5746826 5851679 16 chrY 28707448 28764196 17 chrY 6599942 6664881 18 chrY 23799512 23860617 19 chrY 3427018 3545705
    20 chrY 13573548 13635016 21 chrY 18387555 18551943 22 chrY 16529414 16585431 23 chrY 19111726 19166891 24 chrY 9020782 9081054 25 chrY 19451088 19508211 26 chrY 6720180 6778075 27 chrY 6349316 6458079 28 chrY 4163770 4261597 29 chrY 28648165 28707448 30 chrY 8741265 8796960 31 chrY 19283946 19339589 32 chrY 3970433 4073487 33 chrY 7346142 7402799 34 chrY 15149848 15205024 35 chrY 18774055 18829409 36 chrY 7290613 7346142 37 chrY 23743018 23799512 38 chrY 4700163 4811039 39 chrY 16473510 16529414 40 chrY 21654324 21709511 41 chrY 14418460 14477812 42 chrY 5851679 5948790 43 chrY 8685630 8741265
    44 chrY 14650141 14705375 45 chrY 15605187 15663531 46 chrY 4073487 4163770 47 chrY 9399760 9457656 48 chrY 4366038 4474368 49 chrY 4937971 5066009 50 chrY 19564127 21039220
  12. 根据权利要求11所述的分类方法,其中,m为50-300或者大于300,例如50-100、100-150、150-200、200-250、250-300、50、100、150、200、250或300。
  13. 根据权利要求11所述的分类方法,其中,n1和n2独立地为5000、4000、3000、2000、1500、1000、500、300、200、150、100、90、80、70、60、50、40、30、20、10、5或0。
  14. 根据权利要求11所述的分类方法,其中,所述生物标志物为一段cfDNA;优选地,所述cfDNA来源于人尿液特别是人尿液上清。
  15. 根据权利要求11至14中任一权利要求所述的分类方法,其中,
    所述染色体、A和B如所述(1)-(7)组中的任意1组、任意2组、任意3组、任意4组、任意5组、任意6组或全部7组所示。
  16. 一种用于人泌尿生殖系统肿瘤的检测、诊断、分类、患病风险评估或预后评估的方法,包括下述步骤(1)、步骤(2)、可选的步骤(3)、步骤(4):
    (1)收取尿液样本,提取cfDNA;
    (2)筛选得到90-300bp的cfDNA片段或100-300bp的cfDNA片段,
    (3)利用得到的cfDNA片段构建全基因组文库;
    (4)将cfDNA片段按照权利要求1至15中任一权利要求所述的分类方法进行分类。
  17. 根据权利要求16所述的方法,其中,所述泌尿生殖系统肿瘤为选自前列腺癌、尿路上皮癌和肾癌中的一种或多种;优选地,所述肾癌为透明肾细胞癌,尿路上皮癌包括上尿路上皮癌和膀胱癌,前列腺癌为前列腺腺癌。
  18. 根据权利要求16所述的方法,其中,步骤(1)中,所述尿液样本为晨尿;优选地,所述尿液样本为晨尿上清。
  19. 根据权利要求16所述的方法,其中,步骤(2)中,所述筛选为磁珠筛选。
  20. 一种用于人泌尿生殖系统肿瘤的检测、诊断、分类、患病风险评估或预后评估的装置,包括:
    I.‘正常决策单元’:
    正常-vs-肾癌,正常-vs-尿路上皮癌,正常-vs-前列腺癌;
    II.‘肾癌决策单元’:
    肾癌-vs-正常,肾癌-vs-尿路上皮癌,肾癌-vs-前列腺癌;
    III.‘尿路上皮癌决策单元’:
    尿路上皮癌-vs-正常,尿路上皮癌-vs-肾癌,尿路上皮癌-vs-前列腺癌;
    IV.‘前列腺癌决策单元’:
    前列腺癌-vs-正常,前列腺癌-vs-肾癌,前列腺癌-vs-尿路上皮癌。
  21. 一种用于人泌尿生殖系统肿瘤的检测、诊断、分类、患病风险评估或预后评估的装置,
    包括存储器;和耦接至所述存储器的处理器,
    其中,
    所述存储器上存储有由处理器执行的程序指令,所述程序指令包含选自如下的4个决策单元中的任意1个、任意2个、任意3个或者全部4个决策单元,其中,每个 决策单元里面包含3个随机森林二元分类器:
    I.‘正常决策单元’:
    正常-vs-肾癌,正常-vs-尿路上皮癌,正常-vs-前列腺癌;
    II.‘肾癌决策单元’:
    肾癌-vs-正常,肾癌-vs-尿路上皮癌,肾癌-vs-前列腺癌;
    III.‘尿路上皮癌决策单元’:
    尿路上皮癌-vs-正常,尿路上皮癌-vs-肾癌,尿路上皮癌-vs-前列腺癌;
    IV.‘前列腺癌决策单元’:
    前列腺癌-vs-正常,前列腺癌-vs-肾癌,前列腺癌-vs-尿路上皮癌。
  22. 根据权利要求21所述的装置,其中,所述处理器被配置为基于存储在所述存储器装置中的指令,执行权利要求1至15中任一权利要求所述的分类方法。
  23. 根据权利要求20至22中任一权利要求所述的装置,其中,所述泌尿生殖系统肿瘤为选自前列腺癌、尿路上皮癌和肾癌中的一种或多种;
    优选地,所述肾癌为透明肾细胞癌,
    优选地,所述尿路上皮癌为上尿路上皮癌和/或膀胱癌,
    优选地,所述前列腺癌为前列腺腺癌。
  24. 选自如下的1)-3)项中的任意一项在制备人泌尿生殖系统肿瘤的检测、诊断、患病风险评估或预后评估的药物中的用途:
    1)权利要求11至15中任一权利要求所述的生物标志物组合;
    2)人尿液中的cfDNA特别是人尿液上清中的cfDNA;
    优选地,所述尿液为晨尿;
    优选地,所述cfDNA为90-300bp的cfDNA或100-300bp的cfDNA;更优选地,所述cfDNA为90-150bp的cfDNA或100-150bp的cfDNA;
    3)DNA文库,其由第2)项制得;优选地,所述DNA文库为全基因组文库;
    优选地,所述泌尿生殖系统肿瘤为选自前列腺癌、尿路上皮癌和肾癌中的一种或多种;
    优选地,所述肾癌为透明肾细胞癌,
    优选地,所述尿路上皮癌为上尿路上皮癌和/或膀胱癌,
    优选地,所述前列腺癌为前列腺腺癌。
  25. 选自如下的1)-3)项中的任意一项,其用于人泌尿生殖系统肿瘤的检测、诊断、患病风险评估或预后评估:
    1)权利要求11至15中任一权利要求所述的生物标志物组合;
    2)人尿液中的cfDNA特别是人尿液上清中的cfDNA;
    优选地,所述尿液为晨尿;
    优选地,所述cfDNA为90-300bp的cfDNA或100-300bp的cfDNA;更优选地,所述cfDNA为90-150bp的cfDNA或100-150bp的cfDNA;
    3)DNA文库,其由第2)项制得;优选地,所述DNA文库为全基因组文库;
    优选地,所述泌尿生殖系统肿瘤为选自前列腺癌、尿路上皮癌和肾癌中的一种或多种;
    优选地,所述肾癌为透明肾细胞癌,
    优选地,所述尿路上皮癌为上尿路上皮癌和/或膀胱癌,
    优选地,所述前列腺癌为前列腺腺癌。
  26. 一种生物标志物组合,其为权利要求11至15中任一权利要求中所述的生物标志物组合。
PCT/CN2020/087830 2019-05-07 2020-04-29 一种cfDNA分类方法、装置和用途 WO2020224504A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/609,036 US20220336043A1 (en) 2019-05-07 2020-04-29 cfDNA CLASSIFICATION METHOD, APPARATUS AND APPLICATION

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910374094.1 2019-05-07
CN201910374094.1A CN111833963B (zh) 2019-05-07 2019-05-07 一种cfDNA分类方法、装置和用途

Publications (1)

Publication Number Publication Date
WO2020224504A1 true WO2020224504A1 (zh) 2020-11-12

Family

ID=72912303

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/087830 WO2020224504A1 (zh) 2019-05-07 2020-04-29 一种cfDNA分类方法、装置和用途

Country Status (3)

Country Link
US (1) US20220336043A1 (zh)
CN (1) CN111833963B (zh)
WO (1) WO2020224504A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838533A (zh) * 2021-08-17 2021-12-24 福建和瑞基因科技有限公司 一种癌症检测模型及其构建方法和试剂盒

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113257360B (zh) * 2021-06-24 2021-10-15 北京橡鑫生物科技有限公司 癌症筛查模型、癌症筛查模型的构建方法及构建装置
CN115148287B (zh) * 2022-09-01 2024-05-31 中山大学肿瘤防治中心(中山大学附属肿瘤医院、中山大学肿瘤研究所) 基因焦点扩增分型模型的构建方法及肿瘤样本的分型方法
CN115691667B (zh) * 2022-12-30 2023-04-18 北京橡鑫生物科技有限公司 尿路上皮癌早筛装置、模型构建方法和设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105102634A (zh) * 2013-03-15 2015-11-25 伊穆科Gti诊治股份有限公司 使用尿的无细胞dna评价肾状态的方法和组合物
CN105567846A (zh) * 2016-02-14 2016-05-11 上海交通大学医学院附属仁济医院 检测粪便中细菌dna的试剂盒及其在大肠癌诊断中的应用
CN108763859A (zh) * 2018-05-17 2018-11-06 北京博奥医学检验所有限公司 一种基于未知cnv样本建立提供cnv检测所需的模拟数据集的方法
CN108846259A (zh) * 2018-04-26 2018-11-20 河南师范大学 一种基于聚类和随机森林算法的基因分类方法及系统
CN109182526A (zh) * 2018-10-10 2019-01-11 杭州翱锐生物科技有限公司 用于早期肝癌辅助诊断的试剂盒及其检测方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105102634A (zh) * 2013-03-15 2015-11-25 伊穆科Gti诊治股份有限公司 使用尿的无细胞dna评价肾状态的方法和组合物
CN105567846A (zh) * 2016-02-14 2016-05-11 上海交通大学医学院附属仁济医院 检测粪便中细菌dna的试剂盒及其在大肠癌诊断中的应用
CN108846259A (zh) * 2018-04-26 2018-11-20 河南师范大学 一种基于聚类和随机森林算法的基因分类方法及系统
CN108763859A (zh) * 2018-05-17 2018-11-06 北京博奥医学检验所有限公司 一种基于未知cnv样本建立提供cnv检测所需的模拟数据集的方法
CN109182526A (zh) * 2018-10-10 2019-01-11 杭州翱锐生物科技有限公司 用于早期肝癌辅助诊断的试剂盒及其检测方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BHUVAN MOLPARIA,ESHAAN NICHANI,ALI TORKAMANI: "Assessment of Circulating Copy Number Variant Detection for Cancer Screening", PLOS ONE, e0180647, 7 July 2017 (2017-07-07), pages 1 - 18, XP055751956, ISSN: 1932-6203, DOI: 10.1371/journal.pone.0180647 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838533A (zh) * 2021-08-17 2021-12-24 福建和瑞基因科技有限公司 一种癌症检测模型及其构建方法和试剂盒
CN113838533B (zh) * 2021-08-17 2024-03-12 福建和瑞基因科技有限公司 一种癌症检测模型及其构建方法和试剂盒

Also Published As

Publication number Publication date
CN111833963B (zh) 2024-06-11
US20220336043A1 (en) 2022-10-20
CN111833963A (zh) 2020-10-27

Similar Documents

Publication Publication Date Title
WO2020224504A1 (zh) 一种cfDNA分类方法、装置和用途
US11984195B2 (en) Methylation pattern analysis of tissues in a DNA mixture
Duttagupta et al. Genome-wide maps of circulating miRNA biomarkers for ulcerative colitis
Hong et al. A ‘metastasis-prone’signature for early-stage mismatch-repair proficient sporadic colorectal cancer patients and its implications for possible therapeutics
US20200270707A1 (en) Methylation pattern analysis of haplotypes in tissues in a dna mixture
JP6161607B2 (ja) サンプルにおける異なる異数性の有無を決定する方法
Tao et al. Machine learning-based genome-wide interrogation of somatic copy number aberrations in circulating tumor DNA for early detection of hepatocellular carcinoma
CN103299188B (zh) 用于癌症的分子诊断试验
WO2021088653A1 (zh) 一种尿沉渣基因组dna的分类方法、装置和用途
TW202043483A (zh) 來自血漿之胚胎或腫瘤甲基化模式組(methylome)之非侵入性測定
CN111863250B (zh) 一种早期乳腺癌的联合诊断模型及系统
Li et al. Differential expression profiles of long non-coding RNAs as potential biomarkers for the early diagnosis of acute myocardial infarction
CN114134227A (zh) 多发性骨髓瘤预后不良生物标志物及筛选方法、预后分层模型和应用
CN113544288A (zh) 用于预测肝癌复发的dna甲基化标志物及其用途
Adamyan et al. Gene expression signature of endometrial samples from women with and without endometriosis
CN110408706A (zh) 一种评估鼻咽癌复发的生物标志物及其应用
JP2024507174A (ja) 無細胞dnaメチル化試験
Yang et al. Multi-omics approaches for biomarker discovery in predicting the response of esophageal cancer to neoadjuvant therapy: A multidimensional perspective
Wang et al. Blood leukocytes as a non-invasive diagnostic tool for thyroid nodules: a prospective cohort study
Wu et al. Identification of Six Genes as Diagnostic Markers for Colorectal Cancer Detection by Integrating Multiple Expression Profiles
CN118147288A (zh) 用于子宫内膜异位症分子分型的基因集及其应用
Chao et al. Fragmentomics features of ovarian cancer
CN117457069A (zh) 一种基于m7G相关基因突变对肝癌进行分型的方法和生存预测模型构建
Zou et al. An overview of multiomics: a powerful tool applied in cancer molecular subtyping for cancer therapy
WO2023239866A1 (en) Methods for identifying cns cancer in a subject

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20801954

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20801954

Country of ref document: EP

Kind code of ref document: A1