WO2020124625A1 - 基于ctDNA的基因检测方法、装置、存储介质及计算机系统 - Google Patents

基于ctDNA的基因检测方法、装置、存储介质及计算机系统 Download PDF

Info

Publication number
WO2020124625A1
WO2020124625A1 PCT/CN2018/123705 CN2018123705W WO2020124625A1 WO 2020124625 A1 WO2020124625 A1 WO 2020124625A1 CN 2018123705 W CN2018123705 W CN 2018123705W WO 2020124625 A1 WO2020124625 A1 WO 2020124625A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
ctdna
sequencing
sample
cnv
Prior art date
Application number
PCT/CN2018/123705
Other languages
English (en)
French (fr)
Inventor
徐寒黎
张静波
方楠
王建伟
伍启熹
刘倩
刘珂弟
唐宇
Original Assignee
北京优迅医学检验实验室有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京优迅医学检验实验室有限公司 filed Critical 北京优迅医学检验实验室有限公司
Publication of WO2020124625A1 publication Critical patent/WO2020124625A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer

Definitions

  • the present invention relates to the field of gene detection, and in particular, to a gene detection method, device, storage medium, and computer system based on ctDNA.
  • the first type traditional screening methods
  • the second type is early screening methods based on ctDNA.
  • More mature ones include mammographic mammography to screen for breast cancer, and Pap smear and other cytological methods for cervical cancer screening.
  • Puncture tissue samples for pathological or genetic testing are Puncture tissue samples for pathological or genetic testing.
  • the early screening method based on ctDNA is expected to be resolved. Because cancer is essentially a genetic disease, mutations at the DNA level occur earlier than those at the cellular and tissue levels. Since early tumors release ctDNA, the detection of ctDNA provides the possibility of early screening of cancer. There are currently two methods:
  • the Panel captures cancer-related gene regions and performs ultra-high-depth (tens of thousands x) sequencing to detect ultra-low frequency mutations from ctDNA to infer whether the human body has undergone early canceration.
  • tissue localization is achieved by screening CpG methylated haplotype tags.
  • the limitation of this kind of method lies in: 1) the experimental process of methylation sequencing is complicated and the operation threshold is higher; 2) methylation sequencing is more expensive; 3) sulfite or methylation sensitive enzyme digestion before sequencing Processing may reduce accuracy.
  • Embodiments of the present invention provide a ctDNA-based gene detection method, device, storage medium, and computer system to solve the problem of high cost of ctDNA gene detection in the prior art.
  • a ctDNA-based gene detection method includes: acquiring ctDNA sequencing data of a sample to be tested; comparing the sequencing data with a reference genome and retaining The comparison data that meets the preset conditions; analyze the data according to at least one of the following parameters, and determine the corresponding result of ctDNA: the mutation error spectrum and CNV characteristics of the comparison data, where the mutation error spectrum is for mutation errors After classification, the abundance of each type of mutation error is calculated.
  • the mutation error is a base that is inconsistent with the base of the reference genome except for the polymorphic site.
  • the mutation error spectrum is obtained by classifying mutation errors according to the following information and calculating the abundance of each type of mutation error: reference base, measured base, positive strand, negative strand, and background.
  • the mutation errors are classified into at least the following categories: A>T(+), A>T(-), A>C(+), A>C(-), A>G(+) , A>G(-), T>A(+), T>A(-), T>C(+), T>C(-), T>G(+), T>G(-), C>A(+), C>A(-), C>G(+), C>G(-), C>T(+), C>T(-), G>A(+), G >A(-), G>C(+), G>C(-), G>T(+) and G>T(-).
  • the CNV feature is obtained by the following steps: dividing the reference genome into a series of windows with a predetermined width as the minimum analysis unit; according to the minimum analysis unit, the hidden Markov model is used to remove the CNV of the population level in the comparison data to obtain the first A data set; GC correction is performed on the first data set to obtain the second data set; the interference of germline CNV in the second data set is removed to obtain the third data set; the dimensionality reduction is performed on the third data set by the method of principal component analysis And extract the features of CNV.
  • determining the corresponding result of the ctDNA according to the parameters includes: predicting the corresponding parameter in the sequencing data of the ctDNA of the test sample based on the parameter corresponding to the sequencing data of the known category obtained in advance to obtain a prediction result; determining according to the prediction result The category corresponding to the ctDNA of the test sample is used as the result corresponding to the ctDNA of the test sample.
  • the category corresponding to the ctDNA of the test sample is a tumor patient or a non-tumor patient.
  • predicting the parameters corresponding to the sequencing data of the ctDNA of the sample to be tested based on the parameters corresponding to the sequencing data of the known category obtained in advance includes establishing the sequencing data of the population of known phenotype using the method of support vector machine
  • the relationship model between the at least one of the mutation error spectrum and CNV characteristics in the phenotype and the known phenotype; use the relationship model and at least one of the mutation error spectrum and CNV characteristics corresponding to the ctDNA of the sample to be tested Make predictions.
  • obtaining sequencing data of ctDNA includes: sequencing ctDNA derived from the sample to be tested to obtain original data; and performing quality control on the original data to obtain sequencing data.
  • performing quality control on the original data to obtain sequencing data includes: deleting at least one of the following reads in the original data: reads of repeat fragments introduced by PCR amplification, reads containing more than one base N, and 5 consecutive nucleotides The average sequencing quality is less than 20 reads.
  • comparing the sequencing data with the reference genome, and retaining the alignment data that meets the preset conditions includes: comparing the sequencing data with the reference genome, and retaining reads on the complete comparison with the reference genome as the alignment data .
  • sequencing the ctDNA derived from the sample to be tested includes extracting ctDNA from the sample to be tested and performing low-depth sequencing of the whole genome.
  • a storage medium on which a computer-executable program code is stored.
  • the computer system executes a ctDNA-based
  • the computer-executable program codes include: the code for obtaining the ctDNA sequencing data of the sample to be tested; the code for comparing the sequencing data with the reference genome and retaining the alignment data that meets the preset conditions; And at least one of the following parameters in the comparison data, and to determine the code of the gene result corresponding to ctDNA, the mutation error spectrum and CNV characteristics of the comparison data, where the mutation error spectrum is calculated after classifying the mutation error The abundance of each type of mutation error is obtained.
  • the mutation error is a base that is inconsistent with the base of the reference genome except for the polymorphic site.
  • the mutation error spectrum is obtained by classifying mutation errors according to the following information and calculating the abundance of each type of mutation error: reference base, measured base, positive strand, negative strand, and background.
  • the mutation errors are classified into at least the following categories: A>T(+), A>T(-), A>C(+), A>C(-), A>G(+) , A>G(-), T>A(+), T>A(-), T>C(+), T>C(-), T>G(+), T>G(-), C>A(+), C>A(-), C>G(+), C>G(-), C>T(+), C>T(-), G>A(+), G >A(-), G>C(+), G>C(-), G>T(+) and G>T(-).
  • the CNV feature is obtained by executing the following code: a code for dividing the reference genome into a series of windows of a predetermined width as a minimum analysis unit; for removing the comparison data using the hidden Markov model according to the minimum analysis unit Group-level CNV, get the code of the first data set; used to correct the first data set by GC, get the code of the second data set; used to remove the interference of germline CNV in the second data set, and get the third data set
  • the code for determining the result corresponding to the ctDNA according to the parameter includes: predicting the corresponding parameter in the sequencing data of the ctDNA of the sample to be predicted based on the parameter corresponding to the sequencing data of the known category obtained in advance The result code; used to determine the category corresponding to the ctDNA of the test sample according to the prediction result, as the code of the result corresponding to the ctDNA of the test sample.
  • the category corresponding to the ctDNA of the test sample is a tumor patient or a non-tumor patient.
  • the code for predicting the corresponding parameters in the sequencing data of the ctDNA of the sample to be tested based on the parameters corresponding to the sequencing data of the known category obtained in advance includes: a method for establishing a known method using a support vector machine The mutation error spectrum in the sequencing data of the phenotype population and the code of the relationship model between at least one of the CNV characteristics and the known phenotype; used to use the relationship model and the mutation error spectrum corresponding to the ctDNA of the sample to be tested and The code that predicts the phenotype of the sample under test at least one of the CNV features.
  • the code for obtaining the ctDNA sequencing data of the sample to be tested includes: a code for sequencing the ctDNA derived from the sample to obtain the original data; and a code for performing quality control on the original data to obtain the sequencing data.
  • the code for performing quality control on the original data to obtain sequencing data includes: a code for deleting reads of at least one of the following in the original data: reads of repeat fragments introduced by PCR amplification, containing more than one base N reads, reads with an average sequencing quality of 5 consecutive nucleotides less than 20.
  • the code for comparing the sequencing data with the reference genome and keeping the alignment data that meets the preset conditions includes: comparing the sequencing data with the reference genome and keeping the complete comparison with the reference genome Code used to reads as comparison data.
  • the code for sequencing the ctDNA from the test sample to obtain the original data includes: a code for extracting the ctDNA from the test sample and performing low-depth sequencing of the whole genome.
  • a computer system including a processor, a system memory, and one or more computer-readable storage media, computer-executable instructions are stored on the storage media, and the storage media are as described above Any kind of storage media.
  • a ctDNA-based gene detection device which includes: an acquisition module for acquiring ctDNA sequencing data of a sample to be tested; an alignment module for comparing sequencing data with a reference The genome is compared, and the comparison data that meets the preset conditions are retained; the analysis determination module is used to analyze the data according to at least one of the following parameters and determine the corresponding result of ctDNA: the mutation error spectrum of the comparison data and CNV characteristics, where the mutation error spectrum is obtained by calculating the abundance of each type of mutation error after classifying the mutation error.
  • the mutation error is a base that is inconsistent with the base of the reference genome except for the polymorphic site.
  • the analysis and determination module further includes a mutation error spectrum module.
  • the mutation error spectrum module classifies the mutation errors according to the following information and calculates the abundance of each type of mutation error: reference base, measured base, positive strand, Negative chain and background.
  • the mutation error spectrum module classifies the mutation errors into at least the following categories according to the information: A>T(+), A>T(-), A>C(+), A>C(-), A> G(+), A>G(-), T>A(+), T>A(-), T>C(+), T>C(-), T>G(+), T>G (-), C>A(+), C>A(-), C>G(+), C>G(-), C>T(+), C>T(-), G>A( +), G>A(-), G>C(+), G>C(-), G>T(+) and G>T(-).
  • the analysis determination module further includes a CNV feature extraction module for extracting CNV features.
  • the CNV feature extraction module includes: a window division submodule for dividing the reference genome into a series of windows of a predetermined width as a minimum analysis unit; A corrector module, used to eliminate the CNV of the group level in the comparison data according to the minimum analysis unit, to obtain the first data set; a second corrector module, used to perform GC correction on the first data set To get the second data set; the third corrector sub-module to remove the interference of germline CNV in the second data set to obtain the third data set; the CNV extraction sub-module to apply the principal component analysis method to the third data set Perform dimensionality reduction and extract CNV features.
  • the analysis and determination module includes: a prediction module for predicting the corresponding parameters in the ctDNA sequencing data of the test sample to obtain a prediction result based on the parameters corresponding to the sequencing data of the known category obtained in advance; the determination module, It is used to determine the category corresponding to the ctDNA of the test sample according to the prediction result, as the result corresponding to the ctDNA of the test sample.
  • the category corresponding to the ctDNA of the test sample is a tumor patient or a non-tumor patient.
  • the prediction module includes: a model building module, which is configured to use support vector machine to establish the relationship between at least one of the mutation error spectrum and CNV characteristics in the sequencing data of people with known phenotype and the known phenotype Relationship model; phenotype prediction module, used to predict the phenotype of the sample to be tested using at least one of the mutation model and CNV characteristics corresponding to the ctDNA of the sample to be tested and the relationship model.
  • a model building module which is configured to use support vector machine to establish the relationship between at least one of the mutation error spectrum and CNV characteristics in the sequencing data of people with known phenotype and the known phenotype Relationship model
  • phenotype prediction module used to predict the phenotype of the sample to be tested using at least one of the mutation model and CNV characteristics corresponding to the ctDNA of the sample to be tested and the relationship model.
  • the acquisition module includes: a sequencing module for sequencing ctDNA derived from the sample to be tested to obtain original data; and a quality control module for quality control of the original data to obtain sequencing data.
  • the quality control module includes: a deletion unit for deleting at least one of the following reads in the original data: reads of repeat fragments introduced by PCR amplification, reads containing more than one base N, and 5 consecutive nucleotides
  • the average sequencing quality is less than 20 reads.
  • the comparison module includes: a comparison sub-module for comparing the sequencing data with the reference genome, and retaining reads on the complete comparison with the reference genome as comparison data.
  • the provided ctDNA-based gene detection method creatively establishes the parameter of mutation error spectrum and uses its and/or CNV characteristics to predict the corresponding detection result of the ctDNA of the sample to be tested.
  • This method uses conventional NGS low-depth sequencing data to achieve gene mutation detection of ctDNA without adding any additional experiments and sequencing costs.
  • FIG. 1 is a flowchart of a gene detection method based on ctDNA according to an embodiment of the present invention
  • 2 is a schematic diagram of the detection of four bases based on the two-color fluorescent channel of the existing sequencing platform
  • FIG. 3 is a detailed flowchart of a gene detection method based on ctDNA according to an embodiment of the present invention.
  • 4A and 4B are the distribution diagrams before and after GC correction of the sequencing data after removing CNV at the population level;
  • Figure 5 shows the CNV pattern corresponding to different phenotypes of non-cancer patients and tumor patients
  • FIG. 6 is a structural diagram of a ctDNA-based gene detection device according to an embodiment of the present invention.
  • FIG. 7 is a detailed structural diagram of a ctDNA-based gene detection device according to an embodiment of the present invention.
  • cfDNA cell free DNA, free DNA, most of the DNA fragments of normal cell apoptosis, ctDNA of tumor is also in cfDNA.
  • ctDNA Circulating tumor DNA (Circulating tumor DNA), as the name implies, is the primary tumor, or even the metastasis of the new tumor formed by the rupture of the cells and the DNA fragments that fall into the peripheral blood circulation system. Because ctDNA is derived from the primary tumor, the use of ctDNA can be detected early in tumorigenesis.
  • the gene detection method based on ctDNA in this application is essentially a method for genetic detection of cfDNA.
  • SNP mainly refers to the DNA sequence polymorphism caused by the variation of a single nucleotide at the genome level. It is the most common type of human heritable variation, accounting for more than 90% of all known polymorphisms. SNPs are widespread in the human genome, with an average of 1 in 500-1000 base pairs.
  • a ctDNA-based gene detection method includes: S10, obtaining the ctDNA sequencing data of the sample to be tested; S30, comparing the sequencing data with the reference genome, and retaining the alignment data that meets the preset conditions; S50, comparing according to at least one of the following parameters Analyze the data and determine the corresponding result of ctDNA: compare the mutation error spectrum and CNV characteristics of the data, where the mutation error spectrum is obtained by calculating the abundance of each type of mutation error after classifying the mutation error, and the mutation error is Bases that are inconsistent with the bases of the reference genome.
  • the above method of the present application creatively establishes the parameter of mutation error spectrum through the sequencing data of ctDNA, and collaboratively uses the CNV characteristics to predict the corresponding detection results of ctDNA of the sample to be tested, which not only realizes the routine sequencing data of ctDNA Genetic testing, and greatly reduced costs.
  • the above mutation error is a base that is inconsistent with the base of the reference genome.
  • the "base that is inconsistent with the base of the reference genome” refers to an inconsistent SNP site other than the normal population polymorphic SNP site. These SNP sites that do not belong to the population polymorphism are called “non-SNP sites” in this application.
  • mutation errors are defined as bases that do not coincide with bases in the reference genome except for polymorphic sites.
  • existing databases such as HapMap, dbSNP, gnomad, and other public databases, have limited sample sizes, and there are population differences between the samples and the Chinese population, the number of polymorphic sites included in these databases is too small to be applied to China. The actual analysis needs of crowd data.
  • This application uses the data of millions of Chinese pregnant women accumulated previously as a reference database to find out the various polymorphic sites of a relatively complete Chinese population. At non-SNP sites, if the base measured on a certain read in the sample is inconsistent with the reference genome, there are only two possibilities: one is the sequencing error of the sequencing platform itself; the other is the rare mutation caused by the tumor.
  • the measured bases are (4(-)1), that is, 3 cases, and each read has 2 cases of positive and negative chains.
  • 4*3*2 24 kinds.
  • the reference base the measured base, the positive and negative strands of read, it can be divided into: A>T(+), A>T(-), A>C(+), A>C (-), A>G(+), A>G(-), T>A(+), T>A(-), T>C(+), T>C(-), T>G( +), T>G(-), C>A(+), C>A(-), C>G(+), C>G(-), C>T(+), C>T(- ), G>A(+), G>A(-), G>C(+), G>C(-), G>T(+) and G>T(-).
  • the source of mutation in the mutation error spectrum is the sequencing error of the sequencing platform. Due to the tendency of sequencing errors, the mutation error spectrum has certain characteristics. Taking illumina's Novaseq sequencing platform as an example, the four bases are detected through the two-color fluorescence channel (as shown in Figure 2). Only the green channel emits light, measured as T; only the red channel emits light, measured as C; the red and green channels emit light, measured as A; neither the red or green channel emits light, measured as G. In practice, since bubbles block fluorescence, the probability of misdetection as G is slightly higher. In addition, the error probability of the channel from on to off (such as A>T) and the channel from off to on (such as T>A) are also different.
  • the source of mutations in the mutation error spectrum comes from sequencing errors on the sequencing platform and rare mutations caused by tumors.
  • the mutation of the tumor also has a certain tendency, so there is another characteristic of the mutation error spectrum.
  • the frequency of transition (for example, T>C) mutation is also higher than the frequency of transition (for example, T>A) mutation.
  • tumor mutations are also related to the background of the base. Therefore, the mutation error spectrum of healthy people and cancer patients will be significantly different.
  • the mutation error spectrum of the comparison data is obtained according to step S104 shown in FIG. 3.
  • CNV features are obtained through the steps shown in S105: the reference genome is divided into a series of windows of a predetermined width as the minimum analysis unit; according to the minimum analysis unit, Hidden Markov is used The model eliminates the group-level CNV in the sequencing data to obtain the first data set; performs GC correction on the first data set to obtain the second data set; removes the germline CNV of the second data set to obtain the third data set; uses the principal component
  • the method of analysis is to reduce the dimension of the third data set and extract the features of CNV.
  • the specific steps for obtaining CNV features are as follows: 1) Divide the reference genome into a series of windows of predetermined width as the minimum analysis unit, and finally select a window width of 100Kb and a step size of 50Kb; 2) Use Hidden Markov Model, removing the window containing the group-level CNV in the sequencing data; 3) As shown in FIGS.
  • PCA Principal Component Analysis
  • cfDNA refers to free DNA in plasma and ctDNA is free DNA secreted from tumor cells in plasma. For tumor patients, their cfDNA contains a small portion of ctDNA. In actual experiments, cfDNA can only be extracted for sequencing, and there is no way to sequence ctDNA alone.
  • gDNA refers to the DNA of white blood cells. The specific operation is to collect blood samples, centrifuge, separate plasma and blood cells, and then separately extract cfDNA sequencing of plasma and gDNA sequencing of leukocytes.
  • CNV copy number variation
  • germline CNV refers to CNV that an individual is born with, and still has CNV after differentiation, such as CNV carried by leukocytes.
  • the CNV of healthy people and cancer patients is significantly different.
  • healthy people after removing the possible germline CNV, there should be no other copy number abnormalities in the plasma.
  • tumor patients in addition to germline CNV, plasma also carries a lot of tumor-related copy number abnormalities, and this type of CNV has certain cancer characteristics (as shown in Figure 5, the figure shows It is a model of CNV corresponding to different phenotypes of non-cancer patients and tumor patients.
  • Image source references Qiu Z, W, Bi J, H, Gazdar A, F, et al. Genes Chromosomes & Cancer, 2017, 56 (7): 559.
  • the result of determining the corresponding ctDNA according to the parameters includes: predicting the parameters corresponding to the ctDNA of the test sample based on the parameters corresponding to the ctDNA of the known class obtained in advance to obtain the prediction Results: Determine the category corresponding to the ctDNA of the test sample according to the prediction result, as the result of the ctDNA of the test sample.
  • the ctDNA corresponding to the sample to be tested corresponds to a tumor patient or a non-tumor patient.
  • predicting the parameters corresponding to the ctDNA of the test sample based on the parameters corresponding to the ctDNA of the known class obtained in advance includes: using a support vector machine method Establish a mutation error spectrum in the ctDNA sequencing data of people with known phenotypes and a relationship model between at least one of CNV characteristics and a known phenotype, and use the mutation error spectrum corresponding to the ctDNA of the test model and the test sample And at least one of the CNV features to predict the phenotype of the sample to be tested.
  • the above-mentioned populations with known phenotypes are healthy people and patients with stage I(-) IV tumors.
  • data from a number of healthy people and patients with stage I(-) IV tumors ie, whole-genome sequencing data
  • two-thirds of the sample data is selected as the training set.
  • the sample data contains mutation error spectra and/or CNV Features, classified as phenotypic information.
  • the support vector machine model is used to predict the phenotype corresponding to the mutation error spectrum and CNV features in the sequencing data of the sample to be tested.
  • one of the mutation error spectrum and the CNV feature can also achieve the prediction of the phenotype of the sample to be tested, but the mutation error spectrum and the CNV feature are related to the phenotype together, so that the prediction performance is the best .
  • the step of obtaining the ctDNA sequencing data of the sample to be tested may be an existing step.
  • obtaining ctDNA sequencing data includes: sequencing ctDNA derived from the sample to be tested to obtain original data; and performing quality control on the original data to obtain sequencing data.
  • the specific quality control method is similar to the existing original data quality control method, and both include the step of filtering the original data to obtain sequencing data. That is, from raw data to clean data.
  • performing quality control on the original data to obtain sequencing data includes: deleting at least one of the following reads in the original data: reads of repeat fragments introduced by PCR amplification, including one The average read quality of reads of the above base N and consecutive 5 nucleotides is less than 20.
  • the low quality here has the same meaning as the low quality in the field of conventional high-throughput sequencing, and broadly refers to data that cannot perform effective data processing or obviously has an adverse effect on the processing results.
  • the base N means that there will be undetectable bases in the original sequencing data, which is represented by N.
  • a variety of existing software can detect the sequencing quality of bases in sequencing, so it is easy to screen out reads with an average sequencing quality of less than 20 consecutive 5 nucleotides.
  • comparing the sequencing data with the reference genome and retaining the alignment data that meets the preset conditions includes: comparing the sequencing data with the reference genome and retaining the complete The reads on the reference genome alignment are used as alignment data.
  • the amount of comparison data used for subsequent analysis after the specific comparison is not limited, and can be set reasonably according to the different sample sources. There are preferably at least 4M reads.
  • extracting and sequencing ctDNA of the sample to be tested comprises: extracting ctDNA of the sample to be tested and performing whole-genome low-depth sequencing.
  • the low-depth sequencing here can make the target coverage range from 0.1x to 0.5x.
  • the low-depth sequencing mentioned in this application refers to the coverage of the entire sample from 0.1x to 0.5x.
  • the coverage of 2 or 3 refers to the depth of some of the sites. For example, there are 3 billion sites in a sample, some sites have a depth of 0, some sites have a depth of 1, some sites have a depth of 2, other sites may have similar differences in depth, but the average
  • the depth of the overall sample is 0.1x to 0.5x.
  • the method according to the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware, but in many cases the former is Better implementation.
  • the technical solution of the present invention can be embodied in the form of a software product in essence or part that contributes to the existing technology, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk,
  • the CD-ROM includes several instructions to enable the computing device to perform the method described in each embodiment of the present invention, or to cause the processor to perform the method described in each embodiment of the present invention.
  • a storage medium on which a computer-executable program code is stored, and when the program code is executed by one or more processors of the computer system, the computer system executes a Based on the ctDNA-based gene detection method, the computer-executable program codes include: code for obtaining the ctDNA sequencing data of the sample to be tested; for comparing the sequencing data with the reference genome, and retaining the alignment data that meets the preset conditions Code; and at least one of the following parameters in the analysis data and to determine the code of the gene result corresponding to ctDNA, the mutation error spectrum and CNV characteristics of the comparison data, where the mutation error spectrum is to classify the mutation error After calculating the abundance of each type of mutation error, the mutation error is a base that is inconsistent with the base of the reference genome except for the polymorphic site.
  • the mutation error spectrum is obtained by classifying mutation errors according to the following information and calculating the abundance of each type of mutation error: reference base, measured base, positive strand, negative strand, and background.
  • mutation errors are classified into categories containing at least the following according to the information: A>T(+), A>T(-), A>C(+), A>C(-), A >G(+), A>G(-), T>A(+), T>A(-), T>C(+), T>C(-), T>G(+), T> G(-), C>A(+), C>A(-), C>G(+), C>G(-), C>T(+), C>T(-), G>A (+), G>A(-), G>C(+), G>C(-), G>T(+) and G>T(-).
  • the CNV feature is obtained by executing the following code: the code for dividing the reference genome into a series of windows of predetermined width as the minimum analysis unit; for the minimum analysis unit, using the hidden Markov model Remove the group-level CNV in the comparison data to obtain the code of the first data set; use the GC correction of the first data set to obtain the code of the second data set; remove the interference of germline CNV in the second data set,
  • the code of the third data set is obtained; the code used to reduce the dimensionality of the third data set by the method of principal component analysis and extract the features of CNV.
  • the code for determining the corresponding result of the ctDNA according to the parameter includes: corresponding to the corresponding parameter in the sequencing data of the ctDNA of the sample to be tested based on the parameter corresponding to the sequencing data of the known category obtained in advance
  • the parameter is used to predict the code to obtain the prediction result; it is used to determine the category corresponding to the ctDNA of the sample to be tested according to the prediction result, and is used as the code corresponding to the ctDNA of the sample to be tested.
  • the category corresponding to the ctDNA of the test sample is a tumor patient or a non-tumor patient.
  • the code for predicting the corresponding parameters in the ctDNA sequencing data of the sample to be tested based on the parameters corresponding to the pre-obtained sequencing data of the known category includes: Method to establish the code of the relationship model between the mutation error spectrum in the sequencing data of people with known phenotypes and at least one of the CNV characteristics and the known phenotype; used to use the relationship model to correspond to the ctDNA of the sample to be tested The code for predicting at least one of the mutation error spectrum and the CNV feature of the sample to be tested.
  • the code for obtaining the ctDNA sequencing data of the sample to be tested includes: a code for extracting ctDNA of the sample to be tested and sequencing to obtain the original data; and performing quality control on the original data to obtain the sequencing The code of the data.
  • the code used for quality control of the original data to obtain sequencing data includes: a code for deleting reads of at least one of the following in the original data: reads of repeat fragments introduced by PCR amplification, including one The average read quality of reads of the above base N and consecutive 5 nucleotides is less than 20.
  • the code for comparing the sequencing data with the reference genome and maintaining the alignment data that meets the preset conditions includes: comparing the sequencing data with the reference genome and retaining the complete The code used for reads as alignment data on the reference genome alignment.
  • the code for extracting and sequencing free ctDNA of the sample to be tested includes: a code for extracting ctDNA of the sample to be tested and performing whole-genome low-depth sequencing.
  • a computer system which includes a processor, system memory, and one or more computer-readable storage media.
  • the storage media stores computer-executable instructions, and the storage media is Any one of the above storage media.
  • a ctDNA-based gene detection device which is used to store or run a module, or the module is an integral part of the device; wherein, the module is a software module, and the software module is one or Multiple, software modules are used to perform any of the above genetic detection methods.
  • the above device includes an acquisition module 20, a comparison module 40 and an analysis determination module 60.
  • the acquisition module 20 is used to acquire the ctDNA sequencing data of the sample to be tested;
  • the comparison module 40 is used to compare the sequencing data with the reference genome, and the comparison data that meets the preset conditions are retained;
  • the analysis determination module 60 is used to Analyze the data according to at least one of the following parameters and determine the corresponding result of ctDNA: compare the mutation error spectrum and CNV characteristics of the data, where the mutation error spectrum is calculated after classifying the mutation errors for each type of mutation error Obtained by abundance, the mutation error is a base that is inconsistent with the base of the reference genome except for the polymorphic site.
  • the above-mentioned embodiments of the present application provide a ctDNA-based gene detection device that creatively uses the parameter of mutation error spectrum in ctDNA sequencing data, and uses mutation error spectrum and/or CNV characteristics to predict the test
  • the corresponding detection result of the ctDNA of the sample not only realizes the genetic detection of the conventional sequencing data of ctDNA, but also greatly reduces the cost.
  • the attribute value of can include the type of input file, whether there is variation data of known phenotype and health control data.
  • the acquisition module 20 in the above embodiment of the present application may further include a quality control module 202, and the analysis determination module may further include a mutation error spectrum module 501, a CNV feature module 502, a model establishment module 601, and a phenotype prediction module 602.
  • the above-mentioned embodiments of the present application may further include a control module 101, which is used to control input and output, obtain file and parameter attribute values, control the calling of other modules, and determine the design of the gene detection method.
  • the scheme of the control module 101 in the above-mentioned device to control the calling of other modules and determine the gene detection process may be as follows: deciding whether to call the quality control module, whether to generate a training data set, and whether to select an appropriate analysis determination module.
  • control module 101 controls the design and execution of the entire genetic testing process. First, judge according to the attribute value of the input file. If it is the original sequencing data, call the quality control module 202, otherwise call the comparison module 40; second, when the variant data of the known phenotype and the health control data are input, call The mutation error spectrum module 501 and/or CNV feature module 502 generates a training data set; again, the selection of the gene detection method based on the ctDNA comparison data of the sample to be tested, and finally, the corresponding model building module 601 and phenotype prediction module 602 are called .
  • the analysis and determination module 60 further includes a mutation error spectrum module 501 for counting the mutation error spectrum.
  • the mutation error spectrum module may further include an information classification module and The abundance calculation module and the information classification module are used to classify mutation errors according to the following information: reference base, measured base, positive chain, negative chain, and background.
  • the mutation error spectrum module classifies the mutation errors into at least the following categories according to the above information: A>T(+), A>T(-), A>C(+), A>C(-), A >G(+), A>G(-), T>A(+), T>A(-), T>C(+), T>C(-), T>G(+), T> G(-), C>A(+), C>A(-), C>G(+), C>G(-), C>T(+), C>T(-), G>A (+), G>A(-), G>C(+), G>C(-), G>T(+) and G>T(-).
  • the abundance calculation module is used to calculate the abundance of each type of mutation error after the above classification, thereby obtaining a mutation error spectrum.
  • the analysis and determination module 60 further includes a CNV feature module 502 for extracting CNV features.
  • the CNV feature module 502 may further include: a window division submodule for The reference genome is divided into a series of windows with a predetermined width as the minimum analysis unit; the first corrector module is used to remove the CNV at the population level in the comparison data using the hidden Markov model according to the minimum analysis unit to obtain the first data.
  • the second correcting sub-module is used to perform GC correction on the first data set to obtain the second data set; the third correcting sub-module is used to remove the interference of germline CNV in the second data set to obtain the third data set;
  • the CNV extraction sub-module is used to reduce the dimensionality of the third data set by the method of principal component analysis and extract the features of CNV.
  • the analysis determination module 60 may further include: a prediction module and a determination module, the prediction module is configured to perform the corresponding parameters in the ctDNA sequencing data of the sample to be tested based on the parameters corresponding to the sequencing data of the known category obtained in advance The prediction result is obtained by prediction; the determination module is used to determine the category corresponding to the ctDNA of the test sample according to the prediction result, as the result of the ctDNA of the test sample.
  • a prediction module and a determination module the prediction module is configured to perform the corresponding parameters in the ctDNA sequencing data of the sample to be tested based on the parameters corresponding to the sequencing data of the known category obtained in advance The prediction result is obtained by prediction; the determination module is used to determine the category corresponding to the ctDNA of the test sample according to the prediction result, as the result of the ctDNA of the test sample.
  • the category corresponding to the ctDNA of the test sample is a tumor patient or a non-tumor patient.
  • the prediction module includes: a model establishment module 601 and a phenotype prediction module 602.
  • the model establishment module is used to establish at least at least one of the mutation error spectrum and the CNV feature in the sequencing data of the population of known phenotype using the method of support vector machine A relationship model between one and a known phenotype; a phenotype prediction module, used to predict the phenotype of the sample to be tested using at least one of the mutation model corresponding to the relationship model and the ctDNA of the sample to be tested and CNV characteristics.
  • the acquisition module 20 includes: a sequencing sub-module 201 and a quality control sub-module 202.
  • the sequencing sub-module 201 is used to sequence the ctDNA derived from the sample to be tested to obtain raw data; the quality control sub-module 202 is used to analyze the raw data Perform quality control to obtain sequencing data.
  • the quality control module 202 includes: a deletion unit for deleting at least one of the following reads in the original data: reads of repeat fragments introduced by PCR amplification, reads containing more than one base N, and 5 consecutive nucleotides
  • the average sequencing quality is less than 20 reads.
  • the comparison module 40 includes: a comparison submodule for comparing the sequencing data with the reference genome, and retaining reads on the complete comparison with the reference genome as comparison data.
  • a gene detection device based on ctDNA proposed in this application has built-in multiple functional modules, in which the control module can automatically design the most suitable gene detection process according to different data types, and automatically complete the call and integration of the corresponding module. Perform efficient genetic testing.
  • the detection method and device method are rigorous, comprehensive, and easy to operate.
  • the above storage media, computer systems and devices can be used by computers to perform the above ctDNA-based gene detection methods and output the corresponding detection results. These products have achieved the right without adding any additional experiments and sequencing costs. Gene detection of ctDNA, and the detection cost of the device is low and the accuracy is high.
  • the disclosed technical content may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • there may be other division methods for example, multiple units or components may be combined or may Integration into another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or software function unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solution of the present invention essentially or part of the contribution to the existing technology or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium , Including several instructions to enable a computing device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the aforementioned storage media include: U disk, read-only memory (ROM, Read (-) Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk, etc. can store program code Medium.
  • this example selects patients who are clinically diagnosed with different stages of breast cancer and lung cancer (Stage I to Stage IV) (excluding patients who have undergone surgery, chemotherapy or radiotherapy), and Healthy people and patients with benign tumors or precancerous lesions were tested as controls.
  • the specific method adopts the detailed flowchart shown in FIG. 2.
  • the improvement scheme of this application has a wide application range and is not restricted by the types of tumors. And safe and convenient, not only can be detected in the early stage of tumor occurrence, but the detection accuracy is higher.
  • the traditional tumor screening methods a type of imaging-based methods such as PET (-) CT, are based on the detection of changes in tissue level, and changes in tissue level occur later than changes in DNA level, so it is difficult to early Detected, the sensitivity is less than 60%.
  • Another type of detection method based on serum molecular markers, because false negatives and false positives are still very high in tumor screening, it is basically impossible to achieve early screening.
  • the present application can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or part that contributes to the existing technology, and the computer software product can be stored in a storage medium, such as ROM/RAM, magnetic disk , Optical discs, etc., including several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods of various embodiments or some parts of the embodiments of the present application.
  • ROM/RAM read-only memory
  • magnetic disk magnetic disk
  • Optical discs etc.
  • This application can be used in many general-purpose or special-purpose computing system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics devices, network PCs, small computers, mainframe computers, including Distributed computing environment for any of the above systems or devices, etc.
  • modules or steps of the present invention can be implemented by a general-purpose computing device, and they can be concentrated on a single computing device or distributed in a network composed of multiple computing devices Alternatively, they can be implemented with program code executable by the computing device, so that they can be stored in the storage device to be executed by the computing device, or they can be made into individual integrated circuit modules, or they can be Multiple modules or steps are made into a single integrated circuit module to achieve. In this way, the present invention is not limited to any specific combination of hardware and software.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

一种基于ctDNA的基因检测方法、装置、存储介质及计算机系统。该检测方法包括:获取待测样本的ctDNA的测序数据;将测序数据与参考基因组进行比对,并保留符合预设条件的比对数据;根据以下参数的至少之一对比对数据进行分析,并确定ctDNA对应的结果:比对数据的突变错误谱以及CNV特征,其中,突变错误谱是对突变错误进行分类后计算每一类突变错误的丰度得到的,突变错误为与参考基因组的碱基不一致的碱基。该检测方法在不增加任何额外的实验和测序成本的基础上,利用常规的NGS低深度测序数据,实现了对ctDNA的基因突变检测。

Description

基于ctDNA的基因检测方法、装置、存储介质及计算机系统 技术领域
本发明涉及基因检测领域,具体而言,涉及一种基于ctDNA的基因检测方法、装置、存储介质及计算机系统。
背景技术
现有方法中癌症筛查的手段主要有两类,第一类:传统筛查方法;第二类是基于ctDNA的早筛方法。
目前在临床上较为成熟的癌症筛查手段,如血清学肿瘤标志物、PET(-)CT、穿刺活检等大部分难以发现早期的癌变,或者早期发现的准确性不够高,临床缺乏能够有效检测早期癌症的方法。
1.影像学或细胞学的筛查方法
比较成熟的包括对乳房钼靶X线成像来筛查乳腺癌,巴氏涂片等细胞学方法对宫颈癌筛查。
这类方法的局限性在于:1)难以发现早期的癌变,因为细胞和组织水平的变化要晚于DNA水平的变化,2)特定的方法只针对特定的癌种,不具备通用性;3)检测的敏感性不够高,只有60%不到。
2.血清学肿瘤标志物筛查方法
通过生化检测血清中与肿瘤发生相关的标志物的浓度是否升高。
这类方法的局限性在于:1)灵敏度和特异性均不够理想,一方面与癌症无关的其它因素也有可能造成某种标志物的含量上升,另一方面患有某种癌症的人也一定会出现该癌症相关的标志物浓度上升,2)一种标志物只针对一种对少数几种癌症,并没有通用的癌症标志物;并不是每一种癌症都存在对应的标志物。
3.穿刺活检
穿刺取组织样本,做病理或基因检测。
这类方法的局限性在于:1)取样麻烦,大众接受度低,2)有创手段,存在一定 危险。
针对上述传统方法的缺陷,基于ctDNA的早筛方法有望解决。由于癌症本质上是基因病,DNA水平的变异的发生要早于细胞和组织水平的变异。由于早期的肿瘤会释放ctDNA,因此对ctDNA的检测为癌症的早筛提供了可能。目前有以下两种方法:
1.低频突变检测的肿瘤早筛方法
通过Panel捕获癌症相关的基因区域,进行超高深度(数万x)的测序,检测其中来自ctDNA的超低频突变,从而推断人体是否已发生早期癌变。
这类方法的局限性在于:1)由于突变频率较低,需要定制panel靶向捕获特定的基因,进行超高测序深度,增加了对血量的要求;2)捕获panel很昂贵,超高深度测序的测序成本也很高,导致该筛查手段成本很高,不适用于人群常规筛查;3)准确性不理想,没有达到临床大规模应用的需求。
2.甲基化检测的肿瘤早筛方法
身体中的每个组织都有其独特的甲基化形式,所以可以根据该特性来对组织进行定位。在该新方法中,组织的定位是通过筛查CpG甲基化单倍型标签实现的。
这类方法的局限性在于,1)甲基化测序的实验流程复杂,操作门槛较高;2)甲基化测序较为昂贵;3)测序前用亚硫酸盐或者对甲基化敏感的酶切处理可能降低准确性。
对于现有技术中的问题,目前没有提出相应的解决方案。
发明内容
本发明实施例提供了一种基于ctDNA的基因检测方法、装置、存储介质及计算机系统,以解决现有技术中ctDNA的基因检测成本高的问题。
为实现上述目的,根据本申请的一个方面,提供了一种基于ctDNA的基因检测方法,该检测方法包括:获取待测样本的ctDNA的测序数据;将测序数据与参考基因组进行比对,并保留符合预设条件的比对数据;根据以下参数的至少之一对比对数据进行分析,并确定ctDNA对应的结果:比对数据的突变错误谱以及CNV特征,其中,突变错误谱是对突变错误进行分类后计算每一类突变错误的丰度得到的,突变错误为除多态性位点外与参考基因组的碱基不一致的碱基。
进一步地,突变错误谱是按照如下信息将突变错误进行分类后计算每一类突变错误的丰度得到的:参考碱基、测得碱基、正链、负链及背景。
进一步地,按照信息将突变错误分类成至少包含如下的类别:A>T(+)、A>T(-)、A>C(+)、A>C(-)、A>G(+)、A>G(-)、T>A(+)、T>A(-)、T>C(+)、T>C(-)、T>G(+)、T>G(-)、C>A(+)、C>A(-)、C>G(+)、C>G(-)、C>T(+)、C>T(-)、G>A(+)、G>A(-)、G>C(+)、G>C(-)、G>T(+)及G>T(-)。
进一步地,通过如下步骤得到CNV特征:将参考基因组划分为一系列的预定宽度的窗作为最小分析单元;按照最小分析单元,利用隐马尔科夫模型剔除比对数据中群体水平的CNV,得到第一数据集;对第一数据集作GC校正,得到第二数据集;剔除第二数据集中胚系CNV的干扰,得到第三数据集;采用主成分分析的方法对第三数据集进行降维,并提取CNV的特征。
进一步地,根据参数确定ctDNA对应的结果包括:以预先得到的已知类别的测序数据对应的参数为依据,对待测样本的ctDNA的测序数据中对应的参数进行预测得到预测结果;根据预测结果确定待测样本的ctDNA所对应的类别,作为待测样本的ctDNA对应的结果。
进一步地,待测样本的ctDNA对应的类别为肿瘤患者或者非肿瘤患者。
进一步地,以预先得到的已知类别的测序数据对应的参数为依据,对待测样本的ctDNA的测序数据对应的参数进行预测包括:采用支持向量机的方法建立已知表型的人群的测序数据中的突变错误谱和CNV特征中的至少之一与已知表型之间的关系模型;利用关系模型和待测样本的ctDNA对应的突变错误谱和CNV特征至少之一对待测样本的表型进行预测。
进一步地,获取ctDNA的测序数据包括:对来源于待测样本的ctDNA进行测序,得到原始数据;对原始数据进行质控得到测序数据。
进一步地,对原始数据进行质控得到测序数据包括:删除原始数据中以下至少之一的reads:PCR扩增引入的重复片段的reads、包含一个以上碱基N的reads、连续5个核苷酸的平均测序质量低于20的reads。
进一步地,将测序数据与参考基因组进行比对,并保留符合预设条件的比对数据包括:将测序数据与参考基因组进行比对,并保留完全与参考基因组比对上的reads作为比对数据。
进一步地,对来源于待测样本的ctDNA进行测序包括:对待测样本提取ctDNA并进行全基因组低深度测序。
根据本申请的第二个方面,提供了一种存储介质,存储介质上存储有计算机可执 行的程序代码,程序代码被计算机系统的一个或多个处理器执行时,计算机系统执行一种基于ctDNA的基因检测方法,计算机可执行的程序代码包括:用于获取待测样本的ctDNA的测序数据的代码;用于比对测序数据与参考基因组,并保留符合预设条件的比对数据的代码;以及用于分析比对数据中的以下参数的至少之一,并确定ctDNA对应的基因结果的代码,比对数据的突变错误谱以及CNV特征,其中,突变错误谱是对突变错误进行分类后计算每一类突变错误的丰度得到的,突变错误为除多态性位点外与参考基因组的碱基不一致的碱基。
进一步地,突变错误谱是按照如下信息将突变错误进行分类后计算每一类突变错误的丰度得到的:参考碱基、测得碱基、正链、负链及背景。
进一步地,按照信息将突变错误分类成至少包含如下的类别:A>T(+)、A>T(-)、A>C(+)、A>C(-)、A>G(+)、A>G(-)、T>A(+)、T>A(-)、T>C(+)、T>C(-)、T>G(+)、T>G(-)、C>A(+)、C>A(-)、C>G(+)、C>G(-)、C>T(+)、C>T(-)、G>A(+)、G>A(-)、G>C(+)、G>C(-)、G>T(+)及G>T(-)。
进一步地,通过执行如下代码得到CNV特征:用于将参考基因组划分为一系列的预定宽度的窗作为最小分析单元的代码;用于按照最小分析单元,利用隐马尔科夫模型剔除比对数据中群体水平的CNV,得到第一数据集的代码;用于对第一数据集作GC校正,得到第二数据集的代码;用于剔除第二数据集中胚系CNV的干扰,得到第三数据集的代码;用于采用主成分分析的方法对第三数据集进行降维,并提取CNV的特征的代码。
进一步地,用于根据参数确定ctDNA对应的结果的代码包括:用于以预先得到的已知类别的测序数据对应的参数为依据,对待测样本的ctDNA的测序数据中对应的参数进行预测得到预测结果的代码;用于根据预测结果确定待测样本的ctDNA所对应的类别,作为待测样本的ctDNA对应的结果的代码。
进一步地,待测样本的ctDNA对应的类别为肿瘤患者或者非肿瘤患者。
进一步地,用于以预先得到的已知类别的测序数据对应的参数为依据,对待测样本的ctDNA的测序数据中对应的参数进行预测的代码包括:用于采用支持向量机的方法建立已知表型的人群的测序数据中的突变错误谱和CNV特征中的至少之一与已知表型之间的关系模型的代码;用于利用关系模型和待测样本的ctDNA对应的突变错误谱和CNV特征至少之一对待测样本的表型进行预测的代码。
进一步地,用于获取待测样本的ctDNA的测序数据的代码包括:用于对来源于待测样本的ctDNA进行测序得到原始数据的代码;用于对原始数据进行质控得到测序数 据的代码。
进一步地,用于对原始数据进行质控得到测序数据的代码包括:用于删除原始数据中以下至少之一的reads的代码:PCR扩增引入的重复片段的reads、包含一个以上碱基N的reads、连续5个核苷酸的平均测序质量低于20的reads。
进一步地,用于将测序数据与参考基因组进行比对,并保留符合预设条件的比对数据的代码包括:用于将测序数据与参考基因组进行比对,并保留完全与参考基因组比对上用于reads作为比对数据的代码。
进一步地,用于对来于待测样本的ctDNA进行测序得到原始数据的代码包括:用于对待测样本提取ctDNA并进行全基因组低深度测序的代码。
根据本申请的第三个方面,提供了一种计算机系统,包括处理器、系统内存以及一种或多种计算机可读的存储介质,存储介质上存储有计算机可执行的指令,存储介质为上述任一种存储介质。
根据本申请的第四个方面,提供了一种基于ctDNA的基因检测装置,该装置包括:获取模块,用于获取待测样本的ctDNA的测序数据;比对模块,用于将测序数据与参考基因组进行比对,并保留符合预设条件的比对数据;分析确定模块,用于根据以下参数的至少之一对比对数据进行分析,并确定ctDNA对应的结果:比对数据的突变错误谱以及CNV特征,其中,突变错误谱是对突变错误进行分类后计算每一类突变错误的丰度得到的,突变错误为除多态性位点外与参考基因组的碱基不一致的碱基。
进一步地,分析确定模块还包括突变错误谱模块,突变错误谱模块按照如下信息将突变错误进行分类后计算每一类突变错误的丰度得到的:参考碱基、测得碱基、正链、负链及背景。
进一步地,突变错误谱模块按照信息将突变错误分类成至少包含如下的类别:A>T(+)、A>T(-)、A>C(+)、A>C(-)、A>G(+)、A>G(-)、T>A(+)、T>A(-)、T>C(+)、T>C(-)、T>G(+)、T>G(-)、C>A(+)、C>A(-)、C>G(+)、C>G(-)、C>T(+)、C>T(-)、G>A(+)、G>A(-)、G>C(+)、G>C(-)、G>T(+)及G>T(-)。
进一步地,分析确定模块还包括用于提取CNV特征的CNV特征提取模块,CNV特征提取模块包括:窗口划分子模块,用于将参考基因组划分为一系列的预定宽度的窗作为最小分析单元;第一校正子模块,用于按照最小分析单元,利用隐马尔科夫模型剔除比对数据中群体水平的CNV,得到第一数据集;第二校正子模块,用于对第一数据集作GC校正,得到第二数据集;第三校正子模块,用于剔除第二数据集中胚系CNV的干扰,得到第三数据集;CNV提取子模块,用于采用主成分分析的方法对第三数据 集进行降维,并提取CNV的特征。
进一步地,分析确定模块包括:预测模块,用于以预先得到的已知类别的测序数据对应的参数为依据,对待测样本的ctDNA的测序数据中对应的参数进行预测得到预测结果;确定模块,用于根据预测结果确定待测样本的ctDNA所对应的类别,作为待测样本的ctDNA对应的结果。
进一步地,待测样本的ctDNA对应的类别为肿瘤患者或者非肿瘤患者。
进一步地,预测模块包括:模型建立模块,用于采用支持向量机的方法建立已知表型的人群的测序数据中的突变错误谱和CNV特征中的至少之一与已知表型之间的关系模型;表型预测模块,用于利用关系模型和待测样本的ctDNA对应的突变错误谱和CNV特征至少之一对待测样本的表型进行预测。
进一步地,获取模块包括:测序模块,用于对来源于待测样本的ctDNA进行测序,得到原始数据;质控模块,用于对原始数据进行质控得到测序数据。
进一步地,质控模块包括:删除单元,用于删除原始数据中以下至少之一的reads:PCR扩增引入的重复片段的reads、包含一个以上碱基N的reads、连续5个核苷酸的平均测序质量低于20的reads。
进一步地,比对模块包括:比对子模块,用于将测序数据与参考基因组进行比对,并保留完全与参考基因组比对上的reads作为比对数据。
在本发明实施例中,提供的基于ctDNA的基因检测方法,通过创造性地设立突变错误谱这一参数,并利用其和/或CNV特征来预测待测样本的ctDNA的相应检测结果。该方法在不增加任何额外的实验和测序成本的基础上,利用常规的NGS低深度测序数据,实现了对ctDNA的基因突变检测。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明实施例的基于ctDNA的基因检测方法的流程图;
图2是现有的测序平台基于双色荧光通道对四种碱基的检测原理图;
图3是根据本发明实施例的基于ctDNA的基因检测方法的详细流程图;
图4A和图4B分别是对剔除群体水平的CNV后的测序数据做GC校正之前和之后的分布图;
图5示出的是非癌患者与肿瘤患者不同表型对应的CNV的模式;
图6是根据本发明实施例的基于ctDNA的基因检测装置的结构图;
图7是根据本发明实施例的基于ctDNA的基因检测装置的详细结构图。
具体实施方式
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
术语解释:
cfDNA:cell free DNA,游离DNA,大部分为正常细胞凋亡的DNA片段,肿瘤的ctDNA也在cfDNA中。
ctDNA:循环肿瘤DNA(Circulating tumor DNA),顾名思义是原发肿瘤,甚至是转移形成的新肿瘤上的细胞破裂掉落下来的DNA片段,也是进入了外周血循环系统的DNA片段。由于ctDNA来源于原发肿瘤,因而采用ctDNA能够在肿瘤发生的早期进行检测。
由于在具体的实验中,只能是提取cfDNA进行测序,而没有办法单独对ctDNA进行测序。所以,本申请中基于ctDNA的基因检测方法,本质上是对cfDNA进行基因检测的方法。
SNP:主要指在基因组水平上,由单个核苷酸的变异所引起的DNA序列多态性。它是人类可遗传的变异中最常见的一种,占所有已知多态性的90%以上。SNP在人类基因组中广泛存在,平均每500~1000个碱基对中就有1个。
由于现有技术中基于ctDNA的基因检测方法存在成本高的缺陷,为了改善这一现 状,在本申请一种优选的实施例中,提供了一种基于ctDNA的基因检测方法,如图1所示,该方法包括:S10,获取待测样本的ctDNA的测序数据;S30,将测序数据与参考基因组进行比对,并保留符合预设条件的比对数据;S50,根据以下参数的至少之一对比对数据进行分析,并确定ctDNA对应的结果:比对数据的突变错误谱以及CNV特征,其中,突变错误谱是对突变错误进行分类后计算每一类突变错误的丰度得到的,突变错误为与参考基因组的碱基不一致的碱基。
本申请的上述方法,通过对ctDNA的测序数据,创造性地设立突变错误谱这一参数,并协同利用CNV特征来预测待测样本的ctDNA的相应检测结果,不仅实现了对ctDNA的常规测序数据进行基因检测,而且大大降低了成本。
上述突变错误为与参考基因组的碱基不一致的碱基,此处的“与参考基因组的碱基不一致的碱基”是指除了正常的人群多态性SNP位点以外的不一致的SNP位点,这些不属于人群多态性的SNP位点,本申请中称作“非SNP位点”。
本申请中将突变错误定义为除多态性位点外与参考基因组的碱基不一致的碱基。是由于现有的数据库,例如HapMap,dbSNP,gnomad等公共数据库的样本量有限、且样本与中国人群存在种群差异等,因而这些数据库所包含多态性位点的偏少,并不适用于中国人群数据的实际分析需求。本申请是将此前积累的数百万中国孕妇的数据作为参考数据库,从中找出相对完整的中国人群的各类多态性位点。在非SNP位点上,若样本中某条read上测得的碱基与参考基因组不一致,只有两种可能:一是测序平台本身的测序错误;二是肿瘤带来的罕见突变。
从突变错误的单个碱基位点上看,参考碱基有4种情况,测得碱基有(4(-)1)即3种情况,而每条read有正负链2种情况,就有4*3*2=24种。如图2所示,按照参考碱基、测得碱基、read的正负链,可分为:A>T(+)、A>T(-)、A>C(+)、A>C(-)、A>G(+)、A>G(-)、T>A(+)、T>A(-)、T>C(+)、T>C(-)、T>G(+)、T>G(-)、C>A(+)、C>A(-)、C>G(+)、C>G(-)、C>T(+)、C>T(-)、G>A(+)、G>A(-)、G>C(+)、G>C(-)、G>T(+)及G>T(-)24种。
进一步结合突变错误的碱基位点的背景信息来看,即相邻2~3个碱基合在一起来看,同样是A>T(+),根据背景的不同可以以此类推继续细分。
对于健康人而言,突变错误谱的突变来源即测序平台的测序错误。由于测序错误存在一定倾向性,故而突变错误谱存在一定特征。以illumina的Novaseq测序平台为例,4种碱基是通过双色荧光通道检测出的(如图2所示)。只有绿色通道发光,测为T;只有红色通道发光,测为C;红绿通道都发光,测为A;红绿通道都不发光测为 G。在实际中,由于气泡阻挡荧光,误测为G的概率略高。另外通道由开到关的错误(譬如A>T)和通道由关到开(譬如T>A)的错误概率也不同。
对于肿瘤患者而言,突变错误谱的突变来源来自于测序平台的测序错误和肿瘤带来的罕见突变。肿瘤的突变也存在一定的倾向性,故而突变错误谱存在另一种特征。转换(transition,譬如T>C)突变的频率也要高于颠换(transversion,譬如T>A)突变的频率。同时肿瘤突变也会与碱基的背景有关。因此,健康人和肿瘤患者的突变错误谱会存在显著的不同。
因此,综合根据参考碱基、测得碱基、正链、负链及背景(background)等信息,将错误分为几十种不同的类别,并计算每一类的丰度,得到上述突变错误谱。因此,在一些优选的实施例中,按照图3所示的步骤S104获得对比数据的突变错误谱。
在一些优选的实施例中,如图3所示,通过S105所示步骤得到CNV特征:将参考基因组划分为一系列的预定宽度的窗作为最小分析单元;按照最小分析单元,利用隐马尔科夫模型剔除测序数据中群体水平的CNV,得到第一数据集;对第一数据集作GC校正,得到第二数据集;剔除第二数据集的胚系CNV,得到第三数据集;采用主成分分析的方法,对第三数据集进行降维,并提取CNV的特征。
上述优选实施例中,得到CNV特征的具体步骤如下:1)将参考基因组划分为一系列的预定宽度的窗作为最小分析单元,最终选取窗宽100Kb,步长50Kb;2)利用隐马尔可夫模型,剔除测序数据中包含群体水平的CNV的窗;3)如图4A和图4B所示,基于光滑样条法对剔除群体水平的CNV后的测序数据做GC校正,消除GC偏倚;4)用cfDNA的数据除以gDNA的数据,进一步剔除胚系CNV的干扰;5)最终保留下5万多个窗,用主成分分析对剔除胚系CNV干扰后的数据降维,最终保留下前p个主成分,作为CNV的p个特征。
主成分分析(Principal Component Analysis,PCA)是一种数据降维的统计方法。PCA的原理是通过正交变换将一组可能存在相关性的变量转换为一组线性不相关的变量,转换后的这组变量叫主成分。利用该方法能够把多维数据中影响较大的因素提取出来进行分析,既方便数据处理,又使分析结果偏差较小。
用cfDNA的数据除以gDNA的数据,进一步剔除胚系CNV的干扰的步骤中,cfDNA指的是血浆中的游离DNA,而ctDNA是血浆中的来自肿瘤细胞分泌的游离DNA。对于肿瘤患者,其cfDNA中包含一小部分ctDNA。在实际的实验中,只能是提取cfDNA进行测序,而没有办法单独对ctDNA进行测序。gDNA指的是白细胞的DNA。具体操作是采集血样,离心,将血浆和血细胞分离,然后分别提取血浆的cfDNA测序以及提取白细 胞的gDNA测序。
CNV(copy number variation,拷贝数变异)包括体细胞CNV及胚系CNV,此处的胚系CNV指个体生来即具有的CNV,分化后仍具有的CNV,如白细胞带有的CNV。健康人和肿瘤患者的CNV存在显著的不同。对于健康人而言,剔除掉可能存在的胚系CNV后,血浆中应当不存在其它拷贝数异常。而对肿瘤患者而言,除了胚系CNV之外,血浆中还会携带不少肿瘤相关的拷贝数异常,并且这类CNV存在一定的癌种的特征(如图5所示,该图显示的是非癌患者与肿瘤患者不同表型对应的CNV的模式。图片来源:参考文献Qiu Z W,Bi J H,Gazdar A F,et al.Genes Chromosomes&Cancer,2017,56(7):559.)
在一些优选的实施例中,如图3所示,根据参数确定ctDNA对应的结果包括:以预先得到的已知类别的ctDNA对应的参数为依据,对待测样本的ctDNA对应的参数进行预测得到预测结果;根据预测结果确定待测样本的ctDNA所对应的类别,作为待测样本的ctDNA对应的结果。
上述待测样本的ctDNA对应的类别为肿瘤患者或者非肿瘤患者。
在一些优选的实施例中,如图3中步骤S106所示,以预先得到的已知类别的ctDNA对应的参数为依据,对待测样本的ctDNA对应的参数进行预测包括:采用支持向量机的方法建立已知表型的人群的ctDNA的测序数据中的突变错误谱和CNV特征中的至少之一与已知表型之间的关系模型,利用关系模型和待测样本的ctDNA对应的突变错误谱和CNV特征至少之一对待测样本的表型进行预测。
上述已知表型的人群为健康人和I(-)IV期肿瘤患者。本申请中,获取若干健康人和I(-)IV期肿瘤患者的数据(即全基因组测序数据),从中选取三分之二的样本数据做训练集,样本数据包含突变错误谱和/或CNV特征,分类为表型信息。利用支持向量机模型来预测待测样本的测序数据中的突变错误谱和CNV特征所对应的表型。需要说明的是,突变错误谱和CNV特征两者之一也能实现对待测样本表型的预测,但突变错误谱和CNV特征两部分的信息一起与表型建立关联,这样预测的性能最佳。
上述方法中,获得待测样本的ctDNA的测序数据的步骤采用现有的步骤即可。在一种优选实施例中,如图3中S101和S102所示,获取ctDNA的测序数据包括:对来源于待测样本的ctDNA进行测序得到原始数据;对原始数据进行质控得到测序数据。
具体质控的方式与现有的原始数据的质控方式类似,都包括对原始数据进行过滤得到测序数据的步骤。即从raw data处理为clean data。在一些优选的实施例中,如图3中S103所示,对原始数据进行质控得到测序数据包括:删除原始数据中以下至 少之一的reads:PCR扩增引入的重复片段的reads、包含一个以上碱基N的reads、连续5个核苷酸的平均测序质量低于20的reads。
此处的低质量与常规高通量测序领域的低质量的涵义相同,广义上指无法进行有效的数据处理或者明显对处理结果有不利影响的数据。
上述优选实施例中,碱基N表示测序的原始数据中会有无法测出来的碱基,用N来表示。现有多种软件可以检测测序中碱基的测序质量,因而能够很方便地将连续5个核苷酸的平均测序质量低于20的reads筛选出来。
比对步骤中,保留的预设条件以提高检测的准确性为准。在一些优选的实施例中,如图3所示,将测序数据与参考基因组进行比对,并保留符合预设条件的比对数据包括:将测序数据与参考基因组进行比对,并保留完全与参考基因组比对上的reads作为比对数据。
只保留完全与参考基因组比对上的reads作为比对数据用于后续分析,以确保所检测到的各SNP位点的碱基类型是真实的,而非测序错误导致。具体比对后用于后续分析的比对数据的量不限,可根据样本来源的不同进行合理设置。优选至少有4M的reads数。
上述对待测样本提取ctDNA并进行测序采用现有常规的测序即可,无需高深度测序,也无需进行双端测序,只需按照目前0.1x的低深度测序即可满足要求。当然,如果测序是高深度测序,同样可以满足要求。在一种优选实施例中,对待测样本提取ctDNA并进行测序包括:对待测样本提取ctDNA并进行全基因组低深度测序。此处的低深度测序使目标覆盖度在0.1x~0.5x即可。
本申请中所说的低深度测序是指整个样本的覆盖度的0.1x~0.5x。而覆盖度为2或3是指其中某些位点的深度。比如,1个样本中有30亿个位点,有些位点的深度为0,有些位点的深度为1,有些位点的深度为2,其他位点类似深度也可能存在一定差异,但平均起来,整体样本的深度是0.1x~0.5x。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明所必须的。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但 很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得计算设备执行本发明各个实施例所述的方法,或者是使得处理器来执行本发明各个实施例所述的方法。
在第二种优选的实施方式中,提供了一种存储介质,该存储介质上存储有计算机可执行的程序代码,程序代码被计算机系统的一个或多个处理器执行时,计算机系统执行一种基于ctDNA的基因检测方法,计算机可执行的程序代码包括:用于获取待测样本的ctDNA的测序数据的代码;用于比对测序数据与参考基因组,并保留符合预设条件的比对数据的代码;以及用于分析比对数据中的以下参数的至少之一,并确定ctDNA对应的基因结果的代码,比对数据的突变错误谱以及CNV特征,其中,突变错误谱是对突变错误进行分类后计算每一类突变错误的丰度得到的,突变错误为除多态性位点外与参考基因组的碱基不一致的碱基。
在一些优选的实施例中,突变错误谱是按照如下信息将突变错误进行分类后计算每一类突变错误的丰度得到的:参考碱基、测得碱基、正链、负链及背景。
在一些优选的实施例中,按照信息将突变错误分类成至少包含如下的类别:A>T(+)、A>T(-)、A>C(+)、A>C(-)、A>G(+)、A>G(-)、T>A(+)、T>A(-)、T>C(+)、T>C(-)、T>G(+)、T>G(-)、C>A(+)、C>A(-)、C>G(+)、C>G(-)、C>T(+)、C>T(-)、G>A(+)、G>A(-)、G>C(+)、G>C(-)、G>T(+)及G>T(-)。
在一些优选的实施例中,通过执行如下代码得到CNV特征:用于将参考基因组划分为一系列的预定宽度的窗作为最小分析单元的代码;用于按照最小分析单元,利用隐马尔科夫模型剔除比对数据中群体水平的CNV,得到第一数据集的代码;用于对第一数据集作GC校正,得到第二数据集的代码;用于剔除第二数据集中胚系CNV的干扰,得到第三数据集的代码;用于采用主成分分析的方法对第三数据集进行降维,并提取CNV的特征的代码。
在一些优选的实施例中,用于根据参数确定ctDNA对应的结果的代码包括:用于以预先得到的已知类别的测序数据对应的参数为依据,对待测样本的ctDNA的测序数据中对应的参数进行预测得到预测结果的代码;用于根据预测结果确定待测样本的ctDNA所对应的类别,作为待测样本的ctDNA对应的结果的代码。
在一些优选的实施例中,待测样本的ctDNA对应的类别为肿瘤患者或者非肿瘤患者。
在一些优选的实施例中,用于以预先得到的已知类别的测序数据对应的参数为依据,对待测样本的ctDNA的测序数据中对应的参数进行预测的代码包括:用于采用支持向量机的方法建立已知表型的人群的测序数据中的突变错误谱和CNV特征中的至少之一与已知表型之间的关系模型的代码;用于利用关系模型和待测样本的ctDNA对应的突变错误谱和CNV特征至少之一对待测样本的表型进行预测的代码。
在一些优选的实施例中,用于获取待测样本的ctDNA的测序数据的代码包括:用于对待测样本提取ctDNA并进行测序之后得到原始数据的代码;用于对原始数据进行质控得到测序数据的代码。
在一些优选的实施例中,用于对原始数据进行质控得到测序数据的代码包括:用于删除原始数据中以下至少之一的reads的代码:PCR扩增引入的重复片段的reads、包含一个以上碱基N的reads、连续5个核苷酸的平均测序质量低于20的reads。
在一些优选的实施例中,用于将测序数据与参考基因组进行比对,并保留符合预设条件的比对数据的代码包括:用于将测序数据与参考基因组进行比对,并保留完全与参考基因组比对上用于reads作为比对数据的代码。
在一些优选的实施例中,用于对待测样本提取游离ctDNA并进行测序的代码包括:用于对待测样本提取ctDNA并进行全基因组低深度测序的代码。
在第三种优选的实施方式中,提供了一种计算机系统,包括处理器、系统内存以及一种或多种计算机可读的存储介质,存储介质上存储有计算机可执行的指令,存储介质为上述任一种存储介质。
在第四种优选的实施方式中,提供了一种基于ctDNA的基因检测装置,该装置用于存储或者运行模块,或者模块为装置的组成部分;其中,模块为软件模块,软件模块为一个或多个,软件模块用于执行上述任一项基因检测方法。
优选地,如图6所示,上述装置包括获取模块20、比对模块40和分析确定模块60。其中,获取模块20用于获取待测样本的ctDNA的测序数据;比对模块40用于将测序数据与参考基因组进行比对,并保留符合预设条件的比对数据;分析确定模块60用于根据以下参数的至少之一对比对数据进行分析,并确定ctDNA对应的结果:比对数据的突变错误谱以及CNV特征,其中,突变错误谱是对突变错误进行分类后计算每一类突变错误的丰度得到的,突变错误为除多态性位点外与参考基因组的碱基不一致的碱基。
本申请上述实施例提供了一种基于ctDNA的基因检测装置,该装置通过对创造性地利用ctDNA的测序数据中的突变错误谱这一参数,通过利用突变错误谱和/或CNV 特征来预测待测样本的ctDNA的相应检测结果,不仅实现了对ctDNA的常规测序数据进行基因检测,而且大大降低了成本。
需要说明的是,本申请在进行基因检测之前,系统需要获取文件和参数的属性值,根据属性值决定是否需要进行质控,决定是否需要产生训练数据集、决定基因检测的方法,上述方法中的属性值可以包括输入文件的类型、是否有已知表型的变异数据和健康对照数据。
如图7所示,本申请上述实施例中的获取模块20还可以包括质控模块202,分析确定模块还可以包括突变错误谱模块501、CNV特征模块502、模型建立模块601及表型预测模块602。
优选地,本申请上述实施例还可以包括控制模块101,该控制模块101,用于控制输入输出、获取文件和参数属性值、控制其它模块的调用和决定基因检测方法的设计。进一步地,上述装置中的控制模块101的控制其它模块的调用和决定基因检测流程的方案可以如下:决定是否调用质控模块、是否生成训练数据集、是否选择恰当的分析确定模块。
具体的,控制模块101,控制整个基因检测流程的设计和执行。首先,根据输入文件属性值进行判断,如果是原始测序数据,则调用质控模块202,否则调用比对模块40;其次,在输入已知表型的变异数据和健康对照数据的情况下,调用突变错误谱模块501和/或CNV特征模块502产生训练数据集;再次,根据待测样本的ctDNA比对数据的基因检测方法的选择,最后,调用相应的模型建立模块601和表型预测模块602。
优选地,上述分析确定模块60在根据突变错误谱对比对数据进行分析之前,分析确定模块60还包括用于统计突变错误谱的突变错误谱模块501,突变错误谱模块进一步可以包括信息分类模块和丰度计算模块,信息分类模块用于按照如下信息将突变错误进行分类:参考碱基、测得碱基、正链、负链及背景。
优选地,突变错误谱模块按照上述信息将突变错误分类成至少包含如下的类别:A>T(+)、A>T(-)、A>C(+)、A>C(-)、A>G(+)、A>G(-)、T>A(+)、T>A(-)、T>C(+)、T>C(-)、T>G(+)、T>G(-)、C>A(+)、C>A(-)、C>G(+)、C>G(-)、C>T(+)、C>T(-)、G>A(+)、G>A(-)、G>C(+)、G>C(-)、G>T(+)及G>T(-)。丰度计算模块用于在上述分类后计算每一类突变错误的丰度,从而得到突变错误谱。
优选地,分析确定模块60在根据CNV特征对比对数据进行分析之前,分析确定模块60还包括用于提取CNV特征的CNV特征模块502,CNV特征模块502还可以包括:窗口划分子模块,用于将参考基因组划分为一系列的预定宽度的窗作为最小分析单元; 第一校正子模块,用于按照最小分析单元,利用隐马尔科夫模型剔除比对数据中群体水平的CNV,得到第一数据集;第二校正子模块,用于对第一数据集作GC校正,得到第二数据集;第三校正子模块,用于剔除第二数据集中胚系CNV的干扰,得到第三数据集;CNV提取子模块,用于采用主成分分析的方法对第三数据集进行降维,并提取CNV的特征。
优选地,分析确定模块60还可以包括:预测模块和确定模块,预测模块用于以预先得到的已知类别的测序数据对应的参数为依据,对待测样本的ctDNA的测序数据中对应的参数进行预测得到预测结果;确定模块用于根据预测结果确定待测样本的ctDNA所对应的类别,作为待测样本的ctDNA对应的结果。
优选地,待测样本的ctDNA对应的类别为肿瘤患者或者非肿瘤患者。
优选地,预测模块包括:模型建立模块601和表型预测模块602,模型建立模块用于采用支持向量机的方法建立已知表型的人群的测序数据中的突变错误谱和CNV特征中的至少之一与已知表型之间的关系模型;表型预测模块,用于利用关系模型和待测样本的ctDNA对应的突变错误谱和CNV特征至少之一对待测样本的表型进行预测。
优选地,获取模块20包括:测序子模块201和质控子模块202,测序子模块201用于对来源于待测样本的ctDNA进行测序,得到原始数据;质控子模块202用于对原始数据进行质控得到测序数据。
优选地,质控模块202包括:删除单元,用于删除原始数据中以下至少之一的reads:PCR扩增引入的重复片段的reads、包含一个以上碱基N的reads、连续5个核苷酸的平均测序质量低于20的reads。
优选地,比对模块40包括:比对子模块,用于将测序数据与参考基因组进行比对,并保留完全与参考基因组比对上的reads作为比对数据。
由上可见,本申请提出的一种基于ctDNA的基因检测装置内置了多项功能模块,其中控制模块可根据不同的数据类型自动设计最适的基因检测流程,自动完成相应模块的调用和整合,进行高效的基因检测。该检测方法和装置方法严谨、功能全面、操作简单。
上述存储介质、计算机系统及装置,均可以被计算机用于执行上述基于ctDNA基因检测的方法,并输出相应的检测结果,这些产品在不增加任何额外的实验和测序成本的基础上,实现了对ctDNA的基因检测,且该装置的检测成本低、准确性高。
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分, 仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read(-)Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
下面结合可选的实施例进行说明。
实施例1
为验证本申请的方法的可行性,该实施例选取临床上确诊为乳腺癌和肺癌的不同分期(Stage I至Stage IV)的患者(剔除其中接受过手术、化疗或放疗的患者),并以健康人及良性肿瘤或癌前病变的患者为对照进行试验。具体方法采用图2所示的详细流程图进行。
具体的样本情况见表1,检测的结果见表2。
表1:样本情况
Figure PCTCN2018123705-appb-000001
Figure PCTCN2018123705-appb-000002
表2:验证的结果
Figure PCTCN2018123705-appb-000003
从上述实施例可以看出,本申请的方案具有以下优点:
1)实验建库流程简化,WGS测序,不需要捕获流程。可实现自动化,集成化。
2)成本低,一方面避免了捕获芯片的成本,也避免的捕获的偏向性;另一方面开创性使用低深度测序,与超高深度测序相比,测序成本大幅降低。
3)本申请改进方案的适用范围广,不受肿瘤种类的限制。且安全方便,不仅能够在肿瘤发生的早期进行检测,而且检测的准确性更高。而传统肿瘤筛查方法,一类基于影像学如PET(-)CT等方法,是基于组织水平的变化进行的检测,而组织水平的变化比DNA水平的变化出现的晚,因此很难在早期检出,敏感性不到60%。另一类基于血清分子标志物的检测方法,由于在肿瘤筛查中假阴性和假阳性依旧很高,基本无法实现早筛。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分的方法。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参 见方法实施例的部分说明即可。
本申请可用于众多通用或专用的计算系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (33)

  1. 一种基于ctDNA的基因检测方法,其特征在于,所述检测方法包括:
    获取待测样本的ctDNA的测序数据;
    将所述测序数据与参考基因组进行比对,并保留符合预设条件的比对数据;
    根据以下参数的至少之一对所述比对数据进行分析,并确定所述ctDNA对应的结果:所述比对数据的突变错误谱以及CNV特征,其中,所述突变错误谱是对突变错误进行分类后计算每一类突变错误的丰度得到的,所述突变错误为除多态性位点外与所述参考基因组的碱基不一致的碱基。
  2. 根据权利要求1所述的检测方法,其特征在于,所述突变错误谱是按照如下信息将所述突变错误进行分类后计算每一类突变错误的丰度得到的:参考碱基、测得碱基、正链、负链及背景。
  3. 根据权利要求2所述的检测方法,其特征在于,按照所述信息将所述突变错误分类成至少包含如下的类别:A>T(+)、A>T(-)、A>C(+)、A>C(-)、A>G(+)、A>G(-)、T>A(+)、T>A(-)、T>C(+)、T>C(-)、T>G(+)、T>G(-)、C>A(+)、C>A(-)、C>G(+)、C>G(-)、C>T(+)、C>T(-)、G>A(+)、G>A(-)、G>C(+)、G>C(-)、G>T(+)及G>T(-)。
  4. 根据权利要求1所述的检测方法,其特征在于,通过如下步骤得到所述CNV特征:
    将所述参考基因组划分为一系列的预定宽度的窗作为最小分析单元;
    按照所述最小分析单元,利用隐马尔科夫模型剔除所述比对数据中群体水平的CNV,得到第一数据集;
    对第一数据集作GC校正,得到第二数据集;
    剔除所述第二数据集中胚系CNV的干扰,得到第三数据集;
    采用主成分分析的方法对所述第三数据集进行降维,并提取CNV的特征。
  5. 根据权利要求1至4中任一项所述的检测方法,其特征在于,根据所述参数确定所述ctDNA对应的结果包括:
    以预先得到的已知类别的测序数据对应的参数为依据,对所述待测样本的ctDNA的测序数据中对应的参数进行预测得到预测结果;
    根据所述预测结果确定所述待测样本的ctDNA所对应的类别,作为所述待测样本的ctDNA对应的结果。
  6. 根据权利要求5所述的方法,其特征在于,所述待测样本的ctDNA对应的类别为肿瘤患者或者非肿瘤患者。
  7. 根据权利要求5所述的方法,其特征在于,以预先得到的已知类别的测序数据对应的参数为依据,对所述待测样本的ctDNA的测序数据对应的参数进行预测包括:
    采用支持向量机的方法建立已知表型的人群的测序数据中的突变错误谱和CNV特征中的至少之一与所述已知表型之间的关系模型;
    利用所述关系模型和所述待测样本的ctDNA对应的突变错误谱和CNV特征至少之一对所述待测样本的表型进行预测。
  8. 根据权利要求1至4中任一项所述的检测方法,其特征在于,获取所述ctDNA的测序数据包括:
    对来源于待测样本的ctDNA进行测序,得到原始数据;
    对所述原始数据进行质控得到所述测序数据。
  9. 根据权利要求8所述的检测方法,其特征在于,对所述原始数据进行质控得到所述测序数据包括:
    删除所述原始数据中以下至少之一的reads:PCR扩增引入的重复片段的reads、包含一个以上碱基N的reads、连续5个核苷酸的平均测序质量低于20的reads。
  10. 根据权利要求1所述的检测方法,其特征在于,将所述测序数据与参考基因组进行比对,并保留符合预设条件的比对数据包括:
    将所述测序数据与参考基因组进行比对,并保留完全与所述参考基因组比对上的reads作为所述比对数据。
  11. 根据权利要求8所述的检测方法,其特征在于,对来源于待测样本的ctDNA进行测序包括:
    对所述待测样本提取ctDNA并进行全基因组低深度测序。
  12. 一种存储介质,所述存储介质上存储有计算机可执行的程序代码,其特征在于,所述程序代码被计算机系统的一个或多个处理器执行时,所述计算机系统执行一 种基于ctDNA的基因检测方法,所述计算机可执行的程序代码包括:
    用于获取待测样本的ctDNA的测序数据的代码;
    用于比对所述测序数据与参考基因组,并保留符合预设条件的比对数据的代码;以及
    用于分析所述比对数据中的以下参数的至少之一,并确定所述ctDNA对应的基因结果的代码,所述比对数据的突变错误谱以及CNV特征,其中,所述突变错误谱是对突变错误进行分类后计算每一类所述突变错误的丰度得到的,所述突变错误为除多态性位点外与所述参考基因组的碱基不一致的碱基。
  13. 根据权利要求12所述的存储介质,其特征在于,所述突变错误谱是按照如下信息将所述突变错误进行分类后计算每一类突变错误的丰度得到的:参考碱基、测得碱基、正链、负链及背景。
  14. 根据权利要求13所述的存储介质,其特征在于,按照所述信息将所述突变错误分类成至少包含如下的类别:A>T(+)、A>T(-)、A>C(+)、A>C(-)、A>G(+)、A>G(-)、T>A(+)、T>A(-)、T>C(+)、T>C(-)、T>G(+)、T>G(-)、C>A(+)、C>A(-)、C>G(+)、C>G(-)、C>T(+)、C>T(-)、G>A(+)、G>A(-)、G>C(+)、G>C(-)、G>T(+)及G>T(-)。
  15. 根据权利要求12所述的存储介质,其特征在于,通过执行如下代码得到所述CNV特征:
    用于将所述参考基因组划分为一系列的预定宽度的窗作为最小分析单元的代码;
    用于按照所述最小分析单元,利用隐马尔科夫模型剔除所述比对数据中群体水平的CNV,得到第一数据集的代码;
    用于对第一数据集作GC校正,得到第二数据集的代码;
    用于剔除所述第二数据集中胚系CNV的干扰,得到第三数据集的代码;
    用于采用主成分分析的方法对所述第三数据集进行降维,并提取CNV的特征的代码。
  16. 根据权利要求12至15中任一项所述的存储介质,其特征在于,用于根据所述参数确定所述ctDNA对应的结果的代码包括:
    用于以预先得到的已知类别的测序数据对应的参数为依据,对所述待测样本 的ctDNA的测序数据中对应的参数进行预测得到预测结果的代码;
    用于根据所述预测结果确定所述待测样本的ctDNA所对应的类别,作为所述待测样本的ctDNA对应的结果的代码。
  17. 根据权利要求16所述的存储介质,其特征在于,所述待测样本的ctDNA对应的类别为肿瘤患者或者非肿瘤患者。
  18. 根据权利要求16所述的存储介质,其特征在于,用于以预先得到的已知类别的测序数据对应的参数为依据,对所述待测样本的ctDNA的测序数据中对应的参数进行预测的代码包括:
    用于采用支持向量机的方法建立已知表型的人群的测序数据中的突变错误谱和CNV特征中的至少之一与所述已知表型之间的关系模型的代码;
    用于利用所述关系模型和所述待测样本的ctDNA对应的突变错误谱和CNV特征至少之一对所述待测样本的表型进行预测的代码。
  19. 根据权利要求12至15中任一项所述的存储介质,其特征在于,用于获取待测样本的ctDNA的测序数据的代码包括:
    用于对来源于待测样本的ctDNA进行测序得到原始数据的代码;
    用于对所述原始数据进行质控得到所述测序数据的代码。
  20. 根据权利要求19所述的存储介质,其特征在于,用于对所述原始数据进行质控得到所述测序数据的代码包括:
    用于删除所述原始数据中以下至少之一的reads的代码:PCR扩增引入的重复片段的reads、包含一个以上碱基N的reads、连续5个核苷酸的平均测序质量低于20的reads。
  21. 根据权利要求12所述的存储介质,其特征在于,用于将所述测序数据与参考基因组进行比对,并保留符合预设条件的比对数据的代码包括:
    用于将所述测序数据与参考基因组进行比对,并保留完全与所述参考基因组比对上用于reads作为所述比对数据的代码。
  22. 根据权利要求19所述的存储介质,其特征在于,用于对来于待测样本的ctDNA进行测序得到原始数据的代码包括:
    用于对所述待测样本提取ctDNA并进行全基因组低深度测序的代码。
  23. 一种计算机系统,包括处理器、系统内存以及一种或多种计算机可读的存储介质,所述存储介质上存储有计算机可执行的指令,其特征在于,所述存储介质为权利要求12至22中任一项所述的存储介质。
  24. 一种基于ctDNA的基因检测装置,其特征在于,所述装置包括:
    获取模块,用于获取待测样本的ctDNA的测序数据;
    比对模块,用于将所述测序数据与参考基因组进行比对,并保留符合预设条件的比对数据;
    分析确定模块,用于根据以下参数的至少之一对所述比对数据进行分析,并确定所述ctDNA对应的结果:所述比对数据的突变错误谱以及CNV特征,其中,所述突变错误谱是对突变错误进行分类后计算每一类突变错误的丰度得到的,所述突变错误为除多态性位点外与所述参考基因组的碱基不一致的碱基。
  25. 根据权利要求24所述的装置,其特征在于,所述分析确定模块还包括突变错误谱模块,所述突变错误谱模块按照如下信息将所述突变错误进行分类后计算每一类突变错误的丰度得到的:参考碱基、测得碱基、正链、负链及背景。
  26. 根据权利要求25所述的装置,其特征在于,所述突变错误谱模块按照所述信息将所述突变错误分类成至少包含如下的类别:A>T(+)、A>T(-)、A>C(+)、A>C(-)、A>G(+)、A>G(-)、T>A(+)、T>A(-)、T>C(+)、T>C(-)、T>G(+)、T>G(-)、C>A(+)、C>A(-)、C>G(+)、C>G(-)、C>T(+)、C>T(-)、G>A(+)、G>A(-)、G>C(+)、G>C(-)、G>T(+)及G>T(-)。
  27. 根据权利要求24所述的装置,其特征在于,所述分析确定模块还包括用于提取所述CNV特征的CNV特征提取模块,所述CNV特征提取模块包括:
    窗口划分子模块,用于将所述参考基因组划分为一系列的预定宽度的窗作为最小分析单元;
    第一校正子模块,用于按照所述最小分析单元,利用隐马尔科夫模型剔除所述比对数据中群体水平的CNV,得到第一数据集;
    第二校正子模块,用于对第一数据集作GC校正,得到第二数据集;
    第三校正子模块,用于剔除所述第二数据集中胚系CNV的干扰,得到第三数据集;
    CNV提取子模块,用于采用主成分分析的方法对所述第三数据集进行降维, 并提取CNV的特征。
  28. 根据权利要求24至27中任一项所述的装置,其特征在于,所述分析确定模块包括:
    预测模块,用于以预先得到的已知类别的测序数据对应的参数为依据,对所述待测样本的ctDNA的测序数据中对应的参数进行预测得到预测结果;
    确定模块,用于根据所述预测结果确定所述待测样本的ctDNA所对应的类别,作为所述待测样本的ctDNA对应的结果。
  29. 根据权利要求28所述的装置,其特征在于,所述待测样本的ctDNA对应的类别为肿瘤患者或者非肿瘤患者。
  30. 根据权利要求28所述的装置,其特征在于,所述预测模块包括:
    模型建立模块,用于采用支持向量机的方法建立已知表型的人群的测序数据中的突变错误谱和CNV特征中的至少之一与所述已知表型之间的关系模型;
    表型预测模块,用于利用所述关系模型和所述待测样本的ctDNA对应的突变错误谱和CNV特征至少之一对所述待测样本的表型进行预测。
  31. 根据权利要求24至27中任一项所述的装置,其特征在于,所述获取模块包括:
    测序模块,用于对来源于待测样本的ctDNA进行测序,得到原始数据;
    质控模块,用于对所述原始数据进行质控得到所述测序数据。
  32. 根据权利要求31所述的装置,其特征在于,所述质控模块包括:
    删除单元,用于删除所述原始数据中以下至少之一的reads:PCR扩增引入的重复片段的reads、包含一个以上碱基N的reads、连续5个核苷酸的平均测序质量低于20的reads。
  33. 根据权利要求24所述的装置,其特征在于,所述比对模块包括:
    比对子模块,用于将所述测序数据与参考基因组进行比对,并保留完全与所述参考基因组比对上的reads作为所述比对数据。
PCT/CN2018/123705 2018-12-20 2018-12-26 基于ctDNA的基因检测方法、装置、存储介质及计算机系统 WO2020124625A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811565001.5 2018-12-20
CN201811565001.5A CN109712671B (zh) 2018-12-20 2018-12-20 基于ctDNA的基因检测装置、存储介质及计算机系统

Publications (1)

Publication Number Publication Date
WO2020124625A1 true WO2020124625A1 (zh) 2020-06-25

Family

ID=66256987

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/123705 WO2020124625A1 (zh) 2018-12-20 2018-12-26 基于ctDNA的基因检测方法、装置、存储介质及计算机系统

Country Status (2)

Country Link
CN (1) CN109712671B (zh)
WO (1) WO2020124625A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110706755A (zh) * 2019-08-26 2020-01-17 上海科技发展有限公司 结核菌耐药性检测方法、装置、计算机设备和存储介质
CN113517022A (zh) * 2021-06-10 2021-10-19 阿里巴巴新加坡控股有限公司 基因检测方法、特征提取方法、装置、设备及系统

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104781421A (zh) * 2012-09-04 2015-07-15 夸登特健康公司 检测稀有突变和拷贝数变异的系统和方法
CN105408496A (zh) * 2013-03-15 2016-03-16 夸登特健康公司 检测稀有突变和拷贝数变异的系统和方法
WO2016090584A1 (zh) * 2014-12-10 2016-06-16 深圳华大基因研究院 确定肿瘤核酸浓度的方法和装置
CN105986008A (zh) * 2015-01-27 2016-10-05 深圳华大基因科技有限公司 Cnv检测方法和装置
CN106676178A (zh) * 2017-01-19 2017-05-17 北京吉因加科技有限公司 一种评估肿瘤异质性的方法及系统
CN106845153A (zh) * 2016-12-29 2017-06-13 安诺优达基因科技(北京)有限公司 一种用于利用循环肿瘤dna样本检测体细胞突变的装置
CN107423578A (zh) * 2017-03-02 2017-12-01 北京诺禾致源科技股份有限公司 检测体细胞突变的装置
CN107523563A (zh) * 2017-09-08 2017-12-29 杭州和壹基因科技有限公司 一种用于循环肿瘤dna分析的生物信息处理方法
CN108021788A (zh) * 2017-12-06 2018-05-11 深圳市新合生物医疗科技有限公司 基于细胞游离dna的深度测序数据提取生物标记物的方法和装置
WO2018085862A2 (en) * 2016-11-07 2018-05-11 Grail, Inc. Methods of identifying somatic mutational signatures for early cancer detection
CN108256296A (zh) * 2017-12-29 2018-07-06 北京科迅生物技术有限公司 数据处理方法及装置
CN108664766A (zh) * 2018-05-18 2018-10-16 广州金域医学检验中心有限公司 拷贝数变异的分析方法、分析装置、设备及存储介质
WO2018195211A1 (en) * 2017-04-19 2018-10-25 Singlera Genomics, Inc. Compositions and methods for detection of genomic variance and dna methylation status
CN108875302A (zh) * 2018-06-22 2018-11-23 广州漫瑞生物信息技术有限公司 一种检测细胞游离肿瘤基因拷贝数变异的系统和方法
CN108929911A (zh) * 2018-08-13 2018-12-04 成都中珠健联基因科技有限责任公司 一种利用低深度全基因组测序检测癌症复发的系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105518151B (zh) * 2013-03-15 2021-05-25 莱兰斯坦福初级大学评议会 循环核酸肿瘤标志物的鉴别和用途
ES2923602T3 (es) * 2014-12-31 2022-09-28 Guardant Health Inc Detección y tratamiento de enfermedades que muestran heterogeneidad celular de enfermedad y sistemas y métodos para comunicar los resultados de las pruebas
CN108154010B (zh) * 2017-12-26 2018-10-19 东莞博奥木华基因科技有限公司 一种ctDNA低频突变测序数据分析方法和装置

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104781421A (zh) * 2012-09-04 2015-07-15 夸登特健康公司 检测稀有突变和拷贝数变异的系统和方法
CN105408496A (zh) * 2013-03-15 2016-03-16 夸登特健康公司 检测稀有突变和拷贝数变异的系统和方法
WO2016090584A1 (zh) * 2014-12-10 2016-06-16 深圳华大基因研究院 确定肿瘤核酸浓度的方法和装置
CN105986008A (zh) * 2015-01-27 2016-10-05 深圳华大基因科技有限公司 Cnv检测方法和装置
WO2018085862A2 (en) * 2016-11-07 2018-05-11 Grail, Inc. Methods of identifying somatic mutational signatures for early cancer detection
CN106845153A (zh) * 2016-12-29 2017-06-13 安诺优达基因科技(北京)有限公司 一种用于利用循环肿瘤dna样本检测体细胞突变的装置
CN106676178A (zh) * 2017-01-19 2017-05-17 北京吉因加科技有限公司 一种评估肿瘤异质性的方法及系统
CN107423578A (zh) * 2017-03-02 2017-12-01 北京诺禾致源科技股份有限公司 检测体细胞突变的装置
WO2018195211A1 (en) * 2017-04-19 2018-10-25 Singlera Genomics, Inc. Compositions and methods for detection of genomic variance and dna methylation status
CN107523563A (zh) * 2017-09-08 2017-12-29 杭州和壹基因科技有限公司 一种用于循环肿瘤dna分析的生物信息处理方法
CN108021788A (zh) * 2017-12-06 2018-05-11 深圳市新合生物医疗科技有限公司 基于细胞游离dna的深度测序数据提取生物标记物的方法和装置
CN108256296A (zh) * 2017-12-29 2018-07-06 北京科迅生物技术有限公司 数据处理方法及装置
CN108664766A (zh) * 2018-05-18 2018-10-16 广州金域医学检验中心有限公司 拷贝数变异的分析方法、分析装置、设备及存储介质
CN108875302A (zh) * 2018-06-22 2018-11-23 广州漫瑞生物信息技术有限公司 一种检测细胞游离肿瘤基因拷贝数变异的系统和方法
CN108929911A (zh) * 2018-08-13 2018-12-04 成都中珠健联基因科技有限责任公司 一种利用低深度全基因组测序检测癌症复发的系统

Also Published As

Publication number Publication date
CN109712671B (zh) 2020-06-26
CN109712671A (zh) 2019-05-03

Similar Documents

Publication Publication Date Title
US20220127683A1 (en) Detecting mutations for cancer screening
CN111712582B (zh) 使用核酸大小范围进行非侵入性产前检查和癌症检测
CN107423578B (zh) 检测体细胞突变的装置
WO2016112850A1 (en) Using size and number aberrations in plasma dna for detecting cancer
CN113257350B (zh) 基于液体活检的ctDNA突变程度分析方法和装置、ctDNA性能分析装置
CN112951418B (zh) 基于液体活检的连锁区域甲基化评估方法和装置、终端设备及存储介质
US11581062B2 (en) Systems and methods for classifying patients with respect to multiple cancer classes
CN113151474A (zh) 用于癌症检测的血浆dna突变分析
KR20190085667A (ko) 무세포 dna를 포함하는 샘플에서 순환 종양 dna를 검출하는 방법 및 그 용도
EP3655956A1 (en) Method for molecular typing of tumors in a single targeted next generation sequencing experiment
WO2020124625A1 (zh) 基于ctDNA的基因检测方法、装置、存储介质及计算机系统
CN109461473B (zh) 胎儿游离dna浓度获取方法和装置
CN113710818A (zh) 病毒相关联的癌症风险分层
EP3635138A1 (en) Method for analysing cell-free nucleic acids
CN117393054A (zh) 鉴定核酸样本拷贝数变异真假阳性和细胞分裂来源的方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18943913

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18943913

Country of ref document: EP

Kind code of ref document: A1