WO2019157791A1 - 一种拷贝数变异的检测方法、装置以及计算机可读介质 - Google Patents

一种拷贝数变异的检测方法、装置以及计算机可读介质 Download PDF

Info

Publication number
WO2019157791A1
WO2019157791A1 PCT/CN2018/090086 CN2018090086W WO2019157791A1 WO 2019157791 A1 WO2019157791 A1 WO 2019157791A1 CN 2018090086 W CN2018090086 W CN 2018090086W WO 2019157791 A1 WO2019157791 A1 WO 2019157791A1
Authority
WO
WIPO (PCT)
Prior art keywords
copy number
sample
detected
sequencing
average
Prior art date
Application number
PCT/CN2018/090086
Other languages
English (en)
French (fr)
Inventor
邵阳
汪笑男
吴雪
常志力
刘思思
那成龙
Original Assignee
南京世和基因生物技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京世和基因生物技术有限公司 filed Critical 南京世和基因生物技术有限公司
Publication of WO2019157791A1 publication Critical patent/WO2019157791A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • the invention relates to a method, a device and a computer readable medium for detecting copy number variation, in particular to a technique for detecting copy number variation based on high-throughput sequencing data, and belongs to the technical field of bioinformatics.
  • CNVs Nucleotide Variants
  • SVs Structural Variations
  • Copy Number Variations are a form of structural variation that refers to the deletion, insertion, replication, and complex multi-site variation of DNA fragments ranging in size from 50 bp to several Mb compared to the reference genome.
  • CNVs of genomic fragments affect gene expression by changing gene dosage or chromosome conformation, leading to pathological changes in organisms and affecting disease development, and occupy an increasingly important position in phenotypic polymorphism and evolution research.
  • NGS Next Generation Sequencing
  • the NGS-based detection method is a new method for detecting CNVs in recent years.
  • the number of sequencing fragments in a certain region of the genome is used to characterize the content of the gene in the region, thereby determining the region where the content of each gene is abnormal.
  • the advantage of this method is that using NGS sequencing data with sufficient sequencing depth can obtain more accurate breakpoint position and detection resolution than chip detection, and achieve more detailed CNVs detection in the genome-wide range.
  • Next-generation sequencing technology can be divided into four methods: read-depth based, paired-read based, sequence assembly based, and split-read based. From the core detection techniques used, it can be divided into two types based on probabilistic statistical models and machine learning based methods.
  • the statistical method mainly detects the copy number variation region based on the statistical characteristics of the read deep signal.
  • a hypothetical premise of the statistical class method is that the sequencing process is uniform, that is, the read depth signal along the window of the chromosome obeys a certain distribution, such as a Poisson distribution, and the read depth signal is linear with the number of copy numbers. . Therefore, the increase or decrease of the read depth signal of the continuous window indicates the increase or decrease of the copy number, that is, the copy number variation region is predicted.
  • the present invention is to solve the problem that the calculation result of the copy number of a specific gene in the panel detection process composed of a plurality of detection regions is unstable.
  • a method for calculating the copy number of one or more detection regions/sites/genes by detecting the region/site/gene as a detection region, detecting The region can be combined with other sites/regions to form a panel and then simultaneously sequenced and analyzed by the NGS method.
  • the technical solution is:
  • a method for detecting copy number variation includes the following steps:
  • the area to be detected and other areas are formed into a panel, and the sample to be tested, the control sample and the normal sample are subjected to NGS sequencing using the panel;
  • step S2 Calculate, according to the result of step S1, the copy number of the area to be detected on each normal sample and the sample to be tested;
  • the copy number of the to-be-detected region of the sample to be tested and the normal sample is calculated according to the NGS sequencing result of the sample and the control sample, and the reference to the reference genome sequence is obtained, and a unique comparison to the detection region is obtained. (reads), and calculate the sequencing depth on each detection area in each sample; perform a T-distribution fit on the sequencing depth of each detection area in the same sample, and use the average value of the distribution curve as the sample Average sequencing depth; the number of copies of the area to be detected on each sample was calculated based on the average sequencing depth.
  • the number of detection regions in the panel is from 3 to 50,000.
  • the baseline value is comprised of mean ⁇ standard deviation (sd).
  • the region to be detected includes, but is not limited to, any one of the following genes: AKT1, AKT2, AKT3, CD274 (PD-L1), DDR2, EGFR, ERBB2, FGFR1, FGFR3, FLT4, HGF , MET, MTOR, MYC, PDCD1LG2 (PD-L2), PDGFRA, PDGFRB, SOX2, TP53, VEGFA, BRAF, FLT1, HRAS, KDR, KRAS, MAP2K1, MAP2K2, NRAS, PIK3CA, RB1, TOP1, VEGFA, BCL2 BCL6, CCND1, CCND3, CDK6, CEBPA, CEBPB, CEBPD, HOXA10, IL3, IRF4, KMT2A, LYL1, MUC1, MYC, NOTCH1, SETBP1, TAL1, ZAP70.
  • the method for detecting copy number variation is for non-therapeutic and diagnostic purposes.
  • a method for detecting copy number variation of a gene comprising the following steps: calculating the copy number of the plurality of to-be-detected regions in the gene, and using the average of the number of copies of the to-be-detected regions as the copy number of the gene, and performing comparative analysis, the technical solution is:
  • a method for detecting copy number variation includes the following steps:
  • step S2 calculating the copy number of the gene to be detected on each sample to be tested and the normal sample according to the result of step S1;
  • the copy number of the gene to be detected of the sample to be tested and the normal sample refers to an average value of the copy number of each region to be detected
  • the copy number of the region to be detected is calculated by comparing the NGS sequencing results of the sample and the control sample to the reference genome sequence, obtaining a unique read to the detection region, and calculating each Sequencing depth on each detection area in the sample; T-distribution fit to the sequencing depth of each detection area in the same sample, the average of the distribution curve as the average sequencing depth on the sample; calculated based on the average sequencing depth The number of copies of the area to be detected on each sample.
  • the number of detection regions in the panel is from 3 to 50,000.
  • the average of the number of copies of each area to be detected refers to an arithmetic mean, a mean distribution mean squared mean, an F-distribution average, or a T-distribution average.
  • the genes to be detected include, but are not limited to, AKT1, AKT2, AKT3, CD274 (PD-L1), DDR2, EGFR, ERBB2, FGFR1, FGFR3, FLT4, HGF, MET, MTOR, MYC, PDCD1LG2 (PD-L2), PDGFRA, PDGFRB, SOX2, TP53, VEGFA, BRAF, FLT1, HRAS, KDR, KRAS, MAP2K1, MAP2K2, NRAS, PIK3CA, RB1, TOP1, VEGFA, BCL2, BCL6, CCND1, CCND3, CDK6, CEBPA, CEBPB, CEBPD, HOXA10, IL3, IRF4, KMT2A, LYL1, MUC1, MYC, NOTCH1, SETBP1, TAL1, ZAP70.
  • a method of detecting a method for calculating a copy number of one or more detection regions/sites/genes wherein the detection device is a region/site/gene as a The detection area is detected, and the detection area can be combined with other sites/regions to form a panel and then simultaneously sequenced and analyzed by the NGS method.
  • the technical solution is:
  • a detection device for copy number variation comprising:
  • the sequencing data acquisition module is used for the sample to be tested, the control sample and the normal sample, and the panel composed of the region to be detected and other regions is used for NGS sequencing, and the sequencing data of the sample is compared with the reference genome sequence to obtain a unique alignment. a read of the detection area;
  • a sequencing depth calculation module for calculating a sequencing depth on each detection region in each sample based on the obtained unique alignment to the detection region, and sequencing each detection region in the same sample
  • the T distribution fit is performed in depth, and the average value of the distribution curve is taken as the average sequencing depth on the sample;
  • a copy number calculation module configured to calculate a copy number of the to-be-detected area on each normal sample and the sample to be tested;
  • a baseline calculation module for fitting a copy number of a region to be detected in a normal sample to a T distribution, and using an average value of the distribution curve as a baseline of an average copy number of the region to be detected;
  • An analysis module is configured to compare the copy number of the area to be detected described in the sample to be tested with an average copy number baseline to determine whether there is copy number variation.
  • the number of detection regions in the panel is from 3 to 50,000.
  • a device for detecting copy number variation of a gene which is applied to detection of a copy number of a gene containing a plurality of regions to be detected (for example, an exon region),
  • the technical solution is:
  • a detection device for copy number variation comprising:
  • the sequencing data acquisition module is configured for performing NGS sequencing on the sample to be tested, the control sample and the normal sample, and comparing the sequenced data of the sample with the reference genome sequence to obtain a unique read to the detection area ( Reads); the panel is composed of at least two regions to be detected and other regions, the region to be detected is selected from the sequence of the gene to be tested;
  • a sequencing depth calculation module for calculating a sequencing depth on each detection region in each sample based on the obtained unique alignment to the detection region, and sequencing each detection region in the same sample
  • the T distribution fit is performed in depth, and the average value of the distribution curve is taken as the average sequencing depth on the sample;
  • a copy number calculation module for calculating a copy number of the gene to be detected on each of the normal sample and the sample to be tested
  • a baseline calculation module for fitting a copy number of the gene to be detected described in the normal sample to a T distribution, and using an average value of the distribution curve as a baseline of the average copy number of the gene to be detected;
  • An analysis module configured to compare a copy number of the gene to be detected described in the sample to be tested with an average copy number baseline to determine whether there is copy number variation
  • the copy number of the gene to be detected of the sample to be tested and the normal sample refers to the average value of the copy number of each region to be detected.
  • the number of detection regions in the panel is from 3 to 50,000.
  • a computer readable medium is described with a program that can run the above-described detection method of copy number variation.
  • the detection method of copy number variation provided by the invention When the detection method of copy number variation provided by the invention is applied to the NGS sequencing process of a panel composed of a plurality of detection regions, the copy value caused by the increase/decrease of the sequencing depth of other regions caused by the copy number variation can be effectively avoided.
  • the problem of large fluctuations and inconsistent detection results in different panels; obtaining the average sequencing depth by using T-distribution can effectively eliminate the influence of the volatility of the sequencing depth of each detection area on the calculation results, by calculating the copy number baseline Effectively analyze the detection area where an abnormality exists.
  • FIG. 1 is a flow chart of a detection method provided by the present invention.
  • FIG. 2 is a structural view of a detecting device provided by the present invention.
  • Fig. 3 is a comparison diagram of MYCN gene detection results of the detection method provided by the present invention.
  • FIG. 4 is a comparison diagram of MET gene detection results of the detection method provided by the present invention.
  • Fig. 5 is a comparison diagram of the detection results of the CDKN2A gene by the detection method provided by the present invention.
  • the copy number variation detection method proposed by the invention is based on the depth reading method for determining the copy number variation of each detection region, and in the process of detecting, each detected region collectively constitutes a panel for high-throughput sequencing, and these regions may be external
  • the sage can also be other genes of interest.
  • the principle of detection is: if copy number variation occurs in a certain region of a chromosome, the distribution of sequence fragments in the region will change when high-throughput sequencing, ie, copy number deletion-sequence density will become smaller, copy number amplification-sequence The density will become larger.
  • the detection method provided by the present invention is only used for determining whether there is copy number variation phenomenon by sequencing result, and is not for non-treatment or diagnosis purposes.
  • the number of detection areas may be several tens, or may be several hundred or thousands. It is possible that in the process of one test, only a part of the genes or sites in the panel are concerned, which is the gene or locus region to be detected.
  • the "region to be detected” described in the present invention refers to a region to be detected when performing copy number variation detection.
  • the “other regions” as used in the present invention refer to those regions which are added to the panel together with the region to be detected, which may be important under other conditions, need to be detected, or may be in this test.
  • the copy number variation was tested in subsequent steps, and the addition of "other regions” was added to improve the efficiency of high-throughput sequencing.
  • each detection area and “detection area in a panel” in the specification of the present invention it means “area to be detected” and “other area” because both of the areas belong to once
  • the areas in which detection is performed in high-throughput sequencing, and data such as sequencing depth in "other areas” are also calculation processes to be applied to the detection.
  • high throughput sequencing refers to second generation high throughput sequencing techniques and later developed higher throughput sequencing methods.
  • Next-generation sequencing platforms include, but are not limited to, Illumina (Miseq, Hiseq2000, Hiseq2500, Hiseq3000, Hiseq4000, HiseqX Ten, etc.), ABI-Solid, and Roche-454 sequencing platforms. As the sequencing technology continues to evolve, those skilled in the art will appreciate that other methods of sequencing methods and apparatus can also be used for this assay.
  • a nucleic acid tag according to an embodiment of the present invention can be used for sequencing of at least one of Illumina, ABI-Solid, and Roche-454 sequencing platforms and the like.
  • Next-generation sequencing technologies such as Illumina sequencing technology, have the following advantages: (1) High sensitivity: Next-generation sequencing, such as the sequencing flux of Miseq, can generate up to 15G base data in one experimental process, and high data throughput can be When the number of sequencing sequences is certain, each sequence obtains a higher sequencing depth, so that a lower content of the mutation can be detected, and because the sequencing depth is high, the mutation site is covered multiple times, and the sequencing result is also more. Be reliable.
  • High throughput, low cost With the tag sequence according to an embodiment of the present invention, tens of thousands of samples can be detected by one sequencing, thereby greatly reducing the cost.
  • sample to be tested in the present invention means that detection is required, and it is determined whether one or more gene regions on the sample have copy number variation.
  • “Normal sample” refers to the sample used to calculate the determination baseline. These samples can be selected for blood analysis with a certain number of normal human blood samples. (Several hundred normal human samples can be used, and the sample size can be statistically based on actual conditions. significance). When calculating the copy number, you also need to use the "control” to compare the sequencing depth of the sample to be tested/normal sample and calculate the copy number.
  • the "control sample” can be paired with the sample to be tested. Samples, when the sample to be tested has its own control, it can be directly used as a control sample. If there is no control sample, the standard cell DNA sample can be used as a control. For example, when the sample to be tested is of human origin, health can be used. Beijing human B lymphocyte DNA sample NA18535 and so on.
  • the term "alignment” refers to the process of comparing a reading or label to a reference sequence and thereby determining whether the reference sequence contains the sequence of reads. If the reference sequence contains the reading, the reading can be mapped to a reference sequence or, in some embodiments, to a particular location in the reference sequence. In some cases, the alignment simply tells the reading whether it is a member of a particular reference sequence (ie, whether the reading is present or not present in the reference sequence). For example, comparing the reading to the reference sequence of human chromosome 13 will tell if the reading is present in the reference sequence for chromosome 13. A tool that provides this information can be determined by the set membership tester.
  • the alignment additionally indicates where the reading or label in the reference sequence can be mapped.
  • the alignment can indicate that the reading is present on chromosome 13, and can further indicate that the reading is on a particular strand and/or locus of chromosome 13.
  • the term "reference genome” or "reference sequence” refers to any particular known genomic sequence (whether partial or complete) of any organism or virus that can be used to participate in the recognition of sequences from a subject. ratio.
  • reference genomes for human subjects and many other organisms can be found at the National Center for Biotechnology Information (ncbi.nlm.nih.gov).
  • the reference sequence can be human genome hg18 or hg19. the sequence of. At present, there are relatively many related databases of hg19 and hg19 has more bases than hg18, that is, the sample alignment rate is relatively high, so hg19 is preferred.
  • the term "read” refers to a sequence read from a portion of a nucleic acid sample. Typically, although not necessarily, the readings represent a short sequence of adjacent base pairs in the sample.
  • the readings can be represented by the base pair sequence (represented by ATCG) of the sample portion. It can be stored in a storage device and processed as appropriate to determine if it matches the reference sequence or meets other criteria. The readings can be obtained directly from the sequencing device or indirectly from stored sequence information relating to the sample.
  • the reading is a DNA sequence of sufficient length (eg, at least about 30 bp) that can be used to identify larger sequences or regions, eg, which can be aligned and specifically assigned to a chromosome or genomic region or gene Give a chromosome or genomic region or gene.
  • the sequence information on each target region is the sequencing fragment containing the site in the alignment result, and the sequencing depth information of the site is the number of sequencing fragments containing the site in the alignment result.
  • the method provided by the present invention for detecting copy number variation may be the following genes:
  • cancer-related sensitive genes such as:
  • Lung cancer-related hotspot genes AKT1, AKT2, AKT3, CD274 (PD-L1), DDR2, EGFR, ERBB2, FGFR1, FGFR3, FLT4, HGF, MET, MTOR, MYC, PDCD1LG2 (PD-L2), PDGFRA, PDGFRB, SOX2 , TP53, VEGFA.
  • Intestinal cancer-related hotspot genes AKT1, AKT2, AKT3, BRAF, EGFR, ERBB2, FLT1, FLT4, HRAS, KDR, KRAS, MAP2K1, MAP2K2, MET, MTOR, NRAS, PIK3CA, RB1, TOP1, VEGFA.
  • Blood cancer-related hotspot genes BCL2, BCL6, CCND1, CCND3, CDK6, CEBPA, CEBPB, CEBPD, FGFR3, HOXA10, IL3, IRF4, KMT2A, LYL1, MUC1, MYC, NOTCH1, SETBP1, TAL1, ZAP70.
  • the region to be detected allows any region of interest, not limited to exons and genes, and has no limitation in length.
  • the number of detection areas above may be 3 to 50,000.
  • regions of interest are detected, they can be combined with other unlimited regions to form a panel for simultaneous NGS sequencing.
  • these regions of interest are captured by their reference gene sequence design probes. Detection; if a large gene is detected as a region of interest, it is also possible to detect only a few sub-regions (eg, exon regions) of the gene, and to associate these sub-regions with other unlimited regions. Together, the panels are simultaneously NGS-sequenced.
  • probes are designed to capture and detect sub-regions based on the reference sequences of these sub-regions, and the average of the copy number of the sub-regions is taken as the copy number of the gene (here, the average).
  • the value may refer to an arithmetic mean, a mean distribution mean square mean value, an F distribution mean or a T distribution mean).
  • step 101 normal sample, sample to be tested, and control sample need to be NGS sequenced, and the sequence information of the corresponding site can be obtained by high-throughput sequencing method, which can be used according to conventional experimental methods, textbooks, probe design methods, and sequencers. The description in the manual is carried out.
  • the main process includes: DNA extraction of tissue samples or whole blood samples of each sample to be tested and normal samples to obtain genomic DNA; for samples with too large DNA fragments, the sample is mechanically broken by ultrasonication. Force interrupted to 200-350 base pairs; perform end repair, add adenine, library linker ligation, etc.
  • the fragmented DNA molecule on the fragmented DNA molecule; obtain a DNA library and a single-strand biotin-labeled DNA probe of 120 bases in length Molecular hybridization, and the captured DNA library molecules were separated by streptavidin-coated magnetic beads; sequencing was performed on an illumina next-generation sequencer.
  • the data obtained by the sequencing reaction was analyzed by bioinformatics. After obtaining the corresponding sequencing information, the data can be preprocessed by a conventional method.
  • the processing here is mainly to separately filter each sample sequence obtained by sequencing to remove the unqualified sequence and the linker sequence, wherein the sample Including a target sample (ie, a variant tissue) and a control sample (ie, a normal tissue); specifically, filtering the sample sequence after high-throughput sequencing to remove the unqualified sequence and the linker sequence, wherein the unqualified sequence may be At least one of the following cases: the number of bases whose sequencing quality is below a certain threshold exceeds a certain proportion of the number of bases of the entire sequence (for example, 50%) and the base in which the sequencing result is indeterminate (for example, The number of N) in the IlluminaGA sequencing results exceeds a certain percentage (for example, 10%) of the number of bases in the entire sequence.
  • the high-throughput sequencing technology can be IlluminaGA or HiSeq sequencing technology, or other existing high-throughput sequencing technologies, and the low quality threshold can be determined by specific sequencing technology and sequencing environment.
  • each filtered sample sequence is separately compared to the reference genome sequence, and each of the compared sample sequences is separately screened to obtain a uniquely aligned sample sequence, and each unique one is determined.
  • short sequence mapping procedure eg, short oligonucleotide analysis package (Short Oligo)
  • SOAP nucleotide analysis package
  • the sequencing depth on each target region of each sample needs to be counted according to the unique aligned sequence obtained above, that is, the readings on each target region are counted. Since the present invention detects the panel of a plurality of target regions as a whole, the copy number variation at some positions causes the overall average sequencing depth of the sample to change, so when calculating the average sequencing depth of each region on the panel, The occurrence of inconsistent calculation values will occur with the selected target area, which leads to the instability of the copy number calculation result, and the degree of influence increases as the number of target areas decreases. Or as the degree of variation in copy number increases, the degree of influence increases.
  • step 103 it is necessary to count the sequencing depth of all the detection regions on the same sample, the goal is to obtain the average sequencing depth on the sample, which is unexpectedly found in the present invention for the problem described in step 102, using t
  • the method of distribution fitting to calculate the average value can solve the stability problem of calculation results very effectively.
  • the sequencing depth of all the detection regions in one sample is subjected to T distribution statistics, and the distribution average is calculated and fitted, and this value is used as the value.
  • the average sequencing depth of the samples similarly, for a panel containing multiple sub-regions in a gene, it is also necessary to perform T-distribution statistics on the sequencing depth of all of the sub-regions and other regions to obtain an average sequencing depth; In this step, the average sequencing depth of each of the normal sample, the sample to be tested, and the control sample can be separately counted.
  • step 104 the number of copies of each area to be tested on each sample is calculated by the method of odds ratio (OR).
  • the number of copies (sequencing depth of the area to be tested of the sample to be tested / average sequencing depth of the sample to be tested) / (sequencing depth of the area to be tested of the control sample / average sequencing depth of the control sample) ⁇ ploidy Number (Ploidy).
  • copy number (sequencing depth of the test area of the normal sample / average sequencing depth of the normal sample) / (sequencing depth of the test sample to be tested / average sequencing depth of the control sample) ⁇ ploidy number ( Ploidy).
  • the number of copies (sequencing depth of the sub-area to be tested / average sequencing depth of the sample to be tested) / (sequencing depth of the sub-area of the control sample / average sequencing depth of the control sample) ⁇ Ploidy.
  • copy number (sequencing depth of the sub-area of the normal sample / average sequencing depth of the normal sample) / (sequencing depth of the sub-area of the control sample / average sequencing depth of the control sample) ⁇ ploidy Number (Ploidy).
  • the ratio ratio calculation can achieve the following objectives. First, the effect of the GC content of the region on the sequencing depth can be reduced. This effect is used when calculating the odds ratio of the sample to be tested and the control sample in the same region. In the division, the offset is eliminated. Secondly, the effect of the region itself is lower than other regions due to its own characteristics such as not easy to capture. It is usually considered that the copy number is reduced, but the ratio is compared. After calculation, only the change of the ratio of the sample to be tested relative to the control sample is concerned, rather than the actual value, so the copy number variation region can still be accurately obtained. Finally, if a control sample is used, comparisons can be made for different control samples because they are all compared to the same standard.
  • the sub-regions of the gene are detected, and then the average number of copies of each sub-region is used as the copy number on the gene.
  • the purpose is to obtain a baseline for determining whether there is a copy number variation in the target region, using the method of copying the number of regions to be tested (or genes containing subregions) on all normal samples.
  • a T-distribution fit is performed to obtain an average value as a copy number baseline, and the baseline value herein may be composed of mean ⁇ standard deviation (sd).
  • the criteria are statistically significant. For example, mean ⁇ 2sd contains 96% of samples and 3sd contains 99% of samples. If this range is exceeded, the p-value can be calculated by a hypothesis test to examine its significance.
  • the present invention also provides an apparatus for detecting copy number variation, which may be composed of a plurality of modules, as shown in FIG. 2:
  • the device When used to detect a region to be tested, the device includes: a sequencing data acquisition module, which is used for performing NGS sequencing on a sample containing a plurality of detection regions for a sample to be tested and a normal sample, and sequencing the data of the sample and the reference genome sequence In contrast, a read is obtained that is uniquely aligned to the detection region; the panel is composed of a region to be detected and other regions; a sequencing depth calculation module is configured to obtain a unique alignment according to the The read range of the detection area calculates the sequencing depth on each detection area in each sample, and performs a T-distribution fit on the sequencing depth of each detection area in the same sample, and takes the average value of the distribution curve as the sample.
  • a sequencing data acquisition module which is used for performing NGS sequencing on a sample containing a plurality of detection regions for a sample to be tested and a normal sample, and sequencing the data of the sample and the reference genome sequence
  • a read is obtained that is uniquely aligned to the detection region
  • the panel
  • Copy number calculation module for calculating the copy number of the test area on each normal sample and the sample to be tested; and a baseline calculation module for performing the copy number of the area to be detected in the normal sample Distribution fitting, using the average of the distribution curve as the average copy number baseline of the detection area; an analysis module for detecting the sample to be tested
  • the copy number of the test area is compared with the average copy number baseline to determine whether there is copy number variation.
  • the apparatus When used to detect a gene comprising a plurality of sub-regions, the apparatus comprises: a sequencing data acquisition module, wherein the sample to be tested and the normal sample are subjected to NGS sequencing using a panel containing a plurality of detection regions, and the sample is sequenced and the data is Referring to the genomic sequence comparison, a read is uniquely aligned to the detection region; the plurality of detection regions are composed of the sub-region to be detected and other regions; the sequencing depth calculation module is used Calculating the sequencing depth on each detection region in each sample according to the obtained unique reading to the detection region, and performing T distribution fitting on the sequencing depth of each detection region in the same sample The average value of the distribution curve is taken as the average sequencing depth on the sample; the copy number calculation module is used to calculate the copy number of the gene to be detected on each normal sample and the sample to be tested;
  • a baseline calculation module configured to fit a T-distribution of the copy number of the gene to be detected in the normal sample, and use an average value of the distribution curve as a baseline of the average copy number of the gene to be detected; an analysis module, which is to be tested The copy number of the gene to be detected described in the sample is compared with the average copy number baseline to determine whether there is copy number variation; wherein the copy number of the gene to be detected of the sample to be tested and the normal sample is based on each sub-region to be tested The average of the copy numbers.
  • the average sequencing depth is ⁇ ⁇ ploidy (Ploidy).
  • modules or steps of the present invention described above can be implemented by a general-purpose computing device that can be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device, such that they may be stored in a storage device by a computing device, or they may be fabricated into individual integrated circuit modules, or Multiple modules or steps are made into a single integrated circuit module. Thus, the invention is not limited to any specific combination of hardware and software.
  • the target region detected in this example is the copy number of the MYCN gene containing two exon regions, and the main information of the target region is as follows:
  • the average value of the z-distribution curve of their MYCN gene OR value is 1.08, and the sd is 0.075.
  • the patient sample here is the HD786 standard provided by https://www.horizondiscovery.com.
  • Product which contains MYCN amplification, the official copy number OR is 4.75, and the copy number OR of the gene obtained by the NGS method and the algorithm described above is about 3.95.
  • the control sample used here is the NA18535 standard cell.
  • the panel for detection also included exon regions of other sensitive genes, and different numbers of regions were combined with two exons of MYCN gene to form a panel.
  • a total of 4,684 alternative regions were selected, for example, 100.
  • Information on other alternative areas is as follows:
  • the two exons of the MYCN gene are combined with the exon regions of 10, 50, 100, 300, 400, ... 4000, 4684 other hot genes to form a panel.
  • NGS sequencing is performed.
  • the number of regions in a specific panel is shown in Table 3.
  • the numbers in Table 2 are 1 to 10, 1 to 50, and 1 to Exon regions of 100, other regions are not described here.
  • the detection steps used are as follows:
  • the sequence is sequenced by Illumina high-throughput sequencing technology. After receiving the sequencing sequence, the sequencing sequence is filtered to remove the unqualified sequence, and the sample linker sequence is removed from the sequence fragment.
  • the unqualified sequence includes: sequencing The number of bases whose mass value is less than 5 exceeds 50% of the number of bases of the entire sequence or the number of N in the sequencing result exceeds 10% of the number of bases of the entire sequence;
  • the short-oligonucleotide analysis package (SOAP) mapping program is used to compare the cancer samples obtained by the high-throughput sequencing technology with the sequencing samples of the para-cancerous samples to the human reference genome sequence, and the multiple alignments in the alignment results are screened out. Sequence, removal of repetitive sequencing sequences, sequence of base sequences in one end of the sequencing data (only one copy is retained) to reduce false positives in the results, and finally, the chromosome number to be processed in the comparison result is extracted according to the demand and Location information, chromosomal location.
  • SOAP short-oligonucleotide analysis package
  • the number of copies of a particular detection area (sequencing depth on the detection area on the sample to be tested / average sequencing depth on the sample to be tested) / (sequence depth on a region on the control sample / on the control sample)
  • the average sequencing depth is ⁇ ⁇ ploidy (Ploidy).
  • copy number (sequencing depth on the detection area on the normal sample / average sequencing depth on the normal sample) / (sequencing depth on a region on the control sample / average sequencing depth on the control sample) ⁇ ploidy Number (Ploidy).
  • the number of copies on each area of the sample to be tested and the normal sample can be obtained.
  • the MYCN gene has two exons, and after calculating the copy number for the two exon regions, the arithmetic mean of the two is taken as the copy number of the gene.
  • the average sequencing depth of each sample calculated in step 3 the arithmetic mean of the sequencing depths of all the detection regions in one sample was used as the average sequencing depth of the sample.
  • MYCN exons are used to form a panel with 10 different other regions to detect the copy number.
  • the numbers in Table 2 are 1 to 10, 21 to 30, 41 to 50, and 61 to 70, respectively.
  • the exon region of No. 81-90 and MYCN form a panel (divided into groups 1 to 5), and the MYCN copy number is detected in the same manner, and the results are shown in Table 5:
  • the target region detected in this example is the copy number of the MET gene containing 20 exon regions, and the main information of the target region is shown in the following table:
  • NM_000245 e2 Chr7:116339138 Chr7:116340338
  • NM_000245 e3 Chr7:116371721 Chr7:116371913
  • NM_000245 e4 Chr7:116380003 Chr7:116380138
  • NM_000245 e5 Chr7:116380905 Chr7:116381079
  • NM_000245 e6 Chr7:116395408 Chr7:116395569
  • NM_000245 e7 Chr7:116397490 Chr7:116397593
  • NM_000245 e8 Chr7:116397691 Chr7:116397828
  • NM_000245 e9 Chr7:116398512 Chr7:116398674
  • NM_000245 e10 Chr7:116399444 Chr7:116399544
  • NM_000245 e11
  • the average value of the z-distribution curve of their MET gene OR values is 1.01, and the sd is 0.039.
  • the patient sample here is the HD786 standard provided by https://www.horizondiscovery.com.
  • Product, which contains MET amplification, the official copy number OR is 2.25, and the OR of the gene obtained by the NGS method and the algorithm described above is 1.72.
  • the control sample used here is the NA18535 standard cell.
  • the panel for detection also included exon regions of other sensitive genes, and different numbers of regions were combined with 20 exons of MET gene to form a panel. A total of 4666 other candidate regions were selected, for example, 100. Information on other alternative areas is shown in Table 2.
  • the 20 exons of the MET gene and the exon regions of 10, 50, 100, 300, 400, ... 4000, 4666 other hot genes are combined to form a panel.
  • the number of regions in a specific panel is shown in Table 3.
  • the numbers in Table 2 are 1 to 10, 1 to 50, and 1 to 100.
  • the exon regions, other regions are not described here.
  • the detection procedure employed was the same as in Example 1.
  • the arithmetic mean of the copy number of the 20 MET exon regions was taken as the MET gene copy value, and the average sequencing depth of each sample was also calculated as the control using the arithmetic mean in step 3.
  • Panel composition Average method t distribution method 20 MET exons + 4666 other regions 1.72 1.72 20 MET exons + 4000 other areas 1.98 1.71 20 MET exons + 3000 other areas 2.62 1.72 20 MET exons + 2000 other regions 3.87 1.71 20 MET exons + 1000 other regions 7.33 1.71 20 MET exons + 900 other areas 8.14 1.7
  • the calculation result here is also similar to that of Example 1.
  • the MET gene has 20 exons, and after calculating the copy number for the two exon regions, the arithmetic mean of the two is taken as a copy of the gene. number.
  • the influence of the number of other detection areas in the panel can be avoided, so that the value of each detection result is consistent with the actual value of the standard.
  • the MET exon is also used to form a panel with 10 different regions to detect the copy number.
  • the numbers in Table 2 are 1 to 10, 21 to 30, 41 to 50, 61 to 70, and 81 to 90, respectively.
  • the exon region of the number and MYCN form a panel (divided into groups 1 to 5), and the MYCN copy number is detected in the same manner.
  • the results are shown in Table 8:
  • the target region detected in this example is the copy number of the CDKN2A gene containing five exon regions, and the main information of the target region is shown in the following table:
  • the average value of the z-distribution curve of the OR value of their CDKN2A gene is 1.01, and the sd is 0.054.
  • the patient sample here is the HD786 standard provided by https://www.horizondiscovery.com.
  • the product does not officially provide copy number variation of the gene, but the OR of the gene obtained by the NGS method and the algorithm described above is about 0.62.
  • the panel for detection also included exon regions of other sensitive genes, and different numbers of regions were combined with 5 exons of CDKN2A gene to form a panel. A total of 4681 other candidate regions were selected, for example, 100. Information on other alternative areas is shown in Table 2.
  • the 20 exons of the CDKN2A gene and the exon regions of 25, 50, 100, 300, 400, ... 4000, and 4681 other hot genes are combined to form a panel.
  • the number of regions in a specific panel is shown in Table 3.
  • the numbers in Table 2 are 1 to 25, 1 to 50, and 1 to 100.
  • the exon regions, other regions are not described here.
  • the detection procedure employed was the same as in Example 1.
  • the arithmetic mean of the copy number of the 5 CDKN2A exon regions was taken as the CDKN2A gene copy value, and the average sequencing depth of each sample was also calculated as the control using the arithmetic mean in step 3.
  • the calculation result here is also similar to that of Example 1.
  • the CDKN2A gene has five exons, and after calculating the copy number for the five exon regions, the arithmetic mean of the five copy numbers is taken as the gene. Copy number.
  • the influence of the number of other detection areas in the panel can be avoided, so that the value of each detection result is consistent with the actual value of the standard.
  • the CDKN2A exon was also used to form a panel with 25 different regions to detect the copy number.
  • the explicit numbers in Table 2 were 1-25, 26-50, 51-75, 76-100.
  • the sub-region and CDKN2A form a panel (divided into groups 1 to 4), and the CDKN2A copy number is detected in the same manner.
  • the results are shown in Table 11:
  • the PIK3CA gene Variation 91720 locus was used as the region to be detected, and the other 4681 gene region panel in Example 3 was subjected to high-throughput sequencing, and the copy number of the Variation 91720 site was calculated by the method of the present invention.
  • a total of 33 cases in the case group were from the Han population in southern China, and the normal population was 349 in the above example.
  • the median age of the case group was 56 years old, with an average of 55.1 years.
  • the pathological type was ESCC, accounting for 84.8%, followed by adenocarcinoma, accounting for 12.1%, and other pathological types accounting for 3%.
  • Judging criteria mean ⁇ 3sd of the gene copy number greater than or less than the baseline value is determined to be the copy number amplification or deletion of the gene; the average sequencing depth of the control sample on the detection area is up to 5x, and the coverage ratio is greater than the length of the region. 70%.
  • the detection result of the present invention can be applied to the detection of copy number variation of a case, and the detection result is close to the result of detecting the gene alone by real-time fluorescent PCR.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明公开一种基因片段拷贝数变异的检测方法、装置以及计算机可读介质,特别是公开了一种基于高通量测序(NGS)数据的检测基因片段拷贝数变异的方法,属于生物信息学技术领域。将本发明提供的基因片段拷贝数变异的检测方法应用于多个检测区域构成的一组基因(panel)的NGS测序过程中时,通过采用T分布拟合获得平均测序深度,进而计算出拷贝数基线及分析出存在异常的基因片段。该分析方法有效地消除各个检测区域因NGS试验方法导致的测序深度的波动性对检测结果的影响,避免拷贝数变异引起的其它区域的测序深度增大/减小而引起的拷贝数值波动性大、及在不同panel中检测结果不一致的问题。

Description

一种拷贝数变异的检测方法、装置以及计算机可读介质 技术领域
本发明涉及一种拷贝数变异的检测方法、装置以及计算机可读介质,特别是涉及了一种基于高通量测序数据的检测拷贝数变异的技术,属于生物信息学技术领域。
背景技术
人类基因组上存在着大量的变异。根据发生变异的碱基数目,基因组上的遗传变异分为单核苷酸变异(Single Nucleotide Variants,SNVs)和结构变异(Structural Variations,SVs)。拷贝数变异(Copy Number Variations,CNVs)是结构变异的一种形式,是指与参照基因组相比,大小在50bp到几Mb的DNA片段的缺失、插入、复制和复杂多位点变异。近年来的研究表明,基因组片段的CNVs通过改变基因剂量或染色体构象影响基因的表达,导致生物体病变和影响疾病的发展,在表型多态性和进化研究中占据越来越重要的地位。目前在全基因组范围内寻找CNVs主要基于的两种技术,分别是基因芯片技术(DNA chip)及新一代测序技术(Next Generation Sequencing,NGS)。
基于NGS的检测方法是近年来快速发展的CNVs检测新方法,由基因组上某段区域的测序片段数目表征该段区域基因的含量,从而确定各基因含量异常的区域。这种方法的优势在于,在测序深度足够的条件下利用NGS测序数据能够获得比芯片检测更准确的断点位置和检测分辨率,实现在全基因组范围内更细致的CNVs检测。下一代测序技术又可以分为基于读深(read-depth based)、基于读对(paired-read based)、基于序列组装(sequence assembly based)和基于分裂读取(split-read based)四种方法;从所用的核心检测技术可分为基于概率统计模型和基于机器学习方法两类。统计类方法主要是根据读深信号的统计特征来检测拷贝数变异区域。统计类方法的一个假设前提是认为测序过程是均匀的,即沿着染色体的窗口的读深信号服从某种分布,比如泊松分布,而且读深信号与拷贝数数目之间是呈线性关系的。所以连续窗口的读深信号的增加或减少就预示着拷贝数的增加或减少,也即预示着拷贝数变异区域。
现有的进行高通量测序过程中,通常会将多个基因或多个敏感位点区域构成panel进行检测,但是当一些基因上发生了拷贝数变异之后,其它的一些区域上的高通量测序的测序深度会受到影响而出现测序深度的增大或减小的现象,会导致在计算panel上的平均测序深度时的不稳定性,进而导致计算结果在不同的panel上的结果不一致的问题。
发明内容
本发明要解决的是对于由多个检测区域构成的panel检测过程中特定基因的拷贝数计算结果不稳定的问题。
本发明的第一个方面,提供了用于对一个或多个检测区域/位点/基因的拷贝数进行计算的方法,该方法是将区域/位点/基因作为一个检测区域进行检测,检测区域可以与其它的位点/区域共同构成panel后通过NGS方法进行同时测序分析,技术方案是:
一种拷贝数变异的检测方法,包括如下步骤:
S1,将待检测区域和其它区域组成panel,对待测样本、对照样本和正常样本采用所述的panel进行NGS测序;
S2,根据步骤S1的结果,计算出每份正常样本和待测样本上的待检测区域的拷贝数;
S3,将正常样本中所述的待检测区域的拷贝数进行T分布拟合,将分布曲线的平均值作为该待检测区域的平均拷贝数基线;
S4,将待测样本中所述的待检测区域的拷贝数与平均拷贝数基线进行比较,判定是否存在拷贝数变异;
其中,待测样本和正常样本的待检测区域的拷贝数的计算方法是:根据样本与对照样本的NGS测序结果,比对至参考基因组序列,获得唯一比对至所述的检测区域的读段(reads),并计算出每个样本中每个检测区域上的测序深度;对同一份样本中的每个检测区域的测序深度进行T分布拟合,将分布曲线的平均值作为该样本上的平均测序深度;根据平均测序深度计算出每份样本上的待检测区域的拷贝数。
在一个实施例中,panel中检测区域的数量为3~50000个。
在一个实施例中,拷贝数的计算是公式是:拷贝数=(样本的待检测区域的测序深度/样本的平均测序深度)/(对照样本的待检测区域的测序深度/对照样本的平均测序深度)×倍体数(Ploidy)。
在一个实施例中,所述的基线数值是由平均值(mean)±标准偏差(sd)所构成。
在一个实施例中,所述的待检测区域包括但不限于以下基因中的任意一个片断:AKT1、AKT2、AKT3、CD274(PD-L1)、DDR2、EGFR、ERBB2、FGFR1、FGFR3、FLT4、HGF、MET、MTOR、MYC、PDCD1LG2(PD-L2)、PDGFRA、PDGFRB、SOX2、TP53、VEGFA、BRAF、FLT1、HRAS、KDR、KRAS、MAP2K1、MAP2K2、NRAS、PIK3CA、RB1、TOP1、VEGFA、BCL2、BCL6、CCND1、CCND3、CDK6、CEBPA、CEBPB、CEBPD、HOXA10、IL3、IRF4、KMT2A、LYL1、MUC1、MYC、NOTCH1、SETBP1、TAL1、ZAP70。
所述的拷贝数变异的检测方法是用于非治疗与诊断目的。
本发明的第二个方面,提供了用于检测基因的拷贝数变异的方法,这种方法主要是应用于含有多个待检测区域(例如外显子区域)的基因的拷贝数的检测,本方法是将该基因中的多个待检测区域分别进行拷贝数的计算,并将这几个待检测区域拷贝数的平均值作为该基因的拷贝数,并进行比较分析,技术方案是:
一种拷贝数变异的检测方法,包括如下步骤:
S1,从待测基因序列中选出至少两个待检测区域,并与其它区域组成panel,对待测样本、对照样本和正常样本采用所述的panel进行NGS测序;
S2,根据步骤S1的结果,计算出每份待测样本和正常样本上的待检测基因的拷贝数;
S3,将正常样本中待检测基因的拷贝数进行T分布拟合,将分布曲线的平均值作为该待检测基因的平均拷贝数基线;
S4,将待测样本中所述的待检测基因的拷贝数与平均拷贝数基线进行比较,判定是否存在拷贝数变异;
其中,待测样本和正常样本的待检测基因的拷贝数是指每个待检测区域的拷贝数的平均值;
待检测区域的拷贝数的计算方法是:根据样本与对照样本的NGS测序结果,比对至参考基因组序列,获得唯一比对至所述的检测区域的读段(reads),并计算出每个样本中每个检测区域上的测序深度;对同 一份样本中的每个检测区域的测序深度进行T分布拟合,将分布曲线的平均值作为该样本上的平均测序深度;根据平均测序深度计算出每份样本上的待检测区域的拷贝数。
在一个实施例中,panel中检测区域的数量为3~50000个。
在一个实施例中,拷贝数的计算是公式是:拷贝数=(样本的待检测区域的测序深度/样本的平均测序深度)/(对照样本的待检测区域的测序深度/对照样本的平均测序深度)×倍体数(Ploidy)。
在一个实施例中,每个待检测区域的拷贝数的平均值是指算术平均值、正态分布平均值卡方分布平均值、F分布平均值或者T分布平均值。
待检测基因包括但不限于:AKT1、AKT2、AKT3、CD274(PD-L1)、DDR2、EGFR、ERBB2、FGFR1、FGFR3、FLT4、HGF、MET、MTOR、MYC、PDCD1LG2(PD-L2)、PDGFRA、PDGFRB、SOX2、TP53、VEGFA、BRAF、FLT1、HRAS、KDR、KRAS、MAP2K1、MAP2K2、NRAS、PIK3CA、RB1、TOP1、VEGFA、BCL2、BCL6、CCND1、CCND3、CDK6、CEBPA、CEBPB、CEBPD、HOXA10、IL3、IRF4、KMT2A、LYL1、MUC1、MYC、NOTCH1、SETBP1、TAL1、ZAP70。
本发明的第三个方面,提供了一种用于对一个或多个检测区域/位点/基因的拷贝数进行计算的方法的检测装置,该检测装置是将区域/位点/基因作为一个检测区域进行检测,检测区域可以与其它的位点/区域共同构成panel后通过NGS方法进行同时测序分析,技术方案是:
一种拷贝数变异的检测装置,包括:
测序数据获取模块,用于对待测样本、对照样本和正常样本采用由待检测区域和其它区域组成的panel进行NGS测序,并将样本的测序下机数据与参考基因组序列对比,获得唯一比对至所述的检测区域的读段(reads);
测序深度计算模块,用于根据获得的唯一比对至所述的检测区域的读段计算出每个样本中每个检测区域上的测序深度,并对同一份样本中的每个检测区域的测序深度进行T分布拟合,将分布曲线的平均值作为该样本上的平均测序深度;
拷贝数计算模块,用于计算出每份正常样本和待测样本上的待检测区域的拷贝数;
基线计算模块,用于将正常样本中待检测区域的拷贝数进行T分布拟合,将分布曲线的平均值作为该待检测区域的平均拷贝数基线;
分析模块,用于将待测样本中所述的待检测区域的拷贝数与平均拷贝数基线进行比较,判定是否存在拷贝数变异。
在一个实施例中,拷贝数是通过如下公式计算得到:拷贝数=(样本的待检测区域的测序深度/样本的平均测序深度)/(对照样本的待检测区域的测序深度/对照样本的平均测序深度)×倍体数(Ploidy)。
在一个实施例中,panel中检测区域的数量为3~50000个。
本发明的第四个方面,提供了一种用于基因的拷贝数变异的检测装置,该检测装置是应用于含有多个待检测区域(例如外显子区域)的基因的拷贝数的检测,技术方案是:
一种拷贝数变异的检测装置,包括:
测序数据获取模块,用于对待测样本、对照样本和正常样本采用panel进行NGS测序,并将样本的测序下机数据与参考基因组序列对比,获得唯一比对至所述的检测区域的读段(reads);所述的panel是由至少两个待检测区域和其它区域组成,所述的待检测区域选自待测基因序列;
测序深度计算模块,用于根据获得的唯一比对至所述的检测区域的读段计算出每个样本中每个检测区域上的测序深度,并对同一份样本中的每个检测区域的测序深度进行T分布拟合,将分布曲线的平均值作为该样本上的平均测序深度;
拷贝数计算模块,用于计算出每份正常样本和待测样本上的待检测基因的拷贝数;
基线计算模块,用于将正常样本中所述的待检测基因的拷贝数进行T分布拟合,将分布曲线的平均值作为该待检测基因的平均拷贝数基线;
分析模块,用于将待测样本中所述的待检测基因的拷贝数与平均拷贝数基线进行比较,判定是否存在拷贝数变异;
其中,待测样本和正常样本的待检测基因的拷贝数是指每个待检测区域的拷贝数的平均值。
在一个实施例中,拷贝数是通过如下公式计算得到:拷贝数=(样本的待检测区域的测序深度/样本的平均测序深度)/(对照样本的待检测区域的测序深度/对照样本的平均测序深度)×倍体数(Ploidy)。
在一个实施例中,panel中检测区域的数量为3~50000个。
本发明的第五个方面,提供了:
一种计算机可读介质,记载有可以运行上述拷贝数变异的检测方法的程序。
有益效果
本发明提供的拷贝数变异的检测方法应用于多个检测区域构成的panel的NGS测序过程中时,可以有效地避免拷贝数变异引起的其它区域的测序深度增大/减小而引起的拷贝数值波动性大、在不同panel中检测结果不一致的问题;通过采用了T分布拟合获得平均测序深度可以有效地消除各个检测区域测序深度的波动性对计算结果的影响,通过计算出拷贝数基线可以有效地分析出存在异常的检测区域。
附图说明
图1是本发明提供的检测方法的流程图。
图2是本发明提供的检测装置的结构图。
图3是本发明提供的检测方法的MYCN基因检测结果对比图。
图4是本发明提供的检测方法的MET基因检测结果对比图。
图5是本发明提供的检测方法的CDKN2A基因检测结果对比图。
具体实施方式
下面通过具体实施方式对本发明作进一步详细说明。但本领域技术人员将会理解,下列实施例仅用于说明本发明,而不应视为限定本发明的范围。实施例中未注明具体技术或条件者,按照本领域内的文献所 描述的技术或条件或者按照产品说明书进行。所用试剂或仪器未注明生产厂商者,均为可以通过市购获得的常规产品。
本文使用的词语“包括”、“包含”、“具有”或其任何其他变体意欲涵盖非排它性的包括。例如,包括列出要素的工艺、方法、物品或设备不必受限于那些要素,而是可以包括其他没有明确列出或属于这种工艺、方法、物品或设备固有的要素。除非上下文明确规定,否则单数形式“一个/种”和“所述(该)”包括复数个讨论对象。
本发明所提出的拷贝数变异检测方法,是基于读深方法进行判定各个检测区域的拷贝数变异,进行检测的过程中,各个检测的区域共同构成panel进行高通量测序,这些区域可以是外显子,也可以是其它的关注的基因。检测的原理为:如果染色体某一区域发生了拷贝数变异,则高通量测序时该区域的序列片段分布将发生变化,即,拷贝数缺失-序列密度将变小,拷贝数扩增-序列密度将变大。本发明提供的检测方法仅仅是用于用于通过测序结果判定是否存在着拷贝数变异现象,并非是用于非治疗或诊断目的。
在实际检测过程中,为了发挥高通量测序的通量大、效率高的优点,通常是将许多个基因片段、区域共同组成检测panel,同时对这些区域进行检测,在一些通常的panel中,检测区域的数量可以是几十个,也可以是几百、上千个。有可能有一次检测的过程中,较为关心的只是这个panel中的一部分基因或位点,这部分也就是属于待检测的基因或位点区域。本发明中所述的“待检测区域”是指进行拷贝数变异检测时需要检测的区域。本发明中所述的“其它区域”是指在中与待检测区域一起加入至panel的那些区域,这些区域可能是在其它的条件下较为重要、需要检测,也可以是在本次检测中的后续步骤中再进行拷贝数变异的检测,“其它区域”的加入是为了提高高通量测序的效率而进行加入。根据上述的定义,本发明说明书中当提及“每个检测区域”、“panel中的检测区域”时,是指“待检测区域”和“其它区域”,因为这两个区域都属于在一次高通量测序中都进行检测的区域,并且“其它区域”中的测序深度等数据也是要应用于检测的计算过程。
在本文中所使用的术语“高通量测序”、“下一代测序”等,指的是第二代高通量测序技术及之后发展的更高通量的测序方法。下一代测序平台包括但不限于Illumina(Miseq、Hiseq2000、Hiseq2500、Hiseq3000、Hiseq4000、HiseqX Ten等)、ABI-Solid和Roche-454测序平台等。随着测序技术的不断发展,本领域技术人员能够理解的是还可以采用其他方法的测序方法和装置进行本检测。根据本发明的具体示例,可以将根据本发明实施例的核酸标签用于Illumina、ABI-Solid和Roche-454测序平台等的至少一种进行测序。下一代测序技术,例如Illumina测序技术具有以下优势:(1)高灵敏度:下一代测序,例如Miseq的测序通量大,目前一次实验流程可以产生最多15G碱基数据,高的数据通量可以在测序序列数一定的情况下,使得每条序列获得更高的测序深度,所以可以检测到含量更低的突变,同时因其测序深度高,突变位点被多次覆盖,其测序结果也更为可靠。(2)高通量,低成本:利用根据本发明实施例的标签序列,通过一次测序可以检测上万份样本,从而大大降低了成本。
本发明中“待测样本”是指需要进行检测,并判定该样本上的一个或者多个基因区域是否存在有拷贝数变异。“正常样本”是指用于计算判定基线的样本,这些样本可以是选择具有一定数量的正常人的血液样本进行统计分析(可以采用几百例正常人样本,可以根据实际情况使样本数量具有统计意义)。在计算拷贝数时,还需要使用“对照样本(control)”,以对待测样本/正常样本的测序深度进行对比,并计算拷 贝数,这里的“对照样本”可以是使用与待测样本配对的样本,当待测样本存在着自己的对照时,可以直接将其作为对照样本,如果没有对照样本时,可以采用标准品细胞DNA样本作为对照,例如当待测样本为人类来源时,可使用健康北京人B淋巴细胞DNA样本NA18535等。
如本文所用的,术语“比对”是指将读数或标签与参考序列进行比较并且由此确定该参考序列是否含有该读数序列的过程。如果该参考序列含有该读数,则可以将该读数映射至参考序列,或者在某些实施方案中,映射至参考序列中的某个特定位置。在有些情况下,比对简单地告知读数是或不是特定参考序列的成员(即,该读数是存在还是不存在于参考序列中)。举例来说,将读数与人类染色体13的参考序列进行比对,将告知该读数是否存在于针对染色体13的参考序列中。提供这种信息的工具可以被判定集合成员身份测试器。在有些情况下,比对另外指示参考序列中读数或标签可以映射至其中的位置。举例来说,如果参考序列是全人类基因组序列,那么比对可以指示读数存在于染色体13上,并且可以进一步指示读数是在染色体13的具体股和/或位点上。术语“参考基因组”或“参考序列”是指任何生物体或病毒的任何具体的已知基因组序列(无论是部分的或完整的),它可以用于对来自受试者的识别的序列进行参比。例如,用于人类受试者以及许多其他生物体的参考基因组可见于美国国家生物技术信息中心(ncbi.nlm.nih.gov),对于人的样品来说,参照序列可以是人基因组hg18或hg19的序列。目前hg19的相关数据库相对较多且hg19测出来的碱基量比hg18要多,即样品比对率会相对较高,所以优先选择hg19。
术语“读数(read)”是指来自核酸样品的一部分的序列读数。典型地,虽然并不一定是,读数代表样品中的相邻碱基对的短序列。读数可以通过样品部分的碱基对序列(以ATCG表示)以符号代表。它可以存储在存储设备中,且酌情处理,以确定它是否与参考序列匹配或者满足其它标准。读数可以直接地从测序装置中或者间接地从涉及样品的存储的序列信息中获得。在有些情况下,读数是足够长度(例如,至少约30bp)的DNA序列,其可以用于识别更大的序列或区域,例如,其可以与染色体或基因组区域或基因进行比对并且具体地分配给染色体或基因组区域或基因。
每个目标区域上的序列信息是比对结果中包含该位点的测序片段,位点的测序深度信息是比对结果中包含该位点的测序片段数目。
本发明提供的方法进行拷贝数变异的检测可以是以下的一些基因:
ABL1、APC、ARID2、AURKA、BCL2、BLM、BTK、CCND2、CDC73、CDK8、CEBPA、CRKL、CTNNB1、EGFR、EPHB1、ESR1、FANCC、FANCL、FGF23、FGFR2、FLT4、GID4、GPR124、IDH1、IL7R、JAK2、KDM5C、KLHL6、MAP2K4、MED12、MLH1、MSH2、MYCL1、NFE2L2、NPM1、NUP93、PDGFRA、PIK3R1、PRKDC、RAD51、RICTOR、SF3B1、SMO、SPOP、SUFU、TOP1、VHL、ZNF703、AKT1、AR、ASXL1、AURKB、BCL2L2、BRAF、CARD11、CCND3、CDH1、CDKN1B、CHEK1、CRLF2、DAXX、EMSY、ERBB2、EZH2、FANCD2、FBXW7、FGF3、FGFR3、FOXL2、GNA11、GRIN2A、IDH2、INHBA、JAK3、KDM6A、KRAS、MAP3K1、MEF2B、MLL、MSH6、MYCN、NFKBIA、NRAS、PAK3、PDGFRB、PIK3R2、PTCH1、RAF1、RNF43、SMAD2、SOCS1、SRC、TET2、TP53、WISP3、BRCA1、AKT2、ARAF、ATM、AXL、BCL6、CSF1R、CBFB、CCNE1、CDK12、CDKN2A、CHEK2、FGF10、DDR2、EP300、ERBB3、FAM123B、FANCE、IGF1R、FGF4、FGFR4、GATA1、GNA13、GSK3B、MEN1、IRF4、JUN、KDR、LRP1B、MCL1、PALB2、MLL2、MTOR、MYD88、NKX2-1、NTRK1、SMAD4、 PDK1、PPP2R1A、PTEN、RARA、RPTOR、BRCA2、SOX10、STAG2、TGFBR2、TSC1、WT1、CTCF、AKT3、ARFRP1、ATR、BAP1、BCOR、FGF14、CBL、CD79A、CDK4、CDKN2B、CIC、IKBKE、DNMT3A、EPHA3、ERBB4、FAM46C、FANCF、MET、FGF6、FLT1、GATA2、GNAQ、HGF、PAX5、IRS2、KAT6A、KEAP1、MAP2K1、MDM2、SMARCA4、MPL、MUTYH、NF1、NOTCH1、NTRK2、BRIP1、PIK3CA、PRDM1、PTPN11、RB1、RUNX1、CTNNA1、SOX2、STAT4、TNFAIP3、TSC2、XPO1、FGF19、ALK、ARID1A、ATRX、BARD1、BCORL1、IKZF1、CCND1、CD79B、CDK6、CDKN2C、CREBBP、MITF、DOT1L、EPHA5、ERG、FANCA、FANCG、PBRM1、FGFR1、FLT3、GATA3、GNAS、HRAS、SMARCB1、JAK1、KDM5A、KIT、MAP2K2、MDM4、MRE11A、MYC、NF2、NOTCH2、NTRK3、PIK3CG、PRKAR1A、RAD50、RET、SETD2、SPEN、STK11、TNFRSF14、TSHR和ZNF217。
也可以是以下一些癌症相关的敏感基因,例如:
肺癌相关的热点基因AKT1、AKT2、AKT3、CD274(PD-L1)、DDR2、EGFR、ERBB2、FGFR1、FGFR3、FLT4、HGF、MET、MTOR、MYC、PDCD1LG2(PD-L2)、PDGFRA、PDGFRB、SOX2、TP53、VEGFA。
肠癌相关的热点基因AKT1、AKT2、AKT3、BRAF、EGFR、ERBB2、FLT1、FLT4、HRAS、KDR、KRAS、MAP2K1、MAP2K2、MET、MTOR、NRAS、PIK3CA、RB1、TOP1、VEGFA。
血癌相关的热点基因BCL2、BCL6、CCND1、CCND3、CDK6、CEBPA、CEBPB、CEBPD、FGFR3、HOXA10、IL3、IRF4、KMT2A、LYL1、MUC1、MYC、NOTCH1、SETBP1、TAL1、ZAP70。
所检测的区域可以使任何感兴趣的区域,不局限于外显子与基因,长度大小也都没有限制。上面的检测区域数量,可以是3~50000个。
例如:如果检测若干个感兴趣的区域时,可以将它们与其它数量不限的区域共同构成panel同时进行NGS测序,测序过程中这些感兴趣的区域都通过其参考基因序列设计探针进行捕获并检测;如果是对一个较大的基因作为感兴趣的区域进行检测时,也可以只检测这个基因中的若干个子区域(例如是外显子区域),将这些子区域与其它数量不限的区域共同构成panel同时时行NGS测序,测序过程中根据这些子区域的参考序列设计探针对子区域进行捕获和检测,并将子区域的拷贝数的平均值作为该基因的拷贝数(这里的平均值可以是指算术平均值、正态分布平均值卡方分布平均值、F分布平均值或者T分布平均值)。
本发明提供的拷贝数变异检测方法流程如图1所示:
对于步骤101,需要对正常样本、待测样本、对照样本进行NGS测序,通过高通量测序的方法获得相应位点的序列信息可以按照常规的实验方法、教科书、探针设计方法、测序仪使用手册中的描述进行,主要的流程包括:对每个待测样本和正常样本的组织样本或者全血样本进行DNA提取,获取基因组DNA;对DNA片段过大的样本,通过超声破碎,将样本机械力打断至200-350碱基对;对片段化的DNA分子执行末端修复、添加腺嘌呤、文库接头连接等操作;获得的DNA文库与长度为120碱基的单链生物素标记DNA探针分子杂交,再以链霉亲和素包裹的磁珠分离捕获的DNA文库分子;在illumina下一代测序仪上进行测序。测序反应获得的数据通过生物信息学分析。在获得了相应的测序信息后,可以采用常规方法做数据进行预处理,这里的处理主要是对测序所得的每个样本序列分别进行过滤,以去除掉不合格的序列和接头序列,其中,样本包括目标样本(即,变异组织)和对照样本(即,正常组织);具体地,对高通量测序后的样本序列进行过滤,去除不合格的序列及接头序列,其中,不合格序列可以为下列情况中的至少一种: 测序质量低于某一阈值的碱基个数超过整条序列碱基个数的一定比例(例如,50%)和序列中测序结果不确定的碱基(例如,IlluminaGA测序结果中的N)个数超过整条序列碱基个数的一定比例(例如,10%)。其中,高通量测序技术可以为IlluminaGA或者HiSeq测序技术,也可以为现有的其他高通量测序技术,低质量阈值可以由具体测序技术和测序环境确定。在对读段进行了预处理之后,将过滤后的每个样本序列分别比对到参考基因组序列,对比对后的每个样本序列分别进行筛选以得到唯一比对的样本序列,确定每个唯一比对的样本序列相对于参考基因组序列的位置信息,并对位置信息进行排序;具体地:(1)首先可以通过任何一种短序列映射程序(例如,短寡核苷酸分析包(Short Oligo nucleotide Analysis Package,SOAP))将过滤得到的每个样本序列(即,由多个测序片段数据构成的序列)分别比对到参考基因组序列(例如,人类基因组参考序列)得到每个样本序列在参考基因组上的位置情况;(2)然后,对比对结果进行一系列的筛选,例如,去除比对到多个位置的序列(因为这个序列已无法准确唯一的提供比对位置信息)、去除重复出现的序列(因为这些序列可能是由于前期实验引入的误差,如由测序错误引起,为使检测结果更加精准,故去除),以得到唯一比对的序列结果。
对于步骤102,需要根据以上获得的唯一比对的序列统计出在各个样本各个目标区域上的测序深度,也就是统计在各个目标区域上的读数。由于本发明中是对多个目标区域构成的panel进行整体检测,在一些位置上的拷贝数变异会导致该样本的总体平均测序深度发生变化,因此计算panel上各个区域的平均测序深度时就会出现会随着选取的目标区域的不同而发生计算值不一致的情况,也就导致了拷贝数计算结果的不稳定性,并且随着目标区域的数量的减少,影响程度呈增大趋势。或者随着拷贝数变异程度增大,影响程度也会增大。对于由若干个待检测区域和其它一些区域共同构成的panel来说,这里是要计算出分别在这些待测区域和其它区域上的测序深度;而对于包含一个基因中的多个子区域的panel来说,这里也是同样地获得这些子区域以及其它一些区域的测序深度。
对于步骤103,需要对同一份样本上的全部的检测区域的测序深度进行统计,目标是为了获得该样本上的平均测序深度,本发明中意外地发现针对步骤102中所描述的问题,采用t分布拟合计算平均值的方法可以非常有效地解决计算结果稳定性问题。对于由若干个待检测区域和其它一些区域共同构成的panel来说,将一份样本中这些全部的检测区域的测序深度进行T分布统计,并拟合计算出分布平均值,就该值作为这份样本的平均测序深度;同样地,对于包含一个基因中的多个子区域的panel来说,也是要将这些全部的子区域和其它的区域的测序深度进行T分布统计,获得平均测序深度;在该步骤中,可以针对每个正常样本、待测样本和对照样本分别统计出它们的平均测序深度。
对于步骤104,通过比值比(odds ratio,OR)的方法计算出每个样本上的每个待测区域的拷贝数。
对于由若干个待检测区域和其它一些区域共同构成的panel来说,该计算方法是通过以下公式得到的:
对于待测样本,拷贝数=(待测样本的待测区域的测序深度/待测样本的平均测序深度)/(对照样本的待测区域的测序深度/对照样本的平均测序深度)×倍体数(Ploidy)。
对于正常人群样本,拷贝数=(正常样本的待测区域的测序深度/正常样本的平均测序深度)/(对照样本的待测区域的测序深度/对照样本的平均测序深度)×倍体数(Ploidy)。
对于包含一个基因中的多个子区域的panel来说,该计算方法是也是类似地通过以下公式得到的:
对于待测样本,拷贝数=(待测样本的待测子区域的测序深度/待测样本的平均测序深度)/(对照样本的待测子区域的测序深度/对照样本的平均测序深度)×倍体数(Ploidy)。
对于正常人群样本,拷贝数=(正常样本的待测子区域的测序深度/正常样本的平均测序深度)/(对照样本的待测子区域的测序深度/对照样本的平均测序深度)×倍体数(Ploidy)。
使用使用比值比进行计算可以达到以下几个目的,首先可以降低区域的GC含量对其测序深度的影响,这种影响会在计算待测样本和对照样本在同一区域的比值比时,这种影响在除法中抵消了,其次,消除了该区域自身由于自身特性比如不容易捕获而导致的测序深度较其他区域低的影响,通常会可能会被认为是发生了拷贝数降低,但通过比值比的计算后只关注待测样本相对于对照样本的比值比的变化而不是实际的值,因此依然可以准确得到拷贝数变异区域。最后,如果采用对照样本时,可以对不同的对照样本实现比较,因为它们都是相对于同一个标准做的比较。
其中,如果是对包含多个子区域的基因的拷贝数进行检测时,对该基因上的子区域进行检测,然后根据每个子区域的拷贝数做平均值作为该基因上的拷贝数。
对于步骤105,其目的是为了获得用于判定目标区域是否存在拷贝数变异的基线,所使用的方法是将全部的正常样本上的待测个区域(或者是包含子区域的基因)的拷贝数进行T分布拟合,求出平均值,作为拷贝数基线,这里的基线数值可以是采用平均值(mean)±标准偏差(sd)所构成。
对于步骤106,将计算得到的待测样本上的某一个区域(或者是包含子区域的基因)的拷贝数值与该区域的基线值进行比较,大于或小于某一个数值时,则判定存在拷贝数增加/缺失。例如,可以判定当拷贝数<=0.65或>=1.6时,存在着缺失/增加,这里的mean±sd可以是Mean±2SD、Mean±2.5SD或Mean±3SD等,采取的平均值加减sd的标准使判定结果具有统计学意义,例如mean±2sd包含了96%的样本,3sd包含99%样本。如果超过这个范围,可以再通过假设检验计算p值,考察其显著性。
基于上述的方法,本发明还提供了用于检测拷贝数变异的装置,该装置中可以多个模块构成,如图2所示:
当用于检测一个待测区域时,装置包括:测序数据获取模块,用于对待测样本和正常样本采用包含多个检测区域的panel进行NGS测序,并将样本的测序下机数据与参考基因组序列对比,获得唯一比对至所述的检测区域的读段(reads);所述的panel是由待检测区域和其它区域共同构成;测序深度计算模块,用于根据获得的唯一比对至所述的检测区域的读段计算出每个样本中每个检测区域上的测序深度,并对同一份样本中的每个检测区域的测序深度进行T分布拟合,将分布曲线的平均值作为该样本上的平均测序深度;拷贝数计算模块,用于计算出每份正常样本和待测样本上的待测区域的拷贝数;基线计算模块,用于将正常样本中待检测区域的拷贝数进行T分布拟合,将分布曲线的平均值作为该检测区域的平均拷贝数基线;分析模块,用于将待测样本中所述的待检测区域的拷贝数与平均拷贝数基线进行比较,判定是否存在拷贝数变异。在一个实施例中,拷贝数是通过如下公式计算得到:拷贝数=(样本的待测区域的测序深度/样本的平均测序深度)/(对照样本的待测区域的测序深度/对照样本的平均测序深度)×倍体数(Ploidy)。
当用于检测一个包含若干子区域的基因时,装置包括:测序数据获取模块,用于对待测样本和正常样本采用包含多个检测区域的panel进行NGS测序,并将样本的测序下机数据与参考基因组序列对比,获得唯一比对至所述的检测区域的读段(reads);所述的多个检测区域是由待检测基因的待测子区域和其它区 域构成;测序深度计算模块,用于根据获得的唯一比对至所述的检测区域的读段计算出每个样本中每个检测区域上的测序深度,并对同一份样本中的每个检测区域的测序深度进行T分布拟合,将分布曲线的平均值作为该样本上的平均测序深度;拷贝数计算模块,用于计算出每份正常样本和待测样本上的待检测基因的拷贝数;
基线计算模块,用于将正常样本中所述的待检测基因的拷贝数进行T分布拟合,将分布曲线的平均值作为该待检测基因的平均拷贝数基线;分析模块,用于将待测样本中所述的待检测基因的拷贝数与平均拷贝数基线进行比较,判定是否存在拷贝数变异;其中,待测样本和正常样本的待检测基因的拷贝数是根据每个待测子区域的拷贝数的平均值。在一个实施例中,拷贝数是通过如下公式计算得到:拷贝数=(样本的待测子区域的测序深度/样本的平均测序深度)/(对照样本的待测子区域的测序深度/对照样本的平均测序深度)×倍体数(Ploidy)。
需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
实施例1 MYCN基因拷贝数的检测
本实施例中检测的目标区域是含有2个外显子区域的MYCN基因的拷贝数进行检测,目标区域的主要信息如下表所示:
表1 MYCN基因的2个外显子
区域名称 在染色体上的起始位置 在染色体上的结束位置
MYCN:NM_005378:e1 chr2:16082186 chr2:16082976
MYCN:NM_005378:e2 chr2:16085614 chr2:16086219
这里的正常人群共349位,他们的MYCN基因OR值组成的z分布的曲线平均值为1.08,sd为0.075,这里的病人样本用的是https://www.horizondiscovery.com公司提供的HD786标准品,它包含MYCN扩增,官方提供的拷贝数OR为4.75,我们利用NGS的方法以及以上描述的算法得到的该基因的拷贝数OR为3.95左右。这里采用的对照样本为NA18535标准品细胞。
进行检测的panel中还包括了其它的一些敏感基因的外显子区域,分别取不同数量的区域与MYCN基因2个外显子共同构成panel,共选择了4684个其它备选区域,例如其中100个其它备选区域的信息如下:
表2 panel中其它检测区域信息
Figure PCTCN2018090086-appb-000001
Figure PCTCN2018090086-appb-000002
在进行计算时,将MYCN基因的2个外显子分别和10个、50个、100个、300个、400个、……4000个、4684个其它的热点基因的外显子区域共同构成panel同时进行NGS测序,具体的panel中的区域数量如表3所示,其中采用10个、50个、100个其它区域时,是采用的表2中序号为1~10、1~50、1~100的外显子区域,其它的区域在此不一一描述。
采用的检测步骤如下:
1、采用Illumina高通量测序技术对待测序列进行测序,接收测序序列后,对测序序列进行过滤,去除不合格的序列,将样本接头序列从序列片段中去除,其中,不合格序列包括:测序质量值低于5的碱基个数超过整条序列碱基个数的50%或序列中测序结果中N的个数超过整条序列碱基个数的10%;
2、采用短寡核苷酸分析包(SOAP)映射程序将高通量测序技术得到的癌症样本和癌旁样本测序片段比对到人类参考基因组序列上,筛选掉比对结果中多重比对的序列、去除重复出现的测序序列、一端测序数据中碱基序列一样的序列(仅保留一份),以降低结果中的假阳性,最后,根据需求提取比对结果中所需要处理的染色体编号及位置信息,染色体位置。
3、根据待测样本、正常样本与对照样本的各个检测区域的读段(reads)计算出各个区域上的测序深度;对于每一个单独的样本,将样本上各个检测区域的测序深度进行T分布拟合(例如,当需要检测100个区域时,将某一份样本上这100个区域的测序深度进行T分布统计),并计算出T分布曲线的平均值,将这个平均值作为该单独样本的平均测序深度;按照上述的方法,再对其它的样本进行平均测序深度的计算,此时,就获得了待测样本集、正常样本集与对照样本集中的每一个样本的平均测序深度。
4、通过比值比的方法计算出每一个样本上的每一个检测区域的拷贝数,计算方法是:
对于待测样本,某一个特定检测区域的拷贝数=(待测样本上检测区域上的测序深度/待测样本上的平均测序深度)/(对照样本上某区域上的测序深度/对照样本上的平均测序深度)×倍体数(Ploidy)。
对于正常人群样本,拷贝数=(正常样本上检测区域上的测序深度/正常样本上的平均测序深度)/(对照样本上某区域上的测序深度/对照样本上的平均测序深度)×倍体数(Ploidy)。
可以获得了每个待测样本和正常样本上的各个区域上的拷贝数。
本实施例中,MYCN基因具有2个外显子,对这两个外显子区域计算出拷贝数之后,将两者的算术平均值作为基因的拷贝数。
5、把正常样本集中每个样本上的MYCN基因的拷贝数进行T分布统计,并计算出分布曲线上的平均值,作为MYCN基因的拷贝数基线值,并以mean±3sd数值范围作为判定区间;
6、将待测样本的MYCN基因的拷贝数与步骤5得到的MYCN基因拷贝数基线进行比较,判定是否出现拷贝数异常。
作为对照,步骤3中计算每个样本的平均测序深度中,采用的是将一个样本中的全部检测区域的测序深度作算术平均值,作为该样本的平均测序深度。
计算结果如表4所示,区域数量与拷贝数值之间进行作图后如图3所示。
表4 MYCN基因拷贝数
Figure PCTCN2018090086-appb-000003
Figure PCTCN2018090086-appb-000004
从表中可以看出,采用本发明提供的T分布计算得到的MYCN基因的2个外显子区域的拷贝数数值与panel中的其它检测区域数量之间没有直接关联,panel中其它区域的数量多少对检测出的MYCN基因拷贝数并不产生影响,该数值始终在3.94-4.27之间,变化很小,并且与标准品的官方提供的拷贝数为4.75接近;而与之截然不同的是,采用算术平均值计算得到的拷贝数值随着panel中的其它检测区域数量的变化会发生明显的变化,从图3中就可以看出,只要在panel中的检测区域数量非常大时,这种方法计算得到的结果才能与标准品的数值保持一致,而在小panel的情况下,由于基因拷贝数对样本测序深度的影响更大,使计算结果与真实值之间出现了明显偏差。
另外,还采用了2个MYCN外显子分别与10个不同的其它区域构成panel来进行拷贝数的检测,分别采用表2中序号为1~10、21~30、41~50、61~70、81~90号的外显子区域与MYCN组成panel(分为称为1~5组),依照同样的方法对MYCN拷贝数进行检测,结果如表5所示:
表5 MYCN基因拷贝数
Figure PCTCN2018090086-appb-000005
从表中可以看出,当在MYCN基因与其它10个检测区域构成panel进行检测时,在panel中的其它区域的不同,会对采用平均值方法检测得到的结果产生明显不同,因此,这种方法的检测结果的可信度较低;而采用T分布拟合方法计算得到的结果与panel中其它的区域并没有明显的关联,得到的检测结果始终保持稳定,可信度高。
实施例2 MET基因拷贝数的检测
本实施例中检测的目标区域是含有20个外显子区域的MET基因的拷贝数进行检测,目标区域的主要信息如下表所示:
表6
区域名称 在染色体上的起始位置 在染色体上的结束位置
MET:NM_000245:e2 chr7:116339138 chr7:116340338
MET:NM_000245:e3 chr7:116371721 chr7:116371913
MET:NM_000245:e4 chr7:116380003 chr7:116380138
MET:NM_000245:e5 chr7:116380905 chr7:116381079
MET:NM_000245:e6 chr7:116395408 chr7:116395569
MET:NM_000245:e7 chr7:116397490 chr7:116397593
MET:NM_000245:e8 chr7:116397691 chr7:116397828
MET:NM_000245:e9 chr7:116398512 chr7:116398674
MET:NM_000245:e10 chr7:116399444 chr7:116399544
MET:NM_000245:e11 chr7:116403103 chr7:116403322
MET:NM_000245:e12 chr7:116409698 chr7:116409845
MET:NM_000245:e13 chr7:116411551 chr7:116411708
MET:NM_000245:e14 chr7:116411902 chr7:116412043
MET:NM_000245:e15 chr7:116414934 chr7:116415165
MET:NM_000245:e16 chr7:116417442 chr7:116417523
MET:NM_000245:e17 chr7:116418829 chr7:116419011
MET:NM_000245:e18 chr7:116422041 chr7:116422151
MET:NM_000245:e19 chr7:116423357 chr7:116423523
MET:NM_000245:e20 chr7:116435708 chr7:116435845
MET:NM_000245:e21 chr7:116435940 chr7:116436178
这里的正常人群共349位,他们的MET基因OR值组成的z分布的曲线平均值为1.01,sd为0.039,这里的病人样本用的是https://www.horizondiscovery.com公司提供的HD786标准品,它包含MET扩增,官方提供的拷贝数OR为2.25,我们利用NGS的方法以及以上描述的算法得到的该基因的OR为1.72。这里采用的对照样本为NA18535标准品细胞。
进行检测的panel中还包括了其它的一些敏感基因的外显子区域,分别取不同数量的区域与MET基因20个外显子共同构成panel,共选择了4666个其它备选区域,例如其中100个其它备选区域的信息如表2所示。
在进行计算时,将MET基因20个外显子分别和10个、50个、100个、300个、400个、……4000个、4666个其它的热点基因的外显子区域共同构成panel同时进行NGS测序,具体的panel中的区域数量如表3所示,其中采用10个、50个、100个其它区域时,是采用的表2中序号为1~10、1~50、1~100的外显子区域,其它的区域在此不一一描述。
采用的检测步骤同实施例1。将20个MET外显子区域的拷贝数的算术平均值作为MET基因拷贝数值,同时也在步骤3中采用算术平均值计算每个样本的平均测序深度作为对照。
计算结果如表7所示,区域数量与拷贝数值之间进行作图后如图4所示。
表7 MET基因拷贝数
panel组成 平均值方法 t分布方法
20个MET外显子+4666个其他区域 1.72 1.72
20个MET外显子+4000个其他区域 1.98 1.71
20个MET外显子+3000个其他区域 2.62 1.72
20个MET外显子+2000个其他区域 3.87 1.71
20个MET外显子+1000个其他区域 7.33 1.71
20个MET外显子+900个其他区域 8.14 1.7
20个MET外显子+800个其他区域 9.14 1.72
20个MET外显子+700个其他区域 10.46 1.73
20个MET外显子+600个其他区域 11.73 1.72
20个MET外显子+500个其他区域 13.74 1.71
20个MET外显子+400个其他区域 16.77 1.72
20个MET外显子+300个其他区域 21.32 1.75
20个MET外显子+200个其他区域 27.84 1.74
20个MET外显子+100个其他区域 32.21 1.72
20个MET外显子+50个其他区域 43.49 1.7
20个MET外显子+10个其他区域 57.41 1.73
这里的计算结果也与实施例1类似,本实施例中,MET基因具有20个外显子,对这两个外显子区域计算出拷贝数之后,将两者的算术平均值作为基因的拷贝数。采用本发明的方法可以避免在panel中的其它检测区域的数量的影响,使每次检测结果的数值与对标准品的实际值一致。
还采用了MET外显子分别与10个不同的其它区域构成panel来进行拷贝数的检测,分别采用表2中序号为1~10、21~30、41~50、61~70、81~90号的外显子区域与MYCN组成panel(分为称为1~5组),依照同样的方法对MYCN拷贝数进行检测,结果如表8所示:
表8 MET基因拷贝数
panel组成 平均值方法 t分布方法
第1组 57.41 1.73
第2组 43.21 1.72
第3组 49.74 1.70
第4组 61.21 1.72
第5组 40.55 1.71
从表中可以看出,当在MET基因与其它10个检测区域构成panel进行检测时,在panel中的其它区域的不同,会对采用平均值方法检测得到的结果产生明显不同,因此,这种方法的检测结果的可信度较低;而采用T分布拟合方法计算得到的结果与panel中其它的区域并没有明显的关联,得到的检测结果始终保持稳定,可信度高。
实施例3 CDKN2A基因拷贝数的检测
本实施例中检测的目标区域是含有5个外显子区域的CDKN2A基因的拷贝数进行检测,目标区域的主要信息如下表所示:
表9
区域名称 在染色体上的起始位置 在染色体上的结束位置
CDKN2A:NM_000077:e3 chr9:21968227 chr9:21968241
CDKN2A:NM_001195132:e3 chr9:21968723 chr9:21968770
CDKN2A:NM_000077:e2 chr9:21970900 chr9:21971207
CDKN2A:NM_000077:e1 chr9:21974676 chr9:21974826
CDKN2A:NM_000077:utr3 chr9:21974827 chr9:21975132
这里的正常人群共349位,他们的CDKN2A基因OR值组成的z分布的曲线平均值为1.01,sd为0.054,这里的病人样本用的是https://www.horizondiscovery.com公司提供的HD786标准品,官方没有提供该基因的拷贝数变异,但我们利用NGS的方法以及以上描述的算法得到的该基因的OR为0.62左右。
进行检测的panel中还包括了其它的一些敏感基因的外显子区域,分别取不同数量的区域与CDKN2A基因5个外显子共同构成panel,共选择了4681个其它备选区域,例如其中100个其它备选区域的信息如表2所示。
在进行计算时,将CDKN2A基因20个外显子分别和25个、50个、100个、300个、400个、……4000个、4681个其它的热点基因的外显子区域共同构成panel同时进行NGS测序,具体的panel中的区域数量如表3所示,其中采用25个、50个、100个其它区域时,是采用的表2中序号为1~25、1~50、1~100的外显子区域,其它的区域在此不一一描述。
采用的检测步骤同实施例1。将5个CDKN2A外显子区域的拷贝数的算术平均值作为CDKN2A基因拷贝数值,同时也在步骤3中采用算术平均值计算每个样本的平均测序深度作为对照。
计算结果如表10所示,区域数量与拷贝数值之间进行作图后如图5所示。
表10 CDKN2A基因拷贝数
panel组成 平均值方法 T分布方法
5个cdkn2a外显子+4681个其他区域 0.62 0.62
5个cdkn2a外显子+4000个其他区域 0.72 0.62
5个cdkn2a外显子+3000个其他区域 0.96 0.62
5个cdkn2a外显子+2000个其他区域 1.45 0.62
5个cdkn2a外显子+1000个其他区域 2.92 0.63
5个cdkn2a外显子+900个其他区域 3.19 0.62
5个cdkn2a外显子+800个其他区域 3.59 0.62
5个cdkn2a外显子+700个其他区域 4.21 0.63
5个cdkn2a外显子+600个其他区域 4.79 0.62
5个cdkn2a外显子+500个其他区域 5.83 0.63
5个cdkn2a外显子+400个其他区域 7.08 0.62
5个cdkn2a外显子+300个其他区域 9.59 0.63
5个cdkn2a外显子+200个其他区域 14.21 0.62
5个cdkn2a外显子+100个其他区域 26.52 0.6
5个cdkn2a外显子+50个其他区域 53.75 0.63
5个cdkn2a外显子+25个其他区域 96.3 0.66
这里的计算结果也与实施例1类似,本实施例中,CDKN2A基因具有5个外显子,对这5个外显子区域计算出拷贝数之后,将5个拷贝数的算术平均值作为基因的拷贝数。采用本发明的方法可以避免在panel中的其它检测区域的数量的影响,使每次检测结果的数值与对标准品的实际值一致。
还采用了CDKN2A外显子分别与25个不同的其它区域构成panel来进行拷贝数的检测,分别采用表2中序号为1~25、26~50、51~75、76~100号的外显子区域与CDKN2A组成panel(分为称为1~4组),依照同样的方法对CDKN2A拷贝数进行检测,结果如表11所示:
表11 CDKN2A基因拷贝数
panel组成 平均值方法 t分布方法
第1组 96.3 0.66
第2组 75.4 0.64
第3组 67.7 0.63
第4组 103.3 0.66
从表中可以看出,当在CDKN2A基因与其它25个检测区域构成panel进行检测时,在panel中的其它区域的不同,会对采用平均值方法检测得到的结果产生明显不同,因此,这种方法的检测结果的可信度较低;而采用T分布拟合方法计算得到的结果与panel中其它的区域并没有明显的关联,得到的检测结果始终保持稳定,可信度高。
实施例4 食管癌患者样本中PIK3CA基因拷贝数变异的检测
以PIK3CA基因Variation 91720位点作为待检测区域,与实施例3中其它4681基因区域组成panel进行高通量测序,采用本发明的方法对Variation 91720位点的拷贝数进行计算。
同时,作为对照,采用基于TaqMan探针的实时荧光定量PCR检测Variation 91720位点的拷贝数,并采用Copy Caller v2.0软件对拷贝数进行计算。
病例组共33例,来自于中国南方地区的汉族人群,正常人群是以上实施例中349例。病例组年龄中位数56岁,平均55.1岁,病理类型以ESCC为主,占84.8%,其次为腺癌,占12.1%,其它病理类型共占3%。
判定标准:基因的拷贝数大于或小于基线值的mean±3sd被判定是基因存在着拷贝数扩增或缺失;对照样本在检测区域上的平均测序深度达标,大于5x,覆盖比例大于该区域长度的70%。
以上33例病例样本的Variation 91720位点的拷贝数检测结果如表12所示:
表12 食管癌患者样本中PIK3CA基因拷贝数
Figure PCTCN2018090086-appb-000006
从上表可以看出,本发明的检测结果可以应用于病例的拷贝数变异的检测,检测结果与实时荧光PCR对该基因单独进行检测的结果接近。

Claims (16)

  1. 一种拷贝数变异的检测方法,其特征在于,包括如下步骤:
    S1,采用待检测区域和其它区域组成panel,对待测样本、对照样本和正常样本采用所述的panel进行NGS测序;
    S2,根据步骤S1的结果,计算出每份正常样本和待测样本上的待检测区域的拷贝数;
    S3,将正常样本中所述的待检测区域的拷贝数进行T分布拟合,将分布曲线的平均值作为该待检测区域的平均拷贝数基线;
    S4,将待测样本中所述的待检测区域的拷贝数与平均拷贝数基线进行比较,判定是否存在拷贝数变异;
    其中,待测样本和正常样本的待检测区域的拷贝数的计算方法是:根据样本与对照样本的NGS测序结果,比对至参考基因组序列,获得唯一比对至所述的检测区域的读段(reads),并计算出每个样本中每个检测区域上的测序深度;对同一份样本中的每个检测区域的测序深度进行T分布拟合,将分布曲线的平均值作为该样本上的平均测序深度;根据平均测序深度计算出每份样本上的待检测区域的拷贝数。
  2. 根据权利要求1所述的拷贝数变异的检测方法,其特征在于,panel中检测区域的数量为3~50000个。
  3. 根据权利要求1所述的拷贝数变异的检测方法,其特征在于,拷贝数的计算是公式是:拷贝数=(样本的待检测区域的测序深度/样本的平均测序深度)/(对照样本的待检测区域的测序深度/对照样本的平均测序深度)×倍体数(Ploidy)。
  4. 根据权利要求1所述的拷贝数变异的检测方法,其特征在于,所述的基线数值是由平均值(mean)±标准偏差(sd)所构成。
  5. 根据权利要求1所述的拷贝数变异的检测方法,其特征在于,所述的待检测区域包括但不限于以下基因中的任意一个片断:AKT1、AKT2、AKT3、CD274(PD-L1)、DDR2、EGFR、ERBB2、FGFR1、FGFR3、FLT4、HGF、MET、MTOR、MYC、PDCD1LG2(PD-L2)、PDGFRA、PDGFRB、SOX2、TP53、VEGFA、BRAF、FLT1、HRAS、KDR、KRAS、MAP2K1、MAP2K2、NRAS、PIK3CA、RB1、TOP1、VEGFA、BCL2、BCL6、CCND1、CCND3、CDK6、CEBPA、CEBPB、CEBPD、HOXA10、IL3、IRF4、KMT2A、LYL1、MUC1、MYC、NOTCH1、SETBP1、TAL1、ZAP70。
  6. 根据权利要求1所述的拷贝数变异的检测方法,其特征在于,所述的拷贝数变异的检测方法是用于非治疗与诊断目的。
  7. 一种拷贝数变异的检测方法,其特征在于,包括如下步骤:
    S1,从待测基因序列中选出至少两个待检测区域,并与其它区域组成panel,对待测样本和正常样本采用所述的panel进行NGS测序;
    S2,根据步骤S1的结果,计算出每份正常样本和待测样本上的待检测基因的拷贝数;
    S3,将正常样本中待检测基因的拷贝数进行T分布拟合,将分布曲线的平均值作为该待检测基因的平均拷贝数基线;
    S4,将待测样本中所述的待检测基因的拷贝数与平均拷贝数基线进行比较,判定是否存在拷贝数变异;
    其中,待测样本和正常样本的待检测基因的拷贝数是指每个待检测区域的拷贝数的平均值;
    待检测区域的拷贝数的计算方法是:根据样本与对照样本的NGS测序结果,比对至参考基因组序列,获得唯一比对至所述的检测区域的读段(reads),并计算出每个样本中每个检测区域上的测序深度;
    对同一份样本中的每个检测区域的测序深度进行T分布拟合,将分布曲线的平均值作为该样本上的平均测序深度;根据平均测序深度计算出每份样本上的待检测区域的拷贝数。
  8. 根据权利要求7所述的拷贝数变异的检测方法,其特征在于,panel中检测区域的数量为3~50000个。
  9. 根据权利要求7所述的拷贝数变异的检测方法,其特征在于,拷贝数的计算是公式是:拷贝数=(样本的待检测区域的测序深度/样本的平均测序深度)/(对照样本的待检测区域的测序深度/对照样本的平均测序深度)×倍体数(Ploidy)。
  10. 根据权利要求7所述的拷贝数变异的检测方法,其特征在于,每个待检测区域的拷贝数的平均值是指算术平均值、正态分布平均值卡方分布平均值、F分布平均值或者T分布平均值。
  11. 根据权利要求7所述的拷贝数变异的检测方法,其特征在于,所述的拷贝数变异的检测方法是用于非治疗与诊断目的。
  12. 一种拷贝数变异的检测装置,其特征在于,包括:
    测序数据获取模块,用于对待测样本和正常样本采用由待检测区域和其它区域组成的panel进行NGS测序,并将样本的测序下机数据与参考基因组序列对比,获得唯一比对至所述的检测区域的读段(reads);
    测序深度计算模块,用于根据获得的唯一比对至所述的检测区域的读段计算出每个样本中每个检测区域上的测序深度,并对同一份样本中的每个检测区域的测序深度进行T分布拟合,将分布曲线的平均值作为该样本上的平均测序深度;
    拷贝数计算模块,用于计算出每份正常样本和待测样本上的待检测区域的拷贝数;
    基线计算模块,用于将正常样本中待检测区域的拷贝数进行T分布拟合,将分布曲线的平均值作为该待检测区域的平均拷贝数基线;
    分析模块,用于将待测样本中所述的待检测区域的拷贝数与平均拷贝数基线进行比较,判定是否存在拷贝数变异。
  13. 根据权利要求12所述的拷贝数变异的检测装置,其特征在于,拷贝数是通过如下公式计算得到:拷贝数=(样本的待检测区域的测序深度/样本的平均测序深度)/(对照样本的待检测区域的测序深度/对照样本的平均测序深度)×倍体数(Ploidy)。
  14. 一种拷贝数变异的检测装置,其特征在于,包括:
    测序数据获取模块,用于对待测样本、对照样本和正常样本采用panel进行NGS测序,并将样本的测序下机数据与参考基因组序列对比,获得唯一比对至所述的检测区域的读段(reads);所述的panel是由至少两个待检测区域和其它区域组成,所述的待检测区域选自待测基因序列;
    测序深度计算模块,用于根据获得的唯一比对至所述的检测区域的读段计算出每个样本中每个检测区域上的测序深度,并对同一份样本中的每个检测区域的测序深度进行T分布拟合,将分布曲线的平均值作为该样本上的平均测序深度;
    拷贝数计算模块,用于计算出每份正常样本和待测样本上的待检测基因的拷贝数;
    基线计算模块,用于将正常样本中所述的待检测基因的拷贝数进行T分布拟合,将分布曲线的平均值作为该待检测基因的平均拷贝数基线;
    分析模块,用于将待测样本中所述的待检测基因的拷贝数与平均拷贝数基线进行比较,判定是否存在拷贝数变异;
    其中,待测样本和正常样本的待检测基因的拷贝数是指每个待检测区域的拷贝数的平均值。
  15. 根据权利要求14所述的拷贝数变异的检测装置,其特征在于,拷贝数是通过如下公式计算得到:拷贝数=(样本的待检测区域的测序深度/样本的平均测序深度)/(对照样本的待检测区域的测序深度/对照样本的平均测序深度)×倍体数(Ploidy)。
  16. 一种计算机可读介质,记载有可以运行权利要求1~11任一项所述的拷贝数变异的检测方法的程序。
PCT/CN2018/090086 2018-02-14 2018-06-06 一种拷贝数变异的检测方法、装置以及计算机可读介质 WO2019157791A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810151291.2 2018-02-14
CN201810151291.2A CN108427864B (zh) 2018-02-14 2018-02-14 一种拷贝数变异的检测方法、装置以及计算机可读介质

Publications (1)

Publication Number Publication Date
WO2019157791A1 true WO2019157791A1 (zh) 2019-08-22

Family

ID=63157045

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/090086 WO2019157791A1 (zh) 2018-02-14 2018-06-06 一种拷贝数变异的检测方法、装置以及计算机可读介质

Country Status (2)

Country Link
CN (1) CN108427864B (zh)
WO (1) WO2019157791A1 (zh)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109637585B (zh) * 2018-12-27 2020-11-17 北京优迅医学检验实验室有限公司 测序深度的矫正方法及装置
CN110246543B (zh) * 2019-06-21 2021-02-26 元码基因科技(北京)股份有限公司 基于二代测序技术利用单样本检测拷贝数变异的方法和计算机系统
CN112151112A (zh) * 2019-06-27 2020-12-29 天津中科智虹生物科技有限公司 一种遗传基因检测的方法和装置
CN110241191A (zh) * 2019-06-28 2019-09-17 中国人民解放军第四军医大学 一种基于NGS同时检测mtDNA拷贝数和突变的方法
WO2021114139A1 (zh) * 2019-12-11 2021-06-17 深圳华大基因股份有限公司 一种基于血液循环肿瘤dna的拷贝数变异检测方法和装置
CN110993022B (zh) * 2019-12-20 2023-09-05 北京优迅医学检验实验室有限公司 检测拷贝数扩增的方法和装置及建立检测拷贝数扩增的动态基线的方法和装置
CN112885406B (zh) * 2020-04-16 2023-01-31 深圳裕策生物科技有限公司 检测hla杂合性缺失的方法及系统
CN111508559B (zh) * 2020-04-21 2021-08-13 北京橡鑫生物科技有限公司 检测目标区域cnv的方法及装置
CN111863124B (zh) * 2020-06-06 2024-01-30 聊城大学 一种拷贝数变异检测方法、系统、存储介质、计算机设备
CN112349346A (zh) * 2020-10-27 2021-02-09 广州燃石医学检验所有限公司 检测基因组区域中的结构变异的方法
CN112435711B (zh) * 2020-11-11 2022-04-01 赛福解码(北京)基因科技有限公司 一种改善小panel数据中大型cnv检测效果的方法
CN114613434A (zh) * 2020-12-08 2022-06-10 深圳华大生命科学研究院 基于群体样本深度信息检测基因拷贝数变异的方法及系统
CN112768000B (zh) * 2021-01-25 2021-07-20 深圳吉因加医学检验实验室 一种预测met基因拷贝数变化类型的方法及装置
EP4397773A1 (en) * 2021-08-30 2024-07-10 Guangzhou Burning Rock DX Co., Ltd. Copy number variation detection method and application thereof
CN114582427B (zh) * 2022-03-22 2023-04-07 成都基因汇科技有限公司 一种渐渗区段鉴定方法及计算机可读存储介质
CN114758720B (zh) * 2022-06-14 2022-09-02 北京贝瑞和康生物技术有限公司 用于检测拷贝数变异的方法、设备和介质
CN115376609B (zh) * 2022-10-24 2023-03-10 广州燃石医学检验所有限公司 一种判别met基因拷贝数扩增类型的方法及装置
CN117334249A (zh) * 2023-05-30 2024-01-02 上海品峰医疗科技有限公司 基于扩增子测序数据检测拷贝数变异的方法、设备和介质
CN117935907B (zh) * 2024-01-31 2024-09-03 苏州贝康医疗器械有限公司 真假基因的拷贝数变异的检测方法和装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013149385A1 (zh) * 2012-04-05 2013-10-10 深圳华大基因健康科技有限公司 一种拷贝数变异检测方法和系统
CN105760712A (zh) * 2016-03-01 2016-07-13 西安电子科技大学 一种基于新一代测序的拷贝数变异检测方法
CN107423534A (zh) * 2016-05-24 2017-12-01 郝柯 基因组拷贝数变异的检测方法和系统

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103374518B (zh) * 2012-04-12 2018-03-27 维里纳塔健康公司 拷贝数变异的检测和分类
CA2915626A1 (en) * 2013-06-17 2014-12-24 Verinata Health, Inc. Method for determining copy number variations in sex chromosomes
CN104133914B (zh) * 2014-08-12 2017-03-08 厦门万基生物科技有限公司 一种消除高通量测序引入的gc偏差及对染色体拷贝数变异的检测方法
CN105574361B (zh) * 2015-11-05 2018-11-02 上海序康医疗科技有限公司 一种检测基因组拷贝数变异的方法
CN105349678A (zh) * 2015-12-03 2016-02-24 上海美吉生物医药科技有限公司 一种染色体拷贝数变异的检测方法
CN106755444A (zh) * 2016-12-31 2017-05-31 东北农业大学 一种大豆基因拷贝数变异分析方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013149385A1 (zh) * 2012-04-05 2013-10-10 深圳华大基因健康科技有限公司 一种拷贝数变异检测方法和系统
CN105760712A (zh) * 2016-03-01 2016-07-13 西安电子科技大学 一种基于新一代测序的拷贝数变异检测方法
CN107423534A (zh) * 2016-05-24 2017-12-01 郝柯 基因组拷贝数变异的检测方法和系统

Also Published As

Publication number Publication date
CN108427864A (zh) 2018-08-21
CN108427864B (zh) 2019-01-29

Similar Documents

Publication Publication Date Title
WO2019157791A1 (zh) 一种拷贝数变异的检测方法、装置以及计算机可读介质
Beaubier et al. Clinical validation of the tempus xT next-generation targeted oncology sequencing assay
US11118234B2 (en) Methods and systems for adjusting tumor mutational burden by tumor fraction and coverage
JP7317078B2 (ja) 腫瘍変異負荷を評価するための方法及びシステム
CN109880910B (zh) 一种肿瘤突变负荷的检测位点组合、检测方法、检测试剂盒及系统
JP7123975B2 (ja) 無細胞dnaについての体細胞起源または生殖系列起源の識別
CN110387419B (zh) 实体瘤多基因检测基因芯片及其制备方法和检测装置
Liu et al. The contribution of hereditary cancer-related germline mutations to lung cancer susceptibility
CN104293938A (zh) 构建测序文库的方法及其应用
US20160281171A1 (en) Targeted screening for mutations
US20220072553A1 (en) Device and method for detecting tumor mutation burden (tmb) based on capture sequencing
US12049672B2 (en) Methods and systems for screening for conditions
CN111334505B (zh) 一种泛肿瘤基因检测的标准品及其制备方法和应用
US20200273537A1 (en) High Throughput Patient Genomic Sequencing and Clinical Reporting Systems
WO2023240755A1 (zh) 用于检测染色体非整倍体及单基因突变的试剂盒及应用
EP4397773A1 (en) Copy number variation detection method and application thereof
Tang et al. Tumor mutation burden derived from small next generation sequencing targeted gene panel as an initial screening method
KR102416074B1 (ko) 생물학적 시료의 핵산 품질을 결정하는 방법
WO2016049929A1 (zh) 构建测序文库的方法及其应用
US20230057154A1 (en) Somatic variant cooccurrence with abnormally methylated fragments
CN118043892A (zh) 体细胞变体与异常甲基化片段的共现
KR20220125708A (ko) 차세대 염기서열분석 기반 표적유전자 rna 염기서열 분석 패널 및 분석알고리즘
WO2018216905A2 (ko) 무세포 핵산으로부터 수득된 서열 분석 데이터에 대한 배경 대립인자의 빈도 분포를 생성하는 방법 및 이를 이용하여 무세포 핵산으로부터 변이를 검출하는 방법
WO2023182585A1 (ko) 순환 종양 핵산의 복제수 변이 분석 방법
WO2023164713A1 (en) Probe sets for a liquid biopsy assay

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18906700

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18906700

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 18906700

Country of ref document: EP

Kind code of ref document: A1