WO2018209704A1 - 基于dna测序数据的样本来源检测方法、装置和存储介质 - Google Patents

基于dna测序数据的样本来源检测方法、装置和存储介质 Download PDF

Info

Publication number
WO2018209704A1
WO2018209704A1 PCT/CN2017/085150 CN2017085150W WO2018209704A1 WO 2018209704 A1 WO2018209704 A1 WO 2018209704A1 CN 2017085150 W CN2017085150 W CN 2017085150W WO 2018209704 A1 WO2018209704 A1 WO 2018209704A1
Authority
WO
WIPO (PCT)
Prior art keywords
curve
depth
sample
segments
segment
Prior art date
Application number
PCT/CN2017/085150
Other languages
English (en)
French (fr)
Inventor
梁瀚
李甫强
吴逵
赵鑫
乔斯坦
史旭莲
Original Assignee
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因研究院 filed Critical 深圳华大基因研究院
Priority to CN201780089043.3A priority Critical patent/CN110462056B/zh
Priority to PCT/CN2017/085150 priority patent/WO2018209704A1/zh
Publication of WO2018209704A1 publication Critical patent/WO2018209704A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the present invention relates to the field of bioinformatics technology, and in particular, to a sample source detection method, apparatus and storage medium based on DNA sequencing data.
  • SNV single nuclein variation
  • CNV structure variation
  • SV structure variation
  • Machine learning methods have a common feature, that is, they are easy to saturate during training, and the amount of information is large. In the case of the situation, the characteristics of the training set cannot be fully exploited, so it is also called shallow learning.
  • Relative deep learning methods have been proposed in recent years, and have achieved great success in image processing and speech recognition, and a series of variants have been developed.
  • the size of the biological genome is often large, and in humans, it can reach 3G.
  • the amount of data of this scale cannot be directly applied to shallow learning methods.
  • the general practice is to first statistic on the variability of the genome and then train on the basis of the statistical results. Focus on different levels and analyze the variation at the corresponding level. For example, on the single nucleotide level, some types of unique SNV can be analyzed; on the 1K to 3MB surface layer, CNV can be analyzed; at the level of 3MB or more On top, the SV can be analyzed.
  • the smaller the granularity the more difficult it is to extract valid information from a large amount of noise information; the larger the granularity, the more severe the information loss.
  • the shallow learning method is based on the variation on the one hand, and the information is largely lost in the statistical variation; on the other hand, due to the limitation of its structure, the information cannot be effectively utilized in the case of large amount of information, and the application is limited.
  • the sudden emergence of deep learning methods has also received great attention in the bio-industry. Since the bio-genome data and the deep-study image and speech data that have achieved great success are essentially different, they cannot be directly used, and the corresponding learning methods need to be developed in a targeted manner.
  • Copy number variation refers to the repeated arrangement of DNA fragments of 1KB to 3MB in length on the chromosome, which is widely present in the biological genome and covers a much larger number of nucleotides than other types of mutations. An important source of biological genetic diversity is closely related to many diseases.
  • the tissue source eg, predicting which cancer a tumor tissue belongs to
  • samples from some specific sources such as free DNA (Cell DNA, cfDNA) samples in peripheral blood
  • the main component of cfDNA is a DNA fragment released into the blood during normal apoptosis.
  • the sample contributor has cancer
  • the circulating DNA (ctDNA) released by cancer cell apoptosis may also appear in the blood.
  • ctDNA content is very small, accounting for about 1% of cfDNA, and is susceptible to many aspects, it is difficult to search for stable and effective copy number variation, which limits the application of the methods reported in the above literature in the important field of liquid biopsy.
  • the invention provides a sample source detection method, device and storage medium based on DNA sequencing data, and directly searches for the characteristics of different tissues directly from the comparison results of the DNA sequencing data rather than the statistical variation, thereby predicting the sample source.
  • an embodiment provides a sample source detection method based on DNA sequencing data, comprising:
  • DNA sequencing data of samples of at least two different sources data of multiple samples of the same source are respectively compared to a reference genome, and a window is divided on the above reference genome, and the sequencing depth in each window is separately counted, and according to The depth is generated to generate a depth curve;
  • each curve segment of the similar curve segment set is at least partially consistent with the corresponding depth curve
  • Generating a depth curve of the sample to be tested by using DNA sequencing data of the sample to be tested, and comparing the depth curve of the sample to be tested with the set of two complementary effective curve segments, and respectively according to the depth curve of the sample to be tested and the two complementary effective curve segments The degree of matching of the set determines the source of the sample to be tested.
  • an embodiment provides a sample source detection device based on DNA sequencing data, including:
  • DNA sequencing data of samples of at least two different sources data of multiple samples of the same source are respectively compared to a reference genome, and a window is divided on the above reference genome, and the sequencing depth in each window is separately counted, and according to The depth is generated to generate a depth curve;
  • each curve segment of the similar curve segment set is at least partially consistent with the corresponding depth curve
  • Generating a depth curve of the sample to be tested by using DNA sequencing data of the sample to be tested, and comparing the depth curve of the sample to be tested with the set of two complementary effective curve segments, and respectively according to the depth curve of the sample to be tested and the two complementary effective curve segments The degree of matching of the set determines the source of the sample to be tested.
  • an embodiment provides a computer readable storage medium comprising a program executable by a processor to implement the method:
  • DNA sequencing data of samples of at least two different sources data of multiple samples of the same source are respectively compared to a reference genome, and a window is divided on the above reference genome, and the sequencing depth in each window is separately counted, and according to The depth is generated to generate a depth curve;
  • Generating a depth curve of the sample to be tested by using DNA sequencing data of the sample to be tested, and comparing the depth curve of the sample to be tested with the set of two complementary effective curve segments, and respectively according to the depth curve of the sample to be tested and the two complementary effective curve segments The degree of matching of the set determines the source of the sample to be tested.
  • the method of the present invention directly searches for the characteristics of different tissues directly from the alignment results of the DNA sequencing data instead of the statistical variation, thereby predicting the source of the sample, and can be used for the free DNA sample in the peripheral blood with high specificity and sensitivity.
  • Source detection can not only detect the source of non-diagnostic DNA samples, but also detect the source of diagnostic samples such as tumor circulating DNA. It can be used in liquid biopsy-related detection and molecular typing of tumors.
  • FIG. 1 is a flow chart of a sample source detection method based on DNA sequencing data in an embodiment of the present invention
  • FIG. 2 is a schematic diagram showing the principle of extracting a depth curve according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of the principle of extracting a set of complementary balance curve segments in an embodiment of the present invention.
  • the present invention proposes a highly sensitive data mining method, from the copy number variation (CNV) precursor data - the DNA fragment in the sequencing result (sequencing read length or reads)
  • CNV copy number variation
  • the distribution information is used to construct a model to identify tissue signals in the blood beyond the accuracy of the prior art.
  • the method of the present invention is based on the biological principle that the presence of CNV increases/decreases repetitive fragments of a region, and when aligned to a normal reference genome, appears to be aligned to the corresponding region of the reference genome.
  • the number of DNA fragments is abnormally increased/decreased, that is, the "depth" is increased/decreased.
  • the depths of different regions are inconsistent, forming a depth curve.
  • the prediction of unknown sample sources is achieved by mining characteristic depth curve segments of a certain type of tissue.
  • group As used herein, the terms "group,” “sample source,” “(a class of) organization,” “(organization) source” have substantially the same or equivalent meaning, eg, lung tissue, stomach tissue, Each of prostate tissue and lymphoid tissue can be considered a specific group and a specific source of sample.
  • sample is the original material that provides DNA sequencing data, such as free DNA in the peripheral blood of each normal person or patient can be considered a sample, different samples. This may have the same source (for example, from the same organization) or it may have different sources. Therefore, in the present invention, “samples of different origins” and “samples of the same source” are distinguished.
  • curve segment As used herein, the terms “curve segment”, “sub-segment”, “subsequence” have substantially the same or equivalent meaning, for example, “curve segment” can be considered as part of a depth curve when the "depth curve" is When a sequence (such as a window number sequence) is expressed, a “curve segment” may be referred to as a "subsequence.”
  • set has the mathematical meaning that a set of multiple elements, such as a set of similar curve segments, is a set of multiple similar curve segments.
  • the method of the present invention mainly comprises the steps of extracting a depth curve, extracting a set of similar curve segments, extracting a set of effective curve segments, optionally extracting a set of balanced curve segments, and finally performing a score prediction on the sample to be tested.
  • a sample source detection method based on DNA sequencing data in an embodiment of the present invention is as shown in FIG. 1 , and specifically includes:
  • Step S101 extracting a depth curve
  • the data of multiple samples from the same source are respectively compared to the reference genome, the window is divided on the reference genome, and the sequencing depth in each window is separately counted, and according to the depth Generate a depth curve.
  • the input data of the invention is the DNA sequencing data of the sample (sequencing read length or reads), and is not limited to the sequencing platform and the database construction method, and is not limited to the specific parameters in the middle, and the existing mainstream sequencing scheme can be directly used.
  • DNA sequencing data needs to be aligned to the reference genome.
  • This step can be performed using existing software such as Burrows-Wheeler Aligner (BWA), Short Oligonucleotide Analysis Package (SOAP), and the like.
  • BWA Burrows-Wheeler Aligner
  • SOAP Short Oligonucleotide Analysis Package
  • Fig. 2 after the alignment is completed, the depth information (i.e., the number of DNA fragments) on each chromosome is counted by window.
  • the scheme of window division can be designed according to specific data, and adopts a fixed length or variable length mode, and the length is, for example, 10-100K/window, preferably 40K/window.
  • the depth curve is drawn based on the depth information of each window.
  • the reference genome shown in Fig. 2 is exemplarily divided into 5 windows of 1 to 5, and the alignment is indicated by a broken line.
  • the DNA fragment to the region, in the coordinate system, the solid line indicates the number of DNA fragments that fall in the window at the beginning of the corresponding window (which can also be the end or midpoint or any fixed relative or absolute position) as the depth value. . Connect the depth value of each window to get the depth curve.
  • the curve of FIG. 2 can be described as: 1, 2, 4, 3, 5 or 1, 2, 4, 5, 3.
  • Step S102 Extracting a similar curve segment set
  • a set of similar curve segments is extracted from depth curves of a plurality of samples of the same source, each curve segment of the similar curve segment set being at least partially coincident with a corresponding depth curve.
  • extracting a set of similar curve segments is accomplished in two steps:
  • one embodiment of the present invention provides an overview of how to find these frequently occurring subsequences.
  • the following method is used: If a group has a frequently occurring subsequence "10, 20, 30, 40", since the number of possible subsequences is exponential with the total number of windows, it is practically impossible to obtain this by direct search. Subsequences, where iterative stitching is used.
  • Splicing is defined as: if there are two sequences, the first end of one sequence is consistent with the end of the other, then the entire sequence of the latter and the part of the former that is inconsistent with the latter are connected to obtain a new sequence. It is called splicing.
  • Step S103 extracting a set of valid curve segments
  • group A Similar curve segments mined from a certain group (hereinafter referred to as group A) are not guaranteed to be unique, and need to be filtered for another specific group (hereinafter referred to as group B), after filtering
  • group B another specific group
  • the set can be used as a basis for distinguishing between the two groups.
  • each segment is scored as its weight.
  • the p-value of Fisher's exact test is used as the scoring basis.
  • the specific process is as follows: for a certain curve segment, the number of matched samples and the number of unmatched samples in the A group are respectively counted, and two values of a and b are obtained, and the number of matched samples in the B group is not matched according to the same method. The number of samples, get c, d.
  • the so-called "match” is defined as follows: If the common elements of the two sequences are identical in order, they are considered to be matched. For example, for the sequence "1, 2, 3, 4" and the sequence "2, 3, 4", the common elements are 2, 3, 4. Since the order of the three elements 2, 3, and 4 is identical in the two sequences, the two sequences are considered to be matched.
  • a curve segment matches a curve extracted from a sample, the curve segment is considered to match the sample.
  • the p value is less than the set threshold, correct it to the value, and use the reciprocal of the p value as the weight of the curve segment. .
  • the curve segments whose weights are lower than the set threshold are filtered out to obtain a filtered set of valid curve segments.
  • the above operation is also performed on the B group, and a corresponding set of effective curve segments is obtained.
  • the sample set used to filter the similar curve segments is the A group set.
  • the above process results in a set of two effective curve segments being a set of complementary effective curve segments.
  • Step S104 determining the source of the sample to be tested
  • the source of the sample to be tested is determined according to the matching degree of the sample depth curve to the pair of complementary effective curve segments. in particular:
  • the sample depth curve to be tested is obtained.
  • the sample depth curve to be tested is compared with both pairs of complementary effective curve segment sets. If the sample depth curve to be tested matches a certain curve segment of one side, the weight value of the curve segment is added to the type. After all the sample depth curves to be tested are checked, the scores of both parties are compared, and the higher score is used as the winner. In some cases where high specificity is required, the difference between the scores of both parties may be required to reach a certain threshold, otherwise it is considered impossible.
  • a pair of complementary effective curve segment sets can be used as a two-two classification model.
  • all valid curve segment sets are combined and trained in pairs to obtain multiple two-two classification models.
  • the sample depth curve to be tested is compared with each two-two classification model, and the score of the winning party in each model is increased by 1, and finally the party with the highest score among all types is judged as a many-to-many judgment.
  • the difference between the highest score and the second highest score can also be required to reach a certain threshold to increase the specificity.
  • the two-two classification model is first used for judgment. Assuming that the real type of the sample is A, it is judged as A in the two-two classification model AB, A is also judged as A in the two-two classification model AC, and judged as B (or C) in the two-two classification model BC. . Of the three judgments, two were judged as A, so A got 2 points, the last judgment was B (or C), and B (or C) scored 1 point. Since 2 points of A are more than 1 point of B (or C), the sample is judged to be of type A.
  • the determination of the source of the sample to be tested can be achieved by completing step S104.
  • the number of segments and the weight of the complementary effective curve segment set may be inconsistent, so that the segment in the effective curve segment set of one side is larger than the other party and the weight is high, and is used for classification prediction.
  • the prediction result is tilted to the type, it causes bias.
  • it needs to be truncated, and the pruning process is performed simultaneously for both sides of a pair of complementary effective curve segment sets.
  • a set of complementary balance curve segments is extracted from the set of complementary effective curve segments, and the ratio of the set of complementary balance curve segments is compared with the depth curve of the sample to be tested.
  • Bias when comparing the set of complementary effective curve segments with the depth curve of the sample to be tested Low sex. in particular:
  • the two sides of the set of complementary effective curve segments are respectively arranged according to the weights from highest to lowest (the first element is the element with the highest weight or one), and two sorted sequences are obtained, and each of them is provided with a tentative pointer, pointing to each of the two sides.
  • An element When the pointer points to the Mth element in a sequence, a partial subset of the first to M elements (the number of elements is M) is extracted, and a subset of the number of elements in the other complementary set is N, which is used to extract the curve.
  • the two sets of samples of the fragment are evaluated to observe the prediction accuracy, and the prediction accuracy can also be estimated using Fisher's exact test.
  • the number of samples of group A is judged as A by a method, and the number of B is judged as b;
  • the sample of the group B is judged to be the number of A, and the number of B is judged as d.
  • the p value of Fisher's exact test is calculated according to the four values a, b, c, and d. The smaller the p value, the higher the prediction accuracy.
  • the sample to be tested is scored and predicted, and the source of the sample to be tested is determined. in particular:
  • a set of complementary balanced curve segments can be used as a two-two classification model.
  • all training sets are combined and trained to obtain multiple two-two classification models.
  • the sample depth curve to be tested is compared with each two-two classification model, and the score of the winning party in each model is increased by one.
  • the party with the highest score in the type has the result of many-to-many judgment.
  • the difference between the highest score and the second highest score can also be required to reach a certain threshold to improve the specificity.
  • the program may be stored in a computer readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc.
  • the computer executes the program to implement the above functions.
  • the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the above functions can be realized.
  • the program may also be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk or a mobile hard disk, and may be saved by downloading or copying.
  • the system is updated in the memory of the local device, or the system of the local device is updated.
  • an embodiment of the present invention further provides a sample source detecting device based on DNA sequencing data, including:
  • DNA sequencing data of samples of at least two different sources data of multiple samples of the same source are respectively compared to a reference genome, and a window is divided on the above reference genome, and the sequencing depth in each window is separately counted, and according to The depth is generated to generate a depth curve;
  • each curve segment of the similar curve segment set is at least partially consistent with the corresponding depth curve
  • Generating a depth curve of the sample to be tested by using DNA sequencing data of the sample to be tested, and comparing the depth curve of the sample to be tested with the set of two complementary effective curve segments, and respectively according to the depth curve of the sample to be tested and the two complementary effective curve segments The degree of matching of the set determines the source of the sample to be tested.
  • DNA sequencing data of samples of at least two different sources data of multiple samples of the same source are respectively compared to a reference genome, and a window is divided on the above reference genome, and the sequencing depth in each window is separately counted, and according to The depth is generated to generate a depth curve;
  • each curve segment of the similar curve segment set is at least partially consistent with the corresponding depth curve
  • Generating a depth curve of the sample to be tested by using DNA sequencing data of the sample to be tested, and comparing the depth curve of the sample to be tested with the set of two complementary effective curve segments, and respectively according to the depth curve of the sample to be tested and the two complementary effective curve segments The degree of matching of the set determines the source of the sample to be tested.
  • This example demonstrates the feasibility of the method of the present invention using the DNA sequencing data shown in Table 1.
  • the human reference genome GRCh37.p13
  • the DNA sequencing data is compared to the reference genome by using the BWA comparison software, and the reference genome is divided into 40K size windows, and the first end of the corresponding window falls in the window.
  • the number of DNA fragments is taken as the depth value.
  • Table 1 sample related information
  • the training set used 80% of each of the above four tissues.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Wood Science & Technology (AREA)
  • Analytical Chemistry (AREA)
  • Biotechnology (AREA)
  • Zoology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Biochemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Microbiology (AREA)
  • Theoretical Computer Science (AREA)
  • Immunology (AREA)

Abstract

一种基于DNA测序数据的样本来源检测方法、装置和存储介质,该方法包括:分别将同一来源的多个样本的DNA测序数据比对到参考基因组上,分别统计参考基因组上每个窗口中的测序深度,并生成深度曲线;从同一来源的多个样本的深度曲线中提取相似曲线片段集合;对于两两不同来源的相似曲线片段集合,计算每一曲线片段分别属于这两个不同来源的集合的权重,并过滤掉权重低于设定阈值的曲线片段,得到两两互补有效曲线片段集合;将待测样本深度曲线与两两互补有效曲线片段集合进行比较,根据待测样本深度曲线分别与两两互补有效曲线片段集合的匹配程度判断待测样本的来源。本方法直接从DNA测序数据的比对结果搜索不同组织的特征,从而对样本来源进行预测。

Description

基于DNA测序数据的样本来源检测方法、装置和存储介质 技术领域
本发明涉及生物信息学技术领域,具体涉及一种基于DNA测序数据的样本来源检测方法、装置和存储介质。
背景技术
自1977年Sanger等推出基于双脱氧链终止法的第一代测序仪以来,测序技术不断发展,取得一系列突破,现已推出基于单分子测序技术的第三代测序仪。测序技术的每一代发展,都极大地推进了生命科学各个领域的发展,特别是生物医药方面的研究。
通过对DNA进行测序分析,人们认识到不同个体的基因组间存在一系列差异,并将其按区域大小分为单核酸变异(single nuclein variation,SNV)(包括插入缺失和替换)、拷贝数变异(copy number variation,CNV)和结构变异(structure variation,SV)等。这些差异的存在极大地丰富了基因组遗传变异的多样性。
通过对基因组的差异进行归纳总结,人们可凭借测序结果对样本所属的物种、种群、个体乃至组织来源作出判断,尤其是组织来源的判断在生物医药方面有着重要意义。如癌症由于其具有高特异性,即使发生在同一组织器官的肿瘤也可能有完全不同的生长机理,需要不同的治疗策略。然而肿瘤在影像学上可能极为相似,难以直接区分,需要测序分析进行辨认,根据其变异特征分类到相应的类别中。
上述归纳总结一般使用机器学习方法实现。人工智能领域提出了一系列通用机器学习方法,并成功地应用在基因组信息分析上,如逻辑回归(logistic regression)、决策树(decision tree)、支持向量机(Support Vector Machine,SVM)等,这些机器学习方法有一共同特点,即在训练过程中容易饱和,在信息量大 的情况下无法充分挖掘训练集的特征,因此又被称为浅层学习。相对的深度学习(deep learning)方法在近几年内被提出,并且在图像处理、语音识别方面获得极大的成功,现已发展出一系列变形。
生物基因组尺寸往往较大,以人类为例,可达到3G。此类规模的数据量无法直接应用浅层学习方法,一般做法是先对基因组上的变异进行统计,再在统计结果的基础上进行训练。关注不同的层面,可分析相应层面上的变异,如单核苷酸层面上,可分析某些类型特有的SNV;在1K到3MB的面层上,可对CNV进行分析;在3MB以上的层面上,可对SV进行分析。一般而言,粒度越小,在大量噪音信息中提取有效信息的难度越大;粒度越大,则信息丢失越严重。随着液体活检的兴起以及测序技术在液体活检中发挥着越来越重要的作用,提出了对学习方法的进一步要求。浅层学习方法,一方面其建立在变异的基础上,而在统计变异时信息大量丢失;另一方面由于自身结构的限制,在信息量大的情况下无法有效利用信息,应用受到限制。深度学习方法的异军突起在生物行业也受到极大关注,由于生物基因组数据与深度学习已取得重大成功的图像和语音数据存在本质上的不同,无法直接使用,需要针对性地开发相应的学习方法。
目前根据测序数据对样本来源进行预测方面的文献包括,Matthew W.Snyder等发表在CELL杂志上的Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin,其通过机体染色质核小体占位(Nucleosome positioning varies)的多样性对肿瘤循环DNA进行预测定位;以及Shicheng Guo等发表在Nature Genetics杂志上的Identification of methylation haplotype blocks aids in deconvolution of heterogeneous tissue samples and tumor tissue-of-origin mapping from plasma DNA,其通过机体甲基化单倍型(CpG methylation haplotypes)的多样性对肿瘤循环DNA进行预测定位。
拷贝数变异是指长度介于1KB到3MB的DNA片段在染色体上的重复排列,在生物基因组中广泛存在,其覆盖的核苷酸数目远远超过其他类型的变异,是 生物遗传多样性的重要来源,与众多疾病密切相关。
我们的前期工作证实,通过检测DNA测序数据中的拷贝数变异可在较高特异性和敏感度下,预测其组织来源(如预测某个肿瘤组织属于何种癌症)。然而,对于部分特定来源的样本,如外周血中的游离DNA(Cell free DNA,cfDNA)样本,可能不适用。cfDNA的主要成分为正常细胞凋亡时释放到血液中的DNA片段,在样本贡献者患有癌症时,血液中也可能出现癌细胞凋亡释放出的肿瘤循环DNA(circulating tumor DNA,ctDNA),后者正是癌症早期筛查的重点关注对象。由于ctDNA含量极少,约占cfDNA的1%,且易受到多方面的影响,难以搜索稳定有效的拷贝数变异,限制了上述文献报道的方法在液体活检这一重要领域的应用。
发明内容
本发明提供一种基于DNA测序数据的样本来源检测方法、装置和存储介质,直接从DNA测序数据的比对结果而非统计到的变异搜索不同组织的特征,从而对样本来源进行预测。
根据第一方面,一种实施例中提供一种基于DNA测序数据的样本来源检测方法,包括:
对于至少两个不同来源的样本的DNA测序数据,分别将同一来源的多个样本的数据比对到参考基因组上,在上述参考基因组上划分窗口,分别统计每个窗口中的测序深度,并根据上述深度生成深度曲线;
从上述同一来源的多个样本的深度曲线中提取相似曲线片段集合,上述相似曲线片段集合中的每一曲线片段与相对应的深度曲线至少部分一致;
对于两两不同来源的上述相似曲线片段集合,计算每一曲线片段分别属于这两个不同来源的集合的权重,并过滤掉上述权重低于设定阈值的曲线片段,得到两两互补有效曲线片段集合;
用待测样本的DNA测序数据生成待测样本深度曲线,将上述待测样本深度曲线与上述两两互补有效曲线片段集合进行比较,根据上述待测样本深度曲线分别与上述两两互补有效曲线片段集合的匹配程度判断上述待测样本的来源。
根据第二方面,一种实施例中提供一种基于DNA测序数据的样本来源检测装置,包括:
存储器,用于存储程序;
处理器,用于通过执行上述存储器存储的程序以实现如下的方法:
对于至少两个不同来源的样本的DNA测序数据,分别将同一来源的多个样本的数据比对到参考基因组上,在上述参考基因组上划分窗口,分别统计每个窗口中的测序深度,并根据上述深度生成深度曲线;
从上述同一来源的多个样本的深度曲线中提取相似曲线片段集合,上述相似曲线片段集合中的每一曲线片段与相对应的深度曲线至少部分一致;
对于两两不同来源的上述相似曲线片段集合,计算每一曲线片段分别属于这两个不同来源的集合的权重,并过滤掉上述权重低于设定阈值的曲线片段,得到两两互补有效曲线片段集合;
用待测样本的DNA测序数据生成待测样本深度曲线,将上述待测样本深度曲线与上述两两互补有效曲线片段集合进行比较,根据上述待测样本深度曲线分别与上述两两互补有效曲线片段集合的匹配程度判断上述待测样本的来源。
根据第三方面,一种实施例中提供一种计算机可读存储介质,包括程序,该程序能够被处理器执行以实现如下的方法:
对于至少两个不同来源的样本的DNA测序数据,分别将同一来源的多个样本的数据比对到参考基因组上,在上述参考基因组上划分窗口,分别统计每个窗口中的测序深度,并根据上述深度生成深度曲线;
从上述同一来源的多个样本的深度曲线中提取相似曲线片段集合,上述相 似曲线片段集合中的每一曲线片段与相对应的深度曲线至少部分一致;
对于两两不同来源的上述相似曲线片段集合,计算每一曲线片段分别属于这两个不同来源的集合的权重,并过滤掉上述权重低于设定阈值的曲线片段,得到两两互补有效曲线片段集合;
用待测样本的DNA测序数据生成待测样本深度曲线,将上述待测样本深度曲线与上述两两互补有效曲线片段集合进行比较,根据上述待测样本深度曲线分别与上述两两互补有效曲线片段集合的匹配程度判断上述待测样本的来源。
本发明的方法直接从DNA测序数据的比对结果而不是统计到的变异搜索不同组织的特征,从而对样本来源进行预测,能够高特异性、高灵敏度地用于外周血中的游离DNA样本的来源检测,既可实现非诊断性DNA样本来源检测,也可实现肿瘤循环DNA等诊断性样本来源检测,可应用在液体活检相关检测以及肿瘤的分子分型中。
附图说明
图1为本发明一种实施例中的基于DNA测序数据的样本来源检测方法的流程图;
图2为本发明一种实施例中提取深度曲线的原理示意图;
图3为本发明一种实施例中提取互补平衡曲线片段集合的原理示意图。
具体实施方式
下面通过具体实施方式结合附图对本发明作进一步详细说明。在以下的实施方式中,很多细节描述是为了使得本发明能被更好的理解。然而,本领域技术人员可以毫不费力的认识到,其中部分特征在不同情况下是可以省略的,或者可以由其他元件、材料、方法所替代。在某些情况下,本发明相关的一些操 作并没有在说明书中显示或者描述,这是为了避免本发明的核心部分被过多的描述所淹没,而对于本领域技术人员而言,详细描述这些相关操作并不是必要的,他们根据说明书中的描述以及本领域的一般技术知识即可完整了解相关操作。
另外,说明书中所描述的特点、操作或者特征可以以任意适当的方式结合形成各种实施方式。同时,方法描述中的各步骤或者动作也可以按照本领域技术人员所能显而易见的方式进行顺序调换或调整。因此,说明书和附图中的各种顺序只是为了清楚描述某一个实施例,并不意味着是必须的顺序,除非另有说明其中某个顺序是必须遵循的。
为了解决现有技术中的至少部分技术问题,本发明提出了一种高灵敏度的数据挖掘方法,从拷贝数变异(CNV)的前体数据——测序结果中的DNA片段(测序读长或reads)分布信息入手,构建模型,以超越现有技术的准确性识别出血液中的组织信号。
本发明的方法是基于这样的生物学原理:CNV的存在使得某一区域的重复性片段增加/减少,将其比对到正常的参考基因组上时,表现为比对到参考基因组的相应区域上的DNA片段数异常增多/减少,也即“深度”增加/降低。不同区域的深度不一致,形成深度曲线。通过挖掘某一类组织的特征性深度曲线片段,来实现未知样本来源的预测。
在本发明说明书中涉及到的一些术语说明如下:
如本文中所使用的,术语“群体”、“样本来源”、“(某一类)组织”、“(组织)来源”具有基本相同或等同的含义,例如,肺部组织、胃部组织、前列腺组织和淋巴组织中的每一个可以认为是一个特定的群体,也是一个特定的样本来源。
如本文中所使用的,术语“样本”即提供DNA测序数据的那个最初的材料,例如每个正常人或患者的外周血中的游离DNA可以认为是一个样本,不同的样 本可能有相同的来源(例如来源于同一组织),也可能有不同来源。因此,本发明中区分“不同来源的样本”和“同一来源的样本”。
如本文中所使用的,术语“曲线片段”、“子片段”、“子序列”具有基本相同或等同的含义,例如,“曲线片段”可以认为是深度曲线的一部分,当“深度曲线”以序列(例如窗口编号序列)表达时,“曲线片段”就可以称作“子序列”。
如本文中所使用的,术语“集合”具有数学上的含义,即多个元素的集,例如相似曲线片段集合即是多个相似曲线片段的集。
本发明的方法主要包括提取深度曲线、提取相似曲线片段集合、提取有效曲线片段集合、任选的提取平衡曲线片段集合以及最后的对待测样本进行打分预测的步骤。
本发明一种实施例中的基于DNA测序数据的样本来源检测方法如图1所示,具体包括:
步骤S101:提取深度曲线
对于至少两个不同来源的样本的DNA测序数据,分别将同一来源的多个样本的数据比对到参考基因组上,在参考基因组上划分窗口,分别统计每个窗口中的测序深度,并根据深度生成深度曲线。
本发明的输入数据为样本的DNA测序数据(测序读长或reads),不限测序平台和建库方法,也不限中间的具体参数,直接采用现有主流测序方案即可。
DNA测序数据需要比对到参考基因组上,这一步可利用现有的软件进行,如Burrows-Wheeler Aligner(BWA)、Short Oligonucleotide Analysis Package(SOAP)等。如图2所示,比对完成后,按窗口分别统计每一条染色体上的深度信息(即DNA片段数)。窗口划分的方案可以根据具体的数据进行设计,采用定长或不定长的模式,长度例如10-100K/窗口,优选40K/窗口。根据每个窗口的深度信息绘制深度曲线。
图2中所示的参考基因组被示例性地分割成1~5共5个窗口,用虚线表示比对 到该区域的DNA片段,坐标系中用实线表示相应窗口中端首(也可以是端尾或中点或任一固定的相对或绝对位置)落在该窗口的DNA片段数,作为深度值。连接每个窗口的深度值,得到深度曲线。
若不考虑窗口间平均深度的具体差距,只关注大概的形状,则可以通过下面的方式描述这一曲线:将所有窗口按深度值由大到小(或者相反)排列,得到排序后的窗口编号序列。例如,按照这个方法,图2的曲线可以被描述为:1、2、4、3、5或1、2、4、5、3。
步骤S102:提取相似曲线片段集合
从同一来源的多个样本的深度曲线中提取相似曲线片段集合,该相似曲线片段集合中的每一曲线片段与相对应的深度曲线至少部分一致。具体而言:
为了找到一个群体(或来源)中相似的深度曲线片段,需要对“相似”作严格的数学定义:假如一系列曲线按同样的方法,得到的窗口编号序列一致,则认为它们是“相似”的。例如,按步骤S101的方法,曲线A得到的序列为“1,2,3”,曲线B得到的序列也为“1,2,3”,则认为两个曲线是相似的。
通过这种变换,使得问题演变为,如何在这些序列中寻找频繁出现的子序列(即“子片段”或“曲线片段”),子序列不要求连续。其中,“不要求连续”的意义体现在,假如存在曲线1、2、3、4、5和曲线1、2、4、3、5,从定义上看,二者是不相似的,但是如果去掉3、4两个窗口,剩下的1、2、5部分则变得一致,此时曲线变得不连续。
在本发明的一个实施例中,提取相似曲线片段集合分两步完成:
(1)寻找包含两个窗口长度的子片段(或子序列),子片段(或子序列)是连续的或不连续的;
(2)将多个子片段按照两两前后有窗口重叠的方式迭代拼接成大于两个窗口长度的子片段,直至没有更长的子片段生成,则所有子片段组成所述相似曲线片段集合。
具体而言,对于如何找到这些频繁出现的子序列,本发明的一个实施例提 出了以下方法:假如某一群体存在一个频繁出现的子序列“10,20,30,40”,由于可能存在的子序列数量与窗口总数是指数级的关系,实际上无法通过直接搜索得到这个子序列,在此采用迭代拼接的方式。
拼接被定义为:假如存在两个序列,其中一个序列的首端部分与另一个的末端部分一致,则将后者整个序列以及前者与后者不一致的部分连接起来,得到新序列,这个过程被称为拼接。
继续上面的例子。从最短长度为2的子序列(L2)开始,假如存在频繁子序列“10,20”和“20,30”,前者的末端“20”和后者的首端“20”一致,则可以将其拼接得到“10,20,30”。下一轮拼接中,如果存在“10,20,30”和“20,30,40”则能够拼接得到“10,20,30,40”。即每轮将其长度增加1,不断迭代,直到没有新的子序列产生。
值得注意的是,并不是所有拼接得到的序列都是频繁的,需要进行检查,过滤掉非频繁的部分。
在开始上述拼接前,需要找到所有L2。由于允许子序列不连续,对于窗口数量为N的深度曲线,其可能存在的L2数量为N2级别,为了减少计算量,限制不连续的最大间距为R,从而使L2数量下降到NR级别。
要得到所有符合条件的L2,可以通过以下方法:对于被划分成N个窗口的染色体,将每个窗口的位置作为其编号;搜索其包含编号为x的窗口(1≤x≤N)的L2时,遍历编号在从x-R到x+R范围内的窗口(1≤x-R<x+R≤N),观察此范围内的编号为i(i≠x)的窗口内归一化深度(例如平均深度或总深度的某一比例)大于或小于编号为x的窗口的样本数占总样本数的比例,若该比例大于设定值,则认为片段(i,x)是频繁的,从而将其保留下来进入接下来的迭代拼接环节。
例如,对于具有N=100个窗口的序列,我们要得到包含编号x=10的窗口的L2,当R=20时,范围(x-R,x+R)等于(10-20,10+20),即(-10,30),由于1≤x-R<x+R≤N,范围被修正为(1,30)。统计在此范围内的编号为i=5的窗口的平均深度(或总深度的某一比例)大于(或小于)编号为x=10的样本数量, 假设该数量为80,而总样本数为100,得到比例为0.8。而设定的比例阈值为0.5,因为0.8>0.5,则将序列“5,10”记录下来,存储进L2集合。
步骤S103:提取有效曲线片段集合
对于两两不同来源的相似曲线片段集合,计算每一曲线片段分别属于这两个不同来源的集合的权重,并过滤掉权重低于设定阈值的曲线片段,得到两两互补有效曲线片段集合。具体而言:
从某一群体(以下称为A群体)中挖掘得到的相似曲线片段,并不能保证均是其独有的,需要针对另一个特定的群体(以下称为B群体)进行过滤,过滤之后的子集可作为区分这两个群体的依据。
在过滤过程中,对每个片段进行打分,作为其权重。打分有多种方式,在本发明的一个实施例中,采用Fisher精确检验的p值作为打分依据。具体过程如下:对于某一曲线片段,分别统计A群体中与其匹配的样本数和不匹配的样本数,得到a、b两个值,按同样方法统计B群体中与其匹配的样本数和不匹配的样本数,得到c、d。所谓“匹配”定义如下:假如两个序列的共有元素在先后顺序上是一致的,则认为它们是匹配的。例如,对于序列“1,2,3,4”和序列“2,3,4”,其共有元素为2、3、4。由于2、3、4这三个元素在两个序列中的先后顺序是一致的,所以认为这两个序列是匹配的。
假如某一曲线片段与提取自某一样本的曲线是匹配的,则认为该曲线片段与该样本匹配。以a、b、c、d四个值作为参数,计算Fisher精确检验的p值,当p值小于设定阈值时,将其校正为该值,并将p值的倒数作为此曲线片段的权重。过滤掉权重低于设定阈值的曲线片段,得到过滤后的有效曲线片段集合。
对B群体也进行上述操作,得到相应的有效曲线片段集合,此过程中用来过滤相似曲线片段的样本集为A群体集合。上述过程得到两个有效曲线片段集合为一对互补有效曲线片段集合。
步骤S104:判断待测样本的来源
用待测样本的DNA测序数据生成待测样本深度曲线,将待测样本深度曲线 与两两互补有效曲线片段集合进行比较,根据待测样本深度曲线分别与两两互补有效曲线片段集合的匹配程度判断待测样本的来源。具体而言:
对于待测的未知类型样本,经过与步骤S101相同的深度曲线提取,得到待测样本深度曲线。将待测样本深度曲线与一对互补有效曲线片段集合的双方进行比较,若待测样本深度曲线与某一方的某一曲线片段匹配,则给该类型加上此曲线片段的权重值。全部待测样本深度曲线检查完毕后,比较双方的得分情况,以分数较高方作为获胜者。在某些需要高特异性的情况下,可以要求双方分数之差达到某一设定阈值,否则认为无法判断。
一对互补有效曲线片段集合可以作为一个两两分类模型。当要求多对多分类时,将所有的有效曲线片段集合两两组合训练,得到多个两两分类模型。将待测样本深度曲线分别与各个两两分类模型进行比较,每个模型中获胜的一方分数加1,最后所有类型中分数最高的一方作为多对多的判断结果。同样,也可以要求最高分和次高分之差达到某一设定阈值,以提高特异性。
例如,在要求将某一样本判断为A、B、C三个类型之一时,首先运用两两分类模型判断。假设该样本的真实类型是A,则在两两分类模型AB中被判断为A,在两两分类模型AC中也被判断为A,在两两分类模型BC中被判断为B(或C)。3次判断中,有两次被判断为A,故A得到2分,最后一次判断为B(或C),B(或C)得到1分。由于A的2分多于B(或C)的1分,所以该样本被判断为A类型。
一般情况下,完成步骤S104即可实现待测样本来源的判断。然而,在一些实际操作中,互补有效曲线片段集合的双方片段数和权重高低可能不一致,从而出现某一方的有效曲线片段集合中的片段比另一方数量多且权重高的情况,用于分类预测时,会使预测结果向该类型倾斜,造成偏向性。为了解决该问题,需要对其进行删减,删减过程针对一对互补有效曲线片段集合的双方同时进行。
在本发明的一个实施例中,按照图3所示的原理,从互补有效曲线片段集合中提取互补平衡曲线片段集合,该互补平衡曲线片段集合与待测样本深度曲线比较时的偏向性,比互补有效曲线片段集合与待测样本深度曲线比较时的偏向 性低。具体而言:
将互补有效曲线片段集合的双方分别按权重由高到低排列(第1个元素为权重最高的元素或者之一),得到两个排序后的序列,并各提供一个试探指针,指向双方各自的某一元素。当指针指向一个序列中的第M个元素时,提取第1到M个元素的部分子集(元素数量为M),结合另一互补集合中元素数量为N的子集,对用来提取曲线片段的两个集合样本进行评估,观察预测准确性,此处预测准确性同样可以使用Fisher精确检验来估算。
例如,假如一对互补有效曲线片段集合分别由群体A和群体B挖掘所得,用前面所讲的方法将群体A的样本判断为A的数量记为a,判断为B的数量记为b;将群体B的样本判断为A的数量记为c,判断为B的数量记为d。根据a、b、c、d四个值计算Fisher精确检验的p值,p值越小代表预测准确性越高。
为找到最佳的未知M和N,需要进行迭代试验。首先,将N设为合理区间内的随机值,遍历M所有可能的取值,记下预测准确性最高的位置M’;然后,把M的值设为M’,遍历N所有可能的取值,同样记下预测准确性最高的位置N’,并将N的值设为N’。不断重复迭代上述过程,直到M’和N’不再变化。
重复多次此过程,并保证每次的初始随机位置不同。记下其中能使预测准确性达到最佳的一对N’和M’。分别根据这两个值提取相应的子序列,得到一对互补平衡曲线片段集合。
然后,依据上述互补平衡曲线片段集合,对待测样本进行打分预测,判断待测样本的来源。具体而言:
将前述待测样本深度曲线与两两互补有效曲线片段集合进行比较的判断过程中的“互补有效曲线片段集合”替换成最后得到的“互补平衡曲线片段集合”得到最终的判断方法。
一对互补平衡曲线片段集合可以作为一个两两分类模型。当要求多对多分类时,将所有训练集两两组合训练,得到多个两两分类模型。待测样本深度曲线分别与各个两两分类模型进行比较,每个模型中获胜的一方分数加1,最后所 有类型中分数最高的一方作为多对多的判断结果。同样此处也可以要求最高分和次高分之差达到某一设定阈值,以提高特异性。
本领域技术人员可以理解,上述实施方式中各种方法的全部或部分功能可以通过硬件的方式实现,也可以通过计算机程序的方式实现。当上述实施方式中全部或部分功能通过计算机程序的方式实现时,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器、随机存储器、磁盘、光盘、硬盘等,通过计算机执行该程序以实现上述功能。例如,将程序存储在设备的存储器中,当通过处理器执行存储器中程序,即可实现上述全部或部分功能。另外,当上述实施方式中全部或部分功能通过计算机程序的方式实现时,该程序也可以存储在服务器、另一计算机、磁盘、光盘、闪存盘或移动硬盘等存储介质中,通过下载或复制保存到本地设备的存储器中,或对本地设备的系统进行版本更新,当通过处理器执行存储器中的程序时,即可实现上述实施方式中全部或部分功能。
因此,本发明的一种实施例还提供一种基于DNA测序数据的样本来源检测装置,包括:
存储器,用于存储程序;
处理器,用于通过执行上述存储器存储的程序以实现如下的方法:
对于至少两个不同来源的样本的DNA测序数据,分别将同一来源的多个样本的数据比对到参考基因组上,在上述参考基因组上划分窗口,分别统计每个窗口中的测序深度,并根据上述深度生成深度曲线;
从上述同一来源的多个样本的深度曲线中提取相似曲线片段集合,上述相似曲线片段集合中的每一曲线片段与相对应的深度曲线至少部分一致;
对于两两不同来源的上述相似曲线片段集合,计算每一曲线片段分别属于这两个不同来源的集合的权重,并过滤掉上述权重低于设定阈值的曲线片段,得到两两互补有效曲线片段集合;
用待测样本的DNA测序数据生成待测样本深度曲线,将上述待测样本深度曲线与上述两两互补有效曲线片段集合进行比较,根据上述待测样本深度曲线分别与上述两两互补有效曲线片段集合的匹配程度判断上述待测样本的来源。
本发明的另一种实施例还提供一种计算机可读存储介质,包括程序,该程序能够被处理器执行以实现如下的方法:
对于至少两个不同来源的样本的DNA测序数据,分别将同一来源的多个样本的数据比对到参考基因组上,在上述参考基因组上划分窗口,分别统计每个窗口中的测序深度,并根据上述深度生成深度曲线;
从上述同一来源的多个样本的深度曲线中提取相似曲线片段集合,上述相似曲线片段集合中的每一曲线片段与相对应的深度曲线至少部分一致;
对于两两不同来源的上述相似曲线片段集合,计算每一曲线片段分别属于这两个不同来源的集合的权重,并过滤掉上述权重低于设定阈值的曲线片段,得到两两互补有效曲线片段集合;
用待测样本的DNA测序数据生成待测样本深度曲线,将上述待测样本深度曲线与上述两两互补有效曲线片段集合进行比较,根据上述待测样本深度曲线分别与上述两两互补有效曲线片段集合的匹配程度判断上述待测样本的来源。
以下通过实施例详细说明本发明的技术方案和效果,应当理解,实施例仅是示例性的,不能理解为对本发明保护范围的限制。
实施例
本实施例使用表1所示的DNA测序数据验证了本发明的方法的可行性。本实施例中,采用人类参考基因组(GRCh37.p13),采用BWA比对软件将DNA测序数据比对到参考基因组上,参考基因组上划分窗口40K大小,以相应窗口中端首落在该窗口的DNA片段数作为深度值。
表1样本相关信息
Figure PCTCN2017085150-appb-000001
Figure PCTCN2017085150-appb-000002
训练集使用了以上4种组织各自80%的样本,测试集使用了剩余20%的样本以及两个来源共8+48=56例肺癌患者的血浆样本。结果如表2所示:
表2样本来源判断结果
Figure PCTCN2017085150-appb-000003
以上应用了具体个例对本发明进行阐述,只是用于帮助理解本发明,并不用以限制本发明。对于本发明所属技术领域的技术人员,依据本发明的思想, 还可以做出若干简单推演、变形或替换。

Claims (10)

  1. 一种基于DNA测序数据的样本来源检测方法,其特征在于,包括:
    对于至少两个不同来源的样本的DNA测序数据,分别将同一来源的多个样本的数据比对到参考基因组上,在所述参考基因组上划分窗口,分别统计每个窗口中的测序深度,并根据所述深度生成深度曲线;
    从所述同一来源的多个样本的深度曲线中提取相似曲线片段集合,所述相似曲线片段集合中的每一曲线片段与相对应的深度曲线至少部分一致;
    对于两两不同来源的所述相似曲线片段集合,计算每一曲线片段分别属于这两个不同来源的集合的权重,并过滤掉所述权重低于设定阈值的曲线片段,得到两两互补有效曲线片段集合;
    用待测样本的DNA测序数据生成待测样本深度曲线,将所述待测样本深度曲线与所述两两互补有效曲线片段集合进行比较,根据所述待测样本深度曲线分别与所述两两互补有效曲线片段集合的匹配程度判断所述待测样本的来源。
  2. 根据权利要求1所述的方法,其特征在于,所述测序深度为端首或端尾或中点或任一固定的相对或绝对位置落在相应窗口的DNA片段数;所述深度曲线表示为所有窗口按测序深度由大到小或由小到大排列得到的排序的窗口编号序列。
  3. 根据权利要求1所述的方法,其特征在于,所述提取相似曲线片段集合具体包括:
    寻找包含两个窗口长度的子片段,所述子片段是连续的或不连续的;
    将多个所述子片段按照两两前后有窗口重叠的方式迭代拼接成大于两个窗口长度的子片段,直至没有更长的子片段生成,则所有子片段组成所述相似曲线片段集合。
  4. 根据权利要求3所述的方法,其特征在于,所述寻找包含两个窗口长度的子片段具体包括:
    对于被划分成N个窗口的参考基因组,将每个窗口的位置作为其编号;搜索包含编号为x的窗口的子片段时,遍历编号为x-R至x+R范围内的窗口;计算所述范围内编号为i的窗口内归一化深度大于或小于编号为x的窗口的样本数占总样本数的比例,若该比例大于设定值,则认为片段(i,x)是可用于迭代拼接的子片段;
    其中,1≤x≤N,1≤x-R<x+R≤N,i≠x,R表示所述子片段不连续的最大间距值。
  5. 根据权利要求1所述的方法,其特征在于,所述计算每一曲线片段分别属于这两个不同来源的集合的权重具体包括:
    对于每一曲线片段,分别统计其与这两个不同来源的样本中匹配的样本数和不匹配的样本数;然后计算Fisher精确检验的p值;将该p值的倒数作为所述曲线片段的权重。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,在得到所述两两互补有效曲线片段集合之后且在生成所述待测样本深度曲线之前,所述方法还包括:
    从所述互补有效曲线片段集合中提取互补平衡曲线片段集合,所述互补平衡曲线片段集合与所述待测样本深度曲线比较时的偏向性,比所述互补有效曲线片段集合与所述待测样本深度曲线比较时的偏向性低。
  7. 根据权利要求6所述的方法,其特征在于,所述从所述互补有效曲线片段集合中提取互补平衡曲线片段集合具体包括:
    将所述两两互补有效曲线片段集合中的元素分别按权重由高到低排列,得到两个序列,提取一个集合中M个元素的子集,结合另一互补集合中N个元素的子集,对所述两两互补有效曲线片段集合的样本进行评估,观察预测准确性;
    为找到最佳的M和N,将N设为合理区间内的随机值,遍历M所有可能的取值,记下所述预测准确性最高的位置M’;然后把M设为M’,遍历N所有 可能的取值,记下所述预测准确性最高的位置N’,并将N设为N’,不断重复迭代该过程,直到M’和N’不再变化;
    根据上述两个不再变化的M’和N’值,提取相应的子集,得到一对互补平衡曲线片段集合。
  8. 根据权利要求6或7所述的方法,其特征在于,所述方法对于样本来源大于两个来源的情况,所述判断所述待测样本的来源具体包括:
    将所述待测样本深度曲线分别与所有组合的两两互补平衡曲线片段集合进行比较,统计各个集合的预测分数,以所述预测分数最高的集合对应的样本来源作为所述待测样本的来源。
  9. 一种基于DNA测序数据的样本来源检测装置,其特征在于,包括:
    存储器,用于存储程序;
    处理器,用于通过执行所述存储器存储的程序以实现如权利要求1至8中任一项所述的方法。
  10. 一种计算机可读存储介质,其特征在于,包括程序,所述程序能够被处理器执行以实现如权利要求1至8中任一项所述的方法。
PCT/CN2017/085150 2017-05-19 2017-05-19 基于dna测序数据的样本来源检测方法、装置和存储介质 WO2018209704A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201780089043.3A CN110462056B (zh) 2017-05-19 2017-05-19 基于dna测序数据的样本来源检测方法、装置和存储介质
PCT/CN2017/085150 WO2018209704A1 (zh) 2017-05-19 2017-05-19 基于dna测序数据的样本来源检测方法、装置和存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/085150 WO2018209704A1 (zh) 2017-05-19 2017-05-19 基于dna测序数据的样本来源检测方法、装置和存储介质

Publications (1)

Publication Number Publication Date
WO2018209704A1 true WO2018209704A1 (zh) 2018-11-22

Family

ID=64273080

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/085150 WO2018209704A1 (zh) 2017-05-19 2017-05-19 基于dna测序数据的样本来源检测方法、装置和存储介质

Country Status (2)

Country Link
CN (1) CN110462056B (zh)
WO (1) WO2018209704A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110504014A (zh) * 2019-08-20 2019-11-26 福州大学 一种具有实时反馈功能的颈椎康复训练信息管理方法及系统
WO2023236058A1 (zh) * 2022-06-07 2023-12-14 深圳华大生命科学研究院 肺结节筛查模型的组建方法和装置以及肺结节筛查方法和装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105349678A (zh) * 2015-12-03 2016-02-24 上海美吉生物医药科技有限公司 一种染色体拷贝数变异的检测方法
WO2016183106A1 (en) * 2015-05-11 2016-11-17 Natera, Inc. Methods and compositions for determining ploidy
US20170009287A1 (en) * 2015-07-08 2017-01-12 Quest Diagnostics Investments Incorporated Detecting genetic copy number variation

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9783850B2 (en) * 2010-02-19 2017-10-10 Nucleix Identification of source of DNA samples
JP2014530629A (ja) * 2011-10-28 2014-11-20 ビージーアイダイアグノーシス カンパニー リミテッドBgi Diagnosis Co., Ltd. 染色体の微細欠失及び微細重複を検出する方法
CN105765076B (zh) * 2013-12-17 2019-07-19 深圳华大基因股份有限公司 一种染色体非整倍性检测方法及装置
CN103955630A (zh) * 2014-03-26 2014-07-30 田埂 制备参考数据库及对待测游离核酸样本进行目标区域序列比对的方法
TWI813141B (zh) * 2014-07-18 2023-08-21 香港中文大學 Dna混合物中之組織甲基化模式分析
WO2016090583A1 (zh) * 2014-12-10 2016-06-16 深圳华大基因研究院 测序数据处理装置和方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016183106A1 (en) * 2015-05-11 2016-11-17 Natera, Inc. Methods and compositions for determining ploidy
US20170009287A1 (en) * 2015-07-08 2017-01-12 Quest Diagnostics Investments Incorporated Detecting genetic copy number variation
CN105349678A (zh) * 2015-12-03 2016-02-24 上海美吉生物医药科技有限公司 一种染色体拷贝数变异的检测方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110504014A (zh) * 2019-08-20 2019-11-26 福州大学 一种具有实时反馈功能的颈椎康复训练信息管理方法及系统
WO2023236058A1 (zh) * 2022-06-07 2023-12-14 深圳华大生命科学研究院 肺结节筛查模型的组建方法和装置以及肺结节筛查方法和装置

Also Published As

Publication number Publication date
CN110462056B (zh) 2023-08-29
CN110462056A (zh) 2019-11-15

Similar Documents

Publication Publication Date Title
AU2019229273B2 (en) Ultra-sensitive detection of circulating tumor DNA through genome-wide integration
CN109072309B (zh) 癌症进化检测和诊断
US11961589B2 (en) Models for targeted sequencing
CN109767810B (zh) 高通量测序数据分析方法及装置
CN113257350B (zh) 基于液体活检的ctDNA突变程度分析方法和装置、ctDNA性能分析装置
CN116403644B (zh) 一种用于癌症风险预测的方法及装置
EP3729441A1 (en) Microsatellite instabilty detection
CN113838533A (zh) 一种癌症检测模型及其构建方法和试剂盒
WO2018209704A1 (zh) 基于dna测序数据的样本来源检测方法、装置和存储介质
CN107851136B (zh) 用于对未知重要性的变体划分优先级顺序的系统和方法
US11535896B2 (en) Method for analysing cell-free nucleic acids
US20200105374A1 (en) Mixture model for targeted sequencing
EP4018003A1 (en) Systems and methods for predicting and monitoring treatment response from cell-free nucleic acids
Hua et al. Combining protein-protein interactions information with support vector machine to identify chronic obstructive pulmonary disease related genes
US20220336044A1 (en) Read-Tier Specific Noise Models for Analyzing DNA Data
US20230407405A1 (en) Method for diagnosing cancer and predicting type of cancer based on single nucleotide variant in cell-free dna
Wong et al. Regional and bilateral MRI and gene signatures in facioscapulohumeral dystrophy: implications for clinical trial design and mechanisms of disease progression
Chieruzzi Identification of RAS co-occurrent mutations in colorectal cancer patients: workflow assessment and enhancement
Wang et al. Prediction of miRNA-Disease Associations Based on Stacked Autoencoders and Extremely Randomized Trees
Zhang Error Correction Algorithms for Genomic Sequencing Data
CN116312814A (zh) 一种肺腺癌分子分型模型的构建方法、设备、装置以及试剂盒
WO2023077114A1 (en) Determining fragmentomic signatures based on latent variables of nucleic acid molecules
Yu Statistical Methods and Analyses for Next-generation Sequencing Data
CN112992273A (zh) 一种预测早期结直肠癌风险评估模型及系统
Kariotis et al. Robust clustering of transcriptomes through multiple testing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17910089

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17910089

Country of ref document: EP

Kind code of ref document: A1