WO2020227952A1 - 基于测序数据的碱基突变检测方法、装置及存储介质 - Google Patents

基于测序数据的碱基突变检测方法、装置及存储介质 Download PDF

Info

Publication number
WO2020227952A1
WO2020227952A1 PCT/CN2019/086972 CN2019086972W WO2020227952A1 WO 2020227952 A1 WO2020227952 A1 WO 2020227952A1 CN 2019086972 W CN2019086972 W CN 2019086972W WO 2020227952 A1 WO2020227952 A1 WO 2020227952A1
Authority
WO
WIPO (PCT)
Prior art keywords
base
tested
mutation
specific
sample
Prior art date
Application number
PCT/CN2019/086972
Other languages
English (en)
French (fr)
Inventor
刘斯洋
黄树嘉
金鑫
Original Assignee
深圳华大基因股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因股份有限公司 filed Critical 深圳华大基因股份有限公司
Priority to SG11202112408QA priority Critical patent/SG11202112408QA/en
Priority to HUE19928972A priority patent/HUE061763T2/hu
Priority to PL19928972.9T priority patent/PL3971902T3/pl
Priority to PCT/CN2019/086972 priority patent/WO2020227952A1/zh
Priority to EP19928972.9A priority patent/EP3971902B1/en
Priority to ES19928972T priority patent/ES2942359T3/es
Priority to IL288026A priority patent/IL288026A/en
Priority to DK19928972.9T priority patent/DK3971902T3/da
Priority to CN201980093764.0A priority patent/CN113795886B/zh
Publication of WO2020227952A1 publication Critical patent/WO2020227952A1/zh
Priority to US17/522,920 priority patent/US20220068437A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the embodiments of the present application relate to gene mutation detection technology, for example, to a method, device, and storage medium for detecting base mutation based on sequencing data.
  • Genome mutations include single-base mutations, small indels, and structural variations. Different genomic mutations have different detection algorithms. This application focuses on single-base mutation detection. There are relatively stable open source algorithms for single-base mutation detection in the second-generation sequencing technology, such as the Genome Analysis Toolkit (GATK).
  • GATK Genome Analysis Toolkit
  • the single-base mutation detection algorithm in the related technology still has the following main defects: first, it is impossible to detect three-base and four-base mutation sites; second, it is not suitable for low sequencing depth and large sample data Scene.
  • non-invasive prenatal testing is a type of genome-wide ultra-low depth sequencing, with an average sequencing depth as low as 0.06x and a genome coverage rate of about 6%.
  • the sequencing depth is very low, a large amount of non-invasive prenatal sequencing data, such as one hundred thousand, one million or even ten million sequencing data, constitutes an excellent original data set for mutation detection.
  • the world has accumulated about 7.2 million non-invasive prenatal sequencing data.
  • the population-specific variation spectrum obtained through the accumulated population genome sequencing data has an important guiding role for disease research and drug development.
  • This application discloses a base mutation detection method, device and storage medium based on sequencing data, by which the method can realize base mutation detection on sequencing data with low sequencing depth and large data volume.
  • the embodiment of the application discloses a base mutation detection method based on sequencing data, including:
  • the specific base includes adenine A base, thymine T base, cytosine C base or guanine G base.
  • the embodiment of the application also discloses a base mutation detection device based on sequencing data, including:
  • the initial frequency determination module is set to determine the initial frequency of specific bases at the research site in the sequencing data of multiple samples to be tested;
  • the expected value calculation module is configured to calculate the expected value of each sample to be tested as a specific base at the research site based on the initial frequency;
  • An update module configured to update the initial frequency of the sequencing data of the multiple samples to be tested as specific bases at the research site by using each expected value
  • the iterative module is set to use the updated initial frequency to continue to calculate the expected value of each sample to be tested as a specific base at the research site, and use each new expected value to analyze the sequencing data of the multiple samples to be tested in the research site
  • the point is the initial frequency of the specific base to continue to update, repeat the above iterative process, until each sample to be tested converges to the expected value of the specific base at the research site;
  • the mutation type determination module is set to determine the base mutation type and mutation confidence of each sample to be tested at the research site according to the expected value of each convergence;
  • the specific base includes adenine A base, thymine T base, cytosine C base or guanine G base.
  • the embodiment of the present application also discloses a storage medium containing computer-executable instructions, which are executed by a computer processor to implement any of the aforementioned methods for detecting base mutations based on sequencing data.
  • a method for base mutation detection based on sequencing data disclosed in the embodiment of the application is to determine the initial frequency of a specific base at the research site in the sequencing data of multiple samples to be detected;
  • the expected value of the sample at the research site is a specific base; each expected value is used to update the initial frequency of the sequencing data of the sample to be tested as the specific base at the research site; the updated initial frequency is used to continue to calculate each pending Detect the expected value of the sample at the research site as a specific base, and use each new expected value to continue to update the sequencing data of the multiple samples to be tested at the initial frequency of the specific base at the research site, and repeat the above iterative process , Until the expected value of each sample to be tested for a specific base at the research site converges; according to the expected value of each convergence, the technical means to determine the type of base mutation and the mutation confidence of each sample to be tested at the research site are realized. Detection of base mutations in sequencing data with low sequencing depth and large data volume.
  • FIG. 1 is a schematic flowchart of a method for detecting base mutations based on sequencing data according to an embodiment of the application;
  • FIG. 2 is a schematic structural diagram of a base mutation detection device based on sequencing data according to an embodiment of the application;
  • FIG. 3 is a schematic diagram of the detection rate of different mutation frequencies of the above-mentioned base mutation detection method provided by an embodiment of the application under different sample data conditions;
  • FIG. 4 is a schematic diagram of the mutation detection rate of different types of mutations under different mutation frequencies in the above-mentioned base mutation detection method when the number of samples is 140,000 according to an embodiment of the application;
  • FIG. 5 is a schematic diagram of the difference between the estimated mutation frequency of the mutation site and the true value of the above-mentioned base mutation detection method under different mutation frequencies according to an embodiment of the application;
  • FIG. 6 is a schematic diagram of the difference between the estimated mutation frequency of the mutation site and the true value of the mutation frequency of the mutation site in the above-mentioned base mutation detection method provided by another embodiment of the application.
  • the base mutation detection method based on sequencing data disclosed in the examples of this application can be applied to prenatal testing or other ultra-large-scale population sequencing data. Usually, the sequencing depth will be low. Of course, the base mutation based on sequencing data provided by this application The detection method can also be used in situations where the sequencing depth is high. Although the sequencing depth of each sample is very low, when there are many samples, each site is covered by many multiplying base sequences (hereinafter referred to as reads). Each base from independent reads covering the current position corresponds to a base quality value generated by the sequencer. The base quality value is recorded as p(d i
  • the base quality value reflects the probability of wrong detection of the corresponding base.
  • the base covered on the first read is an A
  • the base quality value of A is 30, which means that the first read is detected at the first position
  • the corresponding correct probability is 1-10 -3 ;
  • A represents adenine base
  • T represents thymine base
  • C represents cell Pyrimidine base
  • G represents guanine base.
  • FIG. 1 is a schematic flow chart of a method for detecting base mutations based on sequencing data provided by an embodiment of the application. As shown in Figure 1, the method includes the following steps.
  • step 110 the initial frequency at which the sequencing data of multiple samples to be tested are specific bases at the research site is determined.
  • the research site refers to the site to be detected whether there is a base mutation;
  • the specific base includes adenine A base, thymine T base, cytosine C base or guanine G base.
  • determining the initial frequency of sequencing data of multiple samples to be tested as specific bases at the research site includes: counting the number of specific bases carried in the sequencing data of multiple samples to be tested and the number of specific bases to be tested. The total number of the four bases in the sequencing data of the test sample; the quotient of the number of the specific bases and the total number of the four bases is used as the sequencing data of the multiple samples to be tested as the specific base at the research site The initial frequency of the base.
  • bi j represents the number of specific base j carried in the sequencing data of the sample i to be tested
  • N represents the total number of samples to be tested.
  • n represents the total number of four bases in the sequencing data of multiple samples to be tested
  • p j represents the number of sequencing data of multiple samples to be tested in
  • the research site is the initial frequency of a specific base j
  • j ⁇ 0,1,2,3 ⁇
  • p 0 represents the initial frequency of base A in the sequence data of multiple samples to be tested at the research site
  • p 1 represents
  • the sequencing data of multiple samples to be tested is the initial frequency of base C at the research site
  • p 2 represents the initial frequency of the sequencing data of multiple samples to be tested as base G at the research site
  • p 3 represents the initial frequency of base G at the research site
  • the sequence data of the sample is the initial frequency of base T at the study site.
  • step 120 the expected value of each sample to be tested as a specific base at the research site is calculated based on the initial frequency.
  • the expected value of each sample to be tested as a specific base at the research site is calculated by the following formula (2):
  • b i, j indicate that the sample i to be tested is a specific base j at the research site
  • p j indicates that the sequencing data of multiple samples to be tested is the initial frequency of a specific base j at the research site
  • d i represents the test to be tested
  • p j ,d i ) represents the expected value of the sample i to be tested as a specific base j at the research site, p( b i,j
  • p j ) represents the prior probability that the sample i to be tested is a specific base j at the research site given p j , p(d i
  • each expected value is used to update the initial frequency at which the sequencing data of the multiple samples to be tested are specific bases at the research site.
  • the following formula (3) is used to update the initial frequency at which the sequencing data of the multiple samples to be tested are specific bases at the research site using each expected value:
  • p j ,d i ) indicates that the sample i to be tested is a specific base j at the research site
  • N represents the number of samples to be tested
  • n represents the total number of four bases in the sequencing data of multiple samples to be tested
  • the four bases refer to bases A, T, C, and G.
  • step 140 use the updated initial frequency to continue to calculate the expected value of each sample to be tested as a specific base at the research site, and use each new expected value to analyze the sequencing data of the multiple samples to be tested in the research site.
  • the point is the initial frequency of the specific base continues to be updated, and the above iterative process is repeated until each sample to be tested converges to the expected value of the specific base at the research site.
  • the new expected value of each sample to be tested as a specific base at the research site is obtained by formula (2), and the new expected value is used to compare the multiple
  • the sequencing data of the sample to be tested continues to be updated at the initial frequency of the specific base at the research site until the expected value obtained by formula (2) converges.
  • the conditions for the expected value convergence include: the difference between the expected values obtained in adjacent iterations is less than one ten thousandth for three consecutive times; for example, the expected value obtained from the 10th iteration is q10, the expected value obtained from the 11th iteration is q11, and the 12th iteration The expected value obtained in the 13th iteration is q12, and the expected value obtained in the 13th iteration is q13. If (q11-q10) ⁇ 0.0001, and (q12-q11) ⁇ 0.0001, and (q13-q12) ⁇ 0.0001, it can be determined that the expected value has reached convergence Condition, the expected value q13 is the expected value of convergence.
  • the optimal base frequency can be obtained, that is, the value of the above formula (3) corresponding to the convergence of the expected value, that is, the frequency at which the research site is a specific base j.
  • step 150 the base mutation type and the mutation confidence of each sample to be tested at the research site are determined according to the expected value of each convergence.
  • determining the base mutation type of each sample to be tested at the research site according to the expected value of each convergence includes: calculating multiple samples to be tested that belong to four specific types at the research site according to each expected value of convergence.
  • the maximum likelihood estimation value of each specific base mutation type in the base mutation type calculate the ratio of the maximum likelihood estimation value of two adjacent specific base mutation types; the ratio is processed according to preset rules to Obtain the probability corresponding to the ratio; in the case that the probability is less than the set threshold, determine that the base mutation type of each sample to be tested at the research site is the specific base mutation type corresponding to the current denominator; wherein, the The four types of specific base mutations include: single-base mutation, two-base mutation, three-base mutation and four-base mutation.
  • the calculation of the maximum likelihood estimation value of each of the four specific base mutation types of the four specific base mutation types of the multiple samples to be tested according to the expected value of each convergence includes: calculating according to the following formula (4) The maximum likelihood estimates of each of the four specific base mutation types of the four specific base mutation types at the research site:
  • D represents the observation data composed of the set of bases in the base sequence covered by all the samples to be tested at the research site
  • p j represents the sequencing data of multiple samples to be tested obtained according to the expected value of each convergence.
  • the point is the frequency of a specific base j
  • the calculation of the ratio of the maximum likelihood estimates of two adjacent specific base mutation types includes: assuming that the study site is a four-base mutation and the maximum likelihood estimate is f 4 , the study site The point is that the maximum likelihood estimate of the three-base mutation is f 3 ; Determined as the ratio of the maximum likelihood estimates of two adjacent specific base mutation types.
  • the maximum likelihood estimate is f 2
  • the minimum of the maximum likelihood estimates of the four mutation combinations in the three-base mutation is f 3 min;
  • the mutation combinations are: ⁇ A,T,C ⁇ , ⁇ A,T,G ⁇ , ⁇ T,C,G ⁇ and ⁇ A,C,G ⁇ , each mutation combination corresponds to one of the above formula (4)
  • the maximum likelihood estimate of. will Determined as the ratio of the maximum likelihood estimates of two adjacent specific base mutation types.
  • the maximum likelihood estimation value is f 1
  • the minimum of the maximum likelihood estimation values of the 16 mutation combinations in the two-base mutation is f 2 min; Determined as the ratio of the maximum likelihood estimates of two adjacent specific base mutation types.
  • processing the ratio according to a preset rule to obtain the probability corresponding to the ratio includes: performing a natural logarithmic operation on the ratio to obtain a first result; and multiplying the obtained first result by ⁇ 2.
  • the mutation frequency of any one base is assumed to be 0, in two-base mutation, the mutation frequency of any two bases is assumed to be 0, and in single-base mutation, the mutation frequency of any three bases is assumed. Is 0, so the above ratio ( as well as The numerator of) has a frequency parameter p j less than the denominator. Therefore, the statistic obtained by taking the natural logarithm of the above ratio and multiplying by -2 obeys the chi-square distribution of one degree of freedom, so it can be passed The square value distribution table finds the probability corresponding to each ratio.
  • the probability is less than the set threshold, for example, if the probability corresponding to LRT 4vs3 is less than 10 -6 , it is determined that the hypothesis is not true, that is, it is determined that the research site is not a three-base mutation but a four-base mutation; if LRT 3vs2 If the corresponding probability is less than 10 -6 , it is determined that the research site is not a two-base mutation but a three-base mutation; if the corresponding probability of LRT 2vs1 is less than 10 -6 , it is determined that the research site is not a single-base mutation. It is a two-base mutation.
  • the determination of the variation confidence of each sample to be tested at the research site according to the expected value of each convergence includes: The corresponding probability is subjected to a conventional Phred-scale transformation to obtain a Phred quality value; the Phred quality value is determined as the variation confidence of each sample to be tested at the research site.
  • the mutation confidence can further determine whether there is a base mutation in the research site.
  • the embodiment of the application discloses a base mutation detection method based on sequencing data, which is completely different from the mutation detection method in the related technology, and has obvious advantages in time complexity.
  • the low data characteristics directly use the method based on the likelihood function of the observed data of allele mutations (instead of using the likelihood function based on the observed data of the genotype), which makes the whole detection method fast and can Complete the analysis of samples greater than 100,000 or even millions (for example, one million cases of prenatal testing data); in addition, the embodiments of this application do not presuppose that the research site is a two-base mutation, and for the first time, multiple likelihood tests are used The maximum likelihood estimation values of different base combinations are tested.
  • the base mutation detection method of the embodiment of the present application is the first to detect two-base mutations, as well as single-base and multi-base mutations. Detection.
  • the base mutation detection method of the embodiment of the present application is used to analyze the data of one hundred thousand and one million cases of prenatal testing, high-precision population mutation site and frequency information can be obtained, and this information has both scientific research and industrial value.
  • the device includes: an initial frequency determination module 210, an expected value calculation module 220, an update module 230, an iteration module 240, and a mutation type determination module 250;
  • the initial frequency determination module 210 is configured to determine the initial frequency of the sequence data of a plurality of samples to be tested as a specific base at the research site; the expected value calculation module 220 is configured to calculate each target frequency based on the initial frequency.
  • the detection sample is the expected value of the specific base at the research site; the update module 230 is configured to use each expected value to update the initial frequency of the sequencing data of the multiple samples to be tested as the specific base at the research site; iterative module 240.
  • the mutation type determination module 250 is set to determine each base based on the expected value of each convergence The base mutation type and the mutation confidence of the sample to be tested at the research site; wherein, the specific base includes adenine A base, thymine T base, cytosine C base or guanine G base.
  • An apparatus for detecting base mutations based on sequencing data determines the initial frequency of a specific base at the research site in the sequencing data of multiple samples to be detected;
  • the expected value of the sample at the research site being a specific base; using each expected value to update the initial frequency of the sequencing data of the multiple samples to be tested as the specific base at the research site; using the updated initial frequency to continue calculating each
  • the expected value of each sample to be tested is a specific base at the research site, and each new expected value is used to update the sequencing data of the multiple samples to be tested at the initial frequency of the specific base at the research site, repeat the above Iterative process until each sample to be tested converges to the expected value of a specific base at the research site; the technical means to determine the type of base mutation and the mutation confidence of each sample to be tested at the research site according to the expected value of each convergence To detect the base mutation of the sequencing data with low sequencing depth and large data volume.
  • the above-mentioned base mutation detection method is used to analyze large-scale simulation data.
  • the purpose is to test the mutation detection rate, false positive rate, and the difference between the estimated mutation site frequency and the true value of the above-mentioned base mutation detection method.
  • the simulation data simulates a total of 100 monopolymic loci, 50,000 di-allelic loci, 50,000 tri-allelic loci, and 50,000 four-base Locus (tetra-allelic loci).
  • the minimum allelic mutation frequency is set at one-tenths of an interval, as shown in Table 1:
  • the sequencing depth of a given sample is 0.06x, and the sequencing error rate is about 0.01.
  • the main purpose of the simulation data is to observe the mutation detection rate and false positive rate of the above-mentioned base mutation detection method based on sequencing data provided in the examples of this application under different conditions (different sample numbers, different mutation frequencies, and different mutation types). And the difference between the estimated frequency of mutation sites and the true value.
  • the horizontal axis represents the mutation frequency
  • the vertical axis represents the detection rate.
  • Figure 3 shows the detection rates of 44,000 (44k) people, 140,000 (140k) people, and 1 million (1M) people with different mutation frequencies.
  • attention is paid to the minimum mutation frequency that the above-mentioned sequencing data-based base mutation detection method can detect (when the detection rate>0) under a certain sample size and the minimum mutation frequency when the detection rate reaches 100%.
  • the lowest mutation can be detected (when the detection rate>0)
  • the frequency and the lowest mutation frequency when the detection rate reaches 100% are 0.002 and 0.008, respectively.
  • the lowest mutation frequency that can be detected when the detection rate> 0
  • the lowest mutation frequency when the detection rate reaches 100% The mutation frequency is 0.001 and 0.008, respectively.
  • no non-mutation sites were detected, indicating that the method is highly accurate.
  • the result of FIG. 4 shows that the method for detecting base mutations based on sequencing data provided by the embodiments of the present application can sensitively and accurately detect three-base or four-base mutations.
  • Figure 5 reflects the average difference between the frequency of each frequency band and the true mutation frequency, which intuitively reflects the approximate difference between the detected frequency and the true frequency, but because different frequency bands have different metrics, it is not possible to compare the difference of different frequencies horizontally.
  • Figure 6 unifies the metrics of different frequency bands, and can compare the difference between the detection frequency and the true frequency of the method under different frequency bands.
  • An embodiment of the present application also discloses a storage medium containing computer-executable instructions, when the computer-executable instructions are executed by a computer processor, are used to execute a method for detecting base mutations based on sequencing data, the method comprising: determining The sequencing data of multiple samples to be tested is the initial frequency of a specific base at the research site; based on the initial frequency, the expected value of each sample to be tested as a specific base at the research site is calculated; The sequencing data of a sample to be tested is updated at the initial frequency of the specific base at the research site; the updated initial frequency is used to continue to calculate the expected value of each sample to be tested as a specific base at the research site, and use the new expected value The sequencing data of the multiple samples to be tested is continuously updated at the initial frequency of the specific base at the research site, and the above iterative process is repeated until the expected value converges; according to the expected value of convergence, it is determined that each sample to be tested is at the research site The type of base mutation and the mutation confidence of the point; where

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Public Health (AREA)
  • General Physics & Mathematics (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本文公布一种基于测序数据的碱基突变检测方法、装置及存储介质,所述方法包括确定多个待检测样本的测序数据在研究位点为特定碱基的初始频率;基于所述初始频率计算每个待检测样本在研究位点为特定碱基的期望值;利用每个期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率进行更新;利用更新后的初始频率继续计算每个待检测样本在研究位点为特定碱基的期望值,直到每个待检测样本在研究位点为特定碱基的期望值收敛;根据每个收敛的期望值确定每个待检测样本在研究位点的碱基突变类型以及变异置信度。

Description

基于测序数据的碱基突变检测方法、装置及存储介质 技术领域
本申请实施例涉及基因变异检测技术,例如涉及一种基于测序数据的碱基突变检测方法、装置及存储介质。
背景技术
自2008年第二代测序技术走向成熟且被推向商业化以来,测序数据持续增长,基于测序数据进行多种下游应用的算法也处于活跃开发状态中。其中一类具有广泛使用前景的算法叫做基于测序数据的碱基突变检测方法。基因组变异包括单碱基突变,小插入缺失和结构性变异,不同的基因组变异各自有不同的检测算法,本申请专注于单碱基突变的变异检测。针对第二代测序技术中单碱基突变检测已经有较为稳定的开源算法,比如基因组分析工具包(Genome Analysis ToolKit,GATK)。
但是,相关技术中的单碱基突变检测算法仍存在如下主要缺陷:第一,无法检测出三碱基和四碱基突变位点;第二,不适用于测序深度偏低且样本数据很大的场景。例如,由于内存、速度等问题,单碱基突变检测算法甚至无法稳定地分析出大于5万的样本数据。然而,无创产前检测就是一种全基因组超低深度的测序类型,平均测序深度低至0.06x,基因组覆盖率约为6%。虽然测序深度很低,但是大量无创产前测序数据比如说十万,百万甚至千万例的测序数据,构成了变异检测一个极好的原始数据集。且,全球累积了大概720万人的无创产前测序数据,通过对这些累积起来的群体基因组测序数据获得的人群特异变异频谱对于疾病研究,药物研发都有重要的指引作用。
在这一超大规模样本但是测序深度偏低的场景中,相关技术中的单碱基突变检测算法已经无法正常运行,因此,开发一种新的单碱基突变检测算法以适应上述应用场景则显得非常有必要。
发明内容
本申请公布一种基于测序数据的碱基突变检测方法、装置及存储介质,通过所述方法可实现对测序深度低,数据量大的测序数据进行碱基突变检测。
本申请实施例公布了一种基于测序数据的碱基突变检测方法,包括:
确定多个待检测样本的测序数据在研究位点为特定碱基的初始频率;
基于所述初始频率计算每个待检测样本在研究位点为特定碱基的期望值;
利用每个期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率进行更新;
利用更新后的初始频率继续计算每个待检测样本在研究位点为特定碱基的期望值,并利用每个新的期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率继续进行更新,重复上述迭代过程,直到每个待检测样本在研究位点为特定碱基的期望值收敛;
根据每个收敛的期望值确定每个待检测样本在研究位点的碱基突变类型以及变异置信度;
其中,所述特定碱基包括腺嘌呤A碱基、胸腺嘧啶T碱基、胞嘧啶C碱基或者鸟嘌呤G碱基。
本申请实施例还公布了一种基于测序数据的碱基突变检测装置,包括:
初始频率确定模块,设置为确定多个待检测样本的测序数据在研究位点为特定碱基的初始频率;
期望值计算模块,设置为基于所述初始频率计算每个待检测样本在研究位点为特定碱基的期望值;
更新模块,设置为利用每个期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率进行更新;
迭代模块,设置为利用更新后的初始频率继续计算每个待检测样本在研究位点为特定碱基的期望值,并利用每个新的期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率继续进行更新,重复上述迭代过程,直到每个待检测样本在研究位点为特定碱基的期望值收敛;
变异类型确定模块,设置为根据每个收敛的期望值确定每个待检测样本在研究位点的碱基突变类型以及变异置信度;
其中,所述特定碱基包括腺嘌呤A碱基、胸腺嘧啶T碱基、胞嘧啶C碱基或者鸟嘌呤G碱基。
本申请实施例还公布一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行以实现如上所述的任意一种基于测序数据的碱基突变检测方法。
本申请实施例公布的一种基于测序数据的碱基突变检测方法,通过确定多个待检测样本的测序数据在研究位点为特定碱基的初始频率;基于所述初始频率计算每个待检测样本在研究位点为特定碱基的期望值;利用每个期望值对所述待检测样本的测序数据在研究位点为特定碱基的初始频率进行更新;利用更新后的初始频率继续计算每个待检测样本在研究位点为特定碱基的期望值,并利用每个新的期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率继续进行更新,重复上述迭代过程,直到每个待检测样本在研究位点为特定碱基的期望值收敛;根据每个收敛的期望值确定每个待检测样本在研 究位点的碱基突变类型以及变异置信度的技术手段,实现了对测序深度低,数据量大的测序数据的碱基突变检测。
附图说明
图1为本申请一实施例提供的一种基于测序数据的碱基突变检测方法的流程示意图;
图2为本申请一实施例提供的一种基于测序数据的碱基突变检测装置的结构示意图;
图3为本申请一实施例提供的上述碱基突变检测方法在不同样本数据的条件下对不同突变频率的检出率示意图;
图4为本申请一实施例提供的在样本数目为14万的情况下,上述碱基突变检测方法在不同突变频率下对不同突变类型的突变检出率的示意图;
图5为本申请一实施例提供的上述碱基突变检测方法在不同突变频率下对突变位点的突变频率估算与真实值之间的差异示意图;
图6为本申请另一实施例提供的上述碱基突变检测方法在不同突变频率下对突变位点的突变频率估算与真实值之间的差异示意图。
具体实施方式
以下是对本文描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
除非另有定义,本文所使用的所有的技术和科学术语与本技术领域的技术人员通常理解的含义相同。本文中所使用的术语是为了描述具体的实施例,不用于限制本申请。本文所使用的术语“和/或”包括一个或多个相关的所列项 目的任意的和所有的组合。
本申请实施例公布的基于测序数据的碱基突变检测方法可应用于产前检测或者其他超大规模的人群测序数据中,通常测序深度都会偏低,当然本申请提供的基于测序数据的碱基突变检测方法也可用于测序深度较高的情况。虽然每个样本的测序深度很低,但是当样本很多时,每个位点都有很多乘碱基序列(以下简称为reads)的覆盖。每个覆盖到当前位点的来自独立reads的碱基都对应有一个由测序仪产生的碱基质量值,该碱基质量值记为p(d i|b i,j),具体表示覆盖到当前位点的来自样本i独立reads d i的碱基j的质量值。所述碱基质量值反应了对应碱基测错的概率。例如第一个位点L1={A,T}中第一条reads上覆盖的碱基是一个A,A的碱基质量值为30,则表示第一条reads在第一个位点上检测到碱基A的错误概率为10 –(30/10)=10 -3,对应的正确概率为1-10 -3;其中,A表示腺嘌呤碱基,T表示胸腺嘧啶碱基,C表示胞嘧啶碱基,G表示鸟嘌呤碱基。如果样本是产前样本,且当前位点有多于一条reads的覆盖,为了避免胎儿和母亲DNA混杂导致reads之间不能独立的问题,每个样本只抽取一条reads序列。
图1为本申请实施例提供的基于测序数据的碱基突变检测方法的流程示意图。如图1所示,所述方法包括如下步骤。
在步骤110中,确定多个待检测样本的测序数据在研究位点为特定碱基的初始频率。
其中,所述研究位点指待检测是否存在碱基突变的位点;所述特定碱基包括腺嘌呤A碱基、胸腺嘧啶T碱基、胞嘧啶C碱基或者鸟嘌呤G碱基。
在一实施例中,确定多个待检测样本的测序数据在研究位点为特定碱基的初始频率,包括:统计多个待检测样本的测序数据中携带的特定碱基的数量以及多个待检测样本的测序数据中四种碱基的总数量;将所述特定碱基的数量与 所述四种碱基的总数量的商作为多个待检测样本的测序数据在研究位点为特定碱基的初始频率。
利用公式表示上述确定所述初始频率的过程为:
Figure PCTCN2019086972-appb-000001
其中,bi j表示待检测样本i的测序数据中携带的特定碱基j的数量,N表示待检测样本的总数,
Figure PCTCN2019086972-appb-000002
表示多个待检测样本的测序数据中携带的特定碱基j的总数,n表示多个待检测样本的测序数据中四种碱基的总数量,p j表示多个待检测样本的测序数据在研究位点为特定碱基j的初始频率,j={0,1,2,3},p 0表示多个待检测样本的测序数据在研究位点为碱基A的初始频率,p 1表示多个待检测样本的测序数据在研究位点为碱基C的初始频率,p 2表示多个待检测样本的测序数据在研究位点为碱基G的初始频率,p 3表示多个待检测样本的测序数据在研究位点为碱基T的初始频率。
在步骤120中,基于所述初始频率计算每个待检测样本在研究位点为特定碱基的期望值。
在一实施例中,通过如下公式(2)计算每个待检测样本在研究位点为特定碱基的期望值:
Figure PCTCN2019086972-appb-000003
其中,b i,j表示待检测样本i在研究位点为特定碱基j,p j表示多个待检测样本的测序数据在研究位点为特定碱基j的初始频率,d i表示待检测样本i在研究位点覆盖到的碱基序列中的碱基集合,p(b i,j|p j,d i)表示待检测样本i在研 究位点为特定碱基j的期望值,p(b i,j|p j)表示在给定p j的情况下,待检测样本i在研究位点为特定碱基j的先验概率,p(d i|b i,j)表示待检测样本i在研究位点覆盖到的碱基序列的碱基质量值。
在步骤130中,利用每个期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率进行更新。
在一实施例中,通过如下公式(3)利用每个期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率进行更新:
Figure PCTCN2019086972-appb-000004
其中,
Figure PCTCN2019086972-appb-000005
表示多个待检测样本的测序数据在研究位点为特定碱基更新后的初始频率,p(b i,j|p j,d i)表示待检测样本i在研究位点为特定碱基j的期望值,N表示待检测样本的数量,n表示多个待检测样本的测序数据中四种碱基的总数量,所述四种碱基指碱基A、T、C和G。
在步骤140中,利用更新后的初始频率继续计算每个待检测样本在研究位点为特定碱基的期望值,并利用每个新的期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率继续进行更新,重复上述迭代过程,直到每个待检测样本在研究位点为特定碱基的期望值收敛。
在一实施例中,利用上述公式(3)的结果
Figure PCTCN2019086972-appb-000006
替换公式(2)中的p j,通过公式(2)得到每个待检测样本在研究位点为特定碱基的新的期望值,并利用新的期望值按照上述公式(3)对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率继续进行更新,直到公式(2)得出的期望值收敛。期望值收 敛的条件包括:相邻次迭代得到期望值之间的差连续三次小于万分之一;例如,第10次迭代得到的期望值为q10,第11次迭代得到的期望值为q11,第12次迭代得到的期望值为q12,第13次迭代得到的期望值为q13,若(q11-q10)<0.0001,且(q12-q11)<0.0001,且(q13-q12)<0.0001,则可确定期望值达到了收敛条件,期望值q13为收敛的期望值。
直到所述期望值收敛时,可得到最优的碱基频率,即所述期望值收敛时对应的上述公式(3)的值,即研究位点为特定碱基j的频率。
在步骤150中,根据每个收敛的期望值确定每个待检测样本在研究位点的碱基突变类型以及变异置信度。
在一实施例中,根据每个收敛的期望值确定每个待检测样本在研究位点的碱基突变类型,包括:根据每个收敛的期望值计算多个待检测样本在研究位点属于四种特定碱基突变类型中每种特定碱基突变类型的最大似然估计值;计算两种相邻特定碱基突变类型的最大似然估计值的比值;对所述比值按照预设规则进行处理,以得到所述比值对应的概率;在所述概率小于设定阈值的情况下,确定每个待检测样本在研究位点的碱基突变类型为当前分母对应的特定碱基突变类型;其中,所述四种特定碱基突变类型包括:单碱基突变、二碱基突变、三碱基突变和四碱基突变。
所述根据每个收敛的期望值计算多个待检测样本在研究位点属于四种特定碱基突变类型中每种特定碱基突变类型的最大似然估计值,包括:按照如下公式(4)计算多个待检测样本在研究位点属于四种特定碱基突变类型中每种特定碱基突变类型的最大似然估计值:
Figure PCTCN2019086972-appb-000007
其中,D表示所有待检测样本在研究位点覆盖到的碱基序列中的碱基集合构成的观察数据,p j表示根据每个收敛的期望值得到的多个待检测样本的测序数据在研究位点为特定碱基j的频率,p(Dp j)表示待检测样本在研究位点为与j对应的碱基突变类型的最大似然估计值;在j=0的情况下,对应的碱基突变类型为单碱基突变,在j=1的情况下,对应的碱基突变类型为二碱基突变,j=2时,对应的碱基突变类型为三碱基突变,在j=3的情况下,对应的碱基突变类型为四碱基突变。
在一实施例中,所述计算两种相邻特定碱基突变类型的最大似然估计值的比值,包括:假设研究位点是四碱基突变的最大似然估计值为f 4,研究位点是三碱基突变的最大似然估计值为f 3;将
Figure PCTCN2019086972-appb-000008
确定为两种相邻特定碱基突变类型的最大似然估计值的比值。
假设研究位点是二碱基突变的最大似然估计值为f 2,三碱基突变中四种突变组合的最大似然估计值中的最小值为f 3min;三碱基突变中四种突变组合分别为:{A,T,C}、{A,T,G}、{T,C,G}和{A,C,G},每种突变组合对应一个上述公式(4)所述的最大似然估计值。将
Figure PCTCN2019086972-appb-000009
确定为两种相邻特定碱基突变类型的最大似然估计值的比值。
假设研究位点是单碱基突变的最大似然估计值为f 1,二碱基突变中16种突变组合的最大似然估计值中的最小值为f 2min;将
Figure PCTCN2019086972-appb-000010
确定为两种相邻特定碱基突变类型的最大似然估计值的比值。
对应的,对所述比值按照预设规则进行处理,以得到所述比值对应的概率,包括:对所述比值进行取自然对数操作,得到第一结果;将得到的第一结果乘 以-2,得到第二结果;通过查找卡方值分布表得到与所述第二结果对应的概率。
由于三碱基突变中,假设任意一个碱基的突变频率为0,二碱基突变中,假设任意两个碱基的突变频率为0,单碱基突变中假设任意三个碱基的突变频率为0,因此,上述比值(
Figure PCTCN2019086972-appb-000011
以及
Figure PCTCN2019086972-appb-000012
)的分子都比分母少了一个频率参数p j,因此,通过对上述比值进行取自然对数操作,并乘以-2之后得到的统计量服从一个自由度的卡方分布,故可以通过卡方值分布表求出每个比值对应的概率。
对上述比值进行取自然对数操作,并乘以-2之后得到的统计量为:
Figure PCTCN2019086972-appb-000013
Figure PCTCN2019086972-appb-000014
Figure PCTCN2019086972-appb-000015
当所述概率小于设定阈值时,例如,若LRT 4vs3对应的概率小于10 -6,则确定假设不成立,即确定研究位点不属于三碱基突变,而是四碱基突变;若LRT 3vs2对应的概率小于10 -6,则确定研究位点不属于二碱基突变,而是三碱基突变;若LRT 2vs1对应的概率小于10 -6,则确定研究位点不属于单碱基突变,而是二碱基突变。
在一实施例中,根据每个收敛的期望值确定每个待检测样本在研究位点的变异置信度,包括:将
Figure PCTCN2019086972-appb-000016
对应的概率进行常规的Phred-scale转化,得到Phred质量值;将所述Phred质量值确定为每个待检测样本在研究位点的变异置信度。
通过变异置信度可进一步确定研究位点是否真实存在碱基突变。
本申请实施例公开的一种基于测序数据的碱基突变检测方法,与相关技术中的变异检测方法截然不同,且在时间复杂度上具备明显的优势,通过利用样本数据多但是单个样本测序深度低的数据特点,直接使用了基于等位基因突变型观察到数据的似然函数的方法(而不是使用基于基因型观察到数据的似然函数),使得整个检测方法速度变得很快,能完成大于十万甚至百万样本(例如百万例产前检测数据)的分析;另外,本申请实施例没有预先假设研究位点是二碱基突变,并且首次运用了多个似然检验的办法对不同碱基组合的最大似然估计值进行了检验,因此,本申请实施例的碱基突变检测方法首创除了能进行二碱基突变的检测,还能进行单碱基以及多碱基突变的检测。当运用本申请实施例的碱基突变检测方法对十万、百万例产前检测数据进行分析时,可获得高精度的人群突变位点和频率信息,这种信息同时具有科研和产业价值。
图2为本申请实施例提供的基于测序数据的碱基突变检测装置的结构示意图。如图2所示,所述装置包括:初始频率确定模块210、期望值计算模块220、更新模块230、迭代模块240和变异类型确定模块250;
本实施例中,初始频率确定模块210,设置为确定多个待检测样本的测序数据在研究位点为特定碱基的初始频率;期望值计算模块220,设置为基于所述初始频率计算每个待检测样本在研究位点为特定碱基的期望值;更新模块230,设置为利用每个期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率进行更新;迭代模块240,设置为利用更新后的初始频率继续计算每个待检测样本在研究位点为特定碱基的期望值,并利用每个新的期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率继续进行更新,重复上述迭代过程,直到每个待检测样本在研究位点为特定碱基的期望值收敛;变异类型确定模块250,设置为根据每个收敛的期望值确定每个待检测样本在研 究位点的碱基突变类型以及变异置信度;其中,所述特定碱基包括腺嘌呤A碱基、胸腺嘧啶T碱基、胞嘧啶C碱基或者鸟嘌呤G碱基。
本申请实施例公布的一种基于测序数据的碱基突变检测装置,通过确定多个待检测样本的测序数据在研究位点为特定碱基的初始频率;基于所述初始频率计算每个待检测样本在研究位点为特定碱基的期望值;利用每个期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率进行更新;利用更新后的初始频率继续计算每个待检测样本在研究位点为特定碱基的期望值,并利用每个新的期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率继续进行更新,重复上述迭代过程,直到每个待检测样本在研究位点为特定碱基期望值收敛;根据每个收敛的期望值确定每个待检测样本在研究位点的碱基突变类型以及变异置信度的技术手段,实现了对测序深度低,数据量大的测序数据的碱基突变检测。
需要说明的是,装置实施例中,每个组成部分的具体工作原理,请参见方法实施例对应部分,此次不再赘述。
在一实施例中,为了验证上述碱基突变检测方法的效果,现提供一份利用上述碱基突变检测方法进行分析的实例,实例中利用上述碱基突变检测方法对大规模模拟数据进行分析,目的是检验上述碱基突变检测方法的突变检出率,假阳率,以及突变位点频率估算和真实值的差异。
实例
模拟数据概述
模拟数据一共模拟了100个单碱基位点(monopolymic loci),50000个二碱基位点(di-allelic loci),50000个三碱基位点(tri-allelic loci)以及50000个四碱基位点(tetra-allelic loci)。对于上述三组50000位点,都按照万分之一的间隔,设置最小 等位基因突变的频率,举例如表一所示:
表一:模拟数据的突变频率分布
Figure PCTCN2019086972-appb-000017
在设定了表一所示的碱基突变频率的分布后,给定样本测序深度为0.06x,测序错误率在0.01左右。模拟数据的主要目的在于观察本申请实施例提供的上述基于测序数据的碱基突变检测方法在不同条件下(不同样本数目,不同突变频率,不同突变类型)的突变检出率,假阳率,以及突变位点频率估算和真实值的差异。参见图3所示的本申请实施例提供的上述基于测序数据的碱基突变检测方法在不同样本数据的条件下对不同突变频率的检出率示意图,横轴表示突变频率,纵轴表示检出率,图3中示出了在不同突变频率下样本数分别为44000(44k)人、140000(140k)人以及1百万(1M)人的检出率。通常会关注在一定样本数量下上述基于测序数据的碱基突变检测方法能够检出(检出率>0时)的最低突变频率以及检出率达到100%时的最低突变频率。从图3中可以看到,随着样本数目从44k升至140k至1M,检出率>0时的最低检出突变频率分别为0.002,0.001和0.00002,而检出率达到100%时的最低突变频率(完全检测突变频率)分别为:0.015,0.005,0.002,可见上述基于测序数据的碱基突变检测方法的性能较好。特别的,100个单碱基突变都未被错误检出,即假阳率为0。图3的结果表明了本申请实施例提供的基于测序数据的碱基突变检测方法能够灵敏且准确地从大量深度低至0.06x的测序数据中探测到单碱基突变。
同时参见图4所示的在样本数目为14万时,上述碱基突变检测方法在不同突变频率下对不同突变类型的突变检出率的示意图,其中,横轴表示突变频率,纵轴表示检出率。从图4中可以看出,对于二碱基突变,能够检出(检出率>0时)的最低突变频率以及检出率达到100%时的最低突变频率分别是0.002和0.005。较难的三碱基突变以及四碱基突变虽然检出率比二碱基突变稍低,但是效果相差不大,对于三碱基突变,能够检出(检出率>0时)的最低突变频率以及检出率 达到100%时的最低突变频率分别是0.002和0.008,对于四碱基突变,能够检出(检出率>0时)的最低突变频率以及检出率达到100%时的最低突变频率分别是0.001和0.008。同样地,无一非突变位点被检出,说明了本方法的准确度较高。图4的结果表明本申请实施例提供的基于测序数据的碱基突变检测方法能够灵敏且准确地探测到三碱基或者四碱基突变。
同时参见图5和图6所示的上述碱基突变检测方法在不同突变频率下对突变位点的突变频率估算与真实值之间的差异示意图,其中,横轴表示突变频率,图5的纵轴表示均方根误差,图6的纵轴表示均方根误差变异系数。图5反映的是每一个频段频率和真实突变频率的平均差异,直观反映了检出频率与真实频率地大致差异,但因为不同频率段度量不同,所以不能横向对不同频率差异进行比较。图6统一了不同频率段的度量,可以比较不同频率段下方法的检出频率和真实频率的差异大小。从图5可以看到,随着模拟突变频率的增加,检出频率与真实频率存在一定差异但较小。从图6可以看到,低频频率(<0.003)与真实频段的度量差异在1-1.5倍左右,而频率大于0.003的突变小于度量差异。通过这个分析可以看到对于探测到的突变位点,上述方法能够准确地探测到真实的突变频率。
本申请实施例还公开一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行一种基于测序数据的碱基突变检测方法,该方法包括:确定多个待检测样本的测序数据在研究位点为特定碱基的初始频率;基于所述初始频率计算每个待检测样本在研究位点为特定碱基的期望值;利用所述期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率进行更新;利用更新后的初始频率继续计算每个待检测样本在研究位点为特定碱基的期望值,并利用新的期望值对所述多个待检测样本的测 序数据在研究位点为特定碱基的初始频率继续进行更新,重复上述迭代过程,直到所述期望值收敛;根据收敛的期望值确定每个待检测样本在研究位点的碱基突变类型以及变异置信度;其中,所述特定碱基包括腺嘌呤A碱基、胸腺嘧啶T碱基、胞嘧啶C碱基或者鸟嘌呤G碱基。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。

Claims (13)

  1. 一种基于测序数据的碱基突变检测方法,包括:
    确定多个待检测样本的测序数据在研究位点为特定碱基的初始频率;
    基于所述初始频率计算每个待检测样本在研究位点为特定碱基的期望值;
    利用每个期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率进行更新;
    利用更新后的初始频率继续计算每个待检测样本在研究位点为特定碱基的期望值,并利用每个新的期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率继续进行更新,重复上述迭代过程,直到每个待检测样本在研究位点为特定碱基的期望值收敛;
    根据每个收敛的期望值确定每个待检测样本在研究位点的碱基突变类型以及变异置信度;
    其中,所述特定碱基包括腺嘌呤A碱基、胸腺嘧啶T碱基、胞嘧啶C碱基或者鸟嘌呤G碱基。
  2. 根据权利要求1所述的方法,其中,所述确定多个待检测样本的测序数据在研究位点为特定碱基的初始频率,包括:
    统计多个待检测样本的测序数据中特定碱基的数量以及所述多个待检测样本的测序数据中四种碱基的总数量;其中,所述四种碱基包括:腺嘌呤A碱基、胸腺嘧啶T碱基、胞嘧啶C碱基和鸟嘌呤G碱基;
    将所述特定碱基的数量与所述四种碱基的总数量的商作为所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率。
  3. 根据权利要求1所述的方法,其中,所述基于所述初始频率计算每个待检测样本在研究位点为特定碱基的期望值,包括:
    通过如下公式计算每个待检测样本在研究位点为特定碱基的期望值:
    Figure PCTCN2019086972-appb-100001
    其中,b i,j表示待检测样本i在研究位点为特定碱基j,p j表示多个待检测样本的测序数据在研究位点为特定碱基j的初始频率,d i表示待检测样本i在研究位点覆盖到的碱基序列中的碱基集合,p(b i,j|p j,d i)表示待检测样本i在研究位点为特定碱基j的期望值,p(b i,j|p j)表示在给定p j的情况下,待检测样本i在研究位点为特定碱基j的先验概率,p(d i|b i,j)表示待检测样本i在研究位点覆盖到的碱基序列的碱基质量值。
  4. 根据权利要求3所述的方法,其中,利用每个期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率进行更新,包括:
    通过如下公式利用每个期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率进行更新:
    Figure PCTCN2019086972-appb-100002
    其中,
    Figure PCTCN2019086972-appb-100003
    表示多个待检测样本的测序数据在研究位点为特定碱基更新后的初始频率,p(b i,j|p j,d i)表示待检测样本i在研究位点为特定碱基j的期望值,N表示待检测样本的数量,n表示多个待检测样本的测序数据中四种碱基的总数量。
  5. 根据权利要求1或4所述的方法,其中,根据每个收敛的期望值确定每个待检测样本在研究位点的碱基突变类型,包括:
    根据每个收敛的期望值计算多个待检测样本在研究位点属于四种特定碱基 突变类型中每种特定碱基突变类型的最大似然估计值;
    计算两种相邻特定碱基突变类型的最大似然估计值的比值;
    对所述比值按照预设规则进行处理,以得到所述比值对应的概率;
    在所述概率小于设定阈值的情况下,确定每个待检测样本在研究位点的碱基突变类型为当前分母对应的特定碱基突变类型;
    其中,所述四种特定碱基突变类型包括:单碱基突变、二碱基突变、三碱基突变和四碱基突变。
  6. 根据权利要求5所述的方法,其中,根据每个收敛的期望值计算多个待检测样本在研究位点属于四种特定碱基突变类型中每种特定碱基突变类型的最大似然估计值,包括:
    按照如下公式计算多个待检测样本在研究位点属于四种特定碱基突变类型中每种特定碱基突变类型的最大似然估计值:
    Figure PCTCN2019086972-appb-100004
    其中,D表示所有待检测样本在研究位点覆盖到的碱基序列中的碱基集合构成的观察数据,p j表示根据每个收敛的期望值得到的多个待检测样本的测序数据在研究位点为特定碱基j的频率,p(D|p j)表示多个待检测样本在研究位点为与j对应的碱基突变类型的最大似然估计值,p(b i,j|p j)表示在给定p j的情况下,待检测样本i在研究位点为特定碱基j的先验概率,p(d i|b i,j)表示待检测样本i在研究位点覆盖到的碱基序列的碱基质量值;在j=0的情况下,对应的特定碱基突变类型为单碱基突变,在j=1的情况下,对应的特定碱基突变类型为二碱基突变,在j=2的情况下,对应的特定碱基突变类型为三碱基突变,在j=3的 情况下,对应的特定碱基突变类型为四碱基突变。
  7. 根据权利要求6所述的方法,其中,所述计算两种相邻特定碱基突变类型的最大似然估计值的比值,包括:
    在研究位点是四碱基突变的最大似然估计值为f 4,研究位点是三碱基突变的最大似然估计值为f 3的情况下;
    Figure PCTCN2019086972-appb-100005
    确定为两种相邻特定碱基突变类型的最大似然估计值的比值。
  8. 根据权利要求6所述的方法,其中,所述计算两种相邻特定碱基突变类型的最大似然估计值的比值,包括:
    在研究位点是二碱基突变的最大似然估计值为f 2,三碱基突变中四种突变组合的最大似然估计值中的最小值为f 3min的情况下;
    Figure PCTCN2019086972-appb-100006
    确定为两种相邻特定碱基突变类型的最大似然估计值的比值。
  9. 根据权利要求6所述的方法,其中,所述计算两种相邻特定碱基突变类型的最大似然估计值的比值,包括:
    在研究位点是单碱基突变的最大似然估计值为f 1,二碱基突变中16种突变组合的最大似然估计值中的最小值为f 2min的情况下;
    Figure PCTCN2019086972-appb-100007
    确定为两种相邻特定碱基突变类型的最大似然估计值的比值。
  10. 根据权利要求7-9任一项所述的方法,其中,对所述比值按照预设规则进行处理,以得到所述比值对应的概率,包括:
    对所述比值进行取自然对数操作,得到第一结果;
    将得到的第一结果乘以-2,得到第二结果;
    通过查找卡方值分布表得到与所述第二结果对应的概率。
  11. 根据权利要求9所述的方法,其中,根据每个收敛的期望值确定每个待检测样本在研究位点的变异置信度,包括:
    Figure PCTCN2019086972-appb-100008
    对应的概率进行常规的Phred-scale转化,得到Phred质量值;
    将所述Phred质量值确定为每个待检测样本在研究位点的变异置信度。
  12. 一种基于测序数据的碱基突变检测装置,包括:
    初始频率确定模块,设置为确定多个待检测样本的测序数据在研究位点为特定碱基的初始频率;
    期望值计算模块,设置为基于所述初始频率计算每个待检测样本在研究位点为特定碱基的期望值;
    更新模块,设置为利用每个期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率进行更新;
    迭代模块,设置为利用更新后的初始频率继续计算每个待检测样本在研究位点为特定碱基的期望值,并利用每个新的期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率继续进行更新,重复上述迭代过程,直到每个待检测样本在研究位点为特定碱基的期望值收敛;
    变异类型确定模块,设置为根据每个收敛的期望值确定每个待检测样本在研究位点的碱基突变类型以及变异置信度;
    其中,所述特定碱基包括腺嘌呤A碱基、胸腺嘧啶T碱基、胞嘧啶C碱基或者鸟嘌呤G碱基。
  13. 一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行以实现如权利要求1-11中任一所述的一种基于测序数据的碱基突变检测方法。
PCT/CN2019/086972 2019-05-15 2019-05-15 基于测序数据的碱基突变检测方法、装置及存储介质 WO2020227952A1 (zh)

Priority Applications (10)

Application Number Priority Date Filing Date Title
SG11202112408QA SG11202112408QA (en) 2019-05-15 2019-05-15 Base mutation detection method and apparatus based on sequencing data, and storage medium
HUE19928972A HUE061763T2 (hu) 2019-05-15 2019-05-15 Szekvenálási adatokon alapuló bázismutáció-kimutatási eljárás és berendezés, valamint tárolóközeg
PL19928972.9T PL3971902T3 (pl) 2019-05-15 2019-05-15 Metoda i aparat do wykrywania mutacji punktowych w oparciu o dane z sekwencjonowania oraz nośnik danych
PCT/CN2019/086972 WO2020227952A1 (zh) 2019-05-15 2019-05-15 基于测序数据的碱基突变检测方法、装置及存储介质
EP19928972.9A EP3971902B1 (en) 2019-05-15 2019-05-15 Base mutation detection method and apparatus based on sequencing data, and storage medium
ES19928972T ES2942359T3 (es) 2019-05-15 2019-05-15 Método y aparato de detección de mutaciones de bases basados en datos de secuenciación, y medio de almacenamiento
IL288026A IL288026A (en) 2019-05-15 2019-05-15 Method and device for basic mutation detection based on sequencing data, and storage media
DK19928972.9T DK3971902T3 (da) 2019-05-15 2019-05-15 Fremgangsmåde og indretning til basismutationsdetektion på basis af sekventeringsdata samt lagermedium
CN201980093764.0A CN113795886B (zh) 2019-05-15 2019-05-15 基于测序数据的碱基突变检测方法、装置及存储介质
US17/522,920 US20220068437A1 (en) 2019-05-15 2021-11-10 Base mutation detection method and apparatus based on sequencing data, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/086972 WO2020227952A1 (zh) 2019-05-15 2019-05-15 基于测序数据的碱基突变检测方法、装置及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/522,920 Continuation US20220068437A1 (en) 2019-05-15 2021-11-10 Base mutation detection method and apparatus based on sequencing data, and storage medium

Publications (1)

Publication Number Publication Date
WO2020227952A1 true WO2020227952A1 (zh) 2020-11-19

Family

ID=73290105

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/086972 WO2020227952A1 (zh) 2019-05-15 2019-05-15 基于测序数据的碱基突变检测方法、装置及存储介质

Country Status (10)

Country Link
US (1) US20220068437A1 (zh)
EP (1) EP3971902B1 (zh)
CN (1) CN113795886B (zh)
DK (1) DK3971902T3 (zh)
ES (1) ES2942359T3 (zh)
HU (1) HUE061763T2 (zh)
IL (1) IL288026A (zh)
PL (1) PL3971902T3 (zh)
SG (1) SG11202112408QA (zh)
WO (1) WO2020227952A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115798578B (zh) * 2022-12-06 2024-06-18 中国人民解放军军事科学院军事医学研究院 一种分析与检测病毒新流行变异株的装置及方法

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570350A (zh) * 2015-12-17 2017-04-19 复旦大学 单核苷酸多态位点分型算法
WO2017115741A1 (ja) * 2015-12-28 2017-07-06 国立大学法人東北大学 アリル特異的遺伝子発現頻度の推定方法、推定用コンピュータシステム及び推定用プログラム

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491666B (zh) * 2017-09-01 2020-11-10 深圳裕策生物科技有限公司 异常组织中单样本体细胞突变位点检测方法、装置和存储介质
CN108733975B (zh) * 2018-03-29 2021-09-07 深圳裕策生物科技有限公司 基于二代测序的肿瘤克隆变异检测方法、装置和存储介质

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570350A (zh) * 2015-12-17 2017-04-19 复旦大学 单核苷酸多态位点分型算法
WO2017115741A1 (ja) * 2015-12-28 2017-07-06 国立大学法人東北大学 アリル特異的遺伝子発現頻度の推定方法、推定用コンピュータシステム及び推定用プログラム

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HUA XING: "Study on Statistical Methods for Cancer Genome Sequencing Data", CHINESE DOCTORAL DISSERTATIONS FULL-TEXT DATABASE, 1 April 2012 (2012-04-01), pages 1 - 124, XP055860181 *
SAHARON R. ET AL.: "Maximum Likelihood Estimation of Site-Specific Mutation Rates in Human Mitochondrial DNA From Partial Phylogenetic Classification", GENETICS, vol. 180, no. 3, 30 November 2008 (2008-11-30), XP055753427, DOI: 20200215182716X *

Also Published As

Publication number Publication date
CN113795886A (zh) 2021-12-14
EP3971902A4 (en) 2022-05-04
US20220068437A1 (en) 2022-03-03
DK3971902T3 (da) 2023-04-17
EP3971902A1 (en) 2022-03-23
CN113795886B (zh) 2024-08-06
EP3971902B1 (en) 2023-01-18
HUE061763T2 (hu) 2023-08-28
PL3971902T3 (pl) 2023-07-10
ES2942359T3 (es) 2023-05-31
SG11202112408QA (en) 2021-12-30
IL288026A (en) 2022-07-01

Similar Documents

Publication Publication Date Title
Garrison et al. Haplotype-based variant detection from short-read sequencing
JP6718885B2 (ja) コピー数多型検出のための方法及びシステム
US12006533B2 (en) Detecting cross-contamination in sequencing data using regression techniques
Crawford et al. Assessing the accuracy and power of population genetic inference from low-pass next-generation sequencing data
CN111916150A (zh) 一种基因组拷贝数变异的检测方法和装置
WO2020132572A1 (en) Source of origin deconvolution based on methylation fragments in cell-free-dna samples
Yuan et al. Detection of significant copy number variations from multiple samples in next-generation sequencing data
WO2020227952A1 (zh) 基于测序数据的碱基突变检测方法、装置及存储介质
Zhang et al. GAEP: a comprehensive genome assembly evaluating pipeline
CN115035950A (zh) 基因型检测方法、样本污染检测方法、装置、设备及介质
Friedlander et al. A numerical framework for genetic hitchhiking in populations of variable size
AU2021342561A1 (en) Detecting cross-contamination in sequencing data
Doria-Belenguer et al. Probabilistic graphlets capture biological function in probabilistic molecular networks
AU2022346858A1 (en) Methylation fragment probabilistic noise model with noisy region filtration
Temple et al. Modeling recent positive selection in Americans of European ancestry
Niehus et al. PopDel identifies medium-size deletions jointly in tens of thousands of genomes
Sitarčík et al. epiBAT: Multi-objective bat algorithm for detection of epistatic interactions
CN118098345B (zh) 一种染色体非整倍体的检测方法、装置、设备及存储介质
KR102441856B1 (ko) 중요도 샘플링을 활용한 다중변이 연관연구 방법
Greenberg Information theoretic limits of metagenomic binning
CN117174178A (zh) 一种基于二代短读长序列的单倍型距离评估方法及装置
Zucker et al. Generating simulated SNP array and sequencing data to assess genomic segmentation algorithms
Nguyen et al. SASeq: a selective and adaptive shrinkage approach to detect and quantify active transcripts using RNA-Seq
Luo Accurate and Integrative Detection of Copy Number Variants With High-Throughput Data
Liu Variable selection for longitudinal and irregular high-dimensional data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19928972

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019928972

Country of ref document: EP

Effective date: 20211215