WO2020063052A1 - 胎儿游离dna浓度获取方法、获取装置、存储介质及电子装置 - Google Patents

胎儿游离dna浓度获取方法、获取装置、存储介质及电子装置 Download PDF

Info

Publication number
WO2020063052A1
WO2020063052A1 PCT/CN2019/096367 CN2019096367W WO2020063052A1 WO 2020063052 A1 WO2020063052 A1 WO 2020063052A1 CN 2019096367 W CN2019096367 W CN 2019096367W WO 2020063052 A1 WO2020063052 A1 WO 2020063052A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
fetus
fetal
mother
sequencing data
Prior art date
Application number
PCT/CN2019/096367
Other languages
English (en)
French (fr)
Inventor
关永涛
徐寒黎
张静波
王伟伟
伍启熹
方楠
白灵
王建伟
Original Assignee
北京优迅医疗器械有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京优迅医疗器械有限公司 filed Critical 北京优迅医疗器械有限公司
Publication of WO2020063052A1 publication Critical patent/WO2020063052A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6809Methods for determination or identification of nucleic acids involving differential detection
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/6858Allele-specific amplification
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/686Polymerase chain reaction [PCR]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/118Prognosis of disease development
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers

Definitions

  • the invention relates to the field of biological detection, and in particular, to a method, a device, a storage medium, and an electronic device for obtaining fetal free DNA concentration.
  • the quantification of fetal free nucleic acid concentration has important value in non-invasive prenatal screening, which determines whether NIPT is effectively detected.
  • the importance of quantitative fetal nucleic acid concentration is reflected in the following: first, given the known fetal concentration, for samples with very low fetal concentration (for example, less than 3%), a "no conclusion" report can be selected, and pregnant women are recommended to choose Other prenatal testing methods. This can largely avoid false negatives of NIPT, after all, low fetal concentration is the main cause of false negatives.
  • Second, given the known fetal concentration we can know the expected value of chromosomal content changes, and the statistical power of NIPT screening can be greatly improved.
  • NIPT detection can obtain low-depth sequencing data of the whole genome of pregnant women's peripheral blood.
  • an estimated value of the content of each chromosome is obtained.
  • the basis of this method is that the Y chromosome fragment can only be derived from a male fetus. The higher the fetal concentration, the higher the Y chromosome content.
  • the male fetus has one X chromosome, and the higher the fetal concentration, the more the X chromosome content low. Therefore, the fetal concentration of male fetuses can be inferred from the content of sex chromosomes.
  • This type of method must use paired-end sequencing during sequencing to infer the length of the cfDNA fragment based on the aligned positions of Read1 and Read2.
  • the basis of this type of method is that the length distribution of fetal cfDNA is different from that of maternal cfDNA.
  • the higher the concentration of the fetus the greater the peripheral blood of pregnant women: cfDNA peaked at 143bp increased significantly, while cfDNA peaked at 166bp decreased significantly. Therefore, the fetal concentration can be inferred from the distribution of cfDNA fragment length in the peripheral plasma of pregnant women.
  • This type of method can use deep-targeted NGS sequencing to perform high-depth sequencing of several SNP sites in the whole genome of pregnant women's peripheral blood, and treat the cfDNA in pregnant women's peripheral blood as complex genotypes (AAAA, AAAB, ABAA, ABAB, the first two letters of each group represent the maternal genotype, and the last two represent the fetal genotype), and the fetal cfDNA concentration is directly estimated based on the value of the heterozygosity ratio in the sequencing data.
  • This method is based on the fact that fetal DNA methylation is different from maternal DNA methylation, and methylation sequencing is used to distinguish fetal and maternal cfDNA, thereby inferring fetal free nucleic acid concentration.
  • the basis of this method is that the cfDNA of highly expressed genes is more easily degraded.
  • the fetal cfDNA is derived from the placenta, which has specificity for gene expression.
  • a sample of pregnant males was used to make a statistical prediction model to find out the correlation between the fetal concentration and the coverage data of the entire genome, and then use this model to predict the fetal concentration.
  • the accurate quantification of fetal concentration has always been a technical difficulty and there are many difficulties.
  • the traditional method of quantifying fetal concentration based on sex chromosomes has the disadvantage that the fetal concentration of female fetuses cannot be quantified.
  • Fetal concentration quantification methods based on the difference in fetal and maternal cfDNA fragment lengths require double-ended sequencing, which increases sequencing costs and is not highly accurate.
  • the method of fetal concentration quantification based on the allele frequency of the SNP locus requires high-depth sequencing.
  • 0.1X low-depth sequencing of NIPT cannot meet the requirements.
  • the experimental processing steps based on methylated fetal concentration quantification are tedious and the sequencing cost is high. Inhomogeneous methods based on fetal free DNA coverage across the genome are not accurate enough.
  • Embodiments of the present invention provide a method, an acquisition device, a storage medium, and an electronic device for obtaining fetal free DNA concentration to solve the problem of high fetal concentration detection cost in the prior art.
  • a method for obtaining fetal free DNA concentration includes: obtaining sequencing data of a sample to be tested, wherein the sample to be tested is taken from a mother who has a fetus; establishing a mother and a fetus Genotype joint probability distribution model, where the joint probability distribution model includes one or more factors that affect read heterozygosity, which is the ratio of the number of SNP sites covered by different bases to the total number of sites in the sequencing data ; Substitute the value of one or more factors and the read heterozygosity value into the joint probability distribution model, and perform maximum likelihood estimation on the parameters in the joint probability distribution model to obtain the fetal free DNA concentration.
  • one or more factors include at least one of the following: the inbreeding coefficient of the mother, the inbreeding coefficient of the fetus, the sequencing error rate, and population allele frequency information, in which one or more factors are combined Before the values of the heterozygosity and read heterozygosity are substituted into the joint probability distribution model, the values of one or more factors are obtained.
  • the mother's inbreeding coefficient is obtained by low-depth sequencing of white blood cells, or by performing maximum likelihood estimation on the joint probability distribution model and fetal free DNA concentration Get both.
  • the inbreeding coefficient of the fetus is obtained by one of the following: the inbreeding coefficient of the fetus is set to 0; the fetus is obtained by performing white blood cell sequencing on the father of the fetus The inbreeding coefficient of the population; the average of the inbreeding coefficient of the population is used as the inbreeding coefficient of the fetus.
  • the population allele frequency information is obtained by one of the following: obtained from the data of the mother's population; calculated from a predetermined number of NIPT samples included get.
  • obtaining the sequencing data of the sample to be tested includes: extracting free DNA from the sample to be tested and performing sequencing to obtain the original sequencing data; processing the original sequencing data to obtain the sequencing data, and processing for processing the original sequencing data to be suitable for obtaining read Heterozygosity sequencing data.
  • processing the original sequencing data to obtain sequencing data includes: deleting low-quality reads; comparing the reads retained after deletion to a reference genome, and obtaining reads satisfying the comparison strategy as sequencing data.
  • the low-quality reads include at least one of the following: reads of repeats introduced by PCR amplification, reads containing more than one base N, reads with an average sequencing quality of 5 consecutive nucleotides less than 20; and /
  • the comparison strategy includes one of the following: allowing at most one mismatch and retaining only reads on unique comparisons.
  • extracting free DNA from the sample to be tested and performing sequencing includes: extracting free DNA from the sample to be tested and performing low-depth sequencing of the entire genome.
  • the MMFF column represents the genotypes of the mother and the fetus
  • a and B represent the two alleles at a SNP locus, respectively
  • the Prob column represents the combined probability of the genotypes of the mother and fetus
  • p and q respectively Represents the population allele frequency information of alleles A and B
  • F1 represents the inbreeding coefficient of the mother
  • F2 represents the inbreeding coefficient of the fetus
  • e represents the sequencing error rate
  • column f A represents the allele A in the sequencing data in the sample
  • the frequency, h represents the fetal free DNA concentration.
  • a device for obtaining fetal free DNA concentration including: the device is used to store or run a module, or the module is a component of the device; wherein the module is a software module, and the software module For one or more, software modules are used to perform any of the methods described above.
  • a storage medium stores a computer program, and the computer program is configured to execute the steps in any one of the foregoing method embodiments when running.
  • an electronic device including a memory and a processor, wherein the computer program is stored in the memory, and the processor is configured to run the computer program to execute any one of the foregoing. Steps in a method embodiment.
  • a method for obtaining fetal free DNA concentration is provided by establishing a joint probability distribution model of mother and fetal genotypes, and using the values of various factors in the model and the value of read heterozygosity affected by these factors By solving, you can get the fetal free DNA concentration.
  • This method can use the conventional NGS low-depth sequencing data of NIPT. Without adding any additional experimental and sequencing costs, it can not only achieve quantitative detection of fetal concentration, but also the method has low cost, high accuracy, and is applicable. For female fetal concentration detection.
  • FIG. 1 is a flowchart of a method for obtaining a fetal free DNA concentration according to Embodiment 1 of the present invention
  • FIG. 2 is a graph comparing the actual fetal concentration obtained based on the simulated mixed sample data with the expected result according to Example 2 of the present invention
  • Example 3 is a result chart comparing fetal concentration obtained from a real mixed sample and mixed sample concentration according to Example 3 of the present invention
  • FIG. 4 is a graph comparing the fetal concentration obtained from a real male fetal NIPT sample with the sex chromosome inferred concentration according to Example 4 of the present invention
  • FIG. 5 is a graph comparing the results of the joint inferred maternal suburban coefficient and fetal concentration with the inferred concentration based on the sex chromosome according to Embodiment 5 of the present invention.
  • FIG. 6 is a structural block diagram of a fetal free DNA concentration acquisition device according to Embodiment 6 of the present invention.
  • FIG. 7 is a detailed structural block diagram of a fetal free DNA concentration acquisition device according to Embodiment 6 of the present invention.
  • FIG. 1 is a flowchart of a method for obtaining a fetal free DNA concentration according to an embodiment of the present invention. As shown in Figure 1, the method includes the following steps:
  • Step S102 obtaining sequencing data of a sample to be tested, wherein the sample to be tested is taken from a mother who is pregnant with a fetus;
  • Step S104 establishing a joint probability distribution model of mother and fetal genotypes, wherein the joint probability distribution model includes one or more factors affecting read heterozygosity, and read heterozygosity is the number of SNP sites covered by different bases in the sequencing data Percentage of total spots;
  • Step S106 Substituting the value of one or more factors and the value of the read heterozygosity into the joint probability distribution model, and performing the maximum likelihood estimation on the joint probability distribution model to obtain the fetal free DNA concentration.
  • the above method for obtaining fetal free DNA concentration is to establish a maximum likelihood estimation by establishing a joint probability distribution model of mother and fetal genotypes, and using the values of each factor in the model and the value of read heterozygosity affected by these factors. Obtain fetal free DNA concentration.
  • This method can use the conventional NGS low-depth sequencing data of NIPT. Without adding any additional experimental and sequencing costs, it can not only achieve quantitative detection of fetal concentration, but also the method has low cost, high accuracy, and is applicable. For female fetal concentration detection.
  • the execution subject of the foregoing steps may be a base station or a terminal, but is not limited thereto.
  • the above method further includes: obtaining values of one or more factors.
  • the number of the above factors affecting read heterozygosity varies according to the source of sequencing data, and the values of each of these factors are also different. For example, when the sequencing quality is high, the sequencing error rate e is usually around 0.001.
  • the population allele frequency information varies from one population to another. For example, the population allele frequency information obtained from East Asian populations is different from the population allele frequency information obtained from European and American populations.
  • Both the maternal inbreeding coefficient F1 and the fetal inbreeding coefficient F2 have an effect on the statistics of SNP heterozygous sites in the sequencing data. The higher the inbreeding coefficient, the higher the probability of heterozygous sites in the fetus, and the lower the inbreeding coefficient, the lower the probability of heterozygous sites in the fetus.
  • the mother's inbreeding coefficient F1 can be obtained by sequencing low-depth (0.1x to 0.5x) white blood cells. Specifically, a model similar to the present application is established by low-depth sequencing of white blood cells, and the fetal concentration h can be obtained. Alternatively, the maximum likelihood estimation of the joint probability distribution model and fetal concentration can also be obtained by using cfDNA low-depth sequencing data.
  • the inbreeding coefficient F2 of the fetus is obtained by one of the following: setting the inbreeding coefficient F2 of the fetus to 0; The fetus father performed white blood cell sequencing to obtain the fetus inbreeding coefficient F2; the average value of the population inbreeding coefficient was used as the fetus inbreeding coefficient F2.
  • the inbreeding coefficient F2 of the fetus is theoretically affected by the mother and the father. Therefore, theoretically, the father's white blood cells need to be sequenced.
  • the mean value of the coefficient is sufficient to obtain the fetal free DNA concentration, because the fetal free DNA concentration is generally about 10%.
  • the population allele frequency information is obtained by one of the following: obtained from the data of the mother's population; and included from a predetermined number Calculated from the NIPT sample.
  • Obtained from the data of the mother's population Obtained from the data of the mother's population. For example, if the mother belongs to East Asians, it can be obtained from the 1000genome (thousands of people's genome) and gnomAD's East Asian population data. Calculated from the inclusion of a predetermined number of NIPT samples. For example, it can be calculated from a large number of real NIPT samples. The specific number of the samples can be thousands or tens of thousands.
  • the step of obtaining the sequencing data of the sample to be tested can be an existing step.
  • obtaining the sequencing data of the sample to be tested includes: extracting free DNA from the sample to be tested and performing sequencing to obtain original sequencing data; processing the original sequencing data to obtain sequencing data, and processing for processing the original sequencing data. Sequencing data suitable for obtaining read heterozygosity.
  • processing the original sequencing data to obtain sequencing data includes: deleting low-quality reads; comparing the reads retained after deletion to a reference genome, and obtaining reads satisfying the comparison strategy as sequencing data.
  • Low quality here has the same meaning as low quality in the field of conventional high-throughput sequencing. In a broad sense, it means data that cannot be processed effectively or that has a significant adverse effect on the processing result.
  • the low-quality reads include at least one of the following: reads of repeats introduced by PCR amplification, reads containing more than one base N, and the average sequencing quality of 5 consecutive nucleotides is less than 20 Reads; and / or, the comparison strategy includes one of the following: allowing at most one mismatch and retaining only reads on unique comparisons.
  • the base N indicates that there may be undetectable bases in the original sequencing data, and N is used as the base.
  • Various existing software can detect the sequencing quality of bases in sequencing, so it can easily screen out reads with an average sequencing quality of 5 consecutive nucleotides less than 20.
  • the alignment strategy only a maximum of one mismatch is allowed to ensure that the quality of the sequencing data for subsequent processing is higher. It is more likely to be a true base type rather than caused by a sequencing error, which will help make the fetal free DNA concentration more accurate. Keeping only reads on unique alignments means that the data that is ultimately used for subsequent analysis can be completely aligned with the reference genome to ensure that the base types of each SNP site detected are true.
  • the amount of data after specific comparison is not limited, and can be set reasonably according to different sample sources.
  • the sequencing data obtained after processing has at least 4M reads.
  • extracting the free DNA from the sample to be tested and performing sequencing includes: extracting the free DNA from the sample to be tested and performing low-depth sequencing of the whole genome.
  • the low-depth sequencing here can make the target coverage between 0.1x and 0.5x.
  • the theoretical basis for establishing a joint probability distribution model of maternal and fetal genotypes is that even for low-depth sequencing data such as NIPT, there are enough 1000genome SNP sites to be covered by more than one read, and these 1000genome The coverage of SNP loci obeys the Poisson distribution.
  • the site can be defined as read homozygous or read heterozygous.
  • the inbreeding coefficient of the mother and fetus is 0, the sequencing error rate of the sequencing platform is also 0, and the population allele frequency follows a uniform distribution, the mother and fetal genotypes can be obtained
  • the joint probability model is shown in Table 1 below.
  • MMFF represents the genotype of the mother and fetus
  • a and B represent the alleles of a SNP locus
  • the Prob column represents the probability of the corresponding mother and fetal genotype
  • f A represents the median of the sequencing data. The frequency of gene A.
  • the inbreeding coefficient F2 of the fetus the inbreeding coefficient F1 of the mother, and the sequencing error rate e.
  • the inbred coefficient F directly affects the frequencies of homozygous AA, BB, and heterozygous AB, as follows:
  • the joint probability distribution model is the following Table 2.
  • the MMFF column represents the genotypes of the mother and fetus
  • a and B represent the two alleles at a SNP locus, respectively
  • the Prob column represents the combined probability of the genotypes of the mother and fetus
  • p and q respectively Represents the population allele frequency information of alleles A and B
  • F1 represents the inbreeding coefficient of the mother
  • F2 represents the inbreeding coefficient of the fetus
  • e represents the sequencing error rate
  • column f A represents the allele A in the sequencing data in the sample
  • the frequency, h represents the fetal free DNA concentration.
  • This model can be used to solve h.
  • the premise of its solution is to know the frequency information of F1, F2, e and the population allele.
  • the inbreeding coefficient F1 of the mother can be obtained by low-depth sequencing of white blood cells.
  • This model can solve h and F1 at the same time using the maximum likelihood method. This slightly sacrifices the accuracy of h, but saves the cost of sequencing maternal leukocytes.
  • the sequencing error rate of the platform can be obtained directly from the data.
  • the population allele frequency information can be directly obtained from East Asian population data of 1000genome and gnomAD, or it can be calculated by incorporating a large number of real NIPT samples.
  • the low-depth sequencing in this application refers to the coverage of the entire sample from 0.1x to 0.5x.
  • the coverage of 2 or 3 refers to the depth of some of these sites. For example, there are 3 billion sites in a sample. Some sites have a depth of 0, some sites have a depth of 1, and some sites have a depth of 2. Other sites may have similar differences in depth, but the average Together, the depth of the overall sample is 0.1x to 0.5x.
  • the technical solution of the present invention in essence, or a part that contributes to the prior art, can be embodied in the form of a software product, which is stored in a storage medium (such as ROM / RAM, magnetic disk, The optical disc) includes a plurality of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in the embodiments of the present invention.
  • a terminal device which may be a mobile phone, a computer, a server, or a network device, etc.
  • the inbreeding coefficient of the mother and daughter was obtained from the entire genome sequencing reads of the mother and daughter, and the sequencing error rate was calculated from the sample reads obtained after mixing. Obtained, the percentage of heterozygous sites to the total sites is obtained by statistically reading the samples obtained after mixing, and then substituting the above parameters into the aforementioned joint probability distribution model to obtain the fetal free DNA concentration h.
  • the inferred fetal concentration was compared with the expected, and the comparison result is shown in Figure 2 below. It can be seen from Figure 2 that the fetal concentration obtained by the method of the present application is consistent with the expected fetal concentration (proportion of mixed reads). Each gradient mixed reads is repeated 100 times to calculate the mean value of h (black dots in the figure) and variance (the vertical line represents a plus or minus one variance)
  • the DNA from the mother and the fetus were mixed according to different fetal concentrations (fetal concentrations were 3%, 5%, 8%, and 12%, respectively), and then sequenced on the machine.
  • the sequencing was low-depth whole genome sequencing, and then used
  • the method proposed in this application infers fetal concentration.
  • the specific sequencing depth is 0.1x
  • the sequencing error rate is 1/1000
  • the inbreeding coefficients of the mother and fetus are calculated from the respective DNA sequencing data.
  • the population allele frequency of each site is obtained from the East Asian 1000genome population data.
  • the percentage of heterozygous sites in the total sequencing data in the sequencing data of each mixed sample concentration was obtained from the sequencing data.
  • the inferred fetal concentration is compared with the mixed sample concentration, and the comparison result is shown in FIG. 3. It can be seen from Figure 3 that the fetal concentration obtained by this method is consistent with the mixed fetal concentration.
  • NIPT clinical samples with male fetuses were selected and the fetal concentrations obtained by the method of this application.
  • the inferred fetal concentration is compared to the inferred sex chromosome.
  • the comparison results are shown in Fig. 4. It can be seen from Fig. 4 that the fetal concentration obtained by this method and the sex chromosome-based inference method is highly consistent in 67 samples. There are two outliers (*).
  • the fetal concentration obtained by the method of the present application is about twice the inferred fetal concentration of the sex chromosome. These two samples are dragon and phoenix.
  • Figure 5 uses the same sample as Figure 4, except that Figure 5 jointly estimates the maternal inbreeding coefficient and fetal concentration (no maternal leukocyte information is used). Figure 5 shows that the joint estimation can be very accurate, in which samples of dragon and phoenix are shown by asterisks.
  • the method of the present application is based on whole-genome low-depth sequencing, which can directly use the existing NIPT sample data. No double-end sequencing is required, and high-depth sequencing is not required (the fetal concentration obtained by this method directly depends on the slight difference in the sequencing depth of the two alleles of certain heterozygous SNP points obtained by deep sequencing, and each heterozygote needs to be Quantitative analysis of the loci; and this application is to calculate the proportion of all heterozygous SNP loci to the total number of loci, only the loci of heterozygosity and homozygosity need to be roughly quantified), without additional sequencing costs.
  • this embodiment also provides a device for obtaining fetal free DNA concentration, which is used to implement the above-mentioned embodiments and preferred implementation manners, which have already been described will not be repeated.
  • the term "module” may implement a combination of software and / or hardware for a predetermined function.
  • the devices described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware is also possible and conceived.
  • FIG. 6 is a structural block diagram of a fetal free DNA concentration acquisition device according to Embodiment 6 of the present invention. As shown in FIG. 6, the device includes a first acquisition module 10, a model establishment module 30, and a concentration estimation module 50. among them,
  • a first acquisition module 10 is configured to acquire sequencing data of a sample to be tested, where the sample to be tested is taken from a mother who is pregnant with a fetus;
  • a model building module 30 is used to establish a joint probability distribution model of maternal and fetal genotypes.
  • the joint probability distribution model includes one or more factors that affect the read heterozygosity, which is covered by different bases in the sequencing data The ratio of the number of SNP loci to the total number of SNP loci;
  • the concentration estimation module 50 is configured to substitute the value of one or more factors and the value of the read heterozygosity into the joint probability distribution model, and perform maximum likelihood estimation on the joint probability distribution model to obtain the fetal free DNA concentration.
  • the quantitative analysis of the fetal free DNA concentration is achieved without adding any additional experimental and sequencing costs, and the method has low cost, high accuracy, and is suitable for female fetal concentration Detection.
  • FIG. 7 is a detailed block diagram of a fetal free DNA concentration acquisition device according to Embodiment 6 of the present invention.
  • the device includes a second acquisition module 70 in addition to all the modules shown in FIG. 6.
  • the second obtaining module is used to obtain one or more factors when one or more factors include at least one of the following: inbreeding coefficient of the mother, inbreeding coefficient of the fetus, sequencing error rate, and population allele frequency information Value.
  • the second acquisition module 70 includes: a first acquisition unit 20 configured to acquire the mother's inbreeding coefficient through one of the following if one or more factors include the mother's inbreeding coefficient: low depth of white blood cells Sequencing; maximum likelihood estimation of the joint probability distribution model.
  • the second acquisition module includes 70: a second acquisition unit 40 configured to acquire the inbreeding coefficient of the fetus through one of the following if one or more factors include the inbreeding coefficient of the fetus: The cross coefficient was set to 0; the inbreeding coefficient of the fetus was obtained by performing white blood cell sequencing on the father of the fetus; the average of the inbreeding coefficient of the population was used as the inbreeding coefficient of the fetus.
  • the second acquisition module includes 70: a third acquisition unit 60 configured to obtain the population allele frequency information through one of the following if one or more factors include the population allele frequency information: from the mother Obtained from the data of the population; calculated from a predetermined number of NIPT samples included.
  • the first acquisition module 10 includes a sample sequencing module for extracting free DNA from the sample to be tested and performing sequencing to obtain original sequencing data; a processing module for processing the original sequencing data to obtain sequencing data, and the processing is used for The raw sequencing data is processed into sequencing data suitable for obtaining read heterozygosity.
  • the processing module includes: a deletion module for deleting low-quality reads; a comparison module for comparing the reads retained after deletion to a reference genome, and obtaining reads satisfying the comparison strategy as sequencing data.
  • the low-quality reads include at least one of the following: reads of repeats introduced by PCR amplification, reads containing more than one base N, reads with an average sequencing quality of 5 consecutive nucleotides less than 20; and / or.
  • the alignment strategy includes one of the following: allowing at most one mismatch and retaining only reads on unique alignments.
  • the sample sequencing module includes a whole-genome low-depth sequencing module for extracting free DNA from the sample to be tested and performing whole-genome low-depth sequencing.
  • the joint probability distribution model is expressed by the following formula:
  • the MMFF column indicates the genotype of the mother and fetus, and A and B indicate the SNP loci, respectively.
  • Prob column shows the joint probability of genotypes of mother and fetus, p and q respectively
  • F1 represents the maternal inbreeding coefficient
  • F2 represents the maternal inbreeding coefficient
  • Fetal inbreeding coefficient e indicates the sequencing error rate
  • column f A indicates that the allele A is in the sample.
  • the frequency, h represents the fetal free DNA concentration.
  • modules can be implemented by software or hardware. For the latter, they can be implemented in the following ways, but are not limited to the above: the above modules are located in the same processor; The forms are located in different processors.
  • An embodiment of the present invention further provides a storage medium.
  • the storage medium stores a computer program, and the computer program is configured to execute the steps in any one of the foregoing method embodiments when running.
  • the foregoing storage medium may be configured to store a computer program for performing the following steps:
  • the joint probability distribution model includes one or more factors that affect the read heterozygosity.
  • the read heterozygosity is the number of SNP sites covered by different bases in the sequencing data. Ratio of total loci;
  • the foregoing storage medium may include, but is not limited to, a U disk, a read-only memory (ROM), a random access memory (Random Access Memory, RAM), A variety of media that can store computer programs, such as mobile hard disks, magnetic disks, or optical disks.
  • ROM read-only memory
  • RAM Random Access Memory
  • An embodiment of the present invention further provides an electronic device including a memory and a processor.
  • the memory stores a computer program
  • the processor is configured to execute the steps in any one of the foregoing method embodiments by running the computer program.
  • the electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the processor, and the input-output device is connected to the processor.
  • the foregoing processor may be configured to execute the following steps by a computer program:
  • the joint probability distribution model includes one or more factors that affect the read heterozygosity.
  • the read heterozygosity is the number of SNP sites covered by different bases in the sequencing data. Ratio of total loci;
  • modules or steps of the present invention can be implemented by a general-purpose computing device, and they can be concentrated on a single computing device or distributed on a network composed of multiple computing devices
  • they may be implemented with program code executable by a computing device, so that they may be stored in a storage device and executed by the computing device, and in some cases, may be in a different order than here
  • the steps shown or described are performed either by making them into individual integrated circuit modules or by making multiple modules or steps into a single integrated circuit module. As such, the invention is not limited to any particular combination of hardware and software.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Analytical Chemistry (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Public Health (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Primary Health Care (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)

Abstract

本发明公开了一种胎儿游离DNA浓度获取方法、获取装置、存储介质及电子装置。该方法包括:获取待测样本的测序数据,其中,待测样本取自怀有胎儿的母亲;建立母亲和胎儿基因型的联合概率分布模型,其中,联合概率分布模型中包括影响read杂合度的一个或多个因素,read杂合度为测序数据中被不同碱基覆盖的SNP位点数占总SNP位点数的比例;将一个或多个因素的值以及获取到的杂合度的值代入联合概率分布模型,并对联合概率分布模型进行最大似然估计,得到胎儿游离DNA浓度。该方法解决了现有技术中胎儿浓度检测成本高的问题。

Description

胎儿游离DNA浓度获取方法、获取装置、存储介质及电子装置 技术领域
本发明涉及生物检测领域,具体而言,涉及一种胎儿游离DNA浓度获取方法、获取装置、存储介质及电子装置。
背景技术
胎儿游离核酸浓度的定量在无创产前筛查中有重要价值,它决定了NIPT是否有效检出。胎儿核酸浓度定量的重要性体现在:第一,在已知胎儿浓度的情况下,对于胎儿浓度极低的样本(譬如低于3%),可以选择出“无结论”报告,同时建议孕妇选择其他产前检查方法。这能够在很大程度上避免NIPT的假阴性,毕竟胎儿浓度过低是假阴性的主要原因。第二,在已知胎儿浓度的情况下,就可知染色体含量变化的期望值,NIPT筛查的统计功效能得到很大提升。第三,在已知胎儿浓度的情况下,性染色体异常,双胎、嵌合等特殊样本的NIPT也变得更加简单,准确性更高。但是如何对胎儿浓度精准定量仍是待解难题。
当前已有的胎儿游离DNA的定量方法有以下几种:
(1)实时定量PCR技术
1998年,香港中文大学的Dennis Lo等用实时定量PCR技术定量分析了孕妇血浆中的胎儿游离DNA,发现它早在妊娠7周可以测得,浓度随着妊娠周数的增加而增加。以实时荧光定量PCR方法为例,设计引物扩增并检测孕妇外周血浆样本中Y性别决定区(SRY)基因。这类方法的依据是SRY基因是男胎的标志基因,母体的cfDNA中不存在该基因。根据标准曲线的绘制,推算每ml样本中SRY基因的拷贝数,从而推断男胎的胎儿浓度。
(2)全基因组NGS测序,基于性染色体推断胎儿浓度
基于新一代高通量测序,NIPT的检测能得到孕妇外周血全基因组的低深度测序数据。通过将测序数据比对到参考基因组上,比对结果进行GC校正等,得到每条染色体的含量的估计值。这类方法的依据是Y染色体的片段只能来源于男胎,胎儿浓度越高则Y染色体的含量越高;同理,男胎少一条X染色体,胎儿浓度越高则X染色体的含量会越低。因此,可通过性染色体的含量来推断男胎的胎儿浓度。
(3)全基因组NGS测序(PE测序),基于游离DNA片段长度分布推断胎儿浓度
这类方法在测序时必须采用双末端测序法(paired-end sequencing),从而根据Read1和Read2的比对位置来推断cfDNA片段的长度。这类方法的依据是胎儿cfDNA长度分布与母体cfDNA有所不同,研究显示,血浆内主要的cfDNA长度为166bp,存在以10bp为单位的递减规律,并在143bp处也有明显存在。胎儿浓度越高,孕母外周血中:以143bp为峰值的cfDNA显著增加,同时以166bp为峰值的cfDNA的则显著降低。因而可根据孕母外周血浆中cfDNA片段长度的分布来推断胎儿浓度。
(4)深度靶向的NGS测序法,对若干个SNP位点进行高深度测序
这类方法可以采用深度靶向的NGS测序法,对孕妇外周血全基因组的若干SNP位点进行高深度测序,将该位点的孕妇外周血中的cfDNA看成复合基因型(AAAA,AAAB,ABAA,ABAB,每组前两个字母代表母亲基因型,后两个代表胎儿基因型),直接根据测序数据中杂合比的数值来估算胎儿cfDNA浓度。
(5)基于甲基化标记的方法
这类方法的依据是胎儿DNA甲基化与母亲DNA甲基化程度不同,利用甲基化测序区分胎儿和母亲来源的cfDNA,从而推断胎儿游离核酸浓度。
(6)基于胎儿游离DNA在全基因组覆盖的不均一的方法
这类方法的依据是高表达的基因其cfDNA更容易被降解,胎儿的cfDNA来源于胎盘,胎盘有基因表达的特异性。用怀男胎的样本做一个统计预测模型找出胎儿浓度跟全基因组的覆盖数据的相关,再用此模型对胎儿浓度进行预测。
胎儿浓度的准确定量一直是技术难点,存在多方面的困难。传统的基于性染色体的胎儿浓度定量方法,弊端在于无法对女胎的胎儿浓度进行定量。基于胎儿和母体cfDNA片段长度差异的胎儿浓度定量方法,需要双端测序,增加测序成本且准确性不高。基于SNP位点的等位基因频率的胎儿浓度定量方法,需要高深度测序,目前NIPT的0.1X低深度测序无法满足要求。基于甲基化的胎儿浓度定量的实验处理步骤繁琐,测序成本较高。基于胎儿游离DNA在全基因组覆盖的不均一的方法不够准确。
由此可见,现有方法均存在一定的缺陷,主要有以下几方面:增加额外的实验工作;对仪器和设备有额外需求;受限于男胎的检测;检测准确性不够理想;检测成本较高。
对于现有技术中的问题,目前没有提出相应的解决方案。
发明内容
本发明实施例提供了一种胎儿游离DNA浓度获取方法、获取装置、存储介质及电子装置,以解决现有技术中胎儿浓度检测成本高的问题。
根据本发明实施例的一个实施例,提供了一种胎儿游离DNA浓度获取方法,该方法包括:获取待测样本的测序数据,其中,待测样本取自怀有胎儿的母亲;建立母亲和胎儿基因型的联合概率分布模型,其中,联合概率分布模型中包括影响read杂合度的一个或多个因素,read杂合度为测序数据中被不同碱基覆盖的SNP位点数目占总位点数的比例;将一个或多个因素的值以及获取到的read杂合度的值代入联合概率分布模型,并对联合概率分布模型中的参数进行最大似然估计,得到胎儿游离DNA浓度。
进一步地,在一个或多个因素包括以下至少之一的情况下:母亲的近交系数、胎儿的近交系数、测序错误率、人群等位基因频率信息,其中,在将一个或多个因素的值和read杂合度的值代入联合概率分布模型之前,获取一个或多个因素的值。
进一步地,在一个或多个因素包括母亲的近交系数的情况下,母亲的近交系数通过对白细胞低深度测序获取到,或者通过对联合概率分布模型进行最大似然估计与胎儿游离DNA浓度同时获得。
进一步地,在一个或多个因素包括胎儿的近交系数的情况下,通过以下之一得到胎儿的近交系数:将胎儿的近交系数设置为0;通过对胎儿的父亲进行白细胞测序得到胎儿的近交系数;将人群近交系数的均值作为胎儿的近交系数。
进一步地,在一个或多个因素包括人群等位基因频率信息的情况下,通过以下之一得到人群等位基因频率信息:从母亲所属人群的数据中获取;从纳入预定数量的NIPT样本中计算得到。
进一步地,获取待测样本的测序数据包括:对待测样本提取游离DNA并进行测序之后得到原始测序数据;对原始测序数据进行加工得到测序数据,加工用于将原始测序数据处理成适用于得到read杂合度的测序数据。
进一步地,对原始测序数据进行加工得到测序数据包括:删除低质量的reads;将删除后被保留的reads对比到参考基因组,得到满足比对策略的reads作为测序数据。
进一步地,低质量的reads包括以下至少之一:PCR扩增引入的重复片段的reads、包含一个以上碱基N的reads、连续5个核苷酸的平均测序质量低于20的reads;和/或,比对策略包括以下之一:允许最多一个错配及只保留唯一比对上的reads。
进一步地,对待测样本提取游离DNA并进行测序包括:对待测样本提取游离DNA并进行全基因组低深度测序。
进一步地,通过如下公式表示联合概率分布模型:
Figure PCTCN2019096367-appb-000001
其中,MMFF列表示的是母亲和胎儿的基因型,A和B分别表示一SNP位点上的两种等位基因,Prob列表示的是母亲和胎儿的基因型的联合概率,p和q分别表示等位基因A和B的人群等位基因频率信息,F1表示母亲的近交系数,F2表示胎儿的近交系数,e表示测序错误率,f A列表示测序数据中等位基因A在样本中的频率,h表示胎儿游离DNA浓度。
根据本发明实施例的另一个实施例,还提供了一种胎儿游离DNA浓度获取装置,包括:装置用于存储或者运行模块,或者模块为装置的组成部分;其中,模块为软件模块,软件模块为一个或多个,软件模块用于执行上述任一种方法。
根据本发明的又一个实施例,还提供了一种存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
根据本发明的又一个实施例,还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。
在本发明实施例中,提供的胎儿游离DNA浓度获取方法,通过建立母亲和胎儿基因型的联合概率分布模型,并利用该模型中的各因素的值及这些因素所影响的read杂合度的值进行求解,即可获得胎儿游离DNA浓度。该方法可以利用NIPT常规的NGS低深度测序数据,在不增加任何额外的实验和测序的成本的基础上,不仅能够实现对胎儿浓度的定量检测,而且该方法成本低,准确性高,还适用于女胎胎儿浓度检测。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明实施例1的胎儿游离DNA浓度获取方法的流程图;
图2是根据本发明实施例2的基于模拟混样数据实际获得的胎儿浓度与预期相比较的结果图;
图3是根据本发明实施例3的基于真实混样样本获得的胎儿浓度与混样浓度相比较的结果图;
图4是根据本发明实施例4的基于真实男胎NIPT样本获得的胎儿浓度与性染色体推断出的浓度相比较的结果图;
图5是根据本发明实施例5的联合推断母体近郊系数和胎儿浓度与基于性染色体推断出的浓度相比较的结果图;
图6是根据本发明的实施例6的胎儿游离DNA浓度获取装置的结构框图;
图7是根据本发明实施例6的胎儿游离DNA浓度获取装置的详细结构框图。
具体实施方式
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
实施例1
在本实施例中,提供了一种胎儿游离DNA浓度获取方法,图1是根据本发明实施例的胎儿游离DNA浓度获取方法的流程图。如图1所示,该方法包括如下步骤:
步骤S102,获取待测样本的测序数据,其中,待测样本取自怀有胎儿的母亲;
步骤S104,建立母亲和胎儿基因型的联合概率分布模型,其中,联合概率分布模型中包括影响read杂合度的一个或多个因素,read杂合度为测序数据中被不同碱基覆盖的SNP位点数占总位点数的比例;
步骤S106,将一个或多个因素的值以及获取到的read杂合度的值代入联合概率分布模型,并对联合概率分布模型进行最大似然估计,得到胎儿游离DNA浓度。
上述胎儿游离DNA浓度获取方法,通过建立母亲和胎儿基因型的联合概率分布模型,并利用该模型中的各因素的值及这些因素所影响的read杂合度的值进行最大似然估计,即可获得胎儿游离DNA浓度。该方法可以利用NIPT常规的NGS低深度测序数据,在不增加任何额外的实验和测序的成本的基础上,不仅能够实现对胎儿浓度的定量检测,而且该方法成本低,准确性高,还适用于女胎胎儿浓度检测。
可选地,上述步骤的执行主体可以为基站或终端等,但不限于此。
在一种优选实施例中,在一个或多个因素包括以下至少之一的情况下:母亲的近交系数F1、胎儿的近交系数F2、测序错误率e、人群等位基因频率信息,将一个或多个因素的值和read杂合度的值代入联合概率分布模型之前,上述方法还包括:获取一个或多个因素的值。
实际应用中,根据测序数据来源的不同,影响read杂合度的上述因素的数量不等,其各因素的值也不相同。比如,测序质量很高的情况下,测序错误率e通常在0.001左右。人群等位基因频率信息根据人群的不同而不同,比如,从东亚人群中获得的人群等位基因频率信息与从欧美人群中获得的人群等位基因频率信息是不同的。母亲的近交系数F1和胎儿的近交系数F2都对测序数据中SNP杂合位点的统计有影响。近交系数越高,胎儿出现杂合位点的概率就高,近交系数越低,胎儿出现杂合位点的概率就低。
在一种优选实施例中,在一个或多个因素包括母亲的近交系数F1的情况下,母亲的近交系数F1可以通过对白细胞低深度(0.1x~0.5x)测序获取到。具体地,通过对白细胞低深度测序建立类似于本申请的模型,令其中胎儿浓度h为0即可获取到。或者也可以通过利用cfDNA低深度测序数据对联合概率分布模型进行最大似然估计与胎儿浓度同时获得。
在一种优选实施例中,在一个或多个因素包括胎儿的近交系数F2的情况下,通过以下之一得到胎儿的近交系数F2:将胎儿的近交系数F2设置为0;通过对胎儿的父亲进行白细胞测序得到胎儿的近交系数F2;将人群近交系数的均值作为胎儿的近交系数F2。
胎儿的近交系数F2理论上受母亲和父亲的影响,因而理论上需要对父亲的白细胞进行测序得到,但本申请的发明人发现,将胎儿的近交系数F2设置为0或者取人群近交系数的均值就足以获得胎儿游离DNA浓度了,因为胎儿游离DNA浓度一般在10%左右。
在一种优选实施例中,在一个或多个因素包括人群等位基因频率信息的情况下,通过以下之一得到人群等位基因频率信息:从母亲所属人群的数据中获取;从纳入预定数量的NIPT样本中计算得到。
从母亲所属人群的数据中获取,比如母亲属于东亚人,则可以从1000genome(千人基因组)和gnomAD的东亚人群数据中获取。从纳入预定数量的NIPT样本中计算得到,比如可以从大量真实的NIPT样本计算得到,该样本的具体数量可以是几千或几万。
上述方法中,获得待测样本的测序数据的步骤采用现有的步骤即可。在一种优选实施例中,获取待测样本的测序数据包括:对待测样本提取游离DNA并进行测序之后得到原始测序数据;对原始测序数据进行加工得到测序数据,加工用于将原始测序数据处理成适用于得到read杂合度的测序数据。
具体加工的方式与现有的原始测序数据的加工方式类似,都包括对原始数据进行过滤得到测序数据的步骤。即从raw data处理为clean data。在一种优选实施例中,对原始测序数据进行加工得到测序数据包括:删除低质量的reads;将删除后被保留的reads对比到参考基因组,得到满足比对策略的reads作为测序数据。
此处的低质量与常规高通量测序领域的低质量的涵义相同,广义上指无法进行有效的数据处理或者明显对处理结果有不利影响的数据。在一种优选实施例中,低质量的reads包括以下至少之一:PCR扩增引入的重复片段的reads、包含一个以上碱基N的reads、连续5个核苷酸的平均测序质量低于20的reads;和/或,比对策略包括以下之一:允许最多一个错配及只保留唯一比对上的reads。
上述优选实施例中,碱基N表示测序的原始数据中会有无法测出来的碱基,用N来表示。现有多种软件可以检测测序中碱基的测序质量,因而能够很方便地将连续5个核苷酸的平均测序质量低于20的reads筛选出来。
比对策略中,仅允许最多一个错配以确保用于后续处理的测序数据的质量较高, 更倾向于是真实的碱基类型,而非测序错误导致,进而有助于使胎儿游离DNA浓度更准确。只保留唯一比对上的reads是指最终用于后续分析的数据是能够完全与参考基因组比对上的reads,以确保所检测到各SNP位点的碱基类型是真实的。具体比对后的数据的量不限,可根据样本来源的不同进行合理设置。优选加工后得到的测序数据至少有4M的reads数。
上述对待测样本提取游离DNA并进行测序采用现有常规的测序即可,无需高深度测序,也无需进行双端测序,只需按照目前NIPT的0.1x-0.5x的低深度测序即可满足要求。当然,如果测序是进行高深度测序,同样可以满足要求。在一种优选实施例中,对待测样本提取游离DNA并进行测序包括:对待测样本提取游离DNA并进行全基因组低深度测序。此处的低深度测序使目标覆盖度在0.1x~0.5x即可。
上述方法中,建立母亲和胎儿基因型的联合概率分布模型的理论基础在于:即便是对于NIPT这样低深度测序的数据,存在足够多的1000genome SNP位点被1条以上的read覆盖,并且这些1000genome SNP位点的覆盖度服从Poisson分布。
对于任何覆盖度大于1的SNP位点,都可以定义该位点为read纯合或read杂合。
read杂合位点占总位点的百分比与胎儿浓度h之间存在函数关系。因为胎儿的存在会引入父源DNA,使得样本中某些纯合位点变成了杂合位点。由于是低深度测序,杂合能够被测到的概率与胎儿浓度有关。对于同一个母体背景而言,胎儿浓度越大,测得的杂合位点的比例就越高。因此可用杂合位点占总位点的百分比来推断胎儿浓度h。
在最理想的条件下,假定母亲和胎儿的近交系数(inbreeding coefficient)都为0,测序平台的测序错误率也为0,群体等位基因频率服从均一分布,则能够得到母亲和胎儿基因型的联合概率模型,如下表1。
表1:
Figure PCTCN2019096367-appb-000002
上表1中,MMFF表示母亲和胎儿的基因型,A和B表示某一SNP位点的等位基因,Prob列表示为对应的母亲和胎儿的基因型的概率,f A表示测序数据中等位基因A的频率。
如果某些测序位点的覆盖度为2,且群体等位基因频率为p的一类位点上,杂合位点占该类位点的百分为:
P H=(1+h-h 2)p(1-p)
根据p~uniform(0,1),对P H做积分运算。在测序数据中所有等位基因频率下,杂合位点占总位点的百分比为:
Figure PCTCN2019096367-appb-000003
而在实际应用中,有三个因素会影响杂合程度:胎儿的近交系数F2,母亲的近交系数F1,测序错误率e。
对于两等位基因的SNP,近交系数F会直接影响纯合AA,BB,以及杂合AB的频率,如下:
AA~p 2+pqF,AB~2pq(1-F),BB~q 2+pqF。
因此,在一种优选实施例中,联合概率分布模型为下表2。
表2:
Figure PCTCN2019096367-appb-000004
其中,MMFF列表示的是母亲和胎儿的基因型,A和B分别表示一SNP位点上的两种等位基因,Prob列表示的是母亲和胎儿的基因型的联合概率,p和q分别表示等位基因A和B的人群等位基因频率信息,F1表示母亲的近交系数,F2表示胎儿的近交系数,e表示测序错误率,f A列表示测序数据中等位基因A在样本中的频率,h表示胎儿游离DNA浓度。
该模型可用最大似然法求解h。其求解的前提是需要知道F1、F2、e以及人群等位基因频率信息。其中,母亲的近交系数F1,可以通过白细胞低深度测序得到,该模型可以看作是常规模型在h=0时的特殊情况。该模型可用最大似然法同时求解h和F1。这 样做略微牺牲了h的精度,但是节省了对母体白细胞测序的成本。平台的测序错误率可以直接从数据中得到。胎儿的近交系数F2,虽然理论上需要对父亲的白细胞测序,但是实际操作中令F2=0或者取人群近交系数的均值就已经足够满足要求,因为胎儿浓度一般比较小在10%左右。人群等位基因频率信息,可以直接从1000genome和gnomAD的东亚人群数据获取,也可以纳入大量真实NIPT样本来计算得到。
基于比对后的数据,通过统计常染色体上大量SNP位点上(深度为2或者3)的杂合和纯合的情况,结合母体自身的近交系数,从千人基因组数据得到的大量SNP位点的人群频率,代入实际模型中,即可求解出胎儿游离核酸浓度h。
本申请中所说的低深度测序是指整个样本的覆盖度的0.1x~0.5x。而覆盖度为2或3是指其中某些位点的深度。比如,1个样本中有30亿个位点,有些位点的深度为0,有些位点的深度为1,有些位点的深度为2,其他位点类似深度也可能存在一定差异,但平均起来,整体样本的深度是0.1x~0.5x。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。
下面结合可选的实施例进一步说明。
实施例2模拟混样数据验证
选取来自1000genome中NA12892(母亲)和NA12878(女儿)的全基因组测序数据,按照不同梯度(分别是2%,4%,6%,8%,10%,12%,14%,16%,18%,20%)的胎儿浓度来混reads,覆盖度最高达到0.5X。
母亲和女儿的近交系数通过母本和女儿各自的全基因组测序reads获得,测序错误率通过混合后得到的样本reads计算得到,各SNP位点的人群等位基因频率通过东亚1000genome的东亚人群数据获取,杂合位点占总位点的百分比通过统计混合后得到的样本的reads得到,然后将上述各参数代入前述联合概率分布模型中进行求解,即可获得胎儿游离DNA浓度h。
将推断出的胎儿浓度与预期相比较,比较结果如下图2。从图2中可以看出:采 用本申请的方法获取的胎儿浓度与预期的胎儿浓度(混reads的比例)一致。每个梯度混reads重复100次,算出h的均值(图中黑点)和方差(竖线表示加减一个方差)
实施例3真实混样样本
将分别来源于母亲和胎儿的DNA按照不同的胎儿浓度进行混合(胎儿浓度分别为3%、5%、8%和12%),然后上机测序,该测序为低深度全基因组测序,进而利用本申请所提出的方法推断胎儿浓度。
具体的测序深度是0.1x,测序错误率为1/1000,母亲和胎儿的近交系数分别通过各自的DNA测序数据计算得到,各位点的人群等位基因频率通过东亚1000genome的东亚人群数据获取,各混样浓度的测序数据中杂合位点数占总位点数的百分比分通过测序数据获得。
将推断出的胎儿浓度与混样浓度相比较,比较结果见图3。从图3可以看出:该方法获得的胎儿浓度与混样的胎儿浓度一致。
实施例4真实NIPT男胎样本验证
选取怀有男胎的NIPT临床样本69例,采用本申请的方法获取的胎儿浓度。将推断出的胎儿浓度与性染色体推断出的相比较。比较结果见图4,从图4中可以看出:该方法与基于性染色体的推断方法得到的胎儿浓度在67个样本中高度一致。有两个离群点(*)采用本申请的方法获取的胎儿浓度大约是性染色体推断胎儿浓度的两倍。此两样本为龙凤胎。
实施例5联合母体近交系数和胎儿浓度获取
图5用了图4相同的样本,所不同的是图5联合估计母体近交系数和胎儿浓度(没有用母体白细胞的信息)。图5显示联合估计可以非常准确,其中龙凤胎的样本为星号所示。
从上述的优选实施例可以看出,本申请的方案具有以下优点:
1)准确性高,经用3万多例男胎NIPT样本验证,该方法与基于性染色体的推断方法得到的胎儿浓度高度一致,R 2达到99%。
2)适用于女胎,克服了女胎的胎儿浓度难以准确定量的难题。
3)不依赖额外的实验步骤和仪器,不需要定制Panel,不需要甲基化测序,不增加任何额外的实验工作,也不依赖额外的实验仪器或平台。
4)成本低廉,临床推广价值大。本申请方法基于全基因组低深度测序,可直接使 用现有的NIPT样本数据。不需要双端测序,不需要高深度测序(此方法的胎儿浓度获得直接依赖于深度测序得到的某些杂合SNP点的两个等位基因的测序深度的微小差异,需要对每个杂合位点作定量分析;而本申请是统计所有杂合SNP位点占总位点数的比例,只需要粗略对位点作杂合和纯合的定性),不增加额外测序成本。
5)可直接整合入NIPT流程,基于NIPT的数据,因此可以方便地整合进NIPT的分析流程中,提高NIPT筛查的统计功效。
实施例6
对应于上述方式,本实施例还提供了一种胎儿游离DNA浓度获取装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
图6是根据本发明实施例6的胎儿游离DNA浓度获取装置的结构框图,如图6所示,该装置包括第一获取模块10、模型建立模块30以及浓度估计模块50。其中,
第一获取模块10,用于获取待测样本的测序数据,其中,待测样本取自怀有胎儿的母亲;
模型建立模块30,用于建立母亲和胎儿基因型的联合概率分布模型,其中,联合概率分布模型中包括影响read杂合度的一个或多个因素,read杂合度为测序数据中被不同碱基覆盖的SNP位点数占总SNP位点数的比例;
浓度估计模块50,用于将一个或多个因素的值以及获取到的read杂合度的值代入联合概率分布模型,并对联合概率分布模型进行最大似然估计,得到胎儿游离DNA浓度。
通过上述胎儿游离DNA浓度获取装置,在不增加任何额外的实验和测序成本的基础上,实现了对胎儿游离DNA浓度的定量,且该方法成本低、准确性高,且适用于女胎胎儿浓度检测。
图7是根据本发明实施例6的胎儿游离DNA浓度获取装置的详细结构框图,如图7所示,该装置除包括图6所示的所有模块外,还包括包括第二获取模块70。第二获取模块用于在一个或多个因素包括以下至少之一的情况下:母亲的近交系数、胎儿的近交系数、测序错误率及人群等位基因频率信息,获取一个或多个因素的值。
可选地,第二获取模块70包括:第一获取单元20,用于在一个或多个因素包括 母亲的近交系数的情况下,通过以下之一获取母亲的近交系数:对白细胞低深度测序;对联合概率分布模型进行最大似然估计。
可选地,第二获取模块包括70:第二获取单元40,用于在一个或多个因素包括胎儿的近交系数的情况下,通过以下之一获取胎儿的近交系数:将胎儿的近交系数设置为0;通过对胎儿的父亲进行白细胞测序得到胎儿的近交系数;将人群近交系数的均值作为胎儿的近交系数。
可选地,第二获取模块包括70:第三获取单元60,用于在一个或多个因素包括人群等位基因频率信息的情况下,通过以下之一得到人群等位基因频率信息:从母亲所属人群的数据中获取;从纳入预定数量的NIPT样本中计算得到。
可选地,第一获取模块10包括:样本测序模块,用于对待测样本提取游离DNA并进行测序之后得到原始测序数据;加工模块,用于对原始测序数据进行加工得到测序数据,加工用于将原始测序数据处理成适用于得到read杂合度的测序数据。
可选地,加工模块包括:删除模块,用于删除低质量的reads;比对模块,用于将删除后被保留的reads对比到参考基因组,得到满足比对策略的reads作为测序数据。
具体地,低质量的reads包括以下至少之一:PCR扩增引入的重复片段的reads、包含一个以上碱基N的reads、连续5个核苷酸的平均测序质量低于20的reads;和/或。比对策略包括以下之一:允许最多一个错配及只保留唯一比对上的reads。
可选地,样本测序模块包括全基因组低深度测序模块,用于对待测样本提取游离DNA并进行全基因组低深度测序。
可选地,通过如下公式表示联合概率分布模型:
Figure PCTCN2019096367-appb-000005
其中,MMFF列表示的是母亲和胎儿的基因型,A和B分别表示一SNP位点上
的两种等位基因,Prob列表示的是母亲和胎儿的基因型的联合概率,p和q分别
表示等位基因A和B的人群等位基因频率信息,F1表示母亲的近交系数,F2表示
胎儿的近交系数,e表示测序错误率,f A列表示测序数据中等位基因A在样本中
的频率,h表示胎儿游离DNA浓度。
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述各个模块以任意组合的形式分别位于不同的处理器中。
实施例7
本发明的实施例还提供了一种存储介质,该存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的计算机程序:
S1,获取待测样本的测序数据,其中,待测样本取自怀有胎儿的母亲;;
S2,建立母亲和胎儿基因型的联合概率分布模型,其中,联合概率分布模型中包括影响read杂合度的一个或多个因素,read杂合度为测序数据中被不同碱基覆盖的SNP位点数占总位点数的比例;
S3,将一个或多个因素的值以及获取到的read杂合度的值代入联合概率分布模型,并对联合概率分布模型进行最大似然估计,得到胎儿游离DNA浓度。
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。
实施例8
本发明的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为通过运行计算机程序以执行上述任一项方法实施例中的步骤。
可选地,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。
可选地,在本实施例中,上述处理器可以被设置为通过计算机程序执行以下步骤:
S1,获取待测样本的测序数据,其中,待测样本取自怀有胎儿的母亲;;
S2,建立母亲和胎儿基因型的联合概率分布模型,其中,联合概率分布模型中包括影响read杂合度的一个或多个因素,read杂合度为测序数据中被不同碱基覆盖的SNP位点数占总位点数的比例;
S3,将一个或多个因素的值以及获取到的read杂合度的值代入联合概率分布模型,并对联合概率分布模型进行最大似然估计,得到胎儿游离DNA浓度。
可选地,本实施例中的具体示例可以参考上述实施例及可选实施方式中所描述的示例,本实施例在此不再赘述。
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (20)

  1. 一种胎儿游离DNA浓度获取方法,其特征在于,包括:
    获取待测样本的测序数据,其中,所述待测样本取自怀有胎儿的母亲;
    建立母亲和胎儿基因型的联合概率分布模型,其中,所述联合概率分布模型中包括影响read杂合度的一个或多个因素,所述read杂合度为所述测序数据中被不同碱基覆盖的SNP位点数占总SNP位点数的比例;
    将所述一个或多个因素的值以及获取到的所述read杂合度的值代入所述联合概率分布模型,并对所述联合概率分布模型进行最大似然估计,得到所述胎儿游离DNA浓度。
  2. 根据权利要求1所述的方法,其特征在于,在所述一个或多个因素包括以下至少之一的情况下:母亲的近交系数、胎儿的近交系数、测序错误率、人群等位基因频率信息,其中,在将所述一个或多个因素的值和所述read杂合度的值代入所述联合概率分布模型之前,获取所述一个或多个因素的值。
  3. 根据权利要求2所述的方法,其特征在于,在所述一个或多个因素包括所述母亲的近交系数的情况下,所述母亲的近交系数通过对白细胞低深度测序获取到;或者,通过对所述联合概率分布模型进行最大似然估计与所述胎儿游离DNA浓度一起获取。
  4. 根据权利要求2所述的方法,其特征在于,在所述一个或多个因素包括所述胎儿的近交系数的情况下,通过以下之一得到所述胎儿的近交系数:
    将所述胎儿的近交系数设置为0;
    通过对所述胎儿的父亲进行白细胞测序得到所述胎儿的近交系数;
    将人群近交系数的均值作为所述胎儿的近交系数。
  5. 根据权利要求2所述的方法,其特征在于,在所述一个或多个因素包括所述人群等位基因频率信息的情况下,通过以下之一得到所述人群等位基因频率信息:
    从所述母亲所属人群的数据中获取;
    从纳入预定数量的NIPT样本中计算得到。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,获取所述待测样本的测 序数据包括:
    对所述待测样本提取游离DNA并进行全基因组低深度测序之后得到原始测序数据;
    对所述原始测序数据进行加工得到所述测序数据,所述加工用于将所述原始测序数据处理成适用于得到所述read杂合度的测序数据。
  7. 根据权利要求6所述的方法,其特征在于,对所述原始测序数据进行加工得到所述测序数据包括:
    删除低质量的reads;
    将删除后被保留的reads对比到参考基因组,得到满足比对策略的reads作为所述测序数据。
  8. 根据权利要求7所述的方法,其特征在于,
    所述低质量的reads包括以下至少之一:PCR扩增引入的重复片段的reads、包含一个以上碱基N的reads、连续5个核苷酸的平均测序质量低于20的reads;和/或,
    所述比对策略包括以下之一:允许最多一个错配及只保留唯一比对上的reads。
  9. 根据权利要求2至5中任一项所述的方法,其特征在于,通过如下公式表示所述联合概率分布模型:
    Figure PCTCN2019096367-appb-100001
    其中,MMFF列表示的是所述母亲和胎儿的基因型,A和B分别表示一SNP位点上的两种等位基因,Prob列表示的是所述母亲和胎儿的所述基因型的联合概率,p和q分别表示所述等位基因A和B的人群等位基因频率信息,F1表示所述母亲的近交系数,F2表示所述胎儿的近交系数,e表示所述测序错误率,f A列表 示所述测序数据中所述等位基因A在样本中的频率,h表示所述胎儿游离DNA浓度。
  10. 一种胎儿游离DNA浓度获取装置,其特征在于,包括:
    第一获取模块,用于获取待测样本的测序数据,其中,所述待测样本取自怀有胎儿的母亲;
    模型建立模块,用于建立母亲和胎儿基因型的联合概率分布模型,其中,所述联合概率分布模型中包括影响read杂合度的一个或多个因素,所述read杂合度为所述测序数据中被不同碱基覆盖的SNP位点数占总SNP位点数的比例;
    浓度估计模块,用于将所述一个或多个因素的值以及获取到的所述read杂合度的值代入所述联合概率分布模型,并对所述联合概率分布模型进行最大似然估计,得到所述胎儿游离DNA浓度。
  11. 根据权利要求10所述的装置,其特征在于,所述装置还包括第二获取模块,用于在所述一个或多个因素包括以下至少之一的情况下:母亲的近交系数、胎儿的近交系数、测序错误率及人群等位基因频率信息,获取所述一个或多个因素的值。
  12. 根据权利要求11所述的装置,其特征在于,所述第二获取模块包括:第一获取单元,用于在所述一个或多个因素包括所述母亲的近交系数的情况下,通过以下之一获取所述母亲的近交系数:
    对白细胞低深度测序;
    对所述联合概率分布模型进行最大似然估计。
  13. 根据权利要求11所述的装置,其特征在于,所述第二获取模块包括:第二获取单元,用于在所述一个或多个因素包括所述胎儿的近交系数的情况下,通过以下之一获取所述胎儿的近交系数:
    将所述胎儿的近交系数设置为0;
    通过对所述胎儿的父亲进行白细胞测序得到所述胎儿的近交系数;
    将人群近交系数的均值作为所述胎儿的近交系数。
  14. 根据权利要求11所述的装置,其特征在于,所述第二获取模块包括:第三获取单元,用于在所述一个或多个因素包括所述人群等位基因频率信息的情况下,通过以下之一得到所述人群等位基因频率信息:
    从所述母亲所属人群的数据中获取;
    从纳入预定数量的NIPT样本中计算得到。
  15. 根据权利要求10至14中任一项所述的装置,其特征在于,所述第一获取模块包括:
    样本测序模块,用于对所述待测样本提取游离DNA并进行全基因组低深度测序之后得到原始测序数据;
    加工模块,用于对所述原始测序数据进行加工得到所述测序数据,所述加工用于将所述原始测序数据处理成适用于得到所述read杂合度的测序数据。
  16. 根据权利要求15所述的装置,其特征在于,所述加工模块包括:
    删除模块,用于删除低质量的reads;
    比对模块,用于将删除后被保留的reads对比到参考基因组,得到满足比对策略的reads作为所述测序数据。
  17. 根据权利要求16所述的装置,其特征在于,
    所述低质量的reads包括以下至少之一:PCR扩增引入的重复片段的reads、包含一个以上碱基N的reads、连续5个核苷酸的平均测序质量低于20的reads;和/或,
    所述比对策略包括以下之一:允许最多一个错配及只保留唯一比对上的reads。
  18. 根据权利要求11至14中任一项所述的装置,其特征在于,通过如下公式表示所述联合概率分布模型:
    Figure PCTCN2019096367-appb-100002
    其中,MMFF列表示的是所述母亲和胎儿的基因型,A和B分别表示一SNP位 点上的两种等位基因,Prob列表示的是所述母亲和胎儿的所述基因型的联合概率,p和q分别表示所述等位基因A和B的人群等位基因频率信息,F1表示所述母亲的近交系数,F2表示所述胎儿的近交系数,e表示所述测序错误率,f A列表示所述测序数据中所述等位基因A在样本中的频率,h表示所述胎儿游离DNA浓度。
  19. 一种存储介质,所述存储介质上存储有计算机可执行的程序,其特征在于,所述程序被设置为运行时,执行权利要求1至9中任一项所述的胎儿游离DNA浓度获取方法。
  20. 一种电子装置,包括存储其和处理器,其特征在于,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行所述权利要求1至9中任一项所述的胎儿游离DNA浓度获取方法。
PCT/CN2019/096367 2018-09-30 2019-07-17 胎儿游离dna浓度获取方法、获取装置、存储介质及电子装置 WO2020063052A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811162012.9 2018-09-30
CN201811162012.9A CN109461473B (zh) 2018-09-30 2018-09-30 胎儿游离dna浓度获取方法和装置

Publications (1)

Publication Number Publication Date
WO2020063052A1 true WO2020063052A1 (zh) 2020-04-02

Family

ID=65607271

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/096367 WO2020063052A1 (zh) 2018-09-30 2019-07-17 胎儿游离dna浓度获取方法、获取装置、存储介质及电子装置

Country Status (3)

Country Link
US (1) US20200048714A1 (zh)
CN (1) CN109461473B (zh)
WO (1) WO2020063052A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109461473B (zh) * 2018-09-30 2019-12-17 北京优迅医疗器械有限公司 胎儿游离dna浓度获取方法和装置
EP3709302B1 (en) * 2019-03-14 2024-02-14 Ricoh Company, Ltd. Estimation method
CN113450871B (zh) * 2021-06-28 2024-06-11 广东博奥医学检验所有限公司 基于低深度测序的鉴定样本同一性的方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050192763A1 (en) * 2004-02-28 2005-09-01 Park Kyung-Hee Method of selecting optimized SNP marker sets from multiple SNP markers associated with a complex disease
CN104846089A (zh) * 2015-05-06 2015-08-19 厦门万基生物科技有限公司 一种孕妇外周血中胎儿游离dna比例的定量方法
CN106591451A (zh) * 2016-12-14 2017-04-26 北京贝瑞和康生物技术股份有限公司 测定胎儿游离dna含量的方法及其用于实施该方法的装置
CN107133491A (zh) * 2017-03-08 2017-09-05 广州市达瑞生物技术股份有限公司 一种获取胎儿游离dna浓度的方法
CN109461473A (zh) * 2018-09-30 2019-03-12 北京优迅医疗器械有限公司 胎儿游离dna浓度获取方法和装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9732390B2 (en) * 2012-09-20 2017-08-15 The Chinese University Of Hong Kong Non-invasive determination of methylome of fetus or tumor from plasma
CN109971852A (zh) * 2014-04-21 2019-07-05 纳特拉公司 检测染色体片段中的突变和倍性
CN104232778B (zh) * 2014-09-19 2016-08-17 天津华大基因科技有限公司 同时确定胎儿单体型及染色体非整倍性的方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050192763A1 (en) * 2004-02-28 2005-09-01 Park Kyung-Hee Method of selecting optimized SNP marker sets from multiple SNP markers associated with a complex disease
CN104846089A (zh) * 2015-05-06 2015-08-19 厦门万基生物科技有限公司 一种孕妇外周血中胎儿游离dna比例的定量方法
CN106591451A (zh) * 2016-12-14 2017-04-26 北京贝瑞和康生物技术股份有限公司 测定胎儿游离dna含量的方法及其用于实施该方法的装置
CN107133491A (zh) * 2017-03-08 2017-09-05 广州市达瑞生物技术股份有限公司 一种获取胎儿游离dna浓度的方法
CN109461473A (zh) * 2018-09-30 2019-03-12 北京优迅医疗器械有限公司 胎儿游离dna浓度获取方法和装置

Also Published As

Publication number Publication date
CN109461473B (zh) 2019-12-17
CN109461473A (zh) 2019-03-12
US20200048714A1 (en) 2020-02-13

Similar Documents

Publication Publication Date Title
US20210174901A1 (en) METHOD FOR SIMULTANEOUS DETECTION OF GENOME-WIDE COPY NUMBER CHANGES, cnLOH, INDELS, AND GENE MUTATIONS
KR20160022374A (ko) 유전적 변이의 비침습 평가를 위한 방법 및 프로세스
WO2020063052A1 (zh) 胎儿游离dna浓度获取方法、获取装置、存储介质及电子装置
EP3564391B1 (en) Method, device and kit for detecting fetal genetic mutation
WO2019213811A1 (zh) 检测染色体非整倍性的方法、装置及系统
Luo et al. Pilot study of a novel multi‐functional noninvasive prenatal test on fetus aneuploidy, copy number variation, and single‐gene disorder screening
Yuan et al. FF‐QuantSC: accurate quantification of fetal fraction by a neural network model
Wang et al. Allele-specific copy-number discovery from whole-genome and whole-exome sequencing
CA3143723C (en) Systems and methods for determining pattern of inheritance in embryos
Malcher et al. Development of a comprehensive noninvasive prenatal test
Kang et al. An advanced model to precisely estimate the cell-free fetal DNA concentration in maternal plasma
WO2019213810A1 (zh) 检测染色体非整倍性的方法、装置及系统
JP7446343B2 (ja) ゲノム倍数性を判定するためのシステム、コンピュータプログラム及び方法
WO2022087839A1 (zh) 基于无创产前基因检测数据的亲缘关系判定方法和装置
Söylev et al. CONGA: Copy number variation genotyping in ancient genomes and low-coverage sequencing data
Flickinger Detecting and Correcting Contamination in Genetic Data.
WO2024129354A1 (en) Methods for distinguishing aneuplodies in non-invasive prenatal testing
Yu Fetal CNAPS–DNA/RNA

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19867513

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19867513

Country of ref document: EP

Kind code of ref document: A1