WO2022134807A1 - 一种利用多态性位点和靶位点测序检测胎儿遗传变异的方法 - Google Patents

一种利用多态性位点和靶位点测序检测胎儿遗传变异的方法 Download PDF

Info

Publication number
WO2022134807A1
WO2022134807A1 PCT/CN2021/125359 CN2021125359W WO2022134807A1 WO 2022134807 A1 WO2022134807 A1 WO 2022134807A1 CN 2021125359 W CN2021125359 W CN 2021125359W WO 2022134807 A1 WO2022134807 A1 WO 2022134807A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
allele
count
dna
genotype
Prior art date
Application number
PCT/CN2021/125359
Other languages
English (en)
French (fr)
Inventor
高嵩
Original Assignee
高嵩
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 高嵩 filed Critical 高嵩
Priority to EP21908808.5A priority Critical patent/EP4265732A1/en
Priority to US18/268,459 priority patent/US20240047008A1/en
Priority to CN202180080432.6A priority patent/CN116888274A/zh
Publication of WO2022134807A1 publication Critical patent/WO2022134807A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/6858Allele-specific amplification
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR

Definitions

  • the present invention relates to the field of genetic variation detection, in particular to aneuploidy variation at the chromosomal level, microdeletion/microduplication variation at the subchromosomal level or insertion deletion and single nucleotide site variation at the short sequence level.
  • the purpose of the present invention is to provide a method for simultaneously detecting chromosomal aneuploidy genetic diseases, micro-deletion/micro-duplication genetic diseases at the subchromosomal level and single-gene genetic diseases caused by short sequence variation.
  • the present invention designs a method for genetic variation screening based on high-throughput sequencing technology, including obtaining a test sample and extracting DNA, selectively amplifying the target site, and performing high-throughput sequencing on the target site. , to analyze the sequencing data to obtain the detection results.
  • the invention provides a kind of genetic variation detection method, it comprises the following steps:
  • the present invention provides the detection of aneuploidy at the chromosome level, the detection of microdeletion/microduplication at the subchromosomal level, and the detection of short sequence fragment level variation in mixed samples through the amplification and sequencing technology of specific target DNA sites, At least one of said specific target DNA loci has more than one allele in the sample.
  • the target DNA site in the present invention refers to a specific DNA sequence, the bases in the DNA sequence may vary in different individuals, and the DNA sequence can be amplified by techniques such as PCR, multiplex PCR, or by nucleic acid Hybridization and other technology enrichment.
  • the terms "target DNA sequence” and “target DNA site” are used interchangeably, and the term “site” when referring to a target does not limit the length of the target, ie the length of the target can be a single nucleoside acid up to the length of the entire chromosome.
  • the present invention provides detection of aneuploidy at the chromosome level and microdeletion/microduplication at the subchromosomal level in a single genome sample through the amplification and sequencing technology of a specific DNA site (target site). detection, wherein at least one of the specific target DNA loci has more than one allele in the sample.
  • the biological sample in the present invention includes fetal and maternal nucleic acid (such as cell-free DNA in maternal plasma) from the pregnant female biological sample or from a single genomic sample (such as embryonic nucleic acid derived from preimplantation diagnosis).
  • fetal and maternal nucleic acid such as cell-free DNA in maternal plasma
  • a single genomic sample such as embryonic nucleic acid derived from preimplantation diagnosis
  • the enrichment or amplification of target DNA sites described in the present invention can be carried out by any method known in the art, including but not limited to using PCR, multiplex PCR, whole genome amplification ( WGA), multiple substitution amplification (MDA), rolling circle amplification (RCA), circular amplification (RCR), hybrid capture and other methods to enrich or amplify target DNA sites.
  • WGA whole genome amplification
  • MDA multiple substitution amplification
  • RCA rolling circle amplification
  • RCR circular amplification
  • a chromosome or region or locus presumed to be normal euploid is otherwise designated herein as a "reference chromosome or reference region or reference sequence or reference locus", the chromosome or region or locus presumed to be the state of genetic variation to be detected Also designated herein as "a target chromosome or target region or target sequence or target site".
  • a set consisting of not less than one or one reference chromosome or reference region or reference sequence or reference site is referred to as a reference group.
  • a set consisting of not less than one or one target chromosome or target region or target sequence or target site is called a target group.
  • counting the counts of its various alleles means that for each amplified sequence, firstly map it to the chromosome or genome location, and finally count the counts in each chromosome or genome region. The number of sequences to map. If a chromosome or genomic region has different alleles, the number of sequences mapped by each allele in the region is counted at the same time.
  • Various computational methods are available for mapping individual sequence reads to chromosomal or genomic locations/regions.
  • Non-limiting examples of computer algorithms that can be used to map sequences include, but are not limited to, search for specific sequences, BLAST, BLITZ, FASTA, BOWTIE, BOWTIE 2, BWA, NOVOALIGN, GEM, ZOOM, ELAN, MAQ, MATCH, SOAP, STAR , SEGEMEHL, MOSAIK or SEQMAP or variants or combinations thereof.
  • a subchromosomal level microdeletion segment is considered as one chromosome
  • a subchromosomal level microduplication segment is considered as two chromosomes. Therefore, for single-genome samples, chromosomes with heterozygous microdeletions at the subchromosomal level are marked as monosomy, chromosomes with homozygous microdeletions are marked as absent, chromosomes with heterozygous microduplications are marked as trisomy, and chromosomes with homozygous microdeletions are marked as trisomy. Duplicated chromosomes are marked as tetrasomies.
  • the normal chromosomes of both mother and fetus are marked as disomy-diasomy, and a chromosome of a normal fetus of the mother containing a microdeletion is marked as disomy-singular,
  • a chromosome containing a microduplication is marked as a disomy-trisomy.
  • the chromosomes and/or chromosome segments involved in the variation at the chromosomal level or the subchromosomal level are marked according to similar principles.
  • microdeletion/microduplication at the subchromosomal level refers to chromosomal aberrations that are not very long in the deletion or increase segment on the chromosome and are difficult to find through traditional cytogenetic analysis.
  • Chromosomal microdeletion-microduplication syndrome is another major category of neonatal birth defects besides chromosomal aneuploidy.
  • some parts also refer to chromosomal microdeletion/microduplication variation by copy number variation of chromosomal segments.
  • karyotype is used to refer to variation at the chromosomal or subchromosomal level
  • genotype is used to refer to variation at the short sequence level.
  • the present invention will mark the chromosome 21 karyotype in the sample as a disomic-trisomy karyotype.
  • the present invention will mark the karyotype of the 22q11 chromosome segment in the sample as a monosomic karyotype.
  • the present invention will mark the karyotype of the 22q11 chromosome segment in the sample as a trisomy-trisomy karyotype.
  • the present invention will refer to The genotype marked the 6th amino acid of the hemoglobin beta subunit in this sample is AS
  • wild type is used to refer to the highest frequency of genotypes observed at the target site in a population that is normally free of the diseased phenotype. Wild-type, on the other hand, refers to the genotype that does not contain a pathogenic or likely pathogenic variant at the target locus.
  • mutant type is used to refer to a genotype whose target site is different from wild type.
  • some samples to be tested use the allele count of each target site in the reference group to estimate the concentration of the smallest component DNA in the sample.
  • the concentration of the smallest component DNA in the sample to be detected can be estimated by any method that has been reported so far.
  • the relative proportion method of the allele counts of each target site in the reference group is used to estimate the concentration of the smallest component DNA in the sample to be tested; Estimate the concentration of the least component DNA in the sample; preferably, use the mean and/or median of FC and TC to calculate the concentration of the least component DNA in the sample.
  • the relative ratio method of allele count is used to calculate the concentration of the minimum component DNA in the sample.
  • the least component DNA is fetal DNA
  • the largest component DNA is maternal DNA.
  • the fetus inherits one chromosome from the mother, so the genotype of each target site can only be one of the following five possible genotypes, namely AA
  • the fetal DNA concentration did not affect the relative counts of its individual alleles if the target site was the AA
  • AC can be used to estimate the fetal DNA-derived count (FC) in each target DNA locus.
  • the present invention provides a method for calculating the concentration of the minimum component DNA in a sample by using the relative ratio of the allele counts of each target site in the reference group, the method comprising:
  • setting the noise threshold ⁇ of the sample in the above-mentioned step (a1) is to set the threshold for distinguishing the count signal of the real allele and the count signal of the non-real allele; preferably, the set noise threshold ⁇ is any value not greater than 0.05; preferably, the set noise threshold ⁇ is 0.05, 0.04, 0.03, 0.02, 0.01, 0.0075, 0.005, 0.0025 or 0.001.
  • TC Total Count
  • each allele count of the target DNA site is utilized in the above-mentioned steps (a2-ii), wherein the largest three allele counts are marked as R1, R2 and R3 in turn, and the genotype of the target DNA site is estimated. , including the following steps:
  • (a2-ii-1) Utilize each allele count of the target DNA site to determine the number of alleles detected in the target DNA site that are higher than the noise threshold; if the judgment result is 1, then perform the following steps ( a2-ii-2); if the judgment result is 2, execute the following steps (a2-ii-3); if the judgment result is greater than 2, execute the following steps (a2-ii-4);
  • each allele count of the target DNA site is utilized to determine the number of alleles detected in the target DNA site that is higher than the noise threshold, and the steps are sequentially included:
  • the set noise threshold ⁇ is any value not greater than 0.05; preferably, the predetermined noise threshold is 0.05, 0.04, 0.02, 0.01, 0.0075, 0.005, 0.0025 or 0.001.
  • the number of alleles detected above the noise threshold is 2 and the largest two alleles of the target DNA site are counted, and the gene of the target DNA site is estimated. type, where the two largest allele counts are labeled R1 and R2, respectively, including the following steps:
  • (a2-ii-3-2) Judge whether the value of R1/(R1+R2) is less than 0.75, if the result is yes, then estimate the genotype of the target DNA locus as AB
  • steps (a2-ii-4) according to the detected allele number higher than the noise threshold greater than 2 and the largest at least two allele counts of the target DNA site, estimate the target DNA site. genotype, where the two largest allele counts are labeled R1 and R2, respectively, including the following steps:
  • (a2-ii-4-1) Determine whether R2/R1 is greater than or equal to 0.5 and/or whether R1/(R1+R2) is greater than or equal to 1/2 and less than or equal to 2/3 and/or whether R2/(R1+R2) If the value is greater than or equal to 1/3 and less than or equal to 1/2, if the judgment result is yes, the genotype of the target DNA site is estimated to be AB
  • (a2-ii-4-2) Mark the allele count of the locus as abnormal, then either estimate the genotype of the target locus to be NA, and perform the following steps (a2-ii-4-3); or Set the number of alleles detected in the target DNA locus above the noise threshold to 2, then estimate the genotype of the target locus as described in step (a2-ii-3), and perform the following steps ( a2-ii-4-3); (a2-ii-4-3) output the estimated genotype of the target site.
  • genotype NA represents the genotype for which the target site cannot be estimated.
  • step (a2-iii) according to the estimated genotype of the target DNA site and the count of each allele of the target DNA site, the count (FC) and the total count (TC) derived from the least component DNA are estimated. ), where the three largest allele counts are labeled R1, R2, and R3 sequentially, including the following steps:
  • the estimated genotype of the target site is AB
  • the estimated counts (FC) derived from the least component DNA are R1-R2
  • the estimated total counts (TC) are R1+R2 or R1 +R2+R3, then perform the following steps (a2-iii-7);
  • the estimated count (FC) derived from the least component DNA is 2 times of R1-R2+R3 or R3 or (R1-R2) 2 times, the estimated total count (TC) is R1+R2+R3, and then perform the following steps (a2-iii-7);
  • the estimated genotype of the target site is not one of the above-mentioned genotypes
  • the estimated count (FC) derived from the least component DNA is NA
  • the estimated total count (TC) is R1 Or R1+R2 or R1+R2+R3, then perform the following steps (a2-iii-7);
  • the counts (FC) derived from the least fractional DNA are estimated as NA, which means that the counts (FC) derived from the least fractional DNA cannot be estimated.
  • the count (FC) and total count (TC) of the minimum component DNA of each target site in the reference group are used to estimate the concentration of the minimum component DNA, wherein linear regression or robust linear regression is used to calculate the sample. and/or use the mean or median of FC and TC to calculate the concentration of the least component DNA in the sample.
  • the count (FC) and total count (TC) of the minimum component DNA of each target site of the reference group are used to estimate the concentration of the minimum component DNA, wherein the minimum component is estimated by fitting a regression model. DNA concentration.
  • the concentration of the least component DNA is estimated by fitting a regression model, wherein the regression model is selected from: linear regression model, robust linear regression model, simple regression model, ordinary least squares regression model, multiple regression model , general multiple regression model, polynomial regression model, general linear model, generalized linear model, discrete choice regression model, logistic regression model, polynomial logarithmic model, mixed logarithmic model, probability unit model, polynomial probability unit model, ordinal Logarithmic Model, Ordinal Probability Unit Model, Poisson Model, Multiple Response Regression Model, Multilevel Model, Fixed Effects Model, Random Effects Model, Mixed Model, Nonlinear Regression Model, Nonparametric Model, Semiparametric Model, Robust model, quantile model, isotonic model, principal component model, least angle model, local model, piecewise model and variable error model.
  • the regression model is selected from: linear regression model, robust linear regression model, simple regression model, ordinary least squares regression model, multiple regression model , general multiple regression model, polynomial regression model, general linear model, general
  • the concentration of the minimum component DNA is estimated by fitting a regression model, wherein in the fitted model, the total count (TC) of each target site in the reference group is an independent variable, and the minimum group of each target site is the independent variable.
  • Fractional DNA count (FC) was the dependent variable.
  • the concentration of the minimum component DNA is estimated by fitting a regression model, wherein the concentration of the minimum component DNA is estimated as the regression coefficient of the total count (TC) of the model parameter.
  • the fitted regression model is a linear regression model; preferably, the fitted regression model is a robust linear regression model; preferably, the fitted regression model is a general linear model.
  • the present invention provides a method for calculating the concentration of the minimum component DNA in a sample by using the iterative fitting genotype method to count the alleles of each target site in the reference group, and the method includes:
  • setting the noise threshold ⁇ of the sample in the above-mentioned step (b1) is to set the threshold for distinguishing the count signal of the true allele and the count signal of the non-true allele; preferably, the set noise threshold ⁇ is any value not greater than 0.05; preferably, the set noise threshold ⁇ is 0.05, 0.04, 0.03, 0.02, 0.01, 0.0075, 0.005, 0.0025 or 0.001.
  • the initial concentration estimated value f 0 is set, which is to set f 0 to be the value of any possible minimum component DNA concentration; preferably, the set initial concentration estimated value f 0 is less than 0.5 ; preferably, the set initial concentration estimation value f 0 is less than 0.5 and greater than the set noise threshold ⁇ ; preferably, the set initial concentration estimation value f 0 is any one that is not only less than 0.5 but also greater than the set noise threshold The value of ⁇ ; preferably, the set initial concentration estimation value f 0 is 0.50, 0.45, 0.40, 0.35, 030, 0.25, 0.20, 0.15, 0.10, 0.05, 0.04, 0.03, 0.02, 0.01 or 0.005.
  • setting the iterative error precision value ⁇ in the above step (b1) is to set ⁇ as a very small cut-off threshold for iterative calculation; preferably, the set ⁇ value is less than 0.01; preferably, the set ⁇ The value is any value less than 0.01; preferably, the set ⁇ value is less than 0.001; preferably, the set ⁇ value is less than 0.0001; preferably, the set ⁇ value is 0.01, 0.001, 0.0001 or 0.00001.
  • the genotype of each target DNA site is estimated by using its respective allele count and the concentration value f 0 of the least component DNA in the sample, including the following steps:
  • the goodness of fit test refers to one or more statistical test methods that can be used to test the consistency between the observed number and the theoretical number; preferably, the goodness of fit test is a chi-square test; preferably, The goodness-of-fit test is a G test; preferably, the goodness-of-fit test is Fisher's exact test; preferably, the goodness-of-fit test is a binomial distribution test; preferably, the goodness-of-fit test is a chi-square test and/or G-test and/or Fisher's exact test and/or binomial distribution test and/or variants thereof and/or combinations thereof; preferably, the goodness-of-fit test is calculated using the G-test G value and Goodness-of-fit tests were performed with/or AIC values and/or corrected G values and/or corrected AIC values and/or variants of G values or AIC values and/or combinations thereof.
  • the count (FC) and total count (TC) derived from the least component DNA are estimated according to its estimated genotype, wherein the four largest alleles Gene counts are labeled R1, R2, R3, and R4 in sequence, including the following steps:
  • the estimated genotype of the target site is AB
  • the estimated count (FC) derived from the least component DNA is R1-R2
  • the estimated total count (TC) is R1+R2 or R1+R2 +R3 or R1+R2+R3+R4, then perform the following steps (b3-11);
  • the count (FC) derived from the least component DNA is estimated to be 2 times of R1-R2+R3 or R3 or 2 of (R1-R2) times, the estimated total count (TC) is R1+R2+R3 or R1+R2+R3+R4, and then perform the following steps (b3-11);
  • the estimated genotype of the target site is AA
  • the estimated count (FC) derived from the least component DNA is R2
  • the estimated total count (TC) is R1+R2 or R1+R2+R3 Or R1+R2+R3+R4, then perform the following steps (b3-11);
  • the estimated genotype of the target site is not one of the above-mentioned genotypes
  • the estimated count (FC) derived from the least component DNA is NA
  • the estimated total count (TC) is R1 or R1 +R2 or R1+R2+R3 or R1+R2+R3+R4, and then perform the following steps (b3-11);
  • step (b4) using the count (FC) and total count (TC) of the least component DNA to estimate the concentration f of the least component DNA is to use the method described in step (a3) to estimate the least component DNA.
  • the iterative fitting genotype method is used to count the alleles of each target site in the reference group to calculate the concentration of the minimum component DNA in the sample.
  • This method can be used not only to estimate the concentration of the smallest component DNA in the mixed sample with biological relationship, but also to estimate the concentration of the smallest component DNA in the mixed sample with no biological relationship. Further, the method is applicable not only to calculating the concentration of fetal DNA in plasma DNA samples of pregnant women who are biological genetic mothers, but also to calculating the concentration of fetal DNA in plasma DNA of pregnant women who are legally permitted to accept egg donation. Further, the method can be used to estimate the concentration of minimal component DNA in two independent mixed DNA samples.
  • the method described above can be used to estimate the concentration of several components in a mixture of more than two samples.
  • a fetal DNA concentration value that needs to be iterated can be set for each fetus; for example, for twin pregnancy, the fetal DNA concentration values that need to be iterated can be set as f1 and f2 respectively; for triple pregnancy, you can Set the fetal DNA concentration values to be iterated as f1, f2, and f3, and so on.
  • the concentration of multiple sample components In order to estimate the concentration of multiple sample components, one can first set an initial value for the concentration of each sample, and then use the individual allele counts of each target DNA locus and all possible genotypes for the locus to estimate the target locus At the estimated count of each sample component, the goodness-of-fit test is then used to iteratively calculate the concentration of each sample component until the variation in the calculated concentration of each sample component is less than the set precision value.
  • the target of the sample to be detected in the present invention includes a single target DNA site, an entire chromosome containing one or more target DNA sites, and a subchromosomal segment containing one or more target DNA sites.
  • the present invention provides a method for determining the karyotype or genotype or wild mutant type of a target to be detected in a sample by using the goodness-of-fit test of allele counts of target DNA sites, the method comprising:
  • each target DNA site into a reference site or a target site according to its location on the chromosome, wherein each reference site forms a reference group, and each target site forms a target group;
  • a goodness-of-fit test method is adopted to estimate the gene of the target to be detected in the sample. type, the method includes:
  • a goodness-of-fit test method is adopted to estimate the nucleus of the target to be detected in the sample. type, the method includes:
  • (c3-b1) analyze the sample to be tested and list all possible karyotypes of the target chromosome or subchromosomal segment to be detected;
  • a goodness-of-fit test method is adopted to estimate the wildness of the target to be detected in the sample. mutant, the method comprises:
  • a goodness-of-fit test method is adopted to estimate the wildness of the target to be detected in the sample. mutant, the method comprises:
  • the genotype is estimated by the goodness-of-fit test method according to the count of each allele and the concentration of the least component DNA in the sample;
  • a goodness-of-fit test method is adopted to estimate the gene of the target to be detected in the sample.
  • the target group can be a target site or multiple independent repeats of a target site.
  • the target site independent repeats are obtained by using the same primers and independent PCR and/or multiplex PCR amplification reactions; preferably, the target site independent repeats are obtained by using different primers and independent PCR and/or multiplex PCR amplification reactions increased reaction.
  • a goodness-of-fit test method is adopted to estimate the gene of the target to be detected in the sample. type or wild mutant type, wherein the goodness of fit test method is to use one or more statistical test methods that can be used to test the consistency between the observed number and the theoretical number; preferably, the goodness of fit test is a card Square test; preferably, the goodness-of-fit test is a G test; preferably, the goodness-of-fit test is Fisher's exact test; preferably, the goodness-of-fit test is a binomial distribution test; preferably, the goodness-of-fit test is a binomial distribution test The degree test is the chi-square test and/or the G test and/or Fisher's exact test and/or the binomial distribution test and/or its variants and/or combinations thereof; preferably, the goodness of fit test is using the G test Goodness-of-fit tests were performed on
  • a goodness-of-fit test method is adopted to estimate the gene of the target to be detected in the sample. type or wild mutant type, wherein the method for testing the goodness of fit is to use the method described in step (b2-i) to step (b2-iv) to perform a goodness-of-fit test.
  • the karyotype at the chromosome level refers to the euploidy or aneuploidy state of a certain chromosome in each mixed component in a mixed sample.
  • the karyotype of the normal fetus of the mother is disomy-monosomy
  • the karyotype of the normal fetus of the mother with trisomy is disomy-trisomy
  • the karyotype of both the mother and the fetus is normal.
  • the type is diploid - diploid.
  • each subchromosomal level fragment is considered as a chromosome, so in maternal plasma samples, the subchromosomal karyotypes with homozygous microdeletion in both the mother and the fetus are azygosity-absomy, and maternal homozygous microdeletion
  • the subchromosomal karyotype of the fetus with heterozygous microdeletion is monosomy-monosomy
  • the normal subchromosomal karyotype of the fetus with heterozygous microdeletion in the mother is monosomy-disomy
  • the subchromosomal karyotype of the heterozygous microdeletion in both mother and fetus is Monosomy, heterozygous microdeletion of mother, subchromosomal karyotype of fetus with homozygous microdeletion is monosomy-absence, normal fetus of mother with heterozygous microd
  • genotype refers to the combination of each genotype of a certain target DNA locus in each mixed component in a mixed sample, wherein 0 or 1 allele may be detected at this locus on each chromosome.
  • genotypes excluding genotypes in which the mother and/or fetus are mosaic
  • disomy-trisomy excluding genotypes in which the mother and/or fetus are mosaic and/or the fetus does not inherit at least one allele from the mother due to de novo mutations, etc.
  • A, B, C and D represents alleles that differ at the
  • the genotype of a locus in a mixed sample is all possible combinations of alleles for that locus on each chromosome in each sample.
  • this locus may detect 0 (microdeletions), 1 (normal), or 2 (microduplications) alleles on each chromosome, so all subchromosomal karyotypes corresponding to mixed samples
  • the possible genotypes are all possible combinations of all alleles on each chromosome at each locus in the pooled sample.
  • genotypes at the subchromosomal karyotype trisomy-trisomy excluding the genotype in which the mother and/or the fetus are mosaic and/or the fetus due to de novo mutations, etc.
  • genotypes that did not inherit at least one allele from the mother namely AAA
  • the present invention provides a method for determining the karyotype or genotype or wild mutant type of a target to be detected in a sample by utilizing the relative distribution map of the allele counts of each target site, the method comprising:
  • each target DNA site into a reference site or a target site according to its location on the chromosome, wherein each reference site constitutes a reference group, and each target site constitutes a target group;
  • the method includes:
  • step (d3) the allele count of each target DNA site of the target group and the concentration of the minimum component DNA in the sample are used, and the relative distribution map method of allele count is adopted to estimate the target to be detected in the sample.
  • the method includes:
  • (d3-b1) analyze the sample to be tested and list all possible karyotypes of the target chromosome or subchromosomal segment to be detected;
  • (d3-b5) Infer the karyotype of the target to be tested according to the theoretical position distribution of each target DNA site in the target group in each karyotype and its actual position distribution in the relative allele count map.
  • step (d3) the allele count of each target DNA site of the target group and the concentration of the minimum component DNA in the sample are used, and the relative distribution map method of allele count is adopted to estimate the target to be detected in the sample.
  • the wild mutant the method comprises:
  • (d3-c3) For each target DNA site in the target group, calculate the relative count value of its wild-type allele and other non-wild-type alleles, and select at least one non-wild-type allele relative count to the wild-type allele.
  • the relative allele count map is used to mark the actual position of the target DNA locus on the allele relative count map;
  • the present invention provides a method for determining the karyotype, genotype or wild mutant type of a target to be detected in a sample by using a goodness-of-fit test of allele counts of target DNA sites and/or a relative distribution map of allele counts.
  • step (c2) or step (d2) using the allele count of each target DNA site in the reference group to calculate the concentration of the minimum component DNA in the sample, is to adopt steps (a1)-step (a3) and/or The method described in step (b1)-step (b5) calculates the concentration of the minimum component DNA in the sample.
  • the present invention provides a method for determining the karyotype of a target to be detected in a single genome sample using a relative distribution map of allele counts, the method comprising:
  • the present invention can not only detect the genetic alteration of each component in the mixed genome, for example, by counting each allele of the polymorphic locus in the DNA sample of pregnant women, to detect the genetic alteration of the single locus or chromosome of the mother and/or fetus Variation at the level and subchromosomal level, and can be applied to karyotype or genotype detection of single-genome samples, such as preimplantation diagnosis of genetic diseases in embryos.
  • This method can simultaneously detect the genetic changes of samples at the nucleotide level and the chromosomal or subchromosomal level, and has a good development and application prospect for the screening of fetal genetic diseases.
  • the present invention relates to the use of a mixture of maternal and fetal genetic material to detect whether a target to be tested has genetic abnormalities. Accordingly, in one aspect, the present invention provides methods for determining the presence or absence of fetal aneuploidy in a biological sample comprising a fetus in the form of free-floating DNA from a biological sample of the mother and the mother nucleic acid, amplify the target DNA loci in a PCR or multiplex PCR reaction (i.e., amplify template DNA such that the amplified DNA reproduces the ratio of the original template DNA), and then each target according to the amplified target to be tested The relative count distribution of each allele at a DNA locus to determine the presence or absence of the fetal aneuploidy.
  • the present invention provides a method for determining the presence or absence of copy number variation of fetal chromosomal segments in a biological sample, said biological sample comprising a biological sample from said mother in the form of free-floating DNA Fetal and maternal nucleic acids, target DNA sites are amplified in a PCR or multiplex PCR reaction (ie, template DNA is amplified such that the amplified DNA reproduces the ratio of the original template DNA), and then each The relative count distribution of each allele at the target DNA locus is used to determine the presence or absence of copy number variation in the fetal chromosomal segment.
  • the present invention provides a method for determining the presence or absence of a variation in a locus causative of a fetal monogenic genetic disease in a biological sample, said biological sample comprising a free Fetal and maternal nucleic acids in the form of floating DNA, target DNA sites are amplified in a PCR or multiplex PCR reaction (i.e., template DNA is amplified such that the amplified DNA reproduces the ratio of the original template DNA), and then according to the amplified
  • the relative count distribution of each allele of the target DNA locus (single-gene genetic disease causative gene locus) to be tested is used to determine the presence or absence of the variation of the fetal monogenic genetic disease causative gene locus.
  • the present invention provides diagnostic kits for carrying out the methods of the present invention, comprising at least one set of primers to amplify a target DNA locus.
  • the at least one set of primers amplifies at least one reference set of target DNA sites and/or at least one target set of target DNA sites.
  • the target DNA loci of the target group are selected from possible chromosomes with chromosomal aneuploidy and/or chromosomal segments with possible copy number variation and/or loci that may be pathogenic variants of monogenic genetic diseases.
  • the nucleic acid sequence of the target group target DNA site generally has polymorphisms in the population to be tested and/or the target group target DNA site is a possible pathogenic variant site of a single-gene genetic disease.
  • the reference group target DNA loci are selected from chromosomes generally free of chromosomal aneuploidy and/or chromosome segments generally free of copy number variation.
  • the nucleic acid sequence of the target DNA site in the reference group generally has polymorphisms in the population to be detected.
  • the present invention provides diagnostic kits for practicing the methods of the present invention.
  • the diagnostic kit includes primers for performing step (2) and/or step (3). Additional reagents that may optionally be included in the diagnostic kit are instructions for use, polymerases and buffers for performing PCR and/or multiplex PCR reactions, and reagents required for high-throughput sequencing library construction of amplified fragments.
  • the present invention provides a system for implementing the method of the present invention.
  • the system is used to implement one or more steps in a method for predicting the karyotype or genotype or wild mutant type of a target to be detected from a biological test sample, eg, one or more of steps (4) to (5).
  • the present invention provides an apparatus and/or computer program product and/or system and/or module for implementing the method of the present invention, the apparatus and/or computer program product and/or system and/or module comprising
  • the apparatus and/or computer program product and/or system and/or module comprising
  • the above-mentioned steps (1)-step (5) the above-mentioned steps (a1)-step (a3), the above-mentioned steps (b1)-step (b5), the above-mentioned steps (c1)-step (c3), the above-mentioned steps ( d1)-step (d3) and/or any of the above-mentioned steps (e1)-step (e3).
  • the methods of the invention are performed in vitro or ex vivo.
  • the samples of the present invention are in vitro or ex vivo samples.
  • the present invention relates to an apparatus for carrying out the method of the present invention.
  • the present invention relates to a device for detecting genetic variation in a sample, characterized by comprising:
  • a module configured to receive the biological sample to be tested and prepare nucleic acid
  • a statistical module which is configured to count the counts of each allele for each target DNA site
  • a determination module which is configured to determine the karyotype or genotype or wild mutation of the target to be detected in the sample using the goodness-of-fit test of the allele count of the target DNA site and/or the relative distribution map of the allele count type.
  • the statistics module is configured to count the counts of each allele for each target DNA locus, and the statistics sequentially include the following steps: (4-1) For each amplified sequence, count the It maps to a chromosome or genome position; (4-2) Count the number of sequences mapped in each chromosome or genome region; if a chromosome or genome region has different alleles, count each allele in the region at the same time The number of sequences the gene maps to.
  • each sequence read is mapped to a chromosomal or genomic location/region using any computer method.
  • the computer algorithm for mapping sequences in step (4-1) includes, but is not limited to, search for specific sequences, BLAST, BLITZ, FASTA, BOWTIE, BOWTIE 2, BWA, NOVOALIGN, GEM, ZOOM, ELAN , MAQ, MATCH, SOAP, STAR, SEGEMEHL, MOSAIK or SEQMAP or variants or combinations thereof.
  • specific sequences unique mapped sequences
  • specific sequences are extracted from the chromosomal or genomic sequence corresponding to each target DNA locus, and the reads are then used to map reads to chromosomal or genomic locations/regions.
  • sequence reads can be aligned to sequences at chromosomal or genomic locations/regions. In some embodiments, sequence reads can be aligned with sequences of a chromosome or genome. In some embodiments, sequence reads may be obtained from and/or aligned with sequences in nucleic acid databases known in the art, including, for example, GenBank, dbEST, dbSTS, EMBL (European Molecular Biology Laboratory) and DDBJ (Japan DNA database). BLAST or similar tools can be used to search for the same sequence against sequence databases. Then, for example, search hits can be used to sort the same sequences into appropriate chromosomal or genomic locations/regions.
  • reads may map uniquely or non-uniquely to portions in a reference genome. If a read is aligned to a single sequence in the genome, it is called a "unique map”. If a read aligns with two or more sequences in the genome, it is called a “non-unique mapping.” In some embodiments, reads that are not uniquely mapped are removed from further analysis (eg, quantification).
  • the determination module is configured to determine the karyotype or genotype or wild mutant of the target to be detected in the sample using a goodness-of-fit test of allele counts of target DNA loci, the determination in turn comprising the following step:
  • the determination module is configured to determine the karyotype or genotype or wild mutant of the target to be detected in the sample using the relative distribution of allele counts at the target DNA loci, the determining in turn comprising the steps of:
  • one or more goodness-of-fit testing statistical tests are used to test the agreement between the observed and theoretical numbers.
  • the goodness-of-fit test is a chi-square test.
  • the goodness-of-fit test is a G-test.
  • the goodness-of-fit test is Fisher's exact test.
  • the goodness-of-fit test is a binomial distribution test.
  • the goodness-of-fit test is a chi-square test, a G test, a Fisher's exact test, a binomial distribution test, a variant thereof, or a combination thereof.
  • the goodness-of-fit test is performed using a G-test calculated G value, AIC value, corrected G value, corrected AIC value, a variant of G value or AIC value, or a combination thereof Goodness of fit test.
  • the determination module is configured to determine the karyotype of the target to be detected in the sample using the relative distribution of allele counts at the target DNA loci, wherein the sample to be detected is a single genomic sample, the determining in turn comprising the steps of :
  • the relative count of the second largest allele is plotted against the relative count of the largest allele as distribution map A, or the relative count of the largest allele is plotted against the target DNA The relative position of the locus on the chromosome or subchromosome for distribution map B;
  • step (c2) or step (d2) the relative ratio method of allele count is adopted to calculate the concentration of the minimum component DNA in the sample, and the calculation includes the following steps in sequence:
  • step (c2) or step (d2) an iteratively fitting genotype method of allele counting is adopted to calculate the concentration of the minimum component DNA in the sample, and the calculation sequentially includes the following steps:
  • step (c3) using the allele count of each target DNA site in the target group and the concentration of the least component DNA in the sample, a goodness-of-fit test method is adopted to estimate the amount of the target to be detected in the sample.
  • Genotype the estimation includes the following steps in sequence:
  • step (c3) using the allele count of each target DNA site in the target group and the concentration of the least component DNA in the sample, a goodness-of-fit test method is adopted to estimate the amount of the target to be detected in the sample.
  • the estimation includes the following steps in sequence:
  • (c3-b1) analyze the sample to be tested and list all possible karyotypes of the target chromosome or subchromosomal segment to be detected;
  • step (c3) using the allele count of each target DNA site in the target group and the concentration of the least component DNA in the sample, a goodness-of-fit test method is adopted to estimate the amount of the target to be detected in the sample.
  • the estimation sequentially includes the following steps:
  • step (c3) using the allele count of each target DNA site in the target group and the concentration of the least component DNA in the sample, a goodness-of-fit test method is adopted to estimate the amount of the target to be detected in the sample.
  • the estimation sequentially includes the following steps:
  • the genotype is estimated by the goodness-of-fit test method according to the count of each allele and the concentration of the least component DNA in the sample;
  • a goodness-of-fit test is performed using one or more statistical tests that can be used to test the agreement between the observed and theoretical numbers.
  • the goodness-of-fit test is a chi-square test.
  • the goodness-of-fit test is a G-test.
  • the goodness-of-fit test is Fisher's exact test.
  • the goodness-of-fit test is a binomial distribution test.
  • the goodness-of-fit test is a chi-square test and/or a G test and/or Fisher's exact test and/or a binomial distribution test.
  • the goodness-of-fit test is a calculated G value and/or AIC value and/or corrected G value and/or corrected AIC value and/or calculated from the G value or AIC value using a G test Goodness-of-fit tests were performed with the derived values.
  • step (d3) the allele count of each target DNA site in the target group and the concentration of the least component DNA in the sample are used to estimate the relative distribution of allele counts in the sample to be detected.
  • the genotype of the target, the estimation sequentially includes the following steps:
  • step (d3) the allele count of each target DNA site in the target group and the concentration of the least component DNA in the sample are used to estimate the relative distribution of allele counts in the sample to be detected.
  • the karyotype of the target, the estimation sequentially includes the following steps:
  • (d3-b1) analyze the sample to be tested and list all possible karyotypes of the target chromosome or subchromosomal segment to be detected;
  • step (d3) the allele count of each target DNA site in the target group and the concentration of the least component DNA in the sample are used to estimate the relative distribution of allele counts in the sample to be detected.
  • the wild-type mutant of the target, the estimation sequentially includes the following steps:
  • (d3-c3) For each target DNA site in the target group, calculate the relative count value of its wild-type allele and other non-wild-type alleles, and select at least one non-wild-type allele relative count to the wild-type allele.
  • the relative allele count map is used to mark the actual position of the target DNA locus on the allele relative count map;
  • the genotype is first estimated using the individual allele counts for the target locus, and then the estimated genotype is derived from the least component.
  • DNA counts (FC) and total counts (TC) the estimation sequentially includes the following steps:
  • the genotype of the target DNA locus is estimated using the individual allele counts of the target DNA locus performed in step (a2-ii), the estimating sequentially comprising the steps of:
  • (a2-ii-1) Utilize each allele count of the target DNA site to determine the number of alleles detected in the target DNA site that are higher than the noise threshold; if the judgment result is 1, then perform the following steps ( a2-ii-2); if the judgment result is 2, execute the following steps (a2-ii-3); if the judgment result is greater than 2, execute the following steps (a2-ii-4);
  • the target DNA locus is estimated based on the number of detected alleles above the noise threshold being 2 and the largest two allele counts at the target DNA locus, performed in step (a2-ii-3).
  • the genotype of the DNA locus, the estimation sequentially includes the following steps:
  • (a2-ii-3-2) Judge whether the value of R1/(R1+R2) is less than 0.75, if the result is yes, then estimate the genotype of the target DNA locus as AB
  • the allele count performed in step (a2-ii-4) based on the number of detected alleles above the noise threshold greater than 2 and the largest at least two of the target DNA loci, Estimating the genotype of the target DNA locus, the estimation sequentially includes the following steps:
  • (a2-ii-4-1) Determine whether R2/R1 is greater than or equal to 0.5 and/or whether R1/(R1+R2) is greater than or equal to 1/2 and less than or equal to 2/3 and/or whether R2/(R1+R2) If the value is greater than or equal to 1/3 and less than or equal to 1/2, if the judgment result is yes, the genotype of the target DNA site is estimated to be AB
  • (a2-ii-4-2) Mark the allele count of the locus as abnormal, then either estimate the genotype of the target locus to be NA, and perform the following steps (a2-ii-4-3); or Set the number of alleles detected in the target DNA locus above the noise threshold to 2, then estimate the genotype of the target locus as described in step (a2-ii-3), and perform the following steps ( a2-ii-4-3);
  • counts derived from minimal component DNA are estimated based on the estimated genotype of the target DNA locus and individual allele counts at the target DNA locus performed in step (a2-iii) and total counts (TC), where the three largest allele counts are labeled R1, R2, and R3 in sequence, and the estimation sequentially includes the following steps:
  • the estimated genotype of the target site is AB
  • the estimated counts (FC) derived from the least component DNA are R1-R2
  • the estimated total counts (TC) are R1+R2 or R1 +R2+R3, then perform the following steps (a2-iii-7);
  • the estimated count (FC) derived from the least component DNA is 2 times of R1-R2+R3 or R3 or (R1-R2) 2 times, the estimated total count (TC) is R1+R2+R3, and then perform the following steps (a2-iii-7);
  • the estimated genotype of the target site is not one of the above-mentioned genotypes
  • the estimated count (FC) derived from the least component DNA is NA
  • the estimated total count (TC) is R1 Or R1+R2 or R1+R2+R3, then perform the following steps (a2-iii-7);
  • the genotype of each target DNA locus performed in step (b2) is estimated for each target DNA locus using its individual allele count and the concentration value f 0 of the least component DNA in the sample, the estimation in turn comprising follows the steps below:
  • step (b3) for each target DNA locus performed in step (b3), counts (FC) and total counts (TC) derived from the least component DNA are estimated based on its estimated genotype, with the largest The counts of the four alleles are marked as R1, R2, R3 and R4 in turn, and the estimation includes the following steps:
  • the estimated genotype of the target site is AB
  • the estimated count (FC) derived from the least component DNA is R1-R2
  • the estimated total count (TC) is R1+R2 or R1+R2 +R3 or R1+R2+R3+R4, then perform the following steps (b3-11);
  • the count (FC) derived from the least component DNA is estimated to be 2 times of R1-R2+R3 or R3 or 2 of (R1-R2) times, the estimated total count (TC) is R1+R2+R3 or R1+R2+R3+R4, and then perform the following steps (b3-11);
  • the estimated genotype of the target site is AA
  • the estimated count (FC) derived from the least component DNA is R2
  • the estimated total count (TC) is R1+R2 or R1+R2+R3 Or R1+R2+R3+R4, then perform the following steps (b3-11);
  • the estimated genotype of the target site is not one of the above-mentioned genotypes
  • the estimated count (FC) derived from the least component DNA is NA
  • the estimated total count (TC) is R1 or R1 +R2 or R1+R2+R3 or R1+R2+R3+R4, and then perform the following steps (b3-11);
  • the present invention relates to an apparatus for calculating the concentration of minimal component DNA in a sample, the apparatus comprising:
  • (a3) A calculation module that estimates the concentration of minimal component DNA using the count (FC) and total count (TC) of minimal component DNA for each target site of the reference group.
  • the genotype of each target locus is first estimated by counting alleles of the target locus, and then the estimated genotype is estimated Counts (FC) and total counts (TC) derived from minimal component DNA, including the following steps:
  • the genotype of the target DNA locus is estimated using individual allele counts at the target DNA locus, wherein the three largest allele counts are labeled R1, R2, and R3 in order, comprising the steps of:
  • (a2-ii-1) Utilize each allele count of the target DNA site to determine the number of alleles detected in the target DNA site that are higher than the noise threshold; if the judgment result is 1, then perform the following steps ( a2-ii-2); if the judgment result is 2, execute the following steps (a2-ii-3); if the judgment result is greater than 2, execute the following steps (a2-ii-4);
  • the genotype of the target DNA locus is estimated based on the number of detected alleles above a noise threshold of 2 and the count of the largest two alleles at the target DNA locus, wherein the largest two The allele counts are labeled R1 and R2, respectively, and include the following steps:
  • (a2-ii-3-2) Judge whether the value of R1/(R1+R2) is less than 0.75, if the result is yes, then estimate the genotype of the target DNA locus as AB
  • the genotype of the target DNA locus is estimated based on the number of detected alleles above a noise threshold greater than 2 and the allele count of at least two of the largest of the target DNA loci, wherein the largest
  • the two allele counts, labeled R1 and R2 consist of the following steps:
  • (a2-ii-4-1) Determine whether R2/R1 is greater than or equal to 0.5 and/or whether R1/(R1+R2) is greater than or equal to 1/2 and less than or equal to 2/3 and/or whether R2/(R1+R2) If the value is greater than or equal to 1/3 and less than or equal to 1/2, if the judgment result is yes, the genotype of the target DNA site is estimated to be AB
  • (a2-ii-4-2) Mark the allele count of the locus as abnormal, then either estimate the genotype of the target locus to be NA, and perform the following steps (a2-ii-4-3); or Set the number of alleles detected in the target DNA locus above the noise threshold to 2, then estimate the genotype of the target locus as described in step (a2-ii-3), and perform the following steps ( a2-ii-4-3);
  • counts (FC) and total counts (TC) derived from the least component DNA are estimated based on the estimated genotype of the target DNA locus and the individual allele counts at the target DNA locus, with the largest
  • the three allele counts are labeled R1, R2, and R3 sequentially, including the following steps:
  • the estimated genotype of the target site is AB
  • the estimated counts (FC) derived from the least component DNA are R1-R2
  • the estimated total counts (TC) are R1+R2 or R1 +R2+R3, then perform the following steps (a2-iii-7);
  • the estimated count (FC) derived from the least component DNA is 2 times of R1-R2+R3 or R3 or (R1-R2) 2 times, the estimated total count (TC) is R1+R2+R3, and then perform the following steps (a2-iii-7);
  • the estimated genotype of the target site is not one of the above-mentioned genotypes
  • the estimated count (FC) derived from the least component DNA is NA
  • the estimated total count (TC) is R1 Or R1+R2 or R1+R2+R3, then perform the following steps (a2-iii-7);
  • the calculation module of step (a3) calculates the concentration of the least component DNA in the sample based on the FC and TC counts, using linear regression or robust linear regression, or using the mean or median of FC and TC Calculate the concentration of the smallest component DNA in the sample.
  • the present invention relates to a device for calculating the concentration of minimal component DNA in a sample, the device:
  • (b1) a setting module, which sets the noise threshold ⁇ of the sample, the initial concentration estimation value f 0 and the iterative error precision value ⁇ ;
  • (b2) a module for estimating the genotype of each target DNA locus using its individual allele count and the concentration value f0 of the least component DNA in the sample;
  • (b3) an estimation module, which estimates, for each target DNA site, the count (FC) and total count (TC) derived from minimal component DNA according to its estimated genotype;
  • (b4) a module for estimating the concentration f of the minimum fraction DNA using the count (FC) and total count (TC) of the minimum fraction DNA;
  • estimating the genotype of each target DNA locus using its allele count and the concentration value f0 of the least component DNA in the sample comprises the following steps:
  • counts (FC) and total counts (TC) derived from the least component DNA are estimated based on its estimated genotype, with the largest four alleles counted in order Labeled as R1, R2, R3, and R4, the following steps are included:
  • the estimated genotype of the target site is AB
  • the estimated count (FC) derived from the least component DNA is R1-R2
  • the estimated total count (TC) is R1+R2 or R1+R2 +R3 or R1+R2+R3+R4, then perform the following steps (b3-11);
  • the count (FC) derived from the least component DNA is estimated to be 2 times of R1-R2+R3 or R3 or 2 of (R1-R2) times, the estimated total count (TC) is R1+R2+R3 or R1+R2+R3+R4, and then perform the following steps (b3-11);
  • the estimated genotype of the target site is AA
  • the estimated count (FC) derived from the least component DNA is R2
  • the estimated total count (TC) is R1+R2 or R1+R2+R3 Or R1+R2+R3+R4, then perform the following steps (b3-11);
  • the estimated genotype of the target site is not one of the above-mentioned genotypes
  • the estimated count (FC) derived from the least component DNA is NA
  • the estimated total count (TC) is R1 or R1 +R2 or R1+R2+R3 or R1+R2+R3+R4, and then perform the following steps (b3-11);
  • the sample is a maternal plasma sample and the minimal component DNA is fetal DNA. In some embodiments, the sample is embryonic nucleic acid derived from preimplantation diagnosis.
  • the present invention provides diagnostic kits for practicing the methods of the present invention.
  • the diagnostic kit includes at least one set of primers to amplify a reference set of target DNA loci and/or a target set of target DNA loci.
  • the target DNA loci of the target group are selected from possible chromosomes with chromosomal aneuploidy and/or chromosomal segments with possible copy number variation and/or loci that may be pathogenic variants of monogenic genetic diseases.
  • the nucleic acid sequence of the target DNA site of the target group generally has polymorphisms in the population to be tested and/or may be a site of pathogenic variants of a single-gene genetic disease.
  • the reference group target DNA loci are selected from chromosomes generally free of chromosomal aneuploidy and/or chromosome segments generally free of copy number variation.
  • the nucleic acid sequence of the target DNA site in the reference group generally has polymorphisms in the population to be detected.
  • the reference set of target DNA loci are selected from chromosomal regions in the sample that are believed to be free of chromosomal aneuploidy or chromosomal segment copy number variation.
  • the reference chromosome or reference chromosome region is selected from chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 , 19, 20, 21, 22, X and Y, and sometimes, the reference chromosome or reference chromosomal region is selected from autosomes (ie, not X and Y).
  • the target DNA locus of interest is selected from a chromosomal region in the sample that is believed to be likely to have a chromosomal aneuploidy or copy number variation of a chromosomal segment.
  • the target DNA locus of interest is selected from nucleic acid regions in the sample that are believed to have and/or may have pathogenic variants in a single-gene genetic disease.
  • the chromosomal region of interest is selected from chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X and Y.
  • the target group target DNA sites are selected from chromosome 13 and/or chromosome 18 and/or chromosome 21 and/or chromosome X and/or chromosome Y.
  • the kit includes primers for amplifying target nucleic acids derived from chromosomes 13, 18, 21, X and/or Y.
  • the target DNA site of the target group is selected from 1p36 deletion syndrome, catcall syndrome, peroneal muscular dystrophy, Digeorge syndrome, Duchenne muscular dystrophy, Williams-Beuren syndrome, Wolf-Hirschhorn syndrome, 15q13.3 Chromosomal regions of microdeletion syndrome, Miller-Dieker syndrome, Smith-Magenis syndrome, Angelman syndrome, Langer-Giedion syndrome.
  • the kit includes tools for amplification derived from 1p36 Deletion Syndrome, Catcall Syndrome, Peroneal Muscular Dystrophy, Digeorge Syndrome, Duchenne Muscular Dystrophy, Williams-Beuren Syndrome, Wolf-Hirschhorn Syndrome, 15q13.
  • chromosome or portion thereof comprising the target locus region is an euploid chromosome. Euploidy refers to the normal number of chromosomes. Additional reagents that may optionally be included in the diagnostic kit are instructions for use, polymerases and buffers for performing PCR and/or multiplex PCR reactions, and reagents required for high-throughput sequencing library construction of amplified fragments.
  • the present invention provides diagnostic kits for practicing the methods of the present invention.
  • the diagnostic kit includes primers for performing step (2) and/or step (3). Additional reagents that may optionally be included in the diagnostic kit are instructions for use, polymerases and buffers for performing PCR and/or multiplex PCR reactions, and reagents required for high-throughput sequencing library construction of amplified fragments.
  • the present invention provides a system for performing the methods of the present invention for performing one of the methods for predicting the karyotype or genotype or wild mutant type of a target to be detected from a biological test sample or Multiple steps, such as one or more of steps (4) to (5).
  • the present invention provides apparatus and/or computer program products and/or systems and/or modules for implementing the methods of the present invention, the apparatus and/or computer program products and/or systems and/or modules For carrying out the above-mentioned steps (1)-step (5), the above-mentioned steps (a1)-step (a3), the above-mentioned steps (b1)-step (b5), the above-mentioned steps (c1)-step (c3), the above-mentioned steps ( d1)-step (d3) and/or any of the above-mentioned steps (e1)-step (e3).
  • the present invention relates to the following embodiments:
  • step (5) the goodness-of-fit test of the allele count is used to determine the karyotype or genotype or wild mutant type of the target site to be detected in the sample,
  • the determining sequentially includes the following steps:
  • (A1) divide the target DNA sequence into a reference group sequence and a target group sequence according to its location on the chromosome;
  • the concentration of the least component DNA in the sample is calculated by the relative ratio method of allele count or the iterative fitting genotype method of allele count;
  • step (5) the karyotype or genotype or wild mutant type of the target site to be detected in the sample is determined by using the relative distribution map of allele counts, and the determination It includes the following steps in sequence:
  • step (5) the karyotype or genotype or wild mutant of the target site to be detected in the sample is determined by using a relative distribution map of allele counts, wherein the to-be-detected karyotype or genotype or wild mutant
  • the sample is a single genomic sample, and the determining includes the following steps in sequence:
  • C2 For each target DNA sequence, plot the relative count of the second largest allele against the relative count of the largest allele as a distribution map A, or plot the relative count of the largest allele for the target DNA sequence in the chromosome or subgroup The relative position on the chromosome is shown as distribution map B;
  • step (A2) or step (B2) the allele count relative ratio method is adopted to calculate the concentration of the least component DNA in the sample, and the calculation comprises successively the following steps:
  • step (A2) or step (B2) adopting allele counting iteratively fitting genotype method to calculate the concentration of the least component DNA in the sample, the The calculation includes the following steps in sequence:
  • step (A3) the goodness-of-fit test of the allele counts of each target DNA sequence of the target group is used to estimate the genotype of the target site to be detected in the sample , the estimation sequentially includes the following steps:
  • step (A3) the goodness-of-fit test of the allele counts of each target DNA sequence of the target group is used to estimate the karyotype of the target site to be detected in the sample , the estimation sequentially includes the following steps:
  • step (A3) the goodness-of-fit test of the allele counts of each target DNA sequence of the target group is used to estimate the wild mutation of the target site to be detected in the sample.
  • the estimation includes the following steps in sequence:
  • step (B3) the relative distribution map of allele counts of each target DNA sequence of the target group is used to estimate the genotype of the target site to be detected in the sample, and the It is estimated that the following steps are involved in sequence:
  • (B3.a3) Calculate the relative counts of each allele of the target DNA sequence, and select at least one non-maximal allele relative count to plot the maximum allele relative count to mark the target DNA sequence relative allele relative count The actual position of the count; (B3.a4) The genotype of the target DNA sequence is inferred from the theoretical position distribution and the actual position distribution in the relative allele count graph of the target DNA sequence.
  • step (B3) the relative distribution map of allele counts of each target DNA sequence of the target group is used to estimate the karyotype of the target site to be detected in the sample, the It is estimated that the following steps are involved in sequence:
  • B3.b2 For each possible karyotype, list all possible genotypes of the target DNA sequence of the target group on the chromosome or subchromosome of the karyotype in the sample, and then select at least one non-maximal genotype for each genotype The theoretical relative allele count is plotted against the theoretical maximum relative allele count to mark the theoretical position of the genotype;
  • (B3.b4) Infer the karyotype of the target chromosome or subchromosomal segment to be detected according to the theoretical position distribution and actual position distribution of all target DNA sequences in the relative allele count graph.
  • step (B3) the relative distribution map of allele counts of each target DNA sequence of the target group is used to estimate the wild mutant type of the target site to be detected in the sample, so that The estimation includes the following steps in sequence:
  • B3.c2 Calculate the theoretical value of relative counts of wild-type alleles and other non-wild-type alleles in each possible genotype, and select at least one non-wild-type allele relative count theory for each genotype The value is plotted against the theoretical value of the relative count of wild-type alleles to mark the theoretical position of all possible genotypes;
  • (B3.c3) Calculate the relative counts of the wild-type allele and other non-wild-type alleles of the target DNA sequence, and select at least one non-wild-type allele relative count to plot the wild-type allele relative count to mark the actual position of the relative count of alleles of the target DNA sequence;
  • step (a2) sequentially comprises the following steps:
  • step (ii) Judging the number of alleles detected in the target DNA sequence that are higher than the noise threshold; if the judgment result is 1, then estimate the genotype of the target DNA sequence to be AA
  • step (iii) Judging whether the value of R1/(R1+R2) is less than 0.5+ ⁇ , if the judgment result is yes, then the genotype of the target DNA sequence is estimated to be AB
  • step (iv) Judging whether the value of R1/(R1+R2) is less than 0.75, if the judgment result is yes, then the genotype of the target DNA sequence is estimated to be AB
  • step (v) Judging whether the value of R2/R1 is less than 0.5, if the judgment result is no, then estimate the genotype of the target DNA sequence to be AB
  • step (b2) sequentially comprises the following steps:
  • Figure 1 is a schematic diagram of a flowchart for estimating fetal DNA concentration using individual allele counts at multiple polymorphic loci in maternal plasma cfDNA samples.
  • Figure 2 is a schematic diagram of the process flow for estimating the minimum component DNA concentration in a mixed sample of two components using the individual allele counts of multiple polymorphic loci in the sample.
  • Figure 3 is an estimate of fetal DNA concentration using polymorphic site sequencing in maternal plasma cfDNA samples.
  • the fetal DNA count (FC) and total maternal and fetal DNA count (TC) were first estimated using the individual allele counts for each polymorphic locus, and then the FC and TC counts for all polymorphic loci were calculated.
  • the rlm robust regression fit of the origin, and the fetal DNA concentration was estimated as the slope of the fitted line (model coefficients).
  • Figure 4 is an estimate of minimal component DNA concentration using polymorphic site sequencing in mixed component DNA samples.
  • the individual allele counts for each polymorphic locus were used to estimate the minimum fractional DNA count (FC) and the total fractional DNA count (TC) for that locus.
  • FC minimum fractional DNA count
  • TC total fractional DNA count
  • Robust regression of rlm through the origin was performed in Figure 4a using the FC and TC values for each polymorphic site, and the minimum component DNA concentration was estimated as the slope of the line (model coefficient).
  • Figure 4b is the result of rlm robust regression estimation of minimum component DNA sample concentrations on multiple different samples or different biological replicates.
  • Figure 5 is the detection of monosomy variation in fetal chromosomes using individual allele counts at polymorphic loci.
  • Figure 5a is the use of comprehensive goodness-of-fit test results to detect whether the disomic-disomic karyotype chromosomes in simulated maternal plasma cfDNA samples are fetal monosomy abnormalities.
  • Figure 5b is the use of comprehensive goodness-of-fit test results to detect whether the disomy-monosome karyotype chromosomes in simulated maternal plasma cfDNA samples are fetal monosomy abnormalities.
  • the y-axis AIC value is the corrected AIC value obtained by dividing the AIC value of the G-test for that locus by the fetal concentration divided by the total count of each allele at that locus.
  • Fig. 6 is the detection of trisomy variation in fetal chromosomes using individual allele counts at polymorphic loci.
  • Figure 6a is the use of comprehensive goodness-of-fit test results to detect whether the disomic-disomic karyotype chromosome is fetal trisomy in simulated maternal plasma cfDNA samples.
  • Figure 6b is the use of comprehensive goodness-of-fit test results to detect whether the disomic-trisomic karyotype chromosomes in simulated maternal plasma cfDNA samples are fetal trisomy abnormalities.
  • Fig. 7 is the estimation of microdeletion variation at the subchromosomal level of the fetus to be tested using the counts of each allele of the polymorphic locus.
  • Figure 7a is the use of comprehensive goodness-of-fit test results to detect whether the monosomic-disomic karyotype chromosomes in simulated maternal plasma cfDNA samples are fetal chromosomal microdeletion abnormalities.
  • Fig. 7b is a partial enlargement of Fig. 7a.
  • Figure 7c is the use of comprehensive goodness-of-fit test results to detect whether the monosomy karyotype chromosomes in simulated maternal plasma cfDNA samples are fetal chromosomal microdeletion abnormalities.
  • Fig. 7d is a partial enlargement of Fig. 7c.
  • Figure 8 is the estimation of microduplication variation at the subchromosomal level of the fetus to be tested using the counts of each allele of the polymorphic locus.
  • Figure 8a is the use of comprehensive goodness-of-fit test results to detect whether trisomic-disomic karyotype chromosomes in simulated maternal plasma cfDNA samples are microduplication abnormalities of fetal chromosomes.
  • Fig. 8b is a partial enlargement of Fig. 8a.
  • Figure 8c is the use of comprehensive goodness-of-fit test results to detect whether trisomy-trisomy karyotype chromosomes in simulated maternal plasma cfDNA samples are microduplication abnormalities of fetal chromosomes.
  • Fig. 8d is a partial enlargement of Fig. 8c.
  • Figure 9 is the use of individual allele counts at polymorphic loci to detect wild mutants in the fetus at the short sequence level.
  • Figure 9a is the genotype of the locus of a short sequence of a simulated maternal heterozygous mutation and a normal fetus using the results of a goodness-of-fit test.
  • Fig. 9b is a partial enlargement of Fig. 9a. The results showed that the estimated genotype of this genetic locus was AB
  • the wild mutant at this locus was determined to be a heterozygous mutant for the mother and the fetus was normal (Aa
  • Figure 9c is the use of goodness-of-fit test results to detect the genotypes of the simulated short sequences whose mother and fetus are both heterozygous mutations and their loci.
  • Fig. 9d is a partial enlargement of Fig. 9c. The results showed that the estimated genotype of this genetic locus was AB
  • allele A is wild type and alleles B and C are mutant types, so it is determined that the wild mutant type of this locus is a heterozygous mutation in both mother and fetus (Aa
  • Figure 10 shows the estimated genotype of the target locus using the relative distribution of allele counts.
  • Figure 10a shows the theoretical distribution of the relative counts of each allele for polymorphic loci on chromosomes with normal disomic-disomic karyotypes.
  • Figure 10b is a distribution of the second largest relative allele count relative to the largest relative allele count for a polymorphic locus on a normal disomic-disomic karyotype chromosome.
  • Figure 11 shows the theoretical distribution of relative counts for each allele at each polymorphic locus on a chromosome with normal maternal karyotype in maternal plasma cfDNA samples.
  • Figure 11a shows the theoretical values of relative counts of all possible genotypes and their respective alleles for each polymorphic locus on a disomic-disomic karyotype or a disomic-monosomal karyotype chromosome.
  • Figure 11b is a theoretical distribution of the second-largest relative allele count relative to the largest relative allele count for each polymorphic locus on the disomic-disomic and disomic-monosome karyotype chromosomes.
  • Figure 11c shows the theoretical values of relative counts of all possible genotypes and their respective alleles for each polymorphic locus on a disomic-disomic or disomic-trisomic karyotype chromosome.
  • Figure 11d is the relative count of the second or fourth largest allele relative to the largest relative allele count for each polymorphic locus on the disomic-disomic and disomic-trisomic karyotype chromosomes the theoretical distribution.
  • Figure 12 shows the theoretical distribution of the relative counts of each allele at each polymorphic locus at the subchromosomal level of the target group in maternal plasma cfDNA samples.
  • Figure 12a shows the theoretical relative counts of all possible genotypes and their respective alleles for each polymorphic locus on the chromosome with or without the microdeletion karyotype of the mother or fetus.
  • Figure 12b is a theoretical distribution of the relative count of the second largest allele relative to the relative count of the largest allele for each polymorphic locus on the mother or fetus with or without a microdeletion karyotype.
  • Figure 12c shows the theoretical relative counts of all possible genotypes and their respective alleles for each polymorphic locus on the subchromosomal normal in the mother with or without the microduplication and the fetus.
  • Figure 12d is the theoretical distribution of the relative count of the second or third largest allele relative to the relative count of the largest allele for each polymorphic locus on the normal karyotype subchromosome of the fetus with or without microduplications .
  • Figure 13 shows the theoretical distribution of all possible genotypes and the relative counts of each allele at the tested locus on the normal disomic-disomic karyotype chromosomes in maternal plasma cfDNA samples.
  • Figure 13a is the theoretical value of all possible genotypes and the relative counts of each allele for the tested locus on a normal disomic-disomic karyotype chromosome.
  • Figure 13b is a theoretical distribution diagram of the maximum relative count of non-wild-type alleles relative to the relative count of wild-type alleles for each possible genotype at the tested locus on a normal disomic-disomic karyotype chromosome.
  • Figure 14 is a graph of the relative distribution of the counts of each allele at a polymorphic locus to detect monosomy variation in fetal chromosomes.
  • Figure 14a is an estimate of the karyotype of normal disomic-disomic chromosomes in simulated maternal plasma cfDNA samples using a relative distribution of allele counts.
  • Figure 14b is an estimate of the karyotype of the disomy-monosome chromosomes in a simulated maternal plasma cfDNA sample using a relative distribution of allele counts.
  • Figure 15 is a graph of the relative distribution of the counts of each allele at a polymorphic locus to detect trisomy variation in fetal chromosomes.
  • Figure 15a is an estimation of the karyotype of normal disomic-disomic chromosomes in simulated maternal plasma cfDNA samples using a relative distribution of allele counts.
  • Figure 15b is an estimation of the karyotypes of disomic-trisomic chromosomes in simulated maternal plasma cfDNA samples using the relative distribution of allele counts.
  • Figure 16 is a graph of the relative distribution of the counts of each allele at a polymorphic locus to detect microdeletion variants at the subchromosomal level of the fetus.
  • Figure 16a is an estimation of the microdeletion karyotype of the monosomic-disomic subchromosome in a simulated maternal plasma cfDNA sample using a relative distribution of allele counts.
  • Figure 16b is an estimation of monosomic-subchromosomal microdeletion karyotypes in simulated maternal plasma cfDNA samples using relative distribution of allele counts.
  • Figure 17 is a graph of the relative distribution of the counts of each allele at a polymorphic locus to detect microduplication variants at the subchromosomal level of the fetus.
  • Figure 17a is an estimation of microduplication karyotypes of trisomy-disomic subchromosomes in simulated maternal plasma cfDNA samples using relative distribution of allele counts.
  • Figure 17b is an estimation of trisomy-trisomy subchromosomal microduplication karyotypes in simulated maternal plasma cfDNA samples using a relative distribution of allele counts.
  • Figure 18 is a graph of the relative distribution of allele counts at polymorphic loci to detect fetal wild-type mutants at the short-sequence level.
  • Figure 18a is an estimate of wild mutants at ab
  • Figure 18b is an estimate of the wild-type mutant at the Aa
  • Figure 19 shows the detection of karyotypes of target group chromosomes or subchromosomal segments in single genome samples using relative counts of each allele at polymorphic loci. For each polymorphic locus in the target group, plot the relative count of the second largest allele against the relative count of the largest allele (relative count plot) or plot the relative count of the largest allele against the relative count of the largest allele. The relative positions of loci on the simulated chromosomes are plotted (relative count position plot). The karyotype of the target to be tested can be estimated according to the distribution characteristics of each polymorphism site on the relative count map or relative count position map.
  • Example 1 Analysis and calculation of each allele count of each polymorphism locus in pregnant women's plasma DNA samples
  • the sequencing result file (Barrett, Xiong et al. 2017, PLoS One 12: e0186771) was obtained from the NIH SRA database (BioProject ID: PRJNA387652).
  • Sample collection In this example, Barrett et al. collected 10-20 ml of peripheral blood from each pregnant woman, and then used the QiaAmp Circulating Nucleic Acid kit (Qiagen) kit to extract plasma DNA (cfDNA) according to the manufacturer's protocol. In this example, we analyzed 157 plasma cfDNA samples collected using the methods described above.
  • Qiagen Circulating Nucleic Acid kit
  • (3.2) Prepare the positioning index of the polymorphic site: for each amplification product reference sequence, manually divide it into three regions, the 5' region, the variation region and the 3' region, where the variation region is the reference sequence of the amplification product.
  • the region of the nucleic acid sequence affected by any one allele of the polymorphic site, and the 5' region is the nucleic acid sequence in the 5' to 3' direction starting from the reference sequence to the beginning of the variant region, and the 3' region is the 5' to 3' region.
  • the amplification product of the polymorphic site is unique in the reference sequence, and the amplification product can be uniquely located to the specific polymorphic site by using this sequence.
  • Sequencing data analysis For each sequencing sequence, first filter low-quality sequences, and then search for the mapping index of each polymorphic site in the filtered sequencing sequences from beginning to end. If the mapping index of not less than one polymorphic site is found, the sequence is mapped to the specific polymorphic site, otherwise the sequence is discarded. Finally, for each sequence that maps to a specific polymorphic locus, the allele count index of the polymorphic locus is searched from the beginning to the end. If no less than one allele count index is found, one of the allele count indexes is selected and the sequence is marked as the allele of the polymorphic site, otherwise the sequence is discarded.
  • Count the counts of each allele of each polymorphic locus for each sample, count the number of sequencing sequences of each allele of each polymorphic locus, that is, each polymorphic locus Counts of individual alleles.
  • the sequencing result file (Kim, Kim et al. 2019, Nat Commun 10:1047) was obtained from the NIH's SRA database (BioProject ID: PRJNA517742).
  • sample collection In this example, Kim et al. extracted genomic DNA from two independent blood samples, one of which was used as the primary component and the other was used as the secondary component. The genomic DNAs of the two samples were mixed in a certain proportion to obtain mixed samples with the proportion of minor components of 0.01, 0.02, 0.10 and 0.20, respectively.
  • Sequencing data analysis For each sequencing sequence, first filter low-quality sequences, and then search for the mapping index of each polymorphic site in the filtered sequencing sequences from beginning to end. If the mapping index of not less than one polymorphic site is found, the sequence is mapped to the specific polymorphic site, otherwise the sequence is discarded. Finally, for each sequence that maps to a specific polymorphic locus, the allele count index of the polymorphic locus is searched from the beginning to the end. If no less than one allele count index is found, then one of the allele count indexes is selected and the sequence is marked as the allele of the polymorphic site, otherwise the sequence is discarded.
  • Count the counts of each allele of each polymorphic locus for each sample, count the number of sequencing sequences of each allele of each polymorphic locus, that is, each polymorphic locus Counts of individual alleles.
  • Simulate polymorphic sites first randomly generate a 70bp long unique sequence and divide it into three regions, 5' region (30bp in length), variant region (10bp in length) and 3' region (30bp in length) . Then randomly generate mutations (including insertion, deletion, point mutation, multiple site variation and other nucleic acid sequence changes) to the 10 bp sequence of the variant region, to obtain at least six 60 bp lengths greater than or equal to the 5' region, the variant region and the 3' region. different nucleic acid sequences and marked as different alleles of the polymorphic site.
  • step (3.2) in Example 1 select at least one unique sequence with a length of 12 bp as the positioning index of the polymorphic site; according to the step (3.3) in Example 1, select a 12 bp length containing the variant region. At least one unique sequence of is used as the allele count index for the polymorphic locus.
  • the genotype of a polymorphic site on a chromosome in the plasma cfDNA of a pregnant woman whose karyotype is disomic-disomic may be AA
  • AA 200 copies of allele A were simulated; for genotype AA
  • simulating a polymorphic site on a chromosome in the plasma cfDNA of a pregnant woman with a karyotype of disomy-monomeric or simulating a polymorphic site in the plasma cfDNA of a pregnant woman with a karyotype of a chromosome fragment of a disomy-monomeric, Its genotype may be or Assuming that the concentration of fetal DNA in the sample is 10% and the simulated normal genome copy number is 200, the fetal genome is 20 copies and the mother's genome is 180 copies.
  • First select a polymorphic site list its individual allele sequences and label them as A, B, C, D, E, F, and so on.
  • genotype Simulates 190 copies of allele A; for genotypes Simulate 100 copies of allele A and 90 copies of allele B; for genotypes Simulates 180 copies of allele A and 10 copies of allele B; for genotypes Simulates 90 copies of allele A, 90 copies of allele B, and 10 copies of allele C.
  • the number of alleles at polymorphic sites on other different karyotype chromosomes or chromosome segments and the genome copy number of each allele can be simulated in a similar way.
  • each sample simulates a different number of chromosomes according to the experimental purpose, and each chromosome or chromosome segment simulates at least 100 polymorphic loci according to the above step 2, and each locus is based on the genotype.
  • Sequencing data analysis For each sequencing sequence, low-quality sequences are first filtered, and then the mapping index of each polymorphic site is searched from the beginning to the end in the filtered sequencing sequence. If the mapping index of not less than one polymorphic site is found, the sequence is mapped to the specific polymorphic site, otherwise the sequence is discarded. Finally, for each sequence that maps to a specific polymorphic locus, the allele count index of the polymorphic locus is searched from the beginning to the end. If no less than one allele count index is found, then one of the allele count indexes is selected and the sequence is marked as the allele of the polymorphic site, otherwise the sequence is discarded.
  • Count the counts of each allele of each polymorphic locus For each sample, count the number of sequencing sequences of each allele of each polymorphic locus, that is, each polymorphic locus Counts of individual alleles.
  • Example 4 Estimating the number of alleles detected in a polymorphic locus above the noise threshold
  • a polymorphic site is selected, and the counts of its alleles are arranged in descending order, marked as R1, R2, R3, ..., Rn or R 1 , R 2 , R 3 , ..., R n , respectively, And the total count of each allele is the sum of the counts of each allele, marked as TC
  • the noise threshold of the sample is ⁇
  • the allele count is marked as noise
  • the polymorphic locus is not marked
  • the number of alleles that are noise is the number of alleles at the locus that are above the noise threshold.
  • the following steps are used to estimate the detected polymorphic locus that is higher than Number of alleles for noise threshold:
  • C2 is greater than or equal to 0.01 and C3 is less than 0.01, the alleles above the noise threshold for this locus are R1 and R2, and the number of alleles above the noise threshold for this locus is 2.
  • Example 5 Estimate the total count (TC) of each allele in a polymorphic locus
  • the total count (TC) of each allele at a polymorphic locus can be calculated according to any of the following methods:
  • the total count of each allele in the polymorphic locus is the largest The sum of the k allele counts, i.e.
  • Example 6 Estimation of the possible genotype of a polymorphic locus in plasma cfDNA samples using allele counting
  • the genotype of each polymorphic locus on the chromosome in which both the mother and the fetus are normal disomic karyotypes in the plasma cfDNA can only be one of the 5 genotypes (no Consider cases where the mother and/or fetus are of a mosaic genotype and/or the fetus does not inherit the mother's genotype for various reasons).
  • For each polymorphic site first calculate the number of alleles detected above the noise threshold according to the method described in Example 4, and then estimate the possible genotype of the polymorphic site according to the following steps:
  • Example 7 Estimation of fetal DNA-derived counts (FC) of polymorphic loci in plasma cfDNA samples of biological mothers
  • TC total count
  • FC fetal DNA-derived count
  • FC is estimated as NA
  • the FC is estimated as R1-R2+R3 or 2.0 ⁇ R3;
  • the FC is estimated as NA.
  • Example 8 Estimating the concentration of the least component DNA in the mixed sample (f)
  • Figure 1 is a flow chart for estimating the concentration of fetal DNA in biological maternal plasma cfDNA samples as described in Example 8.
  • Example 9 Estimating each of the polymorphic sites according to the sample concentration f of the least component in the mixture of the two samples Expected count of alleles
  • the two samples here refer to maternal cfDNA and fetal cfDNA, respectively, where the least component is the fetal cfDNA component and the largest component is the maternal cfDNA component;
  • the least component Fraction refers to the DNA fraction of the sample with a small proportion and the largest fraction is the DNA fraction of the larger fraction;
  • the minimum fraction is the fetal cfDNA fraction and the largest fraction is the largest fraction. is the maternal cfDNA component.
  • TC total count
  • the total count of each allele is marked as TC.
  • AA polymorphism there are two chromosomal positions derived from the largest component sample (mother DNA), namely A and A (marked before the vertical line), and it is derived from the least There are two chromosomal positions in the fractional sample (fetal DNA), A and A (marked after the vertical bar).
  • Theoretical expected counts for other genotypes can be obtained in a similar manner.
  • the total count of each allele is marked as TC.
  • Theoretical expected counts for other genotypes can be obtained in a similar manner.
  • the goodness-of-fit test in the above step (3) may, but is not limited to, use Fisher's exact test, binomial distribution test, chi-square test or G test to perform the goodness-of-fit test.
  • the fit of the G test will be The goodness of fit can be calculated as:
  • the missing observed allele count is set to a small value, such as 0.1; if the expected allele count The number of alleles is less than the number of observed alleles, and the expected value of the missing position is set to a small value or background noise value, such as 5 or TC ⁇ .
  • the observed allele counts for each allele of all possible genotypes at the polymorphic locus are calculated. Theoretical counts were tested for goodness of fit.
  • AC are as follows:
  • goodness-of-fit tests can also be performed using the same number of allele counts. Since there may be up to three alleles at this polymorphic locus, both the observed allele count and the expected allele count are retained for the maximum three values, where the observed allele count can be used with a small value fill, and the desired allele count can be filled with a threshold.
  • Example 11 Using the sample concentration f of the least component in the sample mixture and the allele of a polymorphic site Estimate the probable genotype for this polymorphic locus from counts
  • Example 12 Using the sample concentration f of the least component in the mixture of two independent samples and one polymorphic site The individual allele counts and their genotypes estimate the counts (FC) from the least component samples at this polymorphic locus
  • the concentration of the smallest component sample is f
  • the concentration of the largest component sample is 1-f
  • the individual allele counts are labeled R1, R2, R3 and R4
  • the polymorphic site is then estimated from the minimum component count (FC) according to the following steps:
  • FC is estimated as NA
  • the FC is estimated as R1-R2+R3 or 2.0 ⁇ R3;
  • the FC is estimated to be R2+R3 or 2.0 ⁇ R2 or 2.0 ⁇ R3;
  • FC is estimated as R3+R4 or 2.0 ⁇ R3 or 2.0 ⁇ R4;
  • the FC is estimated as NA.
  • Example 13 Using the allele count of polymorphic loci to estimate the sample with the least component in the mixture of two samples This concentration
  • each polymorphic locus in the plasma DNA of the legally authorized pregnant woman who accepts the egg donation may be one of nine genotypes (regardless of The mother and/or fetus has a chromosomal aneuploidy or chromosomal segment copy number variation and/or the mother and/or fetus is a mosaic genotype and/or the fetus has other genotypes corresponding to a non-diploid karyotype for various reasons case), where the concentration of fetal DNA can be estimated iteratively as described above.
  • each polymorphic locus in the biological maternal plasma DNA may be one of five genotypes (regardless of whether the mother and/or fetus have chromosomal aneuploidy or chromosomal segment copy number variation and/or the mother and/or fetus is a mosaic genotype and/or the fetus does not inherit the mother's genotype for various reasons), where the concentration of fetal DNA can be estimated by iteration as described above.
  • FIG. 2 is a flow chart for estimating fetal DNA concentrations in plasma DNA samples from pregnant women who are legally permitted to receive egg donation as described in Example 13.
  • FIG. 2 is a flow chart for estimating fetal DNA concentrations in plasma DNA samples from pregnant women who are legally permitted to receive egg donation as described in Example 13.
  • Example 14 Estimation of fetal DNA concentration using simulated sequencing of polymorphic loci in maternal plasma DNA samples
  • the polymorphic sites on the selected reference genome were marked as Id001-Id005, respectively. It is assumed that the results of each allele count of the 5 polymorphic sites simulated according to Example 3 are shown in Table 1.
  • the reference genome is considered to be a chromosomal region where both maternal and fetal karyotypes are normal, so each polymorphic locus theoretically contains up to 3 alleles. Here counts of up to five alleles are shown for each locus (some of these allele counts represent systematic noise during sample processing, sequencing, etc.). It should be understood that each polymorphic site may be detected to contain multiple alleles, and each allele should be counted and counted.
  • Table 2 Individual allele counts for the hypothetical ranked five polymorphic loci.
  • FC fetal DNA
  • TC total counts
  • the number of alleles is estimated as one
  • the genotype is estimated as AA
  • FC NA
  • R2/(R1+R2) 0.496 ⁇ 0.01
  • R3/(R1+R2+R3) 0.009 ⁇ 0.01
  • the genotype was estimated as AB
  • FC NA
  • R2/(R1+R2) 0.379 ⁇ 0.01
  • R3/(R1+R2+R3) 0.003 ⁇ 0.01
  • R2/(R1+R2) 0.430 ⁇ 0.01
  • R3/(R1+R2+R3) 0.126 ⁇ 0.01
  • the gene The type is estimated to be AB
  • FC c (NA, 1154, NA, 2257, 1990)
  • Table 3 Estimation of fetal DNA concentration in samples using different methods.
  • Estimation method Estimated fetal DNA concentration Linear regression (b) 0.2441 Robust regression (c) 0.2441 The median of the ratios (d1) 0.2465 Average of ratios (d3) 0.2450 Ratio of medians (d2) 0.2473 Ratio of Means (d4) 0.2445
  • Example 15 Analysis of polymorphic loci in plasma cfDNA samples of pregnant women who received legal permission to donate eggs using simulation Sequencing to estimate fetal DNA concentration
  • the polymorphic sites on the selected reference genome were marked as Id001-Id009, respectively. It is assumed that the results of each allele count of the 9 polymorphic sites simulated according to Example 3 are shown in Table 4.
  • the reference genome In hypothetical maternal plasma cfDNA legally permitted to accept egg donation, the reference genome is considered to be a chromosomal region with normal disomic karyotype in both mother and fetus, so each polymorphic locus theoretically contains up to 4 allele. Here each locus shows counts of up to five alleles. It should be understood that each polymorphic site may be detected to contain multiple alleles, and each allele should be counted and counted.
  • Table 4 Allele counts for the hypothetical nine polymorphic loci.
  • FC fetal DNA Amplification counts
  • TC total counts
  • FC and TC values were estimated for the above 9 sites, respectively.
  • Step (b) Using the FC and TC values of each polymorphic site, the fetal DNA concentration f is estimated according to the method described in Example 8.
  • Example 16 Using the fetal DNA concentration in the maternal plasma DNA sample and the allele count estimation of the locus to be analyzed Calculate the genotype of the locus
  • both loci A and B of the mother and fetus are normal disomics and there are no large insertion or deletion variants affecting loci A and B, then both loci A and B can only be is one of five genotypes, namely AA
  • the most probable genotypes of site A and site B are respectively estimated according to the method described in Example 11 by taking the above allele count results of site A and site B as an example.
  • Table 6 Genotypes of target sites estimated using goodness-of-fit tests.
  • site A has the best goodness-of-fit test results for genotype AA
  • Example 17 Using the sample concentration f of the least component in the sample mixture and a set of polymorphic sites in the target region The allele count of the estimated target karyotype
  • Example 18 Using fetal DNA concentration f in maternal plasma DNA samples and chromosomes or subchromosomes to be analyzed Allele counts at a set of polymorphic loci within a horizontal region to estimate aneuploidy variation at the chromosome level within the region to be analyzed or Deletion-duplication variants at the subchromosomal level
  • Table 7 Allele counts for a set of polymorphic loci on the target chromosome to be tested in two hypothetical samples.
  • a set of polymorphic loci in the target regions in samples 1 and 2 originate from chromosome 21, and our goal is to detect whether the fetuses in samples 1 and 2 are trisomy 21, that is, No. 21 in these two samples Whether the karyotype of the chromosome is disomy-diasomy (both maternal and fetal chromosome 21 are normal disomic) or disomic-trisomy (pregnant woman with normal chromosome 21 disomy is pregnant with a fetus with trisomy 21) .
  • all polymorphic loci can only be one of the following 5 genotypes, namely AA
  • all polymorphic loci can only be one of the following 10 genotypes, namely AA
  • Table 8 Karyotype goodness-of-fit test results for each allele count of each polymorphic site in the target region.
  • sample 1 the individual allele counts for most polymorphic loci gave a better fit to the genotypes in the disomy-disomy than the genotypes in the disomy-trisomy, so the nuclei of sample 1
  • the type is estimated to be disomy-disomy, that is, both mother and fetus are normal disomy.
  • sample 2 the individual allele counts for all polymorphic loci gave a better fit to the genotypes in trisomy-diasomy than the genotypes in disomy-diabodies, so the nuclei of sample 2
  • the type is estimated to be disomy-trisomy, that is, the mother is a normal disomy and the fetus is an abnormal trisomy 21.
  • either the karyotype with the best fit for most samples can be considered, or the G value, AIC value, modified G value and/or modified G value can be used.
  • the modified AIC value is judged.
  • the integrated G value, integrated AIC value, integrated AIC/total value, and integrated AIC/total count/f value for the disomy-disomic genotype were all smaller than the corresponding fit for the disomy-trisomy genotype. Therefore, these values or the values derived from them can also be used to judge the fitness of each allele of multiple polymorphism sites to different karyotypes.
  • microdeletion-microduplication variants at the subchromosomal level When testing for microdeletion-microduplication variants at the subchromosomal level, consideration should be given to the possibility that the mother may carry homozygous or heterozygous subchromosomal microdeletions or microduplications, so for each polymorphic locus affected, all possible Genotypes need to be considered and tested using goodness-of-fit tests. For example, to detect microdeletion mutations at the subchromosomal level, it is necessary to test all mothers and fetuses if the mother is homozygous microdeletion, heterozygous microdeletion or normal and the fetus is homozygous microdeletion, heterozygous microdeletion or normal. genotype combination.
  • Example 19 Using the high-throughput sequencing results of a group of polymorphic loci in pregnant women's plasma DNA samples to estimate the number of Fetal DNA Concentration
  • each sample in the amplicon sequencing dataset of maternal plasma cfDNA indel markers (Barrett, Xiong et al. 2017, PLoS One 12: e0186771) was counted for each indel marker (multiple indel markers). polymorphic loci), and then for each polymorphic locus in each sample, the fetal DNA-derived counts (FC) and the maternal-derived counts were estimated as described in Example 8. and total counts (TC) of fetal DNA, and estimated the concentration of fetal DNA in each sample using the FC and TC for each polymorphic locus in each sample.
  • FC fetal DNA-derived counts
  • TC total counts
  • FIG 3 shows the results of the analysis of a maternal plasma cfDNA sample in this dataset.
  • the counts (FC) derived from fetal DNA and the total counts (TC) derived from maternal and fetal DNA for each indel polymorphism locus in the sample are shown as a point in the graph.
  • Robust regression fitting was performed using the FC and TC values of each polymorphic site in the sample and the rlm function in the MASS library of the R software package (fitted model: FC ⁇ TC+0) and the concentration of fetal DNA was estimated.
  • the result of the rlm robust regression fit is the straight line in the graph, and the fetal DNA concentration is estimated as the slope of the straight line (model coefficients for TC).
  • Example 20 Using the high-throughput sequencing results of a group of polymorphic loci in mixed DNA samples to estimate the least amount in the sample DNA concentration of components
  • each sample in the mixed sample amplicon sequencing data set (Kim, Kim et al. 2019, Nat Commun 10: 1047) was counted for each allele in each polymorphic locus Then, for each polymorphic locus in each sample, the counts derived from the least component DNA (FC) and the total counts derived from all DNA (TC) were estimated as described in Example 8, and Using the FC and TC for each polymorphic locus in each sample, estimate the concentration of the minimal component DNA in each sample.
  • Figure 4a is the result of an analysis of a mixed DNA sample in this dataset. For each polymorphic locus in the sample, the counts derived from the smallest fraction of DNA (FC) and the total counts derived from all DNA (TC) are represented as a point on the graph. Robust regression of rlm (model: FC ⁇ TC+0) was performed using the FC and TC values for each polymorphic site and the concentration of the least component DNA in the sample was estimated. The result of the rlm robust regression is a straight line fitted to the graph, and the minimum component DNA concentration is estimated as the slope of the line (model coefficient for TC).
  • Figure 4b is the analysis result of all mixed DNA samples in this dataset.
  • Example 21 Computer simulation of chromosome level, subchromosomal level and short sequence water in plasma DNA samples of pregnant women flat variation
  • the simulated chromosome 1 is the reference chromosome in the sample, in which the genotype of each polymorphic locus is simulated as one of the normal disomic-disomic genotypes, and the total of each allele of each polymorphic locus is The number is 200.
  • the simulated chromosome 2 is a disomic-disomic chromosome in the sample, in which the genotype of each polymorphic site is simulated as one of the normal disomic-disomic genotypes, and each polymorphic site is equal to The total count of alleles is 200.
  • the simulated chromosome 3 is a disomic-monomeric chromosome in the sample, wherein the genotype of each polymorphism site is simulated as one of the disomic-monosporous genotypes. Due to the lack of a fetal chromosome, the total count for each allele at each polymorphic locus was 200-100f.
  • the ART simulation software (Huang, Li et al. 2012, Bioinformatics 28: 593-594) was used to simulate the high-throughput sequencing results, where the fold parameter of the ART simulation software was set 50 or 100.
  • chromosomal trisomy aneuploidy at the chromosomal level, we simulated maternal plasma DNA samples containing chromosomal trisomies, each of which simulated three pairs of chromosomes for the mother and fetus, numbered 1 (Chr01) , No. 2 (Chr02) and No. 3 (Chr03).
  • 100 polymorphic sites were simulated on chromosomes 1, 2 and 3 according to the method described in Example 3.
  • Each sample was randomly selected from the following concentrations (0.02, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45) as the simulated fetal DNA concentration.
  • the simulated chromosome 1 is the reference chromosome in the sample, in which the genotype of each polymorphic locus is simulated as one of the normal disomic-disomic genotypes, and the total of each allele of each polymorphic locus is The number is 200.
  • the simulated chromosome 2 is a disomic-disomic chromosome in the sample, in which the genotype of each polymorphic site is simulated as one of the normal disomic-disomic genotypes, and each polymorphic site is equal to The total count of alleles is 200.
  • the simulated chromosome 3 is a disomic-trisomic chromosome in the sample, wherein the genotype of each polymorphism site is simulated as one of the disomic-trisomic genotypes. Since there is an extra fetal chromosome, the total count of each allele at each polymorphic locus is 200+100f.
  • the ART simulation software was used to simulate the high-throughput sequencing results, where the fold parameter of the ART simulation software was set to 50 or 100.
  • each sample was randomly selected from the following concentrations (0.02, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45) as the simulated fetal DNA concentration.
  • concentrations 0.02, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45) as the simulated fetal DNA concentration.
  • each microdeletion region is treated as a whole chromosome, and the polymorphic loci are selected from the microdeletion region, in which one normal and one microdeletion-containing chromosome in a single genome is marked as a monosomy, and two Chromosomes containing microdeletions were marked as absent.
  • the simulated chromosome 1 is the reference chromosome in the sample, in which the genotype of each polymorphic locus is simulated as one of the normal disomic-disomic genotypes, and the total of each allele of each polymorphic locus is The number is 200.
  • the simulated chromosome 2 is a disomic-disomic chromosome in the sample, in which the genotype of each polymorphic site is simulated as one of the normal disomic-disomic genotypes, and each polymorphic site is equal to The total count of alleles is 200.
  • the simulated chromosome 3 is a disomic-monomeric chromosome in the sample, wherein the genotype of each polymorphism site is simulated as one of the disomic-monosporous genotypes. Since one fetal chromosome contains a microdeletion, the total count for each allele at each polymorphic locus is 200-100f.
  • the simulated chromosome 4 is a monosomic-disomic chromosome in the sample, wherein the genotype of each polymorphic site is simulated as one of the monosomic-disomic genotypes. Since one maternal chromosome contains a microdeletion, the total count for each allele at each polymorphic locus is 100+100f.
  • the simulated chromosome 5 is a haploid chromosome in the sample, wherein the genotype of each polymorphic site is simulated as one of the haploid genotypes. Since both one maternal and one fetal chromosome contained microdeletions, the total count for each allele at each polymorphic locus was 100.
  • the simulated chromosome 6 is a monosomic-null chromosome in the sample, and the genotype of each polymorphic site is simulated as one of the monosomic-null genotypes. Since one maternal chromosome and two fetal chromosomes contained microdeletions, the total count for each allele at each polymorphic locus was 100-100f.
  • the simulated chromosome 7 in the sample is an amyotrophic-absosomal chromosome, wherein the genotype of each polymorphic locus is simulated as one of the amyotrophic-absosomal genotypes. Since both maternal chromosomes and two fetal chromosomes contain microdeletions, the total count of each allele at each polymorphic locus is 0, that is, no specific amplified sequences are simulated or some random but cannot be located. to sequences on any chromosome.
  • the ART simulation software was used to simulate the high-throughput sequencing results, where the fold parameter of the ART simulation software was set to 50 or 100.
  • each sample was randomly selected from the following concentrations (0.02, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45) as the simulated fetal DNA concentration.
  • each microduplication region is regarded as two chromosomes, and the polymorphism loci are selected from the microduplication region, so in a single genome, one normal and one microduplication-containing chromosome is marked as trisomy, and both are marked as trisomy. Chromosomes containing microduplications are marked as tetrasomies.
  • the simulated chromosome 1 is the reference chromosome in the sample, in which the genotype of each polymorphic locus is simulated as one of the normal disomic-disomic genotypes, and the total of each allele of each polymorphic locus is The number is 200.
  • the simulated chromosome 2 is a disomic-disomic chromosome in the sample, in which the genotype of each polymorphic site is simulated as one of the normal disomic-disomic genotypes, and each polymorphic site is equal to The total count of alleles is 200.
  • the simulated chromosome 3 is a disomic-trisomic chromosome in the sample, wherein the genotype of each polymorphism site is simulated as one of the disomic-trisomic genotypes. Since a fetal chromosome contains microduplications, the total count for each allele at each polymorphic locus is 200+100f.
  • the simulated chromosome 4 is a trisomy-disomic chromosome in the sample, wherein the genotype of each polymorphic locus is simulated as one of the trisomy-disomic genotypes. Since one maternal chromosome contains microduplications, the total count for each allele at each polymorphic locus is 300-100f.
  • the simulated chromosome 5 is a trisomy-trisomy chromosome in the sample, wherein the genotype of each polymorphism site is simulated as one of the trisomy-trisomy genotypes. Since both one maternal and one fetal chromosome contained microduplications, the total count for each allele at each polymorphic locus was 300.
  • the simulated chromosome 6 is a trisomy-tetrasomic chromosome in the sample, and the genotype of each polymorphism site is simulated as one of the trisomic-tetrasomic genotypes. Since both one maternal chromosome and two fetal chromosomes contained microduplications, the total count for each allele at each polymorphic locus was 300+100f.
  • the simulated chromosome 7 is a tetrasomic-tetrasomic chromosome in the sample, wherein the genotype of each polymorphism site is simulated as one of the tetrasomic-tetrasomic genotypes. Since both maternal and two fetal chromosomes contained microduplications, the total count for each allele at each polymorphic locus was 400.
  • the ART simulation software was used to simulate the high-throughput sequencing results, where the fold parameter of the ART simulation software was set to 50 or 100.
  • the simulated chromosome 1 is the reference chromosome in the sample, in which the genotype of each polymorphic locus is simulated as one of the normal disomic-disomic genotypes, and the total of each allele of each polymorphic locus is The number is 200.
  • the simulated chromosome 2 was a disomic-disomic chromosome in the sample, and the total count of each allele simulated at each locus was 200.
  • each simulated site can only be one of the following 14 genotypes, namely AA
  • Randomly simulate 100 loci to be detected on chromosome 2 and each locus randomly selects one of the 14 genotypes, and then simulates each of them in proportion according to the set fetal DNA
  • the ART simulation software was used to simulate the high-throughput sequencing results, where the fold parameter of the ART simulation software was set to 50 or 100.
  • genomic DNA samples such as preimplantation embryo genomic DNA samples
  • each sample 100 polymorphic sites were simulated on chromosomes 1 to 5 according to the method described in Example 3.
  • normal chromosomes are marked as diploid
  • each microdeletion region is considered as a whole chromosome
  • each microduplication region is considered as two chromosomes
  • the polymorphic locus is determined by the microdeletion/microduplication region. Repeat area selection.
  • one normal chromosome with microdeletion is marked as monosomy
  • two chromosomes with microdeletion are marked as absent
  • one normal chromosome with microduplication is marked as trisomy
  • two chromosomes with microdeletion are marked as trisomy.
  • Chromosomes with microduplications are marked as tetrasomies.
  • the simulated chromosome 1 is a normal disomic chromosome in the sample, in which the genotype of each polymorphic locus is simulated as one of the normal disomic genotypes (AA or AB), and each polymorphic locus is The total count of alleles is 200.
  • the simulated chromosome 2 is an absent or homozygous microdeletion chromosome in the sample, and the genotype of each polymorphic locus is simulated as a normal absent or homozygous microdeletion genotype
  • the total count of each allele at each polymorphic locus is 0, so it is not simulated to generate specific amplified sequences or to simulate some random sequences that cannot be located on any chromosome.
  • the simulated chromosome 3 is a monosomy or heterozygous microdeletion chromosome in the sample, and the genotype of each polymorphic locus is simulated as a monosomy or heterozygous microdeletion genotype
  • the total count for each allele at each polymorphic locus is 100.
  • the simulated chromosome 4 is a trisomy or heterozygous microduplication chromosome in the sample, in which the genotype of each polymorphic locus is simulated as one of the trisomy or heterozygous microduplication genotype (AAA, AAB or ABC), The total count for each allele at each polymorphic locus was 300.
  • the simulated chromosome 5 is a tetrasomic or homozygous microduplication chromosome in the sample, where the genotype of each polymorphic locus is simulated as a tetrasomic or homozygous microduplication genotype (AAAA, AAAB, AABB, AABC or ABCD ), and the total count for each allele at each polymorphic locus is 400.
  • the ART simulation software was used to simulate the high-throughput sequencing results, where the fold parameter of the ART simulation software was set to 50 or 100.
  • Example 22 Using the fetal DNA concentration in the maternal plasma DNA sample and the allele count detection of the locus to be analyzed Detection of fetal chromosomal abnormalities
  • chromosomes 1, 2 and 3 were reference chromosomes, chromosomes of normal disomic-disomic karyotype, and abnormal disomic-monosomial karyotypes, respectively.
  • Figure 5 shows the detection of monosomy abnormalities in fetal chromosomes in simulated samples using a goodness-of-fit test.
  • Figure 5a is the use of comprehensive goodness-of-fit test results to detect fetal monosomy abnormalities of normal disomic-disomic karyotype chromosomes in simulated samples.
  • the y-axis AIC value is the corrected AIC value obtained by dividing the AIC value of the G-test for that locus by the fetal concentration divided by the total count of alleles at that locus.
  • Figure 5b is the use of comprehensive goodness-of-fit test results to detect fetal monosomy abnormalities in the disomic-monosome karyotype chromosomes in simulated samples.
  • the test result is that no chromosomal abnormality is found on fetal chromosome 2 and a chromosomal abnormality is found on fetal chromosome 3.
  • Example 23 Using the fetal DNA concentration in the maternal plasma DNA sample and the allele count detection of the locus to be analyzed Fetal chromosomal trisomy
  • chromosomes 1, 2 and 3 were reference chromosomes, chromosomes of normal disomic-disomic karyotype, and abnormal disomic-trisomy, respectively karyotype of chromosomes.
  • Figure 6 shows the detection of trisomy of fetal chromosomes in simulated samples using a goodness-of-fit test.
  • Figure 6a is the use of comprehensive goodness-of-fit test results to detect fetal trisomy with normal disomic-disomic karyotype chromosomes in simulated samples.
  • the y-axis AIC value is the corrected AIC value obtained by dividing the AIC value of the G-test for that locus by the fetal concentration divided by the total count of alleles at that locus.
  • Figure 6b is the use of comprehensive goodness-of-fit test results to detect fetal trisomy abnormalities in the disomic-trisomic karyotype chromosomes in the simulated samples.
  • Example 24 Using the fetal DNA concentration in the maternal plasma DNA sample and the allele count detection of the locus to be analyzed Detection of fetal chromosomal microdeletion abnormalities
  • chromosomes 1 to 7 were reference chromosomes, normal chromosomes in both mother and fetus (chromosomes with normal disomic-disomic karyotypes), respectively.
  • One chromosome of the mother's normal fetus with a microdeletion (chromosome of the disomic-monosome karyotype), one chromosome of the mother with a microdeletion of the fetus's normal chromosome (a chromosome of the monosomy-disomy karyotype), one mother and one fetus Chromosomes with a microdeletion (monosome-monosomal karyotype), a mother with a microdeletion on one chromosome and a fetus with both microdeletions (monosome-monosomal karyotype), and both mother and fetus
  • Each of the two chromosomes contains a microdeleted chromosome (chromosomes with an asomic-anosomic karyotype).
  • Figure 7 shows the detection of microdeletion abnormalities in fetal chromosomes in simulated samples using a goodness-of-fit test.
  • Figure 7a is the use of comprehensive goodness-of-fit test results to detect fetal chromosomal microdeletion abnormalities in monosomic-disomic karyotype chromosomes (heterozygous microdeletion fetuses for mothers are normal) in simulated samples.
  • the y-axis AIC value is the corrected AIC value obtained by dividing the AIC value of the G-test for that locus by the fetal concentration divided by the total count of alleles at that locus.
  • Fig. 7b is a partial enlargement of Fig. 7a.
  • Figure 7c is the use of comprehensive goodness-of-fit test results to detect fetal chromosomal microdeletion abnormalities in monosomy-karyotype chromosomes (heterozygous microdeletions in both mother and fetus) in simulated samples.
  • Fig. 7d is a partial enlargement of Fig. 7c.
  • Fig. 7a and Fig. 7b show that no abnormal microdeletion is found in this chromosome number of the fetus, while the detection results of Fig. 7c and Fig. 7d are that the microdeletion abnormality is found in this chromosome number of the fetus.
  • Example 25 Using the fetal DNA concentration in the maternal plasma DNA sample and the allele count detection of the locus to be analyzed Detection of fetal chromosomal microduplication abnormalities
  • chromosomes 1 to 7 were reference chromosomes, normal chromosomes in both mother and fetus (chromosomes with normal disomic-disomic karyotypes), respectively.
  • One chromosome of the mother's normal fetus contains a chromosome with a microduplication (chromosome of the disomic-trisomy karyotype), one chromosome of the mother contains a microduplication of the fetus's normal chromosome (a chromosome of the trisomy-disomic karyotype), and both the mother and the fetus have one chromosome Chromosomes with microduplications (trisomic-trisomic karyotypes), mothers with microduplications on one chromosome and fetuses with microduplications on both chromosomes (trisomic-tetrasomic karyotypes), and mother and fetus Both chromosomes contain microduplicate chromosomes (chromosomes with a tetrasomic-tetrasomic karyotype).
  • Figure 8 shows the detection of microduplication abnormalities in fetal chromosomes in simulated samples using a goodness-of-fit test.
  • Figure 8a is the use of comprehensive goodness-of-fit test results to detect fetal chromosomal microduplication abnormalities in trisomy-disomic karyotype chromosomes (mother is a heterozygous microduplication fetus is normal) in a simulated sample.
  • the y-axis AIC value is the corrected AIC value obtained by dividing the AIC value of the G-test for that locus by the fetal concentration divided by the total count of alleles at that locus.
  • Fig. 8b is a partial enlargement of Fig. 8a.
  • Figure 8c is the use of comprehensive goodness-of-fit test results to detect fetal chromosomal microduplication abnormalities in trisomy-trisomy karyotype chromosomes (both mother and fetus are heterozygous microduplications) in simulated samples.
  • Fig. 8d is a partial enlargement of Fig. 8c.
  • Fig. 8a and Fig. 8b show that no microduplication abnormality is found in this chromosome number of the fetus, and the detection results of Fig. 8c and Fig. 8d are that the microduplication abnormality is found in the fetal chromosome number.
  • Example 26 Using the fetal DNA concentration in the maternal plasma DNA sample and the allele count detection of the locus to be analyzed Wild mutant at the locus to be analyzed
  • each polymorphic site on chromosome 1 is selected from different chromosomal regions, while multiple polymorphic sites on chromosome 2 are selected from the same specific site but belong to the same and/or different primers Results of independent amplifications performed, that is, the simulated polymorphic loci on chromosome 2 represent distinct independent repeats of a particular locus.
  • step (c) For each wild mutant genotype, estimate the theoretical count of each allele based on the concentration f of fetal DNA in the sample estimated in step (a) above.
  • step (d) For each wild mutant genotype, determine its actual count from the nucleic acid sequence of its individual allele.
  • Goodness-of-fit tests were performed for each wild mutant genotype for independent replicates of each locus.
  • (f) Comprehensively analyze the results of the goodness-of-fit test, and select the wild mutant with the comprehensive best fit for all repeat sites as the estimated genotype for that specific site.
  • (g) Determine the wild mutant type of each allele of the mother and/or fetus based on the estimated wild mutant genotype.
  • each duplication site to be detected is first estimated in accordance with the method described in Example 11.
  • the genotype in the case of not considering whether the sequence of each allele belongs to the wild-type sequence, and then according to whether the sequence of each allele is a normal wild-type sequence, it is determined whether the locus has variation in the mother and the fetus.
  • Figure 9 shows the detection of wild mutants of fetal short sequence loci in simulated samples using a goodness-of-fit test.
  • Figure 9a is the genotype of short sequence loci in which the simulated maternal heterozygous mutation is detected and the fetus is normal using goodness-of-fit test results (different dots represent different independent repeats of the target locus to be determined).
  • the y-axis AIC value is the corrected AIC value obtained by dividing the AIC value of the G-test for that locus by the fetal concentration divided by the total count of alleles at that locus.
  • Fig. 9b is a partial enlargement of Fig. 9a.
  • Figure 9c is a genotype of short sequence loci where both the simulated mother and fetus are heterozygous mutations using goodness-of-fit test results.
  • Fig. 9d is a partial enlargement of Fig. 9c. The results showed that both mother and fetus were heterozygous genotypes (AB
  • allelic mutation was wild-type and alleles B and C were mutants, so it was determined that both the mother and the fetus at this locus were heterozygous mutations, and the fetus either produced a new mutation or inherited paternal origin. allelic mutation.
  • Example 27 Using the sample concentration f of the least component in the sample mixture and the allele count of the locus to be analyzed Relative distribution map to estimate the genotype at this locus
  • Figure 10 shows the theoretical distribution of the polymorphic loci derived from normal karyotype chromosomes on the relative allele distribution map in maternal plasma DNA samples.
  • Figure 10a shows the theoretical values of all possible genotypes and the relative counts of each allele for polymorphic loci on a normal disomic-disomic karyotype chromosome.
  • Figure 10b is the distribution of the second largest relative allele count (RR2) relative to the largest relative allele count (RR1) for each polymorphic locus on a normal disomic-diasomic karyotype chromosome. The results show that each polymorphic locus is distributed in different positions on the relative distribution map of allele counts due to different genotypes, and its genotype can be inferred according to its specific distribution position.
  • Example 28 Using the sample concentration f of the least component in the sample mixture and a set of polymorphic sites in the target region The relative distribution map of allele counts to estimate the karyotype of the target to be tested
  • Figure 11 shows the theoretical distribution of each polymorphic locus on the chromosome of normal maternal but fetal aneuploidy variation on the relative allele distribution map in maternal plasma DNA samples.
  • Figure 11a shows the theoretical values of all possible genotypes and the relative counts of their respective alleles for polymorphic loci on chromosomes of the disomic-disomic and disomic karyotypes.
  • Figure 11b shows the second largest relative allele count (RR2) relative to the largest relative allele count ( The theoretical distribution of RR1).
  • Figure 11c shows the theoretical values of all possible genotypes and the relative counts of each allele for each polymorphic locus on the disomic-disomic and disomic-trisomic karyotype chromosomes.
  • Figure 11d is the relative count of the second or fourth largest allele (RR2 or RR4) relative to the largest for each polymorphic locus on the disomic-disomic and disomic-trisomic karyotype chromosomes Theoretical distribution of relative allele counts (RR1).
  • Figure 12 shows the theoretical distribution of each polymorphic site on the subchromosomal microdeletion or microduplication variant of the maternal or fetal microdeletion or microduplication in the maternal plasma DNA sample on the relative allele distribution map.
  • Figure 12a shows the theoretical values of all possible genotypes and the relative counts of each allele for a polymorphic locus on a chromosome of a mother or fetus with a microdeletion karyotype.
  • Figure 12b is a theoretical distribution of the second largest relative allele count (RR2) relative to the largest relative allele count (RR1) for each polymorphic locus on a mother or fetus with a microdeletion karyotype.
  • Figure 12c shows the theoretical values of all possible genotypes and the relative counts of each allele for each polymorphic locus on a normal subchromosomal subchromosomal of a fetus of a mother with a microduplication.
  • Figure 12d is the relative count of the second or third largest allele (RR2 or RR3) relative to the relative largest allele (RR1) for each polymorphic locus on the normal subchromosome of the fetus of the mother with the microduplication theoretical distribution.
  • Example 29 Using the sample concentration f of the least component in the sample mixture and the wild type of the site to be analyzed and its Relative counts for each non-wild-type allele to estimate the wild-type mutant at that locus
  • Figure 13 is a graph showing the relative distribution of allele counts for all possible genotypes of the tested locus on a normal disomic-disomic chromosome in a maternal plasma DNA sample.
  • Figure 13a shows the theoretical values of all possible genotypes and the relative counts of each allele at the tested locus on a normal disomic-disomic chromosome.
  • Figure 13b is a theoretical distribution diagram of the maximum non-wild-type relative allele count (RR2) relative to the wild-type relative allele count (RR1) for the tested locus on a normal disomic-disomic chromosome.
  • RR2 maximum non-wild-type relative allele count
  • RR1 wild-type relative allele count
  • Example 30 Using the fetal DNA concentration in the maternal plasma DNA sample and the allele count of the locus to be analyzed Detection of fetal chromosomal monosomy abnormalities on the distribution map
  • chromosomes 1, 2 and 3 were reference chromosomes, chromosomes of normal disomic-disomic karyotype, and abnormal disomic-monosomial karyotypes, respectively.
  • chromosome 2 or 3 To detect if the fetus has a chromosomal abnormality on chromosome 2 or 3, we need to test whether chromosome 2 or 3 is a normal disomic-disomic karyotype (both mother and fetus are disomic) or an abnormal disomic- Monosomy karyotype (mother is normal diploid and fetus is abnormal monosomy). Therefore, we first marked the theoretical positions of all diploid-disomic and disomic-monotype genotypes on the relative distribution map of allele counts, and then based on each polymorphism locus on the chromosome to be analyzed, the allele Count the distribution on the relative distribution map to determine the karyotype of the chromosome.
  • Figure 14 is a graph of the relative distribution of the counts of each allele at a polymorphic locus to detect monosomy variation in fetal chromosomes.
  • Figure 14a is a plot of relative allele counts for all polymorphic loci on a simulated normal disomic-disomic chromosome.
  • Figure 14b is a plot of relative allele counts for all polymorphic loci on a simulated disomic chromosome. The results show that almost all the relative counts of polymorphic loci in Fig. 14a are distributed around the corresponding genotype clusters of the disomic-disomic, and almost none are distributed around the corresponding disomic-disomic genotype clusters. In Fig.
  • the karyotype of the chromosome to be analyzed in Figure 14a is disomic-disomic, that is, the fetal chromosome is normal; and the karyotype of the chromosome to be analyzed in Figure 14b is a disomic-haplotype, that is, the fetal chromosome is an abnormal monosomy.
  • Example 31 Using the fetal DNA concentration in the maternal plasma DNA sample and the allele count of the locus to be analyzed Detection of fetal chromosomal trisomies on the distribution map
  • chromosomes 1, 2 and 3 were reference chromosomes, chromosomes of normal disomic-disomic karyotype, and abnormal disomic-trisomy, respectively karyotype of chromosomes.
  • chromosome 2 or 3 is a normal disomic-disomic karyotype (both mother and fetus are disomic) or an abnormal disomic- Trisomic karyotype (normal disomy in the mother and abnormal trisomy in the fetus).
  • Figure 15 is a graph of the relative distribution of the counts of each allele at a polymorphic locus to detect trisomy variation in fetal chromosomes.
  • Figure 15a is a plot of relative allele counts for all polymorphic loci on a simulated normal disomic-disomic chromosome.
  • Figure 15b is a plot of relative allele counts for all polymorphic loci on simulated disomic-trisomic chromosomes. The results show that almost all the relative counts of polymorphic loci in Fig. 15a are distributed around the corresponding genotype clusters of disomic-disomic and almost no distribution around the corresponding disomic-trisomic genotype clusters. In Fig.
  • the karyotype of the chromosome to be analyzed in Fig. 15a is a disomy-diasomy, that is, the fetal chromosome is normal; while the karyotype of the chromosome to be analyzed in Fig. 15b is a disomy-trisomy, that is, the chromosome of the fetus is an abnormal trisomy.
  • Example 32 Using the fetal DNA concentration in the maternal plasma DNA sample and the allele count of the locus to be analyzed Detection of fetal chromosomal microdeletion abnormalities on distribution maps
  • chromosomes 1 to 7 were reference chromosomes, normal chromosomes in both mother and fetus (chromosomes with normal disomic-disomic karyotypes), respectively.
  • One chromosome of the mother's normal fetus with a microdeletion (chromosome of the disomic-monosome karyotype), one chromosome of the mother with a microdeletion of the fetus's normal chromosome (a chromosome of the monosomy-disomy karyotype), one mother and one fetus Chromosomes with a microdeletion (monosome-monosomal karyotype), a mother with a microdeletion on one chromosome and a fetus with both microdeletions (monosome-monosomal karyotype), and both mother and fetus
  • Each of the two chromosomes contains a microdeleted chromosome (chromosomes with an asomic-anosomic karyotype).
  • a fetus has a chromosomal microdeletion abnormality
  • Figure 16 is a graph of the relative distribution of the counts of each allele at a polymorphic locus to detect microdeletion variants in fetal chromosomes.
  • Figure 16a is a plot of relative allele counts for all polymorphic loci on a simulated monosomic chromosome.
  • Figure 16b is a plot of relative allele counts for all polymorphic loci on a simulated haploid-monosome chromosome. The results show that almost all the relative counts of polymorphic loci in Figure 16a are distributed around the corresponding haplotype-disomic genotype clusters, and almost none around the genotype clusters of other karyotypes.
  • the karyotype of the chromosome to be analyzed in Fig. 16a is a haplotype-disomy, that is, the chromosome of the fetus normally does not contain microdeletions; while the karyotype of the chromosome to be analyzed in Fig. 16b is a haplotype-haplotype, that is, the chromosome of the fetus has A variant containing a microdeletion.
  • Example 33 Using the fetal DNA concentration in the maternal plasma DNA sample and the allele count of the locus to be analyzed Detection of fetal chromosomal microduplication abnormalities on the distribution map
  • chromosomes 1 to 7 were reference chromosomes, normal chromosomes in both mother and fetus (chromosomes with normal disomic-disomic karyotypes), respectively.
  • One chromosome of the mother's normal fetus contains a chromosome with a microduplication (chromosome of the disomic-trisomy karyotype), one chromosome of the mother contains a microduplication of the fetus's normal chromosome (a chromosome of the trisomy-disomic karyotype), and both the mother and the fetus have one chromosome Chromosomes with microduplications (trisomic-trisomic karyotypes), mothers with microduplications on one chromosome and fetuses with microduplications on both chromosomes (trisomic-tetrasomic karyotypes), and mother and fetus Both chromosomes contain microduplicate chromosomes (chromosomes with a tetrasomic-tetrasomic karyotype).
  • each polymorphic site on reference chromosome 1 To analyze the sequencing data of the simulated samples, first use the allele counts of each polymorphic site on reference chromosome 1 to estimate the concentration f of fetal DNA in the sample according to the method described in Example 8; then according to the concentration of fetal DNA in the sample f and allele counts for each polymorphic locus on chromosomes 2 to 7 The karyotype of each chromosome 2 to 7 was estimated according to the method described in Example 28, respectively.
  • a fetal chromosome has a chromosomal microduplication abnormality
  • the allele Count the distribution on the relative distribution map to determine the karyotype of the chromosome. Since the total number of all genotypes that may contain microduplications in the maternal and fetal chromosomes reaches dozens or hundreds, it is very unfavorable to label all these genotypes on the relative distribution map of allele counts for each polymorphic locus.
  • Figure 17 is a graph of the relative distribution of the counts of each allele at a polymorphic locus to detect microduplication variants in fetal chromosomes.
  • Figure 17a is a plot of relative allele counts for all polymorphic loci on a simulated trisomic chromosome.
  • Figure 17b is a plot of relative allele counts for all polymorphic loci on a simulated trisomy-trisomy chromosome. The results show that almost all the relative counts of polymorphic loci in Figure 17a are distributed around the corresponding fetal normal genotype clusters. In Figure 17b, however, the relative counts of all polymorphic loci are clearly clustered but not clustered around the normal fetal genotype clusters.
  • the fetal chromosome normally does not contain microduplications; while in the chromosome to be analyzed in Figure 17b, at least one of the chromosomes of the fetus contains microduplication variation, or the chromosome has other types of variation.
  • Example 34 Using the fetal DNA concentration in the maternal plasma DNA sample and the allele count of the locus to be analyzed Detect wild mutants of the locus to be analyzed on the distribution map
  • each polymorphic site on chromosome 1 is selected from different chromosomal regions, while multiple polymorphic sites on chromosome 2 are selected from the same specific site but belong to the same and/or different primers Results of independent amplifications performed, that is, the simulated polymorphic loci on chromosome 2 represent distinct independent repeats of a particular locus.
  • each locus needs to consider all possible genotypes of the fetus and the mother (wild-type alleles) Marked as capital letter A, variants are counted in descending order of alleles and marked as lowercase letters a-c), including the four gene copies of mother and fetus are non-wild-type variants (aa
  • Figure 18 is a graph of relative distribution of allele counts at polymorphic loci to detect fetal variation at the short-sequence level.
  • Figure 18a is a plot of relative allele counts for polymorphic loci in simulated ab
  • Figure 18b is a plot of relative allele counts at polymorphic loci in the simulated Aa
  • the genotype of the polymorphic locus is estimated to be Aa
  • Example 35 Detecting the genetics of a single genome sample using the allele count and relative distribution map of the locus to be analyzed variation
  • chromosomes 1 to 5 are disomy, deletional (or homozygous microdeletion), monosomy (or heterozygous microdeletion), trisomy (or heterozygous microdeletion), respectively zygotic microduplication), tetrasomy (or homozygous microduplication).
  • both chromosomes In order to detect whether a single genome sample has chromosomal or subchromosomal variation, the following five situations need to be considered: (1) deletion of both chromosomes (absomy) or microdeletion of the same region in both chromosomes (homozygous microdeletion); (2) one chromosome is normal and the other chromosome is missing (single) or there is a microdeletion in the other chromosome (heterozygous microdeletion); (3) both chromosomes are normal; (4) three chromosomes (three (5) Four chromosomes (tetrasomy) or microduplications in the same region in both chromosomes (homozygous microduplications).
  • Figure 19 shows the detection of the karyotype of a target chromosome or subchromosome in a single genome sample using relative counts of each allele at a polymorphic locus. For each polymorphic locus on the target region (chromosomal or subchromosomal region), plot the relative count of the second largest allele against the relative count of the largest allele (relative count map A) or The relative maximum allele counts were plotted against the relative position of the locus on the simulated chromosome (relative count position panel B).
  • results show that the genotypes of different karyotype chromosomes have different characteristic distributions on the relative count map A or relative count position map B, and the karyotype (variation type) of the target chromosome or subchromosome can be detected according to these characteristic distributions.
  • each group member may be referred to or claimed alone or in any combination with other members of the group or other elements found herein.
  • One or more members of a group may be included in or deleted from a group for reasons of convenience and/or patentability.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • General Health & Medical Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Ecology (AREA)
  • Physiology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)

Abstract

提供了一种无创检测胎儿遗传变异的方法。首先通过对参照基因组上的多态性位点进行靶向测序以及随后对每个多态性位点进行等位基因拷贝计数,估算孕妇血浆样本中胎儿遗传物质的百分比。然后对目标基因组上的多态性位点或待检测靶点进行等位基因拷贝计数,并配合拟合优度检验或等位基因计数相对分布图,来检测样本中的待测靶点是否在染色体水平、亚染色体水平或单个遗传位点水平有变异。该方法适用于同时对孕妇血浆样本中的染色体整倍性变异、亚染色体水平的微缺失微重复变异和短序列水平的变异进行检测,具有良好的开发和应用前景。

Description

一种利用多态性位点和靶位点测序检测胎儿遗传变异的方法 技术领域
本发明涉及遗传变异检测领域,特别是染色体水平的非整倍性变异、亚染色体水平的微缺失/微重复变异或短序列水平的插入缺失和单核苷酸位点变异。
背景技术
1997年,在孕妇血浆中发现存在来源于胎儿的游离DNA(Lo,Corbetta et al.1997,Lancet 350:485-487)。基于这一发现和大规模平行测序,多个研究组开发出了基于对孕妇血浆DNA(cfDNA)进行测序分析的方法来检测染色体非整倍性变异、亚染色体水平的微缺失/微重复变异或单基因水平的短序列插入缺失和单核苷酸位点变异(Advani,Barrett et al.2017,Prenat Diagn 37:1067-1075;Breveglieri,D'Aversa et al.2019,Mol Diagn Ther 23:291-299;Andari,Bussamra et al.2020,Ceska Gynekol 85:41-48;Guseh 2020,Hum Genet 139:1141-1148)。
目前,利用二代测序进行染色体非整倍性异常的检测由于有很高的灵敏性和特异性而在全世界多个国家得到认可并被产业化(Chiu,Chan et al.2008,Proc Natl Acad Sci U S A 105:20458-20463;Fan,Blumenfeld et al.2008,Proc Natl Acad Sci U S A 105:16266-16271;Liao,Chan et al.2012,PLoS One 7:e38154;Zimmermann,Hill et al.2012,Prenat Diagn 32:1233-1241)。然而对于亚染色体水平的微缺失/微重复变异,其无创检测方法的灵敏性和特异性还不是很高,尤其是对于小片段的微缺失/微重复变异(Advani,Barrett et al.2017,Prenat Diagn 37:1067-1075;Hu,Wang et al.2019,Human Genomics 13:14;Srebniak,Knapen et al.2020,Mol Genet Genomic Med 8:e1062)。虽然多种基于二代测序的单基因遗传病的无创检测方法已经被开发出来(Lun,Tsui et al.2008,Proc Natl Acad Sci U S A 105:19920-19925;Lo,Chan et al.2010,Sci Transl Med 2:61ra91;Lv,Wei et al.2015,Clinical Chemistry 61:172-181;Vermeulen,Geeven et al.2017,Am J Hum Genet 101:326-339;Allen,Young et al.2018,Noninvasive Prenatal Testing(NIPT)157-177;Yin,Du et al.2018,J Hum Genet 63:1129-1137;Cutts,Vavoulis et al.2019,Blood 134:1190-1193;Zhang,Li et al.2019,Nat Med 25:439-447),但是这些方法并没有在临床实践中得到广泛的应用,主要是由于这些方法采用了不同于检测染色体或亚染色体水平变异的方法,因此不能同时用来检测染色体和亚染色体水平变异。同时这些方法对每一个单基因遗传病的检测成本都很高,导致用这些方法对低患病率的单基因遗传病进行筛查的性价比不高。
因此,一种通用的能利用孕妇血浆DNA同时对胎儿染色体水平、亚染色体水平和单基因短序列水平的变异进行检测的方法将极大有利于对胎儿遗传变异的无创检测。
发明内容
本发明的目的在于提供一种同时检测染色体非整倍性遗传病、亚染色体水平的微缺失/微重复遗传病和由于短序列的变异导致的单基因遗传病的方法。
为实现上述目的,本发明设计了一种基于高通量测序技术进行遗传变异筛查的方法,包括获取测试样品并提取DNA、选择性扩增靶位点、对靶位点进行高通量测序、对测序数据进行分析来得出检测结果。
本发明提供了一种遗传变异检测方法,其包括以下步骤:
(1)接收待测生物样品并制备核酸;
(2)富集或扩增靶DNA位点,其中至少有一个靶DNA位点在样本中有多于一个的等位基因;
(3)测序所扩增的靶DNA位点;
(4)对每一个靶DNA位点,统计其各个等位基因的计数;
(5)利用靶DNA位点等位基因计数的拟合优度检验和/或等位基因计数相对分布图确定样本中待检测目标的核型或基因型或野生突变型。
本发明通过对特定靶DNA位点的扩增与测序技术,提供了对混合样本中染色体水平的非整倍性检测、亚染色体水平的微缺失/微重复检测、短序列片段水平变异的检测,其中至少有一个所述特定靶DNA位点在样本中有多于一个的等位基因。
本发明所述靶DNA位点指的是特定的DNA序列,该DNA序列中的碱基有可能在不同的个体中有变化,并且该DNA序列可通过PCR、多重PCR等技术扩增或通过核酸杂交等技术富集。在本发明中,术语“靶DNA序列”和“靶DNA位点”可互换使用,并且术语“位点”当提及靶标时并不限定靶标的长度,即靶标的长度可为单个核苷酸至整条染色体的长度。
在另一个方面,本发明通过对特定DNA位点(靶位点)的扩增与测序技术,提供了对单一基因组样本中染色体水平的非整倍性检测、亚染色体水平的微缺失/微重复检测,其中至少有一个所述特定靶DNA位点在样本中有多于一个的等位基因。
本发明中所述生物样品包括来自所述妊娠女性生物样品的胎儿和母亲核酸(诸如母亲血浆中游离的DNA)或来自于单一基因组样品(诸如来源于植入前诊断的胚胎核酸)。
本发明中所述富集或扩增靶DNA位点,可通过本领域已知的任何方法进行富集或扩增靶DNA位点,包括但不限于使用PCR、多重PCR、全基因组扩增(WGA)、多取代扩增(MDA)、滚环扩增(RCA)、环形扩增(RCR)、杂交捕获等方法来富集或扩增靶DNA位点。在富集或扩增 的靶DNA位点中,一部分来源于被假定为正常整倍体的一条或多条染色体的区域,一部分来源于待测定的怀疑有染色体水平、亚染色体水平或短序列水平变异的一条或多条染色体的区域。被假定为正常整倍体的染色体或区域或位点在本文被另外指定为“参照染色体或参照区域或参照序列或参照位点”,被假定为待检测遗传变异状态的染色体或区域或位点在本文被另外指定为“目标染色体或目标区域或目标序列或目标位点”。在本发明中,由不少于一条或一个参照染色体或参照区域或参照序列或参照位点组成的集合被称为参照组。在本发明中,由不少于一条或一个目标染色体或目标区域或目标序列或目标位点组成的集合被称为目标组。
本发明中所述对每一个靶DNA位点,统计其各个等位基因的计数,是指对每一个扩增序列,首先将其映射到染色体或基因组位置,最后统计每一个染色体或基因组区域中映射的序列数。如果某一染色体或基因组区域有不同的等位基因,则同时统计该区域每一个等位基因所映射的序列数。有各种计算机方法可用于将各序列读数映射至染色体或基因组位置/区域。可用于映射序列的计算机算法的非限制性示例包括但不限于特异性序列的查找、BLAST、BLITZ、FASTA、BOWTIE、BOWTIE 2、BWA、NOVOALIGN、GEM、ZOOM、ELAN、MAQ、MATCH、SOAP、STAR、SEGEMEHL、MOSAIK或SEQMAP或其变体或其组合。
在本发明中,为了便于理解,一个亚染色体水平的微缺失片段被认为是一条染色体,而一个亚染色体水平的微重复片段被认为是两条染色体。因此对于单基因组样本,亚染色体水平有杂合微缺失的染色体标记为单体,有纯合微缺失的染色体标记为缺体,有杂合微重复的染色体标记为三体,而有纯合微重复的染色体标记为四体。相应的,在混合样本中,比如在孕妇血浆样本中,母亲和胎儿均正常的染色体被标记为双体-双体,母亲正常胎儿一条染色体含有微缺失的染色体被标记为双体-单体,而母亲正常胎儿一条染色体含有微重复的染色体被标记为双体-三体。本发明中对涉及到染色体水平或亚染色体水平的变异的染色体和/或染色体片段均按照类似的原则进行标记。
在本发明中,亚染色体水平的微缺失/微重复是指在染色体上缺失或增加的片段不是很长、经过传统细胞遗传学分析难以发现的染色体畸变。染色体微缺失微重复综合征是除染色体非整倍体之外的另一大类新生儿出生缺陷。在本发明中,某些部分也用染色体片段的拷贝数变异来指代染色体微缺失/微重复变异。
在本发明中,用核型来指代染色体或亚染色体水平的变异,用基因型来指代在短序列水平的变异。比如对于孕妇血浆样本,如果母亲的21号染色体是正常双体而胎儿该染色体是三体,则本发明将标记该样本中的21号染色体核型为双体-三体核型。比如对于孕妇血浆样本,如果母亲的一条22号染色体含有22q11微缺失而另一条22号染色体不含有22q11微缺 失,并且胎儿的一条22号染色体含有22q11微缺失而另一条22号染色体不含有22q11微缺失,则本发明将标记该样本中的22q11染色体片段的核型为单体-单体核型。比如对于孕妇血浆样本,如果母亲的一条22号染色体含有22q11微重复而另一条22号染色体不含有22q11微重复,并且胎儿的一条22号染色体含有22q11微重复而另一条22号染色体不含有22q11微重复,则本发明将标记该样本中的22q11染色体片段的核型为三体-三体核型。比如对于孕妇血浆样本,如果母亲的血红蛋白β亚基第6位氨基酸的等位基因分别为A和S,而胎儿血红蛋白β亚基第6位氨基酸的等位基因分别为S和C,本发明将标记该样本中的血红蛋白β亚基第6位氨基酸的基因型为AS|SC型,其中竖线前的部分代表母亲的基因型而竖线后的部分代表胎儿的基因型。在本发明中,用野生型来指代在正常无患病表型的群体中靶位点观察到的最高频率的基因型。在另一方面,野生型指的是靶位点不含致病性或可能致病性变异的基因型。在本发明中,用突变型来指代靶位点不同于野生型的基因型。
在本发明中,部分待检测样本利用参照组各个靶位点的等位基因计数估计样本中最小组分DNA的浓度。其中,待检测样本中最小组分DNA的浓度可以用任何目前已经报道的方法进行估计。优选的,利用参照组各个靶位点等位基因计数的相对比例法估算待检测样本中最小组分DNA的浓度;优选的,利用参照组各个靶位点等位基因计数迭代拟合基因型法估算样本中最少组分DNA的浓度;优选的,利用FC和TC的平均数和/或中位数计算样本中最少组分DNA的浓度。
在本发明中,利用等位基因计数相对比例法计算样本中最少组分DNA的浓度。比如对于孕妇血浆DNA样本,最少组分DNA是胎儿DNA,而最大组分DNA是母亲DNA。在正常孕妇血浆DNA样本中,胎儿遗传了母亲的一条染色体,因此每一个靶位点的基因型只能是下述5种可能的基因型中的一种,即AA|AA、AA|AB、AB|AA、AB|AB或AB|AC,其中A、B和C表示靶位点的各个等位基因。在这五种基因型中,如果靶位点是AA|AA或AB|AB基因型,则胎儿DNA浓度不影响其各个等位基因的相对计数,而如果靶位点是AA|AB、AB|AA或AB|AC基因型,则其各个等位基因计数受胎儿DNA浓度的影响。因此,基因型AA|AB、AB|AA和AB|AC可以用来估算每一个靶DNA位点中来源于胎儿DNA的计数(FC)。
本发明提供了利用参照组各个靶位点等位基因计数的相对比例计算样本中最少组分DNA的浓度的方法,所述方法包括:
(a1)设定样本的噪声阈值α;
(a2)对每一个靶DNA位点,首先利用其各个等位基因计数估算其基因型,然后根据其估算的基因型估算来源于最少组分DNA的计数(FC)和总计数(TC);
(a3)利用各个靶位点的最少组分DNA的计数(FC)和总计数(TC),估算最少组分DNA的浓度。
进一步地,上述步骤(a1)中设定样本的噪声阈值α,是设定用于区分真实等位基因的计数信号与非真实的等位基因计数信号的阈值;优选的,设定的噪声阈值α为任何不大于0.05的值;优选的,设定的噪声阈值α为0.05、0.04、0.03、0.02、0.01、0.0075、0.005、0.0025或0.001。
进一步地,上述步骤(a2)中对每一个靶DNA位点,首先利用其各个等位基因计数估算其基因型,然后根据其估算的基因型估算来源于最少组分DNA的计数(FC)和总计数(TC),包括如下步骤:
(a2-i)对靶DNA位点的各个等位基因计数进行从大到小排序,其中最大的三个等位基因计数依次标记为R1、R2和R3;
(a2-ii)利用靶DNA位点的各个等位基因计数,估算该靶DNA位点的基因型;
(a2-iii)根据估算的靶DNA位点的基因型和靶DNA位点的各个等位基因计数,估算来源于最少组分DNA的计数(FC)和总计数(TC)。
进一步地,上述步骤(a2-ii)中利用靶DNA位点的各个等位基因计数,其中最大的三个等位基因计数依次标记为R1、R2和R3,估算该靶DNA位点的基因型,包括如下步骤:
(a2-ii-1)利用靶DNA位点的各个等位基因计数,判断靶DNA位点中检测到的高于噪声阈值的等位基因数量;如果判断结果是1,则执行下述步骤(a2-ii-2);如果判断结果是2,则执行下述步骤(a2-ii-3);如果判断结果为大于2,则执行下述步骤(a2-ii-4);
(a2-ii-2)估算该靶DNA位点的基因型为AA|AA,然后执行下述步骤(a2-ii-5);
(a2-ii-3)根据检测到的高于噪声阈值的等位基因数量为2和靶DNA位点的最大的两个等位基因计数,估计靶DNA位点的基因型,然后执行下述步骤(a2-ii-5);
(a2-ii-4)根据检测到的高于噪声阈值的等位基因数量大于2和靶DNA位点的最大的至少两个的等位基因计数,估计靶DNA位点的基因型,然后执行下述步骤(a2-ii-5);
(a2-ii-5)输出估算的该靶位点的基因型。
进一步地,上述步骤(a2-ii-1)中利用靶DNA位点的各个等位基因计数,判断靶DNA位点中检测到的高于噪声阈值的等位基因数量,依次包括如下步骤:
(a2-ii-1-1)计算靶DNA位点每一个等位基因的相对计数;
(a2-ii-1-2)判断每一个等位基因的相对计数是否高于设定的噪声阈值,然后统计高于设定的噪声阈值的等位基因数量。
其中,一个等位基因的相对计数是该等位基因的计数和该靶位点所有等位基因计数的 商。优选的,设定的噪声阈值α为任何不大于0.05的值;优选的,预定的噪声阈值为0.05、0.04、0.02、0.01、0.0075、0.005、0.0025或0.001。
进一步地,上述步骤(a2-ii-3)中根据检测到的高于噪声阈值的等位基因数量为2和靶DNA位点的最大的两个等位基因计数,估计靶DNA位点的基因型,其中最大的两个等位基因计数分别标记为R1和R2,包括如下步骤:
(a2-ii-3-1)判断R1/(R1+R2)的值是否小于0.5+α,如果判断结果为是,则估算该靶DNA位点的基因型为AB|AB,然后执行下述步骤(a2-ii-3-3);如果判断结果为否,则执行下述步骤(a2-ii-3-2);
(a2-ii-3-2)判断R1/(R1+R2)的值是否小于0.75,如果判断结果为是,则估算该靶DNA位点的基因型为AB|AA,然后执行下述步骤(a2-ii-3-3);如果判断结果为否,则估算该靶DNA位点的基因型为AA|AB,然后执行下述步骤(a2-ii-3-3);
(a2-ii-3-3)输出估算的该靶位点的基因型。
进一步地,上述步骤(a2-ii-4)中根据检测到的高于噪声阈值的等位基因数量大于2和靶DNA位点的最大的至少两个的等位基因计数,估计靶DNA位点的基因型,其中最大的两个等位基因计数分别标记为R1和R2,包括如下步骤:
(a2-ii-4-1)判断R2/R1是否大于等于0.5和/或R1/(R1+R2)是否大于等于1/2并且小于等于2/3和/或R2/(R1+R2)是否大于等于1/3并且小于等于1/2的值,如果判断结果为是,则估算该靶DNA位点的基因型为AB|AC,然后执行下述步骤(a2-ii-4-3);如果判断结果为否,则执行下述步骤(a2-ii-4-2);
(a2-ii-4-2)标记该位点的等位基因计数为异常,然后或者估算该靶位点的基因型为NA,并执行下述步骤(a2-ii-4-3);或者设定该靶DNA位点中检测到的高于噪声阈值的等位基因数量为2,然后按照步骤(a2-ii-3)所述估算该靶位点的基因型,并执行下述步骤(a2-ii-4-3);(a2-ii-4-3)输出估算的该靶位点的基因型。
其中,基因型NA代表不能估计靶位点的基因型。
进一步地,上述步骤(a2-iii)中根据估算的靶DNA位点的基因型和靶DNA位点的各个等位基因计数,估算来源于最少组分DNA的计数(FC)和总计数(TC),其中最大的三个等位基因计数依次标记为R1、R2和R3,包括如下步骤:
(a2-iii-1)如果靶位点估计的基因型是AA|AA,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-2)如果靶位点估计的基因型是AB|AB,则估算来源于最少组分DNA的计数(FC)为NA, 估算总计数(TC)为R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-3)如果靶位点估计的基因型是AB|AA,则估算来源于最少组分DNA的计数(FC)为R1-R2,估算总计数(TC)为R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-4)如果靶位点估计的基因型是AA|AB,则估算来源于最少组分DNA的计数(FC)为R2的2倍,估算总计数(TC)为R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-5)如果靶位点估计的基因型是AB|AC,则估算来源于最少组分DNA的计数(FC)为R1-R2+R3或R3的2倍或(R1-R2)的2倍,估算总计数(TC)为R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-6)如果靶位点估计的基因型不是上述所述基因型中的一种,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-7)输出估算的来源于最少组分DNA的计数(FC)和总计数(TC)。
其中,估算来源于最少组分DNA的计数(FC)为NA代表不能估计来源于最少组分DNA的计数(FC)。
进一步地,上述步骤(a3)中利用参照组各个靶位点的最少组分DNA的计数(FC)和总计数(TC)估算最少组分DNA的浓度,其中利用线性回归或稳健线性回归计算样本中最少组分DNA的浓度,和/或利用FC和TC的平均数或中位数计算样本中最少组分DNA的浓度。
进一步地,上述步骤(a3)中利用参照组各个靶位点的最少组分DNA的计数(FC)和总计数(TC)估算最少组分DNA的浓度,其中通过拟合回归模型估计最少组分DNA的浓度。
进一步地,上述步骤中通过拟合回归模型估计最少组分DNA的浓度,其中所述回归模型选自:线性回归模型、稳健线性回归模型、简单回归模型,普通最小二乘回归模型、多重回归模型、一般多重回归模型、多项式回归模型、一般线性模型、广义线性模型、离散选择回归模型、逻辑回归模型、多项式分对数模型、混合分对数模型、概率单位模型、多项式概率单位模型、有序分对数模型、有序概率单位模型、泊松模型、多元响应回归模型、多级模型、固定效应模型、随机效应模型、混合模型、非线性回归模型、非参数模型、半参数模型、鲁棒模型、分位模型、等渗模型、主成分模型、最小角模型、局部模型、分段模型和变量误差模型。
进一步地,上述步骤中通过拟合回归模型估计最少组分DNA的浓度,其中在拟合的模型中,参照组各个靶位点的总计数(TC)是自变量,各个靶位点的最少组分DNA的计数(FC)是因变量。
进一步地,上述步骤中通过拟合回归模型估计最少组分DNA的浓度,其中最少组分DNA的浓度估算为模型参数总计数(TC)的回归系数。
优选的,拟合的回归模型是线性回归模型;优选的,拟合的回归模型是稳健线性回归模型;优选的,拟合的回归模型是一般线性模型。
本发明提供了利用参照组各个靶位点等位基因计数迭代拟合基因型法计算样本中最少组分DNA的浓度的方法,所述方法包括:
(b1)设定样本的噪声阈值α、初始浓度估计值f 0和迭代误差精度值ε;
(b2)对每一个靶DNA位点,利用其各个等位基因计数和样本中最少组分DNA的浓度值f 0估算其基因型;
(b3)对每一个靶DNA位点,根据其估算的基因型来估算来源于最少组分DNA的计数(FC)和总计数(TC);
(b4)利用最少组分DNA的计数(FC)和总计数(TC),估算最少组分DNA的浓度f;
(b5)判断f-f 0的绝对值是否小于ε,如果判断结果为否,则设定f 0=f,然后执行步骤(b2);如果判断结果为是,则样本中最少组分DNA浓度估算为f。
进一步地,上述步骤(b1)中设定样本的噪声阈值α,是设定用于区分真实等位基因的计数信号与非真实的等位基因计数信号的阈值;优选的,设定的噪声阈值α为任何不大于0.05的值;优选的,设定的噪声阈值α为0.05、0.04、0.03、0.02、0.01、0.0075、0.005、0.0025或0.001。
进一步地,上述步骤(b1)中设定初始浓度估计值f 0,是设定f 0为任何一个可能的最少组分DNA浓度的值;优选的,设定的初始浓度估计值f 0小于0.5;优选的,设定的初始浓度估计值f 0值小于0.5并且大于设定的噪声阈值α;优选的,设定的初始浓度估计值f 0为任何一个不仅小于0.5而且大于设定的噪声阈值α的值;优选的,设定的初始浓度估计值f 0为0.50、0.45、0.40、0.35、030、0.25、0.20、0.15、0.10、0.05、0.04、0.03、0.02、0.01或0.005。
进一步地,上述步骤(b1)中设定迭代误差精度值ε,是设定ε为一个很小的迭代计算的截止阈值;优选的,设定的ε值小于0.01;优选的,设定的ε值为任何一个小于0.01的值;优选的,设定的ε值小于0.001;优选的,设定的ε值小于0.0001;优选的,设定的ε值为0.01、0.001、0.0001或0.00001。
进一步地,上述步骤(b2)中对每一个靶DNA位点利用其各个等位基因计数和样本中最少组分DNA的浓度值f 0估算其基因型,包括如下步骤:
(b2-i)根据样本来源,列出靶DNA位点所有可能的基因型;
(b2-ii)对靶DNA位点的每一个可能基因型,利用样本中最少组分DNA的浓度值f 0和靶DNA位点各个等位基因的总计数(TC),计算其各个等位基因的理论计数;
(b2-iii)对靶DNA位点的每一个可能基因型,利用靶DNA位点的各个等位基因计数及其各个等位基因理论计数进行拟合优度检验;
(b2-iv)分析靶DNA位点对所有可能的基因型的拟合优度检验结果,选择对靶DNA位点各个等位基因计数有最优拟合的基因型作为估算的靶DNA位点的基因型。
本发明中,拟合优度检验是指一种或几种能用来检验观测数与理论数之间一致性的统计检验方法;优选的,拟合优度检验是卡方检验;优选的,拟合优度检验是G检验;优选的,拟合优度检验是费希尔精确检验;优选的,拟合优度检验是二项分布检验;优选的,拟合优度检验是卡方检验和/或G检验和/或费希尔精确检验和/或二项分布检验和/或其变体和/或其组合;优选的,拟合优度检验是利用G检验的计算值G值和/或AIC值和/或经校正的G值和/或经校正的AIC值和/或G值或AIC值的变体和/或其组合来进行拟合优度检验。
进一步地,上述步骤(b3)中对每一个靶DNA位点,根据其估算的基因型来估算来源于最少组分DNA的计数(FC)和总计数(TC),其中最大的四个等位基因计数依次标记为R1、R2、R3和R4,包括如下步骤:
(b3-1)如果靶位点估计的基因型是AA|AA,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-2)如果靶位点估计的基因型是AB|AB,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-3)如果靶位点估计的基因型是AB|AA,则估算来源于最少组分DNA的计数(FC)为R1-R2,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-4)如果靶位点估计的基因型是AA|AB,则估算来源于最少组分DNA的计数(FC)为R2的2倍,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-5)如果靶位点估计的基因型是AB|AC,则估算来源于最少组分DNA的计数(FC)为R1-R2+R3或R3的2倍或(R1-R2)的2倍,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-6)如果靶位点估计的基因型是AA|BB,则估算来源于最少组分DNA的计数(FC)为R2,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-7)如果靶位点估计的基因型是AA|BC,则估算来源于最少组分DNA的计数(FC)为R2+R3或R2的2倍或R3的2倍,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤 (b3-11);
(b3-8)如果靶位点估计的基因型是AB|CC,则判断是否当前估计值f 0大于和或等于1/3,如果判断结果为是,则估算来源于最少组分DNA的计数(FC)为R1,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);如果判断结果为否,则估算来源于最少组分DNA的计数(FC)为R3,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-9)如果靶位点估计的基因型是AB|CD,则估算来源于最少组分DNA的计数(FC)为R3+R4或R3的2倍或R4的2倍,估算总计数(TC)为R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-10)如果靶位点估计的基因型不是上述所述基因型中的一种,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-11)输出估算的来源于最少组分DNA的计数(FC)和总计数(TC)。
进一步地,上述步骤(b4)中利用最少组分DNA的计数(FC)和总计数(TC)估算最少组分DNA的浓度f,是采用步骤(a3)所述的方法估算最少组分DNA的浓度f。
在本发明中,利用参照组各个靶位点等位基因计数迭代拟合基因型法计算样本中最少组分DNA的浓度。该方法不仅可以用来估算具有生物学关系的混合样本中最小组分DNA的浓度,而且可以用来估算不具有生物学关系的混合样本中最小组分DNA的浓度。进一步地,该方法不仅适用于计算孕妇是亲生遗传学母亲的血浆DNA样本中胎儿DNA的浓度,而且适用于计算孕妇是经法律许可接受赠卵的孕妇血浆DNA中胎儿DNA浓度。进一步地,该方法可以用于估计两个独立的混合DNA样本中最小组分DNA的浓度。进一步地,上述所述的方法可以用于估计多于两个样本的混合物中其中几个组分的浓度。例如,对于多胎妊娠,可以对每一个胎儿设定一个需要迭代的胎儿DNA浓度值;比如对于双胎妊娠,可以设定需要迭代的胎儿DNA浓度值分别为f1和f2;对于三胎妊娠,可以设定需要迭代的胎儿DNA浓度值分别为f1、f2和f3,等等。为了估计多个样本组分浓度,可以首先给每一个样本的浓度设定一个初始值,然后利用每一个靶DNA位点各个等位基因计数和该位点所有可能的基因型估算该靶位点在每一个样本组分的估计计数,然后利用拟合优度检验迭代计算各个样本组分的浓度,直到计算的各个样本组分浓度的变化小于设定的精度值为止。
本发明中样本待检测目标包括单一靶DNA位点、包含一个或多个靶DNA位点的整条染色体和包含一个或多个靶DNA位点的亚染色体片段。
本发明提供了利用靶DNA位点等位基因计数的拟合优度检验确定样本中待检测目标的核型或基因型或野生突变型的方法,所述方法包括:
(c1)将每一个靶DNA位点根据其在染色体上的定位分为参照位点或目标位点,其中各参照位点组成参照组,和各目标位点组成目标组;
(c2)利用参照组各个靶DNA位点的等位基因计数,计算样本中最少组分DNA的浓度;
(c3)利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的核型或基因型或野生突变型。
进一步地,上述步骤(c3)中所述利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的基因型,所述方法包括:
(c3-a1)对于目标组每一个靶DNA位点,列出其所有可能的基因型;
(c3-a2)对于目标组每一个靶DNA位点,对于其每一个可能的基因型,根据样本中最少组分DNA浓度和该位点各个等位基因的总计数,计算其各个等位基因的理论计数;
(c3-a3)对于目标组每一个靶DNA位点,对于其每一个可能的基因型,利用靶DNA位点各个等位基因计数和其理论计数进行拟合优度检验;
(c3-a4)对于目标组每一个靶DNA位点,根据对其所有可能基因型的拟合优度检验结果,选择最优拟合的基因型为该靶DNA位点的基因型。
进一步地,上述步骤(c3)中所述利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的核型,所述方法包括:
(c3-b1)分析待测样本,列出待检测目标染色体或亚染色体片段的所有可能的核型;
(c3-b2)对于每一个可能的核型,列出目标组各个靶DNA位点所有可能的基因型;
(c3-b3)对目标组每一个靶DNA位点,首先利用其各个等位基因计数对其所有可能的基因型进行拟合优度检验,然后对每一个可能的核型选择一个对该核型有最优拟合的基因型;
(c3-b4)综合分析所有靶DNA位点对每一个核型的拟合优度检验结果,选择对所有靶DNA位点综合拟合最好的核型作为待检测目标染色体或亚染色体片段的核型。
进一步地,上述步骤(c3)中所述利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的野生突变型,所述方法包括:
(c3-c1a)对于目标组每一个靶DNA位点,列出其所有可能的野生突变基因型;
(c3-c2a)对于目标组每一个靶DNA位点,对于其每一个可能的野生突变基因型,根据样本中最少组分DNA浓度和该位点各个等位基因的总计数,计算其各个等位基因的理论计数;
(c3-c3a)对于目标组每一个靶DNA位点,对于其每一个可能的野生突变基因型,利用靶DNA位点各个等位基因计数和其理论计数进行拟合优度检验;
(c3-c4a)综合分析目标组所有靶DNA位点,选择对所有靶位点有最优拟合的野生突变基因型为待测目标的野生突变基因型。
进一步地,上述步骤(c3)中所述利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的野生突变型,所述方法包括:
(c3-c1b)对于目标组每一个靶DNA位点,根据其各个等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法估计其基因型;
(c3-c2b)根据目标组每一个靶DNA位点的基因型和其各个等位基因的序列,确定样本各个组分中待测目标各个等位基因的野生突变型。
进一步地,上述步骤(c3)中所述利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的基因型或野生突变型,其中目标组可以为一个靶位点,也可以为一个靶位点的多个独立重复。优选的,靶位点独立重复通过利用相同的引物和独立的PCR和/或多重PCR扩增反应得到;优选的,靶位点独立重复通过利用不同的引物和独立的PCR和/或多重PCR扩增反应得到。
进一步地,上述步骤(c3)中所述利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的基因型或野生突变型,其中所述拟合优度检验方法是采用一种或几种能用来检验观测数与理论数之间一致性的统计检验方法;优选的,拟合优度检验是卡方检验;优选的,拟合优度检验是G检验;优选的,拟合优度检验是费希尔精确检验;优选的,拟合优度检验是二项分布检验;优选的,拟合优度检验是卡方检验和/或G检验和/或费希尔精确检验和/或二项分布检验和/或其变体和/或其组合;优选的,拟合优度检验是利用G检验的计算值G值和/或AIC值和/或经校正的G值和/或经校正的AIC值和/或G值或AIC值的变体和/或其组合来进行拟合优度检验。
进一步地,上述步骤(c3)中所述利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的基因型或野生突变型,其中所述拟合优度检验方法是采用步骤(b2-i)-步骤(b2-iv)所述的方法进行拟合优度检验。
在本发明中,染色体水平的核型是指混合样本中某号染色体在各个混合成分中的整倍性或非整倍性状态。比如在孕妇血浆样本中,母亲正常胎儿为单体的染色体核型为双体-单体, 母亲正常胎儿为三体的染色体核型为双体-三体,而母亲和胎儿均正常的染色体核型为双体-双体。
在本发明中,每个亚染色体水平的片段被认为是一条染色体,因此在孕妇血浆样本中,母亲和胎儿均纯合微缺失的亚染色体核型为缺体-缺体、母亲纯合微缺失胎儿杂合微缺失的亚染色体核型为缺体-单体、母亲杂合微缺失胎儿正常的亚染色体核型为单体-双体、母亲和胎儿均杂合微缺失的亚染色体核型为单体-单体、母亲杂合微缺失胎儿纯合微缺失的亚染色体核型为单体-缺体、母亲正常胎儿杂合微缺失的亚染色体核型为双体-单体、母亲和胎儿均正常的亚染色体核型为双体-双体、母亲和胎儿均纯合微重复的亚染色体核型为四体-四体、母亲纯合微重复胎儿杂合微重复的亚染色体核型为四体-三体、母亲杂合微重复胎儿正常的亚染色体核型为三体-双体、母亲和胎儿均杂合微重复的亚染色体核型为三体-三体、母亲杂合微重复胎儿纯合微重复的亚染色体核型为三体-四体、母亲正常胎儿杂合微重复的亚染色体核型为双体-三体。
在本发明中,基因型是指混合样本中某个靶DNA位点在各个混合成分中的各个基因型的组合,其中每条染色体上该位点可能检测到0或1个等位基因。比如在孕妇血浆样本中,核型为双体-单体的位点有4种可能的基因型(未包括母亲和/或胎儿是嵌合体的基因型),分别是
Figure PCTCN2021125359-appb-000001
Figure PCTCN2021125359-appb-000002
而双体-三体的可能基因型(未包括母亲和/或胎儿是嵌合体的基因型和/或胎儿由于新发突变等而未遗传来自于母亲的不少于一个的等位基因的基因型)为AA|AAA、AA|AAB、AB|AAA、AB|AAB、AB|AAC、AB|ABC、AA|ABB、AA|ABC、AB|ACC和AB|ACD,其中A、B、C和D代表靶DNA位点不同的等位基因,而
Figure PCTCN2021125359-appb-000003
代表缺失。总的来说,混合样本某个位点的基因型是该位点在各个样本中各条染色体上的各个等位基因所有可能的组合。相似地,对于亚染色体水平的变异,每条染色体上该位点可能检测0(微缺失)、1(正常)或2(微重复)个等位基因,因此混合样本亚染色体核型对应的所有可能的基因型是混合样本中每一个位点在各条染色体上所有等位基因的所有可能的组合。比如在孕妇血浆样本中,亚染色体核型为三体-三体的位点有22种可能的基因型(未包括母亲和/或胎儿是嵌合体的基因型和/或胎儿由于新发突变等而未遗传来自于母亲的不少于一个的等位基因的基因型),分别是AAA|AAA、AAA|AAB、AAA|ABB、AAA|ABC、AAB|AAA、AAB|AAB、AAB|AAC、AAB|ABB、AAB|ABC、AAB|ACC、AAB|ACD、AAB|BBB、AAB|BBC、AAB|BCC、AAB|BCD、ABC|AAA、ABC|AAB、ABC|AAD、ABC|ABC、ABC|ABD、ABC|ADD和ABC|ADE,其中A、B、C、D和E代表靶DNA位点不同的等位基因。
本发明提供了利用各个靶位点等位基因计数的相对分布图确定样本中待检测目标的 核型或基因型或野生突变型的方法,所述方法包括:
(d1)将每一个靶DNA位点根据其在染色体上的定位分为参照位点或目标位点,其中各参照位点组成参照组,和各目标位点组成目标组;
(d2)利用参照组各个靶DNA位点的等位基因计数,计算样本中最少组分DNA的浓度;
(d3)利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取等位基因计数相对分布图方法,估计样本中待检测目标的核型或基因型或野生突变型。
进一步地,上述步骤(d3)中所述利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取等位基因计数相对分布图方法,估计样本中待检测目标的基因型,所述方法包括:
(d3-a1)对于目标组每一个靶DNA位点,列出其所有可能的基因型;
(d3-a2)对于目标组靶DNA位点每一个可能的基因型,首先根据样本中最少组分DNA的浓度计算其各个等位基因的相对计数理论值,然后选取至少一个非最大的等位基因相对计数理论值对最大的等位基因相对计数理论值作图来标记该基因型的理论位置;
(d3-a3)对于目标组每一个靶DNA位点,首先计算其各个等位基因的相对计数,然后选取至少一个非最大的等位基因相对计数对最大的等位基因相对计数作图来标记该靶DNA位点在等位基因相对计数图上的实际位置;
(d3-a4)根据目标组各个靶DNA位点在等位基因相对计数图中的理论位置分布以及实际位置分布推断待测目标的基因型。
进一步地,上述步骤(d3)中所述利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取等位基因计数相对分布图方法,估计样本中待检测目标的核型,所述方法包括:
(d3-b1)分析待测样本,列出待检测目标染色体或亚染色体片段的所有可能的核型;
(d3-b2)对于每一个可能的核型,列出目标组各个靶DNA位点所有可能的基因型;
(d3-b3)对于目标组靶DNA位点每一个可能的基因型,首先根据样本中最少组分DNA的浓度计算其各个等位基因的相对计数理论值,然后选取至少一个非最大的等位基因相对计数理论值对最大的等位基因相对计数理论值作图来标记该基因型的理论位置;
(d3-b4)对于目标组每一个靶DNA位点,首先计算其各个等位基因的相对计数,然后选取至少一个非最大的等位基因相对计数对最大的等位基因相对计数作图来标记该靶DNA位点在等位基因相对计数图上的实际位置;
(d3-b5)根据在等位基因相对计数图中目标组各个靶DNA位点在各个核型的理论位置分布以 及其实际位置分布推断待测目标的核型。
进一步地,上述步骤(d3)中所述利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取等位基因计数相对分布图方法,估计样本中待检测目标的野生突变型,所述方法包括:
(d3-c1)对于目标组每一个靶DNA位点,列出其野生型序列和所有可能的野生突变基因型;
(d3-c2)对于每一个可能的野生突变基因型,计算其野生型等位基因和其它非野生型各个等位基因的相对计数理论值,并选取至少一个非野生型等位基因相对计数理论值对野生型等位基因相对计数理论值作图来标记其野生突变基因型的理论位置;
(d3-c3)对于目标组每一个靶DNA位点,计算其野生型等位基因和其它非野生型各个等位基因的相对计数值,并选取至少一个非野生型等位基因相对计数对野生型等位基因相对计数作图来标记该靶DNA位点在等位基因相对计数图上的实际位置;
(d3-c4)根据目标组所有靶DNA位点在等位基因相对计数图中的理论位置分布以及实际位置分布推断其野生突变型。
本发明提供了利用靶DNA位点等位基因计数的拟合优度检验和/或等位基因计数相对分布图确定样本中待检测目标的核型或基因型或野生突变型的方法,其特征在于在步骤(c2)或步骤(d2)中利用参照组各个靶DNA位点的等位基因计数,计算样本中最少组分DNA的浓度,是采用步骤(a1)-步骤(a3)和/或步骤(b1)-步骤(b5)所述的方法计算样本中最少组分DNA的浓度。
本发明提供了利用等位基因计数相对分布图确定单一基因组样本中待检测目标的核型的方法,所述方法包括:
(e1)计算各个靶DNA位点的各个等位基因相对计数;
(e2)对每一个靶DNA位点,将其第二大的等位基因相对计数对其最大的等位基因相对计数作分布图A或将其最大的等位基因相对计数对该靶DNA位点在染色体或亚染色体上的相对位置作分布图B;
(e3)利用各个靶DNA位点的等位基因计数相对分布图A和/或分布图B,估计单一基因组样本中待检测目标的核型。
本发明不仅能检测混合基因组中的各个组分的遗传改变,比如通过对孕妇血浆DNA样本中的多态性位点各个等位基因计数,检测母亲和/或胎儿单一位点的遗传改变或染色体水平和亚染色体水平的变异,而且能够应用于单基因组样本的核型或基因型检测,比如应用于胚胎的遗传病的植入前诊断。该方法能同时在核苷酸水平和染色体或亚染色体水平检测样本的 遗传改变,对胎儿遗传病的筛查有良好的开发和应用前景。
本发明涉及利用母亲和胎儿遗传物质的混合物检测待测目标是否有遗传学异常。因此,在一方面,本发明提供了用于确定生物样品中胎儿非整倍性存在或不存在的方法,所述生物样品包括来自所述母亲的生物样品的以游离漂浮的DNA形式存在的胎儿和母亲核酸,在PCR或多重PCR反应中扩增靶DNA位点(即,扩增模板DNA,使得扩增的DNA再现原始模板DNA的比),然后根据所扩增的待测目标每一个靶DNA位点各个等位基因的相对计数分布来确定所述胎儿非整倍性的存在或不存在。
在另一方面,本发明提供了用于确定生物样品中胎儿染色体片段的拷贝数变异存在或不存在的方法,所述生物样品包括来自所述母亲的生物样品的以游离漂浮的DNA形式存在的胎儿和母亲核酸,在PCR或多重PCR反应中扩增靶DNA位点(即,扩增模板DNA,使得扩增的DNA再现原始模板DNA的比),然后根据所扩增的待测目标每一个靶DNA位点各个等位基因的相对计数分布来确定所述胎儿染色体片段的拷贝数变异的存在或不存在。
在另一方面,本发明提供了用于确定生物样品中胎儿单基因遗传病致病基因位点的变异的存在或不存在的方法,所述生物样品包括来自所述母亲的生物样品的以游离漂浮的DNA形式存在的胎儿和母亲核酸,在PCR或多重PCR反应中扩增靶DNA位点(即,扩增模板DNA,使得扩增的DNA再现原始模板DNA的比),然后根据所扩增的待测目标靶DNA位点(单基因遗传病致病基因位点)各个等位基因的相对计数分布来确定所述胎儿单基因遗传病致病基因位点的变异的存在或不存在。
在另一方面,本发明提供了用于实施本发明方法的诊断试剂盒,其中包括至少一组引物以扩增靶DNA位点。所述至少一组引物扩增至少一个参照组靶DNA位点和/或至少一个目标组靶DNA位点。其中目标组靶DNA位点选自有染色体非整倍性异常的可能的染色体和/或有拷贝数变异可能的染色体片段和/或有可能是单基因遗传病的致病变异的位点。其中目标组靶DNA位点的核酸序列在待检测人群中一般具有多态性和/或目标组靶DNA位点是可能的单基因遗传病的致病性变异位点。其中参照组靶DNA位点选自通常没有染色体非整倍性异常的染色体和/或通常没有拷贝数变异的染色体片段。其中参照组靶DNA位点的核酸序列在待检测人群中一般具有多态性。
在另一方面,本发明提供了用于实施本发明的方法的诊断试剂盒。此诊断试剂盒包括包括用于执行步骤(2)和/或步骤(3)的引物。任选可被包括在诊断试剂盒中的其它试剂是使用说明、进行PCR和/或多重PCR反应的聚合酶和缓冲液和对扩增的片段进行高通量测序文库构建所需要的试剂。
在另一方面,本发明提供了用于实施本发明的方法的一种系统。该系统用于实施从生物测试样品预测待检测目标的核型或基因型或野生突变型的方法中的一个或多个步骤,例如步骤(4)至(5)中的一个或多个。在另一方面,本发明提供了用于实施本发明的方法的装置和/或计算机程序产品和/或系统和/或模块,该装置和/或计算机程序产品和/或系统和/或模块包括用于执行上述步骤(1)-步骤(5)、上述步骤(a1)-步骤(a3)、上述步骤(b1)-步骤(b5)、上述步骤(c1)-步骤(c3)、上述步骤(d1)-步骤(d3)和/或上述步骤(e1)-步骤(e3)中的任何步骤。
在一些实施方案中,本发明的方法在体外或离体进行。在一些实施方案中,本发明的样品为体外或离体样本。
在一方面,本发明涉及用于执行本发明方法的装置。例如,在一些实施方案中,本发明涉及一种检测样本遗传变异的装置,其特征在于包括:
(1)配置用于接收待测生物样品并制备核酸的模块;
(2)配置用于富集或扩增靶DNA位点的模块,其中至少有一个靶DNA位点在样本中有多于一个的等位基因;
(3)配置用于测序所扩增的靶DNA位点的模块;
(4)统计模块,其配置用于对每一个靶DNA位点,统计其各个等位基因的计数;
(5)确定模块,其配置用于利用靶DNA位点等位基因计数的拟合优度检验和/或等位基因计数相对分布图确定样本中待检测目标的核型或基因型或野生突变型。
在一些实施方案中,统计模块经配置用于对每一个靶DNA位点,统计其各个等位基因的计数,所述统计依次包括如下步骤:(4-1)对每一个扩增序列,将其映射到染色体或基因组位置;(4-2)统计每一个染色体或基因组区域中映射的序列数;其中如果某一染色体或基因组区域有不同的等位基因,则同时统计该区域每一个等位基因所映射的序列数。在一些实施方案中,利用任何计算机方法将各序列读数映射至染色体或基因组位置/区域。在一些实施方案中,步骤(4-1)中用于映射序列的计算机算法包括但不限于特异性序列的查找、BLAST、BLITZ、FASTA、BOWTIE、BOWTIE 2、BWA、NOVOALIGN、GEM、ZOOM、ELAN、MAQ、MATCH、SOAP、STAR、SEGEMEHL、MOSAIK或SEQMAP或其变体或其组合。在一些实施方式中,从各个靶DNA位点对应的染色体或基因组序列提取特异性序列(唯一映射序列),然后利用特异性序列将读数映射到染色体或基因组位置/区域。在一些实施方式中,序列读数可与染色体或基因组位置/区域的序列比对。在一些实施方式中,序列读数可与染色体或基因组的序列比对。在一些实施方式中,序列读数可从本领域已知核酸数据库获得和/或与其中的序列比对,所述数据库包括例如GenBank,dbEST,dbSTS,EMBL(欧洲分子生物实验室)和DDBJ(日本DNA数据库)。BLAST或相似工具可用于针对序列数据库搜索相同序列。然后,例如,搜索命中可用于将相同的序列分选入合适的染色体或基因组位置/区域。在一些实施方式中,读数可唯一或非唯一映射至参照基因组中的部份。若读数与基因组中的单一序列比对,则其称为“唯一映射”。若读数与 基因组中的两个或多个序列比对,则其称为“非唯一映射”。在一些实施方式中,非唯一映射的读数从进一步分析(例如定量)中去除。
在一些实施方案中,确定模块经配置用于利用靶DNA位点等位基因计数的拟合优度检验确定样本中待检测目标的核型或基因型或野生突变型,所述确定依次包括如下步骤:
(c1)将每一个靶DNA位点根据其在染色体上的定位分为参照位点或目标位点;
(c2)利用参照组各个靶DNA位点的等位基因计数,计算样本中最少组分DNA的浓度;
(c3)利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的核型或基因型或野生突变型。
在一些实施方案中,确定模块经配置用于利用靶DNA位点等位基因计数的相对分布图确定样本中待检测目标的核型或基因型或野生突变型,所述确定依次包括如下步骤:
(d1)将每一个靶DNA位点根据其在染色体上的定位分为参照位点或目标位点;
(d2)利用参照组各个靶DNA位点的等位基因计数,计算样本中最少组分DNA的浓度;
(d3)利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取等位基因计数相对分布图方法,估计样本中待检测目标的核型或基因型或野生突变型。
在一些实施方案中,利用一种或几种拟合优度检验统计检验方法来检验观测数与理论数之间的一致性。在一些实施方案中,拟合优度检验是卡方检验。在一些实施方案中,拟合优度检验是G检验。在一些实施方案中,拟合优度检验是费希尔精确检验。在一些实施方案中,拟合优度检验是二项分布检验。在一些实施方案中,拟合优度检验是卡方检验、G检验、费希尔精确检验、二项分布检验、其变体或其组合。在一些实施方案中,拟合优度检验是利用G检验的计算值G值、AIC值、经校正的G值、经校正的AIC值、G值或AIC值的变体、或其组合来进行拟合优度检验。
在一些实施方案中,确定模块经配置用于利用靶DNA位点等位基因计数相对分布图确定样本中待检测目标的核型,其中待检测样本是单一基因组样本,所述确定依次包括如下步骤:
(e1)计算目标组各个靶DNA位点的各个等位基因相对计数;
(e2)对目标组每一个靶DNA位点,将其第二大的等位基因相对计数对最大的等位基因相对计数作分布图A或将其最大的等位基因相对计数对该靶DNA位点在染色体或亚染色体上的相对位置作分布图B;
(e3)利用目标组各个靶DNA位点的等位基因计数相对分布图A和/或分布图B,估计单一基因组样本中待检测目标的核型。
在一些实施方案中,在步骤(c2)或步骤(d2)中采取等位基因计数相对比例法计算样本中最少组分DNA的浓度,所述计算依次包括如下步骤:
(a1)设定样本的噪声阈值α;
(a2)对每一个靶DNA位点,首先利用该靶位点各个等位基因计数估算其基因型,然后根据估 算的基因型估算来源于最少组分DNA的计数(FC)和总计数(TC);
(a3)利用参照组各个靶位点的最少组分DNA的计数(FC)和总计数(TC),估算最少组分DNA的浓度。
在一些实施方案中,在步骤(c2)或步骤(d2)中采取等位基因计数迭代拟合基因型法计算样本中最少组分DNA的浓度,所述计算依次包括如下步骤:
(b1)设定样本的噪声阈值α、初始浓度估计值f 0和迭代误差精度值ε;
(b2)对每一个靶DNA位点利用其各个等位基因计数和样本中最少组分DNA的浓度值f 0估算其基因型;
(b3)对每一个靶DNA位点,根据其估算的基因型来估算来源于最少组分DNA的计数(FC)和总计数(TC);
(b4)利用最少组分DNA的计数(FC)和总计数(TC)估算最少组分DNA的浓度f;
(b5)判断f-f 0的绝对值是否小于ε,如果判断结果为否,则设定f 0=f,然后执行步骤(b2);如果判断结果为是,则样本中最少组分DNA浓度估算为f。
在一些实施方案中,在步骤(c3)中利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的基因型,所述估计依次包括如下步骤:
(c3-a1)对于目标组每一个靶DNA位点,列出其所有可能的基因型;
(c3-a2)对于目标组每一个靶DNA位点,对于其每一个可能的基因型,根据样本中最少组分DNA浓度和该位点各个等位基因的总计数,计算其各个等位基因的理论计数;
(c3-a3)对于目标组每一个靶DNA位点,对于其每一个可能的基因型,利用靶DNA位点各个等位基因计数和其理论计数进行拟合优度检验;
(c3-a4)对于目标组每一个靶DNA位点,根据对其所有可能基因型的拟合优度检验结果,选择最优拟合的基因型为该靶DNA位点的基因型。
在一些实施方案中,在步骤(c3)中利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的核型,所述估计依次包括如下步骤:
(c3-b1)分析待测样本,列出待检测目标染色体或亚染色体片段的所有可能的核型;
(c3-b2)对于每一个可能的核型,列出目标组各个靶DNA位点所有可能的基因型;
(c3-b3)对目标组每一个靶DNA位点,首先利用其各个等位基因计数对其所有可能的基因型进行拟合优度检验,然后对每一个可能的核型选择一个对该核型有最优拟合的基因型;
(c3-b4)综合分析所有靶DNA位点对每一个核型的拟合优度检验结果,选择对所有靶DNA位点综合拟合最好的核型作为待检测目标染色体或亚染色体片段的核型。
在一些实施方案中,在步骤(c3)中利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的野生突变型,所 述估计依次包括如下步骤:
(c3-c1a)对于目标组每一个靶DNA位点,列出其所有可能的野生突变基因型;
(c3-c2a)对于目标组每一个靶DNA位点,对于其每一个可能的野生突变基因型,根据样本中最少组分DNA浓度和该位点各个等位基因的总计数,计算其各个等位基因的理论计数;
(c3-c3a)对于目标组每一个靶DNA位点,对于其每一个可能的野生突变基因型,利用靶DNA位点各个等位基因计数和其理论计数进行拟合优度检验;
(c3-c4a)综合分析目标组所有靶DNA位点,选择对所有靶位点有最优拟合的野生突变基因型为待测目标的野生突变基因型。
在一些实施方案中,在步骤(c3)中利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的野生突变型,所述估计依次包括如下步骤:
(c3-c1b)对于目标组每一个靶DNA位点,根据其各个等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法估计其基因型;
(c3-c2b)根据目标组每一个靶DNA位点的基因型和其各个等位基因的序列,确定样本各个组分中待测目标各个等位基因的野生突变型。
在一些实施方案中,利用一种或几种能用来检验观测数与理论数之间一致性的统计检验方法来进行拟合优度检验。在一些实施方案中,拟合优度检验是卡方检验。在一些实施方案中,拟合优度检验是G检验。在一些实施方案中,拟合优度检验是费希尔精确检验。在一些实施方案中,拟合优度检验是二项分布检验。在一些实施方案中,拟合优度检验是卡方检验和/或G检验和/或费希尔精确检验和/或二项分布检验。在一些实施方案中,拟合优度检验是利用G检验的计算值G值和/或AIC值和/或经校正的G值和/或经校正的AIC值和/或由G值或AIC值的衍生的值来进行拟合优度检验。
在一些实施方案中,在步骤(d3)中利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取等位基因计数相对分布图方法,估计样本中待检测目标的基因型,所述估计依次包括如下步骤:
(d3-a1)对于目标组每一个靶DNA位点,列出其所有可能的基因型;
(d3-a2)对于目标组靶DNA位点每一个可能的基因型,首先根据样本中最少组分DNA的浓度计算其各个等位基因的相对计数理论值,然后选取至少一个非最大的等位基因相对计数理论值对最大的等位基因相对计数理论值作图来标记该基因型的理论位置;
(d3-a3)对于目标组每一个靶DNA位点,首先计算其各个等位基因的相对计数,然后选取至少一个非最大的等位基因相对计数对最大的等位基因相对计数作图来标记该靶DNA位点在等位基因相对计数图上的实际位置;
(d3-a4)根据目标组各个靶DNA位点在等位基因相对计数图中的理论位置分布以及实际位置 分布推断待测目标的基因型。
在一些实施方案中,在步骤(d3)中利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取等位基因计数相对分布图方法,估计样本中待检测目标的核型,所述估计依次包括如下步骤:
(d3-b1)分析待测样本,列出待检测目标染色体或亚染色体片段的所有可能的核型;
(d3-b2)对于每一个可能的核型,列出目标组各个靶DNA位点所有可能的基因型;
(d3-b3)对于目标组靶DNA位点每一个可能的基因型,首先根据样本中最少组分DNA的浓度计算其各个等位基因的相对计数理论值,然后选取至少一个非最大的等位基因相对计数理论值对最大的等位基因相对计数理论值作图来标记该基因型的理论位置;
(d3-b4)对于目标组每一个靶DNA位点,首先计算其各个等位基因的相对计数,然后选取至少一个非最大的等位基因相对计数对最大的等位基因相对计数作图来标记该靶DNA位点在等位基因相对计数图上的实际位置;
(d3-b5)根据在等位基因相对计数图中目标组各个靶DNA位点在各个核型的理论位置分布以及其实际位置分布推断待测目标的核型。
在一些实施方案中,在步骤(d3)中利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取等位基因计数相对分布图方法,估计样本中待检测目标的野生突变型,所述估计依次包括如下步骤:
(d3-c1)对于目标组每一个靶DNA位点,列出其野生型序列和所有可能的野生突变基因型;
(d3-c2)对于每一个可能的野生突变基因型,计算其野生型等位基因和其它非野生型各个等位基因的相对计数理论值,并选取至少一个非野生型等位基因相对计数理论值对野生型等位基因相对计数理论值作图来标记其野生突变基因型的理论位置;
(d3-c3)对于目标组每一个靶DNA位点,计算其野生型等位基因和其它非野生型各个等位基因的相对计数值,并选取至少一个非野生型等位基因相对计数对野生型等位基因相对计数作图来标记该靶DNA位点在等位基因相对计数图上的实际位置;
(d3-c4)根据目标组所有靶DNA位点在等位基因相对计数图中的理论位置分布以及实际位置分布推断其野生突变型。
在一些实施方案中,在步骤(a2)中进行的对每一个靶DNA位点,首先利用该靶位点各个等位基因计数估算其基因型,然后根据估算的基因型估算来源于最少组分DNA的计数(FC)和总计数(TC),所述估算依次包括如下步骤:
(a2-i)对靶DNA位点的各个等位基因计数进行从大到小排序,其中最大的三个等位基因计数依次标记为R1、R2和R3;
(a2-ii)利用靶DNA位点的各个等位基因计数,估算该靶DNA位点的基因型;
(a2-iii)根据估算的靶DNA位点的基因型和靶DNA位点的各个等位基因计数,估算来源于最 少组分DNA的计数(FC)和总计数(TC)。
在一些实施方案中,在步骤(a2-ii)中进行的利用靶DNA位点的各个等位基因计数,估算该靶DNA位点的基因型,所述估算依次包括如下步骤:
(a2-ii-1)利用靶DNA位点的各个等位基因计数,判断靶DNA位点中检测到的高于噪声阈值的等位基因数量;如果判断结果是1,则执行下述步骤(a2-ii-2);如果判断结果是2,则执行下述步骤(a2-ii-3);如果判断结果为大于2,则执行下述步骤(a2-ii-4);
(a2-ii-2)估算该靶DNA位点的基因型为AA|AA,然后执行下述步骤(a2-ii-5);
(a2-ii-3)根据检测到的高于噪声阈值的等位基因数量为2和靶DNA位点的最大的两个等位基因计数,估计靶DNA位点的基因型,然后执行下述步骤(a2-ii-5);
(a2-ii-4)根据检测到的高于噪声阈值的等位基因数量大于2和靶DNA位点的最大的至少两个的等位基因计数,估计靶DNA位点的基因型,然后执行下述步骤(a2-ii-5);
(a2-ii-5)输出估算的该靶位点的基因型。
在一些实施方案中,在步骤(a2-ii-3)中进行的根据检测到的高于噪声阈值的等位基因数量为2和靶DNA位点的最大的两个等位基因计数,估计靶DNA位点的基因型,所述估算依次包括如下步骤:
(a2-ii-3-1)判断R1/(R1+R2)的值是否小于0.5+α,如果判断结果为是,则估算该靶DNA位点的基因型为AB|AB,然后执行下述步骤(a2-ii-3-3);如果判断结果为否,则执行下述步骤(a2-ii-3-2);
(a2-ii-3-2)判断R1/(R1+R2)的值是否小于0.75,如果判断结果为是,则估算该靶DNA位点的基因型为AB|AA,然后执行下述步骤(a2-ii-3-3);如果判断结果为否,则估算该靶DNA位点的基因型为AA|AB,然后执行下述步骤(a2-ii-3-3);
(a2-ii-3-3)输出估算的该靶位点的基因型。
在一些实施方案中,在步骤(a2-ii-4)中进行的根据检测到的高于噪声阈值的等位基因数量大于2和靶DNA位点的最大的至少两个的等位基因计数,估计靶DNA位点的基因型,所述估算依次包括如下步骤:
(a2-ii-4-1)判断R2/R1是否大于等于0.5和/或R1/(R1+R2)是否大于等于1/2并且小于等于2/3和/或R2/(R1+R2)是否大于等于1/3并且小于等于1/2的值,如果判断结果为是,则估算该靶DNA位点的基因型为AB|AC,然后执行下述步骤(a2-ii-4-3);如果判断结果为否,则执行下述步骤(a2-ii-4-2);
(a2-ii-4-2)标记该位点的等位基因计数为异常,然后或者估算该靶位点的基因型为NA,并执行下述步骤(a2-ii-4-3);或者设定该靶DNA位点中检测到的高于噪声阈值的等位基因数量为2,然后按照步骤(a2-ii-3)所述估算该靶位点的基因型,并执行下述步骤(a2-ii-4-3);
(a2-ii-4-3)输出估算的该靶位点的基因型。
在一些实施方案中,在步骤(a2-iii)中进行的根据估算的靶DNA位点的基因型和靶DNA位点的各个等位基因计数,估算来源于最少组分DNA的计数(FC)和总计数(TC),其中最大的三个等位基因计数依次标记为R1、R2和R3,所述估算依次包括如下步骤:
(a2-iii-1)如果靶位点估计的基因型是AA|AA,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-2)如果靶位点估计的基因型是AB|AB,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-3)如果靶位点估计的基因型是AB|AA,则估算来源于最少组分DNA的计数(FC)为R1-R2,估算总计数(TC)为R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-4)如果靶位点估计的基因型是AA|AB,则估算来源于最少组分DNA的计数(FC)为R2的2倍,估算总计数(TC)为R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-5)如果靶位点估计的基因型是AB|AC,则估算来源于最少组分DNA的计数(FC)为R1-R2+R3或R3的2倍或(R1-R2)的2倍,估算总计数(TC)为R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-6)如果靶位点估计的基因型不是上述所述基因型中的一种,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-7)输出估算的来源于最少组分DNA的计数(FC)和总计数(TC)。
在一些实施方案中,在步骤(b2)中进行的对每一个靶DNA位点利用其各个等位基因计数和样本中最少组分DNA的浓度值f 0估算其基因型,所述估算依次包括如下步骤:
(b2-i)根据样本来源,列出靶DNA位点所有可能的基因型;
(b2-ii)对靶DNA位点的每一个可能基因型,利用样本中最少组分DNA的浓度值f 0和靶DNA位点各个等位基因的总计数(TC),计算其各个等位基因的理论计数;
(b2-iii)对靶DNA位点的每一个可能基因型,利用靶DNA位点的各个等位基因计数及其各个等位基因理论计数进行拟合优度检验;
(b2-iv)分析靶DNA位点对所有可能的基因型的拟合优度检验结果,选择对靶DNA位点各个等位基因计数有最优拟合的基因型作为估算的靶DNA位点的基因型。
在一些实施方案中,在步骤(b3)中进行的对每一个靶DNA位点,根据其估算的基因型来估算来源于最少组分DNA的计数(FC)和总计数(TC),其中最大的四个等位基因计数依次标记为R1、R2、R3和R4,,所述估算依次包括如下步骤:
(b3-1)如果靶位点估计的基因型是AA|AA,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-2)如果靶位点估计的基因型是AB|AB,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-3)如果靶位点估计的基因型是AB|AA,则估算来源于最少组分DNA的计数(FC)为R1-R2,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-4)如果靶位点估计的基因型是AA|AB,则估算来源于最少组分DNA的计数(FC)为R2的2倍,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-5)如果靶位点估计的基因型是AB|AC,则估算来源于最少组分DNA的计数(FC)为R1-R2+R3或R3的2倍或(R1-R2)的2倍,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-6)如果靶位点估计的基因型是AA|BB,则估算来源于最少组分DNA的计数(FC)为R2,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-7)如果靶位点估计的基因型是AA|BC,则估算来源于最少组分DNA的计数(FC)为R2+R3或R2的2倍或R3的2倍,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-8)如果靶位点估计的基因型是AB|CC,则判断当前估计值f 0是否大于等于1/3,如果判断结果为是,则估算来源于最少组分DNA的计数(FC)为R1,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);如果判断结果为否,则估算来源于最少组分DNA的计数(FC)为R3,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-9)如果靶位点估计的基因型是AB|CD,则估算来源于最少组分DNA的计数(FC)为R3+R4或R3的2倍或R4的2倍,估算总计数(TC)为R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-10)如果靶位点估计的基因型不是上述所述基因型中的一种,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-11)输出估算的来源于最少组分DNA的计数(FC)和总计数(TC)。
在一些实施方案中,本发明涉及一种用于计算样本中最少组分DNA的浓度的装置,所述装置包括:
(a1)用于设定样本的噪声阈值α的模块;
(a2)用于对每一个靶DNA位点,首先利用该靶位点各个等位基因计数估算其基因型,然后根据估算的基因型估算来源于最少组分DNA的计数(FC)和总计数(TC)的模块;
(a3)计算模块,其利用参照组各个靶位点的最少组分DNA的计数(FC)和总计数(TC),估算最 少组分DNA的浓度。
在一些实施方案中,本发明所述在在步骤(a2)中进行的对每一个靶DNA位点,首先利用该靶位点各个等位基因计数估算其基因型,然后根据估算的基因型估算来源于最少组分DNA的计数(FC)和总计数(TC),包括如下步骤:
(a2-i)对靶DNA位点的各个等位基因计数进行从大到小排序,其中最大的三个等位基因计数依次标记为R1、R2和R3;
(a2-ii)利用靶DNA位点的各个等位基因计数,估算该靶DNA位点的基因型;
(a2-iii)根据估算的靶DNA位点的基因型和靶DNA位点的各个等位基因计数,估算来源于最少组分DNA的计数(FC)和总计数(TC)。
在一些实施方案中,利用靶DNA位点的各个等位基因计数,其中最大的三个等位基因计数依次标记为R1、R2和R3,估算该靶DNA位点的基因型,包括如下步骤:
(a2-ii-1)利用靶DNA位点的各个等位基因计数,判断靶DNA位点中检测到的高于噪声阈值的等位基因数量;如果判断结果是1,则执行下述步骤(a2-ii-2);如果判断结果是2,则执行下述步骤(a2-ii-3);如果判断结果为大于2,则执行下述步骤(a2-ii-4);
(a2-ii-2)估算该靶DNA位点的基因型为AA|AA,然后执行下述步骤(a2-ii-5);
(a2-ii-3)根据检测到的高于噪声阈值的等位基因数量为2和靶DNA位点的最大的两个等位基因计数,估计靶DNA位点的基因型,然后执行下述步骤(a2-ii-5);
(a2-ii-4)根据检测到的高于噪声阈值的等位基因数量大于2和靶DNA位点的最大的至少两个的等位基因计数,估计靶DNA位点的基因型,然后执行下述步骤(a2-ii-5);
(a2-ii-5)输出估算的该靶位点的基因型。
在一些实施方案中,根据检测到的高于噪声阈值的等位基因数量为2和靶DNA位点的最大的两个等位基因计数,估计靶DNA位点的基因型,其中最大的两个等位基因计数分别标记为R1和R2,包括如下步骤:
(a2-ii-3-1)判断R1/(R1+R2)的值是否小于0.5+α,如果判断结果为是,则估算该靶DNA位点的基因型为AB|AB,然后执行下述步骤(a2-ii-3-3);如果判断结果为否,则执行下述步骤(a2-ii-3-2);
(a2-ii-3-2)判断R1/(R1+R2)的值是否小于0.75,如果判断结果为是,则估算该靶DNA位点的基因型为AB|AA,然后执行下述步骤(a2-ii-3-3);如果判断结果为否,则估算该靶DNA位点的基因型为AA|AB,然后执行下述步骤(a2-ii-3-3);
(a2-ii-3-3)输出估算的该靶位点的基因型。
在一些实施方案中,根据检测到的高于噪声阈值的等位基因数量大于2和靶DNA位点的最大的至少两个的等位基因计数,估计靶DNA位点的基因型,其中最大的两个等位基因计数分别标记为R1和R2,包括如下步骤:
(a2-ii-4-1)判断R2/R1是否大于等于0.5和/或R1/(R1+R2)是否大于等于1/2并且小于等于2/3和/或R2/(R1+R2)是否大于等于1/3并且小于等于1/2的值,如果判断结果为是,则估算该靶DNA位点的基因型为AB|AC,然后执行下述步骤(a2-ii-4-3);如果判断结果为否,则执行下述步骤(a2-ii-4-2);
(a2-ii-4-2)标记该位点的等位基因计数为异常,然后或者估算该靶位点的基因型为NA,并执行下述步骤(a2-ii-4-3);或者设定该靶DNA位点中检测到的高于噪声阈值的等位基因数量为2,然后按照步骤(a2-ii-3)所述估算该靶位点的基因型,并执行下述步骤(a2-ii-4-3);
(a2-ii-4-3)输出估算的该靶位点的基因型。
在一些实施方案中,根据估算的靶DNA位点的基因型和靶DNA位点的各个等位基因计数,估算来源于最少组分DNA的计数(FC)和总计数(TC),其中最大的三个等位基因计数依次标记为R1、R2和R3,包括如下步骤:
(a2-iii-1)如果靶位点估计的基因型是AA|AA,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-2)如果靶位点估计的基因型是AB|AB,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-3)如果靶位点估计的基因型是AB|AA,则估算来源于最少组分DNA的计数(FC)为R1-R2,估算总计数(TC)为R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-4)如果靶位点估计的基因型是AA|AB,则估算来源于最少组分DNA的计数(FC)为R2的2倍,估算总计数(TC)为R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-5)如果靶位点估计的基因型是AB|AC,则估算来源于最少组分DNA的计数(FC)为R1-R2+R3或R3的2倍或(R1-R2)的2倍,估算总计数(TC)为R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-6)如果靶位点估计的基因型不是上述所述基因型中的一种,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
(a2-iii-7)输出估算的来源于最少组分DNA的计数(FC)和总计数(TC)。
在一些实施方案中,所述步骤(a3)的计算模块根据FC和TC计数,利用线性回归或稳 健线性回归计算样本中最少组分DNA的浓度,或者利用FC和TC的平均数或中位数计算样本中最少组分DNA的浓度。
在一些实施方案中,本发明涉及一种计算样本中最少组分DNA的浓度的装置,所述装置:
(b1)设定模块,其设定样本的噪声阈值α、初始浓度估计值f 0和迭代误差精度值ε;
(b2)用于对每一个靶DNA位点利用其各个等位基因计数和样本中最少组分DNA的浓度值f 0估算其基因型的模块;
(b3)估算模块,其对每一个靶DNA位点,根据其估算的基因型来估算来源于最少组分DNA的计数(FC)和总计数(TC);
(b4)用于利用最少组分DNA的计数(FC)和总计数(TC)估算最少组分DNA的浓度f的模块;
(b5)判断模块,其判断f-f 0的绝对值是否小于ε,如果判断结果为否,则设定f 0=f,然后执行步骤(b2);如果判断结果为是,则样本中最少组分DNA浓度估算为f。
在一些实施方案中,对每一个靶DNA位点利用其等位基因计数和样本中最少组分DNA的浓度值f 0估算其基因型包括如下步骤:
(b2-i)根据样本来源,列出靶DNA位点所有可能的基因型;
(b2-ii)对靶DNA位点的每一个可能基因型,利用样本中最少组分DNA的浓度值f 0和靶DNA位点各个等位基因的总计数(TC),计算其各个等位基因的理论计数;
(b2-iii)对靶DNA位点的每一个可能基因型,利用靶DNA位点的各个等位基因计数及其各个等位基因理论计数进行拟合优度检验;
(b2-iv)分析靶DNA位点对所有可能的基因型的拟合优度检验结果,选择对靶DNA位点各个等位基因计数有最优拟合的基因型作为估算的靶DNA位点的基因型。
在一些实施方案中,对每一个靶DNA位点,根据其估算的基因型来估算来源于最少组分DNA的计数(FC)和总计数(TC),其中最大的四个等位基因计数依次标记为R1、R2、R3和R4,包括如下步骤:
(b3-1)如果靶位点估计的基因型是AA|AA,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-2)如果靶位点估计的基因型是AB|AB,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-3)如果靶位点估计的基因型是AB|AA,则估算来源于最少组分DNA的计数(FC)为R1-R2,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-4)如果靶位点估计的基因型是AA|AB,则估算来源于最少组分DNA的计数(FC)为R2的2倍,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-5)如果靶位点估计的基因型是AB|AC,则估算来源于最少组分DNA的计数(FC)为R1-R2+R3或R3的2倍或(R1-R2)的2倍,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-6)如果靶位点估计的基因型是AA|BB,则估算来源于最少组分DNA的计数(FC)为R2,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-7)如果靶位点估计的基因型是AA|BC,则估算来源于最少组分DNA的计数(FC)为R2+R3或R2的2倍或R3的2倍,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-8)如果靶位点估计的基因型是AB|CC,则判断是否当前估计值f 0大于和或等于1/3,如果判断结果为是,则估算来源于最少组分DNA的计数(FC)为R1,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);如果判断结果为否,则估算来源于最少组分DNA的计数(FC)为R3,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-9)如果靶位点估计的基因型是AB|CD,则估算来源于最少组分DNA的计数(FC)为R3+R4或R3的2倍或R4的2倍,估算总计数(TC)为R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-10)如果靶位点估计的基因型不是上述所述基因型中的一种,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
(b3-11)输出估算的来源于最少组分DNA的计数(FC)和总计数(TC)。
在一些实施方案中,所述样本为母体血浆样本,以及所述最少组分DNA为胎儿DNA。在一些实施方案中,所述样本为来源于植入前诊断的胚胎核酸。
在一些实施方案中,本发明提供了用于实施本发明的方法的诊断试剂盒。此诊断试剂盒包括至少一组引物以扩增参照组靶DNA位点和/或目标组靶DNA位点。其中目标组靶DNA位点选自有染色体非整倍性异常的可能的染色体和/或有拷贝数变异可能的染色体片段和/或有可能是单基因遗传病的致病变异的位点。其中目标组靶DNA位点的核酸序列在待检测人群中一般具有多态性和/或有可能是单基因遗传病的致病变异的位点。其中参照组靶DNA位点选自通常没有染色体非整倍性异常的染色体和/或通常没有拷贝数变异的染色体片段。其中参照组靶DNA位点的核酸序列在待检测人群中一般具有多态性。在一些实施方式中,参照组靶DNA位点选自样本中认为不存在染色体非整倍性异常或染色体片段拷贝数变异的染色体区域。在 一些实施方式中,参照染色体或参照染色体区域选自染色体1、2、3、4、5、6、7、8、9、10、11、12、13、14、15、16、17、18、19、20、21、22、X和Y,并且有时,参照染色体或参照染色体区域选自常染色体(即,非X和Y)。在一些实施方式中,目标靶DNA位点选自样本中认为可能存在染色体非整倍性异常或染色体片段拷贝数变异的染色体区域。在一些实施方式中,目标靶DNA位点选自样本中认为存在和/或可能存在单基因遗传病致病变异位点的核酸区域。在一些实施方式中,目标染色体区域选自染色体1、2、3、4、5、6、7、8、9、10、11、12、13、14、15、16、17、18、19、20、21、22、X和Y。优选的,目标组靶DNA位点选自13号染色体和/或18号染色体和/或21号染色体和/或X染色体和/或Y染色体。优选的,试剂盒包括用于扩增源自13号、18号、21号、X和/或Y染色体的靶核酸的引物。优选的,目标组靶DNA位点选自1p36缺失综合征、猫叫综合征、腓骨肌萎缩症、Digeorge综合征、杜氏肌营养不良、Williams-Beuren综合症、Wolf-Hirschhorn综合症、15q13.3微缺失综合征、Miller-Dieker综合征、Smith-Magenis综合征、天使人综合征、Langer-Giedion综合征的染色体区域。优选的,试剂盒包括用于扩增源自1p36缺失综合征、猫叫综合征、腓骨肌萎缩症、Digeorge综合征、杜氏肌营养不良、Williams-Beuren综合症、Wolf-Hirschhorn综合症、15q13.3微缺失综合征、Miller-Dieker综合征、Smith-Magenis综合征、天使人综合征、Langer-Giedion综合征染色体区域的靶核酸的引物。应当理解,包含靶位点区域的参照染色体或其部分是整倍体染色体。整倍体是指正常数目的染色体。任选可被包括在诊断试剂盒中的其它试剂是使用说明、进行PCR和/或多重PCR反应的聚合酶和缓冲液和对扩增的片段进行高通量测序文库构建所需要的试剂。
在一些实施方案中,本发明提供了用于实施本发明的方法的诊断试剂盒。此诊断试剂盒包括用于执行步骤(2)和/或步骤(3)的引物。任选可被包括在诊断试剂盒中的其它试剂是使用说明、进行PCR和/或多重PCR反应的聚合酶和缓冲液和对扩增的片段进行高通量测序文库构建所需要的试剂。
在一些实施方案中,本发明提供了用于实施本发明的方法的一种系统,其用于实施从生物测试样品预测待检测目标的核型或基因型或野生突变型的方法中的一个或多个步骤,例如步骤(4)至步骤(5)中的一个或多个。在一些实施方案中,本发明提供了用于实施本发明的方法的装置和/或计算机程序产品和/或系统和/或模块,该装置和/或计算机程序产品和/或系统和/或模块用于执行上述步骤(1)-步骤(5)、上述步骤(a1)-步骤(a3)、上述步骤(b1)-步骤(b5)、上述步骤(c1)-步骤(c3)、上述步骤(d1)-步骤(d3)和/或上述步骤(e1)-步骤(e3)中的任何步骤。
在一方面,本发明涉及以下实施方案:
1.一种检测样本遗传变异的方法,其特征在于依次包括如下步骤:
(1)接收待测生物样品并制备核酸;
(2)富集或扩增靶DNA序列,其中至少有一个靶DNA序列在样本中有多于一个的等位基因;
(3)测序所扩增的靶DNA;
(4)对每一个靶DNA序列,统计其各个等位基因的计数;
(5)利用等位基因计数的拟合优度检验和/或等位基因计数相对分布图确定样本中待检测目标位点的核型或基因型或野生突变型。
2.如实施方案1所述的方法,其特征在于在步骤(5)中利用等位基因计数的拟合优度检验确定样本中待检测目标位点的核型或基因型或野生突变型,所述确定依次包括如下步骤:
(A1)将靶DNA序列根据其在染色体上的定位分为参照组序列和目标组序列;
(A2)利用参照组各个靶DNA序列的等位基因计数,采取等位基因计数相对比例法或等位基因计数迭代拟合基因型法计算样本中最少组分DNA的浓度;
(A3)利用目标组各个靶DNA序列的等位基因计数的拟合优度检验,估计样本中待检测目标位点的核型或基因型或野生突变型。
3.如实施方案1所述的方法,其特征在于在步骤(5)中利用等位基因计数相对分布图确定样本中待检测目标位点的核型或基因型或野生突变型,所述确定依次包括如下步骤:
(B1)将靶DNA序列根据其在染色体上的定位分为参照组序列和目标组序列;
(B2)利用参照组各个靶DNA序列的等位基因计数,采取等位基因计数相对比例法或等位基因计数迭代拟合基因型法计算样本中最少组分DNA的浓度;
(B3)利用目标组各个靶DNA序列的等位基因计数相对分布图,估计样本中待检测目标位点的核型或基因型或野生突变型。
4.如实施方案1所述的方法,其特征在于在步骤(5)中利用等位基因计数相对分布图确定样本中待检测目标位点的核型或基因型或野生突变型,其中待检测样本是单一基因组样本,所述确定依次包括如下步骤:
(C1)计算目标组每一个靶DNA序列的各个等位基因相对计数;
(C2)对每一个靶DNA序列,将第二大的等位基因相对计数对最大的等位基因相对计数作分布图A或将最大的等位基因相对计数对该靶DNA序列在染色体或亚染色体上的相对位置作分布图B;
(C3)利用目标组各个靶DNA序列的等位基因计数相对分布图A和或分布图B,估计单一基因组样本中待检测目标区域的核型。
5.如实施方案2或3所述的方法,其特征在于在步骤(A2)或步骤(B2)中采取等位基因计数相对比例法计算样本中最少组分DNA的浓度,所述计算依次包括如下步骤:
(a1)设定样本的噪声背景值α;
(a2)对每一个靶DNA序列估算来源于最少组分DNA的计数(FC)和总计数(TC);
(a3)利用最少组分DNA的计数(FC)和总计数(TC)估算最少组分DNA的浓度。
6.如实施方案2或3所述的方法,其特征在于在步骤(A2)或步骤(B2)中采取等位基因计数迭代拟合基因型法计算样本中最少组分DNA的浓度,所述计算依次包括如下步骤:
(b1)设定样本的噪声背景值α、初始浓度估计值f 0和迭代误差精度值ε;
(b2)对每一个靶DNA序列利用其等位基因计数和f 0估算其基因型;
(b3)对每一个靶DNA序列,根据其估算的基因型来估算来源于最少组分DNA的计数(FC)和总计数(TC);
(b4)利用最少组分DNA的计数(FC)和总计数(TC)估算最少组分DNA的浓度f;
(b5)判断f-f 0的绝对值是否小于ε,如果判断结果为否,则设定f 0=f,然后执行步骤(b2);如果判断结果为是,则样本中最少组分DNA浓度估算为f。
7.如实施方案2所述的方法,其特征在于在步骤(A3)中利用目标组各个靶DNA序列的等位基因计数的拟合优度检验,估计样本中待检测目标位点的基因型,所述估计依次包括如下步骤:
(A3.a1)分析目标靶DNA序列位点,列出其所有可能的基因型;
(A3.a2)对每一个可能的基因型,根据估算的最少组分DNA浓度和靶DNA序列的总计数,计算其各个等位基因的理论计数,然后对靶DNA序列各个等位基因计数和其理论计数进行拟合优度检验;
(A3.a3)根据对靶DNA序列所有可能基因型的拟合优度检验结果,选择最优拟合的基因型为该靶DNA序列位点的基因型。
8.如实施方案2所述的方法,其特征在于在步骤(A3)中利用目标组各个靶DNA序列的等位基因计数的拟合优度检验,估计样本中待检测目标位点的核型,所述估计依次包括如下步骤:
(A3.b1)分析待测样本,列出样本在目标染色体或亚染色体片段上的所有可能的核型;
(A3.b2)对于每一个可能的核型,列出样本中该核型的染色体或亚染色体上目标组的靶DNA序列所有可能的基因型;
(A3.b3)对每一个目标组靶DNA序列,首先利用其各个等位基因计数对所有可能的基因型进行拟合优度检验,然后对每个核型选择一个对该核型有最优拟合的基因型;
(A3.b4)综合分析所有靶DNA序列对所有核型的拟合优度检验结果,选择对所有靶DNA序列综合拟合最好的核型作为待检测目标染色体或亚染色体片段的核型。
9.如实施方案2所述的方法,其特征在于在步骤(A3)中利用目标组各个靶DNA序列的等位基因计数的拟合优度检验,估计样本中待检测目标位点的野生突变型,所述估计依次包括如下步骤:
(A3.c1)利用靶DNA各个等位基因计数和拟合优度检验估计靶DNA序列的基因型;
(A3.c2)根据靶DNA序列的基因型和其各个等位基因的野生突变序列,确定样本各个组分中该靶DNA序列各个等位基因的野生突变型。
10.如实施方案3所述的方法,其特征在于在步骤(B3)中利用目标组各个靶DNA序列的等位 基因计数相对分布图,估计样本中待检测目标位点的基因型,所述估计依次包括如下步骤:
(B3.a1)分析待测样本,列出目标靶序列位点所有可能的基因型;
(B3.a2)计算每一个可能基因型中各个等位基因的相对计数理论值,并对每一个基因型选取至少一个非最大的等位基因相对计数理论值对最大的等位基因相对计数理论值作图来标记所有可能基因型的理论位置;
(B3.a3)计算靶DNA序列的各个等位基因的相对计数,并选取至少一个非最大的等位基因相对计数对最大的等位基因相对计数作图来标记该靶DNA序列等位基因相对计数的实际位置;(B3.a4)根据靶DNA序列在等位基因相对计数图形中的理论位置分布以及实际位置分布推断其基因型。
11.如实施方案3所述的方法,其特征在于在步骤(B3)中利用目标组各个靶DNA序列的等位基因计数相对分布图,估计样本中待检测目标位点的核型,所述估计依次包括如下步骤:
(B3.b1)分析待测样本,列出样本在目标染色体或亚染色体片段上的所有可能的核型;
(B3.b2)对于每一个可能的核型,列出样本中该核型的染色体或亚染色体上目标组的靶DNA序列所有可能的基因型,然后对于每一个基因型选取至少一个非最大的等位基因相对计数理论值对最大的等位基因相对计数理论值作图来标记该基因型的理论位置;
(B3.b3)对每一个目标组靶DNA序列,计算其各个等位基因的相对计数并选取至少一个非最大的等位基因相对计数对最大的等位基因相对计数作图来标记该位点的实际位置;
(B3.b4)根据所有靶DNA序列在等位基因相对计数图形中的理论位置分布以及实际位置分布来推断待检测目标染色体或亚染色体片段的核型。
12.如实施方案3所述的方法,其特征在于在步骤(B3)中利用目标组各个靶DNA序列的等位基因计数相对分布图,估计样本中待检测目标位点的野生突变型,所述估计依次包括如下步骤:
(B3.c1)分析待测样本,列出目标靶序列位点的野生型序列和其所有可能的基因型;
(B3.c2)计算每一个可能基因型中野生型等位基因和其它非野生型各个等位基因的相对计数理论值,并对每一个基因型选取至少一个非野生型等位基因相对计数理论值对野生型等位基因相对计数理论值作图来标记所有可能基因型的理论位置;
(B3.c3)计算靶DNA序列的野生型等位基因和其它非野生型各个等位基因的相对计数,并选取至少一个非野生型等位基因相对计数对野生型等位基因相对计数作图来标记该靶DNA序列等位基因相对计数的实际位置;
(B3.c4)根据靶DNA序列在等位基因相对计数图形中的理论位置分布以及实际位置分布推断其野生突变型。
13.如实施方案5所述的方法,其特征在于在步骤(a2)中进行的所述估算依次包括如下步骤:
(i)对靶DNA序列的各个等位基因计数进行从大到小排序,其中最大的三个等位基因计数依次标记为R1、R2和R3;
(ii)判断靶DNA序列中检测到的高于噪声阈值的等位基因数量;如果判断结果是1,则估算该靶DNA序列基因型为AA|AA,然后执行下述步骤(vi);如果判断结果是2,则执行下述步骤(iii);如果判断结果为大于2,则执行下述步骤(v);
(iii)判断R1/(R1+R2)的值是否小于0.5+α,如果判断结果为是,则估算该靶DNA序列的基因型为AB|AB,然后执行下述步骤(vi);如果判断结果为否,则执行下述步骤(iv);
(iv)判断R1/(R1+R2)的值是否小于0.75,如果判断结果为是,则估算该靶DNA序列的基因型为AB|AA,然后执行下述步骤(vi);如果判断结果为否,则估算该靶DNA序列的基因型为AA|AB,然后执行下述步骤(vi);
(v)判断R2/R1的值是否小于0.5,如果判断结果为否,则估算该靶DNA序列的基因型为AB|AC,然后执行下述步骤(vi);如果判断结果为是,则标记该靶DNA序列为异常值,然后或者估算该靶DNA序列的基因型为NA,然后执行下述步骤(vi),或者执行上述步骤(iii);
(vi)根据估算的靶DNA序列基因型,估算来源于最少组分DNA的计数(FC)和总计数(TC)。
14.如实施方案6所述的方法,其特征在于在步骤(b2)中进行的所述估算依次包括如下步骤:
(i)根据样本来源,列出靶DNA序列的所有可能基因型;
(ii)对靶DNA序列的所有可能基因型,利用f 0和靶DNA序列各个等位基因的总计数(TC),计算其每一个等位基因的理论计数;
(iii)利用靶DNA序列的各个等位基因计数及其各个等位基因理论计数进行拟合优度检验;
(iv)分析靶DNA序列对所有可能的基因型的拟合优度检验结果,选择对靶DNA序列各个等位基因计数有最优拟合的基因型作为估算的靶DNA序列的基因型。
附图说明
图1是利用孕妇血浆cfDNA样本中的多个多态性位点各个等位基因计数估计胎儿DNA浓度的流程示意图。
图2是利用两个组分混合样本中的多个多态性位点各个等位基因计数估计其中最少组分DNA浓度的流程示意图。
图3是利用孕妇血浆cfDNA样本中的多态性位点测序估计胎儿DNA浓度。首先利用每一个多态性位点的各个等位基因计数估算其胎儿DNA计数(FC)和母亲与胎儿DNA总计数(TC),然后对所有的多态性位点的FC和TC计数进行过原点的rlm稳健回归拟合,而胎儿DNA浓度则估算为该拟合直线的斜率(模型系数)。
图4是利用混合组分DNA样本中的多态性位点测序估计最少组分DNA浓度。利用每一个多态性位点的各个等位基因计数估算其最少组分DNA计数(FC)和该位点所有组分DNA总计 数(TC)。图4a中利用每一个多态性位点的FC和TC值进行过原点的rlm稳健回归,而最少组分DNA浓度则估算为该直线的斜率(模型系数)。图4b是对多个不同的样本或不同的生物学重复进行rlm稳健回归估计最少组分DNA样本浓度的结果。四个混合的样本在文库制备或测序水平进行了多个重复,期望的最少组分DNA浓度分别为0.01、0.02、0.10或0.20(x轴),而估计的每个样本最少组分DNA浓度为y轴。图中虚线表示直线y=x的位置。
图5是利用多态性位点各个等位基因计数检测胎儿染色体的单体变异。图5a是利用综合拟合优度检验结果来检测模拟的孕妇血浆cfDNA样本中双体-双体核型染色体是否是胎儿单体异常。图5b是利用综合拟合优度检验结果来检测模拟的孕妇血浆cfDNA样本中双体-单体核型染色体是否是胎儿单体异常。其中y轴AIC值是经校正的AIC值,由该位点的G检验的AIC值除以胎儿浓度再除以该位点各个等位基因的总计数得到。
图6是利用多态性位点各个等位基因计数检测胎儿染色体的三体变异。图6a是利用综合拟合优度检验结果来检测模拟的孕妇血浆cfDNA样本中双体-双体核型染色体是否是胎儿三体异常。图6b是利用综合拟合优度检验结果来检测模拟的孕妇血浆cfDNA样本中双体-三体核型染色体是否是胎儿三体异常。
图7是利用多态性位点各个等位基因计数估计待检测胎儿亚染色体水平的微缺失变异。图7a是利用综合拟合优度检验结果来检测模拟的孕妇血浆cfDNA样本中单体-双体核型染色体是否是胎儿染色体的微缺失异常。图7b是图7a的局部放大。图7c是利用综合拟合优度检验结果来检测模拟的孕妇血浆cfDNA样本中单体-单体核型染色体是否是胎儿染色体微缺失异常。图7d是图7c的局部放大。
图8是利用多态性位点各个等位基因计数估计待检测胎儿亚染色体水平的微重复变异。图8a是利用综合拟合优度检验结果来检测模拟的孕妇血浆cfDNA样本中三体-双体核型染色体是否是胎儿染色体的微重复异常。图8b是图8a的局部放大。图8c是利用综合拟合优度检验结果来检测模拟的孕妇血浆cfDNA样本中三体-三体核型染色体是否是胎儿染色体的微重复异常。图8d是图8c的局部放大。
图9是利用多态性位点各个等位基因计数检测胎儿在短序列水平的野生突变型。图9a是利用拟合优度检验结果来检测模拟的母亲杂合突变而胎儿正常的短序列其位点的基因型。图9b是图9a的局部放大。结果表明,该遗传位点的估计基因型为AB|AA,即母亲为杂合而胎儿为纯合基因型。进一步分析等位基因序列,发现等位基因A为野生型而等位基因B为突变型,因此确定该位点的野生突变型为母亲为杂合突变胎儿为正常(Aa|AA)。图9c是利用拟合优度检验结果来检测模拟的母亲和胎儿均是杂合突变的短序列其位点的基因型。图9d是图 9c的局部放大。结果表明,该遗传位点的估计基因型为AB|AC,即母亲和胎儿均为杂合基因型。进一步分析等位基因序列,发现等位基因A为野生型而等位基因B和C为突变型,因此确定该位点的野生突变型为母亲和胎儿均为杂合突变(Aa|Ab),并且胎儿或者产生了新发突变或者遗传了父源性的等位基因突变。
图10所示是利用等位基因计数相对分布图估计目标位点的基因型。图10a为正常双体-双体核型染色体上的多态性位点其各个等位基因相对计数的理论分布。图10b是正常双体-双体核型染色体上的多态性位点其第二大的等位基因相对计数相对于最大的等位基因相对计数的分布。
图11所示是孕妇血浆cfDNA样本中母亲核型正常的染色体上各个多态性位点各个等位基因相对计数的理论分布。图11a为双体-双体核型或双体-单体核型染色体上各个多态性位点的所有可能基因型及其各个等位基因的相对计数理论值。图11b是双体-双体核型和双体-单体核型染色体上的各个多态性位点其第二大的等位基因相对计数相对于最大的等位基因相对计数的理论分布。图11c为双体-双体核型或双体-三体核型染色体上各个多态性位点的所有可能基因型及其各个等位基因的相对计数理论值。图11d是双体-双体核型和双体-三体核型染色体上的各个多态性位点其第二大或第四大的等位基因相对计数相对于最大的等位基因相对计数的理论分布。
图12所示是孕妇血浆cfDNA样本中目标组亚染色体水平上各个多态性位点各个等位基因相对计数的理论分布。图12a为母亲或胎儿有或没有微缺失核型染色体上各个多态性位点的所有可能基因型及其各个等位基因的相对计数理论值。图12b是母亲或胎儿有或没有微缺失核型染色体上的各个多态性位点其第二大的等位基因相对计数相对于最大的等位基因相对计数的理论分布。图12c为母亲有或没有微重复而胎儿正常的亚染色体上各个多态性位点的所有可能基因型及其各个等位基因的相对计数理论值。图12d是母亲有或没有微重复而胎儿正常核型亚染色体上的各个多态性位点其第二大或第三大的等位基因相对计数相对于最大的等位基因相对计数的理论分布。
图13所示是孕妇血浆cfDNA样本中正常的双体-双体核型染色体上的待测位点所有可能的基因型及其各个等位基因相对计数的理论分布。图13a是正常的双体-双体核型染色体上的待测位点所有可能的基因型及其各个等位基因相对计数的理论值。图13b是正常的双体-双体核型染色体上的待测位点每一个可能的基因型其最大的非野生型等位基因相对计数相对于野生型等位基因相对计数的理论分布图。
图14是利用多态性位点各个等位基因计数相对分布图检测胎儿染色体的单体变异。 图14a是利用等位基因计数相对分布图估计模拟的孕妇血浆cfDNA样本中正常双体-双体染色体的核型。图14b是利用等位基因计数相对分布图估计模拟的孕妇血浆cfDNA样本中双体-单体染色体的核型。
图15是利用多态性位点各个等位基因计数相对分布图检测胎儿染色体的三体变异。图15a是利用等位基因计数相对分布图估计模拟的孕妇血浆cfDNA样本中正常双体-双体染色体的核型。图15b是利用等位基因计数相对分布图估计模拟的孕妇血浆cfDNA样本中双体-三体染色体的核型。
图16是利用多态性位点各个等位基因计数相对分布图检测胎儿亚染色体水平的微缺失变异。图16a是利用等位基因计数相对分布图估计模拟的孕妇血浆cfDNA样本中单体-双体亚染色体的微缺失核型。图16b是利用等位基因计数相对分布图估计模拟的孕妇血浆cfDNA样本中单体-单体亚染色体的微缺失核型。
图17是利用多态性位点各个等位基因计数相对分布图检测胎儿亚染色体水平的微重复变异。图17a是利用等位基因计数相对分布图估计模拟的孕妇血浆cfDNA样本中三体-双体亚染色体的微重复核型。图17b是利用等位基因计数相对分布图估计模拟的孕妇血浆cfDNA样本中三体-三体亚染色体的微重复核型。
图18是利用多态性位点各个等位基因计数相对分布图检测胎儿在短序列水平的野生突变型。图18a是利用等位基因计数相对分布图估计模拟的孕妇血浆cfDNA样本中ab|Aa基因型位点的野生突变型。图18b是利用等位基因计数相对分布图估计模拟的孕妇血浆cfDNA样本中Aa|ab基因型位点的野生突变型。
图19所示是利用多态性位点各个等位基因相对计数检测单基因组样本中目标组染色体或亚染色体片段的核型。对目标组每个多态性位点,将其第二大的等位基因相对计数对其最大的等位基因相对计数作图(相对计数图)或者将其最大的等位基因相对计数对该位点在模拟的染色体上的相对位置作图(相对计数位置图)。根据各个多态性位点在相对计数图或相对计数位置图上的分布特征可以估计待测目标的核型。
具体实施方式
下面结合具体实施例,进一步阐述本发明。应当理解,这些实施例仅用于说明本发明而不用于限制本发明要求保护的范围。在不背离本发明精神和实质的情况下,对本发明方法、步骤或条件所作的修改或替换,均属于本发明的范围。
实施例1、分析计算孕妇血浆DNA样本中各个多态性位点的各个等位基因计数
在本实施例中,测序结果文件(Barrett,Xiong et al.2017,PLoS One 12:e0186771)来自于NIH的SRA数据库(BioProject ID:PRJNA387652)。
1.样本收集:在本实施例中,Barrett等从每一名孕妇收集外周血10-20毫升,然后使用QiaAmp Circulating Nucleic Acid kit(Qiagen)试剂盒根据厂商的方案提取血浆DNA(cfDNA)。在本实施例中,我们分析了157个用上述方法收集的血浆cfDNA样本。
2.多态性位点扩增与测序:Barrett等选择了44个在人群中有高最小等位基因频率(MAF>0.25)的多态性位点,然后设计45对扩增引物,其中包括44对序列特异性的多态性位点扩增引物以及一对ZFX/ZFY位点扩增引物。最后对每一个样本用45对引物和多重PCR方法进行扩增。扩增产物用TruSeq Nano DNA Sample Preparation kits(Illumina)试剂盒按照厂家说明书进行制备测序文库。然后按厂家说明书利用MiSeq测序仪进行测序。
3.准备数据分析索引:
(3.1)准备多态性位点的参考序列:利用Barrett等报道的45个扩增位点的正向引物、反向引物以及多态性位点在染色体的具体定位从人基因组序列数据库中提取每一个扩增产物的参考序列。
(3.2)准备多态性位点的定位索引:对于每一个扩增产物参考序列,手动分成三个区域,5’区、变异区和3’区,其中变异区是扩增产物参考序列中受该多态性位点的任何一个等位基因影响的核酸序列区域,而5’区是5’到3’方向从参考序列开始到变异区开始的核酸序列,3’区是5’到3’方向从变异区终止到参考序列终止的核酸序列。然后,对于每一个多态性位点,分别从5’区和3’区各选择不少于一组的独特序列作为该多态性位点的定位索引,其中独特序列是指该序列在所有多态性位点的扩增产物参考序列中是唯一的,而利用该序列能将扩增产物唯一定位到特定的多态性位点。
(3.3)准备多态性位点的等位基因计数索引:对于每一个多态性位点,首先从NCBI的dbSNP数据库下载其所有等位基因序列,然后对于每一个等位基因序列,选择一个独特的核酸序列作为该多态性位点的等位基因计数索引,其中独特的核酸序列是指在该多态性位点的所有可能的扩增产物参考序列中该独特的核酸序列是唯一的,并且对于该位点的同一个等位基因,该独特的核酸序列相同,而对于该位点的不同等位基因,该独特的核酸序列各不相同。
4.测序数据分析:对于每一个测序序列,首先过滤低质量序列,然后在经过过滤的测序序列中从头到尾寻找每一个多态性位点的定位索引。如果找到不少于一个的多态性位点的定位索引,则将该序列定位到特定的多态性位点,否则则丢弃该序列。最后对于每一个定位到特定的多态性位点的序列,从头到尾寻找该多态性位点的等位基因计数索引。如果找到不 少于一个的等位基因计数索引,则选择其中的一个等位基因计数索引并将该序列标记为该多态性位点的该等位基因,否则则丢弃该序列。
5.统计每一个多态性位点的各个等位基因计数:对每一个样本,统计每一个多态性位点的每一个等位基因的测序序列数,即为每一个多态性位点的各个等位基因的计数。
实施例2、分析计算两个独立基因组混合样本中各个多态性位点的各个等位基因计数
在本实施例中,测序结果文件(Kim,Kim et al.2019,Nat Commun 10:1047)来自于NIH的SRA数据库(BioProject ID:PRJNA517742)。
1.样本收集:在本实施例中,Kim等从两个独立的血液样本提取基因组DNA,其中一个作为主要成分,另一个作为次要成分。两个样本的基因组DNA按一定比例混合,分别得到次要成分占比分别为0.01、0.02、0.10和0.20的混合样本。
2.多态性位点扩增与测序:Kim等选择了645个在两个基因组样本中有多态性的位点,设计扩增引物。对每一个混合样本用扩增引物和多重PCR方法进行扩增。扩增产物分别按照厂家说明书所述方法制备测序文库,然后分别利用Ion Torrent或Illumina测序仪进行测序。在本实施例中,我们分析了测序结果中的Illumina测序数据集(ILA数据集)。
3.准备数据分析索引:我们利用Kim等报道的645个扩增位点在染色体的具体定位等信息从人基因组序列数据库中提取每一个扩增产物的参考序列,然后采用实施例1的步骤3中步骤(3.2)和步骤(3.3)中所述方法,制备每一个多态性位点的定位索引和其每一个等位基因的计数索引。
4.测序数据分析:对于每一个测序序列,首先过滤低质量序列,然后在经过过滤的测序序列中从头到尾寻找每一个多态性位点的定位索引。如果找到不少于一个的多态性位点的定位索引,则将该序列定位到特定的多态性位点,否则则丢弃该序列。最后对于每一个定位到特定的多态性位点的序列,从头到尾寻找该多态性位点的等位基因计数索引。如果找到不少于一个的等位基因计数索引,则选择其中的一个等位基因计数索引并将该序列标记为该多态性位点的该等位基因,否则则丢弃该序列。
5.统计每一个多态性位点的各个等位基因计数:对每一个样本,统计每一个多态性位点的每一个等位基因的测序序列数,即为每一个多态性位点的各个等位基因的计数。
实施例3、计算机模拟混合样本中各个多态性位点并对其各个等位基因进行计数
在本实施例中,我们按照以下步骤产生模拟的多态性位点的各个等位基因序列。
1.模拟多态性位点:首先随机产生一段70bp长的独特序列,并分成三个区域,5’区(长度为30bp)、变异区(长度为10bp)和3’区(长度为30bp)。然后对变异区10bp的序列 随机产生突变(包括插入、缺失、点突变、多个位点变异等核酸序列改变),得到包含5’区、变异区和3’区的长度大于等于60bp的至少六个不同的核酸序列,并标记为该多态性位点的不同的等位基因。最后,按照实施例1中步骤(3.2)所述选择12bp长度的至少一个独特序列作为该多态性位点的定位索引;按照实施例1中步骤(3.3)所述选择包含变异区的12bp长度的至少一个独特序列作为该多态性位点的等位基因计数索引。
2.模拟样本中特定的染色体或染色体片段:对于每一号染色体,按照上述步骤1模拟至少100个多态性位点,并且每一个位点根据模拟的基因型决定模拟的等位基因个数以及各个等位基因计数。
比如模拟核型为双体-双体的孕妇血浆cfDNA某号染色体上的多态性位点,其基因型可能为AA|AA、AA|AB、AB|AA、AB|AB和AB|AC。假设样本中胎儿DNA的浓度为10%,而模拟的基因组拷贝数为200个,则胎儿的基因组为20个拷贝而母体的基因组为180个拷贝。首先选择一个多态性位点,列出其各个等位基因序列并分别标记为A、B、C、D、E、F等等。然后对于基因型AA|AA,模拟200个拷贝的等位基因A;对于基因型AA|AB,模拟180个拷贝的母亲等位基因A和10个拷贝的胎儿等位基因A以及10个拷贝的胎儿等位基因B,即模拟190个拷贝的等位基因A和10个拷贝的等位基因B;对于基因型AB|AA,模拟110个拷贝的等位基因A和90个拷贝的等位基因B;对于基因型AB|AB,模拟100个拷贝的等位基因A和100个拷贝的等位基因B;对于基因型AB|AC,模拟100个拷贝的等位基因A、90个拷贝的等位基因B和10个拷贝的等位基因C。
比如模拟核型为双体-单体的孕妇血浆cfDNA某号染色体上的多态性位点或者模拟某号染色体片段核型为双体-单体的孕妇血浆cfDNA上的多态性位点,其基因型可能为
Figure PCTCN2021125359-appb-000004
Figure PCTCN2021125359-appb-000005
Figure PCTCN2021125359-appb-000006
假设样本中胎儿DNA的浓度为10%,而模拟的正常基因组拷贝数为200个,则胎儿的基因组为20个拷贝而母亲的基因组为180个拷贝。首先选择一个多态性位点,列出其各个等位基因序列并分别标记为A、B、C、D、E、F等等。然后对于基因型
Figure PCTCN2021125359-appb-000007
模拟190个拷贝的等位基因A;对于基因型
Figure PCTCN2021125359-appb-000008
模拟100个拷贝的等位基因A和90个拷贝的等位基因B;对于基因型
Figure PCTCN2021125359-appb-000009
模拟180个拷贝的等位基因A和10个拷贝的等位基因B;对于基因型
Figure PCTCN2021125359-appb-000010
模拟90个拷贝的等位基因A、90个拷贝的等位基因B和10个拷贝的等位基因C。
其它不同核型染色体或染色体片段上的多态性位点的等位基因数量及其各个等位基因的基因组拷贝数可以按类似的方法模拟得到。
3.模拟特定的样本:每一个样本根据实验目的模拟不同的几号染色体,而每一号染 色体或染色体片段按上述步骤2模拟至少100个多态性位点,而每个位点根据基因型的不同模拟相应的不同等位基因的基因组拷贝数,其中每一个多态性位点的所有等位基因拷贝数对应于在正常双体-双体核型下的200个基因组拷贝。
4.模拟高通量测序结果:以每一个样本模拟的不同多态性位点的基因组拷贝序列为输入文件,利用ART模拟软件(Huang,Li et al.2012,Bioinformatics 28:593-594)模拟高通量测序结果。
5.测序数据分析:对于每一个测序序列,首先过滤低质量序列,然后在经过过滤的测序序列中从头到尾寻找每一个多态性位点的定位索引。如果找到不少于一个的多态性位点的定位索引,则将该序列定位到特定的多态性位点,否则则丢弃该序列。最后对于每一个定位到特定的多态性位点的序列,从头到尾寻找该多态性位点的等位基因计数索引。如果找到不少于一个的等位基因计数索引,则选择其中的一个等位基因计数索引并将该序列标记为该多态性位点的该等位基因,否则则丢弃该序列。
6.统计每一个多态性位点的各个等位基因计数:对每一个样本,统计每一个多态性位点的每一个等位基因的测序序列数,即为每一个多态性位点的各个等位基因的计数。
实施例4、估算一个多态性位点中检测到的高于噪声阈值的等位基因数量
选定一个多态性位点,其各个等位基因的计数按从大到小排列,分别标记为R1、R2、R3、…、Rn或R 1、R 2、R 3、…、R n,而其各个等位基因的总计数为各个等位基因计数的和,标记为TC
Figure PCTCN2021125359-appb-000011
假设样本的噪声阈值为α,对于某一个多态性位点如果其某一个等位基因的计数小于TC×α,则将该等位基因计数标记为噪声,而该多态性位点没有标记为噪声的等位基因数量为该位点的高于噪声阈值的等位基因数量。比如,多态性位点4个等位基因计数分别为27、3552、5809和11,则TC=27+3552+5809+11=9399,R1=5809,R2=3552,R3=27和R4=11。假如设定噪声阈值α=0.01,则截止阈值(Th)=TC×α=93.99。由于R1和R2均大于93.99而R3和R4均小于93.99,故该位点的高于噪声阈值的等位基因为R1和R2,而该位点的高于噪声阈值的等位基因数量为2。
优选的,将一个多态性位点各个等位基因的计数按从大到小排列并标记为R1、R2、…、Rn后,按照以下步骤估算该多态性位点中检测到的高于噪声阈值的等位基因数量:
(1)设定测序的噪声阈值为α;
(2)计算
Figure PCTCN2021125359-appb-000012
(3)如果C i-1≥α并且C i<α,则估计该多态性位点有i-1个等位基因。
比如,对于一个多态性位点,如果i=3,C 2=R2/(R1+R2)≥α并且C 3=R3/(R1+R2+R3)<α,那么估计该位点有i-1=2个检测到的高于噪声阈值的等位基因。比如,多态性位点4个等位基因计数分别为27、3552、5809和11,则TC=27+3552+5809+11=9399,R1=5809,R2=3552,R3=27和R4=11。假如设定噪声阈值α=0.01,则截止阈值α=0.01。由于C 1=R1/R1=1.0,C 2=R2/(R1+R2)=0.38,C 3=R3/(R1+R2+R3)=0.003和C 4=R4/(R1+R2+R3+R4)=0.001。由于C 2大于等于0.01而C 3小于0.01,故该位点的高于噪声阈值的等位基因为R1和R2,而该位点的高于噪声阈值的等位基因数量为2。
实施例5、估算一个多态性位点中各个等位基因总计数(TC)
一个多态性位点中各个等位基因的总计数(TC)可以按照下列任何一种方法计算:
(1)一个多态性位点,将其各个等位基因的计数求和,得到该多态性位点中各个等位基因的总计数;
(2)一个多态性位点,首先按照实施例4所述的方法计算其检测到的高于噪声阈值的等位基因数量,则该多态性位点中各个等位基因的总计数为各个高于噪声阈值的等位基因计数的和;
(3)根据样本特性,考虑该样本中该多态性位点最多可能有几个等位基因(设为k),则该多态性位点中各个等位基因的总计数为其最大的k个等位基因计数的和,即
Figure PCTCN2021125359-appb-000013
实施例6、利用等位基因计数估算血浆cfDNA样本中一个多态性位点的可能基因型
对于是胎儿亲生母亲的孕妇(亲生孕妇),其血浆cfDNA中母亲和胎儿均为正常双体核型的染色体上每一个多态性位点的基因型只能是5种基因型之一(不考虑母亲和/或胎儿是嵌合基因型和/或胎儿由于各种原因没有遗传母亲的基因型的情况)。对于每一个多态性位点,首先按照实施例4所述的方法计算其检测到的高于噪声阈值的等位基因数量,然后可以根据以下步骤估算该多态性位点的可能基因型:
(1)设定测序的噪声阈值为α;
(2)判断高于噪声阈值的等位基因数量,如果判断结果是1,则执行下述步骤(3);如果判断结果为2,则执行下述步骤(4);如果判断结果为大于2,则执行下述步骤(8);
(3)估算该多态性位点的基因型为AA|AA,然后执行下述步骤(11);
(4)判断R1/(R1+R2)的值,如果判断结果为小于0.5+α,则执行下述步骤(5);如果判断结果为大于等于0.5+α并且小于0.75,则执行下述步骤(6);如果判断结果为大于等于0.75,则执行下述步骤(7);
(5)估算该多态性位点的基因型为AB|AB,然后执行下述步骤(11);
(6)估算该多态性位点的基因型为AB|AA,然后执行下述步骤(11);
(7)估算该多态性位点的基因型为AA|AB,然后执行下述步骤(11);
(8)判断R2/R1的值,如果判断结果为小于0.5,则执行下述步骤(9);如果判断结果为大于等于0.5,则执行下述步骤(10);
(9)标记该多态性位点为异常值,然后或者估算该多态性位点的基因型为NA,并执行下述步骤(11);或者执行上述步骤(4);
(10)估算该多态性位点的基因型为AB|AC,然后执行下述步骤(11);
(11)输出估算的该多态性位点的基因型。
实施例7、估算亲生孕妇血浆cfDNA样本中多态性位点来源于胎儿DNA的计数(FC)
选定一个多态性位点,首先按照实施例5所述的方法估计来源于孕妇和胎儿DNA的总计数(TC),然后按照实施例6所述的方法估算该多态性位点的可能基因型,并根据以下步骤估算该多态性位点来源于胎儿DNA的计数(FC):
(1)如果该多态性位点的基因型为AA|AA,则FC估算为NA;
(2)如果该多态性位点的基因型为AA|AB,则FC估算为2.0×R2;
(3)如果该多态性位点的基因型为AB|AA,则FC估算为R1-R2;
(4)如果该多态性位点的基因型为AB|AB,则FC估算为NA;
(5)如果该多态性位点的基因型为AB|AC,则FC估算为R1-R2+R3或2.0×R3;
(6)如果该多态性位点的基因型不是上述任何一种基因型,则FC估算为NA。
实施例8、估算混合样本中最少组分DNA的浓度(f)
选定多个多态性位点,然后根据以下步骤估算样本中最少组分DNA的浓度:
(1)按照实施例5所述的方法估计每一个多态性位点来源于所有样本DNA的总计数(TC);
(2)按照实施例6和实施例7所述的方法估算每一个多态性位点来源于最少组分DNA的计数(FC);
(3)根据对各个多态性位点的FC和TC计数,利用线性回归或稳健线性回归计算样本中最少组分DNA的浓度,或者利用FC和TC的平均数或中位数计算样本中最少组分DNA的浓度。
图1是按实施例8所述估计亲生孕妇血浆cfDNA样本中胎儿DNA的浓度流程图。
实施例9、根据两个样本的混合物中最少组分的样本浓度f估计多态性位点中各个等 位基因的期望计数
对于亲生孕妇血浆cfDNA样本,这里两个样本分别指的是母体cfDNA和胎儿cfDNA, 其中最少组分是胎儿cfDNA组分而最大组分是母体cfDNA组分;对于两个独立基因组样本混合物,最少组分是指占比较少的样本的DNA组分而最大组分是占比较大的DNA组分;对于经法律许可接受赠卵的孕妇血浆cfDNA样本,最少组分是胎儿cfDNA组分而最大组分是母体cfDNA组分。
选定一个多态性位点,首先按照实施例5所述的方法估计该多态性位点来源于两个样本DNA的总计数(TC)。如果最少组分的浓度为f,则另一个最大组分样本的浓度为1-f。对于任意一个多态性位点的基因型,根据以下步骤估算该多态性位点各个等位基因的理论期望计数:
(1)对于来源于最大组分样本中的每一个染色体位置的等位基因,标记其相对值为1-f;
(2)对于来源于最少组分样本中的每一个染色体位置的等位基因,标记其相对值为f;
(3)计算该多态性位点中每一个等位基因的相对总值和所有等位基因的相对总值;
(4)对每一个等位基因,计算其等位基因的相对总值相对于所有等位基因相对总值的比例,然后将该比例乘以TC得到该等位基因的理论期望计数。
比如对于亲生孕妇血浆DNA样本,假设胎儿DNA浓度为f,对于任意一个多态性位点,其各个等位基因的总计数标记为TC。则对于基因型AA|AA的多态性位点,其来源于最大组分样本(母亲DNA)中的染色体位置有两个,分别为A和A(竖线前标记),而其来源于最少组分样本(胎儿DNA)中的染色体位置有两个,分别为A和A(竖线后标记)。则等位基因A的相对总值为(1-f)+(1-f)+f+f=2,而所有等位基因的相对总值为(1-f)+(1-f)+f+f=2;比例为2/2=1,因此等位基因A的理论期望值为TC*1=TC。对于基因型AB|AC,所有等位基因的相对总值为(1-f)+(1-f)+f+f=2;等位基因A的相对总值为(1-f)+f=1,比例为1/2,则其理论期望值为1/2×TC=TC/2;等位基因B的相对总值为1-f,比例为(1-f)/2,则其理论期望值为(1-f)/2×TC;等位基因C的相对总值为f,比例为f/2,则其理论期望值为f/2×TC。对于核型为双体-三体染色体上的多态性位点基因型AB|AAB,所有等位基因的相对总值为(1-f)+(1-f)+f+f+f=2+f;等位基因A的相对总值为(1-f)+f+f=1+f,比例为(1+f)/(2+f),则其理论期望值为(1+f)/(2+f)×TC;等位基因B的相对总值为1-f+f=1,比例为1/(2+f),则其理论期望值为1/(2+f)×TC。其他基因型的理论期望计数可以用类似的方法得到。
比如对于经法律许可接受赠卵的孕妇血浆DNA样本,假设胎儿DNA浓度为f,对于任意一个多态性位点,其各个等位基因的总计数标记为TC。对于基因型AB|AC,所有等位基因的相对总值为(1-f)+(1-f)+f+f=2;等位基因A的相对总值为(1-f)+f=1,比例为1/2,则其理论期望值为1/2×TC=TC/2;等位基因B的相对总值为1-f,比例为(1-f)/2,则其理论期望 值为(1-f)/2×TC;等位基因C的相对总值为f,比例为f/2,则其理论期望值为f/2×TC。对于基因型AA|BC,所有等位基因的相对总值为(1-f)+(1-f)+f+f=2;等位基因A的相对总值为(1-f)+(1-f)=2-2f,比例为(2-2f)/2=1-f,则其理论期望值为(1-f)×TC;等位基因B的相对总值为f,比例为f/2,则其理论期望值为f/2×TC;等位基因C的相对总值为f,比例为f/2,则其理论期望值为f/2×TC。其他基因型的理论期望计数可以用类似的方法得到。
实施例10、对多态性位点的等位基因计数进行拟合优度检验
选定一个多态性位点,根据以下步骤对该位点可能的基因型进行拟合优度检验:
(1)计算该多态性位点各个等位基因的计数,并按从大到小顺序标记为观察计数O 1、O 2、…、O m
(2)按照实施例9所述的方法计算其各个等位基因的期望计数,并按从大到小顺序分别标记为E 1、E 2、…、E n
(3)利用各个等位基因的观察计数和期望计数,进行拟合优度检验。
上述步骤(3)中的拟合优度检验,可以但不限于用Fisher精确检验、二项分布检验、卡方检验或G检验来进行拟合优度检验。
例如,对某一个基因型来说,如果各个等位基因观察计数值分别为O 1、O 2和O 3,而期望的计数值分别为E 1、E 2和E 3,则G检验的拟合优度可以计算为:
Figure PCTCN2021125359-appb-000014
或者
Figure PCTCN2021125359-appb-000015
其中,df是自由度。
优选地,如果观察的等位基因计数个数小于期望的等位基因计数个数,则缺失的观察的等位基因计数设定为一个很小的数值,比如0.1;如果期望的等位基因计数个数小于观察到的等位基因个数,则缺失位置的期望值设定为一个很小的数值或背景噪音值,比如5或者TC×α。
比如,如果观察到某多态性位点的两个等位基因计数值分别为4105和577,胎儿DNA浓度f=0.25,而噪声阈值设定为α=0.01,则O 1=4105,O 2=577,TC=4105+577=4682。为了判断该多态性位点的各个等位基因计数对哪个基因型有最好的拟合,将观察到的各个等位基因计数对该多态性位点所有可能基因型各个等位基因的理论计数进行拟合优度检验。该多态性位点对基因型AA|AA、AA|AB和AB|AC的拟合优度检验结果示例如下:
基因型AA|AA:自由度df=1,期望的等位基因计数分别为E 1=TC×(1-α)=4682×(1-0.01)=4635.18,E 2=TC×α=46.82;则G=1901.045,AIC=G-2×df=1899.045。或者自由度df=0,期望的等位基因计数分别为E 1=TC=4682,E 2=0(舍去);则G=0.0,AIC=G-2×df=0.0。
基因型AA|AB:自由度df=1,期望的等位基因计数分别为E 1=TC×(2-f)/2=4682×(2-0.25)/2=4096.75,E 2=TC×f/2=4682×0.25/2=585.25;则G=0.1334,AIC=G-2×df=-1.8666。
基因型AB|AC:自由度df=2,由于期望有三个等位基因而只有两个等位基因的观察计数,则O 3设定为一个很小的值,比如设定O 3=0.1,而期望的等位基因计数分别为E 1=TC×1/2=4682×1/2=2341,E 2=TC×(1-f)/2=4682×(1-0.25)/2=1755.75,E 3=TC×f/2=4682×0.25/2=585.25则G=3325.046,AIC=G-2×df=3321.046。
另外,也可以全部用相同的等位基因计数个数来进行拟合优度检验。由于该多态性位点最多可能有三个等位基因,因此观察到的等位基因计数和期望的等位基因计数均保留最大的三个值,其中观察到的等位基因计数可以用小值补位,而期望的等位基因计数可以用阈值补位。比如对上述两个观察的等位基因值拟合基因型AA|AB,则设定O 3=0.1,E 3=TC×α=46.82,df=2,因此E 1=TC×(1-α)×(2-f)/2=4055.783,E 2=TC×(1-α)×f/2=579.398;G=94.24,AIC=G-2×df=90.24。
实施例11、利用样本的混合物中最少组分的样本浓度f和一个多态性位点的等位基 因计数估算该多态性位点的可能基因型
对于一个多态性位点,根据以下步骤估算该多态性位点的基因型:
(1)按照实施例10所述的方法用观测到的各个等位基因计数对每一种可能的基因型各个等位基因的理论计数分别进行拟合优度检验;
(2)选择对观测到的各个等位基因计数有最优拟合优度检验的基因型标记为该多态性位点的基因型。
实施例12、利用两个独立样本的混合物中最少组分的样本浓度f和一个多态性位点 的各个等位基因计数及其基因型估算该多态性位点中来源于最少组分样本的计数(FC)
在两个独立样本的混合物中,最少组分样本的浓度为f,则最大组分样本的浓度为1-f,各个等位基因计数按由大到小排列分别标记为R1、R2、R3和R4,然后根据以下步骤估算该多态性位点来源于最少组分的计数(FC):
(1)如果该多态性位点的基因型为AA|AA,则FC估算为NA;
(2)如果该多态性位点的基因型为AA|AB,则FC估算为2.0×R2;
(3)如果该多态性位点的基因型为AB|AA,则FC估算为R1-R2;
(4)如果该多态性位点的基因型为AB|AB,则FC估算为NA;
(5)如果该多态性位点的基因型为AB|AC,则FC估算为R1-R2+R3或2.0×R3;
(6)如果该多态性位点的基因型为AA|BB,则FC估算为R2;
(7)如果该多态性位点的基因型为AA|BC,则FC估算为R2+R3或2.0×R2或2.0×R3;
(8)如果该多态性位点的基因型为AB|CC,则判断是否f>1/3,如果判断结果为是,则FC估算为R1;如果判断结果为否,则FC估算为R3;
(9)如果该多态性位点的基因型为AB|CD,则FC估算为R3+R4或2.0×R3或2.0×R4;
(10)如果该多态性位点的基因型不是上述任何一种基因型,则FC估算为NA。
实施例13、利用多态性位点的等位基因计数估算两个样本的混合物中最少组分的样 本浓度
选定多个多态性位点,然后根据以下步骤估算两个独立样本的混合物中最少组分的样本浓度:
(1)设定背景噪声值α,浓度精度值ε和初始浓度值f 0
(2)按照实施例11所述的方法估计每一个多态性位点的基因型;
(3)按照实施例7或实施例12所述的方法估计混合物中每一个多态性位点来源于最少组分样本DNA的总计数(FC);
(4)按照实施例5所述的方法估算每一个多态性位点的来源于两个样本的总计数(TC);
(5)根据对各个多态性位点的FC和TC计数,按照实施例8所述的方法估算混合样本中最少组分DNA的浓度;
(6)判断|f-f 0|是否小于ε,如果判断结果为是,则混合物中最少组分的浓度为f;如果判断结果为否,则设定f 0=f,然后执行步骤(2)。
对于经法律许可接受赠卵的孕妇血浆DNA样本,最少组分为胎儿DNA,最大组分为母亲DNA。由于胎儿没有遗传经法律许可接受赠卵的孕妇染色体上的遗传物质,经法律许可接受赠卵的孕妇血浆DNA中每一个多态性位点均可能是九种基因型中的一种(不考虑母亲和/或胎儿有染色体非整倍性或染色体片段拷贝数变异和/或母亲和/或胎儿是嵌合基因型和/或胎儿由于各种原因有其它非二倍体核型对应的基因型的情况),其中胎儿DNA的浓度可以按上述步骤通过迭代来估计。
对于亲生孕妇血浆DNA样本,最少组分为胎儿DNA,最大组分为母亲DNA。由于胎儿遗传了亲生母亲染色体上的遗传物质,亲生孕妇血浆DNA中每一个多态性位点均可能是五种基因型中的一种(不考虑母亲和/或胎儿有染色体非整倍性或染色体片段拷贝数变异和/或母 亲和/或胎儿是嵌合基因型和/或胎儿由于各种原因没有遗传母亲的基因型的情况),其中胎儿DNA的浓度可以按上述步骤通过迭代来估计。
图2是按实施例13所述估计经法律许可接受赠卵的孕妇血浆DNA样本中胎儿DNA浓度的流程图。
实施例14、利用模拟的对孕妇血浆DNA样本多态性位点的测序估计胎儿DNA浓度
下面以模拟的孕妇血浆cfDNA中的5个假想的多态性位点的各个等位基因计数为例,简要说明利用等位基因计数相对比例法估算该样本中胎儿DNA浓度的方法及步骤。
(1)模拟参照基因组上多个多态性位点的测序结果
选定参照基因组上的多态性位点,分别标记为Id001-Id005。假设按照实施例3所述模拟的5个多态性位点的各个等位基因计数结果如表1。在假想的孕妇血浆cfDNA中,参照基因组被认为是在母体和胎儿均为正常双体核型的染色体区域,因此每一个多态性位点理论上最多包含3个等位基因。这里每一个位点显示了最多五个等位基因的计数(其中一些等位基因计数代表样本处理、测序等过程中的系统噪声)。应当理解,每一个多态性位点可能检测到包含多个等位基因,对每一个等位基因均应该进行计数统计。
表1:假想的五个多态性位点的各个等位基因计数
Figure PCTCN2021125359-appb-000016
(2)按照实施例6和实施例7所述的方法估计每一个多态性位点计数中来源于胎儿DNA的计数
由于在孕妇血浆cfDNA中每一个多态性位点理论上最多只能有三个等位基因,因此对每一个多态性位点的等位基因计数按由大到小的顺序排序并分别标记其中最大的三个数为R1、R2和R3。结果见表2。
表2:假想的经排序的五个多态性位点的各个等位基因计数。
位点编号 R1 R2 R3
Id001 14127 35 0
Id002 4105 577 13
Id003 3148 3101 54
Id004 5809 3552 27
Id005 4007 3028 1011
设定测序的噪音阈值为α=0.01。计算每一个多态性位点中理论上来源于胎儿DNA的扩增计数(FC)和来源于母亲和胎儿DNA的总计数(TC)。
对于位点Id001,R2/(R1+R2)=35/(14127+35)=0.002<0.01,等位基因数量估计为一个,基因型估计为AA|AA,FC=NA,TC=R1=14127。
对于位点Id002,R2/(R1+R2)=577/(4105+577)=0.123≥0.01,R3/(R1+R2+R3)=13/(4105+577+13)=0.003<0.01,等位基因数量估计为两个。因为R1/(R1+R2)=0.877≥0.75,则基因型估计为AA|AB,FC=2×R2=1154,TC=R1+R2=4682。
对于位点Id003,R2/(R1+R2)=0.496≥0.01,R3/(R1+R2+R3)=0.009<0.01,等位基因数量估计为两个,因为R1/(R1+R2)=0.504<0.5+α,基因型估计为AB|AB,FC=NA,TC=R1+R2=6249。
对于位点Id004,R2/(R1+R2)=0.379≥0.01,R3/(R1+R2+R3)=0.003<0.01,等位基因数量估计为两个,因为0.5+α≤R1/(R1+R2)=0.621<0.75,基因型估计为AB|AA,FC=R1-R2=2257,TC=R1+R2=9361。
对于位点Id005,R2/(R1+R2)=0.430≥0.01,R3/(R1+R2+R3)=0.126≥0.01,等位基因数量估计为两个,因为R2/R1=0.756≥0.5,基因型估计为AB|AC,FC=R1-R2+R3=1990,TC=R1+R2+R3=8046。
(3)估计胎儿DNA的浓度
利用R软件和线性回归或稳健线性回归计算样本中胎儿DNA的浓度,或者利用FC和TC的平均数或中位数计算样本中胎儿DNA的浓度。结果见表3。
(a)输入FC和TC的值
FC=c(NA,1154,NA,2257,1990)
TC=c(14127,4682,6249,9361,8046)
(b)利用线性回归计算胎儿DNA的浓度
lmfit=lm(FC~TC+0)
f=lmfit$coefficients["TC"]
(c)利用稳健回归计算胎儿DNA的浓度
library(MASS)
rlmfit=rlm(FC~TC+0,maxit=1000)
f=rlmfit$coefficients["TC"]
(d)利用FC和TC的平均数或中位数计算样本中胎儿DNA的浓度
(d1)f=median(FC/TC,na.rm=T)
(d2)f=median(FC[c(2,4,5)])/median(TC[c(2,4,5)])
(d3)f=mean(FC/TC,na.rm=T)
(d4)f=mean(FC[c(2,4,5)])/mean(TC[c(2,4,5)])
(e)胎儿DNA浓度计算结果见表3。
表3:用不同的方法估算样本中胎儿DNA的浓度。
估算方法 估计的胎儿DNA浓度
线性回归(b) 0.2441
稳健回归(c) 0.2441
比值的中位数(d1) 0.2465
比值的平均值(d3) 0.2450
中位数的比值(d2) 0.2473
平均数的比值(d4) 0.2445
实施例15、利用模拟的对经法律许可接受赠卵的孕妇血浆cfDNA样本多态性位点的 测序估计胎儿DNA浓度
下面以模拟的经法律许可接受赠卵的孕妇血浆cfDNA中的9个假想的多态性位点的各个等位基因计数为例,简要说明利用等位基因计数迭代拟合基因型法估算该样本中胎儿DNA浓度的方法及步骤。
(1)模拟经法律许可接受赠卵的孕妇血浆cfDNA样本参照基因组上多个多态性位点各个等位基因计数的测序结果
选定参照基因组上的多态性位点,分别标记为Id001-Id009。假设按照实施例3所述模拟的9个多态性位点的各个等位基因计数结果如表4。在假想的经法律许可接受赠卵的孕妇血浆cfDNA中,参照基因组被认为是在母体和胎儿均为正常双体核型的染色体区域,因此每一个多态性位点理论上最多包含4个等位基因。这里每一个位点显示了最多五个等位基因的计数。应当理解,每一个多态性位点可能检测到包含多个等位基因,对每一个等位基因均应该进行计数统计。
表4:假想的九个多态性位点的等位基因计数。
Figure PCTCN2021125359-appb-000017
Figure PCTCN2021125359-appb-000018
(2)按实施例13所述的方法迭代估计胎儿DNA的浓度
由于在经法律许可接受赠卵的孕妇血浆中每一个多态性位点理论上最多只能有四个等位基因,因此对每一个多态性位点的等位基因计数按由大到小的顺序排序并分别标记其中最大的四个数为R1、R2、R3和R4;然后设定测序的噪音阈值为α=0.01,迭代精度值ε=0.001,胎儿浓度初始估计值f 0=0.10;最后按下述步骤计算胎儿DNA的浓度。
步骤(a)对每一个多态性位点,根据各个等位基因计数和f 0,按照实施例11和实施例12所述的方法估计该位点的基因型,以及理论上来源于胎儿DNA的扩增计数(FC)和来源于母亲和胎儿DNA的总计数(TC)。
例如,对于位点编号Id0006,R1到R4分别为3322、936、36和28,则O 1=3322、O 2=936、O 3=36和O 4=28。由于R2/(R1+R2)≥0.01并且R3/(R1+R2+R3)<0.01,故该位点有两个检测到的高于噪声阈值的等位基因数量。
对该位点的所有可能基因型进行拟合优度检验如下:
AA|AA:TC=R1+R2=4258,E 1=(1-α)×TC=4215.42,E 2=α×TC=42.58,
Figure PCTCN2021125359-appb-000019
Figure PCTCN2021125359-appb-000020
AA|AB:TC=R1+R2=4258,E 1=(1-f 0/2)×TC=4045.10,E 2=f 0/2×TC=212.90,
Figure PCTCN2021125359-appb-000021
Figure PCTCN2021125359-appb-000022
AB|AA:TC=R1+R2=4258,E 1=(1+f 0)/2×TC=2341.90,E 2=(1-f 0)/2×TC=1916.10,
Figure PCTCN2021125359-appb-000023
Figure PCTCN2021125359-appb-000024
AB|AB:TC=R1+R2=4258,E 1=1/2×TC=2129.00,E 2=1/2×TC=2129.00,
Figure PCTCN2021125359-appb-000025
Figure PCTCN2021125359-appb-000026
AB|AC:TC=R1+R2+R3=4294,E 1=1/2×TC=2147.00,E 2=(1-f 0)/2×TC=1932.30,E 3=f 0/2×TC=214.70,
Figure PCTCN2021125359-appb-000027
AA|BB:TC=R1+R2=4258,E 1=(1-f 0)×TC=3832.20,E 2=f 0×TC=425.80,
Figure PCTCN2021125359-appb-000028
Figure PCTCN2021125359-appb-000029
AA|BC:TC=R1+R2+R3=4294,E 1=(1-f 0)×TC=3864.60,E 2=f 0/2×TC=214.70, E 3=f 0/2×TC=214.70,
Figure PCTCN2021125359-appb-000030
AB|CC:TC=R1+R2+R3=4294,E 1=(1-f 0)/2×TC=1932.30,E 2=(1-f 0)/2×TC=1932.30,E 3=f 0×TC=429.40,
Figure PCTCN2021125359-appb-000031
AB|CD:TC=R1+R2+R3+R4=4322,E 1=(1-f 0)/2×TC=1944.90,E 2=(1-f 0)/2×TC=1944.90,E 3=f 0/2×TC=216.10,E 4=f 0/2×TC=216.10,G AB|CD=1944.34。
由于G AA|BB<G AB|AA<G AB|AC<G AB|AB<G AA|AB<G AA|BC<G AB|CD<G AB|CC<G AA|AA,因此位点Id006的基因型估计为AA|BB。然后根据实施例12所述的方法估算FC=R2=936,TC=R1+R2=4258。
按照相同的规则,对上述9个位点分别估计FC和TC值。
步骤(b)利用各个多态性位点的FC和TC值,按照实施例8所述的方法估算胎儿DNA浓度f。
步骤(c)判断f-f 0的绝对值是否小于ε,如果判断结果为是,则输出胎儿DNA浓度为f,计算结束;如果判断结果为否,则设定f 0=f,然后执行上述步骤(a)。
对上面示例迭代执行的结果见下表5。
表5:胎儿DNA浓度的迭代参数估计值。
迭代序数 初始f 0 重新计算f |f-f 0|
1 0.1 0.2385 0.1385
2 0.2385 0.2436 0.0051
3 0.2436 0.2436 0
因此该示例中胎儿DNA的浓度估计为f=0.2436。
实施例16、利用孕妇血浆DNA样本中胎儿DNA浓度和待分析位点的等位基因计数估 算该位点的基因型
按照实施例3所述模拟孕妇血浆cfDNA样本中一组参照组多态性位点和2个目标组多态性位点。假设利用一组参照基因组多态性位点按照实施例14所述的方法估计胎儿DNA的浓度f=0.20,而目标组两个多态性位点各个等位基因计数分别为A:16994,1896,23;B:9146,7355,1892,58。如果母亲和胎儿的位点A和位点B所在的染色体均为正常的双体且没有影响到位点A和位点B的大片段插入或缺失变异,则位点A和位点B均只能是以下五种基因型的一种,即AA|AA、AA|AB、AB|AA、AB|AB和AB|AC。下面以位点A和位点B的上述等位基因计数结果为例按照实施例11所述的方法来分别估计它们最可能的基因型。
对位点A和位点B所有可能的基因型利用G检验进行拟合优度检验,结果见下表6。
表6:利用拟合优度检验估算靶位点的基因型。
Figure PCTCN2021125359-appb-000032
从表6结果可以看出,位点A对基因型AA|AB有最优的拟合优度检验结果而位点B对基因型AB|AC有最优的拟合优度检验结果,因此估计位点A的基因型为AA|AB,而位点B的基因型为AB|AC。
实施例17、利用样本混合物中最少组分的样本浓度f和目标区域内一组多态性位点 的等位基因计数估算待测目标的核型
利用目标区域内多个多态性位点和综合拟合优度检验估计待测目标染色体或亚染色体片段的核型,其主要步骤如下:
(1)分析待测样本,列出样本在目标区域内所有可能的核型;
(2)对于每一个可能的核型,列出目标区域内各个多态性位点所有可能的对应于该核型的基因型;
(3)对每一个目标区域内的多态性位点,按照实施例11所述的方法对每个核型选择一个有最优拟合的基因型;
(4)综合分析目标区域内的所有多态性位点对所有核型的拟合优度检验结果,将多态性位点综合拟合最好的核型作为待检测目标(染色体或亚染色体片段)的核型。
实施例18、利用孕妇血浆DNA样本中胎儿DNA浓度f和待分析的染色体或亚染色体 水平区域内一组多态性位点的等位基因计数估算待分析区域内染色体水平的非整倍性变异或 亚染色体水平的缺失重复变异
按照实施例3所述模拟2个孕妇血浆cfDNA样本,其中每个样本模拟一组参照组多态 性位点和一组来源于特定染色体或亚染色体片段的目标区域内的多态性位点。假设利用一组参考基因组多态性位点按照实施例14所述的方法估计两个样本中胎儿DNA的浓度均为f=0.20,样本1和样本2中目标区域的一组多态性位点的各个等位基因计数如下表7。
表7:假想的两个样本中待测目标染色体上一组多态性位点的等位基因计数。
Figure PCTCN2021125359-appb-000033
假设样本1和样本2中目标区域的一组多态性位点来源于21号染色体,而我们的目标是检测样本1和样本2中胎儿是否是21三体,即这两个样本中21号染色体的核型是双体-双体(母体和胎儿的21号染色体均是正常双体)还是双体-三体(正常21号染色体双体的孕妇怀有一个21号染色体三体的胎儿)。对于双体-双体型,所有多态性位点均只能是下列5种基因型中的一种,即AA|AA、AA|AB、AB|AA、AB|AB或AB|AC。对于双体-三体型,所有多态性位点均只能是下列10种基因型中的一种,即AA|AAA、AA|AAB、AA|ABB、AA|ABC、AB|AAA、AB|AAB、AB|AAC、AB|ABC、AB|ACC或AB|ACD。对样本1和样本2中21号染色体目标区域的各一组多态性位点按照实施例17所述的方法分别按照双体-双体和双体-三体的核型利用G检验进行拟合优度检验,结果见下表8。
表8:目标区域各个多态性位点各个等位基因计数的分核型拟合优度检验结果。
Figure PCTCN2021125359-appb-000034
Figure PCTCN2021125359-appb-000035
对于样本1,大部分多态性位点的各个等位基因计数对双体-双体中的基因型比对双体-三体中的基因型有更好的拟合,因此样本1的核型估计为双体-双体,即母亲和胎儿均为正常双体。
对于样本2,所有的多态性位点的各个等位基因计数对三体-双体中的基因型比对双体-双体中的基因型有更好的拟合,因此样本2的核型估计为双体-三体,即母亲为正常双体,胎儿为异常的21三体。
当综合考虑多个多态性位点拟合结果的时候,既可以考虑对大多数样本具有最优拟合的核型,也可以用G值、AIC值、经修饰的G值和/或经修饰的AIC值进行判断。
比如,对样本1拟合双体-双体核型,则:
综合G值为ΣG i=0.0+0.039+0.025+2.138+0.054=2.256
综合AIC值为ΣAIC i=0.0+(-1.961)+(-1.975)+0.138+(-3.946)=-7.744
综合AIC/总计数值为Σ(AIC i/TC i)=0.0/9565+(-1.961/6472)+(-1.975/11183)+0.138/15494+(-3.946/18915)=-0.00068
综合AIC/总计数/f值为Σ(AIC i/TC i/f)=0.0/9565/0.2+(-1.961/6472/0.2)+(-1.975/11183/0.2)+0.138/15494/0.2+(-3.946/18915/0.2)=-0.0034。
对样本1拟合双体-三体核型,则:
综合G值为ΣG i=319.73
综合AIC值为ΣAIC i=309.73
综合AIC/总计数为Σ(AIC i/TC i)=0.02017
综合AIC/总计数/f为Σ(AIC i/TC i/f)=0.10087。
对于样本1,综合G值、综合AIC值、综合AIC/总计数值、综合AIC/总计数/f值对双体-双体基因型的拟合均小于相应的对双体-三体基因型的拟合,因此也可以用这些值或由其衍生的值来判断多个多态性位点各个等位基因对不同核型的拟合优劣。
当检测亚染色体水平的微缺失微重复变异时,应该考虑到母亲有可能携带纯合或杂合的亚染色体水平微缺失或微重复,因此对于受影响的每一个多态性位点所有可能的基因型都需要考虑到并利用拟合优度检验进行检测。比如,检测亚染色体水平的微缺失突变,需要检测母亲是纯合微缺失、杂合微缺失或正常而胎儿是纯合微缺失、杂合微缺失或正常的情况下所有的母亲和胎儿的可能的基因型组合。相应的,如果检测亚染色体水平的微重复突变,需要检测母亲是纯合微重复、杂合微重复或正常而胎儿是纯合微重复、杂合微重复或正常的情况下所有的母亲和胎儿的可能的基因型组合。
实施例19、利用孕妇血浆DNA样本中一组多态性位点的高通量测序结果估计样本中 胎儿DNA浓度
按照实施例1所述的方法对孕妇血浆cfDNA插入缺失标记的扩增子测序数据集中(Barrett,Xiong et al.2017,PLoS One 12:e0186771)的每一个样本,统计每一个插入缺失标记(多态性位点)中各个等位基因的计数,然后按照实施例8所述的方法对于每一个样本的每一个多态性位点,估算其来源于胎儿DNA的计数(FC)和来源于孕妇和胎儿DNA的总计数(TC),并利用每一个样本中每一个多态性位点的FC和TC,估算每一个样本中胎儿DNA的浓度。
图3是对该数据集中一个孕妇血浆cfDNA样本的分析结果。样本中每一个插入缺失多态性位点来源于胎儿DNA的计数(FC)和来源于孕妇和胎儿DNA的总计数(TC)表现为图中的一个点。利用样本中每一个多态性位点的FC和TC值和R软件包MASS库中的rlm函数进行稳健回归拟合(拟合模型:FC~TC+0)并估算胎儿DNA的浓度。rlm稳健回归拟合的结果为图中的直线,而胎儿DNA浓度则估算为该直线的斜率(TC的模型系数)。
实施例20、利用混合DNA样本中一组多态性位点的高通量测序结果估计样本中最少 组分的DNA浓度
按照实施例2所述的方法对混合样本扩增子测序数据集中(Kim,Kim et al.2019,Nat Commun 10:1047)的每一个样本,统计每一个多态性位点中各个等位基因的计数,然后按照实施例8所述的方法对于每一个样本的每一个多态性位点,估算来源于最少组分DNA的计数(FC)和来源于所有DNA的总计数(TC),并利用每一个样本中每一个多态性位点的FC和TC,估算每一个样本中最少组分DNA的浓度。
图4a是对该数据集中一个混合DNA样本进行分析的结果。样本中每一个多态性位点 来源于最少组分DNA的计数(FC)和来源于所有DNA的总计数(TC)表现为图中的一个点。利用每一个多态性位点的FC和TC值进行rlm稳健回归(模型:FC~TC+0)并估算样本中最少组分DNA的浓度。rlm稳健回归结果为图中拟合的直线,而最少组分DNA浓度则估算为该直线的斜率(TC的模型系数)。图4b是对该数据集所有混合DNA样本的分析结果。四个混合的样本在文库制备或测序水平进行了多个重复,期望的最少组分DNA浓度分别为0.01、0.02、0.10或0.20(x轴),而估计的每个样本最少组分DNA浓度为y轴。图中虚线表示直线y=x的位置。
实施例21、计算机模拟孕妇血浆DNA样本中染色体水平、亚染色体水平和短序列水 平上的变异
为了检测染色体水平、亚染色体水平或短序列水平的遗传变异,我们在染色体水平模拟了核型为双体-单体和双体-三体的变异,在亚染色体水平模拟了缺体-缺体、缺体-单体、单体-缺体、单体-单体、单体-双体、双体-单体、双体-双体、双体-三体、三体-双体、三体-三体、三体-四体、四体-三体和四体-四体的变异,在短序列水平模拟了任何一个多态性位点在正常双体-双体核型下的所有可能的基因型。对各个样本中不同多态性位点的具体模拟过程简述如下:
1.模拟含染色体单体的孕妇血浆DNA样本。
为了检测染色体水平的染色体单体非整倍性变异,我们模拟了含染色体单体的孕妇血浆DNA样本,其中每一个样本中母亲和胎儿均模拟了三对染色体,分别编号为1号(Chr01)、2号(Chr02)和3号(Chr03)。每一个样本中,1号、2号和3号染色体上均按实施例3所述的方法模拟100个多态性位点。每个样本从以下浓度中(0.02、0.05、0.10、0.15、0.20、0.25、0.30、0.35、0.40、0.45)随机选择一个浓度作为模拟的胎儿DNA浓度。
模拟的1号染色体在样本中为参照染色体,其中每一个多态性位点的基因型模拟为正常双体-双体基因型之一,而每个多态性位点各个等位基因的总计数为200。
模拟的2号染色体在样本中为双体-双体染色体,其中每一个多态性位点的基因型模拟为正常双体-双体基因型之一,而每个多态性位点各个等位基因的总计数为200。
模拟的3号染色体在样本中为双体-单体染色体,其中每一个多态性位点的基因型模拟为双体-单体基因型之一。由于缺乏一条胎儿染色体,故每个多态性位点各个等位基因的总计数为200-100f。
以每一个样本模拟的等位基因序列为输入文件,利用ART模拟软件(Huang,Li et al. 2012,Bioinformatics 28:593-594)模拟高通量测序结果,其中ART模拟软件的fold参数设定为50或100。
2.模拟含染色体三体的孕妇血浆DNA样本。
为了检测染色体水平的染色体三体非整倍性变异,我们模拟了含染色体三体的孕妇血浆DNA样本,其中每一个样本中母亲和胎儿均模拟了三对染色体,分别编号为1号(Chr01)、2号(Chr02)和3号(Chr03)。每一个样本中,1号、2号和3号染色体上均按实施例3所述的方法模拟100个多态性位点。每个样本从以下浓度中(0.02、0.05、0.10、0.15、0.20、0.25、0.30、0.35、0.40、0.45)随机选择一个浓度作为模拟的胎儿DNA浓度。
模拟的1号染色体在样本中为参照染色体,其中每一个多态性位点的基因型模拟为正常双体-双体基因型之一,而每个多态性位点各个等位基因的总计数为200。
模拟的2号染色体在样本中为双体-双体染色体,其中每一个多态性位点的基因型模拟为正常双体-双体基因型之一,而每个多态性位点各个等位基因的总计数为200。
模拟的3号染色体在样本中为双体-三体染色体,其中每一个多态性位点的基因型模拟为双体-三体基因型之一。由于多了一条胎儿染色体,故每个多态性位点各个等位基因的总计数为200+100f。
以每一个样本模拟的等位基因序列为输入文件,利用ART模拟软件模拟高通量测序结果,其中ART模拟软件的fold参数设定为50或100。
3.模拟含亚染色体微缺失的孕妇血浆DNA样本。
为了检测亚染色体水平的微缺失变异,我们模拟了含染色体微缺失的孕妇血浆DNA样本,其中每一个样本中母亲和胎儿均模拟了7对染色体,分别编号为1号(Chr01)、2号(Chr02)、3号(Chr03)、4号(Chr04)、5号(Chr05)、6号(Chr06)和7号(Chr07)。每一个样本中,1号-7号染色体上均按实施例3所述的方法模拟100个多态性位点。每个样本从以下浓度中(0.02、0.05、0.10、0.15、0.20、0.25、0.30、0.35、0.40、0.45)随机选择一个浓度作为模拟的胎儿DNA浓度。在这里,每一个微缺失区域被当成是一整条染色体,而多态性位点由该微缺失区域选取,其中在单一基因组中一条正常一条含微缺失的染色体标记为单体,而两条均含微缺失的染色体标记为缺体。
模拟的1号染色体在样本中为参照染色体,其中每一个多态性位点的基因型模拟为正 常双体-双体基因型之一,而每个多态性位点各个等位基因的总计数为200。
模拟的2号染色体在样本中为双体-双体染色体,其中每一个多态性位点的基因型模拟为正常双体-双体基因型之一,而每个多态性位点各个等位基因的总计数为200。
模拟的3号染色体在样本中为双体-单体染色体,其中每一个多态性位点的基因型模拟为双体-单体基因型之一。由于一条胎儿染色体含有微缺失,故每个多态性位点各个等位基因的总计数为200-100f。
模拟的4号染色体在样本中为单体-双体染色体,其中每一个多态性位点的基因型模拟为单体-双体基因型之一。由于一条母体染色体含有微缺失,故每个多态性位点各个等位基因的总计数为100+100f。
模拟的5号染色体在样本中为单体-单体染色体,其中每一个多态性位点的基因型模拟为单体-单体基因型之一。由于一条母体染色体和一条胎儿染色体均含有微缺失,故每个多态性位点各个等位基因的总计数为100。
模拟的6号染色体在样本中为单体-缺体染色体,其中每一个多态性位点的基因型模拟为单体-缺体基因型之一。由于一条母体染色体和两条胎儿染色体均含有微缺失,故每个多态性位点各个等位基因的总计数为100-100f。
模拟的7号染色体在样本中为缺体-缺体染色体,其中每一个多态性位点的基因型模拟为缺体-缺体基因型之一。由于两条母体染色体和两条胎儿染色体均含有微缺失,故每个多态性位点各个等位基因的总计数为0,即不模拟产生特异性扩增序列或模拟产生一些随机但是不能定位到任何染色体上的序列。
以每一个样本模拟的等位基因序列为输入文件,利用ART模拟软件模拟高通量测序结果,其中ART模拟软件的fold参数设定为50或100。
4.模拟含亚染色体微重复的孕妇血浆DNA样本。
为了检测亚染色体水平的微重复变异,我们模拟了含亚染色体微重复的孕妇血浆DNA样本,其中每一个样本中母亲和胎儿均模拟了7对染色体,分别编号为1号(Chr01)、2号(Chr02)、3号(Chr03)、4号(Chr04)、5号(Chr05)、6号(Chr06)和7号(Chr07)。每一个样本中,1号-7号染色体上均按实施例3所述的方法模拟100个多态性位点。每个样本从以下浓度中(0.02、0.05、0.10、0.15、0.20、0.25、0.30、0.35、0.40、0.45)随机选择一个浓 度作为模拟的胎儿DNA浓度。在这里,每一个微重复区域被当成是两条染色体,而多态性位点由该微重复区域选取,因此在单一基因组中一条正常一条含微重复的染色体标记为三体,而两条均含微重复的染色体标记为四体。
模拟的1号染色体在样本中为参照染色体,其中每一个多态性位点的基因型模拟为正常双体-双体基因型之一,而每个多态性位点各个等位基因的总计数为200。
模拟的2号染色体在样本中为双体-双体染色体,其中每一个多态性位点的基因型模拟为正常双体-双体基因型之一,而每个多态性位点各个等位基因的总计数为200。
模拟的3号染色体在样本中为双体-三体染色体,其中每一个多态性位点的基因型模拟为双体-三体基因型之一。由于一条胎儿染色体含有微重复,故每个多态性位点各个等位基因的总计数为200+100f。
模拟的4号染色体在样本中为三体-双体染色体,其中每一个多态性位点的基因型模拟为三体-双体基因型之一。由于一条母体染色体含有微重复,故每个多态性位点各个等位基因的总计数为300-100f。
模拟的5号染色体在样本中为三体-三体染色体,其中每一个多态性位点的基因型模拟为三体-三体基因型之一。由于一条母体染色体和一条胎儿染色体均含有微重复,故每个多态性位点各个等位基因的总计数为300。
模拟的6号染色体在样本中为三体-四体染色体,其中每一个多态性位点的基因型模拟为三体-四体基因型之一。由于一条母体染色体和两条胎儿染色体均含有微重复,故每个多态性位点各个等位基因的总计数为300+100f。
模拟的7号染色体在样本中为四体-四体染色体,其中每一个多态性位点的基因型模拟为四体-四体基因型之一。由于两条母体染色体和两条胎儿染色体均含有微重复,故每个多态性位点各个等位基因的总计数为400。
以每一个样本模拟的等位基因序列为输入文件,利用ART模拟软件模拟高通量测序结果,其中ART模拟软件的fold参数设定为50或100。
5.模拟含短序列水平变异的孕妇血浆DNA样本。
为了检测短序列水平的变异,我们模拟了含短序列水平变异位点的孕妇血浆DNA样本,其中每一个样本中母亲和胎儿均模拟了2对染色体,分别编号为1号(Chr01)和2号(Chr02)。 每一个样本中,1号和2号染色体上均按实施例3所述的方法模拟100个多态性位点。每个样本从以下浓度中(0.02、0.05、0.10、0.15、0.20、0.25、0.30、0.35、0.40、0.45)随机选择一个浓度作为模拟的胎儿DNA浓度。
模拟的1号染色体在样本中为参照染色体,其中每一个多态性位点的基因型模拟为正常双体-双体基因型之一,而每个多态性位点各个等位基因的总计数为200。
模拟的2号染色体在样本中为双体-双体染色体,每个位点模拟的各个等位基因的总计数为200。对于任意一个模拟的位点,选择其中一个等位基因标记为野生型(正常型,用大写字母A代表),其余等位基因标记为突变型(分别用小写字母a、b、c或d代表),则每一个模拟的位点只能是以下14种基因型之一,分别为AA|AA、AA|Aa、Aa|AA、Aa|Aa、Aa|Ab、Aa|aa、Aa|ab、aa|Aa、aa|aa、aa|ab、ab|Aa、ab|aa、ab|ab或ab|ac。随机模拟2号染色体上100个待检测位点,而每一个位点从14种基因型中随机选择一个,然后根据设定的胎儿DNA浓度以及实施例3所述的方法按比例模拟其各个等位基因的序列。
以每一个样本模拟的等位基因序列为输入文件,利用ART模拟软件模拟高通量测序结果,其中ART模拟软件的fold参数设定为50或100。
6.模拟单一基因组样本。
为了检测单一基因组在染色体水平或亚染色体水平的变异,我们模拟了非孕妇的基因组DNA样本(比如植入前的胚胎基因组DNA样本),其中每一个样本中均模拟了五号染色体,分别编号为1号(Chr01)至5号(Chr05)。每一个样本中,1号至5号染色体上均按实施例3所述的方法模拟100个多态性位点。在这里,正常的染色体被标记为双体,每一个微缺失区域被当成是一整条染色体,而每一个微重复区域被当成是两条染色体,并且多态性位点由该微缺失/微重复区域选取。其中在单一基因组中一条正常一条含微缺失的染色体标记为单体,而两条均含微缺失的染色体标记为缺体,一条正常一条含微重复的染色体标记为三体,而两条均含微重复的染色体标记为四体。
模拟的1号染色体在样本中为正常双体染色体,其中每一个多态性位点的基因型模拟为正常双体基因型(AA或AB)之一,而每个多态性位点各个等位基因的总计数为200。
模拟的2号染色体在样本中为缺体或纯合微缺失染色体,其中每一个多态性位点的基因型模拟为正常缺体或纯合微缺失基因型
Figure PCTCN2021125359-appb-000036
而每个多态性位点各个等位基因的总计数为0,因此不模拟产生特异性扩增序列或模拟产生一些随机但是不能定位到任何染色体上的序列。
模拟的3号染色体在样本中为单体或杂合微缺失染色体,其中每一个多态性位点的基因型模拟为单体或杂合微缺失基因型
Figure PCTCN2021125359-appb-000037
而每个多态性位点各个等位基因的总计数为100。
模拟的4号染色体在样本中为三体或杂合微重复染色体,其中每一个多态性位点的基因型模拟为三体或杂合微重复基因型(AAA、AAB或ABC)之一,而每个多态性位点各个等位基因的总计数为300。
模拟的5号染色体在样本中为四体或纯合微重复染色体,其中每一个多态性位点的基因型模拟为四体或纯合微重复基因型(AAAA、AAAB、AABB、AABC或ABCD)之一,而每个多态性位点各个等位基因的总计数为400。
以每一个样本模拟的等位基因序列为输入文件,利用ART模拟软件模拟高通量测序结果,其中ART模拟软件的fold参数设定为50或100。
实施例22、利用孕妇血浆DNA样本中胎儿DNA浓度和待分析位点的等位基因计数检 测胎儿染色体单体异常
按照实施例21所述的方法模拟含染色体单体的孕妇血浆DNA样本,其中1号、2号和3号染色体分别为参照染色体、正常双体-双体核型的染色体和异常双体-单体核型的染色体。
分析模拟样本的测序数据,首先利用1号参照染色体上的各个多态性位点的等位基因计数按照实施例8所述的方法估算样本中胎儿DNA的浓度f;然后根据样本中胎儿DNA浓度f和2号或3号染色体上各个多态性位点的等位基因计数按照实施例17所述的方法分别估算2号或3号染色体的核型。为了检测胎儿2号或3号染色体是否有染色体单体异常,我们需要考虑2号或3号染色体上的各个多态性位点的各个等位基因计数是对核型为双体-双体的基因型还是对核型为双体-单体的基因型有更好的综合拟合优度检验结果。
图5所示是利用拟合优度检验检测模拟样本中胎儿染色体的单体异常。图5a是利用综合拟合优度检验结果来检测模拟的样本中正常双体-双体核型染色体的胎儿单体异常。其中y轴AIC值是经校正的AIC值,由该位点的G检验的AIC值除以胎儿浓度再除以该位点等位基因的总计数得到。图5b是利用综合拟合优度检验结果来检测模拟的样本中双体-单体核型染色体的胎儿单体异常。对于正常染色体(2号双体-双体核型染色体),几乎所有的多态性位点对双体-双体核型的基因型有很好的拟合,但是对双体-单体核型的基因型拟合不好。对于异常染色体(3号双体-单体核型染色体),几乎所有的多态性位点对双体-单体核型的基因型有很好的拟合,但是对双体-双体核型的基因型拟合不好。因此,检测结果为胎儿2号染色体 未发现染色体单体异常而胎儿3号染色体发现染色体单体异常。
实施例23、利用孕妇血浆DNA样本中胎儿DNA浓度和待分析位点的等位基因计数检 测胎儿染色体三体异常
按照实施例21所述的方法模拟含染色体三体的孕妇血浆DNA样本,其中1号、2号和3号染色体分别为参照染色体、正常双体-双体核型的染色体和异常双体-三体核型的染色体。
分析模拟样本的测序数据,首先利用1号参照染色体上的各个多态性位点的等位基因计数按照实施例8所述的方法估算样本中胎儿DNA的浓度f;然后根据样本中胎儿DNA浓度f和2号或3号染色体上各个多态性位点的等位基因计数按照实施例17所述的方法分别估算2号或3号染色体的核型。为了检测胎儿2号或3号染色体是否有染色体三体异常,我们需要考虑2号或3号染色体上的各个多态性位点的各个等位基因计数是对核型为双体-双体的基因型还是对核型为双体-三体的基因型有更好的综合拟合优度检验结果。
图6所示是利用拟合优度检验检测模拟样本中胎儿染色体的三体异常。图6a是利用综合拟合优度检验结果来检测模拟的样本中正常双体-双体核型染色体的胎儿三体异常。其中y轴AIC值是经校正的AIC值,由该位点的G检验的AIC值除以胎儿浓度再除以该位点等位基因的总计数得到。图6b是利用综合拟合优度检验结果来检测模拟的样本中双体-三体核型染色体的胎儿三体异常。对于正常染色体(2号双体-双体核型染色体),几乎所有的多态性位点对双体-双体核型的基因型有很好的拟合,但是对双体-三体核型的基因型拟合不好。对于异常染色体(3号双体-三体核型染色体),几乎所有的多态性位点对双体-三体核型的基因型有很好的拟合,但是对双体-双体核型的基因型拟合不好。因此,检测结果为胎儿2号染色体未发现染色体三体异常而胎儿3号染色体发现染色体三体异常。
实施例24、利用孕妇血浆DNA样本中胎儿DNA浓度和待分析位点的等位基因计数检 测胎儿染色体微缺失异常
按照实施例21所述的方法模拟含染色体微缺失的孕妇血浆DNA样本,其中1号至7号染色体分别为参照染色体、母亲和胎儿均正常的染色体(正常双体-双体核型的染色体)、母亲正常胎儿一条染色体含微缺失的染色体(双体-单体核型的染色体)、母亲一条染色体含微缺失胎儿正常的染色体(单体-双体核型的染色体)、母亲和胎儿均一条染色体含微缺失的染色体(单体-单体核型的染色体)、母亲一条染色体含微缺失而胎儿两条染色体均含微缺失的染色体(单体-缺体核型的染色体)和母亲和胎儿各两条染色体均含微缺失的染色体(缺体-缺体核型 的染色体)。
分析模拟样本的测序数据,首先利用1号参照染色体上的各个多态性位点的等位基因计数按照实施例8所述的方法估算样本中胎儿DNA的浓度f;然后根据样本中胎儿DNA浓度f和2号至7号染色体上各个多态性位点的等位基因计数按照实施例17所述的方法分别估算2号至7号各染色体的核型。为了检测胎儿某号染色体是否有染色体微缺失异常,我们需要对每一个可能的母亲胎儿微缺失核型分别利用该染色体上的各个多态性位点的各个等位基因计数进行综合拟合优度检验,然后根据对所有多态性位点各个等位基因计数有最优综合拟合的核型判断胎儿染色体是否有微缺失异常。
图7所示是利用拟合优度检验检测模拟样本中胎儿染色体的微缺失异常。图7a是利用综合拟合优度检验结果来检测模拟的样本中单体-双体核型染色体(母亲为杂合微缺失胎儿为正常)的胎儿染色体微缺失异常。其中y轴AIC值是经校正的AIC值,由该位点的G检验的AIC值除以胎儿浓度再除以该位点等位基因的总计数得到。图7b是图7a的局部放大。图7c是利用综合拟合优度检验结果来检测模拟的样本中单体-单体核型染色体(母亲和胎儿均为杂合微缺失)的胎儿染色体微缺失异常。图7d是图7c的局部放大。对于胎儿正常的单体-双体核型的染色体,几乎所有的多态性位点对单体-双体核型的基因型有很好的拟合,但是对其它可能核型的基因型拟合不好。对于胎儿含微缺失的单体-单体核型的染色体,几乎所有的多态性位点对单体-单体核型的基因型有很好的拟合,但是对其它可能核型的基因型拟合不好。因此,图7a和图7b的检测结果为胎儿该号染色体未发现微缺失异常而图7c和图7d的检测结果为胎儿该号染色体发现微缺失异常。
实施例25、利用孕妇血浆DNA样本中胎儿DNA浓度和待分析位点的等位基因计数检 测胎儿染色体微重复异常
按照实施例21所述的方法模拟含染色体微重复的孕妇血浆DNA样本,其中1号至7号染色体分别为参照染色体、母亲和胎儿均正常的染色体(正常双体-双体核型的染色体)、母亲正常胎儿一条染色体含微重复的染色体(双体-三体核型的染色体)、母亲一条染色体含微重复胎儿正常的染色体(三体-双体核型的染色体)、母亲和胎儿均一条染色体含微重复的染色体(三体-三体核型的染色体)、母亲一条染色体含微重复而胎儿两条染色体均含微重复的染色体(三体-四体核型的染色体)和母亲和胎儿各两条染色体均含微重复的染色体(四体-四体核型的染色体)。
分析模拟样本的测序数据,首先利用1号参照染色体上的各个多态性位点的等位基因计数按照实施例8所述的方法估算样本中胎儿DNA的浓度f;然后根据样本中胎儿DNA浓度f和2号至7号染色体上各个多态性位点的等位基因计数按照实施例17所述的方法分别估算2号至7号各染色体的核型。为了检测胎儿某号染色体是否有染色体微重复异常,我们需要对每一个可能的母亲胎儿微重复核型分别利用该染色体上的各个多态性位点的各个等位基因计数进行综合拟合优度检验,然后根据对所有多态性位点各个等位基因计数有最优综合拟合的核型判断胎儿染色体是否有微重复异常。
图8所示是利用拟合优度检验检测模拟样本中胎儿染色体的微重复异常。图8a是利用综合拟合优度检验结果来检测模拟的样本中三体-双体核型染色体(母亲为杂合微重复胎儿为正常)的胎儿染色体微重复异常。其中y轴AIC值是经校正的AIC值,由该位点的G检验的AIC值除以胎儿浓度再除以该位点等位基因的总计数得到。图8b是图8a的局部放大。图8c是利用综合拟合优度检验结果来检测模拟的样本中三体-三体核型染色体(母亲和胎儿均为杂合微重复)的胎儿染色体微重复异常。图8d是图8c的局部放大。对于胎儿正常的三体-双体核型的染色体,几乎所有的多态性位点对三体-双体核型的基因型有很好的拟合,但是对其它可能核型的基因型拟合不好。对于胎儿含微重复的三体-三体核型的染色体,几乎所有的多态性位点对三体-三体核型的基因型有很好的拟合,但是对其它可能核型的基因型拟合不好。因此,图8a和图8b的检测结果为胎儿该号染色体未发现微重复异常而图8c和图8d的检测结果为胎儿该号染色体发现微重复异常。
实施例26、利用孕妇血浆DNA样本中胎儿DNA浓度和待分析位点的等位基因计数检 测待分析位点的野生突变型
按照实施例21所述的方法模拟含特定短序列位点变异的孕妇血浆DNA样本,其中1号至2号染色体分别为参照染色体和含特定短序列位点变异的染色体。具体的,1号染色体中各个多态性位点选自不同的染色体区域,而2号染色体的多个多态性位点选自同一个特定的位点但是属于利用相同和/或不同的引物进行的独立扩增结果,也就是说,2号染色体上模拟的多态性位点代表了一个特定位点的不同独立重复。
为了检测特定位点的野生突变型,我们采用了两种方案:(1)直接对所有可能的野生突变基因型进行拟合优度检验并综合分析拟合优度检验结果;(2)首先估计待检测位点的不区分野生突变等位基因的情况下的基因型,然后再确定估计的基因型中各个等位基因的野生突变型,从而决定母亲和/或胎儿的各个等位基因的野生突变型。
(1)对所有可能的野生突变基因型进行拟合优度检验。(a)利用1号参照染色体上的各个多态性位点的等位基因计数按照实施例8所述的方法估算样本中胎儿DNA的浓度f。(b)列出2号染色体上该特定位点所有可能的野生突变基因型,即AA|AA、AA|Aa、Aa|AA、Aa|Aa、Aa|Ab、Aa|aa、Aa|ab、aa|Aa、aa|aa、aa|ab、ab|Aa、ab|aa、ab|ab和ab|ac,其中A代表野生型等位基因而a、b和c代表各个突变型等位基因。(c)对每一个野生突变基因型,根据上述步骤(a)中估算的样本中胎儿DNA的浓度f估计其各个等位基因的理论计数。(d)对每一个野生突变基因型,根据其各个等位基因的核酸序列确定其实际计数。(e)对每一个位点的独立重复,对每一个野生突变基因型进行拟合优度检验。(f)综合分析拟合优度检验的结果,选择对所有重复位点有综合最优拟合的野生突变型作为该特定位点的估计基因型。(g)根据估计的野生突变基因型确定母亲和/或胎儿的各个等位基因的野生突变型。
(2)先估计不区分野生突变等位基因的情况下的基因型,然后根据各个等位基因的野生突变核酸序列确定母亲和/或胎儿的各个等位基因的野生突变型。
分析模拟样本的测序数据,首先利用1号参照染色体上的各个多态性位点的等位基因计数按照实施例8所述的方法估算样本中胎儿DNA的浓度f;然后根据样本中胎儿DNA浓度f和2号染色体上各个待检测特定短序列位点的等位基因计数按照实施例11所述的方法分别估算2号染色体上的各个特定短序列位点的基因型。为了检测胎儿是否有短序列水平的遗传学变异,比如导致某些单基因遗传病的点突变、短的插入缺失突变等,每一个待检测重复位点首先按照实施例11所述的方法估计在不考虑各个等位基因序列是否属于野生型序列的情况下的基因型,然后再根据各个等位基因的序列是否是正常野生型序列来确定该位点在母亲和胎儿是否有变异。
图9所示是利用拟合优度检验检测模拟样本中胎儿短序列位点的野生突变型。图9a是利用拟合优度检验结果来检测模拟的母亲杂合突变而胎儿正常的短序列位点的基因型(不同的点代表待测定目标靶位点的不同独立重复)。其中y轴AIC值是经校正的AIC值,由该位点的G检验的AIC值除以胎儿浓度再除以该位点等位基因的总计数得到。图9b是图9a的局部放大。结果表明,母亲为杂合而胎儿为纯合基因型(AB|AA)。进一步分析发现等位基因A为野生型而等位基因B为突变型,因此确定该位点母亲为杂合突变而胎儿为正常。图9c是利用拟合优度检验结果来检测模拟的母亲和胎儿均是杂合突变的短序列位点的基因型。图9d是图9c的局部放大。结果表明,母亲和胎儿均为杂合基因型(AB|AC)。进一步分析发现等位基因A为野生型而等位基因B和C均为突变型,因此确定该位点母亲和胎儿均为杂合突变,并且胎 儿或者产生了新发突变或者遗传了父源性的等位基因突变。
实施例27、利用样本混合物中最少组分的样本浓度f和待分析位点的等位基因计数 相对分布图估算该位点的基因型
对于一个待检测位点,根据以下步骤估算该位点的基因型:
(1)分析待测样本,列出目标靶位点所有可能的基因型;
(2)计算每一个可能基因型中各个等位基因的相对计数理论值,并对每一个基因型选取至少一个非最大的等位基因相对计数理论值对最大的等位基因相对计数理论值作图来标记所有可能基因型的理论位置;
(3)计算靶DNA位点的各个等位基因的相对计数,并选取至少一个非最大的等位基因相对计数对最大的等位基因相对计数作图来标记该靶DNA位点等位基因相对计数的实际位置;
(4)根据靶DNA位点在等位基因相对计数图形中的理论位置分布以及实际位置分布推断其基因型。
图10所示是孕妇血浆DNA样本中来源于正常核型染色体上的多态性位点在等位基因相对分布图上的理论分布。图10a为正常双体-双体核型染色体上的多态性位点所有可能的基因型及其各个等位基因相对计数的理论值。图10b是正常双体-双体核型染色体上的各个多态性位点其第二大的等位基因相对计数(RR2)相对于最大的等位基因相对计数(RR1)的分布。结果表明,每一个多态性位点在等位基因计数相对分布图上由于基因型不同而分布在不同的位置,根据其特定的分布位置可以推断出其基因型。
实施例28、利用样本混合物中最少组分的样本浓度f和目标区域内一组多态性位点 的等位基因计数相对分布图估算待测目标的核型
我们利用目标区域内各个多态性位点的等位基因计数相对分布图来检测待测目标在染色体水平的非整倍性或亚染色体水平的缺失重复变异,其主要步骤为:
(1)分析待测样本,列出样本在目标染色体或亚染色体片段上的所有可能的核型;
(2)对于每一个可能的核型,列出样本中该核型的染色体或亚染色体上目标组的靶DNA位点所有可能的基因型,然后对于每一个基因型选取至少一个非最大的等位基因相对计数理论值对最大的等位基因相对计数理论值作图来标记该基因型的理论位置;
(3)对每一个目标组靶DNA位点,计算其各个等位基因的相对计数并选取至少一个非最大的等位基因相对计数对最大的等位基因相对计数作图来标记该位点的实际位置;
(4)根据所有靶DNA位点在等位基因相对计数图形中的理论位置分布以及实际位置分布来推断待检测目标染色体或亚染色体片段的核型。
图11所示是孕妇血浆DNA样本中母亲正常而胎儿非整倍性变异的染色体上各个多态性位点在等位基因相对分布图上的理论分布。图11a为双体-双体核型和双体-单体核型染色体上的多态性位点所有可能的基因型及其各个等位基因相对计数的理论值。图11b是双体-双体核型和双体-单体核型染色体上的各个多态性位点其第二大的等位基因相对计数(RR2)相对于最大的等位基因相对计数(RR1)的理论分布。图11c为双体-双体核型和双体-三体核型染色体上各个多态性位点所有可能的基因型及其各个等位基因相对计数的理论值。图11d是双体-双体核型和双体-三体核型染色体上的各个多态性位点其第二大或第四大的等位基因相对计数(RR2或RR4)相对于最大的等位基因相对计数(RR1)的理论分布。
图12所示是孕妇血浆DNA样本中母亲或胎儿微缺失或微重复变异的亚染色体上各个多态性位点在等位基因相对分布图上的理论分布。图12a为母亲或胎儿有微缺失核型染色体上的多态性位点所有可能的基因型及其各个等位基因相对计数的理论值。图12b是母亲或胎儿有微缺失核型染色体上的各个多态性位点其第二大的等位基因相对计数(RR2)相对于最大的等位基因相对计数(RR1)的理论分布。图12c为母亲有微重复胎儿正常的亚染色体上各个多态性位点所有可能的基因型及其各个等位基因相对计数的理论值。图12d是母亲有微重复胎儿正常的亚染色体上各个多态性位点其第二或第三大的等位基因相对计数(RR2或RR3)相对于最大的等位基因相对计数(RR1)的理论分布。
实施例29、利用样本混合物中最少组分的样本浓度f和待分析位点的野生型以及其 各个非野生型等位基因的相对计数估算该位点的野生突变型
我们利用待分析位点的野生型等位基因计数以及其各个非野生型等位基因计数来检测该位点的野生突变型,其主要步骤为:
(1)分析待测样本,列出目标靶位点的野生型序列和目标靶位点所有可能的基因型;
(2)计算每一个可能基因型中野生型等位基因和其它非野生型各个等位基因的相对计数理论值,并对每一个基因型选取至少一个非野生型等位基因相对计数理论值对野生型等位基因相对计数理论值作图来标记所有可能基因型的理论位置;
(3)计算样本中目标靶DNA位点的野生型等位基因和其它非野生型各个等位基因的相对计数,并选取至少一个非野生型等位基因相对计数对野生型等位基因相对计数作图来标记该靶DNA 位点基因型的实际位置;
(4)根据靶DNA位点在等位基因相对计数图形中的理论位置分布以及实际位置分布推断其野生突变型。
图13所示是孕妇血浆DNA样本中正常的双体-双体染色体上的待测位点所有可能基因型的各个等位基因计数相对分布图。图13a为正常的双体-双体染色体上的待测位点所有可能的基因型及其各个等位基因相对计数的理论值。图13b为正常的双体-双体染色体上的待测位点其最大的非野生型等位基因相对计数(RR2)相对于野生型的等位基因相对计数(RR1)的理论分布图。其中,A代表野生型等位基因,a、b或c代表非野生型(突变型)等位基因。
实施例30、利用孕妇血浆DNA样本中胎儿DNA浓度和待分析位点的等位基因计数相 对分布图检测胎儿染色体单体异常
按照实施例21所述的方法模拟含染色体单体的孕妇血浆DNA样本,其中1号、2号和3号染色体分别为参照染色体、正常双体-双体核型的染色体和异常双体-单体核型的染色体。
分析模拟样本的测序数据,首先利用1号参照染色体上的各个多态性位点的等位基因计数按照实施例8所述的方法估算样本中胎儿DNA的浓度f;然后根据样本中胎儿DNA浓度f和2号或3号染色体上各个多态性位点的等位基因计数按照实施例28所述的方法分别估算2号或3号染色体的核型。为了检测胎儿2号或3号染色体是否有染色体单体异常,我们需要检测2号或3号染色体是正常的双体-双体核型(母亲和胎儿均是双体)还是异常的双体-单体核型(母亲是正常的双体而胎儿是异常的单体)。因此,我们首先在等位基因计数相对分布图上分别标出所有双体-双体和双体-单体基因型的理论位置,然后根据待分析染色体上各个多态性位点在等位基因计数相对分布图上的分布来确定该染色体的核型。
图14是利用多态性位点各个等位基因计数相对分布图检测胎儿染色体的单体变异。图14a是对模拟的正常双体-双体染色体上所有多态性位点的等位基因相对计数进行作图。图14b是对模拟的双体-单体染色体上所有多态性位点的等位基因相对计数进行作图。结果表明,图14a中几乎所有的多态性位点相对计数均分布在对应的双体-双体的基因型簇周围,而在相应的双体-单体基因型簇周围几乎没有分布。而图14b中,几乎所有的多态性位点相对计数均分布在对应的双体-单体的基因型簇周围,而在相应的双体-双体基因型簇周围几乎没有分布。因此,图14a的待分析染色体核型为双体-双体型,即胎儿该染色体正常;而图14b的待分析染色体核型为双体-单体型,即胎儿该染色体为异常的单体。
实施例31、利用孕妇血浆DNA样本中胎儿DNA浓度和待分析位点的等位基因计数相 对分布图检测胎儿染色体三体异常
按照实施例21所述的方法模拟含染色体三体的孕妇血浆DNA样本,其中1号、2号和3号染色体分别为参照染色体、正常双体-双体核型的染色体和异常双体-三体核型的染色体。
分析模拟样本的测序数据,首先利用1号参照染色体上的各个多态性位点的等位基因计数按照实施例8所述的方法估算样本中胎儿DNA的浓度f;然后根据样本中胎儿DNA浓度f和2号或3号染色体上各个多态性位点的等位基因计数按照实施例28所述的方法分别估算2号或3号染色体的核型。为了检测胎儿2号或3号染色体是否有染色体三体异常,我们需要检测2号或3号染色体是正常的双体-双体核型(母亲和胎儿均是双体)还是异常的双体-三体核型(母亲是正常的双体而胎儿是异常的三体)。因此,我们首先在等位基因计数相对分布图上分别标出所有双体-双体和双体-三体基因型的理论位置,然后根据待分析染色体上各个多态性位点在等位基因计数相对分布图上的分布来确定该染色体的核型。
图15是利用多态性位点各个等位基因计数相对分布图检测胎儿染色体的三体变异。图15a是对模拟的正常双体-双体染色体上所有多态性位点的等位基因相对计数进行作图。图15b是对模拟的双体-三体染色体上所有多态性位点的等位基因相对计数进行作图。结果表明,图15a中几乎所有的多态性位点相对计数均分布在对应的双体-双体的基因型簇周围,而在相应的双体-三体基因型簇周围几乎没有分布。而图15b中,几乎所有的多态性位点相对计数均分布在对应的双体-三体的基因型簇周围,而在相应的双体-单体基因型簇周围几乎没有分布。因此,图15a的待分析染色体核型为双体-双体型,即胎儿该染色体正常;而图15b的待分析染色体核型为双体-三体型,即胎儿该染色体为异常的三体。
实施例32、利用孕妇血浆DNA样本中胎儿DNA浓度和待分析位点的等位基因计数相 对分布图检测胎儿染色体微缺失异常
按照实施例21所述的方法模拟含染色体微缺失的孕妇血浆DNA样本,其中1号至7号染色体分别为参照染色体、母亲和胎儿均正常的染色体(正常双体-双体核型的染色体)、母亲正常胎儿一条染色体含微缺失的染色体(双体-单体核型的染色体)、母亲一条染色体含微缺失胎儿正常的染色体(单体-双体核型的染色体)、母亲和胎儿均一条染色体含微缺失的染色体(单体-单体核型的染色体)、母亲一条染色体含微缺失而胎儿两条染色体均含微缺失的染色体(单体-缺体核型的染色体)和母亲和胎儿各两条染色体均含微缺失的染色体(缺体-缺体核型 的染色体)。
分析模拟样本的测序数据,首先利用1号参照染色体上的各个多态性位点的等位基因计数按照实施例8所述的方法估算样本中胎儿DNA的浓度f;然后根据样本中胎儿DNA浓度f和2号至7号染色体上各个多态性位点的等位基因计数按照实施例28所述的方法分别估算2号至7号各染色体的核型。为了检测胎儿某号染色体是否有染色体微缺失异常,我们需要检测该染色体是否是正常的双体-双体核型(母亲和胎儿均是双体)还是异常的含微缺失的核型之一(母亲和/或胎儿该染色体上有微缺失)。因此,我们首先在等位基因计数相对分布图上分别标出母亲和胎儿染色体可能含有微缺失的情况下所有可能基因型的位置,然后根据待分析染色体上各个多态性位点在等位基因计数相对分布图上的分布来确定该染色体的核型。
图16是利用多态性位点各个等位基因计数相对分布图检测胎儿染色体的微缺失变异。图16a是对模拟的单体-双体染色体上所有多态性位点的等位基因相对计数进行作图。图16b是对模拟的单体-单体染色体上所有多态性位点的等位基因相对计数进行作图。结果表明,图16a中几乎所有的多态性位点相对计数均分布在对应的单体-双体的基因型簇周围,而在其它核型的基因型簇周围几乎没有分布。而图16b中,几乎所有的多态性位点相对计数均分布在对应的单体-单体的基因型簇周围,而在其它核型的基因型簇周围几乎没有分布。因此,图16a的待分析染色体核型为单体-双体型,即胎儿该染色体正常不含微缺失;而图16b的待分析染色体核型为单体-单体型,即胎儿的该染色体其中一条含微缺失变异。
实施例33、利用孕妇血浆DNA样本中胎儿DNA浓度和待分析位点的等位基因计数相 对分布图检测胎儿染色体微重复异常
按照实施例21所述的方法模拟含染色体微重复的孕妇血浆DNA样本,其中1号至7号染色体分别为参照染色体、母亲和胎儿均正常的染色体(正常双体-双体核型的染色体)、母亲正常胎儿一条染色体含微重复的染色体(双体-三体核型的染色体)、母亲一条染色体含微重复胎儿正常的染色体(三体-双体核型的染色体)、母亲和胎儿均一条染色体含微重复的染色体(三体-三体核型的染色体)、母亲一条染色体含微重复而胎儿两条染色体均含微重复的染色体(三体-四体核型的染色体)和母亲和胎儿各两条染色体均含微重复的染色体(四体-四体核型的染色体)。
分析模拟样本的测序数据,首先利用1号参照染色体上的各个多态性位点的等位基因计数按照实施例8所述的方法估算样本中胎儿DNA的浓度f;然后根据样本中胎儿DNA浓度f 和2号至7号染色体上各个多态性位点的等位基因计数按照实施例28所述的方法分别估算2号至7号各染色体的核型。为了检测胎儿某号染色体是否有染色体微重复异常,我们需要检测该染色体是否是正常的双体-双体核型(母亲和胎儿均是双体)还是异常的含微重复的核型之一(母亲和/或胎儿该染色体上有微重复)。因此,我们首先在等位基因计数相对分布图上分别标出母亲和胎儿染色体可能含有微重复的情况下所有可能基因型的位置,然后根据待分析染色体上各个多态性位点在等位基因计数相对分布图上的分布来确定该染色体的核型。由于母亲和胎儿染色体可能含有微重复的所有基因型总数达几十上百种,而将这些基因型全部在等位基因计数相对分布图上标记出来非常不利于对各个多态性位点等位基因相对计数的分类分析,因此在这里我们只标出胎儿是正常不含有微重复的基因型的分布。如果待测染色体上各个多态性位点等位基因相对计数并没有观察到在胎儿正常基因型相对的位置聚集成簇但却在其它位置观察到聚集成簇,则意味着样本中该染色体含有胎儿微重复变异或其它类型的变异。
图17是利用多态性位点各个等位基因计数相对分布图检测胎儿染色体的微重复变异。图17a是对模拟的三体-双体染色体上所有多态性位点的等位基因相对计数进行作图。图17b是对模拟的三体-三体染色体上所有多态性位点的等位基因相对计数进行作图。结果表明,图17a中几乎所有的多态性位点相对计数均分布在对应的胎儿正常的基因型簇周围。而图17b中,所有的多态性位点相对计数明显分成几簇但是并没有聚集在胎儿正常的基因型簇周围。因此,图17a的待分析染色体中,胎儿该染色体正常不含微重复;而图17b的待分析染色体中,或者胎儿的该染色体至少其中一条含微重复变异,或者该染色体有其它类型的变异。
实施例34、利用孕妇血浆DNA样本中胎儿DNA浓度和待分析位点的等位基因计数相 对分布图检测待分析位点的野生突变型
按照实施例21所述的方法模拟含特定短序列位点变异的孕妇血浆DNA样本,其中1号至2号染色体分别为参照染色体和含特定短序列位点变异的染色体。具体的,1号染色体中各个多态性位点选自不同的染色体区域,而2号染色体的多个多态性位点选自同一个特定的位点但是属于利用相同和/或不同的引物进行的独立扩增结果,也就是说,2号染色体上模拟的多态性位点代表了一个特定位点的不同独立重复。
分析模拟样本的测序数据,首先利用1号参照染色体上的各个多态性位点的等位基因计数按照实施例8所述的方法估算样本中胎儿DNA的浓度f;然后根据样本中胎儿DNA浓度f和2号染色体上各个待检测特定短序列位点的等位基因计数按照实施例29所述的方法分别估 算2号染色体上的各个特定短序列位点的野生突变型。为了检测胎儿是否有短的遗传学变异,比如导致某些单基因遗传病的点突变、短的插入缺失突变等,每一个位点需要考虑胎儿和母亲所有可能的基因型(野生型等位基因标记为大写字母A,变异型按等位基因计数从大到小标记为小写字母a-c),包括母亲和胎儿四个基因拷贝均为非野生型变异(aa|aa、aa|ab、ab|aa、ab|ab或ab|ac)、母亲两个基因拷贝为非野生型变异而胎儿为野生突变杂合型变异(aa|Aa或ab|Aa)、母亲为野生突变杂合型变异而胎儿正常(Aa|AA)、母亲和胎儿均为野生突变杂合型变异(Aa|Aa或Aa|Ab)、母亲为野生突变杂合型变异而胎儿为非野生型变异(Aa|aa或Aa|ab)、母亲正常胎儿野生突变杂合型变异(AA|Aa)和母亲和胎儿均为正常野生型(AA|AA)。其中对每个模拟的待测定位点进行了20倍的测序水平的生物学重复。
图18是利用多态性位点各个等位基因计数相对分布图检测胎儿在短序列水平的变异。图18a是对模拟的ab|Aa基因型中多态性位点的等位基因相对计数进行作图。根据测序水平生物学重复的多态性位点在相对计数分布图上的簇状分布估计该多态性位点的基因型为ab|Aa型,即母亲是双突变型杂合变异而胎儿为野生突变型杂合变异。图18b是对模拟的Aa|ab基因型中多态性位点的等位基因相对计数进行作图。根据测序水平生物学重复的多态性位点在相对计数分布图上的簇状分布估计该多态性位点的基因型为Aa|ab型,即母亲是野生突变型杂合变异而胎儿为双突变型杂合变异。
实施例35、利用待分析位点的等位基因计数及相对分布图检测单一基因组样本的遗 传变异
我们利用目标区域各个多态性位点的等位基因计数相对分布图来检测单一基因组样本的待测目标在染色体水平的非整倍性或亚染色体水平的缺失重复变异,其主要步骤为:
(1)计算目标组每一个靶DNA位点的各个等位基因相对计数;
(2)对每一个靶DNA位点,将其第二大的等位基因相对计数对其最大的等位基因相对计数作分布图A或将其最大的等位基因相对计数对该靶DNA位点在染色体或亚染色体上的相对位置作分布图B;
(3)利用目标组各个靶DNA位点的等位基因计数相对分布图A和/或分布图B,估计单一基因组样本中待检测目标的核型。
按照实施例21所述的方法模拟单一基因组样本,其中1号至5号染色体分别为双体、缺体(或纯合微缺失)、单体(或杂合微缺失)、三体(或杂合微重复)、四体(或纯合微重复)。
为了检测单一基因组样本是否有染色体水平或亚染色体水平的变异,需要考虑以下五种情况:(1)两条染色体均缺失(缺体)或两条染色体中均有相同区域的微缺失(纯合微缺失);(2)一条染色体正常而另一条染色体缺失(单体)或另一条染色体中有微缺失(杂合微缺失);(3)两条染色体均正常;(4)三条染色体(三体)或一条染色体正常而另一条染色体中有微重复(杂合微重复);(5)四条染色体(四体)或两条染色体中均有相同区域的微重复(纯合微重复)。
图19所示是利用多态性位点各个等位基因相对计数检测单基因组样本中目标染色体或亚染色体的核型。对目标区域(染色体或亚染色体区域)上每个多态性位点,将其第二大的等位基因相对计数对其最大的等位基因相对计数作图(相对计数图A)或将其最大的等位基因相对计数对该位点在模拟的染色体上的相对位置作图(相对计数位置图B)。结果表明,不同核型染色体的基因型在相对计数图A或相对计数位置图B上有不同的特征性分布,根据这些特征性分布可以检测目标染色体或亚染色体的核型(变异类型)。
此外,除非本文另外指示或另外与上下文明显矛盾,否则本文所述的所有方法均能够以任何合适的顺序进行。本文某些实施例提供的任何和/或所有实例和/或示例性语言的使用仅旨在更好地说明本发明,而不对另外要求保护的本发明范围进行限制。说明书中的语言不应当被解释为指示任何未要求保护的要素为实践本发明所必需的。
本文披露的本发明的替代要素或实施例的组不应解释为限制。每个组成员可以单独或以与组中其他成员或本文发现的其他要素的任何组合被提及或要求保护。出于方便和/或专利性的原因,组中的一个或多个成员可以包括在组中或从组中删除。
尽管参照一个或多个具体实施方式充分详细描述了本技术,但是本领域普通技术人员应认识到可对本申请中具体公开的实施方式进行改变,而这些改良和改进在本技术的范围和精神内。因此,除了在所附权利要求的范围中之外,本发明主题不受限制。此外,在解释说明书和权利要求书时,所有术语应当以与上下文一致的尽可能广泛的方式解释。

Claims (29)

  1. 一种计算样本中最少组分DNA的浓度的方法,其特征在于所述方法包括如下步骤:
    (a1)设定样本的噪声阈值α;
    (a2)对每一个靶DNA位点,首先利用其各个等位基因计数估算其基因型,然后根据其估算的基因型估算来源于最少组分DNA的计数(FC)和总计数(TC);和
    (a3)利用各个靶DNA位点的最少组分DNA的计数(FC)和总计数(TC),估算最少组分DNA的浓度。
  2. 如权利要求1所述的方法,其特征在于步骤(a2)包括如下步骤:
    (a2-i)对靶DNA位点的各个等位基因计数进行从大到小排序,其中最大的三个等位基因计数依次标记为R1、R2和R3;
    (a2-ii)利用靶DNA位点的各个等位基因计数,估算该靶DNA位点的基因型;和
    (a2-iii)根据估算的靶DNA位点的基因型和靶DNA位点的各个等位基因计数,估算来源于最少组分DNA的计数(FC)和总计数(TC)。
  3. 如权利要求2所述的方法,其特征在于步骤(a2-ii)包括如下步骤:
    (a2-ii-1)利用靶DNA位点的各个等位基因计数,判断靶DNA位点中检测到的高于噪声阈值的等位基因数量;如果判断结果是1,则执行下述步骤(a2-ii-2);如果判断结果是2,则执行下述步骤(a2-ii-3);如果判断结果为大于2,则执行下述步骤(a2-ii-4);
    (a2-ii-2)估算该靶DNA位点的基因型为AA|AA,然后执行下述步骤(a2-ii-5);
    (a2-ii-3)根据检测到的高于噪声阈值的等位基因数量为2和靶DNA位点的最大的两个等位基因计数,估计靶DNA位点的基因型,然后执行下述步骤(a2-ii-5);
    (a2-ii-4)根据检测到的高于噪声阈值的等位基因数量大于2和靶DNA位点的最大的至少两个的等位基因计数,估计靶DNA位点的基因型,然后执行下述步骤(a2-ii-5);和
    (a2-ii-5)输出估算的该靶位点的基因型。
  4. 如权利要求3所述的方法,其特征在于步骤(a2-ii-3)包括如下步骤:
    (a2-ii-3-1)判断R1/(R1+R2)的值是否小于0.5+α,如果判断结果为是,则估算该靶DNA位点的基因型为AB|AB,然后执行下述步骤(a2-ii-3-3);如果判断结果为否,则执行下述步骤(a2-ii-3-2);
    (a2-ii-3-2)判断R1/(R1+R2)的值是否小于0.75,如果判断结果为是,则估算该靶DNA位点的基因型为AB|AA,然后执行下述步骤(a2-ii-3-3);如果判断结果为否,则估算该靶DNA位点的基因型为AA|AB,然后执行下述步骤(a2-ii-3-3);和
    (a2-ii-3-3)输出估算的该靶位点的基因型。
  5. 如权利要求3所述的方法,其特征在于步骤(a2-ii-4)包括如下步骤:
    (a2-ii-4-1)判断R2/R1是否大于等于0.5和/或R1/(R1+R2)是否大于等于1/2并且小于等于2/3和/或R2/(R1+R2)是否大于等于1/3并且小于等于1/2的值,如果判断结果为是,则估算该靶DNA位点的基因型为AB|AC,然后执行下述步骤(a2-ii-4-3);如果判断结果为否,则执行下述步骤(a2-ii-4-2);
    (a2-ii-4-2)标记该位点的等位基因计数为异常,然后或者估算该靶位点的基因型为NA,并执行下述步骤(a2-ii-4-3);或者设定该靶DNA位点中检测到的高于噪声阈值的等位基因数量为2,然后按照步骤(a2-ii-3)所述估算该靶位点的基因型,并执行下述步骤(a2-ii-4-3);和
    (a2-ii-4-3)输出估算的该靶位点的基因型。
  6. 如权利要求2所述的方法,其特征在于步骤(a2-iii)包括如下步骤:
    (a2-iii-1)如果靶位点估计的基因型是AA|AA,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
    (a2-iii-2)如果靶位点估计的基因型是AB|AB,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
    (a2-iii-3)如果靶位点估计的基因型是AB|AA,则估算来源于最少组分DNA的计数(FC)为R1-R2,估算总计数(TC)为R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
    (a2-iii-4)如果靶位点估计的基因型是AA|AB,则估算来源于最少组分DNA的计数(FC)为R2的2倍,估算总计数(TC)为R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);
    (a2-iii-5)如果靶位点估计的基因型是AB|AC,则估算来源于最少组分DNA的计数(FC)为R1-R2+R3或R3的2倍或(R1-R2)的2倍,估算总计数(TC)为R1+R2+R3,然后执行下述步骤(a2-iii-7);
    (a2-iii-6)如果靶位点估计的基因型不是上述所述基因型中的一种,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3,然后执行下述步骤(a2-iii-7);和
    (a2-iii-7)输出估算的来源于最少组分DNA的计数(FC)和总计数(TC)。
  7. 一种计算样本中最少组分DNA的浓度的方法,其特征在于所述方法包括如下步骤:
    (b1)设定样本的噪声阈值α、初始浓度估计值f 0和迭代误差精度值ε;
    (b2)对每一个靶DNA位点,利用其各个等位基因计数和样本中最少组分DNA的浓度值f 0估算其基因型;
    (b3)对每一个靶DNA位点,根据其估算的基因型来估算来源于最少组分DNA的计数(FC)和总计数(TC);
    (b4)利用各个靶位点的最少组分DNA的计数(FC)和总计数(TC),估算最少组分DNA的浓度f;和
    (b5)判断f-f 0的绝对值是否小于ε,如果判断结果为否,则设定f 0=f,然后执行步骤(b2);如果判断结果为是,则样本中最少组分DNA浓度估算为f。
  8. 如权利要求7所述的方法,其特征在于步骤(b2)包括如下步骤:
    (b2-i)根据样本来源,列出靶DNA位点所有可能的基因型;
    (b2-ii)对靶DNA位点的每一个可能基因型,利用样本中最少组分DNA的浓度值f 0和靶DNA位点各个等位基因的总计数(TC),计算其各个等位基因的理论计数;
    (b2-iii)对靶DNA位点的每一个可能基因型,利用靶DNA位点的各个等位基因计数及其各个等位基因理论计数进行拟合优度检验;和
    (b2-iv)分析靶DNA位点对所有可能的基因型的拟合优度检验结果,选择对靶DNA位点各个等位基因计数有最优拟合的基因型作为估算的靶DNA位点的基因型。
  9. 如权利要求7所述的方法,其特征在于步骤(b3)中对每一个靶DNA位点,根据其估算的基因型来估算来源于最少组分DNA的计数(FC)和总计数(TC),其中最大的四个等位基因计数从大到小依次标记为R1、R2、R3和R4,包括如下步骤:
    (b3-1)如果靶位点估计的基因型是AA|AA,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
    (b3-2)如果靶位点估计的基因型是AB|AB,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
    (b3-3)如果靶位点估计的基因型是AB|AA,则估算来源于最少组分DNA的计数(FC)为R1-R2,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
    (b3-4)如果靶位点估计的基因型是AA|AB,则估算来源于最少组分DNA的计数(FC)为R2的2倍,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
    (b3-5)如果靶位点估计的基因型是AB|AC,则估算来源于最少组分DNA的计数(FC)为R1-R2+R3或R3的2倍或(R1-R2)的2倍,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
    (b3-6)如果靶位点估计的基因型是AA|BB,则估算来源于最少组分DNA的计数(FC)为R2,估算总计数(TC)为R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
    (b3-7)如果靶位点估计的基因型是AA|BC,则估算来源于最少组分DNA的计数(FC)为R2+R3或R2的2倍或R3的2倍,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
    (b3-8)如果靶位点估计的基因型是AB|CC,则判断当前估计值f 0是否大于等于1/3,如果判断结果为是,则估算来源于最少组分DNA的计数(FC)为R1,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);如果判断结果为否,则估算来源于最少组分DNA的计数(FC)为R3,估算总计数(TC)为R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);
    (b3-9)如果靶位点估计的基因型是AB|CD,则估算来源于最少组分DNA的计数(FC)为R3+R4或R3的2倍或R4的2倍,估算总计数(TC)为R1+R2+R3+R4,然后执行下述步骤(b3-11);
    (b3-10)如果靶位点估计的基因型不是上述所述基因型中的一种,则估算来源于最少组分DNA的计数(FC)为NA,估算总计数(TC)为R1或R1+R2或R1+R2+R3或R1+R2+R3+R4,然后执行下述步骤(b3-11);和
    (b3-11)输出估算的来源于最少组分DNA的计数(FC)和总计数(TC)。
  10. 如权利要求1或权利要求7所述的方法,其特征在于步骤(a3)或步骤(b4)中,通过拟合回归模型估计最少组分DNA的浓度。
  11. 如权利要求1或权利要求7所述的方法,其特征在于在步骤(a3)或步骤(b4)中,根据FC和TC计数,利用线性回归和/或稳健线性回归和/或FC和TC的平均数和/或FC和TC的中位数计算样本中最少组分DNA的浓度。
  12. 如权利要求1-11中任一项所述的方法,其中所述样本为孕妇血浆样本,以及所述最少组分DNA为胎儿DNA。
  13. 一种检测样本遗传变异的方法,其特征在于依次包括如下步骤:
    (1)接收待测生物样品并制备核酸;
    (2)富集或扩增靶DNA位点,其中至少有一个靶DNA位点在样本中有多于一个的等位基因;
    (3)测序所扩增的靶DNA位点;
    (4)对每一个靶DNA位点,统计其各个等位基因的计数;和
    (5)利用靶DNA位点等位基因计数的拟合优度检验和/或等位基因计数相对分布图确定样本中待检测目标的核型或基因型或野生突变型。
  14. 如权利要求13所述的方法,其特征在于在步骤(5)中利用靶DNA位点等位基因计数的拟合优度检验,确定样本中待检测目标的核型或基因型或野生突变型,所述确定依次包括如下步骤:
    (c1)将每一个靶DNA位点根据其在染色体上的定位分为参照位点或目标位点,其中各参照位点组成参照组,和各目标位点组成目标组;
    (c2)利用参照组各个靶DNA位点的等位基因计数,计算样本中最少组分DNA的浓度;和
    (c3)利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的核型或基因型或野生突变型。
  15. 如权利要求13所述的方法,其特征在于在步骤(5)中利用靶DNA位点等位基因计数相对分布图,确定样本中待检测目标的核型或基因型或野生突变型,所述确定依次包括如下步骤:
    (d1)将每一个靶DNA位点根据其在染色体上的定位分为参照位点或目标位点,其中各参照位点组成参照组,和各目标位点组成目标组;
    (d2)利用参照组各个靶DNA位点的等位基因计数,计算样本中最少组分DNA的浓度;
    (d3)利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取等位基因计数相对分布图方法,估计样本中待检测目标的核型或基因型或野生突变型。
  16. 如权利要求13所述的方法,其特征在于在步骤(5)中利用靶DNA位点等位基因计数相对分布图,确定样本中待检测目标的核型,其中待检测样本是单一基因组样本,所述确定依次包括如下步骤:
    (e1)计算各个靶DNA位点的各个等位基因相对计数;
    (e2)对每一个靶DNA位点,将其第二大的等位基因相对计数对其最大的等位基因相对计数作分布图A或将其最大的等位基因相对计数对该靶DNA位点在染色体或亚染色体上的相对位置作分布图B;
    (e3)利用各个靶DNA位点的等位基因计数相对分布图A和/或分布图B,估计单一基因组样本中待检测目标的核型。
  17. 如权利要求14或权利要求15所述的方法所述的方法,其特征在于在步骤(c2)或步骤(d2)中采用如权利要求1-12中任一项所述的方法计算样本中最少组分DNA的浓度。
  18. 如权利要求14所述的方法,其特征在于在步骤(c3)中利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的基因型,所述估计依次包括如下步骤:
    (c3-a1)对于目标组每一个靶DNA位点,列出其所有可能的基因型;
    (c3-a2)对于目标组每一个靶DNA位点,对于其每一个可能的基因型,根据样本中最少组分DNA浓度和该位点各个等位基因的总计数,计算其各个等位基因的理论计数;
    (c3-a3)对于目标组每一个靶DNA位点,对于其每一个可能的基因型,利用靶DNA位点各个等位基因计数和其理论计数进行拟合优度检验;和
    (c3-a4)对于目标组每一个靶DNA位点,根据对其所有可能基因型的拟合优度检验结果,选择最优拟合的基因型为该靶DNA位点的基因型。
  19. 如权利要求14所述的方法,其特征在于在步骤(c3)中利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的核型,所述估计依次包括如下步骤:
    (c3-b1)分析待测样本,列出待检测目标染色体或亚染色体片段的所有可能的核型;
    (c3-b2)对于每一个可能的核型,列出目标组各个靶DNA位点所有可能的基因型;
    (c3-b3)对目标组每一个靶DNA位点,首先利用其各个等位基因计数对其所有可能的基因型进行拟合优度检验,然后对每一个可能的核型选择一个对该核型有最优拟合的基因型;和
    (c3-b4)综合分析所有靶DNA位点对每一个核型的拟合优度检验结果,选择对所有靶DNA位点综合拟合最好的核型作为待检测目标染色体或亚染色体片段的核型。
  20. 如权利要求14所述的方法,其特征在于在步骤(c3)中利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的野生突变型,所述估计依次包括如下步骤:
    (c3-c1a)对于目标组每一个靶DNA位点,列出其所有可能的野生突变基因型;
    (c3-c2a)对于目标组每一个靶DNA位点,对于其每一个可能的野生突变基因型,根据样本中最少组分DNA浓度和该位点各个等位基因的总计数,计算其各个等位基因的理论计数;
    (c3-c3a)对于目标组每一个靶DNA位点,对于其每一个可能的野生突变基因型,利用靶DNA位点各个等位基因计数和其理论计数进行拟合优度检验;和
    (c3-c4a)综合分析目标组所有靶DNA位点,选择对所有靶位点有最优拟合的野生突变基因型为待测目标的野生突变基因型。
  21. 如权利要求14所述的方法,其特征在于在步骤(c3)中利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法,估计样本中待检测目标的野生突变型,所述估计依次包括如下步骤:
    (c3-c1b)对于目标组每一个靶DNA位点,根据其各个等位基因计数和样本中最少组分DNA的浓度,采取拟合优度检验方法估计其基因型;和
    (c3-c2b)根据目标组每一个靶DNA位点的基因型和其各个等位基因的序列,确定样本各个组分中待测目标各个等位基因的野生突变型。
  22. 如权利要求14所述的方法,其特征在于在步骤(c3)中所述拟合优度检验方法是采用卡方检验、G检验、费希尔精确检验、二项分布检验、其变体或其组合进行的。
  23. 如权利要求14所述的方法,其特征在于在步骤(c3)中所述拟合优度检验方法是采用G检验的计算值G值、AIC值、经校正的G值、经校正的AIC值、G值或AIC值的变体、或其组合来进行拟合优度检验。
  24. 如权利要求14所述的方法,其特征在于在步骤(c3)中所述拟合优度检验方法是采用如权利要求8所述的方法进行拟合优度检验。
  25. 如权利要求15所述的方法,其特征在于在步骤(d3)中利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取等位基因计数相对分布图方法,估计样本中待检测目标的基因型,所述估计依次包括如下步骤:
    (d3-a1)对于目标组每一个靶DNA位点,列出其所有可能的基因型;
    (d3-a2)对于目标组靶DNA位点每一个可能的基因型,首先根据样本中最少组分DNA的浓度 计算其各个等位基因的相对计数理论值,然后选取至少一个非最大的等位基因相对计数理论值对最大的等位基因相对计数理论值作图来标记该基因型的理论位置;
    (d3-a3)对于目标组每一个靶DNA位点,首先计算其各个等位基因的相对计数,然后选取至少一个非最大的等位基因相对计数对最大的等位基因相对计数作图来标记该靶DNA位点在等位基因相对计数图上的实际位置;和
    (d3-a4)根据目标组各个靶DNA位点在等位基因相对计数图中的理论位置分布以及实际位置分布,推断待测目标的基因型。
  26. 如权利要求15所述的方法,其特征在于在步骤(d3)中利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取等位基因计数相对分布图方法,估计样本中待检测目标的核型,所述估计依次包括如下步骤:
    (d3-b1)分析待测样本,列出待检测目标染色体或亚染色体片段的所有可能的核型;
    (d3-b2)对于每一个可能的核型,列出目标组各个靶DNA位点所有可能的基因型;
    (d3-b3)对于目标组靶DNA位点每一个可能的基因型,首先根据样本中最少组分DNA的浓度计算其各个等位基因的相对计数理论值,然后选取至少一个非最大的等位基因相对计数理论值对最大的等位基因相对计数理论值作图来标记该基因型的理论位置;
    (d3-b4)对于目标组每一个靶DNA位点,首先计算其各个等位基因的相对计数,然后选取至少一个非最大的等位基因相对计数对最大的等位基因相对计数作图来标记该靶DNA位点在等位基因相对计数图上的实际位置;和
    (d3-b5)根据在等位基因相对计数图中目标组各个靶DNA位点在各个核型的理论位置分布以及其实际位置分布,推断待测目标的核型。
  27. 如权利要求15所述的方法,其特征在于在步骤(d3)中利用目标组各个靶DNA位点的等位基因计数和样本中最少组分DNA的浓度,采取等位基因计数相对分布图方法,估计样本中待检测目标的野生突变型,所述估计依次包括如下步骤:
    (d3-c1)对于目标组每一个靶DNA位点,列出其野生型序列和所有可能的野生突变基因型;
    (d3-c2)对于每一个可能的野生突变基因型,计算其野生型等位基因和其它非野生型各个等位基因的相对计数理论值,并选取至少一个非野生型等位基因相对计数理论值对野生型等位基因相对计数理论值作图来标记其野生突变基因型的理论位置;
    (d3-c3)对于目标组每一个靶DNA位点,计算其野生型等位基因和其它非野生型各个等位基因的相对计数值,并选取至少一个非野生型等位基因相对计数对野生型等位基因相对计数作图来标记该靶DNA位点在等位基因相对计数图上的实际位置;
    (d3-c4)根据目标组所有靶DNA位点在等位基因相对计数图中的理论位置分布以及实际位置分布,推断其野生突变型。
  28. 一种用于检测样本遗传变异的系统,其包括用于执行权利要求1至权利要求27中任一项所述的方法中的任何步骤的装置和/或计算机程序产品和/或模块。
  29. 一种用于检测样本遗传变异的试剂盒,所述试剂盒包括用于执行权利要求1至权利要求27中任一项所述的方法中的任何步骤的引物。
PCT/CN2021/125359 2020-12-21 2021-10-21 一种利用多态性位点和靶位点测序检测胎儿遗传变异的方法 WO2022134807A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP21908808.5A EP4265732A1 (en) 2020-12-21 2021-10-21 Method for detecting fetal genetic variations by sequencing polymorphic sites and target sites
US18/268,459 US20240047008A1 (en) 2020-12-21 2021-10-21 Method for detecting fetal genetic variations by sequencing polymorphic sites and target sites
CN202180080432.6A CN116888274A (zh) 2020-12-21 2021-10-21 一种利用多态性位点和靶位点测序检测胎儿遗传变异的方法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011514641.0A CN114645080A (zh) 2020-12-21 2020-12-21 一种利用多态性位点和靶位点测序检测胎儿遗传变异的方法
CN202011514641.0 2020-12-21

Publications (1)

Publication Number Publication Date
WO2022134807A1 true WO2022134807A1 (zh) 2022-06-30

Family

ID=81990364

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/125359 WO2022134807A1 (zh) 2020-12-21 2021-10-21 一种利用多态性位点和靶位点测序检测胎儿遗传变异的方法

Country Status (4)

Country Link
US (1) US20240047008A1 (zh)
EP (1) EP4265732A1 (zh)
CN (2) CN114645080A (zh)
WO (1) WO2022134807A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009013496A1 (en) * 2007-07-23 2009-01-29 The Chinese University Of Hong Kong Diagnosing fetal chromosomal aneuploidy using genomic sequencing
US20130267425A1 (en) * 2012-04-06 2013-10-10 The Chinese University Of Hong Kong Noninvasive prenatal diagnosis of fetal trisomy by allelic ratio analysis using targeted massively parallel sequencing
WO2017051996A1 (ko) * 2015-09-24 2017-03-30 에스케이텔레콤 주식회사 비침습적 태아 염색체 이수성 판별 방법
CN108138226A (zh) * 2015-10-18 2018-06-08 阿费梅特里克斯公司 单核苷酸多态性和插入缺失的多等位基因基因分型
CN109971846A (zh) * 2018-11-29 2019-07-05 时代基因检测中心有限公司 使用双等位基因snp靶向下一代测序的非侵入性产前测定非整倍体的方法
WO2020104394A1 (en) * 2018-11-19 2020-05-28 Sistemas Genómicos, S.L. Method and computer program product for analysis of fetal dna by massive sequencing
CN111951890A (zh) * 2020-08-13 2020-11-17 北京博昊云天科技有限公司 染色体和单基因病同步产前筛查的方法、试剂盒和分析系统

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009013496A1 (en) * 2007-07-23 2009-01-29 The Chinese University Of Hong Kong Diagnosing fetal chromosomal aneuploidy using genomic sequencing
CN103849684A (zh) * 2007-07-23 2014-06-11 香港中文大学 利用靶向扩增和测序的非侵入性胎儿基因组筛查
US20130267425A1 (en) * 2012-04-06 2013-10-10 The Chinese University Of Hong Kong Noninvasive prenatal diagnosis of fetal trisomy by allelic ratio analysis using targeted massively parallel sequencing
WO2017051996A1 (ko) * 2015-09-24 2017-03-30 에스케이텔레콤 주식회사 비침습적 태아 염색체 이수성 판별 방법
CN108138226A (zh) * 2015-10-18 2018-06-08 阿费梅特里克斯公司 单核苷酸多态性和插入缺失的多等位基因基因分型
WO2020104394A1 (en) * 2018-11-19 2020-05-28 Sistemas Genómicos, S.L. Method and computer program product for analysis of fetal dna by massive sequencing
CN109971846A (zh) * 2018-11-29 2019-07-05 时代基因检测中心有限公司 使用双等位基因snp靶向下一代测序的非侵入性产前测定非整倍体的方法
CN111951890A (zh) * 2020-08-13 2020-11-17 北京博昊云天科技有限公司 染色体和单基因病同步产前筛查的方法、试剂盒和分析系统

Non-Patent Citations (23)

* Cited by examiner, † Cited by third party
Title
ADVANI, BARRETT ET AL., PRENAT DIAGN, vol. 37, 2017, pages 1067 - 1075
ALLEN, YOUNG ET AL., NONINVASIVE PRENATAL TESTING (NIPT, 2018, pages 157 - 177
ANDARI, BUSSAMRA ET AL., CESKA GYNEKOL, vol. 85, 2020, pages 41 - 48
BARRETT, XIONG ET AL., PLOS ONE, vol. 12, 2017, pages 0186771
BREVEGLIERI, D'AVERSA ET AL., MOL DIAGN THER, vol. 23, 2019, pages 291 - 299
CUTTS, VAVOULIS ET AL., BLOOD, vol. 134, 2019, pages 1190 - 1193
DU HANXIAO, ET AL.: "Research and Outlook for Noninvasive Prenatal Testing for Monogenic Diseases", CHINESE JOURNAL OF CLINICAL LABORATORY SCIENCE, vol. 36, no. 11, 30 November 2018 (2018-11-30), pages 805 - 808, XP055944657, DOI: 10.13602/j.cnki.jcls.2018.11.02 *
FAN, BLUMENFELD ET AL., PROC NATL ACAD SCI U S A, vol. 105, 2008, pages 19920 - 19925
GAO SONG: "Noninvasive detection of fetal genetic variations through polymorphic site sequencing of maternal plasma DNA", THE JOURNAL OF GENE MEDICINE, vol. 24, 14 December 2021 (2021-12-14), XP055944660, DOI: 10.1002/jgm.3400 *
GUSEH, HUM GENET, vol. 139, 2020, pages 1141 - 1148
HU, WANG ET AL., HUMAN GENOMICS, vol. 13, 2019, pages 14
HUANG, LI ET AL., BIOINFORMATICS, vol. 28, 2012, pages 593 - 594
KIM, KIM ET AL., NAT COMMUN, vol. 10, 2019, pages 1047
LIAO, CHAN ET AL., PLOS ONE, vol. 7, 2012, pages 38154
LO, CHAN ET AL., SCI TRANSL MED, vol. 2, 2010, pages 61 - 91
LO, CORBETTA ET AL., LANCET, vol. 350, 1997, pages 485 - 487
LV, WEI ET AL., CLINICAL CHEMISTRY, vol. 61, 2015, pages 172 - 181
SREBNIAK, KNAPEN ET AL., MOL GENET GENOMIC MED, vol. 8, 2020, pages 1062
VERMEULEN, GEEVEN ET AL., AM J HUM GENET, vol. 101, 2017, pages 326 - 339
YIN, DU ET AL., J HUM GENET, vol. 63, 2018, pages 1129 - 1137
ZHANG, LI ET AL., NAT MED, vol. 25, 2019, pages 439 - 447
ZHAO XIN, ET AL.: "Research Advances in Noninvasive Prenatal Diagnosis of Fetal Structural Chromosome Abnormalities", CHINESE JOURNAL OF PRENATAL DIAGNOSIS (ELECTRONIC VERSION), vol. 8, no. 4, 31 December 2016 (2016-12-31), pages 1 - 4, XP055944658, DOI: 10.13470/j.cnki.cjpd.2016.04.001 *
ZIMMERMANN, HILL ET AL., PRENAT DIAGN, vol. 32, 2012, pages 1233 - 1241

Also Published As

Publication number Publication date
EP4265732A1 (en) 2023-10-25
CN116888274A (zh) 2023-10-13
CN114645080A (zh) 2022-06-21
US20240047008A1 (en) 2024-02-08

Similar Documents

Publication Publication Date Title
Monks et al. Genetic inheritance of gene expression in human cell lines
JP6068598B2 (ja) 多胎妊娠の分子検査
CN110176273B (zh) 遗传变异的非侵入性评估的方法和过程
Jiang et al. FetalQuant: deducing fractional fetal DNA concentration from massively parallel sequencing of DNA in maternal plasma
CN105243295B (zh) 与癌症相关的遗传或分子畸变的检测
Gao et al. Human population structure detection via multilocus genotype clustering
Lepoittevin et al. In vitro vs in silico detected SNPs for the development of a genotyping array: what can we learn from a non-model species?
WO2009146335A1 (en) Methods for embryo characterization and comparison
WO2021232388A1 (zh) 确定胚胎细胞染色体中预定位点碱基类型的方法及其应用
Yang et al. Genome-wide eQTLs and heritability for gene expression traits in unrelated individuals
CN109207606B (zh) 用于亲权鉴定的ssr位点的筛选方法和应用
JP7333838B2 (ja) 胚における遺伝パターンを決定するためのシステム、コンピュータプログラム及び方法
EP3434790A1 (en) Determining fetal genomes for multiple fetus pregnancies
WO2022134807A1 (zh) 一种利用多态性位点和靶位点测序检测胎儿遗传变异的方法
JP7446343B2 (ja) ゲノム倍数性を判定するためのシステム、コンピュータプログラム及び方法
CN111593108A (zh) 与噪声性听力下降发生相关的7q36.3区域的多态性的检测方法、试剂盒及其应用
CN116075898A (zh) 用于确定基因相似性的方法和系统
Gao Noninvasive Detection of Fetal Genetic Variations through Polymorphic Sites Sequencing of
Cummings Power and type 1 error for large pedigree analyses of binary traits
Van Hout The Genomic Landscape of the Old Order Amish.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908808

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180080432.6

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 2021908808

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021908808

Country of ref document: EP

Effective date: 20230721