CN110373458B - Kit and analysis system for thalassemia detection - Google Patents

Kit and analysis system for thalassemia detection Download PDF

Info

Publication number
CN110373458B
CN110373458B CN201910565206.1A CN201910565206A CN110373458B CN 110373458 B CN110373458 B CN 110373458B CN 201910565206 A CN201910565206 A CN 201910565206A CN 110373458 B CN110373458 B CN 110373458B
Authority
CN
China
Prior art keywords
sample
ratio
fetal
concentration
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910565206.1A
Other languages
Chinese (zh)
Other versions
CN110373458A (en
Inventor
黄铨飞
彭春方
陈样宜
饶兴蔷
糜庆丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CapitalBio Genomics Co Ltd
Original Assignee
CapitalBio Genomics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CapitalBio Genomics Co Ltd filed Critical CapitalBio Genomics Co Ltd
Priority to CN201910565206.1A priority Critical patent/CN110373458B/en
Publication of CN110373458A publication Critical patent/CN110373458A/en
Application granted granted Critical
Publication of CN110373458B publication Critical patent/CN110373458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Abstract

The invention discloses a kit and an analysis system for detecting thalassemia. Through the optimization of the reference probe set, the accuracy of the detection of the thalassemia mutation can be improved; by using the specific UID, more effective sequences can be reserved, and the accuracy of the detection result is further improved; by using the samples with known prenatal diagnosis results as a training set and constructing the multidimensional Bayesian prior probability, the samples to be detected can be effectively classified, and the classification accuracy is remarkably improved.

Description

Kit and analysis system for thalassemia detection
Technical Field
The invention relates to a prenatal noninvasive detection technology, in particular to a kit and an analysis system for detecting thalassemia.
Background
Detection of fetal genomic abnormality and detection of whether fetal DNA has genetic abnormality is a high-safety gene detection technology for mothers and infants, and currently, detection of fetal chromosomal abnormality and detection of fetal monogenic diseases are mainly included.
The monogenic genetic diseases refer to genetic diseases controlled by a pair of alleles, and according to statistics of an online Mendelian genetic database of human, more than 6000 monogenic genetic diseases exist at present, and more than 3700 monogenic genetic diseases are generally clear in research on pathogenic mechanisms and molecular mechanisms. The common monogenic genetic diseases include G6PD deficiency (fava bean disease), thalassemia, pseudohypertrophic muscular dystrophy (DMD), deafness, cataract, etc.
according to the classification of gene variation, the alpha chain globin mediterranean anemia and the beta chain globin mediterranean anemia are clinically mainly divided into alpha chain globin mediterranean anemia and beta chain mediterranean anemia, the mediterranean anemia is mainly distributed in mediterranean coastal countries, Africa, middle east countries, southeast Asia countries and southern province of China, wherein the mediterranean gene defect rate of southern China is 2.5-20%, the occurrence rate of the mediterranean anemia gene defect of Guangdong and Guangxi provinces is respectively as high as 10% and 20%, the population carrying rate of the alpha mediterranean anemia and the beta mediterranean anemia in Guangxi areas is respectively 14.96% and 6.78%, each 4 individuals have one mediterranean defect gene carrier, each 55 families have 1 birth risk of the fetus, and each heavy type of the fetus has 230 serious birth risk.
α globin gene is located on chromosome 16, each chromosome 16 has 2 α globin genes, a pair of chromosomes 16 has 4 globin genes, according to α globin gene deletion degree, can be used for genotyping α mediterranean genes, HbBart's fetal edema syndrome, HbH disease, standard α mediterranean anemia, resting α mediterranean anemia, normality, respectively delete 4, 3, 2, 1, 0 α thalassemia genes, wherein Hb Bart's edema fetus dies in late gestation or several hours after delivery, systemic edema, hepatosplenomegaly, therefore Hb Bart's edema fetus is the important prenatal diagnosis object.
thalassemia belongs to autosomal recessive genetic diseases, and if parents are carriers of thalassemia genes, the children have 25 percent of chances to be carriers of thalassemia genes, 50 percent of chances to become patients with thalassemia minor, and 25 percent of chances to suffer thalassemia major.
In 1997, lo et al [2] found that there was about 3% to 25% fetal free DNA in maternal peripheral blood free DNA, which laid the theoretical foundation for noninvasive prenatal screening techniques from maternal peripheral blood. In 2008, Lo 3, stephenquake 4 and other researches show that fetal chromosome aneuploidy abnormality can be screened by detecting free DNA of maternal peripheral blood, and the accuracy is over 99%. Subsequently, researchers extended the non-invasive method of detecting fetal genomic abnormalities through maternal peripheral blood free DNA to the detection of fetal deletion/duplication syndrome [5] and fetal monogenic disorders [6 ]. However, the biggest challenge in noninvasive fetal deletion/duplication syndrome and monogenic disease detection is that the background signal of the mother is too strong and the fetal free DNA accounts for only 3% -30%. The accuracy of noninvasive fetal deletion/duplication syndrome is far lower than NIPT, the sensitivity is only 70% -90%, and the detection accuracy is influenced by the fetal DNA concentration, the length of a microdeletion micro-duplication region and the sequencing data quantity.
The detection difficulty of the non-invasive monogenic diseases is influenced by the genetic mode: the new mutation or the dominant mutation from a father source can judge the genotype of the fetus only by detecting the existence or nonexistence of a gene mutation signal because of no interference of the mutation background of the mother body, so that the method can be realized by adopting a sensitive digital PCR or a target region capture sequencing method; however, the recessive genetic monogenic disease has very high requirements on the sensitivity and specificity of experiments and analysis when the mother carries 50% of mutation signals to identify whether the mutation type of the fetus is high. Therefore, the detection of recessive genetic noninvasive monogenic diseases (such as thalassemia) usually needs to construct haplotype information of father and mother in advance, expand the available information amount by using haplotype blocks (haplotype blocks) [8], count the dose relationship of two haplotypes from father (one Hap carries mutation and the other does not carry mutation) to deduce whether the haplotype with mutation of father is inherited to children, and further judge whether the fetus carries homozygous mutation. The haplotype is a special combination of a group of alleles on a chromosome or a chromosome segment, and the haplotype constructed by using SNPs with certain density is used as a tool, so that the association analysis research of diseases and haplotypes can be carried out. Since the SNP information of parents alone is not enough to construct haplotypes, SNP information of third parties (probands or grandparents) needs to be borrowed or a complicated long DNA single molecule detection technology needs to be adopted to construct the parents haplotypes. The third party (proband or grandparents) haplotype construction method requires the synchronous detection of SNP sites of proband or grandparents, and then deduces the parents' haplotype through the SNP sites with classification information in proband or grandparents [9 ]. The main limitations of this method are that the DNA of proband and grandparents is not easily available, and the cost of detecting SNP sites requiring a large amount of high-depth sequencing is high, so the clinical applicability is low. In addition, single molecule long fragment DNA sequencing methods such as the third generation sequencing method are also used to construct parental haplotypes, the principle is to infinitely dilute DNA close to each reaction containing only one long fragment DNA molecule, then to independently library and sequence in the reaction, to obtain SNP information for each single molecule, and finally to connect the SNP signals of N long fragment single molecules with identical signals to construct haplotypes.
For example, CN108642160A constructs parental haploid genotypes of SNP sites near poor genes, combines sequencing information of free DNA and whole blood genome DNA libraries, analyzes father source genetic condition and mother source genetic condition to determine corresponding genotypes of fetal SNP sites. The main problems of the method are that the difficulty of constructing the parental haplotype information is high, and the complexity of the experiment and biological information analysis is high.
Since one of the important current limitations of noninvasive fetal microdeletion/microduplication and noninvasive monosomy is the low fetal DNA concentration, CN107541561A proposes to add magnetic bead screening or gel cutting methods to the library construction process to cut short fragments to increase the fetal DNA concentration. However, the experimental method for enriching the short-fragment free DNA has a large experimental error, DNA within a certain length range cannot be accurately selected, a large amount of short-fragment DNA is easily mistakenly removed when the long fragment is removed, the number of molecules of the plasma free DNA is remarkably lost, the effective read length number of unique comparison is reduced, and stable and accurate data is not obtained easily.
Reference documents:
1.Nanal,R.,P.Kyle,and P.W.Soothill,A classification of pregnancylosses after invasive prenatal diagnostic procedures:an approach to allowcomparison of units with a different case mix.PrenatDiagn,2003.23(6):p.488-92.
2.Lo,Y.M.,et al.,Presence of fetal DNA in maternal plasma andserum.Lancet,1997.350(9076):p.485-7.
3.Chiu,R.W.,et al.,Noninvasive prenatal diagnosis of fetalchromosomal aneuploidy by massively parallel genomic sequencing of DNA inmaternal plasma.Proc Natl AcadSci U S A,2008.105(51):p.20458-63.
4.Fan,H.C.,et al.,Noninvasive diagnosis of fetal aneuploidy byshotgun sequencing DNA from maternal blood.Proc Natl AcadSci U S A,2008.105(42):p.16266-71.
5.Yin,A.H.,et al.,Noninvasive detection of fetal subchromosomalabnormalities bysemiconductor sequencing of maternal plasma DNA.Proc NatlAcadSci U S A,2015.112(47):p.14670-5.
6.Lv,W.,et al.,Noninvasive prenatal testing for Wilson disease by useof circulating single-molecule amplification and resequencing technology(cSMART).ClinChem,2015.61(1):p.172-81.
7.Lam,K.W.,et al.,Noninvasive prenatal diagnosis of monogenicdiseases by targeted massively parallel sequencing of maternal plasma:application to beta-thalassemia.ClinChem,2012.58(10):p.1467-75.
8.New,M.I.,et al.,Noninvasive prenatal diagnosis of congenitaladrenal hyperplasia using cell-free fetal DNA in maternal plasma.JClinEndocrinolMetab,2014.99(6):p.E1022-30.
9.Chen,S.,et al.,Haplotype-assisted accurate non-invasive fetal wholegenome recovery through maternal plasma sequencing.Genome Med,2013.5(2):p.18。
disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a kit and an analysis system for detecting the thalassemia with better classification effect.
The technical scheme adopted by the invention is as follows:
in a first aspect of the present invention, there is provided:
a reference probe set for improving the accuracy of detecting the thalassemia mutation comprises detection probes for detecting the following SNPs:
Figure BDA0002109368020000041
Figure BDA0002109368020000051
Figure BDA0002109368020000061
Figure BDA0002109368020000071
Figure BDA0002109368020000081
Figure BDA0002109368020000091
Figure BDA0002109368020000101
Figure BDA0002109368020000111
in a second aspect of the present invention, there is provided:
a nucleic acid sequence group for detecting thalassemia, wherein the nucleic acid sequence group comprises a UID linker sequence group and a reference probe group, a detection probe of the reference probe group is shown as a first aspect of the invention, and the UID linker sequence group comprises the following nucleic acid sequences:
Figure BDA0002109368020000112
Figure BDA0002109368020000121
in a third aspect of the present invention, there is provided:
a modeling method for non-invasively detecting fetal thalassemia in maternal peripheral blood, comprising:
1) determining multidimensional Bayes prior probability: taking the plasma free DNA of the pregnant woman with known prenatal diagnosis result, dividing the pregnant woman into three types of samples including heavy-duty thalassemia, light-duty thalassemia and normal samples according to the prenatal diagnosis result, respectively establishing a library, sequencing and comparing, enriching the concentration of fetal DNA, respectively counting the prior probability of the three types of samples falling into SEA _ ratio values of each type based on a Bayesian probability model, and constructing a multidimensional Bayesian probability model;
2) treating a sample to be detected: taking free DNA of the blood plasma of the pregnant woman to be detected, establishing a library, sequencing and comparing, enriching the concentration of the DNA of the fetus, calculating the posterior probability of the sample to be detected based on the multi-dimensional Bayesian probability model, and determining the classification result of the sample to be detected.
In some examples, sequencing comparison using the reference probe set provided in the first aspect of the invention may further improve the accuracy of the detection result.
In some examples, the set of UID linker sequences provided by the second aspect of the invention is used for the pooling.
In some examples, in enriching for fetal DNA concentration, reads with sequence reads no greater than 158bp are selected to enrich for fetal DNA concentration.
In some examples, the statistics of the prior probabilities of the SEA _ ratio values of the respective types includes the following operations:
window division and counting: sequencing probes according to position in the human genome, and then taking a sliding window of the same size as the read lengthCounting windows, counting the read length number Readnum of each counting windowij(ii) a Preferably, the length of the sliding window is 50 bp;
and (3) correcting the sequencing data volume: normalizing the initial read length number by using the total read length of all the compared non-SEA areas to obtain the normalized NReadnumijAfter normalization processing, the average value of the read length numbers of all the statistical windows of each sample is 500, and the calculation formula is as follows:
Figure BDA0002109368020000131
wherein
Figure BDA0002109368020000132
Where i denotes the ith sample, j denotes the jth statistical window, weightiRepresents the mean of all statistical windows reads _ number of sample i;
and (3) GC correction: calculate GC% and NReadnum for the ith statistical window for each sampleij(ii) a Calculate NReadnum per 0.1% GC binijMedian MikAnd median M of all GC binsglobalCorrection coefficient Wik=M+k-Mglobal(ii) a Then the original NReadnum is put intoijSubtracting the correction factor from the number to obtain corrected CreadnumijThe calculation formula is as follows:
CReadnumij=NReadnumij-Wik=NReadnumij-(Mik-Mglobal)
wherein i represents the ith sample, j represents the jth statistical window, and k represents the kth GC bin;
and (3) probe correction: selecting a known light-weight lean sample, and utilizing the RReadnum of the light-weight lean sampleiMean of RReadnumijConversion to ratio, i.e. Readratioij
Figure BDA0002109368020000133
Counting Readratio of each statistical window in SEA regionijIs calculated, SEA _ ratio is calculatedi
Figure BDA0002109368020000134
Where i denotes the ith sample and j denotes the jth statistical window
Prediction of fetal DNA concentration: calculating the fetal DNA concentration by using the SNP ratio value of the SNP locus hybridized at the high frequency in the probe;
and (3) constructing a multidimensional Bayesian prior probability by utilizing a training set sample: all known sample fetal free DNA concentrations, sample classification information (severe thalassemia, mild thalassemia, normal), SEA _ ratio per sample were collectedijInformation of values, calculating the SEA _ ratio of each sample under different fetal DNA concentration gradientsijAnd obtaining a prior probability value.
In some examples, when constructing the multidimensional bayesian prior probability using the training set samples, all samples are pre-grouped according to the fetal concentration of the sample, and then at each fetal concentration FnRespectively calculating the prior probability P (Ratio) of the three types of samples under the concentrationm|A)、P(Ratiom|B)、P(Ratiom|C)
Figure BDA0002109368020000141
(sample fetal concentration ∈ F)n)
Figure BDA0002109368020000142
(sample fetal concentration ∈ F)n)
Figure BDA0002109368020000143
(sample fetal concentration ∈ F)n)。
In some examples, the determination of the prior probability of the sample to be tested includes the steps of:
selecting a Bayesian probability model of corresponding dimensionality according to the fetal DNA concentration of a sample to be detected;
calculating each SEA _ Ratio interval (Ratio) corresponding to each type of sample according to the SEA _ Ratio value of the training set sample and the known sample classification informationm) A priori probability that three types of samples, heavy earth barreness (class a), light earth barreness (class B) and normal person (class C), fall into each SEA _ Ratio interval (Ratio)m) Probability P (Ratio)m|A)、P(Ratiom|B)、P(Ratiom|C)。
In some examples, the determining of the posterior probability of the sample to be tested and the sample classification include the steps of:
calculating the fetal DNA concentration of a known sample, selecting a Bayesian probability model of a corresponding dimension, and determining the SEA _ ratio value of the sample;
falls within the interval Ratio according to the SEA _ Ratio value of the samplemThe probability P (A | Ratio) of each of the three corresponding samples is calculatedmFn)、P(B|Ratiom)、P(C|RatiomFn);
Comparing the posterior probabilities of the three types of samples, and assigning the classification with the highest posterior probability value to the sample; wherein the content of the first and second substances,
Figure BDA0002109368020000144
(sample fetal concentration ∈ F)n)
Figure BDA0002109368020000145
(sample fetal concentration ∈ F)n)
Figure BDA0002109368020000146
(sample fetal concentration ∈ F)n)
Figure BDA0002109368020000147
According to Mendel's Law of inheritance, 0.25, 0.5 and 0.25 are set, respectively.
In some examples, the fetal DNA concentration is calculated using SNP ratio values for SNP sites that are heterozygous at high frequency in the probe.
A system for non-invasively detecting fetal thalassemia in maternal peripheral blood, comprising:
a data storage unit: reads for storing known prenatal diagnostic results or maternal plasma free DNA to be tested;
a data processing unit: the method is used for calculating the multidimensional Bayes prior probability and the multidimensional Bayes posterior probability of a sample to be detected, wherein the calculation of the multidimensional Bayes prior probability comprises the following steps:
taking the pregnant woman plasma free DNA with known prenatal diagnosis results, dividing the pregnant woman plasma free DNA into three types of samples including heavy-duty thalassemia, light-duty thalassemia and normal samples according to the prenatal diagnosis results, respectively counting the prior probability of the three types of samples falling into various types of SEA _ ratio values based on a Bayesian probability model, and constructing a multidimensional Bayesian probability model;
the calculation step of the multidimensional Bayes posterior probability of the sample to be detected comprises the following steps:
and taking reads data of the free DNA of the blood plasma of the pregnant woman to be detected, calculating the posterior probability of the sample to be detected based on the multi-dimensional Bayesian probability model, and determining the classification result of the sample to be detected.
In some examples, from the fetal DNA concentration values, a priori probabilities and a posteriori probabilities of SEA _ ratio values at different fetal DNA concentrations are calculated.
In some examples, reads of no more than 158bp in length in the episomal DNA sequence are selected to enrich for fetal DNA concentration.
The invention has the beneficial effects that:
the reference probe set fully considers the characteristic correlation of the probe sequence and the gene sequence to be detected, including but not limited to factors directly influencing the detection result, such as a high repetitive region, a high GC region, a complex structure region and the like, so that the accuracy of detecting the mutation of the thalassemia can be effectively improved.
In the nucleic acid sequence group, original sequences are marked by introducing Unique Identifiers (UIDs), read lengths from different free DNAs are accurately identified, more DNA fragments can be reserved compared with a method for removing repetition by using a read length starting position, and the method is favorable for further improving the accuracy of a detection result. And the optimization of the designed probe and the analysis method is combined, so that the thalassemia genotype can be accurately detected finally.
The method only needs maternal peripheral blood free plasma DNA, and is easy to sample; the haplotype of father and mother does not need to be constructed in advance, and samples of proband or grandfather do not need to be used, because the samples are difficult to guarantee; only conventional library construction, capture and sequencing are needed, single-molecule sequencing and multiple library construction are not needed, the complexity of the experiment is not increased, and the cost is relatively low; the high-depth sequencing method for detecting the target gene or the genome region is required to reserve all free DNA sequences to the maximum extent in the library building and analyzing processes, and fetal DNA concentration enrichment is carried out in a data processing mode after sequencing so as to improve the fetal DNA concentration and reserve the free DNA in all target length ranges;
the method of the invention provides a multi-dimensional Bayes probability model, namely, samples are pre-grouped according to the data characteristics of the samples, samples of different groups are respectively suitable for different Bayes models, and the classification accuracy is higher than that of a single-dimensional Bayes model without pre-grouping.
Drawings
FIG. 1 is a schematic representation of a synthetic linker structure;
FIG. 2 shows the difference between the distribution of free DNA before and after enrichment with short fragments of 158bp or less;
FIG. 3 is a prior probability distribution curve of a one-dimensional Bayesian probability model constructed directly without pre-grouping samples according to fetal DNA concentration gradients;
FIGS. 4-8 are prior probability distribution curves of a multidimensional Bayesian probability model constructed by pre-grouping samples according to different fetal DNA concentration intervals.
Detailed Description
The technical scheme of the invention is further explained by combining the examples.
It should be noted that the examples are not intended to limit the present invention, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.
Step 1, sample collection, plasma separation and DNA extraction
the method comprises the following steps of 150 pregnant woman peripheral blood plasma samples, wherein 100 samples are used for constructing a classifier, 50 samples are used for evaluating the accuracy of the classifier, the early-pregnancy thalassemia gene screening of couples and couples proves that the pregnant woman is southeast thalassemia (SEA thalassemia), the fetus is subjected to prenatal diagnosis and the carried alpha thalassemia gene type is confirmed, and the plasma separation and the plasma free DNA extraction are carried out within 4 hours after all the plasma samples are collected.
Step 2, design of capture Probe
the technical scheme includes that a group of reference probes are designed on the thalassemia-related genes, and meanwhile, reference probes of other chromosome regions are added in the design process for controlling the detection accuracy, the design of the probes is often related to the sequence characteristics of the genes, such as a high repetition region, a high GC region, a complex structure region and the like directly influence the detection result, the probe sets are determined by screening, aligning and testing by comprehensively considering the factors in the design process, the designed probes comprise α poor gene detection probes and high-frequency heterozygous SNP probes for predicting fetal DNA concentration, the α poor gene probes cover alpha thalassemia-related genes (HBA1 and HBA2), the preferred regions are 115000-000 regions of chr16 after 10kb of each upstream and downstream α poor gene expansion, meanwhile, a sufficient number of high-frequency heterozygous SNP probe sites are designed for predicting fetal DNA concentration, the SNP sites are selected from a dbSNP database, the Ratio value of Chinese population in the database is required to be between 0.4 and 0.5, and the SNP sites are selected by the inventor, such as high genomic duplication sites as shown in the human genome, and the SNP sites are created by the high SNP table selection.
TABLE 1 detection Probe numbering and SNP site sequence information
Figure BDA0002109368020000171
Figure BDA0002109368020000181
Figure BDA0002109368020000191
Figure BDA0002109368020000201
Figure BDA0002109368020000211
Figure BDA0002109368020000221
Figure BDA0002109368020000231
Figure BDA0002109368020000241
The sample is controlled by adding partial region probes of other chromosomes, particularly the plasma sample is optimized, and the thalassemia genotypes of the mother and the fetus can be accurately detected by using one sample. Because the read length number detected by each sample is different, the read length number can be corrected by using probes of other chromosomes firstly, so that the samples with different sequencing depths have comparability; secondly, GC preference exists in the capturing and sequencing process, a linear model can be constructed by using probes of other chromosomes, the GC preference is corrected, and finally the detection value of the probes in the target region range is more accurate.
Step 3, library construction and sequencing
Constructing and sequencing a high-throughput sequencing library of plasma free DNA, and performing processes of end repair, linker addition, PCR enrichment, library-to-library, probe capture, on-machine sequencing and the like. When the library is constructed, short-fragment UID molecular tags can be added at two ends or one end of DNA, and the read medium (reads) of the library after sequencing not only contains sequence information of free DNA, but also contains sequences of the UID molecular tags at two ends. The UID sequence group is suitable for a connector sequence, and the connector group sequence is used for adding connectors in library construction, namely the UID sequence is introduced into the 3' end of the connector sequence to construct a library containing the UID sequence. The schematic diagram of the resulting linker structure is shown in FIG. 1.
PCR produces large numbers of repeats, requiring de-duplication of read lengths, reducing data bias due to amplification and sequencing bias. The method for removing the repetitive sequence comprises the following steps: the advantage of using read length alignments to locations in the reference genome such as the start site to remove duplicates is that there is no increase in experimental complexity and cost, the disadvantage is that the accuracy is lower because reads aligned to the same location in the reference genome are not necessarily from the same piece of episomal DNA; and a UID molecular label is added to the DNA, and the repeated sequence is identified by combining the position of the read length ratio in the reference genome and the information of the UID sequence, so that the accuracy is higher. Since noninvasive monogenic detection has high requirements on data accuracy, a scheme of removing repeated sequences by using UID is preferred. To solve this problem, the present technology introduces Unique Identifiers (UIDs) to label the original sequence, solving the problem of repeatability. And the optimization of the designed probe and the analysis method is combined, so that the thalassemia genotype can be accurately detected finally.
Through the inventive design, the two-way UID sequence linker group constructed by the inventors based on the library of thalassemia mutations targeted capture sequencing is shown in table 2:
TABLE 2 UID sequence group number and sequence information
Numbering Information on the plus strand sequence Information on reverse-strand sequence Numbering Information on the plus strand sequence Information on reverse-strand sequence
1 CTCATCGT ACGATGAG 26 AGCTACTC GAGTAGCT
2 TCTGACGT ACGTCAGA 27 GTAGTATC GATACTAC
3 CGCATAGT ACTATGCG 28 GCACTATC GATAGTGC
4 GACGATCT AGATCGTC 29 AGCAGATC GATCTGCT
5 TATCAGCT AGCTGATA 30 TAGACATC GATGTCTA
6 TGCTCACT AGTGAGCA 31 ACTAGTGC GCACTAGT
7 GACTGTAT ATACAGTC 32 TACGACGC GCGTCGTA
8 CGTCGTAT ATACGACG 33 ATGATGAC GTCATCAT
9 GAGCTGAT ATCAGCTC 34 AGCGTCAC GTGACGCT
10 ATGTCGAT ATCGACAT 35 GCTATGTA TACATAGC
11 CGAGCGAT ATCGCTCG 36 CATGCGTA TACGCATG
12 GCTGTCAT ATGACAGC 37 ACTACGTA TACGTAGT
13 GTCTACAT ATGTAGAC 38 AGTCAGTA TACTGACT
14 CTCGAGTG CACTCGAG 39 GCTCGCTA TAGCGAGC
15 GCAGTCTG CAGACTGC 40 CGTAGATA TATCTACG
16 GTACGCTG CAGCGTAC 41 GTGCACGA TCGTGCAC
17 TATCTGCG CGCAGATA 42 TAGCATCA TGATGCTA
18 CTCAGACG CGTCTGAG 43 ATGCTGCA TGCAGCAT
19 CGTGCTAG CTAGCACG 44 TGATCGCA TGCGATCA
20 ACAGTGAG CTCACTGT 45 GCGTGACA TGTCACGC
21 ATACTGAG CTCAGTAT 46 ACGAGTCG CGACTCGT
22 ATCATGAG CTCATGAT 47 GCTACTCA TGAGTAGC
23 AGATGCAG CTGCATCT 48 TGTCGATA TATCGACA
24 TCATACAG CTGTATGA 49 GCTAGATG CATCTAGC
25 ATCACGTC GACGTGAT 50 TGATAGCT AGCTATCA
Using the UID sequence library shown in table 2, the following advantages can be brought:
1) point mutation errors caused by PCR in the sequencing process can be better removed by introducing UID; the single base error of the synthesized DNA (usually, the content is less) can occur due to different fidelity degrees of the polymerase in the PCR process, and the error caused by the polymerase self factor can be eliminated by the UID plus the genome position information and the detected number.
2) Duplicate segments can be better removed: under natural conditions, the same gene sequence and the same sequence introduced by the library establishment cause inconsistency after UID is added, and the fragment can be restored from the repeated fragment because the fragment is not the true repeated fragment.
Gene capture and sequencing methods for thalassemia detection (plasma samples);
1) collecting and extracting samples: taking 6 cases provided by the Guangdong province women and young health care institute as examples of clinically confirmed carried thalassemia and patient plasma, and respectively adopting Magpure Circulating DNA Kit (magenta) to carry out plasma DNA extraction;
2) end repair and linker attachment: performing end repair and sequencing adaptor ligation on the DNA fragment by using an Ion Xpress plus fragment Library Kit of the Thermofisher Scientific company, wherein the adaptor used in the step is an adaptor primer containing an A adaptor and a P adaptor of a self-designed UID sequence (the sequence information is shown in a table 2); purifying the product by using AgencourtAMPure XP Kit;
3) amplification before hybridization: and (3) performing PCR amplification enrichment on the purified product in the last step, wherein a PCR system comprises: linker-ligated purified product 18.2. mu.L, 2 XKAPA HiFiHotStartStreabyMix 20. mu.L, primer 1.8. mu.L; the reaction conditions are as follows: 5min at 72 ℃; pre-denaturation at 98 ℃ for 3 min; cycling for 13 times (denaturation at 98 ℃ for 20 s; annealing at 65 ℃ for 30 s; extension at 72 ℃ for 30 s); extending for 1min at 72 ℃; purifying the PCR amplification product by using an AgencourtAmpure XP Kit to obtain a DNA library for hybrid capture;
4) and (3) probe capture: blocking a DNA library before hybridization by using a SureSelect TE Reagent Kit, PTN,16(Agilent) Kit according to the instruction operation to obtain a blocked DNA library; preparing probes aiming at the probe capture area in the table 1 into probe mixed liquid; performing hybridization capture for 20 hours with the closed DNA library;
5) and (3) enriching a target area: eluting the probe capture product by using Dynabeads MyOne Streptavidin T1(Invitrogen) to obtain a captured target sequence, and performing PCR amplification on the captured target sequence to obtain an enriched target sequence, wherein the PCR amplification system is as follows: captured target 36.5. mu.L, 5 × Herculase II Reaction Buffer 10. mu. L, dNTPmix (25mM each) 0.5. mu.L, primer 2. mu. L, Herculase II Fusion DNA Polymerasel 1. mu.L; the reaction conditions are as follows: pre-denaturation at 98 ℃ for 2 min; circulating for 10 times (denaturation at 98 deg.C for 30s, annealing at 58 deg.C for 30s, and extension at 72 deg.C for 1 min); extending for 10min at 72 ℃; purifying the PCR product by using an AgencourtAMPureXP Kit to obtain a product to be tested;
6) sequencing the product to be sequenced by using an Ion Torrent platform, and analyzing the sequencing result by adopting bioinformatics.
In this operation, the total amount of plasma genomic DNA was quantified by Qubit, and the DNA library concentration was quantified by QPCR, and the results are shown in Table 3.
TABLE 3 Total amount of extracted genomic DNA from plasma and concentration of DNA library
Figure BDA0002109368020000261
Figure BDA0002109368020000271
The results of the quantitylqubit and qPCR quantification of the product to be sequenced are shown in table 4.
TABLE 4 Qubit quantitation and qPCR quantitation of sequencing products
Figure BDA0002109368020000272
Results of the determination of thalassemia for 6 plasma samples are shown in table 5.
TABLE 5, 6 results of determination of thalassemia in plasma samples
Figure BDA0002109368020000273
Gene capture and sequencing method for thalassemia detection (whole blood sample)
1) Collecting and extracting samples: taking 2 cases provided by the maternal and child care institute of Guangdong province as clinically confirmed carried thalassemia and patient whole Blood as examples, respectively adopting QIAamp DNA Blood Mini Kit to carry out genome DNA extraction;
2) DNA fragmentation: ultrasonically breaking the genome DNA into DNA fragments of 100-500 bp; the disrupted product was purified with the AgencourtAMPure XP Kit;
3) end repair and linker attachment: performing end repair and sequencing adaptor ligation on the DNA fragment by using an Ion Xpress plus fragment Library Kit of the Thermofisher Scientific company, wherein the adaptor used in the step is an adaptor primer containing an A adaptor and a P adaptor of a self-designed UID sequence (the sequence information is shown in a table 2); purifying the product by using AgencourtAMPure XP Kit;
4) amplification before hybridization: and (3) performing PCR amplification enrichment on the purified product in the last step, wherein a PCR system comprises: linker-ligated purified product 18.2. mu.L, 2 XKAPA HiFiHotStartStreabyMix 20. mu.L, primer 1.8. mu.L; the reaction conditions are as follows: 5min at 72 ℃; pre-denaturation at 98 ℃ for 3 min; cycling for 13 times (denaturation at 98 ℃ for 20 s; annealing at 65 ℃ for 30 s; extension at 72 ℃ for 30 s); extending for 1min at 72 ℃; purifying the PCR amplification product by using an AgencourtAmpure XP Kit to obtain a DNA library for hybrid capture;
5) and (3) probe capture: blocking a DNA library before hybridization by using a SureSelect TE Reagent Kit, PTN,16(Agilent) Kit according to the instruction operation to obtain a blocked DNA library; preparing a probe mixed solution by using the probe of the example 1 aiming at the probe capture area; performing hybridization capture for 20 hours with the closed DNA library;
6) and (3) enriching a target area: eluting the probe capture product by using Dynabeads MyOne Streptavidin T1(Invitrogen) to obtain a captured target sequence, and performing PCR amplification on the captured target sequence to obtain an enriched target sequence, wherein the PCR amplification system is as follows: captured target 36.5. mu.L, 5 × Herculase II Reaction Buffer 10. mu. L, dNTPmix (25mM each) 0.5. mu.L, primer 2. mu. L, Herculase II Fusion DNA Polymerasel 1. mu.L; the reaction conditions are as follows: pre-denaturation at 98 ℃ for 2 min; circulating for 10 times (denaturation at 98 deg.C for 30s, annealing at 58 deg.C for 30s, and extension at 72 deg.C for 1 min); extending for 10min at 72 ℃; purifying the PCR product by using an AgencourtAMPureXP Kit to obtain a product to be tested;
7) sequencing the product to be sequenced by using an Ion Torrent platform, and analyzing the sequencing result by adopting bioinformatics.
In this operation, the total amount of whole blood genomic DNA was quantified by Qubit, and the DNA library concentration was quantified by QPCR, and the results are shown in Table 6.
TABLE 6 Whole blood genomic DNA extraction concentration and DNA library concentration
Figure BDA0002109368020000281
The results of the quantitylqubit and qPCR quantification of the product to be sequenced are shown in table 7.
TABLE 7 Qubit quantitation and qPCR quantitation of sequencing products
Figure BDA0002109368020000282
The results of the measurement of thalassemia of 2 whole blood samples are shown in Table 8.
TABLE 8 results of thalassemia measurements of 2 whole blood samples
Figure BDA0002109368020000283
Step 4, removing repeated read-length sequences
PCR produces large numbers of repeats, requiring de-duplication of read lengths, reducing data bias due to amplification and sequencing bias. The method for removing the repetitive sequence comprises the following steps: the advantage of using read length alignments to locations in the reference genome such as the start site to remove duplicates is that there is no increase in experimental complexity and cost, the disadvantage is that the accuracy is lower because reads aligned to the same location in the reference genome are not necessarily from the same piece of episomal DNA; and a UID molecular label is added to the DNA, and a repeated sequence is identified by combining the position of the read length ratio in the reference genome and the information of the UID sequence, so that more free DNA sequences are reserved, and the accuracy is higher. Since noninvasive monogenic detection requires high data accuracy, a scheme of removing repetitive sequences using UID is preferable.
Step 5, enriching the concentration of free DNA in fetus
Since one of the important current limitations of noninvasive fetal microdeletion/microduplication and noninvasive monosomy is the low fetal DNA concentration, CN107541561A proposes to add magnetic bead screening or gel cutting methods to the library construction process to cut short fragments to increase the fetal DNA concentration. However, this approach can significantly deplete the number of molecules of plasma free DNA, reducing the number of effective reads for unique alignments. Our example shows that when high-depth sequencing is carried out on a target gene or a genome region, all free DNA sequences should be kept to the maximum extent in the library construction process, and the fetal DNA concentration is enriched by a data processing mode after sequencing, so that the stability and the accuracy are higher. The invention unexpectedly discovers that the concentration of fetal free DNA can be obviously improved by screening reads with the length less than or equal to 158bp, and enough reads number is stored for subsequent data analysis. The difference between the distribution of fetal free DNA before and after enrichment of 158bp short fragments is shown in FIG. 2.
Since the fragment length of free DNA from the fetus is significantly longer than that of free DNA from the mother's own, the method of cutting short fragments can be used to enrich the fetal free DNA concentration. Aiming at the thalassemia gene detection project, the existing samples are used for analysis, each length gradient threshold is set for screening the read length, and the variation coefficients of the fetal DNA concentration, the number of the residual read lengths, the ratio of the residual read lengths and the depth obtained under different screening coefficients are further counted, as shown in the following table.
reads length Fetal DNA concentration Read length number after filtering Read length ratio after filtering Coefficient of depth variation
≤148bp 22.2% 108647 9.4% 0.329
≤158bp 19.7% 197239 17.0% 0.256
≤168bp 16.5% 374461 32.1% 0.199
≤178bp 14.6% 567548 48.7% 0.172
all reads 12.6% 1161720 100.0% 0.152
>178bp 10.8% 607650 52.6% 0.216
Theoretically, the smaller the length threshold of the screening reads is, the more the enriched fetal free DNA proportion is, the higher the fetal DNA concentration is, and the easier the positive sample is detected; however, the smaller the length threshold of the screening reads, the less effective reads are obtained, and the uniformity of the sequencing depth of the sample is reduced, resulting in increased noise. Therefore, a balance point needs to be found between the effective reads and fetal DNA concentration. Through evaluation, 158bp is selected as a threshold value of a fragment screening coefficient for a thalassemia gene detection project, the mean value of the concentration of free DNA of a fetus can be improved to 19.7%, and 17% of effective reading length can be reserved.
Step 6, calculating the ratio mean value of the SEA area
Window division and counting: sequencing the probes according to positions in the human genome, then taking sliding windows with the same size as statistical windows of read length numbers, and counting the read length numbers Readnum of each statistical windowijPreferably we choose one window every 50 bp.
And (3) correcting the sequencing data volume: normalizing (normalization) the initial read length number by using the total read length of all the compared non-SEA regions to obtain the normalized RReadnumijThe mean of all statistical window read numbers after normalization was 500 for each sample.
Figure BDA0002109368020000301
Wherein
Figure BDA0002109368020000302
i denotes the ith sample, j denotes the jth statistical window, weightiRepresents the mean of all statistical windows reads _ number of sample i;
and (3) GC correction:
calculate GC% and NReadnum for the ith statistical window for each sampleij. Calculate NReadnum per 0.1% GC binijMedian MikAnd median M of all GC binsglobalCorrection coefficient Wik=Mik-Mglobal. Then the original NReadnum is put intoijSubtracting the correction factor from the number to obtain corrected Creadnumij
CReadnumij=NReadnumij-Wik=NReadnumij-(Mik-Mglobal)
Where i denotes the ith sample, j denotes the jth statistical window, and k denotes the kth GC bin.
And (3) probe correction:
since there is a difference in capture efficiency between different probes, the expected value of readsnumber between probes is different, and a probe read _ number correction needs to be made. Selecting a known light-weight lean (SEA) sample, and utilizing RReadnum of the light-weight lean sampleiMean of RReadnumijConversion to ratio, i.e. Readratioij
Figure BDA0002109368020000303
Counting Readratio of all 50bp bins in SEA regionijIs calculated, SEA _ ratio is calculatedi
Figure BDA0002109368020000311
Where i denotes the ith sample and j denotes the jth 50bp bins
Step 7, predicting fetal DNA concentration
Calculating the fetal DNA concentration by using the SNP ratio value of the SNP locus hybridized at high frequency in the probe. Preferably, we use maximum likelihood estimation for fetal DNA concentration prediction. Other well known methods can also be used to predict fetal DNA concentration.
Step 8, constructing multi-dimensional Bayesian prior probability by using training set samples
All samples were collected for fetal free DNA concentration, sample classification information (severe thalassemia, mild thalassemia, normal), SEA _ ratio per sampleijInformation of values, calculating the SEA _ ratio of each sample under different fetal DNA concentration gradientsijAnd obtaining a prior probability value.
The prenatal diagnosis of fetuses from all training set samples was collected and the fetuses were classified into three categories, a for severe grade barrenness, B for mild grade barrenness, and C for wild type (normal). Obtaining the SEA _ ratio value of each sample according to the step 7; the fetal DNA concentration of each sample is obtained according to step 8. Considering that the fetal DNA concentration has a direct influence on the detection signal, all samples are pre-grouped according to the fetal concentration of the sample, e.g., [0.05-0.1), [0.1-0.15), [0.15-0.2), [0.2-0.25), [0.25-0.3), etc. Then at each fetal concentration FnRespectively calculating the prior probability P (Ratio) of the three types of samples under the concentrationm|A)、P(Ratiom|B)、P(Ratiom|C)。
Figure BDA0002109368020000312
(sample fetal concentration ∈ F)n)
Figure BDA0002109368020000313
(sample fetal concentration ∈ F)n)
Figure BDA0002109368020000314
(sample fetal concentration ∈ F)n)
Samples are not pre-grouped according to the concentration gradient of fetal DNA, a one-dimensional Bayesian probability model is directly constructed, and then the distribution of Bayesian prior probability is shown in figure 1. As can be seen from fig. 3, by using the single-dimensional bayesian probability model, the degree of distinction between the three types of samples is very low, and no matter which threshold (cutoff) is selected for classification, a large number of misjudgments are caused.
After we pre-classify the samples by using fetal DNA concentration, a multidimensional Bayesian probability model can be constructed, and a group of Bayesian prior probabilities respectively corresponds to different fetal DNA concentration gradients, as shown in FIGS. 4-8. As can be seen from fig. 4 to 8, the bayesian prior probabilities of the multidimensional bayesian probability model in different concentration intervals have different discrimination degrees, and it is found that the discrimination degrees of the three types of samples are obviously increased along with the increase of the fetal DNA concentration, and the fetal free DNA concentration can effectively distinguish different thalassemia types when being more than 0.15. This suggests that, when maternal plasma is used for the detection of fetal chromosomal abnormalities or single gene mutations, the fetal own DNA concentration has a direct effect on the detection signal. Therefore, samples at different fetal DNA concentrations should not be mixed together, but should be pre-grouped according to fetal DNA concentration in advance, and then a classification model or classifier, such as a bayesian model, is constructed within each group separately.
Step 9, calculating the multidimensional Bayes posterior probability by the test set sample
Selecting a Bayesian probability model of corresponding dimensionality according to the fetal DNA concentration of a sample to be detected; then, according to the SEAratio value of the sample, the prior probability value P (Ratio) of three types of samples of severe thalassemia, light thalassemia and normal people falling into the Ratio interval is extractedm|A)、P(Ratiom|B)、P(RatiomI C). Finally, the posterior probability of each type of sample predicted by the sample can be calculated, the fetal DNA concentration of a certain known sample is calculated, and the SEA _ Ratio value falls in the interval RatiomCorresponding to the likelihood P (A | Ratio) of the three types of samplesmFn)、P(B|Ratiom)、P(C|RatiomFn). And finally comparing the posterior probabilities of the three types of samples, and assigning the class with the highest posterior probability value to the sample.
Figure BDA0002109368020000321
(sample fetal concentration ∈ F)n)
Figure BDA0002109368020000322
(sample fetal concentration ∈ F)n)
Figure BDA0002109368020000323
(sample fetal concentration ∈ F)n)
Wherein
Figure BDA0002109368020000324
According to Mendel's Law of inheritance, 0.25, 0.5 and 0.25 are set, respectively.
In the method, the accuracy can be improved by enriching the fetal DNA concentration and multidimensional Bayes, so that the results of 100 test samples are analyzed, and the advantages and the disadvantages of the three methods are compared, wherein the three methods are as follows:
the method comprises the following steps: not enriched fetal DNA concentration and using a conventional one-dimensional Bayesian model
The method 2 comprises the following steps: not enriched fetal DNA concentration and Using multidimensional Bayesian model
The method 3 comprises the following steps: enrichment of fetal DNA concentration and use of multidimensional Bayesian model
The results are shown in Table 9.
TABLE 9 influence of different methods on prediction accuracy
Figure BDA0002109368020000325
Figure BDA0002109368020000331
As can be seen from the data in the table, the accuracy of the prediction result can be effectively improved by multidimensional Bayes processing through comparing the method 1 with the method 2; comparing methods 1, 2 and 3 shows that the enriched fetal DNA concentration in combination with the multidimensional bayesian model can unexpectedly further improve the accuracy of the prediction result. The specific prediction results of 100 test samples are shown in table 10.
Figure BDA0002109368020000341
Figure BDA0002109368020000351
Figure BDA0002109368020000361
Figure BDA0002109368020000371
Figure BDA0002109368020000381
Figure BDA0002109368020000391

Claims (2)

1. A system for non-invasively detecting fetal thalassemia in maternal peripheral blood, comprising:
a data storage unit: reads for storing known prenatal diagnostic results or maternal plasma free DNA to be tested;
a data processing unit: the method is used for calculating the multidimensional Bayes prior probability and the multidimensional Bayes posterior probability of a sample to be detected, wherein the calculation of the multidimensional Bayes prior probability comprises the following steps:
taking pregnant woman plasma free DNA with known prenatal diagnosis results, dividing the pregnant woman plasma free DNA into three types of samples including heavy-duty thalassemia, light-duty thalassemia and normal samples according to the prenatal diagnosis results, respectively establishing a library, sequencing and comparing, selecting reads enriched fetal DNA concentration with the read length not more than 158bp in a free DNA sequence, respectively counting the prior probability of the three types of samples falling into various types of SEA _ ratio values based on a Bayesian probability model, and constructing a multidimensional Bayesian probability model;
the determination of the prior probability of the sample to be detected comprises the following steps:
taking plasma free DNA of a pregnant woman to be detected, establishing a library, sequencing and comparing, selecting the concentration of reads enriched fetal DNA with the reading length of no more than 158bp in a free DNA sequence, and selecting a Bayesian probability model with corresponding dimensionality according to the fetal DNA concentration of a sample to be detected;
calculating the corresponding SEA _ Ratio interval Ratio of each type of sample according to the SEA _ Ratio value of the training set sample and the known sample classification informationmThe prior probability indicates that three types of samples of class A severe thalassemia, class B light thalassemia and class C normal people fall into each SEA _ Ratio interval RatiomProbability P (Ratio)m|A)、P(Ratiom|B)、P(Ratiom|C);
When a multi-dimensional Bayesian prior probability is constructed by utilizing training set samples, all samples are pre-grouped according to the fetal concentrations of the samples, and then each fetal concentration FnRespectively calculating the prior probability P (Ratio) of the three types of samples under the concentrationm|A)、P(Ratiom|B)、P(Ratiom|C),
Figure DEST_PATH_IMAGE001
Fetal concentration in a sample
Figure 167517DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
Fetal concentration in a sample
Figure 965708DEST_PATH_IMAGE002
Figure 164608DEST_PATH_IMAGE004
Fetal concentration in a sample
Figure 505591DEST_PATH_IMAGE002
When the prior probability of each type of SEA _ ratio value is counted, the method comprises the following operations:
window division and counting: sequencing the probes according to positions in the human genome, then taking sliding windows with the same size as statistical windows of read length numbers, and counting the read length numbers Readnum of each statistical windowij(ii) a The length of the sliding window is 50 bp;
and (3) correcting the sequencing data volume: normalizing the initial read length number by using the total read length of all the compared non-SEA areas to obtain the normalized NReadnumijAfter normalization processing, the average value of the read length numbers of all the statistical windows of each sample is 500, and the calculation formula is as follows:
Figure DEST_PATH_IMAGE005
wherein
Figure 905480DEST_PATH_IMAGE006
i denotes the ith sample, j denotes the jth statistical window, weightiRepresents the mean of all statistical windows reads _ number of sample i;
and (3) GC correction: calculate the GC% sum of the ith statistical window for each sample
Figure DEST_PATH_IMAGE007
(ii) a Calculate NReadnum per 0.1% GC binijMedian MikAnd median of all GC bins
Figure DEST_PATH_IMAGE009
Correction coefficient Wik=Mik-Mglobal(ii) a Then the original NReadnum is put intoijSubtracting the correction factor from the number to obtain corrected CreadnumijThe calculation formula is as follows:
CReadnumij=NReadnumij-Wik=NReadnumij-(Mik-Mglobal)
wherein i represents the ith sample, j represents the jth statistical window, and k represents the kth GC bin;
and (3) probe correction: selecting a known light-weight lean sample, and utilizing the RReadnum of the light-weight lean sampleiMean of RReadnumijConversion to ratio, i.e. Readratioij
Figure 456547DEST_PATH_IMAGE010
Counting Readratio of each statistical window in SEA regionijIs calculated, SEA _ ratio is calculatedi
Figure DEST_PATH_IMAGE011
Where i denotes the ith sample and j denotes the jth statistical window
Prediction of fetal DNA concentration: calculating the fetal DNA concentration by using the SNP ratio value of the SNP locus hybridized at the high frequency in the probe;
and (3) constructing a multidimensional Bayesian prior probability by utilizing a training set sample: collecting all known sample fetal free DNA concentrations, sample severe-grade and light-grade, normal person classification information, SEA _ ratio of each sampleijInformation of values, calculating the SEA _ ratio of each sample under different fetal DNA concentration gradientsijAnd obtaining a prior probability value;
the calculation step of the multidimensional Bayes posterior probability of the sample to be detected comprises the following steps:
taking reads data of the plasma free DNA of the pregnant woman to be detected, calculating the posterior probability of the sample to be detected based on the multi-dimensional Bayesian probability model, and determining the classification result of the sample to be detected; the method specifically comprises the following steps:
calculating the fetal DNA concentration of a known sample, selecting a Bayesian probability model of a corresponding dimension, and determining the SEA _ ratio value of the sample;
falls within the interval Ratio according to the SEA _ Ratio value of the samplemCalculating the likelihood of the three types of samples
P(A|RatiomFn)、P(B| RatiomFn)、P(C| RatiomFn);
Comparing the posterior probabilities of the three types of samples, and assigning the classification with the highest posterior probability value to the sample; wherein the content of the first and second substances,
Figure 193558DEST_PATH_IMAGE012
fetal concentration in a sample
Figure 123468DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE013
Fetal concentration in a sample
Figure 225417DEST_PATH_IMAGE002
Figure 467042DEST_PATH_IMAGE014
Fetal concentration in a sample
Figure 742166DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE015
According to Mendel's Law of inheritance, 0.25, 0.5 and 0.25 are set, respectively.
2. The system of claim 1, wherein: and calculating the prior probability and the posterior probability of the SEA _ ratio value under different fetal DNA concentrations according to the fetal DNA concentration value.
CN201910565206.1A 2019-06-27 2019-06-27 Kit and analysis system for thalassemia detection Active CN110373458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910565206.1A CN110373458B (en) 2019-06-27 2019-06-27 Kit and analysis system for thalassemia detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910565206.1A CN110373458B (en) 2019-06-27 2019-06-27 Kit and analysis system for thalassemia detection

Publications (2)

Publication Number Publication Date
CN110373458A CN110373458A (en) 2019-10-25
CN110373458B true CN110373458B (en) 2020-05-19

Family

ID=68250769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910565206.1A Active CN110373458B (en) 2019-06-27 2019-06-27 Kit and analysis system for thalassemia detection

Country Status (1)

Country Link
CN (1) CN110373458B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112029850A (en) * 2020-09-16 2020-12-04 成都锦欣生殖医学与遗传学研究所 Primer pair, probe and kit for detecting thalassemia gene and using method
CN113362892B (en) * 2021-06-16 2021-12-17 北京阅微基因技术股份有限公司 Method for detecting and typing repetition number of short tandem repeat sequence

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DK2852680T3 (en) * 2012-05-21 2020-03-16 Sequenom Inc Methods and processes for non-invasive evaluation of genetic variations
CN106591441B (en) * 2016-12-02 2021-07-09 王君文 Alpha and/or beta-thalassemia mutation detection probe, method and chip based on whole gene capture sequencing and application
CN108277267B (en) * 2016-12-29 2019-08-13 安诺优达基因科技(北京)有限公司 It detects the device of gene mutation and carries out the kit of parting for the genotype to pregnant woman and fetus
CN106834502B (en) * 2017-03-06 2018-06-26 明码(上海)生物科技有限公司 A kind of spinal muscular atrophy related gene copy number detection kit and method based on gene trap and two generation sequencing technologies
CN107541561B (en) * 2017-04-18 2018-09-07 东莞博奥木华基因科技有限公司 Improve the kit of fetus dissociative DNA concentration, device and method in maternal peripheral blood
CN108048541B (en) * 2018-01-25 2020-11-20 广州精科医学检验所有限公司 System for determining fetal alpha thalassemia gene haplotype
CN108597603B (en) * 2018-05-04 2021-04-20 吉林大学 Cancer recurrence prediction system based on multidimensional Gaussian distribution Bayesian classification
CN108642160B (en) * 2018-05-16 2022-03-11 广州市达瑞生物技术股份有限公司 Method and kit for detecting fetal thalassemia pathogenic gene
CN109584957B (en) * 2019-01-21 2020-04-17 明码(上海)生物科技有限公司 Detection kit for capturing α thalassemia related gene copy number

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Pilot Study of Noninvasive Prenatal Diagnosis of Alpha- and Beta-Thalassemia with Target Capture Sequencing of Cell-Free Fetal DNA in Maternal Blood;Wang W et al.;《 Genet Test Mol Biomarkers》;20170524;第21卷(第7期);第433-439页 *
Validation of Extensive Next-Generation Sequencing Method for Monogenic Disorder Analysis on Cell-Free Fetal DNA: Noninvasive Prenatal Diagnosis;Dello RC et al.;《J Mol Diagn》;20190425;第21卷(第4期);第572-579页 *

Also Published As

Publication number Publication date
CN110373458A (en) 2019-10-25

Similar Documents

Publication Publication Date Title
US20240102101A1 (en) Systems and methods to detect rare mutations and copy number variation
CN107771221B (en) Mutation detection for cancer screening and fetal analysis
KR102028375B1 (en) Systems and methods to detect rare mutations and copy number variation
CN106834502B (en) A kind of spinal muscular atrophy related gene copy number detection kit and method based on gene trap and two generation sequencing technologies
CN103874767B (en) Presumptive area in sample of nucleic acid is carried out the method and system of gene type
CN109637590B (en) Microsatellite instability detection system and method based on genome sequencing
KR101795124B1 (en) Method and system for detecting copy number variation
US20210065842A1 (en) Systems and methods for determining tumor fraction
EP3564391B1 (en) Method, device and kit for detecting fetal genetic mutation
EP3973080A1 (en) Systems and methods for determining whether a subject has a cancer condition using transfer learning
CN112126677B (en) Noninvasive deafness haplotype gene mutation detection method
CN110373458B (en) Kit and analysis system for thalassemia detection
EP4111455A1 (en) Systems and methods for calling variants using methylation sequencing data
WO2023142625A1 (en) Methylation sequencing data filtering method and application
WO2023246949A1 (en) Non-invasive method for determining parentage before birth by using microhaplotypes
CN105838720B (en) PTPRQ gene mutation body and its application
CN106119406B (en) Genotyping diagnostic kit for multiple granulomatous vasculitis and arteriolositis and using method thereof
WO2021127565A1 (en) Systems and methods for estimating cell source fractions using methylation information
CN114023442B (en) Student information analysis method and model based on bone and meat tumor molecular typing of multiple groups of chemical data
CN112695081B (en) New susceptibility gene of primary biliary cholangitis and application thereof
WO2023239866A1 (en) Methods for identifying cns cancer in a subject
CN115772563A (en) Non-diagnostic method for detecting PAH gene mutation and design method of probe

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant