US20240013859A1 - Fetal chromosomal abnormality detection method and system - Google Patents
Fetal chromosomal abnormality detection method and system Download PDFInfo
- Publication number
- US20240013859A1 US20240013859A1 US18/254,842 US202018254842A US2024013859A1 US 20240013859 A1 US20240013859 A1 US 20240013859A1 US 202018254842 A US202018254842 A US 202018254842A US 2024013859 A1 US2024013859 A1 US 2024013859A1
- Authority
- US
- United States
- Prior art keywords
- sequence
- layer
- feature vector
- module
- pregnant woman
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000001605 fetal effect Effects 0.000 title claims abstract description 63
- 238000001514 detection method Methods 0.000 title claims abstract description 51
- 208000031404 Chromosome Aberrations Diseases 0.000 title claims abstract description 43
- 239000013598 vector Substances 0.000 claims abstract description 148
- 210000000349 chromosome Anatomy 0.000 claims abstract description 100
- 239000011159 matrix material Substances 0.000 claims abstract description 86
- 238000012163 sequencing technique Methods 0.000 claims abstract description 86
- 238000000034 method Methods 0.000 claims abstract description 79
- 238000010801 machine learning Methods 0.000 claims abstract description 24
- 150000007523 nucleic acids Chemical group 0.000 claims abstract description 18
- 238000012549 training Methods 0.000 claims description 63
- 238000011176 pooling Methods 0.000 claims description 61
- 238000003062 neural network model Methods 0.000 claims description 50
- 230000002759 chromosomal effect Effects 0.000 claims description 44
- 238000013507 mapping Methods 0.000 claims description 43
- 230000004913 activation Effects 0.000 claims description 25
- 238000010606 normalization Methods 0.000 claims description 25
- 208000036878 aneuploidy Diseases 0.000 claims description 22
- 239000012634 fragment Substances 0.000 claims description 22
- 231100001075 aneuploidy Toxicity 0.000 claims description 21
- 238000004422 calculation algorithm Methods 0.000 claims description 16
- 238000012545 processing Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 14
- 108091061744 Cell-free fetal DNA Proteins 0.000 claims description 12
- 201000010374 Down Syndrome Diseases 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 11
- 210000005259 peripheral blood Anatomy 0.000 claims description 11
- 239000011886 peripheral blood Substances 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 210000003754 fetus Anatomy 0.000 claims description 10
- 208000006284 Trisomy 13 Syndrome Diseases 0.000 claims description 9
- 208000007159 Trisomy 18 Syndrome Diseases 0.000 claims description 9
- 230000002159 abnormal effect Effects 0.000 claims description 9
- 210000004205 output neuron Anatomy 0.000 claims description 9
- 108020004414 DNA Proteins 0.000 claims description 8
- 238000012937 correction Methods 0.000 claims description 7
- 208000011580 syndromic disease Diseases 0.000 claims description 7
- 238000003745 diagnosis Methods 0.000 claims description 6
- 206010011385 Cri-du-chat syndrome Diseases 0.000 claims description 5
- 238000010876 biochemical test Methods 0.000 claims description 5
- 238000013145 classification model Methods 0.000 claims description 5
- 230000007423 decrease Effects 0.000 claims description 5
- 230000014509 gene expression Effects 0.000 claims description 4
- 210000004185 liver Anatomy 0.000 claims description 4
- 210000002826 placenta Anatomy 0.000 claims description 4
- 208000031639 Chromosome Deletion Diseases 0.000 claims description 3
- 238000005728 strengthening Methods 0.000 claims description 3
- 230000003322 aneuploid effect Effects 0.000 claims 1
- 206010008805 Chromosomal abnormalities Diseases 0.000 abstract description 22
- 239000000523 sample Substances 0.000 description 49
- 238000012360 testing method Methods 0.000 description 20
- 230000008569 process Effects 0.000 description 15
- 238000005070 sampling Methods 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 9
- 230000002068 genetic effect Effects 0.000 description 9
- 238000004458 analytical method Methods 0.000 description 8
- 230000008774 maternal effect Effects 0.000 description 8
- 238000007792 addition Methods 0.000 description 7
- 238000001801 Z-test Methods 0.000 description 6
- 238000007481 next generation sequencing Methods 0.000 description 6
- 238000009609 prenatal screening Methods 0.000 description 6
- 238000007637 random forest analysis Methods 0.000 description 6
- 238000007781 pre-processing Methods 0.000 description 5
- 238000012706 support-vector machine Methods 0.000 description 5
- 208000037280 Trisomy Diseases 0.000 description 4
- 238000002669 amniocentesis Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 4
- 230000005856 abnormality Effects 0.000 description 3
- 238000002790 cross-validation Methods 0.000 description 3
- 238000012165 high-throughput sequencing Methods 0.000 description 3
- 238000007477 logistic regression Methods 0.000 description 3
- 108020004707 nucleic acids Proteins 0.000 description 3
- 102000039446 nucleic acids Human genes 0.000 description 3
- 230000035935 pregnancy Effects 0.000 description 3
- 238000012216 screening Methods 0.000 description 3
- 210000003765 sex chromosome Anatomy 0.000 description 3
- 238000000551 statistical hypothesis test Methods 0.000 description 3
- 206010000210 abortion Diseases 0.000 description 2
- 231100000176 abortion Toxicity 0.000 description 2
- 102000013529 alpha-Fetoproteins Human genes 0.000 description 2
- 108010026331 alpha-Fetoproteins Proteins 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 239000013068 control sample Substances 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003793 prenatal diagnosis Methods 0.000 description 2
- 238000009598 prenatal testing Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 230000000405 serological effect Effects 0.000 description 2
- 208000010543 22q11.2 deletion syndrome Diseases 0.000 description 1
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 208000035473 Communicable disease Diseases 0.000 description 1
- 208000000398 DiGeorge Syndrome Diseases 0.000 description 1
- 201000006360 Edwards syndrome Diseases 0.000 description 1
- 208000022471 Fetal disease Diseases 0.000 description 1
- 208000026350 Inborn Genetic disease Diseases 0.000 description 1
- 206010056254 Intrauterine infection Diseases 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 201000009928 Patau syndrome Diseases 0.000 description 1
- 201000010769 Prader-Willi syndrome Diseases 0.000 description 1
- 208000037492 Sex Chromosome Aberrations Diseases 0.000 description 1
- 206010061513 Sex chromosome abnormality Diseases 0.000 description 1
- 206010044686 Trisomy 13 Diseases 0.000 description 1
- 206010044688 Trisomy 21 Diseases 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000003766 bioinformatics method Methods 0.000 description 1
- 238000010241 blood sampling Methods 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 229960004407 chorionic gonadotrophin Drugs 0.000 description 1
- 210000004252 chorionic villi Anatomy 0.000 description 1
- 230000007012 clinical effect Effects 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002405 diagnostic procedure Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 208000016361 genetic disease Diseases 0.000 description 1
- 208000002672 hepatitis B Diseases 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 201000003738 orofaciodigital syndrome VIII Diseases 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 210000002966 serum Anatomy 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 238000002054 transplantation Methods 0.000 description 1
- 206010053884 trisomy 18 Diseases 0.000 description 1
- 238000002604 ultrasonography Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/10—Ploidy or copy number detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/156—Polymorphic or mutational markers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
Definitions
- the present invention relates to the field of biotechnology, and more specifically, relates to method and system of detecting fetal chromosomal abnormalities.
- Chromosomal aneuploidy disease refers to a kind of serious genetic diseases in which the number of individual chromosomes in the fetus increases or decreases, thereby affecting the normal gene expression. It mainly includes trisomy 21 syndrome, trisomy 18 syndrome, trisomy 13 syndrome, 5p-syndrome, etc. Chromosomal aneuploidy disease has a higher risk of death and disability, and there is no effective treatment. At present, prenatal screening and prenatal diagnosis are mainly used to reduce the birth rate of children with chromosomal aneuploidy.
- the traditional chromosomal aneuploidy detection mainly includes noninvasive prenatal screening based on ultrasonic diagnostic examination or serological screening and prenatal diagnosis based on invasive sampling.
- the prenatal screening method based on ultrasonic diagnostic examination may be used to judge whether the fetal chromosome is abnormal by checking the thickness of the nuchal translucency (NT) of the fetus at 10-14 gestational weeks. It is generally believed that the risk of fetal chromosomal aneuploidy is higher when NT is greater than 3 mm.
- the prenatal screening based on serology is performed at 13-16 gestational weeks by detecting the concentrations of alpha fetoprotein (AFP) and human chorionic gonadotrophin (HCG) in the maternal serum to calculate the risk factor of fetal chromosomal abnormalities in combination with the due date and age of the pregnant woman, and the gestational week of blood sampling.
- the prenatal diagnostic method based on invasive sampling generally obtains fetal samples by amniocentesis, cordocentesis or direct chorionic sampling at 16-24 gestational weeks to detect whether the fetus has a chromosomal abnormality.
- the combined screening method based on ultrasonic diagnostic examination and serology is not to directly detect the fetal chromosomes, but to estimate the risk of fetal disease, with a detection accuracy of 50%-95% and a false positive rate of 3%-7% [1,2] .
- the method based on invasive sampling can directly diagnose fetal aneuploidy accurately, which is the “gold standard” for the detection and diagnosis of fetal chromosomal abnormalities.
- this method will lead to a certain of abortion rate (0.5%-2%), and the pregnant woman suffering from infectious diseases such as hepatitis B is not suitable for invasive sampling (such as amniocentesis) due to a risk of infecting the fetus.
- amniocentesis needs to be carried out under the guidance of B-scan ultrasonography, which takes a long time and requires high technical requirements for operators.
- NGS Next Generation Sequencing
- NIPT In NIPT technology, taking use of maternal peripheral blood, sequencing cell-free DNA in maternal peripheral blood (including cell-free fetal DNA) by NGS technology, combining with bioinformatics analysis to obtain fetal genetic information, it can detect whether a fetus suffers from chromosomal abnormal diseases such as trisomy 21 syndrome (Down syndrome), trisomy 18 syndrome (Edwards syndrome) and trisomy 13 syndrome (Patau syndrome).
- chromosomal abnormal diseases such as trisomy 21 syndrome (Down syndrome), trisomy 18 syndrome (Edwards syndrome) and trisomy 13 syndrome (Patau syndrome).
- NIPT technology has a high sensitivity and specificity (the sensitivity of each of T21, T18 and T13 is above 99%) and a low false positive rate ( ⁇ 0.1%), and it has been widely used in clinical practice [3-5] .
- NIPT technology can reduce the false positive rate of serological screening and avoid the risk of fetal intrauterine infection and abortion caused by invasive prenatal diagnostic operations (such as amniocentesis and chorionic villi sampling, etc.). It is a noninvasive prenatal screening technology with high safety in early and middle pregnancy.
- the conventional NIPT based on NGS technology detects fetal chromosomal abnormalities by calculating the read count of sequencing and using Baseline Z-Test [6] .
- the principle is as follows: firstly, the maternal peripheral blood samples at 12-22 gestational weeks are taken, and the cell-free DNA in peripheral blood samples is sequenced using NGS technology, and the obtained sequencing read segments are aligned with the human reference genome sequence (and the GC content is corrected for the read count simultaneously); then the number of unique mapping reads of each chromosome is counted and its proportion to the total unique mapping read counts of chromosomes in the sample is calculated; further, the Z-score of a chromosome in the sample to be detected is obtained by subtracting the mean value of the corresponding proportion of unique mapping read counts of the chromosome in the control sample (i.e.
- the Z-score is compared with a given threshold, and it is judged as a high risk of trisomy syndrome if the Z-score is bigger than the threshold; otherwise, it is judged as a low risk.
- the mean value of the unique mapping read counts of each chromosome in the normal samples of the control group is the Baseline Value.
- the given threshold of Z-score is generally 3, which is statistically defined, i.e., 99.9% deviation from normal expectation.
- Different statistical hypothesis tests can be selected according to different Baseline Values. For example, correlation analysis and T-Test are adopted in reference [7] , and the median of the read count of each chromosome in a fixed-size window in the sample is used as the Baseline Value, which represents the read count of this chromosome, and the median of the total read count of chromosomes in the sample is used to represent the read count of the sample; then the read count of each chromosome is divided by the read count of the sample to obtain the normalized read count of corresponding chromosome; finally, the normalized read count of each chromosome of all samples in the control group is used to calculate the confidence interval and the sample is considered to be abnormal when the score of the sample to be detected does not fall within the confidence interval.
- a reference chromosome with a GC content similar to that of the chromosome of interest (such as chromosome 21) is selected in the sample of known karyotypes, and the read count of the reference chromosome is used as the baseline value for Z-test, which allow the detection accuracy of interested chromosomal abnormalities in the sample of known karyotypes reach the maximum.
- the reference chromosome served as the Baseline Value is the so-called Internal Chromosome.
- NIFTY Noninvasive Fetal Trisomy
- this method In addition to comparing the read count of the chromosome with that of normal control samples, this method also considers the proportion of cell-free fetal DNA.
- binary hypothesis test, logarithmic likelihood ratio and FCAPS binary segmentation algorithm are used to judge detection results.
- NIFTY is a approach based on genome-wide. This method has been verified by a large population with high accuracy, but the process is relatively complicated.
- the aforementioned statistical hypothesis test (Z-Test or T-Test) method based on the read count is the key of the current NIPT analysis.
- the current NIPT analysis method may lead to the deviation of sequencing read segments distribution of individual sample, which will lead to the fluctuation of Z-score calculation in different situations, thus affecting the final result judgment and related performance indicators;
- the current NIPT analysis method highly depends on the proportion of cell-free fetal DNA in maternal peripheral blood, and the excessively low proportion of cell-free fetal DNA ( ⁇ 4%) may increase the risk of false negative detection due to the large individual difference among pregnant woman;
- the current NIPT analysis method performs well in the detection of trisomy 21 syndrome, but its accuracy in the detection of trisomy 18 syndrome and trisomy 13 syndrome is poor due to the individual difference of pregnant woman and the deviation of GC content in different chromosomes;
- the current NIPT analysis method mainly detects the common trisomy syndrome represented by Down syndrome, and has limited clinical effect on the detection of chromosome micro
- a new technology based on the machine learning model using NIPT sequencing results to detect chromosomal abnormalities has been proposed.
- a method of assisting NIPT decisions using a Support Vector Machine (SVM) has been proposes in reference [10] .
- SVM Support Vector Machine
- 6 different Z-score results are obtained by calculating different Baseline Values and clinical indications of the samples are also added for training the SVM model to judge chromosomal abnormalities.
- a Bayes method for judging chromosomal abnormalities have been designed in reference [11] .
- This method utilizes the prior information of cell-free fetal DNA proportion, uses Hidden Markov Model (HMM) to eliminate the interference of population level and maternal CNV, and performs GC content correction, then calculates the Bayes factor by combining with a likelihood value of Z-Test and an inferred prior value of cell-free fetal DNA proportion from the sex chromosome content.
- HMM Hidden Markov Model
- multiple risk factors such as the age of the pregnant woman are incorporated into the prior probability to correct the Bayes factor, and the Z-score and Bayes factor are integrated to evaluate whether the chromosome is abnormal.
- cell-free fetal DNA and cell-free maternal DNA may be first isolated from peripheral blood samples, and various single nucleotide variation (SNV) loci are amplified from the isolated free DNA, and the amplified products are sequenced to determine the genetic sequencing data or genetic array data of multiple SNV loci. Then, based on these genetic sequencing data or genetic array data, the artificial neural network model is trained to detect the ploidy state, tissue cancer state, or organ transplantation rejection state of the individual chromosomes.
- SNV single nucleotide variation
- the aforementioned methods based on machine learning model using NIPT sequencing results to detect chromosomal abnormalities also have the following limitations: most of these methods calculate the desirable features for model training based on the read count of sequencing data; most of these methods rely on the calculation of Z-score; either the calculation is too complex (e.g., reference [11] ), or the model design is too simple (e.g., patent publication [12] ), or genetic sequencing data or genetic array data based on SNV loci are required (e.g., patent publication [13] ) which limit clinical application prospects, model scalability and detection accuracy; and the detection accuracy needs to be improved.
- the invention aims at least to further improve the detection accuracy of chromosomal abnormalities based on the deep hybrid model.
- the invention provides a method of detecting a fetal chromosomal abnormality, comprising:
- the cell-free nucleic acid fragments are derived from peripheral blood, liver, and/or placenta of the pregnant woman.
- the cell-free nucleic acid fragments are cell-free DNA.
- the sequencing data are derived from ultra-low depth sequencing; preferably, the sequencing depth of the ultra-low depth sequencing is 1 ⁇ , 0.1 ⁇ , or 0.01 ⁇ .
- the read segments are aligned to the reference genome to obtain the unique mapping reads (preferably, GC content correction is performed); preferably, the subsequent steps are carried out with the unique mapping reads (preferably, the read segments are corrected by GC content).
- the GC content correction is performed as follows:
- c k ⁇ (f(k)) represents GC content of fragment k
- F i represents the number of sequencing read segments with GC content of i and start site same as that of the fragment.
- ⁇ i ⁇ r ⁇ F i N i , F i > 0 ⁇ and ⁇ ⁇ N i > 0 1 , other
- r is the global scaling factor, which is defined as:
- R i represents the expected number of sequencing read segments with a corrected GC content of i.
- the phenotypic feature data of the pregnant woman are selected from one or combination of more of: age, gestational week, height, weight, BMI, biochemical test results of prenatal examination, ultrasonic diagnosis results, and cell-free fetal DNA concentration in plasma.
- the phenotypic feature data of the pregnant woman are subjected to outlier processing, missing value processing and/or null value processing.
- the phenotypic data of the pregnant woman sample will be judged as outliers if the following records appear:
- the missing values and null values are padded by missForest algorithm.
- the chromosome is chromosome 21, chromosome 18, chromosome 13 and/or a sex chromosome.
- the sequence feature matrix includes the number, the base quality and the mapping quality of read segments within the sliding windows.
- the base quality includes the mean, the standard deviation, the skewness, and/or the kurtosis of the base quality.
- the mapping quality includes the mean, the standard deviation, the skewness and/or the kurtosis of the mapping quality.
- the sequence feature matrix is:
- h represents the number of sliding windows
- w represents the number of sequence features within a single sliding window
- x ij represents the j th sequence eigenvalue in the i th sliding window.
- the sequence feature matrix is normalized.
- the sequence feature matrix is normalized by using formula (I):
- the trained machine learning model is a neural network model or an AutoEncoder model; preferably, the neural network model is a deep neural network model, and more preferably, the neural network model is a deep neural network model based on 1D convolution.
- the structure of the deep neural network model includes:
- the pre-module includes:
- the core module consists of one or more residual submodules with the same structure, wherein the output of each residual module is the input of the next residual module.
- the residual submodule includes:
- the SE module includes:
- the sSE module includes:
- the combined feature vector is obtained by combining the sequence feature vector and the phenotypic feature vector of the pregnant woman.
- the combined feature vector x is normalized by:
- x i ′ x i - ⁇ i ⁇ i
- the classification detection model is an ensemble learning model.
- the ensemble learning model is an ensemble learning model based on Stacking or Majority Voting; preferably, the ensemble learning model is one or more of: support vector machine model, naive Bayes classifier, random forest classifier, XGBoost and logistic regression.
- the chromosomal abnormality includes at least one or more of: trisomy 21 syndrome, trisomy 18 syndrome, trisomy 13 syndrome, 5p-syndrome, chromosomal microdeletion and chromosomal microduplication.
- the invention provides a method of constructing a classification detection model for detecting a fetal chromosomal abnormality, comprising:
- the fetal chromosomal state of each of the pregnant women is one or more of: normal diploid, chromosomal aneuploidy, partial monosomy syndrome, chromosomal microdeletion and chromosomal microduplication.
- the chromosomal aneuploidy includes at least one or more of: trisomy 21 syndrome, trisomy 18 syndrome and trisomy 13 syndrome.
- the partial monosomy syndrome includes 5p-syndrome.
- the number of the pregnant women is greater than 10, and the ratio of the number of fetuses with normal diploid to that of fetuses with chromosomal aneuploidy is 1 ⁇ 2 to 2.
- the training data set is represented as:
- the trained machine learning model includes an output layer.
- the structure of the deep neural network model includes an output layer after the first global average pooling layer, and the output layer is connected with the first global average pooling layer and is a fully connected layer with a number of output neuron of 1, which is used to output the chromosomal abnormality state.
- the invention provides a system of detecting a fetal chromosomal abnormality, comprising:
- system further includes an alignment module, for aligning the reads of the sequencing data to a reference genome to obtain the unique mapping reads.
- the cell-free nucleic acid fragments are derived from the peripheral blood, liver, and/or placenta of the pregnant woman.
- the cell-free nucleic acid fragments are cell-free DNA.
- the sequencing data are derived from ultra-low depth sequencing; preferably, the sequencing depth of the ultra-low depth sequencing is 1 ⁇ , 0.1 ⁇ , or 0.01 ⁇ .
- the read segments are aligned to the reference genome to obtain the unique mapping reads (preferably, GC content correction is preformed); preferably, the subsequent steps are carried out with the unique mapping reads (preferably, the read segments are corrected by GC content).
- the GC content correction is performed as follows:
- the phenotypic feature data of the pregnant woman are selected from one or combination of more of: age, gestational week, height, weight, BMI, biochemical test results of prenatal examination, ultrasonic diagnosis results, and cell-free fetal DNA concentration in plasma.
- the phenotypic feature data of the pregnant woman are subjected to outlier processing, missing value processing and/or null value processing.
- the phenotypic data of the pregnant woman sample will be judged as outliers if the following records appear:
- the missing values and null values are padded by missForest algorithm.
- the chromosome is chromosome 21, chromosome 18, chromosome 13 and/or a sex chromosome.
- the sequence feature matrix includes the number, the base quality and the mapping quality of read segments within the sliding windows.
- the base quality includes the mean, the standard deviation, the skewness, and/or the kurtosis of the base quality.
- the mapping quality includes the mean, the standard deviation, the skewness and/or the kurtosis of the mapping quality.
- the sequence feature matrix is:
- the sequence feature matrix is normalized.
- the sequence feature matrix is normalized by formula (I):
- Z i,j (k) is the normalized sequence feature matrix of sample k
- X i,j (k) represents the j th sequence eigenvalue in the i th sliding window of sample k
- ⁇ i,j and ⁇ i,j represent the mean and standard deviation of the j th sequence eigenvalue in the i th sliding window of all samples, respectively.
- the trained machine learning model is a neural network model or an AutoEncoder model; preferably, the neural network model is a deep neural network model, and more preferably, the neural network model is a deep neural network model based on 1D convolution.
- the definition in the embodiments of the first aspect of the invention also applies.
- the combined feature vector is obtained by combining the sequence feature vector and the phenotypic feature vector of the pregnant woman.
- the combined feature vector x is normalized by:
- x i ′ x i - ⁇ i ⁇ i
- the classification detection model is an ensemble learning model.
- the ensemble learning model is an ensemble learning model based on Stacking or Majority Voting; preferably, the ensemble learning model is one or more of: support vector machine model, naive Bayes classifier, random forest classifier, XGBoost and logistic regression.
- the invention provides a system of constructing a classification detection model for detecting a fetal chromosomal abnormality, comprising:
- system further includes an alignment module, for aligning the read segments of the sequencing data to a reference genome to obtain the unique mapping reads.
- the trained machine learning model includes an output layer.
- the structure of the deep neural network model includes an output layer after the first global average pooling layer, and the output layer is connected with the first global average pooling layer and is a fully connected layer with a number of output neuron of 1, which is used to output the chromosomal abnormality state.
- the method and model of the invention are based on the innovative algorithm of sequencing data instead of Z-Test, and avoid the clinical problem that it is difficult to judge depending on the threshold when the result score falls in the “grey area”.
- the hybrid model proposed by the invention can be automatically upgraded and optimized to improve the detection accuracy.
- FIG. 1 illustrates a flow chart of the method of detecting fetal chromosomal abnormalities based on a deep neural network hybrid model according to an embodiment of the present invention.
- FIG. 2 illustrates the calculation of a feature matrix of sequencing data according to an embodiment of the present invention.
- FIG. 3 illustrates the structure of a deep neural network according to an embodiment of the present invention.
- FIG. 4 illustrates a Squeeze-Excite module (SE module) according to an embodiment of the present invention.
- FIG. 5 illustrates a Spatial Squeeze-Excite module (sSE module) according to an embodiment of the present invention.
- FIG. 6 illustrates the missing value padding of the phenotypic data set according to an embodiment of the present invention.
- FIG. 7 illustrates the structure of an ensemble learning model based on Stacking according to an embodiment of the present invention.
- FIG. 8 illustrates the ROC curve of the 5-fold cross-validation training results of an ensemble learning model based on Stacking according to an embodiment of the present invention.
- FIG. 9 illustrates the ROC curve evaluated by the model based on the testing set according to an embodiment of the present invention.
- FIG. 10 illustrates the Precision-Recall curve evaluated by the model based on the testing set according to an embodiment of the present invention.
- FIG. 11 illustrates the confusion matrix diagram when the decision threshold is the default (i.e., 0.5) according to an embodiment of the present invention.
- FIG. 12 illustrates the function with precision and recall as thresholds according to an embodiment of the present invention.
- FIG. 13 illustrates the confusion matrix diagram when the minimum recall is 0.95 (i.e., limiting type II error) according to an embodiment of the present invention.
- the method of detecting fetal chromosomal abnormalities can be implemented by the system of detecting fetal chromosomal abnormalities; the method of constructing a classification detection model for detecting fetal chromosomal abnormalities can be implemented by the system of detection model for detecting fetal chromosomal abnormalities.
- the data acquisition module is used for obtaining the sequencing data of cell-free nucleic acid fragments and the clinical phenotypic feature data of a pregnant woman, wherein the sequencing data comprise a plurality of read segments, the fetal chromosomal state of the pregnant woman is known (training samples) or unknown (samples to be detected), and the clinical phenotypic feature data of the pregnant woman form a phenotypic feature vector of the pregnant woman.
- the data acquisition module can include a data receiving module for receiving the above data.
- the data acquisition module can further include a sequencer, which can obtain sequencing data by inputting the cell-free nucleic acid of a pregnant woman for sequencing.
- Sequencing can be high throughput sequencing, and can be ultra-low depth sequencing, and the sequencing depth of the ultra-low depth sequencing is 1 ⁇ , 0.1 ⁇ , or 0.01 ⁇ .
- the cell-free nucleic acid can be derived from the peripheral blood, liver, and/or placenta of pregnant woman.
- the clinical phenotypic feature of the pregnant woman and the fetal chromosomal state of the pregnant woman (training samples) can be available in the database, wherein the fetal chromosomal state of the pregnant woman can be chromosomal aneuploidy, microdeletion and/or microduplication.
- the alignment module is used for aligning the read segments to a reference genome to obtain the unique mapping reads.
- Application software that aligns the sequences to a reference genome can be available from an open-source developer, for example, from some online websites, or can be self-developed.
- the sequence feature matrix generation module is used for performing window division on at least part of a chromosome sequence of a reference genome to obtain sliding windows, counting the read segments falling within the sliding windows, and generating a sequence feature matrix of the chromosome sequence.
- This can be implemented by using windows with a fixed length to slide on the chromosome sequence and windows with a fixed length may be 10 k, 100 k, 1M, or 10M, etc.
- Step size can be any length and is generally set as half of the length of sliding windows for convenient calculation.
- the length of chromosome sequence is only required to be greater than that of the sliding window, which can be 10 k, 100 k, 1M, 10M, or 100M . . . till the length of an entire chromosome.
- Chromosome can be the target chromosome, for example, Chromosome 21 corresponding to the detection of trisomy 21 syndrome, Chromosome 18 corresponding to the detection of trisomy 18 syndrome, Chromosome 13 corresponding to the detection of trisomy 13 syndrome, Chromosomes XY corresponding to the detection of sex chromosome abnormality, and all autosomes corresponding to the detection of chromosomal microdeletion/microduplication.
- the parameters including the number of reads, base quality (a measure of sequencing accuracy) and mapping quality (a measure of the accuracy of aligning the read segments to the reference genome, and the higher the mapping quality, the more unique the alignment position of the read segments to the reference genome), etc. are counted, which can be done using computer software.
- the sequence feature extraction module is used for extracting sequence features of a chromosome sequence.
- the sequence feature vector generation module uses the sequence feature matrix and the fetal chromosomal state of the pregnant woman to construct the training data set and train a machine learning model to extract the sequence feature vector of the chromosome sequence.
- the sequence feature vector generation module uses the sequence feature matrix to construct the testing data set and input into the trained machine learning model, such as the deep neural network model, to extract the sequence feature vector of the chromosome sequence.
- the classification detection module such as the training module of the ensemble learning model, is used to train a classification detection model by the combined feature vector formed by the sequence feature vector and the phenotypic feature vector of the pregnant woman as well as the fetal chromosome state to obtain the trained classification detection model.
- the classification detection module is used to combine the sequence feature vector with the phenotypic feature vector of the pregnant woman to form a combined feature vector as an input, and utilize the trained classification detection model to detect chromosomal abnormality state.
- the present invention proposes a completely innovative method of detecting chromosomal abnormalities, such as aneuploidy, microdeletion or microduplication.
- the present invention does not detect aneuploidy directly based on the number of read segments and Z-score and does not require tedious work of data preprocessing and feature extraction selection.
- the invention designs a machine learning model to automatically extract the sequence feature vector from the sequence feature matrix generated from the sequencing data and combine the sequence feature vector with the clinical phenotype feature of pregnant woman, and use the classification detection model for detection, so that finally obtain the prediction result of whether there is a genetic abnormality in the fetal chromosome.
- the machine learning model is used to automatically extract the sequence feature vector from the sequencing data, which avoids the disadvantages of traditional manual extraction of NIPT whole-genome sequence features.
- the method of the present invention not only fully mines the sequencing data information, but also makes full use of the clinical phenotype information of pregnant woman (the phenotype data information that can be added into the model includes maternity age, gestational week, height, weight, BMI (body mass index), biochemical test results of prenatal examination, and ultrasonic diagnosis results such as NT value, etc.), and combine the extracted sequence feature vector with the phenotypic feature vector of the pregnant woman, so as to fully mine the abundant feature data information contained in the NIPT sequencing data and the clinical phenotypic result of the pregnant woman, and ensure the high reliability and validity of the detection results.
- the method of the present invention not only can be used to detect the common trisomy syndrome, but also can be used to detect other chromosome defects, such as chromosomal copy number variation, chromosomal microdeletion, chromosomal microduplication, etc.
- extracting the sequence feature vector can also be carried out by using the deep neural network model based on Autoencoder network or Variational Autoencoder network, etc.
- an ensemble learning model based on Stacking or Majority Voting is trained to detect chromosomal abnormalities, and the findings of aneuploidy by different classifiers are fully utilized, greatly improving the accuracy of finding aneuploidy.
- the reference genome refers to the map of human genome with normal diploid chromosomes produced by, for example, the Human Genome Project, such as hg38, hg19, etc.
- the reference genome can be one chromosome or more chromosomes, or it can be part of a chromosome.
- Example 1 Example of Constructing a Detection Model
- High throughput sequencing platform BGIseq500 is used to sequence training samples (SE35 is adopted, with a sequencing depth of 0.1 ⁇ ), that is, the cell-free nucleic acid fragments of a pregnant woman.
- SE35 is adopted, with a sequencing depth of 0.1 ⁇
- the fetal chromosomal state of the pregnant woman is known.
- the sequencing data are aligned to a reference genome and the repeated alignment sequences are filtered to obtain the unique mapping reads.
- Base quality is for quantitative description of the accuracy of sequencing results;
- the mean, standard deviation, skewness and kurtosis of base quality refer to the mean, standard deviation, skewness and kurtosis of all base quality in sequencing reads, respectively.
- Map quality refers to the reliability of the alignment of a given sequencing read segment to a reference genome sequence;
- the mean, standard deviation, skewness and kurtosis of map quality refer to the mean, standard deviation, skewness and kurtosis of map quality of given sequencing read segments, respectively.
- Z (k) is the normalized sequence feature matrix of sample k (hereinafter referred to the normalized sequence feature matrix and k ⁇ [1,N], being defined as:
- X i,j (k) represents the j th sequence feature vector in the i th sliding window of sample k in the training set
- ⁇ i,j is the mean of the j th sequence feature vectors in the i h sliding window in the training set
- ⁇ i,j is the standard deviation of the j th sequence feature vectors in the i th sliding window in the training set
- i is an integer ⁇ 1
- j is an integer ⁇ 1;
- y ( k ) ⁇ 1 , abnormal ⁇ fetal ⁇ chromosomes ⁇ of ⁇ sample ⁇ k 0 , normal ⁇ fetal ⁇ chromosomes ⁇ of ⁇ sample ⁇ k
- a deep neural network model is constructed, and its structure is shown in FIG. 3 . All convolution layers involved in the deep neural network model are subjected to 1D convolution operations. Unless otherwise specified, the parameters of the 1D convolution kernel (i.e., 1D filter) are the same, that is, the number of 1D convolution kernels is f; the size of 1D convolution kernel is k; the step size of 1D convolution operation is s; the 1D convolution kernel uses L2 regularization and the regularization factor is r L2 ; the initialization function of the 1D convolution kernel is g; the size of the output feature map of the 1D convolution operation is set to remain the same as that of the input feature map; the size of the pooling kernel is p, and the pooling step size is p s .
- the parameters of the 1D convolution kernel i.e., 1D filter
- the parameters of the 1D convolution kernel i.e., 1D filter
- the parameters of the 1D convolution kernel i.
- the used Dropout Ratio of the Dropout layer involved in the deep neural network model is the same and is set as d.
- the structure of the deep neural network model includes:
- the input layer is used to receive the normalized sequence feature matrix Z (k) with a size of h ⁇ w.
- the pre-module is connected with the input layer, and is used for performing the first convolution and activation operation of the input sequence feature matrix to obtain the abstract representation feature map.
- the module includes: a 1D convolution layer, a batch normalization layer connected with the 1D convolution layer, and a ReLU activation layer connected with the batch normalization layer.
- the core module is connected with the pre-module, and is used for further abstraction and feature extraction of the feature map, and strengthening the expression ability of the neural network by effectively increasing the depth of the neural network model.
- the core module consists of three repeated operations of residual modules with the same structure, wherein the output of each residual module is the input of the next residual module.
- Each residual module includes:
- the post-module has the same structure as the pre-module, and the only difference is that the number of 1D convolution kernels in the post-module is set as n out , which is used for feature abstraction representation of the feature map from the core module before output.
- the first global average pooling layer is connected with the post-module, and is used for vectorizing the feature map of the feature abstraction representation.
- the output layer is connected with the first global average pooling layer, and is a fully connected layer with a number of output neuron of 1, and the activation function is sigmoid function, which is used to output the chromosomal abnormality.
- the training set is used to train the deep neural network model in step 4, and the sequence feature vector of the sample is calculated using the trained deep neural network model.
- the process is as follows:
- the phenotypic result of corresponding the pregnant woman sample is obtained, and the initial phenotypic feature vector phe init is constructed, including 5 features, which is defined as:
- Phenotypic data set of the pregnant woman is preprocessed, which includes outliers processing and missing values or null values processing.
- the phenotypic data of the pregnant woman sample are judged as outliers if the following records appear:
- the phenotypic data matrix P is constructed, which is defined as:
- MissForest algorithm is used for missing value padding, which is a non-parametric missing value padding algorithm based on random forest (see reference [18] for details). Its algorithm is as follows:
- BMI phenotypic results after the missing value padding
- the sequence feature vector described in step 5 is combined with the final feature vector described in step 7 to obtain a combined feature vector:
- the combined feature vector described in 8 is normalized by:
- x i ′ x i - ⁇ i ⁇ i
- y ( k ) ⁇ 1 , abnormal ⁇ fetal ⁇ chromosomes ⁇ of ⁇ sample ⁇ k 0 , normal ⁇ fetal ⁇ chromosomes ⁇ of ⁇ sample ⁇ k
- the ensemble learning algorithm based on Stacking is used to predict aneuploidy.
- the algorithm is as follows (see reference [19] for details):
- the invention proposes a method of detecting a fetal chromosomal abnormality, which uses nucleic acid sequencing results of noninvasive prenatal testing (NIPT) and phenotypic data of a pregnant woman together to predict whether the genetic abnormality presents in fetal chromosomes.
- NIPT noninvasive prenatal testing
- FIG. 1 the process and steps of the method of detecting a fetal chromosomal abnormality are shown in FIG. 1 , and the specific process is described below.
- High throughput sequencing platform BGIseq500 is used to sequence samples to be detected (SE35 is adopted, with a sequencing depth of 0.1 ⁇ ).
- SE35 is adopted, with a sequencing depth of 0.1 ⁇ .
- the sequencing data are aligned to a reference genome and the repeated alignment sequences are filtered to obtain the unique mapping reads.
- Base quality is for quantitative description of the accuracy of sequencing results; the mean, standard deviation, skewness and kurtosis of base quality refer to the mean, standard deviation, skewness and kurtosis of all base quality in sequencing read segments, respectively.
- Map quality refers to the reliability of the alignment of a given sequence segment to a reference genome sequence; the mean, standard deviation, skewness and kurtosis of map quality refer to the mean, standard deviation, skewness and kurtosis of map quality of given sequencing read segments, respectively.
- Example 4 The trained deep neural network model in Example 1 is used to calculate the sequence feature vector of the sample, and the process is as follows:
- the phenotypic result corresponding to the pregnant woman samples to be detected is obtained, and the initial phenotypic feature vector phe init is constructed, including 5 features, which is defined as:
- the phenotypic data of the pregnant woman sample to be detected are judged as outliers if the following records appear:
- step 7 The sequence feature vector described in step 4 is combined with the final feature vector described in step 6 to obtain a combined feature vector:
- the combined feature vector described in 7 is normalized by:
- x i ′ x i - ⁇ i ⁇ i
- This example uses 1205 samples with “trisomy 21 (T21)” as positive samples and 1600 samples with normal chromosome (diploid) as negative samples.
- Table 1 describes the number of training samples and testing samples.
- N Number Number Total number of samples in of samples in of samples training set testing set (N) (90% ⁇ N) (10% ⁇ N) Positive samples 1205 1084 121 (T21) Negative samples 1600 1440 160 (Normal)
- the feature matrix of corresponded sequencing data in the training set is used to train the deep neural network model.
- Table 2 lists the operations of each layer, the size of the output feature map, and the network connections in the deep neural network model based on the parameters described.
- step 7 in above Example 1 the phenotypic features of all samples in the whole data set (including the training set and testing set) are obtained and the outliers of the phenotypic features are processed.
- step 7 in above Example 1 the phenotypic features in the training set are subjected to the missing value padding, and the padding model of the missing values is saved.
- step 7 in above Example 1 BMI is calculated for the phenotypic features in the training set after the missing value processing, as shown in FIG. 6 .
- step 8 in above Example 1 the sequence feature vector in the training set is combined with the phenotypic feature vector of the corresponding sample to obtain a combined feature vector.
- step 9 in above Example 1 the combined feature vector of each sample in the training set is normalized to obtain the normalized feature vector and the normalization model of the combined feature vector is saved.
- the saved padding model of the missing values is used for the missing value padding of the phenotypic features of each sample in the testing set, and the sequence feature vector of the testing set is then combined with the phenotypic feature vector of the corresponding sample to obtain the combined feature vector of the testing set, and then the saved normalization model of the combined feature vector is used to normalize the combined feature vector in the testing set.
- the normalized feature vector of the training set obtained in above step 10 is used to train the ensemble learning model based on Stacking, as shown in FIG. 7 .
- step 13 The trained ensemble learning model based on Stacking described in step 12 is verified using the testing set.
- the invention proposes using a machine learning model (such as a deep neural network) to extract the sequence feature vector of NIPT sequencing data, and then combining the sequence feature vector (the features including but not limited to the read count, base quality and mapping quality) with the phenotypic feature vector of the pregnant woman (the phenotypic features of pregnant woman including but not limited to maternity age, gestational week, height, weight, BMI, biochemical test results of the prenatal examination, and ultrasonic diagnosis results such as NT value, etc.) to form a vector combination, and then using a classification model, such as an ensemble learning model based on Stacking, to obtain the final predictive aneuploidy.
- a machine learning model such as a deep neural network
- extracting the sequence feature vector is not limited to the method used herein but also can be used including but not limited to an Autoencoder network or a Variational Autoencoder network.
- the model structure proposed by the invention is a hybrid model, that is, the model comprises 2 stages. In the first stage, a machine learning model (such as a deep neural network) is used to calculate the sequence feature vector. In the second phase, a classification model (such as an ensemble learning model based on Stacking) is used to predict aneuploidy by using the combination of sequence feature vector and phenotypic feature vector. Other ensemble learning models, such as a model based on Majority Voting, can also be used.
- the verified advanced deep neural network model used in the examples of the invention has the following features on network design and architecture: the deep neural network model used in the examples of the invention is a deep neural network model based on 1D convolutional model; the deep neural network model used in the examples of the invention is a network model based on residual network; the SE module of Squeeze-Excite network is introduced into the deep neural network model used in the examples of the invention.
- the neural network model used in the examples of the invention has more layers (see Example 3), and effectively reduces the risk of gradient disappearance and overfitting in the process of training model, and improves the stability, and thus effectively improves the accuracy of model prediction result.
- the invention can be implemented as a computer-readable storage medium, on which a computer program is stored, and the steps to implement the method of the invention are executed when the computer program is executed by a processor.
- the computer program is distributed over several computer devices or processors coupled by network, so that the computer program is stored, accessed, and executed in a distributed manner by one or more computer devices or processors.
- a single step/operation, or two or more steps/operations can be executed by a single computer device or processor or by two or more computer devices or processors.
- One or more steps/operations can be executed by one or more computer devices or processors, and one or more other steps/operations can be executed by one or more other computer devices or processors.
- One or more computer devices or processors can execute a single step/operation, or two or more steps/operations.
Landscapes
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Organic Chemistry (AREA)
- Biotechnology (AREA)
- Analytical Chemistry (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Data Mining & Analysis (AREA)
- Wood Science & Technology (AREA)
- Zoology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Public Health (AREA)
- Biochemistry (AREA)
- Microbiology (AREA)
- Immunology (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Pathology (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Computational Linguistics (AREA)
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2020/132331 WO2022110039A1 (zh) | 2020-11-27 | 2020-11-27 | 一种胎儿染色体异常的检测方法与系统 |
Publications (1)
Publication Number | Publication Date |
---|---|
US20240013859A1 true US20240013859A1 (en) | 2024-01-11 |
Family
ID=81753821
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/254,842 Pending US20240013859A1 (en) | 2020-11-27 | 2020-11-27 | Fetal chromosomal abnormality detection method and system |
Country Status (8)
Country | Link |
---|---|
US (1) | US20240013859A1 (ko) |
EP (1) | EP4254418A4 (ko) |
JP (1) | JP2024505780A (ko) |
KR (1) | KR20230110615A (ko) |
CN (1) | CN116648752A (ko) |
AU (1) | AU2020479407A1 (ko) |
CA (1) | CA3200221A1 (ko) |
WO (1) | WO2022110039A1 (ko) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114792548B (zh) * | 2022-06-14 | 2022-09-09 | 北京贝瑞和康生物技术有限公司 | 校正测序数据、检测拷贝数变异的方法、设备和介质 |
CN114841294B (zh) * | 2022-07-04 | 2022-10-28 | 杭州德适生物科技有限公司 | 一种检测染色体结构异常的分类器模型训练方法及装置 |
CN117095747B (zh) * | 2023-08-29 | 2024-04-30 | 广东省农业科学院水稻研究所 | 一种基于线性泛基因组和人工智能模型检测群体倒位或转座子端点基因型的方法 |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1600265A (zh) * | 2004-09-27 | 2005-03-30 | 郑末晶 | 唐氏综合征和神经管缺陷产前筛查方法 |
WO2018064486A1 (en) * | 2016-09-29 | 2018-04-05 | Counsyl, Inc. | Noninvasive prenatal screening using dynamic iterative depth optimization |
WO2019055835A1 (en) * | 2017-09-15 | 2019-03-21 | The Regents Of The University Of California | DETECTION OF SOMATIC MONONUCLEOTIDE VARIANTS FROM ACELLULAR NUCLEIC ACID WITH APPLICATION TO MINIMUM RESIDUAL DISEASE SURVEILLANCE |
US11168356B2 (en) * | 2017-11-02 | 2021-11-09 | The Chinese University Of Hong Kong | Using nucleic acid size range for noninvasive cancer detection |
AU2019244115A1 (en) | 2018-03-30 | 2020-11-19 | Juno Diagnostics, Inc. | Deep learning-based methods, devices, and systems for prenatal testing |
WO2020018522A1 (en) | 2018-07-17 | 2020-01-23 | Natera, Inc. | Methods and systems for calling ploidy states using a neural network |
US20200365234A1 (en) * | 2019-05-13 | 2020-11-19 | Nvidia Corporation | Sequence variation detection using deep learning |
CN111286529A (zh) * | 2019-07-22 | 2020-06-16 | 常州市妇幼保健院 | 一种利用外周血胎儿游离dna产前筛查假阳性的试剂盒 |
KR20220122596A (ko) * | 2019-12-31 | 2022-09-02 | 비지아이 클리니컬 래보러토리즈 (셴젠) 컴퍼니 리미티드 | 염색체 이수성 판별 및 분류 모델 구성 방법 및 장치 |
CN111292802B (zh) * | 2020-02-03 | 2021-03-16 | 至本医疗科技(上海)有限公司 | 用于检测突变的方法、电子设备和计算机存储介质 |
-
2020
- 2020-11-27 EP EP20962929.4A patent/EP4254418A4/en active Pending
- 2020-11-27 CA CA3200221A patent/CA3200221A1/en active Pending
- 2020-11-27 WO PCT/CN2020/132331 patent/WO2022110039A1/zh active Application Filing
- 2020-11-27 AU AU2020479407A patent/AU2020479407A1/en active Pending
- 2020-11-27 JP JP2023532353A patent/JP2024505780A/ja active Pending
- 2020-11-27 CN CN202080107528.2A patent/CN116648752A/zh active Pending
- 2020-11-27 US US18/254,842 patent/US20240013859A1/en active Pending
- 2020-11-27 KR KR1020237021684A patent/KR20230110615A/ko active Search and Examination
Also Published As
Publication number | Publication date |
---|---|
AU2020479407A1 (en) | 2023-06-29 |
EP4254418A4 (en) | 2024-03-27 |
WO2022110039A1 (zh) | 2022-06-02 |
JP2024505780A (ja) | 2024-02-08 |
EP4254418A1 (en) | 2023-10-04 |
KR20230110615A (ko) | 2023-07-24 |
CA3200221A1 (en) | 2022-06-02 |
CN116648752A (zh) | 2023-08-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240013859A1 (en) | Fetal chromosomal abnormality detection method and system | |
US11482303B2 (en) | Convolutional neural network systems and methods for data classification | |
US20230187021A1 (en) | Methods for Non-Invasive Assessment of Genomic Instability | |
US11854666B2 (en) | Noninvasive prenatal screening using dynamic iterative depth optimization | |
US20080234976A1 (en) | Statistical Methods for Multivariate Ordinal Data Which are Used for Data Base Driven Decision Support | |
Schmidt et al. | A machine-learning–based algorithm improves prediction of preeclampsia-associated adverse outcomes | |
US20050019787A1 (en) | Apparatus and methods for analyzing and characterizing nucleic acid sequences | |
US20210102262A1 (en) | Systems and methods for diagnosing a disease condition using on-target and off-target sequencing data | |
Liu et al. | Multiple testing under dependence via graphical models | |
CN109191422B (zh) | 基于常规ct图像的缺血性脑卒中早期检测系统和方法 | |
CN107463797B (zh) | 高通量测序的生物信息分析方法及装置、设备及存储介质 | |
JP7467504B2 (ja) | 染色体異数性を判定するためおよび分類モデルを構築するための方法およびデバイス | |
Xia et al. | KaryoNet: Chromosome recognition with end-to-end combinatorial optimization network | |
Yang et al. | Chromosome classification via deep learning and its application to patients with structural abnormalities of chromosomes | |
Verma et al. | Breast Cancer Survival Rate Prediction In Mammograms Using Machine Learning | |
Li et al. | Down syndrome prediction using a cascaded machine learning framework designed for imbalanced and feature-correlated data | |
US9965584B2 (en) | Identifying interacting DNA loci using a contingency table, classification rules and statistical significance | |
Bhattacharya et al. | Effects of gene–environment and gene–gene interactions in case-control studies: A novel Bayesian semiparametric approach | |
US20200105374A1 (en) | Mixture model for targeted sequencing | |
Boddupally et al. | Artificial Intelligence for Prenatal Chromosome Analysis | |
US20230005569A1 (en) | Chromosomal and Sub-Chromosomal Copy Number Variation Detection | |
Gaskins et al. | A bayesian nonparametric model for predicting pregnancy outcomes using longitudinal profiles | |
CN114822682B (zh) | 与早发型重度子痫前期发生相关的基因组合及其应用 | |
Afroze et al. | Analysis of RNA-Seq Data of 10000 Samples of Single-cell Transcriptome | |
Kalkan et al. | Prediction of Alzheimer’s Disease by a Novel Image-Based Representation of Gene Expression. Genes 2022, 13, 1406 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: BGI SHENZHEN, CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BAI, YONG;GAO, YA;HUANG, SHUJIA;AND OTHERS;SIGNING DATES FROM 20230517 TO 20230518;REEL/FRAME:063816/0350 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |