IL307784A - Noninvasive fetal variant identification using fragmentomics-based classification - Google Patents

Noninvasive fetal variant identification using fragmentomics-based classification

Info

Publication number
IL307784A
IL307784A IL307784A IL30778423A IL307784A IL 307784 A IL307784 A IL 307784A IL 307784 A IL307784 A IL 307784A IL 30778423 A IL30778423 A IL 30778423A IL 307784 A IL307784 A IL 307784A
Authority
IL
Israel
Prior art keywords
read
algorithms
sequencing
maternal
reads
Prior art date
Application number
IL307784A
Other languages
Hebrew (he)
Original Assignee
Identifai Genetics Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Identifai Genetics Ltd filed Critical Identifai Genetics Ltd
Priority to IL307784A priority Critical patent/IL307784A/en
Priority to PCT/IL2024/051015 priority patent/WO2025083690A1/en
Publication of IL307784A publication Critical patent/IL307784A/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Description

NONINVASIVE FETAL VARIANT IDENTIFICATION USING FRAGMENTOMICS-BASED CLASSIFICATION TECHNOLOGICAL FIELD The present disclosure relates to the field of prenatal genetic analysis.
REFERENCES: Chiu and Lo (2021) Prenatal Diagnosis 41 (10): 1193–1201. Ding and Lo (2022) Diagnostics (Basel, Switzerland) 12(4), 978. Jiang et al (2020) Cancer Discovery 10 (5): 664–73. Li and Durbin (2009) Bioinformatics 25 (14): 1754–60. Liu (2022) British Journal of Cancer 126 (3): 379–90. Moufarrej et al (2023) Annual Review of Biomedical Data Science https://doi.org/10.1146/annurev-biodatasci-020722-0941Poplin et al (2017) bioRxiv. https://doi.org/10.1101/2011Rabinowitz et al (2019) Genome Res. 29 (3), pp. 428-4Sun et al (2018) PNAS USA 115 (22) E5106–14. Zhou et al (2022) PNAS USA 119 (44) e2209852119. BACKGROUND Non-invasive prenatal testing (NIPT) is the process of assessing the health of an unborn fetus by determining the risk that the fetus will be born with genetic abnormalities. NIPT relies on the existence of cell-free fetal DNA (cffDNA) fragments as a portion of the cell-free DNA (cfDNA) circulating in maternal plasma. Current NIPT tests can detect large genetic aberrations on a whole-chromosome scale, or very large, specific copy number variations (CNVs). However, other types of genetic variants such as single nucleotide variants (SNVs), short insertions and deletions and CNVs of moderate size can also have dramatic health effects and are not considered in any currently available NIPT product. A major challenge in fetal genotype prediction from cfDNA is identifying the informative DNA sequences (known as reads). Only a small fraction of the sequenced cfDNA reads originate from the fetus. Typically, the fetal fraction is 10-20%, depending on gestational age and other factors (Moufarrej et al. 2023). With the rest of the reads originating from the mother’s genome. Thus, determining which reads represent the fetal genome is crucial for accurately identifying variants present in the fetus. Fragments from fetal and maternal origin have different characteristics (Ding and Lo 2022; Chiu and Lo 2021; Liu 2022). These include the fragment lengths, composition of the DNA sequences at the ends of the fragments (known as end motifs), preferential genomic end points of fetal fragments, differences in the distribution of the fragments along the genome and more. The study of these features is known as fragmentomics. While these differences are statistically significant, they are subtle. No single fragmentomic feature known today can separate fetal and maternal reads with high confidence. Therefore, any approach that classifies a DNA fragment using hard thresholds is likely to be error prone and mis-classify many reads. Using such an approach as a basis for predicting fetal genotypes will likely result in discarding relevant information (in the case when fetal reads are mis-classified as maternal), use of irrelevant and misleading information (in case when maternal reads are mis-classified as fetal), or both. Rabinowitz et al (2019) describe a different approach for genome wide NIPT, termed noninvasive prenatal variant calling. Using Hoobari, the first noninvasive fetal variant caller, they were able to genotype all fetal positions, including biparental loci and indels (WO2021/0340601).
GENERAL DESCRIPTION In a first aspect the present invention provides a method for genotyping a fetus, comprising: a. receiving reads of sequencing data of (i) maternal cell-free DNA (cfDNA), and (ii) maternal genomic DNA (gDNA) and optionally paternal genomic DNA (gDNA) from a pair parenting the fetus; b. identifying potential genomic sites where the fetus may have a variant; c. extracting fragmentomic features for each cfDNA read identified as overlapping a potential genomic site where the fetus may have a variant; d. introducing the extracted fragmentomic features to a machine learning model which determines for each read a probability value that the read originated from the fetus and a probability value that the read originated from the mother; and e. using said probability value, determining whether the fetal genome comprises the variant; thereby genotyping said fetus. In one embodiment, the one or both gDNA sequencing data and the cfDNA sequencing data is obtained by deep whole genome sequencing (WGS), whole exome sequencing (WES), targeted sequencing, panel sequencing, gene sequencing, long-read genome sequencing, paired-end sequencing, single read sequencing, or amplicon sequencing. In one embodiment, said WGS or WES data is obtained by deep sequencing. In one embodiment, identifying potential genomic sites where the fetus may have a variant is performed using one or more of: a. Analyzing the received maternal gDNA data and optionally also the received paternal gDNA data to identify sequence reads overlapping genomic sites that comprise a variant, e.g., using variant calling; b. Analyzing the received maternal cfDNA data to identify sequence reads overlapping genomic sites that comprise a variant; c. Analyzing the received data to identify haplotypes potentially associated with a variant; or d. Analyzing the received data to identify if the fetal genome comprises variants that have a high prevalence in the general population or in a relevant ethnic group. In some embodiments, said fragmentomic features comprise one or more of read quality mapping, read base qualities, fragment length, short/long read ratio, end motifs, cleavage patterns around methylation sites, read endpoint preferred end, DNA/accessibility/nucleosome, distance to nearest nucleosome, transcription factor binding sites, regional fetal fraction, regional sequence composition, read sequence composition, and number of sequence errors in the read. In one embodiment, said machine learning model is a read classifier machine learning model.
In one embodiment, the machine learning model is developed by a method comprising: a. presenting the model with a first training set comprising reads of known maternal origin and a second training set comprising reads of known fetal origin; b. training the model to identify fragmentomic features that are characteristic to and discriminate between reads of maternal and fetal origins. In some embodiments, said machine learning model is selected from a group consisting of clustering, association rule algorithms, feature evaluation algorithms, subset selection algorithms, support vector machines, classification rules, cost - sensitive classifiers, vote algorithms, stacking algorithms, Bayesian networks, decision trees, random forest algorithms, neural networks, convolutional neural networks, instance - based algorithms, linear modeling algorithms, k - nearest neighbors ( KNN ) analysis, ensemble learning algorithms, boosting algorithms probabilistic models, graphical models, logistic regression methods (including multinomial logistic regression methods), gradient ascent methods, dimensionality reduction methods, singular value decomposition methods, principle component analysis, and a combination thereof. In some embodiments, said determining whether the fetal genome comprises the variant in accordance with the invention, comprises applying an algorithm selected from a group consisting of clustering, association rule algorithms, feature evaluation algorithms, subset selection algorithms, support vector machines, classification rules, cost - sensitive classifiers, vote algorithms, stacking algorithms, Bayesian networks, decision trees, random forest algorithms, neural networks, convolutional neural networks, instance - based algorithms, linear modeling algorithms, k - nearest neighbors ( KNN ) analysis, ensemble learning algorithms, boosting algorithms probabilistic models, graphical models, logistic regression methods (including multinomial logistic regression methods), gradient ascent methods, dimensionality reduction methods, singular value decomposition methods, principle component analysis, and a combination thereof. The combination of algorithms may be for example, using majority voting. In one embodiment, said determining whether the fetal genome comprises the variant in accordance with the invention comprises applying a Bayesian procedure.
In one embodiment, said Bayesian procedure comprises prior probabilities calculated using sequencing data of at least one of said parents. In one embodiment, said determining whether the fetal genome comprises the variant is performed using the Hoobari algorithm. In another aspect, the present invention provides a computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a data processor, configure the data processor to (1) receive reads of sequencing data of (i) maternal cell-free DNA (cfDNA), and (ii) maternal genomic DNA (gDNA) and optionally paternal genomic DNA (gDNA) from a pair parenting a fetus, and to (2) execute the method of the invention. In another aspect, the present invention provides a system for genotyping a fetus, comprising: an input utility for receiving reads of sequencing data of (i) maternal cell-free DNA (cfDNA), and (ii) maternal genomic DNA (gDNA) and optionally paternal genomic DNA (gDNA) from a pair parenting a fetus; and a data processor configured for analyzing said data for executing the method of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS For better understanding the subject matter that is disclosed herein and to exemplify how it may be carried out in practice, embodiments will now be described, by way of non-limiting example only, with reference to the accompanying drawings, in which: Figure 1 is an outline of the training and use of the read classifier according to various exemplary embodiments of the present invention. Figure 2 is a flowchart diagram of a method suitable for fetal genotyping, according to various exemplary embodiments of the present invention. Figures 3A-C are outlines of methods for training the read classifier and using it to predict fetal genotypes according to various exemplary embodiments of the present invention. Fig. 3A: Parental homozygous sites. In cases where both paternal and maternal gDNA is available, the read classifier can be trained on cfDNA reads from sites for which the parents are homozygous for different alleles. Hoobari with read classifier predictions is then used to predict fetal genotypes in other sites from the same family. Fig. 3B: Direct fetal sequencing. WGS from DNA extracted directly from the fetus using an invasive test (amniocentesis/chorionic villi) is used alongside the DNA from the mother’s blood and cfDNA to build a set of fetal and maternal reads used for training the read classifier. The read classifier is then used to improve Hoobari predictions on other pregnancies for which invasive fetal DNA sequencing is not available. Fig. 3C: Training with Hoobari. The basic Hoobari algorithm is first run on the cfDNA WGS and is used to identify sites with high confidence fetal genotypes. These are used to train the read classifier and then the improved Hoobari algorithm is used to predict fetal genotypes on more challenging sites.
DETAILED DESCRIPTION OF EMBODIMENTS The present invention proposes that deep whole genome sequencing of cfDNA extracted from maternal plasma during pregnancy and its analysis using a variant calling approach, combined with fragmentomics, can improve the overall accuracy of genotype predictions (e.g., reduce mistaken and low confidence predictions), and allow the identification of various genetic variants in the fetal genome, including single nucleotide mutations. The present invention thus provides non-invasive fetal variant calling using fragmentomics-based DNA fragments classification models. In accordance with the invention, machine learning (ML) models that integrate complementary signals from multiple fragmentomic and other features were developed to more accurately predict for each read in a cfDNA if it originated from the fetal placenta or from the mother’s cells. In accordance with the methods of the invention, the genetic material is extracted for example using deep whole genome sequencing (e.g., X300) of cfDNA in maternal plasma during pregnancy. Maternal blood cells and optionally paternal blood cells are sequenced as well, for example in a WGS approach (X30). To develop the machine learning model a first step of training was performed (see Figure 1 for illustration), followed by use of the model to analyze unknown fetal genotypes. In the first stage, the "training stage", the machine learning model (also referred to herein as the "read classifier") was trained to predict if a cfDNA read originated from the fetus or the mother. Reads whose maternal or fetal origin can be determined with high confidence are identified and maternal and fetal read sets are constructed. Fragmentomic features are extracted for each read in the training sets and are used to train the read classifier. In the second stage, the read classifier is used for inferring unknown fetal genotypes in a second set of cfDNA reads. These can either be more challenging genomic sites from the sample that was used for training or reads from an unrelated pregnancy. Fragmentomic features are extracted for the new set of reads. The read classifier predicts for each read the two probabilities: P(fetal) - the probability that the read originated from the fetus and P(maternal) - the probability that the read originated from the mother. Finally, the read origin probabilities are used for predicting the fetal genotypes using the Hoobari algorithm, as described for example in Rabinowitz et a. 2019. An exemplary embodiment of the method is shown in Figure 2. As used herein, the term "fragment" refers to a single DNA molecule found in maternal plasma. During high throughput DNA sequencing, the sequence of each fragment is determined. The outcome of the sequencing process is a list of DNA sequences referred to as "reads", which are typically stored in a file format known as fastq. Depending on the sequencing technology, the sequence of a DNA fragment may be represented by a single read or by two reads. While strictly speaking, fragment and read are distinct terms, these terms are used interchangeably herein. Accordingly, in an aspect, the present invention provides a method for genotyping a fetus, comprising: a. receiving reads of sequencing data of (i) maternal cell-free DNA (cfDNA), and (ii) maternal genomic DNA (gDNA) and optionally paternal genomic DNA (gDNA) from a pair parenting the fetus; b. identifying potential genomic sites where the fetus may have a variant; c. extracting fragmentomic features for each cfDNA read identified as overlapping a potential genomic site where the fetus may have a variant; d. introducing the extracted fragmentomic features to a machine learning model which determines for each read a probability value that the read originated from the fetus and a probability value that the read originated from the mother; and e. using said probability value, determining whether the fetal genome comprises the variant; thereby genotyping said fetus. The maternal genomic DNA (gDNA) data, maternal cell-free DNA (cfDNA) data, and optionally the paternal gDNA data are obtained by performing one or more of deep whole genome sequencing (WGS), whole exome sequencing (WES), targeted sequencing, panel sequencing, gene sequencing, long-read genome sequencing, paired-end sequencing, single read sequencing, or amplicon sequencing.
Deep sequencing refers to sequencing a genomic region multiple times, sometimes hundreds or even thousands of times. As used herein the term "deep whole genome sequencing" refers to deep sequencing of the entire genome. In the context of the present invention cell-free DNA extracted from maternal blood plasma during pregnancy is subjected to deep whole genome sequencing. The maternal blood plasma samples may be obtained at any stage of the pregnancy, preferably between weeks 7-38 of the pregnancy. The sequencing may be repeated multiple times, for example, but not limited to between 0.5 times (0.5x) and 1000 times (1000X), e.g., 0.5 times (0.5x), 1 time (1x), times (2x), 10 times (10X), 20 times (20X), 30 times (30X), 50 times (50X), 100 times (100X), 200 times (200X), 300 times (300X), 500 times (500X), or 1000 times (1000X). In one non-limiting example, the cfDNA in maternal plasma is sequenced 300 times (300X). The number of times indicated above means that the number of reads obtained is the sequencing run is such that on average each position in the genome is covered by the indicated number of reads. For example, 0.5x means that the number of reads obtained is the sequencing run is such that on average each position in the genome is covered by 0.reads. In addition, genomic maternal and optionally paternal DNA is also subjected to whole genome sequencing. Such genomic DNA may be obtained from any cell type, for example from blood cells, e.g., leukocytes. In an embodiment, whole genome sequencing of maternal genomic DNA and optionally paternal genomic DNA is performed to a targeted depth of between about 20X and 40X, for example 30X. Whole genome sequencing may be performed using any method known in the art, for example, the HiSeq X Ten System (Illumina) or HiSeq 4000 (Illumina), Novaseq 60(Illumina), NovaSeq X (Illumina), NovaSeq X Plus (Illumina), PacBio Revio (PacBio), PacBio Sequel (PacBio), UG100 (Ultima Genomics), MinIon (Oxford Nanopore), GridIon(Oxford Nanopore), PromethIon (Oxford Nanopore), DNBSEQ-G400 (MGI), G(Singular Genomics), AVITI (Element BioSciences). The sequencing generates "reads" which are sequences of DNA fragments of varying lengths. After sequencing, the reads are aligned to a human reference genome based on sequence similarities. Optionally, additional information (also referred to herein as "metadata") pertaining to one or both the parents is also received. The received metadata optionally and preferably includes at least one, more preferably more than one, of the following features: mutation carrier status of the parents, ethnicity of the parents, body mass index (BMI), week of pregnancy and medical history of the mother. As used herein the term "variant" refers to any mutation or variation in the genomic sequence, including, but not limited to genetic chromosomal aberrations, copy number variations (CNVs), structural variants (SVs, for example inversions and translocations), single nucleotide variants (SNVs), insertions and deletions. Identifying potential genomic sites where the fetus may have a variant is performed using one or more of: a. Analyzing the received maternal gDNA data and optionally also the received paternal gDNA data to identify sequence reads overlapping genomic sites that comprise a variant, e.g., using variant calling; b. Analyzing the received maternal cfDNA data to identify sequence reads overlapping genomic sites that comprise a variant; c. Analyzing the received data to identify haplotypes potentially associated with a variant; or d. Analyzing the received data to identify if the fetal genome comprises variants that have a high prevalence in the general population or in a relevant ethnic group. The identification of the maternal and paternal variants (i.e., variant sites or mutations) can be performed using a variant calling approach, which is generally based on alignment of the DNA sequencing data and the application of a commercially available variant caller. Sequence alignment techniques that can be used according to some embodiments of the present invention include, without limitation, Burrows Wheeler Aligner (BWA), ABA, ALE, AMAP, anon, BAli-Phy, Base-By-Base, BHAOS/DIALIGN, Bowtie, Bowtie 2, ClustalW, CodonCode Aligner, Comass, DECIPHER, DIALIGN-TX, DIALIGN-T, DNA Alignment, DNA Baser Sequence Assembler, EDNA, FSA, Geneious, Kalign, MAFFT, MARNA, MAVID, MSA, MSAProbs, MULTALIN, Multi-LAGEN, MUSCLE, Opal, Pecan, Phylo, Praline, PicXAA, POA, Probalign, ProbCons, PROMALS3D, PRRN/PRRP, PSAlign, RevTrans, SAGA, SAM, Se-AI, STAR, STAR-Fusion, StatAlign, Stemloc, T-Coffee, UGENE, VectorFriends, GLProbs, Dragmap and minimap2.
Exemplary variant callers suitable for the present embodiments include, without limitation, Genome Analysis Toolkit (GATK) and Freebayes. For example, Freebayes can comprise an alignment based on literal sequences of reads aligned to a particular target, not their precise alignment. GATK can comprise: (i) pre-processing; (ii) variant discovery; and (iii) callset refinement. Pre-processing can comprise starting from raw sequence data, e.g., in FASTQ or uBAM format, and producing analysis-ready BAM files; processing can include alignment to a reference genome as well as data cleanup operations to correct for technical biases and make the data suitable for analysis; variant discovery can comprise starting from analysis-ready BAM files and producing a callset in VCF format; processing can involve identifying sites where one or more individuals display possible genomic variation, and applying filtering methods appropriate to the experimental design; callset refinement can comprise starting and ending with a VCF callset; processing can involve using metadata to assess and improve genotyping accuracy, attach additional information and evaluate the overall quality of the callset. Also contemplated are variant callers such as, but not limited to, Platypus, VarScan, Bowtie analysis, MuTect and/or SAMtools. For example, Bowtie analysis can comprise implementing the Burrows-Wheeler transform for aligning. MuTect can comprise: (i) pre-processing; (ii) statistical analysis; and (iii) post-processing. Pre-processing can comprise an initial alignment of sequencing reads; statistical analysis can comprise using two Bayesian classifiers, one classifier can detect whether a SNP is non-reference at a given site and, for those sites that are found as non-reference, the other classifier can make sure that the normal does not carry the SNP; post-processing can comprise removal of artifacts of sequencing, short read alignments and hybrid capture. SAMtools can comprise storing, manipulating, and aligning sequencing reads stored as SAM files. In various exemplary embodiments of the invention the method comprises the determination of the probability, for each read, to be of fetal origin. In an embodiment, the determination of the probability of the read to be of fetal origin comprises extracting fragmentomic features for the cfDNA read, and introducing the extracted fragmentomic features to a machine learning model which determines for each read a probability value that the read originated from the fetus and a probability value that the read originated from the mother. Subsequently, said probability value is used for determining each variant site as being of fetal or maternal origin.
As used herein the term fragmentomic features refers to molecular characteristics of DNA reads (for example, but not limited to, those listed below), as well as to genomic, epigenetic and alignment features of the DNA read. Generating fragmentomic features for cfDNA reads: Each read from the cfDNA sequencing libraries is represented using a set of numeric or textual features. The fragmentomic features include, but are not limited to: a. Read quality mapping. b. Read base qualities. c. Fragment length. d. short/long read ratio. e. End motifs. f. Cleavage patterns around methylation sites. g. Read endpoint preferred end. h. DNA accessibility/nucleosome positioning inference. i. Distance to nearest nucleosome. j. Transcription factor binding sites. k. Regional fetal fraction. l. Regional sequence composition. m. Read sequence composition; and n. Number of sequence errors in the read Read quality mapping: read mapping quality score can be provided by alignment algorithms, e.g., Burrows-Wheeler Transform (bwa) (Li and Durbin 2009). A higher mapping quality score means higher confidence that the read was mapped to a genomic region from which the underlying DNA fragment originated. Base quality score: base quality score can be provided in the raw output of the sequencing machine, typically in fastq file format. A higher base quality means higher confidence that the sequencing machine correctly called the nucleotide. To aggregate the information about base qualities of all bases in the reads’ summary statistics mean, median and standard deviation of the individual base qualities is calculated. Fragment length: cfDNA fragments of fetal origins tend to be somewhat shorter than cfDNA fragments of maternal origin (Chiu and Lo 2021). The distance (in base pairs) between the genomic coordinates to which the two ends of each read are mapped is used to quantify the fragment length. Short/long read ratio: As cfDNA fragments from fetal origin tend to be shorter, the fact that a read is in a genomic region which is enriched in shorter reads may indicate that the read is of fetal origin, and vice versa. To build this feature, the reference genome (hg38) is divided into non overlapping windows. For each window i the counts for long(L) and short (S) reads are defined as L={# reads r mapping to window i such that llower≤ length(r) ≤ lupper} S={# reads r mapping to window i such that slower≤ length(r) ≤ supper} llower, lupper, slower,supper are hyperparameters whose values are subject to optimization. End motifs: cfDNA fragments have short (typically 4 bases long) and characteristic sequence motifs at their ends. The composition of these motifs has been shown to differ between cfDNA fragments from fetal and maternal origins (Jiang et al. 2020). Preferred ends: the genomic end points of cfDNA fragments are not distributed uniformly along the genome. Instead, they cluster together in specific positions called hotspots. Furthermore, reads of maternal and fetal origins form two distinct sets of such hotspots (Sun et al. 2018). This suggests that if the endpoint of a DNA fragment maps to a hotspot, there is a higher chance that the fragment comes from either the mother or the fetus. To determine if a specific position in the genome is a hotspot, the number of DNA fragments whose endpoint (either the lowest or highest coordinate) maps to that position is assessed and compared to a background distribution. Cleavage patterns around DNA methylation sites: Methylation is a chemical modification that occurs on certain genomic sites known as CpG sites. The pattern of methylation sites is distinct between different tissues and is particularly different between DNA from fetal and maternal cells. Methylated CpG sites tend to be cleaved in a significantly higher frequency compared to unmethylated CpG sites (Zhou et al. 2022). Several methods can be used to identify methylated CpG sites using fragmentomics analysis: Cleavage proportion: The number of reads whose endpoint falls at a given CpG site, divided by the sequencing depth at the same site.
Cleavage ratio: The number of reads whose endpoint falls at a given CpG site, divided by the number of reads whose endpoint falls within a window of k bases on each side of the CpG site. DNA accessibility/nucleosome positioning: within the cell, the spatial organization of the DNA is mediated by protein complexes known as nucleosomes. DNA wrapped around nucleosomes is protected from cleavage by DNA-cutting enzymes. Cleavage thus occurs either in nucleosome depleted regions, or in short inter-nucleosome regions known as linkers. Therefore, the dispersion of cfDNA reads along the genome is expected to mirror nucleosome positioning patterns. Notably, fetal and maternal cells manifest distinct nucleosome positioning and depletion patterns. Thus, if a read is mapped to a region known to be occupied by nucleosomes in placental cells but not in hematopoietic cells (which make most of the maternal portion of the cfDNA), this may indicate that the read is of fetal origin and vice versa. This can be inferred either using experimental methods such as DNAse-seq, ATAC-seq, ChIP-seq and Hi-C or using computational methods. Distance to nearest nucleosome: The distance from a fragments end to the center of the nearest nucleosome is calculated. nucleosome. Transcription factor (TF) binding sites: Transcription factors are proteins that bind DNA in a sequence specific manner for the purpose of regulating gene expression. The following approaches can be used to test if a read contains or is near a transcription factor binding site (TFBS): i. Experimental data: data from ChIP-Seq experiments, a genome-wide assay to find genomic regions to which a specific transcription factor is bound, can be used. ii. Transcription factor binding motifs: These are short DNA sequences (30 base pairs or shorter) to which a specific type of TF is known to bind preferentially. Since many TFs can bind to many different DNA sequences, probabilistic representations called position weight matrices (PWM) are used to represent TF binding motifs. A PWM is a table that shows the probability of each nucleotide (A, C, G, T) occurring at each position in the binding motif. The reads’ sequences are then scanned and their similarity to the PWM is evaluated. Regional fetal fraction: The genome is divided into bins of 1 million base pairs. For each such region the fetal fraction (distinct from the global fetal fraction) is calculated using the method described in Rabinowitz et al., 2019. Briefly, sites where the parents are homozygous for different alleles are identified, and the fetal fraction is estimated to be twice the fraction of reads supporting the paternal allele. Regional sequence composition: The genome is divided into bins of 1 million base pairs. For each such bin the number of occurrences of each of the four nucleotides (A,C,G,T) in the reference genome sequence (GRCh 38) is counted. Read sequence composition: For each read, the number of occurrences of each of the four nucleotides (A,C,G,T) in the read sequence is counted. Number of sequence errors in the read: For each read, the number of sequence positions in which the sequence of the read is different from the reference genome sequence at the corresponding position is counted. Building a training set of cfDNA reads from maternal and fetal origin: The first step in the development of a read classifier ML model involves presenting the model with multiple examples of maternal and fetal reads. During this step, known as training, the model learns to identify features that separate these two populations of reads. The input to this stage is a set of reads whose origin (fetal or maternal) can be determined with high confidence. The following are methods suitable for identifying high confidence reads for building the training set: 1. Parental homozygous sites (for illustration see Figure 3A): In cases where both paternal and maternal gDNA is available, sites where the parents are homozygous for different alleles are identified. For such sites, cfDNA reads supporting the paternal allele are considered fetal. Reads from these sites supporting the maternal allele can originate either from the mother or the fetus, but most of these reads originate from the mother and are therefore considered maternal. This method has been described in Rabinowitz et al., 2019. 2. Fetal sequencing (for illustration see Figure 3B): This method requires the use of WGS data from DNA taken directly from the fetus/newborn (e.g., from amniocentesis/chorionic villi sampling) in the training stage. To identify reads from maternal origin, first variant calling using e.g., the GATK HaplotypeCaller (Poplin et al. 2017) tool is performed on the reads from the maternal blood cells as well as the reads from the direct fetal sample. These maternal and fetal genotypes are then used to identify genomic sites in which the mother has an allele not shared with the fetus. The origin of each read is then inferred according to the fetal and maternal genotypes (see Table 1). The use of REF and ALT alleles is merely exemplary. Table 1: Informative combinations of maternal and fetal genotypes and the read origins that could be inferred in each case.
Maternal genotype Fetal genotype Origin of reads supporting the REF allele Origin of reads supporting the ALT allele 0/0 0/1 Shared* Fetal 0/1 0/0 Shared* Maternal 0/1 1/1 Maternal Shared* 1/1 0/1 Fetal Shared* Shared* - reads supporting this allele may be either maternal or fetal. Notably, in a clinical NIPT setting, DNA extracted directly from the fetus will not be available, and therefore under this method, a new NIPT sample is processed using a read classifier model trained on data from previous families/pregnancies, in contrast to methods (1) above and (3) below which allow the use of models trained on the sample currently being processed. 3. Hoobari basic run is used to train read classifier model on high confidence site, then a refined algorithm is used to infer fetal genotypes on more challenging sites (for illustration see Figure 3C): In this method, Hoobari is run at two stages. In the first stage, the basic Hoobari algorithm is run as described in Rabinowitz et al., 2019. Next, sites for which Hoobari gave fetal genotype predictions with high confidence are extracted from the callset. Sites in which the mother has a genotype not shared with the fetus are identified and used to infer read origins as in method (2) above. These reads are then used to train a read classifier model specific to the cfDNA sample being processed. This model can now be used to calculate read origin probabilities for reads from the sites for which the fetal genotype could not be determined with high confidence, and then Hoobari can be run with these improved read origin probabilities to predict the fetal genotypes for these sites.
The methods can be used alone or in combination. For example, in accordance with one embodiment, an initial read classifier is trained using method (2) on a set of samples for which both cfDNA and direct fetal WGS is available, and then, given a new sample is available, the model is refined with a training set generated using methods (1) or (3) to optimize the model for the new sample. Once the read origins have been determined, a training set of reads which are labeled as either "maternal" or "fetal" is constructed. Training ML models: As a next step, a machine learning model that identifies patterns that are characteristic and distinct to either the maternal cfDNA reads or the fetal cfDNA reads is trained to discriminate between these two sets of reads. The model receives as input the read representations (i.e., the fragmentomic features) described above and the labels ("maternal"/"fetal") for the same reads generated as described above. Once the model is trained the output is a set of weights, rules, or functions that, when applied to a feature representation from the type described above, estimates the probabilities that the read is of maternal origin or of fetal origin. The types of ML models that can be used in accordance with the invention include but are not limited to: clustering, association rule algorithms, feature evaluation algorithms, subset selection algorithms, support vector machines, classification rules, cost - sensitive classifiers, vote algorithms, stacking algorithms, Bayesian networks, decision trees, random forest algorithms, neural networks, convolutional neural networks, instance - based algorithms, linear modeling algorithms, k - nearest neighbors ( KNN ) analysis, ensemble learning algorithms, boosting algorithms probabilistic models, graphical models, logistic regression methods (including multinomial logistic regression methods), gradient ascent methods, dimensionality reduction methods, singular value decomposition methods and principle component analysis. Use read class probabilities to improve fetal genotype prediction: Next the ML models trained in accordance with the invention as described above are used to predict the origins of cfDNA reads. Such cfDNA reads may be the same reads on which the model was trained or different reads. For each read, the model outputs two numbers, representing the probabilities that the read is of fetal or maternal origin. These read class probabilities are then used to calculate the likelihood of a set of reads with the observed fragmentomic features under each possible fetal genotype. These likelihoods are then used to calculate the posterior probability of each possible fetal genotype given the reads fragmentomic features and parental genotypes. In an embodiment, calculating the probabilities comprises applying a Bayesian procedure. Optionally, said Bayesian procedure comprises prior probabilities calculated using sequencing data of at least one of said parents. In an embodiment, this procedure further comprises recalibration of the output of said Bayesian procedure using machine learning. In a specific embodiment the determination of the probability, for each variant site, to be of fetal origin is performed as described in Rabinowitz et al., 2019 and WO2021/0340601.
The term "about" as used herein indicates values that may deviate up to 1%, more specifically 5%, more specifically 10%, more specifically 15%, and in some cases up to 20% higher or lower than the value referred to, the deviation range including integer values, and, if applicable, non-integer values as well, constituting a continuous range. Disclosed and described, it is to be understood that this invention is not limited to the specific examples, methods’ steps, and compositions disclosed herein as such methods’ steps and compositions may vary somewhat. It is also to be understood that the terminology applied herein is used for the purpose of describing specific embodiments only and not intended to be limiting since the scope of the present invention will be limited only by the appended claims and equivalents thereof. It must be noted that, as used in this specification and the appended claims, the singular forms "a", "an" and "the" include plural referents unless the content clearly dictates otherwise. Throughout this specification and the Examples and claims which follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.
EXAMPLES Materials and Methods Sample collection and DNA extraction Samples from each family were collected during week 7-38 of the pregnancy with informed consent. DNA from chorionic villus sampling (CVS) was extracted using the DNA Tissue protocol for the MagNA Pure Compact Nucleic Acid Isolation Kit I - Large Volume (Roche Life Science). Peripheral maternal blood was collected using 2-Ethylene-diamine-tetra-acetic acid (EDTA) tubes. Plasma was separated from blood by centrifugation at 4°C for 10 minutes at 1600 x g. The plasma was then centrifuged again at 16,000 x g for 10 minutes at room temperature to remove any residual cells. Extraction of cfDNA was performed using the QIAamp Circulating Nucleic Acid Kit (Qiagen). Removal of excess salts resulting from cfDNA purification was conducted using Agencourt AMPure XP beads (Beckman Coulter, Inc.) at a 2X ratio to cfDNA volume. Pure maternal DNA was extracted from leukocytes in the maternal buffy coats, using a protocol that includes (i) buffy coat separation and (ii) DNA purification using the Gentra Puregene Blood Kit (Qiagen) according to the manufacturer's instructions. Pure paternal DNA was collected and purified similarly. Library preparation and sequencing Library preparation for samples that underwent WGS was performed using the TruSeq DNA PCR-Free Library Prep Kit (Illumina) according to the manufacturer's instructions. This was followed by sequencing using the HiSeq X Ten System (Illumina) with 151-bp paired-end reads. Cell-free DNA samples were not fragmented during library preparation and were sequenced to a requested coverage of 300x, using HiSeq 4000 (Illumina) with 151-bp paired-end reads. Alignment to the genome Reads were aligned to the Genome Reference Consortium Human Build (GRCh38/hg38) using Burrows-Wheeler v0.7.834 with default parameters. Duplicate reads, resulting from PCR clonality or optical duplicates, and reads mapping to multiple locations were excluded from downstream analysis.
Variant calling of pure genomic sequencing data Single-nucleotide substitutions and small insertions and deletions were identified using the GATK HaplotypeCaller software v4.2.4.0 applying default parameters. HaplotypeCaller was first run on the aligned sequencing data of both parents together, then on the aligned data of the CVS sample using the variant sites that were identified in the parental genomes. Reported variants were not filtered, so that all reported SNPs and indels were kept for downstream analysis. Pre-processing of cell-free DNA data HaplotypeCaller was run on the cfDNA sample . The program was set to consider all sites for which there was evidence for a possible variant in the cfDNA sample and also to consider (force-call) variant sites that were identified in the paternal genomes, regardless of evidence from the cfDNA. Using Hoobari, the allele that was observed by each read, together with the read insert-size, was saved in a separate database. Noninvasive fetal variant calling Hoobari was run using the parental variants and the cfDNA pre-processing results database as input. The output was a standard variant call format (VCF) file. The analysis of the results was held using several software dedicated for VCF manipulation, such as vcflib and vcftools. Bayesian noninvasive genotyping At each site of interest, a Bayesian calculation was applied. For each possible fetal genotype G: ? (? |???? ) =? (???? |? )? (? )∑ ? (???? |? )? (? ) Where data is the set of reads cover the site (see below) and Gi is the ith possible fetal genotype out of n possibilities. For bi-allelic variants, it would be either homozygous for the reference allele (AA), heterozygous (Aa), or homozygous for the alternate allele (aa).
As used herein the term "heterozygous" refers to different versions (alleles) of a genomic locus. The term "homozygous" refers to the presence of the same versions (alleles) of the genomic locus. P(G) is the prior probability for each genotype and was calculated by Mendelian laws. The data variable denotes the reads that cover the site, represented using their fragmetnomic features and P(data│G) denotes the likelihood function, which is defined in this Example as a product of the likelihood of each read: The likelihood of a read rj depends on the fetal genotype and is calculated using the maternal genotype and the fetal fraction. P(rj|fet)? (??? ) and P(rj|mat) are the probabilities of a read-observation that supports a certain allele, given that the read is fetal or maternal, respectively. This depends on the tested fetal genotype Gi, the maternal genotype GM and the observed allele. P(fet) and P(mat) are the probabilities of that the read is fetal or maternal based on its fragmentomic features and regardless of the allele that it supports. These probabilities are calculated using the read classifier.

Claims (14)

- 21 - CLAIMS:
1. A method for genotyping a fetus, comprising: a. receiving reads of sequencing data of (i) maternal cell-free DNA (cfDNA), and (ii) maternal genomic DNA (gDNA) and optionally paternal genomic DNA (gDNA) from a pair parenting the fetus; b. identifying potential genomic sites where the fetus may have a variant; c. extracting fragmentomic features for each cfDNA read identified as overlapping a potential genomic site where the fetus may have a variant; d. introducing the extracted fragmentomic features to a machine learning model which determines for each read a probability value that the read originated from the fetus and a probability value that the read originated from the mother; and e. using said probability value, determining whether the fetal genome comprises the variant; thereby genotyping said fetus.
2. The method of claim 1, wherein one or both gDNA sequencing data and the cfDNA sequencing data is obtained by deep whole genome sequencing (WGS), whole exome sequencing (WES), targeted sequencing, panel sequencing, gene sequencing, long-read genome sequencing, paired-end sequencing, single read sequencing, or amplicon sequencing.
3. The method of claim 2 wherein said WGS or WES data is obtained by deep sequencing.
4. The method of any one of the preceding claims wherein identifying potential genomic sites where the fetus may have a variant is performed using one or more of: a. Analyzing the received maternal gDNA data and optionally also the received paternal gDNA data to identify sequence reads overlapping genomic sites that comprise a variant, e.g., using variant calling; b. Analyzing the received maternal cfDNA data to identify sequence reads overlapping genomic sites that comprise a variant; c. Analyzing the received data to identify haplotypes potentially associated with a variant; or d. Analyzing the received data to identify if the fetal genome comprises variants that have a high prevalence in the general population or in a relevant ethnic group. - 22 -
5. The method of any one of the preceding claims, wherein said fragmentomic features comprise one or more of read quality mapping, read base qualities, fragment length, short/long read ratio, end motifs, cleavage patterns around methylation sites, read endpoint preferred end, DNA/accessibility/nucleosome, distance to nearest nucleosome, transcription factor binding sites, regional fetal fraction, regional sequence composition, read sequence composition, and number of sequence errors in the read.
6. The method of any one of the preceding claims wherein said machine learning model is a read classifier machine learning model.
7. The method of any one of the preceding claims wherein the machine learning model is developed by a method comprising: a. presenting the model with a first training set comprising reads of known maternal origin and a second training set comprising reads of known fetal origin; b. training the model to identify fragmentomic features that are characteristic to and discriminate between reads of maternal and fetal origins.
8. The method of any one of the preceding claims wherein said machine learning model is selected from a group consisting of clustering, association rule algorithms, feature evaluation algorithms, subset selection algorithms, support vector machines, classification rules, cost - sensitive classifiers, vote algorithms, stacking algorithms, Bayesian networks, decision trees, random forest algorithms, neural networks, convolutional neural networks, instance - based algorithms, linear modeling algorithms, k - nearest neighbors ( KNN ) analysis, ensemble learning algorithms, boosting algorithms probabilistic models, graphical models, logistic regression methods (including multinomial logistic regression methods), gradient ascent methods, dimensionality reduction methods, singular value decomposition methods, principle component analysis, and a combination thereof.
9. The method of any one of the preceding claims wherein said determining whether the fetal genome comprises the variant of step e in claim 1 comprises applying an algorithm selected from a group consisting of clustering, association rule algorithms, feature evaluation algorithms, subset selection algorithms, support vector machines, classification rules, cost - sensitive classifiers, vote algorithms, stacking algorithms, Bayesian networks, decision trees, random forest algorithms, neural networks, - 23 - convolutional neural networks, instance - based algorithms, linear modeling algorithms, k - nearest neighbors ( KNN ) analysis, ensemble learning algorithms, boosting algorithms probabilistic models, graphical models, logistic regression methods (including multinomial logistic regression methods), gradient ascent methods, dimensionality reduction methods, singular value decomposition methods, principle component analysis, and a combination thereof.
10. The method of any one of the preceding claims wherein said determining whether the fetal genome comprises the variant of step e in claim 1 comprises applying a Bayesian procedure.
11. The method of claim 10 wherein said Bayesian procedure comprises prior probabilities calculated using sequencing data of at least one of said parents.
12. The method of any one of the preceding claims wherein said determining whether the fetal genome comprises the variant is performed using the Hoobari algorithm.
13. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a data processor, configure the data processor to (1) receive reads of sequencing data of (i) maternal cell-free DNA (cfDNA), and (ii) maternal genomic DNA (gDNA) and optionally paternal genomic DNA (gDNA) from a pair parenting a fetus, and to (2) execute the method according to any one of claims 1-12.
14. A system for genotyping a fetus, comprising: an input utility for receiving reads of sequencing data of (i) maternal cell-free DNA (cfDNA), and (ii) maternal genomic DNA (gDNA) and optionally paternal genomic DNA (gDNA) from a pair parenting a fetus; and a data processor configured for analyzing said data for executing the method according to any one of claims 1-12.
IL307784A 2023-10-16 2023-10-16 Noninvasive fetal variant identification using fragmentomics-based classification IL307784A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
IL307784A IL307784A (en) 2023-10-16 2023-10-16 Noninvasive fetal variant identification using fragmentomics-based classification
PCT/IL2024/051015 WO2025083690A1 (en) 2023-10-16 2024-10-15 Noninvasive fetal variant identification using fragmentomics-based classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
IL307784A IL307784A (en) 2023-10-16 2023-10-16 Noninvasive fetal variant identification using fragmentomics-based classification

Publications (1)

Publication Number Publication Date
IL307784A true IL307784A (en) 2025-05-01

Family

ID=95448577

Family Applications (1)

Application Number Title Priority Date Filing Date
IL307784A IL307784A (en) 2023-10-16 2023-10-16 Noninvasive fetal variant identification using fragmentomics-based classification

Country Status (2)

Country Link
IL (1) IL307784A (en)
WO (1) WO2025083690A1 (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
PT3967775T (en) * 2015-07-23 2023-10-10 Univ Hong Kong Chinese Analysis of fragmentation patterns of cell-free dna
US20210340601A1 (en) * 2018-09-03 2021-11-04 Ramot At Tel-Aviv University Ltd. Method and system for identifying gene disorder in maternal blood

Also Published As

Publication number Publication date
WO2025083690A1 (en) 2025-04-24

Similar Documents

Publication Publication Date Title
US20230326547A1 (en) Variant annotation, analysis and selection tool
US11462298B2 (en) Methods and processes for non-invasive assessment of genetic variations
EP3760739B1 (en) Methods and processes for non-invasive assessment of genetic variations
US20150105267A1 (en) Whole genome sequencing of a human fetus
EA033752B1 (en) Method for determining at least a portion of fetal genome on the basis of analysing a maternal biological sample
AU2015330734A1 (en) Methods and processes for non-invasive assessment of genetic variations
TW202424206A (en) Molecular analyses using long cell-free fragments obtained from pregnant female
WO2019025004A1 (en) A method for non-invasive prenatal detection of fetal sex chromosomal abnormalities and fetal sex determination for singleton and twin pregnancies
US20240371466A1 (en) Method and system for newborn screening for genetic diseases by whole genome sequencing
CA3260582A1 (en) Epigenetics analysis of cell-free dna
US11869630B2 (en) Screening system and method for determining a presence and an assessment score of cell-free DNA fragments
IL307784A (en) Noninvasive fetal variant identification using fragmentomics-based classification
JP2025514547A (en) Methods and devices for parental origin disease allele detection for the diagnosis and management of genetic diseases - Patents.com
IL298244A (en) Method and system for increased-accuracy identification of fetal gene disorders in maternal blood
Veeramachaneni Data analysis in rare disease diagnostics
IL298246A (en) Noninvasive fetal variant identification using hapoltype analysis
WO2021137770A1 (en) Method for fetal fraction estimation based on detection and interpretation of single nucleotide variants
Lakhani et al. Integration of Deep Learning Annotations with Functional Genomics Improves Identification of Causal Alzheimer’s Disease Variants
SK1412025A3 (en) Method for detection of samples with insufficient amount of fetal and circulating tumor DNA fragments for non- invasive genetic testing
WO2024086226A1 (en) Component mixture model for tissue identification in dna samples
AU2022398491A1 (en) Sample contamination detection of contaminated fragments for cancer classification
CN120814002A (en) High resolution and non-invasive fetal sequencing