WO2023164728A2 - Systems and methods for alignment-free variant calling - Google Patents

Systems and methods for alignment-free variant calling Download PDF

Info

Publication number
WO2023164728A2
WO2023164728A2 PCT/US2023/063411 US2023063411W WO2023164728A2 WO 2023164728 A2 WO2023164728 A2 WO 2023164728A2 US 2023063411 W US2023063411 W US 2023063411W WO 2023164728 A2 WO2023164728 A2 WO 2023164728A2
Authority
WO
WIPO (PCT)
Prior art keywords
mer
sequences
genetic
variant
neighbor
Prior art date
Application number
PCT/US2023/063411
Other languages
French (fr)
Other versions
WO2023164728A3 (en
Inventor
Foad NAZARI
Sneh PATEL
Giana J. SCHENA
Emma K. MURRAY
Alina SANSEVICH
Original Assignee
Rajant Health Incorporated
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rajant Health Incorporated filed Critical Rajant Health Incorporated
Publication of WO2023164728A2 publication Critical patent/WO2023164728A2/en
Publication of WO2023164728A3 publication Critical patent/WO2023164728A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Definitions

  • the present invention is directed to the field of next-generation sequencing, and more specifically, to its use in identifying genetic variants.
  • Detection of human genome variants is usually done using alignment-based approaches, which are based on mapping sequenced reads to the reference genome. Those approaches generally deliver highly accurate results when the sequences are closely related and can be aligned reliably. However, when the sequences are divergent, a reliable alignment cannot be performed. The alignment-based processes are also computationally complex and timeconsuming, and so they are limited in application on large-scale sequence data.
  • a server receives a dataset comprised of a genetic sequence of control group RNA samples and experimental RNA samples and performs a count of unique k-mer sequences from the dataset. Then, the server sorts and filters the plurality of k-mer sequences based on density and applies a neighbor detection function to the plurality of k-mer sequences to identify one or more neighbor k-mer sequences to form one or more k-mer pair sequences.
  • the neighbor detection function performs a k-to-3 dimensionality reduction transformation on the k-mers to reduce the computation cost.
  • the server fdters the one or more k-mer pair sequences based on a predetermined edit distance and merges the one or more fdtered k-mer pair sequences into a plurality of genetic variant candidates.
  • the server then localizes the variant candidates in the reference genome to validate their existence and type, and also to check if there is any annotation associated with them in that specific location of the reference genome.
  • the server subsequently compares the plurality of genetic variant candidates against a pre-populated variant database to specify if each detected one or more sequence genetic variants is novel or has been already annotated in the literature for the targeted disease and outputs the one or more identified sequence genetic variants through a graphic user interface.
  • the genetic variants identified are one or more of single nucleotide polymorphism (SNP), multiple nucleotide polymorphism (MNP), and insertion/deletion (INDEL).
  • SNP single nucleotide polymorphism
  • MNP multiple nucleotide polymorphism
  • INDEL insertion/deletion
  • the dataset is comprised of FASTQ/A data for healthy individuals and unhealthy individuals.
  • trimming of the genomic sequences is performed to remove unwanted or low-quality regions.
  • sorting of the plurality of k-mer sequences is performed in descending order.
  • sequence k-mers are filtered based on the ratio of k-mer density in one group vs the other.
  • the server applies a T-test filter that performs an unequal variance T-test on the plurality of k-mer sequences.
  • the k-mers are filtered based on the level that their density difference in control and experiment groups is compensated by its neighbor k-mers.
  • the k-mer pairs which have overlap are merged together to make longer sequence pairs.
  • the output is in variant call format (VCF).
  • VCF variant call format
  • FIG. l is a diagram of an exemplary embodiment of the hardware of the system of the present invention.
  • FIG. 2A is a flowchart showing the software processes of an exemplary embodiment of the present invention.
  • FIG. 2B is a flowchart showing the software processes of an exemplary embodiment of the present invention.
  • FIG. 3 is a graph showing a 3D representation of k-mers nucleotide counts.
  • FIG. 4 shows charts that demonstrate the geometry MD values (up to 6) with different colors in a 9*9 MD matrix.
  • FIG. 1 is an exemplary embodiment of the health information system of the present invention.
  • one or more peripheral devices 110 are connected to one or more computers 120 through a network 130.
  • peripheral devices/locations 110 include smartphones, tablets, wearables devices, and any other electronic devices that collect and transmit data over a network that are known in the art.
  • the network 130 may be a wide-area network, like the Internet, or a local area network, like an intranet. Because of the network 130, the physical location of the peripheral devices 110 and the computers 120 has no effect on the functionality of the hardware and software of the invention. Both implementations are described herein, and unless specified, it is contemplated that the peripheral devices 110 and the computers 120 may be in the same or in different physical locations.
  • Communication between the hardware of the system may be accomplished in numerous known ways, for example using network connectivity components such as a modem or Ethernet adapter.
  • the peripheral devices/locations 110 and the computers 120 will both include or be attached to communication equipment. Communications are contemplated as occurring through industry-standard protocols such as HTTP or HTTPS.
  • Each computer 120 is comprised of a central processing unit 122, a storage medium 124, a user-input device 126, and a display 128.
  • Examples of computers that may be used are: commercially available personal computers, open source computing devices (e.g. Raspberry Pi), commercially available servers, and commercially available portable device (e g. smartphones, smartwatches, tablets).
  • each of the peripheral devices 110 and each of the computers 120 of the system may have software related to the system installed on it.
  • system data may be stored locally on the networked computers 120 or alternately, on one or more remote servers 140 that are accessible to any of the peripheral devices 110 or the networked computers 120 through a network 130.
  • the software runs as an application on the peripheral devices 110.
  • FIGs. 2A and 2B show a flow diagram of an alignment-free variant calling algorithm that may be used in accordance with the present invention to identify genomic variants using FASTQ/A data.
  • Accurate variant calling in next-generation sequencing (NGS) data is a major step upon which virtually all downstream analysis and interpretation processes depend.
  • the dataset that is used includes FASTQ/A data for individuals with healthy (control) and unhealthy (experimental) conditions, more specifically for one specific abnormality.
  • the Control Group Raw Sequence Data 202 and the Experimental Group Raw Sequence Data 204 are transmitted to the Trimming module 206.
  • the raw sequence data 202, 204 is trimmed, resulting in trimmed control group and experimental group raw sequence data.
  • Sequence trimming is the process of removing unwanted or low-quality regions from a nucleotide or protein sequence. It can be done based on various criteria, such as quality scores, length, or the presence of contaminants or adapters. The goal of trimming is to improve the accuracy of downstream analysis, such as alignments and functional predictions. Trimming of adapter sequences from FASTQ/A reads is a common preprocessing step during NGS data analysis. Adapter removal is necessary to remove the adapter sequences from the 3' end of the reads because those artificially added sequences (which are necessary to attach the DNA fragments to the flow cell and also barcoding) can interfere with the alignment of the reads to the genome.
  • the trimmed control group and experimental group raw sequence data are sent to the Count Unique k-mers module 208, at which the number of appearances of each unique k-mer in each sample is counted and recorded as its frequency.
  • the output of this block for each one of control and experiment groups would be a ⁇ k-mer, [k-mer_count] ⁇ dictionary in which the key is the k-mer sequence, and the value is the [k-mer_count] which is frequency of the k-mer at each sample.
  • the ⁇ k-mer, [k-mer_count] ⁇ dictionary of the control group and the experimental group are then transmitted to the Density Calculation module 210.
  • the density of k-mers is determined (for example, by determining the frequency of k-mers divided by total number of k- mers in that sample). Then, the mean of densities of each k-mer are calculated, for experimental and control groups, separately.
  • the k-mer densities and the mean of their densities are transmitted to the Density Filter module 212.
  • the k-mers in each group are sorted in descending order, based on their mean density values.
  • the k-mers having a mean density beyond a predetermined density threshold pass this fdter 212.
  • the k-mers from the control group that pass the Density Filter module 212 are transmitted to the Density Ratio Filter 214, which finds the corresponding mean density for each k-mer in the experiment group, calculates density _ratio control (kmer'), and in case the ratio is higher than a specific predetermined threshold, passes the k-mer through the filter 214.
  • density _ratio control kmer'
  • the Density T-Test module 216 is a filter which performs an unequal variance T-Test (Welch’s T-test) on the k-mers that have passed the previous filter, using their densities and the density of their corresponding k-mers in the other group as well as the density ratios calculated by the Density Ratio Filter 214. That is an independent T-Test which is used when the number of samples in each group is different, and/or the variance of the two data sets is different. Preferably, the level of significance is assumed to be 5%. An exemplary calculation is shown below.
  • Equation 3 are the number of records (samples) in control and experiment groups, respectively. Also, ; in which var is variance and a is standard deviation, where the outer bracket in Equation 3 is the floor function.
  • a T- distribution critical value table gives the corresponding critical T-value. Now, if the calculated T-value for any k-mer is greater than or equal to the critical value, it passes the filter of the Density T-Test module 216. The Density T-Test module 216 is applied to the control and experiment group k-mers, separately.
  • the Neighbor Detector module 218 For each filtered k-mer, the Neighbor Detector module 218 is used to find its neighbor k-mers (i.e., their hamming distance is less than a specific threshold) for further filtering downstream. Performing this process in k-dimensions is computationally expensive, because it needs to calculate the hamming distance of each filtered k-mer with all other k-mers.
  • the Neighbor Detector module 218 transforms a 4D analysis into 3D, in which the neighbor sequences for a desired distance to the original k-mer form a predetermined (complex) geometry and are therefore easy to find.
  • nt count vector [nA, nc, no, n?]. since the number of nucleotides in each k-mer is k, we have:
  • a 3D_nt_count vector which is [nA, nc, no] can be used instead of nt count since it has the same information for k-mers. With that, the dimension is reduced to 3.
  • the 3D matrix of 3D_nt_count coordinates is called 3D_static.
  • the way the Neighbor Detector module 218 works is that all experiment group k-mers locations in the 3D static matrix are specified.
  • We found the geometry of the isodistance cells with every Manhattan distance of MD md. For any desired md distance, we just need to go to isodistance cells that pick the experimental k-mers. They are experiment groups nt count neighbors of the original control group k-mer with MD distance.
  • FIG. 4 shows the geometry MD values (up to 6) with different colors in a 9*9 MD matrix.
  • the k-mer variant candidates identified by the Neighbor Detector module 218 are transmitted to the k-mer Density Compensation Filter module 220, which further filters them.
  • SNPs single nucleotide polymorphisms
  • Insertion For single insertion, the process is similar to SNP, the only difference is the insertion neighbor has a nucleotide inserted in the insertion point of the original sequence and so has one nucleotide less from one end.
  • Table 3 below shows an exemplary insertion and how it affects the dataset.
  • MNP nucleotide polymorphism
  • the pair (original & neighbor) k-mer sequences passes the Density Compensation Filter module 220.
  • the output original-neighbor k-mer pairs of the Density Compensation Filter module 220 are transmitted to the Max Edit Distance Filter 222, where the Needleman_Wunch approach is used to filter them further.
  • the original-neighbor k-mer pairs whose edit distance is less than a specific threshold pass this filter.
  • the edit distance between two sequences is a measure of the minimum number of operations (such as insertions, deletions, or substitutions) required to transform one sequence into the other.
  • the Needleman-Wunsch algorithm is a dynamic programming approach which is being used in bioinformatics to align two sequences.
  • the algorithm creates a similarity matrix between the two sequences, considering gaps and mismatches, and then applies this matrix to find the optimal global alignment with the highest similarity score.
  • the output of the Needleman-Wunsch algorithm is a pairwise sequence alignment with the highest possible similarity score, which reflects the evolutionary relationship between the two sequences.
  • the Density Compensation Filter module 220 may be applied after the Max Edit Distance Filter module 222, because the Max Edit Distance Filter module 222 is where original -neighbor k-mer pair potentially includes a variant is determined (i.e., the aforementioned SNP, MNP, Indel examples). However, since the Max Edit Distance Filter module 222 is computationally much more expensive than the Density Compensation Filter module 220, the latter module is typically applied first to reduce the load of the former (the number of pairs that go through the Density Compensation Filter module 220). Regardless of which filter is applied first, however, the same result is achieved.
  • the k-mer pair sequences that result from the Max Edit Distance Filter 222 are transmitted to the Merging module 224.
  • Many of the k-mer-pair sequences that have been identified and sent to the Merging module 224 may have overlap.
  • three k-mer pairs that are presented in Table 4 below can be merged together to create a bigger sequence, so that instead of sending three pairs of sequences to the downstream modules of the software, just one bigger sequence will be sent which will result in less computation. It will be computationally expensive to check each two pairs together if they are mergeable. So, there is a need to reduce the number of candidates for merging to each pair and then evaluate them, which is performed at the Merging module 224.
  • All the pairs are sent from the Merging module 224 to the Neighbor Pair Detector 226 and then to the Mergeable Pair Detector module 228 to find the merable pairs for each k-mer- pair. Those pairs which have a right side mergeable pair (as explained with regard to the Mergeable Pair Detector module 228) are merged, and those which do not, will remain not- merged.
  • the Neighbor Pair Detector 226 does a similar job to the Neighbor Detector module 218, the only difference being that for a k-mer pair to be a neighbor pair of a given pair, its experiment and control nt-vectors should be at the neighborhood of the nt-vectors of the experiment and control sequences of the given pair, respectively. So, to find which two pairs are mergeable together, we first create the nt-vector for each k-mer of each candidate pair.
  • any other pair that the Manhattan distance between the nt- vector of their control k-mers and their experiment k-mers are less than a specific threshold (here, threshold 2), (i.e., both the Manhattan distance between the control k-mers and the Manhattan distance between the experiment k-mers were smaller than a threshold), is considered as the nt-count neighbor-pair of the original pair.
  • threshold 2
  • Table 5 illustrates that assuming the Manhattan distance threshold of 2, (pairl, pair2), (pair2 , pairl), (pair2, pair3) and (pair3, pair2) are original-neighbor pairs.
  • both merged and not -merged pairs are transmitted from the Merging module 224 to the Variant Identifier module 230.
  • the Needleman- Wunch matrix of the experiment vs control sequence of each candidate pair is created. Based on that matrix, the location and type of variation is specified.
  • the result of the Variant Identifier module 230 is then validated at the Localization module 232, which validates and confirms the detected variants, exemplarily by localizing the variants in the Reference Genome module 234, using existing tools. Then, at the Annotation Check module 236, the variants are checked to confirm if there is any annotation associated with the detected variant in that specific location of the reference genome. This analysis is again performed by comparison to the data found in the Reference Genome module 236. The detected variants and their annotation are then transferred to the Check Novelty module 238.
  • the Check Novelty module 238 functions to check if the detected variants are already annotated for the targeted disease in its databases. A list of variants which are already identified and found in the system databases to have a causation or correlation relationship with the abnormality being analyzed are collected at the Variant DB module 240, in advance. The Check Novelty module 238 compares the detected merged variant candidates against that variant list of Variant DB 240 to see which variant is already known to be associated with the targeted disease or abnormality and which is novel and does not appear in the database. Based on that determination, each detected variant can be designated as either “new” or “existing” in candidate variants.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Disclosed is a methodology to find genetic sequence variations. In certain embodiments, a server receives a dataset comprised of a genetic sequence of control group RNA samples and experimental RNA samples and performs a count of unique k-mer sequences based on density values. Then, the server sorts the plurality of k-mer sequences based on their density values and applies a neighbor detection function to the plurality of k-mer sequences to identify one or more neighbor k-mer sequences to form one or more k-mer pair sequences. Then, the server filters the one or more k-mer pair sequences and merges the one or more filtered k-mer pair sequences into genetic variant candidates. The server then localizes the variant candidates in the reference genome to validate their existence and type and compares the plurality of genetic variant candidates against a variant database. The server then outputs the one or more identified sequence genetic variants.

Description

SYSTEMS AND METHODS FOR ALTGNMENT-FREE VARIANT CALLING
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of U.S. Prov. App. No. 63/314,734, filed February 28, 2022, and U.S. Prov. App. No. 63/431,957, filed December 12, 2022, the entire contents of both of which are incorporated herein by reference.
FIELD OF THE INVENTION
[0002] The present invention is directed to the field of next-generation sequencing, and more specifically, to its use in identifying genetic variants.
BACKGROUND OF THE INVENTION
[0003] Detection of human genome variants is usually done using alignment-based approaches, which are based on mapping sequenced reads to the reference genome. Those approaches generally deliver highly accurate results when the sequences are closely related and can be aligned reliably. However, when the sequences are divergent, a reliable alignment cannot be performed. The alignment-based processes are also computationally complex and timeconsuming, and so they are limited in application on large-scale sequence data.
[0004] With the recent advancements in Next Generation Sequencing (NGS), the amount of genomic data is being increased tremendously. That increase in the size of sequence data has caused challenges for alignment-based variant calling tools. Accordingly, there is a need for reliable, efficient variant calling methods to overcome the limitations of alignment-based approaches.
SUMMARY OF THE INVENTION
[0005] The present invention comprises techniques for next-generation sequencing and its use in genetic analysis. In certain embodiments, a server receives a dataset comprised of a genetic sequence of control group RNA samples and experimental RNA samples and performs a count of unique k-mer sequences from the dataset. Then, the server sorts and filters the plurality of k-mer sequences based on density and applies a neighbor detection function to the plurality of k-mer sequences to identify one or more neighbor k-mer sequences to form one or more k-mer pair sequences. The neighbor detection function performs a k-to-3 dimensionality reduction transformation on the k-mers to reduce the computation cost. Then, the server fdters the one or more k-mer pair sequences based on a predetermined edit distance and merges the one or more fdtered k-mer pair sequences into a plurality of genetic variant candidates. The server then localizes the variant candidates in the reference genome to validate their existence and type, and also to check if there is any annotation associated with them in that specific location of the reference genome. The server subsequently compares the plurality of genetic variant candidates against a pre-populated variant database to specify if each detected one or more sequence genetic variants is novel or has been already annotated in the literature for the targeted disease and outputs the one or more identified sequence genetic variants through a graphic user interface.
[0006] In certain embodiments, the genetic variants identified are one or more of single nucleotide polymorphism (SNP), multiple nucleotide polymorphism (MNP), and insertion/deletion (INDEL).
[0007] In other embodiments, the dataset is comprised of FASTQ/A data for healthy individuals and unhealthy individuals.
[0008] In certain other embodiments, trimming of the genomic sequences is performed to remove unwanted or low-quality regions.
[0009] In certain other embodiments, sorting of the plurality of k-mer sequences is performed in descending order.
[0010] In certain embodiments, the sequence k-mers are filtered based on the ratio of k-mer density in one group vs the other.
[0011] In yet other embodiments, the server applies a T-test filter that performs an unequal variance T-test on the plurality of k-mer sequences.
[0012] In certain other embodiments, the k-mers are filtered based on the level that their density difference in control and experiment groups is compensated by its neighbor k-mers.
[0013] In certain embodiments, the k-mer pairs which have overlap are merged together to make longer sequence pairs.
[0014] In other embodiments, the output is in variant call format (VCF). BRIEF DESCRIPTION OF THE DRAWINGS
[0015] A more complete appreciation of the invention and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:
[0016] FIG. l is a diagram of an exemplary embodiment of the hardware of the system of the present invention;
[0017] FIG. 2A is a flowchart showing the software processes of an exemplary embodiment of the present invention;
[0018] FIG. 2B is a flowchart showing the software processes of an exemplary embodiment of the present invention;
[0019] FIG. 3 is a graph showing a 3D representation of k-mers nucleotide counts; and
[0020] FIG. 4 shows charts that demonstrate the geometry MD values (up to 6) with different colors in a 9*9 MD matrix.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0021] FIG. 1 is an exemplary embodiment of the health information system of the present invention. In the exemplary system 100, one or more peripheral devices 110 are connected to one or more computers 120 through a network 130. Examples of peripheral devices/locations 110 include smartphones, tablets, wearables devices, and any other electronic devices that collect and transmit data over a network that are known in the art. The network 130 may be a wide-area network, like the Internet, or a local area network, like an intranet. Because of the network 130, the physical location of the peripheral devices 110 and the computers 120 has no effect on the functionality of the hardware and software of the invention. Both implementations are described herein, and unless specified, it is contemplated that the peripheral devices 110 and the computers 120 may be in the same or in different physical locations. Communication between the hardware of the system may be accomplished in numerous known ways, for example using network connectivity components such as a modem or Ethernet adapter. The peripheral devices/locations 110 and the computers 120 will both include or be attached to communication equipment. Communications are contemplated as occurring through industry-standard protocols such as HTTP or HTTPS.
[0022] Each computer 120 is comprised of a central processing unit 122, a storage medium 124, a user-input device 126, and a display 128. Examples of computers that may be used are: commercially available personal computers, open source computing devices (e.g. Raspberry Pi), commercially available servers, and commercially available portable device (e g. smartphones, smartwatches, tablets). In one embodiment, each of the peripheral devices 110 and each of the computers 120 of the system may have software related to the system installed on it. In such an embodiment, system data may be stored locally on the networked computers 120 or alternately, on one or more remote servers 140 that are accessible to any of the peripheral devices 110 or the networked computers 120 through a network 130. In alternative embodiments, the software runs as an application on the peripheral devices 110.
[0023] FIGs. 2A and 2B show a flow diagram of an alignment-free variant calling algorithm that may be used in accordance with the present invention to identify genomic variants using FASTQ/A data. Accurate variant calling in next-generation sequencing (NGS) data is a major step upon which virtually all downstream analysis and interpretation processes depend. The dataset that is used includes FASTQ/A data for individuals with healthy (control) and unhealthy (experimental) conditions, more specifically for one specific abnormality. Initially, the Control Group Raw Sequence Data 202 and the Experimental Group Raw Sequence Data 204 are transmitted to the Trimming module 206. At the Trimming module 206, the raw sequence data 202, 204 is trimmed, resulting in trimmed control group and experimental group raw sequence data. Sequence trimming is the process of removing unwanted or low-quality regions from a nucleotide or protein sequence. It can be done based on various criteria, such as quality scores, length, or the presence of contaminants or adapters. The goal of trimming is to improve the accuracy of downstream analysis, such as alignments and functional predictions. Trimming of adapter sequences from FASTQ/A reads is a common preprocessing step during NGS data analysis. Adapter removal is necessary to remove the adapter sequences from the 3' end of the reads because those artificially added sequences (which are necessary to attach the DNA fragments to the flow cell and also barcoding) can interfere with the alignment of the reads to the genome. [0024] The trimmed control group and experimental group raw sequence data are sent to the Count Unique k-mers module 208, at which the number of appearances of each unique k-mer in each sample is counted and recorded as its frequency. As a result, the output of this block for each one of control and experiment groups would be a {k-mer, [k-mer_count] } dictionary in which the key is the k-mer sequence, and the value is the [k-mer_count] which is frequency of the k-mer at each sample.
[0025] The {k-mer, [k-mer_count]} dictionary of the control group and the experimental group are then transmitted to the Density Calculation module 210. For each sample at the control and experimental groups, at the Density Calculation module 210, the density of k-mers is determined (for example, by determining the frequency of k-mers divided by total number of k- mers in that sample). Then, the mean of densities of each k-mer are calculated, for experimental and control groups, separately.
[0026] For each group, the k-mer densities and the mean of their densities are transmitted to the Density Filter module 212. At the Density Filter module 212, the k-mers in each group are sorted in descending order, based on their mean density values. In each group, the k-mers having a mean density beyond a predetermined density threshold pass this fdter 212.
[0027] The k-mers from the control group that pass the Density Filter module 212 are transmitted to the Density Ratio Filter 214, which finds the corresponding mean density for each k-mer in the experiment group, calculates density _ratiocontrol(kmer'), and in case the ratio is higher than a specific predetermined threshold, passes the k-mer through the filter 214. An exemplary calculation is shown below.
Figure imgf000007_0001
[0028] A similar fdter is applied on the experiment group k-mers in which the k-mers from the experimental group that pass the Density Filter module 212 are transmitted to the Density Ratio Filter 214, which finds the corresponding mean density for each k-mer in the control group, calculates and in case the ratio is higher than a specific predetermined
Figure imgf000007_0002
threshold, passes the k-mer through the filter 214. An exemplary' calculation is shown below.
Figure imgf000008_0001
[0029] The Density T-Test module 216 is a filter which performs an unequal variance T-Test (Welch’s T-test) on the k-mers that have passed the previous filter, using their densities and the density of their corresponding k-mers in the other group as well as the density ratios calculated by the Density Ratio Filter 214. That is an independent T-Test which is used when the number of samples in each group is different, and/or the variance of the two data sets is different. Preferably, the level of significance is assumed to be 5%. An exemplary calculation is shown below.
Figure imgf000008_0002
[0030] In Equation 3,
Figure imgf000008_0003
and are the number of records (samples) in control
Figure imgf000008_0004
and experiment groups, respectively. Also,
Figure imgf000008_0005
; in which var is variance and a is standard deviation, where the outer bracket in Equation 3 is the floor function.
[0031] Then the absolute value of the T-value is determined, as exemplarily calculated by Equation 4 below.
Figure imgf000009_0001
[0032] Once the degrees of freedom (DOF) and significance level are known, a T- distribution critical value table gives the corresponding critical T-value. Now, if the calculated T-value for any k-mer is greater than or equal to the critical value, it passes the filter of the Density T-Test module 216. The Density T-Test module 216 is applied to the control and experiment group k-mers, separately.
[0033] For each filtered k-mer, the Neighbor Detector module 218 is used to find its neighbor k-mers (i.e., their hamming distance is less than a specific threshold) for further filtering downstream. Performing this process in k-dimensions is computationally expensive, because it needs to calculate the hamming distance of each filtered k-mer with all other k-mers. The Neighbor Detector module 218, for each k-mer sequence that has passed the previous filter, finds its nt count neighbors. The two sequences are considered to be nt count neighbors if the Manhattan distance between their nt count vectors is equal to or smaller than a specific threshold. However, finding the neighbors in 4-dimensions is still computationally expensive. The Neighbor Detector module 218 transforms a 4D analysis into 3D, in which the neighbor sequences for a desired distance to the original k-mer form a predetermined (complex) geometry and are therefore easy to find.
Table 1: Example - Manhattan distance between nt-count vectors
Figure imgf000009_0002
Figure imgf000010_0003
[0034] Further, where the nt count vector is: [nA, nc, no, n?]. since the number of nucleotides in each k-mer is k, we have:
And so,
Figure imgf000010_0002
[0035] Therefore, a 3D_nt_count vector which is [nA, nc, no] can be used instead of nt count since it has the same information for k-mers. With that, the dimension is reduced to 3.
[0036] The analysis then is subject to the following constraints:
Figure imgf000010_0001
[0037] These two constraints make the possibility space a tetrahedron, as shown in FIG. 3, which shows a 3D representation of k-mers nucleotide counts. Therefore, there will be no k-mer outside this tetrahedron. The difference vector of a first and a second nt count vectors can be presented as:
Figure imgf000011_0001
Therefore, we have:
And so:
Figure imgf000011_0002
The Manhattan distance between these two nt_count vector is:
Figure imgf000011_0003
With replacing d in this equation we will have:
Figure imgf000011_0004
[0038] The 3D matrix of 3D_nt_count coordinates is called 3D_static. The way the Neighbor Detector module 218 works is that all experiment group k-mers locations in the 3D static matrix are specified. We can find the position of any control group k-mer based on its 3D_nt_count coordinate. We found the geometry of the isodistance cells with every Manhattan distance of MD=md. For any desired md distance, we just need to go to isodistance cells that pick the experimental k-mers. They are experiment groups nt count neighbors of the original control group k-mer with MD distance. FIG. 4 shows the geometry MD values (up to 6) with different colors in a 9*9 MD matrix. A similar process is performed by switching the groups, to find the control group nt count neighbors of the original experiment group k-mers. [0039] The geometry for isodistance cells for an MD matrix is known and consistent across the 3D_static matrix and is subject to the tetrahedron surfaces constraints. It is proved that for any k value, the total number of cells in the 3D static matrix is:
Figure imgf000012_0001
For MD=md, the total number of cells in MD_matrix is:
Figure imgf000012_0002
For MD<= md, the total number of cells in MD matrix is:
Figure imgf000012_0003
[0040] For example, for k=18, the total number of cells in the 3D static matrix is 1793 but the number of cells that have the MD=<2 (i.e., SNP) is 13. That means that with this neighbor detection method, the k-dimension hamming distance evaluation in Density Compensation Filter 220 for each control k-mer is just applied on the experiment k-mers located at 13 cells instead of 1793 possible cells, and vice versa
[0041] The k-mer variant candidates identified by the Neighbor Detector module 218 are transmitted to the k-mer Density Compensation Filter module 220, which further filters them.
[0042] For single nucleotide polymorphisms (SNPs), it is assumed that if a SNP occurs in a specific k-mer in a dataset, the density of that k-mer sequence is reduced and the density of the sequence with the changed nucleotide is increased. Table 2 shows an example for a dataset with a total 100 k-mers. Table 2: SNP
Figure imgf000013_0004
[0043] Based on that assumption, in an ideal scenario, the sum of the density of original and neighbor k-mers at each of the control and experiment datasets should be the same. Based on that idea, when there is an SNP, we expect that the absolute value of the average of Jp of experiment and control groups would be smaller than the absolute value of Ap of each experiment and control groups, separately, in which
Figure imgf000013_0003
To do that, we define the i/> metric, which is related to the change in
Figure imgf000013_0002
mean density of the original and neighbor k-mers due to each kind of variant, as follows.
Figure imgf000013_0001
[0044] Insertion. For single insertion, the process is similar to SNP, the only difference is the insertion neighbor has a nucleotide inserted in the insertion point of the original sequence and so has one nucleotide less from one end. Table 3 below shows an exemplary insertion and how it affects the dataset.
Table 3: Insertion
Figure imgf000013_0005
[0045] Deletion'. For single deletion, the process is similar to SNP, the only difference is the deletion neighbor has a nucleotide deleted from the deletion point of the original sequence, and so has one nucleotide more at one end.
[0046] The same process applies for multiple nucleotide polymorphism (MNP), multiple insertion or multiple deletion, only the number of varied nucleotides is greater than 1.
Example: Larger Indels
Figure imgf000014_0001
Example: MNP
Figure imgf000014_0002
[0047] For the above, if ip is smaller than a specific predetermined threshold, the pair (original & neighbor) k-mer sequences passes the Density Compensation Filter module 220.
[0048] The output original-neighbor k-mer pairs of the Density Compensation Filter module 220 are transmitted to the Max Edit Distance Filter 222, where the Needleman_Wunch approach is used to filter them further. The original-neighbor k-mer pairs whose edit distance is less than a specific threshold pass this filter. The edit distance between two sequences is a measure of the minimum number of operations (such as insertions, deletions, or substitutions) required to transform one sequence into the other.
[0049] The Needleman-Wunsch algorithm is a dynamic programming approach which is being used in bioinformatics to align two sequences. The algorithm creates a similarity matrix between the two sequences, considering gaps and mismatches, and then applies this matrix to find the optimal global alignment with the highest similarity score. The output of the Needleman-Wunsch algorithm is a pairwise sequence alignment with the highest possible similarity score, which reflects the evolutionary relationship between the two sequences.
[0050] In certain embodiments, the Density Compensation Filter module 220 may be applied after the Max Edit Distance Filter module 222, because the Max Edit Distance Filter module 222 is where original -neighbor k-mer pair potentially includes a variant is determined (i.e., the aforementioned SNP, MNP, Indel examples). However, since the Max Edit Distance Filter module 222 is computationally much more expensive than the Density Compensation Filter module 220, the latter module is typically applied first to reduce the load of the former (the number of pairs that go through the Density Compensation Filter module 220). Regardless of which filter is applied first, however, the same result is achieved.
[0051] The k-mer pair sequences that result from the Max Edit Distance Filter 222 are transmitted to the Merging module 224. Many of the k-mer-pair sequences that have been identified and sent to the Merging module 224 may have overlap. For example, three k-mer pairs that are presented in Table 4 below can be merged together to create a bigger sequence, so that instead of sending three pairs of sequences to the downstream modules of the software, just one bigger sequence will be sent which will result in less computation. It will be computationally expensive to check each two pairs together if they are mergeable. So, there is a need to reduce the number of candidates for merging to each pair and then evaluate them, which is performed at the Merging module 224.
Table 4: Merging Example
Figure imgf000015_0001
Figure imgf000016_0001
[0052] All the pairs are sent from the Merging module 224 to the Neighbor Pair Detector 226 and then to the Mergeable Pair Detector module 228 to find the merable pairs for each k-mer- pair. Those pairs which have a right side mergeable pair (as explained with regard to the Mergeable Pair Detector module 228) are merged, and those which do not, will remain not- merged.
[0053] The Neighbor Pair Detector 226 does a similar job to the Neighbor Detector module 218, the only difference being that for a k-mer pair to be a neighbor pair of a given pair, its experiment and control nt-vectors should be at the neighborhood of the nt-vectors of the experiment and control sequences of the given pair, respectively. So, to find which two pairs are mergeable together, we first create the nt-vector for each k-mer of each candidate pair. For any given pair, called an original -pair, any other pair that the Manhattan distance between the nt- vector of their control k-mers and their experiment k-mers are less than a specific threshold (here, threshold=2), (i.e., both the Manhattan distance between the control k-mers and the Manhattan distance between the experiment k-mers were smaller than a threshold), is considered as the nt-count neighbor-pair of the original pair. The example shown in Table 5 below illustrates that assuming the Manhattan distance threshold of 2, (pairl, pair2), (pair2 , pairl), (pair2, pair3) and (pair3, pair2) are original-neighbor pairs.
Table 5: Neighbor Pair Detections
Figure imgf000016_0002
Figure imgf000017_0001
[0054] For all detected pairs, if the first k-1 nucleotides of the original-pair were the same as the last k-1 of the neighbor-pair, the latter is labeled as the right side mergeable pair (RSMP) of the former, and vice versa. That process is done for all pairs, as exemplarily shown in Table 6 below.
Table 6: Mergeable Pairs
Figure imgf000017_0002
Figure imgf000018_0001
[0055] Then, both merged and not -merged pairs are transmitted from the Merging module 224 to the Variant Identifier module 230. At the Variant Identifier module 230, the Needleman- Wunch matrix of the experiment vs control sequence of each candidate pair is created. Based on that matrix, the location and type of variation is specified.
[0056] The result of the Variant Identifier module 230 is then validated at the Localization module 232, which validates and confirms the detected variants, exemplarily by localizing the variants in the Reference Genome module 234, using existing tools. Then, at the Annotation Check module 236, the variants are checked to confirm if there is any annotation associated with the detected variant in that specific location of the reference genome. This analysis is again performed by comparison to the data found in the Reference Genome module 236. The detected variants and their annotation are then transferred to the Check Novelty module 238.
[0057] The Check Novelty module 238 functions to check if the detected variants are already annotated for the targeted disease in its databases. A list of variants which are already identified and found in the system databases to have a causation or correlation relationship with the abnormality being analyzed are collected at the Variant DB module 240, in advance. The Check Novelty module 238 compares the detected merged variant candidates against that variant list of Variant DB 240 to see which variant is already known to be associated with the targeted disease or abnormality and which is novel and does not appear in the database. Based on that determination, each detected variant can be designated as either “new” or “existing” in candidate variants.
[0058] Following the Check Novelty module 238, the result of the variant caller processes described is presented in the standard VCF format by the VCF Output module 242.
[0059] The foregoing description and drawings should be considered as illustrative only of the principles of the invention. The invention is not intended to be limited by the preferred embodiment and may be implemented in a variety of ways that will be clear to one of ordinary skill in the art. Numerous applications of the invention will readily occur to those skilled in the art. Therefore, it is not desired to limit the invention to the specific examples disclosed or the exact construction and operation shown and described. Rather, all suitable modifications and equivalents may be resorted to, falling within the scope of the invention.

Claims

1. A system for variant analysis comprised of a server that: receives a dataset comprising one or more control genetic sequence samples and one or more experimental genetic sequence samples; performs a count of unique k-mer sequences from the dataset; sorts the plurality of k-mer sequences based on density; applies a neighbor detection function to the plurality of k-mer sequences to identify one or more neighbor k-mer sequences to form one or more k-mer pair sequences; filters the one or more k-mer pair sequences based on a predetermined edit distance; merges the one or more filtered k-mer pair sequences into a plurality of genetic variant candidates; compares the plurality of genetic variant candidates against a pre-populated variant database to specify if each detected sequence genetic variant is novel or has been already annotated in a targeted disease; and outputs the one or more identified sequence genetic variants through a graphic user interface.
2. The system of claim 1, wherein the neighbor detection function comprises a dimensionality reduction transformation on the plurality of k-mer sequences.
3. The system of claim 1, wherein the genetic variants are one or more of single nucleotide polymorphism (SNP), multiple nucleotide polymorphism (MNP), and insertion/deletion (INDEL).
4. The system of claim 1, wherein the dataset is comprised of RNA data in a FASTQ/A format for healthy individuals and unhealthy individuals.
5. The system of claim 1, wherein the server further trims low quality regions from the control genetic sequence samples and the experimental genetic sequence samples of the dataset.
6. The system of claim 1, wherein the sorting of the plurality of k-mer sequences is performed based on their density values in descending order.
7. The system of claim 1, wherein the server further filters the plurality of k-mer sequences by calculating a ratio of k-mer density in one subset of the plurality of k-mer sequences as compared to a second subset of the plurality of k-mer sequences.
8. The system of claim 1, wherein the server further applies a T-test filter that performs an unequal variance T-test on the plurality of k-mer sequences.
9. The system of claim 1, wherein the filtering of the one or more k-mer pair sequences is based on the amount of density difference in the control genetic sequence samples and experimental genetic sequence samples as compensated by their neighbor k-mer sequences.
10. The system of claim 1, wherein the one or more filtered k-mer pair sequences are merged based on overlap.
11. The system of claim 1, wherein server further localizes the plurality of genetic variant candidates in a reference genome to validate their existence and type.
12. The system of claim 1, wherein the server further performs a check on the plurality of genetic variant candidates to determine whether an annotation is associated with said genetic variant candidates at a specific location on the reference genome.
13. The system of claim 1, wherein the output is in variant call format (VCF).
14. A computer-implemented method for variant analysis comprising: receiving a dataset comprising one or more control genetic sequence samples and one or more experimental genetic sequence samples; performing a count of unique k-mer sequences from the dataset; sorting the plurality of k-mer sequences based on density; applying a neighbor detection function to the plurality of k-mer sequences to identify one or more neighbor k-mer sequences to form one or more k-mer pair sequences; filtering the one or more k-mer pair sequences based on a predetermined edit distance; merging the one or more filtered k-mer pair sequences into a plurality of genetic variant candidates; comparing the plurality of genetic variant candidates against a pre-populated variant database to specify if each detected sequence genetic variant is novel or has been already annotated in a targeted disease; and outputting the one or more identified sequence genetic variants through a graphic user interface.
15. The method of claim 14, wherein the neighbor detection function comprises a dimensionality reduction transformation on the plurality of k-mer sequences.
16. The method of claim 14, wherein the genetic variants are one or more of single nucleotide polymorphism (SNP), multiple nucleotide polymorphism (MNP), and insertion/deletion (INDEL).
17. The method of claim 14, wherein the dataset is comprised of RNA data in a FASTQ/A format for healthy individuals and unhealthy individuals.
18. The method of claim 14, further comprising trimming low quality regions from the control genetic sequence samples and the experimental genetic sequence samples of the dataset.
19. The method of claim 14, wherein the sorting of the plurality of k-mer sequences is performed based on their density values in descending order.
20. The method of claim 14, further comprising filtering the plurality of k-mer sequences by calculating a ratio of k-mer density in one subset of the plurality of k-mer sequences as compared to a second subset of the plurality of k-mer sequences.
21. The method of claim 14, further comprising applying a T-test filter that performs an unequal variance T-test on the plurality of k-mer sequences.
22. The method of claim 14, wherein the filtering of the one or more k-mer pair sequences is based on the amount of density difference in the control genetic sequence samples and experimental genetic sequence samples as compensated by their neighbor k-mer sequences.
23. The method of claim 14, wherein the one or more filtered k-mer pair sequences are merged based on overlap.
24. The method of claim 14, wherein server further localizes the plurality of genetic variant candidates in a reference genome to validate their existence and type.
25. The method of claim 14, further comprising performing a check on the plurality of genetic variant candidates to determine whether an annotation is associated with said genetic variant candidates at a specific location on the reference genome.
26. The method of claim 14, wherein the output is in variant call format (VCF).
PCT/US2023/063411 2022-02-28 2023-02-28 Systems and methods for alignment-free variant calling WO2023164728A2 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202263314734P 2022-02-28 2022-02-28
US63/314,734 2022-02-28
US202263431957P 2022-12-12 2022-12-12
US63/431,957 2022-12-12

Publications (2)

Publication Number Publication Date
WO2023164728A2 true WO2023164728A2 (en) 2023-08-31
WO2023164728A3 WO2023164728A3 (en) 2023-09-28

Family

ID=87766819

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/063411 WO2023164728A2 (en) 2022-02-28 2023-02-28 Systems and methods for alignment-free variant calling

Country Status (2)

Country Link
US (1) US20230298693A1 (en)
WO (1) WO2023164728A2 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014052909A2 (en) * 2012-09-27 2014-04-03 The Children's Mercy Hospital System for genome analysis and genetic disease diagnosis
ES2958715T3 (en) * 2013-11-26 2024-02-13 Illumina Inc Compositions and methods for polynucleotide sequencing
EP3267346A1 (en) * 2016-07-08 2018-01-10 Barcelona Supercomputing Center-Centro Nacional de Supercomputación A computer-implemented and reference-free method for identifying variants in nucleic acid sequences

Also Published As

Publication number Publication date
WO2023164728A3 (en) 2023-09-28
US20230298693A1 (en) 2023-09-21

Similar Documents

Publication Publication Date Title
US11756652B2 (en) Systems and methods for analyzing sequence data
Bush et al. Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines
US20180357363A1 (en) Protein design method and system
US20160140289A1 (en) Variant caller
US20170242958A1 (en) Systems and methods for genotyping with graph reference
Trappe et al. Gustaf: Detecting and correctly classifying SVs in the NGS twilight zone
US8972406B2 (en) Generating epigenetic cohorts through clustering of epigenetic surprisal data based on parameters
CN113555062A (en) Data analysis system and analysis method for genome base variation detection
Cleal et al. Dysgu: efficient structural variant calling using short or long reads
WO2014169377A1 (en) Aligning and clustering sequence patterns to reveal classificatory functionality of sequences
Glusman et al. Ultrafast comparison of personal genomes via precomputed genome fingerprints
Schelling et al. Evolutionary couplings and sequence variation effect predict protein binding sites
US20190267110A1 (en) System and method for sequence identification in reassembly variant calling
US20230298693A1 (en) Alignment-free variant calling
Siegel et al. Analysis of sequence-tagged-connector strategies for DNA sequencing
Mesa et al. Hidden Markov models for gene sequence classification: Classifying the VSG gene in the Trypanosoma brucei genome
US20190214110A1 (en) Detection of insufficient homology regions in a reference sequence
CN113611358A (en) Sample pathogenic bacteria typing method and system
US20220301655A1 (en) Systems and methods for generating graph references
Vetro et al. TIDE: Inter-chromosomal translocation and insertion detection using embeddings
Itan et al. Detecting gene duplications in the human lineage
Suzuki et al. A method of sequence analysis for high-throughput sequencer data based on shifted short read clustering
Braga et al. Family-Free Genome Comparison
Ie et al. BioSpike: Efficient search for homologous proteins by indexing patterns
Glusman et al. Ultrafast comparison of personal genomes