WO2023240183A1 - Method and system for assessing an impact of genetic changes on biological properties - Google Patents

Method and system for assessing an impact of genetic changes on biological properties Download PDF

Info

Publication number
WO2023240183A1
WO2023240183A1 PCT/US2023/068124 US2023068124W WO2023240183A1 WO 2023240183 A1 WO2023240183 A1 WO 2023240183A1 US 2023068124 W US2023068124 W US 2023068124W WO 2023240183 A1 WO2023240183 A1 WO 2023240183A1
Authority
WO
WIPO (PCT)
Prior art keywords
amino acid
sequence
acid sequences
sequences
assessment system
Prior art date
Application number
PCT/US2023/068124
Other languages
French (fr)
Inventor
Vladimir PEROVIC
Slobodan Paessler
Original Assignee
Biomed Protection Tx, Llc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Biomed Protection Tx, Llc. filed Critical Biomed Protection Tx, Llc.
Publication of WO2023240183A1 publication Critical patent/WO2023240183A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B10/00ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • aspects of the present disclosure relate to assessing the impact of genetic mutations on biological properties.
  • aspects of the present disclosure relate to a method and system that can generate training data for a machine learning model and train the machine learning model to identify genetic variants that are at risk of being associated with certain biological characteristics.
  • a method for assessing genetic changes includes receiving a plurality of amino acid sequences; determining electronic properties for each sequence of the plurality of amino acid sequences, the electronic properties comprising one or more amplitude values for one or more characteristic frequencies; constructing, using the electronic properties, a phylogenetic tree; assigning, using the phylogenetic tree, a label to each sequence of the plurality of amino acid sequences; and training, using training data and machine learning algorithms, a classification model, wherein the training data includes, for each sequence of the plurality of amino acid sequences, the one or more amplitude values and the label.
  • a method for assessing a biological impact of a genetic variant includes receiving an amino acid sequence; determining electronic properties of the amino acid sequence, the electronic properties including one or more amplitude values for one or more characteristic frequencies; and determining, using the electronic properties, whether the ammo acid sequence is at nsk of being associated with a biological characteristic; wherein determining, using the electronic properties, whether the amino acid sequence is at risk of being associated with a biological characteristic comprises applying a classification model to the one or more amplitude values.
  • a mutation assessment system comprising a processor and a memory.
  • the memory can store instructions, wherein the instructions, when executed by the processor, cause the mutation assessment system to: receive a plurality of amino acid sequences; determine electronic properties for each sequence of the plurality of amino acid sequences, the electronic properties comprising one or more amplitude values for one or more characteristic frequencies; construct, using the electronic properties, a phylogenetic tree; assign, using the phylogenetic tree, a label to each sequence of the plurality of amino acid sequences; and train, using training data and machine learning algorithms, a classification model, wherein the training data includes, for each sequence of the plurality of ammo acid sequences, the one or more amplitude values, their mathematical transformation, and the label.
  • FIG. 1 illustrates an example network environment in which aspects of the present disclosure can be implemented.
  • FIG. 2 illustrates a schematic representation of an example mutation assessment system.
  • FIG. 3 is a flowchart of an example method for performing aspects of the present disclosure.
  • FIG. 4 is a flowchart of an example method for generating training data.
  • FIG. 5 illustrates a schematic example execution of aspects of the present disclosure
  • Fig. 6 is a flowchart of an example method for training a classification model.
  • Fig. 7 is a flowchart of an example method for assessing an input query.
  • FIG. 8 illustrates a schematic example execution of aspects of the present disclosure.
  • FIG. 9 illustrates a block diagram of an example computing system.
  • aspects of the present disclosure relate to a system that can assess the impact of genetic mutations on biological properties.
  • the mutation assessment system can generate training data, train a machine learning model, and determine a likelihood that an input sequence is associated with a certain biological trait.
  • the mutation assessment system can receive a plurality of sequences, such as a plurality of amino acid sequences.
  • Each sequence of the plurality of amino acid sequences may, in some examples, be derived from a variant of a pathogen or a variant of a gene of any organism, including humans.
  • the mutation assessment system can, in some embodiments, convert each of the amino acid sequences into a sequence of electron-ion interaction potential (EIIP) values (or electron-ion interaction pseudo-potential values). Then the mutation assessment system can, in some embodiments, convert each of the sequences of EIIP values to a frequency domain.
  • EIIP electron-ion interaction potential
  • the frequencies of the frequency domain, and their corresponding amplitudes, including mathematical transformations of the amplitudes can relate to biological characteristics (e g., an ability to escape immune system detection or drug treatment).
  • the mutation assessment system can determine, or can receive, one or more characteristic frequencies in the frequency domain, and the mutation assessment system can determine one or more amplitude values for the one or more characteristic frequencies. Having determined, for each sequence, the one or more amplitude values for the one or more characteristic frequencies, the mutation assessment system can, based on these values, and mathematical transformations of these values, group the sequences. To form the groups, the mutation assessment system can, in example aspects, generate a tree. To do so, the mutation assessment system can, in some embodiments, calculate a distance matrix based on amplitude values at characteristic frequencies. The distance matrix can include a distance between each pair of sequences in the plurality of sequences.
  • the mutation assessment system can then, in some embodiments, generate a phylogenetic tree that groups the sequences based on their distances from one another. Using the constructed phylogenetic tree, the mutation assessment system can, in some embodiments, cluster the sequences, assigning them, for example, a positive label or a negative label.
  • the mutation assessment system can train a machine learning model using training data.
  • the training data can include, for example, a plurality of training instances, each of which can have training attributes and a label.
  • each sequence can be a training instance, with the one or more amplitude values as training attributes and the positive or negative label assigned while clustering as the label.
  • the mutation assessment system can train the machine learning model using, for each sequence, the amplitude values and the label, thereby creating a model that can receive one or more amplitude values associated with one or more characteristic frequencies and that can output a positive or a negative label.
  • the mutation assessment system can determine whether a query input is positive or negative.
  • the input can be an amino acid sequence of a protein from an organism or virus.
  • the mutation assessment tool can convert the ammo acid sequence into an EIIP sequence, and then convert the EIIP sequence to the frequency domain. Then, the mutation assessment system can, in some embodiments, determine one or more amplitude values at the characteristic frequencies and apply the machine learning model to the one or more amplitudes, or to mathematical transformations of the amplitudes,, thereby determining a predicted positive or negative classification for the input.
  • Whether the input receives a positive or negative level can, depending on the embodiment, indicate whether the organism or virus associated with the input sequence is at risk of being associated with a certain biological characteristic.
  • Certain embodiments of the present disclosure have certain technical features that make them particularly advantageous over existing tools. For example, certain embodiments of the present disclosure enable accurate and efficient assessments of whether a genetic mutation will be associated with a certain biological trait. For example, aspects of the present disclosure can help scientists and health officials determine whether a new genetic variant of a pathogen is likely to have a biological trait of interest, such as, for example, an ability to escape immune system recognition or an ability to infect humans. Yet still, aspects of the present disclosure are generalizable. Certain aspects can be used to determine an impact of genetic changes on biological characteristics for a variety of biological characteristics across a wide range of organisms and viruses.
  • aspects of the present disclosure allow scientists to create phylogenetic trees that are tailored to analyze relevant genetic changes.
  • aspects of the present disclosure can better detect a seemingly small genetic mutation if the genetic mutation is relevant to a biological characteristic and can keep two sequences grouped together, despite seemingly large genetic changes, if those genetic changes do not affect the relevant biological characteristic.
  • aspects of the present disclosure can be used to actively monitor changing pathogens and, based on accessible genetic or amino acid sequences, efficiently identify variants of the pathogen that may have harmful biological characteristics.
  • aspects of the present disclosure can be integrated into, and improve, a system for monitoring and responding to evolving pathogens. As will be apparent, these are only some of the advantages offered by aspects of the present disclosure.
  • Fig. 1 illustrates an example network 100 in which aspects of the present disclosure can be implemented.
  • the network 100 includes a mutation assessment system 102, a genetic information database 104, a user 106, an output system 108, a database 110, an input source 112, and the networks 120a-c.
  • the network 100 can include more or less elements than those displayed in and described in connection with Fig. 1.
  • Each network of networks 120a-c can be, for example, a wireless network, a wired network, a virtual network, the internet, or any other type of network.
  • each network of the networks 120a-c can be divided into subnetworks, and the subnetworks can be different types of networks.
  • the mutation assessment system 102 can be a computer system or program that can be configured to analyze genetic information. As is further described below, the mutation assessment system 102 can, for example, combine electronic biology techniques, such as using EIIP values to represent amino acid sequences, and artificial intelligence techniques, such as a binary classification machine learning model, to analyze and assess genetic information. As is further described below, the mutation assessment system 102 can, in some embodiments, generate training data from a plurality of nucleotide or amino acid sequences, train a machine learning model using the training data, and use the machine learning model to analyze an impact of a genetic mutation on a biological characteristic. An example architecture of the mutation assessment system 102 is described in connection with Fig. 2.
  • the genetic information database 104 which can be coupled with the mutation assessment system 102 via the network 120a, can be a database that stores genetic data, data related to a particular organism’s or vims’s genome, or data related to proteins of a virus or organism.
  • the genetic information database 104 can include nucleotide sequences and amino acid sequences.
  • the genetic information database 104 can include data related to a particular pathogen, proteins of the pathogen, or variants of the pathogen.
  • the genetic information database 104 can include data related to SARS- CoV-2, and the genetic information database 104 can include a plurality of samples of SARS- CoV-2, including samples of various proteins and variants of SARS-CoV-2.
  • the genetic information database 104 can be a plurality of databases. In some embodiments, the genetic information database 104 can be a database associated with GISAID or with the National Center for Biotechnology Information. As show n in the example of Fig. 1, the genetic information database 104 can transmit genetic data 114 to the mutation assessment system 102. In some embodiments, the genetic data 114 can be a plurality of amino acid sequences As is further described below, the mutation assessment system 102 can use the genetic data 114 to, for example, generate training data.
  • the user 106, the output system 108, and the database 110 can be coupled with the mutation assessment system 102 via, for example, the network 120b.
  • the user 106 can, in some examples, be a researcher, scientist, public health official, or another individual that operates the mutation assessment system 102.
  • the user 106 can, in some embodiments, receive data generated by the mutation assessment system 102, and the user 106 can input information into the mutation assessment system 102.
  • the user 106 may, among other things, use a phylogenetic tree constructed by the mutation assessment system 102 to cluster sequences, as is further described below.
  • the output system 108 can receive data generated by the mutation system 102 and use that data as part of another process or system, such as an analytics system or monitoring system.
  • the output system 108 can automatically act in response to receiving an output (e.g., an indication that an input sequence is sufficiently likely to be associated with a predefined biological characteristic). For example, the output system 108 may automatically cause further analysis of a sample, or the output system 108 may alert officials of the result.
  • an output e.g., an indication that an input sequence is sufficiently likely to be associated with a predefined biological characteristic.
  • the output system 108 may automatically cause further analysis of a sample, or the output system 108 may alert officials of the result.
  • the database 110 can be a database that is external to the mutation assessment system 102, and the database 110 can, in some embodiments, store data generated by the mutation assessment system 102 and provide the mutation assessment system 102 with information that may be required to generate training data, construct a machine learning model, assess an input sequence, or perform another operation.
  • the mutation assessment system 102 can generate an assessment 118 and transmit the assessment 118 to one or more of the user 106, the output system 108, or the database 110.
  • the assessment 118 can include, for example, results and analysis for one or more input queries.
  • the assessment 118 can indicate, for example, whether there exists, out of a set of amino acid sequences from a set of virus variants, a virus variant may have certain biological characteristic, such as being resistant to a vaccine.
  • the input source 112 which can be connected to the mutation assessment system 102 via, for example, the network 120c, can be a database or system that can produce or send one or more input sequences to the mutation assessment system 102.
  • the input source 112 may have an amino acid or nucleotide sequence associated with a novel protein, virus, or organism, and the input source 112 may send such information as an input query 116 to the mutation assessment system 102.
  • the mutation assessment system 102 can, in some examples, receive the input query 1 1 and can assess a likelihood of whether it is associated with a certain biological characteristic.
  • Fig. 2 illustrates a block diagram of an example implementation of the mutation assessment system 102.
  • the mutation assessment system 102 includes a plurality of components.
  • the mutation assessment system 102 can include a training data generator 200, a classification model 206, a mutation analyzer 208, a user interface 210, and a database 212.
  • the training data generator 200 can include a tree builder 202 and a cluster identifier 204.
  • the components of the mutation assessment system 102 are described as performing various aspects of the present disclosure; however, in some examples, the functions of the components can overlap, or the functions of the components can be performed by other components.
  • the components of the mutation assessment system 102 can, depending on the embodiment, be located on the same computing system or on different computing systems.
  • the mutation assessment system 102 can have more or less components than those shown in the example of Fig. 2.
  • the training data generator 200 can, in some embodiments, receive genetic data and make training data from the genetic data. To do so, the training data generator 200 can, in some embodiments, have subcomponents, such as the tree builder 202 and the cluster identifier 204, which can be programs or systems that are configured to perform certain aspects of the present disclosure, including operations performed by the training data generator 200. As is further described below (e g., in connection with Figs. 3-5), the training data generator 200 can, in some embodiments, receive a plurality of amino acid sequences (or, in some embodiments, the training data generator 200 can convert sequences of other genetic data into sequences of amino acids).
  • subcomponents such as the tree builder 202 and the cluster identifier 204, which can be programs or systems that are configured to perform certain aspects of the present disclosure, including operations performed by the training data generator 200.
  • the training data generator 200 can, in some embodiments, receive a plurality of amino acid sequences (or, in some embodiments, the training data generator 200 can convert sequences
  • the training data generator 200 can convert the amino acid sequences into training data to, for example, train the classification model. To do so, the training data generator 200 can, in some embodiments, convert the sequences of amino acids to sequences of EIIP values, convert the sequences of EIIP values to the frequency domain, cluster the EIIP sequences based, at least in part, on their amplitude values at characteristic frequencies, including mathematical transformations of their amplitude values at characteristic frequencies, and label the EIIP sequences based, at least in part, on the clusters. Examples of the training data generator 200 are further described below.
  • the classification model 206 can, in some embodiments, be a process or system that incorporates a machine learning model for assessing a biological impact of a genetic mutation.
  • the machine learning model can, in some embodiments, be a binary classification model that, in some embodiments, determines a probability and, based on the probability and a threshold value, classifies an input as positive or negative.
  • the machine learning model can include decision trees, k- nearest neighbor algorithms, neural networks, Bayes classifiers, or other machine learning algorithms that can perform classification tasks.
  • the machine learning model can use an ensemble of machine learning methods, including random forests, gradient boost methods, and deep learning.
  • the mutation analyzer 208 can, in some embodiments, be a program or system that can receive one or more results from the classification model 206, or other genetic mutation analysis systems, and provide additional analysis for an input. Furthermore, in some embodiments, the mutation analyzer 208 can generate a report, such as the assessment 118 of Fig. 1, which may provide results and analysis for one or more input sequences.
  • the user interface 210 can, in some embodiments, be used by a user to access the mutation assessment system 102 and to input data into, or receive data from, the mutation assessment system 102.
  • the database 212 can be used to store information that is generated by the mutation assessment system 102 or data that can be used by the mutation assessment system 102.
  • Fig. 3 is a flowchart of an example method 300 that can be used, for example, by the mutation assessment system 102, or a user that is using the mutation assessment system 102.
  • the mutation assessment system 102 can generate training data (step 302), create a classification model (step 304), and assess an input query (step 306).
  • the illustrated steps of the method 300 are further described below in connection with Figs. 4-8.
  • an example of generating training data (step 302) is described in connection with Figs. 4-5; an example of creating a classification model (step 304) is described in connection with Fig. 6; and an example of assessing an input query (step 306) is described in connection with Fig. 7-8.
  • the method 300 can have more or less steps than those illustrated in the example of Fig. 3.
  • Fig. 4 is a flowchart of an example method 400 for generating training data.
  • the mutation assessment system 102 including subcomponents of the mutation assessment system 102 (e.g., described in connection with Fig. 2), can perform aspects of the method 400, or a user (e.g., the user 106 of Fig. 1) can use the mutation assessment system 102 to perform aspects of the method 400.
  • the mutation assessment system 102 can receive a plurality of amino acid sequences (step 402).
  • the mutation assessment system 102 can receive a plurality of amino acid sequences from the genetic information database 104.
  • the mutation assessment system 102 can receive other genetic data (e.g., nucleotide sequences), and convert that genetic data into amino acid sequences.
  • the amino acid sequences can be associated with a certain protein in a virus or organism.
  • the amino acid sequences may be for variants of the SI spike protein of SARS-CoV-2.
  • the amino acid sequences can be for different proteins of different viruses or organisms.
  • the mutation assessment system 102 can normalize the length of the amino acid sequences by trimming or padding them to ensure that they are the same length. Similarly, in some embodiments, the mutation assessment system 102 can normalize the length of the plurality of sequences after converting the amino acid sequences to sequences of EIIP values, which is described below.
  • Long-range molecular interactions can be caused by electronic properties of molecules.
  • the electronic properties of an amino acid sequence can include data, values, or information that are determined using methods related to electronic biology .
  • the electronic properties for an amino acid sequence can include electron-ion interaction potential values, amplitude values or frequency values related to electronic biology, mathematical transformations of values, and other properties.
  • the mutation assessment system 102 can convert the plurality of amino acid sequences to sequences of EIIP values (step 404).
  • An electron-ion interaction potential (EIIP) (or electron-ion interaction pseudo-potential) value can represent the main energy term of valence electrons.
  • EIIP electron-ion interaction potential
  • the EIIP can be based on the number of delocalized electrons, represented by an average quasivalence number (AQVN).
  • AQVN average quasivalence number
  • the EIIP value can be calculated using the following equation (1):
  • the EIIP value for an amino acid can be calculated and expressed in Rydbergs (Rys) units.
  • Rys Rydbergs
  • an amino acid sequence having N residues can, in some examples, be converted to an A-length sequence of EIIP values.
  • This EIIP signal can, for example, characterize electronic biology properties of the primary sequence of the protein that the amino acid sequence is drawn from.
  • the following table illustrates the EIIP value for twenty amino acids:
  • the mutation assessment system 102 having converted sequences of amino acid sequences to sequences of EIIP values, can convert the sequences of EIIP values to a frequency domain (step 406).
  • the mutation assessment system 102 can apply a discrete Fourier transform (e.g., by using Fast Fourier Transform, Wavelet Transform, or another algorithm) to convert each of the sequences of EIIP values to the frequency domain.
  • the discrete Fourier transform can be defined by the following equation (3):
  • the mutation assessment system 102 can, for each sequence of EIIP values, have a frequency domain representation, which can include amplitude, frequency, and phase information of sinusoids that represent the original EIIP sequence.
  • the mutation assessment system can calculate an energy density spectrum from the Fourier coefficients, which can, in some examples be defined by the following equation (4):
  • the mutation assessment system 102 can determine one or more values at one or more characteristic frequencies (step 408).
  • a characteristic frequency can be a value in the frequency domain that has been determined to be relevant to a biological characteristic.
  • the one or more characteristic frequencies can be determined, for example, by expert analysis or by other systems.
  • the characteristic frequencies may be determined by performing crossspectrum analysis on a plurality of energy density spectrums derived from amino acid sequences.
  • the amino acid sequences may be associated with proteins manifesting a certain biological characteristic. Thus, by performing a cross-spectrum analysis, it can be determined, in some examples, that there are one or more characteristic frequencies associated with that biological characteristic.
  • the mutation assessment system 102 can, in some embodiments, determine the amplitude value in the energy density spectrum for the characteristic frequency.
  • the mutation assessment tool 102 can calculate a distance matrix (step 410).
  • the mutation assessment system 102 can determine a distance matrix for the plurality of amino acid sequences, which can, as described above, be converted to a plurality of sequences of EIIP values. Calculating the distance matrix can, in some embodiments, include determining a distance between every two sequences of the plurality of sequences (step 411). To calculate a distance between a pair of sequences, the mutation assessment system 102 can, in some embodiments, use the amplitude values at the characteristic frequencies, including mathematical transformations of the amplitude values, as described above, for example in connection with step 408.
  • the mutation assessment system 102 can use one of a plurality of distance metrics to calculate a distance between a pair of sequences.
  • a distance metric can be a single frequency distance (dl), which can be the distance between amplitude values in the energy density spectrum at a characteristic frequency.
  • dl single frequency distance
  • SI and S2 are their corresponding energy density spectra
  • F is a characteristic frequency
  • A1(F) and A2(F) are the amplitudes on frequency F of spectra SI and S2, respectively.
  • the mutation assessment 102 can, in some embodiments, use an amplitude ratio distance (d2).
  • the amplitude ratio distance can be used, in some examples, to infer information that corresponds to the transfer between two biological characteristics that relate to previously determined characteristic frequencies Fl and F2. For example, if XI and X2 are two sequences, SI and S2 are their corresponding spectra, and Fl and F2 are two characteristics frequencies, then the amplitude ratio distance (d2) between XI and X2 can, in some embodiments, be defined by the following equation (6):
  • A1(F1) A2(F1) d2(Xl,X2) A1(F2) A2(F2) (6) [0050] Where A1(F1) and A1(F2) are amplitude values of spectrum SI at characteristic frequencies Fl and F2, respectively, and A2(F1) and A2(F2) are amplitudes values of spectrum S2 at characteristic frequencies Fl and F2, respectively.
  • Each of the distances dl-d3 is a valid distance metric between sequences and, therefore, can provide useful information regarding relationships between sequences of amino acids and sequences of genetic data.
  • the distance metrics dl-d3 because they rely on electronic biology properties of amino acid sequences, can be more sensitive, in some instances, to the position of a mutation and the type of substituted residue, and they can be more sensitive to small — yet biologically significant — mutations or deletions.
  • the mutation assessment system 102 can use the distance matrix to construct a tree (step 412).
  • the tree can be a phylogenetic tree.
  • the phylogenetic tree constructed by the mutation assessment system 102 can group sequences based on electronic properties that may relate to one or more specific biological characteristics.
  • the mutation assessment system 102 can, in some embodiments, apply an agglomerative hierarchical clustering algorithm on the distance matrix.
  • the mutation assessment can apply an unweighted pair group method with an arithmetic mean (UPGMA), neighbor joining (NJ) method, Fitch-Margoliash algorithm, or an ensemble of methods to construct the tree.
  • UGMA arithmetic mean
  • NJ neighbor joining
  • Fitch-Margoliash algorithm Fitch-Margoliash algorithm
  • the process of building the tree can be efficient.
  • the process of converting amino acid sequences to sequences of EIIP values (step 404), converting sequences of EIIP values to the frequency domain (step 406), determining values at characteristic frequencies (step 408), calculating a distance matrix (step 410), and constructing a phylogenetic tree (step 412) can have a computational complexity of O(N(N + Llog(L)) for dl and d2 distances, and O(NL(N + log(L)) for d3 distance, where N is the number of sequences and L is the length of the longest sequence.
  • the mutation assessment system 102 can cluster the sequences (step 414). For example, the mutation assessment system 102 can, based at least in part on the phylogenetic tree, cluster the sequences into two or more groups. To do so, two or more clusters can be identified in the phylogenetic tree. Identifying the two or more clusters in the tree can be performed, for example, by using a reference sequence, known characteristic of one or more sequences, expert knowledge, or a combination of various techniques.
  • the EIIP sequences (and the amino acid sequences they represent) can, in some embodiments, be separated into clusters based on whether they are at risk of being associated with one or more predetermined biological characteristic, such as the biological characteristics that correspond with the one or more characteristic frequencies.
  • One cluster may include sequences that are at risk of being associated with the biological characteristic, and another cluster may include sequences that are not at risk of being associated with the biological characteristic.
  • the mutation assessment system 102 may cluster the sequences into more than two groups, including for example, a group of sequences for which it is more uncertain whether the sequences are associated with the predefined biological characteristic.
  • the relevant biological characteristic may be an ability to escape a vaccine-induced immune system response. Furthermore, it may be determined that this biological characteristic is associated with one or more characteristic frequencies. Having constructed a distance matrix using amplitude values at these characteristic frequencies, including, in some examples, mathematical transformations of these amplitude values, the mutation assessment system 102 may, in some embodiments, use the distance matrix to construct a tree and then, using the tree, cluster the sequences. One cluster may include sequences associated with a risk of having the ability to escape immune system recognition, and another cluster may include sequences that are not at risk of being associated with that characteristic.
  • the mutation assessment system 102 can, in some embodiments, label the sequences (step 416). For example, the mutation assessment system 102 can assign a positive label to sequences at risk of being associated with a biological trait, and the mutation assessment 102 can, in some embodiments, assign a negative label to those sequences that are not at risk of being associated with the biological trait.
  • the mutation assessment system 102 can arrange the training data for training a machine learning model (step 418).
  • the training data can include a plurality of training instances and there can be, for each training instance, training attributes and a label.
  • each of the EIIP sequences (or, in some embodiments, amino acid sequences) can be a training instance.
  • the training attributes can be, for each sequence, the amplitude values in the energy density spectrum at, for example, the characteristic frequencies.
  • the label for each sequence can be, for example, the positive or negative label assigned to the sequence based, for example, on whether the sequence is at risk for being associated with a certain biological characteristic.
  • Fig. 5 illustrates a schematized example execution 500 of generating training data.
  • Fig. 5 illustrates, for example, aspects of an example execution of the method 400.
  • the mutation assessment tool 102 can perform aspects of the schematized example execution 500.
  • the characters and numbers in the example of Fig. 5 are for illustrative purposes.
  • the mutation assessment system 102 can receive a plurality of amino acid sequences 502. As shown, each of the plurality of amino acid sequences can be represented by a string of characters that represent amino acids. In the example shown, the plurality of amino acid sequences contains X number of amino acid sequences, starting with the amino acid sequences 1-3. In some embodiments, the mutation assessment system 102 can receive other data (e.g., nucleotide sequences) and convert that data into the plurality of amino acid sequences 502. In some embodiments, the plurality of amino acid sequences 502 can be non-redundant, and they can come from a protein of a virus or organism.
  • the mutation assessment system 102 can convert the plurality of amino acid sequences 502 into a plurality of sequences of EIIP values 504. To do so, the mutation assessment system 102 can, in some embodiments, convert each amino acid into a corresponding EIIP value, which can be a certain number that represents electronic biology properties of the amino acid.
  • the mutation assessment system 102 can convert the plurality of EIIP sequences 504 to the frequency domain, as illustrated by the graphs 506 that represent the EIIP sequences in the frequency domain. To do so, the mutation assessment system 102 can apply a discrete Fourier transform to each of the sequences of EIIP values 504. As shown in the graphs 506, each of the sequences of EIIP values may, in some embodiments, have a different representation in the frequency domain. In the frequency domain, the x-axis of the graphs 506 can include frequencies.
  • the x-axis can include one or more characteristic frequencies, which can be, for example, predetermined frequency values that are relevant to a biological characteristic of interest of the virus or organism that the plurality of amino acid sequences 502 are drawn from.
  • the y-axis of the graphs can include amplitudes at those frequencies, including, in some example, mathematical transformations of the amplitudes.
  • the mutation assessment system 102 can, in some embodiments, calculate the energy density spectrum as part of determining the amplitude values at the frequencies.
  • the mutation assessment system 102 can determine, for each of the sequences, one or more amplitude values at the one or more characteristic frequencies.
  • the amplitude values table 508 can include amplitude values, or mathematical transformations of amplitude values, at characteristic frequencies for each sequence.
  • the one or more amplitude values can, in some embodiments, be the one or more values on the y-axis for the one or more characteristic frequencies on the x-axis.
  • the mutation assessment system 102 can calculate a distance matrix 510.
  • the mutation assessment system 102 can, for each pair of sequences, calculate a distance between the one or more amplitude values of the sequences.
  • the mutation assessment system 102 can use, for example, the amplitude values table 508 to calculate a distance between each pair of the sequences 1-X, resulting in the distance matrix 510.
  • the mutation assessment system 102 can, depending on the situation, use one of a plurality of distance metrics to calculate the distance between each pair of sequences.
  • the mutation assessment system 102 can construct a phylogenetic tree 512.
  • the mutation assessment system 102 can use the distance matrix 510 to construct the phylogenetic tree 512.
  • the sequences can be grouped.
  • the mutation assessment system 102 can apply an agglomerative hierarchical clustering algorithm on the distance matrix 510.
  • the mutation assessment can apply an unweighted pair group method with an arithmetic mean (UPGMA), neighbor joining (NJ) method, Fitch-Margoliash algorithm, or an ensemble of methods on the distance matrix 510 to construct the phylogenetic tree 512.
  • UGMA arithmetic mean
  • NJ neighbor joining
  • Fitch-Margoliash algorithm or an ensemble of methods on the distance matrix 510 to construct the phylogenetic tree 512.
  • the mutation assessment system 102 can, in some embodiments, cluster the sequences.
  • the mutation assessment system 102 — or a user of the mutation assessment system 102 — may, using the phylogenetic tree 12 and, in some embodiments, a reference sequence, cluster the sequences into a positive group and a negative group.
  • the positive group may include sequences that are at risk of being associated with a certain biological characteristic
  • the negative group may include sequences that are not at risk of being associated with the biological characteristic.
  • the mutation assessment system 102 can, in some embodiments, assign a positive or negative label to each sequence depending on how the sequences are clustered.
  • the mutation assessment system 102 can, in the example shown, arrange the training data 514.
  • the training data 514 can include a plurality of instances, which can be, for example, associated with the plurality of amino acid sequences 502.
  • each of the sequences can include, as training attributes, the one or more amplitude values at the one or more characteristic frequencies, including, in some examples, mathematical transformations of the amplitude values, and each of the sequences can include a label.
  • the training data 514 can be used, for example, to train a classification model, such as the classification model 516.
  • Fig. 6 is a flowchart of an example method 600 for creating a classification model.
  • the method 600 can be used, for example by the mutation assessment system 102, by subcomponents of the mutation assessment system 102 or, in some examples, by a user of the mutation assessment system 102.
  • the mutation assessment system 102 can receive training data (step 602).
  • the mutation assessment system 102 can receive training data generated during the example method 400, described above in connection with Fig. 4.
  • the mutation assessment system 102 can receive other data for creating a classification algorithm.
  • the mutation assessment system 102 may combine data generated during the example method 400 with other data received from the genetic information database 104 or the user 106.
  • the mutation assessment system 102 can train a classification model (step 604).
  • the mutation assessment system 102 can train a machine learning algorithm to perform a binary classification task.
  • the machine learning model can be trained to receive one or more amplitude values, or mathematical transformations of amplitude values, and to output a positive or negative prediction.
  • the machine learning model can be trained with supervised learning using the training data.
  • the training data can include a plurality of sequences and, for each instance sequence, there can be instance amplitude values and an instance label.
  • training the machine learning model can include, in some embodiments, inputting training attributes (e.g., amplitude values or mathematical transformations of amplitude values) into the machine learning model, generating a prediction for whether the training attributes are associated with a positive or negative label, checking the prediction with the actual label, and adjusting parameters of the machine learning model accordingly. In some embodiments, this process can continue until the model converges or until the model performs sufficiently well.
  • training attributes e.g., amplitude values or mathematical transformations of amplitude values
  • the machine learning model can use a random forest.
  • the random forest can include, for example, a plurality of decision trees, each of which can be trained on a bootstrapped subset of the training data.
  • the machine learning model can use other techniques or models, such as a gradient boost method, neural networks, or deep learning methods.
  • the machine learning model can include an ensemble of machine learning techniques and models.
  • the mutation assessment system 102 can validate the classification model (step 606).
  • the mutation assessment system 102 can, in some embodiments, apply a k-fold cross validation procedure to determine the performance of a trained machine learning model.
  • the mutation assessment system 102 can determine and evaluate a plurality of metrics, including, in some embodiments, an Area Under the Curve (AUC) performance, accuracy, precision, recall, F-score, specificity, and a Matthews correlation coefficient (MCC).
  • AUC Area Under the Curve
  • MCC Matthews correlation coefficient
  • the k-fold validation procedure can be a 10-fold validation procedure.
  • the mutation assessment system 102 can generate a plurality of classification models using, for example, a variety of hyperparameters and combinations of hyperparameters, where the types of hyperparameters are determined by the underlying machine learning model.
  • the mutation assessment system 102 can select one, for example, based on the results of the k-fold cross-validation procedure.
  • the mutation assessment system 102 can, in some embodiments, select a model having the best F-score.
  • the mutation assessment system 102 can define a threshold value for the classification model (step 608).
  • the output of the classification model can be a probability that an input corresponds with an amino acid sequence that is at risk of being associated with a certain biological trait.
  • the classification model can assign a positive label to the input if this probability is over a threshold value and a negative label to the input if this probability is below the threshold value, or vice versa.
  • the mutation assessment system 102 can select a threshold value that results in better performance by the classification model, as measured, for example, by performing a k-fold validation procedure or validating the classification model. In some embodiments, the mutation assessment system 102 can select, as the threshold value, the value that results in a maximum F-score.
  • Fig. 7 is a flowchart of an example method 700 for assessing one or more input queries.
  • the method 700 can be used, for example, by the mutation assessment system 102, by subcomponents of the mutation assessment system 102, or, in some embodiments, by a user of the mutation assessment system 102.
  • the mutation assessment system 102 can receive an input query (step 702).
  • the input query can be a sequence or other data that can be converted into a sequence.
  • the input query can be an amino acid or nucleotide sequence.
  • the mutation assessment system 102 can receive the input query from the input query source 112 of Fig. 1.
  • the mutation assessment system 102 can receive a plurality of input queries.
  • the input queries can, in some embodiments, be a plurality of amino acid sequences, each of which is sampled from a protein of a different variant of a pathogen.
  • the method 700 may be described with respect to one input query, the method 700 can, in some embodiments, be applied to a plurality of input queries.
  • the mutation assessment system 102 can convert an amino acid sequence associated with the input query into a sequence of EIIP values (step 704). If the mutation assessment system 102 received a plurality of input queries, then the mutation assessment system 102 can, in some embodiments, convert a plurality of amino acid sequences associated with the plurality of input queries into a plurality of sequences of EIIP values. To do the conversion, the mutation assessment system 102 can convert each amino acid sequence into an associated EIIP value, as is described above in connection with Fig. 4. [0077] In the example shown, the mutation assessment system 102 can convert one or more EIIP sequences to the frequency spectrum (step 706).
  • the mutation system 102 can perform a discrete Fourier transform on the one or more sequences of EIIP values, as is described above in connection with Fig. 4. Furthermore, in some embodiments, the mutation assessment system 102 can, having converted one or more EIIP sequences to the frequency domain, convert the amplitudes to the energy density spectrum, as is described above in connection with step 404 of Fig. 4.
  • the mutation assessment system 102 can, for the input query converted to a sequence of EIIIP values, determine one or more amplitude values at one or more characteristic frequencies (step 708). Furthermore, in some embodiments, the mutation assessment system 102 can determine a mathematical transformation of the one or more amplitude values. These one or more amplitude values may thereafter be used, for example, as input attributes in the machine learning model.
  • the characteristic frequencies may, in some embodiments, be values in the frequency domain that have been determined to be relevant to a biological characteristic (e.g., an ability to evade a vaccine- induced immune system response). Furthermore, the one or more amplitude values at these characteristic frequencies may include information related to whether a particular input query is at risk of being associated with the predetermined biological characteristic.
  • the mutation assessment system 102 can apply the classification model to the input query (step 710).
  • the mutation assessment system 102 can apply the trained machine learning model (e g., trained in connection with Fig. 6) to the input attributes, which as described above, can be one or more amplitude values, or, in some embodiments, a mathematical transformation of the one or more amplitude values, at one or more characteristic frequencies.
  • the mutation assessment system 102 can classify the input as positive or negative (step 712).
  • the classification model can, in some embodiments, output a probability, based on the input attributes, that the query input is at risk of being associated with a certain biological characteristic. Based on this probability, the mutation assessment system 102 can, in some embodiments, classify the input as positive or negative.
  • the mutation assessment system 102 can, for example, in response to classifying an input as at risk of being associated with a certain biological characteristic, flag the input for further review, automatically generate an assessment report, or transmit information to an output system or a user.
  • FIG. 8 illustrates a schematized example execution 800 of aspects of the present disclosure, including an example execution of the method 700 for assessing an input query.
  • the mutation assessment system 102 can perform aspects of the execution 800.
  • the characters and numbers in the example of Fig. 8 are for illustrative purposes.
  • the mutation assessment system 102 can receive an input query', which can be, for example, an input amino acid sequence 802.
  • the input amino acid sequence 802 can include a plurality of amino acids, such as asparagine, glycine, leucine, phenylalanine, and so on.
  • the amino acid sequence 802 can represent the primary structure of a protein from an organism or virus.
  • the mutation assessment system 102 can receive a plurality of non-redundant amino acid sequences.
  • the mutation assessment system 102 can convert the amino acid sequence 802 into a sequence of EIIP values 804. To do so, the mutation assessment system 802 can convert each amino acid sequence into an electron-ion potential value, as is described above.
  • the mutation assessment system 102 can convert the sequence of EIIP values 804 to the frequency domain 808. To do so, the mutation assessment system 102 can perform, for example, a discrete Fourier transform on the sequence of EIIP values 804, as is described above in connection.
  • the x-axis can include frequency values, including the characteristic frequencies 806.
  • the values can be amplitudes at the frequencies.
  • the mutation assessment system 102 can calculate an energy density spectrum to determine the y-axis values, as is described above.
  • the mutation assessment system 102 can determine the one or more amplitude values 810-812 for the one or more frequency values 806 by using, for example, the data derived in the frequency domain 808.
  • the data discussed in connection with the elements 804-812 can, in some examples, be considered electronic properties of the amino acid sequence 802.
  • a trained classification model 814 which may have been trained, for example, using one or more aspects of the present disclosure described, for example, in connection with Figs. 4-6.
  • the mutation assessment system 102 can, in some embodiments, apply the trained classification model 814 to the amplitude values 810-812.
  • the trained classification model 814 can then, in some embodiments, classify the instance associated with the amplitude values 810-812 as positive or negative.
  • a positive classification may indicate that the virus or organism associated with the amino acid sequence 802 is at risk of having a certain biological characteristic, and a negative classification may indicate that there is not such a risk.
  • the trained classification model 814 can output a probability that the virus or organism associated with the amino acid sequence 802 is at risk of being associated with the biological characteristic.
  • the mutation assessment system 102 may output data 816, which can include, for example, results from the trained classification model 814, analysis, or other information.
  • the mutation assessment system 102 can output the data 816 to a monitoring system that can automatically act in response to receiving certain results from the mutation assessment system (e.g., a determination that a virus or organism is at risk of having a certain biological characteristic or property).
  • aspects of the present disclosure can be used to detect an impact of a genetic change on a biological property for a wide range of viruses and organisms and a wide range of biological properties.
  • the following applications are example applications, and aspects of the present disclosure are not limited to these applications.
  • aspects of the present disclosure can be used to detect genetic mutations that may result in a pathogen being able to escape therapeutic effects of antiviral drugs or preventive effects of vaccines.
  • aspects of the present disclosure can be used to detect mutations in SARS-CoV-2, A/H3N2, or A/H5N1 that may result in decreased vaccine efficacy.
  • the mutation assessment system may, for example, receive a plurality of non-redundant amino acid sequences, each of which may relate to a spike protein of a different variant of SARS-CoV-2.
  • the mutation assessment system may, as is described above, convert the amino acid sequences to sequences of EIIP values and convert the sequences of EIIP values to the frequency domain.
  • the mutation assessment may then construct a distance matrix by using an amplitude ratio distance, as described above in connection with equation (6), to determine a distance between each pair of sequences of EIIP values.
  • the mutation assessment system may then construct a phylogenetic tree by using the distance matrix.
  • the mutation assessment system can, for example, use the phylogenetic tree to group the sequences into two clusters, one for sequences that are vaccine resistant, which may be assigned a positive label, and another for sequences that are not vaccine resistant, which may be assigned a negative label. Then the mutation assessment can, for example, train a machine learning model using an ensemble of a distributed random forest method and deep learning. The mutation assessment system may then evaluate the model using a 10-fold cross validation procedure and select a threshold that maximizes the F-score on a holdout set. The mutation assessment can then, for example, apply the machine learning model to one or more query inputs, which may be amino acid sequences for variants of SAR-CoV-2 that were not used to generate training data. The mutation assessment system can then, in some embodiments, detect which of the one or more variants are sufficiently likely to be resistant to a vaccine.
  • the mutation assessment system received 2081 non-redundant SARS-CoV-2 protein sequences to generate training data and train a machine learning model.
  • the machine learning model had the following results when evaluated using a 10-fold cross validation procedure: AUC: 0.995; accuracy: 0.9914; precision 0.9936; recall: 0.9959; F-score: 0.9948; specificity: 0.9692; and MCC: 0.9695.
  • the machine learning model correctly identified the mutations of H69del and V70del — which were mutations that were not in the training set — as variants that were potentially resistant to the vaccine.
  • the mutation assessment system can, for example, receive a plurality of sequences of amino acids for hemagglutinin proteins of variants of A/H3N2.
  • the mutation assessment system can create a distance matrix by using single frequency distances (e g., as described in connection with equation (5)), construct a phylogenetic tree, label the sequences, and train a machine learning algorithm.
  • the mutation assessment system can, for example, receive a plurality of amino acid sequences for hemagglutinin proteins of variants of A/H5N1.
  • the mutation assessment system can create a distance matrix by using amplitude ratio distances (e.g., as described in connection with equation (6)), construct a phylogenetic tree, label the sequences, and train a machine learning model.
  • aspects of the present disclosure can be used to detect decreased enzyme activity. Specifically, in an example application, aspects of the present disclosure can be used to detect a decrease in enzyme lipoprotein lipase (LPL) activity and a risk for development of cardiovascular disease (CVD).
  • LPL enzyme lipoprotein lipase
  • CVD cardiovascular disease
  • M 2.
  • the mutation assessment system can construct a phylogenetic tree, label the sequences, and train a machine learning model.
  • aspects of the present disclosure can be used to detect mutations of epidermal growth factor receptors that may result in cancer.
  • the mutation assessment system can receive a plurality of amino acid sequences for epidermal growth factor receptors.
  • the mutation assessment system can create a distance matrix by using amplitude ratio distances (e.g., as described in connection with equation (6)), construct a phylogenetic tree, label the sequences, and tram a machine learning model.
  • Fig. 9 illustrates an example system 900 with which disclosed systems and methods can be used.
  • the following can be implemented in one or more systems 900 or in one or more systems having one or more components of system 900: the mutation assessment system 102, the genetic information database 104, the user 106, the output system 108, the database 110, the input query source 112, the networks 120a-c, the training data generator 200, the tree builder 202, the cluster identifier 204, the classification model 206, the mutation analyzer 208, the user interface 210, the database 212, the classification model 516, the trained classification model 814, and other aspects of the present disclosure.
  • the system 900 can include a computing environment 902.
  • the computing environment 902 can be a physical computing environment, a virtualized computing environment, or a combination thereof.
  • the computing environment 902 can include memory 904, a communication medium 912, one or more processing units 914, anetwork interface 916, and an external component interface 918.
  • the memory 904 can include a computer readable storage medium.
  • the computer storage medium can be a device or article of manufacture that stores data and/or computerexecutable instructions.
  • the memory 904 can include volatile and nonvolatile, transitory and non-transitory, removable and non-removable devices or articles of manufacture implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data.
  • computer storage media may include dynamic random access memory (DRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), reduced latency DRAM, DDR2 SDRAM, DDR3 SDRAM, solid state memory, read-only memory (ROM), electrically- erasable programmable ROM, optical discs (e.g., CD-ROMs, DVDs, etc.), magnetic disks (e.g., hard disks, floppy disks, etc.), magnetic tapes, and other types of devices and/or articles of manufacture that store data.
  • DRAM dynamic random access memory
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • ROM read-only memory
  • ROM electrically- erasable programmable ROM
  • optical discs e.g.,
  • the memory 904 can store various types of data and software.
  • the memory 904 includes software application instructions 906, one or more databases 908, as well as other data 910.
  • the communication medium 912 can facilitate communication among the components of the computing environment 902.
  • the communication medium 912 can facilitate communication among the memory 904, the one or more processing units 914, the network interface 916, and the external component interface 918.
  • the communications medium 912 can be implemented in a variety of ways, including but not limited to a PCI bus, a PCI express bus accelerated graphics port (AGP) bus, a serial Advanced Technology Attachment (ATA) interconnect, a parallel ATA interconnect, a Fiber Channel interconnect, a USB bus, a Small Computing system interface (SCSI) interface, or another type of communications medium.
  • PCI bus a PCI express bus accelerated graphics port (AGP) bus
  • AGP accelerated graphics port
  • ATA serial Advanced Technology Attachment
  • ATA parallel ATA interconnect
  • Fiber Channel interconnect a USB bus
  • SCSI Small Computing system interface
  • the one or more processing units 914 can include physical or virtual units that selectively execute software instructions, such as the software application instructions 906.
  • the one or more processing units 914 can be physical products comprising one or more integrated circuits.
  • the one or more processing units 914 can be implemented as one or more processing cores.
  • one or more processing units 914 are implemented as one or more separate microprocessors.
  • the one or more processing units 914 can include an application-specific integrated circuit (ASIC) that provides specific functionality'.
  • ASIC application-specific integrated circuit
  • the network interface 916 enables the computing environment 902 to send and receive data from a communication network.
  • the network interface 916 can be implemented as an Ethernet interface, a token-ring network interface, a fiber optic network interface, a wireless network interface (e.g., Wi-Fi), or another type of network interface.
  • the external component interface 918 enables the computing environment 902 to communicate with external devices.
  • the external component interface 918 can be a USB interface. Thunderbolt interface, a Lightning interface, a serial port interface, a parallel port interface, a PS/2 interface, or another type of interface that enables the computing environment 902 to communicate with external devices.
  • the external component interface 918 enables the computing environment 902 to communicate with various external components, such as external storage devices, input devices, speakers, modems, media player docks, other computing devices, scanners, digital cameras, and fingerprint readers.
  • the components of the computing environment 902 can be spread across multiple computing environments 902.
  • one or more of instructions or data stored on the memory 904 may be stored partially or entirely in a separate computing environment 902 that is accessed over a network.
  • Aspects of the system 900 and the computing environment 902 can be protected using a robust security model.
  • users may be made to sign into the system using a directory service.
  • Connection and credential information can be externalized from jobs using an application programming interface.
  • Credentials can be stored in an encrypted repository in a secured operational data store database space.
  • Privileges can be assigned based on a collaboration team and mapped to a Lightweight Directory Access Protocol (LDAP) Group membership.
  • LDAP Lightweight Directory Access Protocol
  • a self-service security model can be used to allow owners to assign others permissions on their objects (e.g., actions).
  • Each node may be configured to be capable of running the full system 900, such that portal can run and schedule jobs and serve the portal user interface as long as a single node remains functional.
  • the environment 902 may include monitoring technology to determine when anode is not functioning so an appropriate action can be taken.

Abstract

A method and system for analyzing genetic mutations is disclosed. Aspects of the present disclosure relate to a method and system for determining an impact of genetic changes on biological properties. For example, aspects of the present disclosure can receive a plurality of amino acid sequences, construct a phylogenetic tree based on electronic properties of the amino acid sequences, generate training data, train a machine learning model, and assess whether a query input is at risk of being associated with a certain biological characteristic.

Description

METHOD AND SYSTEM FOR ASSESSING AN IMPACT OF GENETIC CHANGES
ON BIOLOGICAL PROPERTIES
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application is being filed on June 8, 2023, as a PCT International Patent Application that claims priority to and the benefit of U.S. Provisional Application No. 63/350,273, filed on June 8, 2022, which application is hereby incorporated by reference in its entirety.
BACKGROUND
[0002] As organisms and viruses evolve, their genetic makeup changes. Some genetic changes can lead to significant biological changes, while others may lead to insignificant or nonexistent biological changes or can neutralize previously present changes (compensatory mutations). Determining which genetic changes may result in biological changes can be useful, for example, as part of in silico analysis of genetic variants of organisms or viruses. For instance, such analysis may, among other things, focus limited resources and time into researching variants that may have particular biological characteristics.
[0003] However, given the vast genomes of organisms and viruses, and given the diverse, complex ways that genes are manifested in biological characteristics, it can be difficult to determine which genetic changes are biologically important and which are not. Although techniques exist for assessing genetic mutations, these techniques are limited. Techniques based on multiple sequence alignment (MSA) may be insensitive to the position of genetic mutations and may not accurately assess a functional significance of a genetic change. For example, variants with few — but significant — mutations may improperly be labeled to have similar function, whereas variants with many — but insignificant — mutations may be improperly marked as functionally different. Thus, there is a need for improved methods for assessing the impact of genetic changes on biological properties.
SUMMARY
[0004] Aspects of the present disclosure relate to assessing the impact of genetic mutations on biological properties. In particular, aspects of the present disclosure relate to a method and system that can generate training data for a machine learning model and train the machine learning model to identify genetic variants that are at risk of being associated with certain biological characteristics.
[0005] In an example aspect, a method for assessing genetic changes is disclosed. The method includes receiving a plurality of amino acid sequences; determining electronic properties for each sequence of the plurality of amino acid sequences, the electronic properties comprising one or more amplitude values for one or more characteristic frequencies; constructing, using the electronic properties, a phylogenetic tree; assigning, using the phylogenetic tree, a label to each sequence of the plurality of amino acid sequences; and training, using training data and machine learning algorithms, a classification model, wherein the training data includes, for each sequence of the plurality of amino acid sequences, the one or more amplitude values and the label.
[0006] In a second aspect, a method for assessing a biological impact of a genetic variant is disclosed. The method includes receiving an amino acid sequence; determining electronic properties of the amino acid sequence, the electronic properties including one or more amplitude values for one or more characteristic frequencies; and determining, using the electronic properties, whether the ammo acid sequence is at nsk of being associated with a biological characteristic; wherein determining, using the electronic properties, whether the amino acid sequence is at risk of being associated with a biological characteristic comprises applying a classification model to the one or more amplitude values.
[0007] In a third aspect, a mutation assessment system comprising a processor and a memory is disclosed. The memory can store instructions, wherein the instructions, when executed by the processor, cause the mutation assessment system to: receive a plurality of amino acid sequences; determine electronic properties for each sequence of the plurality of amino acid sequences, the electronic properties comprising one or more amplitude values for one or more characteristic frequencies; construct, using the electronic properties, a phylogenetic tree; assign, using the phylogenetic tree, a label to each sequence of the plurality of amino acid sequences; and train, using training data and machine learning algorithms, a classification model, wherein the training data includes, for each sequence of the plurality of ammo acid sequences, the one or more amplitude values, their mathematical transformation, and the label. BRIEF DESCRIPTION OF THE DRAWINGS
[0008] Fig. 1 illustrates an example network environment in which aspects of the present disclosure can be implemented.
[0009] Fig. 2 illustrates a schematic representation of an example mutation assessment system.
[0010] Fig. 3 is a flowchart of an example method for performing aspects of the present disclosure.
[0011] Fig. 4 is a flowchart of an example method for generating training data.
[0012] Fig. 5 illustrates a schematic example execution of aspects of the present disclosure
[0013] Fig. 6 is a flowchart of an example method for training a classification model.
[0014] Fig. 7 is a flowchart of an example method for assessing an input query.
[0015] Fig. 8 illustrates a schematic example execution of aspects of the present disclosure.
[0016] Fig. 9 illustrates a block diagram of an example computing system.
DETAILED DESCRIPTION
[0017] As briefly described above, aspects of the present disclosure relate to a system that can assess the impact of genetic mutations on biological properties. In example aspects, the mutation assessment system can generate training data, train a machine learning model, and determine a likelihood that an input sequence is associated with a certain biological trait.
[0018] In example aspects, the mutation assessment system can receive a plurality of sequences, such as a plurality of amino acid sequences. Each sequence of the plurality of amino acid sequences may, in some examples, be derived from a variant of a pathogen or a variant of a gene of any organism, including humans. The mutation assessment system can, in some embodiments, convert each of the amino acid sequences into a sequence of electron-ion interaction potential (EIIP) values (or electron-ion interaction pseudo-potential values). Then the mutation assessment system can, in some embodiments, convert each of the sequences of EIIP values to a frequency domain. As is further described below, the frequencies of the frequency domain, and their corresponding amplitudes, including mathematical transformations of the amplitudes, can relate to biological characteristics (e g., an ability to escape immune system detection or drug treatment).
[0019] In example aspects, the mutation assessment system can determine, or can receive, one or more characteristic frequencies in the frequency domain, and the mutation assessment system can determine one or more amplitude values for the one or more characteristic frequencies. Having determined, for each sequence, the one or more amplitude values for the one or more characteristic frequencies, the mutation assessment system can, based on these values, and mathematical transformations of these values, group the sequences. To form the groups, the mutation assessment system can, in example aspects, generate a tree. To do so, the mutation assessment system can, in some embodiments, calculate a distance matrix based on amplitude values at characteristic frequencies. The distance matrix can include a distance between each pair of sequences in the plurality of sequences. Using the distance matrix, the mutation assessment system can then, in some embodiments, generate a phylogenetic tree that groups the sequences based on their distances from one another. Using the constructed phylogenetic tree, the mutation assessment system can, in some embodiments, cluster the sequences, assigning them, for example, a positive label or a negative label.
[0020] In example aspects, the mutation assessment system can train a machine learning model using training data. The training data can include, for example, a plurality of training instances, each of which can have training attributes and a label. In some examples, each sequence can be a training instance, with the one or more amplitude values as training attributes and the positive or negative label assigned while clustering as the label. Thus, the mutation assessment system can train the machine learning model using, for each sequence, the amplitude values and the label, thereby creating a model that can receive one or more amplitude values associated with one or more characteristic frequencies and that can output a positive or a negative label.
[0021] In example aspects, the mutation assessment system can determine whether a query input is positive or negative. For example, the input can be an amino acid sequence of a protein from an organism or virus. In some embodiments, the mutation assessment tool can convert the ammo acid sequence into an EIIP sequence, and then convert the EIIP sequence to the frequency domain. Then, the mutation assessment system can, in some embodiments, determine one or more amplitude values at the characteristic frequencies and apply the machine learning model to the one or more amplitudes, or to mathematical transformations of the amplitudes,, thereby determining a predicted positive or negative classification for the input. Whether the input receives a positive or negative level can, depending on the embodiment, indicate whether the organism or virus associated with the input sequence is at risk of being associated with a certain biological characteristic. [0022] Certain embodiments of the present disclosure have certain technical features that make them particularly advantageous over existing tools. For example, certain embodiments of the present disclosure enable accurate and efficient assessments of whether a genetic mutation will be associated with a certain biological trait. For example, aspects of the present disclosure can help scientists and health officials determine whether a new genetic variant of a pathogen is likely to have a biological trait of interest, such as, for example, an ability to escape immune system recognition or an ability to infect humans. Yet still, aspects of the present disclosure are generalizable. Certain aspects can be used to determine an impact of genetic changes on biological characteristics for a variety of biological characteristics across a wide range of organisms and viruses.
[0023] Furthermore, by using EIIP values and characteristic frequencies, aspects of the present disclosure allow scientists to create phylogenetic trees that are tailored to analyze relevant genetic changes. Thus, aspects of the present disclosure can better detect a seemingly small genetic mutation if the genetic mutation is relevant to a biological characteristic and can keep two sequences grouped together, despite seemingly large genetic changes, if those genetic changes do not affect the relevant biological characteristic. Furthermore, aspects of the present disclosure can be used to actively monitor changing pathogens and, based on accessible genetic or amino acid sequences, efficiently identify variants of the pathogen that may have harmful biological characteristics. Thus, aspects of the present disclosure can be integrated into, and improve, a system for monitoring and responding to evolving pathogens. As will be apparent, these are only some of the advantages offered by aspects of the present disclosure.
[0024] Fig. 1 illustrates an example network 100 in which aspects of the present disclosure can be implemented. The network 100 includes a mutation assessment system 102, a genetic information database 104, a user 106, an output system 108, a database 110, an input source 112, and the networks 120a-c. In other embodiments, the network 100 can include more or less elements than those displayed in and described in connection with Fig. 1. Each network of networks 120a-c can be, for example, a wireless network, a wired network, a virtual network, the internet, or any other type of network. Furthermore, each network of the networks 120a-c can be divided into subnetworks, and the subnetworks can be different types of networks.
[0025] The mutation assessment system 102 can be a computer system or program that can be configured to analyze genetic information. As is further described below, the mutation assessment system 102 can, for example, combine electronic biology techniques, such as using EIIP values to represent amino acid sequences, and artificial intelligence techniques, such as a binary classification machine learning model, to analyze and assess genetic information. As is further described below, the mutation assessment system 102 can, in some embodiments, generate training data from a plurality of nucleotide or amino acid sequences, train a machine learning model using the training data, and use the machine learning model to analyze an impact of a genetic mutation on a biological characteristic. An example architecture of the mutation assessment system 102 is described in connection with Fig. 2.
[0026] The genetic information database 104, which can be coupled with the mutation assessment system 102 via the network 120a, can be a database that stores genetic data, data related to a particular organism’s or vims’s genome, or data related to proteins of a virus or organism. For example, the genetic information database 104 can include nucleotide sequences and amino acid sequences. In some embodiments, the genetic information database 104 can include data related to a particular pathogen, proteins of the pathogen, or variants of the pathogen. For example, the genetic information database 104 can include data related to SARS- CoV-2, and the genetic information database 104 can include a plurality of samples of SARS- CoV-2, including samples of various proteins and variants of SARS-CoV-2. In some embodiments, the genetic information database 104 can be a plurality of databases. In some embodiments, the genetic information database 104 can be a database associated with GISAID or with the National Center for Biotechnology Information. As show n in the example of Fig. 1, the genetic information database 104 can transmit genetic data 114 to the mutation assessment system 102. In some embodiments, the genetic data 114 can be a plurality of amino acid sequences As is further described below, the mutation assessment system 102 can use the genetic data 114 to, for example, generate training data.
[0027] The user 106, the output system 108, and the database 110 can be coupled with the mutation assessment system 102 via, for example, the network 120b. The user 106 can, in some examples, be a researcher, scientist, public health official, or another individual that operates the mutation assessment system 102. The user 106 can, in some embodiments, receive data generated by the mutation assessment system 102, and the user 106 can input information into the mutation assessment system 102. For examples, the user 106 may, among other things, use a phylogenetic tree constructed by the mutation assessment system 102 to cluster sequences, as is further described below. The output system 108 can receive data generated by the mutation system 102 and use that data as part of another process or system, such as an analytics system or monitoring system. In some embodiments, the output system 108 can automatically act in response to receiving an output (e.g., an indication that an input sequence is sufficiently likely to be associated with a predefined biological characteristic). For example, the output system 108 may automatically cause further analysis of a sample, or the output system 108 may alert officials of the result.
[0028] The database 110 can be a database that is external to the mutation assessment system 102, and the database 110 can, in some embodiments, store data generated by the mutation assessment system 102 and provide the mutation assessment system 102 with information that may be required to generate training data, construct a machine learning model, assess an input sequence, or perform another operation. In some embodiments, the mutation assessment system 102 can generate an assessment 118 and transmit the assessment 118 to one or more of the user 106, the output system 108, or the database 110. The assessment 118 can include, for example, results and analysis for one or more input queries. In some embodiments, the assessment 118 can indicate, for example, whether there exists, out of a set of amino acid sequences from a set of virus variants, a virus variant may have certain biological characteristic, such as being resistant to a vaccine.
[0029] The input source 112, which can be connected to the mutation assessment system 102 via, for example, the network 120c, can be a database or system that can produce or send one or more input sequences to the mutation assessment system 102. For example, the input source 112 may have an amino acid or nucleotide sequence associated with a novel protein, virus, or organism, and the input source 112 may send such information as an input query 116 to the mutation assessment system 102. As is further described below, the mutation assessment system 102 can, in some examples, receive the input query 1 1 and can assess a likelihood of whether it is associated with a certain biological characteristic.
[0030] Fig. 2 illustrates a block diagram of an example implementation of the mutation assessment system 102. In the example of Fig. 2, the mutation assessment system 102 includes a plurality of components. For example, the mutation assessment system 102 can include a training data generator 200, a classification model 206, a mutation analyzer 208, a user interface 210, and a database 212. In the example shown, the training data generator 200 can include a tree builder 202 and a cluster identifier 204. The components of the mutation assessment system 102 are described as performing various aspects of the present disclosure; however, in some examples, the functions of the components can overlap, or the functions of the components can be performed by other components. Furthermore, the components of the mutation assessment system 102 can, depending on the embodiment, be located on the same computing system or on different computing systems. Furthermore, in some embodiments, the mutation assessment system 102 can have more or less components than those shown in the example of Fig. 2.
[0031] The training data generator 200 can, in some embodiments, receive genetic data and make training data from the genetic data. To do so, the training data generator 200 can, in some embodiments, have subcomponents, such as the tree builder 202 and the cluster identifier 204, which can be programs or systems that are configured to perform certain aspects of the present disclosure, including operations performed by the training data generator 200. As is further described below (e g., in connection with Figs. 3-5), the training data generator 200 can, in some embodiments, receive a plurality of amino acid sequences (or, in some embodiments, the training data generator 200 can convert sequences of other genetic data into sequences of amino acids). In some embodiments, the training data generator 200 can convert the amino acid sequences into training data to, for example, train the classification model. To do so, the training data generator 200 can, in some embodiments, convert the sequences of amino acids to sequences of EIIP values, convert the sequences of EIIP values to the frequency domain, cluster the EIIP sequences based, at least in part, on their amplitude values at characteristic frequencies, including mathematical transformations of their amplitude values at characteristic frequencies, and label the EIIP sequences based, at least in part, on the clusters. Examples of the training data generator 200 are further described below.
[0032] The classification model 206 can, in some embodiments, be a process or system that incorporates a machine learning model for assessing a biological impact of a genetic mutation. As is further described below, the machine learning model can, in some embodiments, be a binary classification model that, in some embodiments, determines a probability and, based on the probability and a threshold value, classifies an input as positive or negative. In some examples, the machine learning model can include decision trees, k- nearest neighbor algorithms, neural networks, Bayes classifiers, or other machine learning algorithms that can perform classification tasks. In example embodiments, the machine learning model can use an ensemble of machine learning methods, including random forests, gradient boost methods, and deep learning. The mutation analyzer 208 can, in some embodiments, be a program or system that can receive one or more results from the classification model 206, or other genetic mutation analysis systems, and provide additional analysis for an input. Furthermore, in some embodiments, the mutation analyzer 208 can generate a report, such as the assessment 118 of Fig. 1, which may provide results and analysis for one or more input sequences. [0033] The user interface 210 can, in some embodiments, be used by a user to access the mutation assessment system 102 and to input data into, or receive data from, the mutation assessment system 102. The database 212 can be used to store information that is generated by the mutation assessment system 102 or data that can be used by the mutation assessment system 102.
[0034] Fig. 3 is a flowchart of an example method 300 that can be used, for example, by the mutation assessment system 102, or a user that is using the mutation assessment system 102. In the example shown, the mutation assessment system 102 can generate training data (step 302), create a classification model (step 304), and assess an input query (step 306). The illustrated steps of the method 300 are further described below in connection with Figs. 4-8. For instance, an example of generating training data (step 302) is described in connection with Figs. 4-5; an example of creating a classification model (step 304) is described in connection with Fig. 6; and an example of assessing an input query (step 306) is described in connection with Fig. 7-8. In other examples, the method 300 can have more or less steps than those illustrated in the example of Fig. 3.
[0035] Fig. 4 is a flowchart of an example method 400 for generating training data. In some embodiments, the mutation assessment system 102, including subcomponents of the mutation assessment system 102 (e.g., described in connection with Fig. 2), can perform aspects of the method 400, or a user (e.g., the user 106 of Fig. 1) can use the mutation assessment system 102 to perform aspects of the method 400.
[0036] In the example shown, the mutation assessment system 102 can receive a plurality of amino acid sequences (step 402). For example, the mutation assessment system 102 can receive a plurality of amino acid sequences from the genetic information database 104. In some embodiments, the mutation assessment system 102 can receive other genetic data (e.g., nucleotide sequences), and convert that genetic data into amino acid sequences. In some examples, the amino acid sequences can be associated with a certain protein in a virus or organism. For example, the amino acid sequences may be for variants of the SI spike protein of SARS-CoV-2. In some embodiments, the amino acid sequences can be for different proteins of different viruses or organisms. Furthermore, in some embodiments, the mutation assessment system 102 can normalize the length of the amino acid sequences by trimming or padding them to ensure that they are the same length. Similarly, in some embodiments, the mutation assessment system 102 can normalize the length of the plurality of sequences after converting the amino acid sequences to sequences of EIIP values, which is described below. [0037] Long-range molecular interactions can be caused by electronic properties of molecules. The electronic properties of an amino acid sequence, for instance, can include data, values, or information that are determined using methods related to electronic biology . For example, as is further described below, the electronic properties for an amino acid sequence can include electron-ion interaction potential values, amplitude values or frequency values related to electronic biology, mathematical transformations of values, and other properties.
[0038] In the example shown, the mutation assessment system 102 can convert the plurality of amino acid sequences to sequences of EIIP values (step 404). An electron-ion interaction potential (EIIP) (or electron-ion interaction pseudo-potential) value can represent the main energy term of valence electrons. For organic molecules, such as amino acids, the EIIP can be based on the number of delocalized electrons, represented by an average quasivalence number (AQVN). For a particular molecule, the EIIP value can be calculated using the following equation (1):
Figure imgf000012_0001
[0039] Where Z* is the AQVN of the molecule determined by the following equation (2):
Figure imgf000012_0002
[0040] Where m in the number of atomic components in the molecule, ( is the valence number of the z-th atomic component, nt is the number of atoms of the z-th component, and N is the total number of atoms.
[0041] Using the above formulas (1) - (2), the EIIP value for an amino acid can be calculated and expressed in Rydbergs (Rys) units. Thus, an amino acid sequence having N residues can, in some examples, be converted to an A-length sequence of EIIP values. This EIIP signal can, for example, characterize electronic biology properties of the primary sequence of the protein that the amino acid sequence is drawn from. The following table illustrates the EIIP value for twenty amino acids:
Figure imgf000013_0001
[0042] In the example shown, the mutation assessment system 102, having converted sequences of amino acid sequences to sequences of EIIP values, can convert the sequences of EIIP values to a frequency domain (step 406). For example, the mutation assessment system 102 can apply a discrete Fourier transform (e.g., by using Fast Fourier Transform, Wavelet Transform, or another algorithm) to convert each of the sequences of EIIP values to the frequency domain. When applying a discrete Fourier transform, it can be assumed, in some examples, that the points in the EIIP sequences are equidistant with a distance of d = 1. Furthermore, in some embodiments, the maximal frequency can be F = l/2d = 0.5. In some embodiments, the discrete Fourier transform can be defined by the following equation (3):
Figure imgf000014_0001
[0043] Where x(m) is the m-th member of a given numerical series (e.g., a sequence of EIIP values), X(n) is the w-lh coefficient of the discrete Fourier transformation, and N is the total number of points in the series. Thus, following the discrete Fourier transform, the mutation assessment system 102 can, for each sequence of EIIP values, have a frequency domain representation, which can include amplitude, frequency, and phase information of sinusoids that represent the original EIIP sequence. In example embodiments, the mutation assessment system can calculate an energy density spectrum from the Fourier coefficients, which can, in some examples be defined by the following equation (4):
Figure imgf000014_0002
[0044] In the example shown, the mutation assessment system 102 can determine one or more values at one or more characteristic frequencies (step 408). As described above, a characteristic frequency can be a value in the frequency domain that has been determined to be relevant to a biological characteristic. For a given set of amino acid sequences, the one or more characteristic frequencies can be determined, for example, by expert analysis or by other systems. For example, the characteristic frequencies may be determined by performing crossspectrum analysis on a plurality of energy density spectrums derived from amino acid sequences. In some examples, the amino acid sequences may be associated with proteins manifesting a certain biological characteristic. Thus, by performing a cross-spectrum analysis, it can be determined, in some examples, that there are one or more characteristic frequencies associated with that biological characteristic.
[0045] For a given characteristic frequency, in order to determine an amplitude value for that frequency, the mutation assessment system 102 can, in some embodiments, determine the amplitude value in the energy density spectrum for the characteristic frequency. For example, when analyzing genetic mutations for SARS-CoV-2, the characteristic frequencies can, in some examples, be Fl = 0.257 and F2 = 0.479. Thus, for an EIIP sequence transformed to the frequency domain and energy density spectrum, the amplitude values at these frequencies can be Al = S(0.257) and A2 = S(0.479), where examples of determining S(n) are discussed above in connection with equations (3) - (4).
[0046] In the example shown, the mutation assessment tool 102 can calculate a distance matrix (step 410). For example, the mutation assessment system 102 can determine a distance matrix for the plurality of amino acid sequences, which can, as described above, be converted to a plurality of sequences of EIIP values. Calculating the distance matrix can, in some embodiments, include determining a distance between every two sequences of the plurality of sequences (step 411). To calculate a distance between a pair of sequences, the mutation assessment system 102 can, in some embodiments, use the amplitude values at the characteristic frequencies, including mathematical transformations of the amplitude values, as described above, for example in connection with step 408.
[0047] The mutation assessment system 102 can use one of a plurality of distance metrics to calculate a distance between a pair of sequences. One example of a distance metric can be a single frequency distance (dl), which can be the distance between amplitude values in the energy density spectrum at a characteristic frequency. For example, if XI and X2 are two sequences (e.g., sequences of EIIP values or amino acid sequences), SI and S2 are their corresponding energy density spectra, and F is a characteristic frequency, then the distance dl can be defined by equation (5): dl(Xl,X2) = | 1(F) - 2(F) | (5)
[0048] Where A1(F) and A2(F) are the amplitudes on frequency F of spectra SI and S2, respectively.
[0049] As another distance metric, the mutation assessment 102 can, in some embodiments, use an amplitude ratio distance (d2). The amplitude ratio distance can be used, in some examples, to infer information that corresponds to the transfer between two biological characteristics that relate to previously determined characteristic frequencies Fl and F2. For example, if XI and X2 are two sequences, SI and S2 are their corresponding spectra, and Fl and F2 are two characteristics frequencies, then the amplitude ratio distance (d2) between XI and X2 can, in some embodiments, be defined by the following equation (6):
A1(F1) A2(F1) d2(Xl,X2) = A1(F2) A2(F2) (6) [0050] Where A1(F1) and A1(F2) are amplitude values of spectrum SI at characteristic frequencies Fl and F2, respectively, and A2(F1) and A2(F2) are amplitudes values of spectrum S2 at characteristic frequencies Fl and F2, respectively.
[0051] As another distance metric, the mutation assessment system 102 can, in some embodiments, calculate a full spectrum distance (d3) between sequences. For example, if XI and X2 are two sequences, S I and S2 are their corresponding spectra, and M is a size of a subset of selected frequencies (where, if the frequency subset wraps the whole spectrum, then M = N /2, where N is the length of the longest sequence), then the full spectrum distance (d3) can, in some embodiments, be defined by the following equation (7):
Figure imgf000016_0001
[0052] Each of the distances dl-d3 is a valid distance metric between sequences and, therefore, can provide useful information regarding relationships between sequences of amino acids and sequences of genetic data. When analyzing a mutation, the distance metrics dl-d3, because they rely on electronic biology properties of amino acid sequences, can be more sensitive, in some instances, to the position of a mutation and the type of substituted residue, and they can be more sensitive to small — yet biologically significant — mutations or deletions. [0053] In the example shown, the mutation assessment system 102 can use the distance matrix to construct a tree (step 412). In example embodiments, the tree can be a phylogenetic tree. In some embodiments, the phylogenetic tree constructed by the mutation assessment system 102 can group sequences based on electronic properties that may relate to one or more specific biological characteristics. For example, to construct the tree, the mutation assessment system 102 can, in some embodiments, apply an agglomerative hierarchical clustering algorithm on the distance matrix. For example, the mutation assessment can apply an unweighted pair group method with an arithmetic mean (UPGMA), neighbor joining (NJ) method, Fitch-Margoliash algorithm, or an ensemble of methods to construct the tree.
[0054] In some embodiments, the process of building the tree can be efficient. For example, the process of converting amino acid sequences to sequences of EIIP values (step 404), converting sequences of EIIP values to the frequency domain (step 406), determining values at characteristic frequencies (step 408), calculating a distance matrix (step 410), and constructing a phylogenetic tree (step 412) can have a computational complexity of O(N(N + Llog(L)) for dl and d2 distances, and O(NL(N + log(L)) for d3 distance, where N is the number of sequences and L is the length of the longest sequence.
[0055] In the example shown, the mutation assessment system 102 can cluster the sequences (step 414). For example, the mutation assessment system 102 can, based at least in part on the phylogenetic tree, cluster the sequences into two or more groups. To do so, two or more clusters can be identified in the phylogenetic tree. Identifying the two or more clusters in the tree can be performed, for example, by using a reference sequence, known characteristic of one or more sequences, expert knowledge, or a combination of various techniques. The EIIP sequences (and the amino acid sequences they represent) can, in some embodiments, be separated into clusters based on whether they are at risk of being associated with one or more predetermined biological characteristic, such as the biological characteristics that correspond with the one or more characteristic frequencies. One cluster may include sequences that are at risk of being associated with the biological characteristic, and another cluster may include sequences that are not at risk of being associated with the biological characteristic. In some embodiments, moreover, the mutation assessment system 102 may cluster the sequences into more than two groups, including for example, a group of sequences for which it is more uncertain whether the sequences are associated with the predefined biological characteristic.
[0056] For example, in the example of analyzing mutations of SARS-CoV-2, the relevant biological characteristic may be an ability to escape a vaccine-induced immune system response. Furthermore, it may be determined that this biological characteristic is associated with one or more characteristic frequencies. Having constructed a distance matrix using amplitude values at these characteristic frequencies, including, in some examples, mathematical transformations of these amplitude values, the mutation assessment system 102 may, in some embodiments, use the distance matrix to construct a tree and then, using the tree, cluster the sequences. One cluster may include sequences associated with a risk of having the ability to escape immune system recognition, and another cluster may include sequences that are not at risk of being associated with that characteristic.
[0057] In the example shown, having clustered the sequences, the mutation assessment system 102 can, in some embodiments, label the sequences (step 416). For example, the mutation assessment system 102 can assign a positive label to sequences at risk of being associated with a biological trait, and the mutation assessment 102 can, in some embodiments, assign a negative label to those sequences that are not at risk of being associated with the biological trait.
[0058] In the example shown, the mutation assessment system 102 can arrange the training data for training a machine learning model (step 418). As described above, the training data can include a plurality of training instances and there can be, for each training instance, training attributes and a label. In example embodiments, each of the EIIP sequences (or, in some embodiments, amino acid sequences) can be a training instance. Furthermore, in example embodiments, the training attributes can be, for each sequence, the amplitude values in the energy density spectrum at, for example, the characteristic frequencies. And the label for each sequence can be, for example, the positive or negative label assigned to the sequence based, for example, on whether the sequence is at risk for being associated with a certain biological characteristic.
[0059] Fig. 5 illustrates a schematized example execution 500 of generating training data. Fig. 5 illustrates, for example, aspects of an example execution of the method 400. In some embodiments, the mutation assessment tool 102 can perform aspects of the schematized example execution 500. As will be understood, the characters and numbers in the example of Fig. 5 are for illustrative purposes.
[0060] In the example shown, the mutation assessment system 102 can receive a plurality of amino acid sequences 502. As shown, each of the plurality of amino acid sequences can be represented by a string of characters that represent amino acids. In the example shown, the plurality of amino acid sequences contains X number of amino acid sequences, starting with the amino acid sequences 1-3. In some embodiments, the mutation assessment system 102 can receive other data (e.g., nucleotide sequences) and convert that data into the plurality of amino acid sequences 502. In some embodiments, the plurality of amino acid sequences 502 can be non-redundant, and they can come from a protein of a virus or organism.
[0061] In the example shown, the mutation assessment system 102 can convert the plurality of amino acid sequences 502 into a plurality of sequences of EIIP values 504. To do so, the mutation assessment system 102 can, in some embodiments, convert each amino acid into a corresponding EIIP value, which can be a certain number that represents electronic biology properties of the amino acid.
[0062] In the example shown, the mutation assessment system 102 can convert the plurality of EIIP sequences 504 to the frequency domain, as illustrated by the graphs 506 that represent the EIIP sequences in the frequency domain. To do so, the mutation assessment system 102 can apply a discrete Fourier transform to each of the sequences of EIIP values 504. As shown in the graphs 506, each of the sequences of EIIP values may, in some embodiments, have a different representation in the frequency domain. In the frequency domain, the x-axis of the graphs 506 can include frequencies. The x-axis can include one or more characteristic frequencies, which can be, for example, predetermined frequency values that are relevant to a biological characteristic of interest of the virus or organism that the plurality of amino acid sequences 502 are drawn from. The y-axis of the graphs can include amplitudes at those frequencies, including, in some example, mathematical transformations of the amplitudes. As described above, the mutation assessment system 102 can, in some embodiments, calculate the energy density spectrum as part of determining the amplitude values at the frequencies.
[0063] In the example shown, the mutation assessment system 102 can determine, for each of the sequences, one or more amplitude values at the one or more characteristic frequencies. For example, the amplitude values table 508 can include amplitude values, or mathematical transformations of amplitude values, at characteristic frequencies for each sequence. Graphically, the one or more amplitude values can, in some embodiments, be the one or more values on the y-axis for the one or more characteristic frequencies on the x-axis.
[0064] In the example shown, the mutation assessment system 102 can calculate a distance matrix 510. For example, the mutation assessment system 102 can, for each pair of sequences, calculate a distance between the one or more amplitude values of the sequences. Thus, the mutation assessment system 102 can use, for example, the amplitude values table 508 to calculate a distance between each pair of the sequences 1-X, resulting in the distance matrix 510. As described above in connection with steps 410-411 of Fig. 4, the mutation assessment system 102 can, depending on the situation, use one of a plurality of distance metrics to calculate the distance between each pair of sequences.
[0065] In the example shown, the mutation assessment system 102 can construct a phylogenetic tree 512. For example, the mutation assessment system 102 can use the distance matrix 510 to construct the phylogenetic tree 512. In the phylogenetic tree 512, the sequences can be grouped. In some embodiments, the mutation assessment system 102 can apply an agglomerative hierarchical clustering algorithm on the distance matrix 510. For example, the mutation assessment can apply an unweighted pair group method with an arithmetic mean (UPGMA), neighbor joining (NJ) method, Fitch-Margoliash algorithm, or an ensemble of methods on the distance matrix 510 to construct the phylogenetic tree 512. [0066] Having constructed the phylogenetic tree 512, the mutation assessment system 102 can, in some embodiments, cluster the sequences. For example, the mutation assessment system 102 — or a user of the mutation assessment system 102 — may, using the phylogenetic tree 12 and, in some embodiments, a reference sequence, cluster the sequences into a positive group and a negative group. In some embodiments, the positive group may include sequences that are at risk of being associated with a certain biological characteristic, and the negative group may include sequences that are not at risk of being associated with the biological characteristic. The mutation assessment system 102 can, in some embodiments, assign a positive or negative label to each sequence depending on how the sequences are clustered.
[0067] Having assigned a label to the sequences, the mutation assessment system 102 can, in the example shown, arrange the training data 514. In the example shown, the training data 514 can include a plurality of instances, which can be, for example, associated with the plurality of amino acid sequences 502. Furthermore, as show n in the training data 514, each of the sequences can include, as training attributes, the one or more amplitude values at the one or more characteristic frequencies, including, in some examples, mathematical transformations of the amplitude values, and each of the sequences can include a label. As described below, the training data 514 can be used, for example, to train a classification model, such as the classification model 516.
[0068] Fig. 6 is a flowchart of an example method 600 for creating a classification model. The method 600 can be used, for example by the mutation assessment system 102, by subcomponents of the mutation assessment system 102 or, in some examples, by a user of the mutation assessment system 102.
[0069] In the example shown, the mutation assessment system 102 can receive training data (step 602). For example, the mutation assessment system 102 can receive training data generated during the example method 400, described above in connection with Fig. 4. Furthermore, in some embodiments, the mutation assessment system 102 can receive other data for creating a classification algorithm. For example, the mutation assessment system 102 may combine data generated during the example method 400 with other data received from the genetic information database 104 or the user 106.
[0070] In the example shown, the mutation assessment system 102 can train a classification model (step 604). For example, the mutation assessment system 102 can train a machine learning algorithm to perform a binary classification task. In some embodiments, the machine learning model can be trained to receive one or more amplitude values, or mathematical transformations of amplitude values, and to output a positive or negative prediction. In some embodiments, the machine learning model can be trained with supervised learning using the training data. As described above (e.g., in connection with step 418 of Fig. 4), the training data can include a plurality of sequences and, for each instance sequence, there can be instance amplitude values and an instance label. For example, training the machine learning model can include, in some embodiments, inputting training attributes (e.g., amplitude values or mathematical transformations of amplitude values) into the machine learning model, generating a prediction for whether the training attributes are associated with a positive or negative label, checking the prediction with the actual label, and adjusting parameters of the machine learning model accordingly. In some embodiments, this process can continue until the model converges or until the model performs sufficiently well.
[0071] In some embodiments, the machine learning model can use a random forest. The random forest can include, for example, a plurality of decision trees, each of which can be trained on a bootstrapped subset of the training data. In other embodiments, the machine learning model can use other techniques or models, such as a gradient boost method, neural networks, or deep learning methods. In some embodiments, the machine learning model can include an ensemble of machine learning techniques and models.
[0072] In the example shown, the mutation assessment system 102 can validate the classification model (step 606). For example, the mutation assessment system 102 can, in some embodiments, apply a k-fold cross validation procedure to determine the performance of a trained machine learning model. To evaluate the model, the mutation assessment system 102 can determine and evaluate a plurality of metrics, including, in some embodiments, an Area Under the Curve (AUC) performance, accuracy, precision, recall, F-score, specificity, and a Matthews correlation coefficient (MCC). Furthermore, in some embodiments, the k-fold validation procedure can be a 10-fold validation procedure. Furthermore, in some embodiments, the mutation assessment system 102 can generate a plurality of classification models using, for example, a variety of hyperparameters and combinations of hyperparameters, where the types of hyperparameters are determined by the underlying machine learning model. In the case where the mutation assessment system 102 generates a plurality of classification models, the mutation assessment system 102 can select one, for example, based on the results of the k-fold cross-validation procedure. For example, the mutation assessment system 102 can, in some embodiments, select a model having the best F-score. [0073] In the example shown, the mutation assessment system 102 can define a threshold value for the classification model (step 608). In some embodiments, the output of the classification model can be a probability that an input corresponds with an amino acid sequence that is at risk of being associated with a certain biological trait. Thus, in some embodiments, the classification model can assign a positive label to the input if this probability is over a threshold value and a negative label to the input if this probability is below the threshold value, or vice versa. Furthermore, in some embodiments, the mutation assessment system 102 can select a threshold value that results in better performance by the classification model, as measured, for example, by performing a k-fold validation procedure or validating the classification model. In some embodiments, the mutation assessment system 102 can select, as the threshold value, the value that results in a maximum F-score.
[0074] Fig. 7 is a flowchart of an example method 700 for assessing one or more input queries. The method 700 can be used, for example, by the mutation assessment system 102, by subcomponents of the mutation assessment system 102, or, in some embodiments, by a user of the mutation assessment system 102.
[0075] In the example shown, the mutation assessment system 102 can receive an input query (step 702). In some embodiments, the input query can be a sequence or other data that can be converted into a sequence. For example, in some embodiments, the input query can be an amino acid or nucleotide sequence. In some embodiments, the mutation assessment system 102 can receive the input query from the input query source 112 of Fig. 1. Furthermore, in some embodiments, the mutation assessment system 102 can receive a plurality of input queries. For example, the input queries can, in some embodiments, be a plurality of amino acid sequences, each of which is sampled from a protein of a different variant of a pathogen. Thus, although aspects of the method 700 may be described with respect to one input query, the method 700 can, in some embodiments, be applied to a plurality of input queries.
[0076] In the example shown, the mutation assessment system 102 can convert an amino acid sequence associated with the input query into a sequence of EIIP values (step 704). If the mutation assessment system 102 received a plurality of input queries, then the mutation assessment system 102 can, in some embodiments, convert a plurality of amino acid sequences associated with the plurality of input queries into a plurality of sequences of EIIP values. To do the conversion, the mutation assessment system 102 can convert each amino acid sequence into an associated EIIP value, as is described above in connection with Fig. 4. [0077] In the example shown, the mutation assessment system 102 can convert one or more EIIP sequences to the frequency spectrum (step 706). To do so, the mutation system 102 can perform a discrete Fourier transform on the one or more sequences of EIIP values, as is described above in connection with Fig. 4. Furthermore, in some embodiments, the mutation assessment system 102 can, having converted one or more EIIP sequences to the frequency domain, convert the amplitudes to the energy density spectrum, as is described above in connection with step 404 of Fig. 4.
[0078] In the example shown, the mutation assessment system 102 can, for the input query converted to a sequence of EIIIP values, determine one or more amplitude values at one or more characteristic frequencies (step 708). Furthermore, in some embodiments, the mutation assessment system 102 can determine a mathematical transformation of the one or more amplitude values. These one or more amplitude values may thereafter be used, for example, as input attributes in the machine learning model. As described above, the characteristic frequencies may, in some embodiments, be values in the frequency domain that have been determined to be relevant to a biological characteristic (e.g., an ability to evade a vaccine- induced immune system response). Furthermore, the one or more amplitude values at these characteristic frequencies may include information related to whether a particular input query is at risk of being associated with the predetermined biological characteristic.
[0079] In the example shown, the mutation assessment system 102 can apply the classification model to the input query (step 710). For example, the mutation assessment system 102 can apply the trained machine learning model (e g., trained in connection with Fig. 6) to the input attributes, which as described above, can be one or more amplitude values, or, in some embodiments, a mathematical transformation of the one or more amplitude values, at one or more characteristic frequencies.
[0080] In the example shown, the mutation assessment system 102 can classify the input as positive or negative (step 712). For example, the classification model can, in some embodiments, output a probability, based on the input attributes, that the query input is at risk of being associated with a certain biological characteristic. Based on this probability, the mutation assessment system 102 can, in some embodiments, classify the input as positive or negative. Furthermore, in some embodiments, the mutation assessment system 102 can, for example, in response to classifying an input as at risk of being associated with a certain biological characteristic, flag the input for further review, automatically generate an assessment report, or transmit information to an output system or a user. [0081] Fig. 8 illustrates a schematized example execution 800 of aspects of the present disclosure, including an example execution of the method 700 for assessing an input query. In some examples, the mutation assessment system 102 can perform aspects of the execution 800. As will be understood, the characters and numbers in the example of Fig. 8 are for illustrative purposes.
[0082] As illustrated in the example of Fig. 8, the mutation assessment system 102 can receive an input query', which can be, for example, an input amino acid sequence 802. As shown, the input amino acid sequence 802 can include a plurality of amino acids, such as asparagine, glycine, leucine, phenylalanine, and so on. In some examples, the amino acid sequence 802 can represent the primary structure of a protein from an organism or virus. Furthermore, in some examples, the mutation assessment system 102 can receive a plurality of non-redundant amino acid sequences.
[0083] In the example shown, the mutation assessment system 102 can convert the amino acid sequence 802 into a sequence of EIIP values 804. To do so, the mutation assessment system 802 can convert each amino acid sequence into an electron-ion potential value, as is described above.
[0084] In the example shown, the characteristic frequencies 806 can be Fl = 0.2 and F2 = 0.35. As described above, these characteristic frequencies 806 can, in some embodiments, be defined by a user. Furthermore, in some embodiments, the characteristic frequencies 806 can, as described above, be associated with a biological characteristic of interest in the organism or virus that the amino acid sequence 802 is drawn from.
[0085] In the example shown, the mutation assessment system 102 can convert the sequence of EIIP values 804 to the frequency domain 808. To do so, the mutation assessment system 102 can perform, for example, a discrete Fourier transform on the sequence of EIIP values 804, as is described above in connection. Thus, in the frequency domain 808, the x-axis can include frequency values, including the characteristic frequencies 806. For the y-axis, the values can be amplitudes at the frequencies. Furthermore, in some embodiments, the mutation assessment system 102 can calculate an energy density spectrum to determine the y-axis values, as is described above.
[0086] In the example shown, the mutation assessment system 102 can determine the one or more amplitude values 810-812 for the one or more frequency values 806 by using, for example, the data derived in the frequency domain 808. In the example shown, the amplitude value for characteristic frequency Fl = 0.2 is 4.0, and the amplitude value for characteristic frequency F2 = 0.35 is 3.5. While generating training data and training a classification model, these amplitude values at the characteristic frequencies can, in some embodiments, be used, along with a plurality of other such values and sequences, by the mutation assessment system 102 to, for example, generate a phylogenetic tree, cluster sequencies, and train the classification model, as described above. The data discussed in connection with the elements 804-812 can, in some examples, be considered electronic properties of the amino acid sequence 802.
[0087] In the example shown, there is a trained classification model 814, which may have been trained, for example, using one or more aspects of the present disclosure described, for example, in connection with Figs. 4-6. The mutation assessment system 102 can, in some embodiments, apply the trained classification model 814 to the amplitude values 810-812. The trained classification model 814 can then, in some embodiments, classify the instance associated with the amplitude values 810-812 as positive or negative. In some embodiments, a positive classification may indicate that the virus or organism associated with the amino acid sequence 802 is at risk of having a certain biological characteristic, and a negative classification may indicate that there is not such a risk. In some embodiments, the trained classification model 814 can output a probability that the virus or organism associated with the amino acid sequence 802 is at risk of being associated with the biological characteristic.
[0088] In the example shown, the mutation assessment system 102 may output data 816, which can include, for example, results from the trained classification model 814, analysis, or other information. In some examples, the mutation assessment system 102 can output the data 816 to a monitoring system that can automatically act in response to receiving certain results from the mutation assessment system (e.g., a determination that a virus or organism is at risk of having a certain biological characteristic or property).
[0089] Referring generally to Figs. 1-8 and their accompanying descriptions, aspects of the present disclosure can be used to detect an impact of a genetic change on a biological property for a wide range of viruses and organisms and a wide range of biological properties. Thus, the following applications are example applications, and aspects of the present disclosure are not limited to these applications.
[0090] In example applications, aspects of the present disclosure can be used to detect genetic mutations that may result in a pathogen being able to escape therapeutic effects of antiviral drugs or preventive effects of vaccines. Specifically, in example applications, aspects of the present disclosure can be used to detect mutations in SARS-CoV-2, A/H3N2, or A/H5N1 that may result in decreased vaccine efficacy. [0091] In the case of detecting genetic mutations or deletions of SARS-CoV-2 that may result in a virus variant beter able to avoid a vaccine-induced immune system response, the mutation assessment system may, for example, receive a plurality of non-redundant amino acid sequences, each of which may relate to a spike protein of a different variant of SARS-CoV-2. The mutation assessment system may, as is described above, convert the amino acid sequences to sequences of EIIP values and convert the sequences of EIIP values to the frequency domain. The mutation assessment system may determine one or more amplitude values in the energy density spectrum at characteristic frequencies Fl = 0.257 and F2 = 0.479. The mutation assessment may then construct a distance matrix by using an amplitude ratio distance, as described above in connection with equation (6), to determine a distance between each pair of sequences of EIIP values. The mutation assessment system may then construct a phylogenetic tree by using the distance matrix.
[0092] Next, the mutation assessment system can, for example, use the phylogenetic tree to group the sequences into two clusters, one for sequences that are vaccine resistant, which may be assigned a positive label, and another for sequences that are not vaccine resistant, which may be assigned a negative label. Then the mutation assessment can, for example, train a machine learning model using an ensemble of a distributed random forest method and deep learning. The mutation assessment system may then evaluate the model using a 10-fold cross validation procedure and select a threshold that maximizes the F-score on a holdout set. The mutation assessment can then, for example, apply the machine learning model to one or more query inputs, which may be amino acid sequences for variants of SAR-CoV-2 that were not used to generate training data. The mutation assessment system can then, in some embodiments, detect which of the one or more variants are sufficiently likely to be resistant to a vaccine.
[0093] An example implementation of aspects of the present disclosure for detecting SARS-CoV-2 genetic mutations that resulted in vaccine resistance was shown to be effective. In the example implementation, the mutation assessment system received 2081 non-redundant SARS-CoV-2 protein sequences to generate training data and train a machine learning model. Using an F-score maximizing threshold value of 0.7111, the machine learning model had the following results when evaluated using a 10-fold cross validation procedure: AUC: 0.995; accuracy: 0.9914; precision 0.9936; recall: 0.9959; F-score: 0.9948; specificity: 0.9692; and MCC: 0.9695. Furthermore, the machine learning model correctly identified the mutations of H69del and V70del — which were mutations that were not in the training set — as variants that were potentially resistant to the vaccine.
[0094] For influenza A virus subtype H3N2, the mutation assessment system can, for example, receive a plurality of sequences of amino acids for hemagglutinin proteins of variants of A/H3N2. Using aspects of the present disclosure, the mutation assessment system can cluster the sequences based on amplitude values at a characteristic frequency of F=0.299. For example, the mutation assessment system can create a distance matrix by using single frequency distances (e g., as described in connection with equation (5)), construct a phylogenetic tree, label the sequences, and train a machine learning algorithm. The mutation assessment can then, for example, apply the machine learning algorithm to query amino acid sequences of H3N2 variants and detect variants that may, because of certain biological characteristics associated with the frequency F=0.299, be resistant to a vaccine.
[0095] For influenza A virus subtype H5N1, the mutation assessment system can, for example, receive a plurality of amino acid sequences for hemagglutinin proteins of variants of A/H5N1. Using aspects of the present disclosure, the mutation assessment system can cluster the sequences based on amplitude values at characteristic frequencies Fl = 0.076 and F2 = 0.236. For example, the mutation assessment system can create a distance matrix by using amplitude ratio distances (e.g., as described in connection with equation (6)), construct a phylogenetic tree, label the sequences, and train a machine learning model. The mutation assessment can then, for example, apply the machine learning model to query amino acid sequences of H5N1 variants and detect variants that may, because of biological characteristics associated with the frequencies Fl = 0.076 and F2 = 0.236, be resistant to a vaccine.
[0096] In another example, aspects of the present disclosure can be used to detect decreased enzyme activity. Specifically, in an example application, aspects of the present disclosure can be used to detect a decrease in enzyme lipoprotein lipase (LPL) activity and a risk for development of cardiovascular disease (CVD). For example, the mutation assessment system can receive a plurality of amino acid sequences for LPL mutations. Using aspects of the present disclosure, the mutation assessment system can cluster the sequences based on amplitude values at characteristic frequencies Fl = 0.033 and F2 = 0.168. To calculate a distance matrix, the mutation assessment system can calculate, for each pair of sequences, a full spectrum distance (e.g., as described in connection with equation (7)). For example, referring to equation (7), S 1 (1) and S2(l) are the amplitude values for a first second and second sequence, respectively, at the frequencies Fl = 0.033, Sl(2) and S2(2) are the amplitude values for the first and second sequence, respectively, at the frequency F2 = 0.168, and M = 2.
[0097] Having created the distance matrix, the mutation assessment system can construct a phylogenetic tree, label the sequences, and train a machine learning model. The mutation assessment can then, for example, apply the machine learning model to query amino acid sequences of enzyme LPL mutants and detect mutants that may, because of biological characteristics associated with the frequencies Fl = 0.033 and F2 = 0.168, be associated with decreased enzyme activity and a risk for development of cardiovascular disease.
[0098] In another example application, aspects of the present disclosure can be used to detect mutations of epidermal growth factor receptors that may result in cancer. For example, the mutation assessment system can receive a plurality of amino acid sequences for epidermal growth factor receptors. Using aspects of the present disclosure, the mutation assessment system can cluster the sequences based on amplitude values at characteristic frequencies Fl = 0.254 and F2 = 0.467. For example, the mutation assessment system can create a distance matrix by using amplitude ratio distances (e.g., as described in connection with equation (6)), construct a phylogenetic tree, label the sequences, and tram a machine learning model. The mutation assessment can then, for example, apply the machine learning model to query' amino acid sequences of epidermal growth factor receptors and detect epidermal growth factor receptors that may, because of biological characteristics associated with the frequencies Fl = 0.254 and F2 = 0.467, increase a likelihood of cancer development.
[0099] Fig. 9 illustrates an example system 900 with which disclosed systems and methods can be used. In an example, the following can be implemented in one or more systems 900 or in one or more systems having one or more components of system 900: the mutation assessment system 102, the genetic information database 104, the user 106, the output system 108, the database 110, the input query source 112, the networks 120a-c, the training data generator 200, the tree builder 202, the cluster identifier 204, the classification model 206, the mutation analyzer 208, the user interface 210, the database 212, the classification model 516, the trained classification model 814, and other aspects of the present disclosure.
[00100] In an example, the system 900 can include a computing environment 902. The computing environment 902 can be a physical computing environment, a virtualized computing environment, or a combination thereof. The computing environment 902 can include memory 904, a communication medium 912, one or more processing units 914, anetwork interface 916, and an external component interface 918. [00101] The memory 904 can include a computer readable storage medium. The computer storage medium can be a device or article of manufacture that stores data and/or computerexecutable instructions. The memory 904 can include volatile and nonvolatile, transitory and non-transitory, removable and non-removable devices or articles of manufacture implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. By way of example, and not limitation, computer storage media may include dynamic random access memory (DRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), reduced latency DRAM, DDR2 SDRAM, DDR3 SDRAM, solid state memory, read-only memory (ROM), electrically- erasable programmable ROM, optical discs (e.g., CD-ROMs, DVDs, etc.), magnetic disks (e.g., hard disks, floppy disks, etc.), magnetic tapes, and other types of devices and/or articles of manufacture that store data.
[00102] The memory 904 can store various types of data and software. For example, as illustrated, the memory 904 includes software application instructions 906, one or more databases 908, as well as other data 910. The communication medium 912 can facilitate communication among the components of the computing environment 902. In an example, the communication medium 912 can facilitate communication among the memory 904, the one or more processing units 914, the network interface 916, and the external component interface 918. The communications medium 912 can be implemented in a variety of ways, including but not limited to a PCI bus, a PCI express bus accelerated graphics port (AGP) bus, a serial Advanced Technology Attachment (ATA) interconnect, a parallel ATA interconnect, a Fiber Channel interconnect, a USB bus, a Small Computing system interface (SCSI) interface, or another type of communications medium.
[00103] The one or more processing units 914 can include physical or virtual units that selectively execute software instructions, such as the software application instructions 906. In an example, the one or more processing units 914 can be physical products comprising one or more integrated circuits. The one or more processing units 914 can be implemented as one or more processing cores. In another example, one or more processing units 914 are implemented as one or more separate microprocessors. In yet another example embodiment, the one or more processing units 914 can include an application-specific integrated circuit (ASIC) that provides specific functionality'. In yet another example, the one or more processing units 914 provide specific functionality by using an ASIC and by executing computer-executable instructions. [00104] The network interface 916 enables the computing environment 902 to send and receive data from a communication network. The network interface 916 can be implemented as an Ethernet interface, a token-ring network interface, a fiber optic network interface, a wireless network interface (e.g., Wi-Fi), or another type of network interface.
[00105] The external component interface 918 enables the computing environment 902 to communicate with external devices. For example, the external component interface 918 can be a USB interface. Thunderbolt interface, a Lightning interface, a serial port interface, a parallel port interface, a PS/2 interface, or another type of interface that enables the computing environment 902 to communicate with external devices. In various embodiments, the external component interface 918 enables the computing environment 902 to communicate with various external components, such as external storage devices, input devices, speakers, modems, media player docks, other computing devices, scanners, digital cameras, and fingerprint readers.
[00106] Although illustrated as being components of a single computing environment 902, the components of the computing environment 902 can be spread across multiple computing environments 902. For example, one or more of instructions or data stored on the memory 904 may be stored partially or entirely in a separate computing environment 902 that is accessed over a network.
[00107] Depending on the size and scale of the computing environment 902, it may be advantageous to include one or more load balancers to balance traffic across multiple physical or virtual machine nodes.
[00108] Aspects of the system 900 and the computing environment 902 can be protected using a robust security model. In an example, users may be made to sign into the system using a directory service. Connection and credential information can be externalized from jobs using an application programming interface. Credentials can be stored in an encrypted repository in a secured operational data store database space. Privileges can be assigned based on a collaboration team and mapped to a Lightweight Directory Access Protocol (LDAP) Group membership. A self-service security model can be used to allow owners to assign others permissions on their objects (e.g., actions).
[00109] Each node may be configured to be capable of running the full system 900, such that portal can run and schedule jobs and serve the portal user interface as long as a single node remains functional. The environment 902 may include monitoring technology to determine when anode is not functioning so an appropriate action can be taken. [00110] While particular uses of the technology have been illustrated and discussed above, the disclosed technology can be used with a variety of data structures and processes in accordance with many examples of the technology. The above discussion is not meant to suggest that the disclosed technology is only suitable for implementation with the data structures shown and described above.
[00111] This disclosure described some aspects of the present technology with reference to the accompanying drawings, in which only some of the possible aspects were shown. Other aspects can, however, be embodied in many different forms and should not be construed as limited to the aspects set forth herein. Rather, these aspects were provided so that this disclosure was thorough and complete and fully conveyed the scope of the possible aspects to those skilled in the art.
[00112] As should be appreciated, the various aspects (e.g., operations, memory arrangements, etc.) described with respect to the figures herein are not intended to limit the technology to the particular aspects described. Accordingly, additional configurations can be used to practice the technology herein and/or some aspects described can be excluded without departing from the methods and systems disclosed herein.
[00113] Similarly, where operations of a process are disclosed, those operations are described for purposes of illustrating the present technology and are not intended to limit the disclosure to a particular sequence of operations. For example, the operations can be performed in differing order, two or more operations can be performed concurrently, additional operations can be performed, and disclosed operations can be excluded without departing from the present disclosure. Further, each operation can be accomplished via one or more sub-operations. The disclosed processes can be repeated.
[00114] Although specific aspects were described herein, the scope of the technology is not limited to those specific aspects. One skilled in the art will recognize other aspects or improvements that are within the scope of the present technology. Therefore, the specific structure, acts, or media are disclosed only as illustrative aspects. The scope of the technology is defined by the following claims and any equivalents therein.

Claims

Claims:
1. A method for assessing genetic changes, the method comprising: receiving a plurality of amino acid sequences; determining electronic properties for each sequence of the plurality of amino acid sequences, the electronic properties comprising one or more amplitude values for one or more characteristic frequencies; constructing, using the electronic properties, a phylogenetic tree; assigning, using the phylogenetic tree, a label to each sequence of the plurality of amino acid sequences; and training, using training data, a classification model, wherein the training data includes, for each sequence of the plurality of amino acid sequences, the one or more amplitude values and the label.
2. The method of claim 1, further comprising classifying, using the classification model, an input amino acid sequence as positive or negative.
3. The method of claim 2, wherein classifying, using the classification model, the input amino acid sequence as positive or negative comprises: converting the input amino acid sequence into an input sequence of electron-ion interaction potential values; converting the input sequence of electron-ion interaction potential values to the frequency domain; determining, for the input sequence of electron-ion interaction potential values, one or more input amplitude values for the one or more characteristic frequencies; and applying the classification model to the one or more input amplitude values.
4. The method of claim 1, wherein determining electronic properties for each sequence of the plurality of amino acid sequences comprises: converting the plurality of amino acid sequences into a plurality of sequences of electron-ion interaction potential values; converting each sequence of the plurality of sequences of electron-ion interaction potential values to a frequency domain; and determining, for each sequence of electron-ion interaction potential values, the one or more amplitude values for the one or more characteristic frequencies in the frequency domain.
5. The method of claim 1, further comprising: determining the one or more characteristic frequencies by performing a cross-spectrum analysis.
6. The method of claim 1, wherein constructing the phylogenetic tree, using the electronic properties, comprises calculating a distance matrix for the plurality of amino acid sequences, the distance matrix comprising a distance between each pair of amino acid sequences of the plurality of amino acid sequences.
7. The method of claim 1, wherein assigning, using the phylogenetic tree, the label to each sequence of the plurality of amino acid sequences comprises: clustering, using the phylogenetic tree, the plurality of amino acid sequences into two or more groups, wherein one of the two or more groups includes one or more sequences of the plurality of the amino acid sequences that are sufficiently likely to be associated with a biological characteristic, and wherein one of the two or more groups includes one or more amino acid sequences of the plurality of the amino acid sequences that are not sufficiently likely to be associated with the biological characteristic; wherein the label is a positive label or a negative label.
8. The method of claim 1, wherein the plurality of amino acid sequences is associated with a pathogen; and wherein each sequence of the plurality of amino acid sequences is associated with a variant of the pathogen.
9 The method of claim 8, wherein the pathogen is SARS-CoV-2 or influenza.
10. The method of claim 1, wherein the classification model includes one or more or ensemble of machine learning algorithms, such as random forest or a neural network.
11. The method of claim 1, wherein training, using the training data, the classification model comprises defining a threshold value maximizing an F-score of the classification model.
12. A method for assessing a biological impact of a genetic variant, the method comprising: receiving an amino acid sequence; determining electronic properties of the amino acid sequence, the electronic properties including one or more amplitude values for one or more characteristic frequencies; and determining, using the electronic properties, whether a mutation associated with the amino acid sequence is at risk of being associated with a biological characteristic; wherein determining, using the electronic properties, whether the amino acid sequence is at risk of being associated with a biological characteristic comprises applying a classification model to the one or more amplitude values.
13. The method of claim 12, wherein the amino acid sequence is associated with a protein of a variant of SARS- CoV-2 or of a variant of influenza; and wherein the biological characteristic is associated with a resistance to a vaccine-induced immune system response.
14. The method of claim 12, wherein the amino acid sequence is associated with a lipoprotein lipase enzyme; and wherein the biological characteristic is associated with a decrease in lipoprotein lipase activity.
15. The method of claim 12, wherein the amino acid sequence is associated with an epidermal growth factor receptor; and wherein the biological characteristic is associated with an increased likelihood of cancer development
16. The method of claim 12, wherein determining the electronic properties of the amino acid sequence comprises: converting the amino acid sequence to a sequence of electron-ion interaction potential values; converting the electron-ion interaction potential values to a frequency domain; and determining the one or more amplitude values for the one or more characteristic frequencies in the frequency domain.
17. The method of claim 12, wherein the classification model is a machine learning model trained with training data; wherein training instances of the training data are associated with a plurality of amino acid sequences; and wherein the training instances include, for each sequence of the plurality of amino acid sequences, an instance label and instance electronic properties, the instance electronic properties including one or more instance amplitude values for the one or more characteristic frequencies.
18. The method of claim 12, further comprising: in response to determining, using the electronic properties, that the amino acid sequence is at risk of being associated with the biological characteristic, automatically transmitting data to a monitoring system, the data indicating that the amino acid sequence is at risk of being associated with the biological characteristic.
19. A mutation assessment system comprising: a processor; and a memory storing instructions, wherein the instructions, when executed by the processor, cause the mutation assessment system to: receive a plurality of amino acid sequences; determine electronic properties for each sequence of the plurality of amino acid sequences, the electronic properties comprising one or more amplitude values for one or more characteristic frequencies; construct, using the electronic properties, a phylogenetic tree; assign, using the phylogenetic tree, a label to each sequence of the plurality of amino acid sequences; and train, using training data, a classification model, wherein the training data includes, for each sequence of the plurality of amino acid sequences, the one or more amplitude values and the label.
20. The mutation assessment system of claim 19, wherein determining the electronic properties for each sequence of the plurality of ammo acid sequences comprises: converting the plurality' of amino acid sequences into a plurality of sequences of electron-ion interaction potential values; converting, using a discrete Fourier transform, each sequence of the plurality of sequences of electron-ion interaction potential values to a frequency domain; and determining, for each sequence of electron-ion interaction potential values, the one or more amplitude values for the one or more characteristic frequencies in the frequency domain; and wherein constructing, using the electronic properties, the phylogenetic tree comprises: calculating a distance matrix for the plurality of amino acid sequences, the distance matrix comprising a distance between each pair of amino acid sequences of the plurality of amino acid sequences; and applying an agglomerative hierarchical clustering method to the distance matrix.
PCT/US2023/068124 2022-06-08 2023-06-08 Method and system for assessing an impact of genetic changes on biological properties WO2023240183A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263350273P 2022-06-08 2022-06-08
US63/350,273 2022-06-08

Publications (1)

Publication Number Publication Date
WO2023240183A1 true WO2023240183A1 (en) 2023-12-14

Family

ID=89119065

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/068124 WO2023240183A1 (en) 2022-06-08 2023-06-08 Method and system for assessing an impact of genetic changes on biological properties

Country Status (1)

Country Link
WO (1) WO2023240183A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120310863A1 (en) * 2011-05-12 2012-12-06 University Of Utah Gene-specific prediction
US20190351046A1 (en) * 2017-02-02 2019-11-21 The Board Of Regents Of The University Of Texas System Universal influenza vaccine targeting virus/host recognition
US20200279157A1 (en) * 2017-10-16 2020-09-03 Illumina, Inc. Deep Learning-Based Techniques for Training Deep Convolutional Neural Networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120310863A1 (en) * 2011-05-12 2012-12-06 University Of Utah Gene-specific prediction
US20190351046A1 (en) * 2017-02-02 2019-11-21 The Board Of Regents Of The University Of Texas System Universal influenza vaccine targeting virus/host recognition
US20200279157A1 (en) * 2017-10-16 2020-09-03 Illumina, Inc. Deep Learning-Based Techniques for Training Deep Convolutional Neural Networks

Similar Documents

Publication Publication Date Title
Cheng et al. Grouped SMOTE with noise filtering mechanism for classifying imbalanced data
US9922190B2 (en) Method and system for detecting DGA-based malware
Hwang et al. A heterogeneous label propagation algorithm for disease gene discovery
EP2431918B1 (en) Graph lattice method for image clustering, classification, and repeated structure finding
Abu-Jamous et al. Paradigm of tunable clustering using binarization of consensus partition matrices (Bi-CoPaM) for gene discovery
Bernardes et al. Evaluation and improvements of clustering algorithms for detecting remote homologous protein families
Pitolli et al. Malware family identification with BIRCH clustering
Lamba et al. Feature Selection of Micro-array expression data (FSM)-A Review
Gupta et al. Extracting dynamics from static cancer expression data
Kaden et al. Learning vector quantization as an interpretable classifier for the detection of SARS-CoV-2 types based on their RNA sequences
Aksa et al. Bitmapaligner: bit-parallelism string matching with mapreduce and hadoop
Hilda et al. Effective feature selection for supervised learning using genetic algorithm
Liu et al. A weight-incorporated similarity-based clustering ensemble method
WO2023240183A1 (en) Method and system for assessing an impact of genetic changes on biological properties
Mahony et al. Self-organizing neural networks to support the discovery of DNA-binding motifs
Dehmer et al. Entropy bounds for hierarchical molecular networks
Sato et al. Directed acyclic graph kernels for structural RNA analysis
Shen et al. Applied graph-mining algorithms to study biomolecular interaction networks
Merschmann et al. A lazy data mining approach for protein classification
Duan Research on abnormal data detection method of web browser in cloud computing environment
Chehreghani et al. Upper and lower bounds for the q-entropy of network models with application to network model selection
Zhang et al. Quadratic graph attention network (Q-GAT) for robust construction of gene regulatory networks
Hu et al. Learning deep representations in large integrated network for graph clustering
Gokilavani et al. Novel Fuzzy Based Density Based Clustering Algorithm for Effective Cluster Prioritization in WSN.
Hamzah et al. Performance Evaluation of Support Vector Machine Kernels in Intrusion Detection System for Wireless Sensor Network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23820647

Country of ref document: EP

Kind code of ref document: A1