WO2008134261A2 - Procédé de détermination de la structure d'une protéine, identification d'un gène, analyse mutationnelle et conception d'une protéine - Google Patents

Procédé de détermination de la structure d'une protéine, identification d'un gène, analyse mutationnelle et conception d'une protéine Download PDF

Info

Publication number
WO2008134261A2
WO2008134261A2 PCT/US2008/060786 US2008060786W WO2008134261A2 WO 2008134261 A2 WO2008134261 A2 WO 2008134261A2 US 2008060786 W US2008060786 W US 2008060786W WO 2008134261 A2 WO2008134261 A2 WO 2008134261A2
Authority
WO
WIPO (PCT)
Prior art keywords
amino acid
acid sequence
acid sequences
protein
sequence
Prior art date
Application number
PCT/US2008/060786
Other languages
English (en)
Other versions
WO2008134261A3 (fr
Inventor
Charles Michael Fortmann
Yeona Kang
David H. Coleman
Original Assignee
The Research Foundation Of State University Of New York
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from PCT/US2007/067639 external-priority patent/WO2007140061A2/fr
Application filed by The Research Foundation Of State University Of New York filed Critical The Research Foundation Of State University Of New York
Priority to US12/301,963 priority Critical patent/US20100304983A1/en
Publication of WO2008134261A2 publication Critical patent/WO2008134261A2/fr
Publication of WO2008134261A3 publication Critical patent/WO2008134261A3/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • the present invention relates to computational methods for protein structure determination and prediction, and utilization of the protein structure determination methods for gene identification, mutation analysis, and protein design.
  • NMR Nuclear Magnetic Resonance Spectroscopy
  • Examples of a sequence-based homology approach include using such alignment tools at BLAST or FASTA to directly assess similarity between primary amino acid sequences, and to predict the structure of a sequence based on its sequence similarity to a sequence with a known structure.
  • Examples of a structure-based homology approach include threading algorithms, which predict structure by superimposing an amino acid sequence onto a known structure and then assessing whether that conformation of amino acids would still retain the same equilibrium of hydrophobic/hydrophilic forces that allow for the structure to be stable.
  • a suitable template for homology analysis may be difficult to find, especially in the case of a novel artificial protein.
  • homology modeling is unable to predict the structure of those portions of the sequence that are unmatched by the template in addition to the fact that each amino acid sequence may have different degrees of divergence from the template sequence.
  • physics-based computational methods have been employed for protein structure determination, of which energy minimization is a typical example. Due to the enormous conformational space of a large size protein, energy minimization usually does not yield a folded structure with a global free energy minimum because of its lack of modes of motion to facilitate crossing of barriers of local energy minima.
  • Molecular dynamics is another example of a physics-based method. It explicitly employs Newtonian motion functions to track a temporal evolution of a molecular system. Because of this explicit exploitation of the laws of physics in the temporal dimension, the molecular dynamics method is extremely computationally expensive due to the small time steps needed for the evolution to be physically meaningful, and the complex mathematical integration
  • Such a computational method may utilize knowledge gained by a simulation of a physics-based model, and may also utilize knowledge derived by homology modeling.
  • a given DNA sequence potentially encodes for many possible amino acid sequences, not all of which are transcribed and translated into a protein under natural conditions.
  • Recent attempts to identify genes and establish their associated proteins have relied upon evolutionary, statistical, and/or experimental input.
  • These evolutionary and/or statistical based models fail to produce a link between a nucleic acid sequence and an amino acid sequence with high degree or precision due to their inherent inadequacies, such as their inability to account for the potentially large impact of seemingly small variations in amino acid sequence on the secondary structure of a protein, and their inability to determine amino acid secondary structure in different environments such as different temperatures, pH levels, etc.
  • Variations of the sequence of a natural protein may be characterized as neutral mutations (mutations that do not lead to altered structure or function of the protein), structural mutations (mutations that lead to altered structure of the protein, but not necessarily altered function), or functional mutations (mutations that lead to loss or gain of function of the protein).
  • Such variations may result in differences within a population and/or result in diseases, and therefore would provide useful information for evolutionary analysis, and medical analysis of the root cause of the disease. Further, it would allow more accurate evolutionary analysis by identifying homologous and analogous mutations to both structure and function.
  • the present invention provides a computational method that upon being given an input of an amino acid sequence of a protein or polypeptide,
  • NY02:613565.4 4 determines and/or predicts the presence or absence of secondary and/or tertiary structure(s) of a protein or polypeptide, as well as the size and position of the secondary and/or tertiary structure(s).
  • Protein structures, such as secondary and tertiary structures include but are not limited to the types of structures described in established protein classification systems such as CATH and SCOP.
  • the secondary structures may be alpha helices or beta sheets. The method is hereinafter referred to as "the Fortmann-Kang-Coleman method," or "the FKC method.”
  • the FKC method is based on a molecular model described by physical parameters such as amino acid sequence and the net charge and hydrophobicity of each amino acid residue on the amino acid sequence, as well as environmental variables such as ionic strength, pH, temperature, pressure, etc.
  • the determination or prediction of folded regions of a protein or polypeptide is based on examining the given amino acid sequence, the physical parameters defining each amino acid residue and the environment of the amino acid sequence, without the need to carry out computer simulations or output graphics representing the folded structure. Therefore, the FKC method is highly computationally efficient and may be used for high- throughput analysis of genomes and proteins.
  • the FKC method extends and streamlines the computational method described in the International Patent Application PCT/US07/067639 by Fortmann and Kang, filed April 27, 2007, the disclosure of which is fully incorporated herein in its entirety.
  • a computational method (referred to hereinafter as “the Fortmann-Kang method", or “the FK method") was described which enables expedited determination and/or prediction of secondary and higher ordered structures (including, but not limited to, alpha helices and beta sheets) of a protein, a polypeptide, or an autonomous folding region thereof.
  • the present invention provides a method for selecting an amino acid sequence upon given a nucleic acid.
  • the method generates a probabilistic array of amino acid sequences that are potentially encoded by the nucleic
  • NY02:613565.4 5 acid determines the folded region of each of the amino acid sequences generated using the FKC method, and selects the amino acid sequence having the most secondary structures.
  • the present invention provides a method for sorting an array of amino acid sequences potentially encoded by a given nucleic acid.
  • the method determines the folded region of each of the amino acid sequences, and sorts the amino acid sequences according to the type, order and/or number of the predicted folded regions that each amino acid sequence produces.
  • the present invention provides a method for determining the presence or absence of a gene in a nucleic acid sequence.
  • the present invention provides a method for determining the presence or absence of a structural mutation in an amino acid sequence.
  • the present invention provides a method for the design of a protein.
  • Fig. 1 is a functional diagram of an exemplary system and process in accordance with the present invention.
  • Fig. 2 is a process diagram of an exemplary process in accordance with the present invention.
  • Fig. 3 is a process diagram of an exemplary process in accordance with the present invention.
  • Fig. 4 is a process diagram of an exemplary process in accordance with the present invention.
  • Fig. 5 is a process diagram of an exemplary process in accordance with the present invention.
  • Fig. 6 is a process diagram of an exemplary process in accordance with the present invention.
  • Fig. 7 is a process diagram of an exemplary process in accordance with the present invention.
  • Fig. 8 is an illustrative graph depicting the predicted structure of an amino acid sequence.
  • Fig. 9 is an illustrative graph depicting the predicted structure of an amino acid sequence.
  • Fig. 10 is an illustrative graph depicting the predicted structure of an amino acid sequence.
  • system 100 includes a IBM-PC compatible computer platform and a memory 120.
  • Computer 110 is a PC, but may be any form of conventional computer.
  • Memory 120 is a hard drive, but may be any form of accessible data storage, while Processor 130 may be any type of conventional processor.
  • Data entry device 140 is a keyboard, but may also include other data entry devices such as a mouse, an optical character reader, or an internet connection for receiving electronic media.
  • the present invention provides a computational method that, upon being given an input of an amino acid sequence of a protein or polypeptide, determines and/or predict the presence or absence of a folded region, such as secondary and/or tertiary structure(s), of the protein or polypeptide, as well as the size and position of the predicted secondary and/or tertiary structure(s).
  • the method is referred hereinafter as the FKC method.
  • the FKC method is related to the computational method described in the International Patent Application PCT/US07/067639 by Fortmann and Kang, filed April 27, 2007, the disclosure of which is fully incorporated herein in its entirety.
  • FK method a computational method which enables expedited determination and/or prediction of secondary and higher ordered structures (including but not limited to alpha helices and beta sheets) of a protein, a polypeptide, or an autonomous folding region thereof.
  • folding region refers to a fragment or section of a protein or polypeptide sequence, wherein at least a portion the amino acid residues within such fragment or section form at least one secondary or tertiary
  • the term "secondary structure” refers to a pattern of spatial arrangement by the local segments of a given protein or amino acid sequence, whereas the term “tertiary structure” refers to a longer-ranged structural pattern formed by the secondary structure(s) or individual amino acid residues on the sequence.
  • the FKC method is based on a molecular model described by physical parameters such as the amino acid sequence and the net charge and hydrophobicity of each amino acid residue on the amino acid sequence.
  • the FKC method can predict the secondary or higher structure of a protein by simply examining the sequence of the input amino acid and the physical parameters, without the need to conduct a computer simulation of the protein folding.
  • the FKC method can denote each amino acid residue on the given amino acid sequence as belonging to a secondary or tertiary structure and present the output in a tabular or graphic form instead of a three dimensional representation of the folded structure.
  • the FKC method is illustrated in Fig. 2.
  • an amino acid sequence is provided, along with input parameters for each of the amino acid residues in the sequence.
  • the input parameters include charge, hydrophobicity value, size, polarity, and other properties that describe an amino acid residue.
  • a first and a second amino acid residue on the amino acid sequence are identified.
  • the first and the second amino acid residues may be identified based on the input parameters and are generally two closely spaced (e.g.
  • non-polar residues are also termed the "boundary residues" in the present application.
  • boundary residues the presence or absence of a folded structure between (and including) the two identified amino acid residues is predicted and/or denoted.
  • the charge values of an amino acid residue may be obtained from various software suites or from calculations based on first principles. For example, they can be obtained from the software MOLECULAR OPERATING ENVIRONMENTTM ("MOE", by Chemical Computing Group, Montreal, Quebec, Canada), or, they can be obtained by using the tight-binding methods described by Walter A. Harrison, Electronic Structure and the Properties of Solids: The Physics of the Chemical Bond, W.H. Freeman, San Francisco, 1980.
  • Equation Ia may be modified as: where a is a number sufficiently close to zero to describe a region in which charge- related electric fields are sufficiently large so as to energetically prohibit the displacement of water thereby forming a beta sheet region.
  • Equation 1 When the condition of Equation 1 is not met for all amino acid residues between the two identified boundary residues, that is, when: the FKC method identifies the region between the boundary residues as an alpha helix. Equation 2a may be further modified for cases in which the partial or fractional electronic charge data is used:
  • an alpha helix may be predicted when there exists a predominance of charge neutrality by all but one of the amino acid residues in the region between the two boundary residues.
  • This one amino acid residue may attain a position external to the shared vapor core region connecting the two boundary residues whereby the shared water/vapor and water/non-polar residue surface can be minimized by rotating out of the forming alpha helix core, thereby allowing the formation of the alpha helix structure to proceed. Therefore, an alternative condition for alpha helix formation is: where a is a number sufficiently close to zero andy is one less than the number of the intervening residues between the two boundary residues under consideration.
  • Equation 2c the parameter a may be varied to reflect differences that may relate to specific examples of amino acid sequence and/or environmental considerations (e.g., those factors that alter the dielectric constant of the water and/or media in which the amino acid sequence is placed).
  • the number ⁇ may be set systematically, scanned through, and/or through examination of a known homologous structure. Generally, a may be set as 0.01 ⁇ 0.5, more preferably 0.05 ⁇ 0.25.
  • the structures generated by the simulation for various a values may be compared to the known structure of the homolog to best fit the homolog structure. Further, the distance between the two boundary hydrophobic residues may be chosen in the range of 3-11 residues (on the contour of the amino acid sequence), preferably 4-8 residues, with 5 as a useful default value.
  • Equation Ia or Ib The repeating of a pattern of non-polar residues being spaced sufficiently close and having intervening amino acids conforming to Equation Ia or Ib is identified by the FKC method as an extended beta sheet region. Likewise, a region having repeat patterns described by Equation 2a, 2b or 2c is identified as an extended alpha helix region.
  • the FKC method identifies such an intervening region as an unstructured region.
  • the FKC method can output a graphical representation, or a tabular list of amino acid residues in the sequence with each residue annotated with an indicator of structure each amino acid belongs to: an alpha helix, a beta sheet, or an unstructured region.
  • the parameters (for example, the value a in Equations 1 - 2) used in the FKC method may be varied according to the temperature and/or other environmental factors.
  • Environmental effects include: local ion concentrations; environmentally induced chemical changes resulting in a new charge and/or new charge state on an atom and/or residue (e.g., but not limited to pH change, oxidation, ionizing radiation, and/or reduction); changes in the dielectric and/or charged species content, and/or surface energy of the media in which the protein under consideration is suspended; and, changes in temperature and/or thermal energy.
  • Equations 1 - 2 the relevant charge summation tests (i.e., Equations 1 - 2) for helix and beta sheet formations are carried out with the new environmentally-altered charge distributions.
  • the value of a should be decreased with increasing temperature to reflect an increased propensity for the alpha helix to denature upon heating.
  • Environmental changes that influence the strength and/or distance of the hydrophobic interactions between amino acid residues include but are not limited to: the dielectric strength of the media, the surface tension of the dielectric media, vapor bubble (including cavitations) size and distribution, and temperature.
  • Factors that decrease the interaction length e.g., decreased media dielectric strength
  • Factors that increase interaction length such as increased dielectric strength, increased vapor bubble size and/or density, and increased surface energy, may be modeled with an increase in the pre-determined distance.
  • Increasing temperature decreases the interactions between two hydrophobic residues, and the extent of such decrease
  • NY02:613565.4 1 1 depends upon the media and the temperature dependence of its related properties (including the dielectric strength, vapor bubble size and distributions, as well as surface energy). Accordingly, the predetermined distance for identifying the two boundary residues can be reduced. Selecting an appropriate parameter of ⁇ or / may be conveniently accomplished by using homology and/or an observation of a known protein structure as a function of the environmental parameter in question, for example, by sweeping through a range of / values between, e.g., 3 to 11 residues and/or the parameter a, e.g., between 0 to 0.25 in steps of 0.05.
  • the FKC method may utilize homology information to self-tune when the provided amino acid sequence has a sequence homologous to an amino acid sequence with experimentally determined secondary or tertiary structure(s). This feature can make use of natural tertiary interactions that modify parameters governing secondary structure generation. For example, homology may permit a more refined choice of parameter ⁇ and the number of intervening residues between the two boundary hydrophobic residues.
  • the FKC method may further be configured to predict and denote tertiary structure of a protein or polypeptide. This can be done by incorporating forces arising between the secondary structures, such as hydrogen bonding and Van Der Waals forces.
  • a modified version of the original program may be run after establishing the presence of secondary structure in which closely-spaced secondary structures are allowed to interact in accordance with a predetermined interaction template.
  • the FKC method may be first used to identify secondary structures in an amino acid sequence. The residues in these identified secondary structures are then combined into groups and treated as single resides. An appropriate value of charge and hydrophobicity is assigned to each of the groups. Using the parameters of the grouped structures, the FKC method is then utilized to determine the tertiary structure.
  • Considerations for interaction may include, among other things, distance, net charge, charge distribution, and environmental charge.
  • alpha helices will more likely be found in regions with low electric field, and closely spaced beta sheets form stacks when net charge is opposite (attraction) and form beta barrels when net charges are similar (repulsion).
  • the modified program may allow a number (greater than three) of nearly spaced beta sheets (e.g., three residues
  • a given DNA sequence may be translated in six reading frames to produce all possible amino acid sequence that may be potentially encoded by the DNA sequence.
  • the resulting amino acid sequences may be run through a protein folding algorithm, such as the FK method or the FKC method, to determine the presence of a folding region.
  • the FKC method is used to sort amino acid sequences based on the combination of predicted secondary structures present within the sequence.
  • a nucleic acid sequence is provided.
  • the nucleic acid sequence is translated to an array of amino acid sequences that may be potentially encoded by the nucleic acid sequence.
  • a protein structure prediction method such as the FKC method or the FK method, may be used to determine the folded regions of each of the translated amino acid sequences.
  • the predicted folded regions are sorted.
  • the predicted folded regions may be sorted by the type, order, and number of the secondary structures present to produce such general categories as all alpha helix, all beta sheet, or a mixture of alpha helix and beta sheet, or such specific categories for alpha helix (A) and beta sheet (B) as: A, B, AA, AB, BA, BB, AAA, etc.
  • the results may then be annotated using such structural databases as CATH and SCOP in order to assign potential function and potential higher structures.
  • Homology-based information may be employed in the above method to improve its accuracy. For example, if the provided nucleic acid sequence has a sequence homologous to a nucleic acid sequence known to encode a particular amino acid sequence, then this information may be used to prioritize which nucleic acid sequences will be analyzed first or if at all. Or, if one of the generated amino acid sequences is homologous to an amino acid sequence with an experimentally determined 3-D structure accessible in a public database, then this information may be used to prioritize which predicted fold will be selected first or if at all for further investigation.
  • the amino acid sequence of a polypeptide with a desired pre-determined secondary structure is determined.
  • an initial amino acid sequence(s) is chosen either randomly or from a list of pre-determined amino acid sequences known to fold into a structure that is similar to the desired structure or from a list of predetermined amino acid sequences known to encode for a protein that has a similar function to the desired protein. This sequence(s) is then randomly mutated to produce a new sequence(s).
  • a protein structure prediction method such as the FKC method or the FK method, is used to predict the secondary structures of the mutated new sequence(s).
  • the predicted secondary structures of each of the amino acid sequences are compared with the desired secondary structure.
  • the amino acid sequence having the predicted secondary structure that most closely fits with the pre-determined secondary structure provided is selected.
  • amino acid sequence of a gene product will necessarily fold into certain secondary structures, e.g., alpha helices or beta sheets. Accordingly, the present invention may also be used to determine whether there exists a potential gene in a given nucleic acid sequence.
  • a method of determining the presence or absence of a gene in a nucleic acid sequence is provided.
  • a nucleic acid sequence is given, and a corresponding amino acid is obtained through the translation of the nucleic acid.
  • the presence or absence of a folded region in the amino acid sequence is determined using a protein structure prediction method, for example, the FKC method or the FK method.
  • the presence or absence of a gene in the nucleic acid sequence is determined based on the presence or absence of a folded region in the amino acid sequence.
  • the predicted folded regions may be alpha helices or beta sheets, while the nucleic acid sequences may be DNA, RNA or cDNA.
  • the pattern of the predicted amino acid sequence structures may be matched to known proteins or to known sequence motifs to further determine whether the gene identified may transcribe and translate into a protein, or will remain inactive.
  • the method above may utilize information gained from sequence homology in order to define transcription boundaries. This may be accomplished by establishing homology to known transcribed regions, such as those present in a cDNA library, and to known transcriptional motifs such as start and stop codons, splice sites, promoters, regulatory elements and structure elements, to aid the determination of the existence of a gene in the nucleic acid. Homology to known transcribed regions, such as those present in a cDNA library, and to known transcriptional motifs such as start and stop codons, splice sites, promoters, regulatory elements and structure elements, to aid the determination of the existence of a gene in the nucleic acid. Homology to known
  • the correlation between a natural variation of an amino acid sequence and the structural mutation is determined.
  • a first amino acid sequence and a second amino sequence is provided, wherein the second amino acid sequence comprises at least one mutated amino acid residue compared to the first amino acid sequence.
  • the first amino acid sequence may be a naturally occurring protein.
  • the folded region(s) of the second amino acid sequence is determined by a protein structure prediction method, such as the FKC method or the FK method; the folded region(s) of the first amino acid sequence may be obtained either from available experimentally data (by crystallography or NMR), or by using a protein structure prediction method, such as the FKC method of the FK method.
  • the predicted folded region of the second amino acid sequence is compared to the folded region in the first amino acid sequence to determine the effect of natural variation of the first amino acid sequence on the folded structure of the amino acid sequence.
  • the mutation may be a point mutation, a deletion mutation, an insertion mutation, an inversion mutation, and an alternate splice.
  • the predicted folded region may be either an alpha helix or a beta sheet.
  • a protein sequence is designed by generating and refining a set of one or more amino acid sequences, as illustrated in Fig. 7.
  • a first set of one or more amino acid sequences is provided.
  • the amino acid sequence(s) may be either naturally occurring or artificial.
  • this first set of amino acid sequence(s) is tested for the presence or absence of a folding region(s) using a method of protein structure prediction, such as the FK method or the FKC method.
  • the absence or presence of the folding region(s) may then be used to determine if the amino acid sequence will have a structure that will produce the desired function.
  • the first set of one or more sequences of amino acid sequences may be any one or more sequences of amino acid sequences.
  • the mutation may be random mutation or directed mutation.
  • the type of mutation may be a point mutation, a deletion mutation, an insertion mutation, an inversion mutation, and an alternate splice.
  • the considerations for selection of the mutation may be based on the preservation of structure where one wishes to modify a protein's characteristics without altering the overall "lock and key" functionality provided by the protein's overall conformation. For instance, the stability of the protein may be altered by introducing a di-sulfide bond which would allow the protein to retain its conformation at a high temperature.
  • the strength of a protein's binding site may be altered by changing the side chains present at the binding location, thereby allowing the protein to bind more or less strongly to its target without altering the type of target that the protein binds.
  • the method of protein structure prediction could be used to confirm that these alterations to the characteristics of the protein will not interfere with the overall structure of the protein and so as to allow it to retain the function of its conformation.
  • the researcher may wish to mutate the amino acid sequence such that the amino acid sequence either no longer functions or acquires new functions based on its interaction with other proteins. This may be desired for disabling the function of the protein, for instance, when the researcher wishes to disable a critical protein in the lifecycle of a pathogen, or to disable a growth factor responsible for a cancer. It may also be desired to look for new functions of the protein, for instance, the researcher may wish to retain the binding site, but to alter the conformation so that the protein interacts with new targets and proteins of either natural or artificial origins.
  • the amino acid sequences may be sorted by type, number and/or order of folded regions predicted, and a new set of amino acid sequences may then be generated from the first set by mutation of highly ranked candidates. This new set of amino acid sequences may then be further mutated to generate yet another new set for testing. This process may be repeated to obtain better candidates. Finally, the best candidate amino acid sequences may be selected according to the criteria of type, number, and/or order of secondary structures required by the researcher (740). The above process may be performed according to a genetic algorithm.
  • NY02:613565.4 16 embodiment may be translated back to a nucleic acid sequence for use as a gene.
  • a collection of genes may be assembled into a synthetic genome.
  • the synthetic genome may be used as the basis for a synthetic organism.
  • amino acid sequences may be further tested and refined.
  • the FKC method is implemented as follows.
  • q t denotes the charge of the i-th residue in the given amino acid sequence
  • ⁇ t denotes the hydrophobicity value of the i-th residue (a positive X 1 indicates that the residue is hydrophobic, and a negative ⁇ t value denotes a hydrophilic residue).
  • Both the charge and hydrophobicity value of an amino acid residue may be obtained by using the previously-mentioned MOE software or by data in Copeland (Robert A. Copeland, Methods for Protein Analysis: a Practical Guide to Laboratory Protocols, Chapman & Hall, NY 1994 p. 14), and/or by well-known experimental techniques or calculations.
  • the pre-determined distance between the two boundary residues are arbitrarily chosen as 5 residues in some of the examples.
  • Ubiquitin is a well-known protein having an experimentally determined structure.
  • the result showed that the secondary structures and unstructured regions of Ubiquitin predicted by the FKC method match the experimental structure (obtained from Protein Data Bank (PDB) with PDB# UBQ) with an accuracy of 68.4%.
  • PDB Protein Data Bank
  • UBQ Protein Data Bank
  • Increasing the value of / for example, to 5, 6, or 7, did not further improve the accuracy.
  • including amino acid molecular weights, and/or amino acid molecular size, and/or polar moment considerations did not produce improvement.
  • the result obtained by applying one embodiment of the FKC method on Ubiquitin compared with the experimentally obtained structure of Ubiquitin is illustrated in Fig. 9.
  • the horizontal axis represents the amino acid residue index number on the Ubiquitin sequence
  • the vertical axis represents the denotation of each amino acid residue on the Ubiquitin sequence as either belonging to an ⁇ helix (an assigned value of 1), a ⁇ sheet (value of 2) or an unstructured region (value of 0.1).
  • the predicted structure is shown on the top of the horizontal axis and the experimentally determined is shown on the bottom of the horizontal axis.
  • the value of 68.4% matching is obtained by dividing the number of matched residues by the total number of residues in Ubiquitin.
  • Group G Streptococcus is an important human pathogen and has an experimentally determined structure.
  • the predicted structure of Group G Streptococcus is shown above the horizontal axis, as compared to the experimentally obtained structure (obtained from PDB by PDB# 2NMQ) shown below the horizontal axis. The value of
  • Influenza A virus In this example, one embodiment of the FKC method is applied to Influenza A virus, and the result is illustrated in Fig. 11.
  • the PDB structure of Influenza A (PDB#: 1AA7) virus has 11 ⁇ -helices.
  • the predicted structure of this virus using the FKC method shows an overall 67.3% matching and 73% matching of the ⁇ -helices.

Abstract

La présente invention concerne un procédé et un système de calcul efficaces permettant de prédire les régions de repliement et les structures secondaires et tertiaires associées, d'une protéine. L'invention concerne, en outre, des procédés et des systèmes de tri des séquences d'acides aminés sur la base des structures prédites, ainsi que des procédés et des systèmes permettant de déterminer la présence ou l'absence de gènes dans des séquences d'acides nucléiques ou encore de mutations structurelles dans des séquences d'acides aminés. L'invention concerne encore un procédé et un système de conception d'une protéine.
PCT/US2008/060786 2007-04-27 2008-04-18 Procédé de détermination de la structure d'une protéine, identification d'un gène, analyse mutationnelle et conception d'une protéine WO2008134261A2 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/301,963 US20100304983A1 (en) 2007-04-27 2008-04-18 Method for protein structure determination, gene identification, mutational analysis, and protein design

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
PCT/US2007/067639 WO2007140061A2 (fr) 2006-05-23 2007-04-27 Méthode de détermination et de prédiction du pliage automone de protéines
USPCT/US2007/0067639 2007-04-27

Publications (2)

Publication Number Publication Date
WO2008134261A2 true WO2008134261A2 (fr) 2008-11-06
WO2008134261A3 WO2008134261A3 (fr) 2009-12-30

Family

ID=39930731

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/060786 WO2008134261A2 (fr) 2007-04-27 2008-04-18 Procédé de détermination de la structure d'une protéine, identification d'un gène, analyse mutationnelle et conception d'une protéine

Country Status (1)

Country Link
WO (1) WO2008134261A2 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011100395A1 (fr) * 2010-02-11 2011-08-18 The Research Foundation Of State University Of New York Procédés de calcul pour la détermination de structures d'une protéine
CN112480204A (zh) * 2020-04-13 2021-03-12 南京大学 一种采用Aerolysin纳米孔道的蛋白质/多肽测序方法
CN116486906A (zh) * 2023-04-17 2023-07-25 深圳新锐基因科技有限公司 基于氨基酸残基突变提高蛋白质分子稳定性的方法及装置

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030130797A1 (en) * 1999-01-27 2003-07-10 Jeffrey Skolnick Protein modeling tools
US20050130224A1 (en) * 2002-05-31 2005-06-16 Celestar Lexico- Sciences, Inc. Interaction predicting device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030130797A1 (en) * 1999-01-27 2003-07-10 Jeffrey Skolnick Protein modeling tools
US20050130224A1 (en) * 2002-05-31 2005-06-16 Celestar Lexico- Sciences, Inc. Interaction predicting device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
AVBELJ ET AL.: 'Prediction of the Three-Dimensional Structure of Proteins Using the Electrostatic Screening Model and Hierarchic Condensation.' PROTEINS: STRUCTURE, FUNCTION, AND GENETICS vol. 31, 1998, pages 74 - 96 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011100395A1 (fr) * 2010-02-11 2011-08-18 The Research Foundation Of State University Of New York Procédés de calcul pour la détermination de structures d'une protéine
US20130013215A1 (en) * 2010-02-11 2013-01-10 The Research Foundation Of State University Of New York Computational methods for protein structure determination
CN112480204A (zh) * 2020-04-13 2021-03-12 南京大学 一种采用Aerolysin纳米孔道的蛋白质/多肽测序方法
CN116486906A (zh) * 2023-04-17 2023-07-25 深圳新锐基因科技有限公司 基于氨基酸残基突变提高蛋白质分子稳定性的方法及装置
CN116486906B (zh) * 2023-04-17 2024-03-19 深圳新锐基因科技有限公司 基于氨基酸残基突变提高蛋白质分子稳定性的方法及装置

Also Published As

Publication number Publication date
WO2008134261A3 (fr) 2009-12-30

Similar Documents

Publication Publication Date Title
Hobohm et al. A sequence property approach to searching protein databases
Capriotti et al. A neural-network-based method for predicting protein stability changes upon single point mutations
Tian et al. How many protein sequences fold to a given structure? A coevolutionary analysis
Selbig et al. Decision tree-based formation of consensus protein secondary structure prediction
Troyanskaya Putting microarrays in a context: integrated analysis of diverse biological data
Cheung et al. De novo protein structure prediction using ultra-fast molecular dynamics simulation
Linial et al. Methodologies for target selection in structural genomics
Valafar et al. Rapid classification of a protein fold family using a statistical analysis of dipolar couplings
US20100304983A1 (en) Method for protein structure determination, gene identification, mutational analysis, and protein design
Wu et al. Atomic protein structure refinement using all-atom graph representations and SE (3)-equivariant graph neural networks
WO2008134261A2 (fr) Procédé de détermination de la structure d'une protéine, identification d'un gène, analyse mutationnelle et conception d'une protéine
Chen et al. Domain-based predictive models for protein-protein interaction prediction
Ku et al. Protein structure search and local structure characterization
Torres et al. A novel ab-initio genetic-based approach for protein folding prediction
Aydin et al. A signal processing application in genomic research: protein secondary structure prediction
Mishra et al. Classification of Protein Structure (RMSD<= 6A) using physicochemical properties
Postic et al. MyPMFs: a simple tool for creating statistical potentials to assess protein structural models
Joshi A decade of computing to traverse the labyrinth of protein domains
Wu et al. Pathogenicity prediction of single amino acid variants with machine learning model based on protein structural energies
Rahmani et al. An extension of Wang’s protein design model using Blosum62 substitution matrix
Zhang et al. Prediction of Intrinsically Disordered Proteins Based on Deep Neural Network-ResNet18.
Pancotti et al. A deeplearning sequence-based method to predict protein stability changes upon genetic variations. Genes 2021; 12: 911
US20040171063A1 (en) Local descriptors of protein structure
Hippe et al. Zoomqa: Residue-level single-model QA support vector machine utilizing sequential and 3D structural features
Karakaş et al. BCL:: contact–low confidence fold recognition hits boost protein contact prediction and de novo structure determination

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08746241

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08746241

Country of ref document: EP

Kind code of ref document: A2

WWE Wipo information: entry into national phase

Ref document number: 12301963

Country of ref document: US