WO2001090715A2 - Maximum likelihood density modification by pattern recognition of structural motifs - Google Patents
Maximum likelihood density modification by pattern recognition of structural motifs Download PDFInfo
- Publication number
- WO2001090715A2 WO2001090715A2 PCT/US2001/016001 US0116001W WO0190715A2 WO 2001090715 A2 WO2001090715 A2 WO 2001090715A2 US 0116001 W US0116001 W US 0116001W WO 0190715 A2 WO0190715 A2 WO 0190715A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- electron density
- map
- likelihood
- probability
- solvent
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/69—Microscopic objects, e.g. biological cells or cellular parts
Definitions
- the present invention relates generally to the determination of crystal structure from the analysis of diffraction patterns, and, more particularly, to identification of protein crystal structure represented by electron density patterns.
- electron density modification has generally been carried out in a two-step procedure that is iterated until convergence occurs.
- an electron density map is obtained experimentally and then modified in real space in order to make it consistent with expectations.
- the modification can consist of, e.g., flattening solvent regions, averaging non-crystallographic symmetry-related regions, or histogram-matching.
- phases are calculated from the modified map and are combined with the experimental phases to form a new phase set.
- This formalism describes the contents of a crystal in terms of a collection of point atoms along with probabilities for their positions. From the positions of these atoms, crystallographic structure factors can be calculated, with a certainty depending on the certainties of the positions of the atoms. Extensions of the formalism are described in Bricogne (1988). The extended formalism specifically addresses the situation encountered in crystals of macromolecules in which defined solvent and macromolecule regions exist in the crystallographic unit cell, and formulas for calculating probabilities of structure factors based on the presence of "flat" solvent regions are presented (Bricogne, 1988). The implementation of this formalism is not straightforward according to Xiang et al., Acta Cryst. D49, pp.
- Somoza, et al., Acta Cryst. A51 , pp. 691-708 (1995) describe an algorithm for recovering crystallographic phase information that is related to the method of Bricogne (1988), but in which electron density is estimated by minimizing a combined target function consisting of the weighted sum of two terms.
- One term is the weighted sum of squares of differences between calculated and known electron density in the region where electron density is known.
- the other term is the weighted sum of squares of differences between calculated and observed amplitudes of structure factors.
- the electron density in a model description of the crystal is adjusted in order to minimize the combined target function.
- the use of the first term was shown by Somoza et al (1995) to correspond to the solvent flattening procedures described above. This allowed solvent flattening and other related density modification procedures (such as non-crystallographic symmetry averaging) to be carried out without the iterative phase recombination steps required in previous methods.
- Bricogne (1988) For treatment of solvent and macromolecule (protein) regions in a crystal, Bricogne develops statistical relationships among structure factors based on a model of the contents of the crystal in which point atoms are randomly located, but in which atoms in the protein region are sharply-defined with low thermal parameters and atoms in the solvent region are diffuse, with high thermal parameters.
- no assumptions about the presence of atoms or possible values of thermal factors are used. Instead, it is assumed that values of electron density in the protein and solvent regions, respectively, are distributed in the same way in the crystal as in a model calculation of a crystal that may or may not be composed of discrete atoms.
- Bricogne (1988) applies a maximum-entropy formalism developed by Bricogne (1984) to find likely arrangements of atoms in the crystal, which in turn can be used to calculate the arrangement of electron density in the crystal.
- likely values of the structure factors are found by applying a likelihood-based approach based on a combination of experimental information and the likelihood of resulting electron density maps. These structure factors can be used to calculate an electron density map that is then, in turn, a likely arrangement of electron density in the crystal.
- the present invention also addresses much the same problem that earlier procedures by Somoza et al.
- the mathematical approaches for obtaining solutions in the two methods are different as well.
- the method of Somoza et al. (1995) calculates derivatives of their target function with respect to electron density, resulting in a linear system of equations to solve for the electron density at all points in the electron density map, while the present invention calculates derivatives of a likelihood- based target function with respect to structure factors in order to solve for crystallographic phases (or phases and amplitudes if amplitudes are not measured).
- the target function that is optimized in the method of Somoza et al. (1995) is a weighted sum of squared differences, while the target function in the present invention is a log-likelihood-based function.
- the target function in Somoza et al (1995) simply restrains the electron density in the region where it is known to be similar to the known values.
- the present invention instead calculates the log-likelihood of the electron density map and maximizes it. Consequently the weighting schemes and the details of the target functions used are different.
- the method of Beran and Szoke (1995) is related to the present method in that their target function has the same form as a special case of the map log- likelihood function to be descibed below, in which the local map log-likelihood is zero for all points outside the target area and a constant for all points within it.
- the method of Beran and Szoke differs in several ways from the present invention.
- the target function is a weighted sum of squared differences, while the target function in the present case is a log-likelihood-based function.
- An electron density map for a crystallographic structure having protein regions and solvent regions is improved by maximizing the log likelihood of a set of structures factors ⁇ F h ⁇ using a local log-likelihood function:
- FIGURE 1 is a flow sheet for a process to obtain characteristics from a model electron density map.
- FIGURE 2 is a flow sheet for a process to derive structure factors consistent with experimental results that result in an electron density map with expected characteristics.
- FIGURE 3 is a flow sheet for a process to identify patterns in an electron density map that match a template and to associate probabilities with the identified patterns.
- FIGURE 4 is an exemplary template of an electron density map.
- experimental phase information is combined with prior knowledge about expected electron density distribution in maps by maximizing a combined likelihood function.
- the fundamental idea is to express knowledge about the probability of a set of structure factors ⁇ F h ⁇ (F h includes amplitude , F b , and phase, ⁇ , factors) and in terms of three quantities: (1) any prior knowledge available from other sources about these structure factors, (2) the likelihood of having measured the observed set of structure
- the likelihood-based density modification approach has a second very important advantage. This is that the derivatives of the likelihood functions with respect to individual structure factors can be readily calculated in reciprocal space by Fast Fourier Transform (FFT) based methods. As a consequence, density modification simply becomes an optimization of a combined likelihood function by adjustment of structure factors. This makes density modification a remarkably simple but powerful approach, requiring only that suitable likelihood functions be constructed for each aspect of prior knowledge that is to be incorporated.
- FFT Fast Fourier Transform
- the basic idea of the likelihood-based density modification procedure is that there are two key kinds of information about the structure factors for a crystal of a macromolecule.
- the first is the experimental phase and amplitude information, which can be expressed in terms of a likelihood (or a long-likelihood function LL 0BS (F h ) f or eacn structure factor F h .
- the second kind of information about structure factors in this formulation is the likelihood of the map resulting from the factors. For example, for most macromolecular crystals, a set of structure factors ⁇ h ⁇ that leads to a map with a flat region corresponding to solvent is more likely to be correct than one that leads to a map with uniform variation everywhere.
- This map likelihood function describes the probability that the map obtained from a set of structure factors is compatible with expectations: (2) The two principal sources of information are then combined, along with any prior knowledge of the structure factors, to yield the likelihood of a particular set of structure factors:
- LL ( ⁇ F h ⁇ ) includes any structure factor information that is known in advance, such as the distribution of intensities of structure factors.
- the change in the map likelihood function in response to changes in structure factors must be known.
- the map likelihood function, LL (i /) there are two linked relationships: the response of the likelihood function to changes in electron density, and the changes in electron density as a function of changes in structure factors.
- the likelihood of a particular map is a complicated function of the electron density over the entire map.
- the value of any structure factor affects the electron density everywhere in the map.
- a low-order approximation to the likelihood function for a map is used instead of attempting to evaluate the function precisely.
- Fourier transformation is a linear process, each reflection contributes independently to the electron density at a given point in the cell.
- the log-likelihood of the electron density might have any form, it is expected that, for sufficiently small changes in structure factors, a first-order approximation to the log-likelihood function would apply and each reflection would also contribute relatively independently to changes in the log-likelihood function.
- the log-likelihood for the whole electron density map is written as the sum of the log-likelihood of the densities at each point in the map, normalized to the volume of the unit cell and the number of reflections used to construct it: where NREF is the number of independent reflections and V js the volume.
- Eq. (9) can be generalized to read, where the indices h' are all indices equivalent to h due to space-group symmetry.
- the probability distribution for an individual structure factor can be written as, ln p(F h ) * LL°(F h ) + LL 0BS (F h ) + (16)
- a key step in likelihood-based density modification is the decision as to the likelihood function for values of the electron density at a particular location in the map.
- an expression for the log-likelihood of the electron density at a particular location x in a map is needed that depends on whether the point satisfies any of a wide variety of conditions, such as being in the protein or solvent region of the crystal, being at a certain location in a known fragment of structure, being at a certain distance from some other feature of the map, or the like.
- Information can be incorporated on the environment of x by writing the log-likelihood function as the log of the sum of conditional probabilities dependent on the environment of x ,
- the derivatives of the likelihood function for electron density were intended to represent how the likelihood function changed when small changes in one structure factor were made.
- the likelihood function that is most appropriate for the present invention is not a globally correct one. Instead, it is a likelihood function that represents how the overall likelihood function varies in response to small changes in one structure factor, keeping all others constant.
- the electron density in the solvent region of a macromolecular crystal In an idealized situation with all possible reflections included, the electron density might be exactly equal to a constant in this region.
- the goal in using Eq. (16) is to obtain the relative probabilities for each possible value of a particular unknown structure factor F h .
- the appropriate variance to use as a weighting factor in refinement includes the estimated model error as well as the error in measurement.
- the appropriate likelihood function for electron density for use in the present method is one in which the overall uncertainty in the electron density due to all reflections other than the one being considered is included in the variance.
- a likelihood function of this kind for the electron density can be developed using a model in which the electron density due to all reflections but one is treated as a random variable. See Terwilliger et al., Acta Cryst.
- the factor ⁇ represents the expectation that the calculated value of P will be smaller than the true value. This is true for two reasons. One is that such an estimate may be calculated using figure-of-merit weighted estimates of structure factors, which will be smaller than the correct ones. The other is that phase error in the structure factors systematically leads to a bias towards a smaller component of the structure factor along the direction of the true structure factor.
- the coefficients c k> ⁇ k> anc ' w k are obtained as follows.
- a model of a protein structure is used to calculate theoretical structure factors for a crystal of that protein structure.
- Exemplary structures may be obtained from the Protein Data Bank (H.M.Berman et al., The Protein Data Bank. Nucleic Acids Research 28, pp. 235-242, 2000), with data containing space group, cell dimensions and angles, and a list of coordinates, atom types, occupancies, and atomic displacement parameters.
- the model may be chosen to be similar in size, resolution of the data, and overall atomic displacement factors to the experimental protein structure to be analyzed, but this is not essential to the process.
- the resolution of the calculated data and the average atomic displacement parameter may be adjusted to match those of the protein structure to be analyzed. Alternatively, a standardized resolution such as 3 Angstrom units and unadjusted atomic displacement parameters may be used, as in the examples given below.
- the theoretical structure factors for the model are then used to calculate an electron density map.
- the electron density map is then divided into "protein” and “solvent” regions in the following way. All points in the map within a specified distance (typically 2.5 Angstrom units) of an atom in the model are designated “protein” and all others are designated “solvent”. The next steps are carried out separately for "protein” and "solvent” regions of the electron density map.
- a histogram of the numbers of points in the protein or solvent region of the electron density map that fall into each possible range of electron densities is calculated.
- the histogram is then normalized so that the sum of all histogram values is equal to unity.
- the coefficients c k , ⁇ k , and w k are obtained by least- squares fitting of Equation (21 ) to the normalized histograms.
- One set of coefficients is obtained for the "protein” region, another for the "solvent” region. If the values of ⁇ and O ⁇ MAP are known for an experimental map with unknown errors, but identified solvent and protein regions, the probability distribution for electron density in each region of the map can be written approximately from Eq. (19) as,
- ⁇ and ⁇ MAP are estimated by a least-squares fitting of the probability distributions for protein and solvent regions given in Eq. (22) to the ones found in the protein and solvent regions in the experimental map.
- This fitting is carried out by first constructing separate histograms of values of electron density in the protein and solvent regions defined by the methods described in Wang, Methods Enzymol 115, pp. 90-112 (1985) and Leslie, Proceedings of the Study Weekend, organized by CCP4, pp. 25-32 (1988), incorporated by reference.
- the histograms are normalized so that the sum, over all values of electron density, of the values in each histogram is unity. In this way the histograms represent the probability that each value of electron density is observed.
- the values of ⁇ and cr MAP in Eq. (22) are adjusted to minimize the squared difference between the values of the probabilities calculated from Eq. (22) and the observed values from the analysis of the histogram.
- the local log-likelihood function for the map in Eq. (17) is based simply on probability distributions for the protein and solvent regions of the map.
- the same approach can be applied to information on the likely values of electron density at a particular point derived from any other source.
- the probability p H is known that there is a structural motif, e.g., a helix pattern, in a particular orientation, located at a particular place in the unit cell.
- the prior knowledge about the electron density distribution in the motif can be used in just the same way as the knowledge about the electron density in the solvent or protein regions of the unit cell.
- p H ⁇ x refers to the probability that there is a structural motif at a known location, with a known orientation, somewhere near the point x, and is the probability distribution for electron density at this point given that this motif actually is present.
- An exemplary structural motif is a helical structure. There is nothing special about helices (other than their relative regularity), and helices serve to illustrate the features of the present invention.
- the significance of Eq. 23 is that it provides a way to incorporate pattern recognition (the probability that there is a helix with this orientation at this point) into density modification. If the pattern to be detected involves a large part of the map, then it might be identifiable even when errors in the map are very large. Then, if the pattern is well-defined, the last term in Eq. (23) can potentially contribute very substantially to the local log- likelihood function and, therefore, to density modification.
- the formulation in Eq. 23 essentially segments the map into points within protein, within solvent, and within another pattern (helix) of electron density. Strictly speaking, these categories are not mutually exclusive as a point can be both within protein and within a helix. Furthermore, a particular point could be within more than one helix pattern, as the template used to identify a helix might be shorter than the actual helix and several overlapping patterns of helix might be recognized.
- the probability is needed that a particular pattern of electron density (e.g., one corresponding to a helix) is located at each possible position and with each possible orientation in the unit cell.
- this estimation is separated into three steps. First, a template is constructed that is an average of the patterns of electron density found in many instances where it occurs. Next, locations and orientations of a template (such as the electron density for a helix) that match the electron density in the map to some degree are identified. Then the probabilities of these possibilities are estimated.
- helices are relatively regular secondary structures, there is some variation from one to another in the precise locations of atoms and in their thermal factors. Even more importantly, the side chains in one helix may be completely different than those in another helix. Consequently construction of a template that has average features is useful for the purpose of pattern matching. Additionally, it is helpful to have a point-by-point estimate of the standard deviation of this density that can be used to identify regions within the template that have more or less variation. A simple method is used to generate a template and standard deviation of the template for helices.
- Residues 133-138 of myoglobin (see Berman et al., The Protein Data Bank, Nucleic Acids Research 28:2 3 5-242 (2000) (PDB, entry IA6M) were chosen as a model helical segment. Then 326 segments of 6 amino acids from the largely-helical protein phycoerythrin (PDB entry ILIA) for which the N,C, C a , and O atoms could be superimposed on the corresponding atoms in the myoglobin helix with an r.m.s, deviation of 0.5 A or less were used to generate an average template for ⁇ -helices.
- PDB entry ILIA largely-helical protein phycoerythrin
- the template was constructed by superimposing each 6-amino acid helical segment of phycoerythrin on the myoglobin helix and calculating an electron density map at a resolution of 3 A based on all atoms of the phycoerythrin structure that fell inside a 20 A cube with the helix at the center.
- the resulting electron density within 2.5 A of an atom in the myoglobin helix was averaged to yield an exemplary helical template.
- the average density in the template region was adjusted to a value of zero, and all points outside the template region were set to values of zero. At the same time, the standard deviation of electron density at each of the same set of points was determined.
- Fig. 4 shows the resulting helical template.
- An FFT-based convolution method was used to identify rotations and translations of the helix template that match the electron density in a map to some degree in an extension of earlier methods for pattern matching in electron density maps.
- the helix template was rotated in real-space and placed at the origin of a unit cell with dimensions identical to the map to be searched. Structure factors for the rotated template were calculated in space group P1 and the convolution of the template and the electron density map was calculated using an FFT.
- Each point in this convolution corresponds to a translation of the rotated template.
- the value of the convolution at each point is essentially the integral over the template region of the density in the rotated, translated template, multiplied by the density in the map. This product is expected to be high if the rotated, translated template has a high correspondence to the map and low otherwise.
- the template is rotated in increments of 10° over three rotation axes.
- the ⁇ -helix template is essentially symmetric when rotated 100° about its axis, and translated along its axis, the search only included 100° of rotation about the helix axis.
- a height cutoff was calculated such that in a random map only about one peak would be chosen every other rotation.
- the cutoff was estimated from the number of reflections (an estimate of the number of degrees of freedom in the map), the mean and standard deviation of the convolution function. Typically the cutoff was in the range of 3 ⁇ to 4 ⁇ , and typically about 200 to 2000 peaks were saved. In cases where there are templates with center-to-center distances of less than 2 A, the one with the higher peak height was chosen.
- This residual error ⁇ MSID is estimated from the r.m.s. difference ⁇ F ⁇ between the map and the template (after multiplying the template by a scale factor a and adding an adjustable offset) and the uncertainty in the template itself ⁇ H (based on the variability in electron densities for model helices):
- a convolution-based search might show a large peak corresponding to overlap of a 6 amino acid-long template and these 3 amino acids, yet only part of the template pattern is really present. In this example, it might be reasonable to say that there is a 50% chance that any given point in the template is a good description of the true electron density in the map, but not to say that this chance is 100%. Additionally, it is well known that the convolution is not the best discriminator of the location of a pattern in an image. A combination of prior knowledge of the helical content of the protein in the crystal and the correlation coefficient of each match of template to map is used to estimate the probability that each match correctly identifies a region of the map with this pattern of electron density.
- the number of templates that are likely to be needed to describe all the helical regions in the unit cell are estimated. This is necessarily rather approximate both because the number of residues in helical conformation is not ordinarily known very accurately and because in the present method the templates describing a helix can overlap.
- each template match, with correlation coefficient CC 0BS is at least partially correct (that is, it does not arise by chance): where p 0 ⁇ H) and p 0 ⁇ notH) are the a priori probabilities that there is or is not a helix located at this position and orientation, and and p ⁇ CC OBS ⁇ notH) are the probabilities that this correlation coefficient would be found for correct and incorrect matches, respectively.
- p 0 ⁇ notH M 1.
- Eq. (25) is an expression for the only unknown term in Eq. (27) is /? 0 (H) , the a priori probability that there is a helix in this position and orientation.
- the term p 0 ⁇ H) is estimated by adjusting it so that the total number of templates is equal o N Templ ⁇ te (Eqs. (25)-(27)):
- the fraction that matches the pattern ⁇ f match ) is estimated based on the ratio of the correlation coefficient for each match ⁇ CC 0BS ) to the highest correlation coefficient for any match in the map ⁇ CC MAX ):
- Eq. (29) Using Eq. (29) along with the average helix template and its standard deviation, the new terms in Eq. (23) can be evaluated.
- the probability _p H , x s that there is a helix at a particular location and orientation that contributes some information about the electron density at point x is given by,
- the probability that this template match is at least partially correct is (Eq. (29))
- the estimated fraction of the template that is involved in the match is f match
- H refers to a template match that overlaps the point x.
- the probability distribution for electron density at x is given by Eq. (20), where the ideal electron density distribution p ⁇ p ⁇ ) is based on the mean p Tem ⁇ ate and standard deviation ⁇ Templ ⁇ te of the rotated, translated template at the point x,
- the process discussed above is more particularly shown in Figures 1 , 2, and 3.
- the basic process of maximum-likelihood density modification has two parts.
- the characteristics of model electron density map(s) are obtained ( Figures 1 and 3). These will typically be the same or similar for many different applications of the algorithm.
- Figure 2 a particular set of structure factors has typically been obtained using experimental measurements on a crystal. This set of structure factors can be directly used to calculate an electron density map. Due to uncertainties in measurement, the electron density map is imperfect.
- a set of structure factors (phases and amplitudes) is found that is consistent with experimental measurements of those structure factors, and that, when used to calculate an electron density map, leads to an electron density that has characteristics similar to those obtained from the model electron density map(s).
- a likelihood-based approach is used to find this optimal set of structure factors.
- Figure 1 shows a process for obtaining characteristics from model electron density maps to use in the above equations.
- a model protein structure obtained by X-ray crystallography is chosen 10.
- the model is used to conventionally calculate an electron density map 12.
- the electron density map is segmented into "protein” and “solvent” regions 14, along with regions containing structural motifs, where the protein region contains all points within a selected proximity to an atom in the model. Histograms of electron density are obtained 16 for "protein” and “solvent” regions.
- coefficients for the Gaussian function formed by Eq. (21) are found so that Eq. (21 ) is optimally fitted 18 to the histogram for that region.
- Eq. (21 ) with the fitted coefficients, is output 22 as the analytical description of the electron density distribution in the protein or solvent region for this model structure.
- Figure 2 depicts the process for finding the optimal set of structure factors for a crystal consistent with experimental measurements and resulting in an the electron density map having characteristics expected from the model structure and other known motifs, such as helices.
- the inputs are (1 ) the analytical descriptions of electron density distributions (Eq. 21 ) for model solvent and protein regions output 22 from the process shown in Figure 1 ; (2) the fraction /solvent of the crystal that is in the "solvent" region; (3) the space group and cell parameters of the crystal; and (4) the experimental measurements of structure factors (phases and amplitudes) and their associated uncertainties.
- the overall process steps for estimating the probability that the electron density at each point in the map is correct are: (1 ) obtaining probability distributions for electron density for the protein and solvent regions of the current electron density map; (2) estimating the probability that the electron density at each point in the map is correct; (3) evaluating how the probabilities would change if the electron density at each point in the map changed; (4) using a Fourier Transform to evaluate how the overall likelihood of the electron density map would change if one crystallographic structure factor changed; (5) combining the likelihood of the map with the likelihood of having observed the experimental data, as a function of each crystallographic structure factor; and (6) deriving a new probability distribution for each crystallographic structure factor. Steps (1 ) through (6) are then iterated until no substantial further changes in structure factors are obtained.
- the process for finding structure factors that are consistent with experiments and that result in an electron density map with expected characteristics is shown in Figure 2.
- the current best estimates of structure factors are used to calculate 32 an electron density map. If there is uncertainty in amplitude or phase, the weighted mean structure factor is ordinarily used, where all possible amplitudes and phases are weighted by their relative probabilities.
- the electron density map is segmented into protein and solvent regions as described by Wang, Methods Enzymol. 115, pp.90-112 (1985) and Leslie, Proceedings of the Study Weekend organized by CCP4, p. 25-32 (1988), incorporated by reference.
- the analytical descriptions of electron density distributions for model protein and solvent regions are fitted by least-squares to the observed electron density distributions in the protein and solvent regions in
- An FFT is used to calculate 38, for each structure factor, how the overall log-likelihood of the map would change if that structure factor were changed. Then, the log-likelihood of the map as a function of all possible values of each structure factor is estimated 42 from a Taylor's series expansion of the log- likelihood of the map. This provides a log-likelihood estimate of any value of each structure factor as the sum of the log-likelihood of the resulting map with the log-likelihood of having observed the experimental data given that value.
- the new estimate 44 of the logarithm of the probability that a structure factor has a particular value is obtained by adding together the log-likelihood of the map for that value of the structure factor and the log-likelihood of observing the experimental value of the structure factor.
- the exponentiation of these values is the probability of each possible value of a structure factor and is used to obtain a new weighted estimate of the structure factor.
- the new estimate of the structure factor is then returned to step 32 to begin a new iteration with a revised electron density map.
- a structural motif is selected to further input known information about electron density distribution, as further shown in Figure 3.
- a structural motif appropriate to the structure being evaluated is selected 52.
- a template is formed 54 that is representative of the selected motif and is preferably formed from an average of the motif structure that may be found in many instances where it occurs.
- the template is then used to search the initial electron density map to locate possible matches 56 with the template.
- the probability that a match has been found is estimated 58 to verify that the pattern location is not just by chance.
- the most probable matches are selected and the probability of the electron density distribution at that location is then determined 62 for input to the density modification process.
- the method was tested further by using just one of the 15 selenium atoms in the /?-catenin structure for phasing.
- the starting map was exceptionally noisy and had a correlation coefficient to the model map of just 0.24.
- Real-space density modification with dm result in only a small improvement of the map, leading to a correlation coefficient of 0.30.
- some helices could be recognized and maximum-likelihood density modification with pattern recognition yielded a final map that was interpretable in many regions and had a correlation coefficient to the model map of 0.51.
- the modified map has some regions that are very clear and others that have very little density.
- the density-modification procedures developed here contain two fundamental changes from methods in general use. One is the use of optimization of a likelihood function rather than phase recombination between experimental and modified maps. The second is the use of a log-likelihood function for a map.
- the optimization of a likelihood function (Eq. (3)) is important because it places density modification on a sound statistical foundation. In the present case, it also eliminates difficulties in weighting of experimental and modified phases. This optimization is made practical by the approaches that have been developed involving reciprocal-space calculations of derivatives of the likelihood function with respect to structure factors.
- Somoza et al (1995) showed previously that optimization of a target function that includes the differences between model electron density and a target electron density in regions of the unit cell (such as solvent regions) where the electron density is known, can accomplish the same function as conventional solvent flattening procedures.
- the present invention extends this by developing the concept of a map likelihood function, showing how a map likelihood function can be calculated, and showing how optimizing a combined likelihood function that consists of the map likelihood function, an experimental likelihood function, and any a priori information can be used to obtain crystallographic phase information.
- the present invention also carries out the optimization process with respect to crystallographic phases, rather than electron density, and through the use of derivatives of the log likelihood function with respect to crystallographic structure factors, rather than through solving a linear system of equations as done by Somoza et al (1995).
- the map likelihood function is a statement of the plausibility of an electron density map calculated from some set of structure factors.
- the plausibility can include any information about patterns of electron density that are expected and not expected.
- the implementation of the likelihood function for a map (Eq. 4) is a simplified version in which each point in the map is treated independently.
- the overall log-likelihood of the map is the integral over the unit cell of the local map log-likelihood function.
- the local log-likelihood function for a map can readily incorporate information about solvent, and protein regions in the map if they are identified by some means. After taking into consideration the noise in the map (Eq. (20)), the electron density at a point known to be in the solvent region is plausible only if it has values within a narrow range expected in the solvent. Similarly, the density at a point in the protein region is plausible only if it has a value in the somewhat greater range expected in the protein region.
- the patterns of electron density that are included in the local log- likelihood function need not be as simple as the probability distribution for electron density in solvent or protein regions. They can include detailed information about the electron density in a region as well.
- Eq. (23) shows how to incorporate information on a pattern of density corresponding to a structural motif such as a fragment of ⁇ -helix. Any other fragment density information can be incorporated in a similar fashion.
- the difference can be best appreciated in an idealized case where only a small fragment of structure is missing from an otherwise perfect model, and a difference Fourier or similar calculation is carried out to identify the missing fragment.
- the difference density can be located anywhere in the map (though much will be in the correct region).
- map likelihood approach the fact that the density is known exactly everywhere except in the region of the missing fragment is explicitly taken into account. Consequently in this approach all the difference density would be located in the region where the missing fragment is located.
- ⁇ -helices are identified in a map and used to improve phases.
- the rotated, translated templates (or coordinates of atoms in a model helix) would be used to calculate model phases, and a ⁇ A -weighted combined phase map would be calculated.
- the uncertainties in electron density based on the model alone would be assumed to be distributed over the entire unit cell.
- uncertainties in electron density are relatively low in the entire region of each helical template (where the model electron density is relatively well known), and higher elsewhere in the protein region (where it is poorly known), and once again lower in the solvent region (where it is very precisely known). This point-by-point specification of uncertainty in the map allows a much more complete use of the available information about the partial model than the model phase method.
- the key to the use of the local log-likelihood function for a map is the specification of a probability distribution for the electron density for some subset of points in the map. It doesn't matter if this specification says that all the points in a region have the same electron density, or whether the points in this region have a particular pattern of electron density such as a part of a helix. Much the same amount of information is conveyed in either case, and essentially the same amount of improvement in phases or structure factors can potentially be obtained in either case.
- the methods of the present invention provide a simple and practical way to incorporate prior knowledge of the electron density in a crystal structure into probability distributions for structure factors.
- the prior knowledge can range from the locations of solvent and protein regions to detailed information on a local pattern of electron density corresponding to a fragment of structure.
- Electron density information from one copy of a macromolecule in the asymmetric unit can be used in the present approach in the same way as other partial structure information.
- the ability to specify separate probability distributions for electron density at each point in the map will make it possible to take into account the different amounts of error in different parts of the partial model. In that way, the parts that are most similar can effectively be weighted more strongly and the parts that are more different be weighted less strongly, a property that is more difficult to achieve with current methods.
- the same approach could be used to combine information on electron density from more than one crystal form as well.
- a second is in the area of molecular replacement.
- a fourth and somewhat speculative possibility is that the use of the present approach in ab initio phasing of macromolecular structures.
- the information content in the statement that a particular region of the unit cell is solvent is nearly the same as the statement that the region contains an ⁇ -helix. This is despite the fact that a model-phased map would be essentially noise in the solvent case and would contain significant information in the case of the helix.
- the importance is that the formulation of the present invention allows the use of any information about the local probability distribution for electron density, even distributions that are completely uniform. This observation leads to the possibility of guessing that a particular region of the unit cell is contained in the solvent, and using the resulting phase information as the starting point for interactive phase improvement using the methods described here.
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2001274848A AU2001274848A1 (en) | 2000-05-22 | 2001-05-16 | Maximum likelihood density modification by pattern recognition of structural motifs |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US20651300P | 2000-05-22 | 2000-05-22 | |
US60/206,513 | 2000-05-22 | ||
US09/769,612 US6721664B1 (en) | 2000-02-25 | 2001-01-23 | Maximum likelihood density modification by pattern recognition of structural motifs |
US09/769,612 | 2001-01-23 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2001090715A2 true WO2001090715A2 (en) | 2001-11-29 |
WO2001090715A3 WO2001090715A3 (en) | 2002-03-28 |
Family
ID=26901416
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2001/016001 WO2001090715A2 (en) | 2000-05-22 | 2001-05-16 | Maximum likelihood density modification by pattern recognition of structural motifs |
Country Status (2)
Country | Link |
---|---|
AU (1) | AU2001274848A1 (en) |
WO (1) | WO2001090715A2 (en) |
-
2001
- 2001-05-16 AU AU2001274848A patent/AU2001274848A1/en not_active Abandoned
- 2001-05-16 WO PCT/US2001/016001 patent/WO2001090715A2/en active Application Filing
Non-Patent Citations (2)
Title |
---|
BRIGOGNE G.: 'Bayesian statistical theory of the phase problem. 1. A multichannel maximum-entropy formalism for constructing generalized joint probability distribution of structure factors' ACTA CRYSTAL. vol. A44, 1988, pages 517 - 545, XP002947085 * |
XIANG S.: 'Entropy maximization constrained by solvent flatness: A new method for macromolecular phase extension and map improvement' ACTA CRYST. vol. D49, 1993, pages 193 - 212, XP002947086 * |
Also Published As
Publication number | Publication date |
---|---|
WO2001090715A3 (en) | 2002-03-28 |
AU2001274848A1 (en) | 2001-12-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Terwilliger et al. | Improvement of cryo-EM maps by density modification | |
Terwilliger | Maximum-likelihood density modification using pattern recognition of structural motifs | |
US11515002B2 (en) | Methods and systems for 3D structure estimation | |
de La Fortelle et al. | [27] Maximum-likelihood heavy-atom parameter refinement for multiple isomorphous replacement and multiwavelength anomalous diffraction methods | |
Zhang et al. | Statistical mechanics of sequence-dependent circular DNA and its application for DNA cyclization | |
WO2020058176A1 (en) | Machine learning for determining protein structures | |
Buchete et al. | Orientational potentials extracted from protein structures improve native fold recognition | |
Thompson et al. | Incorporation of evolutionary information into Rosetta comparative modeling | |
Hustedt et al. | Confidence analysis of DEER data and its structural interpretation with ensemble-biased metadynamics | |
Urzhumtsev et al. | Introduction to crystallographic refinement of macromolecular atomic models | |
Chakravorty et al. | Entropy of proteins using multiscale cell correlation | |
WO2022112248A1 (en) | Predicting protein structures by sharing information between multiple sequence alignments and pair embeddings | |
Spencer et al. | Bayesian inference assessment of protein secondary structure analysis using circular dichroism data–how much structural information is contained in protein circular dichroism spectra? | |
US6721664B1 (en) | Maximum likelihood density modification by pattern recognition of structural motifs | |
Kleinman et al. | A maximum likelihood framework for protein design | |
Mitsuta et al. | Analytical method using a scaled hypersphere search for high-dimensional metadynamics simulations | |
Gosink et al. | Bayesian model averaging for ensemble-based estimates of solvation-free energies | |
Corso | Modeling molecular structures with intrinsic diffusion models | |
US6931329B1 (en) | Likelihood-based modification of experimental crystal structure electron density maps | |
WO2001090715A2 (en) | Maximum likelihood density modification by pattern recognition of structural motifs | |
WO2022112260A1 (en) | Predicting protein structures over multiple iterations using recycling | |
Liu et al. | Analyzing Molecular Dynamics Trajectories Thermodynamically through Artificial Intelligence | |
Dai et al. | A pipeline for improved QSAR analysis of peptides: physiochemical property parameter selection via BMSF, near-neighbor sample selection via semivariogram, and weighted SVR regression and prediction | |
US7085653B2 (en) | Method for removing atomic-model bias in macromolecular crystallography | |
Rasheed et al. | Quantifying and visualizing uncertainties in molecular models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
AK | Designated states |
Kind code of ref document: A3 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: JP |