WO2001086271A2

WO2001086271A2 - Structure factor determinations

Info

Publication number: WO2001086271A2
Application number: PCT/US2001/015003
Authority: WO
Inventors: Zeljko Dzakula
Original assignee: Molecular Simulations, Inc.
Priority date: 2000-05-08
Filing date: 2001-05-08
Publication date: 2001-11-15
Also published as: GB2373981A; WO2001086271A3; GB0200722D0; GB2373981B

Abstract

A method reduces the structure factor phase ambiguity corresponding to a selected reciprocal lattice vector. The method includes generating an original phase probability distribution corresponding to a selected structure factor phase of the selected reciprocal lattice vector. The original phase probability distribution includes a first structure factor phase ambiguity. The method further includes combining the original phase probability distribution with a plurality of phase probability distributions of a plurality of structure factor phases of other reciprocal lattice vectors using a phase equation or inequality. The method further includes producing a resultant phase probability distribution for the selected structure factor phase of the selected reciprocal lattice vector. The resultant phase probability distribution includes a second structure factor phase ambiguity which is smaller than the first structure factor phase ambiguity. In addition, a method uses linear prediction analysis to define a first structure factor component for a first reflection from x-ray crystallography data. The method includes expressing the first structure factor component as a first linear equation in which the first structure factor component is equal to a sum of a first plurality of terms.

Description

STRUCTURE FACTORDETERMINATIONS

Background of the Invention Field of the Invention The invention relates to x-ray crystallography.

Description of the Related Art

In x-ray diffraction crystallography, a crystalline form of the molecule under study is exposed to a beam of x-rays, and the intensity of diffracted radiation at a variety of angles from the angle of incidence is measured. The beam of x-rays is diffracted into a plurality of diffraction "reflections," each reflection representing a reciprocal lattice vector. From the diffraction intensities of the reflections, the magnitudes of a series of numbers, known as "structure factors," are determined. The structure factors in general are complex numbers, having a magnitude and a phase in the complex plane, and are defined by the electron distribution within the unit cell of the crystal. The magnitudes of the complex numbers are relatively easy to experimentally determine from measured diffraction intensities of the various reflections. However, a map of electron density and/or atomic position within the unit cell of the crystal cannot be generated without determining the phases of the structure factors as well. Thus, the central problem in x-ray diffraction crystallography is the determination of phases for structure factors whose amplitudes are already known.

In attempts to determine the structure of large biomolecules such as proteins, one of the most frequently used approaches to solve this problem is based on isomorphous replacement. In single isomorphous replacement (SIR) analysis, one or more heavy atoms are attached to the protein, creating a heavy atom derivative or isomorph of the protein. An analysis of the difference between the x-ray diffraction intensities from the native protein and from its heavy atom derivative can limit the phase of at least some structure factors to two plausible possibilities. For each structure factor, this SIR analysis results in a phase probability distribution curve which is typically substantially bimodal, with peaks positioned at the two most probable phases for that structure factor. To remove the ambiguity of which probability peak corresponds to the correct phase for each structure factor, a plurality of heavy atom derivatives can be used to generate a set of phase probability distribution curves for each structure factor. In this multiple isomorphous replacement (MIR) analysis, the probability distribution curves for a selected structure factor are mathematically combined such that the resulting phase value is consistent across all of the heavy atom derivatives for the selected structure factor. In essence, the resulting phase value common to the set of phase probability distribution curves corresponds to the correct phase of the structure factor. An alternative analysis, multiple anomalous diffraction (MAD) has mathematical formalisms which are similar to those of MIR analysis. Aspects of these two procedure are described in Section 8.4, pages 255-267, of An Introduction to X-Ray Crystallography by Michael M. Woolfson, Cambridge University Press (1970, 1997). The complete content of the Woolfson textbook is hereby incorporated by reference in its entirety.

The heavy atom derivative method is commonly used when the structure of the protein or other molecule(s) in the unit cell is wholly unknown. However, the preparation of heavy atom derivatives is slow and tedious, and the creation of a sufficient number of heavy atom isomorphs to sufficiently reduce the phase ambiguity is not always possible.

The structure factors used to calculate atomic coordinates from measured x-ray diffraction intensities are oscillatory functions of the indices of the reciprocal lattice vectors with an overall decaying envelope. One expression for these structure factors has the following form:

Equation 1: F_m = — Y q T ΛcoSfiπihX_j + ky, + lz_!)\+ ismμπ(hx_i +ky_l + lz .)\ ,

where F_m is the structure factor for the reciprocal lattice vector with indices h,k,l ; q. are the occupancy populations of each site; T_j are the temperature factors which correspond to thermal motions; and f. are the atomic scattering factors. While the populations q_} are constants, the temperature factors T. and atomic scattering factors f. decrease as the indices h,k,l increase.

Working from the magnitudes and phases of the structure factors, the electron density and/or atomic positions within the unit cell of the crystal can be determined. Structural determinations using x-ray diffraction data are described in An Introduction to X-Ray Crystallography by Michael M. Woolfson, Cambridge University Press (1970, 1997), which is hereby incorporated by reference in its entirety.

In principle, all of the x-ray diffraction reflections are capable of being known or measured (i.e., cognizable). However, due to various aspects of the systems used to experimentally measure the reflection intensities, the set of measured intensities may be incomplete, or may contain errors. First, some x-ray diffraction measurement systems do not provide a measurement of the (0, 0, 0) reflection, which can contain useful information regarding the contents of the crystal. Second, the range of reflections accessible by the x-ray measurement system can be constrained to some value, preventing the measurement of reflections corresponding to larger reciprocal lattice vectors. These larger reciprocal lattice vectors can contain high-resolution information (i.e., corresponding to shorter distances in direct space) regarding the crystal structure. Third, various other reflections may be partially or wholly occluded by various portions of the x-ray diffraction measurement system. Fourth, there may be other experimental factors, such as signal-to-noise, which reduce the confidence of a particular measurement by the x-ray measurement system.

Summary of the Invention According to one aspect of the present invention, a method reduces the structure factor phase ambiguity corresponding to a selected reciprocal lattice vector. The method comprises generating an original phase probability distribution corresponding to a selected structure factor phase of the selected reciprocal lattice vector. The original phase probability distribution comprises a first structure factor phase ambiguity. The method further comprises combining the original phase probability distribution with a plurality of phase probability distributions of a plurality of structure factor phases of other reciprocal lattice vectors using a phase equation or inequality. The phase equation or inequality defines a mathematical relationship between the selected structure factor phase of the selected reciprocal lattice vector and the plurality of structure factor phases of other reciprocal lattice vectors. The method further comprises producing a resultant phase probability distribution for the selected structure factor phase of the selected reciprocal lattice vector. The resultant phase probability distribution comprises a second structure factor phase ambiguity which is smaller than the first structure factor phase ambiguity.

According to another aspect of the present invention, a method defines a structure factor phase for a reflection derived from x-ray crystallography data. The method comprises generating a first probability distribution for the structure factor phase of the reflection. The method further comprises generating two or more additional probability distributions for the structure factor phases of other reflections. The method further comprises calculating a composite probability distribution for the structure factor phase of the reflection. The composite probability distribution is derived from the first probability distribution of the reflection and the two or more additional probability distribution of the other reflections.

According to another aspect of the present invention, the methods described herein are implemented on computer readable medium having instructions stored thereon which causes a general purpose computer system to perform the methods described herein. According to another aspect of the present invention, a computer-implemented x-ray crystallography analysis system is programmed to perform the methods described herein.

According to another aspect of the present invention, a computer-implemented x-ray crystallography analysis system comprises a means for retreiving a first phase probability distribution corresponding to a selected structure factor phase of a selected reciprocal lattice vector. The system further comprises a means for retreiving a plurality of second phase probability distributions corresponding to other structure factor phases of other reciprocal lattice vectors. The system further comprises a means for combining the first phase probability distribution and plurality of second phase probability distributions so as to produce a resultant phase probability distribution for the selected structure factor phase of the selected reciprocal lattice vector. According to another aspect of the present invention, a method refines x-ray diffraction data. The method comprises combining structure factor phase probability distributions for different reciprocal lattice vectors so that the structure factor phase probability distribution for at least one of the reciprocal lattice vectors is more heavily weighted toward a phase value.

According to one aspect of the present invention, a method uses linear prediction analysis to define a first structure factor component for a first reflection from x-ray crystallography data. The x-ray crystallography data comprises a set of cognizable reflections. The method comprises expressing the first structure factor component as a first linear equation in which the first structure factor component is equal to a sum of a first plurality of terms. Each term comprises a product of (1) a structure factor component for a cognizable reflection from the x-ray crystallography data, wherein the cognizable reflection has a separation in reciprocal space from the first reflection, and (2) a linear prediction coefficient corresponding to the separation between the cognizable reflection and the first reflection. The method further comprises calculating values for the linear prediction coefficients. The method further comprises substituting the values for the linear prediction coefficients into the first linear equation, thereby defining the first structure factor component for the first reflection. According to another aspect of the present invention, a method refines x-ray diffraction data. The method comprises deriving a value of a first structure factor from a linear combination of other structure factors.

According to another aspect of the present invention, a computer readable medium has instructions stored thereon which cause a general purpose computer to perform a method of using linear prediction analysis to define a first structure factor component for a first reflection from x- ray crystallography data. The x-ray crystallography data comprises a set of cognizable reflections. The method comprises expressing the first structure factor component as a first linear equation in which the first structure factor component is equal to a sum of a first plurality of terms. Each term comprises a product of (1) a structure factor component for a cognizable reflection from the x-ray crystallography data, wherein the cognizable reflection has a separation in reciprocal space from the first reflection, and (2) a linear prediction coefficient corresponding to the separation between the cognizable reflection and the first reflection. The method further comprises calculating values for the linear prediction coefficients. The method further comprises substituting the values for the linear prediction coefficients into the first linear equation, thereby defining the first structure factor component for the first reflection. According to another aspect of the present invention, a computer-implemented x-ray crystallography analysis system comprises a structure factor component generator for generating a first structure factor component for a first reflection from x-ray crystallography data using linear prediction analysis. The x-ray crystallography data comprises a set of cognizable reflections. The first structure factor component is expressed as a first linear equation in which the first structure factor component is equal to a sum of a first plurality of terms. Each term comprises a product of

(1) a structure factor component for a cognizable reflection from the x-ray crystallography data, wherein the cognizable reflection has a separation in reciprocal space from the first reflection, and

(2) a linear prediction coefficient corresponding to the separation between the cognizable reflection and the first reflection. The system further comprises a calculating module for calculating values for the linear prediction coefficients. The system further comprises a resultant structure factor component definer for defining the first structure factor component for the first reflection by substituting the values for the linear prediction coefficients into the first linear equation.

According to another aspect of the present invention, a computer-implemented x-ray crystallography analysis system comprises a means for generating a first structure factor component for a first reflection from x-ray crystallography data using linear prediction analysis. The x-ray crystallography data comprises a set of cognizable reflections. The first structure factor component is expressed as a first linear equation in which the first structure factor component is equal to a sum of a first plurality of terms. Each term comprises a product of (1) a structure factor component for a cognizable reflection from the x-ray crystallography data, wherein the cognizable reflection has a separation in reciprocal space from the first reflection, and (2) a linear prediction coefficient corresponding to the separation between the cognizable reflection and the first reflection. The system further comprises a means for calculating values for the linear prediction coefficients. The system further comprises a means for defining the first structure factor component for the first reflection by substituting the values for the linear prediction coefficients into the first linear equation.

Brief Description of the Drawings Figure 1 is a flowchart of one embodiment of a method of reducing structure factor phase ambiguity corresponding to a selected reciprocal lattice vector. Figure 2 schematically illustrates an example of a substantially bimodal phase probability distribution p ^) for the phase Φ^ corresponding a reciprocal lattice vector k .

Figures 3A-3C schematically illustrate phase probability distributions p(Φ ) , p(Φ_ ) , and ) for reciprocal lattice vectors k , h , and k - h , respectively. Figure 3D schematically illustrates the resultant phase probability distribution R(Φ_j) for the structure factor phase corresponding to reciprocal lattice vector k , based on the three phase probability distributions shown in Figures 3A-3C.

Figures 4A-4C schematically illustrate phase probability distributions p(Φ^) , (Φ__/7) , and p( ^_^) for reciprocal lattice vectors k , h , and k — h , respectively.

Figure 4D schematically illustrates the resultant phase probability distribution P(Φ^ ) for the structure factor phase corresponding to reciprocal lattice vector k , based on the three phase probability distributions shown in Figures 4A-4C.

Figure 5 is a flowchart of one embodiment of a method of defining a structure factor phase for a reflection derived from x-ray crystallography data.

Figures 6A-6D schematically illustrate an example of an embodiment of the present invention as applied to certain reflections of experimental data.

Figures 7A-7D schematically illustrate an example of an embodiment of the present invention as applied to certain reflections of experimental data. Figure 8 schematically illustrates a "true" value of the phase obtained from density modification techniques corresponding to the reciprocal lattice vector k .

Figures 9A-9D schematically illustrate an example of an embodiment of the present invention as applied to certain reflections of experimental data.

Figure 9E schematically illustrates a "true" value of the phase obtained from density modification techniques corresponding to the reciprocal lattice vector k .

Figure 10A schematically illustrates an artificial one-dimensional electron distribution composed often randomly positioned atoms.

Figure 10B schematically illustrates the correlation between the "calculated" structure factor phases produced by one embodiment of the present invention and the "true" structure factor phases computed from the electron distribution of Figure 10A.

Figure IOC schematically illustrates the electron distribution calculated from the set of structure factor phases from one embodiment of the present invention.

Figure 10D schematically illustrates the electron distribution calculated from the structure factors with random phases. Figure 11 is a flowchart of one embodiment of a method of using linear prediction analysis to define a first structure factor component for a first reflection from x-ray crystallography data.

Figure 12 is a flowchart of one embodiment of calculating values for the linear prediction coefficients. Figure 13 is a flowchart of another embodiment of calculating values for the linear prediction coefficients.

Figure 14 is a flowchart of another embodiment of calculating values for the linear prediction coefficients. Figure 15 is a flowchart of another embodiment of calculating values for the linear prediction coefficients.

Figure 16A schematically illustrates an electron distribution of a hypothetical one- dimensional system often atoms.

Figure 16B schematically illustrates the agreement between the true values for the structure factor components corresponding to the electron distribution of Figure 6A and the corresponding linear prediction estimates from an embodiment of the present invention.

Figure 17A schematically illustrates another electron distribution of a hypothetical one- dimensional system often atoms.

Figure 17B schematically illustrates the agreement between the true values for the structure factor components corresponding to the electron distribution of Figure 7A and the corresponding linear prediction estimates from an embodiment of the present invention.

Figure 18A schematically illustrates another electron distribution of a hypothetical one- dimensional system often atoms.

Figure 18B schematically illustrates the agreement between the true values for the structure factor components corresponding to the electron distribution of Figure 8A and the corresponding linear prediction estimates from an embodiment of the present invention.

Figure 19A schematically illustrates another electron distribution of a hypothetical one- dimensional system of thirty atoms.

Figure 19B schematically illustrates the agreement between the true values for the structure factor components corresponding to the electron distribution of Figure 9A and the corresponding linear prediction estimates from an embodiment of the present invention.

Figure 20A schematically illustrates another electron distribution of a hypothetical one- dimensional system of thirty atoms.

Figure 20B schematically illustrates the agreement between the true values for the structure factor components corresponding to the electron distribution of Figure 10A and the „ corresponding linear prediction estimates from an embodiment of the present invention.

Figure 21A schematically illustrates another electron distribution of a one-dimensional projection of a hypothetical three-dimensional system of 500 atoms. Figure 2 IB schematically illustrates the agreement between the true values for the structure factor components corresponding to the electron distribution of Figure 11 A and the corresponding linear prediction estimates from an embodiment of the present invention.

Figure 22A schematically illustrates another electron distribution of a one-dimensional projection of a hypothetical three-dimensional system of 500 atoms.

Figure 22B schematically illustrates the agreement between the true values for the structure factor components corresponding to the electron distribution of Figure 12A and the corresponding linear prediction estimates from an embodiment of the present invention.

Detailed Description of the Preferred Embodiment In describing embodiments of the invention, the terminology used is not intended to be interpreted in any limited or restrictive manner, simply because it is being utilized in conjunction with a detailed description of certain specific embodiments of the invention. Furthermore, embodiments of the invention may include several novel features, no single one of which is solely responsible for its desirable attributes or which is essential to practicing the inventions herein described. In many embodiments, the present invention is useful in computer-implemented x-ray crystallography analysis processes. In these processes, x-ray crystallography data is analyzed using software code running on general purpose computers, which can take a wide variety of forms, including, but not limited to, network servers, workstations, personal computers, mainframe computers, and the like. The code which configures the computer to perform these analyses is typically provided to the user on a computer readable medium, such as a CD-ROM. The code may also be downloaded by a user from a network server which is part of a local or wide-area network, such as the Internet.

The general purpose computer running the software will typically include one or more input devices such as a mouse and/or keyboard, a display, and computer readable memory media such as random access memory integrated circuits and a hard disk drive. It will be appreciated that one or more portions, or all of the code may be remote from the user and, for example, resident on a network resource such as a LAN server, Internet server, network storage device, etc. In typical embodiments, the software receives as an input a variety of information, such as the x-ray crystallographic data and any user-determined parameters for the analysis. Figure 1 is a flowchart of one embodiment of a method 50 of reducing structure factor phase ambiguity corresponding to a selected reciprocal lattice vector. The method 50 comprises generating an original phase probability distribution in an operational block 60. The original phase probability distribution corresponds to a selected structure factor phase of the selected reciprocal lattice vector, and comprises a first structure factor phase ambiguity. The method 50 further comprises combining the original phase probability distribution with a plurality of phase probability distributions of a plurality of structure factor phases of other reciprocal lattice vectors using a phase equation or inequality in an operational block 70. The phase equation or inequality defines a mathematical relationship between the selected structure factor phase of the selected reciprocal lattice vector and the plurality of structure factor phases of other reciprocal lattice vectors. The method 50 further comprises producing a resultant phase probability distribution for the selected structure factor phase of the selected reciprocal lattice vector in an operational block 80. The resultant phase probability distribution comprises a second structure factor phase ambiguity which is smaller than the first structure factor phase ambiguity.

In the operational block 60, an original phase probability distribution is generated which corresponds to a selected structure factor phase of the selected reciprocal lattice vector. In certain embodiments, the original phase probability distribution is generated using single-isomorphous replacement (SIR) analysis. Other examples of analyses which can generate the original phase probability distribution in other embodiments include, but are not limited to single anomalous dispersion (SAD), multiple isomorphous replacement (MIR) and multiple anomalous dispersion (MAD).

As is known to those of skill in the art, the usual result of SIR analysis is a set of

Hendrickson-Lattman coefficients a_k,b^,c_k,d^ for each reciprocal lattice vector k . These coefficients define the original phase probability distribution p(Φ^,a_k,b^,c_k,dX) for each corresponding structure factor according to the following standard^" formula:

Equation 1:

p(Φ_l,a_k-,b_l,c_k-,d_l) = exp[α_ft- cos(Φ_A-) + b_k- UΑ(Φ_k-) + c_k- cos(2Φ_r) + d_k- sin(2Φ^)] ,

where Φ^ corresponds to the structure factor phase of a reciprocal lattice vector k , and a_k,b^,c_k,d_k correspond to the Hendrickson-Lattman coefficients for the reciprocal lattice vector k . The normalization factor of Equation 1 has been omitted for simplicity.

As described above, the shapes of the phase probability distributions generated from SIR analysis are generally bimodal (i.e., the distribution has two prominent probability modes). In such a bimodal phase probability distribution, the phase has a significant likelihood of being in either mode of the distribution. An example of a substantially bimodal phase probability distribution p( _k) is illustrated in Figure 2 for the phase Φ^ corresponding to a reciprocal lattice vector k . The phase probability distribution p(Φ_k) in Figure 2 has a mode centered at approximately 30 degrees and a second, approximately equal mode at approximately 170 degrees. The value of the phase Φ_k then has an approximately equal probability of being either approximately 30 degrees or approximately 170 degrees. The structure factor phase ambiguity of a phase probability distribution can be defined in terms of the relative weight of each mode of the bimodal distribution.

As illustrated in Figure 2, the two modes of the phase probability distribution p(Φ_k ) have approximately equal weights, so it is equally likely that the phase Φ^ has a value in one mode as in the other mode. Therefore the phase probability distribution p(Φ ) has a relatively high structure factor phase ambiguity. The ambiguity of a phase probability distribution can be quantified by calculating a centroid which represents the ensemble average value for the phase, and a "figure of merit" (FOM) which is a measure of the reliability of the centroid. A FOM value of zero represents complete ambiguity, and a FOM value of one represents total certainty (i.e., a sharp, single-peak phase probability distribution). The phase probability distribution schematically illustrated in Figure 2 has a centroid of 129 degrees and a FOM value of 0.19. In the crystallographic analysis of large molecules such as proteins, there are thousands of reciprocal lattice vectors or reflections to be examined, and thus thousands of ambiguous phase determinations defined by phase probability distributions, such as the phase probability distribution p(Φ_k ) illustrated in Figure 2, each comprising a structure factor phase ambiguity. As described above, MIR analysis can reduce the structure factor phase ambiguities from heavy atom derivatives by analyzing x-ray crystallography data obtained for multiple heavy atom derivatives of the molecule under study. However, the preparation of these additional heavy atom derivatives is slow and tedious, and the creation of a sufficient number of heavy atom isomorphs to sufficiently reduce the structure factor phase ambiguity is not always possible.

The preparation of these additional heavy atom derivatives can be avoided by certain embodiments of the present invention. In such embodiments, the original phase probability distribution p(Φ_k ) is combined with a plurality of phase probability distributions of a plurality of structure factor phases of other reciprocal lattice vectors using a phase equation or inequality in the operational block 70 of Figure 1. The phase equation or inequality defines a mathematical relationship between the selected structure factor phase of the selected reciprocal lattice vector and the plurality of structure factor phases of other reciprocal lattice vectors.

Various mathematical relationships exist between the phases and/or the amplitudes of different structure factors. Such relationships have been used in various direct methods for solving crystal structures to find the most probable structure factor phases which are consistent with the measured reflections. To date, these direct methods have found application only to solving structures for relatively small molecules, where the crystal structure includes less than about 150 non-hydrogen atoms in the asymmetric unit cell. Several such methods are described in Sections 8.6, 8.7, and 8.8 of the Woolfson reference described above. Embodiments of the present invention differ from the direct methods by using experimentally determined phase probability distributions as inputs (e.g., from MIR, MAD, SIR, SAD analyses). The direct methods utilize only structure factor amplitudes as inputs.

In certain embodiments of the present invention, these mathematical relationships may be used to reduce the structure factor phase ambiguity present in the x-ray crystallography data for large molecules, such as proteins having hundreds or thousands of non-hydrogen atoms per unit cell. In certain embodiments, the phase equation or inequality can define a mathematical relationship known as the phase addition relationship:

Equation 2: φ + φ = φ k-h

where Φ^ is the structure factor phase for the reciprocal lattice vector k , Φ ^ is the structure factor phase for the reciprocal lattice vector -h , and Φ_k__j; is the structure factor phase for the reciprocal lattice vector k — h . The phase addition relationship is based on two axioms: (1) the electron density is non-negative; and (2) the atoms are identical and discrete, with random positions in the unit cell. Certain other embodiments can utilize other phase equations or inequalities which define other mathematical relationships in accordance with the present invention. An example of another phase equation or inequality is described more fully below.

As applied to bimodal phase probability distributions, if three bimodal phase probability distributions for reciprocal lattice vectors k , -h , and k - h have been generated, the most probable phase for reciprocal lattice vector k is the one which adds to a likely correct phase from the phase probability distribution for reciprocal lattice vector - h to produce a likely correct phase from the phase probability distribution for reciprocal lattice vector k — h .

Figures 3A-3D schematically illustrate the combination of an original phase probability distribution p(Φ_k ) with the phase addition relationship between a selected structure factor phase of a selected reciprocal lattice vector k and a set of structure factor phases of other reciprocal lattice vectors. Figures 3A-3C schematically illustrate three bimodal phase probability distributions for reciprocal lattice vectors k , - h , and k — h . The phase probability distributions of Figures 3A-3C have been generated synthetically to provide well-resolved mode peaks which can be easily resolved by visual analysis for illustration purposes. Such synthetically-generated functions can imitate the ambiguity found in x-ray crystallography data.

In Figure 3 A, the phase probability distribution p Φ^) for reciprocal lattice vector k has two mode peaks, a peak 12 centered at 30 degrees, and an approximately equal peak 14 centered at 170 degrees. In Figure 3B, the phase probability distribution p(Φ_ι; ) for reciprocal lattice vector

-h has two mode peaks, a peak 16 centered at 60 degrees, and a peak 18 centered at 330 degrees, and in Figure 3C, the phase probability distribution p(Φ^__j; ) for reciprocal lattice vector k — h also has two mode peaks, a peak 20 centered at 90 degrees, and a peak 22 centered at 170 degrees.

The phase addition relationship implies that the true phase from reciprocal lattice vector k should add to the true phase of reciprocal lattice vector -A to produce the true phase of reciprocal lattice vector k — h . Examination of the peaks in Figures 3A-3C shows that the phase of peak 12 for reciprocal lattice vector k plus the phase of peak 16 for reciprocal lattice vector -h produces the phase of peak 20 for reciprocal lattice vector k — . Thus, consistency between the phases of these reciprocal lattice vectors selects peak 12 at about 30 degrees as the correct phase for reciprocal lattice vector k .

In certain embodiments, the combination of the original phase probability distribution p(Φ_k ) with the phase equation defining the phase addition relationship in the operational block

70 of Figure 1 is performed in a more mathematically robust and accurate manner by combining the phase addition relationship with the Hendrickson-Lattman formula as follows:

Equation 3:

P(Φ_Έ) = p(Φ_k,a_k,b_k-,c_k-,d_k) \dΦ__Tχp(Φ__ϊι,a__h-,b__h-,c__h-,d__Jι)p(Φ_k- + Φ ^a^- b^c^- d^- )

0

where P(Φ^) is a resultant phase probability distribution for the selected structure factor phase of the selected reciprocal lattice vector k . Equation 3 statistically combines the phase addition relationship with the original phase probability distribution for reciprocal lattice vector k to produce a resultant probability distribution R(Φ^) or the structure factor phase corresponding to reciprocal lattice vector k . As described below, in other embodiments the resultant phase probability distribution can be a composite probability distribution expressed in alternative forms. In certain embodiments, in which the original phase probability distributions are of the form shown in Equation 1, producing a resultant phase probability distribution P(Φ_j_) for the selected structure factor phase of the selected reciprocal lattice vector k in the operational block 80 comprises evaluating the integral of Equation 3 analytically. Such an analysis can yield an infinite series involving hypergeometric Bessel functions. In other embodiments, the resultant phase probability distribution R(Φ^) is produced using numerical integration, in which the form of Equation 3 may be conveniently transformed into the standard form of Equation 2. In such embodiments, the resultant phase probability distribution R( ^) for the selected structure factor phase of the selected reciprocal lattice vector k can be expressed in terms of a revised set of Hendrickson-Lattman coefficients.

Figure 3D schematically illustrates the resultant phase probability distribution P(Φ_ ) for the structure factor phase corresponding to reciprocal lattice vector k , based on the three phase probability distributions shown in Figures 3A-3C. The resultant phase probability distribution p(Φ_k ) is substantially unimodal (i.e., the distribution has only one prominent probability mode).

As compared to the original phase probability distribution for the reciprocal lattice vector k , the resultant phase probability distribution R(Φ^) has a peak 22 centered at 30 degrees, as does the original phase probability distribution p(Φ_k ) , but only has an almost completely suppressed small peak 24 at approximately 170 degrees which corresponds to second peak 14 of the original phase probability distribution p(Φ_k) . h addition, the peak 22 of the resultant phase probability distribution R(Φ^) is narrowed as compared to the corresponding peak 12 of the original phase probability distribution p(Φ_k ) . The resultant phase probability distribution is weighted more heavily to a correct phase than is the original phase probability distribution. Because the resultant phase probability distribution P(Φ^) has a larger fraction of its weight distributed among a smaller range of phases, the structure factor phase ambiguity of the resultant phase probability distribution P(Φ_k ) is smaller than that of the original phase probability distribution p(Φ_k ) . The original phase probability distribution, as illustrated in Figure 3 A, has its centroid at 100 degrees (far away from the true value of 30 degrees) and a FOM value of 0.23. However, the resultant phase probability distribution, as illustrated in Figure 3D, has its centroid at 28 degrees, and a FOM value of 0.92. Therefore, the resultant phase probability distribution has a smaller ambiguity than does the original phase probability distribution. For embodiments in which the phase probability distributions p(Φ_k) , p(Φ__j;) , and p(Φ_k __jj) consist of wider peaks, as schematically illustrated in Figures 4A-4C respectively, the resultant phase probability distribution R( _j) is still bimodal, as schematically illustrated in Figure 4D. However, as compared to the original phase probability distribution p(Φ_k) of Figure 4A, the resultant phase probability distribution P(Φ_k) of Figure 4D emphasizes the correct peak mode over the incorrect peak, thereby reducing the structure factor phase ambiguity corresponding to the reciprocal lattice vector k .

Despite the wider peaks of the phase probability distributions of Figures 4A-4C, the resultant phase probability distribution of Figure 4D is weighted more heavily to a correct phase than is the original phase probability distribution of Figure 4A. The original phase probability distribution, as illustrated in Figure 4A, has its centroid at 100 degrees (far away from the true value of 30 degrees) and a FOM value of 0.28. However, the resultant phase probability distribution, as illustrated in Figure 4D, has its centroid at 89 degrees (approximately 11 degrees closer to the true value of 30 degrees). For essentially complete suppression of the incorrect peak mode of a bimodal original phase probability distribution, the widths of the peaks in the original phase probability distributions should be less than approximately Φ^ - (Φ_k + Φ__jj) , where Φ^ and Φ _ represent the positions of the incorrect phase peak modes in the original phase probability distributions ^(Φ^) , p(Φ_n) for the reciprocal lattice vectors k and -h , respectively. Φ _J; can be the position of either the correct or incorrect phase mode for the reciprocal lattice vector k — h . Although this condition may not always be met, as schematically illustrated by the original phase probability distributions of Figures 4A-4C, a typical x-ray crystallography data set contains enormous numbers of redundant reciprocal lattice vector triplets. In certain embodiments, these reciprocal lattice vector triplets can be combined using a phase equation or inequality to reduce the structure factor phase ambiguity corresponding to a single reciprocal lattice vector. Typically, where the reciprocal lattice vectors are related according to their Miller indices, the structure factors are also related. In such embodiments, the cumulative analysis of multiple reciprocal lattice vector triplets as outlined above can substantially minimize the structure factor phase ambiguity even when the original phase probability distributions are extremely wide. Using multiple redundant reciprocal lattice vector triplets can produce a resultant phase probability distribution which is analogous to that produced by analyzing multiple heavy atom isomorphs. Thus, the structure factor phase ambiguity can be reduced for all reciprocal lattice vectors by scanning the entire x-ray crystallography data set for reciprocal lattice vector triplets k , -h , and k — h . In certain embodiments, the procedure can be iterated until a self- consistent, converged solution is found. Furthermore, in embodiments in which multiple heavy atom derivatives are available, using the above procedures improves the efficiency and accuracy of the analysis because the accuracy of the resultant phase probability distributions produced in the initial SIR analysis can be improved.

Figure 5 is a flowchart of one embodiment of a method 200 of defining a structure factor phase for a reflection derived from x-ray crystallography data. The method 200 comprises generating a first probability distribution for the structure factor phase of the reflection in an operational block 210. The method 200 further comprises generating two or more additional probability distributions for the structure factor phases of other reflections in an operational block 220. The method 200 further comprises identifying a relationship between the structure factor phase for the reflection and the structure factor phases of the other reflections in an operational block 230. The method 200 further comprises calculating a composite probability distribution for the structure factor phase of the reflection in an operational block 240. The composite probability distribution is derived from the first probability distribution for the structure factor phase of the reflection and the two or more additional probability distributions for the structure factor phases of the other reflections.

In certain embodiments, generating the first probability distribution for the structure factor phase of the reflection of the operational block 210 is performed as described above. Similarly, generating two or more additional probability distributions for the structure factor phases of other reflections of the operational block 220 is performed as described above.

In certain embodiments, identifying the relationship between the structure factor phase for the reflection and the structure factor phases of the other reflections of the operational block 230 is performed by identifying a phase equation or inequality as described above. For example, the relationship can be identified to be the phase addition relationship expressed by Equation 2. Alternatively, in other embodiments, the relationship between structure factor phases can be expressed by the so-called tangent formula:

Equation 4: tg(Φ_τ) =

where E_k represents the structure factor F_k in which the scattering factor has been set to one. Equation 4 is based on the assumption that ∑E__/E E_£__/; has vanishing phase, and that k

∑l ^E-ϊ^E _k- i-ϊ I sin(Φ__Λ- + Φ_k + Φ_w) = 0. In certain embodiments, calculating the composite probability distribution for the structure factor phase of the reflection of the operational block 240 is performed by combining the original phase probability distribution with a phase equation or inequality and producing a resultant phase probability distribution as described above. For example, the phase addition relationship of Equation 2 can be combined with the original phase probability distribution, thereby producing Equation 3 for the resultant phase probability distribution which can be solved. Alternatively, in other embodiments in which the relationship between structure factor phases is provided by the tangent formula of Equation 4, the composite probability distribution can be expressed in the following form:

Equation 5:

P(Φ_h-)

Φ.- - arctg

∑\ E_k- E_kι__s \ cos(Φ_{kι +} Φ_k-_ι__R)

where J^Φ_j) is the composite probability distribution and δ(x) is the delta function. In certain embodiments, the delta function can be replaced by a Gaussian function to account for experimental errors, errors in the model, and missing reflections.

In certain embodiments, the composite probability distribution is calculated in the operational block 240 by minimizing a penalty function based on the tangent formula and the probability distributions for the structure factor phases. The penalty function of certain embodiments has the following form:

Equation 6:

- K₂∑[a_h cos(Φ_h)+b_h sin(Φ_A)+ c_h cos(2Φ,)+ d_h sin(2Φ_Λ)]

In certain embodiments, Monte Carlo techniques can be utilized to start from an initial guess for a set of structure factor phases. The Monte Carlo techniques are related to those used in simulations of annealing procedures, as described by Glykos and Kokkinidis in Acta Cryst, Vol.

D56, page 169, (2000), which is incorporated by reference herein in its entirety. In other embodiments, other optimization techniques can be used. Figures 6A-6D and 7A-7D schematically illustrate an example of an embodiment of the present invention as applied to experimental data from the Protein Data Bank, code entry 3APP corresponding to x-ray diffraction data from penicillopepsin, as published by Sielecki and James in

J. Mol. Bio., volume 163, page 299 (1983), which is incorporated by reference herein in its entirety. Figures 6A-6C schematically illustrate the phase probability distributions for the k = (9, 3, 0), — h = (-1, -1, 0), and k - h = (2, 2, 0) reciprocal lattice vectors, respectively. The original phase probability distribution for the reciprocal lattice vector k in Figure 6A is bimodal with a first peak mode centered at approximately 50 degrees and a second peak mode centered at approximately 210 degrees with an intensity approximately equal to that of the first peak. The probability distributions for the structure factor phases for the reciprocal lattice vectors — h and k — h in Figures 6B and 6C respectively are substantially unimodal. As can be seen in the resultant phase probability distribution for the reciprocal lattice vector k in Figure 6D, the intensity of the second peak mode has nearly disappeared, and the first peak has been sharpened somewhat. For the purposes of comparison, density modification techniques can be used as an alternative method for refining the phase probability distribution. Density modification techniques have several sub-categories, based on assumptions such as non-crystallographic symmetry, solvent flattening, non-negativity of electron distributions, etc. A description of density modification techniques is provided by "Principles of Protein X-Ray Crystallography" by Jan Drenth, Chapter 8, pages 183-198, Springer-Verlag, New York, 1999, which is incorporated in its entirety by reference herein. The original phase probability distribution, illustrated in Figure 6A, has a centroid at 129 degrees (far away from the value obtained from the density modification technique of 56 degrees) and a FOM value of 0.19. However, the resultant phase probability distribution, illustrated in Figure 6D, has a centroid at 76 degrees (closer to the density modification value of 56 degrees) and a FOM value of 0.80. Therefore, the resultant phase probability distribution for the reciprocal lattice vector k has a structure factor phase ambiguity which is smaller than that of the original phase probability distribution for the reciprocal lattice vector k . In addition, the centroid of the resultant phase probability distribution for k = (9, 3, 0) is in better agreement with that of the phase obtained from the density modification technique, which is schematically illustrated in Figure 8.

Similarly, Figures 7A-7C schematically illustrate the phase probability distributions for the k = (9, 3, 0), - h = (-5, -1, 0), and k - h =(4, 2, 0) reciprocal lattice vectors, respectively.

However, the phase probability distribution for the reciprocal lattice vector -A in Figure 7B is substantially bimodal while the phase probability distribution for the k — h in Figure 7C is substantially unimodal but broad. As can be seen in the resultant phase probability distribution for the reciprocal lattice vector k in Figure 7D, the intensity of the second peak mode still exists but has been reduced as compared to the intensity of the first peak, and the first peak has been sharpened somewhat.

The original phase probability distribution, illustrated in Figure 7A, has a centroid at 129 degrees (far away from the value obtained from the density modification technique of 56 degrees) and a FOM value of 0.19. However, the resultant phase probability distribution, illustrated in

Figure 7D, has a centroid at 98 degrees (closer to the density modification value of 56 degrees) and a FOM value of 0.43. Therefore, the resultant phase probability distribution for the reciprocal lattice vector k has a structure factor phase ambiguity which is smaller than that of the original phase probability distribution for the reciprocal lattice vector k . Again, the centroid of the resultant phase probability distribution for k = (9, 3, 0) is in better agreement with that of the phase obtained from density modification technique, which is schematically illustrated in Figure 8. Figures 9A-9C schematically illustrate the phase probability distributions for the k = (6, 4, 0), — h = (-4, -2, 0), and k — h = (2, 2, 0) reciprocal lattice vectors, respectively. The original phase probability distribution for the reciprocal lattice vector k in Figure 9A is bimodal with a first peak mode centered at approximately 150 degrees and a second peak mode centered at approximately 315 degrees with an intensity approximately equal to that of the first peak. The probability distributions for the structure factor phases for the reciprocal lattice vectors k and - h in Figures 9B and 9C respectively are substantially unimodal, but broad. As can be seen in the resultant phase probability distribution for the reciprocal lattice vector k in Figure 9D, the intensity of the second peak mode has been eliminated as compared to the intensity of the first peak, and the first peak has been sharpened somewhat. The original phase probability distribution, illustrated in Figure 9A, has a centroid at 220 degrees (far away from the value obtained from the density modification technique of 148 degrees) and a FOM value of 0.074. However, the resultant phase probability distribution, illustrated in Figure 9D, has a centroid at 136 degrees (closer to the density modification value of 148 degrees) and a FOM value of 0.88. Therefore, the resultant phase probability distribution for the reciprocal lattice vector k has a structure factor phase ambiguity which is smaller than that of the original phase probability distribution for the reciprocal lattice vector k . The centroid of the resultant phase probability distribution for k = (6, 4, 0) is in better agreement with that of the phase obtained from the density modification technique, as schematically illustrated in Figure 9E. As a further example of an embodiment of the present invention, an artificial one- dimensional electron distribution composed of 10 randomly positioned atoms, as schematically illustrated in Figure 10 A, was used to compute the corresponding structure factors, and then to back-compute the electron distribution from the structure factors. All scattering factors were set equal to one, as well as the temperature factors and occupancies. The structure factors were also used in conjunction with the tangent formula of Equation 4 for comparison. Figure 10B schematically illustrates the correlation between the "calculated" structure factor phases produced by the tangent formula used by an embodiment of the present invention and the "true" structure factor phases computed from the electron distribution. As can be seen from Figure 10B, the embodiment of the present invention yielded structure factor phases which had a correlation with the true phases of nearly one.

The subset of low-order structure factor phases from the embodiment of the present invention were then used to calculate the electron distribution, as schematically illustrated in Figure IOC. In calculating the phase probability distribution of Figure IOC, negative values for electron densities were excluded, which is a physical constraint. Since the phase probability distribution of Figure IOC was obtained from a truncated set of structure factors which are actually used in the Monte Carlo optimization, it has a reduced resolution as compared to Figure 10 A. A comparison of the original electron distribution of Figure 10A and the resultant electron distribution of Figure 10C reveals some correlation. This correlation is highlighted by comparing the original electron distribution of Figure 10A with the calculated electron distribution of Figure 10D, which schematically illustrates the electron distribution calculated from the structure factors with phases set to random numbers between -180 degrees and 180 degrees. Figure 10D was also calculated by excluding negative values for electron densities. The reduction of correlation with the original electron distribution of Figure 10A by ignoring the phases resulting from the embodiment of the present invention provides further support for the validity of the structure factor phases produced by embodiments of the present invention.

As described above, an analysis of x-ray diffraction reflections from a crystal results in an indexed set of complex numbers, called structure factors, from which characteristics of the atomic configuration within the crystal can be derived. In three dimensions, the structure factors Fj^i are indexed by a triplet of integer indices h, k, I, which correspond to the three orthogonal directions in reciprocal space. Higher indices correspond to structure factors which provide information with better spatial resolution of the atomic configuration within the crystal.

The nature of the experimental process limits the maximum values of the h, k, and / indices for which structure factors can be accurately derived. In embodiments of the invention, resolution is improved despite experimental limitations by using experimentally determined structure factors to derive approximate values for the structure factors that cannot be or were not experimentally determined. In advantageous embodiments of the invention, the value of an unknown structure factor is derived from a linear combination of other structure factors having experimentally determined values. As described in further detail below, the coefficients of the linear formula used to derive unknown structure factor values are themselves derived from the experimentally determined structure factor values.

Figure 11 is a flowchart of one embodiment of a method 100 of using linear prediction analysis to define a first structure factor component for a first reflection from x-ray crystallography data. The x-ray crystallography data comprises a set of cognizable reflections. The method 100 comprises expressing the first structure factor component in an operational block 110 as a first linear equation in which the first structure factor component is equal to a sum of a first plurality of terms. Each term comprises a product of (1) a structure factor component for a cognizable reflection from the x-ray crystallography data, wherein the cognizable reflection has a separation in reciprocal space from the first reflection, and (2) a linear prediction coefficient corresponding to the separation between the cognizable reflection and the first reflection. The method 100 further comprises calculating values for the linear prediction coefficients in an operational block 120. The method further comprises substituting the values for the linear prediction coefficients into the first linear equation in an operational block 130, thereby defining the first structure factor component for the first reflection. hi the operational block 110, the first structure factor component is expressed by a first linear equation as equal to a sum of a first plurality of terms. In certain embodiments, the first structure factor component is real or imaginary. Alternatively, in still other embodiments, the first structure factor component is the magnitude or the phase of the corresponding structure factor.

In certain embodiments, the first structure factor component F_hkl is expressed as the first linear equation in the following form:

Ncoef

Equation 2: F_m = ∑a_sF_{h__{s h) k}__s^ _ι__sAι) , s=l

where N_coef is the number of terms in the sum, and sA_h , sA_k , sA_/ represent the separation along the axes a*, b*, and c* in reciprocal space between the first reflection and the cognizable reflection. To produce accurate values for non-experimentally determined structure factors, the value of N_coef is generally at least as large as the number of scatterers in the unit cell, which for protein x-ray crystallography is typically several hundred to a several thousand. In the form of Equation 2 for the first structure factor component F_m , each term comprises the product of two elements. One element is a structure factor component Fyi-_si_s. _)(k-_s& χ_/--_Δ,₎ f°^{r a} cognizable reflection from the x-ray crystallography data which is separated in reciprocal space from the first reflection corresponding to the F_m structure factor component. In certain embodiments, the structure factor components of the sum correspond to adjacent reflections in reciprocal space (for example, Δ_Λ =1, Δ_ft =0, Δ,=0). Reverse linear prediction corresponds to negative values for one or more of A_h , A_k , or Δ, .

The other element is a linear prediction coefficient a_s corresponding to the separation between the cognizable reflection and the first reflection. As is described below, these linear prediction coefficients a_s are initially unknown, but can be solved for using various methods. In the form of Equation 2, the first structure factor component is expressed as a linear equation comprising a linear combination of other structure factor components with indices which are less than the indices for the first structure factor.

As an example of a first linear equation in accordance with embodiments of the present invention, the F_m reflection (where k and / are constants) can be expressed as a linear combination of the structure factor components F^_N__s^_kl :

Equation 3: F_m = a F_{N__x)kl + a₂F_(N__2)kl + a₃F_(N__3)kl + ... + a_NcoefF_{N__Ηcoef)M .

In this example, the structure factor components -P_(Ar_- _W are selected along a direction parallel to the a* axis in reciprocal space (i.e., A_h = 1 , and A_k = A_t = 0), and s represents the number of steps along this direction. In certain embodiments, the structure factor components F^_N__s^_kl are known, but the linear prediction coefficients a_s are not known. While in principle, structure factor components for cognizable reflections with all combinations of Δ_A , A_k , Δ_z can be used in the first linear equation, in certain embodiments, only a subset will be useful due to missing or erroneous experimental data corresponding to certain reflections.

As a simple one-dimensional example, structure factors F\ through F\ o may be known, and it may be desired to predict the value of F\ \. A series of linear equations may be formed as follows:

Equation 4: F₅ = α_F_Λ + α₂F₃ + α F₂ F₆ = α_lF₅ + α₂F₄ + α₃F₃ F_η = a_xF₆ + a₂F₅ + a₃F,

F₉ = _aιF₈ + a₂F₇ + a₃F₆ F_l0 = a_λF₉ + a₂F₈ + a₃F₇.

As E2 through E1 o ^are measured, known values, the three linear prediction coefficients a\, a2, and a_ may be selected so as to force these six equations to be true with a minimum total error. Once these linear prediction coefficients a_s have been selected, a value for unknown F\ \ is predicted with the formula:

Equation 5: F_n = a_F_lQ + a₂F_g + a₃F_% .

Several techniques for determining linear prediction coefficients are described in further detail below with reference to Figures 2, 3, 4, and 5. In certain embodiments which use Equation 2 to express the first structure factor component, the separation between the first reflection and each cognizable reflection has the same number of steps along each of the reciprocal space axes a*, b*, and c*, by virtue of using the single index s for all three components of the separation. In other embodiments, two or three indices are used in place of the single index of Equation 2 to include cognizable reflections in the first linear equation which have different numbers of steps along the three reciprocal space axes. Persons skilled in the art are able to express the first structure factor component as a first linear equation in accordance with these embodiments of the present invention.

It will be appreciated by those in the art that a variety of mathematical techniques for selecting a set of linear prediction coefficients a_s from already measured structure factor values have been developed and may be used in embodiments of the invention. In general, the techniques involve selecting a set of linear prediction coefficients that predicts, with the least total error, a set of the known structure factor values from other known structure factor values using a series of linear equations of the form of Equation 2. This set of linear prediction coefficients is then used in the linear formula of Equation 2 to predict the value of an unknown structure factor component from other known structure factor values. Such techniques have been applied in communication signal processing and analysis applications, but have never been utilized in the analysis of x-ray diffraction data.

In the operational block 120, values for the linear prediction coefficients are calculated. Figure 12 is a flowchart of one embodiment of the calculation corresponding to operational block 120^'. In the embodiment illustrated in Figure 12, calculating values for the linear prediction coefficients comprises expressing a plurality of second structure factor components for a plurality of second reflections from the set of cognizable reflections in an operational block 121 as a plurality of second linear equations. In the plurality of second linear equations, each second structure factor component is equal to a sum of a second plurality of terms. Each term comprises a product of (1) a structure factor component for a cognizable reflection from the x-ray crystallography data, wherem the cognizable reflection has a separation in reciprocal space from the second reflection, and (2) the linear prediction coefficient corresponding to the separation between the cognizable reflection and the second reflection. In certain embodiments, each of the second linear equations is similar in form to the first linear equation, using the linear prediction coefficients a_s corresponding to the separation between the cognizable reflection and the second reflection. Calculating values for the linear prediction coefficients further comprises solving the plurality of second linear equations for a set of values for the linear prediction coefficients in an operational block 122.

Continuing the example described above, the plurality of second structure factor components can have the following form:

Equatⁱon 6:

(N-3)kl = C lF"¹ (,N-4)kl

T ^τ C "3f/ (N-6)kl + ... + α_Λ cocf F lN-(N_{coef +}3)]kl

The second reflections are from the set of cognizable reflections, so each second structure factor component is capable of being measured or known. In embodiments in which the second structure factor components are known, and are expressed as linear combinations of other known structure factor components, the only unknown parameters are the linear prediction coefficients a_s .

In the operational block 122, the plurality of second linear equations is solved for a set of values corresponding to the set of coefficients. In embodiments in which there are N_coe/- unknown linear prediction coefficients a_s , solving for a set of values utilizes at least N_coef independent second linear equations. Persons skilled in the art are able to solve the plurality of second linear equations for the set of values for the linear prediction coefficients a_s in accordance with embodiments of the present invention.

In the operational block 130, the set of values for the linear prediction coefficients α. are substituted into the first linear equation, thereby defining the first structure factor component F_hkl for the first reflection. In this way, the first structure factor component F_m is then expressed solely in terms of known parameters.

By using the structure factor components for reflections related to the reflection of the structure factor component F_m to be defined, embodiments of the present invention utilize linear prediction to increase the number of observables used in the optimization of the molecular geometry. A relatively small extension (e.g., 20%) along all lines in reciprocal space will lead to large increases in the number of reflections because the number of reflections within a given volume of reciprocal space defined by a reciprocal lattice vector increases with the cube of the indices h,k,l of the reciprocal lattice vector. For example, an extension of the maximum

reciprocal lattice vector from

= 20 to 24 increases the number of reflections available for use in the optimization of the molecular geometry by (24 — 20³)/20³ « 70% .

Embodiments of the present invention can also extrapolate measured data to higher resolution. Reflections for reciprocal lattice vectors with larger indices h, k correspond to longer vectors in reciprocal space, which imply shorter distances in direct space. In this way, a significant improvement in resolution can be achieved. For example, when the length of the unit cell of a hypothetical one-dimensional crystal is a = 50 A, the corresponding reciprocal unit cell edge is a* = 0.02 A-l, and the resolution for h = 20 is d = 2.5 A. A 20% increase of /, (i.e., from 20 to 24) improves the resolution to d = 2.08 A.

Embodiments of the present invention can also be used to complement incomplete or erroneous x-ray crystallography data sets. Embodiments of the present invention can provide a method to detect and replace "outlier" reflections, i.e., measured reflections which, for one reason or another, are aberrant or erroneous. In this way, hidden experimental errors can be identified and eliminated or corrected. Such utility is particularly important with regard to multiple isomorphous replacement (MIR) analysis and multiple anomalous diffraction (MAD) analysis. With regard to missing reflections, embodiments of the present invention can be used to interpolate to provide the missing reflections and to improve data completion within each resolution shell. This utility of embodiments of the present invention can be important when resolution shells contain too few data for cross-validation. Resolution shells are concentric spheres in reciprocal space, designed so that each shell contains an approximately equal number of reflections. Shells with smaller diameters correspond to lower resolution, while shells with larger diameters correspond to higher resolution. The division of the reciprocal space into resolution shells is equivalent to division of the resolution axis into subintervals. Embodiments of the present invention can also be used to evaluate the zeroth-order reflection | E₀₀₀ | and enable subsequent absolute scaling of the set of measured reflections based on the known total number of electrons in the unit cell.

Figure 13 is a flowchart of one embodiment of the calculation corresponding to operational block 120. In the embodiment illustrated in Figure 13, calculating values for the linear prediction coefficients comprises expressing a first subset of the cognizable structure factor components . as vector elements of a first vector in an operational block 221. Calculating values for the linear prediction coefficients further comprises expressing a second subset of the cognizable structure factor components as vector elements of a second vector in an operational block 222. Calculating values for the linear prediction coefficients further comprises expressing the first vector in a matrix equation as being equal to the product of a matrix and the second vector in an operational block 223. The matrix comprises the linear prediction coefficients, with each linear prediction coefficient corresponding to a separation in reciprocal space between the cognizable reflection corresponding to one cognizable structure factor component from the first vector and the cognizable reflection corresponding to one cognizable structure factor component from the second vector. Calculating values for the linear prediction coefficients further comprises solving the matrix equation for values of the linear prediction coefficients in an operational block 224.

In certain embodiments, a first subset of the cognizable structure factor components are expressed as vector elements of a first vector in the operational block 221, and a second subset of the cognizable structure factor components are expressed as vector elements of a second vector in the operational block 222. For example, where k and / are constants such as in the example described above, the first vector can have the following form:

Equation 7: \F_M) = \F_iN__l)kl,F_{N__2)kl,F_(N__3)kl,... ,

and the second vector can have the following form:

Equation 8:

.

In certain embodiments, the first vector \F_M) is expressed in the operational block 223 as a matrix equation in which the first vector |E„_H) is equal to the product of a matrix _nm and the second vector |E„,_W) . Continuing the example from above, the matrix equation can have the following form:

Equation 9:

coefficients. Persons skilled in the art are able to solve the matrix equation and substitute the resulting values into the linear equation in accordance with embodiments of the present invention to define the first structure factor component F_m . Persons skilled in the art are also able to recognize the equivalence of the two embodiments of the example described above.

Figure 14 is a flowchart of one embodiment of the calculation corresponding to operational block 120. In the embodiment illustrated in Figure 14, calculating values for the linear prediction coefficients comprises expressing a first subset of the cognizable structure factor components as matrix elements of a first matrix in an operational block 321. Calculating values for the linear prediction coefficients further comprises expressing a second subset of the cognizable structure factor components as vector elements of a first vector in an operational block 322. Calculating values for the linear prediction coefficients further comprises generating a second matrix representing a generalized inverse of the first matrix in an operational block 323. Calculating values for the linear prediction coefficients further comprises expressing the linear prediction coefficients as vector elements of a second vector in an operational block 324. Calculating values for the linear prediction coefficients further comprises equating the second vector to the product of the second matrix and the first vector in an operational block 325, thereby generating values for the linear prediction coefficients.

In certain embodiments, in the operational block 321, a first subset of the cognizable structure factor components is expressed as matrix elements of a first matrix M_nm , in the operational block 322, a second subset of the cognizable structure factor components is expressed as vector elements of a first vector

, and in the operational block 324, the linear prediction coefficients a_s are expressed as vector elements of a second vector \ a_s) . For example, where k and / are constants such as in the example described above, the first matrix M_nm can have the following form:

Equation 10: M_l ^F^_Ncoef-_m I " = ¹>->(^N . -N_coef),m = l,...,N_coef the first vector can have the following form:

Equation 11: IE > = F(N_coef+l)kl>-> N_

and the second vector can have the following form:

Equation 12: |^Ω- ) ⁼ Ω_χ ,CZ₂,...Cl_N

where N_max is typically on the order of tens of thousands.

In certain embodiments, in the operational block 323, the second matrix

a generalized inverse of the first matrix M_nm . The values of the linear prediction coefficients a_s are then generated in the operational block 325 by equating the second vector j a_s to the product of the second matrix M_nm ) and the first vector | F_nkl ) :

Equation 13: |α_s> =

.

By substituting the values of the coefficients into the linear equation, the first structure factor component for the first reflection can be defined. Figure 15 is a flowchart of one embodiment of the calculation corresponding to operational block 120. In the embodiment illustrated in Figure 15, calculating values for the linear prediction coefficients comprises defining a matrix having matrix elements in an operational block 421. Each matrix element comprises an autocorrelation function between selected structure factor components. Calculating values for the linear prediction coefficients further comprises expressing the linear prediction coefficients as vector elements of a first vector in an operational block 422. Calculating values for the linear prediction coefficients further comprises solving a matrix equation for values for the linear prediction coefficients in an operational block 423. The matrix equation expresses the product of the matrix and the first vector as equal to a second vector with constant vector elements.

In certain embodiments, the autocorrelation functions of the matrix in the operational block 421 have the following form:

Equation 14: Φ .

Autocorrelation functions of this form represent autocorrelations between structure factor components along a selected line in reciprocal space.

In certain embodiments, the matrix in the operational block 421 has the following form: Equation 15: M_ιm

Such a matrix is a symmetric Toeplitz matrix (i.e., a matrix whose elements are constant along diagonals).

In certain embodiments, the linear prediction coefficients a_s are expressed in the operational

block 422 as vector elements of the first vector 1, a_x , a , ... , a_N ), and in the operational block 423 , a matrix equation of the following form is solved for values of the linear prediction coefficients:

Equation 16:

In Equation 16, a₀ is a dummy value, as described in "Numerical Recipes in C, The Art of Scientific Programming," by W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, Cambridge University Press, Cambridge, 1989, pages 452-464, which is incorporated in its entirety by reference herein.

As with all recursive (i.e., infinite-impulse response) digital filters, solving the matrix equation described above is vulnerable to instabilities and divergences. In certain embodiments, solving the matrix equation of Equation 16 comprises limiting instabilities and divergences by calculating complex roots of a characteristic polynomial equation in a complex plane and forcing all complex roots into a unit circle in the complex plane. Stability is increased by calculating the complex roots of the following characteristic polynomial equation:

N, coef

Equation 17: N_n a_jZ = 0

;=! and forcing all the solutions into the unit circle in the complex plane Z. This result is achieved by moving the roots of the characteristic polynomial onto the unit circle, or more preferably by reflecting them into the unit circle (i.e., by replacing z with 1/z*). The linear prediction analysis of embodiments of the present invention extrapolates from the known structure factor components using the characterization of the known structure factor components in terms of the poles in the complex plane, which differs from techniques such as the maximum entropy method.

An example of this embodiment is provided by Figures 16A and 16B. Figure 16A schematically illustrates an electron distribution of a hypothetical one-dimensional system of ten atoms along a line segment of unit length. For simplicity, all atoms are assigned unit scattering factors and the temperature factors Tj have been set to facilitate visual inspection. The electron distribution schematically illustrated in Figure 16A is then used in an embodiment of the present invention to compute a set of 66 structure factor components corresponding to Miller indices Structure factor components h=42,...,66 were estimated by means of linear prediction, using the 40 data points A=l,...,40 and 20 poles. Figure 16B schematically illustrates the agreement between the true values for the structure factor components and the corresponding linear prediction estimates from this embodiment. The resulting agreement has a correlation coefficient of approximately 0.83. Similarly, in another example embodiment of the present invention, Figures 17A and 17B schematically illustrate another hypothetical one-dimensional electron distribution and the agreement between the true structure factor components h=42,... ,66 and the same structure factor components estimated using linear prediction from structure factor components h=l,...,4 and 20 poles. The resulting agreement has a correlation coefficient of approximately 0.97.

Figure 18A schematically illustrates another example embodiment of a hypothetical one- dimensional electron distribution with ten atoms. In this embodiment, structure factor components /z=27,...,30 were estimated using 25 data points (h=l,...,25) and 20 poles. The resulting agreement between true and estimated structure factor components schematically illustrated in Figure 18B has a correlation coefficient of approximately 0.98.

Figure 19A schematically illustrates another example embodiment of a hypothetical one- dimensional electron distribution with thirty atoms. In this embodiment, structure factor components /F=92,..., 100 were estimated using 90 data points (^=1,...,90) and 30 poles. The resulting agreement between true and estimated structure factor components schematically illustrated in Figure 19B has a correlation coefficient of approximately 0.78.

Figure 20A schematically illustrates another example embodiment of a hypothetical one- dimensional electron distribution with thirty atoms. In this embodiment, structure factor components ϊ=92,...,100 were estimated using 90 data points (h=l,...,90) and 35 poles. The resulting agreement between true and estimated structure factor components schematically illustrated in Figure 20B has a correlation coefficient of approximately 0.78.

Figure 21 A schematically illustrates an example embodiment of a one-dimensional projection of a hypothetical three-dimensional electron distribution with 500 atoms created in a cube with unit edges. For simplicity, all atoms are assigned unit scattering factors and the temperature factors Tj have been set to facilitate visual inspection. In this embodiment, structure factor components (h, k, l)=(ll, 1, 2) and (18, 1, 2) were estimated using 15 data points (A=l,...,15; k=\; 1=2) and 5 poles. The resulting agreement between true and estimated structure factor components schematically illustrated in Figure 2 IB. Figure 22A schematically illustrates another example embodiment of a one-dimensional projection of a hypothetical three-dimensional electron distribution with 500 atoms created in a cube with unit edges. For simplicity, all atoms are assigned unit scattering factors and B-scaling factors equal to 0.01. In this embodiment, structure factor components (h, k, 1)=(IS, 0, 0) and (19, 0, 0) were estimated using 16 data points (h=l,...,16; k=0; 1=0) and 4 poles. The resulting agreement between true and estimated structure factor components is schematically illustrated in Figure 22B.

This invention may be embodied in other specific forms without departing from the essential characteristics as described herein. The embodiments described above are to be considered in all respects as illustrative only and not restrictive in any manner. The scope of the invention is indicated by the following claims rather than by the foregoing description. Any and all changes which come within the meaning and range of equivalency of the claims are to be considered within their scope.

Claims

WHAT IS CLAIMED IS:

1. A method of reducing structure factor phase ambiguity corresponding to a selected reciprocal lattice vector, the method comprising: generating an original phase probability distribution corresponding to a selected structure factor phase of the selected reciprocal lattice vector, the original phase probability distribution comprising a first structure factor phase ambiguity; combining the original phase probability distribution with a plurality of phase probability distributions of a plurality of structure factor phases of other reciprocal lattice vectors using a phase equation or inequality, the phase equation or inequality defining a mathematical relationship between the selected structure factor phase of the selected reciprocal lattice vector and the plurality of structure factor phases of other reciprocal lattice vectors; and producing a resultant phase probability distribution for the selected structure factor phase of the selected reciprocal lattice vector, the resultant phase probability distribution comprising a second structure factor phase ambiguity which is smaller than the first structure factor phase ambiguity.

2. The method of Claim 1, wherein the original phase probability distribution is substantially bimodal.

3. The method of Claim 1, wherein the resultant phase probability distribution is substantially unimodal.

4. The method of Claim 1, wherein the resultant phase probability distribution is weighted more strongly to a correct phase than is the original phase probability distribution.

5. The method of Claim 1, wherein the original phase probability distribution is generated by single isomorphous replacement, single anomalous dispersion, multiple isomorphous replacement, or multiple anomalous dispersion.

6. The method of Claim 1, wherein the phase equation or inequality is the phase addition equation.

7. A method of defining a structure factor phase for a reflection derived from x-ray crystallography data, the method comprising: generating a first probability distribution for the structure factor phase of the reflection; generating two or more additional probability distributions for the structure factor . phases of other reflections; identifying a relationship between the structure factor phase for the reflection and the structure factor phases of the other reflections; and calculating a composite probability distribution for the structure factor phase of the reflection, whereby the composite probability distribution is derived from the first probability distributions for the structure factor phase of the reflection and the two or more additional probability distributions for the structure factor phases of the other reflections.

8. The method of Claim 7, wherein the first probability distribution is defined by a set of

Hendrickson-Lattman coefficients.

9. The method of Claim 8, wherein the set of Hendrickson-Lattman coefficients are generated by single isomorphous replacement, single anomalous dispersion, multiple isomorphous replacement, or multiple anomalous dispersion.

10. The method of Claim 7, wherein the first probability distribution is substantially bimodal.

11. The method of Claim 7, wherein the composite probability distribution is substantially unimodal.

12. The method of Claim 7, wherein the relationship between the structure factor phase for the reflection and the structure factor phases for the other reflections is additive.

13. The method of Claim 12, wherein the relationship is given by the phase addition equation.

14. A computer readable medium having instructions stored thereon which cause a general purpose computer to perform a method of reducing structure factor phase ambiguity corresponding to a selected reciprocal lattice vector, the method comprising: generating an original phase probability distribution corresponding to a selected structure factor phase of the selected reciprocal lattice vector, the original phase probability distribution comprising a first structure factor phase ambiguity; combining the original phase probability distribution with a phase equation or inequality, the phase equation or inequality defining a mathematical relationship between the selected structure factor phase of the selected reciprocal lattice vector and a set of structure factor phases of other reciprocal lattice vectors; and producing a resultant phase probability distribution for the selected structure factor phase of the selected reciprocal lattice vector, the resultant phase probability distribution comprising a second structure factor phase ambiguity which is smaller than the first structure factor phase ambiguity.

15. A computer-implemented x-ray crystallography analysis system comprising: an original phase probability distribution generator for generating an original phase probability distribution corresponding to a selected structure factor phase of the selected reciprocal lattice vector, the original phase probability distribution comprising a first structure factor phase ambiguity; a combination module for combining the original phase probability distribution with a phase equation or inequality, the phase equation or inequality defining a mathematical relationship between the selected structure factor phase of the selected reciprocal lattice vector and a set of structure factor phases of other reciprocal lattice vectors; and a resultant phase probability distribution producer for producing a resultant phase probability distribution for the selected structure factor phase of the selected reciprocal lattice vector, the resultant phase probability distribution comprising a second structure factor phase ambiguity which is smaller than the first structure factor phase ambiguity.

16. A computer-implemented x-ray crystallography analysis system comprising: a means for retreiving a first phase probability distribution corresponding to a selected structure factor phase of a selected reciprocal lattice vector; a means for retreiving a plurality of second phase probability distributions corresponding to other structure factor phases of other reciprocal lattice vectors; and a means for combining the first phase probability distribution and plurality of second phase probability distributions so as to produce a resultant phase probability distribution for the selected structure factor phase of the selected reciprocal lattice vector.

17. A method of refining x-ray diffraction data, the method comprising combining structure factor phase probability distributions for different reciprocal lattice vectors so that the structure factor phase probability distribution for at least one of the reciprocal lattice vectors is more heavily weighted toward a phase value.

18. A method of using linear prediction analysis to define a first structure factor component for a first reflection from x-ray crystallography data, the x-ray crystallography data comprising a set of cognizable reflections, the method comprising: expressing the first structure factor component as a first linear equation in which the first structure factor component is equal to a sum of a first plurality of terms, each term comprising a product of (1) a structure factor component for a cognizable reflection from the x-ray crystallography data, wherein the cognizable reflection has a separation in reciprocal space from the first reflection, and (2) a linear prediction coefficient corresponding to the separation between the cognizable reflection and the first reflection; calculating values for the linear prediction coefficients; and substituting the values for the linear prediction coefficients into the first linear equation, thereby defining the first structure factor component for the first reflection.

19. The method of Claim 18, wherein the first structure factor component is real.

20. The method of Claim 18, wherein the first structure factor component is imaginary.

21. The method of Claim 18, wherein the first structure factor component is a magnitude.

22. The method of Claim 18, wherein the first structure factor component is a phase.

23. The method of Claim 18, wherein calculating values for the linear prediction coefficients comprises: expressing a plurality of second structure factor components for a plurality of second reflections from the set of cognizable reflections as a plurality of second linear equations in which each second structure factor component is equal to a sum of a second plurality of terms, each term comprising a product of (1) a structure factor component for a cognizable reflection from the x-ray crystallography data, wherein the cognizable reflection has a separation in reciprocal space from the second reflection, and (2) the linear prediction coefficient corresponding to the separation between the cognizable reflection and the second reflection; and solving the plurality of second linear equations for the values for the linear prediction coefficients.

24. The method of Claim 18, wherein calculating values for the linear prediction coefficients comprises: expressing a first subset of the cognizable structure factor components as vector elements of a first vector; expressing a second subset of the cognizable structure factor components as vector elements of a second vector; expressing the first vector in a matrix equation as being equal to the product of a matrix and the second vector, wherein the matrix comprises matrix elements comprising the linear prediction coefficients, such that each matrix element comprises the linear prediction coefficient corresponding to a separation in reciprocal space between a corresponding cognizable reflection from the second vector and a corresponding cognizable reflection from the first vector; and solving the matrix equation for values of the linear prediction coefficients.

25. The method of Claim 18, wherein calculating values for the linear prediction coefficients comprises: expressing a first subset of the cognizable structure factor components as matrix elements of a first matrix; expressing a second subset of the cognizable structure factor components as vector elements of a first vector; generate a second matrix representing a generalized inverse of the first matrix; expressing the linear prediction coefficients as vector elements of a second vector; and equating the second vector to the product of the second matrix and the first vector, thereby generating the values for the linear prediction coefficients.

26. The method of Claim 18, wherein calculating values for the linear prediction coefficients comprises: defining a matrix having matrix elements, each matrix element comprising an autocorrelation function between selected structure factor components; expressing the linear prediction coefficients as vector elements of a first vector; solving a matrix equation for values for the linear prediction coefficients, the matrix equation expressing the product of the matrix and the first vector as equal to a second vector with constant vector elements.

27. The method of Claim 26, wherein the matrix elements are constant along diagonals of the matrix.

28. The method of Claim 26, wherein solving the matrix equation comprises limiting instabilities and divergences by calculating complex roots of a characteristic polynomial equation in a complex plane and forcing all complex roots into a unit circle in the complex plane.

29. A method of refining x-ray diffraction data comprising deriving a value of a first structure factor from a linear combination of other structure factors.

30. The method of Claim 29, wherein said other structure factors comprise a series of structure factors which are adjacent to said first structure factor in reciprocal space.

31. A computer readable medium having instructions stored thereon which cause a general purpose computer to perform a method of using linear prediction analysis to define a first structure factor component for a first reflection from x-ray crystallography data, the x-ray crystallography data comprising a set of cognizable reflections, the method comprising: expressing the first structure factor component as a first linear equation in which the first structure factor component is equal to a sum of a first plurality of terms, each term comprising a product of (1) a structure factor component for a cognizable reflection from the x-ray crystallography data, wherein the cognizable reflection has a separation in reciprocal space from the first reflection, and (2) a linear prediction coefficient corresponding to the separation between the cognizable reflection and the first reflection; calculating values for the linear prediction coefficients; and substituting the values for the linear prediction coefficients into the first linear equation, thereby defining the first structure factor component for the first reflection.

32. A computer-implemented x-ray crystallography analysis system comprising: a structure factor component generator for generating a first structure factor component for a first reflection from x-ray crystallography data using linear prediction analysis, the x-ray crystallography data comprising a set of cognizable reflections, the first structure factor component expressed as a first linear equation in which the first structure factor component is equal to a sum of a first plurality of terms, each term comprising a product of (1) a structure factor component for a cognizable reflection from the x-ray crystallography data, wherein the cognizable reflection has a separation in reciprocal space from the first reflection, and (2) a linear prediction coefficient corresponding to the separation between the cognizable reflection and the first reflection; a calculating module for calculating values for the linear prediction coefficients; and a resultant structure factor component definer for defining the first structure factor component for the first reflection by substituting the values for the linear prediction coefficients into the first linear equation.

33. A computer-implemented x-ray crystallography analysis system comprising: a means for generating a first structure factor component for a first reflection from x-ray crystallography data using linear prediction analysis, the x-ray crystallography data comprising a set of cognizable reflections, the first structure factor component expressed as a first linear equation in which the first structure factor component is equal to a sum of a first plurality of terms, each term comprising a product of (1) a structure factor component for a cognizable reflection from the x-ray crystallography data, wherein the cognizable reflection has a separation in reciprocal space from the first reflection, and (2) a linear prediction coefficient corresponding to the separation between the cognizable reflection and the first reflection; a means for calculating values for the linear prediction coefficients; and a means for defining the first structure factor component for the first reflection by substituting the values for the linear prediction coefficients into the first linear equation.