EP1307713A2

EP1307713A2 - Method for examining macromolecules

Info

Publication number: EP1307713A2
Application number: EP01945081A
Authority: EP
Inventors: Helmut Bloecker; Gerhard Kauer
Original assignee: Helmholtz Zentrum fuer Infektionsforschung HZI GmbH
Current assignee: Helmholtz Zentrum fuer Infektionsforschung HZI GmbH
Priority date: 2000-05-05
Filing date: 2001-05-03
Publication date: 2003-05-07
Also published as: IL152512A0; KR20030005318A; AU2001267403A1; US20040029126A1; WO2001086247A2; EE200200618A; CA2406694A1; WO2001086247A3; DE10021689A1

Abstract

The invention relates to a method for examining macromolecules which can be stored in frequency-based data patterns. The invention also relates to a device for carrying out said method and to different applications of both the method and device. The method itself is based on: the compilation of sequence data of molecular sequences of macromolecules; the conversion of the sequence data into frequency-modulated frequency data; the transformation of the frequency data into a Fourier space; the use of Fourier analyses for comparing, weighting, cataloging and/or typifying the frequency data and, finally, for re-transforming the weighted, cataloged and/or typified frequency data into sequence data provided in a weighted, cataloged and/or typified form.

Description

Procedure for the study of macromolecules

The invention relates to a method for the investigation of macromolecules and a device for the exemplary implementation of the method and applications of the method and / or the device according to the independent claims.

Enormous amounts of data have been collected in the databases in the form of sequence-based data patterns for a wide variety of macromolecules. Such amounts of data are used to process biological questions that arise from information within macromolecular sequence data. These questions can currently only be dealt with using computer-aided methods, whereby the enormous amounts of data require a considerable amount of computing power, especially since the ever-increasing worldwide sequencing performance of current and planned genome projects increases to an unexpected extent. This creates the problem of efficiently applying the available algorithms to the corresponding problem without reaching the limits of the computing power.

BESTATIGUNGSKOPIE This problem is solved with the subject of the independent claims. Advantageous developments of the invention result from the subclaims.

The method according to the invention for solving the above problem when examining macromolecules thus comprises the following method steps:

a) creating sequence data of molecular sequences of macromolecules,

b) converting the sequence data into frequency-modulated frequency data,

c) transforming the frequency data into a foot space,

d) use of Fourier analyzes for comparison, for weighting, for cataloging and / or for typing the frequency data,

e) reverse transformation of the compared, weighted, cataloged and / or typed frequency data to sequence data in weighted, cataloged and typed form.

This method enables a completely new technology for the efficient analysis of enormous sequence-based amounts of data from macromolecules. The potential of this technology is m a significant increase in speed for the respective Ana ^¬ analyzes of the macromolecules on the one hand and n is the possibility of pop u la ^¬ lig new issues of information retrieval raise.

In a preferred embodiment of the method for comparison, weighting, cataloging and / or Ty ^¬ pisierung a method of filtering information from a di- gital image analysis used. This embodiment has the advantage that both the similarity of two one-dimensional patterns with a mutual local shift by i data points can be measured and a signal with a predetermined signal curve can be searched, a measure of similarities being obtained by an image analysis and thus conclusions about similarities can be concluded among the macromolecules. This similarity becomes maximum when the shift produces a maximum match between the sequence of frequency data and the pattern. This shift also gives the unique position of the one-dimensional pattern in the frequency data sequence via a reverse transformation and demodulation by the position of the pattern in a sequence.

The use of the Fourier transformation simplifies the detection filtering by means of the convolution and thus accelerates the examination to a considerable extent.

In a further embodiment of the method, a frequency analysis method is used for comparison, weighting, cataloging and / or typing. In this embodiment, the sequence data, which were first converted into frequency-modulated data, are prepared in such a way that each element of a sequence is assigned unique frequency information in correlation to its neighbor. Although the actual sequence takes a back seat in this way and is transformed in the simplest case into a one-dimensional frequency-modulated wave, the sequence information remains unaffected by this transformation and is only converted into complex frequency information with the same information content. The advantage of this embodiment is that all mathematical methods of frequency analysis can be applied to this frequency-modulated wave. Spectra Central analysis of the information is of great benefit in this context.

In a further embodiment of the method, stochastic information filtering in the Fourier space is used for comparison, weighting, cataloging and / or typing. In this embodiment, deviations from the ideal signal can be estimated stochastically, with which the expectation horizon can be designed depending on the biological problem.

In a further preferred embodiment of the method, the information units and / or structural information from multidimensional protein and / or DNA databases are encoded in corresponding sequence codes for creating sequence data. This has the advantage that it can be used during the analysis of macromolecules and biological problems to macromolecules to multidimensional protein and / or DNA databases, the method according to invention using the Inventive ^¬ then be evaluated and analyzed, without the boundaries of the The efficiency of the processes used and the considerable computing power are exceeded.

The method according to the invention can preferably be carried out with a device which has a multiplicity of electronic components for modeling frequency data which simulate molecular sequences and a multiplicity of frequency filters for weighting, for cataloging and / or for typing the frequency data modeled by the multiplicity of electronic components. A significant advantage of the fact the inventive method is that not to develop ^¬ manoeuvrable algorithms and filter systems on the one hand on a computer, it is easily possible, but the methods found in electronic circuits thereafter implement, then the relevant Algorithms no longer supported by a computer, but to be carried out in a high-frequency circuit. With such a device it is thus possible to investigate very large sequence-based amounts of data, for example entire genomes, quickly and virtually without any delays.

In a preferred embodiment of the device, the large number of electronic components and the large number of frequency filters are ascertained by means of computer-aided frequency analyzes and these are coupled to one another to form a hardware network which simulates the sequence of information units of macromolecules. In this context, the information units are bases of the nucleic acids, amino acid residues of proteins and / or three-dimensional structural units of proteins and / or DNA, the sequence of which is simulated in a macromulecule by the hardware network. With this embodiment of the device that not only a quick comparison of large sequence-based data pattern is possible, but that about biological questions can be processed directly by the macromolecules simulating hardware network the speed of light and answered speed with a correspondingly high Ge ^¬ addition is achieved.

The method and device of the invention are preferably used for the analysis of protein sequences. Applications in the context of the analysis of DNA sequences are also advantageously possible. Investigations and samples of multidimensional protein databases can also be used for this. For this, the information units of the databases in entspre ^¬ sponding sequence codes are to be offered, which can also be multidimensional. It is therefore necessary not restrictive, single ^¬ Lich to restrict spectral analyzes to one, two or three dimensions, especially in the preferred applications, the inventions can be used for a large number of information fragments.

In a preferred application of the invention, multidimensional DNA structure information is examined for recurring patterns. In particular, this invention makes it possible to investigate biological questions interactively and without delay for sequence-based amounts of data.

The invention will now be explained in more detail using exemplary embodiments.

In a first exemplary embodiment, the sequence data are first converted into m frequency-modulated data. In this way, each element of the sequence is assigned an unary frequency information in correlation to its neighbor. In this way, the actual sequence m enters the background and, in the simplest case, m is transformed into a one-dimensional frequency-modulated wave. The sequence information remains unaffected by this transformation and is only converted into complex frequency information with the same information content.

The advantage of this method is that all mathematical methods for signal processing can now be applied to this frequency-modulated wave. The spectral analyzes of the information in particular provide the greatest benefit in this context.

A Fast-Fouπer-Transformation (FFT) is then applied to the frequency-modulated wave. Appropriate filters are then applied to this transformed data. After the jerk transformation, the so-called inverse Fourier transform (IFFT) and a demodulation of the frequency data back into the sequence data, the correspondingly filtered information is obtained. Sequence patterns can thus be searched very efficiently in the performance spectrum, for example large genomic sections or entire genomes can be compared with one another or filtered out. Deviations from the ideal signal can be estimated stochastically, so that the horizon of expectation can be specifically designed depending on the biological problem. This results in the essential advantage of the method according to the invention that it is easily possible to first develop the necessary algorithms and filter systems on a computer and then to implement the methods found in electronic circuits. Then the algorithms in question no longer have to be in a computer, but can be processed in a high-frequency circuit. With this embodiment of the invention, it is thus possible to examine very large, sequence-based data, for example entire genomes, quickly and without interactivity.

However, the method according to the invention is not limited to the simplest case of a one-dimensional frequency-modulated wave. Rather, in a second example of an embodiment of the invention, three-dimensional or more ^¬ dimensional protein databases or multidimensional DNA structure information can also be examined in a very similar manner for corresponding patterns. To this end databases are their information units in corresponding sequence codes imple ^¬ zen. The method according to the invention can also be used for assembling a large number of n-information fragments, as are present, for example, in “shotgun” organized data banks. This n-pieces of information set in their sum total, the information of a logical unit N constitutes this case, the sum of all elements of the Fra_gmente we be ^¬ sentlich greater than the sum of the partial elements of TOTAL ^¬ TInformation N.: n >>N; Y {n 3 N}

After the sequence information is frequency-modulated, it is transformed according to the present invention by means of a Fast Fourier transform. In the simplest case, the correlation function φ _{fg of} two one-dimensional signals, namely f (m) and g (m), is to be understood as a convolution of the signal f (m) with the signal g (-m).

n

With this procedure, both the similarity of two one-dimensional patterns with a mutual local shift by i-pixels can be measured, and a signal f () can be searched for a signal curve given by g (). φ _fg is the measure of the similarity. This measure becomes maximum when the displacement i produces a maximum correspondence between the wave f (m) and the pattern g (m). This shift then also gives the unique position of the one-dimensional "pattern" in the shaft. The position of the pattern in the sequence can be clearly determined via the reverse transformation and demodulation. This detection filtering via the convolution is advantageously simplified by the FFT. The Fourier transforms Φ _fg and F are calculated from φ _fg and f and have the following relation:

Φ _fq (k) F (k) G * (k)

where G * (k) is the conjugate complex Fourier transform of g (m). In this case, given the enormous amount of data of sequence-based data patterns of macromolecules Operation in the Fourier space is advantageous since extensive pattern functions are already available for the problem addressed. For φ _fg, the signal energy of f (m) and g (m) is _exactly the same as f (m) and g (m).

As a third example, the two-dimensional relations are now listed:

φ _fg (i, j) = ∑ _m ∑ _n f (m, n) g (mi, nj), or Φ _fg (k, l) = F (k, 1) G * (k, 1)

In-depth analysis of the information-bearing biological macromolecules reveals that the pure sequence information is overlaid with considerable amounts of information that result from chemically related patterns of neighboring building blocks or e.g. result in multi-dimensional location signals.

The methods for one-dimensional and two-dimensional relations described by way of example above can quickly determine such additional information contents by means of suitable stochastically acting filters in the frequency domain.

A suitable mapping of the relevant "similarity function" of the components or groups of components involved into the frequency domain automatically results in structures that can be determined using proven filters. For example, analyzes with local power spectra can be used, which deal with the spectral energies of the sections to be examined.

The range of services | F (k) | ² is the Fourier transform of the autocorrelation function of the signal f (m) and can therefore be used to measure the statistical bonds between the values of neighboring data of f (m). Are the leads power spectra calculated within local windows, this way, even stationary patterns can be described. A suitable weighting of the original function can be used to reduce disruptive parts in the range of services. In digital image analysis, for example, an inhibition function of the following type is used for texture detection before the Fourier transformation

h (m, n) = FI (θ, 54-0.46 cos (2Tli_)) i = m, n 15

Claims

claims

1.Procedure for the investigation of macromolecules with the following process steps: a) creating sequence data of molecular sequences of macromolecules b) converting the sequence data into frequency-modulated frequency data c) transforming the frequency data into a Fourier space d) using Fourier analyzes for comparison, weighting, cataloging and / or for typing the frequency data e) back-transforming the compared, weighted, cataloged and / or typed frequency data to sequence data in weighted, cataloged and / or typed form.

2. The method according to claim 1, characterized in that for comparing, weighting, cataloging and / or typing methods of filtering information from digital image analysis are used.

3. The method according to claim 1 or claim 2, characterized in that methods of frequency analysis are used for comparison, for weighting, for cataloging and / or for typing.

4. The method according to any one of the preceding claims, characterized in that a stochastic information filtering in the Fourier space is used for comparison, for weighting, for Ka ^¬ talogisierung and / or for typing.

5. The method according to any one of the preceding claims, characterized in that information units and structural information from multidimensional protein and / or DNA databases are encoded in corresponding sequence codes for creating sequence data.

6. Apparatus for examining macromolecules with a multiplicity of electronic components for modeling frequency data that simulate molecular sequences, and with a multiplicity of frequency filters for comparing, weighting, cataloging and / or for typing the frequency data modeled by the multiplicity of electronic components.

7. The device according to claim 6, characterized in that the plurality of electronic components and the plurality of frequency filters are determined by means of computer-aided frequency analyzes and are computer-aided coupled to one another to form a hardware network which simulates the sequence of information units of macromolecules.

8. The device according to claim 7, characterized in that the information units are bases of nucleic acids, amino acid residues of proteins and / or three-dimensional structure units of proteins and / or DNA.

9. Application of the method according to one of claims 1 to 5 or the device according to claim 6, 7 or 8 for the analysis of protein sequences.

10. Use of the method according to one of claims 1 to 5 or the device according to claim 6, 7 or claim 8 for the analysis of DNA sequences.

11. Application of the method according to one of claims 1 to 5 or the device according to claim 6, 7 or 8 for the examination and sampling of three-dimensional protein databases.

12. Application of the method according to one of claims 1 to 5 or the device according to claim 5, 6 or 8 for examining three-dimensional DNA structure information for recurring patterns.

13. Application of the method according to any one of claims 1 to 5 or the apparatus of claim 5, 6 or 8 tive to interac ^¬ instantaneous study of sequence-based data sets of differently structured macromolecules.