WO2008129458A1

WO2008129458A1 - A method for data mining dna frequency based spectra

Info

Publication number: WO2008129458A1
Application number: PCT/IB2008/051431
Authority: WO
Inventors: Evan E. Santo; Nevenka Dimitrova
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2007-04-18
Filing date: 2008-04-15
Publication date: 2008-10-30

Abstract

The present invention relates to a method for data mining spectra (20) generated from a DNA sequence. Firstly, a plurality of spectra based on the DNA sequence is provided by converting the DNA sequence into a plurality of binary indicator sequences (BIS), and applying short term Fourier transform (STFT) on the binary indicator sequences, each spectrum comprising corresponding frequencies (k) and Fourier coefficients (Usk_X(k)), where each kind of Fourier coefficient constitutes a channel (X), X= A, T, C and G. A search template (ST) is defined in a frequency interval (K_i) with intervals for the Fourier coefficients (Usk_X(k)) with respect to one or more channels (X). Then the search template (ST) is applied on at least a portion of the plurality of spectra, and the portion of the plurality of spectra are marked or tagged according to a degree of similarity with the search template (ST). The invention is particularly advantageous for obtaining an improved data mining method for spectral DNA analysis. The invention is also advantageous in obtaining patterns both globally and locally in a relative fast and efficient manner.

Description

A method for data mining DNA frequency based spectra

FIELD OF THE INVENTION

The present invention relates to a method for data mining DNA frequency based spectra (or a RNA sequence or an amino acid sequence), a corresponding computer program, a corresponding database, a corresponding system, and a corresponding signal.

BACKGROUND OF THE INVENTION

Sequence composition and harmonic properties are of emerging importance in the study of gene regulation. Islands of CpG dinucleotides are tightly associated with active sites of gene transcription as well as arbiters of gene repression via their DNA methylation status. In the field of genomic repeats, functionally important ones range from disease- causing triplet glutamine and alanine codon repeats to much larger repeat classes and transposable elements implicated in epigenetic and RNAi-mediated silencing. Therefore, as biological processes (especially epigenetic) continue to be explored, it will be increasingly important to have a tool that can systematically and comprehensively evaluate composition and harmonic properties of associated genomic sequences.

Spectral analysis is based on considering the occurrences of each nucleotide base in a DNA sequence as individual digital signals, cf. Benson et al. in Nucleic Acid Res. 18 (21), p. 6305-6310, and 18 (10), 3001-3006, 1990, for earlier references on this topic. Each of these four signals, i.e. A, T, C and G, is transformed into frequency space (k). The magnitude of a frequency component then reveals how strong a certain pattern of the nucleotide base is repeated at that frequency. A larger value usually indicates a stronger presence of the repetition.

To improve the readability of the results, each nucleotide base is represented by a color and the frequency spectra of the four bases are combined and presented as a color spectrogram, cf. Anastassiou, "Frequency-domain analysis of bio molecular sequences," Bioinformatics, vol. 16, pp. 1073 - 1081, Dec. 2000. This results in a very powerful visualization tool for DNA analysis. The resultant pixel color thus shows the relative intensity of the four bases at a particular frequency. The representation of a DNA sequence as a color image allows patterns to be easily identified by visual inspection and automatic visual content analysis. An important advantage of performing DNA analysis in the spectral domain is that the N²-scaling of conventional sequence to sequence matching is avoided, N being the number nucleotide bases in the sequence. US 6,287,773 discloses e.g. a frequency domain based comparison method which scale as N log (N) which may very significantly lower the computational time for long sequences, e.g. longer than 10.000 nucleotide bases.

Even with the advantages of spectral analysis for DNA analysis, there is still a need for even faster and/or more efficient analysis tools because of the huge amount of data. For example, the entire chromosome 1 of the human genome is 247 millions nucleotides long, and accordingly viewing the DNA spectrograms as so-called spectra video as recently suggested by N. Dimitrova et al, "Analysis and visualization of DNA spectrograms: open possibilities for genome research," in ACM MM., Santa Barbara, CA, Oct. 2006, may also be a tedious task.

Moreover, despite efforts to date, a need remains for systems and methods that facilitate expeditious data mining for analysis of genomic information. Also there remains a need for tools that can identify structurally or compositionally similar patterns that exhibit similar spectral properties. Such tools are to be contrasted with conventional sequence alignment tools that seek to align sequences in linear order or by nucleotide appearance. Hence, an improved data mining method would be advantageous, and in particular a more efficient and/or reliable method would be advantageous.

SUMMARY OF THE INVENTION

Accordingly, the invention preferably seeks to mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination. In particular, it may be seen as an object of the present invention to provide a method that solves the above mentioned problems of the prior art with data mining spectral DNA data.

This object and several other objects are obtained in a first aspect of the invention by providing a method for data mining spectra generated from a DNA sequence, the method comprising: providing a DNA sequence, - creating a plurality of spectra based on the DNA sequence by converting the

DNA sequence into a plurality of binary indicator sequences (BIS), and applying short term Fourier transform (STFT) on the binary indicator sequences, each spectrum comprising corresponding frequencies (k) and Fourier coefficients, where each kind of Fourier coefficient constitutes a channel, defining a search template (ST) in a frequency interval (K i) with intervals for the Fourier coefficients with respect to one or more channels, applying the search template (ST) on at least a portion of the plurality of spectra, and - marking the portion of the plurality of spectra according to a degree of similarity with the search template (ST).

The invention is particularly, but not exclusively, advantageous for obtaining an improved data mining method for spectral DNA analysis. The invention is also advantageous in obtaining patterns -both globally and locally - in a relative fast and efficient manner as compared to other techniques. Globally, the method according to the present invention may be applied for making tailored search templates that can quickly search the entire data set constituted by the plurality of spectra, so-called exhaustive mining. Locally, the method according to the present invention can quickly be used to narrow-down a frequency limited search template (ST) and apply the template across a chosen portion of spectra. Such searches in DNA spectral analysis are, to the best knowledge of the applicant, hitherto not known in the art. This and other advantages will be further explained in detail below.

It is to be understood that in the context of the present invention, the term

"data mining" is consider to comprise operations that segregates items or objects into groups according to a specified criterion and/or the grouping of items or objects by class or kind or value. Notice that "marking" does not necessarily imply that item or object are moved - either physically, virtually, and/or visually. As a special case data mining can include matching of object or items with one another. For exact matching, the intervals of the Fourier coefficients can be set to zero or close to. Moreover, the present invention can be advantageously employed for alignment of different DNA sequences using spectral analysis. In particular, the present invention can be applied for alignment between two or more different genoms. Within the context of the present invention, the term "similarity" is taken to comprise similarity in a mathematical meaning of the term, in particular vector similarity in the multi-dimensional Fourier space i.e. k-space.

The application of the present invention is not limited to applications in connection with analysis of DNA sequences, but could also be applied on similar sequences of relevance within biochemistry e.g. RNA sequences and amino acid sequences. One may create a binary indicator representation for amino acids (20 of them) and then we apply STFT to convert the BIS sequences to Fourier domain space. Thus, one can concatenate and make one long vector consisting of all the 20 binary indicator sequences. Then the rest of the procedure for implementing the invention would be the same. Here is a list of the aminoacids: alanine - ala - A arginine - arg - R asparagine - asn - N aspartic acid - asp - D cysteine - cys - C glutamine - gin - Q glutamic acid - glu - E glycine - gly - G histidine - his - H iso leucine - ile - 1 leucine - leu - L lysine - lys - K methionine - met - M phenylalanine - phe - F proline - pro - P serine - ser - S threonine - thr - T tryptophan - trp - W tyrosine - tyr - Y valine - val - V

The 20 different amino acids can be mapped to 20 different colors in Red- Green-Blue (RGB) (or Hue Saturation Value - HSV space). Either one of these spaces can be quantized into 20 colors - one for each of the amino acids. Thus, the teaching of the present invention is not limited to DNA analysis, but may be extended to RNA and/or amino acid analysis with the relevant modifications readily recognized by a skilled person in this field. Beneficially, the search template (ST) can be applied on the entire plurality of spectra, thereby "mining" or swiping all spectra. This can be possible for some or all channels.

Advantageously, a clustering process may be performed on the plurality of spectra before applying the search template (ST). Thus, a clustering can be made with a resulting regrouping of the spectra, making the spectra "non- linear" in the ordering. Beneficially, an outcome of the clustering process may be applied for choosing the portion of spectra for data mining from the plurality of spectra. Thus, for instance hierarchical clustering can be performed and then frequency data mining on the part of the spectra that appears relevant. In the present context, it is to be understood that outcome can mean any mathematically, visually, and/or biological significant outcome relevant for selecting which spectra to data mine. This choice of portion of spectra can made both supervised (active) and un- supervised (passive) in relation to a user of the present invention. Advantageously, an outcome of the clustering process is applied for choosing the search template (ST). In the present context, it is to be understood that outcome can mean any mathematically, visually, and/or biological significant outcome relevant for selecting which search template to choice for further analysis. This choice of portion of search template can made both supervised (active) and un-supervised (passive) in relation to a user of the present invention. In particular, the search template can be applied outside of the DNA sequence initially under investigation, for instance in a completely different genom.

In a particular embodiment, the similarity can be based on the degree of variation in a single channel, e.g. X=A, or the similarity can be based on the degree of variation in more than one channel, e.g. X= A and C, or the similarity can based on the degree of variation of all the channels i.e. all A, T, C, and G. If the present invention is applied for RNA studies the available channels are then accordingly A, U, C, and G. If the invention is applied to protein studies, the channels can represent the 20 different amino acids: A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V.

In another embodiment, the frequency interval (K i) of the search template (ST) may covers all frequencies in the spectra corresponding to "full mining" of all frequencies. This can be helpful for statistical analysis as it will be explained below even though such a search template is time consuming to implement.

Typically, the present invention can be implemented in an embodiment where the plurality of spectra are mapped into a color space, e.g. RGB, to facilitate easy viewing of the data. This can be particularly advantageous in combination with embodiments where a clustering is performed on the plurality of spectra in color space, e.g. clustering can be made on color spectrograms. Additionally, the marked portion of spectra can be displayed visually, optionally as a spectrovideo because this provides for (continuous) showing of the colored spectra. Additionally, the marked portion of spectra may be annotated using a database arranged therefore.

In a particular embodiment, an additional clustering process can performed on the marked portion of spectra to facilitate improved in-depth analysis.

Beneficially, a statistical correlation test can be performed on the marked portion of spectra for further analysis, cf. below.

Advantageously, the marked portion of spectra can be applied for an alignment analysis. The inventors have made some tests that indicate that spectral analysis can - with some improvements - assist in alignment analysis, which is typically a heavy computational task to implement in direct space. The data mining method according to the present invention can also be applied for a plurality of DNA sequences originating from different locations in different genoms. Thus, the method can be used for comparison, matching, etc. between e.g. genomes of two different species. Because DNA sequencing and analysis is a tedious task to carry out, this can be particularly advantageous for comparison in order to expedite analysis of new DNA sequences.

In a second aspect, the invention relates to a computer program product being adapted to enable a computer system comprising at least one computer having data storage means associated therewith to implement a method according to the first aspect of the invention.

This aspect of the invention is particularly, but not exclusively, advantageous in that the present invention may be implemented by a computer program product enabling a computer system to perform the operations of the second aspect of the invention. Thus, it is contemplated that some known processor, e.g. a computer, may be changed to operate according to the present invention by installing a computer program product on a computer system controlling the said processor. Such a computer program product may be provided on any kind of computer readable medium, e.g. magnetically or optically based medium, or through a computer based network, e.g. the Internet. In a third aspect, the invention relates to a processor for implementing one or more parts of a method for data mining spectra generated from a DNA sequence, the method comprising: providing a DNA sequence, - creating a plurality of spectra based on the DNA sequence by converting the

DNA sequence into a plurality of binary indicator sequences (BIS), and applying short term Fourier transform (STFT) on the binary indicator sequences, each spectrum comprising corresponding frequencies (k) and Fourier coefficients, where each kind of Fourier coefficient constitutes a channel, - defining a search template (ST) in a frequency interval (K i) with intervals for the Fourier coefficients with respect to one or more channels, applying the search template (ST) on at least a portion of the plurality of spectra, and marking the portion of the plurality of spectra according to a degree of similarity with the search template (ST).

Thus, the processor may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit, or may be physically and functionally distributed between different interconnected units and sub-processors. In particular, the interconnected units can implement a parallel processing algorithm. Thus, a parallel implementation of this spectral data mining can be a way of implementing the invention. The clustering and visualization can e.g. be done on a set of distributed processors geographically in different positions.

In a fourth aspect, the invention relates to a signal for implementing one or more parts of a method for data mining spectra generated from a DNA sequence, the method comprising: providing a DNA sequence, creating a plurality of spectra based on the DNA sequence by converting the DNA sequence into a plurality of binary indicator sequences (BIS), and applying short term Fourier transform (STFT) on the binary indicator sequences, each spectrum comprising corresponding frequencies (k) and Fourier coefficients, where each kind of Fourier coefficient constitutes a channel, defining a search template (ST) in a frequency interval (K i) with intervals for the Fourier coefficients with respect to one or more channels, applying the search template (ST) on at least a portion of the plurality of spectra, and marking the portion of the plurality of spectra according to a degree of similarity with the search template (ST). The signal can be any kind of signal, wireless or wire-mediated. The signal can comprise intermediate or preliminary result for implementing the present invention, and/or the signal can comprise results arranged for displaying the present invention upon reception at a receiving end.

The first, second, third, and fourth aspect of the present invention may each be combined with any of the other aspects. These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments described hereinafter.

BRIEF DESCRIPTION OF THE FIGURES The present invention will now be explained, by way of example only, with reference to the accompanying Figures, where

Figure 1 is an exemplary binary sequence (BIS) pattern, Figure 2 is a plot of the corresponding BIS pattern from Figure 1 of the four nucleotide bases A, T, C, and G, Figure 3 is the converted frequency spectrum of each base,

Figure 4 is similar to Figure 3 and it is indicated to the right that a superposition of the color mapping vectors weighted by the magnitude of the frequency component of the respective nucleotide base is obtained,

Figure 5 schematically illustrates the generation of a single, colored spectrum from short time Fourier transformation (STFT) of a part of a DNA sequence,

Figure 6 is similar to Figure 5, and illustrates the generation of a plurality of spectra by repeated STFT along the DNA sequence,

Figure 7 is a principle sketch of the method according to the present invention, Figure 8 is a schematic drawing of the spectra at various frequencies according to the present invention,

Figure 9 is a drawing similar to Figure 8 showing a search template according to the present invention, Figure 10 is a drawing similar to Figure 8 showing another search template according to the present invention,

Figure 11 is a drawing similar to Figure 8 schematically illustrating a clustering process according to the present invention, Figure 12 is an example of an unclustered spectrogram

Figure 13 is an example of a clustered spectrogram Figure 14 is an example of a selected portion of spectra chosen from Figure 13,

Figure 15 is the resulting output of the marked spectra from application of a search template according to the present invention,

Figure 16 is graph of the invention applied for alignment analysis, and Figure 17 is a flow chart of a method according to the invention.

DETAILED DESCRIPTION OF AN EMBODIMENT DNA spectrograms can be generated in a conventional manner, as described in greater detail herein below with reference to Figures 1-6. For example, a conventional algorithm or technique for the generation of DNA spectrograms may be employed that entails the following five steps:

(i) Formation of binary indicator sequences (BISs) U_A[Π], uτ[n], ucfn] and uβ[n] for the four nucleotide bases. An exemplary BIS pattern is reproduced in Figure 1 generated from a DNA sequence 10 and a plot of the BIS values is presented in Figure. 2.

(ii) Discrete Fourier Transform (DFT) on BISs. The frequency spectrum of each base is obtained by computing the DFT of its corresponding BIS using Equation (1):

^¹ -J— kn , ,

U_x[k] = ∑u_x[n]e ^N , k = 0, 1, ..., IN /2] + l X = A, T, C or G (1) κ=0

As illustrated in Figure 3, the sequence UfkJ provides a measure of the frequency content at frequency k, which is equivalent to an underlying period of N/k samples. N is the total number of N is the total number of nucleotide bases in the window W, cf. Figures 5 and 6. The number of bases can be maximum 300 nucleotide bases, preferably maximum 500 nucleotide bases, or even more preferably 700 nucleotide bases. Alternatively, the period can be maximum 3000 nucleotide bases, preferably maximum 5000 nucleotide bases, or even more preferably maximum 10000 nucleotide bases. .

Alternatively the wavelet transform can be applied on the Binary Indicator Sequences. One implementation is using discrete wavelet transform (DWT) i.e. Haar wavelet transform. It is a discrete transform where the input is represented by 2*n numbers, and the function stores the difference of number pairs and passes the sum. This is repeated in a recursive manner, pairing up the sums to make the next scale and finally providing 2*n - 1 differences and single final sum. The obtained wavelet coefficients for each BIS will constitute a single channel and the rest of the steps are the same as in using DFT. There are two properties which make the wavelet transform an alternative to

DFT: a) The complexity of DWT is in the order of O(n) operations and b) DWT captures the notion of the frequency as well as the times at which these frequencies appear.

(iii) Mapping of DTF values to RGB colors. The four DFT sequences are reduced to three sequences in the RGB space by a set of linear equations which are reproduced below:

X_r[k] = a_rU_A[k] + t_rU_T[k] + c_rU_c[k] + g_rU_G[k] X_g[k] = a_gU_A[k] + t_gU_T[k] + c_gU_c[k] + _ggU_G[k] (2) X_b[k] = a_bU_A[k] + t_bU_T[k] + c_bU_c[k] + g_bU_G[k]

where (a_r, a_g, at), (t_r, t_g, h), (c_r, c_g, ct) and (g_r, g_g, gb) are the color mapping vectors for the nucleotide bases A, T, C and G, respectively. The resultant pixel color ((X_r[k], X_g[k], X&[k]) is thus a superposition of the color mapping vectors weighted by the magnitude of the frequency component of their respective nucleotide base as indicated in the right side of Figure 4. Mapping of DFT values to colors is illustrated in Figure 5 for a single spectrum 20, and in Figure 6 for several spectra 20 i.e. a spectrogram 30. Both Figures 5 and 6 are being reproduced in grey-tones here for illustrative purposes. Other color space mapping of the frequency domain based U values are also possible, e.g. to HSV space.

(iv) Normalizing the pixel values. Before rendering the colored spectrogram 30, the RGB values of each pixel are generally normalized so as to fall between 0 and 1. Numerous normalization procedures are readily available to the skilled person once the general principle of the invention is recognized. (v) Short-time Fourier Transform (STFT). A plurality of DNA spectra 20 i.e. a spectrogram 30 is formed by a concatenation of individual DNA sequence spectra 20 ("strips"), where each strip or spectrum generally depicts the frequency spectrum of a local DNA segment as shown in Figure 6. The short term Fourier transform (STFT) has a window W that is shifted along the DNA sequence from 5' to 3' as shown in Figure 6.

The spectrogram shown in Figure 6 has a length of 60 nucleotide bases and the window W is shifted one base at a time. On the horizontal scale in the spectrogram 30, the frequency k is shown (increasing downwards), whereas the start position P ini on the DNA sequence 10 is shown on the horizontal scale in the spectrogram 30.

The appearance of a spectrogram 30 is very much affected by the choice of the STFT window W size, the length of the overlapping sequence between adjacent windows W, and the color mapping vectors, cf. Equation (2). The window size determines the effective range of a pixel value in a spectrogram 30. A larger window results in a spectrogram that reveals statistics collected from longer DNA segments. In general, the window W size should be made several times larger than the length of the repetitive pattern of interest and smaller than the size of the region that contain the pattern of interest. For exploratory purposes, it is recommended to try a range of window sizes. The window overlap determines the length of the DNA segment common to two adjacent STFT windows. Therefore, the larger the overlap, the more gradual is the transition of the frequency spectrum from one STFT window to the next. A higher image resolution makes it easier to extract features by image processing or visual inspection.

Viewing large amounts of sequence data requires an efficient method for information analysis and visualization. In order to optimize the viewing of spectra derived from very large sequences or spectra containing many small windows, the spectra can be rendered a video as shown by the present inventor; N. Dimitrova, et al. "Analysis and visualization of DNA spectrograms: open possibilities for genome research," in ACM MM., Santa Barbara, CA, Oct. 2006, which is hereby incorporated by reference in its entirety.

Figure 7 is a principle sketch of the method according to the present invention. With reference to Figure 3 and 8 (cf. below), the four channels A, T, C, and G, each defining a three-dimensional space in reciprocal k-space by the coordinates frequency k, Fourier coefficient Usk X(k), and the spectra number s. The invention operates by defining a search template ST in a certain frequency interval K i with intervals for Usk X(k) with respect to e.g. one channel C (usually more than one channel are investigated). The search template ST is schematically indicated in Figure 7 by a rectangular box, and the three vectors U_l, U_2, and U_3 schematically indicate three different spectra that are investigated with the search template ST.

The search template ST defines values of Usk X and intervals connected thereto, e.g. a tolerance within 10%, and according to the invention the template ST is applied in a data mining process on a selected portion of spectra. Thereby some spectra 20 are tagged or marked for later processing and/or analysis according to the chosen degree of similarity with the search template ST.

Figure 8 is a schematic drawing of the spectra at various frequencies according to the present invention listing specifically the Fourier coefficients Usk_X(k) for the different spectra 20 that are consecutively numbered downwards in the left part of the Figure by the running index s. Also the frequency k is indicated in the top part of Figure 8. The frequency of the DFT runs from 1 to the maximum frequency km of the Fourier transform. Like before the four nucleotide bases A, T, C, and G constitutes four channels i.e. X= A, T, C, and G. Usually more than one channel are investigated, and thereby the similarity with the search template can be based on the degree of variation in more than one channel, e.g. X= A and C, and particularly, the similarity can based on the degree of variation of all the channels i.e. X = A, T, C, and G. To emphasize that each entry in Figure 8 comprise 4 different channels the entry named Ulk x in the first row (s=l) has been blown up and all four channels written out explicitly in the upper part of Figure 8.

Figure 9 is a drawing similar to Figure 8 showing a search template STa according to the present invention, where a limited number of frequencies, here the frequency interval K I is k=2 and k=3, and a limited number of spectra, here s=l, 2 and 3, is investigated by applying the search template STa on this subset of data. This embodiment of the invention is known as a so-called "frequency query".

In the first embodiment, the frequency interval K i of the search template ST can cover maximum N/2 different frequencies where N is the window size. For some embodiments, the frequency interval K i of the search template ST can cover maximum 8 different frequencies, preferably maximum 4 different frequencies, or most preferably 2 different frequencies. Alternatively, the frequency interval K i of the search template ST can cover maximum 10% of the frequencies in the spectrum, preferably maximum 5% of the frequencies in the spectrum, or maximum 2% of the frequencies. The spectra have a constant length or period W during analysis. The length of the period W is generally limited to N/2 because otherwise analysis can be too time-consuming. It should be emphasized that the frequency interval K i can comprise several distinctive frequency intervals i.e. with reference to Figure 9 K i may comprise k= 2, k= 6 or k=2 and k= 4. Thus, K i can be any suitable subgroup or combination of subgroups within the interval from k=l to k=km (the maximum frequency of the Fourier transform). Figure 10 is a drawing similar to Figure 8 showing another search template

STb according to the present invention, where at a single frequency, here k=2, all spectra 20 are investigated using the search template STb, which is so-called "frequency mining" process.

This procedure can repeated at each frequency, from k=l to k= km, in order to provide an exhaustive data mining to the plurality of spectra 20. This can in particular be used to obtain statistical results about the "strength" of the pattern of each frequency in regards to some classification of the spectra. This may also be used as a pre-processing step to clustering. One may thereby to able to say whether a specific pattern is enriched in the given DNA spectrogram 30. This process can be thought of as "seeding" a pattern and searching according to the specified criteria. In the "frequency mining" approach, this data mining can be implemented systematically within each frequency domain of every spectra in the dataset. Therefore, in the "frequency mining" approach, a truly exhaustive search in a dataset consisting of 1000 spectra and 50 frequency domains will perform 50,000 (1000 x 50) searches. Figure 11 is a drawing similar to Figure 8 schematically illustrating a clustering process according to the present invention, where spectra s=l and s=2 are interchanged according to a pre-defined clustering algorithm applied on the plurality of spectra 20. The present invention can advantageously be applied in connection with clustering in general, e.g. hierarchical clustering, k-means clustering, self-organizing maps, and various other unsupervised custom clustering methods available. The clustering can either be performed before and/or after application of the search template ST according to the present invention.

The hierarchical clustering of the spectra may be carried out in two ways: 1) using either Fourier coefficients of the binary indicator sequences or 2) DNA spectra in RGB space (or another suitable color space). Generally it is preferred to perform clustering in the Fourier-space data because all four dimensions are represented equally, whereas conversion to RGB-space inevitably causes data loss.

In particular, the step of computing the distance matrix can be mentioned. Every vector is compared to every other vector exactly once and a distance metric is applied. Euclidean distance (L2) can be applied, but other kind of distances, e.g. Manhattan distance (Ll), Mahalanobis distance, Chi-square distances, etc., can perform equally well and there may be inherent advantages of some over others. The choice of distance metric is still being evaluated. The total number of comparisons performed will be N² / 2, where N is the number of windows. Therefore, a run performed with one contiguous sequence of size 46,944,323 bases (human chromosome 21) using a window size of 1500 and an overlap of 300 will yield 39,120 windows according to the formula (3):

N = [SeqLength / (WinSize - WinOverlap)] - 1 (3)

Even with these modest parameters on the shortest human chromosome, there are 765,197,729 unique comparisons. This is both computationally expensive and memory intensive. Executing in Matlab, RAM utilized can run very near 16 gigabytes just for construction of this distance matrix. For the sake of scalability, a clear challenge will be to implement a more efficient clustering scheme since it is desirable to perform this operation across whole genomes at higher resolutions (smaller window sizes and greater overlap).

Figure 12 is an example of an unclustered spectrogram 30. It is displayed as an annotated spectrovideo, where the horizontal axis represents frequencies k and the vertical axis represents spectra 20 of genomic sequence. On the left there are corresponding genomic annotations for each spectrum 20. The annotation can be provided from a database with corresponding DNA features and genomic annotation information, associating a portion of the DNA sequence 10' with a DNA feature from the database, and displaying the plurality of spectra 20 in combination with the genomic annotation information ANN associated to each spectrum. Figure 13 is an example of a clustered spectrogram 30 by application of unsupervised clustering. In this embodiment, hierarchical clustering with L2 distance measure is used. It is displayed as an annotated spectrovideo, where the again the horizontal axis represents frequencies k and the vertical axis represents spectra 20 of genomic sequence in their respective cluster. Figure 14 is an example of a selected portion of spectra chosen from Figure

13. The user has just observed some interesting frequency domains across some CpG island windows from chromosome 8 in a hierarchically clustered spectrovideo of Figure 13. The user would like to know what other windows contain something similar to the leftmost band (frequency k= 6 and 7). The user then supply: a) what frames of the spectral video the pattern spans b) which frequency domains are of interest (k= 6 and 7) c) which spectra or window(s) the pattern spans and d) what % tolerances will be allowed for the four spectral channels (A,T,C and G), here 20% each is used as an example. The query by frequency implementation, cf. Figure 9, of the search template ST is then applied. Each window is seeded in the frequency domains 6 and 7 and the entire dataset is searched as described in Figure 9.

Figure 15 is the result of the marked and selected spectrogram from application of the search template ST for Figure 14. The foreground windows are collected and a new image (or spectro video) is constructed. It is noted that the most of the spectra have a common band at the frequency k= 6 and 7 as required by the search template ST applied on the plurality of spectra.

In the graph of Figure 16, an example of the present invention is given for alignment analysis. ClustalW alignment score from 0 to 100 is given on the horizontal axis (0 is no similarity and 100 is perfect similarity). On the vertical axis we have spectral distance meaning 0 is very similar (distance is 0) and 100 is very dissimilar.

The analysis is performed by the following steps:

1. Create a spectral analysis video for a given sequence and then obtain the sequences for given windows of interest. 2. Use ClustalW to perform multiple alignment of these input sequences.

3. Generate the distances using the same sequences that were used for ClustalW, using the same window size and overlap parameters that were used to generate these windows. For the same two sequences the distance in spectral domain is needed together with the distance using ClustalW (same windows). If a different window size is used at this point, it would no longer be a directly comparison between the ClustalW scores and their associated distances.

4. Generate a list of all pair- wise comparisons and both their distance scores and ClustalW scores occurring only once. Perform a Pearson correlation test and produce a Pearson value and a p-value; e.g. Pearson= -0.842, p= 0.

In this manner it is seen that when there is 100% alignment (same sequences) the distance in spectral space is 0, However, the range of similarity for spectral distance is much more forgiving: while the ClustalW alignment goes down to 0, the similarity in spectral space can still hold up to 40% (distance of 60%). There are two advantages for spectral alignment over symbolic (i.e. traditional sequence) alignment: 1) spectral alignment has the power to align sequences which are more dissimilar (i.e. more forgiving and likely to find non-obvious similarities) and 2) it is fast because we are aligning no N but only N/2 values (because STFT is a symmetric transform) of the window Figure 17 is a flow chart of a method according to the invention. The method comprising:

51 providing a DNA sequence (or alternatively: RNA or amino acid sequence),

52 creating a plurality of spectra based on the DNA sequence by converting the DNA sequence into a plurality of binary indicator sequences BIS, and applying short term

Fourier transform STFT on the binary indicator sequences, each spectrum comprising corresponding frequencies k and Fourier coefficients Usk_X(k), where each kind of Fourier coefficient constitutes a channel X,

53 defining a search template ST in a frequency interval K i with intervals for the Fourier coefficients Usk_X(k) with respect to one or more channels X,

54 applying the search template ST on at least a portion of the plurality of spectra, and

55 marking the portion of the plurality of spectra according to a degree of similarity with the search template ST.

The invention can be implemented in any suitable form including hardware, software, firmware or any combination of these. The invention or some features of the invention can be implemented as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way.

Indeed, the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit, or may be physically and functionally distributed between different units and processors.

Although the present invention has been described in connection with the specified embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present invention is limited only by the accompanying claims. In the claims, the term "comprising" does not exclude the presence of other elements or steps. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. In addition, singular references do not exclude a plurality. Thus, references to "a", "an", "first", "second" etc. do not preclude a plurality. Furthermore, reference signs in the claims shall not be construed as limiting the scope.

Claims

CLAIMS:

1. A method for data mining spectra (20) generated from a DNA sequence, the method comprising: providing a DNA sequence, creating a plurality of spectra based on the DNA sequence by converting the DNA sequence into a plurality of binary indicator sequences (BIS), and applying short term Fourier transform (STFT) on the binary indicator sequences, each spectrum comprising corresponding frequencies (k) and Fourier coefficients (Usk X(k)), where each kind of Fourier coefficient constitutes a channel (X), defining a search template (ST) in a frequency interval (K i) with intervals for the Fourier coefficients (Usk_X(k)) with respect to one or more channels (X), applying the search template (ST) on at least a portion of the plurality of spectra, and marking the portion of the plurality of spectra according to a degree of similarity with the search template (ST).

2. The method according to claim 1, wherein the search template (ST) is applied on the entire plurality of spectra.

3. The method according to claim 1, wherein a clustering process is performed on the plurality of spectra before applying the search template (ST).

4. The method according to claim 3, wherein an outcome of the clustering process is applied for choosing the portion of spectra for data mining from the plurality of spectra.

5. The method according to claim 3, wherein an outcome of the clustering process is applied for choosing the search template (ST).

6. The method according to claim 1, wherein the similarity is based on the degree of variation in a single channel.

7. The method according to claim 1, wherein the similarity is based on the degree of variation in more than one channel.

8. The method according to claim 7, wherein the similarity is based on the degree of variation of all the channels.

9. The method according to claim 1, wherein the frequency interval (K i) of the search template (ST) covers all frequencies in the spectra.

10. The method according to claim 1, wherein the plurality of spectra are mapped into a color space.

11. The method according to claim 3 and 10, wherein the clustering is performed on the plurality of spectra in color space.

12. The method according to claim 10, wherein the marked portion of spectra are displayed, optionally as a spectra video.

13. The method according to claim 10, where the marked portion of spectra are annotated using a database arranged therefore.

14. The method according to claim 3, wherein an additional clustering process is performed on the marked portion of spectra.

15. The method according to claim 1, wherein a statistical correlation test is performed on the marked portion of spectra.

16. The method according to claim 1, wherein the marked portion of spectra is applied for a sequence alignment analysis.

17. The method according to claim 1 or claim 16, wherein the data mining method is applied for a plurality of DNA sequences originating from at least two different genomes.

18. The method according to claim 1, wherein the data mining method is applied on a RNA sequence or amino acid sequence.

19. The method according to claim 1, wherein wavelet transform is applied on the binary indicator sequences instead of short term fourier transform.

20. A computer program product being adapted to enable a computer system comprising at least one computer having data storage means associated therewith to implement a method according to claim 1.

21. A processor for implementing one or more parts of a method for data mining spectra (20) generated from a DNA sequence, the method comprising: providing a DNA sequence, creating a plurality of spectra based on the DNA sequence by converting the DNA sequence into a plurality of binary indicator sequences (BIS), and applying short term Fourier transform (STFT) on the binary indicator sequences, each spectrum comprising corresponding frequencies (k) and Fourier coefficients (Usk X(k)), where each kind of Fourier coefficient constitutes a channel (X), defining a search template (ST) in a frequency interval (K i) with intervals for the Fourier coefficients (Usk_X(k)) with respect to one or more channels (X), applying the search template (ST) on at least a portion of the plurality of spectra, and marking the portion of the plurality of spectra according to a degree of similarity with the search template (ST).

22. A signal for implementing one or more parts of a method for data mining spectra (20) generated from a DNA sequence, the method comprising: providing a DNA sequence, creating a plurality of spectra based on the DNA sequence by converting the DNA sequence into a plurality of binary indicator sequences (BIS), and applying short term Fourier transform (STFT) on the binary indicator sequences, each spectrum comprising corresponding frequencies (k) and Fourier coefficients (Usk X(k)), where each kind of Fourier coefficient constitutes a channel (X), defining a search template (ST) in a frequency interval (K i) with intervals for the Fourier coefficients (Usk_X(k)) with respect to one or more channels (X), applying the search template (ST) on at least a portion of the plurality of spectra, and marking the portion of the plurality of spectra according to a degree of similarity with the search template (ST).