WO2008129459A2

WO2008129459A2 - A method for visualizing a dna sequence

Info

Publication number: WO2008129459A2
Application number: PCT/IB2008/051434
Authority: WO
Inventors: Evan E. Santo; Nevenka Dimitrova
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2007-04-18
Filing date: 2008-04-15
Publication date: 2008-10-30
Also published as: WO2008129459A3

Abstract

The present invention relates to a method for visualizing a DNA sequence (10) (or RNA or amino acid sequence), the method involves providing a DNA sequence (10), creating a plurality of frequency domain spectra (20) based on the DNA sequence, providing a database (DB) with corresponding DNA features (100c) and genomic annotation information (ANN), associating a portion of the DNA sequence (10 ) with a DNA feature (100c) from the database, and displaying the plurality of spectra (20) in combination with the genomic annotation information (ANN) associated to each spectrum. The invention is advantageous for obtaining an improved visualization method for spectral DNA analysis. Moreover, genomic features such as known repeats and CpG islands, transcribed regions such as genes and microRNAs and epigenomic features such as DNA methylation, histone modifications, and any other feature that can be assigned genomic coordinates can be annotated and displayed according to the invention.

Description

A method for visualizing a DNA sequence

FIELD OF THE INVENTION

The present invention relates to a method for visualizing a DNA sequence (or RNA or amino acid sequence), a corresponding computer program, a corresponding database, a corresponding system, and a corresponding signal.

BACKGROUND OF THE INVENTION

Sequence composition and harmonic properties are of emerging importance in the study of gene regulation. Islands of CpG dinucleotides are tightly associated with active sites of gene transcription as well as arbiters of gene repression via their DNA methylation status. In the field of genomic repeats, functionally important ones range from disease- causing triplet glutamine and alanine codon repeats to much larger repeat classes and transposable elements implicated in epigenetic and RNAi-mediated silencing. Therefore, as biological processes (especially epigenetic) continue to be explored, it will be increasingly important to have a tool that can systematically and comprehensively evaluate composition and harmonic properties of associated genomic sequences.

Spectral analysis is based on considering the occurrences of each nucleotide base in a DNA sequence as individual digital signals, cf. Benson et al. in Nucleic Acid Res. 18 (21), p. 6305-6310, and 18 (10), 3001-3006, 1990, for earlier references on this topic. Each of these four signals, i.e. A, T, C and G, is transformed into frequency space (k). The magnitude of a frequency component then reveals how strong a certain pattern of the nucleotide base is repeated at that frequency. A larger value usually indicates a stronger presence of the repetition.

To improve the readability of the results, each nucleotide base is represented by a color and the frequency spectra of the four bases are combined and presented as a color spectrogram, cf. Anastassiou, "Frequency-domain analysis of bio molecular sequences," Bioinformatics, vol. 16, pp. 1073 - 1081, Dec. 2000. This results in a very powerful visualization tool for DNA analysis. The resultant pixel color thus shows the relative intensity of the four bases at a particular frequency. The representation of a DNA sequence as a color image allows patterns to be easily identified by visual inspection and automatic visual content analysis. An important advantage of performing DNA analysis in the spectral domain is that the N²-scaling of conventional sequence-to- sequence matching is avoided, N being the number nucleotide bases in the sequence. US 6,287,773 discloses e.g. a frequency domain based comparison method which scale as N log (N), which may very significantly lower the computational time for long sequences, e.g. longer than 10.000 nucleotide bases.

However, a disadvantage with current visualizing methods for spectral analysis is the lack of genomic context. Thus, no information is provided as to what features were contained within those sequences resulting in the color spectrograms. Moreover, despite efforts up till now, a need still remains for systems and methods that facilitate expeditious visualization of genomic information. Also there remains a need for tools that can identify structurally or compositionally similar patterns that exhibit similar spectral properties. Such tools are to be contrasted with conventional sequence alignment tools that seek to align sequences in linear order or by nucleotide appearance.

Hence, an improved visualizing method would be advantageous, and in particular a more efficient and/or reliable method would be advantageous.

SUMMARY OF THE INVENTION

Accordingly, the invention preferably seeks to mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination. In particular, it may be seen as an object of the present invention to provide a method that solves the above mentioned problems of the prior art with visualizing spectral DNA data.

This object and several other objects are obtained in a first aspect of the invention by providing a method for visualizing a DNA sequence, the method comprising: providing a DNA sequence, - creating a plurality of frequency domain spectra based on the DNA sequence, providing a database (DB) with corresponding DNA features and genomic annotation information (ANN), associating a portion of the DNA sequence with a DNA feature from the database, and - displaying the plurality of spectra in combination with the genomic annotation information (ANN) associated to each spectrum.

The invention is particularly, but not exclusively, advantageous for obtaining an improved visualization method for spectral DNA analysis. Moreover, genomic features such as known repeats and CpG islands, transcribed regions such as genes and microRNAs and epigenomic features such as DNA methylation, histone modifications and any other "feature" that can be assigned genomic coordinates can be annotated and displayed according to the invention. The invention greatly assists in identifying novel patterns within known classes of features. With its ability to so easily find known repetitive patterns that have relatively low periods the invention is especially suited for finding un-annotated, imperfect and higher-period patterns that are not identifiable by tools hitherto known in the art. This will be further explained in detail below.

It is to be understood that the database to be applied in the context of the present invention can have various different structures and various principle of operations. Thus, the database can be located on a single server unit e.g. a suitable computer, or the database can have a distributed structure with data located in different physical locations, the locations being interconnected through suitable computer networks e.g. the internet. Similarly, the database can pull information on demand from various interconnected sources, possibly other independent databases capable of cooperating with the database according to the present invention. The database can for instance be implemented in a flat-file, in a relational form or in object-oriented form.

The application of the present invention is not limited to applications in connection with analysis of DNA sequences, but could also be applied on similar sequences of relevance within biochemistry i.e. RNA sequences. In this respect, the frequency based spectra is obtainable from binary indicator sequence from the channels of A, U, C and G. It is conceivable that if there is a binary indicator representation for amino acids (20 of them) and then one may apply STFT to convert the BIS sequences to Fourier domain space. Then one can concatenate and make one long vector consisting of all the 20 binary indicator sequences. Then the rest of the procedure for working the invention would be the same. Here is a list of the amino acids: alanine - ala - A arginine - arg - R asparagine - asn - N aspartic acid - asp - D cysteine - cys - C glutamine - gin - Q glutamic acid - glu - E glycine - gly - G histidine - his - H iso leucine - ile - 1 leucine - leu - L lysine - lys - K methionine - met - M phenylalanine - phe - F proline - pro - P serine - ser - S threonine - thr - T tryptophan - trp - W tyrosine - tyr - Y valine - val - V

The 20 different amino acids can be mapped to 20 different colors in Red-Green-Blue (RGB)

(or Hue Saturation Value - HSV space). Either one of these spaces can be quantized into 20 colors - one for each of the amino acids. Thus, the teaching of the present invention is not limited to DNA analysis, but may be extended to RNA and/or amino acid analysis with the relevant modifications readily recognized by a skilled person in this field.

Beneficially, the DNA sequence may represent a genome, a chromosome or a portion thereof.

The creating of the plurality of spectra may comprise: converting the DNA sequence into a binary indicator sequence (BIS), and applying short term Fourier transform (STFT) to the said binary indicator sequence (BIS) resulting in a frequency domain vector In a particular embodiment, the database with corresponding DNA features and genomic annotation information may comprise fields for at least one of the following: the classification (A) of a genomic feature, the type (B) of a genomic feature, the tissue type or cell type (C) of a genomic feature, the orientation (D) of a genomic feature in a genomic sequence, and the coordinates (E) of a genomic feature.

Using these fields for spectral DNA analysis, the inventors have obtained quite useful annotated spectra in a quick and comprehensive manner hitherto not possible. Advantageously, the classification (A) may comprise indication of whether the genomic feature is genomic (G), epigenomic (E), transcriptional (T) or proteomic (P). It should be noted that such indications has hitherto not been in one and the same database, whereby the present invention provides a significant advance in spectral DNA analysis. Beneficially, the type (B) may comprise indication of whether the genomic feature is at least one type from the types of: gene, methylated DNA, an Exon, an Intron 3pUTR, a 5pUTR, and Repeat. Other type of genomic features may be provided for implementation in connection with the type (B) field for the present invention.

Advantageously, the tissue type or cell type (C) may comprise indication of whether the genomic feature is specific to any cell line or tissue type, e.g. brain tissue etc.

Beneficially, the orientation (D) may comprise indication of whether the orientation of the genomic feature is five prime (5') to three prime (3') or opposite. All DNA features can be maintained in a sorted order. Sorting in ascending order by the start coordinate for each feature is particular beneficial because then the coordinates are always increasing in the 5 -prime to 3 -prime direction. This sorting greatly speeds feature retrieval upon generation of spectra.

Advantageously, the coordinates (E) may comprise indication of at least a start position and an end position for a genomic feature to facilitate fast retrieval.

The database may be applicable for a specific chromosome, a specific mitochondrial DNA sequence, or an equivalent DNA sequence. Thus, each chromosome can have its one database or table, and accordingly the invention can typically be realized with several databases at the same time.

To speed up processing at least a substantially part of the DNA features in the database (DB) can be sorted in ascending order having a common start coordinate with respect to the orientation (D). Usually, it can be 5' -> 3', but computationally the opposite direction can be just as fast, the effect being increased feature retrieval upon displaying.

Typically, the displaying of the plurality of spectra may comprise mapping into a color space, e.g. RGB or HSV (hue-saturation-value).

Beneficially, the period of short term Fourier transform (STFT) may define the length of a shifting window (W) along the DNA sequence, the spectra being displayed in the same order as the shifting window (W) is positioned along the DNA sequence. This is a so- called "linear" display, which can helpful in particular for initial analysis.

Advantageously, the genomic annotation information may be arranged substantially adjacent to the corresponding spectrum to ease the viewing of the annotations, but other alternatives such as mouse roll-over displaying etc. is also possible within the teaching of the present invention.

A clustering process may be performed on the plurality of spectra before displaying the plurality of spectra. The clustering can be hierarchical, k-means, self- organizing maps, etc. The clustering can be unsupervised and supervised.

In a second aspect, the invention relates to a computer program product being adapted to enable a computer system comprising at least one computer having data storage means associated therewith to implement a method according to the first aspect of the invention. This aspect of the invention is particularly, but not exclusively, advantageous in that the present invention may be implemented by a computer program product enabling a computer system to perform the operations of the second aspect of the invention. Thus, it is contemplated that some known processor, e.g. a computer, may be changed to operate according to the present invention by installing a computer program product on a computer system controlling the said processor. Such a computer program product may be provided on any kind of computer readable medium, e.g. magnetically or optically based medium, or through a computer based network, e.g. the Internet.

In a third aspect, the invention relates to a database with corresponding DNA features and genomic annotation information comprising fields for at least: the classification (A) of a genomic feature, the type (B) of a genomic feature, the tissue type or cell type (C) of a genomic feature, the orientation (D) of a genomic feature in genomic sequence, and the coordinates (E) of a genomic feature. wherein the classification (A) comprises indication of whether the genomic feature is genomic (G), epigenomic (E) or transcriptional (T) or proteomic (P), the database (DB) being adapted for associating a portion of the DNA sequence with a DNA feature from the database, the genomic annotation information (ANN) being adapted for displaying with a frequency based spectrum from the portion of the DNA sequence. The database may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit, or may be physically and functionally distributed between different interconnected units and sub-databases. In a fourth aspect, the invention relates to a processor for implementing one or more parts of a method for visualizing a DNA sequence, the method comprising: providing a DNA sequence, creating a plurality of frequency domain spectra based on the DNA sequence, - providing a database (DB) with corresponding DNA features and genomic annotation information (ANN), associating a portion of the DNA sequence with a DNA feature from the database, and displaying the plurality of spectra in combination with the genomic annotation information (ANN) associated to each spectrum.

Thus, the processor may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit, or may be physically and functionally distributed between different interconnected units and sub-processors. In particular, the interconnected units can implement a parallel processing algorithm.

In a fifth aspect, the invention relates to a signal for implementing one or more parts of a method for visualizing a DNA sequence, the method comprising: providing a DNA sequence, creating a plurality of frequency domain spectra based on the DNA sequence, - providing a database (DB) with corresponding DNA features and genomic annotation information (ANN), associating a portion of the DNA sequence with a DNA feature from the database, and displaying the plurality of spectra in combination with the genomic annotation information (ANN) associated to each spectrum.

The signal can be any kind of signal, wireless or wire-mediated. The signal can comprise intermediate or preliminary result for implementing the present invention, and/or the signal can comprise results arranged for displaying the present invention upon reception at a receiving end. The first, second, third, fourth, and fifth aspect of the present invention may each be combined with any of the other aspects. These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments described hereinafter.

BRIEF DESCRIPTION OF THE FIGURES The present invention will now be explained, by way of example only, with reference to the accompanying Figures, where

Figure 1 is an exemplary binary sequence (BIS) pattern, Figure 2 is a plot of the corresponding BIS pattern from Figure 1 of the four nucleotide bases A, T, C, and G,

Figure 3 is the converted frequency spectrum of each base, Figure 4 is similar to Figure 3 and it is indicated to the right that a superposition of the color mapping vectors weighted by the magnitude of the frequency component of the respective nucleotide base is obtained, Figure 5 schematically illustrates the generation of a single, colored spectrum from short time Fourier transformation (STFT) of a part of a DNA sequence,

Figure 6 is similar to Figure 5, and illustrates the generation of a plurality of spectra by repeated STFT along the DNA sequence,

Figure 7 is a schematic flow chart of the method according to the present invention,

Figure 8 is a schematic illustration of a database structure according to the present invention,

Figure 9 is a more detailed illustration of a database structure according to the present invention, Figure 10a and 10b show spectrograms for illustrating a clustering process according to the present invention,

Figure 11 is an example of an annotated spectrogram according to the invention, and

Figure 12 is a flow chart of a method according to the invention.

DETAILED DESCRIPTION OF AN EMBODIMENT

DNA spectrograms can be generated in a conventional manner, as described in greater detail herein below with reference to Figures 1-6. For example, a conventional algorithm or technique for the generation of DNA spectrograms may be employed that entails the following five steps:

(i) Formation of binary indicator sequences (BISs) UAfn], uτ[n], ucfn] and uβfn] for the four nucleotide bases. An exemplary BIS pattern is reproduced in Figure 1 generated from a DNA sequence 10 and a plot of the BIS values is presented in Figure. 2. (ii) Discrete Fourier Transform (DFT) on BISs. The frequency spectrum of each base is obtained by computing the DFT of its corresponding BIS using Equation (1):

^¹ -J— kn , ,

U_x[k] = ∑u_x[n]e ^N , k = 0, 1, ..., IN /2] + l X = A, T, C or G (1) κ=0

As illustrated in Figure 3, the sequence UfkJ provides a measure of the frequency content at frequency k, which is equivalent to an underlying period of N/k samples. N is the total number of nucleotide bases in the window W, cf. Figures 5 and 6. The number of bases can be maximum 300 nucleotide bases, preferably maximum 500 nucleotide bases, or even more preferably 700 nucleotide bases. Alternatively, the period can be maximum 3000 nucleotide bases, preferably maximum 5000 nucleotide bases, or even more preferably maximum 10000 nucleotide bases.

(iii) Mapping of DTF values to RGB colors. The four DFT sequences are reduced to three sequences in the RGB space by a set of linear equations which are reproduced below:

X_r[k] = a_rU_A[k] + t_rU_T[k] + c_rU_c[k] + _grU_G[k] X_g[k] = a_gU_A[k] + t_gU_T[k] + c_gU_c[k] + _ggU_G[k] (2) X_b[k] = a_bU_A[k] + t_bU_T[k] + c_bU_c[k] + g_bU_G[k] where (a_r, a_g, at), (t_r, t_g, tt), (c_r, c_g, ct) and (g_r, g_g, gt) are the color mapping vectors for the nucleotide bases A, T, C and G, respectively. The resultant pixel color ((X_r[k], X_g[k], X_&[k]) is thus a superposition of the color mapping vectors weighted by the magnitude of the frequency component of their respective nucleotide base as indicated in the right side of Figure 4. Mapping of DFT values to colors is illustrated in Figure 5 for a single spectrum 20, and in Figure 6 for several spectra 20 i.e. a spectrogram 30. Both Figures 5 and 6 are being reproduced in grey-tones here for illustrative purposes. Other color space mapping of the frequency domain based U values are also possible, e.g. to HSV space.

(iv) Normalizing the pixel values. Before rendering the colored spectrogram 30, the RGB values of each pixel are generally normalized so as to fall between 0 and 1. Numerous normalization procedures are readily available to the skilled person once the general principle of the invention is recognized.

(v) Short-time Fourier Transform (STFT). A plurality of DNA spectra 20 i.e. a spectrogram 30 is formed by a concatenation of individual DNA sequence spectra 20 ("strips"), where each strip or spectrum generally depicts the frequency spectrum of a local DNA segment as shown in Figure 6. The short term Fourier transform (STFT) has a window W that is shifted along the DNA sequence from 5' to 3' as shown in Figure 6.

The spectrogram shown in Figure 6 has a length of 60 nucleotide bases and the window W is shifted one base at a time. On the horizontal scale in the spectrogram 30, the frequency k is shown (increasing downwards), whereas the start position P ini on the DNA sequence 10 is shown on the horizontal scale in the spectrogram 30.

The appearance of a spectrogram 30 is very much affected by the choice of the STFT window W size, the length of the overlapping sequence between adjacent windows W, and the color mapping vectors, cf. Equation (2). The window size determines the effective range of a pixel value in a spectrogram 30. A larger window results in a spectrogram that reveals statistics collected from longer DNA segments. In general, the window W size should be made several times larger than the length of the repetitive pattern of interest and smaller than the size of the region that contain the pattern of interest. For exploratory purposes, it is recommended to try a range of window sizes. The window overlap determines the length of the DNA segment common to two adjacent STFT windows. Therefore, the larger the overlap, the more gradual is the transition of the frequency spectrum from one STFT window to the next. A higher image resolution makes it easier to extract features by image processing or visual inspection. Clustering of the spectrogram 30 and rendering of the spectrogram 30 as spectrovideo will be described in more details below.

Figure 7 is a schematic flow chart of the method according to the present invention. Initially, there should be provided a DNA sequence 10 from which there is created a plurality of frequency domain spectra 20 based on the DNA sequence 10, each spectrum being created from a portion 10' of the DNA sequence 10.

A database DB is provided with schemes of corresponding DNA features and genomic annotation information ANN. The database DB is designed to allow for easy incorporation of publicly available, proprietary and other 3rd party genomic, transcriptional, proteomic, and epigenomic features. Available features for incorporating into the database DB can be gathered from sources such as UCSC Genome Bioinformatics (http://genome.ucsc.edu/), EMBL (http ://www.ebi . ac.uk), GenBank

are not limited to any of these sources. A portion of the DNA sequence 10' is associated with a DNA feature from the database DB in order to obtain an annotation ANN of the DNA sequence 10'. The sequence 10' will typically have more than one associated annotation, cf. below.

Additionally, the plurality of spectra 20 are displayed in combination with the genomic annotation information ANN associated to each spectrum 10', cf. Figure 11 below. The invention makes it possible to simultaneously identify patterns that may be associated with existing features, and those patterns that are not currently associated with any known feature which is clearly important for discovery within DNA analysis. Furthermore, the present invention enables statistical testing of the spectral patterns relative to the features they may represent. This may in particular provide speed in DNA analysis for large multiple sequence comparisons in spectral- space. As window W sizes get larger, this approach becomes faster.

Notice that the opposite is true for multiple sequence alignment in sequence- space, such as with the ClustalW algorithm, cf. D. Thompson, D. -G. Higgins, T. -J. Gibson, "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice," Nucleic Acids Res., vol. 22 pp. 4673 - 4680, Nov. 1994. Therefore, spectral analysis may have the ability to make sequence comparisons not possible by any available sequence-space means. It also provides information about arbitrarily large, possibly imperfect harmonic patterns in addition to sequence composition.

Figure 8 is a schematic illustration of an embodiment of a database structure according to the present invention. In Figure 8, it is symbolically shown how a chromosome 100b is obtained from an organism 100a. From the chromosome 100b, or any other portion of DNA sequence, a characteristic feature 100c is analyzed by the database DB. In this embodiment the database DB has corresponding DNA features 100c and genomic annotation information ANN comprises fields for at least one of the following: the classification (A) of a genomic feature, the type (B) of a genomic feature, the tissue type or cell type (C) of a genomic feature, the orientation (D) of a genomic feature in a genomic sequence, and the coordinates (E) of a genomic feature.

These fields can be considered to define a so-called primary key of the database DB. However, it is of course possible to introduce additional fields relevant to making annotation ANN of a DNA feature 100c (not shown in Figure 8). Likewise, it is possible to use fewer fields than the fields listed above. Thus, a smaller database could possible operate only with a subset of the above-mentioned field, e.g. fields B, C and E, or fields A, B, and E, etc. Additionally, all of the fields above could be used.

Figure 9 is a more detailed embodiment of a database structure according to the present invention. Below the embodiment is discussed in connection with each field also listed in Figure 8.

A: The 'Class' Field

Currently, a DNA feature 100c is considered to belong to one of three possible categories, Genomic (G), Transcriptional (T) and Epigenomic (E), proteomic (P) although the invention is not limited to these three categories for the class field. In the database DB, they are designed to be the broadest classifications. Genomic (G): The feature is related to some physical sequence feature of the genome. An example being genomic repeats such as LINES (long interspersed nuclear elements), SINES (short interspersed nuclear elements), LTRs (long terminal repeats) or other regions defined by their actual sequence composition.

Transcriptional (T): The genomic feature is known to be transcribed into RNA by the biological process known as transcription. Genomic regions that give rise to expressed genes, small RNAs and RNA in any other form fall into this classification.

Proteomic (P): The genomic feature is known to be translated from RNA to protein by the biological process known as translation. Protein coding sequences which are translated into amino acid sequences (every codon is a specified triplet of nucleotides) fall into this classification.

Epigenomic (E): The feature described by the given coordinates contains features that may or may not have anything to do with the actual sequence in the described region. This may include, but is not limited to, regions where the CpG dinucleotides of the DNA sequence is modified by the addition of a methyl group on the cytosine, histone modifications, the binding of DNA-binding and other proteins, regions of inter-chromosomal contact or any other feature that can be described in terms of the genomic coordinate system.

Although not explicitly enforced, the class to which a feature belongs may influence the values assigned in the other fields that compose the primary key. For instance, it is widely held that a genomic feature will occur within every instance of the genome of an organism. This implies in every cell of every tissue. Therefore, genomic features are traditionally assigned the value "Mixed" in the Tissue/Cell Type field to denote their ubiquitous nature. Or, something classified as an epigenomic feature may be independent of the polarity of the DNA, whereas a transcriptional feature usually is not. Therefore, in the Orientation field (D) a value of "NO" may be assigned to denote that whether the sequence is being read in the 5 '->3' or the 3 '-^5' orientation this feature is always the same.

B: The 'Type' Field

This field is used to associate each feature with a more descriptive value than those found in the 'Class' field (A). For instance, features 100c where the status of DNA methylation was assessed. Two possible descriptions of the assessment are possible, either unmethylated or methylated. However, in the present context the string DNAMethStatus will be used as the value in this 'Type' field (B) to describe what was being assessed, not the outcome. The outcome may be specified in fields that are not part of the primary key, such as DisplayName or Description. Here the strings methylated or unmethylated may be assigned. Or, perhaps a new field is created to support this feature 'Type'. Other examples of 'Type' values are Gene, Exon, Intron 3pUTR (3 prime untranslated region), 5pUTR (5 prime untranslated region), Repeat-LINE etc. Upon each build of the database DB, either a metadata field can be populated with all available types or a flat-file describing this data can be produced.

C: The 'Tissue/Cell Type' Field

Here it can be specified whether or not a feature 100c is specific to any cell line or tissue type.

This may include identifiers given to "disease" or immortal cell lines such as "HeLa" or clinical or non-clinical tissue names such as Brain or something more specific, say medulla oblongata.

It would be expected that if a feature exists in multiple tissue types, that this feature is defined as a corresponding number of independent features - since tissue/cell type is part of the primary key. As with the 'Type' field (B), a list or metadata field can be populated with all available tissue/cell types in the database DB.

D: The 'Orientation' Field This field describes in which direction the feature 100c is oriented in the genomic sequence 10. The directions either being 5'->3' (symbolized by '+') or 3'->5' (symbolized by '-'). These values in this field usually apply to features that are in the class 'Transcriptional'. However, even here there is the possibility of a feature being bi- directionally transcribed making the value in this field "NO". NO is a non-standard acronym for "no orientation." Almost all features in the classes 'Genomic' and 'Epigenomic' with have a value of "NO" in this field.

E: The 'Coordinates' Field

This field describes where a feature 100c starts and ends within the larger sequence (chromosome or otherwise) within which this feature resides. The 'SCoord' or start coordinate of a feature is always assumed to be less than or equal to the 'ECoord' or end coordinate. The 'SCoord' and 'ECoord' may not be less than 1, or greater than the end coordinate of the larger sequence within which the feature resides.

Coordinates may overlap. Meaning, one DNA feature could have a range of 1024-5000 and another 990-4500, of course all overlaps are possible. Two DNA features may even have the same exact start and end coordinates, since the coordinates are only one of six components of the compound primary key. In the primary key of the database DB, the values in the 'SCoord' and 'ECoord' fields can be concatenated with a '-' to make a single string such as '990-4500'. Viewing large amounts of sequence data requires an efficient method for information analysis and visualization. In order to optimize the viewing of spectra derived from very large sequences or spectra containing many small windows, the spectra can be rendered a video as shown by the present inventor; N. Dimitrova, et al. "Analysis and visualization of DNA spectrograms: open possibilities for genome research," in ACMMM., Santa Barbara, CA, Oct. 2006, which is hereby incorporated by reference in its entirety.

Hierarchical clustering can also applied to group similar spectral windows. This allows for much more efficient viewing and interpretation of the output which is vastly superior to looking at numerous images where the frequency and position axes are highly- condensed. The hierarchical clustering of the spectra may be carried out in two ways: 1) using either Fourier coefficients of the binary indicator sequences or 2) DNA spectra in RGB space (or another suitable color space). Generally it is preferred to perform clustering in the Fourier-space data because all four dimensions are represented equally, whereas conversion to RGB-space inevitably causes data loss. A possible clustering method now will be now be described in 6 steps, consecutively numbered with roman numerals:

(I) Obtain Discrete Fourier Transform values for each of the four channels (A, T, C & G). This is accomplished as described in steps (J) and (Ji) above in connection with generating DNA spectrograms.

(II) Concatenate the vectors containing the DFT values obtained for each channel. Prior to concatenation, the DFT vectors are transposed such that window position is on the y-axis and frequency is on the x-axis. The four DFT vectors for each window are subsequently horizontally concatenated in the order U_A[JC], U_T[JC], U_C[JC], U_G[JC], yielding a single vector representation for the original window. This is done for all windows generated from the input sequence.

(III) Compute the distance matrix. Every vector is compared to every other vector exactly once and a distance metric is applied. Euclidean distance can be applied, but other kind of distance perform equally well and there may be inherent advantages of some over others. The choice of distance metric is still being evaluated. The total number of comparisons performed will be N² / 2, where N is the number of windows. Therefore, a run performed with one contiguous sequence of size 46,944,323 bases (human chromosome 21) using a window size of 1500 and an overlap of 300 will yield 39,120 windows according to the formula (3):

N = [SeqLength / (WinSize - WinOverlap)] - 1 (3)

Even with these modest parameters on the shortest human chromosome, there are 765,197,729 unique comparisons. This is both computationally expensive and memory intensive. Executing in Matlab, RAM utilized can run very near 16 gigabytes just for construction of this distance matrix. For the sake of scalability, a clear challenge will be to implement a more efficient clustering scheme since it is desirable to perform this operation across whole genomes at higher resolutions (smaller window sizes and greater overlap). (W) Build dendrogram. After computing the distance matrix, average linkage can be applied and the dendrogram constructed.

(V) Final image construction. In order to render the new clustered image, the RGB image data associated with each window is retrieved and arranged according to the ordering of the dendrogram. The arrangement for viewing is such that the left-most window in the dendrogram becomes the first window at the top of the y-axis in the new RGB image.

Figure 10a depicts an unclustered image containing genomic repeats from chromosome 21, Figure 10b contains the same windows with the clustering applied. From Figure 10, it is apparent that the majority of genomic repeats annotated as "Other" separate themselves well from the defined classes of genomic repeats, SINES and LINES. On the familial level of the repeat annotation hierarchy, these "Other" repeats predominantly fall into the familial category SVA. SVA repeats are among some of very few active retrotransposon elements in the human genome. They have been found to cause various diseases upon activation by inserting themselves into certain genes, disrupting their function.

(Vl) Rendering as a SpectroVideo. To obtain a spectravideo, the single large RGB image constructed in step V can be segmented into frames according user-specified parameters.

Figure 11 is an example of an annotated spectrogram 30 according to the invention. On the right annotations ANN found from the database DB are displayed. Figure 11 depicts the sharp contrast between the spectral features of one class of repeats, satellites and another class, LINES. With its ability to so easily find known repetitive patterns that have relatively low periods, what is of particular interest are the un-annotated, imperfect and higher-period patterns not identifiable by mainstream tools. Displaying associated annotation ANN at various scales of the window W size may require sorting, prioritizing and/or deleting of annotation. This may be done actively by a user or passively i.e. automatically according to pre-set conditions and/or dynamic conditions. If for instance the size of the window W is very large (e.g. 10,000 nulceotides) then only a high level of information (organized in a hierarchical fashion) may be shown on the left side of the spectrogram 30.

For practical implementation a user interface can be provided where the annotated spectrogram shown in Figure 11 is shown in combination with e.g. a ClustalW interface or other similar application useful in connection with genetic analysis. Additionally, a user interface applicable in connection with various kind of clustering is preferably implemented in connection with a user interface for the present invention.

Figure 12 is a flow chart of a method according to the invention. The method comprises:

51 providing a DNA sequence 10,

52 creating a plurality of frequency domain spectra 20 based on the DNA sequence,

53 providing a database DB with corresponding DNA features 100c and genomic annotation information ANN,

54 associating a portion of the DNA sequence 10' with a DNA feature 100c from the database, and

55 displaying the plurality of spectra 20 in combination with the genomic annotation information ANN associated to each spectrum.

The invention can be implemented in any suitable form including hardware, software, firmware or any combination of these. The invention or some features of the invention can be implemented as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way.

Indeed, the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit, or may be physically and functionally distributed between different units and processors.

Although the present invention has been described in connection with the specified embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present invention is limited only by the accompanying claims. In the claims, the term "comprising" does not exclude the presence of other elements or steps. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. In addition, singular references do not exclude a plurality. Thus, references to "a", "an", "first", "second" etc. do not preclude a plurality. Furthermore, reference signs in the claims shall not be construed as limiting the scope.

Claims

CLAIMS:

1. A method for visualizing a DNA sequence (10), the method comprising: providing a DNA sequence (10), creating a plurality of frequency domain spectra (20) based on the DNA sequence, - providing a database (DB) with corresponding DNA features (100c) and genomic annotation information (ANN), associating a portion of the DNA sequence (10') with a DNA feature (100c) from the database, and displaying the plurality of spectra (20) in combination with the genomic annotation information (ANN) associated to each spectrum.

2. The method according to claim 1, wherein the DNA sequence (10) represents a genome, a chromosome or a portion thereof.

3. The method according to claim 1, wherein creating of the plurality of spectra (20) comprises: converting the DNA sequence into a binary indicator sequence (BIS), and applying short term Fourier transform (STFT) to the said binary indicator sequence (BIS) resulting in a frequency domain vector (U).

4. The method according to claim 1, wherein the database with corresponding

DNA features and genomic annotation information comprises fields for at least one of the following: the classification (A) of a genomic feature, the type (B) of a genomic feature, the tissue type or cell type (C) of a genomic feature, the orientation (D) of a genomic feature in a genomic sequence, and the coordinates (E) of a genomic feature.

5. The method according to claim 4, wherein the classification (A) comprises indication of whether the genomic feature is genomic (G), epigenomic (E), transcriptional (T) or proteomic (P).

6. The method according to claim 4, wherein the type (B) comprises indication of whether the genomic feature is at least one type from the types of: gene, methylated DNA, an Exon, an Intron 3pUTR, a 5pUTR, and Repeat-LINE.

7. The method according to claim 4, wherein the tissue type or cell type (C) comprises indication of whether the genomic feature is specific to any cell line or tissue type.

8. The method according to claim 4, wherein the orientation (D) comprises indication of whether the orientation of the genomic feature is five prime (5') to three prime (3') or opposite.

9. The method according to claim 4, wherein the coordinates (E) comprises indication of at least a start position and an end position for a genomic feature.

10. The method according to any of claims 4-9, wherein the database is applicable for a specific chromosome, a specific mitochondrial DNA sequence, or an equivalent DNA sequence.

11. The method according to any of claims 4-9, wherein at least a substantially part of the DNA features in the database (DB) is sorted in ascending order having a common start coordinate with respect to the orientation (D).

12. The method according to any of claims 1-3, wherein the displaying of the plurality of spectra comprises mapping into a color space.

13. The method according to claim 3, wherein the period of short term Fourier transform (STFT) defines the length of a shifting window (W) along the DNA sequence, the spectra being displayed in the same order as the shifting window (W) is positioned along the DNA sequence.

14. The method according to claims 1 or 13, wherein the genomic annotation information is arranged substantially adjacent to the corresponding spectrum.

15. The method according to claim 1, wherein a clustering process is performed on the plurality of spectra before displaying the plurality of spectra.

16. A computer program product being adapted to enable a computer system comprising at least one computer having data storage means associated therewith to implement a method according to claim 1.

17. A database with corresponding DNA features and genomic annotation information comprising fields for at least: the classification (A) of a genomic feature, the type (B) of a genomic feature, the tissue type or cell type (C) of a genomic feature, the orientation (D) of a genomic feature in genomic sequence, and the coordinates (E) of a genomic feature wherein the classification (A) comprises indication of whether the genomic feature is genomic (G), epigenomic (E) , transcriptional (T) or proteomic (P), the database (DB) being adapted for associating a portion of the DNA sequence (10') with a DNA feature (100c) from the database, the genomic annotation information (ANN) being adapted for displaying with a frequency based spectrum (20) from the portion of the DNA sequence.

18. A processor for implementing one or more parts of a method for visualizing a

DNA sequence (10), the method comprising: providing a DNA sequence (10), creating a plurality of frequency domain spectra (20) based on the DNA sequence, - providing a database (DB) with corresponding DNA features (100c) and genomic annotation information (ANN), associating a portion of the DNA sequence (10') with a DNA feature (100c) from the database, and displaying the plurality of spectra (20) in combination with the genomic annotation information (ANN) associated to each spectrum.

19. A signal for implementing one or more parts of a method for visualizing a DNA sequence (10), the method comprising: providing a DNA sequence (10), creating a plurality of frequency domain spectra (20) based on the DNA sequence, providing a database (DB) with corresponding DNA features (100c) and genomic annotation information (ANN), associating a portion of the DNA sequence (10') with a DNA feature (100c) from the database, and displaying the plurality of spectra (20) in combination with the genomic annotation information (ANN) associated to each spectrum.