DE19950050C2

DE19950050C2 - Method for the functional assignment of unclassified DNA sequences

Info

Publication number: DE19950050C2
Application number: DE19950050A
Authority: DE
Inventors: Werner Mueller
Original assignee: Individual
Current assignee: Individual
Priority date: 1999-10-16
Filing date: 1999-10-16
Publication date: 2002-07-18
Anticipated expiration: 2019-10-17
Also published as: DE19950050A1

Description

Die vorliegende Erfindung betrifft ein Verfahren zum funktionellen Zuordnen von nichtklassifizierten DNA-Sequenzen, in dem eine nichtklassifizierte Sequenz mittels einfacher Schritte bekannten Referenzsequenz zugeordnet (aligned) werden kann.The present invention relates to a method for the functional assignment of unclassified DNA sequences using an unclassified sequence simple steps can be assigned to known reference sequence (aligned).

Seit der kommerziellen Verfügbarkeit der PCR-Technik stellt die funktionelle Zuordnung der auf diese Weise verfügbar gemachten DNA-Information ein grundsätzliches Problem der Biotechnologie dar.Since the commercial availability of PCR technology, the functional Assignment of the DNA information made available in this way fundamental problem of biotechnology.

In herkömmlichen Verfahren wird daher entweder über funktionelle Besonderheiten bzw. direkten Abgleich der nichtklassifizierten Sequenz mit Sequenzen mit bekannter Eigenschaft abgeglichen.In conventional processes, therefore, is either about functional peculiarities or direct comparison of the unclassified sequence with sequences with known ones Matched property.

Die vorliegende Anmeldung stellt ein Verfahren zur funktionellen Zuordnung einer nichtklassifizierten DNA-Sequenz zur Verfügung, die die folgenden Schritte umfaßt:
The present application provides a method for the functional assignment of an unclassified DNA sequence, which comprises the following steps:

a) Compare the unclassified sequence (A) with reference sequences (B 1 -B n ) below
b) creating gap patterns (C 1 -C m ) for the reference sequences (B 1 -B n ) and a consensus sequence (D) for the sequence (A),
c) splitting the gap patterns (C 1 -C m ) into short consensus sequences (E 1 -E m ) and gap information (F 1 -F m ),
d) position-by-point comparison of the short consensus sequences (E 1 -E m ) with iterating offset with the unclassified sequence (A) while determining the short consensus sequence with the highest agreement (E max ),
e) inserting the corresponding one of the consensus sequence (E max) gap information (F max) in the sequence (A) to create a Aligned sequence (G).

Die der Konsensussequenz (D) zugrunde liegenden Referenzsequenzen (B₁-B_n) sollten untereinander einen Übereinstimmungsgrad besitzen der größer als 60 und kleiner als 80-90% ist. Gemäß der vorliegenden Anmeldung können die in Schritt (b) gesammelten Informationen (Lückenmuster (C₁-C_m) und Konsensussequenz (E₁-E_m) zwischengespeichert und für spätere Vergleiche direkt verwendet werden.The reference sequences (B ₁ -B _n ) on which the consensus sequence (D) is based should have a degree of agreement among themselves which is greater than 60 and less than 80-90%. According to the present application, the information collected in step (b) (gap pattern (C ₁ -C _m ) and consensus sequence (E ₁ -E _m ) can be buffered and used directly for later comparisons.

Bei der Erstellung der Lückenmuster werden gleiche Lückenmuster eliminiert, so daß für E₁-E_m gilt: m ≦ n.When creating the gap patterns, the same gap patterns are eliminated, so that for E ₁ -E _{m the} following applies: m ≦ n.

Der kritische Schritt in dem erfindungsgemäßen Verfahren ist der positionsweise Vergleich der kurzen Konsensussequenzen (E₁-E_m) ein möglichst hoher Übereinstimmungsgrad erzielt wird. Für eine sinnvolle Klassifikation ist dabei erforderlich, daß dieser Übereinstimmungsgrad möglichst größer als 60%, vorzugsweise größer als 80% ist.The critical step in the method according to the invention is the positional comparison of the short consensus sequences (E ₁ -E _m ) to achieve the highest possible degree of agreement. For a meaningful classification it is necessary that this degree of agreement is as large as possible greater than 60%, preferably greater than 80%.

Das erfindungsgemäße Verfahren kann weiterhin in mehreren Zyklen erfolgen, wobei nach Finden der Aligned-Sequenz (G) eine beste Referenzsequenz (B_max) gefunden wird, die zum Finden eines neuen Satzes Referenzsequenzen (H₁-H_n) verwendet wird, die zur Familie der Referenzsequenz mit dem höchsten Übereinstimmungsgrad (B_max) des ersten Zyklusses gehört.The method according to the invention can also be carried out in several cycles, and after finding the aligned sequence (G), a best reference sequence (B _max ) is found, which is used to find a new set of reference sequences (H ₁ -H _n ) belonging to the family belongs to the reference sequence with the highest degree of agreement (B _max ) of the first cycle.

Das Verfahren bietet den Vorteil der sehr schnellen Generierung von multiplen Sequenz-Alignments, so daß diese sehr schnell für weitere Sequenzverarbeitungen wie z. B. Sequenzannotierung oder für Sequenzvergleiche zur Verfügung stehen. Insbesondere Sequenzvergleiche in korrekt berechneten multiplen Sequenzalignments sind sehr schnell, da die Sequenzen im Alignment für den Sequenzvergleich nicht mehr gegeneinander verschoben werden müssen, sondern Position für Position direkt verglichen werden können.The method offers the advantage of the very fast generation of multiples Sequence alignments so that they can be used very quickly for further sequence processing such as B. sequence annotation or for sequence comparisons are available. Especially sequence comparisons in correctly calculated multiples Sequence alignments are very fast since the sequences in the alignment for the Sequence comparison no longer need to be shifted against each other, but Position by position can be compared directly.

Die Erfindung wird anhand der nachfolgenden Figuren näher erläutert:The invention is explained in more detail with reference to the following figures:

Fig. 1 Einlesen der Sequenzen Zunächst werden die zu verarbeitenden Sequenzen und die Referenzsequenzen eingelesen. Fig. 1 Reading the sequences First, the sequences to be processed and the reference sequences are read.

Fig. 2 Aus den Referenzsequenzen werden die Lückenmuster extrahiert und aus der Sequenzinformation wird eine Konsensussequenz bestimmt. Fig. 2 The gap patterns are extracted from the reference sequences and a consensus sequence is determined from the sequence information.

Fig. 3a Dann wird eine Liste von Konsensussequenzen erstellt, aus der jeweils ein bestimmtes Lückenmuster entfernt wird. Fig. 3a A list of consensus sequences is then created, from which a particular gap pattern is removed in each case.

Fig. 3b Zeigt ein spezielles Beispiel, wie aus Konsensussequenz Shorted Konsensi erzeugt werden. Fig. 3B shows a specific example of how to generate consensus sequence from Shorted Konsensi.

Fig. 4a Durch positionsweisen Vergleich jeder der in "Shorted-Konsensus"- Sequenzen mit der neuen Sequenz wird ein optimaler Satz von Aligner-Parametern bestimmt. FIG. 4a An optimal set of aligner parameters is determined by comparing each of the sequences in "Shorted Consensus" sequences with the new sequence.

Fig. 4b Zeigt ein spezielles Beispiel für den positionsweisen Vergleich. Fig. 4b shows a specific example for the positional comparison.

Fig. 5a Die neue Sequenz wird aligned, indem, entsprechend des Parametersatzes, die Lücken eingefügt und der Offset verschoben wird. Fig. 5a The new sequence is aligned by inserting the gaps and shifting the offset in accordance with the parameter set.

Fig. 5b Experimentelles Beispiel über Einfügungen der Lücken und Verschiebungen des Offsets. Fig. 5b Experimental example of inserting the gaps and offsets of the offset.

Claims

1. A method for functionally mapping an unclassified DNA sequence comprising:

a) Compare the unclassified sequence (A) with reference sequences (B ₁ -B _i ) below
b) creating gap patterns (C ₁ -C _m ) for the reference sequences (B ₁ -B _n ) and a consensus sequence (D) for the sequence (A),
c) splitting the gap patterns (C ₁ -C _m ) into short consensus sequences (E ₁ -E _m ) and gap information (F ₁ -F _m ),
d) position-by-point comparison of the short consensus sequences (E ₁ -E _m ) with iterating offset with the sequence (A) while determining the short consensus sequence with the highest agreement (E _max ),
e) inserting the corresponding one of the consensus sequence (E _max) gap information (F _max) in the sequence (A) to create a Aligned sequence (G).

2. The method of claim 1, wherein in the positional comparison Degree of agreement of ≧ 60%, preferably ≧ 80%, is required.

3. The method according to claim 1 or 2, wherein several cycles of steps (a) to (e) respectively.

4. The method according to claim 1, wherein the gap pattern (C ₁ -C _m ) and consensus sequence (D) created in step (b) are used for later comparisons.