WO2003056458A1

WO2003056458A1 - Est arrangement mapping method and mapping program

Info

Publication number: WO2003056458A1
Application number: PCT/JP2002/013648
Authority: WO
Inventors: Shinichi Morishita; Jun Ogasawara
Original assignee: Center For Advanced Science And Technology Incubation, Ltd.
Priority date: 2001-12-27
Filing date: 2002-12-26
Publication date: 2003-07-10
Also published as: AU2002359917A1; JP2003223452A

Abstract

An EST arrangement is aligned at a high speed by an algorithm for inserting a small number of long gaps. Firstly, by referencing a map table indicating the positions where primary keys of a predetermined length appear, a position (302) equivalent to a prefix (301) of the EST arrangement is found in the genome arrangement and the prefix is aligned. By referencing the map table, a position (312) equivalent to a suffix (311) of the EST arrangement is found in the genome arrangement and the suffix is aligned. Next, a first partial string present between the prefix and the suffix in the EST arrangement is aligned with a second partial string present between the arrangement equivalent to the prefix aligned and the arrangement equivalent to the suffix (arrows 311, 322) in the genome arrangement.

Description

Specification

TECHNICAL FIELD The present invention relates to an EST array mapping method and a mapping program.

The present invention relates to the mapping of an EST (Expressed Sequence Tag) to a genome sequence. Background art

It has become useful to map E ST to genome sequences, and in fact, various sequence mapping algorithms have been proposed. Recently, heuristic algorithms such as Fasta and Blast have been used because they are faster than dynamic programming techniques.

However, these algorithms require a lot of processing time. In addition, EST cannot be projected to a known genome to be coded because it is designed only to solve the similarity between two sequences.

When mapping 250,000 human ESTs (several hundred to 10,000 in length) to a genomic sequence consisting of about 300 billion bases, map human ESTs into genomic sequences while dividing them into exons. The problem is what you have to do. The Blast, Fasta, and Smith-Waterman algorithms compare two sequences and output one contiguous partial distribution with the highest homology. However, in order to clarify the exon-intron structure, it is necessary to consider multiple calculations of the region with the highest homology (exon) while inserting several tens of thousands and long gaps (introns). No.

It is not impossible to calculate Exon using Blast. For example, if a human EST is queried against a human genome sequence using Blast, multiple results will be returned in which the EEST partial sequence is aligned to a part of the human genome with a high match rate. Since each of these results is likely to mean exon, it is not unfortunate that the exons cannot be carefully connected manually. However, this is a very time-consuming task.

In addition, even Smith-Waterman, which has higher accuracy than Blast and Fasta, outputs alignment with a large amount of short gaps if applied as it is. It does not output Kuson.

An object of the present invention is to provide a method capable of aligning an EST sequence at a high speed by an algorithm for inserting a small number of long gaps.

Disclosure of the invention

An object of the present invention is a method of mapping an EST sequence to a genomic sequence, comprising the steps of: referring to the genomic sequence, generating a map table indicating a position where each primary key of a predetermined length appears in the genomic sequence. Generating, referring to the map table, finding a position corresponding to a prefix of an EST sequence in the genome sequence, and aligning the prefix; and referring to the maptable, the genome sequence In the genomic sequence, a step of finding a position corresponding to the suffix of the EST sequence and aligning the suffix, and a first subsequence interposed between the prefix and the suffix in the EST sequence, Equivalent to aligned prefix Sequence and that is achieved by mapping methods E S T sequence, characterized in that it includes a step of § line placement in a second subsequence interposed between the sequence corresponding to the suffix. A so-called dynamic program can be used for the alignment.

In a preferred embodiment, with respect to the first subsequence, the step of aligning the second subsequence with the step of extending exon, which is an area in the EST in which the alignment is performed, and skipping an intron And the step of performing. Exxon lengthening and skipping of introns allows for higher processing speeds.

In a more preferred embodiment, the step of aligning the first subsequence with the second subsequence extends the prefix as long as the element in the EST sequence and the element in the genome sequence match. And when the elements do not match, the extended prefix is specified as exon, and in the EST sequence, a map of a predetermined length following the prefix is referred to as a primary key in a map table. In the genome sequence, Finding a position corresponding to the primary key and aligning the sequence of the predetermined length; and extending the sequence of the predetermined length as long as the element in the EST sequence and the element in the genome sequence match. And repeating the step of extending the sequence of the predetermined length and the step of aligning the subsequent sequence of the predetermined length to realize alignment of the EST sequence with the genomic sequence. .

In another preferred embodiment, the step of aligning the prefix comprises, referring to the map table, determining a position where a sequence of a predetermined length at a predetermined position from the front of the EST sequence is found in the genome sequence. A step of specifying, and a step of aligning the first suspension portion, which is a subsequence located before the predetermined position, with a subsequence located before the position in the genome sequence. . Similarly, the step of aligning the suffix includes the step of referring to the map table and specifying a position where a sequence of a predetermined length at a predetermined position from the back of the EST sequence is found in the genome sequence. And a step of aligning the second suspended portion, which is a subsequence located behind the predetermined position, with a subsequence located behind the position in the genome sequence. .

According to the above-described embodiment, even when a so-called mismatch is included, appropriate mubbing can be performed by aligning the suspended portion.

In a more preferred embodiment, for the first subsequence, the step of aligning the second subsequence with the second subsequence comprises: Extending, and when the elements do not match, the extended prefix is specified as exon, and in the EST sequence, a sequence of a predetermined length at a predetermined position ahead of the prefix is a genome sequence. Identifying a position found therein; and, with respect to a third suspended portion which is a subsequence located between the end of the prefix and the predetermined position, the sequence corresponding to the prefix in the genome sequence. And aligning a sub-string located between the position to be found and the position to be found. And

In another preferred embodiment, further, from each end of the substring determined to be an intron, referring to an element in a predetermined range to find an element that complies with the motif rule of the intron; Modifying the intron to end with the element. This may increase mismatch, but it is possible to estimate a subsequence along the intron motif rule as an intron. For example, the step of finding the element includes the steps of shifting the intron back and forth to specify a new intron candidate while maintaining the number of elements of the subsequence determined to be an intron; To determine whether or not to obey the intron motif rule. In another preferred embodiment, the method further comprises a step of calculating a matching ratio of an EST sequence from the result of the alignment. This also makes it possible to evaluate the accuracy of the EST sequence.

Also, an object of the present invention is a computer-readable program for operating a computer to map an EST sequence to a genomic sequence. Generating a map table indicating a position where each of the primary keys of a predetermined length appears; and referring to the map table, finding a position corresponding to a prefix of an EST sequence in the genome sequence, A step of aligning a prefix; a step of finding a position corresponding to a suffix of an EST sequence in the genomic sequence with reference to the map table; and aligning the suffix; and a prefix of the EST sequence. And the first subsequence intervening between the suffix And causing the computer to execute a step of aligning a sequence corresponding to the aligned prefix and a second subsequence interposed between the sequence corresponding to the suffix in the genome sequence. This is also achieved by an EST sequence mapping program characterized by Preferably, in the step of aligning the first subsequence with the second subsequence, the exon is a region in the EST that has been aligned And the step of skipping the intron are executed during the combination.

More specifically, in the step of aligning the first subsequence with the second subsequence, the step of extending the prefix as long as the element in the EST sequence matches the element in the genome sequence If the elements do not match, the extended prefix is identified as an exon, and in the EST sequence, a map of a predetermined length following the prefix is referred to as a maptable as a primary key. Finding a position corresponding to the primary leak in the genomic sequence and aligning the sequence of the predetermined length; and as long as the element in the EST sequence and the element in the genomic sequence match. Extending the sequence of the predetermined length in the combination, and executing the sequence of the predetermined length. A step of extending the steps of aligning the sequence of the subsequent predetermined length, is executed by the repetitive computer, to the genomic sequence, to achieve the alignment of the E S T sequence.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a diagram showing an example of a map table according to the present invention.

FIG. 2 is a diagram for explaining the alignment of prefixes and suffixes according to the present invention.

FIG. 3 is a diagram schematically showing an alignment process according to the present invention.

FIG. 4 is a diagram schematically showing exon extension and detection of the next exon in the alignment according to the present invention.

FIG. 5 is a block diagram illustrating a schematic configuration of the mapping system according to the present embodiment.

FIG. 6 is a flowchart showing processing executed by the start Z end point detection unit and the alignment execution unit according to the present embodiment.

FIG. 7 is a diagram showing an outline of alignment processing with further improvement according to the present invention.

FIG. 8 shows the exon in the further improved alignment according to the present invention. FIG.

FIG. 9 is a diagram illustrating the detection of the next exon in an alignment with further improvements according to the present invention.

1A and 1B schematically show the results of further improved alignment according to the present invention, respectively.

FIG. 11 is a flowchart showing a process of estimating a start point and an end point in an alignment with further improvement according to the present invention. FIG. 12 is a flowchart showing a process of extending an exon and detecting the next exon in the alignment with further improvement according to the present invention. FIG. 13 is a diagram schematically illustrating another alignment method according to the present invention.

BEST MODE FOR CARRYING OUT THE INVENTION

[Principle of the present invention]

The mapping of millions of E STs to human genomic sequences is challenging. ESTs are up to tens of thousands of bases long, while genomic sequences are about 300 million bases long. In order to reduce the computation time, the applicants have pre-processed the genomic sequence. First, we defined a DNA array of length L as the primary key. Applicants' idea is to generate a map table that stores the position where each primary key appears in the genome sequence.

FIG. 1 is a diagram showing an example of a map table according to the present invention. When considering a specific EST sequence, the prefix and suffix of the length L of the gene are derived by referring to the map table in the main memory. Assuming that four nucleotides appear randomly in the genome sequence, the position can be inferred from the map table by accessing the main memory an average of M ^L 4 ^L times (M is the length of the genome sequence).

[Generating Map Table]

A more detailed description is given of the map template and its generation.

Prior to that, it is necessary to determine the length of the primary key. Applicants have: The primary key length was set to 14. Of course, it is not limited to this. In the example of FIG. 1, the length of the primary key is set to 2 for ease of explanation.

Referring to the map table 100 in FIG. 1, regarding the genome sequence 101, “TΑ” is located at the ninth position in the genome sequence 101 (see reference numeral 102), and “GC” is It can be seen that it is located in positions 3 and 12 (see reference numeral 103).

In the following description, the genomic sequence G

g § 2,--', Su (Si≡ {A, T, G, C, N})

And an array of £ 3

e ₁₅ e ₂ , · · ·, e _n (e ^ {A, T, G, C, N})

Means an array consisting of G. j] is

Subsequence gg _{i + 1} ,,, gj

Means E _{[i (} also,

Substring

Means

The alignment of the i-th nucleotide in E S {sequence} in genome sequence G is defined as position f (i) in genome sequence G. In other words, for each ei, a unique position f (i) is associated with the genome sequence G.

[First method using map table]

First, the first method for generating a map table will be explained.

In the first step, a start point and an end point are detected. Let L be the length of the primary key in the map table, and let the length of the prefix in the three-row array be L (E _n , _L] ). By accessing the map table, the genome is aligned with the prefix and suffix. For example, as shown in FIG. 2, focusing on an array (prefix: see code 203) having a head length L (in this case, L = 4) of the EST array E (see code 202), the array 203 is a map table. Referring to the genome sequence G (symbol 20) Identify the position that appears in). Further, focusing on the sequence of the end length L of the EST sequence E (suffix: see reference numeral 204), referring to the map table, the position where the sequence 204 appears in the genome sequence G is specified.

Then, in a second step, the alignment is realized by dynamic programming. Here, using dynamic programming, the unassigned EST subsequence E _{[L + 1} , _N - _L] is transformed into a subsequence G _{[f (L) +1} , _{f (N} _L _{+ 1)} Align with _. As the dynamic programming, for example, dynamic programming of “Goto” (r An improved algorithm for matching biological sequencesj: Journal of Molecular Biology, 162: 705-708, 1982) can be used.

Dynamic programming in the second step finds an optimal alignment of EST with long introns in the sequence.

In the first step, finding the start and end points of a given EST in the genome sequence is successful because it can be realized only by accessing the main memory. However, the interval of genomic sequence _{_{G [f (L) +1,}} f (N - L is, for the dynamic programming, because still too long, the second step is in need of much execution time.

[Second method using map table]

To accelerate dynamic programming in the second step, we define (i) by extending exon and skipping long introns using a map table. The following is an explanation of the algorithm for aligning E ST to a single exon.

Here, f (i-1) +1 is set as f (i) for each i from L + 1 to NL. FIG. 3 is a diagram schematically showing this step.

The above steps are called extension steps. One nucleotide is added at the end of the exon for each repetition.

To handle ESTs for multiple exons, a step to skip introns when exon extension fails is incorporated (see Fig. 4). As long as the i-th nucleotide in the EST sequence E matches the f-th (i) nucleotide in the genomic sequence G, let f (i) = f (i-1) +1 and i Increment (step 2-1 in the second step).

Step 2 1 Ekuson is, to indicate that terminates at the i-th in E ST sequence E, by referring to Madzuputeburu E a _{[i + L} _ _n as primers leek one, the following Ekuson Detect the position of. After determining the position of the next exon at which f (j) (j = i, ···, i + L-1) is set, i is incremented by L and the process returns to step 2_1 (step 2 Step 2—2) of the step.

[Example of system that implements the above method]

In the present embodiment, high-speed alignment of the EST array is realized by using the second technique. FIG. 5 is a block diagram showing a schematic configuration of the mapping system according to the present embodiment. As shown in FIG. 5, the mapping system 10 includes a map table generation unit 12 for generating a map table, a map table storage unit 14 for storing the generated map table, and a start point and an end point. It has a start Z end point detecting section 16 for detecting, an alignment executing section 18 for executing the alignment, and a result storage section 20 for storing the alignment result.

The mapping system 10 accesses the genome sequence DB 22 storing the genome sequence, and executes necessary processing (map table generation alignment) with reference to the genome sequence G. The genome sequence DB 22 may be stored on the hard disk of a computer (for example, a personal computer) that implements the mapping system 10, or may be mounted on another server (a server overnight), such as a LAN or an The mapping system 10 may be accessed via a network such as Yuichi Net.

The map table generation unit 12 refers to the genome sequence G in the genome sequence DB 22 and determines at which position in the genome sequence G each of the primer keys of a required length (for example, 14-mer) appears. Is generated, and this is stored in the map table storage unit 14. The alignment using the map table generated by the map table generation unit 12 will be described in more detail below. FIG. 6 is a flowchart showing processing executed by the start / end point detection unit 16 and the alignment execution unit 18. As shown in FIG. 6, first, the start / end point detection unit 16 determines, from the beginning of the EST array E, the one extracted by the length L of the primary key as a prefix, and from the end of the EST array E The suffix is determined by taking out the length L of the primary key (step 601). Next, the start Z end point detection unit 16 refers to the map table in the map table storage unit 14 and aligns the prefix and suffix on the genome sequence G (step 602). This is achieved by referring to the map table to specify the position of the array corresponding to the prefix and specifying the position of the array corresponding to the suffix. As a result, in FIG. 3, the prefix of the length L in the EST sequence E (see reference numeral 301) is aligned with a predetermined portion of the genome sequence G (see reference numeral 302), and the suffix (see reference numeral 311) is obtained. However, the alignment execution unit 18 executes the above-described extension step and the step of detecting the next exon skipping the intron. Here, initially, attention is paid to the position of L + 1 in the EST sequence (i = L + l) (step 603). The alignment execution unit 18 determines whether the i-th nucleotide ei in the EST sequence matches the nucleotide g _(f ( _in ) in the corresponding genomic sequence G: e (i). (Step 604) If it is determined to be Yes in Step 604, then i is incremented after setting f (i) = f (i-1) + 1 (Step 605, 6). 06), it is determined whether the subsequent nucleotides match.

In FIG. 3, for example, in the EST sequence E, the 5th nucleotide (5 = 1 + L: L = 4) is compared with the f (5) nucleotide in the genome sequence G (arrow Since these match, Exxon is extended by one. Next, the sixth nucleotide and the f (6) in the genome sequence (f (6) = f (5) +1) is compared (see arrow 322). In this way, the exon is extended as long as the two agree.

When ei and) no longer match (No in step 604 (No)), the alignment execution unit 18 sets E _{[i (i + L} - ₁₂ ) in the EST array E as the primary key, by referring to a map table in the map table storage unit 14, find the position of the E _{[i + L} _ _u in the genome sequence G (step 607.) This is the position of the next E click Son is identified. The sequence existing between the end position in the genome sequence G aligned to the exon extended in the previous process and the head position in the genome sequence G aligned to the next exon becomes an intron. When the position of the exon is specified, the process of extending the exon is repeated, and at this time, the position i of the nucleotide is set to i + L (step 609) o

For example, in FIG. 4, for the first exon, the exon is extended (see reference numerals 401 and 403) to achieve alignment to the genomic sequence G (see reference numbers 402 and 404). Here, at the position indicated by the arrow 431, and £ 31 ¹ sequence £ medium nucleotides, but coincides with the nucleotides in the genomic sequence G, the following nucleotide does not match (see arrow 432). In such a case, the next L-mer sequence “TGCC” is specified in the EST sequence E, and is aligned with the genome sequence G by referring to the mappable (see reference numeral 412). ). Therefore, the array interposed between the reference numerals 404 and 412 is the intron 410. Further, for the next exon 411, a similar process is performed to extend the exon as long as the nucleotides match.

'Such processing is repeated until the suffix is reached (see step 608 in FIG. 6 and reference numerals 421 and 422 in FIG. 4).

As described above, according to the present embodiment, it is possible to realize the alignment at higher speed by repeating the step of extending the exon and the step of skipping the intron.

[Alignment to allow mismatch] In practice, ESTs are not aligned with 100% identity, and there is a mismatch during alignment. To recognize these mismatches, the Applicants have improved the above algorithm.

In practice, it is not easy to determine the positions of the prefix and suffix in the EST sequence E in the first step. This is because the prefix E _E1 , _L] or the suffix E _[ N_L _{+ 1} , _N] may contain a mismatch. Therefore, the start point and end point of the EST sequence in the genome sequence cannot be detected. To solve this, the start point (start) is scanned from the start point (start) until a subsequence of length L is found in the map table, and in a similar manner the end point (end) From, three arrays are scanned. This will be described in more detail with reference to FIGS.

[Approximation of the starting point of EST (estimation): Step 1.1 in the first step] Initialize to i = l, and increment i until the position of E [i _{+ u} is found in the map ( Figure 1 1 step 1 1 0 1) 0 Thereafter, E _n, the i-n, remained as to be § line Instruments, referred to as the suspended portion of the EST (dangling part) (step 1 1 in Figure 1 1 02 and 703 in FIG. 7). Then, the pendant moieties E _{i → 1} EST sequences E, subsequence of genomic sequence _{_{G [f (i) _ i}} + 1> f (i) _ u ( reference numeral 704) and Arainmen by preparative (arrow 72 1) , F (h) for each h = 1,..., I−1 (step 1 103 in FIG. 11). This may use dynamic programming.

[Approximation (estimation) of the end point of EST: step 1.2]

Initialize to j = N and decrement i until the position of E — is found in the map table (step 1 104 in Figure 11). Thereafter, E [j _{+ 1} , _N] is referred to as the suspended part of the EST that remains to be aligned (see step 1105 in FIG. 11 and reference numeral 13 in FIG. 7). Next, the suspended portion _{Eu + 1} , _N] of the EST sequence E is aligned with the subsequence G _[ (see reference numeral 714) of the genomic sequence (arrow 722), and each h = j + 1,. · Find f (h) for N (step 1 106 in Fig. 11). Again, dynamic programming Can be used. '

Mismatches during the EST also prolong exons and make it difficult to skip introns. The extension ends when there is no mismatch and the exon ends, but when the end of the exon is reached or there is a mismatch in the alignment, the exon extension will fail.

[Identifying One Exon: Step 2.1]

The processing executed subsequent to the first step will be described with reference to FIG. 12 and the like.

Initialize i to the minimum position of the unaligned EST (see step 1201 in FIG. 12 and position (i + 4) in EST sequence E in FIG. 8), and As long as the nucleotide matches nucleotide (f (i-1) +1) in genome sequence G (yes in step 1202 in FIG. 12), f (i) = f (i-1) Set it to +1 (step 1203 in FIG. 12) and increment i (see step 1204 in FIG. 12 and arrow 821 in FIG. 8). Then, set x = i-1 and memorize the position i-1 in EST where the extension ends.

[Identify the next exon: Steps 2 and 2]

Increment i until the position of E _{ii + h} - _1} is found in the map table (see step 1205 in FIG. 12). In the future, the _{_{E [x + 1, i 13}} , remained as should Arainme Santos, referred to as pendant moieties of the EST (reference numeral 903 in steps 1 206 and 9 in Figure 12). In FIG. 9, the sequence E [i, _{i + 3]} from the i-th position of the EST sequence E is aligned with G [ _{i + 3]} in the genome sequence G (reference numerals 901 and 902). See). This E, 1 _{+ 3]} 901 is the next exon.

[Alignment of the suspended part of E ST: Step 2.3]

? 3 sequence £ pendant moiety Σ _{[x + 1)} i_ (reference numeral 903 in FIG. 9), the genomic sequence G subsequence _{_{G [f (x) +1,}} f (i) _ 1: ( in FIG. 9 It is aligned with reference numeral 904) (step 1207). Again, you can use dynamic programming Wear. FIG. 10A shows a case where no intron was detected after dynamic programming, and FIG. 10B shows a case where an intron (see reference numeral 1011) was found.

Such processing is repeated until the suffix is reached (see steps 1208 and 1209 in FIG. 12).

[Faster speed]

In fact, introns can be thousands of base pairs in length, often many hundreds of base pairs in length. Pendant moieties of the EST sequence E _{[x + 1,} the _{_{i_ n, G [f (x}} ) +1, in rudimentary dynamic programming for causing § Line Instrument in _{f (i)} _ _n is Sutetsu flop 2. If G _[f (X) _{+ 1} , _f ) — ₁₃ contains an intron at 3, a furious calculation is performed. To accelerate this step, E _[x _{+ 1} , i-n (see reference numeral 1301 in FIG. 13) and G _[1 : ₁ (see reference numerals 1302, 1304 and 1303 in FIG. 13) before applying the dynamic program. ) To determine whether the subsequence G _[f ( _X ) ₊₁ , _{f (1} )-!] Contains an intron. G _{[f (x) +1} , _{f (1)} _ _n is significantly larger than E _{[x + 1} , (for example, set a predetermined ratio or set a threshold for the difference between subsequences) _{_{the, G [f) + 1,}} f) - 13 includes an intron (reference numeral 1304). In this case, E _{[x + 1,} i _{is, G [f (x) +} i, f (x) + (i -x)] ( marks "¾ 130 irradiation) and G _{[f (1)} one _{{ i} - _x ), _{f (i)} -!] (see reference numeral 1303), so that the suspended part E _{[x + 1)} i _ _n of the EST array is Array concatenation, G _{[f (x) +1} , _{f (x)} + _(i

-x)] + G [f (i)-(i -x), f (i)-1].

[Intron motif]

In the above embodiment, the exon region and the intron region were estimated. However, decisions regarding the exon / intron boundary require other factors to consider. For example, consider a boundary to be (pattern A):

GC C / T.. · AC / GT

The boundaries can also be (pattern B): GC / CT · · · A / CGT

In each of these, the matching rate is the same. Therefore, it is not possible to determine which pattern is correct.

Also, the binding position of intron / exon can be determined based on the idea of an intron motif. The introns are flanked by spliced sites at the 5 'and 3' ends according to the so-called "GT-AG" rule. So a typical intron starts with "GT" and ends with "AG" which is a dinucleotide. The probability that an arbitrarily selected splice site has these motifs at both ends is 1/256. Thus, an intron motif was found in the genomic sequence when the frame of the intron moved forward or backward (ie, at the 5 'end to the 3' end, or at the 3 'end to the 5' end). In some cases, based on the idea of the intron motif, the intron may be a genuine intron.

Therefore, also in the present embodiment, it is possible to make corrections to the alignment result using the intron motif rule, if necessary. The procedure is described below.

For example, while maintaining the length (base number) of the intron region determined by the alignment process, the position of the intron is shifted back and forth, and the bases at both ends of the intron (intron candidate) in the shifted state are shifted. Find out. If the bases at both ends follow the intron motif, the position of the candidate intron is estimated as the position of a new intron.

[Matching rate]

When the above-mentioned intron region is determined, the matching ratio between the genome sequence and the mapped EST region is examined. If the EST sequence is read correctly, the matching rate between the genomic sequence and the encoded EST sequence is 99.9% (the remaining 0.1% is the difference between the human genomes). Therefore, it was considered that the EST sequence with a low matching rate was not encoded. In fact, EST sequences with low matching rates contain a large number of misread (misread) nucleotides, Or not encoded in the genome sequence. According to our results, the lower boundary of the matching rate was set to some value between 80% and 90%.

[Solution for two or more alignments]

A solution for aligning two or more ESTs in a genomic sequence is described. This is because ESTs are often aligned to many different regions in a chromosome by retro-transposition or gene-duplication. To solve this problem, several pairs of start and end points were aligned.

In step 1.1 and step 1.2 (in the drawing, steps 110 to 1106 in Fig. 11), the start and end points of ES Τ were estimated. If there is a large number of candidates, the start and end points are not determined. In this case, the set of starting point candidates in the map table in step 1.1 (see steps 1101 to 1103 in FIG. 11) is used as a starting set, and step 1.2 (steps in FIG. 11) 1104 to 1106) is a set of end point candidates in the end set. The algorithm is described below.

[Step 1.3]

If the distance between “start” and “end” (starte start set, ende end set) is smaller than a predetermined size (for example, 1,000, OO Obp), steps 2.1-2. 3 (Processing of Fig. 12 in the drawing) aligns E [j- _L] with G _{[s start} , _end] . Although this method seems brute force, it can properly align a given EST.

As described above, according to the present embodiment, by referring to the map table, first, the positions of the start and end of the EST array are specified, and then the position of the EST array sandwiched between the start and end is determined. Align the rest with the corresponding subsequence in the genome sequence. Is mentoring. In this alignment, referring to the map template, the exon is first extended from the length of the primary key, and the exon is extended from the position where the nucleotides of the EST sequence and the genomic sequence do not match until the next exon appears. Subsequences in nom array G can be skipped. This makes it possible to realize alignment at higher speed.

The present invention is not limited to the above embodiments, and various modifications are possible within the scope of the invention described in the claims, and these are also included in the scope of the present invention. Needless to say,

For example, in the above embodiment, in the alignment correction method using the intron motif, the length of the intron is shifted back and forth while maintaining the length of the intron, and the base at the end of the intron candidate at the shifted position is examined. However, it is not limited to such a method.

Further, by using the present invention, it is also possible to evaluate how accurately the EST sequence is read by examining the matching ratio. \

Furthermore, in this specification, the function of one means may be realized by two or more physical means, or the function of two or more means may be realized by one physical means. Good.

ADVANTAGE OF THE INVENTION According to this invention, it becomes possible to provide the technique which can align an EST arrangement | sequence at high speed by the algorithm which inserts only a few long gaps. Industrial applications

For example, by mapping cDNA, a more accurate gene map can be created. By doing this, it is possible to know the number of human genes, the entire region of the promoter, and the structure of intronic noexons. In the field of medicine, there are many cases where genes related to diseases are known. Creating a complete genetic map allows for localization of such genes.

Claims

The scope of the claims

A method of mapping an EST sequence to a genome sequence,

Generating a map table indicating a position where each of the primary keys of a predetermined length appears in the genome sequence by referring to the genome sequence;

Referring to the map table, finding a position corresponding to a prefix of an EST sequence in the genome sequence, and aligning the prefix;

Referring to the map table, finding a position corresponding to a suffix of an EST sequence in the genomic sequence, and aligning the suffix;

Regarding a first subsequence interposed between a prefix and a suffix in the EST sequence, a second intervening sequence between the sequence corresponding to the aligned prefix and the sequence corresponding to the suffix in the genomic sequence. And a step of aligning the sub-sequences.

Aligning the second subsequence with respect to the first subsequence comprises:

Extending exon, a region in the aligned E ST, and

Skipping the intron.

For the first sub-sequence, the step of aligning the second sub-sequence comprises:

Extending the prefix as long as the elements in the E ST sequence and the elements in the genome sequence match;

If the elements do not match, the extended prefix is identified as an exon, and in the EST sequence, a fixed-length array following the prefix is used as a primary key, and a map table is referred to, and Finding a position corresponding to the primary key in the sequence, and aligning the sequence of the predetermined length;

Extending the sequence of the predetermined length as long as the element in the EST sequence and the element in the genome sequence match,

By repeating the step of extending the sequence of the predetermined length and the step of aligning the subsequent sequence of the predetermined length, alignment of the EST sequence with the genomic sequence is realized. 3. The method according to claim 1 or 2.

The step of aligning the prefix comprises:

Referring to the map table to identify a position where a sequence of a predetermined length at a predetermined position from the front of the EST sequence is found in the genome sequence; and a subsequence positioned before the predetermined position. And aligning the first suspended portion with a subsequence located ahead of the position in the genome sequence. 4. The method according to claim 1, wherein .

The steps for aligning the suffix include:

Referring to the map table to identify a position where a sequence of a predetermined length located at a predetermined position from the back of the EST sequence is found in the genome sequence; and a subsequence positioned behind the predetermined position. And aligning the second suspended portion with a subsequence located after the position in the genome sequence. 5. The method according to claim 1, wherein .

Extending the prefix as long as the element in the EST sequence and the element in the genomic sequence match;

If the elements do not match, the extended prefix is identified as exon, and in the EST sequence, Identifying a position where a sequence of a predetermined length at a predetermined position is found in the genome sequence;

Regarding a third suspension portion, which is a subsequence positioned between the end of the prefix and the predetermined position, in the genome sequence, between a position corresponding to the prefix and the position to be found 5. The method according to claim 3, further comprising the step of aligning the located sub-sequence.

7. Further, from each end of each of the substrings determined to be introns, referencing elements within a predetermined range, and finding elements that conform to the intron motif rule;

7. A method as claimed in any one of claims 2 to 6, comprising the step of modifying the intron to end the element.

8. The step of finding the element is

Identifying the new intron candidate by shifting the intron back and forth while retaining the number of elements of the subsequence determined to be the intron;

Determining whether each end of the new intron candidate obeys the intron motif rule.

9. The method according to any one of claims 1 to 8, further comprising a step of calculating an EST sequence matching ratio from the result of the alignment.

10. A computer-readable program for operating a computer to map an EST sequence to a genomic sequence,

Generating a map showing a position where each of the primary keys of a predetermined length appears in the genome sequence by referring to the genome sequence;

Referring to the map table, finding a position corresponding to a prefix of an EST sequence in the genomic sequence, and aligning the prefix; Referring to the map table, finding a position corresponding to a suffix of an EST sequence in the genome sequence, and aligning the suffix;

Regarding the first subsequence intervening between the prefix and the suffix in the EST sequence, a second intervening sequence between the sequence corresponding to the aligned prefix and the sequence corresponding to the suffix in the genomic sequence. Causing the computer to execute the step of aligning the subsequences.

11. With respect to the first subsequence, aligning to a second subsequence,

Extending exon, a region in the aligned E ST, and

10. The program according to claim 10, wherein the computer is caused to execute the step of skipping an intron.

12. With respect to the first subsequence, in aligning with a second subsequence,

A step of extending the prefix as long as the element in the EST sequence and the element in the genome sequence match;

If the elements do not match, the extended prefix is identified as exon, and in the EST array, the array of a predetermined length following the prefix is used as a primary key, and the map template is referred to. A step of finding a position corresponding to the primary key in the sequence, and aligning the sequence of the predetermined length;

Extending the sequence of the predetermined length; aligning the subsequent sequence of the predetermined length; and repeatedly executing the sequence to realize the alignment of the EST sequence with the genomic sequence. Features The program according to claim 10 or 11, wherein

13. In the step of aligning the prefix,

Referring to the map table to identify a position where a sequence of a predetermined length at a predetermined position from the front of the EST sequence is found in the genome sequence; and a subsequence positioned before the predetermined position. And performing the step of aligning a partial sequence located in front of a position in the genome sequence with respect to the first suspension portion, wherein the first suspension portion is executed in the combination mode. The program according to any one of the above.

14. In the step of aligning the suffix,

Referring to the map table to identify a position where a sequence of a predetermined length located at a predetermined position from the back of the EST sequence is found in the genome sequence; and a subsequence positioned behind the predetermined position. Causing the computer to execute the step of: aligning the second suspended portion with a subsequence located after the position in the genome sequence. A program according to any one of the preceding claims.

15. Aligning the first subsequence with the second subsequence,

If the elements do not match, the extended prefix is identified as an exon, and in the EST sequence, a sequence of a predetermined length at a predetermined position ahead of the prefix is identified in a genome sequence. Step

Regarding a third suspension portion, which is a subsequence positioned between the end of the prefix and the predetermined position, in the genome sequence, between a position corresponding to the prefix and the position to be found 14. The program according to claim 12 or 13, wherein the step of causing the computer to execute the step of aligning the substrings located therein is performed.

16. Further, from each end of the subsequence determined to be an intron, referring to elements within a predetermined range, finding an element that complies with the intron motif rule;

The program according to any one of claims 11 to 15, wherein the step of modifying the intron so that the element is an end portion is performed on the computer.

17. In the step of finding the element,

Determining whether or not each end of the new intron candidate complies with the motif rule of introns. The program according to claim 16, wherein the program is executed by the computer.

18. The program according to any one of claims 10 to 17, further comprising causing the computer to execute a step of calculating a matching ratio of an EST sequence from the result of the alignment.