WO2010119783A1 - ペア文字列検索システム - Google Patents
ペア文字列検索システム Download PDFInfo
- Publication number
- WO2010119783A1 WO2010119783A1 PCT/JP2010/056146 JP2010056146W WO2010119783A1 WO 2010119783 A1 WO2010119783 A1 WO 2010119783A1 JP 2010056146 W JP2010056146 W JP 2010056146W WO 2010119783 A1 WO2010119783 A1 WO 2010119783A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- character string
- index
- pair
- processing unit
- sequence
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
Definitions
- the present invention relates to a method for quickly searching two character strings appearing in proximity from large-scale genome sequence data or a large amount of general document data using a computer.
- the present invention relates to a method for specifying a corresponding position on the genome at a high speed with respect to a large amount of DNA sequence data of a massively parallel DNA sequencer obtained by using the pair-end method.
- Non-patent Document 1 a massively parallel DNA sequencer based on a new principle, which is completely different from the conventional capillary DNA sequencer, has appeared.
- a single run of a massively parallel DNA sequencer can read a large number of sequences, reaching tens of millions at a time.
- the pair-end method is used to compensate for the disadvantages (Non-Patent Document 3).
- genomes are sequenced by sequencing several tens of bases from both ends of a large number of genomic sequence fragments that are aligned to a substantially constant length (a few hundred bases to several kilobases). Information on pairs of base sequences (pair-end sequences) that are separated by a substantially constant interval can be obtained.
- the genome mapping calculation for the reference genome sequence is performed on the paired-end sequence data obtained by analyzing the genomic DNA sample by the sequencer, that is, the sequence obtained by the sequencing is located at any position on the reference genome sequence.
- the expected interval such as insertion / deletion between the sample genome and the reference genome Structural polymorphism can be detected.
- the distance between mapping positions is larger than the expected distance, it is considered that there is a deletion between these sequence pairs on the sample / genome side, and conversely, the distance between mapping positions is expected. If it is smaller than the interval, the insertion between the sequence pairs is considered to occur on the sample genome side (Non-patent Document 3).
- mapping position is not uniquely determined in the genome mapping calculation, and many positions may be listed as candidates. Even in such a case, by using the mapping position interval of the paired arrays as a constraint, that is, by using the constraint that these mapping positions are close to each other, can the mapping position as the paired array be uniquely determined? Alternatively, it can be expected that the number of candidates can be narrowed down.
- introns may enter between mapping positions of paired sequences, and the mapping position interval may be increased by the length of the intron. is there. Even in such a case, the restriction that the interval between mapping positions does not exceed the length of the gene region on the genome can be used in mapping calculation as a pair sequence.
- Non-Patent Document 2 When performing genome mapping calculation of a large amount of sequence sequence data (query sequence data) on a large-scale reference genome sequence, the reference genome sequence data is usually indexed in advance. The search for the query sequence can be accelerated by using this index. As an indexing method, a suffix array (Non-Patent Document 2) can be used.
- MAQ is known as software capable of mapping a large amount of pair sequence data of a massively parallel DNA sequencer analyzed by the pair-end method onto a reference genome sequence with high accuracy and high speed (Non-patent Document 6).
- index information is created in advance for a large amount of pair sequence data, and mapping position candidates are searched by referring to the index information while scanning the reference genome sequence.
- the candidate for the position to be mapped on the plus strand of the genome sequence is assigned to each query sequence in the scanning process.
- mapping position is obtained at high speed using an index independently for each sequence
- a combination satisfying the space constraint may be selected from the combinations of the mapping positions.
- the sequence length is relatively short
- mapping positions are obtained independently for two pairs of sequences without using space constraints
- a large number of candidate positions are obtained first, after which A small number satisfying the space constraint is selected from among the above, and the processing speed may be reduced.
- an oversight may occur depending on their relative positional relationship.
- An object of the present invention is that when a large-scale reference genome sequence data and a large amount of pair-end sequence sequence data (query sequence data) are given, the mapping position interval between the pairs is equal to or less than a predetermined value. It is to obtain a combination of mapping positions of such pair arrangements at high speed.
- the entire genome sequence has information equivalent to that of a normal suffix array. Furthermore, even for partial regions of the genome restricted at various scales, index information restricted to that region (local index information) Index of the genome sequence data is performed so that Further, a method is used in which the index information for a certain region of the genome is divided into index information for each partial region when the region is divided into two. For paired sequences, first, the entire genome region is searched independently for each sequence to obtain index information, and then, by dividing into two until the length of the region becomes 1, those sequences are obtained. The position coordinates on the genome for the index are gradually determined in detail. In the process, the distance restriction between pairs is taken into consideration, and index information that does not satisfy the restriction is removed.
- a pair character string search system includes a storage device that stores a reference character string, a first character string and a second character string that form a pair, and a distance between the first character string and the second character string.
- An input device for inputting an upper limit value, a suffix array construction processing unit for constructing a suffix array of reference character strings, and an index in the suffix array independently of the first character string and the second character string
- An index area search processing unit that calculates a region, an LSA construction processing unit that constructs index information LSA that can be localized, an index region localization processing unit, a distance evaluation processing unit between index regions, and a distance within an upper limit value
- An output device that outputs a position of the paired first character string and the second character string on the reference character string, and the index area localization processing unit Distribution, and the results of evaluation by inter index range distance evaluation processing unit, removes the information of the index region which can not make a pair at a distance of more than the upper limit.
- the LSA construction processing unit uses the suffix array of the reference character string as an initial block, and divides the suffix array into two blocks corresponding to the first and second partial regions of the reference character string according to whether the most significant bit is 0 or 1 Then, the arrangement of the most significant bits referred to at that time is used as the first column of the LSA, and the obtained two blocks are further divided into two blocks according to whether the second bit is 0 or 1. As a result, the entire reference genome sequence is divided into four blocks corresponding to the partial region to be divided into four, the second-order bits referenced at that time are arranged as the second column of LSA, and so on. The process is repeated until the length of 1 becomes 1, and in the process, each column of the LSA is calculated to construct the LSA.
- the index area localization processing unit uses the index area in the suffix array for each of the first character string and the second character string as the index area in the initial block, and corresponds to the first half and second half partial areas of the reference character string.
- the index area in the initial block is divided into the index areas in each block using the information in the first column of the LSA, and the entire reference character string is divided into four
- the index area in the two blocks is divided into the index areas in the four blocks using the information in the second column of the LSA corresponding to the division into the four blocks corresponding to, and the same division process is repeated thereafter.
- a calculation is performed to localize the index area in the initial block to the index area in the partial area of the reference character string.
- the index area localization processing unit applies a rank function to the blocks in each column of the LSA, and calculates the position of the index area in the two blocks obtained by dividing the block.
- the inter-index-area-distance evaluation processing unit assumes that the position coordinates on the reference character string of the index area localized in the partial area leave an indeterminate range that is only the length of the partial area. The upper limit of the distance between the index areas of the converted first character string and second character string is evaluated.
- the high-speed operation is performed while considering the restriction of the mapping position interval.
- Index search mapping calculation
- Example 1 a method for obtaining a mapping position close to a specified interval or less when a pair of reference genome sequence data and a query sequence is given will be described.
- the mapping between the query sequence and the genome sequence is considered to be an exact match, but application to the incomplete match that allows a mismatch of several bases is possible by simple analogy.
- only the upper limit of the interval between the mapping positions of the pair array is considered, but simultaneously, the lower limit of the interval between the mapping positions of the pair array can be considered by simple analogy.
- FIG. 1 is a diagram illustrating a configuration example of a system that performs paired array mapping calculation according to the present embodiment.
- Reference numeral 101 denotes a computer in which this system is operating, which reads the reference genome sequence data 103 stored in the external storage device 102 and analyzes the DNA sample 115 using the DNA sequencer 116, and the pair sequence data obtained A (query sequence pair) 107 is read via the input device 106, and a parameter 114 (upper limit value D between pairs) specified by the user is read via the input device 118.
- the computer 101 uses the suffix array construction processing unit 105 to connect the base sequences in the reference genome sequence data 103 via the delimiter $, calculates the suffix array, and stores it in the main memory 104.
- the LSA construction processing unit 108 constructs index information LSA (localized suffix array) based on the suffix array.
- the index area search processing unit 109 searches the suffix array independently for each of the paired query sequences, and obtains index area information.
- the index area localization processing unit 110 localizes (subdivides) an index area that satisfies the restriction of the mapping position interval between pairs based on the index area information of the query sequence and the LSA of the reference genome sequence. Then, the position of the index region on the genome is refined from a state where it spreads throughout the initial genome to a state where it is finally determined in detail at the base level.
- mapping position information (search result 113) to the output device 112.
- the mapping position information (search result 113) of the pair sequence includes information on the chromosome name, its direction (+/- which strand) and coordinates indicating the appearance position.
- Non-patent Document 2 In order to construct a suffix array from a reference genome sequence, a known method such as Manber-Myers may be used (Non-patent Document 2). Further, in order to obtain the index information of the query sequence, that is, the index range in which the query sequence appears as a suffix prefix in the suffix array, a general binary search method may be used.
- each suffix is represented by an integer not less than 0 and less than 2 to the power of h.
- An empty string is the final suffix.
- the index of the suffix corresponding to the first half of the genome is represented by an integer with the most significant bit (h bit from the least significant bit) being 0 in binary notation, and the index of the suffix corresponding to the second half of the genome is 2 When expressed in hexadecimal, the most significant bit (hth bit from the least significant bit) is represented by an integer of 1.
- the suffix array B for the whole genome region is divided into two by the most significant bit being 0 or 1 while maintaining the internal relative order, the suffix array subsequence B0 corresponding to the first half of the genome, A subsequence B1 of the suffix array corresponding to the second half of the genome is obtained.
- the same sequence as the query sequence appears scattered on the first half 205 of the genome sequence and the second half 207 of the genome sequence as shown by 206 and 208, respectively.
- 210 and 212 appear together in B0 and B1, respectively.
- the same bisection can be performed by checking whether the value of the second bit from the top (h-1 bit from the bottom) of the index is 0 or 1. , Thereby dividing the whole genome into four parts. Subsequently, 8 divisions, 16 divisions, and the like are possible, and the division can be repeated recursively until the length of the area finally reaches 1.
- Reference numeral 301 represents the entire LSA
- 302 represents the most significant bit of the index in B that was referenced when the entire genome region was divided into the first half and the latter half in the first two divisions.
- the next four divisions that is, in the two divisions of the above two areas, the second bit from the higher order of the indexes referenced in B0 and B1 is arranged.
- 304 is a table in which the third bit from the top of the index referenced from each area after the four divisions is arranged vertically.
- the fourth and subsequent columns are similarly configured for 16 divisions and thereafter.
- FIG. 4 is a diagram for explaining a method for evaluating the distance on the genome between index regions included in different blocks in the inter-index region distance evaluation processing unit 117. As will be described later, this evaluation method is used when configuring the index region localization processing unit 110.
- 401 represents a block. Assume that index areas 402 and 403 for two query sequences p and q are included in different blocks 404 and 405. The position coordinates on the genome of these index regions are fixed, leaving uncertain ranges as indicated by 406 and 407. The block length is calculated based on the total length of the genome and the number of divisions. Therefore, the upper limit 409 and the lower limit 408 of the distance between these index regions can be calculated by counting the number of blocks between the block 404 and the block 405.
- FIG. 5 shows that when index regions in the whole genome for two query sequences p and q are given, the position coordinates on the genome sequence as a pair sequence recursively divide the genome region in the index region localization processing unit 110. It is a figure explaining how it decides while reducing uncertainty by repeating automatically.
- Reference numeral 501 denotes a suffix array for the entire genome region (initial block).
- Reference numeral 502 denotes a partial array of the suffix array corresponding to the first and second half blocks when the whole genome is divided into two.
- Reference numeral 503 denotes a partial array of suffix arrays corresponding to each block when the whole genome is divided into four.
- reference numerals 504 and 505 are obtained by vertically arranging the partial columns of the suffix array corresponding to each block when the whole genome is divided into 8 and 16, respectively.
- Reference numeral 506 denotes a case where the block length finally reaches 1. Solid arrows connecting between 501 to 506 indicate that the index area for the query sequence p is detailed by recursive localization processing (block segmentation processing).
- a broken-line arrow connecting between 501 to 506 indicates that the index area for the query sequence q is detailed by the localization process.
- processing that does not satisfy the constraint on the distance between paired sequences is aborted.
- those that do not satisfy the constraint on the distance between the paired sequences there are those in which the order of the paired sequences is reversed, such as 509.
- 510 and 511 remaining up to 506 give the mapping position of the pair sequence.
- Fig. 6 shows the LSA construction processing procedure.
- block B is set as a suffix array for the entire reference genome sequence calculated by the suffix array construction processing unit 105 and stored in the main memory 104.
- B is an array in which indexes (h-bit integers) are arranged to the power of 2h.
- Step 604 is the end determination process.
- the processing of steps 606 and 607 is performed for each block represented by Ba with a character string a of length (i-1) consisting of 0 and 1.
- step 606 for each index (h-bit integer) included in the block Ba, a bit string formed by arranging the i-th bit from the higher order is defined as La.
- each index included in Ba is classified into block Ba0 or Ba1, depending on whether those bits are 0 or 1. At that time, the relative order between the indexes is maintained.
- the value of i is incremented by 1 in step 608 and the process returns to immediately before step 604. If the end condition is satisfied in step 605, LSA LSA is constructed on the main memory 104 in step 609.
- FIG. 7 is an explanatory diagram of a method for constructing an LSA from La.
- the LSA 701 is an array of bits in 2 h rows and h columns. Its first column 702 is equal to L.
- the second column 703 is obtained by connecting L0 and L1.
- the third column 704 is obtained by connecting L00, L01, L10, and L11.
- the fourth column 705 is obtained by connecting L000, L001, L010, L011, L100, L101, L110, and L111. The same applies to the fifth and subsequent columns.
- Reference numeral 801 denotes a list (including an empty character string) of all suffixes of “abracad” and their indexes.
- Reference numeral 802 denotes a suffix array in which they are sorted in lexicographic order.
- Reference numeral 803 denotes a suffix array index value in binary notation, and B corresponding to the entire genome.
- Reference numeral 804 denotes an LSA constructed from B by the method described in FIG. 6, and corresponds to FIG.
- FIG. 9 is an explanatory diagram for explaining a process of constructing an LSA from B corresponding to the entire genome according to the method described in FIG. 6 when the reference genome sequence is “abracad”.
- 901 is B corresponding to the whole genome.
- Reference numeral 902 is a list in which B0 and B1 corresponding to the first half and the second half of the genome are separated by a broken line. When dividing from B to B0 and B1, the most significant bit (bold) of B is referred to. The arrangement of these bits is L0, which is the first column of the LSA indicated by 905.
- reference numeral 903 denotes B00, B01, B10, and B11 corresponding to the four divisions of the genome, which are separated by a broken line.
- L0 and L1 are a sequence of these bits, which is the second column of LSA indicated by 905.
- reference numeral 904 shows B000, B001, B010, B011, B100, B101, B110, and B111 corresponding to 8 divisions of the genome, which are separated by a broken line.
- L00, L01, L10, and L11 are a sequence of these bits, which is the third column of the LSA indicated by 905. At 904, the block length becomes 1, and the recursive division ends here. Therefore, the 8-by-3 bit matrix indicated by 905 is LSA.
- FIG. 10 is a diagram for explaining the internal processing flow of the index area localization process 110.
- the upper limit value D of the distance is read from the user via the input device 106.
- the appearance range (s p , t p ) of the query sequence p in the suffix array B of the entire genome sequence and the query sequence q The appearance range (s q , t q ) is read from the main memory 104.
- B has already been constructed by the suffix array construction processing unit 105, and the values of (s p , t p ) and (s q , t q ) have been calculated by the index area search unit 109, and these are stored in the main memory 104. Is held in.
- information List (p) indicating a candidate list of mapping positions of query sequence p
- information List (q) indicating a candidate list of mapping positions of query sequence q are initialized as shown in the following equation.
- Step 1004 is an end determination process of this iterative process. When the block length becomes 1, it is determined to end.
- step 1005 according to the method described below with reference to FIG. 11, each element belonging to the list List (p) is subdivided to update the list List (p), and similarly each element belonging to List (q) is subdivided. And list List (q) is updated.
- the mapping positions of the index areas of each query sequence are determined with the uncertainty of only the block length, and based on this, the distance between the mapping positions of the index areas belonging to different blocks is determined.
- the upper and lower limits can be evaluated.
- the elements of the lists List (p) and List (q) are evaluated in combination, and the lists List (p) and List (p) such that a pair whose upper limit of the distance between the pairs is not more than D cannot be formed. Delete the element of q).
- the lists List (p) and List (q) are returned to the main memory 104 as answers in step 1007.
- FIG. 11 is a diagram for explaining the processing procedure of the list segmentation processing 1005 for updating the information List_in indicating the candidate list of the mapping position of the query sequence and replying with a new list List_out.
- This processing procedure is the same as the method described with reference to FIG.
- a list List_in is input, and the list List_out is set as an empty list.
- List_in is List (p) or List (q) before being subdivided in step 1005.
- the following processing is performed on each element (Ba, s, t) of the list List_in.
- a represents a character string consisting of 0 and 1
- s and t are integers indicating the index area in the block Ba of the query sequence.
- integers0, t0, s1, and t1 indicating the index areas in the subdivided blocks Ba0 and Ba1 are calculated according to the following equation.
- rank (La, s, 0) rank (La, t, 0)
- t0 rank (La, t, 0)
- s1 rank (La, s, 1)
- t1 rank (La, t, 1)
- rank (La, s, 0) is a rank function and represents the number of 0 appearing before the position of the s-th row in the bit string La in the LSA
- rank (La, s, 1) is also a rank.
- the function represents the number of 1 appearing before the position of the s-th row in the bit string La in the LSA.
- the rank function can be efficiently calculated by a known method (Non-Patent Document 5). The same applies to rank (La, t, 0) and rank (La, t, 1).
- step 1104 depending on whether each bit in the corresponding bit string La in LSA is 0 or 1, (Ba, s, t) is divided into two blocks (Ba0, s0, t0) and (Ba1, s1, t1). And add them to the list List_out.
- step 1105 the obtained list List_out is returned in the main memory 104 as an answer. This becomes List (p) or List (q) after the subdivision in step 1005.
- mapping position candidates for the query sequence are compactly expressed as index areas (two integers s and t) in each block. Further, by repeatedly performing the division of the block and the accompanying division of the index area, the mapping position of the index area can be finally determined in detail up to the base level while gradually reducing the uncertainty range. At that time, the upper limit of the distance between the mapping positions leaving the indeterminate range is evaluated, and the index information for the mapping positions that cannot satisfy the mapping position interval restriction (distance D or less) specified between the pairs is obtained. Remove and avoid useless calculations. Also, by referring to the LSA, the index area division accompanying the block division can be calculated efficiently. A procedure for efficiently building LSA was also given.
- Example 2 In the above example, the method for calculating the genomic mapping of the query sequence obtained by the paired-end method was shown, but the query obtained by sequencing the fragment sequence of the cDNA sequence made from the spliced mRNA.
- the present invention can also be implemented for the problem of calculating genome mapping of sequences.
- FIG. 12 is a diagram illustrating a configuration example of a system that performs mapping calculation including splicing according to the present embodiment.
- Reference numeral 1201 denotes a computer in which this system is operated, and reads the reference genome sequence data 1203 stored in the external storage device 1202 and also obtains the query sequence obtained by analyzing the cDNA sample 1216 using the DNA sequencer 1217.
- the data 1207 is read via the input device 1206, and the parameter 1219 (intron length upper limit value D) designated by the user is read via the input device 1218.
- the computer 1201 uses the suffix array construction processing unit 1205 to connect the base sequences in the reference genome sequence data 1203 via the delimiter $, and calculates the suffix array and stores it in the main memory 1204. Similarly, calculation results of other processing units are also stored in the main memory 1204, and the contents are referred to by each processing unit as necessary.
- the LSA construction processing unit 1208 constructs new index information LSA (localized “suffix” array) based on the suffix array.
- the pair array generation processing unit 1209 divides the query array into two at various positions and creates various pair arrays.
- the index area search processing unit 1210 searches the suffix array independently for each pair array to obtain index information.
- coordinates corresponding to an index satisfying the constraint that the mapping position interval between pairs is equal to or less than the upper limit of the intron length based on the index information of the pair sequence and the LSA of the reference genome sequence. Is determined while gradually localizing from the entire genome. Thereby, the mapping position as a pair arrangement
- the consensus sequence inspection processing unit 1212 examines the genome sequence around the splice site and removes mapping results that do not satisfy the consensus.
- the output processing unit 1213 outputs the mapping position information (search result 1215) to the output device 1214.
- the mapping information (search result 1215) of the spliced sequence includes information on the chromosome name, its direction (+/ ⁇ which strand) and coordinates indicating two separated appearance positions.
- the processing contents of the suffix array construction processing unit 1205, the LSA construction processing unit 1208, the index region search processing unit 1210, the index region localization processing unit 1211, and the like are the same as those in the first embodiment.
- the pair array generation processing unit 1209 specifies in advance a lower limit value of the array length of the pair array obtained by dividing, and generates all pair arrays that satisfy the condition. They are processed as paired sequence data by the index region search processing unit 1210 and the index region localization processing unit 1211, but are interpreted and checked as a spliced mapping by the consensus sequence inspection processing unit 1212.
- the part sandwiched between the genome mapping positions as a pair sequence is interpreted as an intron sequence, and the condition that the intron sequence starts with GT and ends with AG is imposed. Mapping results of pair sequences that do not satisfy this condition are removed.
- mapping position interval between pairs is equal to or less than a predetermined value.
- Example 3 In the above embodiment, the embodiment for the base sequence data composed of A, C, G, T, and N has been described.
- the present invention is also applicable to general character string data that is not limited to base sequence data.
- amino acid fragment sequence data will be described.
- the amino acid sequence is a sequence composed of about 20 kinds of amino acids.
- FIG. 13 is a diagram showing a configuration example of a system that performs calculation for identifying a protein from the amino acid fragment sequence of this example.
- Reference numeral 1301 denotes a computer in which this system is operating.
- the computer 1301 reads the reference protein sequence data 1303 stored in the external storage device 1302 and analyzes the protein sample 1306 using the mass spectrometry system 1305.
- the query sequence data 1307 is read via the input device 1308, and the parameter 1318 (upper limit value D of the protein sequence length) designated by the user is read via the input device 1319.
- the computer 1301 uses the suffix array construction processing unit 1310 to connect the amino acid sequences in the reference protein sequence data 1303 via the delimiter $, calculate the suffix array, and store it in the main memory 1309.
- calculation results of other processing units are also stored in the main memory 1309, and the contents thereof are referred to by each processing unit as necessary.
- Processing contents in the suffix array construction processing unit 1310, the LSA construction processing unit 1311, the index region search processing unit 1312, the index region localization processing unit 1313, and the like are the same as those in the above embodiment.
- FIG. 14 is a diagram illustrating a configuration example of a system that performs calculation for identifying a document in which two search character strings appear close to each other in general document data according to the present embodiment.
- Reference numeral 1401 denotes a computer in which this system is operating, which reads the reference document data 1403 stored in the external storage device 1402, and reads a search character string pair (query) 1407 via the input device 1406. Also, the parameter 1411 (upper limit value D of the distance between pairs) designated by the user is read via the input device 1416.
- the reference document data 1403 is a collection of documents consisting of character string bodies with document names. However, even for general structured documents having a hierarchical structure, simple reference is made. The present invention can be applied.
- the calculator 1401 uses a suffix array construction processing unit 1405 to connect character strings in the document data 1403 via identifiable delimiters, and calculates the suffix array and stores it in the main memory 1404. Similarly, calculation results of other processing units are also stored in the main memory 1404, and the contents thereof are referred to by each processing unit as necessary.
- the LSA construction processing unit 1408 constructs index information LSA (localized suffix array) based on the suffix array.
- the index area search processing unit 1409 searches the suffix array independently for each search character string to obtain index information. In the index area localization processing unit 1410, based on the index information of the search character string and the LSA of the document data, the appearance position in the document with respect to the index that appears close to the document data 1403 is gradually increased from the entire document.
- the document identification processing unit 1412 checks whether two search character strings appear in the same document in the document data.
- the output processing unit 1413 outputs to the output device 1414 a search result 1415 including the document name of the document in which the two search character strings appear close to each other and information on the appearance position of each search character string.
- the occurrence positions of character string pairs such that the interval between the appearance positions between the pairs is equal to or less than a predetermined value shown in the above-described embodiment.
Abstract
Description
本実施例では、参照ゲノム配列データと、クエリー配列のペアが与えられた時、指定された間隔以下に近接したマッピング位置を求めるための方法を説明する。本実施例では、クエリー配列とゲノム配列の間のマッピングは完全一致の場合を考えるが、数塩基のミスマッチを許容するような不完全一致の場合への応用は、容易な類推により可能である。また、本実施例では、ペア配列のマッピング位置の間隔の上限のみの制約を考えるが、同時にペア配列のマッピング位置の間隔の下限の制約を考えることも、容易な類推により可能である。
s0=rank(L,s,0)
s1=rank(L,s,1)
となる。ランク関数は公知の方法で高速に計算することができる(非特許文献5)。同様に、
t0=rank(L,t,0)
t1=rank(L,t,1)
となる。従って、インデクスの最上位ビットによるゲノム全体領域の二分割に際して、クエリー配列に対応するインデクスの範囲(s,t)のゲノム前半とゲノム後半に対応する部分への分割(s0,t0),(s1,t1)は高速に計算される。なお、分割されたうちの一方は空になることもある。
(距離の上限)=((間にあるブロックの数)+2)×(ブロック長)-1
最終の分割でブロック長が1になると、これらの値は一致する。即ち、不確定さを残さずに、インデクスのゲノム上の位置座標が確定する。
List(q)={(B,sq,tq)}
これらのリストは、マッピング位置座標を詳細に決めていくために、以下の反復処理により更新される。ステップ1004は、この反復処理の終了判定処理であり、ブロック長が1になったら終了と判定する。ステップ1005では、図11を用いて以下に説明する方法に従い、リストList(p)に属する各要素を細分化してリストList(p)を更新し、同様にList(q)に属する各要素を細分化してリストList(q)を更新する。
t0=rank(La,t,0)
s1=rank(La,s,1)
t1=rank(La,t,1)
ここで、rank(La,s,0)はランク関数で、LSA内のビット列Laにおいて、s行目の位置より前に現れる0の個数を表し、また、rank(La,s,1)もランク関数で、LSA内のビット列Laにおいて、s行目の位置より前に現れる1の個数を表す。ランク関数は公知の方法で効率的に計算できる(非特許文献5)。rank(La,t,0)とrank(La,t,1)についても同様である。ステップ1104では、LSA内の対応するビット列La内の各ビットが0か1かに応じて、(Ba,s,t)を2つのブロック(Ba0,s0,t0)と(Ba1,s1,t1)に分割して、これらをリストList_outに追加する。ステップ1105では、得られたリストList_outを回答として、主記憶104内に返す。これは、ステップ1005の細分化された後のList(p)又はList(q)となる。
前記実施例では、ペア・エンド法で得られたクエリー配列のゲノム・マッピングの計算法を示したが、スプライシングを受けたmRNAから作られたcDNA配列の断片配列をシーケンシングして得られたクエリー配列のゲノム・マッピング計算する問題に対しても本発明を実施することができる。
前記実施例では、A,C,G,T,Nからなる塩基配列データに対する実施例を説明した。本発明は、塩基配列データとは限らない一般の文字列データに対しても適用可能である。ここでは、アミノ酸断片配列データに対する実施例を説明する。アミノ酸配列は、20種類程のアミノ酸からなる配列である。
前記実施例では、塩基配列やアミノ酸配列などの生物学的な配列データを扱ってきたが、本発明は、一般の文書データへの適用も可能である。
202 クエリー配列と同一の配列が参照ゲノム配列上に出現する位置
204 参照ゲノム全体に対するサフィックス・アレイ内で、クエリー配列に対応するインデクス領域
205 参照ゲノムの前半の配列
206 クエリー配列と同一の配列が参照ゲノムの前半の配列上に出現する位置
207 参照ゲノムの後半の配列
208 クエリー配列と同一の配列が参照ゲノムの後半の配列上に出現する位置
210 参照ゲノムの前半に対するサフィックス・アレイ内で、クエリー配列に対応するインデクス領域
212 参照ゲノムの後半に対するサフィックス・アレイ内で、クエリー配列に対応するインデクス領域
Claims (6)
- 参照文字列を記憶する記憶装置と、
ペアをなす第1の文字列と第2の文字列及び前記第1の文字列と第2の文字列間の距離の上限値を入力する入力装置と、
前記参照文字列のサフィックス・アレイを構築するサフィックス・アレイ構築処理部と、
前記第1の文字列と第2の文字列に対して独立に、前記サフィックス・アレイ内のインデクス領域を計算するインデクス領域検索処理部と、
前記サフィックス・アレイを初期ブロックとして、その最上位ビットが0か1かに従って前記サフィックス・アレイを前記参照文字列の前半と後半の部分領域に対応する2つのブロックに分割し、その際に参照した前記最上位ビットを並べたものをLSAの第一列とし、得られた前記2つのブロックに対して第二位ビットが0か1かに従って更に2つずつのブロックに分割することにより、前記参照ゲノム配列全体を4分割する部分領域に対応する4つのブロックに分割し、その際に参照した前記第二位ビットを並べたものをLSAの第二列とし、以下同様にして前記部分領域の長さが1となるまで処理を繰り返し、その過程でLSAの各列を計算して、局所化可能なインデクス情報LSAを構築するLSA構築処理部と、
前記第1の文字列と第2の文字列の各々に対する前記サフィックス・アレイ内のインデクス領域を前記初期ブロック内のインデクス領域として、前記参照文字列の前半と後半の部分領域に対応する2つのブロックへの分割に対応して、前記初期ブロック内のインデクス領域を前記LSAの第一列の情報を用いてそれぞれのブロック内のインデクス領域に分割し、前記参照文字列全体を4分割する部分領域に対応する4つのブロックへの分割に対応して前記2つのブロック内のインデクス領域を前記LSAの第二列の情報を用いて前記4つのブロック内のインデクス領域に分割し、以下同様な分割処理を繰り返し、前記初期ブロック内のインデクス領域を、前記参照文字列の部分領域内のインデクス領域に局所化する計算を行うインデクス領域局所化処理部と、
前記部分領域に局所化された前記インデクス領域の前記参照文字列上の位置座標は前記部分領域の長さだけの不確定範囲を残すものとして、前記参照文字列の部分領域に局所化された前記第1の文字列と第2の文字列の前記インデクス領域どうしの距離の上限を評価するインデクス領域間距離評価処理部と、
前記上限値以内の距離でペアをなす前記第1の文字列と第2の文字列の前記参照文字列上の位置を出力する出力装置とを有し、
前記インデクス領域局所化処理部は、前記計算の過程で、空となるインデクス領域の情報、及び、前記インデクス領域間距離評価処理部による評価の結果、前記上限値以下の距離でペアを作れないようなインデクス領域の情報を除去することを特徴とするペア文字列検索システム。 - 請求項1に記載のペア文字列検索システムにおいて、前記インデクス領域局所化処理部は、前記LSAの各列のブロックにランク関数を適用して、当該ブロックが分割された2つのブロックでのインデクス領域の位置を計算することを特徴とするペア文字列検索システム。
- 請求項1又は2に記載のペア文字列検索システムにおいて、
前記参照文字列は参照ゲノム配列データであり、
前記ペアをなす第1の文字列と第2の文字列はDNAシーケンサから得られたDNAの第1の塩基配列と第2の塩基配列であることを特徴とするペア文字列検索システム。 - 請求項1又は2に記載のペア文字列検索システムにおいて、
前記参照文字列は参照ゲノム配列データであり、
前記ペアをなす第1の文字列と第2の文字列は、DNAシーケンサから得られたcDNAの塩基配列を2分割して生成された第1の塩基配列と第2の塩基配列のペアであり、
前記上限値はイントロン長の上限値であり、
前記参照ゲノム配列データ内のスプライス・サイト周辺でコンセンサス配列の検査を行うコンセンサス検査処理部を有し、
前記コンセンサス検査処理部は、前記第1の塩基配列と第2の塩基配列に挟まれた部分をイントロン配列と解釈し、スプライス・サイト周辺にコンセンサス配列が現れないような検索結果を除去することを特徴とするペア文字列検索システム。 - 請求項1又は2に記載のペア文字列検索システムにおいて、
蛋白質同定処理部を有し、
前記参照文字列は複数の参照蛋白質配列データであり、
前記ペアをなす第1の文字列と第2の文字列は、質量分析システムから得られた断片化された蛋白質の2つのアミノ酸配列であり、
前記サフィックス・アレイ構築処理部は、前記複数の参照蛋白質配列データ内のアミノ酸配列を区切り文字を介して連結してそのサフィックス・アレイを計算し、
前記蛋白質同定処理部は前記ペアをなす2つのアミノ酸配列を含む蛋白質を同定し、
前記出力装置は、同定された蛋白質の名称と前記ペアをなす2つのアミノ酸配列の出現位置の情報を出力することを特徴とするペア文字列検索システム。 - 請求項1又は2に記載のペア文字列検索システムにおいて、
文書同定処理部を有し、
前記参照文字列は複数の文書データを識別可能な区切り文字を介して連結した参照文書データであり、
前記文書同定処理部は、前記ペアをなす第1の文字列と第2の文字列が同一の文書に出現しているかを検査し、
前記出力装置は、前記ペアをなす第1の文字列と第2の文字列が含まれる文書の文書名と前記第1の文字列及び第2の文字列の出現位置の情報を出力することを特徴とするペア文字列検索システム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/264,037 US8788522B2 (en) | 2009-04-13 | 2010-04-05 | Pair character string retrieval system |
JP2011509261A JP5279897B2 (ja) | 2009-04-13 | 2010-04-05 | ペア文字列検索システム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2009-097270 | 2009-04-13 | ||
JP2009097270 | 2009-04-13 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010119783A1 true WO2010119783A1 (ja) | 2010-10-21 |
Family
ID=42982445
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2010/056146 WO2010119783A1 (ja) | 2009-04-13 | 2010-04-05 | ペア文字列検索システム |
Country Status (3)
Country | Link |
---|---|
US (1) | US8788522B2 (ja) |
JP (1) | JP5279897B2 (ja) |
WO (1) | WO2010119783A1 (ja) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012203456A (ja) * | 2011-03-23 | 2012-10-22 | Hitachi Ltd | 文書検索システム、文書検索方法、及びプログラム |
WO2014132497A1 (ja) * | 2013-02-28 | 2014-09-04 | 株式会社日立ハイテクノロジーズ | データ解析装置、及びその方法 |
JP2015219856A (ja) * | 2014-05-21 | 2015-12-07 | 株式会社日立製作所 | 解析装置、データベース作成方法、およびシステム |
CN108920483A (zh) * | 2018-04-28 | 2018-11-30 | 南京搜文信息技术有限公司 | 基于后缀数组的字符串快速匹配方法 |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9201916B2 (en) * | 2012-06-13 | 2015-12-01 | Infosys Limited | Method, system, and computer-readable medium for providing a scalable bio-informatics sequence search on cloud |
WO2014145503A2 (en) * | 2013-03-15 | 2014-09-18 | Lieber Institute For Brain Development | Sequence alignment using divide and conquer maximum oligonucleotide mapping (dcmom), apparatus, system and method related thereto |
US11150871B2 (en) * | 2017-08-18 | 2021-10-19 | Colossio, Inc. | Information density of documents |
CN114943021B (zh) * | 2022-07-20 | 2022-11-08 | 之江实验室 | 一种tb级增量数据筛选方法和装置 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000259646A (ja) * | 1999-03-05 | 2000-09-22 | Ricoh Co Ltd | 情報索引装置 |
JP2001243238A (ja) * | 2000-02-29 | 2001-09-07 | Ricoh Co Ltd | 索引作成装置および文字列照合装置、索引作成方法および文字列照合方法ならびに記憶媒体 |
-
2010
- 2010-04-05 WO PCT/JP2010/056146 patent/WO2010119783A1/ja active Application Filing
- 2010-04-05 JP JP2011509261A patent/JP5279897B2/ja not_active Expired - Fee Related
- 2010-04-05 US US13/264,037 patent/US8788522B2/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000259646A (ja) * | 1999-03-05 | 2000-09-22 | Ricoh Co Ltd | 情報索引装置 |
JP2001243238A (ja) * | 2000-02-29 | 2001-09-07 | Ricoh Co Ltd | 索引作成装置および文字列照合装置、索引作成方法および文字列照合方法ならびに記憶媒体 |
Non-Patent Citations (3)
Title |
---|
HENG LI ET AL.: "Mapping short DNA sequencing reads and calling variants using mapping quality scores", GENOME RESEARCH, vol. 18, November 2008 (2008-11-01), pages 1851 - 1858 * |
KOICHI KIMURA: "Cho Heiretsu DNA Sequencer Muke Kosoku?Koseido Genome Mapping Shuho", SHOROKU CD SEIKAGAKU, 4 December 2008 (2008-12-04) * |
KOUICHI KIMURA ET AL.: "LOCALIZED SUFFIX ARRAY AND ITS APPLICATION TO GENOME MAPPING PROBLEMS FOR PAIRED-END SHORT READS", GENOME INFORMATICS 2009, vol. 23, 16 December 2009 (2009-12-16), pages 60 - 71 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012203456A (ja) * | 2011-03-23 | 2012-10-22 | Hitachi Ltd | 文書検索システム、文書検索方法、及びプログラム |
WO2014132497A1 (ja) * | 2013-02-28 | 2014-09-04 | 株式会社日立ハイテクノロジーズ | データ解析装置、及びその方法 |
JP5985040B2 (ja) * | 2013-02-28 | 2016-09-06 | 株式会社日立ハイテクノロジーズ | データ解析装置、及びその方法 |
JP2015219856A (ja) * | 2014-05-21 | 2015-12-07 | 株式会社日立製作所 | 解析装置、データベース作成方法、およびシステム |
CN108920483A (zh) * | 2018-04-28 | 2018-11-30 | 南京搜文信息技术有限公司 | 基于后缀数组的字符串快速匹配方法 |
Also Published As
Publication number | Publication date |
---|---|
JP5279897B2 (ja) | 2013-09-04 |
JPWO2010119783A1 (ja) | 2012-10-22 |
US20120041977A1 (en) | 2012-02-16 |
US8788522B2 (en) | 2014-07-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5279897B2 (ja) | ペア文字列検索システム | |
Baeza-Yates and G. Navarro | Faster approximate string matching | |
Lam et al. | Compressed indexing and local alignment of DNA | |
Eskin et al. | Finding composite regulatory patterns in DNA sequences | |
WO2017120128A1 (en) | Systems and methods for adaptive local alignment for graph genomes | |
CA2424031C (en) | System and process for validating, aligning and reordering genetic sequence maps using ordered restriction map | |
JP5183155B2 (ja) | 大量配列の一括検索方法及び検索システム | |
US20050227278A1 (en) | Recursive categorical sequence assembly | |
WO2006130947A1 (en) | A method of syntactic pattern recognition of sequences | |
US20060218138A1 (en) | System and method for improving search relevance | |
WO2015123269A1 (en) | System and methods for analyzing sequence data | |
US7467047B2 (en) | Method of discovering patterns in symbol sequences | |
EP2963575B1 (en) | Data analysis device and method therefor | |
Kumar et al. | Fast and memory efficient approach for mapping NGS reads to a reference genome | |
Sagot et al. | Identifying satellites and periodic repetitions in biological sequences | |
US7085651B2 (en) | Method and device for assembling nucleic acid base sequences | |
Goad | Computational analysis of genetic sequences | |
CN115662523A (zh) | 面向群体基因组索引表示与构建的方法及设备 | |
KR100538451B1 (ko) | 분산 컴퓨팅 환경에서의 유전자 및 단백질 유사서열 검색시스템 및 그 방법 | |
He | Using suffix tree to discover complex repetitive patterns in DNA sequences | |
Šrámek et al. | On-line Viterbi algorithm for analysis of long biological sequences | |
JP2003256433A (ja) | 遺伝子構造解析方法およびその装置 | |
Lecroq et al. | Sequence indexing | |
CN115168661B (zh) | 原生图数据处理方法、装置、设备及存储介质 | |
CN111916153B (zh) | 一种并行多重序列比对方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10764368 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2011509261 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13264037 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 10764368 Country of ref document: EP Kind code of ref document: A1 |