WO2020049748A1 - 特定方法、特定プログラムおよび情報処理装置 - Google Patents
特定方法、特定プログラムおよび情報処理装置 Download PDFInfo
- Publication number
- WO2020049748A1 WO2020049748A1 PCT/JP2018/033329 JP2018033329W WO2020049748A1 WO 2020049748 A1 WO2020049748 A1 WO 2020049748A1 JP 2018033329 W JP2018033329 W JP 2018033329W WO 2020049748 A1 WO2020049748 A1 WO 2020049748A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- codon
- sequence data
- mutation
- sequence
- type
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Definitions
- the present invention relates to a specifying method and the like.
- FIG. 35 is a diagram showing the relationship between amino acids, bases, and codons. A group of three base sequences is called a "codon”. In each base sequence, a codon is determined, and when the codon is determined, an amino acid is determined.
- a plurality of codons are associated with one amino acid. Therefore, when the codon is determined, the amino acid is determined. However, even if the amino acid is determined, the codon is not uniquely specified.
- the amino acid “alanine (Ala)” is associated with the codons “GCU”, “GCC”, “GCA”, or "GCG”.
- FIG. 36 is a diagram showing a score matrix used in the homology search.
- FIG. 37 is a diagram showing an example of a conventional technique for determining a frame shift of a mutation.
- the local alignment is determined in base units using the Smith-waterman algorithm to improve the accuracy.
- the Smith-waterman algorithm utilizes equation (1).
- the maximum score F (i, j) of equation (1) is searched from the matrix of FIG. 37, and traceback is performed from the searched location to a cell where 0 has been reached. I do.
- the conventional technique described above has a problem that it takes a long time to determine the frame shift of the mutation or to detect a gene mutation latent after the mutation point. Further, there is a problem that the base sequence must be divided in order to speed up the search (collation).
- the present invention provides a specifying method, a specifying program, and an information processing apparatus that can reduce the time required for determining a frame shift of a mutation or detecting a potential gene mutation after a mutation point. Aim. Further, in one aspect, an object of the present invention is to provide a specifying method, a specifying program, and an information processing device that can speed up search and analysis without dividing a base sequence.
- the computer executes the following processing.
- the computer acquires reference codon sequence data and analysis target codon sequence data.
- the computer compares a codon included in the acquired reference codon sequence data with a codon included in the acquired codon sequence data to be analyzed for each codon sequence position.
- the computer specifies, based on the result of the comparison, codons located at a plurality of sequence positions subsequent to the sequence position where the codons do not match among the codons included in the codon sequence data to be analyzed.
- the computer associates the type of mutation occurring in a certain codon included in the certain codon sequence data with the codon located at a plurality of sequence positions subsequent to the sequence position of the codon at which a mutation occurs in a certain codon.
- the type of mutation associated with the codon located at each of the specified sequence positions is specified by referring to the storage unit that stores the data.
- FIG. 1 is a diagram (1) illustrating a process performed by the information processing apparatus according to the first embodiment.
- FIG. 2 is a diagram (2) illustrating a process performed by the information processing apparatus according to the first embodiment.
- FIG. 3 is a diagram (3) illustrating a process performed by the information processing apparatus according to the first embodiment.
- FIG. 4 is a diagram (4) illustrating a process performed by the information processing apparatus according to the first embodiment.
- FIG. 5 is a functional block diagram illustrating the configuration of the information processing apparatus according to the first embodiment.
- FIG. 6 is a diagram illustrating an example of the data structure of the reference codon sequence data.
- FIG. 7 is a diagram illustrating an example of a data structure of codon sequence data to be analyzed.
- FIG. 1 is a diagram (1) illustrating a process performed by the information processing apparatus according to the first embodiment.
- FIG. 2 is a diagram (2) illustrating a process performed by the information processing apparatus according to the first embodiment.
- FIG. 3 is a diagram (3) illustrating a
- FIG. 8 is a diagram illustrating an example of the data structure of the code conversion table.
- FIG. 9 is a diagram illustrating an example of the data structure of the first array data.
- FIG. 10 is a diagram illustrating an example of the data structure of the second array data.
- FIG. 11 is a diagram illustrating an example of the data structure of the insertion transition table.
- FIG. 12A is a diagram showing the data structure of the transition table 50U of the insertion transition table.
- FIG. 12B is a diagram showing the data structure of the transition table 50C of the insertion transition table.
- FIG. 12C is a diagram showing the data structure of the transition table 50A of the insertion transition table.
- FIG. 12D is a diagram showing a data structure of a transition table 50G of the insertion transition table.
- FIG. 12A is a diagram showing the data structure of the transition table 50U of the insertion transition table.
- FIG. 12B is a diagram showing the data structure of the transition table 50C of the insertion transition table
- FIG. 13 is a diagram illustrating an example of the data structure of the deletion transition table.
- FIG. 14A is a diagram showing the data structure of the transition table 55U of the deletion transition table.
- FIG. 14B is a diagram showing the data structure of the transition table 55C of the deletion transition table.
- FIG. 14C is a diagram showing the data structure of the transition table 55A of the deletion transition table.
- FIG. 14D is a diagram showing the data structure of the transition table 55G of the deletion transition table.
- FIG. 15 is a flowchart illustrating the processing procedure of the information processing apparatus according to the first embodiment.
- FIG. 16 is a diagram (1) illustrating a process performed by the information processing apparatus according to the second embodiment.
- FIG. 17 is a diagram (2) illustrating a process performed by the information processing apparatus according to the second embodiment.
- FIG. 18 is a diagram (3) illustrating a process performed by the information processing apparatus according to the second embodiment.
- FIG. 19 is a functional block diagram illustrating the configuration of the information processing apparatus according to the second embodiment.
- FIG. 20 is a flowchart (1) illustrating the processing procedure of the information processing apparatus according to the second embodiment.
- FIG. 21A is a diagram illustrating an example of a data structure of a codon / amino acid conversion table.
- FIG. 21B is a diagram for explaining another process of the information processing apparatus according to the second embodiment.
- FIG. 22 is a flowchart (2) illustrating the processing procedure of the information processing device according to the second embodiment.
- FIG. 23 is a diagram (1) illustrating a process performed by the information processing apparatus according to the third embodiment.
- FIG. 24 is a diagram (2) illustrating a process performed by the information processing apparatus according to the third embodiment.
- FIG. 25 is a functional block diagram illustrating the configuration of the information processing apparatus according to the third embodiment.
- FIG. 26 is a diagram illustrating an example of a process of hashing an inverted index.
- FIG. 27 is a diagram illustrating an example of a process of restoring an inverted index.
- FIG. 28 is a diagram for explaining the process of the specifying unit according to the third embodiment.
- FIG. 29 is a flowchart (1) illustrating the processing procedure of the information processing apparatus according to the third embodiment.
- FIG. 30 is a flowchart illustrating a process in which the specifying unit according to the third embodiment specifies an offset of a point mutation.
- FIG. 31 is a diagram for explaining another process of the information processing apparatus according to the third embodiment.
- FIG. 32 is a flowchart (2) illustrating the processing procedure of the information processing apparatus according to the third embodiment.
- FIG. 33 is a diagram illustrating an example of a hardware configuration of a computer that realizes the same functions as the information processing apparatuses according to the first and second embodiments.
- FIG. 34 is a diagram illustrating an example of a hardware configuration of a computer that realizes functions similar to those of the information processing apparatus according to the third embodiment.
- FIG. 35 shows the relationship between codons and amino acids.
- FIG. 36 is a diagram showing a score matrix used in the homology search.
- FIG. 37 is a diagram illustrating an example of a conventional technique for determining a frame shift of a mutation.
- FIGS. 1 to 4 are diagrams for explaining processing of the information processing apparatus according to the first embodiment.
- the information processing device specifies the point mutation occurring in the base sequence to be analyzed by performing the following processing. Point mutations include "base insertion", "base deletion", and "base substitution”.
- Point mutations include "base insertion", "base deletion", and "base substitution”.
- reference codon sequence data information on a normal base sequence indicated in codon units.
- codon sequence data to be analyzed is referred to as “codon sequence data to be analyzed”.
- the information processing device compares the reference codon sequence data 20A and the analysis target codon sequence data 20B in order of codons from the beginning.
- the information processing apparatus includes a reference codon sequence data 20A, when comparing the analyzed codon sequence data 20B, identifies that sequence position P 20 subsequent codons are different. Thereby, the information processing device determines that the mutation exists in the analysis target codon sequence data 20B.
- the reference codon sequence data and the analysis target codon sequence data are compared in order from the beginning, and the arrangement positions where the codons are different are referred to as “mutation positions”, and each codon is referred to as a “mutated codon” or “mutation Codon ".
- FIG. 2 will be described.
- the information processing device determines that a mutation exists in the analysis target codon sequence data 20B
- the information processing device identifies a mutation codon and two subsequent codons from the codons included in the analysis target codon sequence data 20B.
- the following two codons are referred to as “mutant n codon” (n is an integer of 1 or more) and “mutant n + 1 codon”.
- mutant n codon n is an integer of 1 or more
- mutant n + 1 codon For example, in FIG. 2, when the mutation codon is “GUC”, the mutation 1 codon is “CAA” and the mutation 2 codon is “GUG”.
- the information processing apparatus specifies the n-codon to be mutated next to the codon to be mutated based on the insertion transition table 140f and the two n-codons and n + 1 codons following the codon.
- n is an integer of 1 or more.
- the codon next to the mutated codon is referred to as “mutated n codon (base insertion)”.
- the insertion transition table 140f is a table that associates two codons following the mutation codon with one codon next to the mutated codon before base insertion. If the n-codon to be mutated in the insertion transition table 140f matches the codon next to the mutation position in the reference codon sequence data, the point mutation that occurred in the codon sequence data to be analyzed is “base insertion”.
- the mutated n codon corresponding to the mutated n codon “CAA” and the mutated n + 1 codon “GUG” following the mutated codon “GUC” is “AAG”.
- the information processing apparatus includes a next codon of the reference codon sequence data 20A of the reference position P 20 "AAG” is compared with the target mutation n codon (insertion) "AAG" match. For this reason, the information processing device determines that the point mutation that has occurred in the analysis-target codon sequence data 20B is “base insertion”.
- the point mutation generated in the codon sequence data to be analyzed is “base deletion” or “base substitution”. ".
- the information processing device compares the reference codon sequence data 30A and the analysis target codon sequence data 30B in order from the beginning in codon units.
- the information processing apparatus includes a reference codon sequence data 30A, when comparing the analyzed codon sequence data 30B, specifying that the position is different codon (mutant position) P 30 or later. Thereby, the information processing device determines that the mutation exists in the analysis target codon sequence data 30B.
- FIG. 4 will be described.
- the information processing device determines that a mutation exists in the analysis target codon sequence data 30B
- the information processing device identifies a mutation codon and two subsequent codons from the codons included in the analysis target codon sequence data 30B.
- the mutation codon is “UCA”.
- the following two codons are "AGU" and "GCU”.
- the information processing device specifies the second codon following the mutated codon before base deletion based on the deletion transition table 140g and the two codons following the mutation codon.
- the subsequent second codon is referred to as “mutated n + 1 codon (base deletion)”.
- the deletion transition table 140g is a table that associates a mutation codon, two subsequent codons, and a second codon following the mutated codon before base deletion. If the n + 1 codon to be mutated in the deletion transition table 140g matches the second codon continuing to the mutation position in the reference codon sequence data, the point mutation that has occurred in the codon sequence data to be analyzed is “base deletion”. Lost. "
- the mutated n + 1 codon before base deletion corresponding to the mutation codon “UCA” and the two following codons “AUG” and “GCU” is “UGC”.
- the information processing apparatus is different from the "UGC” subsequent second codon at codon mutation position P 30 of the reference codon sequence data 30A "UUU” match. Therefore, the information processing device determines that the point mutation that has occurred in the analysis target codon sequence data 30B is “base deletion”.
- a deletion is determined for the mutated 2 codon “UGC”.
- the mutated (0) codon “AAG” is also determined for the mutated 1 codon “AAG” using the deletion transition table 140g. From the “UCA” and the mutated 1 codon “AUG”, the mutated 1 codon “AAG” can be referred to, and the deletion can be determined (n is an integer of 0 or more).
- the point mutation occurring in the codon sequence data to be analyzed is "base insertion". Or "base substitution”.
- the information processing apparatus compares the reference codon sequence data with the codon sequence data to be analyzed on a codon unit basis, and specifies a mismatched codon. Then, based on the two codons following the mismatched mutation codon, the information processing apparatus determines, from the insertion transition table 140f, the codon next to the mutation codon, and the deletion transition table 140g, the two codons following the mutation codon. The second codon is obtained and compared with the codon following the codon to be mutated contained in the codon sequence data to be analyzed to identify the type of point mutation. Thereby, the type of the mutation can be determined consistently by comparing the encoded codon units while identifying the mismatched codon, so that the time required for determining the type of the mutation can be reduced.
- FIG. 5 is a functional block diagram illustrating the configuration of the information processing apparatus according to the first embodiment.
- the information processing device 100 includes a communication unit 110, an input unit 120, a display unit 130, a storage unit 140, and a control unit 150.
- the communication unit 110 is a processing unit that executes data communication with an external device (not shown) via a network.
- the communication unit 110 is an example of a communication device.
- the information processing apparatus 100 may receive information such as the reference codon sequence data 140a and the analysis target codon sequence data 140b from an external device via a network.
- the input unit 120 is an input device for inputting various types of information to the information processing device 100.
- the input unit 120 corresponds to a keyboard, a mouse, a touch panel, and the like.
- the display unit 130 is a display device that displays various types of information output from the control unit 150.
- the display unit 130 corresponds to an organic EL (electro-luminescence) display, a liquid crystal display, a touch panel, and the like.
- the storage unit 140 has reference codon sequence data 140a, analysis target codon sequence data 140b, a code conversion table 140c, first sequence data 140d, and second sequence data 140e.
- the storage unit 140 has an insertion transition table 140f, a deletion transition table 140g, and a detection result table 140h.
- the storage unit 140 corresponds to a semiconductor memory device such as a RAM (Random Access Memory), a ROM (Read Only Memory), a flash memory (Flash Memory), and a storage device such as an HDD (Hard Disk Drive).
- the reference codon sequence data 140a is information on a normal base sequence indicated in codon units.
- FIG. 6 is a diagram illustrating an example of the data structure of the reference codon sequence data. As shown in FIG. 6, in the reference codon sequence data 140a, a plurality of codons are arranged from a start codon to a stop codon. For example, the start codon is "AUG”. The stop codon is "UGA”.
- the analysis target codon sequence data 140b is information on the base sequence to be analyzed, which is indicated in codon units.
- FIG. 7 is a diagram illustrating an example of a data structure of codon sequence data to be analyzed. As shown in FIG. 7, in the codon sequence data for analysis 140b, a plurality of codons are arranged from a start codon to a stop codon. For example, the start codon is "AUG”. The stop codon is "UGA”.
- the code conversion table 140c is a table that associates codons with codes.
- FIG. 8 is a diagram illustrating an example of the data structure of the code conversion table.
- the codon “UUU” is associated with the code “40h (01000000)”.
- h is a code indicating a hexadecimal number.
- UUU (40h) a code obtained by encoding the codon “UUU” is referred to as “UUU (40h)”.
- the other codons are coded by parentheses.
- the first sequence data 140d is sequence data obtained by encoding the reference codon sequence data 140a based on the code conversion table 140c.
- FIG. 9 is a diagram illustrating an example of the data structure of the first array data. As shown in FIG. 9, in the first sequence data 140d, a plurality of encoded codons are arranged from a start codon to a stop codon.
- the second sequence data 140e is sequence data obtained by encoding the analysis target codon sequence data 140b based on the code conversion table 140c.
- FIG. 10 is a diagram illustrating an example of the data structure of the second array data. As shown in FIG. 10, in the second sequence data 140e, a plurality of encoded codons are arranged from a start codon to a stop codon.
- the insertion transition table 140f is a table for associating a mutation n codon and a mutation n + 1 codon following a mutation codon with the mutated n codon before base insertion.
- FIG. 11 is a diagram illustrating an example of the data structure of the insertion transition table. As shown in FIG. 11, the insertion transition table 140f has transition tables 50U, 50C, 50A, and 50G.
- the transition table 50U associates each mutation n codon, the mutation n + 1 codon (codon starting from U), and the mutated n codon before base insertion.
- the relationship between each codon is defined by the encoded codon.
- FIG. 12A is a diagram showing the data structure of the transition table 50U of the insertion transition table. Codons corresponding to the mutated n codon and the mutated n + 1 codon at the ith row and jth column are mutated n codons before the base insertion at the ith row and jth column.
- the transition table 50C associates each mutation n codon, the mutation n + 1 codon (codon starting from C), and the mutated n codon before base insertion.
- the relationship between each codon is defined by the encoded codon.
- FIG. 12B is a diagram showing the data structure of the transition table 50A of the insertion transition table. Codons corresponding to the mutated n codon and the mutated n + 1 codon at the ith row and jth column are mutated n codons before the base insertion at the ith row and jth column.
- the transition table 50A associates each mutation n codon, the mutation n + 1 codon (codon starting from A) with the mutated n codon before base insertion.
- the relationship between each codon is defined by the encoded codon.
- FIG. 12C is a diagram showing the data structure of the transition table 50A of the insertion transition table. Codons corresponding to the mutated n codon and the mutated n + 1 codon at the ith row and jth column are mutated n codons before the base insertion at the ith row and jth column.
- the transition table 50C associates each mutation n codon, a mutation n + 1 codon (codon starting from G), and a mutation n codon before base insertion.
- the relationship between each codon is defined by the encoded codon.
- FIG. 12D is a diagram showing a data structure of a transition table 50G of the insertion transition table. Codons corresponding to the mutated n codon and the mutated n + 1 codon at the ith row and jth column are mutated n codons before the base insertion at the ith row and jth column.
- the codon corresponding to the mutated n codon “CAA (5Ah)” and the mutated n + 1 codon “GUG (73h)” in the eleventh row and the second column is the mutated n codon before the base insertion in the eleventh row and the second column.
- AAG (6Bh) ".
- the deletion transition table 140g associates the mutation n codons and each mutation n + 1 codon with the mutated n + 1 codon before base deletion.
- FIG. 13 is a diagram illustrating an example of the data structure of the deletion transition table. As shown in FIG. 13, the deletion transition table 140g has transition tables 55U, 55C, 55A, and 55G.
- the transition table 55U associates n mutated codons (codons ending in U), each mutated n + 1 codon, and mutated n + 1 codon before base deletion. The relationship between each codon is defined by the encoded codon.
- FIG. 14A is a diagram showing the data structure of the transition table 55U of the deletion transition table. The codon corresponding to any of the mutated n codons and the mutated n + 1 codon at the ith row and jth column shown in FIG. 14A is the mutated n + 1 codon before base deletion at the ith row and jth column.
- the codon corresponding to the mutation n codon “AGU (6Ch))” and the mutation n + 1 codon “GCU (74h)” in the fourth row and the fifth column is the mutated n + 1 codon “UGC (4Dh) in the fifth row and the fourth column. ) ".
- the transition table 55C associates n mutated codons (codons ending in C), each mutated n + 1 codon, and mutated n + 1 codon before base deletion. The relationship between each codon is defined by the encoded codon.
- FIG. 14B is a diagram showing the data structure of the transition table 55C of the deletion transition table. The codon corresponding to any of the mutation n codons and the mutation n + 1 codon at the i-th row and j-th column shown in FIG. 14B is the n + 1 codon before base deletion at the i-th row and the j-th column.
- the transition table 55A associates the n mutated codons (codons ending in A), each mutated n + 1 codon, and the mutated n + 1 codon before base deletion. The relationship between each codon is defined by the encoded codon.
- FIG. 14C is a diagram showing the data structure of the transition table 55A of the deletion transition table. The codon corresponding to any of the mutation n codons and the mutation n + 1 codon at the i-th row and j-th column shown in FIG. 14C is the n + 1 codon before the base deletion at the i-th row and the j-th column.
- the transition table 55G associates the mutated n codons (codons ending in G) with each mutated n + 1 codon and the mutated n + 1 codon before base deletion.
- the relationship between each codon is defined by the encoded codon.
- FIG. 14D is a diagram showing the data structure of the transition table 55G of the deletion transition table.
- the codon corresponding to any of the mutation n codons and the mutation n + 1 codon at the i-th row and j-th column shown in FIG. 14D is the n + 1 codon before base deletion at the i-th row and the j-th column.
- the detection result table 140h is a table that holds information on point mutations detected from the analysis target codon sequence data 140b.
- the control unit 150 includes a receiving unit 150a, an encoding unit 150b, a comparing unit 150c, and a specifying unit 150d.
- the control unit 150 can be realized by a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or the like.
- the control unit 150 can also be realized by hard wired logic such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).
- the receiving unit 150a is a processing unit that receives the reference codon sequence data 140a and the analysis target codon sequence data 140b from the input unit 120, an external device, or the like.
- the accepting unit 150a registers the reference codon sequence data 140a and the analysis target codon sequence data 140b in the storage unit 140.
- the receiving unit 150a When receiving the insertion transition table 140f and the deletion transition table 140g from the input unit 120, an external device, or the like, the receiving unit 150a registers the insertion transition table 140f and the deletion transition table 140g in the storage unit 140. .
- the encoding unit 150b is a processing unit that encodes the reference codon sequence data 140a and the analysis target codon sequence data based on the code conversion table 140c.
- the coding unit 150b generates the first sequence data 140d by comparing the reference codon sequence data 140a with the code conversion table 140c and coding each codon.
- the encoding unit 150b generates the second sequence data 140e by comparing the analysis target codon sequence data 140b with the code conversion table 140c, and encoding each codon.
- the encoding unit 150b stores the first array data 140d and the second array data 140e in the storage unit 140.
- ⁇ 1As shown in FIG. 8 a one-byte code is assigned to each codon according to the code conversion table 140c. For example, the codon “UUU” is converted to "40h (01000000)”. The encoded codon is referred to as “UUU (40h)”.
- the comparing unit 150c is a processing unit that compares the first sequence data 140d and the second sequence data 140e and specifies a mutation position at which encoded codons do not match. As described above, since a one-byte code is assigned to each codon, the comparing unit 150c reads out the codes of the first array data 140d and the second array data 140e one byte at a time from the beginning and compares the codes. Is repeatedly executed.
- the comparing unit 150c When the comparing unit 150c specifies a mutation position that is not matched, the comparing unit 150c outputs a comparison result to the specifying unit 150d.
- the comparison result includes information on the mutation position, the first mutated codon, the second mutated codon, the mutated n codon, and the mutated n + 1 codon.
- the first mutated codon is a coded codon at the mutation position included in the first sequence data 140d.
- the second mutation codon is a coded codon of the mutation position included in the second sequence data 140e.
- the variant n codon is the codon next to the second variant codon (the encoded codon).
- the mutation n + 1 codon is the next codon next to the second mutation codon (encoded codon).
- the comparing unit 150c When the first array data 140d and the second array data 140e match, the comparing unit 150c outputs information indicating the match to the specifying unit 150d as a comparison result.
- the identification unit 150d is a processing unit that identifies the type of point mutation that has occurred at the mutation position based on the comparison result of the comparison unit 150c and the insertion transition table 140f and the deletion transition table 140g.
- the specifying unit 150d is configured such that the mutated n codon and the mutated n + 1 codon match the mutated n codon before base insertion specified by comparison with the insertion transition table 140f, and the codon next to the first mutated codon.
- the type of the point mutation that occurred at the mutation position is “base insertion”.
- the information included in the comparison result is the first mutated n-codon “AAG (6Bh)”, the second mutated n-codon “CAA (5Ah)”, and the mutated n + 1 codon “GUG (73h)”.
- the mutated n codon before base insertion corresponding to the mutated n codon “CAA (5Ah)” and the mutated n + 1 codon “GUG (73h)” is “AAG (6Bh)”.
- the identification unit 150d specifies the point mutation generated at the mutation position. Is "base insertion”.
- the specifying unit 150d includes the mutated n codon and the mutated n + 1 codon, the mutated n codon before base insertion specified by comparison with the insertion transition table 140f, and the codon next to the first mutated codon. If does not match, "base insertion" is excluded from the type of point mutation occurring at the mutation position.
- the identification unit 150d includes a mutation n codon and a mutation n + 1 codon, a mutated n + 1 codon before base deletion identified by comparison with the deletion transition table 140g, and a codon next to the first mutated codon. If they match, the type of point mutation that occurred at the mutation position is "base deletion".
- the information included in the comparison result is the first mutated n + 1 codon “UGC (4Dh)”, the second mutated n codon “AGU (6Ch)”, and the mutated n + 1 codon “GCU (74h)”.
- the mutated n + 1 codon before base deletion corresponding to the mutated n codon “AGU (6Ch)” and the mutated n + 1 codon “GCU (74h)” is “UGC (4Dh)”.
- the identification unit 150d determines that the mutated codon “UGC (4Dh)” before the base deletion matches the next codon “UGC (4Dh)” next to the first mutated codon, and thus the point generated at the reference position.
- the type of mutation is "base deletion”.
- the specifying unit 150d compares the mutated n codon and the mutated n + 1 codon with the mutated n + 1 codon before base deletion identified by comparison with the deletion transition table 140g, and the next mutated codon. If the codon does not match the next codon, "base deletion" is excluded from the type of point mutation that occurred at the mutation position.
- the identification unit 150d registers information in which the mutation position and the type of the point mutation are associated with each other in the detection result table 140h. If the comparison result includes the information indicating that there is a match, the specifying unit 150d registers the information indicating that there is no abnormality in the detection result table 140h.
- the information processing device 100 may notify the information of the detection result table 140h to an external device via a network, or may output the information to the display unit 130 for display.
- FIG. 15 is a flowchart illustrating the processing procedure of the information processing apparatus according to the first embodiment.
- the receiving unit 150a of the information processing device 100 receives the reference codon sequence data 140a and the analysis target codon sequence data 140b (Step S101).
- the encoding unit 150b of the information processing apparatus 100 encodes the reference codon sequence data 140a and the analysis target codon sequence data 140b to generate first sequence data 140d and second sequence data 140e (Step S102).
- the comparison unit 150c of the information processing apparatus 100 compares the first sequence data 140d and the second sequence data 140e in codon (1 byte) units, and specifies a mismatch position at which a mismatch occurs (step S103). Based on the mutation position, the comparing unit 150c determines the first mutated codon, the mutated n codon, the mutated n + 1 codon of the first sequence data 140d, the second mutated codon of the second sequence data 140e, the mutated n codon, and the mutated The n + 1 codon is specified (Step S104).
- the specifying unit 150d of the information processing apparatus 100 determines that the mutated n codon before base insertion specified from the mutated n codon and the mutated n + 1 codon matches the codon next to the first mutated codon. It is determined whether or not (step S105). If they match (Step S105, Yes), the specifying unit 150d specifies the type of the point mutation as “base insertion” (Step S106). On the other hand, if they do not match (No at Step S105), the specifying unit 150d proceeds to Step S107.
- Step S107 will be described.
- the specifying unit 150d determines whether or not the mutated n codon before the base insertion specified from the mutated n codon and the mutated n + 1 codon matches the next codon next to the first mutated codon in the deletion transition table 140g. Is determined (step S107). When they match (Step S107, Yes), the specifying unit 150d specifies the type of the point mutation as “base deletion” (Step S108).
- the specifying unit 150d specifies the type of the point mutation as “base substitution” (Step S109).
- the specifying unit 150d registers the information on the type of the specified mutation in the detection result table 140h (Step S110).
- the information processing apparatus 100 outputs the detection result table 140h to the display unit 130 (Step S111).
- the information processing apparatus 100 compares the first sequence data 140d with the second sequence data 140e in units of 1-byte codons, and identifies mismatched codons (coded codons). Then, the information processing apparatus 100 compares the transition destination codon having the mismatched codon as the mutation position with the insertion transition table 140f and the deletion transition table 140g to determine the point mutation contained in the codon sequence data to be analyzed. Specify the type. Thereby, the type of the mutation can be determined consistently by comparing the encoded codon units while identifying the mismatched codon, so that the time required for determining the type of the mutation can be reduced.
- FIG. 16 to FIG. 18 are diagrams for explaining the processing of the information processing apparatus according to the second embodiment.
- FIG. 16 illustrates a process performed when a point mutation “base insertion” is detected.
- the information processing apparatus according to the second embodiment compares the first array data 140d with the second array data 140e in a manner similar to the information processing apparatus 100 according to the first embodiment, so that a mismatch position P Identify 40 .
- the information processing apparatus is based on the mutation position P 40, compared per mutant codon “GUC (71h)”, and mutant n codon “CAA (5Ah)", mutant n + 1 codon “GUG (73h)", and an insertion transition table 140f Then, the mutated n-codon “AAG (68h)” before the base insertion is specified.
- the information processing device corrects by replacing the codon “CAA (5Ah)” following the mutation codon with the mutated n-codon “AAG (68h)” before base insertion.
- the information processing apparatus a mutation position P 40, is moved to the sequence position of the next codon.
- the moved array position P41 is set.
- the information processing apparatus per sequence position P 41, mutation n codon "GUG (73h)” and compared mutated n + 1 codon “CAU (48h)” and inserting transition table 140f, a base before insertion of the mutated n codon " UGC (4Dh) "is specified.
- the information processing device corrects by replacing the codon “GUG (73h) next to the mutation codon with the codon“ UGC (4Dh) ”next to the mutated codon before base insertion.
- the information processing apparatus generates the third sequence data 240e by repeatedly performing the process of replacing the mutated n-codon with the mutated n-codon before base insertion while moving the sequence position as described above.
- the information processing device compares the coded codon of the third sequence data 240e with the coded codon of the first sequence data 140d, and specifies a different codon.
- the information processing device identifies the different codon as a potential gene mutation.
- the codon sequence position P 42 "UCG (47h)"
- the codon sequence position P 43 to "AAA (6Ah)” is specified as a gene mutation.
- FIG. 17 illustrates a process when a point mutation “base deletion” is detected.
- the information processing apparatus compares the first array data 140d with the second array data 140e in a manner similar to the information processing apparatus 100 according to the first embodiment, so that a mismatch position P Identify 50 .
- the information processing apparatus compares the mutated position P 50, as per variant codon "UCA (40h)", mutant n codon "AUG (63h)", mutant n + 1 codon "GCU (74h)”, and a deletion transition table 140g Then, the mutated n + 1 codon “UGC (4Dh)” before the base deletion is identified.
- the information processing device performs correction by replacing the next codon “GCU (74h)” following the mutation codon with the mutated n + 1 codon “UGC (4Dh)” before base insertion.
- the information processing apparatus a mutation position P 50, is moved to the sequence position of the next codon.
- the information processing device specifies the mutated n + 1 codon before base deletion by comparing the mutated n codon and the mutated n + 1 codon based on the new arrangement position with the deletion transition table 140g.
- the information processing device performs the correction by replacing the mutation n + 1 codon with the n + 1 codon before base deletion.
- the information processing apparatus generates the third sequence data 240e by repeatedly performing the process of replacing the mutation n + 1 codon with the base deletion-subject n + 1 codon while moving the sequence position as described above.
- the information processing device compares the coded codon of the third sequence data 240e with the coded codon of the first sequence data 140d, and specifies a different codon.
- the information processing device identifies the different codon as a potential gene mutation.
- the codon sequence position P 52 "UCG (47h)"
- the codon sequence position P 53 to "AAA (6Ah)” is specified as a gene mutation.
- FIG. 18 illustrates a process performed when a point mutation “base substitution” is detected.
- the information processing apparatus according to the second embodiment compares the first array data 140d with the second array data 140e in a manner similar to the information processing apparatus 100 according to the first embodiment, so that a mismatch position P Specify 60 . It is assumed that the information processing apparatus determines that the point mutation is “base substitution” using the insertion transition table 140f and the deletion transition table 140g. In this case, the information processing apparatus, by copying the subsequent codon of the next sequence position P 61 mutation codon mutation position P 60 of the second array data 140e, and generates a third sequence data 240e.
- the information processing device compares the coded codon of the third sequence data 240e with the coded codon of the first sequence data 140d, and specifies a different codon.
- the information processing device identifies the different codon as a potential gene mutation.
- the codon “UCG (47h)" sequence position P 62, the codon sequence position P 63 to "AAA (6Ah)" is specified as a gene mutation.
- the information processing apparatus After specifying the type of the point mutation, the information processing apparatus according to the second embodiment generates the third array data 240e obtained by modifying the second array data 140e, and generates the first array data 140d, A codon different from the third sequence data 240e is specified. Thereby, a potential gene mutation can be detected.
- FIG. 19 is a functional block diagram illustrating the configuration of the information processing apparatus according to the second embodiment.
- the information processing device 200 includes a communication unit 110, an input unit 120, a display unit 130, a storage unit 240, and a control unit 250.
- the description regarding the communication unit 110, the input unit 120, and the display unit 130 is the same as the description regarding the communication unit 110, the input unit 120, and the display unit 130 described with reference to FIG.
- the storage unit 240 has reference codon sequence data 140a, analysis target codon sequence data 140b, a code conversion table 140c, first sequence data 140d, and second sequence data 140e.
- the storage unit 240 has an insertion transition table 140f, a deletion transition table 140g, third array data 240e, and a detection result table 240h.
- the storage unit 240 corresponds to a semiconductor memory device such as a RAM, a ROM, and a flash memory, and a storage device such as an HDD.
- the description of the reference codon sequence data 140a, the analysis target codon sequence data 140b, the code conversion table 140c, the first sequence data 140d, and the second sequence data 140e included in the storage unit 240 is the same as that described in the first embodiment.
- the description of the insertion transition table 140f and the deletion transition table 140g included in the storage unit 240 is the same as that described in the first embodiment.
- the third sequence data 240e is sequence data obtained by correcting codons including point mutations among the coded codons of the second sequence data 140e to normal codons.
- the detection result table 240h is a table that holds information on point mutations and gene mutations detected from the analysis target codon sequence data 140b.
- the control unit 250 includes a receiving unit 150a, an encoding unit 150b, a comparing unit 150c, and a specifying unit 250d.
- the control unit 250 can be realized by a CPU, an MPU, or the like. Further, the control unit 250 can also be realized by hard wired logic such as an ASIC or an FPGA.
- the receiving unit 150a is a processing unit that receives the reference codon sequence data 140a and the analysis target codon sequence data 140b from the input unit 120, an external device, or the like.
- the accepting unit 150a registers the reference codon sequence data 140a and the analysis target codon sequence data 140b in the storage unit 240.
- Other descriptions are the same as the processing of the reception unit 150a of the first embodiment.
- the encoding unit 150b is a processing unit that encodes the reference codon sequence data 140a and the analysis target codon sequence data 140b based on the code conversion table 140c. Other descriptions are the same as the processing of the encoding unit 150b of the first embodiment.
- the comparing unit 150c is a processing unit that compares the first sequence data 140d and the second sequence data 140e and specifies a mutation position at which encoded codons do not match.
- the comparing unit 150c outputs the comparison result to the specifying unit 250d.
- Other descriptions are the same as the processing of the comparison unit 150c of the first embodiment.
- the identification unit 250d identifies the type of the point mutation that has occurred at the mutation position based on the comparison result of the comparison unit 150c and the insertion transition table 140f and the deletion transition table 140g.
- the specifying unit 250d When the type of the point mutation is specified, the specifying unit 250d generates third array data 240e obtained by modifying the second array data 140e.
- the identification unit 250d compares the first sequence data 140d with the third sequence data 240e to detect a gene mutation.
- the specifying unit 250d registers the mutation position, the type of the point mutation, and the information on the gene mutation in the detection result table 240h.
- the process in which the specifying unit 250d specifies the type of the point mutation is the same as the process of the specifying unit 150d described in the first embodiment.
- the processing of the specifying unit 250d will be described for cases where the point mutation is “base insertion”, “base deletion”, or “base substitution”.
- the processing of the specifying unit 250d when the point mutation is “base insertion” will be described.
- Specifying unit 250d as described in FIG. 16, based on the mutation position P 40, per mutated codon “GUC (71h)”, mutant n codon “CAA (5Ah)”, and mutant n + 1 codon “GUG (73h)” , And the insertion transition table 140f to identify the mutated n-codon “AAG (6Bh)” before base insertion.
- the identification unit 250d performs correction by replacing the codon “CAA (5Ah)” following the codon to be mutated with the n-codon “AAG (6Bh)” to be mutated before base insertion.
- Specific portion 250d is a mutation position P 40, is moved to the next sequence position.
- the moved array position P41 is set.
- Specific portion 250d is attached to the array positions P 41, mutation n codon "GUG (73h)", and mutant n + 1 codon “CAU (48h)" is compared with the inserted transition table 140f, a base before insertion of the mutations n codon “UGC (4Dh))” is specified.
- the identification unit 250d performs correction by replacing the next codon “GUG (73h)” following the mutation codon with the mutated n codon “UGC (4Dh)” before base insertion.
- the identifying unit 250d generates the third sequence data 240e by repeatedly executing the process of replacing the mutated n-codon with the mutated n-codon before base insertion while moving the sequence position as described above.
- the specifying unit 250d compares the coded codon of the third sequence data 240e with the coded codon of the first sequence data 140d, and specifies a different codon.
- the specifying unit 250d specifies a different codon as a potential gene mutation.
- the codon sequence position P 42 "UCG (47h)"
- the codon sequence position P 43 to "AAA (6Ah)” is specified as a gene mutation.
- the identification unit 250d registers information on the type of base mutation “base insertion” and the mutation position, and information on the codon and sequence position identified as the gene mutation in the detection result table 240h.
- Specifying unit 250d by comparing the first sequence data 140d, and a second array data 140e, a discrepancy to identify the mutated position P 50.
- Specific portion 250d is based on the mutation position P 50, per mutated codon "UCA (40h)", mutant n codon "AGU (63h)", and mutant n + 1 codon “GCU (74h)", and a deletion transition table 140g
- the mutated n + 1 codon “UGC (4Dh)” before base deletion is identified.
- the information processing device performs correction by replacing the next codon “GCU (74h)” following the mutation codon with the mutated n + 1 codon “UGC (4Dh)” before base insertion.
- the identifying unit 250d is a mutation position P 50, is moved to the next sequence position.
- the specifying unit 250d specifies the mutated n + 1 codon before base deletion by comparing the mutated n codon and the mutated n + 1 codon based on the new sequence position with the deletion transition table 140g.
- the identification unit 250d performs the correction by replacing the mutation n + 1 codon with the mutated n + 1 codon before base deletion.
- the identification unit 250d generates the third sequence data 240e by repeatedly performing the process of replacing the mutation n + 1 codon with the mutated n + 1 codon before base deletion while moving the sequence position as described above.
- the specifying unit 250d compares the coded codon of the third sequence data 240e with the coded codon of the first sequence data 140d, and specifies different codons.
- the specifying unit 250d specifies a different codon as a potential gene mutation.
- the identifying unit 250d includes a codon sequence position P 52 "UCG (47h)", the codon sequence position P 53 to "AAA (6Ah)" is specified as a gene mutation.
- the identification unit 250d registers information on the type of base mutation “base deletion” and the mutation position, and information on the codon and sequence position identified as the gene mutation in the detection result table 240h.
- the processing of the specifying unit 250d when the point mutation is “base substitution” will be described.
- Specifying unit 250d as described in FIG. 18, by comparing the first sequence data 140d, and a second array data 140e, a discrepancy to identify the mutated position P 60. It is assumed that the specifying unit 250d determines the point mutation as “base substitution” using the insertion transition table 140f and the deletion transition table 140g. In this case, the specific portion 250d, by copying the subsequent codon of the next sequence position P 61 mutation codon mutation position P 60 of the second array data 140e, and generates a third sequence data 240e.
- the specifying unit 250d compares the coded codon of the third sequence data 240e with the coded codon of the first sequence data 140d, and specifies a different codon.
- the specifying unit 250d specifies a different codon as a potential gene mutation.
- the identifying unit 250d includes a codon "UCG (47h)" sequence position P 62, the codon sequence position P 63 to "AAA (6Ah)" is specified as a gene mutation.
- the identification unit 250d registers information on the type of base mutation “base substitution” and the mutation position, and information on the codon and sequence position identified as the gene mutation in the detection result table 240h.
- FIG. 20 is a flowchart (1) illustrating the processing procedure of the information processing apparatus according to the second embodiment.
- the receiving unit 150a of the information processing device 200 receives the reference codon sequence data 140a and the analysis target codon sequence data 140b (Step S201).
- the encoding unit 150b of the information processing device 200 encodes the reference codon sequence data 140a and the analysis target codon sequence data 140b to generate first sequence data 140d and second sequence data 140e (Step S202).
- the comparison unit 150c of the information processing device 200 compares the first sequence data 140d and the second sequence data 140e for each codon (1 byte), and specifies a mismatched mutation position (step S203).
- the specifying unit 250d of the information processing device 200 specifies the type of the point mutation (Step S204).
- the processing procedure for specifying the type of point mutation corresponds to the processing procedure described in steps S105 to S109 in FIG.
- the specifying unit 250d generates third array data 240e obtained by modifying the second array data 140e based on the type of the point mutation (step S205).
- the identification unit 250d identifies the gene mutation by comparing the first sequence data 140d with the third sequence data 240e (Step S206).
- the identification unit 250d registers the identified mutation type and gene mutation information in the detection result table 250h (step S207).
- the information processing device 200 outputs the detection result table 240h to the display unit 130 (Step S208).
- the information processing apparatus 200 When the type of the point mutation included in the second array data 140e is specified, the information processing apparatus 200 generates third array data 240e obtained by modifying the second array data 140e, and generates the first array data 140d and the third array data. A codon different from the data 240e is specified. As a result, even after the type of the point mutation is determined, a potential gene mutation can be consistently detected by comparing the encoded codon units.
- the information processing apparatus 200 has described the method of generating the third array data 240e and comparing it with the first array data 140d for convenience, the present invention is not limited to this.
- the information processing device 200 can also convert the second array data 140e into byte units and generate a comparison with the first array data 140d in byte units without generating the third array data 240e.
- the information processing apparatus 200 When the input of the search query is an amino acid sequence, the information processing apparatus 200 performs codon / amino acid conversion based on the first sequence data 140d obtained by encoding the reference codon sequence data 140a described by base symbols, The fourth array data (not shown) is generated. The information processing device 200 compares the fourth sequence data subjected to the codon / amino acid conversion with the amino acid sequence of the search query in units of amino acids, and specifies the mutation position.
- FIG. 21A is a diagram showing an example of the data structure of the codon / amino acid conversion table.
- the codon / amino acid conversion table 240i encoded codons are associated with encoded amino acids.
- the encoded codon "UUU (40h)” is associated with the encoded amino acid "Phe (50h).”
- the codon / amino acid conversion table 240i is stored in the storage unit 240 of the information processing device 200.
- FIG. 21B is a diagram for explaining another process of the information processing apparatus according to the second embodiment.
- the information processing device 200 compares the first sequence data 140d with the codon / amino acid conversion table 240i to convert each encoded codon into an encoded amino acid.
- the codon “AUG (63h)” is converted to the amino acid “Met (4Dh)”.
- the fourth array data 240j is stored in the storage unit 240 of the information processing device 200.
- the information processing device 200 compares the fourth array data 240j with the second array data 140e and specifies a mismatch position at which a mismatch occurs. In the example shown in FIG. 21B, it determines that the amino acid sequence position P 25 or later is different.
- FIG. 22 is a flowchart (2) illustrating the processing procedure of the information processing device according to the second embodiment.
- the receiving unit 150a of the information processing device 200 receives the reference codon sequence data (Step S210).
- the encoding unit 150b of the information processing device 200 encodes the reference codon sequence data 140a to generate the first sequence data 140d (Step S211).
- the receiving unit 150a receives the amino acid sequence data to be analyzed (step S212).
- the encoding unit 150b encodes the amino acid sequence data to be analyzed and generates the second sequence data 140e (Step S213).
- the encoding unit 150b converts the analysis target amino acid sequence data into the second sequence data 140e based on the code conversion table 140c. Although a specific description is omitted, it is assumed that the code conversion table 140c holds information in which amino acids are associated with encoded amino acids.
- the comparison unit 150c of the information processing device 200 generates the fourth sequence data 240j from the first sequence data 140d based on the codon / amino acid conversion table 240i (Step S214).
- the comparing unit 150c compares the fourth sequence data 240j and the second sequence data 140e for each amino acid and specifies a mutation position (Step S215).
- the information processing device 200 registers the information on the mutation position specified by the comparing unit 150c in the detection result table 240h (Step S216).
- the information processing device 200 outputs the detection result table 240h to the display unit 130 (Step S217).
- the information processing apparatus 200 when the input of the search query is an amino acid sequence, the information processing apparatus 200 performs codon / amino acid conversion based on the first sequence data 140d obtained by encoding the reference codon sequence data 140a described in base symbols. And compare it with the search query. Thereby, even if the input of the search query is an amino acid sequence, it is possible to specify the amino acid in which the mutation has occurred.
- FIGS. 23 and 24 are diagrams for explaining the processing of the information processing apparatus according to the third embodiment.
- the information processing apparatus according to the third embodiment receives the reference codon sequence data 140a in the same manner as the information processing apparatus 100 according to the first embodiment, and performs encoding based on the code conversion table 140c. Then, the first array data 140d is generated, and an inverted index 340a is generated at the same time. Further, when receiving the analysis target codon sequence data 140b, the information processing device performs encoding based on the code conversion table 140c to generate the second sequence data 140e.
- the information processing apparatus generates the transposition index 340a simultaneously with the generation of the first array data 140d.
- the transposed index 340a is information indicating the relationship between the type of the encoded code of the first array data 140d and the array position (offset) by a bitmap.
- the horizontal axis of the transposed index 340a is an axis corresponding to the offset.
- the vertical axis of the transposed index 340a is the axis corresponding to the type of the encoded codon.
- the transposed index 340a is indicated by a bitmap of “0” or “1”, and all bitmaps are set to “0” in an initial state.
- the offset is an offset from the first codon included in the sequence data.
- the offset of the first codon is set to “0”. For example, when the codon “AUG (63h)” is the seventh codon from the top of the first sequence data 140d, the offset of the codon “AUG (63h)” is “6”.
- the information processing apparatus scans the first array data 140d from the top, specifies the relationship between the type of the encoded codon and the offset, and sets “1” at the corresponding position of the transposed index 340a. For example, since the codon “AUG (63h)” exists at the offset “6”, “1” is set at a location where the column of the offset “6” intersects with the row of the codon type “AUG (63h)”. .
- the information processing apparatus generates the transposed index 340a by repeatedly executing the above processing.
- the information processing device reads the coded codons sequentially from the start codon of the second sequence data 140e, and acquires the bitmap corresponding to the type of the read codon from the transposed index 340a.
- the start codon is “AUG (63h)”.
- the information processing apparatus acquires the bitmap b10 of the codon “AUG (63h)”, the bitmap b11 of the codon “UUU (40h)”, and the pit map b12 of the codon “GUC (71h)” in this order from the transposed index 340a.
- the bitmap b10 is a bitmap corresponding to the row of the codon type “AUG (63h)” of the transposed index 340a.
- the bitmap b11 is a bitmap corresponding to the row of the codon type “UUU (40h)” of the transposed index 340a.
- the bitmap b12 is a bitmap corresponding to the row of the codon type “GUC (71h)” of the transposed index 340a.
- the information processing apparatus focuses on the position of “1” in the bitmaps of the bitmaps b10 to b12, and while the “1” is sequentially shifted left by one, the first array data 140d and the second array data It is determined that the codon with 140e matches.
- the information processing device determines that the codons of the first array data 140d and the second array data 140e do not match at the stage where the "1" is not shifted left by one in order. In the example shown in FIG. 24, at the stage from the bitmap b11 to the bitmap b12, “1” has moved from the offset “7” to the offset “20”, so the codon “8” at the offset (array position) “ GUC (71h) "is determined to be inconsistent.
- the information processing apparatus generates the transposed index 340a based on the first array data 140d.
- the information processing apparatus obtains bitmaps corresponding to the codon types from the transposed index 340a in order from the top of the codons included in the second sequence data 140e, based on the positions of the flags “1” of the obtained bitmaps. To identify the mismatched codons. This makes it possible to quickly search for codons containing point mutations.
- FIG. 25 is a functional block diagram illustrating the configuration of the information processing apparatus according to the third embodiment.
- the information processing device 300 includes a communication unit 110, an input unit 120, a display unit 130, a storage unit 340, and a control unit 350.
- the description regarding the communication unit 110, the input unit 120, and the display unit 130 is the same as the description regarding the communication unit 110, the input unit 120, and the display unit 130 described with reference to FIG.
- the storage unit 340 includes reference codon sequence data 140a, analysis target codon sequence data 140b, a code conversion table 140c, first sequence data 140d, an inverted index 340a, and second sequence data 140e.
- the storage unit 340 has an insertion transition table 140f, a deletion transition table 140g, third array data 240e, and a detection result table 240h.
- the storage unit 340 corresponds to a semiconductor memory device such as a RAM, a ROM, and a flash memory, and a storage device such as an HDD.
- the storage unit 340 may include a codon / amino acid conversion table 240i and fourth sequence data 240j.
- the description of the reference codon sequence data 140a, the analysis target codon sequence data 140b, the code conversion table 140c, the first sequence data 140d, and the second sequence data 140e included in the storage unit 240 is the same as that described in the first embodiment.
- the description of the insertion transition table 140f and the deletion transition table 140g included in the storage unit 340 is the same as that described in the first embodiment.
- the description regarding the third array data 240e and the detection result table 240h included in the storage unit 340 is the same as that described in the second embodiment.
- the transposed index 340a is information indicating the relationship between the type of the encoded code of the first array data 140d and the array position (offset) using a bitmap. As described in FIG. 23, the horizontal axis of the transposed index 340a is an axis corresponding to the offset. The vertical axis of the transposed index 340a is the axis corresponding to the type of the encoded codon.
- the control unit 350 includes a reception unit 150a, an encoding unit 150b, a generation unit 350a, an acquisition unit 350b, and a specification unit 350c.
- the control unit 350 can be realized by a CPU, an MPU, or the like.
- the control unit 350 can also be realized by hard wired logic such as an ASIC or an FPGA.
- the receiving unit 150a is a processing unit that receives the reference codon sequence data 140a and the analysis target codon sequence data 140b from the input unit 120, an external device, or the like.
- the accepting unit 150a registers the reference codon sequence data 140a and the analysis target codon sequence data 140b in the storage unit 340.
- Other descriptions are the same as the processing of the reception unit 150a of the first embodiment.
- the encoding unit 150b is a processing unit that encodes the reference codon sequence data 140a and the analysis target codon sequence data 140b based on the code conversion table 140c. Other descriptions are the same as the processing of the encoding unit 150b of the first embodiment.
- the generation unit 350a is a processing unit that generates the transposed index 340a based on the first array data 140d.
- the generation unit 350a scans the first sequence data 140d from the beginning, specifies the relationship between the type of the encoded codon and the offset (sequence position), and assigns “1” to the corresponding position of the transposed index 340a. Set. For example, since the codon “AUG (63h)” exists at the offset “6”, the generating unit 350a sets “AUG (63h)” at the intersection of the column of the offset “6” and the row of the codon type “AUG (63h)”. 1 ”is set.
- the generation unit 350a generates the transposed index 340a by repeatedly executing the above processing.
- FIG. 26 is a diagram illustrating an example of a process of hashing an inverted index.
- bitmap of each row of the transposed index 340a is hashed based on the prime numbers (base) of “29” and “31”.
- base the prime numbers (base) of “29” and “31”.
- the bitmap b1 indicates a bitmap obtained by extracting a row having an inverted index (for example, the inverted index 340a shown in FIG. 23).
- the hashed bitmap h11 is a bitmap hashed by the base “29”.
- the hashed bitmap h12 is a bitmap hashed by the base “31”.
- the generation unit 350a associates a value obtained by dividing the position of each bit of the bitmap b1 by one low with the position of the hashed bitmap. When “1” is set in the bit position of the corresponding bitmap b1, the generation unit 350a performs a process of setting “1” in the associated hashed bitmap position.
- the generation unit 350a copies the information of the positions “0 to 28” of the bitmap b1 to the hashed bitmap h11. Subsequently, the remainder obtained by dividing the bit position “35” of the bitmap b1 by the low “29” is “6”, so that the position “35” of the bitmap b1 is the position “6” of the hashed bitmap h11. ]. Since “1” is set at the position “35” of the bitmap b1, the generating unit 350a sets “1” at the position “6” of the hashed bitmap h11.
- the position “42” of the bitmap b1 corresponds to the position “13” of the hashed bitmap h11. Attached. Since “1” is set at the position “42” of the bitmap b1, the generating unit 350a sets “1” at the position “13” of the hashed bitmap h11.
- the generation unit 350a generates the hashed bitmap h11 by repeatedly executing the above-described processing for the position “29” or more in the bitmap b1.
- the generation unit 350a copies the information of the positions “0 to 30” of the bitmap b1 to the hashed bitmap h12. Subsequently, since the remainder obtained by dividing the bit position “35” of the bitmap b1 by the low “31” is “4”, the position “35” of the bitmap b1 becomes the position “4” of the hashed bitmap h12. ]. Since “1” is set at the position “35” of the bitmap b1, the generating unit 350a sets “1” at the position “4” of the hashed bitmap h12.
- the position “42” of the bitmap b1 corresponds to the position “11” of the hashed bitmap h12. Attached. Since “1” is set at the position “42” of the bitmap b1, the generating unit 350a sets “1” at the position “11” of the hashed bitmap h12.
- the generation unit 350a generates the hashed bitmap h12 by repeatedly performing the above-described processing for the position “31” or more in the bitmap b1.
- the generation unit 350a performs hashing on the transposed index by performing compression by the above-described folding technique on each row of the transposed index 340a.
- the hashed bitmaps with bases “29” and “31” are provided with information on the row (encoded codon type) of the source bitmap.
- the acquisition unit 350b is a processing unit that sequentially acquires bitmaps corresponding to each encoded codon included in the second sequence data 140e from the transposed index 340a.
- the obtaining unit 350b outputs the obtained information of each bitmap to the specifying unit 350c. It is assumed that the information of the bitmap output to the specifying unit 350c is sorted in the reading order.
- the acquisition unit 350b reads encoded codons in order from the start codon of the second sequence data 140e, and acquires a bitmap corresponding to the type of the read codon from the transposed index 340a.
- the start codon is “AUG (63h)”
- the second sequence data 140e is as shown in FIG.
- the acquisition unit 350b determines the bitmap b10 of “AUG (63h)”, the bitmap b11 of “UUU (40h)”, the bitmap b12 of “GUC (71h)”, and the bitmap b12 of “CAA (5Ah)”.
- a bitmap (not shown) and a bitmap of each subsequent codon are read.
- FIG. 27 is a diagram illustrating an example of a process of restoring an inverted index.
- the acquiring unit 350b restores the bitmap b1 based on the hashed bitmap h11 and the hashed bitmap h12 will be described.
- the acquisition unit 350b generates an intermediate bitmap h11 'from the hashed bitmap h11 having the base "29".
- the acquisition unit 350b copies the values at positions 0 to 28 of the hashed bitmap h11 to the positions 0 to 28 of the intermediate bitmap h11 ', respectively.
- the acquisition unit 350b repeatedly executes the process of copying the values of the positions 0 to 28 of the hashed bitmap h11 for each value “29” for the values after the position 29 of the intermediate bitmap h11 ′.
- the example shown in FIG. 27 shows an example in which the values of positions 0 to 14 of the hashed bitmap h11 are copied to the positions 29 to 43 of the intermediate bitmap h11 '.
- the acquisition unit 350b generates an intermediate bitmap h12 ′ from the hashed bitmap h12 having the base “31”.
- the acquisition unit 350b copies the values at positions 0 to 30 of the hashed bitmap h12 to the positions 0 to 30 of the intermediate bitmap h12 ', respectively.
- the acquisition unit 350b repeatedly executes a process of copying the values at the positions 0 to 30 of the hashed bitmap h12 for each value “31” for the values after the position 31 of the intermediate bitmap h12 ′.
- the example shown in FIG. 27 shows an example in which the values of positions 0 to 12 of the hashed bitmap h12 are copied to the positions of positions 31 to 43 of the intermediate bitmap h12 '.
- the acquiring unit 350b After generating the intermediate bitmap h11 ′ and the intermediate bitmap h12 ′, the acquiring unit 350b performs an AND operation on the intermediate bitmap h11 ′ and the intermediate bitmap h12 ′, thereby converting the bitmap b1 before hashing. Restore.
- the acquisition unit 350b can restore each bitmap corresponding to a codon (restoring the transposed index 340a) by repeatedly executing the same process for other hashed bitmaps.
- the specifying unit 350c performs a process of specifying a mutation position at which the first sequence data 140d and the second sequence data 140e do not match, a process of specifying a type of point mutation, and a process of specifying a gene mutation.
- FIG. 28 is a diagram for explaining the process of the specifying unit according to the third embodiment.
- the bitmaps b10, b11, and b12 shown in FIG. 28 are bitmaps received from the acquisition unit 350b.
- the identification unit 350c generates the bitmap b10-1 by shifting the bitmap b10 to the left (step S10).
- the specifying unit 350c calculates the bitmap b11-1 by performing an AND operation on the bitmap 10-1 and the bitmap b11 (step S11). In the bitmap b11-1, since the bit “1” is set at the offset 7, the first array data 140d and the second array data 140e match from the offset “6” to the offset “7”.
- the specifying unit 350c calculates the bitmap 11-2 by shifting the bitmap b11-1 to the left (step S12).
- the specifying unit 350c calculates the bitmap b12-1 by performing an AND operation on the bitmap b11-2 and the bitmap b12 (step S13).
- the bit “1” is set at the offset “8” of the bitmap b11-2, but the bit is set to “0” at the offset “8” in the bitmap b12-1. Accordingly, the specifying unit 350c determines that the first array data 140d and the second array data 140e do not match at the offset (array position) “8”.
- the specifying unit 350c specifies the type of the point mutation.
- the specifying unit 350c specifies the type of the point mutation that has occurred at the mutation position, based on the mismatched mutation position (offset), the insertion transition table 140f, and the deletion transition table 140g.
- the specifying unit 350c generates third array data 240e obtained by modifying the second array data 140e.
- the process of the specifying unit 350c specifying the type of the point mutation is the same as the process of the specifying unit 150d described in the first embodiment.
- the process in which the specifying unit 350c generates the third array data 240e by modifying the second array data 140e based on the type of the point mutation is the same as the process of the specifying unit 250d described in the second embodiment. is there.
- the specifying unit 350c sequentially obtains bitmaps corresponding to each encoded codon type included in the third sequence data 240e from the transposed index 340a.
- the identification unit 350c reads out coded codons in order from the start codon in the same manner as the acquisition unit 350b, and reads the bitmap corresponding to the type of the readout codon from the transposed index 340a. get.
- the identifying unit 350c Upon acquiring each bitmap, the identifying unit 350c performs an AND operation on the bitmap obtained by shifting the bitmap to the left and the next bitmap in the same manner as the processing described with reference to FIG. The calculation process is repeatedly executed.
- the specifying unit 350c determines that the first array data 140d and the third array data 240e do not match at the offset when the bit “1” is no longer included in the new bitmap.
- the identification unit 350c determines that the codon of the third sequence data 240e corresponding to the determined mismatched offset is a codon that causes a gene mutation.
- the specifying unit 350c executes the above-described processing, and registers the information on the type and the mutation position (offset) of the point mutation and the information on the codon and the sequence position (offset) specified as the gene mutation in the detection result table 240h. .
- FIG. 29 is a flowchart illustrating a processing procedure of the information processing apparatus according to the third embodiment.
- the receiving unit 150a of the information processing device 300 receives the reference codon sequence data 140a and the analysis target codon sequence data 140b (Step S301).
- the encoding unit 150b of the information processing device 300 encodes the reference codon sequence data 140a to generate the first sequence data 140d, and at the same time, generates the transposed index 340a (Step S302).
- the encoding unit 150b encodes the codon sequence data 140b to be analyzed to generate the second sequence data 140d (Step S303).
- the acquisition unit 350b of the information processing device 300 compares the encoded codon of the second sequence data 140e with the transposed index 340a, and sequentially acquires bitmaps corresponding to the codons (Step S304).
- the specifying unit 350c of the information processing apparatus 300 specifies a mutation position (offset) at which a mismatch occurs by executing a shift operation and an AND operation of each bitmap (step S305).
- the specifying unit 350c specifies the type of the point mutation (Step S306).
- the identification unit 350c generates third array data 240e obtained by modifying the second array data 140e based on the type of the point mutation (step S307).
- the identification unit 350c compares the coded codon of the third sequence data with the transposition index 340a, and sequentially obtains bitmaps corresponding to the codon (step S308).
- the specifying unit 350c specifies a mutation position (offset) and a gene mutation that are mismatched by executing a shift operation and an AND operation of each bitmap (step S309).
- the specifying unit 350c registers the type of the specified point mutation and the information on the gene mutation in the detection result table 240h (Step S310).
- the information processing device 300 outputs the detection result table 240h to the display unit 130 to display the same (step S311).
- FIG. 30 is a flowchart illustrating a process in which the specifying unit according to the third embodiment specifies an offset of a point mutation.
- the specifying unit 350c of the information processing device 300 sets the offset n to the offset of the start codon (Step S401).
- the acquisition unit 350b of the information processing device 100 acquires the first bitmap corresponding to the codon at the offset n of the second array data 140e from the transposed index 340a (Step S402).
- the specifying unit 350c shifts the first bitmap to the left (step S403).
- the specifying unit 350c increments the offset n by 1 (step S404).
- the acquiring unit 350b acquires the second bitmap corresponding to the codon at the offset n of the second array data from the transposed index 340a (Step S405).
- the identification unit 350c performs an AND operation on the first bitmap and the second bitmap to generate a third bitmap (step S406).
- the specifying unit 350c determines whether the bit at the offset n of the third bitmap is “1” (Step S407).
- the specifying unit 350c determines that there is a point mutation at offset n of the second array data (Step S409).
- step S410 when the bit at offset n of the third bitmap is “1” (Yes in step S408), the specifying unit 350c updates the first bitmap with the bitmap obtained by shifting the third bitmap to the left. Then (step S410), the process proceeds to step S404.
- the information processing apparatus 300 according to the third embodiment acquires bitmaps corresponding to codon types from the transposed index 340a in order from the start of codons included in the second sequence data 140e, and shifts the acquired bitmaps.
- An unmatched codon is specified based on an operation and an AND operation. As a result, it becomes possible to quickly search for codons including point mutations and gene mutations.
- the information processing apparatus 300 has described the method of generating the third array data 240e and comparing it with the first array data 140d for convenience, the present invention is not limited to this.
- the information processing device 200 can also convert the second array data 140e into byte units and generate a comparison with the first array data 140d in byte units without generating the third array data 240e.
- the information processing device 300 encodes the reference codon sequence data described by base symbols and generates an inverted index associated with the codon. Further, the information processing device 300 converts the codon sequence into an amino acid sequence, generates an inverted index associated with the amino acid, and specifies the mutation position using the generated inverted index.
- FIG. 31 is a diagram for explaining another process of the information processing apparatus according to the third embodiment.
- the information processing apparatus 300 generates fourth sequence data 240j based on the first sequence data 140d and the codon / amino acid conversion table 240i shown in FIG. 21A, and at the same time, transposes the index 340b.
- Generate The transposed index 340b is information indicating the relationship between the type of the encoded code of the fourth array data 240j and the array position (offset) by a bitmap.
- the information processing device 300 performs a process of specifying a mutation position using the transposition index 340b corresponding to the amino acid sequence. For example, the information processing device 300 acquires the bitmap corresponding to the type of amino acid from the transposed index 340b in order from the head of the amino acid included in the amino acid sequence data, and based on the positions of the flags of the acquired plurality of bitmaps. , The sequence position which is not identical with the fourth sequence data 240j among the amino acids included in the amino acid sequence data is specified.
- FIG. 32 is a flowchart (2) illustrating the processing procedure of the information processing apparatus according to the third embodiment.
- the receiving unit 150a of the information processing device 300 receives the reference codon sequence data (Step S401).
- the encoding unit 150b of the information processing device 300 encodes the reference codon sequence data to generate the first sequence data 140d, and the generation unit 350a generates the transposed index 350a (Step S402).
- the receiving unit 150a receives the amino acid sequence data to be analyzed (step S403).
- the encoding unit 150b encodes the amino acid sequence data to be analyzed and generates the second sequence data 140e (Step S404).
- the generation unit 350a generates the fourth sequence data 240j from the first sequence data 140d based on the codon / amino acid conversion table 240i, and at the same time, generates an inverted index 340b associated with the amino acid (step S405).
- the specifying unit 350c of the information processing device 400 specifies a mutation position (offset) at which the bitmaps do not match by executing a shift operation and an AND operation of each bitmap (step S406).
- the specifying unit 350c registers the information on the specified mutation in the detection result table 240h (Step S407).
- the information processing device 300 outputs and displays the detection result table 240h on the display unit 130 (Step S408).
- the information processing device 300 when the input of the search query is an amino acid sequence, the information processing device 300 generates an inverted index 340b corresponding to the amino acid and compares it with the second sequence data 140e. Thereby, even if the input of the search query is an amino acid sequence, the mutated amino acid can be specified using the inverted index.
- FIG. 33 is a diagram illustrating an example of a hardware configuration of a computer that realizes the same functions as the information processing apparatuses according to the first and second embodiments.
- the computer 400 includes a CPU 401 for executing various arithmetic processing, an input device 402 for receiving input of data from a user, and a display 403.
- the computer 400 includes a reading device 404 that reads a program or the like from a storage medium, and an interface device 405 that exchanges data with an external device or the like via a wired or wireless network.
- the computer 400 includes a RAM 406 for temporarily storing various information, and a hard disk device 407.
- the devices 401 to 407 are connected to a bus 408.
- the hard disk device 407 has a reception program 407a, an encoding program 407b, a comparison program 407c, and a specific program 407d.
- the CPU 401 reads out the reception program 407a, the encoding program 407b, the comparison program 407c, and the specific program 407d and expands them in the RAM 406.
- the receiving program 407a functions as a receiving process 406a.
- the encoding program 407b functions as an encoding process 406b.
- the comparison program 407c functions as a comparison process 406c.
- the specific program 407d functions as a specific process 406d.
- the processing of the receiving process 406a corresponds to the processing of the receiving unit 150a.
- the processing of the encoding process 406b corresponds to the processing of the encoding unit 150b.
- the processing of the comparison process 406c corresponds to the processing of the comparison unit 150c.
- the processing of the specifying process 406d corresponds to the processing of the specifying units 150d and 250d.
- each program does not necessarily have to be stored in the hard disk device 407 from the beginning.
- each program is stored in a “portable physical medium” such as a flexible disk (FD), a CD-ROM, a DVD disk, a magneto-optical disk, or an IC card inserted into the computer 400. Then, the computer 400 may read out and execute each of the programs 407a to 407d.
- a “portable physical medium” such as a flexible disk (FD), a CD-ROM, a DVD disk, a magneto-optical disk, or an IC card inserted into the computer 400.
- the computer 400 may read out and execute each of the programs 407a to 407d.
- FIG. 34 is a diagram illustrating an example of a hardware configuration of a computer that realizes functions similar to those of the information processing apparatus according to the third embodiment.
- the computer 500 includes a CPU 501 for executing various arithmetic processing, an input device 502 for receiving input of data from a user, and a display 503.
- the computer 500 includes a reading device 504 that reads a program or the like from a storage medium, and an interface device 505 that exchanges data with an external device or the like via a wired or wireless network.
- the computer 500 includes a RAM 506 for temporarily storing various information, and a hard disk device 507.
- the devices 501 to 507 are connected to a bus 508.
- the hard disk drive 507 has a reception program 507a, an encoding program 507b, a generation program 507c, an acquisition program 507d, and a specific program 507e.
- the CPU 501 reads the reception program 507a, the encoding program 507b, the generation program 507c, the acquisition program 507d, and the specific program 507e, and expands them on the RAM 406.
- the receiving program 507a functions as a receiving process 506a.
- the encoding program 507b functions as an encoding process 506b.
- the generation program 507c functions as a generation process 506c.
- the acquisition program 507d functions as an acquisition process 506d.
- the specific program 507e functions as a specific process 506e.
- the processing of the receiving process 406a corresponds to the processing of the receiving unit 150a.
- the processing of the encoding process 406b corresponds to the processing of the encoding unit 150b.
- the processing of the generation process 506c corresponds to the processing of the generation unit 350a.
- the processing of the acquisition process 506d corresponds to the processing of the acquisition unit 350b.
- the processing of the specifying process 506e corresponds to the processing of the specifying unit 350c.
- each program does not necessarily have to be stored in the hard disk device 507 from the beginning.
- each program is stored in a “portable physical medium” such as a flexible disk (FD), a CD-ROM, a DVD disk, a magneto-optical disk, or an IC card inserted into the computer 500.
- the computer 500 may read out and execute each of the programs 507a to 507e.
- Information processing device 110 Communication unit 120 Input unit 130 Display unit 140, 240, 340 Storage unit 140a Reference codon sequence data 140b Analysis target codon sequence data 140c Code conversion table 140d First sequence data 140e Second sequence data 140f Insertion transition table 140g Deletion transition table 140h, 240h Detection result table 150, 250, 350 Control unit 150a Reception unit 150b Encoding unit 150c Comparison unit 150d, 250d, 350c Identification unit 240e Third sequence data 240i Codon / amino acid conversion table 240j Fourth array data 350a generation unit 350b acquisition unit
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Engineering & Computer Science (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Analytical Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Evolutionary Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
110 通信部
120 入力部
130 表示部
140,240,340 記憶部
140a 基準コドン配列データ
140b 分析対象コドン配列データ
140c コード変換テーブル
140d 第1配列データ
140e 第2配列データ
140f 挿入遷移テーブル
140g 欠失遷移テーブル
140h,240h 検出結果テーブル
150,250,350 制御部
150a 受付部
150b 符号化部
150c 比較部
150d,250d,350c 特定部
240e 第3配列データ
240i コドン・アミノ酸変換テーブル
240j 第4配列データ
350a 生成部
350b 取得部
Claims (30)
- 基準コドン配列データと、分析対象コドン配列データとを取得し、
取得した前記基準コドン配列データに含まれるコドンと、取得した前記分析対象コドン配列データに含まれるコドンとを、コドンの配列位置毎に比較し、
前記比較の結果に基づき、前記分析対象コドン配列データに含まれるコドンのうち、コドンが不一致となる配列位置に後続する複数の配列位置にそれぞれ位置するコドンを特定し、
あるコドン配列データに含まれるあるコドンに生じた突然変異の種別を、前記あるコドンに前記突然変異が生じることで前記あるコドンの配列位置に後続する複数の配列位置にそれぞれ位置するコドンに対応付けて記憶する記憶部を参照して、特定した前記複数の配列位置にそれぞれ位置するコドンに対応づけられた突然変異の種別を特定する、
処理をコンピュータが実行することを特徴とする特定方法。 - 前記記憶部は、被変異コドンを、前記あるコドンに前記突然変異が生じることで前記あるコドンの配列位置に後続する複数の配列位置にそれぞれ位置するコドンに対応付けて記憶し、
前記記憶部と、特定された突然変異の種別と、特定された前記複数の配列位置にそれぞれ位置するコドンとを比較して、前記被変異コドンを特定する処理を更に実行することを特徴とする請求項1に記載の特定方法。 - 前記被変異コドンを基にして、前記分析対象コドン配列データを修正し、修正したコドン配列データと、前記基準コドン配列データとを比較して、不一致となるコドンを特定する処理を更に実行することを特徴とする請求項2に記載の特定方法。
- 前記突然変異の種別を特定する処理は、前記被変異コドンと、前記基準コドン配列データにおける、コドンが不一致となる配列位置よりも一つ後ろのコドンとが一致する場合に、前記突然変異の種別が塩基挿入であると判定することを特徴とする請求項2に記載の特定方法。
- 前記突然変異の種別を特定する処理は、前記被変異コドンが、前記基準コドン配列データにおける、コドンが不一致となる配列位置よりも二つ後ろのコドンと一致する場合に、前記突然変異の種別が塩基欠失であると判定することを特徴とする請求項4に記載の特定方法。
- 前記突然変異の種別を特定する処理は、前記突然変異の種別が前記塩基挿入でなく、かつ、前記塩基欠失でないと判定した場合には、前記突然変異の種別が塩基置換であると判定することを特徴とする請求項5に記載の特定方法。
- 基準コドン配列データと、分析対象コドン配列データとを取得し、
前記基準コドン配列データをコドン単位で符号化した第1配列データと、前記分析対象コドン配列データをコドン単位で符号化した第2配列データとを生成し、
前記第1配列データと、前記第2配列データとを、一つのコドンに相当する符号の単位で比較し、不一致となる符号の位置を特定する
処理をコンピュータが実行することを特徴とする特定方法。 - 分析対象の配列データとして、アミノ酸配列データを取得した場合に、前記第1配列データのコドン単位の符号をアミノ酸単位の符号に変換する処理を更に実行し、前記特定する処理は、符号化された前記アミノ酸配列データと、変換された前記第1配列データとを基にして、不一致となる符号の位置を特定することを特徴とする請求項7に記載の特定方法。
- 基準コドン配列データを取得し、前記基準コドン配列データに含まれるコドンの種別と、前記コドンの配列位置とをビットマップのフラグによって対応付けた転置インデックスを生成し、
分析対象コドン配列データを取得した場合に、前記分析対象コドン配列データに含まれるコドンの先頭から順に、コドンの種別に対応するビットマップを前記転置インデックスから取得し、
取得した複数のビットマップのフラグの位置を基にして、前記分析対象コドン配列データに含まれるコドンのうち、前記基準コドン配列データのコドンと不一致となる配列位置を特定する
処理をコンピュータが実行することを特徴とする特定方法。 - 前記生成する処理は、分析対象の配列データとして、アミノ酸配列データを取得した場合に、前記基準コドン配列データに含まれるコドンの種別をアミノ酸の種別に変換したアミノ酸転置インデックスを生成し、前記取得する処理は、前記アミノ酸配列データに含まれるアミノ酸の先頭から順に、アミノ酸の種別に対応するビットマップを前記アミノ酸転置インデックスから取得し、前記特定する処理は、取得した複数のビットマップのフラグの位置を基にして、前記アミノ酸配列データに含まれるアミノ酸のうち、前記基準コドン配列データと不一致となる配列位置を特定することを特徴とする請求項9に記載の特定方法。
- コンピュータに、
基準コドン配列データと、分析対象コドン配列データとを取得し、
取得した前記基準コドン配列データに含まれるコドンと、取得した前記分析対象コドン配列データに含まれるコドンとを、コドンの配列位置毎に比較し、
前記比較の結果に基づき、前記分析対象コドン配列データに含まれるコドンのうち、コドンが不一致となる配列位置に後続する複数の配列位置にそれぞれ位置するコドンを特定し、
あるコドン配列データに含まれるあるコドンに生じた突然変異の種別を、前記あるコドンに前記突然変異が生じることで前記あるコドンの配列位置に後続する複数の配列位置にそれぞれ位置するコドンに対応付けて記憶する記憶部を参照して、特定した前記複数の配列位置にそれぞれ位置するコドンに対応づけられた突然変異の種別を特定する、
処理を実行させることを特徴とする特定プログラム。 - 前記記憶部は、被変異コドンを、前記あるコドンに前記突然変異が生じることで前記あるコドンの配列位置に後続する複数の配列位置にそれぞれ位置するコドンに対応付けて記憶し、
前記記憶部と、特定された突然変異の種別と、特定された前記複数の配列位置にそれぞれ位置するコドンとを比較して、前記被変異コドンを特定する処理を更に実行することを特徴とする請求項11に記載の特定プログラム。 - 前記被変異コドンを基にして、前記分析対象コドン配列データを修正し、修正したコドン配列データと、前記基準コドン配列データとを比較して、不一致となるコドンを特定する処理を更に実行することを特徴とする請求項12に記載の特定プログラム。
- 前記突然変異の種別を特定する処理は、前記被変異コドンと、前記基準コドン配列データにおける、コドンが不一致となる配列位置よりも一つ後ろのコドンとが一致する場合に、前記突然変異の種別が塩基挿入であると判定することを特徴とする請求項12に記載の特定プログラム。
- 前記突然変異の種別を特定する処理は、前記被変異コドンが、前記基準コドン配列データにおける、コドンが不一致となる配列位置よりも二つ後ろのコドンと一致する場合に、前記突然変異の種別が塩基欠失であると判定することを特徴とする請求項14に記載の特定プログラム。
- 前記突然変異の種別を特定する処理は、前記突然変異の種別が前記塩基挿入でなく、かつ、前記塩基欠失でないと判定した場合には、前記突然変異の種別が塩基置換であると判定することを特徴とする請求項15に記載の特定プログラム。
- コンピュータに、
基準コドン配列データと、分析対象コドン配列データとを取得し、
前記基準コドン配列データをコドン単位で符号化した第1配列データと、前記分析対象コドン配列データをコドン単位で符号化した第2配列データとを生成し、
前記第1配列データと、前記第2配列データとを、一つのコドンに相当する符号の単位で比較し、不一致となる符号の位置を特定する
処理を実行させることを特徴とする特定プログラム。 - 分析対象の配列データとして、アミノ酸配列データを取得した場合に、前記第1配列データのコドン単位の符号をアミノ酸単位の符号に変換する処理を更に実行し、前記特定する処理は、符号化された前記アミノ酸配列データと、変換された前記第1配列データとを基にして、不一致となる符号の位置を特定することを特徴とする請求項17に記載の特定プログラム。
- コンピュータに、
基準コドン配列データを取得し、前記基準コドン配列データに含まれるコドンの種別と、前記コドンの配列位置とをビットマップのフラグによって対応付けた転置インデックスを生成し、
分析対象コドン配列データを取得した場合に、前記分析対象コドン配列データに含まれるコドンの先頭から順に、コドンの種別に対応するビットマップを前記転置インデックスから取得し、
取得した複数のビットマップのフラグの位置を基にして、前記分析対象コドン配列データに含まれるコドンのうち、前記基準コドン配列データのコドンと不一致となる配列位置を特定する
処理を実行させることを特徴とする特定プログラム。 - 前記生成する処理は、分析対象の配列データとして、アミノ酸配列データを取得した場合に、前記基準コドン配列データに含まれるコドンの種別をアミノ酸の種別に変換したアミノ酸転置インデックスを生成し、前記取得する処理は、前記アミノ酸配列データに含まれるアミノ酸の先頭から順に、アミノ酸の種別に対応するビットマップを前記アミノ酸転置インデックスから取得し、前記特定する処理は、取得した複数のビットマップのフラグの位置を基にして、前記アミノ酸配列データに含まれるアミノ酸のうち、前記基準コドン配列データと不一致となる配列位置を特定することを特徴とする請求項19に記載の特定プログラム。
- 基準コドン配列データと、分析対象コドン配列データとを取得し、取得した前記基準コドン配列データに含まれるコドンと、取得した前記分析対象コドン配列データに含まれるコドンとを、コドンの配列位置毎に比較する比較部と、
前記比較の結果に基づき、前記分析対象コドン配列データに含まれるコドンのうち、コドンが不一致となる配列位置に後続する複数の配列位置にそれぞれ位置するコドンを特定し、あるコドン配列データに含まれるあるコドンに生じた突然変異の種別を、前記あるコドンに前記突然変異が生じることで前記あるコドンの配列位置に後続する複数の配列位置にそれぞれ位置するコドンに対応付けて記憶する記憶部を参照して、特定した前記複数の配列位置にそれぞれ位置するコドンに対応づけられた突然変異の種別を特定する特定部と
を有することを特徴とする情報処理装置。 - 前記記憶部は、被変異コドンを、前記あるコドンに前記突然変異が生じることで前記あるコドンの配列位置に後続する複数の配列位置にそれぞれ位置するコドンに対応付けて記憶し、
特定部は、前記記憶部と、特定された突然変異の種別と、特定された前記複数の配列位置にそれぞれ位置するコドンとを比較して、前記被変異コドンを特定する処理を更に実行することを特徴とする請求項21に記載の情報処理装置。 - 特定部は、前記被変異コドンを基にして、前記分析対象コドン配列データを修正し、修正したコドン配列データと、前記基準コドン配列データとを比較して、不一致となるコドンを特定する処理を更に実行することを特徴とする請求項22に記載の情報処理装置。
- 特定部は、前記被変異コドンと、前記基準コドン配列データにおける、コドンが不一致となる配列位置よりも一つ後ろのコドンとが一致する場合に、前記突然変異の種別が塩基挿入であると判定することを特徴とする請求項22に記載の情報処理装置。
- 前記特定部は、前記被変異コドンが、前記基準コドン配列データにおける、コドンが不一致となる配列位置よりも二つ後ろのコドンと一致する場合に、前記突然変異の種別が塩基欠失であると判定することを特徴とする請求項24に記載の情報処理装置。
- 前記特定部は、前記突然変異の種別が前記塩基挿入および前記塩基欠失でないと判定した場合には、前記突然変異の種別が塩基置換であると判定することを特徴とする請求項25に記載の情報処理装置。
- 基準コドン配列データと、分析対象コドン配列データとを取得し、前記基準コドン配列データをコドン単位で符号化した第1配列データと、前記分析対象コドン配列データをコドン単位で符号化した第2配列データとを生成する符号化部と、
前記第1配列データと、前記第2配列データとを、一つのコドンに相当する符号の単位で比較し、不一致となる符号の位置を特定する特定部と
を有することを特徴とする情報処理装置。 - 前記符号化部は、分析対象の配列データとして、アミノ酸配列データを取得した場合に、前記第1配列データのコドン単位の符号をアミノ酸単位の符号に変換する処理を更に実行し、前記特定部は、符号化された前記アミノ酸配列データと、変換された前記第1配列データとを基にして、不一致となる符号の位置を特定することを特徴とする請求項27に記載の情報処理装置。
- 基準コドン配列データを取得し、前記基準コドン配列データに含まれるコドンの種別と、前記コドンの配列位置とをビットマップのフラグによって対応付けた転置インデックスを生成する生成部と、
分析対象コドン配列データを取得した場合に、前記分析対象コドン配列データに含まれるコドンの先頭から順に、コドンの種別に対応するビットマップを前記転置インデックスから取得し、取得した複数のビットマップのフラグの位置を基にして、前記分析対象コドン配列データに含まれるコドンのうち、前記基準コドン配列データのコドンと不一致となる配列位置を特定する特定部と
を有することを特徴とする情報処理装置。 - 前記生成部は、分析対象の配列データとして、アミノ酸配列データを取得した場合に、前記基準コドン配列データに含まれるコドンの種別をアミノ酸の種別に変換したアミノ酸転置インデックスを生成し、前記特定部は、前記アミノ酸配列データに含まれるアミノ酸の先頭から順に、アミノ酸の種別に対応するビットマップを前記アミノ酸転置インデックスから取得し、取得した複数のビットマップのフラグの位置を基にして、前記アミノ酸配列データに含まれるアミノ酸のうち、前記基準コドン配列データと不一致となる配列位置を特定することを特徴とする請求項29に記載の情報処理装置。
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2020540993A JP7124877B2 (ja) | 2018-09-07 | 2018-09-07 | 特定方法、特定プログラムおよび情報処理装置 |
PCT/JP2018/033329 WO2020049748A1 (ja) | 2018-09-07 | 2018-09-07 | 特定方法、特定プログラムおよび情報処理装置 |
AU2018440274A AU2018440274B2 (en) | 2018-09-07 | 2018-09-07 | Identification method, identification program, and information processing device |
EP18932448.6A EP3848935A4 (en) | 2018-09-07 | 2018-09-07 | SPECIFICATION PROCESS, SPECIFICATION PROGRAM, AND INFORMATION PROCESSING DEVICE |
US17/182,397 US20210183466A1 (en) | 2018-09-07 | 2021-02-23 | Identification method, information processing device, and recording medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2018/033329 WO2020049748A1 (ja) | 2018-09-07 | 2018-09-07 | 特定方法、特定プログラムおよび情報処理装置 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/182,397 Continuation US20210183466A1 (en) | 2018-09-07 | 2021-02-23 | Identification method, information processing device, and recording medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020049748A1 true WO2020049748A1 (ja) | 2020-03-12 |
Family
ID=69721989
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2018/033329 WO2020049748A1 (ja) | 2018-09-07 | 2018-09-07 | 特定方法、特定プログラムおよび情報処理装置 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20210183466A1 (ja) |
EP (1) | EP3848935A4 (ja) |
JP (1) | JP7124877B2 (ja) |
AU (1) | AU2018440274B2 (ja) |
WO (1) | WO2020049748A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPWO2022009342A1 (ja) * | 2020-07-08 | 2022-01-13 | ||
WO2022244089A1 (ja) | 2021-05-18 | 2022-11-24 | 富士通株式会社 | 情報処理プログラム、情報処理方法および情報処理装置 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7188573B2 (ja) * | 2019-05-13 | 2022-12-13 | 富士通株式会社 | 評価方法、評価プログラムおよび評価装置 |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002132781A (ja) | 2000-10-25 | 2002-05-10 | Hitachi Ltd | アミノ酸フレーム表示システム、アミノ酸フレーム表示方法及び記録媒体 |
JP2003256433A (ja) * | 2002-02-27 | 2003-09-12 | Japan Science & Technology Corp | 遺伝子構造解析方法およびその装置 |
JP2004355522A (ja) | 2003-05-30 | 2004-12-16 | Keio Gijuku | データ処理方法、データ処理システム、mRNA翻訳方法、及びmRNA翻訳システム |
WO2008108297A1 (ja) | 2007-03-02 | 2008-09-12 | Research Organization Of Information And Systems | 相同性検索システム |
WO2009013910A1 (ja) | 2007-07-24 | 2009-01-29 | Keio University | 符号化装置、復号化装置、及び情報記録媒体 |
US20130332081A1 (en) * | 2010-09-09 | 2013-12-12 | Omicia Inc | Variant annotation, analysis and selection tool |
US20150363546A1 (en) * | 2014-06-17 | 2015-12-17 | Genepeeks, Inc. | Evolutionary models of multiple sequence alignments to predict offspring fitness prior to conception |
JP2015536156A (ja) | 2012-12-10 | 2015-12-21 | レゾリューション バイオサイエンス, インコーポレイテッド | 標的化ゲノム解析のための方法 |
WO2017201050A1 (en) * | 2016-05-19 | 2017-11-23 | Minati Ludovico | Comparing dna fragments with a reference genome |
-
2018
- 2018-09-07 EP EP18932448.6A patent/EP3848935A4/en active Pending
- 2018-09-07 AU AU2018440274A patent/AU2018440274B2/en active Active
- 2018-09-07 WO PCT/JP2018/033329 patent/WO2020049748A1/ja unknown
- 2018-09-07 JP JP2020540993A patent/JP7124877B2/ja active Active
-
2021
- 2021-02-23 US US17/182,397 patent/US20210183466A1/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002132781A (ja) | 2000-10-25 | 2002-05-10 | Hitachi Ltd | アミノ酸フレーム表示システム、アミノ酸フレーム表示方法及び記録媒体 |
JP2003256433A (ja) * | 2002-02-27 | 2003-09-12 | Japan Science & Technology Corp | 遺伝子構造解析方法およびその装置 |
JP2004355522A (ja) | 2003-05-30 | 2004-12-16 | Keio Gijuku | データ処理方法、データ処理システム、mRNA翻訳方法、及びmRNA翻訳システム |
WO2008108297A1 (ja) | 2007-03-02 | 2008-09-12 | Research Organization Of Information And Systems | 相同性検索システム |
WO2009013910A1 (ja) | 2007-07-24 | 2009-01-29 | Keio University | 符号化装置、復号化装置、及び情報記録媒体 |
US20130332081A1 (en) * | 2010-09-09 | 2013-12-12 | Omicia Inc | Variant annotation, analysis and selection tool |
JP2015536156A (ja) | 2012-12-10 | 2015-12-21 | レゾリューション バイオサイエンス, インコーポレイテッド | 標的化ゲノム解析のための方法 |
US20150363546A1 (en) * | 2014-06-17 | 2015-12-17 | Genepeeks, Inc. | Evolutionary models of multiple sequence alignments to predict offspring fitness prior to conception |
WO2017201050A1 (en) * | 2016-05-19 | 2017-11-23 | Minati Ludovico | Comparing dna fragments with a reference genome |
Non-Patent Citations (2)
Title |
---|
See also references of EP3848935A4 |
YAMASITA, TATUO; MATSUMOTO, YUJI: "Full Text Approximate String Search using Suffix Arrays", IPSJ SIG TECHNICAL REPORT, vol. 97, no. 85, 12 September 1997 (1997-09-12), pages 83 - 90, XP009526189 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPWO2022009342A1 (ja) * | 2020-07-08 | 2022-01-13 | ||
WO2022009342A1 (ja) * | 2020-07-08 | 2022-01-13 | 富士通株式会社 | 情報処理プログラム、情報処理方法および情報処理装置 |
EP4181147A4 (en) * | 2020-07-08 | 2023-08-23 | Fujitsu Limited | INFORMATION PROCESSING PROGRAM, METHOD AND DEVICE |
JP7548312B2 (ja) | 2020-07-08 | 2024-09-10 | 富士通株式会社 | 情報処理プログラム、情報処理方法および情報処理装置 |
WO2022244089A1 (ja) | 2021-05-18 | 2022-11-24 | 富士通株式会社 | 情報処理プログラム、情報処理方法および情報処理装置 |
JP7537609B2 (ja) | 2021-05-18 | 2024-08-21 | 富士通株式会社 | 情報処理プログラム、情報処理方法および情報処理装置 |
Also Published As
Publication number | Publication date |
---|---|
EP3848935A4 (en) | 2021-09-01 |
US20210183466A1 (en) | 2021-06-17 |
JPWO2020049748A1 (ja) | 2021-08-12 |
JP7124877B2 (ja) | 2022-08-24 |
EP3848935A1 (en) | 2021-07-14 |
AU2018440274B2 (en) | 2023-02-16 |
AU2018440274A1 (en) | 2021-03-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Accurate de novo prediction of protein contact map by ultra-deep learning model | |
WO2020049748A1 (ja) | 特定方法、特定プログラムおよび情報処理装置 | |
Lyons et al. | Protein fold recognition using HMM–HMM alignment and dynamic programming | |
JP7552675B2 (ja) | 生成方法および情報処理装置 | |
JP5049965B2 (ja) | データ処理装置及び方法 | |
Täubig et al. | PAST: fast structure-based searching in the PDB | |
US20220068435A1 (en) | Evaluation method, storage medium, and evaluation device | |
Zhang et al. | A program plagiarism detection model based on information distance and clustering | |
JP7287005B2 (ja) | 特定方法、特定プログラムおよび特定装置 | |
JP2019211959A (ja) | 検索方法、検索プログラムおよび検索装置 | |
US20240071568A1 (en) | Storage medium, information processing method, and information processing apparatus | |
EP4357937A1 (en) | Information processing program, information processing method, and information processing device | |
WO2021124535A1 (ja) | 情報処理プログラム、情報処理方法および情報処理装置 | |
Zheng | The use of a conformational alphabet for fast alignment of protein structures | |
Poleksic | Detecting non-trivial protein structure relationships | |
CN117609260A (zh) | 数据校验方法、装置、设备、介质和计算机程序产品 | |
Mendelowitz | Reducing Genome Assembly Complexity with Optical Maps |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2020540993 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2018440274 Country of ref document: AU Date of ref document: 20180907 Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2018932448 Country of ref document: EP Effective date: 20210407 |