WO2022244089A1 - 情報処理プログラム、情報処理方法および情報処理装置 - Google Patents

情報処理プログラム、情報処理方法および情報処理装置 Download PDF

Info

Publication number
WO2022244089A1
WO2022244089A1 PCT/JP2021/018730 JP2021018730W WO2022244089A1 WO 2022244089 A1 WO2022244089 A1 WO 2022244089A1 JP 2021018730 W JP2021018730 W JP 2021018730W WO 2022244089 A1 WO2022244089 A1 WO 2022244089A1
Authority
WO
WIPO (PCT)
Prior art keywords
codon
amino acid
file
information processing
acid sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2021/018730
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
正弘 片岡
良平 永浦
薫 茂櫛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN202180098100.0A priority Critical patent/CN117296100A/zh
Priority to PCT/JP2021/018730 priority patent/WO2022244089A1/ja
Priority to JP2023522033A priority patent/JP7537609B2/ja
Priority to AU2021446660A priority patent/AU2021446660A1/en
Priority to EP21940706.1A priority patent/EP4343769A4/en
Publication of WO2022244089A1 publication Critical patent/WO2022244089A1/ja
Priority to US18/502,405 priority patent/US20240071568A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices

Definitions

  • the present invention relates to an information processing program and the like.
  • a domain is a part of a protein's sequence or structure that evolves independently of the rest and has a function. Motifs are characterized by symmetrical codon sequences.
  • FIG. 17 is a diagram showing an example of a motif. As shown in FIG. 17, the motifs include ⁇ -hairpin 1a, Greek key 1b, ⁇ -barrel 1c (porin and lipocalin), and the like. Folding is the physical process by which a protein chain acquires its native three-dimensional structure, usually a biologically functional conformation, in a rapid and reproducible manner.
  • FIG. 18 is a diagram showing the relationship between amino acids, bases, and codons. A group of three base sequences is called a "codon”. For each sequence of bases, a codon is determined, and once the codon is determined, an amino acid is determined.
  • codons are associated with one amino acid.
  • amino acid “alanine (Ala)” is mapped to the codons “GCU”, “GCC”, “GCA”, “GCG”, and the codons “GCU”, “GCC”, “GCA”, “GCG” are substantially are essentially identical codons.
  • the conventional techniques cannot deal with such codon characteristics, and cannot efficiently search for codon sequences that are repeatedly expressed.
  • An object of one aspect is to provide an information processing program, an information processing method, and an information processing apparatus that can efficiently search for codon sequences that are repeatedly expressed.
  • the computer executes the following processing.
  • the computer calculates a second index indicating the position of the amino acid on the codon file based on the first index indicating the position of the plurality of codons on the codon file for a plurality of codons with different base sequences indicating the same amino acid. do.
  • the computer Based on the second index, the computer identifies each position of the amino acid sequence repeatedly expressed in the codon file.
  • the computer identifies each codon sequence corresponding to the position of each amino acid sequence that appears repeatedly in the codon file as codon sequences with which each has homology.
  • FIG. 1 is a diagram (1) for explaining the processing of the information processing apparatus according to the first embodiment.
  • FIG. 2 is a diagram (2) for explaining the processing of the information processing apparatus according to the first embodiment.
  • FIG. 3 is a functional block diagram showing the configuration of the information processing apparatus according to the first embodiment.
  • FIG. 4 is a diagram showing an example of the data structure of the score table.
  • FIG. 5 is a diagram showing an example of the data structure of a codon file.
  • FIG. 6 is a diagram showing an example of the data structure of a codon permutation index.
  • FIG. 7 is a diagram showing an example of the data structure of an amino acid permutation index.
  • FIG. 8 is a diagram (1) for explaining the processing of the specifying unit;
  • FIG. 9 is a diagram (2) for explaining the processing of the specifying unit.
  • FIG. 10 is a diagram (3) for explaining the processing of the specifying unit.
  • FIG. 11 is a diagram (4) for explaining the processing of the specifying unit.
  • FIG. 12 is a diagram showing an example of the data structure of search result information.
  • FIG. 13 is a flow chart showing the processing procedure of the information processing apparatus according to the first embodiment.
  • FIG. 14 is a diagram (1) for explaining the processing of the information processing apparatus according to the second embodiment.
  • FIG. 15 is a diagram (2) for explaining the processing of the information processing apparatus according to the second embodiment.
  • FIG. 16 is a diagram illustrating an example of a hardware configuration of a computer that implements functions similar to those of the information processing apparatus of the embodiment.
  • FIG. 17 is a diagram showing an example of a motif.
  • FIG. 18 is a diagram showing the relationship between amino acids, bases, and codons.
  • 1 and 2 are diagrams for explaining the processing of the information processing apparatus according to the first embodiment.
  • the information processing device scans a codon file 141 containing base sequence information on a codon-by-codon basis to generate a codon permutation index 142 .
  • the codon permutation index 142 has a bitmap for each codon type. Since there are 64 types of codons, 64 bitmaps are registered in the codon permutation index 142 . Each bitmap of the codon permutation index 142 is associated with a codon type, an offset, and a flag. It indicates that the corresponding type of codon is located at the bitmap flag "1" offset. In the bitmap, "0" is associated with offsets that are not flagged.
  • the offset of the codon at the beginning of the codon file 141 is set to "0".
  • the information processing device generates an amino acid permutation index 143 based on the codon permutation index 142 and the definition table T1.
  • the definition table T1 is a table that defines the correspondence between amino acids and codons. As described with reference to FIG. 18, there are cases where multiple types of codons are associated with the same amino acid.
  • a bitmap corresponding to each amino acid is registered in the amino acid transposition index 143 .
  • Each bitmap of the amino acid transposition index 143 is associated with an amino acid type, an offset, and a flag. It indicates that the corresponding type of amino acid is located at the bitmap flag "1" offset. In the bitmap, "0" is associated with offsets that are not flagged.
  • the information processing apparatus 100 identifies "GCU”, “GCC”, “GCA”, and “GCG” as codons corresponding to the amino acid "Ala” based on the definition table T1.
  • the information processing device obtains a bitmap 142-1 of the codon "GCU”, a bitmap 142-2 of the codon “GCC”, a bitmap 142-3 of the codon "GCA”, a bitmap of the codon “GCG”. Get map 142-4.
  • the information processing device generates a bitmap 143-1 of the amino acid "Ala” by executing an OR operation (logical sum) on the bitmaps 142-1 to 142-4.
  • the information processing device sets the flag of the offset "n” of the bitmap 143-1 to “1” when any of the flags of the offset “n” of the bitmaps 142-1 to 142-1 is “1". ”.
  • the information processing device sets the offset "n” of the bitmap 143-1 to "0". do. The information processing device repeatedly executes the above process for each offset.
  • the information processing device also generates bitmaps of other amino acids in the same manner as the bitmap 143 - 1 of the amino acid "Ala", and registers the bitmaps of each amino acid in the amino acid permutation index 143 .
  • the information processing device identifies the relationship between the offset of the codon file 141 and the type of amino acid based on the amino acid transposition index 143, and identifies the codon sequences corresponding to the positions of the repeatedly expressed amino acid sequences with homology. Specify as a codon sequence.
  • the amino acid sequence "Leu, Lys, Asp, Gln, Ala” is repeatedly expressed at offsets 10-14, 40-44, etc. of codon file 141.
  • the information processing apparatus includes the codon sequence "CUG, AAA, GAU, CAG, GCA” contained at offsets 10 to 14 and the codon sequence "CUG, AAA, GAU, CAA, GCA” contained at offsets 40 to 44. are identified as codon sequences with homology.
  • codon sequence "CUG, AAA, GAU, CAG, GCA” and the codon sequence "CUG, AAA, GAU, CAA, GCA” are compared, the codon granularity differs between “CAG” and “CAA”. However, since “CAG” and “CAA” correspond to the same amino acid “Gln”, the codon sequence "CUG, AAA, GAU, CAG, GCA” and the codon sequence "CUG, AAA, GAU, CAA, GCA ” can be said to be a homologous codon sequence.
  • the amino acid transposition index 143 is generated by generating a bitmap for each amino acid from bitmaps of codons with different base sequences showing the same amino acid. do.
  • the information processing device uses the generated amino acid transposition index 143 to identify the relationship with the types of amino acids on the codon file 141, and identifies the codon sequences corresponding to the positions of the repeatedly expressed amino acid sequences as homologous codons. Specify as an array. This allows efficient search for codon sequences that are repeatedly expressed.
  • FIG. 3 is a functional block diagram showing the configuration of the information processing apparatus according to the first embodiment.
  • this information processing apparatus 100 has a communication section 110 , an input section 120 , a display section 130 , a storage section 140 and a control section 150 .
  • the communication unit 110 is connected to an external device or the like by wire or wirelessly, and transmits and receives information to and from the external device or the like.
  • the communication unit 110 is implemented by a NIC (Network Interface Card) or the like.
  • the communication unit 110 may be connected to a network (not shown).
  • the input unit 120 is an input device that inputs various types of information to the information processing device 100 .
  • the input unit 120 corresponds to a keyboard, mouse, touch panel, or the like.
  • the display unit 130 is a display device that displays information output from the control unit 150 .
  • the display unit 130 corresponds to a liquid crystal display, an organic EL (Electro Luminescence) display, a touch panel, or the like.
  • the storage unit 140 has a definition table T1, a score table T2, a codon file 141, a codon permutation index 142, an amino acid permutation index 143, and search result information 144.
  • the storage unit 140 is implemented by, for example, a semiconductor memory device such as RAM (Random Access Memory) or flash memory, or a storage device such as a hard disk or optical disc.
  • the definition table T1 is a table that defines the correspondence between amino acids and codons.
  • the relationship between amino acids and codons defined in definition table T1 is the same as the relationship between amino acids, bases, and codons described in FIG.
  • the score table T2 is a table that defines the degree of similarity between amino acids.
  • FIG. 4 is a diagram showing an example of the data structure of the score table.
  • the codes shown in the areas A1 and A2 of the score table T2 shown in FIG. 4 are codes that uniquely indicate the amino acids described with reference to FIG.
  • the numerical value of region A3 is a score indicating the probability of amino acid substitution, and a higher score indicates a higher degree of similarity.
  • the score for alanine "A (Ala)” and threonine “T (Thr)” is "-4".
  • the score of alanine "A (Ala)” and tryptophan “W (Trp)” is "1". This indicates that the pair of alanine and tryptophan has a higher degree of similarity than the pair of alanine and threonine.
  • the codon file 141 has information on base sequences in which multiple bases are arranged.
  • FIG. 5 is a diagram showing an example of the data structure of a codon file. As shown in FIG. 5, the codon file 141 is information in which a plurality of base symbols are arranged. A set of three consecutive bases corresponds to one codon.
  • the codon permutation index 142 is information that associates the offset from the beginning of the codon file 141 with the type of codon.
  • FIG. 6 is a diagram showing an example of the data structure of a codon permutation index.
  • the horizontal axis of the codon transposition index 142 is the axis corresponding to the offset.
  • the vertical axis of the codon transposition index 142 is the axis corresponding to the type of codon.
  • the amino acid transposition index 143 is information that associates the offset from the beginning of the codon file 141 with the type of amino acid.
  • FIG. 7 is a diagram showing an example of the data structure of an amino acid permutation index.
  • the horizontal axis of the amino acid transposition index 143 is the axis corresponding to the offset.
  • the vertical axis of the amino acid transposition index 143 is the axis corresponding to the type of amino acid.
  • the offset of the codon at the beginning of the codon file 141 (the codon corresponding to any amino acid) is set to “0". If any of the codons "GCU”, “GCC”, “GCA”, and “GCG” corresponding to the amino acid “Ala” is included at the seventh position from the beginning of the codon file 141, the offset " The bit at the intersection of the column of "6" and the row of amino acid “Ala” is "1".
  • the search result information 144 has information on amino acid sequences (codon sequences) that are repeatedly expressed in the codon file 141 .
  • the search result information 144 holds information on repeatedly expressed amino acid sequences and the positions of such amino acid sequences in association with each other.
  • the control unit 150 has a preprocessing unit 151 and an identification unit 152 .
  • the control unit 150 is realized by, for example, a CPU (Central Processing Unit) or an MPU (Micro Processing Unit). Also, the control unit 150 may be executed by an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the preprocessing unit 151 generates a codon permutation index 142 and an amino acid permutation index 143 based on the codon file 141 and the definition table T1.
  • the preprocessing unit 151 selects the target codon type from the codon types included in the definition table T1.
  • the preprocessing unit 151 scans from the beginning of the codon file 141 with codon granularity (granularity for clustering three base sequences), and sets a flag “1” to the offset where the selected codon type appears. Run iteratively to generate a bitmap corresponding to the selected codon type.
  • the preprocessing unit 151 similarly generates bitmaps for other codon types.
  • the preprocessing unit 151 generates a codon-transposed index 142 by setting a bitmap corresponding to each codon type in the codon-transposed index 142 .
  • the preprocessing unit 151 identifies the types of codons corresponding to the same amino acid, and acquires a bitmap corresponding to the identified codon types from the codon permutation index 142 .
  • the preprocessing unit 151 generates an amino acid bitmap by performing an OR operation on the obtained bitmaps of each codon type.
  • the preprocessing unit 151 For example, a case where the preprocessing unit 151 generates a bitmap of the amino acid "Ala” among the bitmaps of each amino acid in the amino acid permutation index 143 will be described. As described with reference to FIG. 1, the preprocessing unit 151 identifies "GCU”, “GCC”, “GCA”, and “GCG” as codons corresponding to the amino acid "Ala” based on the definition table T1.
  • the preprocessing unit 151 extracts a bitmap 142-1 of the codon "GCU”, a bitmap 142-2 of the codon “GCC”, a bitmap 142-3 of the codon "GCA”, Get the bitmap 142-4.
  • the preprocessing unit 151 generates a bitmap 143-1 of the amino acid "Ala” by executing an OR operation (logical sum) on the bitmaps 142-1 to 142-4.
  • the preprocessing unit 151 also generates bitmaps of other amino acids in the same manner as the bitmap 143-1 of the amino acid "Ala”, and sets each bitmap of each amino acid to the amino acid permutation index 143, thereby A transposed index 143 is generated.
  • the identifying unit 152 identifies positions (offsets) of amino acid sequences repeatedly expressed in the codon file 141 based on the amino acid transposition index 143 .
  • the specifying unit 152 specifies each codon sequence corresponding to the position (offset) of the amino acid sequence repeatedly expressed in the codon file 141 as a homologous codon sequence.
  • the identifying unit 152 executes a longest matching search of amino acid sequences to identify the longest matching amino acid sequence. If the number of times of expression of the longest matching amino acid sequence is equal to or greater than a preset number of times of expression, the specifying unit 152 searches for such amino acid sequence as an "amino acid sequence candidate."
  • the amino acid sequence "Leu, Lys, Asp, Gln, Ala” is repeatedly expressed at offsets 10 to 14, 40 to 44, etc. of codon file 141, and the number of times of expression is a predetermined number of times of expression. That's it.
  • the specifying unit 152 specifies the codon sequence “CUG, AAA, GAU, CAG, GCA” included in offsets 10 to 14 and the codon sequence “CUG, AAA, GAU, CAA, GCA” included in offsets 40 to 44. are identified as codon sequences with homology.
  • the identifying unit 152 registers information on the identified codon sequences having homology in the search result information 144 .
  • FIG. 8 is a diagram (1) for explaining the processing of the specifying unit; As an example in FIG. 8, a case of identifying whether or not the amino acid sequence “Leu, Lys, Asp, Gln” is included in the codon file 141 will be described.
  • the identifying unit 152 acquires the bitmap 50 of the amino acid "Leu” from the amino acid permutation index 143. In the bitmap 50, flags "1" are set at offsets "10" and “20". The specifying unit 152 generates a bitmap 50s by left-shifting the bitmap 50 . In the bitmap 50s, flags "1" are set at offsets "11" and "21".
  • the specifying unit 152 acquires the bitmap 51 of the amino acid "Lys" from the amino acid permutation index 143. In bitmap 51, flag "1" is set at offset "11". The specifying unit 152 generates the bitmap 52 by performing an AND operation on the bitmap 50 s and the bitmap 51 .
  • the specifying unit 152 generates a bitmap 52s by left-shifting the bitmap 52.
  • flag "1" is set at offset "12".
  • the identifying unit 152 acquires the bitmap 53 of the amino acid "Asp” from the amino acid permutation index 143. In bitmap 53, flag "1" is set at offset "12". The specifying unit 152 generates the bitmap 54 by performing an AND operation on the bitmap 52 s and the bitmap 53 .
  • the specifying unit 152 generates a bitmap 54s by left-shifting the bitmap 54.
  • flag "1" is set at offset "13".
  • the specifying unit 152 acquires the bitmap 55 of the amino acid "Gln” from the amino acid permutation index 143. In bitmap 55, flag "1" is set at offset "13". The specifying unit 152 generates a bitmap 56 by performing an AND operation on the bitmap 54 s and the bitmap 55 .
  • the identifying unit 152 identifies the amino acid sequence with the longest match by repeatedly executing the above process for each amino acid sequence, and identifies the amino acid sequence that is repeatedly expressed.
  • the identifying unit 152 may use other techniques to identify repetitively expressed amino acid sequences.
  • FIG. 9 is a diagram (2) for explaining the processing of the specifying unit.
  • the amino acid sequence candidates 60a and 60b are used for explanation.
  • the amino acid sequence candidates 60a and 60b are "Leu, Lys, Asp, Gln, Ala".
  • the table in FIG. 18 corresponding to the definition table T1
  • the specifying unit 152 specifies the score of each amino acid based on the score table T2, and totals them to calculate the homology score.
  • the score between L (Leu) is “0” because it does not exist in the score table T2.
  • the score between K (Lys) is “-1” based on the score table T2.
  • the score between D (Asp) is “-1” based on the score table T2.
  • the score between Q(Gln) is "0” because it does not exist in the score table T2.
  • the score between A (Ala) is "5" based on the score table T2. Therefore, the specifying unit 152 calculates the cumulative score of "3" for the amino acid sequence candidates 60a and 60b.
  • the identifying unit 152 identifies the amino acid sequence candidate as an amino acid sequence having a homology relationship when the cumulative score of the amino acid sequence candidate is equal to or greater than the threshold.
  • the specifying unit 152 registers the specified result in the search result information 144 .
  • the threshold is preset by an administrator.
  • the specifying unit 152 may further specify amino acid sequences that are expressed symmetrically with the specified amino acid sequences after specifying the amino acid sequences that have a homology relationship.
  • FIG. 10 is a diagram (3) for explaining the processing of the specifying unit.
  • the identifying unit 152 identifies the amino acid sequence “Leu, Lys, Asp, Gln, Ala” identified by the above process and the symmetrically expressed “Ala, Gln, Asp, Lys, Leu” as an amino acid permutation index. 143.
  • the identifying unit 152 identifies the amino acid sequence “Ala, Gln, Asp, Lys, Leu” present at offsets “30-34” of the codon file 141 .
  • FIG. 11 is a diagram (4) for explaining the processing of the specifying unit.
  • a case of specifying whether or not the codon file 141 includes the symmetrical amino acid sequence “Ala, Gln, Asp (Lys and Leu are omitted)” will be described.
  • the identifying unit 152 acquires the bitmap 60 of the amino acid "Ala” from the amino acid permutation index 143. In bitmap 60, flag “1" is set at offset "24". The identifying unit 152 right-shifts the bitmap 60 to generate a bitmap 60s. In bitmap 60s, flag "1" is set at offset "23".
  • the specifying unit 152 acquires the bitmap 61 of the amino acid "Gln” from the amino acid permutation index 143. In bitmap 61, flag "1" is set at offset "23". The identifying unit 152 generates a bitmap 62 by performing an AND operation on the bitmap 60s and the bitmap 61 .
  • the specifying unit 152 right-shifts the bitmap 62 to generate a bitmap 62s.
  • flag "1" is set at offset "22".
  • the specifying unit 152 acquires the bitmap 63 of the amino acid "Asp" from the amino acid permutation index 143.
  • the bitmap 63 has a flag "1" set at offset "22".
  • the identifying unit 152 generates a bitmap 64 by performing an AND operation on the bitmap 62 s and the bitmap 63 .
  • the identifying unit 152 identifies a symmetrical amino acid sequence by executing the above process.
  • the specifying unit 152 registers the specified result in the search result information 144 .
  • the identification unit 152 may output the search result information 144 to the display unit 130 for display, or may transmit it to an external device via the communication unit 110 .
  • FIG. 12 is a diagram showing an example of the data structure of search result information.
  • this search result information 144 associates amino acid sequences, first offsets, second offsets, and cumulative scores.
  • the amino acid sequence is a homologous amino acid sequence specified by the specifying portion 152 .
  • the first offset indicates the offset of codon file 141 where the codon sequence corresponding to the homologous amino acid sequence exists.
  • the second offset indicates the offset of codon file 141 where the codon sequence corresponding to the symmetrical amino acid sequence exists.
  • the cumulative score is the cumulative value of the scores described with reference to FIG. 9 .
  • the first offsets corresponding to the amino acid sequence "Leu, Lys, Asp, Gln, Ala” are “10-14" and "40-44". Therefore, the codon sequences corresponding to the offsets "10 to 14" and the codon sequences corresponding to the offsets "40 to 44" of the codon file 141 are homologous codon sequences.
  • the second offset of the amino acid sequence "Ala, Gln, Asp, Lys, Leu” which is symmetrical to the amino acid sequence "Leu, Lys, Asp, Gln, Ala” is "30-34". Therefore, the codon sequence corresponding to the offset "30-34" of the codon file 141 is a symmetrical codon sequence.
  • the portion between the homologous amino acid sequence in the search result information and the symmetrical amino acid sequence can be said to be the portion corresponding to the motif. That is, the portion between the first offset "10-14" and the second offset "30-34" corresponds to the motif portion.
  • FIG. 13 is a flow chart showing the processing procedure of the information processing apparatus according to the first embodiment.
  • the preprocessing unit 151 of the information processing apparatus 100 generates the codon permutation index 142 based on the codon file 141 and the definition table T1 (step S101).
  • the preprocessing unit 151 identifies multiple codons corresponding to the same amino acid based on the definition table T1 (step S102).
  • the preprocessing unit 151 performs an OR operation on the specified bitmaps of the plurality of codons, generates a bitmap of amino acids, and generates an amino acid permutation index 143 (step S103).
  • the specifying unit 152 of the information processing device 100 specifies a candidate amino acid sequence that is repeatedly expressed based on the amino acid transposition index 143 (step S104).
  • the identifying unit 152 calculates the cumulative score of the amino acid sequence candidate based on the score table T2 (step S105).
  • the identifying unit 152 identifies homologous amino acid sequences (homologous codon sequences) based on the cumulative score (step S106). The identifying unit 152 identifies an amino acid sequence that is symmetrical with the homologous amino acid sequence (step S107).
  • the specifying unit 152 registers the specified result in the search result information 144 (step S108).
  • the specifying unit 152 outputs the search result information 144 (step S109).
  • the information processing apparatus 100 generates an amino acid permutation index 143 by generating a bitmap for each amino acid from bitmaps of codons showing the same amino acid but different base sequences.
  • the information processing apparatus 100 uses the generated amino acid transposition index 143 to identify the relationship with the types of amino acids in the codon file 141, and replaces codon sequences corresponding to positions of repeatedly expressed amino acid sequences with codons having homology. Specify as an array. This allows efficient search for codon sequences that are repeatedly expressed.
  • the information processing device 100 evaluates whether or not the amino acid sequences repeatedly expressed in the codon file 141 are homologous amino acids based on the score table T2 that defines the degree of homology between amino acids. This makes it possible to assess the degree of homology between amino acid sequences, not just mere amino acid matching.
  • the information processing device 100 calculates a bitmap of one amino acid corresponding to a plurality of codons by performing a logical sum of the bitmaps of the codon permutation indices 142 corresponding to the plurality of codons. This makes it possible to easily generate a bitmap of amino acids corresponding to a plurality of codons and generate the amino acid permutation index 143 .
  • Example 1 homologous amino acid sequences were identified at the granularity of amino acids, and homologous codon sequences were identified based on the offset of the identified amino acid sequences. may specify the codon sequence of In Example 2, a process for specifying homologous codon sequences at codon granularity is described.
  • FIG. 14 is a diagram (1) for explaining the processing of the information processing apparatus according to the second embodiment.
  • the information processing device Based on the codon transposition index 142, the information processing device identifies the offset of the codon file 141 and the types of codons, and identifies codon sequences that are repeatedly expressed.
  • the description of the codon permutation index 142 is the same as the description of the codon permutation index 142 described in the first embodiment.
  • the codon sequence "CUG, AAA, GAU” is repeatedly expressed at offsets 10-12, 30-32, 40-42, etc. of the codon file 141.
  • the information processing device identifies codon sequences at offsets 10 to 12, 30 to 32, and 40 to 42 as homologous codon sequences.
  • the information processing apparatus may identify amino acid sequences having homology at the granularity of amino acids as described in the first embodiment.
  • FIG. 15 is a diagram (2) for explaining the processing of the information processing apparatus according to the second embodiment.
  • the information processing device may identify symmetrical codon sequences at codon granularity. For example, if the homologous codon sequence is “CUG, AAA, GAU”, the information processing device identifies the symmetrical codon sequence “GAU, AAA, CUG” from the codon file 141 . In the example shown in FIG. 2, the information processing device identifies that the symmetrical codon sequence “GAU, AAA, CUG” is expressed at offsets 23-25.
  • the process of specifying the codon sequence such as the longest match by using the transposed index by the information processing apparatus according to the second embodiment is the same as the process executed by using the amino acid transposed index 143 described in the first embodiment. Therefore, the description is omitted.
  • the functional block diagram of the information processing apparatus according to the second embodiment corresponds to the functional block diagram of the information processing apparatus 100 shown in FIG. 3 additionally executes the processes described with reference to FIGS.
  • the information processing apparatus 100 described above identifies homologous codon sequences and symmetrical codon sequences, and identifies portions corresponding to motifs and the like. can be specified.
  • Multiple alignment refers to alignment or alignment of three or more DNA base sequences or protein amino acid sequences such that corresponding portions are aligned. It is usually assumed that aligned sequences have an evolutionary relationship. Molecular phylogenetic trees can be deduced based on the results of multiple alignments.
  • FIG. 16 is a diagram illustrating an example of a hardware configuration of a computer that implements functions similar to those of the information processing apparatus of the embodiment.
  • the computer 300 has a CPU 301 that executes various arithmetic processes, an input device 302 that receives data input from the user, and a display 303 .
  • the computer 300 also has a communication device 304 and an interface device 305 for exchanging data with an external device or the like via a wired or wireless network.
  • the computer 300 also has a RAM 306 that temporarily stores various information, and a hard disk device 307 . Each device 301 - 307 is then connected to a bus 308 .
  • the hard disk device 307 has a preprocessing program 307a and a specific program 307b.
  • the CPU 301 reads each program 307 a to 307 d and develops them in the RAM 306 .
  • the preprocessing program 307a functions as a preprocessing process 306a.
  • Specific program 307b functions as specific process 306b.
  • the processing of the preprocessing process 306a corresponds to the processing of the preprocessing unit 151.
  • the processing of the identification process 306 b corresponds to the processing of the identification unit 152 .
  • each program does not necessarily have to be stored in the hard disk device 307 from the beginning.
  • each program is stored in a “portable physical medium” such as a flexible disk (FD), CD-ROM, DVD, magneto-optical disk, IC card, etc. inserted into the computer 300 . Then, the computer 300 may read and execute the programs 307a and 307b.
  • a “portable physical medium” such as a flexible disk (FD), CD-ROM, DVD, magneto-optical disk, IC card, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Communication Control (AREA)
PCT/JP2021/018730 2021-05-18 2021-05-18 情報処理プログラム、情報処理方法および情報処理装置 Ceased WO2022244089A1 (ja)

Priority Applications (6)

Application Number Priority Date Filing Date Title
CN202180098100.0A CN117296100A (zh) 2021-05-18 2021-05-18 信息处理程序、信息处理方法和信息处理装置
PCT/JP2021/018730 WO2022244089A1 (ja) 2021-05-18 2021-05-18 情報処理プログラム、情報処理方法および情報処理装置
JP2023522033A JP7537609B2 (ja) 2021-05-18 2021-05-18 情報処理プログラム、情報処理方法および情報処理装置
AU2021446660A AU2021446660A1 (en) 2021-05-18 2021-05-18 Information processing program, information processing method, and information processing apparatus
EP21940706.1A EP4343769A4 (en) 2021-05-18 2021-05-18 Information processing program, information processing method, and information processing device
US18/502,405 US20240071568A1 (en) 2021-05-18 2023-11-06 Storage medium, information processing method, and information processing apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/018730 WO2022244089A1 (ja) 2021-05-18 2021-05-18 情報処理プログラム、情報処理方法および情報処理装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/502,405 Continuation US20240071568A1 (en) 2021-05-18 2023-11-06 Storage medium, information processing method, and information processing apparatus

Publications (1)

Publication Number Publication Date
WO2022244089A1 true WO2022244089A1 (ja) 2022-11-24

Family

ID=84141370

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/018730 Ceased WO2022244089A1 (ja) 2021-05-18 2021-05-18 情報処理プログラム、情報処理方法および情報処理装置

Country Status (6)

Country Link
US (1) US20240071568A1 (https=)
EP (1) EP4343769A4 (https=)
JP (1) JP7537609B2 (https=)
CN (1) CN117296100A (https=)
AU (1) AU2021446660A1 (https=)
WO (1) WO2022244089A1 (https=)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0877177A (ja) * 1994-09-01 1996-03-22 Fujitsu Ltd リスト処理システムとその方法
WO2005096208A1 (ja) 2004-03-31 2005-10-13 Bio-Think Tank Co., Ltd. 塩基配列検索装置及び塩基配列検索方法
WO2010086990A1 (ja) * 2009-01-29 2010-08-05 スパイバー株式会社 Dnaタグの構築方法
JP2014112307A (ja) 2012-12-05 2014-06-19 Sony Corp モチーフ検索プログラム、情報処理装置及びモチーフ検索方法
JP2015524658A (ja) * 2012-07-16 2015-08-27 ダウ アグロサイエンシィズ エルエルシー 分岐したコドン最適化された大きな繰り返しdna配列を設計するための方法
WO2020049748A1 (ja) 2018-09-07 2020-03-12 富士通株式会社 特定方法、特定プログラムおよび情報処理装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004234297A (ja) * 2003-01-30 2004-08-19 Biomatics Inc 生物学的な配列情報処理装置
US11516679B2 (en) * 2018-05-30 2022-11-29 Sony Corporation Communication control device, communication control method, and computer program

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0877177A (ja) * 1994-09-01 1996-03-22 Fujitsu Ltd リスト処理システムとその方法
WO2005096208A1 (ja) 2004-03-31 2005-10-13 Bio-Think Tank Co., Ltd. 塩基配列検索装置及び塩基配列検索方法
WO2010086990A1 (ja) * 2009-01-29 2010-08-05 スパイバー株式会社 Dnaタグの構築方法
JP2015524658A (ja) * 2012-07-16 2015-08-27 ダウ アグロサイエンシィズ エルエルシー 分岐したコドン最適化された大きな繰り返しdna配列を設計するための方法
JP2014112307A (ja) 2012-12-05 2014-06-19 Sony Corp モチーフ検索プログラム、情報処理装置及びモチーフ検索方法
WO2020049748A1 (ja) 2018-09-07 2020-03-12 富士通株式会社 特定方法、特定プログラムおよび情報処理装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4343769A4

Also Published As

Publication number Publication date
JPWO2022244089A1 (https=) 2022-11-24
EP4343769A1 (en) 2024-03-27
JP7537609B2 (ja) 2024-08-21
CN117296100A (zh) 2023-12-26
EP4343769A4 (en) 2024-10-30
AU2021446660A1 (en) 2023-11-30
US20240071568A1 (en) 2024-02-29

Similar Documents

Publication Publication Date Title
Zhu et al. Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea
Morrison Multiple sequence alignment for phylogenetic purposes
Whelan et al. Molecular phylogenetics: state-of-the-art methods for looking into the past
Parisi et al. STRING: finding tandem repeats in DNA sequences
Zhang et al. DiscMLA: an efficient discriminative motif learning algorithm over high-throughput datasets
Oğul et al. A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets
AU2018440274B2 (en) Identification method, identification program, and information processing device
WO2022244089A1 (ja) 情報処理プログラム、情報処理方法および情報処理装置
Cordero et al. Large disclosing the nature of computational tools for the analysis of next generation sequencing data
Oğul et al. SVM-based detection of distant protein structural relationships using pairwise probabilistic suffix trees
Sgarbossa et al. Pairing interacting protein sequences using masked language modeling
Patra et al. Motif discovery in biological network using expansion tree
KR101398851B1 (ko) 아미노산의 복합 패턴을 확인하는 시스템 및 방법
Haque et al. An efficient algorithm for local sequence alignment
JPWO2020230240A1 (ja) 評価方法、評価プログラムおよび評価装置
Böer Multiple alignment using hidden Markov models
Goswami et al. Algorithms for string comparison in DNA sequences
Hadian Dehkordi et al. gpaligner: A fast algorithm for global pairwise alignment of dna sequences
Tapinos et al. Alignment by numbers: sequence assembly using compressed numerical representations
Mohammadi et al. Fast motif discovery using a new motif extension algorithm
Schallmey Bioinformatic Methods for Enzyme Identification
Tan et al. A new encoding scheme for protein structure representation
Tapinos et al. Alignment by the numbers: sequence assembly using reduced dimensionality numerical representations
Shoyaib et al. Protein secondary structure prediction with high accuracy using support vector machine
Zaki et al. A comparative analysis of protein homology detection methods

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21940706

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023522033

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 202180098100.0

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 2021446660

Country of ref document: AU

Ref document number: AU2021446660

Country of ref document: AU

ENP Entry into the national phase

Ref document number: 2021446660

Country of ref document: AU

Date of ref document: 20210518

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2021940706

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021940706

Country of ref document: EP

Effective date: 20231218