WO2022224336A1 - 情報処理プログラム、情報処理方法および情報処理装置 - Google Patents

情報処理プログラム、情報処理方法および情報処理装置 Download PDF

Info

Publication number
WO2022224336A1
WO2022224336A1 PCT/JP2021/015983 JP2021015983W WO2022224336A1 WO 2022224336 A1 WO2022224336 A1 WO 2022224336A1 JP 2021015983 W JP2021015983 W JP 2021015983W WO 2022224336 A1 WO2022224336 A1 WO 2022224336A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
genome
vectors
information processing
subgenome
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2021/015983
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
正弘 片岡
光人 和田
量 松村
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN202180095792.3A priority Critical patent/CN117043868A/zh
Priority to EP21937833.8A priority patent/EP4328921A4/en
Priority to PCT/JP2021/015983 priority patent/WO2022224336A1/ja
Priority to JP2023515916A priority patent/JP7619443B2/ja
Priority to AU2021441603A priority patent/AU2021441603A1/en
Publication of WO2022224336A1 publication Critical patent/WO2022224336A1/ja
Priority to US18/468,023 priority patent/US20240006028A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional [2D] or three-dimensional [3D] molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the present invention relates to an information processing program and the like.
  • gene vectors Due to advances in gene introduction technology and a deeper understanding of the immune system, genetic recombination is being performed using gene vectors. Depending on the size of the gene fragment to be inserted and the purpose of insertion, media to which various characteristics are added are used as gene vectors. Genetic vectors derived from E. coli, yeast, host organisms, etc. are used for these operations.
  • CAR chimeric antigen receptor
  • the present invention aims to provide an information processing program, an information processing method, and an information processing apparatus that can identify a genome that can replace a subgenome contained in a target genome.
  • the computer executes the following processing.
  • the computer executes learning of a learning model based on learning data that defines the relationship between vectors corresponding to the genome and vectors corresponding to a plurality of sub-genomes constituting the genome.
  • the computer receives the genome to be analyzed, the computer inputs the genome to be analyzed into the learning model, thereby calculating vectors of a plurality of subgenomes corresponding to the genome to be analyzed.
  • FIG. 1 is a diagram for explaining the genome.
  • FIG. 2 is a diagram showing the relationship between amino acids, bases, and codons.
  • FIG. 3 is a diagram for explaining the primary structure, secondary structure, tertiary structure, and higher-order structure of proteins.
  • FIG. 4 is a diagram showing an example of a gene vector.
  • FIG. 5 is a diagram for explaining an example of learning phase processing of the information processing apparatus according to the present embodiment.
  • FIG. 6 is a diagram for explaining an example of analysis phase processing of the information processing apparatus according to the present embodiment.
  • FIG. 7 is a functional block diagram showing the configuration of the information processing apparatus according to the first embodiment.
  • FIG. 8 is a diagram showing an example of the data structure of a base file.
  • FIG. 9 is a diagram showing an example of the data structure of a conversion table.
  • FIG. 10 is a diagram showing an example of the data structure of a dictionary table.
  • FIG. 11 is a diagram showing an example of the data structure of a protein primary structure dictionary.
  • FIG. 12 is a diagram showing an example of the data structure of a secondary structure dictionary.
  • FIG. 13 is a diagram showing an example of the data structure of a tertiary structure dictionary.
  • FIG. 14 is a diagram showing an example of the data structure of a high-order structure dictionary.
  • FIG. 15 is a diagram illustrating an example of the data structure of a compressed file table.
  • FIG. 16 is a diagram showing an example of the data structure of a vector table.
  • FIG. 17 is a diagram showing an example of the data structure of a protein primary structure vector table.
  • FIG. 18 is a diagram showing an example of the data structure of a secondary structure vector table.
  • FIG. 19 is a diagram showing an example of the data structure of a tertiary structure vector table.
  • FIG. 20 is a diagram showing an example of the data structure of a high-order structure vector table.
  • FIG. 21 is a diagram illustrating an example of the data structure of an inverted index table.
  • FIG. 22 is a diagram showing an example of the data structure of a protein primary structure permutation index.
  • FIG. 23 is a diagram showing an example of the data structure of a secondary structure transposed index.
  • FIG. 24 is a diagram showing an example of the data structure of a tertiary structure permuted index.
  • FIG. 25 is a diagram illustrating an example of the data structure of a high-order structure permuted index.
  • FIG. 26 is a diagram showing an example of the data structure of a genome dictionary.
  • FIG. 27 is a flowchart (1) showing the processing procedure of the information processing apparatus according to the present embodiment.
  • FIG. 28 is a flowchart (2) showing the processing procedure of the information processing apparatus according to the embodiment.
  • FIG. 29 is a diagram for explaining an example of learning phase processing of the information processing apparatus according to the second embodiment.
  • FIG. 30 is a diagram for explaining processing of the information processing apparatus according to the second embodiment.
  • FIG. 31 is a functional block diagram showing the configuration of the information processing apparatus according to the second embodiment.
  • FIG. 32 is a flow chart showing the processing procedure of the information processing apparatus according to the second embodiment.
  • FIG. 33 is a diagram illustrating an example of a hardware configuration of a computer that implements functions similar to those of the information processing apparatus of the embodiment.
  • FIG. 1 is a diagram for explaining the genome.
  • Genome 1 contains genetic information that defines the order in which a plurality of amino acids are linked. Here, amino acids are determined by three consecutive bases, ie codons.
  • Genome 1 also includes information on protein 1a.
  • Protein 1a is composed of a plurality of 20 types of amino acids linked together in a chain. The structure of protein 1a can be understood as a primary structure, secondary structure, tertiary structure, and higher (quaternary) structure of the protein.
  • FIG. 1b shows the conformation of protein 1a.
  • the primary structure of a protein the secondary structure of a protein, the tertiary structure of a protein, and the higher-order structure of a protein are referred to as primary structure, secondary structure, tertiary structure, and higher-order structure, respectively.
  • FIG. 2 is a diagram showing the relationship between amino acids, bases, and codons. A group of three base sequences is called a "codon”. For each sequence of bases, a codon is determined, and once the codon is determined, an amino acid is determined.
  • a protein is uniquely determined by its base sequence.
  • a protein's primary structure is a sequence of amino acids.
  • Secondary structures include ⁇ -helices and ⁇ -sheets, which are localized, symmetrical substructures.
  • Tertiary structure includes multiple secondary structures.
  • a higher-order structure also includes multiple tertiary structures.
  • FIG. 3 is a diagram for explaining the primary structure, secondary structure, tertiary structure, and higher-order structure of proteins.
  • the higher order structure Z 1 includes the tertiary structures Y 1 , Y 2 , Y 3 and so on.
  • Tertiary structure Y 1 includes secondary structures X 1 , X 2 , X 3 and the like.
  • Secondary structure X 1 includes primary structures W 1 , W 2 , W 3 and the like.
  • Primary structure W 1 includes amino acids A 1 , A 2 , A 3 and so on.
  • a genetic vector used in this example is a DNA or RNA molecule that is used to artificially transfer foreign genetic material to another cell.
  • Gene vectors include plasmids, cosmids, lambda phages, artificial chromosomes, and the like.
  • FIG. 4 is a diagram showing an example of a gene vector.
  • the gene vector shown in FIG. 4 is the pBR322 plasmid, widely used as a cloning vector.
  • the gene vector itself is a base sequence of DNA and RNA, and will be described as corresponding to the higher-order structure of the protein illustrated in FIG. 3, for example.
  • Gene vectors are generated by synthesizing multiple subvectors.
  • Subvectors are sequences of DNA and RNA, corresponding to, for example, the protein secondary structure illustrated in FIG.
  • Subvectors also include so-called E. coli vectors, which contain elements necessary for maintenance in E. coli, and vectors for maintenance in cell lines derived from yeast, plants, mammals, and the like.
  • Subvectors may be other vectors.
  • FIG. 5 is a diagram for explaining an example of processing in the learning phase of the information processing apparatus according to this embodiment.
  • the information processing device uses learning data 65 to perform machine learning of a learning model 70 .
  • the learning model 70 corresponds to a CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), or the like.
  • the learning data 65 defines the relationship between the vector of the target genome (therapeutic drug) and the vectors of a plurality of subgenomes included in this target genome. For example, a target genome vector corresponds to the input data, and multiple subgenomes are correct values of the output data.
  • the information processing device executes learning by error backpropagation so that the output when the vector of the target genome is input to the learning model 70 approaches the vector of each subgenome.
  • the information processing device adjusts the parameters of the learning model 70 by repeatedly executing the above processing based on the relationship between the vector of the target genome and the vectors of the plurality of subgenomes included in the learning data 65 (machine learning ).
  • FIG. 6 is a diagram for explaining an example of analysis phase processing of the information processing apparatus according to the present embodiment.
  • the information processing device uses the learning model 70 learned in the learning phase to perform the following processing.
  • the information processing device When the information processing device receives an analysis query 80 specifying a target genome (therapeutic drug), it converts the target genome of the analysis query 80 into a vector Vob80.
  • the information processing device calculates a plurality of vectors (Vsb80-1, Vsb80-2, Vsb80-3, ... Vsb80-n) corresponding to each subgenome by inputting the vector Vob80 into the learning model 70, Store in subgenome table T1.
  • the information processing device stores a plurality of vectors (Vt1, Vt2, Vt3, . , Vsb80-3, . . . Vsb80-n) to identify similar alternative gene vector vectors.
  • the information processing device associates the vector of the target genome, the vector of the subgenome, and the vector of the similar alternative gene vector, and registers them in the alternative management table 85 .
  • the information processing apparatus executes learning of the learning model 70 based on the learning data 65 that defines the relationship between the vector of the target genome and the vector of each subgenome.
  • the information processing device inputs the vector of the analysis query to the trained learning model 70 to calculate the vector of each subgenome corresponding to the target genome of the analysis query.
  • the vector of each subgenome output from the learning model 70 it is possible to easily detect a gene vector that is similar to the subgenome contained in the target genome and that can be substituted.
  • FIG. 7 is a functional block diagram showing the configuration of the information processing apparatus according to the first embodiment.
  • the information processing apparatus 100 has a communication section 110, an input section 120, a display section 130, a storage section 140, and a control section 150.
  • FIG. 7 is a functional block diagram showing the configuration of the information processing apparatus according to the first embodiment.
  • the information processing apparatus 100 has a communication section 110, an input section 120, a display section 130, a storage section 140, and a control section 150.
  • the communication unit 110 is connected to an external device or the like by wire or wirelessly, and transmits and receives information to and from the external device or the like.
  • the communication unit 110 is implemented by a NIC (Network Interface Card) or the like.
  • the communication unit 110 may be connected to a network (not shown).
  • the input unit 120 is an input device that inputs various types of information to the information processing device 100 .
  • the input unit 120 corresponds to a keyboard, mouse, touch panel, or the like.
  • the display unit 130 is a display device that displays information output from the control unit 150 .
  • the display unit 130 corresponds to a liquid crystal display, an organic EL (Electro Luminescence) display, a touch panel, or the like.
  • the storage unit 140 has a base file 50, a conversion table 51, a dictionary table 52, a compressed file table 53, a vector table 54, and an inverted index table 55.
  • the storage unit 140 also has a subgenome table T1, an alternative gene vector table T2, a genome dictionary D2, learning data 65, a learning model 70, an analysis query 80, and an alternative management table 85.
  • the storage unit 140 is implemented by, for example, a semiconductor memory device such as RAM (Random Access Memory) or flash memory, or a storage device such as a hard disk or optical disc.
  • the base file 50 is a file that holds information in which multiple bases are arranged.
  • FIG. 8 is a diagram showing an example of the data structure of a base file. As shown in FIG. 8, there are four types of base files 50, indicated by symbols "A”, “G”, “C”, "T” or "U”.
  • the conversion table 51 is a table that associates codons with codon codes.
  • a group of three base sequences is called a "codon”.
  • FIG. 9 is a diagram showing an example of the data structure of a conversion table. As shown in FIG. 9, each codon is associated with each code. For example, the code for the codon "UUU” is "40h (01000000)". "h” indicates a hexadecimal number.
  • the dictionary table 52 is a table that holds various dictionaries.
  • FIG. 10 is a diagram showing an example of the data structure of a dictionary table. As shown in FIG. 10, this dictionary table 52 has a protein primary structure dictionary D1-1, a secondary structure dictionary D1-2, a tertiary structure dictionary D1-3, and a higher order structure dictionary D1-4.
  • the protein primary structure dictionary D1-1 is dictionary data that defines the relationship between the compression code of a protein and the sequence of codons that make up the protein.
  • FIG. 11 is a diagram showing an example of the data structure of a protein primary structure dictionary. As shown in FIG. 11, the protein primary structure dictionary D1-1 associates compression codes, names, and codon code sequences.
  • a compressed code is a compressed code sequence of codons (or a code sequence of amino acids). Name is the name of the protein.
  • a codon coding sequence is a sequence of compressed codes for codons. Instead of the codon coding sequence, the sequence of amino acid symbols may be associated with the compressed code of the protein primary structure.
  • the primary protein structure "type 1 collagen” is assigned the compression code "C0008000h”.
  • the codon code sequence corresponding to the compression code "C0008000h” is "02h63h78h...03h”.
  • the secondary structure dictionary D1-2 is dictionary data that defines the relationship between the sequence of compression codes for the primary protein structure and the compression code for the secondary structure.
  • FIG. 12 is a diagram showing an example of the data structure of a secondary structure dictionary. As shown in FIG. 12, the secondary structure dictionary D1-2 associates compression codes, names, and protein primary structure code sequences.
  • a compression code is a compression code assigned to a protein's secondary structure. The name is the name of the secondary structure.
  • a protein primary structure code sequence is a sequence of compressed codes of the protein primary structure corresponding to the secondary structure.
  • the compression code "D0000000h” is assigned to the secondary structure " ⁇ secondary structure".
  • the protein primary structure code sequence corresponding to the compression code "D0000000h” is "C0008001hC00".
  • the tertiary structure dictionary D1-3 is dictionary data that defines the relationship between the arrangement of the compression code of the secondary structure and the compression code of the tertiary structure.
  • FIG. 13 is a diagram showing an example of the data structure of a tertiary structure dictionary. As shown in FIG. 13, the tertiary structure dictionary D1-3 associates compression codes, names, and secondary structure code arrays.
  • the compression code is the compression code assigned to the tertiary structure.
  • the name is the name of the tertiary structure.
  • the secondary structure code array is an array of compressed codes of secondary structure corresponding to the tertiary structure.
  • the compression code "E0000000h” is assigned to the tertiary structure " ⁇ tertiary structure”.
  • the secondary structure code array corresponding to the compressed code "E0000000h” is "D0008031hD00".
  • the high-order structure dictionaries D1-4 are dictionary data that define the relationship between the arrangement of compression codes with a tertiary structure and the compression codes with a high-order structure.
  • FIG. 14 is a diagram showing an example of the data structure of a high-order structure dictionary. As shown in FIG. 14, the high-order structure dictionary D1-4 associates compression codes, names, and tertiary structure code arrays.
  • a compression code is a compression code assigned to a higher-order structure.
  • the name is the name of the higher order structure.
  • the tertiary structure code array is an array of compression codes of tertiary structures corresponding to higher-order structures.
  • the compression code "F0000000h” is assigned to the high-order structure " ⁇ high-order structure”.
  • the tertiary structure code array corresponding to the compressed code "F0000000h” is "E0000031hE00".
  • the compressed file table 53 is a table that holds various compressed files.
  • FIG. 15 is a diagram illustrating an example of the data structure of a compressed file table. As shown in FIG. 15, this compressed file table 53 has a codon compressed file 53A, a protein primary structure compressed file 53B, a secondary structure compressed file 53C, a tertiary structure compressed file 53D, and a higher order structure compressed file 53E.
  • the codon compression file 53A is a file obtained by compressing the bases contained in the base file 50 in units of codons.
  • the compressed protein primary structure file 53B is a file in which the sequence of codon compression codes contained in the codon compressed file 53A is encoded in units of protein primary structure.
  • the compressed secondary structure file 53C is a file in which the sequence of compression codes for the primary protein structure contained in the compressed protein primary structure file 53B is encoded in units of secondary structures.
  • the compressed tertiary structure file 53D is a file obtained by encoding the array of compression codes of the secondary structure contained in the compressed secondary structure file 53C in units of tertiary structure.
  • the compressed high-order structure file 53E is a file obtained by encoding the arrangement of compression codes of the tertiary structure contained in the compressed tertiary structure file 53D in units of high-order structure.
  • the vector table 54 is a table that holds vectors corresponding to protein primary structure, secondary structure, tertiary structure, and higher-order structure.
  • FIG. 16 is a diagram showing an example of the data structure of a vector table. As shown in FIG. 16, this vector table 54 has a protein primary structure vector table VT1-1, a secondary structure vector table VT1-2, a tertiary structure vector table VT1-3, and a higher order structure vector table VT1-4.
  • the protein primary structure vector table VT1-1 is a table that holds vectors corresponding to protein primary structures.
  • FIG. 17 is a diagram showing an example of the data structure of a protein primary structure vector table. As shown in FIG. 17, the protein primary structure vector table VT1-1 associates the compression code of the protein primary structure with the vector assigned to the compression code of the protein primary structure.
  • a vector of protein primary structure is calculated by Poincare embedding. Poincaré embedding will be described later.
  • the secondary structure vector table VT1-2 is a table that holds vectors corresponding to secondary structures.
  • FIG. 18 is a diagram showing an example of the data structure of a secondary structure vector table. As shown in FIG. 18, the secondary structure vector table VT1-2 associates a secondary structure compression code with a vector assigned to this secondary structure compression code. The secondary structure vector is calculated by integrating the protein primary structure vectors included in the secondary structure.
  • the tertiary structure vector tables VT1-3 are tables that hold vectors corresponding to tertiary structures.
  • FIG. 19 is a diagram showing an example of the data structure of a tertiary structure vector table. As shown in FIG. 19, the tertiary structure vector tables VT1-3 associate tertiary structure compression codes with vectors assigned to the tertiary structure compression codes. The tertiary structure vector is calculated by integrating the secondary structure vectors contained in the tertiary structure.
  • Higher-order structure vector tables VT1-4 are tables that hold vectors corresponding to higher-order structures.
  • FIG. 20 is a diagram showing an example of the data structure of a high-order structure vector table. As shown in FIG. 20, in the high-order structure vector tables VT1-4, high-order structure compression codes are associated with vectors assigned to the high-order structure compression codes. The higher-order structure vector is calculated by integrating the tertiary structure vectors included in the higher-order structure.
  • the transposed index table 55 is a table that holds various transposed indexes.
  • FIG. 21 is a diagram illustrating an example of the data structure of an inverted index table. As shown in FIG. 21, the permutation index table 55 has a protein primary structure permutation index In1-1, a secondary structure permutation index In1-2, a tertiary structure permutation index In1-3, and a higher-order structure permutation index In1-4.
  • FIG. 22 is a diagram showing an example of the data structure of a protein primary structure permutation index.
  • the horizontal axis of the protein primary structure transposition index In1-1 is the axis corresponding to the offset.
  • the vertical axis of the protein primary structure permutation index In1-1 corresponds to the compression code of the protein primary structure.
  • the protein primary structure permutation index In1-1 is indicated by a bitmap of "0" or "1", and all bitmaps are set to "0" in the initial state.
  • the offset of the compression code of the protein primary structure at the beginning of the protein primary structure compressed file 53B is set to "0".
  • the protein primary structure code “C0008000h (type 1 collagen)” is included in the eighth position from the beginning of the compressed protein primary structure file 53B, the column of the offset “7” of the protein transposition index In1-1 and the protein The bit at the intersection with the line of the code "C0008000h (type 1 collagen)" is "1".
  • FIG. 23 is a diagram showing an example of the data structure of the secondary structure transposed index.
  • the horizontal axis of the secondary structure transposition index In1-2 is the axis corresponding to the offset.
  • the vertical axis of the secondary structure transposition index In1-2 is the axis corresponding to the compression code of the secondary structure.
  • the secondary structure transposition index In1-2 is indicated by a bitmap of "0" or "1", and all bitmaps are set to "0" in the initial state.
  • the offset of the compressed code of the secondary structure at the beginning of the secondary structure compressed file 53C is set to "0".
  • the secondary structure code “D000000h ( ⁇ secondary structure)” is included in the eighth position from the beginning of the secondary structure compressed file 53C, the secondary structure transposed index In1-2 offset “7” column and , and the line of the compression code “D0000000h ( ⁇ secondary structure)” of the secondary structure is “1”.
  • FIG. 24 is a diagram showing an example of the data structure of a tertiary structure transposed index.
  • the horizontal axis of the tertiary structure transposition indices In1-3 is the axis corresponding to the offset.
  • the vertical axis of the tertiary structure transposition indices In1-3 is the axis corresponding to the compression code of the tertiary structure.
  • the tertiary structure permuted indices In1-3 are indicated by bitmaps of "0" or "1", and all bitmaps are set to "0" in the initial state.
  • the offset of the compression code of the tertiary structure at the beginning of the tertiary structure compressed file 53D is set to "0". If the tertiary structure code “E0000000h ( ⁇ tertiary structure)” is included in the eleventh position from the beginning of the tertiary structure compression file 53D, the column of offset “10” of the tertiary structure permuted indices In1-3 and the tertiary structure The bit at the intersection with the line of the compression code "E0000000h ( ⁇ tertiary structure)" is "1".
  • FIG. 25 is a diagram showing an example of the data structure of a high-order structure permuted index. It is a figure which shows an example of the data structure of a high-order structure permuted index.
  • the horizontal axis of the higher-order structure transposition indices In1-4 is the axis corresponding to the offset.
  • the vertical axis of the higher-order structure permuted indices In1-4 is the axis corresponding to the compression code of the higher-order structure.
  • the high-order structure permuted indices In1-4 are indicated by bitmaps of "0" or "1", and all bitmaps are set to "0" in the initial state.
  • the offset of the compression code of the high-order structure at the beginning of the high-order structure compressed file 53E be "0". If the eleventh position from the beginning of the compressed high-order structure file 53E contains the high-order structure code "F0000000h ( ⁇ high-order structure)", the row of the offset "10" of the high-order structure permuted indices In1-4 and , and the line of compression code "F0000000h ( ⁇ high-order structure)" of the high-order structure becomes "1".
  • the alternative gene vector table T2 holds vectors of a plurality of gene vectors. Gene vectors correspond to protein secondary structures.
  • vectors stored in the alternative gene vector table T2 may be vectors registered in the two-dimensional structure vector table VT1-2.
  • the data structure of the alternative gene vector table T2 stores vectors of a plurality of alternative gene vectors, as described with reference to FIG.
  • the genome dictionary D2 defines the relationship between the name of the target genome and the names of the subgenomes included in this target genome.
  • FIG. 26 is a diagram showing an example of the data structure of a genome dictionary. As shown in FIG. 26, this genome dictionary D2 associates the names of target vectors with the names of a plurality of subgenomes.
  • the learning data 65 defines the relationship between the vector of the target genome and the vectors of a plurality of subgenomes included in this target genome.
  • the data structure of the learning data 65 corresponds to the data structure of the learning data described with reference to FIG.
  • the learning model 70 is a model corresponding to CNN, RNN, etc., and parameters are set.
  • the analysis query 80 includes information on the target genome (therapeutic drug) to be analyzed.
  • information on the target genome includes information on base sequences corresponding to higher-order structures.
  • the substitution management table 85 is a table that holds subgenome vectors included in the target genome and gene vectors that are similar to this subgenome and that are substitutable gene vectors in association with each other.
  • the control unit 150 has a preprocessing unit 151 , a learning unit 152 , a calculation unit 153 and an analysis unit 154 .
  • the control unit 150 is realized by, for example, a CPU (Central Processing Unit) or an MPU (Micro Processing Unit). Also, the control unit 150 may be executed by an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the preprocessing unit 151 performs the following various processes to calculate a higher-order structure or tertiary structure vector corresponding to the target genome (therapeutic drug), a secondary structure vector corresponding to the subgenome, and the like. .
  • the preprocessing unit 151 performs a process of generating a codon compressed file 53A, a process of generating a compressed protein primary structure file 53B, a process of generating a protein primary structure vector table VT1-1, and a protein primary structure permutation index In1-1. Run.
  • the preprocessing unit 151 compares the base file 50 and the conversion table 51, assigns compression codes to the base sequences of the base file 50 in units of codons, and generates a codon-compressed file 53A.
  • the preprocessing unit 151 compares the codon-compressed file 53A with the protein primary structure dictionary D1-1, and assigns compression codes to the sequences of codon compression codes contained in the codon-compressed file 53A in units of protein primary structures. , to generate a protein compressed file 53B.
  • the preprocessing unit 151 embeds the compression code of the protein primary structure in the Poincare space to calculate the vector of the protein primary structure (the compression code of the protein primary structure).
  • the process of embedding in a Poincare space and calculating a vector is a technique called Poincare Embeddings.
  • Poincaré embedding for example, the technology described in the non-patent document "Valentin Khrulkov1 et al. 'Hyperbolic Image Embeddings' Georgia University, 2019 April 3" may be used.
  • the preprocessing unit 151 refers to a protein primary structure similarity table that defines similar protein primary structures, embeds the compression code of each protein primary structure in the Poincare space, and compresses each protein primary structure. Compute the sign vector.
  • the preprocessing unit 151 may perform Poincare embedding in advance on the compression code of each protein primary structure defined in the protein dictionary primary structure D1-1.
  • the preprocessing unit 151 generates a protein primary structure vector table VT1-1 by associating the protein primary structure (compressed code of the protein primary structure) with the protein primary structure vector.
  • the preprocessing unit 151 generates a protein primary structure permutation index In1-1 based on the relationship between the protein primary structure vector and the position of the protein primary structure (compression code of the protein primary structure) in the protein primary structure compression file 53B. Generate.
  • the preprocessing unit 151 executes the process of generating the secondary structure compressed file 53C, the process of generating the secondary structure vector table VT1-2, and the secondary structure transposed index In1-2.
  • the preprocessing unit 151 compares the compressed protein primary structure file 53B with the secondary structure dictionary D1-2, and converts the sequence of the compression code of the primary protein structure contained in the compressed protein primary structure file 53B into the secondary structure.
  • a compression code is assigned in units to generate a secondary structure compression file 53C.
  • the preprocessing unit 151 refers to the secondary structure dictionary D1-2 to identify a protein primary structure code sequence (an array of compression codes of the protein primary structure) corresponding to the compression code of the secondary structure.
  • the preprocessing unit 151 obtains the compression code vector of each identified protein primary structure from the protein primary structure vector table VT1-1, and adds the obtained vectors to obtain the compression code vector of the secondary structure. calculate.
  • the preprocessing unit 151 calculates the vector of each secondary structure by repeatedly executing the above process.
  • the preprocessing unit 151 generates a secondary structure vector table VT1-2 by associating the secondary structure (compression code of the secondary structure) with the vector of the secondary structure.
  • the preprocessing unit 151 calculates the secondary structure transposition index In1-2 based on the relationship between the secondary structure vector and the position of the secondary structure (compression code of the secondary structure) in the secondary structure compression file 53C. Generate.
  • the preprocessing unit 151 executes processing to generate the tertiary structure compressed file 53D, and processing to generate the tertiary structure vector table VT1-3 and the tertiary structure transposed index In1-3.
  • the preprocessing unit 151 compares the compressed secondary structure file 53C with the tertiary structure dictionary D1-3, and converts the arrangement of compression codes of the secondary structures contained in the compressed secondary structure file 53C in units of tertiary structures. A compression code is assigned and a tertiary structure compression file 53D is generated.
  • the preprocessing unit 151 refers to the tertiary structure dictionary D1-3 to identify the secondary structure code array (array of secondary structure compression codes) corresponding to the compression code of the tertiary structure.
  • the preprocessing unit 151 acquires the compression code vector of each specified secondary structure from the secondary structure vector table VT1-2, and adds the acquired vectors to calculate the compression code vector of the tertiary structure. do.
  • the preprocessing unit 151 calculates the vector of each tertiary structure by repeatedly executing the above process.
  • the preprocessing unit 151 generates a tertiary structure vector table VT1-3 by associating the tertiary structure (compression code of the tertiary structure) with the vector of the tertiary structure.
  • the preprocessing unit 151 generates tertiary structure transposed indexes In1-3 based on the relationship between the tertiary structure vector and the position of the tertiary structure (compression code of the tertiary structure) in the tertiary structure compression file 53D.
  • the preprocessing unit 151 executes processing for generating the high-order structure compressed file 53E, and processing for generating the high-order structure vector table VT1-4 and the high-order structure transposed index In1-4.
  • the preprocessing unit 151 compares the compressed tertiary structure file 53D with the higher-order structure dictionaries D1-4, and compresses the arrangement of compression codes of the tertiary structure contained in the compressed tertiary structure file 53D in units of higher-order structures. A code is assigned and a high-order structure compressed file 53E is generated.
  • the preprocessing unit 151 refers to the high-order structure dictionary D1-4 to identify a tertiary structure code array (an array of tertiary structure compression codes) corresponding to the high-order structure compression code.
  • the preprocessing unit 151 acquires the specified compression code vector of each tertiary structure from the tertiary structure vector table VT1-3, and adds the acquired vectors to calculate the compression code vector of the higher-order structure. .
  • the preprocessing unit 151 calculates the vector of each higher-order structure by repeatedly executing the above process.
  • the preprocessing unit 151 generates a high-order structure vector table VT1-4 by associating the high-order structure (compression code of the high-order structure) with the vector of the high-order structure.
  • the preprocessing unit 151 generates higher-order structure permuted indexes In1-4 based on the relationship between the higher-order structure vector and the position of the higher-order structure (compression code of the higher-order structure) in the higher-order structure compression file 53E. Generate.
  • the preprocessing unit 151 directly sets the tertiary structure vectors included in the secondary structure vector table VT1-2 in the alternative gene vector table T2.
  • the preprocessing unit 151 receives designation of a vector via the input unit 120, the designated vector may be set in the alternative gene-genome table T2.
  • the preprocessing unit 151 Based on the genome dictionary D2, the preprocessing unit 151 identifies the relationship between the name of the target genome and the name of the subgenome. Based on the higher-order structure dictionary D1-4 and higher-order structure vector table VT1-4, or the tertiary structure dictionary D1-4 and tertiary structure vector table VT1-3, and the name of the target genome, , to identify the vector of the genome of interest. The preprocessing unit 151 identifies the vector of the subgenome based on the secondary structure dictionary D1-2, the secondary structure vector table VT1-2, and the name of the subgenome. Through such processing, the preprocessing unit 151 identifies the relationship between the target genome and the subgenome and registers it in the learning data 65 .
  • the preprocessing unit 151 generates learning data 65 by repeatedly executing the above process.
  • the information processing apparatus 100 may acquire and use the created learning data 65 from an external device or the like.
  • the learning unit 152 performs learning of the learning model 70 using the learning data 65 .
  • the processing of the learning unit 152 corresponds to the processing described with reference to FIG.
  • the learning unit 152 acquires from the learning data 65 a set of a target genome (therapeutic drug) vector and each subgenome vector corresponding to the target genome vector.
  • the learning unit 152 performs learning by backpropagation so that the output value of the learning model 70 when the vector of the target genome is input to the learning model 70 approaches the value of the vector of each subgenome. , adjust the parameters of the learning model 70 .
  • the learning unit 152 executes the learning of the learning model 70 by repeatedly executing the above-described processing for a set of the vector of the target genome of the learning data 65 and the vector of each subgenome.
  • the calculation unit 153 Upon receiving the specification of the analysis query 80, the calculation unit 153 calculates the vector of each subgenome included in the target genome of the analysis query 80 using the learned learning model 70.
  • the processing of the calculation unit 153 corresponds to the processing described with reference to FIG.
  • the calculation unit 153 may receive the analysis query 80 from the input unit 120 or from an external device via the communication unit 110 .
  • the calculation unit 153 acquires the base sequence of the target genome included in the analysis query 80.
  • the calculation unit 153 compares the base sequence of the target genome with the conversion table 51 to identify codons contained in the base sequence of the target genome, and converts the base sequence of the target genome into compression codes on a codon basis. do.
  • the calculation unit 153 also compares the codon code sequence compressed in units of codons with the protein primary structure dictionary D1-1, and converts the codon code sequences into compression codes in units of protein primary structures.
  • the calculation unit 153 compares the converted compression code of each protein primary structure with the protein primary structure vector table VT1-1 to identify the compression code vector of each protein primary structure.
  • the calculation unit 153 calculates a vector Vob 80 corresponding to the target genome included in the analysis query 80 by accumulating the compression code vectors of the identified protein primary structures.
  • the calculation unit 153 performs the following processing.
  • the calculation unit 153 compares each secondary structure of the subgenome of the target genome with the secondary structure dictionary D1-2 and the secondary structure vector table VT1-2, and calculates the vector of the secondary structure of the subgenome included in the target genome. identify.
  • the calculation unit 153 calculates the vector of the target genome by integrating the vectors of the secondary structures of the identified subgenomes.
  • the calculation unit 153 calculates a plurality of vectors corresponding to each subgenome by inputting the vector Vob80 into the learning model 70.
  • the calculation unit 153 outputs the calculated vector of each subgenome to the analysis unit 154 .
  • the vector of each subgenome calculated by the calculation unit 153 is referred to as an "analysis vector”.
  • the calculation unit 153 stores the vector (analysis vector) of each subgenome in the subgenome table T1.
  • the analysis unit 154 searches for information on alternative gene vectors having vectors similar to the analysis vector. Based on the search results, the analysis unit 154 associates the vector of each subgenome included in the target genome with the vector of each similar alternative gene vector (similar vector shown below), and registers them in the alternative management table 85. .
  • the analysis unit 154 calculates the distance between the analysis vector and each vector included in the alternative gene vector table T2, and identifies the vector whose distance from the analysis vector is less than the threshold.
  • a vector that is included in the alternative gene vector table T2 and whose distance from the analysis vector is less than a threshold is a "similar vector”.
  • a genetic vector corresponding to this analogous vector becomes an alternative genetic vector.
  • the analysis unit 154 Based on the secondary structure vector table VT1-2, the analysis unit 154 identifies the compression code of the gene vector corresponding to the similar vector, and extracts the compression code of the identified gene vector, the secondary structure dictionary D1-2, the primary protein A protein primary structure contained in a gene vector may be specified based on the structure dictionary D1-1. By executing such processing, the analysis unit 154 searches for features of substitutable gene vectors corresponding to similar vectors and registers them in the substitution management table 85 .
  • a characteristic of an alternative gene vector is the protein contained in the gene vector, the primary structure of the protein.
  • the analysis unit 154 may search for the characteristics of the gene vector corresponding to the similar vector for each analysis vector by repeatedly executing the above process for each analysis vector, and register them in the alternative management table 85 .
  • the analysis unit 154 may output the replacement management table 85 to the display unit 130 for display, or may transmit it to an external device connected to the network.
  • FIG. 27 is a flowchart (1) showing the processing procedure of the information processing apparatus according to the present embodiment.
  • the preprocessing unit 151 of the information processing apparatus 100 calculates a compression code vector of each protein by executing Poincare embedding (step S101).
  • the preprocessing unit 151 generates a compressed file table 53, a vector table 54, and an inverted index table 55 based on the base file 50, conversion table 51, and dictionary table 52 (step S102).
  • the preprocessing unit 151 generates learning data 65 (step S103).
  • the learning unit 152 of the information processing device 100 performs learning of the learning model 70 based on the learning data 65 (step S104).
  • FIG. 28 is a flowchart (2) showing the processing procedure of the information processing apparatus according to this embodiment.
  • the calculation unit 153 of the information processing device 100 receives the analysis query 80 (step S201).
  • the calculation unit 153 calculates the vector of the analysis query 80 (target genome) (step S202).
  • the calculation unit 153 calculates the vector of each subgenome by inputting the calculated vector of the analysis query 80 into the learned learning model 70 (step S203).
  • the analysis unit 154 of the information processing device 100 compares the vector of each subgenome with the vector of the alternative gene vector table T2 (step S204).
  • the analysis unit 154 searches for alternative gene vectors corresponding to each subgenome (step S205).
  • the analysis unit 154 registers the search result in the replacement management table 85 (step S206).
  • the information processing apparatus 100 executes learning of the learning model 70 based on the learning data 65 that defines the relationship between the vector of the target genome (therapeutic drug) and the vector of the subgenome.
  • the information processing apparatus 100 calculates the vector of each subgenome corresponding to the analysis query (target genome) by inputting the vector of the analysis query into the trained learning model 70 .
  • the vector of each subgenome output from the learning model 70 it is possible to easily detect an alternative gene vector similar to the subgenome included in the target genome.
  • the subgenome included in the target genome is a rare subgenome
  • the processing of the information processing device 100 it is possible to easily search for an inexpensive gene vector that can replace the subgenome.
  • the information processing apparatus 100 may compare multiple primary structures constituting a subgenome at granularity to search for substitutable primary structures.
  • FIG. 29 is a diagram for explaining an example of learning phase processing of the information processing apparatus according to the second embodiment.
  • the information processing device uses learning data 90 to learn a learning model 91 .
  • the learning model 91 corresponds to CNN, RNN, and the like.
  • the learning data 90 defines the relationship between vectors of a plurality of subgenomes that synthesize the target genome (therapeutic drug) and vectors of common structure maintained by genetic recombination based on the gene vector. For example, a vector of subgenomes corresponds to the input data, and vectors of multiple common structures are correct values.
  • the information processing device executes learning by error backpropagation so that the output when subgenome vectors are input to the learning model 91 approaches the vectors of each common structure.
  • the information processing device adjusts the parameters of the learning model 91 by repeatedly executing the above processing based on the relationship between the subgenome vectors included in the learning data 90 and the vectors of the common structure (executing machine learning). do).
  • FIG. 30 is a diagram for explaining the processing of the information processing apparatus according to the second embodiment.
  • the information processing apparatus according to the second embodiment may learn the learning model 90 in the same way as the information processing apparatus 100 according to the first embodiment. Also, the information processing apparatus learns a learning model 91 different from the learning model 70, as described with reference to FIG.
  • the learning model 91 outputs a common structure vector when a vector of an analysis query (subgenome) 92 is input.
  • the information processing device When the information processing device receives an analysis query 92 specifying a subgenome, it uses the subgenome vector table T1 to convert the subgenome of the analysis query 92 into a vector Vsb92-1.
  • the information processing device inputs the subgenome vector Vsb92-1 to the learning model 91 to calculate the vector Vcm92-1 corresponding to the common structure.
  • the information processing device compares the subgenome vector Vsb92-1 with vectors of multiple gene vectors included in the alternative gene vector table T2.
  • the alternative gene vector table T2 corresponds to the alternative gene vector table T2 described in the first embodiment.
  • the information processing device identifies vectors of similar gene vectors for the subgenomic vector Vsb92-1.
  • Vt92-1 be a gene vector vector similar to the subgenomic vector Vsb92-1.
  • the vector Vcm92-1 output from the learning model 91 has a common structure common to the subgenome of the vector Vsb92-1 and the gene vector of the vector Vt92-1.
  • the result of subtracting the common structure vector Vcm92-1 from the gene vector vector Vt92-1 is the vector of the "gene recombination structure" that is different between the similar gene vector and the subgenome.
  • the information processing device registers the relationship between the vector of the common structure and the vector of the gene recombination structure in the common structure/gene recombination structure table 93 .
  • the information processing device generates a common structure/genetic recombination structure table 93 by repeatedly executing the above-described processing for vectors of each subgenome.
  • the information processing apparatus inputs the vector of the analysis query 92 to the trained learning model 91, and calculates the vector of each common structure corresponding to the subgenome of the analysis query. Further, by subtracting the vector of the common structure from each vector of the gene vector similar to the subgenome, the vector of the genetic recombination structure that is different between the similar subgenome and the gene vector is calculated.
  • the vectors having the common structure and the vectors having the gene recombination structure it is possible to easily analyze better gene vectors that can be used for the synthesis and production of the target genome.
  • FIG. 31 is a functional block diagram showing the configuration of the information processing apparatus according to the second embodiment.
  • this information processing apparatus 200 has a communication section 210, an input section 220, a display section 230, a storage section 240, and a control section 250.
  • FIG. 31 is a functional block diagram showing the configuration of the information processing apparatus according to the second embodiment.
  • this information processing apparatus 200 has a communication section 210, an input section 220, a display section 230, a storage section 240, and a control section 250.
  • FIG. 31 is a functional block diagram showing the configuration of the information processing apparatus according to the second embodiment.
  • this information processing apparatus 200 has a communication section 210, an input section 220, a display section 230, a storage section 240, and a control section 250.
  • the descriptions of the communication unit 210, the input unit 220, and the display unit 230 are the same as the descriptions of the communication unit 110, the input unit 120, and the display unit 130 described in the first embodiment.
  • the storage unit 240 has a base file 50, a conversion table 51, a dictionary table 52, a compressed file table 53, a vector table 54, and an inverted index table 55.
  • the storage unit 240 also has a subgenome table T1, an alternative gene vector table T2, a genome dictionary D2, learning data 90, a learning model 91, an analysis query 92, and a common structure/gene recombination structure table 93.
  • the storage unit 240 is implemented by, for example, a semiconductor memory device such as RAM (Random Access Memory) or flash memory, or a storage device such as a hard disk or optical disc.
  • the base file 50, the conversion table 51, the dictionary table 52, the compressed file table 53, the vector table 54, the transposed index table 55, the subgenome table T1, the alternative gene vector table T2, and the genome dictionary D2 are explained in the first embodiment. Same as content.
  • the learning data 90 is the same as the content explained in FIG. Descriptions of the learning model 91 and the analysis query 92 are the same as those described with reference to FIG.
  • the common structure/genetic recombination structure table 93 contains information on gene recombination structure vectors for gene recombination from gene vectors similar to the common structure vector to subgenomes.
  • the common structure/genetic recombination structure table 93 includes the gene recombination structure vector corresponding to Vcm92-1.
  • a vector obtained by multiplying the vector of the common structure and the vector of the gene recombination structure is the vector corresponding to the vector of the gene vector.
  • the control unit 250 has a preprocessing unit 251 , a learning unit 252 , a calculation unit 253 and an analysis unit 254 .
  • the control unit 250 is implemented by, for example, a CPU or MPU. Also, the controller 250 may be implemented by an integrated circuit such as an ASIC or FPGA.
  • the description of the preprocessing unit 251 is the same as the description of the processing related to the preprocessing unit 151 described in the first embodiment.
  • the preprocessing unit 251 generates a base file 50, a conversion table 51, a dictionary table 52, a compressed file table 53, a vector table 54, an inverted index table 55, a subgenome table T1, and an alternative gene vector table T2.
  • the preprocessing unit 251 may acquire the learning data 90 from an external device, or may generate the learning data 90 by the preprocessing unit 251 .
  • the calculation unit 253 uses the learned learning model 91 to calculate the vector of each common structure to be genetically modified in the synthesis route of the subgenome of the analysis query 92 .
  • the calculation unit 253 outputs the calculated vector of each common structure to the analysis unit 254 .
  • each common structure vector calculated by the calculation unit 253 is referred to as a "common structure vector”.
  • the analysis unit 254 generates a common structure/gene recombination mechanism table 93 based on the subgenome vector of the analysis query 92, the common structure vector, and the gene vector vector table T2. An example of the processing of the analysis unit 254 will be described below.
  • the analysis unit 254 calculates the distance between the subgenome vector and each vector contained in the alternative gene vector table T2, and identifies the vector whose distance from the subgenome vector is less than the threshold.
  • a vector that is included in the alternative gene vector table T2 and whose distance from the subgenome vector is less than a threshold is referred to as a "similar vector”.
  • the analysis unit 254 calculates a gene recombination structure vector by subtracting the common structure vector from the similarity vector, and identifies the correspondence relationship between the common structure vector and the gene recombination structure vector.
  • the analysis unit 254 registers the common structure vector and the gene recombination structure vector in the common structure/gene recombination structure table 93 .
  • the analysis unit 245 generates the common structure/genetic recombination structure table 93 by repeatedly executing the above process.
  • the analysis unit 245 may output the common structure/genetic recombination structure table 93 to the display unit 230 for display, or may transmit it to an external device connected to the network.
  • FIG. 32 is a flow chart showing the processing procedure of the information processing apparatus according to the second embodiment.
  • the calculation unit 253 of the information processing device 200 receives the analysis query 92 (step S301).
  • the calculation unit 253 converts the subgenome of the analysis query 92 into a vector based on the subgenome table T1 (step S302).
  • the calculation unit 253 calculates a common structure vector by inputting the subgenome vector to the learned learning model 91 (step S303).
  • the analysis unit 254 of the information processing device 200 identifies a similar vector based on the distance between the common structure vector and each vector in the alternative gene vector table T2 (step S304).
  • the analysis unit 254 calculates a gene recombination structure vector by subtracting a common structure vector from each vector of gene vectors similar to the subgenome (step S305).
  • the analysis unit 254 registers the relationship between the common structure vector and the gene recombination structure vector in the common structure/gene recombination structure table 93 (step S306).
  • the analysis unit 254 outputs the information of the common structure/genetic recombination structure table (step S307).
  • the information processing apparatus 100 inputs the vector of the analysis query 92 to the trained learning model 91, and calculates the vector of each common structure corresponding to the subgenome of the analysis query. Further, by subtracting each common structure vector from the vector of the gene vector similar to the subgenome, the vector of the genetic recombination structure that is different between the similar subgenome and the gene vector is calculated.
  • the vector with the common structure and the vector with the gene recombination structure it is possible to easily analyze better gene vectors that can be used for gene recombination, resynthesis, and production into the target genome.
  • Subgenomes and gene vectors are secondary structures composed of multiple protein primary structures.
  • the variance vector of the protein primary structure it is possible to estimate the protein primary structure adjacent to a certain protein primary structure, and it can be applied to evaluate the degree of binding and stability of each protein primary structure.
  • genetic recombination from gene vectors to proven subgenomes by performing machine learning based on the distribution vectors of multiple protein secondary structures that make up the secondary structure of subgenomes and gene vectors, it is possible to divert from gene vectors and , can improve the analytical accuracy of genetic recombination and resynthesis.
  • FIG. 33 is a diagram illustrating an example of a hardware configuration of a computer that implements functions similar to those of the information processing apparatus of the embodiment.
  • the computer 300 has a CPU 301 that executes various arithmetic processes, an input device 302 that receives data input from the user, and a display 303 .
  • the computer 300 also has a communication device 304 and an interface device 305 for exchanging data with an external device or the like via a wired or wireless network.
  • the computer 300 also has a RAM 306 that temporarily stores various information, and a hard disk device 307 . Each device 301 - 307 is then connected to a bus 308 .
  • the hard disk device 307 has a preprocessing program 307a, a learning program 307b, a calculation program 307c, and an analysis program 307d.
  • the CPU 301 reads each program 307 a to 307 d and develops them in the RAM 306 .
  • the preprocessing program 307a functions as a preprocessing process 306a.
  • Learning program 307b functions as learning process 306b.
  • the calculation program 307c functions as a calculation process 306c.
  • Analysis program 307d functions as analysis process 306d.
  • the processing of the preprocessing process 306a corresponds to the processing of the preprocessing units 151 and 251.
  • the processing of the learning process 306 b corresponds to the processing of the learning units 152 and 252 .
  • Processing of the calculation process 306 c corresponds to processing of the calculation units 153 and 253 .
  • Processing of the analysis process 306 d corresponds to processing of the analysis unit 154 .
  • each program does not necessarily have to be stored in the hard disk device 307 from the beginning.
  • each program is stored in a “portable physical medium” such as a flexible disk (FD), CD-ROM, DVD, magneto-optical disk, IC card, etc. inserted into the computer 300 . Then, the computer 300 may read and execute each of the programs 307a-307d.
  • a “portable physical medium” such as a flexible disk (FD), CD-ROM, DVD, magneto-optical disk, IC card, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Biochemistry (AREA)
  • Library & Information Science (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Communication Control (AREA)
PCT/JP2021/015983 2021-04-20 2021-04-20 情報処理プログラム、情報処理方法および情報処理装置 Ceased WO2022224336A1 (ja)

Priority Applications (6)

Application Number Priority Date Filing Date Title
CN202180095792.3A CN117043868A (zh) 2021-04-20 2021-04-20 信息处理程序、信息处理方法以及信息处理装置
EP21937833.8A EP4328921A4 (en) 2021-04-20 2021-04-20 INFORMATION PROCESSING PROGRAM, INFORMATION PROCESSING METHOD AND INFORMATION PROCESSING DEVICE
PCT/JP2021/015983 WO2022224336A1 (ja) 2021-04-20 2021-04-20 情報処理プログラム、情報処理方法および情報処理装置
JP2023515916A JP7619443B2 (ja) 2021-04-20 2021-04-20 情報処理プログラム、情報処理方法および情報処理装置
AU2021441603A AU2021441603A1 (en) 2021-04-20 2021-04-20 Information processing program, information processing method, and information processing device
US18/468,023 US20240006028A1 (en) 2021-04-20 2023-09-15 Non-transitory computer-readable recording medium, information processing method, and information processing apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/015983 WO2022224336A1 (ja) 2021-04-20 2021-04-20 情報処理プログラム、情報処理方法および情報処理装置

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/015983 Continuation WO2022224336A1 (ja) 2021-04-20 2021-04-20 情報処理プログラム、情報処理方法および情報処理装置

Related Child Applications (2)

Application Number Title Priority Date Filing Date
PCT/JP2021/015983 Continuation WO2022224336A1 (ja) 2021-04-20 2021-04-20 情報処理プログラム、情報処理方法および情報処理装置
US18/468,023 Continuation US20240006028A1 (en) 2021-04-20 2023-09-15 Non-transitory computer-readable recording medium, information processing method, and information processing apparatus

Publications (1)

Publication Number Publication Date
WO2022224336A1 true WO2022224336A1 (ja) 2022-10-27

Family

ID=83723418

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/015983 Ceased WO2022224336A1 (ja) 2021-04-20 2021-04-20 情報処理プログラム、情報処理方法および情報処理装置

Country Status (6)

Country Link
US (1) US20240006028A1 (https=)
EP (1) EP4328921A4 (https=)
JP (1) JP7619443B2 (https=)
CN (1) CN117043868A (https=)
AU (1) AU2021441603A1 (https=)
WO (1) WO2022224336A1 (https=)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007102578A1 (ja) 2006-03-09 2007-09-13 Keio University 塩基配列設計方法
JP2017504913A (ja) * 2013-11-15 2017-02-09 インフィニットバイオInfinitebio 治療設計のためのコンピュータ支援モデル化
JP2020154442A (ja) * 2019-03-18 2020-09-24 株式会社日立製作所 生物反応情報処理システムおよび生物反応情報処理方法
JP2020530918A (ja) * 2017-10-16 2020-10-29 イルミナ インコーポレイテッド バリアントの分類のための深層畳み込みニューラルネットワーク
WO2020230240A1 (ja) 2019-05-13 2020-11-19 富士通株式会社 評価方法、評価プログラムおよび評価装置

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997036252A1 (en) * 1996-03-22 1997-10-02 University Of Guelph Computational method for designing chemical structures having common functional characteristics
US7047137B1 (en) * 2000-11-28 2006-05-16 Hewlett-Packard Development Company, L.P. Computer method and apparatus for uniform representation of genome sequences
EP1607898A3 (en) * 2004-05-18 2006-03-29 Neal E. Solomon A bioinformatics system for functional proteomics modelling
US20120115734A1 (en) * 2010-11-04 2012-05-10 Laura Potter In silico prediction of high expression gene combinations and other combinations of biological components
CN107025386B (zh) * 2017-03-22 2020-07-17 杭州电子科技大学 一种基于深度学习算法进行基因关联分析的方法
KR20200026878A (ko) * 2017-06-06 2020-03-11 지머젠 인코포레이티드 균류 균주를 개량하기 위한 htp 게놈 공학 플랫폼
JP7763588B2 (ja) * 2017-09-05 2025-11-04 グリットストーン バイオ インコーポレイテッド T細胞療法用の新生抗原の特定法
US20200342955A1 (en) * 2017-10-27 2020-10-29 Apostle, Inc. Predicting cancer-related pathogenic impact of somatic mutations using deep learning-based methods
CN119851752A (zh) * 2018-02-27 2025-04-18 磨石生物公司 利用泛等位基因模型进行的新抗原鉴别
JP2020181959A (ja) * 2019-04-26 2020-11-05 東京エレクトロン株式会社 学習方法、管理装置および管理プログラム

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007102578A1 (ja) 2006-03-09 2007-09-13 Keio University 塩基配列設計方法
JP2017504913A (ja) * 2013-11-15 2017-02-09 インフィニットバイオInfinitebio 治療設計のためのコンピュータ支援モデル化
JP2020530918A (ja) * 2017-10-16 2020-10-29 イルミナ インコーポレイテッド バリアントの分類のための深層畳み込みニューラルネットワーク
JP2020154442A (ja) * 2019-03-18 2020-09-24 株式会社日立製作所 生物反応情報処理システムおよび生物反応情報処理方法
WO2020230240A1 (ja) 2019-05-13 2020-11-19 富士通株式会社 評価方法、評価プログラムおよび評価装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Hyperbolic Image Embeddings", 3 April 2019, CORNELL UNIVERSITY
See also references of EP4328921A4

Also Published As

Publication number Publication date
JPWO2022224336A1 (https=) 2022-10-27
EP4328921A1 (en) 2024-02-28
JP7619443B2 (ja) 2025-01-22
US20240006028A1 (en) 2024-01-04
CN117043868A (zh) 2023-11-10
AU2021441603A1 (en) 2023-09-28
EP4328921A4 (en) 2024-06-26

Similar Documents

Publication Publication Date Title
Darling et al. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement
Wu et al. A simple, fast, and accurate method of phylogenomic inference
Ovchinnikov et al. Large-scale determination of previously unsolved protein structures using evolutionary information
US20110295858A1 (en) Method and apparatus for searching nucleic acid sequence
EP3509018A1 (en) Method for biologically storing and restoring data
CN108363905B (zh) 一种用于植物外源基因改造的CodonPlant系统及其改造方法
CA2930597A1 (en) Methods for the graphical representation of genomic sequence data
Hallin et al. The genome BLASTatlas—a GeneWiz extension for visualization of whole-genome homology
Qi et al. Application of 2D graphic representation of protein sequence based on Huffman tree method
US20210317523A1 (en) Deepsimulator method and system for mimicking nanopore sequencing
McDonald et al. The evolutionary dynamics of tRNA-gene copy number and codon-use in E. coli.
JP2024542154A (ja) 生合成遺伝子クラスターに関連する遺伝子を同定するための方法およびシステム
US20110040488A1 (en) System and method for analysis of a dna sequence by converting the dna sequence to a number string and applications thereof in the field of accelerated drug design
He et al. Predicting the sequence specificities of DNA-binding proteins by DNA fine-tuned language model with decaying learning rates
CN103699819B (zh) 基于多步双向De Bruijn图的变长kmer查询的顶点扩展方法
US20240404620A1 (en) Framework for protein design from partial sequence with memory-efficient global attention method
Sammeth et al. Comparing tandem repeats with duplications and excisions of variable degree
WO2022224336A1 (ja) 情報処理プログラム、情報処理方法および情報処理装置
Bie et al. High-quality genome resource of mango bacterial black spot pathogen Xanthomonas citri pv. mangiferaeindicae GXG07 isolated from Guangxi, China
Liu et al. Chloroplast genome evolution of Hamamelidaceae at subfamily level
CN102841988A (zh) 一种对核酸序列信息进行匹配的系统和方法
Hallin et al. GeneWiz browser: an interactive tool for visualizing sequenced chromosomes
Harris et al. Whole-genome sequencing for rapid and accurate identification of bacterial transmission pathways
Sgarbossa et al. Pairing interacting protein sequences using masked language modeling
Röske et al. A versatile palindromic amphipathic repeat coding sequence horizontally distributed among diverse bacterial and eucaryotic microbes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21937833

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2021441603

Country of ref document: AU

WWE Wipo information: entry into national phase

Ref document number: 2023515916

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 202180095792.3

Country of ref document: CN

ENP Entry into the national phase

Ref document number: 2021441603

Country of ref document: AU

Date of ref document: 20210420

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2021937833

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021937833

Country of ref document: EP

Effective date: 20231120