WO2019235567A1 - Protein interaction analysis device and analysis method - Google Patents

Protein interaction analysis device and analysis method Download PDF

Info

Publication number
WO2019235567A1
WO2019235567A1 PCT/JP2019/022528 JP2019022528W WO2019235567A1 WO 2019235567 A1 WO2019235567 A1 WO 2019235567A1 JP 2019022528 W JP2019022528 W JP 2019022528W WO 2019235567 A1 WO2019235567 A1 WO 2019235567A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
protein
ligand
binding site
ligand binding
Prior art date
Application number
PCT/JP2019/022528
Other languages
French (fr)
Japanese (ja)
Inventor
洋一 西田
真知子 朝家
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to JP2020523173A priority Critical patent/JP6995990B2/en
Publication of WO2019235567A1 publication Critical patent/WO2019235567A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12MAPPARATUS FOR ENZYMOLOGY OR MICROBIOLOGY; APPARATUS FOR CULTURING MICROORGANISMS FOR PRODUCING BIOMASS, FOR GROWING CELLS OR FOR OBTAINING FERMENTATION OR METABOLIC PRODUCTS, i.e. BIOREACTORS OR FERMENTERS
    • C12M1/00Apparatus for enzymology or microbiology
    • C12M1/34Measuring or testing with condition measuring or sensing means, e.g. colony counters
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction

Definitions

  • the present invention relates to a protein interaction analysis apparatus and analysis method used when analyzing the interaction between a ligand and a protein.
  • the enzyme has a so-called substrate specificity that recognizes a substrate having a specific structure.
  • the receptor specifically binds to a physiologically active substance having a specific structure and expresses its action (for example, signal transduction activity or transcription promoting activity).
  • proteins such as enzymes and receptors function through specific binding with so-called ligands such as substrates and physiologically active substances.
  • nucleotide sequence information As a result of research related to proteins, nucleotide sequence information, amino acid sequence information and three-dimensional structure information relating to the gene encoding the protein are accumulated daily.
  • sequence information for example, NCBI (National Center of Biotechnology Information) Genbank, Japan DNA Data Bank (DDBJ) and EMBL have been constructed.
  • DDBJ Japan DNA Data Bank
  • EMBL EMBL
  • a Protein Data Bank including a Japanese protein structure data bank (PDBj: Protein Data Bank Japan) is constructed.
  • KEGG is constructed as a database that integrates information on intermolecular networks such as metabolism and signal transduction.
  • “synthetic biotechnology” that predicts metabolic pathways and gene sequences not possessed by natural microorganisms by computational science and artificially designs them is drawing attention.
  • synthetic biotechnology for example, in order to synthesize a target substance for production, a metabolic pathway for biosynthesizing a final target substance from a starting material is constructed using the above-mentioned various data, A host organism is produced by the technique.
  • the metabolic pathway can be designed by combining a plurality of enzyme reactions comprising a substrate and an enzyme.
  • the basic structure of a lead compound that can interact with the ligand binding site is designed based on the three-dimensional structure data of the ligand binding site in the target protein.
  • Patent Document 1 discloses a method of identifying the function of a protein whose function is unknown in consideration of protein-protein interaction.
  • teacher data is obtained from the three-dimensional structures of a plurality of receptors with known functions and the three-dimensional structures of a plurality of ligands.
  • the teaching data disclosed in Patent Document 1 includes, for each receptor, a plurality of shape complementarity evaluation values when each receptor is docked with a plurality of ligands, a total charge of each receptor, and a total charge of the surface. Charge information including the difference between them.
  • Patent Document 1 a plurality of shape complementarity evaluation values when a function unknown protein is docked with a plurality of ligands and charge information of the function unknown protein are input, and the teacher data is learned. To identify the function of the unknown protein.
  • Non-Patent Document 1 analyzes the conformation formed between these domains based on the electrostatic potential distribution of the adenylation domain (AdD) and oligonucleotide binding domain (OBD) in DNA ligase, and the enzymatic reaction. The relationship with is verified.
  • Patent Document 2 discloses that a mutation was introduced into the domain by a site-directed mutagenesis method to identify amino acid residues that are deeply involved in the ligation reaction. From these Patent Documents 1 and 2, it can be understood that functional analysis of a protein can be performed by conformational analysis based on the electrostatic potential distribution on the surface of the protein.
  • Tanabe M. Ishino S., Yohda M., Morikawa K., Ishino Y., Nishida H. (2012) Structure-based mutational study of an archaeal DNA ligase towards improvement of ligation activity.
  • the present invention provides a protein interaction analysis apparatus and an analysis method for outputting protein interaction related data that can accurately analyze a specific interaction between a protein and a ligand. Objective.
  • ligand binding site-related information including at least three-dimensional structure data on the ligand binding site based on protein-related information and electrostatic potential distribution in the ligand binding site.
  • the present invention includes the following. (1) a data input unit for inputting information on the analysis target; Surface shape of the ligand binding site for a given protein generated based on the amino acid sequence data and 3D structure data of the protein stored in the external storage unit and the 3D structure data of the ligand that specifically interacts with the protein
  • a data storage unit that associates and stores data, electrostatic potential distribution data of the ligand binding site, and three-dimensional structure data relating to a ligand that interacts with the ligand binding site;
  • Teacher data includes surface shape data of a predetermined ligand binding site, electrostatic potential distribution data of the ligand binding site, and three-dimensional structure data relating to a ligand that interacts with the ligand binding site, stored in the data storage unit.
  • a protein interaction analysis apparatus comprising: a computer processing unit that generates data relating to protein interaction related to the analysis target input by the data input unit by machine learning.
  • the information on the analysis target is information on the structure of the ligand
  • the information on the analysis target is information on the structure of the protein or ligand binding site,
  • the calculation processing unit calculates an evaluation value indicating the similarity between the analysis target input by the data input unit and the analysis target included in the generated data with respect to the protein interaction data generated by machine learning.
  • the protein interaction analyzer according to (1) further comprising: an evaluation value calculation unit that performs
  • the calculation processing unit calculates a fitness score that quantitatively indicates the binding stability when the analysis target input by the data input unit interacts with respect to the protein interaction data generated by machine learning.
  • the protein interaction analysis apparatus according to (1) further comprising a protein-ligand compatibility score calculation unit.
  • the data storage unit calculates the center coordinates of the ligand binding site based on the atomic coordinates of the atoms constituting the ligand binding site, and creates a three-dimensional grid space including atoms within a predetermined distance from the center coordinate.
  • the three-dimensional grid space has a plurality of lattice points by a grid set at a predetermined interval, and a character specific to a lattice point closest to each atom within a predetermined distance from the central coordinate.
  • the protein interaction analysis apparatus according to (6), wherein the data is a data in which another character is given to a lattice point where the specific character is not given.
  • the data storage unit includes positive electrostatic potential distribution data including positive values calculated for the lattice points of the three-dimensional grid space, and negative values calculated for the lattice points of the three-dimensional grid space.
  • positive electrostatic potential distribution data including positive values calculated for the lattice points of the three-dimensional grid space, and negative values calculated for the lattice points of the three-dimensional grid space.
  • negative electrostatic potential distribution data consisting of:
  • (11) a step of inputting information relating to an analysis object by an input device; Based on the amino acid sequence data and the three-dimensional structure data of the protein stored in the external storage unit and the three-dimensional structure data of the ligand that specifically interacts with the protein, the arithmetic unit calculates the ligand binding site for the predetermined protein. Generate surface shape data, electrostatic potential distribution data of the ligand binding site, and three-dimensional structure data related to the ligand interacting with the ligand binding site, and generate these surface shape data, electrostatic potential distribution data, and three-dimensional structure related to the ligand.
  • the arithmetic device stores the surface shape data of the predetermined ligand binding site, the electrostatic potential distribution data of the ligand binding site, and the three-dimensional structure data relating to the ligand that interacts with the ligand binding site stored in the storage device. And a step of generating data relating to protein interaction related to the analysis target input by the input device by machine learning as teacher data.
  • the information on the analysis target is information on the structure of the ligand
  • the information on the analysis target is information on the structure of the protein or ligand binding site, (11) The protein interaction analysis method according to (11), wherein the arithmetic unit generates data on a compound or ligand that interacts with the protein or ligand binding site.
  • the arithmetic device calculates a center coordinate of the ligand binding site based on the atomic coordinates of the atoms constituting the ligand binding site, and sets a three-dimensional grid space including atoms within a predetermined distance from the center coordinate. And the surface shape data produced
  • the three-dimensional grid space has a plurality of lattice points by a grid set at a predetermined interval, and a specific character at a lattice point closest to each atom within a predetermined distance from the central coordinate.
  • the arithmetic unit may calculate positive electrostatic potential distribution data including positive values calculated for lattice points in the three-dimensional grid space and negative values calculated for lattice points in the three-dimensional grid space.
  • (11) The protein interaction analysis method according to (11), wherein the negative electrostatic potential distribution data is stored in the data storage unit.
  • the protein interaction analysis apparatus and analysis method according to the present invention can accurately analyze the specific interaction between a protein and a ligand.
  • a protein or ligand that is highly likely to specifically interact with a ligand or protein specified by a user is analyzed with high accuracy by machine learning. be able to.
  • the protein interaction analysis apparatus to which the present invention is applied is a feature that can be used for three-dimensional structure analysis of a site (ligand binding site) that interacts with a ligand in a protein to be analyzed based on amino acid sequence data and three-dimensional structure data in the protein.
  • Data (hereinafter referred to as “ligand binding site surface property data”) is generated, and analysis on the interaction between the ligand and the protein is performed through machine learning using the data.
  • the ligand binding site surface property data is data combining the surface shape data of the ligand binding site and the electrostatic potential distribution data of the ligand binding site.
  • the protein interaction analysis apparatus 1 shown in FIG. 1 is connected to an external storage unit 2 that stores an amino acid sequence related to a protein, three-dimensional structure data, and data related to a ligand for the protein, and a ligand binding site in a predetermined protein.
  • Surface shape data generating unit 3 for generating surface shape data
  • electrostatic potential distribution data generating unit 4 for generating an electrostatic potential distribution of the ligand binding site
  • a ligand three-dimensional structure for generating three-dimensional structure data relating to a ligand for the protein
  • a data generation unit 5 a data generation unit 5.
  • the protein interaction analysis apparatus 1 includes surface shape data and electrostatic potential distribution data (ligand binding site surface property data) relating to a predetermined ligand binding site generated by the surface shape data generation unit 3 and the electrostatic potential distribution data generation unit 4. And a data storage unit 6 for storing the three-dimensional structure data relating to the ligand that interacts with the ligand binding site as teacher data. Furthermore, the protein interaction analysis apparatus 1 includes a data input unit 7 for inputting data to be analyzed by the user.
  • the protein interaction analysis device 1 uses the teacher data stored in the data storage unit 6 to generate data related to the protein interaction by machine learning with respect to the analysis target input by the data input unit 7. And an output unit 9 for outputting the result calculated by the calculation processing unit 8.
  • the calculation processing unit 8 analyzes the interaction between the ligand and the protein according to the user's designation.
  • the calculation processing unit 8 includes a machine learning unit 10 that performs machine learning using teacher data stored in the data storage unit 6, and an analysis target input by the data input unit 7.
  • an evaluation value calculation unit 11 that calculates an evaluation value indicating similarity to a protein or ligand included in the teacher data, a result of machine learning performed by the machine learning unit 10, and an evaluation value calculated by the evaluation value calculation unit 11
  • a list generation unit 12 that generates a combined list.
  • the protein interaction analysis apparatus 1 shown in FIG. 1 is configured to be connected to one external storage unit 2 storing the above-described data.
  • the protein interaction analysis apparatus 1 may be connected to a plurality of external storage units that store the above-described data in a distributed manner.
  • the protein interaction analysis apparatus 1 may be capable of being connected to an external storage unit that stores amino acid sequences and three-dimensional structure data related to proteins and an external storage unit that stores data related to ligands for proteins. .
  • the data stored in the external storage unit 2 is data relating to the amino acid sequence, three-dimensional structure data, and ligand of a predetermined protein.
  • the ligand broadly means a substrate that specifically interacts with a protein such as a substrate for an enzyme, a low-molecular compound that interacts with a receptor protein, and a coenzyme or a regulatory factor.
  • a ligand may be interpreted as being limited to a substance that binds to a receptor present on a cell membrane or an intracellular receptor.
  • the term “ligand” is used in a broad sense and includes substances that interact specifically with proteins, including substrates for enzymes, coenzymes, regulators, substances that bind to receptors, etc. Used in.
  • the ligand may be either a low molecular compound or a high molecular compound, and may mean a partial region of the compound. That is, the molecular structure and atomic coordinates of the ligand may be the molecular structure and atomic coordinates of the entire compound that interacts with the protein, or may be the molecular structure and atomic coordinates of at least a partial region that interacts with the protein in the compound.
  • Protein means a polymer compound having an amino acid sequence as a primary structure, and may be any of a monomer, a homomultimer and a heteromultimer.
  • the protein may have post-translational chemical modifications such as sugar chain addition, functional group addition, and phosphorylation. Therefore, the three-dimensional structure data based on the atomic coordinates at the ligand binding site may be the three-dimensional structure data based on the atomic coordinates obtained with the protein having no post-translational chemical modification described above, or the post-translational chemical modification described above. It may be three-dimensional structure data based on atomic coordinates obtained with the protein possessed.
  • the three-dimensional structure data based on the atomic coordinates at the ligand binding site is changed so that the atomic coordinates obtained with the protein without the post-translational chemical modification described above become the atomic coordinates of the protein with the predetermined chemical modification.
  • the three-dimensional structure data based on the corrected (corrected) atomic coordinates may be used.
  • the atomic coordinates mean data indicating the coordinates of atoms constituting the protein.
  • Atomic coordinates can be obtained for various proteins by either or both of an X-ray crystal structure analysis method mainly using a protein single crystal and a nuclear magnetic resonance method targeting a protein solution.
  • the atomic coordinates can also be obtained by a nuclear magnetic resonance technique using a stable isotope called a stereo-array isotope labeling method.
  • the atomic coordinates in the protein are not particularly limited to the format, but can be in a form in which each atom constituting the protein is indicated by a combination of x, y and z coordinates.
  • the unit of each coordinate can be, for example, [ ⁇ ].
  • the protein interaction analyzer 1 can be configured to be connected to the PDB as the external storage unit 2.
  • the atomic coordinates are displayed as one line of data for each atomic number under a predetermined record name (standard amino acid is ATOM).
  • the atomic name main chain amide nitrogen: N, ⁇ carbon: CA, ⁇ carbon: CB
  • residue name (3 amino acid code)
  • Chain ID residue number
  • x-coordinate [ ⁇ ] y-coordinate [ ⁇ ]
  • z-coordinate [ ⁇ ] occupancy
  • occupancy sample to be analyzed, for example, the proportion of the atom existing in the crystal in the place, occupancy, usually 1.00
  • temperature factor B [ [ 2 ] (if determined by X-ray crystallography).
  • the data stored in the PDB includes data on the type of protein molecule, registered name and accession number (HEADER row), title name (TITLE) ), Protein molecule information (COMPND line), protein host information (SOURCE line), 3D analysis experiments (REMARK line), amino acid sequence information (SEQRES line), ⁇ helix (HELIX row) and positional information (SSBOND) about intramolecular disulfide bonds.
  • the data stored in the PDB includes data on the molecular structure and atomic coordinates of the ligand that interacts with the ligand binding site described above.
  • the PDB includes information on the molecular structure of the ligand (HETATM row) and information on ligand binding (CONECT row).
  • the HETATM line includes information for identifying the atoms constituting the ligand and the coordinates of the atoms.
  • the HETATM row includes information indicating the type of conformer when the ligand has a conformer.
  • the protein interaction analyzer 1 uses the data stored in the external storage unit 2 to generate surface shape data related to a predetermined protein including a ligand binding site in the surface shape data generation unit 3.
  • surface shape data of the entire protein may be generated, or surface shape data may be generated for a partial region including a ligand binding site in the protein.
  • the surface shape data generation unit 3 preferably generates surface shape data for a partial region including the entire ligand binding site in the protein.
  • the surface shape data can be data in which the surface of the protein to which the ligand binds is the xy plane and the unevenness in the xy plane is indicated by a value in the z-axis direction.
  • the xy plane in the surface shape data can be a plane in which the ligand occupies the maximum area in a state where the ligand specifically interacts with the protein. That is, it is preferable that the plane on which the ligand can be projected most greatly in the state where the ligand specifically interacts with the protein is the xy plane in the surface shape data.
  • the xy plane in the surface shape data can be a plane in which the ligand binding site in the protein occupies the maximum area. That is, in the protein three-dimensional structure, the plane where the ligand binding site is the largest is preferably the xy plane in the surface shape data.
  • the value in the z-axis direction on the xy plane can also be obtained as a discrete value at mesh points (intersection points) using the xy plane as mesh data.
  • mesh data with an interval of 0.05 to 1.0 mm preferably mesh data with an interval of 0.1 to 0.5 mm, more preferably mesh data with an interval of 0.2 mm can be used.
  • a value in the z-axis direction on the xy plane can also be obtained as a discrete value at the point (intersection point). Further, the value in the z-axis direction in the xy plane can be obtained as an average value of the values in the z-axis direction in the xy plane calculated for each region in each mesh using the xy plane as described above.
  • the surface shape data generation unit 3 may generate a plurality of surface shape data for one ligand binding site.
  • the surface shape data generation unit 3 may generate a pair of surface shape data as a so-called stereogram as a plurality of surface shape data.
  • the surface shape data generation unit 3 sets the xy plane in which the ligand occupies the maximum area in a state where the ligand specifically interacts with the protein at a predetermined angle around the x axis or the y axis, for example, ⁇ 0.5 to 10 degrees. It is also possible to set a plurality of planes tilted within the range, preferably ⁇ 1 to 5 degrees, and generate surface shape data for each of the plurality of planes.
  • the surface shape data generation unit 3 tilts ⁇ 5 degrees around the xy plane in which the ligand occupies the maximum area and the y axis in the xy plane in a state where the ligand specifically interacts with the protein. Further, surface shape data (total of three surface shape data) can be generated for each of the two planes. Alternatively, the surface shape data generation unit 3 may obtain surface shape data (total, total) for each of the xy plane in which the ligand binding site in the protein occupies the maximum area and two planes inclined by ⁇ 5 degrees about the y axis in the xy plane. 3 surface shape data) can be generated.
  • the surface shape data generation unit 3 is an xy plane in which the ligand occupies the maximum area in a state where the ligand specifically interacts with the protein, and one side is in the range of, for example, 30 to 100 mm, preferably 40 An xy plane in the range of ⁇ 80 cm, more preferably in the range of 45-60 mm can be generated.
  • the surface shape data generation unit 3 is an xy plane in which the ligand binding site in the protein occupies the maximum area, and one side is in the range of, for example, 30 to 100 mm, preferably in the range of 40 to 80 mm, and more preferably in the range of 45 to An xy plane in the range of 60 mm can be generated.
  • the protein interaction analyzer 1 uses the data stored in the external storage unit 2 to generate electrostatic potential distribution data of a ligand binding site in a predetermined protein in the electrostatic potential distribution data generation unit 4.
  • the electrostatic potential distribution data generation unit 4 can calculate the electrostatic potential distribution of the ligand binding site by applying a known method for calculating the surface charge of the protein.
  • the electrostatic potential distribution data generation unit 4 can appropriately use a conventionally known method for calculating the electrostatic potential (surface charge) of a protein.
  • the electrostatic potential (surface charge) can be defined as Coulomb energy received by a positive charge having a unit electric quantity on an arbitrary point.
  • the electrostatic potential (surface charge) of a protein In order to calculate the electrostatic potential (surface charge) of a protein, first, from the amino acid sequence and atomic coordinate data of the protein stored in the external storage unit 2 such as PDB, carbon (C), oxygen (O) Read information and coordinates of non-hydrogen atomic species such as nitrogen (N) and sulfur (S). Next, a hydrogen atom bonded to each read non-hydrogen atomic species and its coordinates are calculated. Next, by using these coordinates together, information on all atoms coordinated on the surface in the molecule, that is, information on all electrons can be obtained. Then, by using such information and assuming a constant dielectric constant, the electrostatic charge at any position inside and outside the protein molecule can be calculated.
  • the positive charge calculated in the surface of the protein can be used as the electrostatic potential (surface charge) of the protein.
  • the value on the curved surface obtained by the surface shape data generation unit 3 is extracted from the space continuous value, and the (x, y, z) value of the surface shape data and the electrostatic potential (surface charge) value c are combined. It is desirable to store it as four-dimensional data (as (x, y, z, c)).
  • Examples of conventionally known methods for calculating the electrostatic potential (surface charge) of proteins include, for example, Rocchia et al. Vol. 23, No No. 1 Journal of Computational Chemistry, 128-137, 2002.
  • Examples of usable software for calculating the electrostatic potential (surface charge) include GRASP, Chimera, APBS, and QUANTA.
  • the electrostatic potential distribution data generation unit 4 may generate the electrostatic potential distribution for the entire region of the xy plane created by the surface shape data generation unit 3, or the electrostatic potential distribution for the partial region of the xy plane. It may be generated.
  • the partial region of the xy plane created by the surface shape data generation unit 3 includes a region including a ligand binding site included in the xy plane, for example, a spatial region within 10 cm, preferably within 5 cm from the ligand binding site. An electrostatic potential distribution may be generated.
  • the electrostatic potential distribution data generation unit 4 when the surface shape data generation unit 3 creates a plurality of xy planes for one ligand binding site, the electrostatic potential distribution data generation unit 4 generates electrostatic potential distributions for all the xy planes. Alternatively, an electrostatic potential distribution may be generated for some xy planes.
  • the value of the electrostatic potential in the xy plane can also be obtained as a discrete value at a mesh point (intersection) using the xy plane as mesh data.
  • the xy plane can be, for example, mesh data with an interval of 0.05 to 1.0 mm, preferably mesh data with an interval of 0.1 to 0.5 mm, more preferably mesh data with an interval of 0.2 mm, and discrete data at mesh points (intersection points).
  • the surface charge value can also be obtained as a typical value.
  • the value of the surface charge in the xy plane can also be obtained as an average value of electrostatic potentials calculated for regions in individual meshes using the xy plane as mesh data as described above.
  • the surface shape data generation unit 3 sets the surface of the protein or the surface of the ligand binding site as the xy plane, and generates data indicating the unevenness in the xy plane by the value in the z-axis direction.
  • the distribution data generation unit 4 generates the electrostatic potential distribution for the entire region or partial region of the xy plane created by the surface shape data generation unit 3, the protein interaction analyzer 1 is not limited to this form. . That is, in the protein interaction analyzer 1, the surface shape data generation unit 3 defines, for example, a three-dimensional grid space generated based on atomic coordinates of atoms constituting the protein, and generates the entire surface shape data of the protein. Alternatively, surface shape data may be generated for a partial region including a ligand binding site in a protein. And the electrostatic potential distribution data generation part 4 may produce
  • data related to atomic coordinates other than the residue name is extracted from the data associated with each atomic number. That is, data relating to atomic coordinates relating to the whole protein or a ligand binding site in the protein is extracted from the data set stored in the PDB.
  • the method for calculating the central coordinates is not particularly limited, but for example, from the atomic coordinates relating to the atoms constituting the whole protein extracted as described above or the atomic coordinates relating to the atoms constituting the ligand binding site, the x coordinate [ ⁇ ], Arithmetic averages of the y coordinate [ ⁇ ] and the z coordinate [ ⁇ ] are respectively calculated, and the obtained average value can be used as the center coordinate.
  • the central coordinates of the ligand binding site after extracting the atomic coordinates for the entire protein, only the atomic coordinates related to the atoms constituting the ligand binding site are further extracted to calculate the arithmetic average as described above. May be.
  • the distance from the center coordinate that is, the radius of the spherical surface can be arbitrarily set, and can be set in the range of, for example, 15 to 50 mm, preferably in the range of 20 to 40 mm, and more preferably in the range of 23 to 30 mm. .
  • a cube that is inscribed or circumscribed with respect to a spherical surface having a predetermined radius from the center coordinates is given, and a three-dimensional grid space is given by dividing each side of the cube at a predetermined interval.
  • interval it does not specifically limit as a predetermined space
  • interval For example, it can be set to 0.25 inch or 0.5 inch.
  • the coordinates of the lattice points of each partition in the three-dimensional grid space should be defined as a coordinate system that is common to the atomic coordinates of the atoms constituting the whole protein or the atomic coordinates of the atoms constituting the ligand binding site. Can do.
  • the surface shape data generation unit 3 can generate the entire surface shape data of the protein by defining the three-dimensional grid generated based on the atomic coordinates of the atoms constituting the protein.
  • a three-dimensional grid space can be defined as shown in FIG.
  • a spherical surface 14 having a predetermined radius for example, 10 to 20 mm
  • the first partition 15 solid line in FIG. 0.5 ⁇
  • a second partition 16 that further divides the first partition 15 into two
  • a predetermined grid number (gridgripositions counts) 17 can be defined for each side of the cube inscribed in the spherical surface 14.
  • a three-dimensional grid space made up of cubes inscribed in the spherical surface 14 divided in a predetermined number of grids in this way is referred to as Voxel.
  • the cube inscribed in the spherical surface 14 is a three-dimensional grid space, Voxel.
  • the vicinity of eight corners of the cube is a space where no atomic coordinates exist.
  • atomic coordinates are included in all of the Voxel.
  • the electrostatic potential distribution data generation unit 4 uses the data stored in the external storage unit 2 to Generate potential distribution data.
  • the electrostatic potential distribution data generation unit 4 reads information and coordinates of non-hydrogen atomic species such as carbon (C), oxygen (O), nitrogen (N), sulfur (S), and the like for each non-hydrogen atomic species.
  • Voxel which is a three-dimensional grid space.
  • a specific character such as “1” is given to the lattice point closest to the non-hydrogen atomic species among the lattice points in Voxel, and “1” is not given.
  • Voxel data can be generated for each non-hydrogen atomic species such as data and Voxel “S” data related to sulfur atoms.
  • the data set obtained in this way is defined as three-dimensional convolution data (3D Convolution data).
  • 3D Convolution data three-dimensional convolution data
  • each atom is assigned to the closest lattice point based on the atomic coordinate data (the value of the lattice point is 1).
  • the electrostatic potential distribution data generation unit 4 performs the methods disclosed in, for example, Rocchia et al. Vol. 23, No. 1 Journal of Computational Chemistry, 128-137, 2002, GRASP, Chimera, APBS, and QUANTA.
  • the electrostatic potential distribution can be generated using commercially available software such as.
  • the electrostatic potential distribution data generation unit 4 can store the calculated electrostatic potential values having a positive value and those having a negative value as different data. That is, the electrostatic potential distribution data generation unit 4 can generate Voxel “positive” data and Voxel “negative” data based on the calculated electrostatic potential value.
  • FIG. 5B schematically shows storing as different data based on whether the electrostatic potential value is “positive” or “negative”. In FIG. 5B, the inside of the circle is expressed by shading according to the absolute values of the “positive” and “negative” values.
  • Voxel “positive” data and Voxel “negative” data are three-dimensional with Voxel “O” data about oxygen atom, Voxel “N” data about nitrogen atom, Voxel “S” data about sulfur atom, etc. It can be convolution data (3D Convolution data).
  • the protein interaction analysis apparatus 1 uses the data stored in the external storage unit 2 to generate the three-dimensional structure data related to the ligand in the ligand structure data generation unit 5.
  • the ligand structure data generation unit 5 can obtain the three-dimensional structure data of the ligand by appropriately using a conventionally known method.
  • the ligand structure data generation unit 5 uses the data related to the ligand stored in the external storage unit 2, that is, the molecular formula, the structural formula, the compound name, and the information related to the atoms interacting with the protein when obtaining the three-dimensional structure data of the ligand.
  • the three-dimensional structure of the ligand is extracted, and based on these, the three-dimensional structure data of the ligand is generated. For example, when using PDB, search the residue name using the _nonpolymer flag as an index from the same file that extracted the atomic coordinate data used when generating the surface shape data described above, and the residue name is not an amino acid.
  • And / or the residue name is a name indicating an organic compound other than a protein, etc., and can be determined as a ligand molecule.
  • the compound name is associated with the _nonpolymer flag as a three-letter code. Therefore, for compounds that have been determined as ligand molecules, it is possible to extract the three-dimensional code (x, ⁇ ⁇ y, z) of all or part of the atoms constituting the ligand from the coordinate data description part at the back of the file using the three-letter code as a clue. it can.
  • the three-dimensional structure data of the ligand generated by the ligand structure data generation unit 5 may be the three-dimensional structure data of the whole compound that interacts with the protein, or the three-dimensional structure data regarding the partial region that interacts with the protein in the compound. Also good.
  • the ligand structure data generation unit 5 may obtain two-dimensional graph structure data using a molecular compound structure description method when obtaining the three-dimensional structure data of the ligand.
  • the molecular compound structure description method include a SMILES (simplified molecular input line entry system) description method, a SMARTS (Smiles Arbitrary Target specification) description method, an InChI (International Chemical Identifier) description method, and the like.
  • SMILES simple molecular input line entry system
  • SMARTS Smiles Arbitrary Target specification
  • InChI International Chemical Identifier
  • the ligand structure data generation unit 5 can use graph convolution data (Graph Convolution data) for machine learning of two-dimensional graph structure data using a molecular compound structure description method such as the SMILES description method.
  • the ligand structure data generation unit 5 generates an electrostatic potential distribution for the ligand in addition to the three-dimensional structure data of the ligand.
  • the electrostatic potential distribution regarding the ligand can be determined according to the method described in, for example, Rocchia et al. Journal of Computational Chemistry, Vol. 23, No. 1, pages128-137.
  • the electrostatic potential distribution regarding the ligand may be the electrostatic potential distribution of the entire compound that interacts with the protein, or may be the electrostatic potential distribution regarding the partial region that interacts with the protein in the compound.
  • the protein interaction analyzing apparatus 1 relates to the combination of the ligand binding site and the ligand in the predetermined protein, the surface shape data of the ligand binding site generated by the surface shape data generating unit 3, and the electrostatic potential distribution data generating unit 4
  • the electrostatic potential distribution data of the ligand binding site generated in step 1 and the three-dimensional structure data relating to the ligand generated by the ligand structure data generation unit 5 are stored in the data storage unit 6 in association with each other.
  • the data storage unit 6 includes a surface of a ligand binding site including “surface shape data”, “electrostatic potential distribution data”, and “stereostructure data regarding a ligand” regarding a plurality of combinations of a ligand binding site and a ligand in a predetermined protein. Stores property data. In addition to these data, the data storage unit 6 may store the electrostatic potential distribution data related to the ligand in association with each other, or juxtapose the three-dimensional structure information of a plurality of three-dimensional structure isomers (rotamers) possessed by the ligand. May be stored.
  • the protein interaction analysis device 1 generates data related to protein interaction desired by the user by machine learning using a plurality of ligand binding site surface property data stored in the data storage unit 6 as teacher data.
  • the user inputs information related to the analysis target to the data input unit 7 in the protein interaction analyzer 1.
  • the information on the analysis target is information on a compound that interacts with a predetermined compound and its ligand binding site when analyzing the protein and its ligand binding site, and the compound that interacts with the predetermined protein or its ligand binding site When analyzing a ligand or a ligand, it is information on the protein or its ligand binding site.
  • Information on the compound input to the data input unit 7 includes information on the three-dimensional structure or molecular formula and three-dimensional structure of the compound, information on the three-dimensional structure or molecular formula and three-dimensional structure of the partial region of the compound, and the like.
  • a candidate protein or candidate ligand binding site that interacts with the compound can be analyzed by a process described in detail later.
  • the substrate As information on the compound to be obtained, information on the three-dimensional structural formula or molecular formula and three-dimensional structure of the substrate compound and information on the three-dimensional structural formula or molecular formula and three-dimensional structure of the region where the enzyme acts in the substrate compound are input to the data input unit 7.
  • the data input unit 7 may be input with the type of enzyme reaction from the substrate to the product and the name of the enzyme involved in the enzyme reaction.
  • the information regarding the protein or ligand binding site input to the data input unit 7 includes the amino acid sequence, atomic coordinates, and three-dimensional structure of the protein or ligand binding site. Based on the information on the protein or ligand binding site, a candidate compound (candidate ligand) that interacts with the protein or ligand binding site can be analyzed by a process described in detail later.
  • a compound (ligand compound) that interacts with a predetermined protein for example, a receptor protein
  • a predetermined protein for example, a receptor protein
  • the amino acid sequence of the protein, a three-dimensional structure is used as information about the protein.
  • the structure data, the amino acid sequence of the ligand binding site, or the three-dimensional structure data of the ligand binding site are input to the data input unit 7.
  • the ligand binding site surface property data stored in the data storage unit 6 and the ligand binding site are mutually connected based on the information about the analysis target input by the data input unit 7.
  • Data relating to protein interaction relating to the analysis target is generated, including analysis results by machine learning using a plurality of sets of three-dimensional structure data relating to the acting ligand as teacher data.
  • the calculation processing unit 8 when the information on the analysis target input by the data input unit 7 is information on a compound, the calculation processing unit 8 generates a candidate protein or a candidate ligand binding site that may interact with the compound. More specifically, when the information on the analysis target input by the data input unit 7 is information on a compound that is a substrate in a predetermined enzyme reaction, the calculation processing unit 8 may select a candidate enzyme that may use the compound as a substrate. Is generated. At this time, the calculation processing unit 8 may generate one candidate protein or candidate ligand binding site or candidate enzyme most likely to interact with the compound, or interact with the compound. A likely group of candidate proteins or candidate ligand binding sites or candidate enzymes may be generated.
  • the calculation processing unit 8 can select a candidate compound that may interact with the protein or ligand binding site or Generate candidate ligands. At this time, the calculation processing unit 8 may generate one candidate compound or candidate ligand having the highest possibility of interacting with the protein or ligand binding site, or with respect to the protein or ligand binding site. A group of candidate compounds or candidate ligands that are likely to interact may be generated.
  • the machine learning unit 10 in the calculation processing unit 8 uses a plurality of data sets of the above-described ligand binding site surface property data and the three-dimensional structure data related to the ligand interacting with the ligand binding site as teacher data. Analysis by machine learning.
  • the evaluation value calculation unit 11 in the calculation processing unit 8 an evaluation value indicating similarity to the protein or ligand included in the teacher data with respect to the analysis target input by the data input unit 7. Is calculated.
  • the list generation unit 12 in the calculation processing unit 8 generates a list that combines the result of machine learning performed by the machine learning unit 10 and the evaluation value calculated by the evaluation value calculation unit 11.
  • the machine learning teacher data processed in the machine learning unit 10 is not particularly limited, but includes, for example, an evaluation value for evaluating the interaction between the protein or ligand binding site and the ligand molecule. Is preferred. As an example of the evaluation value, a higher score is obtained for the shorter distance between the ligand binding site and the ligand molecule in the protein than the “ligand binding site surface property data” and “ligand conformation data”.
  • the three-dimensional unevenness evaluation value that gives The three-dimensional shape unevenness homology evaluation value can be an n-dimensional vector having n number common to each data set composed of “ligand binding site surface property data” and “stereostructure data on ligand”.
  • the n-dimensional vector indicates a three-dimensional unevenness homology evaluation value at n predetermined sites in the ligand molecule.
  • n predetermined sites can be arbitrarily defined for each ligand molecule.
  • the order and arrangement order of n-dimensional vectors are based on the ranking of carbon atoms in accordance with the IUPAC (International Pure and Applied Chemistry Union) nomenclature, and each non-hydrogen (carbon, nitrogen, oxygen) constituting the ligand molecule. , Sulfur, selenium, and the like). Thereby, it is possible to determine the three-dimensional unevenness homology evaluation value at each site on the three-dimensional configuration for a predetermined ligand molecule.
  • the number n in the three-dimensional shape unevenness homology evaluation value may be a value common to each data set including “ligand binding site surface property data” and “three-dimensional structure data regarding the ligand”, but is different for each data set. But it ’s okay.
  • the machine learning teacher data processed in the machine learning unit 10 is not particularly limited, and includes, for example, an evaluation value for evaluating the binding energy related to electrostatic binding between a protein or a ligand binding site and a ligand molecule.
  • an evaluation value for evaluating the binding energy related to electrostatic binding between a protein or a ligand binding site and a ligand molecule.
  • the binding energy enthalpy change
  • a binding energy rating that gives a higher score depending on the size can be used.
  • the binding energy evaluation value can be an m-dimensional vector having m number common to each data set composed of “electrostatic potential distribution data” and “stereostructure data on ligand”.
  • the m-dimensional vector indicates a binding energy evaluation value at m predetermined sites in the ligand molecule. These m predetermined sites can be arbitrarily defined for each ligand molecule as in the above-described n-dimensional vector.
  • the m number in the binding energy evaluation value may be a different value for each set of protein or ligand binding site and ligand molecule, or may be a common value.
  • the n number in the three-dimensional shape unevenness evaluation value may be a value common to each data set composed of “electrostatic potential distribution data” and “stereostructure data on a ligand”, or may be a value different for each data set. good.
  • n and m can be arbitrary.
  • machine learning is performed using a part of the above-described data set, and the appropriateness of answers to other data sets not used for machine learning increases. Can be set as follows.
  • the evaluation value calculation unit 11 performs mutual processing on the compound or ligand extracted as a result of machine learning in the machine learning unit 10.
  • An evaluation value is calculated for a candidate protein or candidate ligand binding site that may act. This evaluation value is a value indicating the similarity between the compound or ligand input by the data input unit 7 and the compound or ligand associated with the extracted candidate protein or candidate ligand binding site.
  • the evaluation value calculation unit 11 performs maximum matching between the compound or ligand input by the data input unit 7 and the extracted candidate protein or compound or ligand associated with the candidate ligand binding site, and the matching degree
  • a high evaluation value can be given to a molecule having a high (matching degree).
  • This evaluation value is obtained when, for example, the proportion of atoms located within a predetermined distance (for example, 1 cm) from the corresponding atoms in the “ligand structure data” to be collated among the atoms contained in the input compound or ligand is high. It can be specified to be a high numerical value.
  • this evaluation value is used to match the type and position of oxygen and nitrogen atoms, which are likely to cause local electrostatic bias when collating the entered compound or ligand with the “ligand structure data”. It can be specified that the value is higher when the value is higher. By defining the evaluation value as described above, it is possible to more accurately evaluate the structural similarity between the input compound or ligand and the compound or ligand associated with the candidate protein or candidate ligand binding site. .
  • the evaluation value calculation unit 11 includes a predetermined compound or ligand input as an analysis target by the data input unit 7 and a compound or ligand associated with the candidate protein or candidate ligand binding unit extracted by the machine learning unit 10. It is preferable to calculate an evaluation value for the similarity between the structure of the region sufficiently close to the ligand binding site in the structure.
  • examples of the sufficiently close region include a region within 5 mm from the ligand binding site in a state where the compound or the ligand interacts with the ligand binding site.
  • the evaluation value is obtained by comparing the predetermined compound or ligand input as an analysis target in the data input unit 7 and the compound or ligand associated with the candidate protein or candidate ligand binding unit extracted by the machine learning unit 10. Similarity with the region involved in the action can be evaluated.
  • the evaluation value calculation unit 11 when the type of enzyme reaction from the substrate to the product or the name of the enzyme involved in the enzyme reaction is input, the evaluation value calculation unit 11, the extracted candidate protein or candidate ligand binding site can be given an evaluation value indicating the degree of coincidence or similarity with the input enzyme reaction or enzyme name.
  • the evaluation value calculation unit 11 extracts the protein or ligand binding site extracted as a result of machine learning in the machine learning unit 10. Evaluation values are calculated for candidate compounds or candidate ligands that may interact with. This evaluation value is a value indicating the similarity between the protein or ligand binding site input by the data input unit 7 and the protein or ligand binding site associated with the extracted candidate compound or candidate ligand.
  • the evaluation value calculation unit 11 the degree of coincidence of amino acid sequences between the protein or ligand binding site input by the data input unit 7 and the protein or ligand binding site associated with the extracted candidate compound or candidate ligand And a high evaluation value can be given to a candidate compound or candidate ligand associated with a protein or ligand binding site having a high degree of coincidence. Further, in the evaluation value calculation unit 11, the three-dimensional structural similarity between the protein or ligand binding site input by the data input unit 7 and the protein or ligand binding site associated with the extracted candidate compound or candidate ligand. And a high evaluation value can be given to a candidate compound or candidate ligand associated with a protein or ligand binding site having a high degree of similarity.
  • the evaluation value calculation unit 11 resembles the electrostatic potential distribution between the protein or ligand binding site input by the data input unit 7 and the protein or ligand binding site associated with the extracted candidate compound or candidate ligand.
  • the degree can be calculated, and a high evaluation value can be given to a candidate compound or candidate ligand associated with a protein or ligand binding site having a high degree of similarity.
  • the evaluation value as described above, the structural similarity and the electrostatic potential distribution similarity between the input protein or ligand binding site and the protein or ligand binding site to which the candidate compound or candidate ligand is related. Sex can be evaluated more accurately.
  • the list generation unit 12 generates a list in which the data regarding the protein interaction generated by the machine learning unit 10 and the evaluation value calculated by the evaluation value calculation unit 11 are integrated.
  • a list that associates candidate protein or candidate ligand binding sites that may interact with the compound and evaluation values calculated for them is generated.
  • candidate compounds or candidate ligands that may interact with the protein or ligand binding site and evaluations calculated for them Generate a list with associated values.
  • the calculation processing unit 8 includes the machine learning unit 10, the evaluation value calculation unit 11, and the list generation unit 12.
  • a ligand suitability score calculation unit 13 may be provided.
  • the protein-ligand compatibility score calculation unit 13 interacts with the compound or ligand extracted by the machine learning unit 10 when the analysis target input by the data input unit 7 is a predetermined compound or ligand.
  • a suitability score for the binding stability between the potential candidate protein or candidate ligand binding site and the compound or ligand to be analyzed is calculated.
  • the protein-ligand suitability score calculation unit 13 applies the protein or ligand binding site extracted by the machine learning unit 10 when the analysis target input by the data input unit 7 is a predetermined protein or ligand binding site.
  • a fitness score relating to the binding stability between the candidate compound or candidate ligand that may interact with the protein or ligand binding site to be analyzed is calculated.
  • the suitability score can be a value calculated based on the binding enthalpy between the ligand and the ligand binding site.
  • the ligand alone is in a state where water molecules are coordinated, and the potential energy 1 of the ligand alone is calculated from the binding enthalpy with the water molecule.
  • the potential energy 2 is calculated by calculating the enthalpy amount in a state where the ligand binding site and the ligand are bound (ionic bond, hydrophobic bond, etc.).
  • the difference between the potential energy 2 and the potential energy 1 is positive, it means that the ligand is easily bound to the ligand binding site. Therefore, by calculating the suitability score in consideration of the difference between the potential energy 2 and the potential energy 1, the above-described bond stability can be quantitatively evaluated.
  • the list generation unit 12 calculates the data regarding the protein interaction generated by the machine learning unit 10, the evaluation value calculated by the evaluation value calculation unit 11, and the protein-ligand compatibility score calculation as described above.
  • a list in which the suitability scores calculated by the unit 13 are integrated is generated.
  • candidate compounds or candidate ligands that may interact with the protein or ligand binding site and evaluations calculated for them Generate a list associating values with fitness scores.
  • the list generation unit 12 illustrated in FIG. 2 or 3 generates a list of candidate proteins or candidate ligand binding sites, or candidate compounds or candidate ligands extracted by the machine learning unit 10. At this time, the list generation unit 12 further selects candidate proteins or candidate ligand binding sites or candidate compounds or candidate ligands included in the list extracted by the machine learning unit 10 based on the above-described evaluation value and / or suitability score. It may be limited. That is, the list generation unit 12 has an evaluation value and / or suitability score of a predetermined value or less among candidate proteins or candidate ligand binding sites, or candidate compounds or candidate ligands included in the list extracted by the machine learning unit 10. You may remove things from the list.
  • the machine learning unit 10 selects candidate enzymes that may use the compound as a substrate. Extract.
  • the list generation unit 12 may exclude, from the list, candidate enzymes extracted by the machine learning unit 10 whose evaluation value and / or suitability score are equal to or less than a predetermined value. Further, in this case, the list generation unit 12 may exclude those candidate enzymes extracted by the machine learning unit 10 that are not related to the enzyme reaction input by the user from the list.
  • the output unit 9 of the protein interaction analyzer 1 outputs the list generated by the list generation unit 12.
  • the list output by the output unit 9 may be the list generated by the list generation unit 12 as it is, or may be a list generated by adding information to the list generated by the list generation unit 12.
  • a list including candidate proteins or candidate ligand binding sites that may interact with the compound is generated in the list generation unit 12.
  • candidate information included in this list function information regarding proteins including candidate ligand binding sites, and the like are added.
  • a process for analyzing the engineering information based on the evaluation value described above may be performed before the output unit 9 outputs the list. For example, when a predetermined compound is input as an analysis target in the data input unit 7, the cause of the interaction between the candidate protein or candidate ligand binding site and the input compound is inhibited based on the evaluation value. Analyze engineering information that facilitates interaction of the compounds.
  • a region that contributes to a decrease in the evaluation value in the input compound is identified from a structural comparison between the input compound and the compound associated with the candidate protein or candidate ligand binding site.
  • the position where the region contributing to the decrease in the evaluation value of the compound interacts is specified.
  • the input compound is introduced into the candidate protein or candidate ligand binding site so as to have a three-dimensional structure or electrostatic potential distribution in which the input compound can easily interact. Identify mutations and modifications. The mutation or modification identified in this way can be generated as engineering information for the candidate protein or candidate ligand binding site.
  • a region that contributes to a decrease in the evaluation value in the input protein is specified from a structural comparison between the input protein and the protein associated with the candidate compound or candidate ligand.
  • a position where a region contributing to a decrease in the evaluation value of the protein interacts is specified.
  • structural modification (functionality) of the candidate compound or candidate ligand is performed so that the input protein has a three-dimensional structure or electrostatic potential distribution in which the input protein is likely to interact. Identify group removal, modification, and addition).
  • the structural modification identified in this way can be generated as engineering information for the candidate compound or candidate ligand.
  • the protein interaction analysis apparatus 1 includes an arithmetic device such as a CPU, a storage device such as a hard disk, a RAM, and a ROM, an input device such as a keyboard and a pointing device, and an output device such as a display and a printer.
  • the protein interaction analysis apparatus 1 may include a communication device for connecting an external storage device such as the external storage unit 2 via a network such as the Internet. In the protein interaction analyzer 1, this communication device functions as an input device for various data and an output device to the outside.
  • a storage device such as a hard disk, a RAM, and a ROM stores a program that causes the computer device to perform the various processes described above. That is, the protein interaction analyzer 1 can be realized by executing the program stored in the storage device with the hardware described above.
  • the protein interaction analysis apparatus 1 may be configured by a single computer apparatus, or may be configured by a plurality of computer apparatuses that are physically different but can communicate with each other.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Physics & Mathematics (AREA)
  • Immunology (AREA)
  • General Health & Medical Sciences (AREA)
  • Wood Science & Technology (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • Biomedical Technology (AREA)
  • Medicinal Chemistry (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Hematology (AREA)
  • Urology & Nephrology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Pathology (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Food Science & Technology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Cell Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Sustainable Development (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The present invention accurately analyzes specific interactions between proteins and ligands. The present invention generates data pertaining to protein interactions related to an analysis target by machine learning that uses, as training data, surface shape data of a prescribed ligand binding site, electrostatic potential distribution data of the ligand binding site, and steric structure data pertaining to ligands that interact with the ligand binding site.

Description

タンパク質相互作用解析装置及び解析方法Protein interaction analyzer and analysis method
 本発明は、リガンドとタンパク質との相互作用を解析する際に使用されるタンパク質相互作用解析装置及び解析方法に関する。 The present invention relates to a protein interaction analysis apparatus and analysis method used when analyzing the interaction between a ligand and a protein.
 酵素は、特定の構造を有する基質を認識する、いわゆる基質特異性を有している。また、受容体は、特定の構造を有する生理活性物質と特異的に結合し、その作用を発現する(例えば、シグナル伝達活性や転写促進活性)。このように、酵素や受容体等のタンパク質は、基質や生理活性物質といったいわゆるリガンドとの特異的な結合を介して機能する。 The enzyme has a so-called substrate specificity that recognizes a substrate having a specific structure. The receptor specifically binds to a physiologically active substance having a specific structure and expresses its action (for example, signal transduction activity or transcription promoting activity). Thus, proteins such as enzymes and receptors function through specific binding with so-called ligands such as substrates and physiologically active substances.
 タンパク質に関する研究の成果として、それをコードする遺伝子に関する塩基配列情報、アミノ酸配列情報や立体構造情報が日々蓄積されている。これらのうち配列情報に関しては、例えば、NCBI(National Center of Biotechnology Information )のGenbank、日本DNAデータバンク(DDBJ)及びEMBLが構築されている。また、タンパク質の立体構造に関する情報は、日本蛋白質構造データバンク(PDBj: Protein Data Bank Japan)を含むProtein Data Bankが構築されている。さらに、代謝やシグナル伝達などの分子間ネットワークに関する情報を統合したデータベースとしてKEGGが構築されている。 As a result of research related to proteins, nucleotide sequence information, amino acid sequence information and three-dimensional structure information relating to the gene encoding the protein are accumulated daily. Among these, with respect to sequence information, for example, NCBI (National Center of Biotechnology Information) Genbank, Japan DNA Data Bank (DDBJ) and EMBL have been constructed. In addition, as for information on the three-dimensional structure of the protein, a Protein Data Bank including a Japanese protein structure data bank (PDBj: Protein Data Bank Japan) is constructed. Furthermore, KEGG is constructed as a database that integrates information on intermolecular networks such as metabolism and signal transduction.
 このような各種データを用いた様々な取り組みのなかで、自然界の微生物が持っていない代謝経路や遺伝子配列を計算科学によって予測し人工的に設計する「合成バイオ技術」が注目されている。「合成バイオ技術」では、例えば、生産目的の物質を合成するため、出発物質から最終的な目的物質を生合成するための代謝経路を、上述した各種データを用いて構築し、ゲノム編集等の手法により宿主生物を作製する。ここで代謝経路は、基質と酵素からなる酵素反応を複数組み合わせることで設計することができる。 In various efforts using such various data, “synthetic biotechnology” that predicts metabolic pathways and gene sequences not possessed by natural microorganisms by computational science and artificially designs them is drawing attention. In “synthetic biotechnology”, for example, in order to synthesize a target substance for production, a metabolic pathway for biosynthesizing a final target substance from a starting material is constructed using the above-mentioned various data, A host organism is produced by the technique. Here, the metabolic pathway can be designed by combining a plurality of enzyme reactions comprising a substrate and an enzyme.
 また、上述した各種データを用いることで、創薬の分野において標的タンパク質に対するリード化合物をハイスループットにスクリーニングする方法が提案されている。この方法では、例えば、標的タンパク質におけるリガンド結合部位の立体構造データに基づいて、リガンド結合部位に相互作用しうるリード化合物の基本構造を設計する。 In addition, a method for screening a lead compound against a target protein with high throughput in the field of drug discovery by using the various data described above has been proposed. In this method, for example, the basic structure of a lead compound that can interact with the ligand binding site is designed based on the three-dimensional structure data of the ligand binding site in the target protein.
 以上のように、タンパク質をリガンドとの特異的な相互作用(基質と酵素との相互作用、標的タンパク質とリード化合物との相互作用)に関する知見は、合成バイオ技術や創薬の分野において非常に有用で価値の高いデータとなることがわかる。 As described above, knowledge about specific interactions between proteins and ligands (interaction between substrate and enzyme, interaction between target protein and lead compound) is very useful in the fields of synthetic biotechnology and drug discovery. It turns out that it becomes data with high value.
 なお、特許文献1には、タンパク質間相互作用を考慮して機能未知タンパク質の機能を識別する方法が開示されている。特許文献1に開示された方法では、機能既知の複数のレセプタの立体構造と複数のリガンドの立体構造とから教師データが求められる。特許文献1に開示された教師データには、各々のレセプタについて、各レセプタが複数のリガンドとそれぞれドッキングするときの複数の形状相補性評価値と、各レセプタの全体の全電荷と表面の全電荷の差分を含む電荷情報とが含まれている。そして、特許文献1に開示された方法では、機能未知タンパク質が複数のリガンドとそれぞれドッキングするときの複数の形状相補性評価値と、機能未知タンパク質の電荷情報と入力し、上記教師データを学習して機能未知タンパク質の機能を識別する。 Note that Patent Document 1 discloses a method of identifying the function of a protein whose function is unknown in consideration of protein-protein interaction. In the method disclosed in Patent Document 1, teacher data is obtained from the three-dimensional structures of a plurality of receptors with known functions and the three-dimensional structures of a plurality of ligands. The teaching data disclosed in Patent Document 1 includes, for each receptor, a plurality of shape complementarity evaluation values when each receptor is docked with a plurality of ligands, a total charge of each receptor, and a total charge of the surface. Charge information including the difference between them. In the method disclosed in Patent Document 1, a plurality of shape complementarity evaluation values when a function unknown protein is docked with a plurality of ligands and charge information of the function unknown protein are input, and the teacher data is learned. To identify the function of the unknown protein.
 一方、非特許文献1には、DNAリガーゼにおけるアデニル化ドメイン(AdD)とオリゴヌクレオチド結合ドメイン(OBD)の静電ポテンシャル分布に基づいて、これらドメイン間に形成されるコンフォメーションを解析し、酵素反応との関連性を検証している。また、特許文献2には、部位特異的突然変異方法によって当該ドメインに突然変異を導入し、ライゲーション反応に深く関与するアミノ酸残基を特定したことが開示されている。これら特許文献1及び2より、タンパク質における表面の静電ポテンシャル分布に基づいたコンフォメーション解析によって、タンパク質の機能解析が可能となることが理解できる。 On the other hand, Non-Patent Document 1 analyzes the conformation formed between these domains based on the electrostatic potential distribution of the adenylation domain (AdD) and oligonucleotide binding domain (OBD) in DNA ligase, and the enzymatic reaction. The relationship with is verified. Patent Document 2 discloses that a mutation was introduced into the domain by a site-directed mutagenesis method to identify amino acid residues that are deeply involved in the ligation reaction. From these Patent Documents 1 and 2, it can be understood that functional analysis of a protein can be performed by conformational analysis based on the electrostatic potential distribution on the surface of the protein.
特許5170630号公報Japanese Patent No. 5170630
Tanabe M., Ishino S., Yohda M., Morikawa K., Ishino Y., Nishida H. (2012) Structure-based mutational study of an archaeal DNA ligase towards improvement of ligation activity. ChemBioChem 13, 2575-2582.Tanabe M., Ishino S., Yohda M., Morikawa K., Ishino Y., Nishida H. (2012) Structure-based mutational study of an archaeal DNA ligase towards improvement of ligation activity. ChemBioC5-2 Tanabe M., Ishino Y., Nishida H. (2015) From structure-function analyses to protein engineering for practical applications of DNA ligase. Archaea ID 267570.Tanabe M., Ishino Y., Nishida H. (2015) From structure-function analyses to protein engineering for practical applications of DNA ligase. Archaea ID 267570.
発明の解決しようとする課題Problems to be Solved by the Invention
 以上のように、日々蓄積される新規タンパク質関連情報に基づいて、タンパク質とリガンドとの特異的相互作用に関する知見を導いたとしても、現状では合成バイオ技術において所望の物質生産が達成されないといった問題や、創薬の分野において高い結合活性を有するリード化合物を設計できないといった問題があった。 As described above, even if the knowledge about the specific interaction between the protein and the ligand is derived based on the new protein-related information accumulated every day, there is a problem that the desired substance production is not achieved at present in synthetic biotechnology. In the field of drug discovery, there has been a problem that lead compounds having high binding activity cannot be designed.
 そこで、本発明は、上述した実情に鑑み、タンパク質とリガンドとの特異的相互作用を正確に解析することができるタンパク質相互作用関連データを出力するタンパク質相互作用解析装置及び解析方法を提供することを目的とする。 Therefore, in view of the above situation, the present invention provides a protein interaction analysis apparatus and an analysis method for outputting protein interaction related data that can accurately analyze a specific interaction between a protein and a ligand. Objective.
 上述した目的を達成するため、本発明者らが鋭意検討した結果、タンパク質関連情報に基づいて少なくともリガンド結合部位に関する立体構造データと、当該リガンド結合部位における静電ポテンシャル分布とを含むリガンド結合部位関連データを用いることで、リガンドとタンパク質との特異的相互作用を正確に解析できることを見いだし、本発明を完成するに至った。 As a result of intensive studies by the present inventors in order to achieve the above-mentioned object, ligand binding site-related information including at least three-dimensional structure data on the ligand binding site based on protein-related information and electrostatic potential distribution in the ligand binding site. By using the data, it was found that the specific interaction between the ligand and the protein can be accurately analyzed, and the present invention has been completed.
 本発明は以下を包含する。
 (1) 解析対象に関する情報を入力するデータ入力部と、
 外部記憶部に格納されたタンパク質のアミノ酸配列データ及び立体構造データと当該タンパク質に対して特異的に相互作用するリガントの立体構造データとに基づいて生成した、所定のタンパク質に関するリガンド結合部位の表面形状データと当該リガンド結合部位の静電ポテンシャル分布データと当該リガンド結合部位に対して相互作用するリガンドに関する立体構造データとを関連づけて記憶するデータ記憶部と、
 上記データ記憶部に記憶された、所定のリガンド結合部位の表面形状データと当該リガンド結合部位の静電ポテンシャル分布データと当該リガンド結合部位に対して相互作用するリガンドに関する立体構造データとを教師データとした機械学習により、上記データ入力部で入力された解析対象に関連する、タンパク質相互作用に関するデータを生成する計算処理部とを備える、タンパク質相互作用解析装置。
The present invention includes the following.
(1) a data input unit for inputting information on the analysis target;
Surface shape of the ligand binding site for a given protein generated based on the amino acid sequence data and 3D structure data of the protein stored in the external storage unit and the 3D structure data of the ligand that specifically interacts with the protein A data storage unit that associates and stores data, electrostatic potential distribution data of the ligand binding site, and three-dimensional structure data relating to a ligand that interacts with the ligand binding site;
Teacher data includes surface shape data of a predetermined ligand binding site, electrostatic potential distribution data of the ligand binding site, and three-dimensional structure data relating to a ligand that interacts with the ligand binding site, stored in the data storage unit. A protein interaction analysis apparatus comprising: a computer processing unit that generates data relating to protein interaction related to the analysis target input by the data input unit by machine learning.
 (2) 上記解析対象に関する情報はリガンドの構造に関する情報であり、
 上記計算処理部は、当該リガンドに相互作用するタンパク質又はリガンド結合部位に関するデータを生成することを特徴とする(1)記載のタンパク質相互作用解析装置。
(2) The information on the analysis target is information on the structure of the ligand,
The protein interaction analyzer according to (1), wherein the calculation processing unit generates data relating to a protein or a ligand binding site that interacts with the ligand.
 (3) 上記解析対象に関する情報はタンパク質又はリガンド結合部位の構造に関する情報であり、
 上記計算処理部は、当該タンパク質又はリガンド結合部位に相互作用する化合物又はリガンドに関するデータを生成することを特徴とする(1)記載のタンパク質相互作用解析装置。
(3) The information on the analysis target is information on the structure of the protein or ligand binding site,
The protein interaction analyzer according to (1), wherein the calculation processing unit generates data relating to a compound or ligand that interacts with the protein or ligand binding site.
 (4) 上記計算処理部は、機械学習により生成したタンパク質相互作用に関するデータについて、上記データ入力部で入力した解析対象と、生成したデータに含まれる解析対象との類似性を示す評価値を算出する評価値算出部を備えることを特徴とする(1)記載のタンパク質相互作用解析装置。 (4) The calculation processing unit calculates an evaluation value indicating the similarity between the analysis target input by the data input unit and the analysis target included in the generated data with respect to the protein interaction data generated by machine learning. The protein interaction analyzer according to (1), further comprising: an evaluation value calculation unit that performs
 (5) 上記計算処理部は、機械学習により生成したタンパク質相互作用に関するデータについて、上記データ入力部で入力した解析対象が相互作用したときの結合安定性を定量的に示す適合性スコアを算出するタンパク質-リガンド適合性スコア算出部を備えることを特徴とする(1)記載のタンパク質相互作用解析装置。 (5) The calculation processing unit calculates a fitness score that quantitatively indicates the binding stability when the analysis target input by the data input unit interacts with respect to the protein interaction data generated by machine learning. The protein interaction analysis apparatus according to (1), further comprising a protein-ligand compatibility score calculation unit.
 (6)上記データ記憶部は、リガンド結合部位を構成する原子の原子座標に基づいてリガンド結合部位の中心座標を算出し、当該中心座標から所定の距離内にある原子を含む三次元グリッド空間を設定し、当該三次元グリッド空間に基づいて生成された表面形状データを記憶することを特徴とする(1)記載のタンパク質相互作用解析装置。 (6) The data storage unit calculates the center coordinates of the ligand binding site based on the atomic coordinates of the atoms constituting the ligand binding site, and creates a three-dimensional grid space including atoms within a predetermined distance from the center coordinate. The protein interaction analyzer according to (1), wherein the apparatus is configured to store surface shape data generated based on the three-dimensional grid space.
 (7)上記三次元グリッド空間は、所定の間隔で設定されたグリッドにより複数数の格子点を有し、上記中心座標から所定の距離内にある各原子について最も近接する格子点に特定の文字を与え、当該特定の文字が与えられなかった格子点に他の文字を与えられたデータであることを特徴とする(6)記載のタンパク質相互作用解析装置。 (7) The three-dimensional grid space has a plurality of lattice points by a grid set at a predetermined interval, and a character specific to a lattice point closest to each atom within a predetermined distance from the central coordinate. The protein interaction analysis apparatus according to (6), wherein the data is a data in which another character is given to a lattice point where the specific character is not given.
 (8)上記中心座標から所定の距離内にある各原子は、複数の非水素原子種であることを特徴とする(6)記載のタンパク質相互作用解析装置。 (8) The protein interaction analyzer according to (6), wherein each atom within a predetermined distance from the central coordinate is a plurality of non-hydrogen atomic species.
 (9)上記データ記憶部は、上記三次元グリッド空間の格子点について算出された静電ポテンシャル分布データを記憶することを特徴とする(6)記載のタンパク質相互作用解析装置。 (9) The protein interaction analyzer according to (6), wherein the data storage unit stores electrostatic potential distribution data calculated for lattice points in the three-dimensional grid space.
 (10)上記データ記憶部は、上記三次元グリッド空間の格子点について算出された正の値からなる正の静電ポテンシャル分布データと、上記三次元グリッド空間の格子点について算出された負の値からなる負の静電ポテンシャル分布データとを記憶することを特徴とする(6)記載のタンパク質相互作用解析装置。 (10) The data storage unit includes positive electrostatic potential distribution data including positive values calculated for the lattice points of the three-dimensional grid space, and negative values calculated for the lattice points of the three-dimensional grid space. (6) The protein interaction analyzer according to (6), wherein negative electrostatic potential distribution data consisting of:
 (11) 入力装置により解析対象に関する情報を入力する工程と、
 演算装置が、外部記憶部に格納されたタンパク質のアミノ酸配列データ及び立体構造データと当該タンパク質に対して特異的に相互作用するリガントの立体構造データとに基づいて、所定のタンパク質に関するリガンド結合部位の表面形状データと当該リガンド結合部位の静電ポテンシャル分布データと当該リガンド結合部位に対して相互作用するリガンドに関する立体構造データとを生成し、これら表面形状データと静電ポテンシャル分布データとリガンドに関する立体構造データとを関連づけて記憶装置に記憶する工程と、
 演算装置が、上記記憶装置に記憶された、所定のリガンド結合部位の表面形状データと当該リガンド結合部位の静電ポテンシャル分布データと当該リガンド結合部位に対して相互作用するリガンドに関する立体構造データとを教師データとした機械学習により、上記入力装置が入力した解析対象に関連する、タンパク質相互作用に関するデータを生成する工程とを有する、タンパク質相互作用解析方法。
(11) a step of inputting information relating to an analysis object by an input device;
Based on the amino acid sequence data and the three-dimensional structure data of the protein stored in the external storage unit and the three-dimensional structure data of the ligand that specifically interacts with the protein, the arithmetic unit calculates the ligand binding site for the predetermined protein. Generate surface shape data, electrostatic potential distribution data of the ligand binding site, and three-dimensional structure data related to the ligand interacting with the ligand binding site, and generate these surface shape data, electrostatic potential distribution data, and three-dimensional structure related to the ligand. Storing the data in a storage device in association with the data;
The arithmetic device stores the surface shape data of the predetermined ligand binding site, the electrostatic potential distribution data of the ligand binding site, and the three-dimensional structure data relating to the ligand that interacts with the ligand binding site stored in the storage device. And a step of generating data relating to protein interaction related to the analysis target input by the input device by machine learning as teacher data.
 (12) 上記解析対象に関する情報はリガンドの構造に関する情報であり、
 上記演算装置は、当該リガンドに相互作用するタンパク質又はリガンド結合部位に関するデータを生成することを特徴とする(11)記載のタンパク質相互作用解析方法。
(12) The information on the analysis target is information on the structure of the ligand,
(11) The protein interaction analysis method according to (11), wherein the arithmetic device generates data relating to a protein that interacts with the ligand or a ligand binding site.
 (13) 上記解析対象に関する情報はタンパク質又はリガンド結合部位の構造に関する情報であり、
 上記演算装置は、当該タンパク質又はリガンド結合部位に相互作用する化合物又はリガンドに関するデータを生成することを特徴とする(11)記載のタンパク質相互作用解析方法。
(13) The information on the analysis target is information on the structure of the protein or ligand binding site,
(11) The protein interaction analysis method according to (11), wherein the arithmetic unit generates data on a compound or ligand that interacts with the protein or ligand binding site.
 (14) 上記演算装置が、機械学習により生成したタンパク質相互作用に関するデータについて、上記入力装置が入力した解析対象と、生成したデータに含まれる解析対象との類似性を示す評価値を算出する工程を有することを特徴とする(11)記載のタンパク質相互作用解析方法。 (14) The step of calculating an evaluation value indicating the similarity between the analysis target input by the input device and the analysis target included in the generated data for the data relating to the protein interaction generated by machine learning by the arithmetic unit (11) The protein interaction analysis method according to (11).
 (15) 上記演算装置が、機械学習により生成したタンパク質相互作用に関するデータについて、上記入力装置が入力した解析対象が相互作用したときの結合安定性を定量的に示す適合性スコアを算出する工程を有することを特徴とする(11)記載のタンパク質相互作用解析方法。 (15) A step in which the arithmetic device calculates a fitness score that quantitatively indicates the binding stability when the analysis target input by the input device interacts with respect to the protein interaction data generated by machine learning. (11) The protein interaction analysis method according to (11).
 (16)上記演算装置は、リガンド結合部位を構成する原子の原子座標に基づいてリガンド結合部位の中心座標を算出し、当該中心座標から所定の距離内にある原子を含む三次元グリッド空間を設定し、当該三次元グリッド空間に基づいて生成された表面形状データを上記データ記憶部に記憶することを特徴とする(11)記載のタンパク質相互作用解析方法。 (16) The arithmetic device calculates a center coordinate of the ligand binding site based on the atomic coordinates of the atoms constituting the ligand binding site, and sets a three-dimensional grid space including atoms within a predetermined distance from the center coordinate. And the surface shape data produced | generated based on the said three-dimensional grid space are memorize | stored in the said data storage part, The protein interaction analysis method of (11) description characterized by the above-mentioned.
 (17)上記三次元グリッド空間は、所定の間隔で設定されたグリッドにより複数数の格子点を有し、上記中心座標から所定の距離内にある各原子について最も近接する格子点に特定の文字を与え、当該特定の文字が与えられなかった格子点に他の文字を与えられたデータであることを特徴とする(11)記載のタンパク質相互作用解析方法。 (17) The three-dimensional grid space has a plurality of lattice points by a grid set at a predetermined interval, and a specific character at a lattice point closest to each atom within a predetermined distance from the central coordinate. (11) The protein interaction analysis method according to (11), characterized in that the data is a data in which another character is given to a lattice point where the specific character is not given.
 (18)上記中心座標から所定の距離内にある各原子は、複数の非水素原子種であることを特徴とする(11)記載のタンパク質相互作用解析方法。 (18) The protein interaction analysis method according to (11), wherein each atom within a predetermined distance from the central coordinate is a plurality of non-hydrogen atomic species.
 (19)上記演算装置は、上記三次元グリッド空間の格子点について算出された静電ポテンシャル分布データを上記データ記憶部に記憶することを特徴とする(11)記載のタンパク質相互作用解析方法。 (19) The protein interaction analysis method according to (11), wherein the arithmetic device stores the electrostatic potential distribution data calculated for the lattice points of the three-dimensional grid space in the data storage unit.
 (20)上記演算装置は、上記三次元グリッド空間の格子点について算出された正の値からなる正の静電ポテンシャル分布データと、上記三次元グリッド空間の格子点について算出された負の値からなる負の静電ポテンシャル分布データとを上記データ記憶部に記憶することを特徴とする(11)記載のタンパク質相互作用解析方法。 (20) The arithmetic unit may calculate positive electrostatic potential distribution data including positive values calculated for lattice points in the three-dimensional grid space and negative values calculated for lattice points in the three-dimensional grid space. (11) The protein interaction analysis method according to (11), wherein the negative electrostatic potential distribution data is stored in the data storage unit.
 本明細書は本願の優先権の基礎となる日本国特許出願番号2018-108362号の開示内容を包含する。 This specification includes the disclosure of Japanese Patent Application No. 2018-108362, which is the basis of the priority of the present application.
 本発明に係るタンパク質相互作用解析装置及び解析方法によれば、タンパク質とリガンドとの特異的相互作用を正確に解析することができる。例えば、本発明に係るタンパク質相互作用解析装置及び解析方法によれば、ユーザが指定したリガンド又はタンパク質に対して特異的に相互作用する可能性の高いタンパク質又はリガンドを機械学習により高精度に解析することができる。 The protein interaction analysis apparatus and analysis method according to the present invention can accurately analyze the specific interaction between a protein and a ligand. For example, according to the protein interaction analysis apparatus and analysis method of the present invention, a protein or ligand that is highly likely to specifically interact with a ligand or protein specified by a user is analyzed with high accuracy by machine learning. be able to.
本発明を適用したタンパク質相互作用解析装置の一例を示すブロック図である。It is a block diagram which shows an example of the protein interaction analyzer to which this invention is applied. 本発明を適用したタンパク質相互作用解析装置における計算処理部の一例を示すブロック図である。It is a block diagram which shows an example of the calculation process part in the protein interaction analyzer to which this invention is applied. 本発明を適用したタンパク質相互作用解析装置におけるリガンド結合部分の抽出方法と学習用データとなるVoxelの生成法についての概念図である。It is a conceptual diagram about the extraction method of the ligand binding part in the protein interaction analyzer to which this invention is applied, and the production | generation method of Voxel used as learning data. 本発明を適用したタンパク質相互作用解析装置におけるタンパク質の各原子の座標と静電ポテンシャル値をVoxel内の近接格子点に配置させる概念図である。It is a conceptual diagram which arrange | positions the coordinate and electrostatic potential value of each atom of a protein in the protein interaction analyzer to which this invention is applied at the close | similar lattice point in Voxel. 本発明を適用したタンパク質相互作用解析装置における計算処理部の他の例を示すブロック図である。It is a block diagram which shows the other example of the calculation process part in the protein interaction analyzer to which this invention is applied.
 以下、図面を参照して、本発明を詳細に説明する。 Hereinafter, the present invention will be described in detail with reference to the drawings.
 本発明を適用したタンパク質相互作用解析装置は、タンパク質におけるアミノ酸配列データ及び立体構造データ等に基づき、解析対象のタンパク質におけるリガントと相互作用する部位(リガンド結合部位)について、立体構造解析に使用できる特徴的なデータ(以下、リガンド結合部位表面性状データ)を生成し、当該データを用いた機械学習を通じてリガンドとタンパク質の相互作用に関する解析を行うものである。リガンド結合部位表面性状データは、リガンド結合部位の表面形状データと当該リガンド結合部位の静電ポテンシャル分布データとを併せたデータである。 The protein interaction analysis apparatus to which the present invention is applied is a feature that can be used for three-dimensional structure analysis of a site (ligand binding site) that interacts with a ligand in a protein to be analyzed based on amino acid sequence data and three-dimensional structure data in the protein. Data (hereinafter referred to as “ligand binding site surface property data”) is generated, and analysis on the interaction between the ligand and the protein is performed through machine learning using the data. The ligand binding site surface property data is data combining the surface shape data of the ligand binding site and the electrostatic potential distribution data of the ligand binding site.
 一例として、図1に示したタンパク質相互作用解析装置1は、タンパク質に関するアミノ酸配列並びに立体構造データ及び当該タンパク質に対するリガンドに関するデータを格納した外部記憶部2と接続され、所定のタンパク質におけるリガンド結合部位の表面形状データを生成する表面形状データ生成部3と、当該リガンド結合部位の静電ポテンシャル分布を生成する静電ポテンシャル分布データ生成部4と、当該タンパク質に対するリガンドに関する立体構造データを生成するリガンド立体構造データ生成部5とを備える。また、タンパク質相互作用解析装置1は、表面形状データ生成部3及び静電ポテンシャル分布データ生成部4で生成した所定のリガンド結合部位に関する表面形状データ及び静電ポテンシャル分布データ(リガンド結合部位表面性状データ)と、当該リガンド結合部位に相互作用するリガンドに関する立体構造データを教師データとして格納するデータ記憶部6を備える。さらに、タンパク質相互作用解析装置1は、ユーザが解析対象とするデータを入力するデータ入力部7を備える。 As an example, the protein interaction analysis apparatus 1 shown in FIG. 1 is connected to an external storage unit 2 that stores an amino acid sequence related to a protein, three-dimensional structure data, and data related to a ligand for the protein, and a ligand binding site in a predetermined protein. Surface shape data generating unit 3 for generating surface shape data, electrostatic potential distribution data generating unit 4 for generating an electrostatic potential distribution of the ligand binding site, and a ligand three-dimensional structure for generating three-dimensional structure data relating to a ligand for the protein And a data generation unit 5. In addition, the protein interaction analysis apparatus 1 includes surface shape data and electrostatic potential distribution data (ligand binding site surface property data) relating to a predetermined ligand binding site generated by the surface shape data generation unit 3 and the electrostatic potential distribution data generation unit 4. And a data storage unit 6 for storing the three-dimensional structure data relating to the ligand that interacts with the ligand binding site as teacher data. Furthermore, the protein interaction analysis apparatus 1 includes a data input unit 7 for inputting data to be analyzed by the user.
 さらにまた、タンパク質相互作用解析装置1は、データ記憶部6に格納された教師データを用い、データ入力部7で入力された解析対象に関して機械学習によりタンパク質相互作用に関するデータを生成する計算処理部8と、計算処理部8で計算した結果を出力する出力部9とを備えている。 Furthermore, the protein interaction analysis device 1 uses the teacher data stored in the data storage unit 6 to generate data related to the protein interaction by machine learning with respect to the analysis target input by the data input unit 7. And an output unit 9 for outputting the result calculated by the calculation processing unit 8.
 計算処理部8は、詳細を後述するが、ユーザの指定に応じて、リガンドとタンパク質の相互作用に関して解析を行う。一例として、計算処理部8は、図2に示すように、データ記憶部6に格納された教師データを用いた機械学習を行う機械学習部10と、データ入力部7で入力された解析対象に対して、教師データに含まれるタンパク質又はリガンドに対する類似性を示す評価値を算出する評価値算出部11と、機械学習部10で行った機械学習の結果と評価値算出部11で算出した評価値とを合わせたリストを生成するリスト生成部12とを備える。 Although the details will be described later, the calculation processing unit 8 analyzes the interaction between the ligand and the protein according to the user's designation. As an example, as shown in FIG. 2, the calculation processing unit 8 includes a machine learning unit 10 that performs machine learning using teacher data stored in the data storage unit 6, and an analysis target input by the data input unit 7. On the other hand, an evaluation value calculation unit 11 that calculates an evaluation value indicating similarity to a protein or ligand included in the teacher data, a result of machine learning performed by the machine learning unit 10, and an evaluation value calculated by the evaluation value calculation unit 11 And a list generation unit 12 that generates a combined list.
 ここで、図1に示したタンパク質相互作用解析装置1では、上述したデータを格納した1つの外部記憶部2に接続する構成としている。しかし、図示しないが、タンパク質相互作用解析装置1は、上述したデータを分散して格納した複数の外部記憶部に接続するものであっても良い。例えば、タンパク質相互作用解析装置1は、タンパク質に関するアミノ酸配列及び立体構造データを格納した外部記憶部と、タンパク質に対するリガンドに関するデータを格納した外部記憶部とに対してそれぞれ接続できるものであっても良い。 Here, the protein interaction analysis apparatus 1 shown in FIG. 1 is configured to be connected to one external storage unit 2 storing the above-described data. However, although not shown, the protein interaction analysis apparatus 1 may be connected to a plurality of external storage units that store the above-described data in a distributed manner. For example, the protein interaction analysis apparatus 1 may be capable of being connected to an external storage unit that stores amino acid sequences and three-dimensional structure data related to proteins and an external storage unit that stores data related to ligands for proteins. .
 外部記憶部2に格納されたデータは、所定のタンパク質に関して、そのアミノ酸配列、立体構造データ及びリガンドに関するデータである。ここで、リガンドとは、酵素に対する基質、受容体タンパク質に相互作用する低分子化合物、補酵素や調節因子のようにタンパク質に特異的に相互作用する物質を広く意味している。なお、リガンドには、細胞膜上に存在する受容体や細胞内受容体と結合する物質に限定して解釈される場合もある。しかし、「リガンド」という用語は、広義の意味として使用し、酵素に対する基質、補酵素、調節因子、受容体に結合する物質等を含む、タンパク質に対して特異的に相互作用する物質を含む意味で用いる。したがって、リガンドとしては、低分子化合物及び高分子化合物の何れであっても良いし、化合物の部分的な領域を意味しても良い。すなわち、リガンドの分子構造及び原子座標とは、タンパク質と相互作用する化合物全体の分子構造及び原子座標でも良いし、化合物における少なくともタンパク質と相互作用する部分領域の分子構造及び原子座標でも良い。 The data stored in the external storage unit 2 is data relating to the amino acid sequence, three-dimensional structure data, and ligand of a predetermined protein. Here, the ligand broadly means a substrate that specifically interacts with a protein such as a substrate for an enzyme, a low-molecular compound that interacts with a receptor protein, and a coenzyme or a regulatory factor. A ligand may be interpreted as being limited to a substance that binds to a receptor present on a cell membrane or an intracellular receptor. However, the term “ligand” is used in a broad sense and includes substances that interact specifically with proteins, including substrates for enzymes, coenzymes, regulators, substances that bind to receptors, etc. Used in. Therefore, the ligand may be either a low molecular compound or a high molecular compound, and may mean a partial region of the compound. That is, the molecular structure and atomic coordinates of the ligand may be the molecular structure and atomic coordinates of the entire compound that interacts with the protein, or may be the molecular structure and atomic coordinates of at least a partial region that interacts with the protein in the compound.
 タンパク質とは、アミノ酸配列を一次構造として有する高分子化合物を意味し、単量体、ホモ多量体及びヘテロ多量体の何れであっても良い。また、タンパク質は、翻訳後の化学修飾、例えば糖鎖付加、官能基付加、リン酸化といった修飾を有するものでも良い。したがって、リガンド結合部位における原子座標に基づく立体構造データとは、上述した翻訳後の化学修飾を有しないタンパク質で得られた原子座標に基づく立体構造データでも良いし、上述した翻訳後の化学修飾を有するタンパク質で得られた原子座標に基づく立体構造データであっても良い。なお、リガンド結合部位における原子座標に基づく立体構造データとは、上述した翻訳後の化学修飾を有しないタンパク質で得られた原子座標を、所定の化学修飾を有するタンパク質の原子座標となるように改変した(補正した)原子座標に基づく立体構造データであっても良い。 Protein means a polymer compound having an amino acid sequence as a primary structure, and may be any of a monomer, a homomultimer and a heteromultimer. The protein may have post-translational chemical modifications such as sugar chain addition, functional group addition, and phosphorylation. Therefore, the three-dimensional structure data based on the atomic coordinates at the ligand binding site may be the three-dimensional structure data based on the atomic coordinates obtained with the protein having no post-translational chemical modification described above, or the post-translational chemical modification described above. It may be three-dimensional structure data based on atomic coordinates obtained with the protein possessed. The three-dimensional structure data based on the atomic coordinates at the ligand binding site is changed so that the atomic coordinates obtained with the protein without the post-translational chemical modification described above become the atomic coordinates of the protein with the predetermined chemical modification. The three-dimensional structure data based on the corrected (corrected) atomic coordinates may be used.
 原子座標とは、タンパク質を構成する原子の座標を示すデータを意味する。原子座標は、主としてタンパク質単結晶を利用するX線結晶構造解析法と、タンパク質溶液を対象とする核磁気共鳴法のいずれか一方又は両方の方法により様々なタンパク質について得ることができる。また、原子座標は、立体整列同位体標識法 (stereo-array isotope labeling)と呼称される安定同位体を利用した核磁気共鳴技術により得ることもできる。 The atomic coordinates mean data indicating the coordinates of atoms constituting the protein. Atomic coordinates can be obtained for various proteins by either or both of an X-ray crystal structure analysis method mainly using a protein single crystal and a nuclear magnetic resonance method targeting a protein solution. The atomic coordinates can also be obtained by a nuclear magnetic resonance technique using a stable isotope called a stereo-array isotope labeling method.
 タンパク質における原子座標は、特にフォーマットに限定されないが、タンパク質を構成する各原子をx座標、y座標及びz座標を組み合わせとして示す形式とすることができる。なお、各座標の単位は例えば[Å]とすることができる。 The atomic coordinates in the protein are not particularly limited to the format, but can be in a form in which each atom constituting the protein is indicated by a combination of x, y and z coordinates. The unit of each coordinate can be, for example, [Å].
 上述したデータを格納した外部記憶部2の一例としては、日本蛋白質構造データバンク(PDBj: Protein Data Bank Japan)を含むProtein Data Bank(以下、PDB)を挙げることができる。すなわち、タンパク質相互作用解析装置1は、外部記憶部2としてPDBに接続できる構成とすることができる。PDBにおいて原子座標は、例えば、所定のレコード名(標準アミノ酸はATOM)のもと原子番号毎に一行のデータとして表示する。一例として、所定の原子番号について、原子名(主鎖アミド窒素:N、α炭素:CA、β炭素:CB)、残基名(アミノ酸3文字表記)、Chain ID、残基番号、それぞれ原子のx座標[Å]、y座標[Å]、z座標[Å]、occupancy(解析対象サンプル、例えば結晶中でその原子がその場所に存在する割合、占有率、通常は1.00)及び温度因子B [Å2](X線結晶解析で決定されている場合)を含むデータとすることができる。 As an example of the external storage unit 2 storing the above-described data, there is a Protein Data Bank (hereinafter referred to as PDB) including a Japanese protein structure data bank (PDBj: Protein Data Bank Japan). That is, the protein interaction analyzer 1 can be configured to be connected to the PDB as the external storage unit 2. In the PDB, for example, the atomic coordinates are displayed as one line of data for each atomic number under a predetermined record name (standard amino acid is ATOM). As an example, for a given atomic number, the atomic name (main chain amide nitrogen: N, α carbon: CA, β carbon: CB), residue name (3 amino acid code), Chain ID, residue number, x-coordinate [Å], y-coordinate [Å], z-coordinate [Å], occupancy (sample to be analyzed, for example, the proportion of the atom existing in the crystal in the place, occupancy, usually 1.00) and temperature factor B [ [ 2 ] (if determined by X-ray crystallography).
 また、PDBに格納されたデータは、上述した原子座標に関するデータ以外にも、タンパク質分子の種類や登録名及びアクセッション番号に関するデータ(HEADERの行)、PDBで公開される際のタイトル名(TITLEの行)、タンパク質分子に関する情報(COMPNDの行)、タンパク質の宿主に関する情報(SOURCEの行)、立体解析の際の実験に関する情報(REMARKの行)、アミノ酸配列情報(SEQRESの行)、αヘリックスを構成するアミノ酸に関する情報(HELIXの行)及び分子内のジスルフィド結合に関する位置情報(SSBOND)を含んでいる。 In addition to the data on the atomic coordinates described above, the data stored in the PDB includes data on the type of protein molecule, registered name and accession number (HEADER row), title name (TITLE) ), Protein molecule information (COMPND line), protein host information (SOURCE line), 3D analysis experiments (REMARK line), amino acid sequence information (SEQRES line), α helix (HELIX row) and positional information (SSBOND) about intramolecular disulfide bonds.
 特に、PDBに格納されたデータは、上述したリガンド結合部位に相互作用するリガンドの分子構造及び原子座標に関するデータを含んでいる。具体的には、PDBは、リガンドの分子構造に関する情報(HETATMの行)及びリガンドの結合に関する情報(CONECTの行)を含んでいる。なお、PDBに格納されたデータの中でHETATMの行には、リガンドを構成する原子を特定する情報及び当該原子の座標が含まれている。また、当該HETATMの行には、リガンドがコンフォマーを有する場合にはコンフォマーの種類を示す情報が含まれている。 In particular, the data stored in the PDB includes data on the molecular structure and atomic coordinates of the ligand that interacts with the ligand binding site described above. Specifically, the PDB includes information on the molecular structure of the ligand (HETATM row) and information on ligand binding (CONECT row). In the data stored in the PDB, the HETATM line includes information for identifying the atoms constituting the ligand and the coordinates of the atoms. Further, the HETATM row includes information indicating the type of conformer when the ligand has a conformer.
 タンパク質相互作用解析装置1は、外部記憶部2に格納されたデータを用いて表面形状データ生成部3にて、リガンド結合部位を含む所定のタンパク質に関する表面形状データを生成する。表面形状データ生成部3では、タンパク質の全体の表面形状データを生成しても良いし、タンパク質におけるリガンド結合部位を含む部分領域について表面形状データを生成しても良い。特に、表面形状データ生成部3は、タンパク質におけるリガンド結合部位の全体を含む部分領域について表面形状データを生成することが好ましい。表面形状データは、リガンドが結合するタンパク質の表面をxy平面とし、当該xy平面における凹凸をz軸方向の値で示したデータとすることができる。 The protein interaction analyzer 1 uses the data stored in the external storage unit 2 to generate surface shape data related to a predetermined protein including a ligand binding site in the surface shape data generation unit 3. In the surface shape data generation unit 3, surface shape data of the entire protein may be generated, or surface shape data may be generated for a partial region including a ligand binding site in the protein. In particular, the surface shape data generation unit 3 preferably generates surface shape data for a partial region including the entire ligand binding site in the protein. The surface shape data can be data in which the surface of the protein to which the ligand binds is the xy plane and the unevenness in the xy plane is indicated by a value in the z-axis direction.
 ここで、表面形状データにおけるxy平面としては、リガンドがタンパク質に特異的に相互作用した状態において、当該リガンドが最大面積を占める平面とすることができる。すなわち、リガンドがタンパク質に特異的に相互作用した状態において、リガンドが最も大きく投影できる平面を表面形状データにおけるxy平面とすることが好ましい。或いは、表面形状データにおけるxy平面は、タンパク質におけるリガンド結合部位が最大面積を占める平面とすることができる。すなわち、タンパク質の立体構造において、リガンド結合部位が最も大きくなる平面を表面形状データにおけるxy平面とすることが好ましい。 Here, the xy plane in the surface shape data can be a plane in which the ligand occupies the maximum area in a state where the ligand specifically interacts with the protein. That is, it is preferable that the plane on which the ligand can be projected most greatly in the state where the ligand specifically interacts with the protein is the xy plane in the surface shape data. Alternatively, the xy plane in the surface shape data can be a plane in which the ligand binding site in the protein occupies the maximum area. That is, in the protein three-dimensional structure, the plane where the ligand binding site is the largest is preferably the xy plane in the surface shape data.
 表面形状データを構成するxy平面におけるz軸方向の値は、外部記憶部2に格納されたデータを用いて、xy平面全域に亘って示す関数(z=f(x,y))として求めることができる。また、xy平面におけるz軸方向の値は、当該xy平面をメッシュデータとし、メッシュポイント(交点)における離散的な値として求めることもできる。例えば、xy平面をメッシュデータとする場合、例えば、0.05~1.0Åの間隔のメッシュデータ、好ましくは0.1~0.5Å間隔のメッシュデータ、より好ましくは0.2Å間隔のメッシュデータとすることができ、メッシュポイント(交点)における離散的な値としてxy平面におけるz軸方向の値を求めることもできる。さらに、xy平面におけるz軸方向の値は、当該xy平面を上述のようにメッシュデータとし、個々のメッシュ内の領域について算出したxy平面におけるz軸方向の値の平均値として求めることもできる。 The value in the z-axis direction on the xy plane constituting the surface shape data is obtained as a function (z = f (x, y)) indicated over the entire xy plane using the data stored in the external storage unit 2. Can do. The value in the z-axis direction on the xy plane can also be obtained as a discrete value at mesh points (intersection points) using the xy plane as mesh data. For example, when the xy plane is mesh data, for example, mesh data with an interval of 0.05 to 1.0 mm, preferably mesh data with an interval of 0.1 to 0.5 mm, more preferably mesh data with an interval of 0.2 mm can be used. A value in the z-axis direction on the xy plane can also be obtained as a discrete value at the point (intersection point). Further, the value in the z-axis direction in the xy plane can be obtained as an average value of the values in the z-axis direction in the xy plane calculated for each region in each mesh using the xy plane as described above.
 また、表面形状データ生成部3では、一つのリガンド結合部位に対して複数の表面形状データを生成しても良い。表面形状データ生成部3は、複数の表面形状データとして、いわゆるステレオグラムとなる一対の表面形状データを生成してもよい。表面形状データ生成部3は、リガンドがタンパク質に特異的に相互作用した状態において、当該リガンドが最大面積を占めるxy平面を、x軸又はy軸を中心に所定の角度、例えば±0.5~10度の範囲、好ましくは±1~5度の範囲に傾けた複数の平面を設定し、これら複数の平面についてそれぞれ表面形状データを生成しても良い。より具体的に、表面形状データ生成部3は、リガンドがタンパク質に特異的に相互作用した状態において、当該リガンドが最大面積を占めるxy平面と、当該xy平面におけるy軸を中心に±5度傾けた2つの平面とについてそれぞれ表面形状データ(合計、3つの表面形状データ)を生成することができる。或いは、表面形状データ生成部3は、タンパク質におけるリガンド結合部位が最大面積を占めるxy平面と、当該xy平面におけるy軸を中心に±5度傾けた2つの平面とについてそれぞれ表面形状データ(合計、3つの表面形状データ)を生成することができる。 Further, the surface shape data generation unit 3 may generate a plurality of surface shape data for one ligand binding site. The surface shape data generation unit 3 may generate a pair of surface shape data as a so-called stereogram as a plurality of surface shape data. The surface shape data generation unit 3 sets the xy plane in which the ligand occupies the maximum area in a state where the ligand specifically interacts with the protein at a predetermined angle around the x axis or the y axis, for example, ± 0.5 to 10 degrees. It is also possible to set a plurality of planes tilted within the range, preferably ± 1 to 5 degrees, and generate surface shape data for each of the plurality of planes. More specifically, the surface shape data generation unit 3 tilts ± 5 degrees around the xy plane in which the ligand occupies the maximum area and the y axis in the xy plane in a state where the ligand specifically interacts with the protein. Further, surface shape data (total of three surface shape data) can be generated for each of the two planes. Alternatively, the surface shape data generation unit 3 may obtain surface shape data (total, total) for each of the xy plane in which the ligand binding site in the protein occupies the maximum area and two planes inclined by ± 5 degrees about the y axis in the xy plane. 3 surface shape data) can be generated.
 さらに、表面形状データ生成部3は、リガンドがタンパク質に特異的に相互作用した状態において、当該リガンドが最大面積を占めるxy平面であって、且つ、一辺が例えば30~100Åの範囲、好ましくは40~80Åの範囲、より好ましくは45~60Åの範囲のxy平面を生成することができる。或いは、表面形状データ生成部3は、タンパク質におけるリガンド結合部位が最大面積を占めるxy平面であって、且つ、一辺が例えば30~100Åの範囲、好ましくは40~80Åの範囲、より好ましくは45~60Åの範囲のxy平面を生成することができる。 Further, the surface shape data generation unit 3 is an xy plane in which the ligand occupies the maximum area in a state where the ligand specifically interacts with the protein, and one side is in the range of, for example, 30 to 100 mm, preferably 40 An xy plane in the range of ˜80 cm, more preferably in the range of 45-60 mm can be generated. Alternatively, the surface shape data generation unit 3 is an xy plane in which the ligand binding site in the protein occupies the maximum area, and one side is in the range of, for example, 30 to 100 mm, preferably in the range of 40 to 80 mm, and more preferably in the range of 45 to An xy plane in the range of 60 mm can be generated.
 一方、タンパク質相互作用解析装置1は、外部記憶部2に格納されたデータを用いて静電ポテンシャル分布データ生成部4にて、所定のタンパク質におけるリガンド結合部位の静電ポテンシャル分布データを生成する。静電ポテンシャル分布データ生成部4は、タンパク質の表面電荷を計算する既知の方法を適用して、リガンド結合部位の静電ポテンシャル分布を計算することができる。 On the other hand, the protein interaction analyzer 1 uses the data stored in the external storage unit 2 to generate electrostatic potential distribution data of a ligand binding site in a predetermined protein in the electrostatic potential distribution data generation unit 4. The electrostatic potential distribution data generation unit 4 can calculate the electrostatic potential distribution of the ligand binding site by applying a known method for calculating the surface charge of the protein.
 静電ポテンシャル分布データ生成部4は、タンパク質の静電ポテンシャル(表面電荷)を計算するための従来公知の方法を適宜使用することができる。ここで静電ポテンシャル(表面電荷)は、単位電気量を持つ正の電荷が、ある任意の点上で受けるクーロンエネルギーとして定義することができる。 The electrostatic potential distribution data generation unit 4 can appropriately use a conventionally known method for calculating the electrostatic potential (surface charge) of a protein. Here, the electrostatic potential (surface charge) can be defined as Coulomb energy received by a positive charge having a unit electric quantity on an arbitrary point.
 タンパク質の静電ポテンシャル(表面電荷)を計算するには、具体的に先ず、PDB等の外部記憶部2に格納されたタンパク質のアミノ酸配列及び原子座標データから、炭素(C)、酸素(O)、窒素(N)、硫黄(S)等の非水素原子種の情報とその座標を読み取る。次に、読み取った各非水素原子種に結合する水素原子とその座標を算出する。次に、これら座標を合わせて用いることで分子内における表面に配位する全原子の情報、すなわち全電子の情報を得ることができる。そして、これらの情報を用い、一定の誘電率を仮定することで当該タンパク質分子内外の任意の位置における静電荷を算出することができる。特に、当該タンパク質の表面内において計算した正電荷をタンパク質の静電ポテンシャル(表面電荷)とすることができる。なお、算出したタンパク質の静電ポテンシャル(表面電荷)を、例えば+5~-5に規格化することで、静電ポテンシャル分布を求めることも可能である。また、この静電ポテンシャル分布は一定の誘電率を仮定した全空間に亘って空間座標の関数(静電ポテンシャル値をcとしc=f(x, y, z))として算出することができる。 In order to calculate the electrostatic potential (surface charge) of a protein, first, from the amino acid sequence and atomic coordinate data of the protein stored in the external storage unit 2 such as PDB, carbon (C), oxygen (O) Read information and coordinates of non-hydrogen atomic species such as nitrogen (N) and sulfur (S). Next, a hydrogen atom bonded to each read non-hydrogen atomic species and its coordinates are calculated. Next, by using these coordinates together, information on all atoms coordinated on the surface in the molecule, that is, information on all electrons can be obtained. Then, by using such information and assuming a constant dielectric constant, the electrostatic charge at any position inside and outside the protein molecule can be calculated. In particular, the positive charge calculated in the surface of the protein can be used as the electrostatic potential (surface charge) of the protein. Note that the electrostatic potential distribution can be obtained by normalizing the calculated electrostatic potential (surface charge) of the protein to, for example, +5 to −5. Further, this electrostatic potential distribution can be calculated as a function of spatial coordinates (c = f (x, y, z) where the electrostatic potential value is c) over the entire space assuming a constant dielectric constant.
 そして、この空間連続的な値から表面形状データ生成部3で得られた曲面上の値を抽出し、表面形状データの(x, y, z)値と静電ポテンシャル(表面電荷)値c組み合わせて4次元データとして格納する((x, y, z, c)として)ことが望ましい。 Then, the value on the curved surface obtained by the surface shape data generation unit 3 is extracted from the space continuous value, and the (x, y, z) value of the surface shape data and the electrostatic potential (surface charge) value c are combined. It is desirable to store it as four-dimensional data (as (x, y, z, c)).
 タンパク質の静電ポテンシャル(表面電荷)を計算するための従来公知の方法としては、例えばRocchia et al. Vol. 23, No. 1 Journal of Computational Chemistry, 128-137, 2002を挙げることができる。また、静電ポテンシャル(表面電荷)を計算するための利用可能なソフトウエアとしては、GRASP,Chimera、APBS及びQUANTA等を挙げることができる。 Examples of conventionally known methods for calculating the electrostatic potential (surface charge) of proteins include, for example, Rocchia et al. Vol. 23, No No. 1 Journal of Computational Chemistry, 128-137, 2002. Examples of usable software for calculating the electrostatic potential (surface charge) include GRASP, Chimera, APBS, and QUANTA.
 静電ポテンシャル分布データ生成部4は、表面形状データ生成部3にて作製したxy平面の全領域について静電ポテンシャル分布を生成しても良いし、当該xy平面の部分領域について静電ポテンシャル分布を生成しても良い。表面形状データ生成部3にて作製したxy平面の部分領域としては、当該xy平面に含まれるリガンド結合部位を含む領域、例えばリガンド結合部位から10Å以内、好ましくは5Å以内の空間領域について表面電荷から静電ポテンシャル分布を生成しても良い。 The electrostatic potential distribution data generation unit 4 may generate the electrostatic potential distribution for the entire region of the xy plane created by the surface shape data generation unit 3, or the electrostatic potential distribution for the partial region of the xy plane. It may be generated. The partial region of the xy plane created by the surface shape data generation unit 3 includes a region including a ligand binding site included in the xy plane, for example, a spatial region within 10 cm, preferably within 5 cm from the ligand binding site. An electrostatic potential distribution may be generated.
 また、静電ポテンシャル分布データ生成部4は、表面形状データ生成部3にて1つのリガンド結合部位に対して複数のxy平面を作製した場合、全てのxy平面について静電ポテンシャル分布を生成しても良いし、一部のxy平面について静電ポテンシャル分布を生成しても良い。 Further, when the surface shape data generation unit 3 creates a plurality of xy planes for one ligand binding site, the electrostatic potential distribution data generation unit 4 generates electrostatic potential distributions for all the xy planes. Alternatively, an electrostatic potential distribution may be generated for some xy planes.
 xy平面における静電ポテンシャルの値は、当該xy平面をメッシュデータとし、メッシュポイント(交点)における離散的な値として求めることもできる。例えば、xy平面を例えば、0.05~1.0Åの間隔のメッシュデータ、好ましくは0.1~0.5Å間隔のメッシュデータ、より好ましくは0.2Å間隔のメッシュデータとすることができ、メッシュポイント(交点)における離散的な値として表面電荷の値を求めることもできる。さらに、xy平面における表面電荷の値は、当該xy平面を上述のようにメッシュデータとし、個々のメッシュ内の領域について算出した静電ポテンシャルの平均値として求めることもできる。 The value of the electrostatic potential in the xy plane can also be obtained as a discrete value at a mesh point (intersection) using the xy plane as mesh data. For example, the xy plane can be, for example, mesh data with an interval of 0.05 to 1.0 mm, preferably mesh data with an interval of 0.1 to 0.5 mm, more preferably mesh data with an interval of 0.2 mm, and discrete data at mesh points (intersection points). The surface charge value can also be obtained as a typical value. Further, the value of the surface charge in the xy plane can also be obtained as an average value of electrostatic potentials calculated for regions in individual meshes using the xy plane as mesh data as described above.
 ところで、上述したように、表面形状データ生成部3がタンパク質の表面或いはリガンド結合部位の表面をxy平面とし、当該xy平面における凹凸をz軸方向の値で示したデータを生成し、静電ポテンシャル分布データ生成部4が表面形状データ生成部3にて作製したxy平面の全領域又は部分領域について静電ポテンシャル分布を生成したが、タンパク質相互作用解析装置1はこの形態に限定されるものではない。すなわち、タンパク質相互作用解析装置1において表面形状データ生成部3は、例えば、タンパク質を構成する原子の原子座標に基づいて生成した三次元グリッド空間を定義して、タンパク質の全体の表面形状データを生成しても良いし、タンパク質におけるリガンド結合部位を含む部分領域について表面形状データを生成しても良い。そして、静電ポテンシャル分布データ生成部4は、この表面形状データについて静電ポテンシャル分布を生成するものであっても良い。 By the way, as described above, the surface shape data generation unit 3 sets the surface of the protein or the surface of the ligand binding site as the xy plane, and generates data indicating the unevenness in the xy plane by the value in the z-axis direction. Although the distribution data generation unit 4 generates the electrostatic potential distribution for the entire region or partial region of the xy plane created by the surface shape data generation unit 3, the protein interaction analyzer 1 is not limited to this form. . That is, in the protein interaction analyzer 1, the surface shape data generation unit 3 defines, for example, a three-dimensional grid space generated based on atomic coordinates of atoms constituting the protein, and generates the entire surface shape data of the protein. Alternatively, surface shape data may be generated for a partial region including a ligand binding site in a protein. And the electrostatic potential distribution data generation part 4 may produce | generate an electrostatic potential distribution about this surface shape data.
 より具体的には、上記PDBに格納されたデータセットより、原子番号毎に関連づけられたデータのうち、残基名(アミノ酸3文字表記)以外の原子座標に関するデータを抽出する。すなわち、上記PDBに格納されたデータセットより、タンパク質の全体或いはタンパク質におけるリガンド結合部位に関する原子座標に関するデータを抽出する。 More specifically, from the data set stored in the PDB, data related to atomic coordinates other than the residue name (3-character amino acid notation) is extracted from the data associated with each atomic number. That is, data relating to atomic coordinates relating to the whole protein or a ligand binding site in the protein is extracted from the data set stored in the PDB.
 次に、タンパク質全体或いはリガンド結合部位の中心座標を算出する。中心座標を算出する方法としては、特に限定されないが、例えば、上述のように抽出したタンパク質全体を構成する原子に関する原子座標或いはリガンド結合部位を構成する原子に関する原子座標から、x座標[Å]、y座標[Å]及びz座標[Å]の算術平均をそれぞれ算出し、求められた平均値を中心座標とすることができる。なお、リガンド結合部位の中心座標を算出する際には、タンパク質全体について原子座標を抽出した後、リガンド結合部位を構成する原子に関する原子座標のみを更に抽出して上述のように算術平均を算出しても良い。 Next, the center coordinates of the whole protein or ligand binding site are calculated. The method for calculating the central coordinates is not particularly limited, but for example, from the atomic coordinates relating to the atoms constituting the whole protein extracted as described above or the atomic coordinates relating to the atoms constituting the ligand binding site, the x coordinate [Å], Arithmetic averages of the y coordinate [座標] and the z coordinate [Å] are respectively calculated, and the obtained average value can be used as the center coordinate. When calculating the central coordinates of the ligand binding site, after extracting the atomic coordinates for the entire protein, only the atomic coordinates related to the atoms constituting the ligand binding site are further extracted to calculate the arithmetic average as described above. May be.
 次に、算出したタンパク質全体の中心座標或いはリガンド結合部位の中心座標から所定の距離内にある原子を抽出する。言い換えると、算出した中心座標から所定の半径を有する球面を与え、球面の内側に位置する全ての原子を抽出する。このとき、中心座標からの距離、すなわち球面の半径は任意に設定することができ、例えば15~50Åの範囲、好ましくは20~40Åの範囲、より好ましくは23~30Åの範囲とすることができる。 Next, atoms within a predetermined distance from the calculated center coordinates of the whole protein or the center coordinates of the ligand binding site are extracted. In other words, a spherical surface having a predetermined radius is given from the calculated center coordinates, and all atoms located inside the spherical surface are extracted. At this time, the distance from the center coordinate, that is, the radius of the spherical surface can be arbitrarily set, and can be set in the range of, for example, 15 to 50 mm, preferably in the range of 20 to 40 mm, and more preferably in the range of 23 to 30 mm. .
 次に、中心座標から所定の半径を有する球面に対して内接又は外接する立方体を与え、当該立方体の各辺について所定の間隔で区切ることで三次元グリッド空間を与える。所定の間隔としては、特に限定されないが、例えば、0.25Åまたは0.5Åとすることができる。そして、三次元グリッド空間における各区切りの格子点の座標は、中心座標を算出したタンパク質全体を構成する原子の原子座標或いはリガンド結合部位を構成する原子の原子座標と共通の座標系として定義することができる。 Next, a cube that is inscribed or circumscribed with respect to a spherical surface having a predetermined radius from the center coordinates is given, and a three-dimensional grid space is given by dividing each side of the cube at a predetermined interval. Although it does not specifically limit as a predetermined space | interval, For example, it can be set to 0.25 inch or 0.5 inch. The coordinates of the lattice points of each partition in the three-dimensional grid space should be defined as a coordinate system that is common to the atomic coordinates of the atoms constituting the whole protein or the atomic coordinates of the atoms constituting the ligand binding site. Can do.
 以上のようにして、表面形状データ生成部3は、タンパク質を構成する原子の原子座標に基づいて生成した三次元グリッドを定義して、タンパク質の全体の表面形状データを生成することができる。 As described above, the surface shape data generation unit 3 can generate the entire surface shape data of the protein by defining the three-dimensional grid generated based on the atomic coordinates of the atoms constituting the protein.
 より具体的には、図3に示すように、三次元グリッド空間を定義することができる。図3に示す例では、中心座標から所定の半径(例えば、10~20Å)を有する球面14を与えることができ、球面14に内接する立方体に対して第1の区切り15(図3中、実線、0.5Å)を設定することができる。また、より分解能を細かくするため、第1の区切り15を更に2分割する第2の区切り16(図3中、破線)を設定することもできる。これにより、球面14に内接する立方体に対して所定のグリッド数(grid positions counts)17を各辺に定義することができる。このように所定のグリッド数で区画された、球面14に内接する立方体からなる三次元グリッド空間をVoxel(ボクセル)と称す。 More specifically, a three-dimensional grid space can be defined as shown in FIG. In the example shown in FIG. 3, a spherical surface 14 having a predetermined radius (for example, 10 to 20 mm) can be given from the center coordinates, and the first partition 15 (solid line in FIG. 0.5 Å) can be set. Further, in order to make the resolution finer, a second partition 16 (broken line in FIG. 3) that further divides the first partition 15 into two can be set. Thereby, a predetermined grid number (gridgripositions counts) 17 can be defined for each side of the cube inscribed in the spherical surface 14. A three-dimensional grid space made up of cubes inscribed in the spherical surface 14 divided in a predetermined number of grids in this way is referred to as Voxel.
 なお、図3においては、球面14に内接する立方体を三次元グリッド空間、Voxelとしたが、このうち立方体の8箇所の角付近は、原子座標が存在しない空間となっている。図示しないが、球面14に外接する立方体を三次元グリッド空間とした場合には、Voxel内の全てに原子座標が含まれることとなる。 In FIG. 3, the cube inscribed in the spherical surface 14 is a three-dimensional grid space, Voxel. Of these, the vicinity of eight corners of the cube is a space where no atomic coordinates exist. Although not shown, when a cube circumscribing the spherical surface 14 is a three-dimensional grid space, atomic coordinates are included in all of the Voxel.
 次に、以上のように生成された三次元グリッド空間として表された表面形状データに対して、外部記憶部2に格納されたデータを用いて静電ポテンシャル分布データ生成部4にて、静電ポテンシャル分布データを生成する。静電ポテンシャル分布データ生成部4は、炭素(C)、酸素(O)、窒素(N)、硫黄(S)等の非水素原子種の情報とその座標を読み取り、各非水素原子種に対して三次元グリッド空間であるVoxelを設定する。そして、所定の非水素原子種について、Voxel内の格子点のうち当該非水素原子種が最も近接する格子点に、例えば「1」といった特定の文字を与え、「1」が与えられなかった格子点には「0」といった他の文字を与える。一例として、炭素(C)について、中心座標から所定の半径を有する球面(図3における球面14)の内側に位置する各炭素原子について、その座標データに基づいてVoxel内で最も近接する格子点に対して「1」を与え、近接する炭素原子がなかった格子点に対して「0」を与える。この処理により、炭素原子に関するVoxel「C」データを生成することができる。当該処理を全ての酸素(O)、窒素(N)、硫黄(S)等の全ての非水素原子種に対して行うことで、酸素原子に関するVoxel「O」データ、窒素原子に関するVoxel「N」データ、硫黄原子に関するVoxel「S」データといった非水素原子種毎にVoxelデータを生成することができる。このようにして得られたデータセットを三次元畳み込みデータ(3D Convolution data)とする。図5Aに、一例として、炭素(C)、酸素(O)、窒素(N)及び硫黄(S)について、原子座標データに基づいて各原子を最も近接する格子点に割り振る(格子点の値を1とする)ことを模式的に示している。 Next, with respect to the surface shape data expressed as the three-dimensional grid space generated as described above, the electrostatic potential distribution data generation unit 4 uses the data stored in the external storage unit 2 to Generate potential distribution data. The electrostatic potential distribution data generation unit 4 reads information and coordinates of non-hydrogen atomic species such as carbon (C), oxygen (O), nitrogen (N), sulfur (S), and the like for each non-hydrogen atomic species. To set Voxel, which is a three-dimensional grid space. For a given non-hydrogen atomic species, a specific character such as “1” is given to the lattice point closest to the non-hydrogen atomic species among the lattice points in Voxel, and “1” is not given. Other characters such as “0” are given to the points. As an example, for carbon (C), for each carbon atom located inside a spherical surface (spherical surface 14 in FIG. 3) having a predetermined radius from the center coordinate, the closest lattice point in Voxel is based on the coordinate data. On the other hand, “1” is given, and “0” is given to the lattice point where there is no adjacent carbon atom. This process can generate Voxel “C” data for carbon atoms. Voxel “O” data on oxygen atoms, Voxel “N” on nitrogen atoms by performing this treatment on all non-hydrogen atomic species such as oxygen (O), nitrogen (N), sulfur (S), etc. Voxel data can be generated for each non-hydrogen atomic species such as data and Voxel “S” data related to sulfur atoms. The data set obtained in this way is defined as three-dimensional convolution data (3D Convolution data). In FIG. 5A, as an example, for carbon (C), oxygen (O), nitrogen (N), and sulfur (S), each atom is assigned to the closest lattice point based on the atomic coordinate data (the value of the lattice point is 1).
 次に、静電ポテンシャル分布データ生成部4は、例えばRocchia et al. Vol. 23, No. 1. Journal of Computational Chemistry, 128-137, 2002に開示された方法や、GRASP,Chimera、APBS及びQUANTA等の市販のソフトウエアを用いて静電ポテンシャル分布を生成することができる。 Next, the electrostatic potential distribution data generation unit 4 performs the methods disclosed in, for example, Rocchia et al. Vol. 23, No. 1 Journal of Computational Chemistry, 128-137, 2002, GRASP, Chimera, APBS, and QUANTA. The electrostatic potential distribution can be generated using commercially available software such as.
 次に、静電ポテンシャル分布データ生成部4は、計算した静電ポテンシャル値のうち正の値を有するものと、負の値を有するものとを異なる別のデータとして格納することができる。すなわち、静電ポテンシャル分布データ生成部4は、計算した静電ポテンシャル値に基づいて、Voxel「正」データとVoxel「負」データとを生成することができる。図5Bに、静電ポテンシャル値が「正」であるか「負」であるかに基づいて異なるデータとして格納することを模式的に示している。なお、図5Bにおいて、「正」及び「負」の値の絶対値に応じて円の内部を濃淡で表現している。そして、これらVoxel「正」データとVoxel「負」データとは、上述した、酸素原子に関するVoxel「O」データ、窒素原子に関するVoxel「N」データ、硫黄原子に関するVoxel「S」データ等とともに三次元畳み込みデータ(3D Convolution data)とすることができる。 Next, the electrostatic potential distribution data generation unit 4 can store the calculated electrostatic potential values having a positive value and those having a negative value as different data. That is, the electrostatic potential distribution data generation unit 4 can generate Voxel “positive” data and Voxel “negative” data based on the calculated electrostatic potential value. FIG. 5B schematically shows storing as different data based on whether the electrostatic potential value is “positive” or “negative”. In FIG. 5B, the inside of the circle is expressed by shading according to the absolute values of the “positive” and “negative” values. And these Voxel “positive” data and Voxel “negative” data are three-dimensional with Voxel “O” data about oxygen atom, Voxel “N” data about nitrogen atom, Voxel “S” data about sulfur atom, etc. It can be convolution data (3D Convolution data).
 一方、タンパク質相互作用解析装置1は、外部記憶部2に格納されたデータを用いてリガンド構造データ生成部5にて、リガンドに関する立体構造データを生成する。リガンド構造データ生成部5は、従来公知の方法を適宜使用してリガンドの立体構造データを求めることができる。 On the other hand, the protein interaction analysis apparatus 1 uses the data stored in the external storage unit 2 to generate the three-dimensional structure data related to the ligand in the ligand structure data generation unit 5. The ligand structure data generation unit 5 can obtain the three-dimensional structure data of the ligand by appropriately using a conventionally known method.
 リガンド構造データ生成部5は、リガンドの立体構造データを求めるに際し、外部記憶部2に格納されたリガンドに関するデータ、すなわち、分子式や構造式、化合物名、タンパク質と相互作用する原子に関する情報を用いてリガンドの立体構造を抽出し、これらに基づいてリガンドの立体構造データを生成する。例えばPDBを利用する場合、上述した表面形状データを生成する際に使用した原子座標データを抽出した同一ファイルより、_nonpolymerフラッグを指標として残基名を探索し、その残基名がアミノ酸ではないこと及び/又はその残基名がタンパク質以外の有機化合物をさす名称であること等からリガンド分子と判断することができる。PDBにおいては、化合物名を三文字コードとして_nonpolymerフラッグに関連づけている。よって、リガンド分子として判断した化合物については、三文字コードを手がかりとしてファイル後部の座標データ記載部よりリガンドを構成する全てまたは一部の原子の立体座標(x, y, z)を抽出することができる。このとき、リガンド構造データ生成部5が生成するリガンドの立体構造データは、タンパク質と相互作用する化合物全体の立体構造データでも良いし、化合物におけるタンパク質と相互作用する部分領域に関する立体構造データであっても良い。 The ligand structure data generation unit 5 uses the data related to the ligand stored in the external storage unit 2, that is, the molecular formula, the structural formula, the compound name, and the information related to the atoms interacting with the protein when obtaining the three-dimensional structure data of the ligand. The three-dimensional structure of the ligand is extracted, and based on these, the three-dimensional structure data of the ligand is generated. For example, when using PDB, search the residue name using the _nonpolymer flag as an index from the same file that extracted the atomic coordinate data used when generating the surface shape data described above, and the residue name is not an amino acid. And / or the residue name is a name indicating an organic compound other than a protein, etc., and can be determined as a ligand molecule. In PDB, the compound name is associated with the _nonpolymer flag as a three-letter code. Therefore, for compounds that have been determined as ligand molecules, it is possible to extract the three-dimensional code (x, よ り y, z) of all or part of the atoms constituting the ligand from the coordinate data description part at the back of the file using the three-letter code as a clue. it can. At this time, the three-dimensional structure data of the ligand generated by the ligand structure data generation unit 5 may be the three-dimensional structure data of the whole compound that interacts with the protein, or the three-dimensional structure data regarding the partial region that interacts with the protein in the compound. Also good.
 なお、リガンド構造データ生成部5は、リガンドの立体構造データを求めるに際し、分子化合物構造記述方法を用いた二次元グラフ構造データとしても良い。この分子化合物構造記述方法としては、例えば、SMILES(simplified molecular input line entry system)記述方法、SMARTS(Smiles Arbitrary Target Specification)記述法、InChI(International Chemical Identifier)記述方法等を挙げることができる。特に、SMILES記述方法によってリガンドの立体構造データを生成することが好ましい。リガンド構造データ生成部5は、SMILES記述方法といった分子化合物構造記述方法を用いた二次元グラフ構造データを機械学習させるためのグラフ畳み込みデータ(Graph Convolution data)とすることができる。 The ligand structure data generation unit 5 may obtain two-dimensional graph structure data using a molecular compound structure description method when obtaining the three-dimensional structure data of the ligand. Examples of the molecular compound structure description method include a SMILES (simplified molecular input line entry system) description method, a SMARTS (Smiles Arbitrary Target specification) description method, an InChI (International Chemical Identifier) description method, and the like. In particular, it is preferable to generate the three-dimensional structure data of the ligand by the SMILES description method. The ligand structure data generation unit 5 can use graph convolution data (Graph Convolution data) for machine learning of two-dimensional graph structure data using a molecular compound structure description method such as the SMILES description method.
 また、リガンド構造データ生成部5は、リガンドの立体構造データに加えて、当該リガンドについて静電ポテンシャル分布を生成することが好ましい。リガンドに関する静電ポテンシャル分布は、例えば、Rocchia et al. Journal of Computational Chemistry, Vol. 23, No. 1, pages 128-137に記載された方法に従って求めることができる。リガンドに関する静電ポテンシャル分布は、タンパク質と相互作用する化合物全体の静電ポテンシャル分布でも良いし、化合物におけるタンパク質と相互作用する部分領域に関する静電ポテンシャル分布であっても良い。 Further, it is preferable that the ligand structure data generation unit 5 generates an electrostatic potential distribution for the ligand in addition to the three-dimensional structure data of the ligand. The electrostatic potential distribution regarding the ligand can be determined according to the method described in, for example, Rocchia et al. Journal of Computational Chemistry, Vol. 23, No. 1, pages128-137. The electrostatic potential distribution regarding the ligand may be the electrostatic potential distribution of the entire compound that interacts with the protein, or may be the electrostatic potential distribution regarding the partial region that interacts with the protein in the compound.
 そして、タンパク質相互作用解析装置1は、所定のタンパク質におけるリガンド結合部位とリガンドとの組み合わせに関して、表面形状データ生成部3で生成した上記リガンド結合部位の表面形状データ、静電ポテンシャル分布データ生成部4で生成した上記リガンド結合部位の静電ポテンシャル分布データ及びリガンド構造データ生成部5で生成した上記リガンドに関する立体構造データを関連付けてデータ記憶部6に格納する。すなわち、データ記憶部6は、所定のタンパク質におけるリガンド結合部位とリガンドとの複数の組み合わせに関して「表面形状データ」、「静電ポテンシャル分布データ」及び「リガンドに関する立体構造データ」を含むリガンド結合部位表面性状データを記憶している。なお、データ記憶部6には、これらのデータの他、リガンドに関する静電ポテンシャル分布データを関連づけて記憶しても良いし、リガンドの持つ複数の立体構造異性体(ロータマー)の立体構造情報を並置して記憶しても良い。 Then, the protein interaction analyzing apparatus 1 relates to the combination of the ligand binding site and the ligand in the predetermined protein, the surface shape data of the ligand binding site generated by the surface shape data generating unit 3, and the electrostatic potential distribution data generating unit 4 The electrostatic potential distribution data of the ligand binding site generated in step 1 and the three-dimensional structure data relating to the ligand generated by the ligand structure data generation unit 5 are stored in the data storage unit 6 in association with each other. That is, the data storage unit 6 includes a surface of a ligand binding site including “surface shape data”, “electrostatic potential distribution data”, and “stereostructure data regarding a ligand” regarding a plurality of combinations of a ligand binding site and a ligand in a predetermined protein. Stores property data. In addition to these data, the data storage unit 6 may store the electrostatic potential distribution data related to the ligand in association with each other, or juxtapose the three-dimensional structure information of a plurality of three-dimensional structure isomers (rotamers) possessed by the ligand. May be stored.
 タンパク質相互作用解析装置1は、データ記憶部6に記憶している複数のリガンド結合部位表面性状データを教師データとして利用した機械学習により、ユーザが求めるタンパク質相互作用に関するデータを生成する。タンパク質相互作用解析装置1におけるデータ入力部7には、ユーザが解析対象に関する情報を入力する。解析対象に関する情報とは、所定の化合物に対して相互作用するタンパク質やそのリガンド結合部位について解析する場合には当該化合物に関する情報であり、所定のタンパク質又はそのリガンド結合部位に対して相互作用する化合物やリガンドについて解析する場合には当該タンパク質又はそのリガンド結合部位に関する情報である。 The protein interaction analysis device 1 generates data related to protein interaction desired by the user by machine learning using a plurality of ligand binding site surface property data stored in the data storage unit 6 as teacher data. The user inputs information related to the analysis target to the data input unit 7 in the protein interaction analyzer 1. The information on the analysis target is information on a compound that interacts with a predetermined compound and its ligand binding site when analyzing the protein and its ligand binding site, and the compound that interacts with the predetermined protein or its ligand binding site When analyzing a ligand or a ligand, it is information on the protein or its ligand binding site.
 データ入力部7に入力される化合物に関する情報としては、化合物の立体構造式若しくは分子式及び立体構造に関する情報、化合物の部分領域の立体構造式若しくは分子式及び立体構造に関する情報等が挙げられる。なお、これら化合物に関する情報に基づいて、詳細を後述する処理によって当該化合物に対して相互作用する候補タンパク質又は候補リガンド結合部位を解析することができる。 Information on the compound input to the data input unit 7 includes information on the three-dimensional structure or molecular formula and three-dimensional structure of the compound, information on the three-dimensional structure or molecular formula and three-dimensional structure of the partial region of the compound, and the like. In addition, based on the information regarding these compounds, a candidate protein or candidate ligand binding site that interacts with the compound can be analyzed by a process described in detail later.
 例えば、タンパク質相互作用解析装置1を利用して、所定の化合物(基質)から目的とする化合物(生成物)を合成する酵素反応に関与する候補タンパク質(候補となる酵素)を解析する際、基質となる化合物に関する情報として、基質化合物の立体構造式若しくは分子式及び立体構造に関する情報、基質化合物における酵素が作用する領域の立体構造式若しくは分子式及び立体構造に関する情報がデータ入力部7に入力される。この例においてデータ入力部7には、基質から生成物への酵素反応の種類や、当該酵素反応に関与する酵素の名称を入力してもよい。 For example, when analyzing a candidate protein (candidate enzyme) involved in an enzyme reaction for synthesizing a target compound (product) from a predetermined compound (substrate) using the protein interaction analyzer 1, the substrate As information on the compound to be obtained, information on the three-dimensional structural formula or molecular formula and three-dimensional structure of the substrate compound and information on the three-dimensional structural formula or molecular formula and three-dimensional structure of the region where the enzyme acts in the substrate compound are input to the data input unit 7. In this example, the data input unit 7 may be input with the type of enzyme reaction from the substrate to the product and the name of the enzyme involved in the enzyme reaction.
 また、データ入力部7に入力されるタンパク質或いはリガンド結合部位に関する情報としては、タンパク質又はリガンド結合部位のアミノ酸配列、原子座標、立体構造等が挙げられる。これらタンパク質或いはリガンド結合部位に関する情報に基づいて、詳細を後述する処理によって、当該タンパク質或いはリガンド結合部位に対して相互作用する候補化合物(候補リガンド)を解析することができる。 Further, the information regarding the protein or ligand binding site input to the data input unit 7 includes the amino acid sequence, atomic coordinates, and three-dimensional structure of the protein or ligand binding site. Based on the information on the protein or ligand binding site, a candidate compound (candidate ligand) that interacts with the protein or ligand binding site can be analyzed by a process described in detail later.
 例えば、タンパク質相互作用解析装置1を利用して、所定のタンパク質(例えば受容体タンパク質)に対して相互作用する化合物(リガンド化合物)を選択する際、タンパク質に関する情報として、当該タンパク質のアミノ酸配列、立体構造データ、リガンド結合部位のアミノ酸配列若しくはリガンド結合部位の立体構造データがデータ入力部7に入力される。 For example, when a compound (ligand compound) that interacts with a predetermined protein (for example, a receptor protein) is selected using the protein interaction analyzer 1, the amino acid sequence of the protein, a three-dimensional structure is used as information about the protein. The structure data, the amino acid sequence of the ligand binding site, or the three-dimensional structure data of the ligand binding site are input to the data input unit 7.
 タンパク質相互作用解析装置1における計算処理部8では、データ入力部7で入力した解析対象に関する情報に基づいて、データ記憶部6に記憶しているリガンド結合部位表面性状データと当該リガンド結合部位に相互作用するリガンドに関する立体構造データとの複数のセットを教師データとした機械学習による解析結果を含む、上記解析対象に関するタンパク質相互作用に関するデータを生成する。 In the calculation processing unit 8 in the protein interaction analyzing apparatus 1, the ligand binding site surface property data stored in the data storage unit 6 and the ligand binding site are mutually connected based on the information about the analysis target input by the data input unit 7. Data relating to protein interaction relating to the analysis target is generated, including analysis results by machine learning using a plurality of sets of three-dimensional structure data relating to the acting ligand as teacher data.
 例えば、データ入力部7で入力した解析対象に関する情報が化合物に関する情報である場合、計算処理部8は、当該化合物に対して相互作用する可能性のある候補タンパク質又は候補リガンド結合部位を生成する。より具体的に、データ入力部7で入力した解析対象に関する情報が所定の酵素反応における基質となる化合物に関する情報である場合、計算処理部8は、当該化合物を基質とする可能性のある候補酵素を生成する。このとき、計算処理部8は、上記化合物に対して相互作用する可能性の最も高い1つの候補タンパク質又は候補リガンド結合部位若しくは候補酵素を生成しても良いし、上記化合物に対して相互作用する可能性の高い一群の候補タンパク質又は候補リガンド結合部位若しくは候補酵素を生成しても良い。 For example, when the information on the analysis target input by the data input unit 7 is information on a compound, the calculation processing unit 8 generates a candidate protein or a candidate ligand binding site that may interact with the compound. More specifically, when the information on the analysis target input by the data input unit 7 is information on a compound that is a substrate in a predetermined enzyme reaction, the calculation processing unit 8 may select a candidate enzyme that may use the compound as a substrate. Is generated. At this time, the calculation processing unit 8 may generate one candidate protein or candidate ligand binding site or candidate enzyme most likely to interact with the compound, or interact with the compound. A likely group of candidate proteins or candidate ligand binding sites or candidate enzymes may be generated.
 また、データ入力部7で入力した解析対象に関する情報がタンパク質又はリガンド結合部位に関する情報である場合、計算処理部8は、当該タンパク質又はリガンド結合部位に対して相互作用する可能性のある候補化合物又は候補リガンドを生成する。このとき、計算処理部8は、上記タンパク質又はリガンド結合部位に対して相互作用する可能性の最も高い1つの候補化合物又は候補リガンドを生成しても良いし、上記タンパク質又はリガンド結合部位に対して相互作用する可能性の高い一群の候補化合物又は候補リガンドを生成しても良い。 In addition, when the information on the analysis target input by the data input unit 7 is information on the protein or ligand binding site, the calculation processing unit 8 can select a candidate compound that may interact with the protein or ligand binding site or Generate candidate ligands. At this time, the calculation processing unit 8 may generate one candidate compound or candidate ligand having the highest possibility of interacting with the protein or ligand binding site, or with respect to the protein or ligand binding site. A group of candidate compounds or candidate ligands that are likely to interact may be generated.
 図2に示す例では、計算処理部8における機械学習部10で、上述したリガンド結合部位表面性状データと当該リガンド結合部位に相互作用するリガンドに関する立体構造データとの複数のデータセットを教師データとした機械学習による解析を行う。また、図2に示す例では、計算処理部8における評価値算出部11において、データ入力部7で入力された解析対象に対して、教師データに含まれるタンパク質又はリガンドに対する類似性を示す評価値を算出する。図2に示す例では、計算処理部8におけるリスト生成部12にて、機械学習部10で行った機械学習の結果と評価値算出部11で算出した評価値とを合わせたリストを生成する。 In the example shown in FIG. 2, the machine learning unit 10 in the calculation processing unit 8 uses a plurality of data sets of the above-described ligand binding site surface property data and the three-dimensional structure data related to the ligand interacting with the ligand binding site as teacher data. Analysis by machine learning. In the example shown in FIG. 2, in the evaluation value calculation unit 11 in the calculation processing unit 8, an evaluation value indicating similarity to the protein or ligand included in the teacher data with respect to the analysis target input by the data input unit 7. Is calculated. In the example illustrated in FIG. 2, the list generation unit 12 in the calculation processing unit 8 generates a list that combines the result of machine learning performed by the machine learning unit 10 and the evaluation value calculated by the evaluation value calculation unit 11.
 また、計算処理部8では、機械学習部10において処理する機械学習用教師データとしては、特に限定されないが、例えば、タンパク質又はリガンド結合部位とリガンド分子との相互作用を評価する評価値を含むことが好ましい。当該評価値の一例としては、「リガンド結合部位表面性状データ」と「リガンドに関する立体構造データ」より、タンパク質におけるリガンド結合部位とリガンド分子との相対する部位の距離の短さに対してより高いスコアを与える立体形状凹凸相同性評価値を使用することができる。立体形状凹凸相同性評価値としては、「リガンド結合部位表面性状データ」と「リガンドに関する立体構造データ」とからなる各データセットに共通のn数を持つn次元ベクトルとすることができる。 In the calculation processing unit 8, the machine learning teacher data processed in the machine learning unit 10 is not particularly limited, but includes, for example, an evaluation value for evaluating the interaction between the protein or ligand binding site and the ligand molecule. Is preferred. As an example of the evaluation value, a higher score is obtained for the shorter distance between the ligand binding site and the ligand molecule in the protein than the “ligand binding site surface property data” and “ligand conformation data”. The three-dimensional unevenness evaluation value that gives The three-dimensional shape unevenness homology evaluation value can be an n-dimensional vector having n number common to each data set composed of “ligand binding site surface property data” and “stereostructure data on ligand”.
 ここで、n次元ベクトルとは、リガンド分子におけるn個の所定の部位における立体形状凹凸相同性評価値を示す。これらn個の所定の部位は、リガンド分子毎に任意に規定することができる。一例として、n次元ベクトルにおける次数及び配列順としては、IUPAC(国際純正・応用化学連合)命名法に則った炭素原子の順位付けに倣い、リガンド分子を構成する各非水素(炭素、窒素、酸素、硫黄、セレン等)について一次元順位付けを行う方法が挙げられる。これにより、所定のリガンド分子に対して、立体配置上の各部位における立体形状凹凸相同性評価値を決定することができる。なお、立体形状凹凸相同性評価値におけるn数は、「リガンド結合部位表面性状データ」と「リガンドに関する立体構造データ」とからなる各データセットに共通する値でも良いが、データセット毎に異なる値でも良い。 Here, the n-dimensional vector indicates a three-dimensional unevenness homology evaluation value at n predetermined sites in the ligand molecule. These n predetermined sites can be arbitrarily defined for each ligand molecule. As an example, the order and arrangement order of n-dimensional vectors are based on the ranking of carbon atoms in accordance with the IUPAC (International Pure and Applied Chemistry Union) nomenclature, and each non-hydrogen (carbon, nitrogen, oxygen) constituting the ligand molecule. , Sulfur, selenium, and the like). Thereby, it is possible to determine the three-dimensional unevenness homology evaluation value at each site on the three-dimensional configuration for a predetermined ligand molecule. Note that the number n in the three-dimensional shape unevenness homology evaluation value may be a value common to each data set including “ligand binding site surface property data” and “three-dimensional structure data regarding the ligand”, but is different for each data set. But it ’s okay.
 さらに、機械学習部10において処理する機械学習用教師データとしては、特に限定されないが、例えば、タンパク質又はリガンド結合部位とリガンド分子との静電的結合に関する結合エネルギーを評価する評価値を含むことが好ましい。当該評価値の一例としては、「静電ポテンシャル分布データ」と「リガンドに関する立体構造データ」より、タンパク質とリガンド分子との静電的結合に関して、リガンド分子における各部位の結合エネルギー(エンタルピー変化)の大きさに応じてより高いスコアを与える結合エネルギー評価値を使用することができる。結合エネルギー評価値としては、「静電ポテンシャル分布データ」と「リガンドに関する立体構造データ」とからなる各データセットに共通のm数を持つm次元ベクトルとすることができる。 Furthermore, the machine learning teacher data processed in the machine learning unit 10 is not particularly limited, and includes, for example, an evaluation value for evaluating the binding energy related to electrostatic binding between a protein or a ligand binding site and a ligand molecule. preferable. As an example of the evaluation value, from the “electrostatic potential distribution data” and “stereostructure data on the ligand”, the binding energy (enthalpy change) of each site in the ligand molecule is related to the electrostatic binding between the protein and the ligand molecule. A binding energy rating that gives a higher score depending on the size can be used. The binding energy evaluation value can be an m-dimensional vector having m number common to each data set composed of “electrostatic potential distribution data” and “stereostructure data on ligand”.
 ここで、m次元ベクトルとは、リガンド分子におけるm個の所定の部位における結合エネルギー評価値を示す。これらm個の所定の部位は、上述したn次元ベクトルと同様にリガンド分子毎に任意に規定することができる。結合エネルギー評価値におけるm数は、タンパク質又はリガンド結合部位とリガンド分子からなる各セット毎に異なる値でも良いし、共通する値でも良い。なお、立体形状凹凸相同性評価値におけるn数は、「静電ポテンシャル分布データ」と「リガンドに関する立体構造データ」とからなる各データセットに共通する値でも良いが、データセット毎に異なる値でも良い。 Here, the m-dimensional vector indicates a binding energy evaluation value at m predetermined sites in the ligand molecule. These m predetermined sites can be arbitrarily defined for each ligand molecule as in the above-described n-dimensional vector. The m number in the binding energy evaluation value may be a different value for each set of protein or ligand binding site and ligand molecule, or may be a common value. Note that the n number in the three-dimensional shape unevenness evaluation value may be a value common to each data set composed of “electrostatic potential distribution data” and “stereostructure data on a ligand”, or may be a value different for each data set. good.
 これらn数及びm数の値としては、それぞれ任意とすることがきる。例えば、n数及びm数の値としては、上述したデータセットの一部を用いて機械学習を行わせたうえで、機械学習に使用しなかった他のデータセットに対する回答の適正さが高くなるように設定することができる。 These values of n and m can be arbitrary. For example, as the values of n and m, machine learning is performed using a part of the above-described data set, and the appropriateness of answers to other data sets not used for machine learning increases. Can be set as follows.
 一方、評価値算出部11は、データ入力部7で解析対象として所定の化合物又はリガンドが入力された場合、機械学習部10における機械学習の結果として抽出された、当該化合物又はリガンドに対して相互作用する可能性のある候補タンパク質又は候補リガンド結合部位について評価値を算出する。この評価値は、データ入力部7で入力された化合物やリガンドと、抽出された候補タンパク質又は候補リガンド結合部位が関連づけられている化合物やリガンドとの類似性を示す値である。 On the other hand, when a predetermined compound or ligand is input as an analysis target in the data input unit 7, the evaluation value calculation unit 11 performs mutual processing on the compound or ligand extracted as a result of machine learning in the machine learning unit 10. An evaluation value is calculated for a candidate protein or candidate ligand binding site that may act. This evaluation value is a value indicating the similarity between the compound or ligand input by the data input unit 7 and the compound or ligand associated with the extracted candidate protein or candidate ligand binding site.
 具体的に評価値算出部11では、データ入力部7で入力された化合物やリガンドと、抽出された候補タンパク質又は候補リガンド結合部位が関連づけられている化合物やリガンドとをmaximum matchingして、適合度(matchingの度合い)の高い分子に高い評価値を与えることができる。この評価値は、例えば、入力した化合物又はリガンドに含まれる原子のうち照合先の「リガンド構造データ」内の対応する原子と所定の距離(例えば1Å)以内に位置する原子の割合が高い場合に高い数値となるように規定することができる。さらに、この評価値は、入力した化合物又はリガンドと「リガンド構造データ」とを照合する際に、局所的に静電的偏りが生じる可能性の高い、酸素原子や窒素原子の種類と位置のmatchingが高い場合にはより高い数値となるように規定することができる。以上のように評価値を規定することによって、入力した化合物又はリガンドと、候補タンパク質又は候補リガンド結合部位が関連づけられている化合物やリガンドとの構造上の類似性をより正確に評価することができる。 Specifically, the evaluation value calculation unit 11 performs maximum matching between the compound or ligand input by the data input unit 7 and the extracted candidate protein or compound or ligand associated with the candidate ligand binding site, and the matching degree A high evaluation value can be given to a molecule having a high (matching degree). This evaluation value is obtained when, for example, the proportion of atoms located within a predetermined distance (for example, 1 cm) from the corresponding atoms in the “ligand structure data” to be collated among the atoms contained in the input compound or ligand is high. It can be specified to be a high numerical value. Furthermore, this evaluation value is used to match the type and position of oxygen and nitrogen atoms, which are likely to cause local electrostatic bias when collating the entered compound or ligand with the “ligand structure data”. It can be specified that the value is higher when the value is higher. By defining the evaluation value as described above, it is possible to more accurately evaluate the structural similarity between the input compound or ligand and the compound or ligand associated with the candidate protein or candidate ligand binding site. .
 また、評価値算出部11は、データ入力部7で解析対象として入力された所定の化合物又はリガンドと、機械学習部10で抽出された候補タンパク質又は候補リガンド結合部に関連づけられた化合物やリガンドの構造のうちリガンド結合部位と十分に近接した領域の構造との類似性について評価値を算出することが好ましい。ここで、十分に近接した領域としては、例えば、化合物又はリガンドがリガンド結合部位に相互作用した状態においてリガンド結合部位から5Å以内の領域を挙げることができる。この処理により、評価値は、データ入力部7で解析対象として入力された所定の化合物又はリガンドと、機械学習部10で抽出された候補タンパク質又は候補リガンド結合部に関連づけられた化合物やリガンドにおける相互作用に関与する領域との類似性を評価することができる。 Further, the evaluation value calculation unit 11 includes a predetermined compound or ligand input as an analysis target by the data input unit 7 and a compound or ligand associated with the candidate protein or candidate ligand binding unit extracted by the machine learning unit 10. It is preferable to calculate an evaluation value for the similarity between the structure of the region sufficiently close to the ligand binding site in the structure. Here, examples of the sufficiently close region include a region within 5 mm from the ligand binding site in a state where the compound or the ligand interacts with the ligand binding site. By this process, the evaluation value is obtained by comparing the predetermined compound or ligand input as an analysis target in the data input unit 7 and the compound or ligand associated with the candidate protein or candidate ligand binding unit extracted by the machine learning unit 10. Similarity with the region involved in the action can be evaluated.
 さらに、データ入力部7において、解析対象として所定の化合物又はリガンドに加えて、基質から生成物への酵素反応の種類や、当該酵素反応に関与する酵素の名称を入力した場合、評価値算出部11では、抽出された候補タンパク質又は候補リガンド結合部位について、入力した酵素反応や酵素名と一致度又は類似度を示す評価値を与えることができる。 Furthermore, in the data input unit 7, in addition to a predetermined compound or ligand as an analysis target, when the type of enzyme reaction from the substrate to the product or the name of the enzyme involved in the enzyme reaction is input, the evaluation value calculation unit 11, the extracted candidate protein or candidate ligand binding site can be given an evaluation value indicating the degree of coincidence or similarity with the input enzyme reaction or enzyme name.
 或いは、評価値算出部11は、データ入力部7で解析対象として所定のタンパク質やリガンド結合部位が入力された場合、機械学習部10における機械学習の結果として抽出された、当該タンパク質やリガンド結合部位に対して相互作用する可能性のある候補化合物又は候補リガンドについて評価値を算出する。この評価値は、データ入力部7で入力されたタンパク質やリガンド結合部位と、抽出された候補化合物又は候補リガンドが関連づけられているタンパク質やリガンド結合部位との類似性を示す値である。 Alternatively, when a predetermined protein or ligand binding site is input as an analysis target in the data input unit 7, the evaluation value calculation unit 11 extracts the protein or ligand binding site extracted as a result of machine learning in the machine learning unit 10. Evaluation values are calculated for candidate compounds or candidate ligands that may interact with. This evaluation value is a value indicating the similarity between the protein or ligand binding site input by the data input unit 7 and the protein or ligand binding site associated with the extracted candidate compound or candidate ligand.
 具体的に評価値算出部11では、データ入力部7で入力されたタンパク質やリガンド結合部位と、抽出された候補化合物又は候補リガンドが関連づけられているタンパク質やリガンド結合部位とのアミノ酸配列の一致度を計算し、当該一致度が高いタンパク質やリガンド結合部位に関連づけられた候補化合物又は候補リガンドに高い評価値を与えることができる。また、評価値算出部11では、データ入力部7で入力されたタンパク質やリガンド結合部位と、抽出された候補化合物又は候補リガンドが関連づけられているタンパク質やリガンド結合部位との立体構造上の類似度を計算し、当該類似度が高いタンパク質やリガンド結合部位に関連づけられた候補化合物又は候補リガンドに高い評価値を与えることができる。さらに、評価値算出部11では、データ入力部7で入力されたタンパク質やリガンド結合部位と、抽出された候補化合物又は候補リガンドが関連づけられているタンパク質やリガンド結合部位との静電ポテンシャル分布の類似度を計算し、当該類似度が高いタンパク質やリガンド結合部位に関連づけられた候補化合物又は候補リガンドに高い評価値を与えることができる。以上のように評価値を規定することによって、入力したタンパク質やリガンド結合部位と、候補化合物又は候補リガンドが関連づけられているタンパク質やリガンド結合部位との構造上の類似性、静電ポテンシャル分布の類似性をより正確に評価することができる。 Specifically, in the evaluation value calculation unit 11, the degree of coincidence of amino acid sequences between the protein or ligand binding site input by the data input unit 7 and the protein or ligand binding site associated with the extracted candidate compound or candidate ligand And a high evaluation value can be given to a candidate compound or candidate ligand associated with a protein or ligand binding site having a high degree of coincidence. Further, in the evaluation value calculation unit 11, the three-dimensional structural similarity between the protein or ligand binding site input by the data input unit 7 and the protein or ligand binding site associated with the extracted candidate compound or candidate ligand. And a high evaluation value can be given to a candidate compound or candidate ligand associated with a protein or ligand binding site having a high degree of similarity. Further, the evaluation value calculation unit 11 resembles the electrostatic potential distribution between the protein or ligand binding site input by the data input unit 7 and the protein or ligand binding site associated with the extracted candidate compound or candidate ligand. The degree can be calculated, and a high evaluation value can be given to a candidate compound or candidate ligand associated with a protein or ligand binding site having a high degree of similarity. By defining the evaluation value as described above, the structural similarity and the electrostatic potential distribution similarity between the input protein or ligand binding site and the protein or ligand binding site to which the candidate compound or candidate ligand is related. Sex can be evaluated more accurately.
 リスト生成部12は、上述のように、機械学習部10で生成したタンパク質相互作用に関するデータ及び評価値算出部11で算出した評価値を統合したリストを生成する。データ入力部7で解析対象として所定の化合物が入力された場合、当該化合物に対して相互作用する可能性のある候補タンパク質又は候補リガンド結合部位とこれらについて算出した評価値を関連づけたリストを生成する。或いは、データ入力部7で解析対象として所定のタンパク質やリガンド結合部位が入力された場合、当該タンパク質やリガンド結合部位に対して相互作用する可能性のある候補化合物又は候補リガンドとこれらについて算出した評価値を関連づけたリストを生成する。 As described above, the list generation unit 12 generates a list in which the data regarding the protein interaction generated by the machine learning unit 10 and the evaluation value calculated by the evaluation value calculation unit 11 are integrated. When a predetermined compound is input as an analysis target in the data input unit 7, a list that associates candidate protein or candidate ligand binding sites that may interact with the compound and evaluation values calculated for them is generated. . Alternatively, when a predetermined protein or ligand binding site is input as an analysis target in the data input unit 7, candidate compounds or candidate ligands that may interact with the protein or ligand binding site and evaluations calculated for them Generate a list with associated values.
 なお、計算処理部8は、図2に示した例では機械学習部10と、評価値算出部11と、リスト生成部12とを備える構成としたが、図3に示すように、更にタンパク質-リガンド適合性スコア算出部13を備えるものでもよい。タンパク質-リガンド適合性スコア算出部13は、データ入力部7で入力された解析対象が所定の化合物又はリガンドである場合、機械学習部10で抽出された、当該化合物又はリガンドに対して相互作用する可能性のある候補タンパク質又は候補リガンド結合部位と解析対象の化合物又はリガンドとの結合安定性に関する適合性スコアを算出する。或いはタンパク質-リガンド適合性スコア算出部13は、データ入力部7で入力された解析対象が所定のタンパク質又はリガンド結合部位である場合、機械学習部10で抽出された、当該タンパク質又はリガンド結合部位に対して相互作用する可能性のある候補化合物又は候補リガンドと解析対象のタンパク質又はリガンド結合部位との結合安定性に関する適合性スコアを算出する。 In the example shown in FIG. 2, the calculation processing unit 8 includes the machine learning unit 10, the evaluation value calculation unit 11, and the list generation unit 12. However, as shown in FIG. A ligand suitability score calculation unit 13 may be provided. The protein-ligand compatibility score calculation unit 13 interacts with the compound or ligand extracted by the machine learning unit 10 when the analysis target input by the data input unit 7 is a predetermined compound or ligand. A suitability score for the binding stability between the potential candidate protein or candidate ligand binding site and the compound or ligand to be analyzed is calculated. Alternatively, the protein-ligand suitability score calculation unit 13 applies the protein or ligand binding site extracted by the machine learning unit 10 when the analysis target input by the data input unit 7 is a predetermined protein or ligand binding site. On the other hand, a fitness score relating to the binding stability between the candidate compound or candidate ligand that may interact with the protein or ligand binding site to be analyzed is calculated.
 ここで、適合性スコアとしては、リガンドとリガンド結合部位との結合エンタルピーに基づいて算出した値とすることができる。リガンド単独では水分子が配位している状態であり、水分子との結合エンタルピーからリガンド単独でのポテンシャルエネルギー1を計算する。次に、リガンド結合部位とリガンドとが結合(イオン結合、疎水結合等々)した状態におけるエンタルピー量を計算してポテンシャルエネルギー2を計算する。ポテンシャルエネルギー2とポテンシャルエネルギー1との差がプラスである場合、リガンドはリガンド結合部位とが結合しやすくなることを意味する。したがって、ポテンシャルエネルギー2とポテンシャルエネルギー1との差分を考慮した適合性スコアを算出することで、上述した結合安定性を定量的に評価することができる。 Here, the suitability score can be a value calculated based on the binding enthalpy between the ligand and the ligand binding site. The ligand alone is in a state where water molecules are coordinated, and the potential energy 1 of the ligand alone is calculated from the binding enthalpy with the water molecule. Next, the potential energy 2 is calculated by calculating the enthalpy amount in a state where the ligand binding site and the ligand are bound (ionic bond, hydrophobic bond, etc.). When the difference between the potential energy 2 and the potential energy 1 is positive, it means that the ligand is easily bound to the ligand binding site. Therefore, by calculating the suitability score in consideration of the difference between the potential energy 2 and the potential energy 1, the above-described bond stability can be quantitatively evaluated.
 図3に示した例では、リスト生成部12は、上述のように、機械学習部10で生成したタンパク質相互作用に関するデータ、評価値算出部11で算出した評価値及びタンパク質-リガンド適合性スコア算出部13で算出した適合性スコアを統合したリストを生成する。データ入力部7で解析対象として所定の化合物又はリガンドが入力された場合、当該化合物又はリガンドに対して相互作用する可能性のある候補タンパク質又は候補リガンド結合部位とこれらについて算出した評価値と適合性スコアとを関連づけたリストを生成する。或いは、データ入力部7で解析対象として所定のタンパク質やリガンド結合部位が入力された場合、当該タンパク質やリガンド結合部位に対して相互作用する可能性のある候補化合物又は候補リガンドとこれらについて算出した評価値と適合性スコアとを関連づけたリストを生成する。 In the example illustrated in FIG. 3, the list generation unit 12 calculates the data regarding the protein interaction generated by the machine learning unit 10, the evaluation value calculated by the evaluation value calculation unit 11, and the protein-ligand compatibility score calculation as described above. A list in which the suitability scores calculated by the unit 13 are integrated is generated. When a predetermined compound or ligand is input as an analysis target in the data input unit 7, the candidate protein or candidate ligand binding site that may interact with the compound or ligand and the evaluation value calculated for these are compatible Generate a list that associates scores. Alternatively, when a predetermined protein or ligand binding site is input as an analysis target in the data input unit 7, candidate compounds or candidate ligands that may interact with the protein or ligand binding site and evaluations calculated for them Generate a list associating values with fitness scores.
 以上のように、図2又は3に示したリスト生成部12は、機械学習部10において抽出した候補タンパク質若しくは候補リガンド結合部位、又は候補化合物若しくは候補リガンドのリストを生成する。このとき、リスト生成部12は、機械学習部10において抽出したリストに含まれる候補タンパク質若しくは候補リガンド結合部位、又は候補化合物若しくは候補リガンドを、上述した評価値及び/又は適合性スコアに基づいて更に限定してもよい。すなわち、リスト生成部12は、機械学習部10において抽出したリストに含まれる候補タンパク質若しくは候補リガンド結合部位、又は候補化合物若しくは候補リガンドのうち、評価値及び/又は適合性スコアが所定の値以下のものをリストから除いても良い。 As described above, the list generation unit 12 illustrated in FIG. 2 or 3 generates a list of candidate proteins or candidate ligand binding sites, or candidate compounds or candidate ligands extracted by the machine learning unit 10. At this time, the list generation unit 12 further selects candidate proteins or candidate ligand binding sites or candidate compounds or candidate ligands included in the list extracted by the machine learning unit 10 based on the above-described evaluation value and / or suitability score. It may be limited. That is, the list generation unit 12 has an evaluation value and / or suitability score of a predetermined value or less among candidate proteins or candidate ligand binding sites, or candidate compounds or candidate ligands included in the list extracted by the machine learning unit 10. You may remove things from the list.
 具体的に、データ入力部7で入力した解析対象に関する情報が所定の酵素反応における基質となる化合物に関する情報である場合、機械学習部10は、当該化合物を基質とする可能性のある候補酵素を抽出する。この場合、リスト生成部12は、機械学習部10において抽出した候補酵素のうち、評価値及び/又は適合性スコアが所定の値以下のものをリストから除いても良い。また、この場合、リスト生成部12は、機械学習部10において抽出した候補酵素のうち、ユーザが入力した酵素反応に関連しないものをリストから除いても良い。 Specifically, when the information on the analysis target input by the data input unit 7 is information on a compound that is a substrate in a predetermined enzyme reaction, the machine learning unit 10 selects candidate enzymes that may use the compound as a substrate. Extract. In this case, the list generation unit 12 may exclude, from the list, candidate enzymes extracted by the machine learning unit 10 whose evaluation value and / or suitability score are equal to or less than a predetermined value. Further, in this case, the list generation unit 12 may exclude those candidate enzymes extracted by the machine learning unit 10 that are not related to the enzyme reaction input by the user from the list.
 そして、タンパク質相互作用解析装置1の出力部9は、リスト生成部12にて生成されたリストを出力する。ここで、出力部9で出力されるリストは、リスト生成部12にて生成したリストをそのままでも良いし、リスト生成部12にて生成したリストに対して更に情報を付加したものでもよい。 Then, the output unit 9 of the protein interaction analyzer 1 outputs the list generated by the list generation unit 12. Here, the list output by the output unit 9 may be the list generated by the list generation unit 12 as it is, or may be a list generated by adding information to the list generated by the list generation unit 12.
 例えば、データ入力部7で解析対象として所定の化合物が入力された場合、当該化合物に対して相互作用する可能性のある候補タンパク質又は候補リガンド結合部位を含むリストがリスト生成部12にて生成されるが、このリストに含まれる候補タンパク質、候補リガンド結合部位を含むタンパク質に関する機能情報等を付加したリストを出力することができる。 For example, when a predetermined compound is input as an analysis target in the data input unit 7, a list including candidate proteins or candidate ligand binding sites that may interact with the compound is generated in the list generation unit 12. However, it is possible to output a list to which candidate information included in this list, function information regarding proteins including candidate ligand binding sites, and the like are added.
 また、図1乃至3には図示していないが、出力部9にてリストを出力する前に、上述した評価値に基づいたエンジニアリング情報を解析する処理を行っても良い。例えば、データ入力部7で解析対象として所定の化合物が入力された場合、評価値に基づいて、候補タンパク質又は候補リガンド結合部位と、入力された化合物との相互作用が阻害される原因を特定し、上記化合物が相互作用しやすくなるエンジニアリング情報を解析する。 Although not shown in FIGS. 1 to 3, before the output unit 9 outputs the list, a process for analyzing the engineering information based on the evaluation value described above may be performed. For example, when a predetermined compound is input as an analysis target in the data input unit 7, the cause of the interaction between the candidate protein or candidate ligand binding site and the input compound is inhibited based on the evaluation value. Analyze engineering information that facilitates interaction of the compounds.
 具体的には、先ず、入力された化合物と、候補タンパク質又は候補リガンド結合部位に関連づけられた化合物との構造比較から、入力された化合物において評価値の低下に寄与する領域を特定する。次に、候補タンパク質又は候補リガンド結合部位において、上記化合物における評価値の低下に寄与する領域が相互作用する位置を特定する。次に、特定された位置の立体構造や静電ポテンシャル分布に基づいて、上記入力された化合物が相互作用しやすい立体構造や静電ポテンシャル分布となるよう、候補タンパク質又は候補リガンド結合部位に導入する変異や修飾を特定する。このように特定した変異や修飾を、候補タンパク質又は候補リガンド結合部位に対するエンジニアリング情報として生成することができる。 Specifically, first, a region that contributes to a decrease in the evaluation value in the input compound is identified from a structural comparison between the input compound and the compound associated with the candidate protein or candidate ligand binding site. Next, in the candidate protein or candidate ligand binding site, the position where the region contributing to the decrease in the evaluation value of the compound interacts is specified. Next, based on the three-dimensional structure and electrostatic potential distribution at the specified position, the input compound is introduced into the candidate protein or candidate ligand binding site so as to have a three-dimensional structure or electrostatic potential distribution in which the input compound can easily interact. Identify mutations and modifications. The mutation or modification identified in this way can be generated as engineering information for the candidate protein or candidate ligand binding site.
 一方、データ入力部7で解析対象として所定のタンパク質が入力された場合、評価値に基づいて、候補化合物又は候補リガンドと、入力されたタンパク質との相互作用が阻害される原因を特定し、上記タンパク質が相互作用しやすくなるエンジニアリング情報を解析することもできる。 On the other hand, when a predetermined protein is input as an analysis target in the data input unit 7, the cause of the interaction between the candidate compound or candidate ligand and the input protein is identified based on the evaluation value, It is also possible to analyze engineering information that facilitates protein interaction.
 具体的には、先ず、入力されたタンパク質と、候補化合物又は候補リガンドに関連づけられたタンパク質との構造比較から、入力されたタンパク質において評価値の低下に寄与する領域を特定する。次に、候補化合物又は候補リガンドにおいて、上記タンパク質における評価値の低下に寄与する領域が相互作用する位置を特定する。次に、特定された位置の立体構造や静電ポテンシャル分布に基づいて、上記入力されたタンパク質が相互作用しやすい立体構造や静電ポテンシャル分布となるよう、候補化合物又は候補リガンドに対する構造改変(官能基の除去、変更及び追加等)を特定する。このように特定した構造改変を候補化合物又は候補リガンドに対するエンジニアリング情報として生成することができる。 Specifically, first, a region that contributes to a decrease in the evaluation value in the input protein is specified from a structural comparison between the input protein and the protein associated with the candidate compound or candidate ligand. Next, in the candidate compound or candidate ligand, a position where a region contributing to a decrease in the evaluation value of the protein interacts is specified. Next, based on the three-dimensional structure and electrostatic potential distribution at the specified position, structural modification (functionality) of the candidate compound or candidate ligand is performed so that the input protein has a three-dimensional structure or electrostatic potential distribution in which the input protein is likely to interact. Identify group removal, modification, and addition). The structural modification identified in this way can be generated as engineering information for the candidate compound or candidate ligand.
 なお、上述したタンパク質相互作用解析装置1は、一般的なコンピュータ装置によって実現することもできる。すなわち、タンパク質相互作用解析装置1は、CPU等の演算装置と、ハードディスク、RAM及びROM等の記憶装置と、キーボード及びポインティングデバイス等の入力装置と、ディスプレイ及びプリンタ等の出力装置とを備えている。タンパク質相互作用解析装置1は、例えば、インターネット等のネットワークを介して外部記憶部2等の外部記憶装置を接続するための通信装置を備えていてもよい。タンパク質相互作用解析装置1においてこの通信装置は、各種データの入力装置及び外部への出力装置として機能する。タンパク質相互作用解析装置1において、ハードディスク、RAM及びROM等の記憶装置には、上述した各種処理をコンピュータ装置に行わせるプログラムが記憶されている。すなわち、記憶装置に記憶された当該プログラムを上述したハードウェアで実行することで、タンパク質相互作用解析装置1を実現できる。なお、タンパク質相互作用解析装置1は、一つのコンピュータ装置で構成されてもよく、物理的に異なるが互いに通信可能な複数のコンピュータ装置で構成されてもよい。 Note that the above-described protein interaction analyzer 1 can also be realized by a general computer device. That is, the protein interaction analysis apparatus 1 includes an arithmetic device such as a CPU, a storage device such as a hard disk, a RAM, and a ROM, an input device such as a keyboard and a pointing device, and an output device such as a display and a printer. . The protein interaction analysis apparatus 1 may include a communication device for connecting an external storage device such as the external storage unit 2 via a network such as the Internet. In the protein interaction analyzer 1, this communication device functions as an input device for various data and an output device to the outside. In the protein interaction analysis apparatus 1, a storage device such as a hard disk, a RAM, and a ROM stores a program that causes the computer device to perform the various processes described above. That is, the protein interaction analyzer 1 can be realized by executing the program stored in the storage device with the hardware described above. The protein interaction analysis apparatus 1 may be configured by a single computer apparatus, or may be configured by a plurality of computer apparatuses that are physically different but can communicate with each other.
 本明細書で引用した全ての刊行物、特許及び特許出願はそのまま引用により本明細書に組み入れられるものとする。 All publications, patents and patent applications cited in this specification are incorporated herein by reference in their entirety.

Claims (20)

  1. 解析対象に関する情報を入力するデータ入力部と、
     外部記憶部に格納されたタンパク質のアミノ酸配列データ及び立体構造データと当該タンパク質に対して特異的に相互作用するリガントの立体構造データとに基づいて生成した、所定のタンパク質に関するリガンド結合部位の表面形状データと当該リガンド結合部位の静電ポテンシャル分布データと当該リガンド結合部位に対して相互作用するリガンドに関する立体構造データとを関連づけて記憶するデータ記憶部と、
     上記データ記憶部に記憶された、所定のリガンド結合部位の表面形状データと当該リガンド結合部位の静電ポテンシャル分布データと当該リガンド結合部位に対して相互作用するリガンドに関する立体構造データとを教師データとした機械学習により、上記データ入力部で入力された解析対象に関連する、タンパク質相互作用に関するデータを生成する計算処理部とを備える、タンパク質相互作用解析装置。
    A data input unit for inputting information on the analysis target;
    Surface shape of the ligand binding site for a given protein generated based on the amino acid sequence data and 3D structure data of the protein stored in the external storage unit and the 3D structure data of the ligand that specifically interacts with the protein A data storage unit that associates and stores data, electrostatic potential distribution data of the ligand binding site, and three-dimensional structure data relating to a ligand that interacts with the ligand binding site;
    Teacher data includes surface shape data of a predetermined ligand binding site, electrostatic potential distribution data of the ligand binding site, and three-dimensional structure data relating to a ligand that interacts with the ligand binding site, stored in the data storage unit. A protein interaction analysis apparatus comprising: a computer processing unit that generates data related to a protein interaction related to an analysis target input by the data input unit by machine learning.
  2. 上記解析対象に関する情報はリガンドの構造に関する情報であり、
     上記計算処理部は、当該リガンドに相互作用するタンパク質又はリガンド結合部位に関するデータを生成することを特徴とする請求項1記載のタンパク質相互作用解析装置。
    The information on the analysis target is information on the structure of the ligand,
    The protein interaction analysis apparatus according to claim 1, wherein the calculation processing unit generates data relating to a protein that interacts with the ligand or a ligand binding site.
  3. 上記解析対象に関する情報はタンパク質又はリガンド結合部位の構造に関する情報であり、
     上記計算処理部は、当該タンパク質又はリガンド結合部位に相互作用する化合物又はリガンドに関するデータを生成することを特徴とする請求項1記載のタンパク質相互作用解析装置。
    The information on the analysis target is information on the structure of the protein or ligand binding site,
    The protein interaction analysis apparatus according to claim 1, wherein the calculation processing unit generates data related to a compound or a ligand that interacts with the protein or a ligand binding site.
  4. 上記計算処理部は、機械学習により生成したタンパク質相互作用に関するデータについて、上記データ入力部で入力した解析対象と、生成したデータに含まれる解析対象との類似性を示す評価値を算出する評価値算出部を備えることを特徴とする請求項1記載のタンパク質相互作用解析装置。 The calculation processing unit is an evaluation value for calculating an evaluation value indicating similarity between the analysis target input by the data input unit and the analysis target included in the generated data with respect to the data relating to the protein interaction generated by machine learning. The protein interaction analyzer according to claim 1, further comprising a calculation unit.
  5. 上記計算処理部は、機械学習により生成したタンパク質相互作用に関するデータについて、上記データ入力部で入力した解析対象が相互作用したときの結合安定性を定量的に示す適合性スコアを算出するタンパク質-リガンド適合性スコア算出部を備えることを特徴とする請求項1記載のタンパク質相互作用解析装置。 The calculation processing unit calculates a fitness score that quantitatively indicates the binding stability when the analysis target input in the data input unit interacts with the data related to the protein interaction generated by machine learning. The protein interaction analysis apparatus according to claim 1, further comprising a fitness score calculation unit.
  6. 上記データ記憶部は、リガンド結合部位を構成する原子の原子座標に基づいてリガンド結合部位の中心座標を算出し、当該中心座標から所定の距離内にある原子を含む三次元グリッド空間を設定し、当該三次元グリッド空間に基づいて生成された表面形状データを記憶することを特徴とする請求項1記載のタンパク質相互作用解析装置。 The data storage unit calculates the center coordinates of the ligand binding site based on the atomic coordinates of the atoms constituting the ligand binding site, sets a three-dimensional grid space including atoms within a predetermined distance from the center coordinate, 2. The protein interaction analysis apparatus according to claim 1, wherein surface shape data generated based on the three-dimensional grid space is stored.
  7. 上記三次元グリッド空間は、所定の間隔で設定されたグリッドにより複数数の格子点を有し、上記中心座標から所定の距離内にある各原子について最も近接する格子点に特定の文字を与え、当該特定の文字が与えられなかった格子点に他の文字を与えられたデータであることを特徴とする請求項6記載のタンパク質相互作用解析装置。 The three-dimensional grid space has a plurality of lattice points by a grid set at a predetermined interval, and gives a specific character to the closest lattice point for each atom within a predetermined distance from the central coordinate, 7. The protein interaction analysis apparatus according to claim 6, wherein the protein interaction analysis apparatus is data in which another character is given to a lattice point to which the specific character is not given.
  8. 上記中心座標から所定の距離内にある各原子は、複数の非水素原子種であることを特徴とする請求項6記載のタンパク質相互作用解析装置。 The protein interaction analysis apparatus according to claim 6, wherein each atom within a predetermined distance from the central coordinate is a plurality of non-hydrogen atomic species.
  9. 上記データ記憶部は、上記三次元グリッド空間の格子点について算出された静電ポテンシャル分布データを記憶することを特徴とする請求項6記載のタンパク質相互作用解析装置。 The protein interaction analyzer according to claim 6, wherein the data storage unit stores electrostatic potential distribution data calculated for lattice points of the three-dimensional grid space.
  10. 上記データ記憶部は、上記三次元グリッド空間の格子点について算出された正の値からなる正の静電ポテンシャル分布データと、上記三次元グリッド空間の格子点について算出された負の値からなる負の静電ポテンシャル分布データとを記憶することを特徴とする請求項6記載のタンパク質相互作用解析装置。 The data storage unit includes positive electrostatic potential distribution data composed of positive values calculated for lattice points in the three-dimensional grid space, and negative values composed of negative values calculated for lattice points in the three-dimensional grid space. The protein interaction analysis apparatus according to claim 6, wherein the electrostatic potential distribution data is stored.
  11. 入力装置により解析対象に関する情報を入力する工程と、
     演算装置が、外部記憶部に格納されたタンパク質のアミノ酸配列データ及び立体構造データと当該タンパク質に対して特異的に相互作用するリガントの立体構造データとに基づいて、所定のタンパク質に関するリガンド結合部位の表面形状データと当該リガンド結合部位の静電ポテンシャル分布データと当該リガンド結合部位に対して相互作用するリガンドに関する立体構造データとを生成し、これら表面形状データと静電ポテンシャル分布データとリガンドに関する立体構造データとを関連づけて記憶装置に記憶する工程と、
     演算装置が、上記記憶装置に記憶された、所定のリガンド結合部位の表面形状データと当該リガンド結合部位の静電ポテンシャル分布データと当該リガンド結合部位に対して相互作用するリガンドに関する立体構造データとを教師データとした機械学習により、上記入力装置が入力した解析対象に関連する、タンパク質相互作用に関するデータを生成する工程とを有する、タンパク質相互作用解析方法。
    A step of inputting information relating to an analysis object by an input device;
    Based on the amino acid sequence data and the three-dimensional structure data of the protein stored in the external storage unit and the three-dimensional structure data of the ligand that specifically interacts with the protein, the arithmetic unit calculates the ligand binding site for the predetermined protein. Generate surface shape data, electrostatic potential distribution data of the ligand binding site, and three-dimensional structure data related to the ligand interacting with the ligand binding site, and generate these surface shape data, electrostatic potential distribution data, and three-dimensional structure related to the ligand. Storing the data in a storage device in association with the data;
    The arithmetic device stores the surface shape data of the predetermined ligand binding site, the electrostatic potential distribution data of the ligand binding site, and the three-dimensional structure data relating to the ligand that interacts with the ligand binding site stored in the storage device. And a step of generating data relating to protein interaction related to the analysis target input by the input device by machine learning as teacher data.
  12. 上記解析対象に関する情報はリガンドの構造に関する情報であり、
     上記演算装置は、当該リガンドに相互作用するタンパク質又はリガンド結合部位に関するデータを生成することを特徴とする請求項11記載のタンパク質相互作用解析方法。
    The information on the analysis target is information on the structure of the ligand,
    12. The protein interaction analysis method according to claim 11, wherein the arithmetic unit generates data relating to a protein that interacts with the ligand or a ligand binding site.
  13. 上記解析対象に関する情報はタンパク質又はリガンド結合部位の構造に関する情報であり、
     上記演算装置は、当該タンパク質又はリガンド結合部位に相互作用する化合物又はリガンドに関するデータを生成することを特徴とする請求項11記載のタンパク質相互作用解析方法。
    The information on the analysis target is information on the structure of the protein or ligand binding site,
    12. The protein interaction analysis method according to claim 11, wherein the arithmetic unit generates data relating to a compound or ligand that interacts with the protein or ligand binding site.
  14. 上記演算装置が、機械学習により生成したタンパク質相互作用に関するデータについて、上記入力装置が入力した解析対象と、生成したデータに含まれる解析対象との類似性を示す評価値を算出する工程を有することを特徴とする請求項11記載のタンパク質相互作用解析方法。 The arithmetic device has a step of calculating an evaluation value indicating similarity between an analysis target input by the input device and an analysis target included in the generated data with respect to data relating to protein interaction generated by machine learning. The protein interaction analysis method according to claim 11.
  15. 上記演算装置が、機械学習により生成したタンパク質相互作用に関するデータについて、上記入力装置が入力した解析対象が相互作用したときの結合安定性を定量的に示す適合性スコアを算出する工程を有することを特徴とする請求項11記載のタンパク質相互作用解析方法。 The arithmetic device has a step of calculating a fitness score that quantitatively indicates the binding stability when the analysis target input by the input device interacts with respect to data relating to protein interaction generated by machine learning. The protein interaction analysis method according to claim 11, wherein
  16. 上記演算装置は、リガンド結合部位を構成する原子の原子座標に基づいてリガンド結合部位の中心座標を算出し、当該中心座標から所定の距離内にある原子を含む三次元グリッド空間を設定し、当該三次元グリッド空間に基づいて生成された表面形状データを上記データ記憶部に記憶することを特徴とする請求項11記載のタンパク質相互作用解析方法。 The arithmetic device calculates the center coordinates of the ligand binding site based on the atomic coordinates of the atoms constituting the ligand binding site, sets a three-dimensional grid space including atoms within a predetermined distance from the center coordinate, 12. The protein interaction analysis method according to claim 11, wherein surface shape data generated based on a three-dimensional grid space is stored in the data storage unit.
  17. 上記三次元グリッド空間は、所定の間隔で設定されたグリッドにより複数数の格子点を有し、上記中心座標から所定の距離内にある各原子について最も近接する格子点に特定の文字を与え、当該特定の文字が与えられなかった格子点に他の文字を与えられたデータであることを特徴とする請求項11記載のタンパク質相互作用解析方法。 The three-dimensional grid space has a plurality of lattice points by a grid set at a predetermined interval, and gives a specific character to the closest lattice point for each atom within a predetermined distance from the central coordinate, 12. The protein interaction analysis method according to claim 11, wherein the data is data in which another character is given to a lattice point to which the specific character is not given.
  18. 上記中心座標から所定の距離内にある各原子は、複数の非水素原子種であることを特徴とする請求項11記載のタンパク質相互作用解析方法。 12. The protein interaction analysis method according to claim 11, wherein each atom within a predetermined distance from the central coordinate is a plurality of non-hydrogen atomic species.
  19. 上記演算装置は、上記三次元グリッド空間の格子点について算出された静電ポテンシャル分布データを上記データ記憶部に記憶することを特徴とする請求項11記載のタンパク質相互作用解析方法。 12. The protein interaction analysis method according to claim 11, wherein the arithmetic device stores electrostatic potential distribution data calculated for lattice points in the three-dimensional grid space in the data storage unit.
  20. 上記演算装置は、上記三次元グリッド空間の格子点について算出された正の値からなる正の静電ポテンシャル分布データと、上記三次元グリッド空間の格子点について算出された負の値からなる負の静電ポテンシャル分布データとを上記データ記憶部に記憶することを特徴とする請求項11記載のタンパク質相互作用解析方法。 The arithmetic device includes positive electrostatic potential distribution data composed of positive values calculated for lattice points in the three-dimensional grid space, and negative composed of negative values calculated for lattice points in the three-dimensional grid space. 12. The protein interaction analysis method according to claim 11, wherein electrostatic potential distribution data is stored in the data storage unit.
PCT/JP2019/022528 2018-06-06 2019-06-06 Protein interaction analysis device and analysis method WO2019235567A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2020523173A JP6995990B2 (en) 2018-06-06 2019-06-06 Protein interaction analyzer and analysis method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2018-108362 2018-06-06
JP2018108362 2018-06-06

Publications (1)

Publication Number Publication Date
WO2019235567A1 true WO2019235567A1 (en) 2019-12-12

Family

ID=68770412

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/022528 WO2019235567A1 (en) 2018-06-06 2019-06-06 Protein interaction analysis device and analysis method

Country Status (2)

Country Link
JP (1) JP6995990B2 (en)
WO (1) WO2019235567A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220010327A (en) * 2020-07-17 2022-01-25 주식회사 아론티어 Protein-ligand binding affinity prediction using ensemble of 3d convolutional neural network and system therefor
EP4102510A1 (en) 2021-06-09 2022-12-14 Fujitsu Limited Stable structure search system, stable structure search method, and stable structure search program

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002057954A1 (en) * 2001-01-19 2002-07-25 Mitsubishi Chemical Corporation Method of constructing three dimensional structure of protein involving induced-fit and utilization thereof
US20040148265A1 (en) * 1998-06-19 2004-07-29 Schwartz Steven D. Neural network methods to predict enzyme inhibitor or receptor ligand potency
JP2009151406A (en) * 2007-12-19 2009-07-09 National Institute Of Advanced Industrial & Technology Protein function identification device
US20110066384A1 (en) * 2002-08-06 2011-03-17 Zauhar Randy J Computer Aided Ligand-Based and Receptor-Based Drug Design Utilizing Molecular Shape and Electrostatic Complementarity

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040148265A1 (en) * 1998-06-19 2004-07-29 Schwartz Steven D. Neural network methods to predict enzyme inhibitor or receptor ligand potency
WO2002057954A1 (en) * 2001-01-19 2002-07-25 Mitsubishi Chemical Corporation Method of constructing three dimensional structure of protein involving induced-fit and utilization thereof
US20110066384A1 (en) * 2002-08-06 2011-03-17 Zauhar Randy J Computer Aided Ligand-Based and Receptor-Based Drug Design Utilizing Molecular Shape and Electrostatic Complementarity
JP2009151406A (en) * 2007-12-19 2009-07-09 National Institute Of Advanced Industrial & Technology Protein function identification device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220010327A (en) * 2020-07-17 2022-01-25 주식회사 아론티어 Protein-ligand binding affinity prediction using ensemble of 3d convolutional neural network and system therefor
KR102576033B1 (en) * 2020-07-17 2023-09-11 주식회사 아론티어 Protein-ligand binding affinity prediction using ensemble of 3d convolutional neural network and system therefor
EP4102510A1 (en) 2021-06-09 2022-12-14 Fujitsu Limited Stable structure search system, stable structure search method, and stable structure search program

Also Published As

Publication number Publication date
JPWO2019235567A1 (en) 2021-07-01
JP6995990B2 (en) 2022-02-04

Similar Documents

Publication Publication Date Title
Li et al. Predicting changes in protein thermodynamic stability upon point mutation with deep 3D convolutional neural networks
Janin et al. Protein–protein interaction and quaternary structure
Orengo et al. Bioinformatics: genes, proteins and computers
WO2019235567A1 (en) Protein interaction analysis device and analysis method
Sankar et al. Distributions of experimental protein structures on coarse-grained free energy landscapes
Liang et al. An index for characterization of natural and non-natural amino acids for peptidomimetics
Kotelnikov et al. Sampling and refinement protocols for template-based macrocycle docking: 2018 D3R Grand Challenge 4
Chowdhury et al. A bi-objective function optimization approach for multiple sequence alignment using genetic algorithm
Singh et al. Bioinformatics: methods and applications
Clarke et al. Novel insights through the integration of structural and functional genomics data with protein networks
Tan et al. Statistical potentials for 3D structure evaluation: from proteins to RNAs
Li et al. Evolving spatial clusters of genomic regions from high-throughput chromatin conformation capture data
Poluri et al. Prediction, analysis, visualization, and storage of protein–protein interactions using computational approaches
Gress Integration of protein three-dimensional structure into the workflow of interpretation of genetic variants
Furnham et al. Comparative modelling by restraint-based conformational sampling
Wang Searching for the origin of protein conformational changes: Protein responses to specific forces in simulations
Altun et al. A feature selection algorithm based on graph theory and random forests for protein secondary structure prediction
Mignon et al. Proteus software for physics-based protein design
Hong et al. Protein Structure Prediction Using A New Optimization-Based Evolutionary and Explainable Artificial Intelligence Approach
Jani et al. Protein Analysis: From Sequence to Structure
Liwo et al. Physics-based coarse-grained modeling in bio-and nanochemistry
ディアン,フィトラサリ Theoretical studies on Dissociation Process of Plastocyanins Complex by using Parallel Cascade Selection Molecular Dynamics Simulations
Aplop Computational approaches to improving the reconstruction of metabolic pathway
Cardoso Dos Reis Melo Modeling cellular metabolism in multiple scales
Liwo et al. Workshop on Mathematics and Computer Science in Modeling and Understanding of Structure and Dynamics of Biomolecules Program and Abstracts

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19815277

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020523173

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19815277

Country of ref document: EP

Kind code of ref document: A1