WO2021131324A1 - 情報処理装置、情報処理方法、及びプログラム - Google Patents

情報処理装置、情報処理方法、及びプログラム Download PDF

Info

Publication number
WO2021131324A1
WO2021131324A1 PCT/JP2020/040861 JP2020040861W WO2021131324A1 WO 2021131324 A1 WO2021131324 A1 WO 2021131324A1 JP 2020040861 W JP2020040861 W JP 2020040861W WO 2021131324 A1 WO2021131324 A1 WO 2021131324A1
Authority
WO
WIPO (PCT)
Prior art keywords
structural formula
component
information
compound
information processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2020/040861
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
侑也 濱口
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujifilm Corp
Original Assignee
Fujifilm Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujifilm Corp filed Critical Fujifilm Corp
Priority to CN202080089203.6A priority Critical patent/CN114868192B/zh
Priority to JP2021566876A priority patent/JP7449961B2/ja
Publication of WO2021131324A1 publication Critical patent/WO2021131324A1/ja
Priority to US17/844,033 priority patent/US12362045B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/40Searching chemical structures or physicochemical data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/42Document-oriented image-based pattern recognition based on the type of document
    • G06V30/422Technical drawings; Geographical maps
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Definitions

  • the present invention relates to an information processing device, an information processing method, and a program, and more particularly to an information processing device, an information processing method, and a program that enable a searchable structural formula of a compound represented as an image.
  • Patent Document 1 recognizes a pattern of character information (for example, atoms constituting a chemical substance) in a chemical structure diagram, and performs diagram information (for example, bonds between atoms) of the chemical structure diagram as a predetermined algorithm. Recognized by.
  • Patent Document 2 reads an image of the structural formula of a compound, assigns a value indicating an atomic symbol attribute to a pixel indicating an atomic symbol in the image, and couples the pixel indicating a coupling symbol. Assign a value that indicates the attribute of the symbol.
  • each component in the structural formula is identified from an image showing the structural formula of a certain compound
  • the information about each identified component can be useful information when subsequently searching for the compound.
  • the present invention has been made in view of the above circumstances, and solves the above-mentioned problems of the prior art. Specifically, regardless of how the structural formula is written, information processing capable of identifying each component of the structural formula from an image showing the structural formula and using the identification result for subsequent compound search. It is an object of the present invention to provide an apparatus, an information processing method, and a program.
  • the information processing apparatus of the present invention is an information processing apparatus including a processor, and the processor is a feature amount of each region in a target image showing a structural formula of a target compound by an identification model.
  • the processor is a feature amount of each region in a target image showing a structural formula of a target compound by an identification model.
  • the constituent elements indicated by each region are identified based on the above, and the element information about the constituent elements in the structural formula of the identified target compound is stored in association with the target compound.
  • the identification model was constructed by machine learning using a learning image showing one component in the structural formula of the compound.
  • a discriminative model for deriving a common feature quantity from the plurality of learning images is a machine. It may be constructed by learning.
  • the processor acquires the input information regarding the search compound, and based on the input information and the element information associated with the target compound, selects the target compound corresponding to the search compound from the target compounds in which the element information is stored. It is preferable to search.
  • the processor calculates the similarity between the search compound and the target compound based on the input information and the element information stored in association with the target compound, and the element information is stored in the target compound. It is more preferable to search for a target compound whose similarity satisfies the search condition as a search compound.
  • the processor acquire input information regarding the components contained in the structural formula of the search compound.
  • the processor may identify the component indicated by each region in the target image by detecting the target image from the document including the target image and inputting the detected target image into the identification model.
  • the processor detects the target image from the document by using the object detection algorithm.
  • the element information may include information indicating the type of the component in the structural formula of the identified target compound. At this time, the element information may further include information indicating the arrangement position of the component in the structural formula of the identified target compound in the coordinate space set with respect to the target image.
  • the information indicating the type of the component may be the information indicating the type of the atom corresponding to the component or the bond between the atoms.
  • the information indicating the type of the component may be information indicating the chemical formula of the functional group corresponding to the component.
  • the information indicating the type of the component may be information consisting of a part of the molecular fingerprint indicating the presence or absence of the component in the structural formula of the target compound for each type of the component.
  • the above-mentioned purpose is indicated by each region among the constituent elements included in the structural formula of the target compound based on the feature amount of each region in the target image showing the structural formula of the target compound by the processor by the identification model.
  • a step of identifying a component and a step of storing element information about the component in the structural formula of the identified target compound in association with the target compound are performed, and the identification model is performed in the structural formula of the compound. It can be achieved by an information processing method constructed by machine learning using a learning image showing one component. Further, a program for causing the processor to perform each step of the above information processing method can also be realized.
  • each component of the structural formula can be identified from the image showing the structural formula regardless of how the structural formula is written, and the identification result can be used for the subsequent compound search.
  • the present embodiment An information processing device, an information processing method, and a program according to an embodiment of the present invention (hereinafter referred to as "the present embodiment") will be described below with reference to the accompanying drawings.
  • the following embodiments are merely examples for the purpose of explaining the present invention in an easy-to-understand manner, and do not limit the present invention. That is, the present invention is not limited to the following embodiments, and various improvements or modifications can be made without departing from the gist of the present invention. Also, of course, the present invention includes an equivalent thereof.
  • document and image are electronic (data) documents and images, and are information (data) that can be processed by a computer. ..
  • the information processing device of the present embodiment includes a processor, analyzes an image (target image) showing the structural formula of the target compound, and analyzes each component in the structural formula. Can be identified.
  • the target compound is, for example, a compound whose structural formula is represented by an image in a document and whose components indicated by each region in the image are identified by an information processing apparatus.
  • the image showing the structural formula is an image of a diagram showing the structural formula.
  • equivalent description methods for describing the structural formula include abbreviation of the single bond of the hydrogen atom (H), omission of the notation of the carbon atom (C) of the skeleton, and abbreviation of the functional group. ..
  • the diagram may change depending on how to draw (for example, the thickness and length of the bond line between atoms, the direction in which the bond line extends, etc.).
  • the writing method of the structural formula includes the resolution of the image showing the structural formula.
  • the constituent elements in the structural formula mean the atoms constituting the structural formula, the bond lines between the atoms, or a combination thereof.
  • individual atoms constituting the structural formula for example, “Bend C” and “O” in FIG. 1
  • individual bond lines for example, in FIG. 1).
  • Double corresponds to the component.
  • Each area of the image showing the structural formula is a part of the image showing the components in the structural formula, for example, a rectangular area surrounding the components (see the right figure in FIG. 1). In this embodiment, it is assumed that one component is included in each area. That is, in the image showing the structural formula, there are a number of regions corresponding to the components included in the structural formula.
  • the information processing device performs machine learning using one component in the structural formula of the compound (specifically, label information of the component) and a learning image showing one component as a learning data set.
  • This machine learning builds a discriminative model.
  • the discriminative model is a model that identifies the constituent elements indicated by each region among the constituent elements in the structural formula based on the feature amount of each region of the image showing the structural formula of the compound. The discriminative model will be described in detail in a later section.
  • the information processing device has a function of detecting an image (target image) from a document containing an image showing the structural formula of the compound.
  • the detected target image is input to the above discriminative model. Thereby, each component in the structural formula of the compound (target compound) shown in the target image is identified.
  • the information processing device acquires element information for each component in the identified target compound.
  • the element information includes information indicating the type of the identified component and information indicating the arrangement position of the component.
  • the information indicating the type of the component is the information indicating the type of the atom or the bond between the atoms corresponding to the component, and in the case of the compound shown in FIG. 1, "Bend C” and “O". And "Double” are applicable.
  • the information indicating the arrangement position of the component is the arrangement position of the component in the coordinate space set for the target image (for example, the two-dimensional coordinate space in which the horizontal direction of the target image is the X direction and the vertical direction is the Y direction). This is information indicating.
  • the reference position for example, the upper left vertex position
  • the representative position and size of the rectangular area surrounding the component for example, the length in each of the X and Y directions
  • S are expressed in pixel units.
  • Element information is acquired for each of a plurality of constituent elements included in the structural formula of the target compound.
  • the acquired element information is stored in association with the target compound, and is stored, for example, in a state of being associated with a document or the like on which an image showing the structural formula of the target compound is posted, as shown in FIG.
  • the information indicating the type of the component is automatically acquired by identifying each component in the structural formula by the identification model. Further, among the element information, the information indicating the arrangement position of the component is automatically acquired by analyzing the image including the area indicating the component (that is, the target image).
  • the information processing device repeatedly executes the above series of processes (specifically, image detection from a document, identification of each component in a structural formula, acquisition and storage of element information) for various target compounds.
  • image detection from a document identification of each component in a structural formula, acquisition and storage of element information
  • element information about each component in the structural formula of the target compound is accumulated.
  • a database in which element information is recorded for each target compound is constructed (see FIG. 2).
  • the information processing device has a function of searching for a target compound of a target (target), that is, a target compound corresponding to the search compound, using the element information stored in the database as a search key.
  • a target compound of a target that is, a target compound corresponding to the search compound
  • the user performing the search inputs image information indicating the structural formula of the search compound.
  • the information processing device acquires the image information as input information, and based on the acquired input information and the element information stored in the database, the target compound in which the element information is stored is the target corresponding to the search compound. Search for compounds.
  • an image of the structural formula of the compound contained in a document such as a paper or a patent specification is detected, and information (elements) about each component in the structural formula shown by the image is detected.
  • Information can be created in a database. Then, by using the database, the target compound can be easily searched. This makes it possible to easily find, for example, a document containing an image showing the structural formula of the target compound.
  • the discriminative model used in the present embodiment (hereinafter referred to as the discriminative model M1) will be described.
  • the discrimination model M1 is a model for identifying each component included in the structural formula from an image (target image) showing the structural formula of the target compound. As shown in FIG. 3, the discriminative model M1 of the present embodiment is composed of a feature amount derivation model Ma and a component output model Mb.
  • the feature amount derivation model Ma is a model that derives the feature amount of each region of the target image by inputting the target image.
  • the feature amount derivation model Ma is configured by, for example, a convolutional neural network (CNN) having a convolutional layer and a pooling layer in the intermediate layer.
  • CNN models include a 16-layer CNN (VGG16) of Oxford visual geometry group, an Inception model (GoogleLeNet) of Google, a 152-layer CNN (Resnet) of Kaiming He, and a 152-layer CNN (Resnet) of Kaiming He. ).
  • each region in the target image is specified. Specifically, each component included in the structural formula shown by the target image is detected, and a region surrounding each of the detected components is specified for each component. Such a region-specific function is installed in the feature amount derivation model Ma by machine learning described later.
  • the feature amount of the image output from the feature amount derivation model Ma is the learning feature amount in the convolutional neural network CNN, and is the feature amount specified in the process of general image recognition (pattern recognition). Then, the feature amount of each region derived by the feature amount derivation model Ma is input to the component output model Mb for each region.
  • the component output model Mb the feature amount of each region derived by the feature quantity derivation model Ma is input for each region, and for each region, the component element corresponding to the feature quantity (for example, the component element) This is a model that outputs (type) for each area.
  • the component output model Mb is configured by, for example, a neural network (NN).
  • the component output model Mb specifies a plurality of candidates (component candidates) for each area when outputting the component corresponding to the feature amount of each area of the target image.
  • a softmax function (softmax) is applied to a plurality of candidates specified for each region, and an output probability is calculated for each candidate.
  • the output probability is a numerical value indicating the certainty (accuracy) corresponding to the component indicated by each region for each of the plurality of candidates.
  • the total of n output probabilities (n is a natural number) to which the softmax function is applied is 1.0.
  • the component output model Mb outputs a candidate determined according to the output probability, for example, a candidate having the highest output probability among a plurality of candidates specified for each area, as a component indicated by each area.
  • each component in the structural formula shown by the target image is based on the output probability of each candidate from the plurality of candidates specified based on the feature amount of each region of the target image. It can be decided.
  • the discriminative model M1 (in other words, each of the above two models Ma and Mb) described above is a learning image showing one component in the structural formula of the compound and a label (correct label) of the component. ) And is a training data set, and it is constructed by machine learning using a plurality of training data sets.
  • the number of learning data sets used for machine learning should be large, preferably 50,000 or more, from the viewpoint of improving the learning accuracy.
  • machine learning is supervised learning
  • the method is deep learning (that is, a multi-layer neural network), but the method is not limited thereto.
  • the type (algorithm) of machine learning may be unsupervised learning, semi-supervised learning, reinforcement learning, or transduction.
  • the machine learning technique may be genetic programming, inductive logic programming, support vector machine, clustering, Bayesian network, extreme learning machine (ELM), or decision tree learning.
  • ELM extreme learning machine
  • the gradient descent method may be used, or the error backpropagation method may be used as a method of minimizing the objective function (loss function) in the machine learning of the neural network.
  • a plurality of learning images showing components having the same chemical structure but different description styles may be used. For example, as shown in FIG. 4, when a certain component (hexylene group is shown in FIG. 4) is described in an equivalent description format, machine learning is performed using a learning image prepared for each description format. Can be assumed. Alternatively, it may be assumed that machine learning is performed using a plurality of learning images showing components having the same chemical structure but different thickness, length, or orientation of bond lines between atoms.
  • the discriminative model M1 (strictly speaking, the feature amount derivation model Ma) that derives a common feature amount from a plurality of learning images is constructed by machine learning. For example, supervised learning is performed with the same label (correct label) of "hexylene group" for each of the learning images showing two hexylene groups shown in FIG. 4 having different description styles. As a result, a discriminative model M1 capable of deriving a common feature quantity from a learning image showing two hexylene groups having different description styles and outputting the same component (hexylene group) from each image is constructed. To.
  • the information processing device 10 is a computer in which a processor 11, a memory 12, an external interface 13, an input device 14, an output device 15, and a storage 16 are electrically connected to each other.
  • the information processing device 10 is composed of one computer, but the information processing device 10 may be composed of a plurality of computers.
  • the processor 11 is configured to execute a program 21 described later and perform a process for exerting the functions of the information processing device 10 described above.
  • the processor 11 is composed of one or a plurality of CPUs (Central Processing Units) and a program 21 described later.
  • CPUs Central Processing Units
  • the hardware processor constituting the processor 11 is not limited to the CPU, but is limited to the CPU, FPGA (Field Programmable Gate Array), DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit), GPU (Graphics Processing Unit), MPU (Micro- It may be a Processing Unit) or another IC (Integrated Circuit), or a combination thereof. Further, the processor 11 may be one IC (Integrated Circuit) chip that exerts the functions of the entire information processing device 10 as represented by SoC (System on Chip) and the like.
  • SoC System on Chip
  • the hardware processor described above may be an electric circuit (Circuitry) in which circuit elements such as semiconductor elements are combined.
  • the memory 12 is composed of semiconductor memories such as ROM (Read Only Memory) and RAM (Random Access Memory), and provides a work area to the processor 11 by temporarily storing programs and data, and the processor 11 executes the memory 12. Various data generated by the processing is also temporarily stored.
  • ROM Read Only Memory
  • RAM Random Access Memory
  • the memory 12 stores a program 21 for making the computer function as the information processing device 10 of the present embodiment.
  • the program 21 includes the following programs p1 to p5.
  • p1 Program for constructing the identification model M1 by machine learning
  • p2 Program for detecting the target image from the document on which the target image is posted
  • p3 For identifying each component in the structural formula indicated by the target image
  • Program p4 Program for storing element information about the identified component p5: Program for searching the target compound corresponding to the search compound from the target compounds in which the element information is stored.
  • the program 21 may be acquired by reading it from a computer-readable recording medium, or may be acquired by receiving (downloading) it through a network such as the Internet or an intranet.
  • the external interface 13 is an interface for connecting to an external device.
  • the information processing device 10 communicates with an external device, for example, a scanner or another computer on the Internet via the external interface 13. Through such communication, the information processing apparatus 10 can acquire data for machine learning and also acquire a document on which a target image is posted.
  • the input device 14 includes, for example, a mouse and a keyboard, and accepts a user's input operation.
  • the information processing device 10 can acquire data for machine learning, for example, by a user drawing a component through the input device 14. Further, when searching for a target compound corresponding to the search compound, the user operates the input device 14 to input information about the search compound. As a result, the information processing apparatus 10 can acquire input information regarding the search compound.
  • the output device 15 is, for example, a device including a display, a speaker, or the like, for displaying a target compound searched based on input information (that is, a target compound corresponding to the search compound) or reproducing a voice. Further, the output device 15 can output the element information stored for each target compound in the database.
  • the storage 16 includes, for example, a flash memory, HDD (Hard Disc Drive), SSD (Solid State Drive), FD (Flexible Disc), MO disk (Magneto-Optical disc), CD (Compact Disc), DVD (Digital Versatile Disc). , SD card (Secure Digital card), USB memory (Universal Serial Bus memory), etc.
  • Various data including data for machine learning are stored in the storage 16. Further, the storage 16 also stores various models constructed by machine learning, including the identification model M1.
  • element information about each component in the structural formula of the target compound identified by the identification model M1 is stored in association with the target compound.
  • the element information database 22 shown in FIG. 2 is constructed in the storage 16.
  • the database 22 stores element information about each component included in the structural formula of the target compound, specifically, the type and arrangement position of the component for each target compound.
  • the type of the component stored in the database 22 is the type of the component having the highest output probability calculated by the discriminative model M1, and the output probability (“accuracy” in the figure). Notation) is memorized together.
  • the arrangement position of the component stored in the database 22 is a position represented in the coordinate space with the reference position of the target image as the origin. For example, the representative position of the rectangular area surrounding the component and the length in the X direction. It is represented by the length in the Y direction.
  • the element information for each component in the structural formula of the target compound is stored in association with the information related to the document in which the image (target image) showing the structural formula is posted.
  • Information about the document includes, for example, the title of the article when the document is a treatise, the issue number of the issue when the document is a gazette, and the page on which the target image is posted in the document and its placement position on the page. And so on.
  • the storage 16 is a device built in the information processing device 10, but the storage 16 is not limited to this, and the storage 16 is an external device connected to the information processing device 10. It may be included. Further, the storage 16 may include an external computer (for example, a server computer for a cloud service) that is communicably connected via a network. In this case, a part or all of the above-mentioned database 22 may be stored in an external computer constituting the storage 16.
  • an external computer for example, a server computer for a cloud service
  • the hardware configuration of the information processing device 10 is not limited to the above configuration, and constituent devices can be added, omitted, or replaced as appropriate according to a specific embodiment.
  • the information processing flow of the present embodiment proceeds in the order of the learning phase S001, the database construction phase S002, and the search phase S003. Each phase will be described below.
  • the learning phase S001 is a phase in which machine learning is performed in order to build a model required in the subsequent phases.
  • the first machine learning S011, the second machine learning S012, and the third machine learning S013 are carried out.
  • the first machine learning S011 is machine learning for constructing the discriminative model M1, and is carried out using a learning image showing one component of the structural formula of the compound as described above.
  • supervised learning is carried out as the first machine learning S011.
  • a learning image and a label (correct answer label) of one component indicated by the learning image are used.
  • the discriminative model M1 (strictly speaking, the feature amount derivation model Ma) for deriving a common feature amount from a plurality of learning images is constructed.
  • the second machine learning S012 is machine learning for constructing a model (hereinafter referred to as an image detection model) for detecting the image from a document in which an image showing the structural formula of the compound is posted.
  • the image detection model is a model for detecting an image of a structural formula from a document by using an object detection algorithm.
  • R-CNN Region-based CNN
  • Fast R-CNN Fast R-CNN
  • YOLO You only Look Once
  • SDD Single Shot Multibox Detector
  • the learning data (teacher data) used for the second machine learning S012 is created by applying an annotation tool to a learning image showing the structural formula of the compound.
  • the annotation tool is a tool that adds related information such as a correct label (tag) and coordinates of an object to the target data as annotations.
  • Learning data is created by starting the annotation tool, displaying the document containing the learning image, surrounding the area showing the structural formula of the compound with a bounding box, and annotating that area.
  • the annotation tool for example, labeImg manufactured by tzutalin, VoTT manufactured by Microsoft, and the like can be used.
  • an image detection model which is an object detection model in the YOLO format is constructed.
  • the third machine learning S013 is machine learning for constructing a model (hereinafter referred to as a search model) for searching a target compound corresponding to a search compound from a plurality of target compounds in which element information is stored in the database 22. is there.
  • the search model of the present embodiment is a model for searching as a search compound a target compound having the same or similar structural formula as the search compound among the target compounds whose element information is stored in the database 22.
  • the input information is information about each component included in the structural formula of the search compound, and is, for example, image information indicating the structural formula of the search compound.
  • the input information may be other information as long as it can specify at least a part of the structural formula of the search compound (that is, information that can be a key when searching the search compound in the database 22). Good.
  • it may be image information showing some components in the structural formula of the search compound.
  • information corresponding to the element information (for example, information indicating the type of the component in the structural formula and the arrangement position of the component in the structural formula) may be used as the input information.
  • a part or all of the structural formula of the search compound may be drawn by known structural formula drawing software such as ChemDraw (registered trademark) and RDKit, and the drawing data may be used as input information.
  • the search model is composed of a search compound specific model and a similarity evaluation model.
  • the search compound specific model is a model that specifies the structural formula of the search compound indicated by the input information.
  • image information as input information is input to the search compound specific model, information about each component in the structural formula indicated by the image information (for example, the type of each component and the arrangement position in the structural formula).
  • Information indicating is output.
  • the search compound specific model the above-mentioned discrimination model M1 may be diverted, and transfer learning may be performed as machine learning in that case.
  • the similarity evaluation model evaluates the similarity between the structural formula of the search compound specified by the search compound specific model and the structural formula of the target compound in which the element information of each component is stored in the database 22.
  • the similarity is evaluated based on the element information about the constituent elements included in the structural formula of the search compound and the element information about the constituent elements included in the structural formula of the target compound.
  • the algorithm of the similarity evaluation model is not particularly limited, but for example, a known algorithm for evaluating the similarity between images or the calculation degree between texts can be used.
  • a known algorithm for evaluating the similarity between images or the calculation degree between texts can be used.
  • an algorithm can be used that vectorizes the element information about the constituent elements included in the structural formula and calculates the similarity between the vectors by an index such as the Euclidean distance.
  • the degree of similarity is high among a plurality of structural formulas written in different description formats for the same chemical substance. This is because, in the structural formulas described in different description formats for the same compound, the writing method of each functional group (for example, the direction of the bond line, etc.) and the position of each atom in each structural formula change. In consideration of such a difference, it is preferable to increase the similarity between the structural formulas described in different description styles for the same compound. For example, for each of a plurality of structural formulas recorded in the database 22 and described in different description formats for the same compound, the same label (correct label) may be attached and a similarity evaluation model may be constructed for machine learning. Good.
  • the similarity evaluation method is not limited to the machine learning method.
  • each component in the structural formula is collated between the search compound and the target compound according to a predetermined collation rule, and the collation is performed. Similarity may be evaluated based on the results.
  • the target compound in which the element information of each component is stored in the database 22 may be clustered based on the element information, and the similarity may be evaluated by specifying the cluster to which the search compound belongs.
  • the third machine learning S013 is carried out by using the element information about each component in the structural formula stored in the database 22 for each target compound and the learning information about the structural formula of the compound.
  • the learning information is, for example, information indicating the type and arrangement position of each component in the structural formula of the compound selected for the third machine learning S013. Then, by carrying out the third machine learning, the above-mentioned search model is constructed.
  • the database construction phase S002 is a phase in which the structural formula of the target compound indicated by the image (target image) included in the document is stored and the element information for each component in the structural formula is stored to construct the database 22.
  • the processor 11 of the information processing device 10 applies the above-mentioned image detection model to the document including the target image, and detects the target image in the document (S021). That is, in this step S021, the processor 11 detects the target image from the document by using the object detection algorithm (specifically, YOLO).
  • the object detection algorithm specifically, YOLO
  • the processor 11 has a plurality of target images from the above document (in FIG. 7, a portion surrounded by a broken line). Image) is detected.
  • the processor 11 identifies each component in the structural formula of the target compound based on the feature amount of each region of the target image by the identification model M1 (S023). Specifically, the processor 11 inputs the target image detected in step S021 into the identification model M1. Of the discriminative model M1, the feature amount derivation model Ma in the previous stage outputs the feature amount of each region of the target image. In the component output model Mb in the subsequent stage, the component (strictly speaking, the type of component) is output based on the input feature amount of each area. At this time, a plurality of candidate components corresponding to each region are specified based on the feature amount of each region, and the output probability is calculated for each candidate.
  • the component output model Mb outputs the candidate having the highest output probability as the component indicated by each area.
  • the structural formula indicated by the target image that is, the structural formula of the target compound
  • step S021 the processor 11 inputs the detected plurality of target images into the identification model M1 for each target image. As a result, for each of the plurality of target images, each component in the structural formula of the target compound indicated by the target image is identified.
  • the processor 11 acquires element information for each component in the structural formula of the identified target compound, and stores the acquired element information (S023). At this time, the processor 11 stores the element information for each component in association with the target compound including each component in the structural formula.
  • the element information for each component is stored in association with the information of the document in which the image (target image) of the structural formula composed of each component is posted (see FIG. 2). ..
  • Step S023 is repeated each time each component in the structural formula of the new target compound is identified.
  • the element information for each component in the structural formula of the target compound is accumulated, and the element information database 22 is constructed.
  • the target compound whose element information is stored in the database 22 can be searched using the element information as a key in the later search phase S003.
  • the search phase S003 is a phase for searching for a target compound corresponding to the search compound from the target compounds whose element information is stored in the database 22.
  • a "search compound” is a compound that is a search target and information about a part or all of its structural formula is acquired as input information when a search is performed.
  • the processor 11 of the information processing device 10 acquires the input information regarding the search compound (S031).
  • the processor 11 acquires information about each component included in the structural formula of the search compound as input information. Examples of such information include image information showing the structural formula of the search compound.
  • the processor 11 After acquiring the input information, the processor 11 searches for the target compound corresponding to the search compound from the target compounds whose element information is stored in the database 22 by the above-mentioned search model (S032). Specifically, the processor 11 has a similarity between the search compound and the target compound based on the input information acquired by the search model and the element information stored in association with the target compound in the database 22. Is calculated. In the present embodiment, the similarity of the structural formula is calculated between the search compound indicated by the input information and the target compound whose element information is stored in the database 22.
  • the processor 11 searches (selects) as a search compound a target compound whose calculated similarity satisfies the search condition from the target compounds whose element information is stored in the database 22.
  • the search condition is a condition determined in advance for selecting a target compound corresponding to the search compound based on the calculation result of the similarity.
  • a predetermined number of target compounds are searched as search compounds in descending order of similarity.
  • the present invention is not limited to this, and for example, only the target compound having the highest degree of similarity may be searched for as the search compound.
  • a target compound having a similarity equal to or higher than a reference value may be searched for as a search compound.
  • the processor 11 outputs the information of the searched target compound by the output device 15, and displays the search result on the screen as shown in FIG. 8, for example.
  • Examples of the searched information on the target compound include a document and a page on which an image showing the structural formula of the target compound is posted. Further, as shown in FIG. 8, it is preferable to output the search result of the target compound and the similarity between the searched target compound and the search compound.
  • partial structure information indicating a part of the constituent elements included in the structural formula of the search compound (hereinafter, referred to as "partial structure" for convenience) is acquired.
  • the target compound containing the partial structure is searched as the search compound.
  • the degree of similarity between the partial structure included in the structural formula and the partial structure indicated by the input information is calculated.
  • a predetermined number of target compounds are searched as search compounds in descending order of similarity.
  • the information processing device 10 of the present embodiment uses the discrimination model M1 constructed by the first machine learning S011 and has a structure based on the feature amount of each region in the image (target image) showing the structural formula of the target compound. Each component in the formula can be identified. Further, the information processing apparatus 10 of the present embodiment stores element information about the identified component in association with the target compound, and constructs the database 22. The element information stored in the database 22 can be used as a search key when searching for the target compound thereafter.
  • each component in the structural formula can be identified from the feature amount of each region of the target image by using the identification model M1 which is the result of machine learning. That is, in the present embodiment, even if the writing method of the structural formula is changed, the feature amount of each region of the image showing the structural formula is specified, and if the feature amount can be specified, the component is determined (identified) from the feature amount. )Is possible. Then, since the element information about the identified component is stored in association with the target compound and stored in a database, the target compound can be searched by using the element information as a search key thereafter. ..
  • each component in the structural formula can be satisfactorily identified even when the writing method of the structural formula of the target compound is changed. Then, the target compound can be appropriately searched by using the element information for each identified component as a search key.
  • the computer constituting the information processing device is a server used for ASP (Application Service Provider), SaaS (Software as a Service), PaaS (Platform as a Service), IaaS (Infrastructure as a Service), or the like. You may.
  • ASP Application Service Provider
  • SaaS Software as a Service
  • PaaS PaaS
  • IaaS Infrastructure as a Service
  • a user who uses a service such as the ASP operates a terminal (not shown) to transmit input information regarding the search compound to the server.
  • the server When the server receives the input information, the server searches for the target compound corresponding to the search compound from the target compounds in which the element information is stored based on the input information. Then, the server outputs (transmits) information about the search result (that is, the target compound corresponding to the search compound) to the user's terminal. On the user side, the information sent from the server (that is, the search result) is displayed or the voice is reproduced.
  • a functional group (atomic group) containing a plurality of atoms may be a constituent element.
  • the information indicating the type of the component may be the information indicating the chemical formula of the functional group corresponding to the component.
  • a plurality of functional groups adjacent to each other may be used as a component, or each fragment when the structural formula is divided according to an arbitrary rule may be used as a component.
  • the information indicating the type of the component element may be information consisting of a part of the molecular fingerprint about the structural formula of the target compound.
  • the molecular fingerprint is a binary multidimensional vector that indicates the presence or absence of a component in the structural formula for each type of component. For example, for the functional group shown on the left side of FIG. 9, the molecular fingerprint shown on the right side of FIG. 9 is set.
  • machine learning (first to third machine learning) for constructing various models is performed by the information processing device 10, but the present invention is not limited to this. Some or all machine learning may be performed by another device (computer) different from the information processing device 10.
  • the information processing device 10 acquires a model constructed by machine learning performed by another device. For example, when the first machine learning is performed by another device, the information processing device 10 acquires the identification model M1 from the other device, and each of the structural formulas shown by the target image is obtained by the acquired identification model M1. Identify the component.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)
PCT/JP2020/040861 2019-12-26 2020-10-30 情報処理装置、情報処理方法、及びプログラム Ceased WO2021131324A1 (ja)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202080089203.6A CN114868192B (zh) 2019-12-26 2020-10-30 信息处理装置、信息处理方法及程序
JP2021566876A JP7449961B2 (ja) 2019-12-26 2020-10-30 情報処理装置、情報処理方法、及びプログラム
US17/844,033 US12362045B2 (en) 2019-12-26 2022-06-19 Information processing apparatus, information processing method, and program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2019-236342 2019-12-26
JP2019236342 2019-12-26

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/844,033 Continuation US12362045B2 (en) 2019-12-26 2022-06-19 Information processing apparatus, information processing method, and program

Publications (1)

Publication Number Publication Date
WO2021131324A1 true WO2021131324A1 (ja) 2021-07-01

Family

ID=76574083

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/040861 Ceased WO2021131324A1 (ja) 2019-12-26 2020-10-30 情報処理装置、情報処理方法、及びプログラム

Country Status (4)

Country Link
US (1) US12362045B2 (https=)
JP (1) JP7449961B2 (https=)
CN (1) CN114868192B (https=)
WO (1) WO2021131324A1 (https=)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114581924A (zh) * 2022-03-01 2022-06-03 苏州阿尔脉生物科技有限公司 化学反应流程图中元素的提取方法及装置
WO2023020210A1 (zh) * 2021-08-16 2023-02-23 中国科学院上海药物研究所 化学结构式的识别方法、装置、存储介质及电子设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114846508B (zh) * 2019-12-16 2025-06-27 富士胶片株式会社 图像分析装置、图像分析方法及计算机程序产品
CN116071554A (zh) * 2023-02-21 2023-05-05 北京英飞智药科技有限公司 一种化学结构识别方法和系统

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013061886A (ja) * 2011-09-14 2013-04-04 Kyushu Univ 化学構造図認識システム及び化学構造図認識システム用のコンピュータプログラム

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04114560A (ja) * 1990-09-04 1992-04-15 Sharp Corp 自動文書入力装置
ES2150926T3 (es) * 1993-06-30 2000-12-16 Ibm Metodo para segmentacion de imagenes y clasificacion de elementos de imagen para tratamiento de documentos.
CN102436447A (zh) * 2010-09-29 2012-05-02 国际商业机器公司 化学物质的信息的处理和匹配方法、系统及存储系统
JP5974838B2 (ja) * 2012-11-06 2016-08-23 富士通株式会社 情報提供方法、情報提供装置および情報提供プログラム
JP6051988B2 (ja) 2013-03-19 2016-12-27 富士通株式会社 情報処理プログラム、情報処理方法および情報処理装置
US10372713B1 (en) * 2014-07-10 2019-08-06 Purdue Pharma L.P. Chemical formula extrapolation and query building to identify source documents referencing relevant chemical formula moieties
WO2018103642A1 (en) * 2016-12-05 2018-06-14 Patsnap Systems, apparatuses, and methods for searching and displaying information available in large databases according to the similarity of chemical structures discussed in them
JPWO2019048965A1 (ja) * 2017-09-06 2020-10-22 株式会社半導体エネルギー研究所 物性予測方法および物性予測システム
CN108062529B (zh) * 2017-12-22 2024-01-12 上海鹰谷信息科技有限公司 一种化学结构式的智能识别方法
CN108334839B (zh) * 2018-01-31 2021-09-14 青岛清原精准农业科技有限公司 一种基于深度学习图像识别技术的化学信息识别方法
US11093842B2 (en) 2018-02-13 2021-08-17 International Business Machines Corporation Combining chemical structure data with unstructured data for predictive analytics in a cognitive system
EP3540610B1 (en) 2018-03-13 2024-05-01 Ivalua Sas Standardized form recognition method, associated computer program product, processing and learning systems
CN110265091A (zh) * 2019-06-26 2019-09-20 王乔健 化学品信息查询方法、装置和电子化学词典
CN114846508B (zh) * 2019-12-16 2025-06-27 富士胶片株式会社 图像分析装置、图像分析方法及计算机程序产品

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013061886A (ja) * 2011-09-14 2013-04-04 Kyushu Univ 化学構造図認識システム及び化学構造図認識システム用のコンピュータプログラム

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Estimating Molecular Formulae from Structural Formula Images in Rdkit", QUIITA, 31 January 2019 (2019-01-31), Retrieved from the Internet <URL:https://qiita.com/nishiha/items/f20f9942alc35elealfd> [retrieved on 20210125] *
ITO, HIDEO: "Image Searching in Patent Information Services", JAPAN PATENT OFFICE TECHNOLOGY FORUM, 30 January 2009 (2009-01-30), pages 66 - 70, XP055837960, Retrieved from the Internet <URL:http://www.tokugikon.jp/gikonshi/252tokusyu8.pdf> *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023020210A1 (zh) * 2021-08-16 2023-02-23 中国科学院上海药物研究所 化学结构式的识别方法、装置、存储介质及电子设备
CN114581924A (zh) * 2022-03-01 2022-06-03 苏州阿尔脉生物科技有限公司 化学反应流程图中元素的提取方法及装置

Also Published As

Publication number Publication date
US12362045B2 (en) 2025-07-15
CN114868192B (zh) 2025-06-27
JPWO2021131324A1 (https=) 2021-07-01
CN114868192A (zh) 2022-08-05
JP7449961B2 (ja) 2024-03-14
US20220327158A1 (en) 2022-10-13

Similar Documents

Publication Publication Date Title
JP7268198B2 (ja) 画像解析装置、画像解析方法、及びプログラム
JP7449961B2 (ja) 情報処理装置、情報処理方法、及びプログラム
US11966455B2 (en) Text partitioning method, text classifying method, apparatus, device and storage medium
CN107004140B (zh) 文本识别方法和计算机程序产品
CN112749547A (zh) 文本分类器训练数据的产生
CN107085585A (zh) 用于图像搜索的准确的标签相关性预测
CN116363212B (zh) 一种基于语义匹配知识蒸馏的3d视觉定位方法和系统
JP5251205B2 (ja) 住所認識装置
CN114612921A (zh) 表单识别方法、装置、电子设备和计算机可读介质
CN115345168A (zh) 自然语言处理的级联池化
Inkeaw et al. Recognition-based character segmentation for multi-level writing style
JP5343617B2 (ja) 文字認識プログラム、文字認識方法および文字認識装置
Le et al. Stroke order normalization for improving recognition of online handwritten mathematical expressions: AD Le et al.
AU2015204339A1 (en) Information processing apparatus and information processing program
TWI285849B (en) Optical character recognition device, document searching system, and document searching program
CN117009595A (zh) 文本段落获取方法及其装置、存储介质、程序产品
CN110378378B (zh) 事件检索方法、装置、计算机设备及存储介质
US12535381B2 (en) Information processing apparatus, information processing method, and program
JP7453731B2 (ja) 半構造化ドキュメントから情報を取り出す方法及びシステム
CN116956052B (zh) 应用匹配方法和应用匹配装置
US12045649B1 (en) Apparatus and method for task allocation
JP5343579B2 (ja) パターン認識辞書作成装置及びプログラム
CN119884738A (zh) 危险液体的多模态识别方法、装置以及计算机程序产品
CN121354123A (zh) 图像处理方法、图像处理装置、电子设备及存储介质
CN119693954A (zh) 信息提取方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20907733

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021566876

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20907733

Country of ref document: EP

Kind code of ref document: A1

WWG Wipo information: grant in national office

Ref document number: 202080089203.6

Country of ref document: CN