KR101801226B1 - Classification Algorithm for Chemical Compound Using InChI - Google Patents

Classification Algorithm for Chemical Compound Using InChI Download PDF

Info

Publication number
KR101801226B1
KR101801226B1 KR1020150120964A KR20150120964A KR101801226B1 KR 101801226 B1 KR101801226 B1 KR 101801226B1 KR 1020150120964 A KR1020150120964 A KR 1020150120964A KR 20150120964 A KR20150120964 A KR 20150120964A KR 101801226 B1 KR101801226 B1 KR 101801226B1
Authority
KR
South Korea
Prior art keywords
layer
compound
class
atomic
classified
Prior art date
Application number
KR1020150120964A
Other languages
Korean (ko)
Other versions
KR20170025071A (en
Inventor
강정원
강성신
Original Assignee
고려대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 고려대학교 산학협력단 filed Critical 고려대학교 산학협력단
Priority to KR1020150120964A priority Critical patent/KR101801226B1/en
Priority to PCT/KR2016/009273 priority patent/WO2017034280A1/en
Publication of KR20170025071A publication Critical patent/KR20170025071A/en
Application granted granted Critical
Publication of KR101801226B1 publication Critical patent/KR101801226B1/en

Links

Images

Classifications

    • G06F19/707
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16ZINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS, NOT OTHERWISE PROVIDED FOR
    • G16Z99/00Subject matter not provided for in other main groups of this subclass
    • G06F19/709

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method of classifying a compound according to the present invention comprises the steps of: (a) inputting an InChI (International Chemical Identifier) identifier of a compound; (b) classifying a layer using the identifier to confirm whether the compound is an organic compound or an inorganic compound (c) determining the atomic class by identifying the constituent atomic components in the case of an organic compound without performing classification in the case of an inorganic compound, (d) (E) determining a structure to determine a subclass, and (f) generating a classification string including an atomic class, a main class and a subclass, By identifying the functional group and the structure using InChI among the compound identifiers, the organic compound can be accurately classified in a short time in the database, and the physical properties of the compound .

Figure 112015083322110-pat00013

Description

(Classification Algorithm for Chemical Compound Using InChI)

The present invention relates to a method for classifying a compound using a molecular structure of a compound, and more particularly, to a method for classifying a compound by identifying an atomic structure, a functional group and a structure using an InChI (International Chemical Identifier) .

As science and technology developed, a large number of materials were used, and physical and chemical properties were entered into the database through experiments. This allows the database to accumulate experimental data for a large number of compounds and allows database users to verify the properties of the compound before use by searching for the compound of interest.

For this reason, analyzing, organizing and storing the compounds in the database broadens the way in which the data can be used as well as the convenience of the user. Thus, in the case of a database containing a compound, an identifier or a three-dimensional structure including the structure of a compound other than the name of the compound is easily included in the database for classification of similar compounds, thereby facilitating the distinction between the compounds.

The best way to distinguish the compounds contained in the database is to convert the three-dimensional structure of the compound into a one-dimensional character array and compare them. The most commonly used methods are SMILES (Simplified Molecular-Input Line-Entry System) and InChI (International Chemical Identifier). The character array generated by these two conversion methods has the advantage of reducing the size of the DB as compared with the method of storing the three-dimensional structure.

SMILES indicates the arrangement and bonding of atoms by linearly representing the atoms contained in the molecule. Therefore, the three-dimensional structure represented by SMILES is generally readable and has an advantage in showing a simple three-dimensional structure. However, since SMILES does not take into account the direction and order of atoms, it has various disadvantages in that it is difficult to distinguish complex structures.

InChI is a one-dimensional array developed by IUPAC and NIST. It shows various information such as composition, arrangement, and binding of atoms contained in a molecule in each layer. It can be distinguished in case of complex structure or the same structure although it is less readable than SMILES, and it is possible to accurately express the hydrogen bonding or resonance structure not included in SMILES.

In actual research, there are many cases where physical properties of a compound having a specific property are required in addition to a search for a single property. Particularly, in the case of an organic compound, the physical properties and reactivity of the compound are greatly influenced by the functional group or the structure. Thus, in the case of organic compounds it is necessary to perform data classification according to functional groups or structures. In this case, a common data classification method is classified through the structure in the name or identifier of the functional group of the compound. However, all of these methods require much time to analyze the structure contained in names and identifiers, to find out the number of possible cases, and there is a drawback that the efficiency of the database is poor.

Conventional techniques classify compounds by name, structure, and identifier included in the database to classify the compounds. However, in this method, a method of directly searching for the number of all the cases included in the name and the identifier is used in order to classify one structure. This method is not only time consuming, but also if the number of cases is omitted, It has a problem that it is not performed.

As a result of intensive efforts to solve the above problems, the present inventors have found that functional groups and structures can be identified using InChI among compound identifiers for organic compounds, and developed as a program to effectively classify organic compounds in a database in a short time And completed the present invention.

It is an object of the present invention to provide a method of classifying a compound having a high searching speed, excellent accuracy and easy prediction of physical properties of a compound.

In order to accomplish the above object, the present invention provides a method for producing a compound, comprising: (a) inputting an InChI (International Chemical Identifier) identifier of a compound; (b) classifying the layer using the identifier to identify an organic compound or an inorganic compound; (c) determining the atomic class by identifying the constituent atom components in the case of an organic compound without performing classification in the case of an inorganic compound; (d) determining a main class by identifying a functional group; (e) determining a subclass by checking the structure; And (f) generating a classification string including an atomic class, a main class, and a subclass.

According to the present invention, it is possible to more accurately classify the functional group and the functional group of a compound having a similar structure but having a different structure, thereby allowing a database user to more easily identify the characteristics of the compound, Can be used to analyze the physical properties resulting from this.

1 is a view schematically showing a step of classifying a compound using an InChI identifier according to the present invention.
Figure 2 is a flow chart illustrating steps for determining an atomic class in accordance with the present invention.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. In general, the nomenclature used herein is well known and commonly used in the art.

The object of the present invention is to search for the functional group or structure of a substance present in a database containing the compound identifier InChI of an IUPAC with respect to an organic compound and to show the functional group and structure of the substance. And to perform the classification.

Accordingly, in one aspect, the present invention provides a method for detecting a compound, comprising: (a) inputting an InChI (International Chemical Identifier) identifier of a compound; (b) classifying the layer using the identifier to identify an organic compound or an inorganic compound; (c) determining the atomic class by identifying the constituent atom components in the case of an organic compound without performing classification in the case of an inorganic compound; (d) determining a main class by identifying a functional group; (e) determining a subclass by checking the structure; And (f) generating a classification string comprising an atomic class, a main class and a subclass.

Unlike conventional classification methods, the present invention is a method of classifying data through functional groups and structures of compounds using InChI, which is an identifier uniquely assigned to a compound.

In the present invention, classification is carried out into an atomic class, a main class, and a sub class using the InChI identifier of a compound. An atomic class is a classification of an atomic structure. A main class is a classification of a functional group, and a sub class is a classification of a structure.

Figure 1 shows the step of classifying compounds using InChI identifiers. As shown in FIG. 1, the method of classifying a compound according to the present invention includes an InChI input 100, an InChI layer decomposition 200, an inorganic compound / organic compound separation 300, Functional group identification (500), structural confirmation (600), and classification string generation (700).

The compound identifier used in the present invention can be input as an InChI as a character string via the InChI input 100 from a text or database.

The input InChI is decomposed into the respective layers by the InChI layer decomposition 200, and the division criterion and description are shown in Table 1. The main layer and the charge layer are used for the classification of the compounds in the respective layers. The classification through the other layers is classified into optical and structural isomers and isotopes. Do not.

division prefix name Explanation Main layer none The The compound of formula / c Atomic connection Connectivity of atoms except hydrogen / h Hydrogen The number of hydrogen atoms connected to each atom Charge layer / q Majesty Net charge / p Positive charge Net charge Stereo layer / b Cis-trans coupling Classification of cis-trans isomers / t, / m Parity Isomerization of tetrahedral carbon / s Structural isomer Classification of stereoisomers Extra layer / i Isotope Isotope indication

The classification through the main layer 200 is classified according to the prefix as shown in Table 1. In the present invention, a chemical layer, an atomic connection layer, and a hydrogen layer are formed by removing a prefix and then forming a chemical layer, Can be classified. In addition, the charge layer is divided according to the prefix, and can be classified as a charge layer or a positive charge after removing the prefix.

The distinction (300) between an inorganic compound and an organic compound is identified through identification of a chemical layer. If there is no carbon in the formula layer or if no hydrogen layer is present, it is classified as an inorganic compound. Among organic compounds, organic salts are classified as being bonded to metals. At this stage, the compounds classified as organic compounds have an atomic class as classification, and when there is no structural specificity and functional group, most of the compounds are not classified.

Compounds classified as organic compounds through the division (300) between inorganic compounds and organic compounds are classified into atomic classes according to their contained atoms. The atomic class is determined by the inclusion of atoms other than carbon and hydrogen, which can be determined by determining the atomic class by analyzing the chemical layer using the method given in Figure 2 below. (A1) loading a layer of formula (i = 0); (a2) identifying a member if the letter atomic layer (i + 1) of the chemical layer is a lowercase letter (i = i + 2) If i is not i, checking the member after i = i + 1; (a3) if the character atomic layer (i) is a number, proceeding to the beginning of (a3) after i = i + 1 processing; And (a4) terminating the entire step if the string is terminated, and repeating (a2) if not.

Each atomic class is shown in Table 2, but it is not limited to the reactors shown in Table 2.

In Fig. 2, the atomic structure is confirmed by reading the characters of the compound layer. (410) and reads from the first character of the string. Identification of the element is confirmed by checking the element symbol (420). The first letter of an element is an uppercase letter, so if it is a lowercase letter, it is an elemental symbol of two letters. If it is an uppercase letter or a number, it is an elemental symbol of a letter. The member is identified through the obtained element symbol (430). The character immediately after the symbol is checked (440). If the number is an integer, the number of atoms can be checked. By repeating the above procedure until the end of the string (460), the composition of all the atoms contained in the compound can be confirmed.

The atomic class is identified through the atomic configuration confirmation 400, and the main class is determined through the functional identifier 500. Most organic compounds change physico-chemical properties depending on their functional groups, and their functional groups change depending on the constituent elements in the compound. Classification of the main class varies according to the atom class, but because the elements of the same group have similar binding tendencies, it is possible to classify them according to the functional groups by using the atomic connection layer and the hydrogen layer. Alkene, alkyne, naphthene, aromatic, amine, amide, nitro, nitrile, nitrile, and the like, when it is an element of Group 14, Group 15, Group 16, (S), sulfone (s), sulfone (s), sulfone (s), sulfide (s), sulfide (s), sulfide sulfone, sulfonic acid, sulfate, phosphine, silane, or siloxane. The term " functional group "

Table 2 shows the criteria for classification of nitrogen, oxygen, and chlorine for groups 15, 16, and 17, respectively, and it is possible to classify the same group in the same way.

tribe class formula # of Guest atom # of hydrogen Relative position -One +1 +2 +3 14 Alkane C-C 0 2 C C 14 Alkene C = C 0 One C C 14 Alkyne C? C 0 0 C C 14 Naphthene C-C (ring) 0 2 C C 14 Aromatic C-C (benzene ring) 0 One C C 15 Amine C-N-H One One C 15 Nitrile C≡N One 0 C C 16 Alcohol C-O-H One One C 16 Ether C-O-C One 0 C C 16 Acetal C-O-C-O-C 2 0 C C O C 16 Peroxide C-O-O-C 2 0 C O C 16 Epoxide C-O-C (ring) One 0 C C 16 Ketone C = O One 0 C C 16 Aldehyde HC = O One One C 16 Carboxylic acid C (= O) -O-H 2 0.5 C O 16 Ester C (= O) -O-C 2 0 C O C 16 Anhydride C (= O) -O-C (= O) 3 0 C O C O 17 Chloride C-Cl One 0 C

In the main class determination phase, the most important thing is the connection position of the object atoms contained in the hydrogen and carbon, which can be confirmed by counting the number of atoms excluding hydrogen in the chemical layer. The number of hydrogen atoms can also be calculated by determining how many hydrogen atoms are bound to the atom through the hydrogen layer.

The main class obtained through the step of identifying functional groups (500) represents only functional groups and does not distinguish structures. Thus, the structure of the compound is achieved through the structure confirmation step 600. The structure classification of the compounds is classified as shown in Table 3, which can be obtained by confirming unsaturated bonds in the atomic connection layer through a branch or hydrogen bonding layer. In addition, if more than two functional groups are found in the main class, or ring structure and direction structure can be distinguished through this process, the class can be subclassed by prefixing the main class with a delimiter character.

That is, the structure may be formed in the atomic connection layer through a branch or a hydrogen-bonding layer, such as N-, Branched-, Unsaturated-, Poly-, Cyclic- or aromatic Aromatic-), or can be classified as a subclass.

prefix Explanation N- When there is no special structure Branched When branch structures exist Unsaturated When unsaturated bonds are present Poly When there are several same functional groups Cyclic When you have an annular structure Aromatic Having an aromatic structure

The atomic class, the main class, and the sub class obtained through the atomic configuration confirmation 400, the functional group identification 500, and the structure confirmation 600 are shown as classification results, It is possible to implement the method by computer programming in code, and it is possible to record as implementation data in code for all databases including compound InChI.

Hereinafter, the present invention will be described in more detail with reference to Examples. It is to be understood by those skilled in the art that these embodiments are for illustrative purposes only and that the scope of the present invention is not limited by these embodiments.

[Example]

Example  One

Illustrative materials were sorted according to the steps of the method described above to aid in the description of the invention. Table 4 shows the results of the classification of the materials and the respective steps in Example 1.

[Table 4]

Figure 112015083322110-pat00001

Figure 112015083322110-pat00002

Figure 112015083322110-pat00003

Figure 112015083322110-pat00004

Figure 112015083322110-pat00005

Figure 112015083322110-pat00006

Figure 112015083322110-pat00007

Figure 112015083322110-pat00008

Figure 112015083322110-pat00009

Figure 112015083322110-pat00010

Figure 112015083322110-pat00011

Figure 112015083322110-pat00012

As shown in Table 4, organic compounds can be effectively classified in a database in a short time by confirming functional groups and structures using InChI among the compound identifiers for organic compounds. It is possible to improve the retrieval speed and accuracy and also to easily predict the physical properties of the compound.

While the present invention has been particularly shown and described with reference to specific embodiments thereof, those skilled in the art will appreciate that such specific embodiments are merely preferred embodiments and that the scope of the present invention is not limited thereto will be. Accordingly, the actual scope of the present invention will be defined by the appended claims and their equivalents.

Claims (9)

In a computer-based system,
(a) a computer-equipped compound identifier means for inputting an International Chemical Identifier (InChI) identifier of the compound as a string;
(b) the layer means classifies the layer using the identifier to confirm whether it is an organic compound or an inorganic compound;
(a1) a chemical layer (i = 0), and (ii) an atomic class determining means for determining an atomic class, if the atomic class determining means is an inorganic compound, ; (a2) confirming the member when the character atomic layer (i + 1) of the chemical layer is a lowercase character (i = i + 2), and checking the member after processing i = i + 1 if the character atomic layer is not a lowercase character; (a3) if the character atomic layer (i) is a number, proceeding to the beginning of (a3) after i = i + 1 processing; And (a4) terminating the entire step if the string is terminated, and repeating (a2) if not.
(d) the main class determining means determines the main class by checking the functional group;
(e) determining a subclass by confirming the structure of the subclass; And
(f) The classification string generating means executes a step of generating a classification string including an atom class, a main class and a sub class.
The method of claim 1, wherein the layer means distinguishes between a main layer and a charge layer and identifies the main layer and the charge layer.
[3] The method of claim 2, wherein the main layer is classified according to a prefix, and after the prefix is removed, the main layer is classified into a chemical layer, an atomic connection layer, or a hydrogen layer.
3. The method of claim 2, wherein the charge layer is separated according to a prefix, and after removing the prefix, the charge layer is classified into a charge layer or a positive charge.
4. The method according to claim 3, wherein the layer of the chemical formula is identified to distinguish the inorganic compound from the organic compound. In the case where no carbon is present or no hydrogen layer is present in the chemical layer, And when the organic compound is bound to a metal, it is classified as an organic salt.
delete The method of claim 1, wherein the functional groups are classified according to an atomic connection layer and a hydrogen layer.
The method according to claim 7, wherein the elements of Group 14, Group 15, Group 16, and Group 17 include alkane, alkene, alkyne, naphthene, aromatic, amine, The present invention relates to a process for the preparation of a compound of formula I in the presence of an amide, nitro, nitrile, alcohol, ketone, ether, acetal, peroxide, epoxide, aldehyde, carboxylic acid, ester, anhydride, wherein the compound is classified as a functional group of sulfite, sulfone, sulfonic acid, sulfate, phosphine, silane or siloxane.
The method of claim 1, wherein the structure is selected from the group consisting of N-, Branched, Unsaturated, Poly-, and Cyclic- Characterized in that it is classified as a prefix of the main class or classified as a subclass in terms of aromatic or aromatic.
KR1020150120964A 2015-08-27 2015-08-27 Classification Algorithm for Chemical Compound Using InChI KR101801226B1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
KR1020150120964A KR101801226B1 (en) 2015-08-27 2015-08-27 Classification Algorithm for Chemical Compound Using InChI
PCT/KR2016/009273 WO2017034280A1 (en) 2015-08-27 2016-08-23 Method of classifying compounds by using molecular structure of compounds

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020150120964A KR101801226B1 (en) 2015-08-27 2015-08-27 Classification Algorithm for Chemical Compound Using InChI

Publications (2)

Publication Number Publication Date
KR20170025071A KR20170025071A (en) 2017-03-08
KR101801226B1 true KR101801226B1 (en) 2017-11-24

Family

ID=58101018

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020150120964A KR101801226B1 (en) 2015-08-27 2015-08-27 Classification Algorithm for Chemical Compound Using InChI

Country Status (2)

Country Link
KR (1) KR101801226B1 (en)
WO (1) WO2017034280A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101236966B1 (en) * 2011-11-14 2013-02-26 숭실대학교산학협력단 Apparatus and method for string expression of compound distinguishing isomer, apparatus and method for searching compound using the same

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8140267B2 (en) * 2006-06-30 2012-03-20 International Business Machines Corporation System and method for identifying similar molecules
US8468001B2 (en) * 2007-03-22 2013-06-18 Infosys Limited Ligand identification and matching software tools
KR101375672B1 (en) * 2011-10-27 2014-03-20 주식회사 켐에쎈 Method for Predicting a Property of Compound and System for Predicting a Property of Compound

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101236966B1 (en) * 2011-11-14 2013-02-26 숭실대학교산학협력단 Apparatus and method for string expression of compound distinguishing isomer, apparatus and method for searching compound using the same

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Sun, Bingjun, et al. "Extraction and search of chemical formulae in text documents on the web." Proceedings of the 16th international conference on World Wide Web. ACM, 2007.
논문 1 : CEUR Workshop
웹사이트

Also Published As

Publication number Publication date
KR20170025071A (en) 2017-03-08
WO2017034280A1 (en) 2017-03-02

Similar Documents

Publication Publication Date Title
Kajino Molecular hypergraph grammar with its application to molecular optimization
US10885323B2 (en) Digital image-based document digitization using a graph model
CN103177120B (en) A kind of XPath query pattern tree matching method based on index
US20210011889A1 (en) Concurrent enumeration of multiple hierarchies in a database environment
CN101031907B (en) Index processing
CN106293677B (en) A kind of code conversion method and device
CN103823823B (en) Denormalization policy selection method based on Frequent Itemsets Mining Algorithm
CN109902142B (en) Character string fuzzy matching and query method based on edit distance
EP2492824A1 (en) Method of searching a data base, navigation device and method of generating an index structure
CN109598334B (en) Sample generation method and device
CN110969517B (en) Bidding life cycle association method, system, storage medium and computer equipment
CN109062876B (en) A kind of similar web page lookup method and system based on DOM webpage beta pruning
Nguyen et al. Scalable and incremental clone detection for evolving software
JP4045400B2 (en) Search device and search method
CN115562679B (en) Java language-based automatic code generation method and server
KR101801226B1 (en) Classification Algorithm for Chemical Compound Using InChI
US20140172897A1 (en) Device, method, and program for processing data with tree structure
CN104346616A (en) Character recognition device and character recognition method
JP5022252B2 (en) Expression template generation apparatus, method and program thereof
Guo et al. RED: Redundancy-Driven Data Extraction from Result Pages?
JP2758609B2 (en) Exact match search method for chemical structural formulas
CN112699637B (en) Paragraph type recognition method and system and document structure recognition method and system
CN110147396A (en) A kind of mapping relations generation method and device
CN115272649A (en) Image recognition, retrieval and entry method and system of molecular structure diagram and medium
CN117521017B (en) Method and device for acquiring multi-mode characteristics

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
GRNT Written decision to grant