CN110390997B - Chemical molecular formula splicing method - Google Patents
Chemical molecular formula splicing method Download PDFInfo
- Publication number
- CN110390997B CN110390997B CN201910646187.5A CN201910646187A CN110390997B CN 110390997 B CN110390997 B CN 110390997B CN 201910646187 A CN201910646187 A CN 201910646187A CN 110390997 B CN110390997 B CN 110390997B
- Authority
- CN
- China
- Prior art keywords
- splicing
- atoms
- chemical
- sites
- heavy metal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/10—Analysis or design of chemical reactions, syntheses or processes
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/50—Molecular design, e.g. of drugs
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P20/00—Technologies relating to chemical industry
- Y02P20/50—Improvements relating to the production of bulk chemicals
- Y02P20/55—Design of synthesis routes, e.g. reducing the use of auxiliary or protecting groups
Landscapes
- Chemical & Material Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computing Systems (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Pharmacology & Pharmacy (AREA)
- Medicinal Chemistry (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Heterocyclic Carbon Compounds Containing A Hetero Ring Having Oxygen Or Sulfur (AREA)
Abstract
The invention discloses a chemical molecular formula splicing method, which is to splice two chemical molecular formulas with splicing sites into one chemical molecular formula, wherein the molecular formulas with the splicing sites are obtained by removing fixed groups from reagent molecules. In the application scenario of compound synthesis through a large number of reagent molecules, molecular formula splicing is an indispensable link. The invention realizes the splicing process of molecular formulas on the graph structures of atoms and chemical bonds, has strong operation flexibility, is suitable for splicing various situations, such as splicing into a ring structure, recording the positions before splicing by using different labels, and finding out the corresponding positions for splicing in sequence.
Description
Technical Field
The invention belongs to the technical field of compound synthesis, and particularly relates to a chemical molecular formula splicing method.
Background
In the process of synthesizing the compound, reagent molecules with protecting groups are subjected to deprotection and mutual reaction to obtain a novel compound. From a chemical point of view, both steps are chemical reactions; from the calculation point of view, the chemical reaction of removing protecting groups from the molecule is the "site cleavage" of the molecule, so as to obtain sites capable of reacting (connecting) with other molecules, then the mutual reaction is the "site splicing" among the molecules, so that the sites capable of undergoing chemical reaction are connected, and a new compound is obtained.
The current processing mode in calculation is to record the molecules with sites obtained in the first step, mark the sites obtained in different reactions as [ R1] as the sites obtained in the reaction type 1, the positions of the sites are the positions of atoms of the molecules which are removed in the reaction, then splice the molecules with the same sites, and if both molecules have [ R1], splice.
The current calculation chemical tools such as Openbabel, rdKit are performed by simulating the actual chemical reaction process, namely inputting the compound to be reacted to obtain the reacted compound, and the chemical splicing with the position is not directly supported, and is not flexible enough to deal with complex situations, such as that two molecular formulas with the position to be reacted have a plurality of positions, and the positions are spliced into a ring structure. The invention realizes a splicing method on the basis of Openbabel, and can realize splicing of various situations more flexibly.
Disclosure of Invention
The invention aims at overcoming the defects of the prior art and provides a chemical molecular formula splicing method, namely splicing two chemical molecular formulas with splicing sites into one chemical molecular formula, wherein the molecular formula with the splicing sites is obtained by removing fixed groups from reagent molecules. In the application scenario of compound synthesis through a large number of reagent molecules, molecular formula splicing is an indispensable link.
The aim of the invention is realized by the following technical scheme: a chemical formula splicing method, which is used for splicing two chemical formulas smi1 and smi2 with sites, wherein the positions to be spliced in the molecular formula, namely the sites, are marked as [ Ri ], i=0, 1 … N, [ Ri ] corresponds to a certain reaction type, N is the total number of reaction types, and the splicing process is as follows:
(1) The chemical formulas smi1 and smi2 are read in and converted into graph data structures mol1 and mol2 representing the molecules respectively. The method comprises the steps of performing special treatment on a site [ Ri ] mark during reading, recording an atomic number IDX and a reaction type mark i of the site [ Ri ] mark in a molecule, defining a heavy metal mapping table, and mapping each site [ Ri ] to a heavy metal atom which does not appear in molecular formulas before and after splicing; and respectively storing the atom number IDX, the reaction type index i and all the heavy metal atoms mapped by [ Ri ] in the molecular formula into a data structure of mol1 and mol2.
(2) Adding the data structure of mol2 and mol1 gives mol3.
(3) The locus pairs with the same index i are found in mol3 and are marked as p and q, atoms ATOMp and ATOMq connected with the locus pairs are found respectively, an atomic bond is newly added between the ATOMp and the ATOMq, the ATOMp and the ATOMq are connected, p and q are deleted, and meanwhile, chemical bonds connected with p and q are deleted, so that splicing of the smi1 and the smi2 is realized.
(4) Returning to step 3, searching the rest atoms for the site pairs with the same index i until no matching site pairs exist.
(5) Converting mol3 into molecular formula smi3 in a Canonical SMILES format, querying the heavy metal mapping table defined in step 1, and if the heavy metal atoms in the table exist in smi3, replacing the heavy metal atoms with corresponding [ Ri ].
Further, in the step (1), the read molecular formula format is Canonic SMILES, and the input of other formats needs to be converted into the format.
Further, in the step (1), the graph data structure is a graph formed by connecting atoms and chemical bonds, and includes atoms, chemical bonds and chemical bond attribute information in molecules.
Further, in the step (2), the data structures of mol2 and mol1 are added, that is, the atomic, chemical bond and chemical bond attribute information of the two are stacked to obtain mol3, that is, the atomic number IDX of the atom in mol2 is added to the atomic number of mol1, including the atomic number IDX of [ Ri ] in mol2, to form a preliminary result of splicing, namely mol3.
Further, in the step (4), the site pairs with the same index i are searched in the remaining atoms until there is no matching site pair, at which time the atoms in the mol3 data structure are all connected, no independent atoms, and if any, manual checking is performed.
Further, in the step (5), there is a heavy metal indicating that there is an unwatered site, and it is also necessary to splice with other molecules having the same site to form the final molecular compound.
Further, for obtaining a new ring or double bond structure after splicing, the corresponding bond information needs to be updated in mol3, and then output as a Canonic SMILES format.
The beneficial effects of the invention are as follows: the invention realizes the splicing process of molecular formulas on the graph structures of atoms and chemical bonds, has strong operation flexibility, is suitable for splicing various situations, such as splicing into a ring structure, recording the positions before splicing by using different labels, and finding out the corresponding positions for splicing in sequence.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
fig. 2 is a splice example of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention provides a chemical formula splicing method, which is used for splicing two chemical formulas smi1 and smi2 with sites (assuming that input format is Canonical SMILES, other formats can be converted into the format), and the specific implementation steps are as follows, as shown in figure 1:
the positions to be spliced (i.e., the positions) in the molecular formula are denoted as [ Ri ], i=0, 1 … N, [ Ri ] corresponds to a certain reaction type, and is obtained by performing a chemical reaction of the type with reagent molecules, N is the total number of reaction types, and the splicing process is as follows:
1. the molecules smi1, smi2 in the Canonical SMILES format are read in and converted into graph data structures mol1, mol2 representing the molecules, namely graphs formed by atoms and chemical bonds, wherein the graph data structures mainly comprise information such as atoms, chemical bonds, chemical bond attributes (such as directionality of double bonds) and the like in the molecules. The method comprises the steps of performing special treatment on a site [ Ri ] mark during reading, recording an atomic number IDX of the site [ Ri ] mark in a molecule (namely, the site is the IDX atom in the molecule) and a reaction type mark i (namely, i in the [ Ri ]), defining a heavy metal mapping table, and mapping each site [ Ri ] to a heavy metal atom which does not appear in molecular formulas before and after splicing; and respectively storing the atom number IDX, the reaction type index i and all the heavy metal atoms mapped by [ Ri ] in the molecular formula into a data structure of mol1 and mol2.
2. Adding the data structures of the mol2 and the mol1, namely stacking the information of atoms, chemical bonds, chemical bond attributes and the like of the two, to obtain the mol3, namely adding the atomic number IDX of the atoms in the mol2 to the atomic number IDX of the atoms in the mol1, including the atomic number IDX of [ Ri ] in the mol2. The preliminary result of the splice mol3 is formed.
3. The locus pairs with the same index i are found in mol3 and are marked as p and q, atoms ATOMp and ATOMq connected with the locus pairs are found respectively, an atomic bond is newly added between the ATOMp and the ATOMq, the ATOMp and the ATOMq are connected, p and q are deleted, and meanwhile, chemical bonds connected with p and q are deleted, so that splicing of the smi1 and the smi2 is realized.
4. Returning to step 3, searching the rest atoms for the site pairs with the same index i until there is no matching site pair. At this time, atoms in the mol3 data structure are all connected, no independent atoms exist, and if any, the atoms are manually checked.
5. Converting mol3 into molecular formula smi3 in a Canonical SMILES format, querying the heavy metal mapping table defined in step 1, and if the heavy metal atoms in the table exist in smi3, replacing the heavy metal atoms with corresponding [ Ri ].
In step 5, heavy metals indicate that there are uncoupling sites, and the final molecular compound can be formed by splicing other molecules with the same sites. In practice, it is quite common that there are multiple sites in one molecule, and that it is necessary to splice with multiple other molecules to form the complete compound.
And for the special structures such as new rings, double bonds and the like obtained after splicing, corresponding key information needs to be updated in mol3, and then the new key information is output into a Canonic SMILES format.
A complete example is as follows: as shown in fig. 2, the first 6 molecular formulas are spliced to the last molecular formula, wherein In, cd, sr, cd, kr, sc is the position of the site, corresponding to [ R0], [ R1] … [ R5]:
the molecular formula in the examples is:
C([In])(=O)CC1=CC2=C(C=CC=C2)C=C1
[Cd]C([In])=O
C([Sr])N[Cd]
[Kr]C([Sr])=O
C([Sc])N[Kr]
CNC([Sc])=O
C(NC(=O)CNC(=O)C(=O)Cc1cc2c(cccc2)cc1)C(=O)NC
it should be noted that the disclosure and the specific embodiments are intended to demonstrate practical applications of the technical solution provided by the present disclosure, and should not be construed as limiting the scope of the present disclosure. Any modifications and changes made to the present invention fall within the spirit of the invention and the scope of the appended claims.
Claims (6)
1. A chemical formula splicing method, which is characterized by being used for splicing chemical formulas smi1 and smi2 with sites, wherein the sites to be spliced in the molecular formula are denoted as [ Ri ], i=0, 1 … N, [ Ri ] corresponds to a certain reaction type, N is the total number of reaction types, and the splicing process is as follows:
(1) Reading in chemical formulas smi1 and smi2, and respectively converting the chemical formulas smi1 and smi2 into graph data structures mol1 and mol2 representing molecules; the special treatment is carried out on the site [ Ri ] mark during reading: recording an atomic number IDX and a reaction type index i of the heavy metal in a molecule, defining a heavy metal mapping table, and mapping each site [ Ri ] to a heavy metal atom which does not appear in molecular formulas before and after splicing; the atom number IDX, the reaction type index i and all the heavy metal atoms mapped by [ Ri ] in the molecular formula are respectively stored in a data structure of mol1 and mol2;
(2) Adding the data structures of the mol2 and the mol1, namely stacking the atomic, chemical bond and chemical bond attribute information of the two to obtain the mol3, namely adding the atomic number IDX of the atoms in the mol2 to the atomic number of the atoms of the mol1, including the atomic number IDX of [ Ri ] in the mol2 to form a preliminary spliced result mol3;
(3) The locus pairs with the same index i are found in mol3 and marked as p and q, atoms ATOMp and ATOMq connected with the locus pairs are found respectively, an atomic bond is newly added between the ATOMp and the ATOMq, the ATOMp and the ATOMq are connected, p and q are deleted, and meanwhile, chemical bonds connected with p and q are deleted, so that splicing of the smi1 and the smi2 is realized;
(4) Returning to the step 3, searching the site pairs with the same index i in the rest atoms until no matched site pairs exist;
(5) Converting mol3 into molecular formula smi3 in a Canonical SMILES format, querying the heavy metal mapping table defined in step 1, and if the heavy metal atoms in the table exist in smi3, replacing the heavy metal atoms with corresponding [ Ri ].
2. The method of claim 1, wherein in the step (1), the read molecular formula format is Canonic SMILES, and the input of the other format is converted into the format.
3. The method according to claim 1, wherein in the step (1), the graph data structure is a graph formed by bonding atoms and chemical bonds, and the graph data structure includes information about atoms, chemical bonds and chemical bond attributes in molecules.
4. A method of splicing chemical formulas according to claim 1, wherein in step (4), the pairs of sites of the same index i are found in the remaining atoms until there is no matching pair of sites, at which time the atoms in the mol3 data structure are all connected, no independent atoms, and if any, manual inspection is performed.
5. A method of splicing according to claim 1, wherein in step (5) there are heavy metals indicating that there are uncoupling sites and there is a need to splice with other molecules having the same sites to form the final molecular compound.
6. The method of claim 1, wherein for obtaining a new ring or double bond structure after splicing, the corresponding bond information needs to be updated in mol3, and then output as the Canonic SMILES format.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910646187.5A CN110390997B (en) | 2019-07-17 | 2019-07-17 | Chemical molecular formula splicing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910646187.5A CN110390997B (en) | 2019-07-17 | 2019-07-17 | Chemical molecular formula splicing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110390997A CN110390997A (en) | 2019-10-29 |
CN110390997B true CN110390997B (en) | 2023-05-30 |
Family
ID=68285020
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910646187.5A Active CN110390997B (en) | 2019-07-17 | 2019-07-17 | Chemical molecular formula splicing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110390997B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110379467B (en) * | 2019-07-17 | 2022-08-19 | 成都火石创造科技有限公司 | Chemical molecular formula segmentation method |
CN111223532B (en) * | 2019-11-14 | 2023-06-20 | 腾讯科技(深圳)有限公司 | Method, device, apparatus, medium for determining a reactant of a target compound |
CN113140262B (en) * | 2021-04-25 | 2022-05-03 | 清华大学 | Chemical molecule synthesis simulation method and device |
CN113140261B (en) * | 2021-04-25 | 2022-05-06 | 清华大学 | Chemical molecule synthesis simulation method and device |
CN117133371B (en) * | 2023-10-25 | 2024-01-05 | 烟台国工智能科技有限公司 | Template-free single-step inverse synthesis method and system based on manual key breaking |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH06309385A (en) * | 1993-01-07 | 1994-11-04 | Akiko Itai | Constructing method for molecular structure for ligand having bioactivity |
DE19646624A1 (en) * | 1995-12-22 | 1997-07-03 | Ibm | Identification of test molecules |
WO2000060507A2 (en) * | 1999-04-02 | 2000-10-12 | Neogenesis, Inc. | Analyzing molecule and protein diversity |
WO2001050127A2 (en) * | 1999-12-30 | 2001-07-12 | 7Tm Pharma | Screening using biological target molecules with metal-ion binding sites |
WO2012083886A1 (en) * | 2010-12-24 | 2012-06-28 | 北大方正集团有限公司 | Method and device for constructing organic chemistry structural formula |
CN105985978A (en) * | 2015-03-06 | 2016-10-05 | 中国科学院上海生命科学研究院 | Construction and application of novel RNA cyclization expression vector |
CN108304691A (en) * | 2018-02-09 | 2018-07-20 | 北京矿冶科技集团有限公司 | Floating agent molecular design method based on segment |
CN109686413A (en) * | 2018-12-24 | 2019-04-26 | 杭州费尔斯通科技有限公司 | A kind of chemical molecular formula search method based on es inverted index |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7369946B2 (en) * | 2000-03-29 | 2008-05-06 | Abbott Gmbh & Co. Kg | Method of identifying inhibitors of Tie-2 |
EP2007934A4 (en) * | 2006-03-24 | 2010-06-30 | Richard D Cramer | Forward synthetic synthon generation and its use to identify molecules similar in 3 dimensional shape to pharmaceutical lead compounds |
US9665693B2 (en) * | 2012-05-30 | 2017-05-30 | Exxonmobil Research And Engineering Company | System and method to generate molecular formula distributions beyond a predetermined threshold for a petroleum stream |
US11114184B2 (en) * | 2017-02-21 | 2021-09-07 | Albert Einstein College Of Medicine | DNA methyltransferase 1 transition state structure and uses thereof |
-
2019
- 2019-07-17 CN CN201910646187.5A patent/CN110390997B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH06309385A (en) * | 1993-01-07 | 1994-11-04 | Akiko Itai | Constructing method for molecular structure for ligand having bioactivity |
DE19646624A1 (en) * | 1995-12-22 | 1997-07-03 | Ibm | Identification of test molecules |
WO2000060507A2 (en) * | 1999-04-02 | 2000-10-12 | Neogenesis, Inc. | Analyzing molecule and protein diversity |
WO2001050127A2 (en) * | 1999-12-30 | 2001-07-12 | 7Tm Pharma | Screening using biological target molecules with metal-ion binding sites |
WO2012083886A1 (en) * | 2010-12-24 | 2012-06-28 | 北大方正集团有限公司 | Method and device for constructing organic chemistry structural formula |
CN105985978A (en) * | 2015-03-06 | 2016-10-05 | 中国科学院上海生命科学研究院 | Construction and application of novel RNA cyclization expression vector |
CN108304691A (en) * | 2018-02-09 | 2018-07-20 | 北京矿冶科技集团有限公司 | Floating agent molecular design method based on segment |
CN109686413A (en) * | 2018-12-24 | 2019-04-26 | 杭州费尔斯通科技有限公司 | A kind of chemical molecular formula search method based on es inverted index |
Also Published As
Publication number | Publication date |
---|---|
CN110390997A (en) | 2019-10-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110390997B (en) | Chemical molecular formula splicing method | |
MacLeod et al. | Deduction of probable events of lateral gene transfer through comparison of phylogenetic trees by recursive consolidation and rearrangement | |
Liu et al. | rHAT: fast alignment of noisy long reads with regional hashing | |
CN109408528A (en) | A kind of database script generation method, device, computing device and storage medium | |
CN111312295B (en) | Holographic sound recording method and device and recording equipment | |
CN108595915A (en) | A kind of three generations's data correcting method based on DNA variation detections | |
CN109472029B (en) | Medicine name processing method and device | |
CN115269006A (en) | Machine code instruction conversion method and device, electronic equipment and readable storage medium | |
CN113407565B (en) | Cross-database data query method, device and equipment | |
CN110619128B (en) | Construction method of digital factory | |
CN112037074B (en) | Visualization-based data file analysis method and device | |
CN110379468B (en) | Improved chemical molecular formula segmentation method | |
CN104536897A (en) | Automatic testing method and system based on keyword | |
CN110379467B (en) | Chemical molecular formula segmentation method | |
CN116489251A (en) | Universal code stream analysis method, device, computer readable medium and terminal equipment | |
Blackwood et al. | The Chemical Abstracts Service Chemical Registry System. III. Stereochemistry | |
CN103838845A (en) | Universal Excel data importing implementing method | |
CN112613894B (en) | Method and device for associating source code with product | |
CN109582692B (en) | Carrier rocket test data interpretation method and system based on formal description | |
CN111414741A (en) | Method, device, equipment and medium for making format template of publication | |
CN114510455A (en) | Method for rapidly extracting outsourcing forming data | |
CN116610345B (en) | Application program upgrading method and device based on execution record table | |
JP2011175454A (en) | Device, method and program for estimation of compound reactivity, and storage medium recording the same, and competitive reaction database | |
CN117076515B (en) | Metadata tracing method and device in medical management system, server and storage medium | |
WO2022130648A1 (en) | Information processing program, information processing method, and information processing device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |