CN110390997B - Chemical molecular formula splicing method - Google Patents

Chemical molecular formula splicing method Download PDF

Info

Publication number
CN110390997B
CN110390997B CN201910646187.5A CN201910646187A CN110390997B CN 110390997 B CN110390997 B CN 110390997B CN 201910646187 A CN201910646187 A CN 201910646187A CN 110390997 B CN110390997 B CN 110390997B
Authority
CN
China
Prior art keywords
splicing
atoms
chemical
sites
heavy metal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910646187.5A
Other languages
Chinese (zh)
Other versions
CN110390997A (en
Inventor
金霞
韩瑞峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Firestone Creation Technology Co ltd
Original Assignee
Chengdu Firestone Creation Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Firestone Creation Technology Co ltd filed Critical Chengdu Firestone Creation Technology Co ltd
Priority to CN201910646187.5A priority Critical patent/CN110390997B/en
Publication of CN110390997A publication Critical patent/CN110390997A/en
Application granted granted Critical
Publication of CN110390997B publication Critical patent/CN110390997B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/10Analysis or design of chemical reactions, syntheses or processes
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P20/00Technologies relating to chemical industry
    • Y02P20/50Improvements relating to the production of bulk chemicals
    • Y02P20/55Design of synthesis routes, e.g. reducing the use of auxiliary or protecting groups

Landscapes

  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Medicinal Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Heterocyclic Carbon Compounds Containing A Hetero Ring Having Oxygen Or Sulfur (AREA)

Abstract

The invention discloses a chemical molecular formula splicing method, which is to splice two chemical molecular formulas with splicing sites into one chemical molecular formula, wherein the molecular formulas with the splicing sites are obtained by removing fixed groups from reagent molecules. In the application scenario of compound synthesis through a large number of reagent molecules, molecular formula splicing is an indispensable link. The invention realizes the splicing process of molecular formulas on the graph structures of atoms and chemical bonds, has strong operation flexibility, is suitable for splicing various situations, such as splicing into a ring structure, recording the positions before splicing by using different labels, and finding out the corresponding positions for splicing in sequence.

Description

Chemical molecular formula splicing method
Technical Field
The invention belongs to the technical field of compound synthesis, and particularly relates to a chemical molecular formula splicing method.
Background
In the process of synthesizing the compound, reagent molecules with protecting groups are subjected to deprotection and mutual reaction to obtain a novel compound. From a chemical point of view, both steps are chemical reactions; from the calculation point of view, the chemical reaction of removing protecting groups from the molecule is the "site cleavage" of the molecule, so as to obtain sites capable of reacting (connecting) with other molecules, then the mutual reaction is the "site splicing" among the molecules, so that the sites capable of undergoing chemical reaction are connected, and a new compound is obtained.
The current processing mode in calculation is to record the molecules with sites obtained in the first step, mark the sites obtained in different reactions as [ R1] as the sites obtained in the reaction type 1, the positions of the sites are the positions of atoms of the molecules which are removed in the reaction, then splice the molecules with the same sites, and if both molecules have [ R1], splice.
The current calculation chemical tools such as Openbabel, rdKit are performed by simulating the actual chemical reaction process, namely inputting the compound to be reacted to obtain the reacted compound, and the chemical splicing with the position is not directly supported, and is not flexible enough to deal with complex situations, such as that two molecular formulas with the position to be reacted have a plurality of positions, and the positions are spliced into a ring structure. The invention realizes a splicing method on the basis of Openbabel, and can realize splicing of various situations more flexibly.
Disclosure of Invention
The invention aims at overcoming the defects of the prior art and provides a chemical molecular formula splicing method, namely splicing two chemical molecular formulas with splicing sites into one chemical molecular formula, wherein the molecular formula with the splicing sites is obtained by removing fixed groups from reagent molecules. In the application scenario of compound synthesis through a large number of reagent molecules, molecular formula splicing is an indispensable link.
The aim of the invention is realized by the following technical scheme: a chemical formula splicing method, which is used for splicing two chemical formulas smi1 and smi2 with sites, wherein the positions to be spliced in the molecular formula, namely the sites, are marked as [ Ri ], i=0, 1 … N, [ Ri ] corresponds to a certain reaction type, N is the total number of reaction types, and the splicing process is as follows:
(1) The chemical formulas smi1 and smi2 are read in and converted into graph data structures mol1 and mol2 representing the molecules respectively. The method comprises the steps of performing special treatment on a site [ Ri ] mark during reading, recording an atomic number IDX and a reaction type mark i of the site [ Ri ] mark in a molecule, defining a heavy metal mapping table, and mapping each site [ Ri ] to a heavy metal atom which does not appear in molecular formulas before and after splicing; and respectively storing the atom number IDX, the reaction type index i and all the heavy metal atoms mapped by [ Ri ] in the molecular formula into a data structure of mol1 and mol2.
(2) Adding the data structure of mol2 and mol1 gives mol3.
(3) The locus pairs with the same index i are found in mol3 and are marked as p and q, atoms ATOMp and ATOMq connected with the locus pairs are found respectively, an atomic bond is newly added between the ATOMp and the ATOMq, the ATOMp and the ATOMq are connected, p and q are deleted, and meanwhile, chemical bonds connected with p and q are deleted, so that splicing of the smi1 and the smi2 is realized.
(4) Returning to step 3, searching the rest atoms for the site pairs with the same index i until no matching site pairs exist.
(5) Converting mol3 into molecular formula smi3 in a Canonical SMILES format, querying the heavy metal mapping table defined in step 1, and if the heavy metal atoms in the table exist in smi3, replacing the heavy metal atoms with corresponding [ Ri ].
Further, in the step (1), the read molecular formula format is Canonic SMILES, and the input of other formats needs to be converted into the format.
Further, in the step (1), the graph data structure is a graph formed by connecting atoms and chemical bonds, and includes atoms, chemical bonds and chemical bond attribute information in molecules.
Further, in the step (2), the data structures of mol2 and mol1 are added, that is, the atomic, chemical bond and chemical bond attribute information of the two are stacked to obtain mol3, that is, the atomic number IDX of the atom in mol2 is added to the atomic number of mol1, including the atomic number IDX of [ Ri ] in mol2, to form a preliminary result of splicing, namely mol3.
Further, in the step (4), the site pairs with the same index i are searched in the remaining atoms until there is no matching site pair, at which time the atoms in the mol3 data structure are all connected, no independent atoms, and if any, manual checking is performed.
Further, in the step (5), there is a heavy metal indicating that there is an unwatered site, and it is also necessary to splice with other molecules having the same site to form the final molecular compound.
Further, for obtaining a new ring or double bond structure after splicing, the corresponding bond information needs to be updated in mol3, and then output as a Canonic SMILES format.
The beneficial effects of the invention are as follows: the invention realizes the splicing process of molecular formulas on the graph structures of atoms and chemical bonds, has strong operation flexibility, is suitable for splicing various situations, such as splicing into a ring structure, recording the positions before splicing by using different labels, and finding out the corresponding positions for splicing in sequence.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
fig. 2 is a splice example of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention provides a chemical formula splicing method, which is used for splicing two chemical formulas smi1 and smi2 with sites (assuming that input format is Canonical SMILES, other formats can be converted into the format), and the specific implementation steps are as follows, as shown in figure 1:
the positions to be spliced (i.e., the positions) in the molecular formula are denoted as [ Ri ], i=0, 1 … N, [ Ri ] corresponds to a certain reaction type, and is obtained by performing a chemical reaction of the type with reagent molecules, N is the total number of reaction types, and the splicing process is as follows:
1. the molecules smi1, smi2 in the Canonical SMILES format are read in and converted into graph data structures mol1, mol2 representing the molecules, namely graphs formed by atoms and chemical bonds, wherein the graph data structures mainly comprise information such as atoms, chemical bonds, chemical bond attributes (such as directionality of double bonds) and the like in the molecules. The method comprises the steps of performing special treatment on a site [ Ri ] mark during reading, recording an atomic number IDX of the site [ Ri ] mark in a molecule (namely, the site is the IDX atom in the molecule) and a reaction type mark i (namely, i in the [ Ri ]), defining a heavy metal mapping table, and mapping each site [ Ri ] to a heavy metal atom which does not appear in molecular formulas before and after splicing; and respectively storing the atom number IDX, the reaction type index i and all the heavy metal atoms mapped by [ Ri ] in the molecular formula into a data structure of mol1 and mol2.
2. Adding the data structures of the mol2 and the mol1, namely stacking the information of atoms, chemical bonds, chemical bond attributes and the like of the two, to obtain the mol3, namely adding the atomic number IDX of the atoms in the mol2 to the atomic number IDX of the atoms in the mol1, including the atomic number IDX of [ Ri ] in the mol2. The preliminary result of the splice mol3 is formed.
3. The locus pairs with the same index i are found in mol3 and are marked as p and q, atoms ATOMp and ATOMq connected with the locus pairs are found respectively, an atomic bond is newly added between the ATOMp and the ATOMq, the ATOMp and the ATOMq are connected, p and q are deleted, and meanwhile, chemical bonds connected with p and q are deleted, so that splicing of the smi1 and the smi2 is realized.
4. Returning to step 3, searching the rest atoms for the site pairs with the same index i until there is no matching site pair. At this time, atoms in the mol3 data structure are all connected, no independent atoms exist, and if any, the atoms are manually checked.
5. Converting mol3 into molecular formula smi3 in a Canonical SMILES format, querying the heavy metal mapping table defined in step 1, and if the heavy metal atoms in the table exist in smi3, replacing the heavy metal atoms with corresponding [ Ri ].
In step 5, heavy metals indicate that there are uncoupling sites, and the final molecular compound can be formed by splicing other molecules with the same sites. In practice, it is quite common that there are multiple sites in one molecule, and that it is necessary to splice with multiple other molecules to form the complete compound.
And for the special structures such as new rings, double bonds and the like obtained after splicing, corresponding key information needs to be updated in mol3, and then the new key information is output into a Canonic SMILES format.
A complete example is as follows: as shown in fig. 2, the first 6 molecular formulas are spliced to the last molecular formula, wherein In, cd, sr, cd, kr, sc is the position of the site, corresponding to [ R0], [ R1] … [ R5]:
the molecular formula in the examples is:
C([In])(=O)CC1=CC2=C(C=CC=C2)C=C1
[Cd]C([In])=O
C([Sr])N[Cd]
[Kr]C([Sr])=O
C([Sc])N[Kr]
CNC([Sc])=O
C(NC(=O)CNC(=O)C(=O)Cc1cc2c(cccc2)cc1)C(=O)NC
it should be noted that the disclosure and the specific embodiments are intended to demonstrate practical applications of the technical solution provided by the present disclosure, and should not be construed as limiting the scope of the present disclosure. Any modifications and changes made to the present invention fall within the spirit of the invention and the scope of the appended claims.

Claims (6)

1. A chemical formula splicing method, which is characterized by being used for splicing chemical formulas smi1 and smi2 with sites, wherein the sites to be spliced in the molecular formula are denoted as [ Ri ], i=0, 1 … N, [ Ri ] corresponds to a certain reaction type, N is the total number of reaction types, and the splicing process is as follows:
(1) Reading in chemical formulas smi1 and smi2, and respectively converting the chemical formulas smi1 and smi2 into graph data structures mol1 and mol2 representing molecules; the special treatment is carried out on the site [ Ri ] mark during reading: recording an atomic number IDX and a reaction type index i of the heavy metal in a molecule, defining a heavy metal mapping table, and mapping each site [ Ri ] to a heavy metal atom which does not appear in molecular formulas before and after splicing; the atom number IDX, the reaction type index i and all the heavy metal atoms mapped by [ Ri ] in the molecular formula are respectively stored in a data structure of mol1 and mol2;
(2) Adding the data structures of the mol2 and the mol1, namely stacking the atomic, chemical bond and chemical bond attribute information of the two to obtain the mol3, namely adding the atomic number IDX of the atoms in the mol2 to the atomic number of the atoms of the mol1, including the atomic number IDX of [ Ri ] in the mol2 to form a preliminary spliced result mol3;
(3) The locus pairs with the same index i are found in mol3 and marked as p and q, atoms ATOMp and ATOMq connected with the locus pairs are found respectively, an atomic bond is newly added between the ATOMp and the ATOMq, the ATOMp and the ATOMq are connected, p and q are deleted, and meanwhile, chemical bonds connected with p and q are deleted, so that splicing of the smi1 and the smi2 is realized;
(4) Returning to the step 3, searching the site pairs with the same index i in the rest atoms until no matched site pairs exist;
(5) Converting mol3 into molecular formula smi3 in a Canonical SMILES format, querying the heavy metal mapping table defined in step 1, and if the heavy metal atoms in the table exist in smi3, replacing the heavy metal atoms with corresponding [ Ri ].
2. The method of claim 1, wherein in the step (1), the read molecular formula format is Canonic SMILES, and the input of the other format is converted into the format.
3. The method according to claim 1, wherein in the step (1), the graph data structure is a graph formed by bonding atoms and chemical bonds, and the graph data structure includes information about atoms, chemical bonds and chemical bond attributes in molecules.
4. A method of splicing chemical formulas according to claim 1, wherein in step (4), the pairs of sites of the same index i are found in the remaining atoms until there is no matching pair of sites, at which time the atoms in the mol3 data structure are all connected, no independent atoms, and if any, manual inspection is performed.
5. A method of splicing according to claim 1, wherein in step (5) there are heavy metals indicating that there are uncoupling sites and there is a need to splice with other molecules having the same sites to form the final molecular compound.
6. The method of claim 1, wherein for obtaining a new ring or double bond structure after splicing, the corresponding bond information needs to be updated in mol3, and then output as the Canonic SMILES format.
CN201910646187.5A 2019-07-17 2019-07-17 Chemical molecular formula splicing method Active CN110390997B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910646187.5A CN110390997B (en) 2019-07-17 2019-07-17 Chemical molecular formula splicing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910646187.5A CN110390997B (en) 2019-07-17 2019-07-17 Chemical molecular formula splicing method

Publications (2)

Publication Number Publication Date
CN110390997A CN110390997A (en) 2019-10-29
CN110390997B true CN110390997B (en) 2023-05-30

Family

ID=68285020

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910646187.5A Active CN110390997B (en) 2019-07-17 2019-07-17 Chemical molecular formula splicing method

Country Status (1)

Country Link
CN (1) CN110390997B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110379467B (en) * 2019-07-17 2022-08-19 成都火石创造科技有限公司 Chemical molecular formula segmentation method
CN111223532B (en) * 2019-11-14 2023-06-20 腾讯科技(深圳)有限公司 Method, device, apparatus, medium for determining a reactant of a target compound
CN113140262B (en) * 2021-04-25 2022-05-03 清华大学 Chemical molecule synthesis simulation method and device
CN113140261B (en) * 2021-04-25 2022-05-06 清华大学 Chemical molecule synthesis simulation method and device
CN117133371B (en) * 2023-10-25 2024-01-05 烟台国工智能科技有限公司 Template-free single-step inverse synthesis method and system based on manual key breaking

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06309385A (en) * 1993-01-07 1994-11-04 Akiko Itai Constructing method for molecular structure for ligand having bioactivity
DE19646624A1 (en) * 1995-12-22 1997-07-03 Ibm Identification of test molecules
WO2000060507A2 (en) * 1999-04-02 2000-10-12 Neogenesis, Inc. Analyzing molecule and protein diversity
WO2001050127A2 (en) * 1999-12-30 2001-07-12 7Tm Pharma Screening using biological target molecules with metal-ion binding sites
WO2012083886A1 (en) * 2010-12-24 2012-06-28 北大方正集团有限公司 Method and device for constructing organic chemistry structural formula
CN105985978A (en) * 2015-03-06 2016-10-05 中国科学院上海生命科学研究院 Construction and application of novel RNA cyclization expression vector
CN108304691A (en) * 2018-02-09 2018-07-20 北京矿冶科技集团有限公司 Floating agent molecular design method based on segment
CN109686413A (en) * 2018-12-24 2019-04-26 杭州费尔斯通科技有限公司 A kind of chemical molecular formula search method based on es inverted index

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7369946B2 (en) * 2000-03-29 2008-05-06 Abbott Gmbh & Co. Kg Method of identifying inhibitors of Tie-2
EP2007934A4 (en) * 2006-03-24 2010-06-30 Richard D Cramer Forward synthetic synthon generation and its use to identify molecules similar in 3 dimensional shape to pharmaceutical lead compounds
US9665693B2 (en) * 2012-05-30 2017-05-30 Exxonmobil Research And Engineering Company System and method to generate molecular formula distributions beyond a predetermined threshold for a petroleum stream
US11114184B2 (en) * 2017-02-21 2021-09-07 Albert Einstein College Of Medicine DNA methyltransferase 1 transition state structure and uses thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06309385A (en) * 1993-01-07 1994-11-04 Akiko Itai Constructing method for molecular structure for ligand having bioactivity
DE19646624A1 (en) * 1995-12-22 1997-07-03 Ibm Identification of test molecules
WO2000060507A2 (en) * 1999-04-02 2000-10-12 Neogenesis, Inc. Analyzing molecule and protein diversity
WO2001050127A2 (en) * 1999-12-30 2001-07-12 7Tm Pharma Screening using biological target molecules with metal-ion binding sites
WO2012083886A1 (en) * 2010-12-24 2012-06-28 北大方正集团有限公司 Method and device for constructing organic chemistry structural formula
CN105985978A (en) * 2015-03-06 2016-10-05 中国科学院上海生命科学研究院 Construction and application of novel RNA cyclization expression vector
CN108304691A (en) * 2018-02-09 2018-07-20 北京矿冶科技集团有限公司 Floating agent molecular design method based on segment
CN109686413A (en) * 2018-12-24 2019-04-26 杭州费尔斯通科技有限公司 A kind of chemical molecular formula search method based on es inverted index

Also Published As

Publication number Publication date
CN110390997A (en) 2019-10-29

Similar Documents

Publication Publication Date Title
CN110390997B (en) Chemical molecular formula splicing method
MacLeod et al. Deduction of probable events of lateral gene transfer through comparison of phylogenetic trees by recursive consolidation and rearrangement
Liu et al. rHAT: fast alignment of noisy long reads with regional hashing
CN109408528A (en) A kind of database script generation method, device, computing device and storage medium
CN111312295B (en) Holographic sound recording method and device and recording equipment
CN108595915A (en) A kind of three generations's data correcting method based on DNA variation detections
CN109472029B (en) Medicine name processing method and device
CN115269006A (en) Machine code instruction conversion method and device, electronic equipment and readable storage medium
CN113407565B (en) Cross-database data query method, device and equipment
CN110619128B (en) Construction method of digital factory
CN112037074B (en) Visualization-based data file analysis method and device
CN110379468B (en) Improved chemical molecular formula segmentation method
CN104536897A (en) Automatic testing method and system based on keyword
CN110379467B (en) Chemical molecular formula segmentation method
CN116489251A (en) Universal code stream analysis method, device, computer readable medium and terminal equipment
Blackwood et al. The Chemical Abstracts Service Chemical Registry System. III. Stereochemistry
CN103838845A (en) Universal Excel data importing implementing method
CN112613894B (en) Method and device for associating source code with product
CN109582692B (en) Carrier rocket test data interpretation method and system based on formal description
CN111414741A (en) Method, device, equipment and medium for making format template of publication
CN114510455A (en) Method for rapidly extracting outsourcing forming data
CN116610345B (en) Application program upgrading method and device based on execution record table
JP2011175454A (en) Device, method and program for estimation of compound reactivity, and storage medium recording the same, and competitive reaction database
CN117076515B (en) Metadata tracing method and device in medical management system, server and storage medium
WO2022130648A1 (en) Information processing program, information processing method, and information processing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant