CN110390997B

CN110390997B - Chemical molecular formula splicing method

Info

Publication number: CN110390997B
Application number: CN201910646187.5A
Authority: CN
Inventors: 金霞; 韩瑞峰
Original assignee: Chengdu Firestone Creation Technology Co ltd
Current assignee: Chengdu Firestone Creation Technology Co ltd
Priority date: 2019-07-17
Filing date: 2019-07-17
Publication date: 2023-05-30
Anticipated expiration: 2039-07-17
Also published as: CN110390997A

Abstract

The invention discloses a chemical molecular formula splicing method, which is to splice two chemical molecular formulas with splicing sites into one chemical molecular formula, wherein the molecular formulas with the splicing sites are obtained by removing fixed groups from reagent molecules. In the application scenario of compound synthesis through a large number of reagent molecules, molecular formula splicing is an indispensable link. The invention realizes the splicing process of molecular formulas on the graph structures of atoms and chemical bonds, has strong operation flexibility, is suitable for splicing various situations, such as splicing into a ring structure, recording the positions before splicing by using different labels, and finding out the corresponding positions for splicing in sequence.

Description

Chemical molecular formula splicing method

Technical Field

The invention belongs to the technical field of compound synthesis, and particularly relates to a chemical molecular formula splicing method.

Background

In the process of synthesizing the compound, reagent molecules with protecting groups are subjected to deprotection and mutual reaction to obtain a novel compound. From a chemical point of view, both steps are chemical reactions; from the calculation point of view, the chemical reaction of removing protecting groups from the molecule is the "site cleavage" of the molecule, so as to obtain sites capable of reacting (connecting) with other molecules, then the mutual reaction is the "site splicing" among the molecules, so that the sites capable of undergoing chemical reaction are connected, and a new compound is obtained.

The current processing mode in calculation is to record the molecules with sites obtained in the first step, mark the sites obtained in different reactions as [ R1] as the sites obtained in the reaction type 1, the positions of the sites are the positions of atoms of the molecules which are removed in the reaction, then splice the molecules with the same sites, and if both molecules have [ R1], splice.

The current calculation chemical tools such as Openbabel, rdKit are performed by simulating the actual chemical reaction process, namely inputting the compound to be reacted to obtain the reacted compound, and the chemical splicing with the position is not directly supported, and is not flexible enough to deal with complex situations, such as that two molecular formulas with the position to be reacted have a plurality of positions, and the positions are spliced into a ring structure. The invention realizes a splicing method on the basis of Openbabel, and can realize splicing of various situations more flexibly.

Disclosure of Invention

The invention aims at overcoming the defects of the prior art and provides a chemical molecular formula splicing method, namely splicing two chemical molecular formulas with splicing sites into one chemical molecular formula, wherein the molecular formula with the splicing sites is obtained by removing fixed groups from reagent molecules. In the application scenario of compound synthesis through a large number of reagent molecules, molecular formula splicing is an indispensable link.

The aim of the invention is realized by the following technical scheme: a chemical formula splicing method, which is used for splicing two chemical formulas smi1 and smi2 with sites, wherein the positions to be spliced in the molecular formula, namely the sites, are marked as [ Ri ], i=0, 1 … N, [ Ri ] corresponds to a certain reaction type, N is the total number of reaction types, and the splicing process is as follows:

(1) The chemical formulas smi1 and smi2 are read in and converted into graph data structures mol1 and mol2 representing the molecules respectively. The method comprises the steps of performing special treatment on a site [ Ri ] mark during reading, recording an atomic number IDX and a reaction type mark i of the site [ Ri ] mark in a molecule, defining a heavy metal mapping table, and mapping each site [ Ri ] to a heavy metal atom which does not appear in molecular formulas before and after splicing; and respectively storing the atom number IDX, the reaction type index i and all the heavy metal atoms mapped by [ Ri ] in the molecular formula into a data structure of mol1 and mol2.

(2) Adding the data structure of mol2 and mol1 gives mol3.

(3) The locus pairs with the same index i are found in mol3 and are marked as p and q, atoms ATOMp and ATOMq connected with the locus pairs are found respectively, an atomic bond is newly added between the ATOMp and the ATOMq, the ATOMp and the ATOMq are connected, p and q are deleted, and meanwhile, chemical bonds connected with p and q are deleted, so that splicing of the smi1 and the smi2 is realized.

(4) Returning to step 3, searching the rest atoms for the site pairs with the same index i until no matching site pairs exist.

(5) Converting mol3 into molecular formula smi3 in a Canonical SMILES format, querying the heavy metal mapping table defined in step 1, and if the heavy metal atoms in the table exist in smi3, replacing the heavy metal atoms with corresponding [ Ri ].

Further, in the step (1), the read molecular formula format is Canonic SMILES, and the input of other formats needs to be converted into the format.

Further, in the step (1), the graph data structure is a graph formed by connecting atoms and chemical bonds, and includes atoms, chemical bonds and chemical bond attribute information in molecules.

Further, in the step (2), the data structures of mol2 and mol1 are added, that is, the atomic, chemical bond and chemical bond attribute information of the two are stacked to obtain mol3, that is, the atomic number IDX of the atom in mol2 is added to the atomic number of mol1, including the atomic number IDX of [ Ri ] in mol2, to form a preliminary result of splicing, namely mol3.

Further, in the step (4), the site pairs with the same index i are searched in the remaining atoms until there is no matching site pair, at which time the atoms in the mol3 data structure are all connected, no independent atoms, and if any, manual checking is performed.

Further, in the step (5), there is a heavy metal indicating that there is an unwatered site, and it is also necessary to splice with other molecules having the same site to form the final molecular compound.

Further, for obtaining a new ring or double bond structure after splicing, the corresponding bond information needs to be updated in mol3, and then output as a Canonic SMILES format.

The beneficial effects of the invention are as follows: the invention realizes the splicing process of molecular formulas on the graph structures of atoms and chemical bonds, has strong operation flexibility, is suitable for splicing various situations, such as splicing into a ring structure, recording the positions before splicing by using different labels, and finding out the corresponding positions for splicing in sequence.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

fig. 2 is a splice example of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides a chemical formula splicing method, which is used for splicing two chemical formulas smi1 and smi2 with sites (assuming that input format is Canonical SMILES, other formats can be converted into the format), and the specific implementation steps are as follows, as shown in figure 1:

the positions to be spliced (i.e., the positions) in the molecular formula are denoted as [ Ri ], i=0, 1 … N, [ Ri ] corresponds to a certain reaction type, and is obtained by performing a chemical reaction of the type with reagent molecules, N is the total number of reaction types, and the splicing process is as follows:

1. the molecules smi1, smi2 in the Canonical SMILES format are read in and converted into graph data structures mol1, mol2 representing the molecules, namely graphs formed by atoms and chemical bonds, wherein the graph data structures mainly comprise information such as atoms, chemical bonds, chemical bond attributes (such as directionality of double bonds) and the like in the molecules. The method comprises the steps of performing special treatment on a site [ Ri ] mark during reading, recording an atomic number IDX of the site [ Ri ] mark in a molecule (namely, the site is the IDX atom in the molecule) and a reaction type mark i (namely, i in the [ Ri ]), defining a heavy metal mapping table, and mapping each site [ Ri ] to a heavy metal atom which does not appear in molecular formulas before and after splicing; and respectively storing the atom number IDX, the reaction type index i and all the heavy metal atoms mapped by [ Ri ] in the molecular formula into a data structure of mol1 and mol2.

2. Adding the data structures of the mol2 and the mol1, namely stacking the information of atoms, chemical bonds, chemical bond attributes and the like of the two, to obtain the mol3, namely adding the atomic number IDX of the atoms in the mol2 to the atomic number IDX of the atoms in the mol1, including the atomic number IDX of [ Ri ] in the mol2. The preliminary result of the splice mol3 is formed.

3. The locus pairs with the same index i are found in mol3 and are marked as p and q, atoms ATOMp and ATOMq connected with the locus pairs are found respectively, an atomic bond is newly added between the ATOMp and the ATOMq, the ATOMp and the ATOMq are connected, p and q are deleted, and meanwhile, chemical bonds connected with p and q are deleted, so that splicing of the smi1 and the smi2 is realized.

4. Returning to step 3, searching the rest atoms for the site pairs with the same index i until there is no matching site pair. At this time, atoms in the mol3 data structure are all connected, no independent atoms exist, and if any, the atoms are manually checked.

5. Converting mol3 into molecular formula smi3 in a Canonical SMILES format, querying the heavy metal mapping table defined in step 1, and if the heavy metal atoms in the table exist in smi3, replacing the heavy metal atoms with corresponding [ Ri ].

In step 5, heavy metals indicate that there are uncoupling sites, and the final molecular compound can be formed by splicing other molecules with the same sites. In practice, it is quite common that there are multiple sites in one molecule, and that it is necessary to splice with multiple other molecules to form the complete compound.

And for the special structures such as new rings, double bonds and the like obtained after splicing, corresponding key information needs to be updated in mol3, and then the new key information is output into a Canonic SMILES format.

A complete example is as follows: as shown in fig. 2, the first 6 molecular formulas are spliced to the last molecular formula, wherein In, cd, sr, cd, kr, sc is the position of the site, corresponding to [ R0], [ R1] … [ R5]:

the molecular formula in the examples is:

C([In])(＝O)CC1＝CC2＝C(C＝CC＝C2)C＝C1

[Cd]C([In])＝O

C([Sr])N[Cd]

[Kr]C([Sr])＝O

C([Sc])N[Kr]

CNC([Sc])＝O

C(NC(＝O)CNC(＝O)C(＝O)Cc1cc2c(cccc2)cc1)C(＝O)NC

it should be noted that the disclosure and the specific embodiments are intended to demonstrate practical applications of the technical solution provided by the present disclosure, and should not be construed as limiting the scope of the present disclosure. Any modifications and changes made to the present invention fall within the spirit of the invention and the scope of the appended claims.

Claims

1. A chemical formula splicing method, which is characterized by being used for splicing chemical formulas smi1 and smi2 with sites, wherein the sites to be spliced in the molecular formula are denoted as [ Ri ], i=0, 1 … N, [ Ri ] corresponds to a certain reaction type, N is the total number of reaction types, and the splicing process is as follows:

(1) Reading in chemical formulas smi1 and smi2, and respectively converting the chemical formulas smi1 and smi2 into graph data structures mol1 and mol2 representing molecules; the special treatment is carried out on the site [ Ri ] mark during reading: recording an atomic number IDX and a reaction type index i of the heavy metal in a molecule, defining a heavy metal mapping table, and mapping each site [ Ri ] to a heavy metal atom which does not appear in molecular formulas before and after splicing; the atom number IDX, the reaction type index i and all the heavy metal atoms mapped by [ Ri ] in the molecular formula are respectively stored in a data structure of mol1 and mol2;

(2) Adding the data structures of the mol2 and the mol1, namely stacking the atomic, chemical bond and chemical bond attribute information of the two to obtain the mol3, namely adding the atomic number IDX of the atoms in the mol2 to the atomic number of the atoms of the mol1, including the atomic number IDX of [ Ri ] in the mol2 to form a preliminary spliced result mol3;

(3) The locus pairs with the same index i are found in mol3 and marked as p and q, atoms ATOMp and ATOMq connected with the locus pairs are found respectively, an atomic bond is newly added between the ATOMp and the ATOMq, the ATOMp and the ATOMq are connected, p and q are deleted, and meanwhile, chemical bonds connected with p and q are deleted, so that splicing of the smi1 and the smi2 is realized;

(4) Returning to the step 3, searching the site pairs with the same index i in the rest atoms until no matched site pairs exist;

2. The method of claim 1, wherein in the step (1), the read molecular formula format is Canonic SMILES, and the input of the other format is converted into the format.

3. The method according to claim 1, wherein in the step (1), the graph data structure is a graph formed by bonding atoms and chemical bonds, and the graph data structure includes information about atoms, chemical bonds and chemical bond attributes in molecules.

4. A method of splicing chemical formulas according to claim 1, wherein in step (4), the pairs of sites of the same index i are found in the remaining atoms until there is no matching pair of sites, at which time the atoms in the mol3 data structure are all connected, no independent atoms, and if any, manual inspection is performed.

5. A method of splicing according to claim 1, wherein in step (5) there are heavy metals indicating that there are uncoupling sites and there is a need to splice with other molecules having the same sites to form the final molecular compound.

6. The method of claim 1, wherein for obtaining a new ring or double bond structure after splicing, the corresponding bond information needs to be updated in mol3, and then output as the Canonic SMILES format.