WO2021031549A1

WO2021031549A1 - Method for establishing molecular structure and activity database

Info

Publication number: WO2021031549A1
Application number: PCT/CN2020/077657
Authority: WO
Inventors: 牛春意; 方磊; 徐旻; 温晓明; 齐珍珍; 张佩宇; 马健; 温书豪; 赖力鹏
Original assignee: 深圳晶泰科技有限公司
Priority date: 2020-03-03
Filing date: 2020-03-03
Publication date: 2021-02-25

Abstract

A method for establishing a molecular structure and activity database, comprising: searching a compound database to obtain all compounds related to selected targets, recording relevant information of the compounds, and converting external data into a standardized format according to requirements; checking the data to ensure the accuracy of the data; uploading the checked data to a MongoDB database via a stored temporary file; a user sending a retrieval request to a data retrieval module via an SDK, selecting a specific target according to requirements of the user, and extracting all data including the target; calling a structure-activity analysis module in Jupyter, and according to a core structure inputted by a user and a similarity requirement, performing sub-structure matching and similarity comparison calculations on the structure and structures in the database. The method is suitable for computer-aided drug design and drug screening such as virtual screening, and implements semi-automated data collection and data cleaning to generate a standardized database.

Description

Method of establishing molecular structure and activity database

Technical field

The invention belongs to the technical field of data processing, and specifically relates to a method for establishing a molecular structure and activity database, which is mainly applied in the field of new drug research and development, and provides good data support for applications in the field of computer-aided drugs and virtual screening.

Background technique

Drug screening is the initial stage and key step of drug discovery, and it occupies an important position in the process of new drug discovery. However, traditional screening experiments often take a long time and high cost. Therefore, with the development of computer technology, virtual screening is gradually developed. The development, optimization, and specific application of virtual screening methods to actual scenarios requires a large amount of high-quality data, including a variety of compound structures, unified and accurate activity data, etc. At present, the commonly used databases containing these data mainly include the public molecular database Chembl and paid databases. At the same time, in the process of drug design, the structure-activity analysis between different compounds of the same target is very important. But at present, there are often a large number of patents and the structure and activity data of compounds reported in the literature for the same target. It is often laborious and laborious to analyze and sort these data, but there is no suitable analysis software on the market that can quickly analyze and interpret it.

Existing databases often have the following drawbacks:

(1) The data of the public database is not updated in time, and the development of new drugs is a process of continuous development and change. Therefore, the data delay of one or two years may miss some very important information, which often affects the accuracy of calculations. .

(2) Compared with the public database, the data of the paid database is updated more timely, but there are often too many parameters to be used directly, and further cleaning is required.

(3) The data formats of the databases collected from different places are often different, so if they want to merge them together, a lot of data cleaning and sorting work is required, which will waste a lot of time and labor costs.

(4) A single database cannot verify the accuracy of the data, and it is difficult to ensure the accuracy of the data.

(5) The existing database lacks structure-activity relationship analysis for different drug molecules between the same target, which is not conducive to the later use of such data.

Summary of the invention

In view of the above technical problems, the present invention provides a method for establishing a molecular structure and activity database, which is applied to data collection and cleaning in the drug design process of new drug development. The method mainly includes collecting data from the existing database to construct the data source to be used, and then extracting the useful data from the data source to be cleaned through the tool script. On the basis of the established database, the data of the same target is extracted from it, and simple structure-effect analysis is performed by calling Jupyter scripts and user input to provide analysis ideas for subsequent drug design work.

The technical solutions adopted are:

The method of establishing molecular structure and activity database includes the following steps:

(1) Data collection

Search from the compound database to obtain all the compounds related to the selected target, and record the relevant information of the compound. The method is mainly to collect data through automatic collection and active upload, and upload the collected data to a temporary file.

(1.1) Automatic collection is mainly from the open source database Chembl. First, the Uniprot ID of the selected target is determined. According to the ID, the accurate and unique target can be locked, and then the python web crawler technology is used to automatically collect and generate the original data.

(1.2) Active uploading is mainly for paid databases. This kind of database cannot use python web crawler technology. It can only be manually downloaded, and then the data can be uploaded locally.

(2) Data cleaning

Regardless of whether it is automatically collected or actively uploaded, the data source will cause different data parameters. At the same time, not all the collected data is needed, and there are errors in the data, so the data will be cleaned to obtain unified standardized data. The data cleaning module will convert the external data into a standardized format as required. Main cleaning standards:

A. According to the original data obtained by different databases, different data cleaning modules are called. The data cleaning module will call the corresponding interpreter according to the unused data content and mark type.

B. Including molecular structure data interpreter, molecular experiment activity data interpreter, etc.

C. Use Jupyter to call the filter module to filter out some molecules that do not meet the criteria. The screening criteria mainly include the molecular activity test method (enzyme activity or cell activity), the molecular activity expression method (whether it is an accurate value), and the data source.

D. The interpreter matches the data one by one according to the specified standardized format. If the match is successful, the data is stored in the corresponding data structure of the memory.

(3) Data verification

Since most of the data in the existing database is obtained by capturing information in the literature through picture or keyword recognition, there may also be some errors in the process of data generation and data storage. Therefore, the accuracy of the data is also ensured by verifying the data in different databases.

(3.1) After data cleaning, call the data verification module, and transfer the data to be verified from the cleaning module system to the data verification module.

(3.2) In the check module, check the data one by one. First, the data type, read different verification rules according to the data type. For the same molecule, if the activity test type is the same, but there are multiple pieces of data. If the difference between the data does not exceed the specified range, take the average value; if the difference exceeds the specified range, output the prompt and download the data source literature for manual inspection.

(3.3) Match the data to be verified one by one according to the verification rules. After the verification is completed, the data that passes the verification will be persisted by the module to the temporary file system.

(4) Data retrieval

Upload the temporary files stored in the verification pass to the MongoDB database for subsequent use. Users can send a search request to the data search module through the SDK, which includes the data table, molecular structure, fields and query conditions to be queried. The data retrieval module converts the request into an identifiable sentence and accesses the database to get the result. The result will be returned to the data retrieval module and then passed to the user SDK to complete the retrieval.

(5) Structure-activity analysis

According to the needs of users, through the above-mentioned data retrieval method, a specific target can be selected and all the data containing the target can be extracted. Then call the structure-activity analysis module in Jupyter, and perform sub-organization matching and similarity comparison calculations between the structure and the structure in the database according to the core structure and similarity requirements input by the user.

(5.1) Perform substructure matching on the molecules in the database, call the substructure matching module in rdkit, and match all substructures that contain the structure.

(5.2) Convert the matched molecular structure into a molecular fingerprint, and then calculate its Tanimoto similarity to match the user's needs.

(5.3) Among the compounds that meet the matching requirements, use the rdkit chemistry toolkit to replace the side chain module and the substituent conversion module to cut, convert, and classify the substituent groups and substitution sites. Finally, the SAR list is listed to facilitate users to compare and analyze the structure and activity.

The method for establishing a molecular structure and activity database provided by the present invention has the following technical effects:

The present invention provides a complete set of standardized methods for establishing the activity database of small molecule inhibitors, which is suitable for computer-aided drug design and virtual screening and other drug screening fields, and realizes semi-automatic data collection and cleaning data to generate standardized databases, and at the same time The rapid SAR molecular summary of a large number of molecules of the same target accelerates the entire drug discovery process. Has the following technical advantages:

(1) It realizes the combination of active and automatic data collection. Compared with the existing database, it covers a wider range of documents and data and can provide more data resources.

(2) The automatic integration and mutual verification of multiple database information is realized, and further manual proofreading is added, so the accuracy of the data is higher than that of the existing database.

(3) For the first time, it is proposed to add a structure-activity relationship analysis module of compounds to the database, which can reduce the time for users to analyze large amounts of data.

Description of the drawings

Figure 1 is a flowchart of the present invention.

detailed description

The technical solutions of the present invention will be further described in detail below through the accompanying drawings and embodiments.

Example 1

In this embodiment, the establishment of the activity database of the small molecule inhibitor of Isocitrate dehydrogenase 1 (IDH1) is taken as an example. IDH1 can oxidize isocitrate to oxalosuccinic acid, and then convert it into α-ketoglutarate, thereby participating in the tricarboxylic acid cycle and regulating energy metabolism in the body. Studies have shown that IDH1 mutations are closely related to glioma, paraganglion, and acute myeloid leukemia. Therefore, the development of small molecule inhibitors against IDH1 is essential for the treatment of this type of cancer. We follow the process shown in Figure 1 to establish a method to modify the database mainly including the following steps:

In step S01, the Uniprot ID of IDH1 is determined to be O75874, and the original data of the existing molecular structure and activity are collected through the open source database Chembl and the python web crawler technology. There are a total of 35948 molecules and 37932 activity data.

In step S02, the data cleaning module is called to clean and classify the original data, and finally 31267 molecules are obtained. The cleaning process includes:

(2.1) Get molecular fingerprint string through molecular structure interpreter. The main steps are:

a. Read the molecular structure M (usually in the form of smiles) and convert it into a 3D structure of mol, through the Chem.MolFromSmiles() module in Rdkit

b. Calculate the Morgan-type molecular fingerprint of the mol structure through GetMorganFingerprint() in Rdkit. Finally, the molecular fingerprint string and the corresponding molecular ID are obtained.

(2.2), the data from different tests to classify (in this example, including molecular IDH1 vitro activity of IC ₅₀ and the inhibition of mutant cell lines IDH1 growth inhibition IC ₅₀₎ the activity of the molecule by a data interpreter .

(2.3) Call the cleaning module, and clean the data after sorting and storage. It mainly includes removing data that does not meet the standard, and duplicate data.

Step S03, call the data verification module to perform further verification on the data. Extract the molecular ID and the corresponding molecular fingerprint string, perform mutual verification between different databases, specify the data error threshold, report the data that exceeds the threshold, and manually verify the original text to obtain the determined data. The final data is obtained by taking the average value.

Step S04, the data after data cleaning and verification is stored in the database from a temporary file. The storage method is:

By comparing the fingerprint strings of molecular structure, the similarity of molecular fingerprints is obtained. Put molecules with higher similarity in the same result set, and store their corresponding activity data in their subsets, and so on, upload them to the MongoDB database.

Step S05, data search, to convert the search request uploaded by the user through the SDK into a recognizable language. After searching the database, the required result is obtained, and then returned to the user.

Among them, for the identification of molecular structure, the molecular structure is converted into a molecular fingerprint string through the molecular structure interpreter, and then the atom type and bond connection mode are compared accordingly, and the result set where the molecular structure is located and the unique molecular ID are finally obtained. , And then select according to the user's needs, search the ID corresponding to the enzyme activity, cell activity and other properties, and then output different results.

Step S06, structure-activity analysis is performed on the compound. Selectively conduct structure-effect analysis according to user needs. By calling the structure-activity analysis module written in Jupyter, batch structure-activity analysis of compounds is performed.

(6.1) The user enters the structure he or she is interested in, such as O=C1CCCN1(1) expressed by Smiles, and selects all compounds containing this structure through substructure matching, and found that a total of 398 molecules contain this substructure; Smiles is expressed as C1= CC=NN1(2), through substructure matching to select all compounds containing this structure, it is found that a total of 2323 molecules contain this substructure.

(6.2) For compounds containing this substructure, use the Chem.ReplaceCore() and Chem.GetMolFrags() commands in the rdkit toolkit to perform substitution, conversion, and classification of substitution sites.

(6.3) Mark each compound with its substitution site substitution type, structure, activity and other data and finally generate a SAR analysis list.

(6.4) When we have an understanding of the preliminary formed list, we can further refine the core structure, that is, repeat the above process, and wait for the further refined SAR analysis list.

Example 2

This example takes the establishment of the activity database of poly(ADP-ribose) polymerase 1, PARP1 as a small molecule inhibitor. PARP1 is a type of catalytic poly(ADP) that exists in eukaryotic cells. Ribosylated nuclease, poly-ADP ribosylation is one of the important modification methods after protein translation. PARP1 accounts for more than 80% of the PARP activity in cells, and it is widely present in organisms, repairing DNA damage, gene transcription and expression And cell apoptosis and other physiological processes play an important role. PARP inhibitors mainly prevent DNA replication by synthesizing a lethal mechanism. Currently, they are mainly used in BRCA1/2 mutant tumors and platinum-sensitive recurrent tumors. We follow The process shown in Figure 1, the method of establishing a database mainly includes the following steps:

Step S01: Determine the Uniprot ID of PARP1 as P09874, and collect the original data of the existing molecular structure and activity through the open source database Chembl and the python web crawler technology. There are a total of 3331 molecules and 4439 activity data. Get 6784 molecules through paid database, a total of 10283 activity data

In step S02, the data cleaning module is called to clean and classify the original data, and finally 4,324 molecules are obtained. The cleaning process includes:

(2.2) Use the molecular activity data interpreter to classify the data of different test methods (in this example, it mainly includes the IC _{50 of the} molecule's inhibition of PARP1 in vitro enzyme activity and the IC 50 of the growth inhibition of BRCA1/2 mutant tumor cell lines ₅₀ ).

Step S03, call the data verification module to perform further verification on the data. Extract the molecular ID and the corresponding molecular fingerprint string, perform mutual verification between data from different database sources, specify the data error threshold, report the data that exceeds the threshold, and manually verify the original text to obtain the confirmed data. The data within the threshold is obtained by taking the average value to obtain the final data.

(6.1) The user enters the structure he or she is interested in, for example, O=C1NN=CC2=C1CCCC2 expressed by Smiles (1), and select all compounds containing the structure through substructure matching, and find that a total of 623 molecules contain this substructure; Smiles expresses For C12=C[N]N=C1C=CC=C2(2), all compounds containing this structure are selected through substructure matching, and a total of 482 molecules containing this substructure are found.

Claims

The method for establishing a database of molecular structure and activity is characterized in that it comprises the following steps:

(1) Data collection

Search from the compound database to obtain all compounds related to the selected target, and record the relevant information of the compound, and upload the collected data to a temporary file;

(2) Data cleaning

The data cleaning module converts external data into a standardized format as required;

(3) Data verification

Ensure the accuracy of the data by verifying the data in different databases;

(4) Data retrieval

Upload the temporary files stored in the verification pass to the MongoDB database for subsequent use;

The user sends a retrieval request to the data retrieval module through the SDK, which includes the data table, molecular structure, fields and query conditions to be queried;

The data retrieval module converts the request into an identifiable sentence and accesses the database to get the result;

The result will be returned to the data retrieval module and then passed to the user SDK to complete the retrieval;

(5) Structure-activity analysis

According to the user’s needs, select a specific target through the above-mentioned data retrieval method, and extract all the data containing the target; then call the structure-activity analysis module in Jupyter, according to the core structure and similarity input by the user It is required to perform sub-organization matching and similarity comparison calculation between the structure and the structure in the database.
The method for establishing a molecular structure and activity database according to claim 1, characterized in that, in step (1), the method of collecting data is mainly through automatic collection and active uploading for data collection:

(1.1) The automatic collection is mainly from the open source database Chembl. First, the Uniprot ID of the selected target is determined. According to the ID, the accurate and unique target can be locked, and then the python web crawler technology is used to automatically collect and generate the original data;

(1.2) Active upload is mainly for paid databases. This kind of database cannot use python web crawler technology. After manually downloading, the data is uploaded locally.
The method for establishing a molecular structure and activity database according to claim 1, wherein the main cleaning criteria in step (2) are:

A. According to the original data obtained by different databases, different data cleaning modules are called; the data cleaning module calls the corresponding interpreter according to the unused data content and mark type;

B. Including molecular structure data interpreter, molecular experiment activity data interpreter;

C. Use Jupyter to call the screening module to filter out some molecules that do not meet the standards; the screening criteria mainly include the molecular activity test method, the molecular activity expression method and the data source standard;

D. The interpreter matches the data one by one according to the specified standardized format. If the match is successful, the data is stored in the corresponding data structure of the memory.
The method for establishing a molecular structure and activity database according to claim 1, wherein step (3) data verification mainly includes the following steps:

(3.1) After data cleaning, call the data verification module, and transfer the data to be verified from the cleaning module system to the data verification module;

(3.2) In the verification module, the data is verified one by one; first, the data type, read different verification rules according to the data type; for the same molecule, if the activity test type is the same, but there are multiple data; If the difference between the data does not exceed the specified range, take the average value; if the difference exceeds the specified range, output the prompt and download and output the literature of the data source for manual inspection;

(3.3) Match the data to be verified one by one according to the verification rules. After the verification is completed, the data that passes the verification will be persisted by the module to the temporary file system.
The method for establishing a molecular structure and activity database according to claim 1, wherein step (5) mainly includes the following steps:

(5.1) Perform substructure matching on molecules in the database, call the substructure matching module in rdkit, and match all substructures that contain the structure;

(5.2) Convert the matched molecular structure into molecular fingerprints, and then calculate its Tanimoto similarity to match user needs;

(5.3) Among the compounds that meet the matching requirements, use the rdkit chemistry toolkit to replace the side chain module and the substituent conversion module to cut, convert, and classify the substituent groups and substitution sites; finally, the SAR list is listed to facilitate users to structure And the activity is compared and analyzed.