CN117854628A

CN117854628A - Configuration method and system of drug development database

Info

Publication number: CN117854628A
Application number: CN202211205105.1A
Authority: CN
Inventors: 倪海洪; 罗子涵
Original assignee: Suzhou Yashen Intelligent Technology Co ltd
Current assignee: Suzhou Yashen Intelligent Technology Co ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2024-04-09
Also published as: WO2024066489A1

Abstract

The configuration method of the drug research and development database provided by the invention comprises the following steps: acquiring relevant data from a public database; processing, correlating and matching the related data; searching and displaying the data which the user needs to inquire; reprocessing the protein crystal structure of the ligand so that the combination mode of the ligand and the protein is easier to understand and display; the amino acid sequences of the multiple targets are aligned for intuitively displaying the differences among the sequences; and processing a plurality of protein crystal structures to enable the structural relationship among the proteins to be visually displayed. The research and development efficiency of drug research and development personnel can be effectively improved, and more research and development ideas are provided. The data integration degree among a large number of databases is improved, the research and development threshold is reduced, and the research and development efficiency is improved.

Description

Configuration method and system of drug development database

Technical Field

The invention relates to the field of data supervision, in particular to a configuration method and a configuration system of a drug research and development database.

Background

The drug development requires a lot of manpower and material resources, especially in the initial drug design stage, a lot of data are needed to support the design work of the developer. Due to the specificity of the pharmaceutical chemistry field, various data related to pharmaceutical design are dispersed in various public databases, which is not beneficial to search and use of research personnel. On the other hand, the tools used by the research and development personnel for drug design are usually in various client forms, the software is difficult to communicate with each other, and a certain technical threshold is provided for use.

The method improves the data integration degree among a large number of databases, reduces the research and development threshold, improves the research and development efficiency, and is always a research key point and a technical problem in the field.

Disclosure of Invention

In view of the foregoing, the present invention has been made to provide a method and system for configuring a drug development database that overcomes or at least partially solves the foregoing problems.

According to an aspect of the present invention, there is provided a method of configuring a drug development database comprising:

acquiring relevant data from a public database;

processing, correlating and matching the related data;

searching and displaying the data which the user needs to inquire;

reprocessing the protein crystal structure of the ligand so that the combination mode of the ligand and the protein is easier to understand and display;

the amino acid sequences of the multiple targets are aligned for intuitively displaying the differences among the sequences;

and processing a plurality of protein crystal structures to enable the structural relationship among the proteins to be visually displayed.

Optionally, the related data specifically includes:

drug data, target data, protein crystal structure data, indication data, bioactivity data, lead data, and mutation data.

Optionally, the processing method of the drug data includes:

acquiring data defining the types of the medicaments from each medicament data table, screening out small-molecule medicaments marked as the medicaments, and independently storing other types of medicaments;

from small molecule drugs, finding out data defining the structure of the drug, and directly using SMILES as an identification mode;

introducing the SMILES into an open source module RDkit, and converting the SMILES into uniform RDkit_SMILES by using the RDkit;

directly comparing all RDkit_SMILES, identifying medicaments with different data sources of the same RDkit_SMILES as the same medicament, and merging data;

and acquiring the DRUGBANK ID by matching the data of the DRUGBANK database, and associating with other data tables by using the DRUGBANK ID as a main key of the data table.

Optionally, the processing method of the target point data includes:

obtaining classification data of targets from a target data table;

classifying the targets according to a classification mode;

for targets lacking classification information, marking the classification as TBD, and waiting for other modes to confirm classification;

and merging the target data through Uniprot ID, and associating with other data tables as a unique primary key.

Optionally, the processing method of the protein crystal structure data comprises the following steps:

extracting data from each protein three-dimensional data file to obtain basic information of the protein three-dimensional data files;

ignoring PDB which does not belong to protein in HEADER, and only reserving the protein three-dimensional data file which belongs to protein;

acquiring Uniprot ID in the detailed information of each protein three-dimensional data file for correlation with target data;

and taking the PDB ID of each protein three-dimensional data file as a main key, and correlating with other data tables.

Optionally, the method for processing indication data includes:

the indication data obtained from the database are matched through synonyms, and the indications with the same name are combined;

the indications are correlated with the drug data by DRUGBANK ID and with Clinical laboratory information in the Clinical tools database by NCT NUMBER.

Optionally, the method for processing the biological activity data comprises the following steps:

acquiring bioactive data from a database, wherein the bioactive data comprises compound data, target data and experimental result data between the compound data and the target data;

and correlating the compound data and the target point data with the drug data and the target point data respectively through SMILES and Uniprot ID, so that the subsequent calling is facilitated.

Optionally, the method for processing the lead compound data specifically includes:

acquiring all compound data from the biological activity test data, screening the compound data, and selecting data with data types and data values meeting the requirements;

identifying the SMILES of the part of data, and combining the data of the same molecule;

after matching molecules through the CHEMBL database, other data is correlated using CHEMBL ID as a primary key.

Optionally, the method for processing mutation data specifically includes:

obtaining mutation data from a database, classifying according to mutations associated with the disease and mutations associated with the ligand;

for disease-related mutations, it is necessary to correlate Uniprot ID with disease name, in addition to mutation site information;

for ligand-related mutations, it is necessary to correlate the Uniprot ID with ligand information;

and after finishing according to the Uniprot ID, correlating with a target point through the Uniprot ID.

The invention also provides a configuration system of the drug development database, which comprises:

the data acquisition module is used for acquiring related data from the public database;

the data processing module is used for processing, correlating and matching the related data;

the retrieval matching module is used for retrieving and displaying the data which the user needs to query;

the ligand display module is used for reprocessing the protein crystal structure where the ligand is located, so that the combination mode of the ligand and the protein is easier to understand and display;

the sequence alignment module is used for aligning the amino acid sequences of a plurality of targets and intuitively displaying the difference between the sequences;

and the structure alignment module is used for processing a plurality of protein crystal structures and enabling the structural relationship among the proteins to be visually displayed.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for configuring a drug development database according to an embodiment of the present invention;

FIG. 2 is a block diagram of a configuration system of a drug development database according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for processing drug data according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for processing target data according to an embodiment of the present invention;

FIG. 5 is a flowchart of a method for processing protein crystal structure data according to an embodiment of the present invention;

fig. 6 is a flowchart of a method for processing indication data according to an embodiment of the present invention;

FIG. 7 is a flowchart of a method for processing bioactive data according to an embodiment of the present invention;

FIG. 8 is a flow chart of a method for processing lead compound data according to an embodiment of the present invention;

FIG. 9 is a flowchart of a method for processing mutation data according to an embodiment of the present invention;

FIG. 10 is a flow chart of a drug search provided by an embodiment of the present invention;

FIG. 11 is a flowchart of target searching provided by an embodiment of the present invention;

FIG. 12 is a flow chart of an indication search provided by an embodiment of the present invention;

FIG. 13 is a flow chart of a lead compound search provided in an embodiment of the present invention;

FIG. 14 is a flow chart of a ligand display provided by an embodiment of the present invention;

FIG. 15 is a flow chart of sequence alignment provided by an embodiment of the present invention;

fig. 16 is a flow chart of structure alignment provided by an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terms "comprising" and "having" and any variations thereof in the description embodiments of the invention and in the claims and drawings are intended to cover a non-exclusive inclusion, such as a series of steps or elements.

The technical scheme of the invention is further described in detail below with reference to the accompanying drawings and the examples.

As shown in fig. 1, a method for configuring a drug development database includes:

acquiring relevant data from a public database;

processing, correlating and matching the related data;

searching and displaying the data which the user needs to inquire;

As shown in fig. 2, a configuration system of a drug development database includes:

The invention includes the following types of data: drug data, target spot data, protein crystal structure data, indication data, bioactivity data, lead compound data and mutation data.

The data integration method of the invention is as follows:

as shown in fig. 3, drug data: data defining the Type of Drug (column name is usually Drug Type) is found from each Drug data table, and drugs labeled as Small molecules molecular are selected and other types of drugs are stored separately. From small molecule drugs, data defining the structure of the drug is found, and SMILES is directly used as a recognition mode. The SMILES is imported into an open source module RDkit, and the RDkit is used for converting the SMILES into unified RDkit_SMILES. All RDkit_SMILES are directly compared, and drugs with different data sources of the same RDkit_SMILES are identified as the same drug, and the data are combined. And acquiring the DRUGBANK ID by matching the data of the DRUGBANK database, taking the DRUGBANK ID as a main key of the data table, and correlating the DRUGBANK ID with other data tables.

As shown in fig. 4, target data: and obtaining classification data of the targets from each target data table, and classifying the targets according to classification modes (Class A, class B, class C, class D and Class F). For targets lacking classification information, the classification is labeled "TBD", waiting for other means to confirm the classification. And merging all target point data through Uniprot ID, and associating with other data tables as a unique primary key.

As shown in fig. 5, protein crystal structure data: and (3) extracting data of the protein crystallization structure from each PDB file to obtain PDB basic information, ignoring PDB which does not belong to the protein in the HEADER, and only retaining the PDB which belongs to the protein. And acquiring Uniprot ID in the detailed information of each PDB file for correlation with target data. The PDB ID of each PDB is associated with other data tables as a primary key.

As shown in fig. 6, indication data: and (3) the indication data obtained from the database are matched through synonyms, and the indications with the same name are combined. The indications are associated with the drug data by DRUGBANK ID and with Clinical laboratory information in the Clinical tools database by NCT NUMBER.

As shown in fig. 7, bioactivity data: and obtaining bioactive data from a database, wherein the bioactive data comprises data of the compound, data of the target point and data of experimental results between the compound and the target point. And the compound data and the target data are respectively related with the drug data and the target data through SMILES and Uniprot ID, so that the subsequent calling is facilitated.

As shown in fig. 8, the lead compound data: all compound data were obtained from the bioactivity test data and screened, only data with data types Ki, kd, IC50, EC50 and data values not exceeding 1000nM were selected. And identifying SMILES of the data, and merging the data of the same molecule. After matching molecules through the CHEMBL database, other data is correlated using CHEMBL ID as a primary key.

As shown in fig. 9, mutation data: mutation data is obtained from the database and classified according to the mutation associated with the disease and the ligand. For disease-related mutations, it is necessary to correlate Uniprot ID with disease name, in addition to mutation site information. For ligand-related mutations, uniprot ID needs to be associated with ligand information. After finishing according to the Uniprot ID, associating with the target point through the Uniprot ID.

The invention relates to a functional module: the system comprises a search matching module, a ligand display module, a sequence alignment module and a structure alignment module.

And (5) searching a matching module:

as shown in fig. 10, the user inputs SMILES or drug name to search for drug data, and the background matches the relevant target data Uniprot ID, protein crystal data PDB ID, and indication data via drug bank ID in the drug data, and all the data are combined and displayed.

As shown in fig. 11, the target search is performed, the user inputs UNIPROT ID and target name to search target data, and the background matches related drug data drug ID, protein crystallization data PDB ID and mutation data UNIPROT ID through UNIPROT ID in the target data, and finally displays the drug data drug ID, the protein crystallization data PDB ID and the mutation data UNIPROT ID together.

As shown in fig. 12, the indication search is performed, the user inputs the indication name, matches the indication name in the database, and then associates the drug data with the indication.

As shown in fig. 13, the user inputs SMILES and CHEMBL ID to search for the lead compound data, matches CHEMBL ID in the lead compound data with the corresponding target data Uniprot ID, reads other data of the corresponding target, and finally displays the result.

As shown in fig. 14, the ligand display module: the user selects a designated protein crystal structure in the PDB display plug-in, the system displays a list of ligands present in the crystal structure, the user continues to select the designated ligand, and a designated radius is entered. The system receives the three information: and (3) reading the protein crystallization structure file in the database, calculating according to the three parameters, and loading the calculated amino acid residues into the PDB display plug-in for highlighting.

As shown in fig. 15, the sequence alignment module: the user inputs the target names of a plurality of targets, reads the sequence information of the corresponding targets from the database, calculates the information, and gives a similarity result and an alignment condition.

As shown in fig. 16, the structure alignment module: the user inputs IDs of a plurality of protein crystal structures, selects a designated Chain ID, a designated cut-off value and a designated cycle number, reads the designated protein crystal structure from a database, substitutes the parameters into the database for calculation, gives an offset value, and loads the aligned protein crystal structure into a PDB display plug-in a file form.

The beneficial effects are that: through the cooperative use of the modules, the research and development efficiency of drug research and development personnel can be effectively improved, and more research and development ideas are provided.

The foregoing detailed description of the invention has been presented for purposes of illustration and description, and it should be understood that the invention is not limited to the particular embodiments disclosed, but is intended to cover all modifications, equivalents, alternatives, and improvements within the spirit and principles of the invention.

Claims

1. A method of configuring a drug discovery database, the method comprising:

acquiring relevant data from a public database;

processing, correlating and matching the related data;

searching and displaying the data which the user needs to inquire;

2. The method for configuring a drug development database according to claim 1, wherein the related data specifically comprises:

3. A method of configuring a drug development database according to claim 2, wherein the method of processing drug data comprises:

4. The method for configuring a drug development database according to claim 2, wherein the method for processing target data comprises:

obtaining classification data of targets from a target data table;

classifying the targets according to a classification mode;

5. The method for configuring a drug development database according to claim 2, wherein the method for processing the protein crystal structure data comprises:

6. A method of configuring a drug development database according to claim 2, wherein the method of processing the indication data comprises:

7. The method for configuring a drug development database according to claim 2, wherein the method for processing the bioactivity data comprises:

8. The method for configuring a drug development database according to claim 2, wherein the lead compound data specifically includes:

9. The method for configuring a drug development database according to claim 2, wherein the mutation data specifically comprises:

10. A system for configuring a drug development database, the system comprising: