WO2022077166A1

WO2022077166A1 - Data processing method and system for drug research and development

Info

Publication number: WO2022077166A1
Application number: PCT/CN2020/120425
Authority: WO
Inventors: 吴楚楠; 徐旻; 张佩宇; 马健; 温书豪; 赖力鹏
Original assignee: 深圳晶泰科技有限公司
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2022-04-21

Abstract

A data processing method and system for drug research and development. The method comprises: constructing various data integrators, acquiring various data using various manners of data access that match the data, serializing the data, pushing same to a data acquisition pipeline, the data acquisition pipeline storing the acquired data in a data warehouse in batches and in an asynchronous manner and recording a unique identifier for each piece of data; a trigger sending the unique identifier of the data stored in the data warehouse to a data cleaning pipeline, a data cleaning subscriber processing the data in the data cleaning pipeline, processing the data, storing the processed data in the data warehouse, and adding a new identifier; and analyzing the data in the data warehouse, and storing an analysis result in a knowledge base. The data processing method and system for drug research and development can access data of different data sources, and store, clean, and recalculate original data by means of batch data processing and persistence techniques so as to construct a knowledge base for domain issues according to requirements.

Description

Data processing method and system for drug development

technical field

The invention relates to an auxiliary method for drug research and development, in particular to a data processing method and system for drug research and development.

Background technique

In the existing drug development process, the collection, arrangement and analysis of drug data are important steps throughout the drug development process. Commonly used drug development information collection generally includes the following categories of data:

Data based on drug target information:

Including the biological function of the target and the indications related to clinical molecules, the epidemiology of the indications, and the clinical needs to be met, etc. Commonly used public data sources include: Pubmed, Google Scholar, HowNet, etc.

Data based on drug and protein structure information:

Target-related information can be queried through Uniprot and other websites, and the protein crystal structure information corresponding to the target can be queried and obtained from the PDB database.

Competitor information based on similar drugs:

Including target-related drug information, patents, and drug-related transactions, sales of listed drugs and other information. Available on websites such as Cortellis, Yaodu, Reaxys, Clinical Trials, National Center for Drug Evaluation, FDA, etc.

Information based on drug patents:

For drug patent related information query, relevant information can be obtained from EPO, WIPO, Google Patents, etc.

Relevant information based on drug activity:

Drug activity data can be obtained from public data sources such as ChEMBL, PubChem, etc.

In general, the collection and organization of comprehensive and rich data and information is particularly important for the decision-making, wind direction control, quality and market success rate of the drug research and development process, and is an indispensable link in the drug research and development process.

There are various types of drug information data, including commonly used public data sources, results generated by computer-aided drug design (CADD) software, and experimental data in the drug development process. They all have their own data structures, storage methods, and data access methods. The process of collecting and sorting out drug information is very dependent on the knowledge background, technical means and time and energy investment of drug R&D personnel.

And the knowledge base that can be used for drug R&D decision-making from data acquisition has the following problems:

Issues with data collection, aggregation, and cleaning:

Integration of access methods for multiple data sources, efficient data collection, data update, and storage and organization. The amount of open data sources is large and interferes with each other. To extract valuable information, millions to billions of data collection, conversion and cleaning tools are required. However, although commercial or customized data sources have relatively high quality and relatively standardized data access methods, their respective data access protocols, interfaces and data formats are different, and how to aggregate them together for maintenance is a difficult problem. At the same time, there is a problem of incremental data update regardless of public data sources or commercial and customized data sources.

Problems with data recalculation:

Cleaning the aggregated data generally requires a series of data cleaning methods to obtain information that is ultimately beneficial to drug development, such as molecular deduplication, charge bond level error processing, chiral molecule processing, etc. Each update of these processing methods or New additions may require recalculation of the data that has been collected and cleaned in the past. Large-scale and time-consuming is the main problem in this part.

Data to knowledge base construction problem:

In the process of applying data, it is often necessary to extract information related to physical chemistry, such as the number of rings contained in the molecular structure, the number of heavy atoms, and the number of hydrogen bonds that can be formed. Data preprocessing in these research processes , The results of data calibration and operation together with the aggregated, cleaned and recalculated data constitute the knowledge base of drug research and development. This type of information extraction relies on a certain calculation process, so the scale problems encountered by recalculation also exist here.

SUMMARY OF THE INVENTION

Based on this, it is necessary to provide a data processing method for drug R&D that can improve R&D efficiency.

At the same time, a knowledge base system for drug research and development that can improve research and development efficiency is provided.

A data processing method for drug development, including:

Data integration: Build a variety of data integrators, use matching data access methods according to different data, obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch and asynchronous the obtained data. The method is stored in the data warehouse, and a unique identifier is calibrated for each stored data record, and the data stored at this time is the original data;

Data processing: The unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers. The data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and accesses the original data through the unique identifier during the cleaning process. content, store the processed data in the data warehouse, and add a new identifier, the data stored at this time is clean data;

Analysis: Analyze the data in the data warehouse and store the analysis results in the knowledge base.

In a preferred embodiment, the data processing further includes: forming a consistent information list by supplementing molecules from different data sources through regular rules and data.

In a preferred embodiment, the data processing includes: plausibility check, exclusion rule check, chiral information consistency check for chiral molecules, data supplementation for tautomers, and pKa prediction values for compound data from different sources one or more treatments of the supplement.

In a preferred embodiment, the data processing further comprises:

Recalculation: If the processing rules are changed, obtain the relevant original data collected in the history according to the unique identifier of the data, send the unique identifier of the original data stored in the data warehouse to the data cleaning pipeline through triggers, and recalculate according to the changed processing rules. Process, get new processed clean data and store it in the data warehouse.

In a preferred embodiment, the data processing further comprises:

Aggregation: Deduplication of the same molecule from different data sources, while retaining its source information, and merging the data inconsistencies in different data sources due to information asymmetry.

In a preferred embodiment, the analysis includes: storing the data after cleaning and polymerization, the result obtained by predicting and calculating the properties of the compound through chemical-physical calculation, and the compound information together in the knowledge base.

In a preferred embodiment, the data processing further includes: organizing the processed compound information and its corresponding ancillary information into a consistent data structure and storing it in a data warehouse through CSV.

In a preferred embodiment, the compound information includes: one or more of SMILES molecular formula, compound source information, and compound unique identifier; the auxiliary information includes: chirality, tautomer, whether it conforms to Lipinski's Rule of Five, whether one or more kinds of information are available for purchase;

The data integrator includes: one or more of an API interface integrator, a file object integrator, a data stream object integrator, and an event object integrator;

The data access method implemented by the API interface integrator is HTTPS API, which is obtained and parsed according to the document requirements and returns its result, and the returned content is written as a string in JSON or CSV format and transmitted to the data collection pipeline;

The file object integrator realizes the data access mode based on the file object, downloads the data in the form of a file through the download interface, completes the download of the acquired data, checks the integrity of the file download and sends it to the data collection pipeline;

The event object integrator implements an event-based data access method, sets the time interval to train the data and update of the data source in rotation, compares the last latest data acquisition time, and acquires the newly released data through HTTPS API or file download. and sent to the data collection pipeline;

The data flow object integrator implements a data access method based on a stream object, acquires data that can provide incremental or paginated data acquisition methods, records parameter information of the last access data, and incrementally acquires the next access data.

A knowledge base system for drug development, including:

Data integration module: Build a variety of data integrators, use matching data access methods according to different data, obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch and asynchronously obtain the data. The method is stored in the data warehouse, and each stored data record is calibrated with a unique identifier, and the data stored at this time is the original data;

Data processing module: The unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers, and the data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and accesses the original data through the unique identifier during the cleaning process. The content of the data, the processed data is stored in the data warehouse, and a new logo is added, and the data stored at this time is clean data;

Analysis module: analyze the data in the data warehouse and store the analysis results in the knowledge base.

In a preferred embodiment, the data processing module further includes: performing rationality check, exclusion rule check, chiral information consistency check of chiral molecules, data supplementation of tautomers, pKa One or more processing of supplementation of predicted values, forming a consistent list of information from molecules derived from different data sources through regular rules and data supplementation.

The above-mentioned data processing methods and systems for drug research and development simplify and standardize the connection and integration of multiple data sources, and provide a variety of different ways to collect data from data sources in different situations. It can also simplify the complexity of subsequent data cleaning, recalculation and knowledge information extraction to a certain extent; by providing a unified field design that maps interesting fields of different data sources to the knowledge base, the same information has the same field Identification and indexing, information differences from different sources can also be combined, saved and queried; the data collection pipeline uses asynchronous batch processing to increase the number of data sources that the system can process at the same time, and the data collection pipeline and data cleaning pipeline can use the same framework. The design reduces the overall system operation and maintenance complexity, and the flexible subscription processing mode can improve the fault tolerance and stability of the system; provide customized data cleaning processes, software development kits and workflow tools, and connect to the interactive data analysis system, It improves the flexibility of the system to deal with diverse data processing needs and the scalability of massive data processing.

The invention is suitable for the process of drug research and development, aiming at the automatic collection, aggregation, data cleaning, storage and analysis process of drug targets, drug molecular structures, market competitors, drug patent information and experimental data, and constructs an auxiliary drug research and development process. Knowledge base system; the system can connect data information from different data sources, store, clean and recalculate original data through mass data processing and persistence technology, and then build a knowledge base for domain problems as needed. The above data analysis tools provide drug R&D personnel with convenient drug data aggregation and analysis capabilities, improve drug R&D efficiency, and facilitate the development and design of new drug R&D methods.

Description of drawings

1 is a flowchart of a data processing method for drug research and development according to an embodiment of the present invention;

FIG. 2 is a flow chart of a data processing method for drug research and development according to a preferred embodiment of the present invention.

Detailed ways

As shown in FIG. 1 and FIG. 2 , a data processing method for drug research and development according to an embodiment of the present invention includes:

Step S101, data integration: build a variety of data integrators, use data access methods that match different data to obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch the obtained data in batches. , stored in the data warehouse in an asynchronous manner, and a unique identifier is calibrated for each stored data record, and the data stored at this time is the original data;

Step S103, data processing: the unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through the trigger, and the data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and passes the unique identifier during the cleaning process. Access the content of the original data, store the processed data in the data warehouse, and add a new identifier, the data stored at this time is clean data;

Step S105, analysis: analyze the data in the data warehouse, and store the analysis result in the knowledge base.

The data collection pipeline in this embodiment is an asynchronous data queue in implementation. The data collection pipe is like a reservoir. Even if the upstream flow is large, it will be accumulated in the queue, and then processed by downstream subscribers in batches. It can be implemented using vendor-provided message queuing services such as AWS SQS, or through open source to message queuing such as Apache Kafka. The subscription method is provided by the message queue software itself.

The serialization of this embodiment uses "data serialization" in the process of writing a data object (which may be structured tabular data containing different information) into a file through a data format (eg, CSV or TXT). Stored procedures are implemented through files. Identification means that each compound will have a unique code when it is stored in the system for the first time. For example, the code can be obtained through UUID, so that the data has a unique identity when it flows in the system. identify.

Different data will exist in the two file formats of CSV or SDF after being stored in the data warehouse, so there is not much difference in the format, and different subscribers deal with different data cleaning processes.

Further, the data integrator in this embodiment includes: one or more of an API interface integrator, a file object integrator, a data flow object integrator, and an event object integrator.

Different data exists in the medium of files, so the problem of docking different data will be transformed into the problem of parsing different data formats to extract data.

There are indeed different data formats for various data sources, but they can basically be classified into several common format categories, such as SDF, CSV, etc. The main differences may come from different data fields and differences in data access methods, and here The system abstracts different data source access methods into 4 most common data access methods: based on API requests (provided and published by the data provider), based on file transfer, based on event triggering (posted by the data provider), and based on (data provider). Provider published) data stream. Therefore, four different program modules are provided to connect four access methods to obtain data.

Asynchronous data queue refers to the continuous acquisition of data from the above 4 data access methods, because the acquisition speed of data and the IO speed of data storage are different, in order to reduce the load pressure on storage due to high concurrency of data acquisition, improve the system performance on the surface. Robustness to the failure of data storage, using a message queue architecture design method to gradually write the collected and summarized data into the data warehouse in a file organization, there are many typical implementation frameworks, such as the open source Kafka framework or cloud Various message queuing frameworks from vendors.

Further, the data access method implemented by the API interface integrator in this embodiment is HTTPS API, which is obtained and parsed according to the document requirements to return the result, and the returned content is written as a string in JSON or CSV format and transmitted to the data collection pipeline.

For example, the data access method provided to us by a data source website is HTTPS API, as well as the corresponding data query document and API access key. We request its HTTPS API through the API interface integrator, and obtain and parse the returned result according to the document requirements. And write the returned content as a string in JSON or CSV format and transmit it to the data collection pipeline.

Further, the file object integrator of this embodiment implements a data access mode based on file objects, obtains data in the form of files by downloading through a download interface, completes downloading the acquired data, checks the integrity of the file download, and sends the data to the data collection pipeline.

The data access method based on the file object generally downloads the data in the form of a file through the download interface of the data provider. Here the file object integrator will complete the download to get the data, verify the integrity of the file download and send it to the data collection pipeline.

Further, the event object integrator of the present embodiment implements an event-based data access method, sets the time interval to rotate the data and the update situation of the access data source, compares the latest data acquisition time last time, and passes the newly released data through HTTPS API or The file download method is obtained and sent to the data collection pipeline.

The event-based data access method is that the event object integrator will rotate at a certain period (for example, every day) to access the data update of the data source, which can generally be API, data update message subscription, website page information update, etc. After the data acquisition time, the newly released data will be acquired through HTTPS API or file download and sent to the data collection pipeline.

Further, the data stream object integrator of this embodiment implements a data access method based on stream objects, acquires data that can provide incremental or paginated data acquisition methods, records the parameter information of the last accessed data, and incrementally acquires the next data. access data.

Further, the data access method based on stream objects is an extension of the ability of the data provider to connect with the API. Unlike the HTTPS API, which is aimed at downloading full data access, stream object processing requires the data provider to provide incremental or paginated data acquisition methods. , the data flow object integrator will record the parameter information of the last access data, so as to obtain the next access data incrementally. For example, if the data provider provides an HTTPS API based on paging parameters, the data can be obtained by page size and page number, then The stream object integrator will fix the page size and increment the page number to get all possible data.

Further, the data processing in this embodiment includes: a rationality check on compound data from different sources, an exclusion rule check, such as some compounds containing certain metal elements need to be eliminated, a chiral molecule chiral information consistency check, interconversion One or more processing of data supplementation of isoforms, supplementation of predicted pKa values. The purpose of the above processing is to form a consistent information list of molecules from different data sources through regular rules and data supplementation.

The rationality test of this embodiment: whether the input molecular format (such as SMILES, MOL format) can be normally read by commonly used chemical calculation software (such as RDKit) can generally be judged by whether the molecular file reading of the commonly used software reports an error or not .

Exclusion rule test: It is to filter the input molecules by applying some common screening rules for small molecules that can be made into medicines, such as "Lipinski's rule of five", which cannot contain special metals.

Chiral Molecule Test: Refers to verifying whether the input molecule, if it is a chiral molecule, has the correct chirality defined in its input SMILES. If there are multiple chirality possibilities, you can choose to filter the molecule or generate all possible chirality molecules and save as needed.

Data supplementation of tautomers: It means that when the input molecule is SMILES, the existence of tautomers can be traversed and the most common tautomers can be selected and retained. Generally, common chemical software such as RDKit implementation.

pKa prediction: It means that the pKa value of the input molecule can be calculated and saved by common open source or commercial software, such as ChemAxon.

Data processing also includes recalculation: if the processing rules are changed, the relevant original data collected in the history is obtained according to the unique identifier of the data, and the unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers. The processing rules are reprocessed, and new processed clean data is obtained and stored in the data warehouse.

Specifically, recalculation mainly refers to the process of obtaining clean data from the collected raw data through the data cleaning pipeline. The possible scenario is that when the cleaning rules are changed, it is necessary to re-clean the historically collected relevant raw data. At this time It can mark the original data that needs to be cleaned, and send the ID of its data file to the data cleaning pipeline through a trigger, and the subsequent process can run completely automatically, and get new clean data after cleaning and store it in the data warehouse.

Data processing also includes aggregation: deduplication of the same molecule from different data sources while retaining its source information, and merging data inconsistencies in different data sources due to information asymmetry.

The specific aggregation process: Although the data from different sources are changed into CSV format and have the same field attributes after cleaning, there are many molecules from different data sources but they are actually the same molecule, so the first purpose of aggregation is to The source information is preserved without repeating molecules. The second is to merge the data inconsistencies that may be caused by information asymmetry from different data sources for the same attribute field. For example, whether a compound can be purchased from different data sources may have different information, but in the final knowledge base there should be Consistent message presentation.

The trigger mechanism for data processing by the trigger in this embodiment is quite similar to that of data collection, except that at this time, the data source comes from a file in the data warehouse. The so-called trigger means that once a file is generated or updated, a message is triggered to be written into the data cleaning queue, and then the consumers of the data cleaning queue receive the message and process the corresponding original data.

Compound information after cleaning (such as SMILES molecular formula, compound source information, compound unique identifier ID, etc.) and its corresponding auxiliary information (chirality, tautomer, whether it conforms to Lipinski's Rule of Five, whether it is available for purchase, etc. ) is organized into a consistent data structure through CSV and stored back into the data warehouse, which is clean data.

Further, the data processing in this embodiment further includes: organizing the processed compound information and its corresponding subsidiary information into a consistent data structure through CSV and storing it in a data warehouse. Compound information includes: one or more of SMILES molecular formula, compound source information, and compound unique identifier; the ancillary information includes: chirality, tautomer, whether it complies with Lipinski's Rule of Five, and whether it is available for purchase. one or more kinds of information.

The clean data in this embodiment is a change rule relative to the original data and adapted to the requirements of a specific scenario, which can be the rule mentioned above (for compound data from different sources, structural rationality check, exclusion rule check, chiral molecule Chiral Information Consistency Check), or data obtained by people using the system who define their own data cleaning rules.

Further, the analysis in this embodiment includes: data loading and data processing, and storing the cleaned and aggregated data, the result obtained by predicting and calculating the properties of the compound through chemical-physical calculation, and the compound information together in the knowledge base.

The specific data loading and data sorting actually refers to the prediction and calculation of the properties of the compounds through a series of chemical and physical calculation methods implemented by interactive analysis tools and software development tools for the cleaned and aggregated data. The results obtained are stored together with the compounds in the knowledge base process. For example, a knowledge base is a database. In the field design of the database, a part of the attribute fields come from the aggregated results of the information provided by the data source, and the other part comes from the results of interactive analysis and software calculation. The two parts of the results are formed together with the compound itself. A record is stored in the database. For example, for the structure library, use the interactive analysis and software development kit to predict and generate the biologically active conformation of the compound molecular formula SMILES in the data warehouse, and store the possible active conformation in the structure library.

The chemical physics calculation is generally customized according to the user's needs. It is defined in the data cleaning process and can be understood as a program call. For example, if the exclusion rule is checked, it can be implemented by the open source software RDKit.

The purpose of the data analysis process and calling the development tool Jupyter Notebook in the process is the same, that is to aggregate and store the data in the data warehouse into the knowledge base. The software development kit defines some columns to access and store the knowledge base. Python SDK, At the same time, the software development kit also includes computing tools such as the pKa calculation module, the conformation generation module and the protonation site discrimination module, and the workflow of the tools is connected to the workflow of knowledge extraction of on-demand components, and Jupyter Notebook provides programmable interactive sections It is convenient for users to connect the process required for programming in series through the Python SDK. The final data will be written to the knowledge base. Taking the conformation library as an example, the software development kit inputs the compound SMILES, outputs several biologically active conformations, and stores them in the structure library for subsequent use.

Further, the encoding vector in this embodiment is a SMILES (Simplified Molecular Input Line Entry Specification) formula encoding vector.

The knowledge base system for drug research and development according to an embodiment of the present invention includes:

Further, the data processing module of this embodiment also includes: performing rationality check, exclusion rule check, chiral information consistency check of chiral molecules, data supplementation of tautomers, and pKa predicted value for compound data from different sources. One or more processing of supplementation, forming a consistent list of information from molecules from different data sources through regular rules and data supplementation.

The data processing module also includes a recalculation unit: if the processing rules are changed, the relevant original data collected in the history is obtained according to the unique identifier of the data, and the unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers. The post-processing rules are reprocessed, and new processed clean data is obtained and stored in the data warehouse.

Specifically, the recalculation unit mainly refers to the process of obtaining clean data from the collected raw data through the data cleaning pipeline. The possible scenario is that when the cleaning rules are changed, it is necessary to re-clean the historically collected relevant raw data. When the original data needs to be cleaned, the ID of the data file can be sent to the data cleaning pipeline through the trigger, and the subsequent process can be run completely automatically, and the new cleaned data can be stored in the data warehouse.

The data processing module also includes an aggregation unit: deduplication of the same molecule from different data sources, while retaining its source information, and merging the data inconsistencies in different data sources due to information asymmetry.

Specifically, the aggregation unit: although the data from different sources are changed into CSV format and have the same field attributes after cleaning, there are many molecules from different data sources but they are actually the same molecule, so the purpose of aggregation should be done first. It is to retain the source information in the case of removing duplicate molecules. The second is to merge the data inconsistencies that may be caused by information asymmetry from different data sources for the same attribute field. For example, whether a compound can be purchased from different data sources may have different information, but in the final knowledge base there should be Consistent message presentation.

Further, the analysis module of this embodiment includes: a data loading and data processing unit: storing the cleaned and aggregated data, the result obtained by predicting and calculating the properties of the compound through chemical-physical calculation, and the compound information together in the knowledge base.

This system connects the data formats and access methods of various data sources by providing a variety of data source integration methods, firstly stores the collected raw data through the asynchronous data queue, and then stores the raw data persistently through the data extraction, conversion, and loading system. Raw data is cleaned, aggregated, and recalculated to get clean data. On this basis, the system further provides an interactive analysis method based on Jupyter Notebook and related physical and chemical information calculation tools, and converts clean data and stores it in the knowledge base for access to drug research and development.

The data integration module of this embodiment is abstracted into the following four types of data integration methods for a variety of public, commercial, and customized data sources: API interface-based data access methods, file object-based data access methods, event-based data access methods, Four kinds of data integrators are constructed based on the data access method of data flow. Each of them obtains data through the docked data source according to the effective and compliant data access method, and then serializes the obtained data into a string and pushes it to the data collection pipeline. The collection pipeline stores the acquired data in the data warehouse in a batch and asynchronous manner, and assigns a globally unique identifier to each stored record, so that it can be accessed and used during data cleaning and recalculation. At this time, the original data is stored.

In the data processing module of this embodiment, since the original data needs to undergo extraction, transformation, and loading operations, the unique identifier of the original data stored in the data warehouse can be sent to the data cleaning pipeline through triggers, and the content in the data cleaning pipeline will be sent to the data cleaning pipeline. Processed by the data cleaning subscriber, each data cleaning subscriber will call the data cleaning process defined by itself to process the task of obtaining the original data and then performing data cleaning, transformation, loading or recalculation. During the data cleaning process, the original data content needs to be accessed through the global unique identifier of the data warehouse, and the resulting data will be stored in the data warehouse, and a new global unique identifier will be added. At this time, clean data is stored.

The analysis module of this embodiment can access the original data and clean data in the data warehouse through the software development kit, and provides an interactive analysis tool Jupyter Notebook to use the methods, functions and libraries in the software development kit to realize the data in the data warehouse. analyze. At the same time, the software development kit also provides a storage method corresponding to the knowledge base, and the analysis results are stored in the knowledge base.

The system simplifies and standardizes the connection and integration of multiple data sources, and provides 4 different ways to collect data from data sources in different situations, which not only makes data collection scalable and maintainable, but also simplifies subsequent The complexity of data cleaning, recalculation and knowledge information extraction. The simplification here actually means that the system provides four data access forms that can cover common data providers. It is no longer one data provider and one access method, but only needs to maintain four integration modes. Standardization means that the data information of different sources has inconsistent fields and different information. By providing a unified field design that maps interesting fields of different data sources to the knowledge base, the same information has the same field identification and index. , the information differences of different sources can also be merged and saved and queried.

The data collection pipeline adopts asynchronous batch processing to increase the number of data sources that the system can process at the same time. At the same time, the data collection pipeline and data cleaning pipeline can use the same framework design to reduce the overall system operation and maintenance complexity, and the flexible subscription processing mode can improve the The fault tolerance and stability of the system.

Provide customized data cleaning process, software development toolkit and workflow tools, and connect to interactive data analysis system, which improves the flexibility of the system to meet various data processing needs and the scalability of massive data processing.

The knowledge base system for drug research and development according to a preferred embodiment of the present invention is described by taking docking with a public drug molecule activity database as an example.

A drug development department plans to develop a new drug. The drug design team of the R&D department hopes to compare the new drugs designed by itself with the drug molecular structure, activity and related information in the public activity database, and select molecules with similar structures to study their activities and related information.

First, the team selected a public compound library, which only provides relevant data to be downloaded in the form of files, so the team members first downloaded the data from the public compound website in the form of files according to the screening conditions. The data integrator pushes the raw downloaded data to the data collection pipeline, where it is ultimately stored in the data warehouse, and each raw data record has its unique identifier in the data warehouse.

Then the team members define a method based on Lipski's Rule of Five and other principles to screen the compounds that are more likely to become drug molecules in the original data. This process is defined in the data cleaning process, and the raw data just downloaded is obtained through triggers. Through the data cleaning pipeline The final molecular data of possible druggable compounds are obtained, that is, clean data.

The team uses the interactive analysis tool Juypter Notebook and the software development kit provided by the system to input the drug molecules designed by the team, and the system can query the data just cleaned according to the compound similarity comparison algorithm. Activity and related information, and related query results can be stored in the knowledge base through the software development kit, so that the team can use it in the drug development process.

The invention is suitable for the process of drug research and development, aiming at the automatic collection, aggregation, data cleaning, storage and analysis process of drug targets, drug molecular structures, market competitors, drug patent information and experimental data, and constructs an auxiliary drug research and development process. Knowledge base system. The system can connect data information from different data sources, store, clean, and recalculate the original data through large-scale data processing and persistence technology, and then build a knowledge base for domain problems as needed. Analysis tools provide drug R&D personnel with convenient drug data aggregation and analysis capabilities, improve drug R&D efficiency, and facilitate the development and design of new drug R&D methods.

Taking the above ideal embodiments according to the present application as inspiration, and through the above descriptions, relevant personnel can make various changes and modifications without departing from the technical idea of the present application. The technical scope of the present application is not limited to the content in the description, and the technical scope must be determined according to the scope of the claims.

As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

Claims

A data processing method for drug research and development, comprising:

Data integration: Build a variety of data integrators, use matching data access methods according to different data, obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch and asynchronous the obtained data. The method is stored in the data warehouse, and a unique identifier is calibrated for each stored data record, and the data stored at this time is the original data;

Data processing: The unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers. The data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and accesses the original data through the unique identifier during the cleaning process. content, store the processed data in the data warehouse, and add a new identifier, the data stored at this time is clean data;

Analysis: Analyze the data in the data warehouse and store the analysis results in the knowledge base.
The data processing method for drug research and development according to claim 1, wherein the data processing further comprises: forming a consistent information list by supplementing molecules from different data sources through regular rules and data.
The data processing method for drug research and development according to claim 2, characterized in that, the data processing comprises: carrying out a rationality test, an exclusion rule test, and a chiral information consistency test on the compound data of different sources. One or more treatments of data complementation of tautomers, complementation of predicted pKa values.
The data processing method for drug research and development according to claim 3, wherein the data processing further comprises:

Recalculation: If the processing rules are changed, obtain the relevant original data collected in the history according to the unique identifier of the data, send the unique identifier of the original data stored in the data warehouse to the data cleaning pipeline through triggers, and recalculate according to the changed processing rules. Process, get new processed clean data and store it in the data warehouse.
The data processing method for drug research and development according to claim 1, wherein the data processing further comprises:

Aggregation: Deduplication of the same molecule from different data sources, while retaining its source information, and merging data inconsistencies in different data sources due to information asymmetry.
The data processing method for drug research and development according to claim 5, wherein the analysis comprises: combining the data after cleaning and polymerization, the result obtained by predicting and calculating the properties of the compound by chemical-physical calculation, and the compound information stored in the knowledge base.
The data processing method for drug research and development according to any one of claims 1 to 6, wherein the data processing further comprises: organizing the processed compound information and its corresponding auxiliary information into a consistent format through CSV Data structures are stored in a data warehouse.
The data processing method for drug research and development according to claim 7, wherein the compound information includes one or more of: SMILES molecular formula, compound source information, and compound unique identifier; the auxiliary information includes: manual one or more of the information on sex, tautomerism, compliance with Lipinski's Rule of Five, availability;

The data integrator includes: one or more of an API interface integrator, a file object integrator, a data stream object integrator, and an event object integrator;

The data access method realized by the API interface integrator is HTTPS API, which is obtained and parsed according to the document requirements and returns its result, and the returned content is written as a string in JSON or CSV format and transmitted to the data collection pipeline;

The file object integrator realizes the data access mode based on the file object, downloads the data in the form of a file through the download interface, completes the download of the acquired data, checks the integrity of the file download and sends it to the data collection pipeline;

The event object integrator implements an event-based data access method, sets the time interval to train the data and update of the data source in rotation, compares the last latest data acquisition time, and acquires the newly released data through HTTPS API or file download. and sent to the data collection pipeline;

The data flow object integrator implements a data access method based on a stream object, acquires data that can provide incremental or paginated data acquisition methods, records parameter information of the last access data, and incrementally acquires the next access data.
A knowledge base system for drug research and development, comprising:

Data integration module: Build a variety of data integrators, use matching data access methods according to different data, obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch and asynchronously obtain the data. The method is stored in the data warehouse, and each stored data record is calibrated with a unique identifier, and the data stored at this time is the original data;

Data processing module: The unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers, and the data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and accesses the original data through the unique identifier during the cleaning process. The content of the data, the processed data is stored in the data warehouse, and a new identifier is added, and the data stored at this time is clean data;

Analysis module: analyze the data in the data warehouse and store the analysis results in the knowledge base.
The knowledge base system for drug research and development according to claim 9, wherein the data processing module further comprises: performing rationality check, exclusion rule check, and consistency of chirality information of chiral molecules on compound data from different sources One or more processing of sex test, data supplementation of tautomers, and supplementation of predicted pKa values, forming a consistent information list of molecules from different data sources through regular rules and data supplementation.