WO2022077166A1 - Data processing method and system for drug research and development - Google Patents

Data processing method and system for drug research and development Download PDF

Info

Publication number
WO2022077166A1
WO2022077166A1 PCT/CN2020/120425 CN2020120425W WO2022077166A1 WO 2022077166 A1 WO2022077166 A1 WO 2022077166A1 CN 2020120425 W CN2020120425 W CN 2020120425W WO 2022077166 A1 WO2022077166 A1 WO 2022077166A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
information
stored
warehouse
cleaning
Prior art date
Application number
PCT/CN2020/120425
Other languages
French (fr)
Chinese (zh)
Inventor
吴楚楠
徐旻
张佩宇
马健
温书豪
赖力鹏
Original Assignee
深圳晶泰科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳晶泰科技有限公司 filed Critical 深圳晶泰科技有限公司
Priority to PCT/CN2020/120425 priority Critical patent/WO2022077166A1/en
Publication of WO2022077166A1 publication Critical patent/WO2022077166A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Definitions

  • the invention relates to an auxiliary method for drug research and development, in particular to a data processing method and system for drug research and development.
  • Target-related information can be queried through Uniprot and other websites, and the protein crystal structure information corresponding to the target can be queried and obtained from the PDB database.
  • relevant information can be obtained from EPO, WIPO, Google Patents, etc.
  • Drug activity data can be obtained from public data sources such as ChEMBL, PubChem, etc.
  • the collection and organization of comprehensive and rich data and information is particularly important for the decision-making, wind direction control, quality and market success rate of the drug research and development process, and is an indispensable link in the drug research and development process.
  • Cleaning the aggregated data generally requires a series of data cleaning methods to obtain information that is ultimately beneficial to drug development, such as molecular deduplication, charge bond level error processing, chiral molecule processing, etc. Each update of these processing methods or New additions may require recalculation of the data that has been collected and cleaned in the past. Large-scale and time-consuming is the main problem in this part.
  • a data processing method for drug development including:
  • Data integration Build a variety of data integrators, use matching data access methods according to different data, obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch and asynchronous the obtained data.
  • the method is stored in the data warehouse, and a unique identifier is calibrated for each stored data record, and the data stored at this time is the original data;
  • the unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers.
  • the data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and accesses the original data through the unique identifier during the cleaning process.
  • content store the processed data in the data warehouse, and add a new identifier, the data stored at this time is clean data;
  • Analysis Analyze the data in the data warehouse and store the analysis results in the knowledge base.
  • the data processing further includes: forming a consistent information list by supplementing molecules from different data sources through regular rules and data.
  • the data processing includes: plausibility check, exclusion rule check, chiral information consistency check for chiral molecules, data supplementation for tautomers, and pKa prediction values for compound data from different sources one or more treatments of the supplement.
  • the data processing further comprises:
  • Recalculation If the processing rules are changed, obtain the relevant original data collected in the history according to the unique identifier of the data, send the unique identifier of the original data stored in the data warehouse to the data cleaning pipeline through triggers, and recalculate according to the changed processing rules. Process, get new processed clean data and store it in the data warehouse.
  • the data processing further comprises:
  • Aggregation Deduplication of the same molecule from different data sources, while retaining its source information, and merging the data inconsistencies in different data sources due to information asymmetry.
  • the analysis includes: storing the data after cleaning and polymerization, the result obtained by predicting and calculating the properties of the compound through chemical-physical calculation, and the compound information together in the knowledge base.
  • the data processing further includes: organizing the processed compound information and its corresponding ancillary information into a consistent data structure and storing it in a data warehouse through CSV.
  • the compound information includes: one or more of SMILES molecular formula, compound source information, and compound unique identifier;
  • the auxiliary information includes: chirality, tautomer, whether it conforms to Lipinski's Rule of Five, whether one or more kinds of information are available for purchase;
  • the data integrator includes: one or more of an API interface integrator, a file object integrator, a data stream object integrator, and an event object integrator;
  • the data access method implemented by the API interface integrator is HTTPS API, which is obtained and parsed according to the document requirements and returns its result, and the returned content is written as a string in JSON or CSV format and transmitted to the data collection pipeline;
  • the file object integrator realizes the data access mode based on the file object, downloads the data in the form of a file through the download interface, completes the download of the acquired data, checks the integrity of the file download and sends it to the data collection pipeline;
  • the event object integrator implements an event-based data access method, sets the time interval to train the data and update of the data source in rotation, compares the last latest data acquisition time, and acquires the newly released data through HTTPS API or file download. and sent to the data collection pipeline;
  • the data flow object integrator implements a data access method based on a stream object, acquires data that can provide incremental or paginated data acquisition methods, records parameter information of the last access data, and incrementally acquires the next access data.
  • Data integration module Build a variety of data integrators, use matching data access methods according to different data, obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch and asynchronously obtain the data.
  • the method is stored in the data warehouse, and each stored data record is calibrated with a unique identifier, and the data stored at this time is the original data;
  • Data processing module The unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers, and the data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and accesses the original data through the unique identifier during the cleaning process.
  • the content of the data, the processed data is stored in the data warehouse, and a new logo is added, and the data stored at this time is clean data;
  • Analysis module analyze the data in the data warehouse and store the analysis results in the knowledge base.
  • the data processing module further includes: performing rationality check, exclusion rule check, chiral information consistency check of chiral molecules, data supplementation of tautomers, pKa One or more processing of supplementation of predicted values, forming a consistent list of information from molecules derived from different data sources through regular rules and data supplementation.
  • the above-mentioned data processing methods and systems for drug research and development simplify and standardize the connection and integration of multiple data sources, and provide a variety of different ways to collect data from data sources in different situations. It can also simplify the complexity of subsequent data cleaning, recalculation and knowledge information extraction to a certain extent; by providing a unified field design that maps interesting fields of different data sources to the knowledge base, the same information has the same field Identification and indexing, information differences from different sources can also be combined, saved and queried; the data collection pipeline uses asynchronous batch processing to increase the number of data sources that the system can process at the same time, and the data collection pipeline and data cleaning pipeline can use the same framework.
  • the design reduces the overall system operation and maintenance complexity, and the flexible subscription processing mode can improve the fault tolerance and stability of the system; provide customized data cleaning processes, software development kits and workflow tools, and connect to the interactive data analysis system, It improves the flexibility of the system to deal with diverse data processing needs and the scalability of massive data processing.
  • the invention is suitable for the process of drug research and development, aiming at the automatic collection, aggregation, data cleaning, storage and analysis process of drug targets, drug molecular structures, market competitors, drug patent information and experimental data, and constructs an auxiliary drug research and development process.
  • Knowledge base system the system can connect data information from different data sources, store, clean and recalculate original data through mass data processing and persistence technology, and then build a knowledge base for domain problems as needed.
  • the above data analysis tools provide drug R&D personnel with convenient drug data aggregation and analysis capabilities, improve drug R&D efficiency, and facilitate the development and design of new drug R&D methods.
  • FIG. 1 is a flowchart of a data processing method for drug research and development according to an embodiment of the present invention
  • FIG. 2 is a flow chart of a data processing method for drug research and development according to a preferred embodiment of the present invention.
  • a data processing method for drug research and development includes:
  • Step S101 data integration: build a variety of data integrators, use data access methods that match different data to obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch the obtained data in batches. , stored in the data warehouse in an asynchronous manner, and a unique identifier is calibrated for each stored data record, and the data stored at this time is the original data;
  • Step S103 data processing: the unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through the trigger, and the data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and passes the unique identifier during the cleaning process. Access the content of the original data, store the processed data in the data warehouse, and add a new identifier, the data stored at this time is clean data;
  • Step S105 analysis: analyze the data in the data warehouse, and store the analysis result in the knowledge base.
  • the data collection pipeline in this embodiment is an asynchronous data queue in implementation.
  • the data collection pipe is like a reservoir. Even if the upstream flow is large, it will be accumulated in the queue, and then processed by downstream subscribers in batches. It can be implemented using vendor-provided message queuing services such as AWS SQS, or through open source to message queuing such as Apache Kafka.
  • the subscription method is provided by the message queue software itself.
  • the serialization of this embodiment uses "data serialization" in the process of writing a data object (which may be structured tabular data containing different information) into a file through a data format (eg, CSV or TXT). Stored procedures are implemented through files. Identification means that each compound will have a unique code when it is stored in the system for the first time. For example, the code can be obtained through UUID, so that the data has a unique identity when it flows in the system. identify.
  • the data integrator in this embodiment includes: one or more of an API interface integrator, a file object integrator, a data flow object integrator, and an event object integrator.
  • Asynchronous data queue refers to the continuous acquisition of data from the above 4 data access methods, because the acquisition speed of data and the IO speed of data storage are different, in order to reduce the load pressure on storage due to high concurrency of data acquisition, improve the system performance on the surface.
  • Robustness to the failure of data storage using a message queue architecture design method to gradually write the collected and summarized data into the data warehouse in a file organization, there are many typical implementation frameworks, such as the open source Kafka framework or cloud Various message queuing frameworks from vendors.
  • the data access method implemented by the API interface integrator in this embodiment is HTTPS API, which is obtained and parsed according to the document requirements to return the result, and the returned content is written as a string in JSON or CSV format and transmitted to the data collection pipeline.
  • the data access method provided to us by a data source website is HTTPS API, as well as the corresponding data query document and API access key.
  • HTTPS API As well as the corresponding data query document and API access key.
  • the file object integrator of this embodiment implements a data access mode based on file objects, obtains data in the form of files by downloading through a download interface, completes downloading the acquired data, checks the integrity of the file download, and sends the data to the data collection pipeline.
  • the data access method based on the file object generally downloads the data in the form of a file through the download interface of the data provider.
  • the file object integrator will complete the download to get the data, verify the integrity of the file download and send it to the data collection pipeline.
  • the event object integrator of the present embodiment implements an event-based data access method, sets the time interval to rotate the data and the update situation of the access data source, compares the latest data acquisition time last time, and passes the newly released data through HTTPS API or The file download method is obtained and sent to the data collection pipeline.
  • the event-based data access method is that the event object integrator will rotate at a certain period (for example, every day) to access the data update of the data source, which can generally be API, data update message subscription, website page information update, etc. After the data acquisition time, the newly released data will be acquired through HTTPS API or file download and sent to the data collection pipeline.
  • the data stream object integrator of this embodiment implements a data access method based on stream objects, acquires data that can provide incremental or paginated data acquisition methods, records the parameter information of the last accessed data, and incrementally acquires the next data. access data.
  • the data access method based on stream objects is an extension of the ability of the data provider to connect with the API.
  • HTTPS API which is aimed at downloading full data access
  • stream object processing requires the data provider to provide incremental or paginated data acquisition methods.
  • the data flow object integrator will record the parameter information of the last access data, so as to obtain the next access data incrementally. For example, if the data provider provides an HTTPS API based on paging parameters, the data can be obtained by page size and page number, then The stream object integrator will fix the page size and increment the page number to get all possible data.
  • the data processing in this embodiment includes: a rationality check on compound data from different sources, an exclusion rule check, such as some compounds containing certain metal elements need to be eliminated, a chiral molecule chiral information consistency check, interconversion One or more processing of data supplementation of isoforms, supplementation of predicted pKa values.
  • the purpose of the above processing is to form a consistent information list of molecules from different data sources through regular rules and data supplementation.
  • Exclusion rule test It is to filter the input molecules by applying some common screening rules for small molecules that can be made into medicines, such as "Lipinski's rule of five", which cannot contain special metals.
  • Chiral Molecule Test Refers to verifying whether the input molecule, if it is a chiral molecule, has the correct chirality defined in its input SMILES. If there are multiple chirality possibilities, you can choose to filter the molecule or generate all possible chirality molecules and save as needed.
  • pKa prediction It means that the pKa value of the input molecule can be calculated and saved by common open source or commercial software, such as ChemAxon.
  • Data processing also includes recalculation: if the processing rules are changed, the relevant original data collected in the history is obtained according to the unique identifier of the data, and the unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers.
  • the processing rules are reprocessed, and new processed clean data is obtained and stored in the data warehouse.
  • recalculation mainly refers to the process of obtaining clean data from the collected raw data through the data cleaning pipeline.
  • the possible scenario is that when the cleaning rules are changed, it is necessary to re-clean the historically collected relevant raw data. At this time It can mark the original data that needs to be cleaned, and send the ID of its data file to the data cleaning pipeline through a trigger, and the subsequent process can run completely automatically, and get new clean data after cleaning and store it in the data warehouse.
  • Data processing also includes aggregation: deduplication of the same molecule from different data sources while retaining its source information, and merging data inconsistencies in different data sources due to information asymmetry.
  • the specific aggregation process Although the data from different sources are changed into CSV format and have the same field attributes after cleaning, there are many molecules from different data sources but they are actually the same molecule, so the first purpose of aggregation is to The source information is preserved without repeating molecules. The second is to merge the data inconsistencies that may be caused by information asymmetry from different data sources for the same attribute field. For example, whether a compound can be purchased from different data sources may have different information, but in the final knowledge base there should be Consistent message presentation.
  • the trigger mechanism for data processing by the trigger in this embodiment is quite similar to that of data collection, except that at this time, the data source comes from a file in the data warehouse.
  • the so-called trigger means that once a file is generated or updated, a message is triggered to be written into the data cleaning queue, and then the consumers of the data cleaning queue receive the message and process the corresponding original data.
  • Compound information after cleaning (such as SMILES molecular formula, compound source information, compound unique identifier ID, etc.) and its corresponding auxiliary information (chirality, tautomer, whether it conforms to Lipinski's Rule of Five, whether it is available for purchase, etc. ) is organized into a consistent data structure through CSV and stored back into the data warehouse, which is clean data.
  • auxiliary information chirality, tautomer, whether it conforms to Lipinski's Rule of Five, whether it is available for purchase, etc.
  • the data processing in this embodiment further includes: organizing the processed compound information and its corresponding subsidiary information into a consistent data structure through CSV and storing it in a data warehouse.
  • Compound information includes: one or more of SMILES molecular formula, compound source information, and compound unique identifier; the ancillary information includes: chirality, tautomer, whether it complies with Lipinski's Rule of Five, and whether it is available for purchase. one or more kinds of information.
  • the clean data in this embodiment is a change rule relative to the original data and adapted to the requirements of a specific scenario, which can be the rule mentioned above (for compound data from different sources, structural rationality check, exclusion rule check, chiral molecule Chiral Information Consistency Check), or data obtained by people using the system who define their own data cleaning rules.
  • the analysis in this embodiment includes: data loading and data processing, and storing the cleaned and aggregated data, the result obtained by predicting and calculating the properties of the compound through chemical-physical calculation, and the compound information together in the knowledge base.
  • the specific data loading and data sorting actually refers to the prediction and calculation of the properties of the compounds through a series of chemical and physical calculation methods implemented by interactive analysis tools and software development tools for the cleaned and aggregated data.
  • the results obtained are stored together with the compounds in the knowledge base process.
  • a knowledge base is a database.
  • a part of the attribute fields come from the aggregated results of the information provided by the data source, and the other part comes from the results of interactive analysis and software calculation.
  • the two parts of the results are formed together with the compound itself.
  • a record is stored in the database.
  • use the interactive analysis and software development kit to predict and generate the biologically active conformation of the compound molecular formula SMILES in the data warehouse, and store the possible active conformation in the structure library.
  • the chemical physics calculation is generally customized according to the user's needs. It is defined in the data cleaning process and can be understood as a program call. For example, if the exclusion rule is checked, it can be implemented by the open source software RDKit.
  • the purpose of the data analysis process and calling the development tool Jupyter Notebook in the process is the same, that is to aggregate and store the data in the data warehouse into the knowledge base.
  • the software development kit defines some columns to access and store the knowledge base. Python SDK,
  • the software development kit also includes computing tools such as the pKa calculation module, the conformation generation module and the protonation site discrimination module, and the workflow of the tools is connected to the workflow of knowledge extraction of on-demand components, and Jupyter Notebook provides programmable interactive sections It is convenient for users to connect the process required for programming in series through the Python SDK. The final data will be written to the knowledge base.
  • the software development kit inputs the compound SMILES, outputs several biologically active conformations, and stores them in the structure library for subsequent use.
  • the encoding vector in this embodiment is a SMILES (Simplified Molecular Input Line Entry Specification) formula encoding vector.
  • Data integration module Build a variety of data integrators, use matching data access methods according to different data, obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch and asynchronously obtain the data.
  • the method is stored in the data warehouse, and each stored data record is calibrated with a unique identifier, and the data stored at this time is the original data;
  • Data processing module The unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers, and the data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and accesses the original data through the unique identifier during the cleaning process.
  • the content of the data, the processed data is stored in the data warehouse, and a new logo is added, and the data stored at this time is clean data;
  • Analysis module analyze the data in the data warehouse and store the analysis results in the knowledge base.
  • the data integrator in this embodiment includes: one or more of an API interface integrator, a file object integrator, a data flow object integrator, and an event object integrator.
  • the data processing module of this embodiment also includes: performing rationality check, exclusion rule check, chiral information consistency check of chiral molecules, data supplementation of tautomers, and pKa predicted value for compound data from different sources.
  • One or more processing of supplementation forming a consistent list of information from molecules from different data sources through regular rules and data supplementation.
  • the data processing module also includes a recalculation unit: if the processing rules are changed, the relevant original data collected in the history is obtained according to the unique identifier of the data, and the unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers.
  • the post-processing rules are reprocessed, and new processed clean data is obtained and stored in the data warehouse.
  • the recalculation unit mainly refers to the process of obtaining clean data from the collected raw data through the data cleaning pipeline.
  • the possible scenario is that when the cleaning rules are changed, it is necessary to re-clean the historically collected relevant raw data.
  • the ID of the data file can be sent to the data cleaning pipeline through the trigger, and the subsequent process can be run completely automatically, and the new cleaned data can be stored in the data warehouse.
  • the data processing module also includes an aggregation unit: deduplication of the same molecule from different data sources, while retaining its source information, and merging the data inconsistencies in different data sources due to information asymmetry.
  • the aggregation unit although the data from different sources are changed into CSV format and have the same field attributes after cleaning, there are many molecules from different data sources but they are actually the same molecule, so the purpose of aggregation should be done first. It is to retain the source information in the case of removing duplicate molecules. The second is to merge the data inconsistencies that may be caused by information asymmetry from different data sources for the same attribute field. For example, whether a compound can be purchased from different data sources may have different information, but in the final knowledge base there should be Consistent message presentation.
  • the analysis module of this embodiment includes: a data loading and data processing unit: storing the cleaned and aggregated data, the result obtained by predicting and calculating the properties of the compound through chemical-physical calculation, and the compound information together in the knowledge base.
  • This system connects the data formats and access methods of various data sources by providing a variety of data source integration methods, firstly stores the collected raw data through the asynchronous data queue, and then stores the raw data persistently through the data extraction, conversion, and loading system. Raw data is cleaned, aggregated, and recalculated to get clean data. On this basis, the system further provides an interactive analysis method based on Jupyter Notebook and related physical and chemical information calculation tools, and converts clean data and stores it in the knowledge base for access to drug research and development.
  • the data integration module of this embodiment is abstracted into the following four types of data integration methods for a variety of public, commercial, and customized data sources: API interface-based data access methods, file object-based data access methods, event-based data access methods, Four kinds of data integrators are constructed based on the data access method of data flow. Each of them obtains data through the docked data source according to the effective and compliant data access method, and then serializes the obtained data into a string and pushes it to the data collection pipeline.
  • the collection pipeline stores the acquired data in the data warehouse in a batch and asynchronous manner, and assigns a globally unique identifier to each stored record, so that it can be accessed and used during data cleaning and recalculation. At this time, the original data is stored.
  • the unique identifier of the original data stored in the data warehouse can be sent to the data cleaning pipeline through triggers, and the content in the data cleaning pipeline will be sent to the data cleaning pipeline.
  • each data cleaning subscriber will call the data cleaning process defined by itself to process the task of obtaining the original data and then performing data cleaning, transformation, loading or recalculation.
  • the original data content needs to be accessed through the global unique identifier of the data warehouse, and the resulting data will be stored in the data warehouse, and a new global unique identifier will be added. At this time, clean data is stored.
  • the analysis module of this embodiment can access the original data and clean data in the data warehouse through the software development kit, and provides an interactive analysis tool Jupyter Notebook to use the methods, functions and libraries in the software development kit to realize the data in the data warehouse. analyze.
  • the software development kit also provides a storage method corresponding to the knowledge base, and the analysis results are stored in the knowledge base.
  • the system simplifies and standardizes the connection and integration of multiple data sources, and provides 4 different ways to collect data from data sources in different situations, which not only makes data collection scalable and maintainable, but also simplifies subsequent The complexity of data cleaning, recalculation and knowledge information extraction.
  • the simplification actually means that the system provides four data access forms that can cover common data providers. It is no longer one data provider and one access method, but only needs to maintain four integration modes.
  • Standardization means that the data information of different sources has inconsistent fields and different information. By providing a unified field design that maps interesting fields of different data sources to the knowledge base, the same information has the same field identification and index. , the information differences of different sources can also be merged and saved and queried.
  • the data collection pipeline adopts asynchronous batch processing to increase the number of data sources that the system can process at the same time.
  • the data collection pipeline and data cleaning pipeline can use the same framework design to reduce the overall system operation and maintenance complexity, and the flexible subscription processing mode can improve the The fault tolerance and stability of the system.
  • the knowledge base system for drug research and development is described by taking docking with a public drug molecule activity database as an example.
  • a drug development department plans to develop a new drug.
  • the drug design team of the R&D department hopes to compare the new drugs designed by itself with the drug molecular structure, activity and related information in the public activity database, and select molecules with similar structures to study their activities and related information.
  • the team selected a public compound library, which only provides relevant data to be downloaded in the form of files, so the team members first downloaded the data from the public compound website in the form of files according to the screening conditions.
  • the data integrator pushes the raw downloaded data to the data collection pipeline, where it is ultimately stored in the data warehouse, and each raw data record has its unique identifier in the data warehouse.
  • the team members define a method based on Lipski's Rule of Five and other principles to screen the compounds that are more likely to become drug molecules in the original data. This process is defined in the data cleaning process, and the raw data just downloaded is obtained through triggers. Through the data cleaning pipeline The final molecular data of possible druggable compounds are obtained, that is, clean data.
  • the team uses the interactive analysis tool Juypter Notebook and the software development kit provided by the system to input the drug molecules designed by the team, and the system can query the data just cleaned according to the compound similarity comparison algorithm.
  • Activity and related information, and related query results can be stored in the knowledge base through the software development kit, so that the team can use it in the drug development process.
  • the invention is suitable for the process of drug research and development, aiming at the automatic collection, aggregation, data cleaning, storage and analysis process of drug targets, drug molecular structures, market competitors, drug patent information and experimental data, and constructs an auxiliary drug research and development process.
  • Knowledge base system The system can connect data information from different data sources, store, clean, and recalculate the original data through large-scale data processing and persistence technology, and then build a knowledge base for domain problems as needed.
  • Analysis tools provide drug R&D personnel with convenient drug data aggregation and analysis capabilities, improve drug R&D efficiency, and facilitate the development and design of new drug R&D methods.
  • the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data processing method and system for drug research and development. The method comprises: constructing various data integrators, acquiring various data using various manners of data access that match the data, serializing the data, pushing same to a data acquisition pipeline, the data acquisition pipeline storing the acquired data in a data warehouse in batches and in an asynchronous manner and recording a unique identifier for each piece of data; a trigger sending the unique identifier of the data stored in the data warehouse to a data cleaning pipeline, a data cleaning subscriber processing the data in the data cleaning pipeline, processing the data, storing the processed data in the data warehouse, and adding a new identifier; and analyzing the data in the data warehouse, and storing an analysis result in a knowledge base. The data processing method and system for drug research and development can access data of different data sources, and store, clean, and recalculate original data by means of batch data processing and persistence techniques so as to construct a knowledge base for domain issues according to requirements.

Description

面向药物研发的数据处理方法及系统Data processing method and system for drug development 技术领域technical field
本发明涉及药物研发的辅助方法,特别涉及一种面向药物研发的数据处理方法及系统。The invention relates to an auxiliary method for drug research and development, in particular to a data processing method and system for drug research and development.
背景技术Background technique
在现有的药物研发过程中,药物数据的收集、整理与分析是贯穿药物研发流程的重要步骤,常用的药物研发信息收集一般有以下几个类目的数据:In the existing drug development process, the collection, arrangement and analysis of drug data are important steps throughout the drug development process. Commonly used drug development information collection generally includes the following categories of data:
基于药物靶点信息的数据:Data based on drug target information:
包括靶点生物学功能及临床分子相关的适应症、适应症的流行病学、为满足的临床需求等,常用的公开数据源有例如:Pubmed,Google Scholar,知网等。Including the biological function of the target and the indications related to clinical molecules, the epidemiology of the indications, and the clinical needs to be met, etc. Commonly used public data sources include: Pubmed, Google Scholar, HowNet, etc.
基于药物及蛋白质结构信息的数据:Data based on drug and protein structure information:
靶点相关信息可通过Uniprot等网站查询,靶点相对应的蛋白质晶体结构信息可在PDB数据库中查询及获取。Target-related information can be queried through Uniprot and other websites, and the protein crystal structure information corresponding to the target can be queried and obtained from the PDB database.
基于同类型药物的竞品信息:Competitor information based on similar drugs:
包括靶点相关的药物信息、专利、以及药物相关的交易、上市药物的销售额等信息。可在例如Cortellis、药渡、Reaxys、Clinical Trials、国家药审中心、FDA等网站获取。Including target-related drug information, patents, and drug-related transactions, sales of listed drugs and other information. Available on websites such as Cortellis, Yaodu, Reaxys, Clinical Trials, National Center for Drug Evaluation, FDA, etc.
基于药物专利相关的信息:Information based on drug patents:
药物专利相关信息查询,可从包括EPO、WIPO、Google Patents等获取相关信息。For drug patent related information query, relevant information can be obtained from EPO, WIPO, Google Patents, etc.
基于药物活性的相关信息:Relevant information based on drug activity:
药物活性数据,可从例如ChEMBL、PubChem等公开数据源获取。Drug activity data can be obtained from public data sources such as ChEMBL, PubChem, etc.
总体来说,全面丰富的数据信息收集及整理对于药物研发流程的决策、风向控制、质量及上市成功率都尤为重要,是药物研发过程中不可或缺的环节。In general, the collection and organization of comprehensive and rich data and information is particularly important for the decision-making, wind direction control, quality and market success rate of the drug research and development process, and is an indispensable link in the drug research and development process.
药物信息数据类型纷繁复杂,包括常用的公开数据源、计算机辅助药物设计(CADD)软件产生的结果、药物研发流程中的实验数据等,他们都存在各自的数据结构、存储方式及数据访问方式,导致药物信息收集与整理的过程非常依 赖于药物研发相关人员的知识背景、技术手段及时间精力投入。There are various types of drug information data, including commonly used public data sources, results generated by computer-aided drug design (CADD) software, and experimental data in the drug development process. They all have their own data structures, storage methods, and data access methods. The process of collecting and sorting out drug information is very dependent on the knowledge background, technical means and time and energy investment of drug R&D personnel.
而其中从数据获取到可以被用于药物研发决策的知识库又存在如下问题:And the knowledge base that can be used for drug R&D decision-making from data acquisition has the following problems:
数据收集、聚合与清洗的问题:Issues with data collection, aggregation, and cleaning:
多种数据源的访问方式集成、数据高效采集、数据更新及存储整理,公开数据源数据量大而干扰多,要提取出有价值的信息需要百万到亿级别的数据收集、转换及清洗工具;而商用或定制化数据源,虽然质量相对较高且有相对标准化的数据访问方式,但各自数据访问协议、接口及数据格式有所不同,如何汇总到一起进行维护是一个难题。同时无论公开数据源或者商用、定制化数据源都存在数据增量更新的问题。Integration of access methods for multiple data sources, efficient data collection, data update, and storage and organization. The amount of open data sources is large and interferes with each other. To extract valuable information, millions to billions of data collection, conversion and cleaning tools are required. However, although commercial or customized data sources have relatively high quality and relatively standardized data access methods, their respective data access protocols, interfaces and data formats are different, and how to aggregate them together for maintenance is a difficult problem. At the same time, there is a problem of incremental data update regardless of public data sources or commercial and customized data sources.
数据重算的问题:Problems with data recalculation:
清洗聚合过后的数据,一般需要经过一系列数据清洗的手段得到最终有利于药物研发的信息,例如对于分子去重、电荷键级错误处理、手性分子处理等,这些处理方法的每一次更新或者新增都可能需要对历往收集并清洗之后的数据进行重算,规模大耗时长是这部分的主要问题。Cleaning the aggregated data generally requires a series of data cleaning methods to obtain information that is ultimately beneficial to drug development, such as molecular deduplication, charge bond level error processing, chiral molecule processing, etc. Each update of these processing methods or New additions may require recalculation of the data that has been collected and cleaned in the past. Large-scale and time-consuming is the main problem in this part.
数据到知识库的构建问题:Data to knowledge base construction problem:
在应用数据的过程中经常会需要对数据进行诸如物理化学相关信息提取,例如针对分子结构提取其包含的环数目、重原子数目、可成氢键的数目等,这些研究过程中的数据预处理、数据标定及运算的结果与聚合、清洗、重算后数据一同构成了药物研发的知识库。这类型的信息提取依赖于一定的计算过程,所以重算所遇到的规模问题在这里也会存在。In the process of applying data, it is often necessary to extract information related to physical chemistry, such as the number of rings contained in the molecular structure, the number of heavy atoms, and the number of hydrogen bonds that can be formed. Data preprocessing in these research processes , The results of data calibration and operation together with the aggregated, cleaned and recalculated data constitute the knowledge base of drug research and development. This type of information extraction relies on a certain calculation process, so the scale problems encountered by recalculation also exist here.
发明内容SUMMARY OF THE INVENTION
基于此,有必要提供一种可提高研发效率的面向药物研发的数据处理方法。Based on this, it is necessary to provide a data processing method for drug R&D that can improve R&D efficiency.
同时,提供一种可提高研发效率的面向药物研发的知识库系统。At the same time, a knowledge base system for drug research and development that can improve research and development efficiency is provided.
一种面向药物研发的数据处理方法,包括:A data processing method for drug development, including:
数据集成:构建多种数据集成器,根据不同数据采用与其匹配的数据访问方式,获取数据,将获取的数据序列化成字符串推送给数据收集管道,数据收集管道将获取的数据以批量、异步的方式存储于数据仓库中,并对每一个存储数据记录标定唯一标识,此时存储的数据为原始数据;Data integration: Build a variety of data integrators, use matching data access methods according to different data, obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch and asynchronous the obtained data. The method is stored in the data warehouse, and a unique identifier is calibrated for each stored data record, and the data stored at this time is the original data;
数据处理:通过触发器将存储在数据仓库中的原始数据的唯一标识发送给数据清洗管道,数据清洗订阅者处理数据清洗管道中的数据,对数据进行处理,清洗过程中通过唯一标识访问原始数据的内容,将处理后的数据存储于数据仓库中,并加上新的标识,此时存储的数据为干净数据;Data processing: The unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers. The data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and accesses the original data through the unique identifier during the cleaning process. content, store the processed data in the data warehouse, and add a new identifier, the data stored at this time is clean data;
分析:对数据仓库中的数据进行分析,并将分析结果存储于知识库中。Analysis: Analyze the data in the data warehouse and store the analysis results in the knowledge base.
在优选实施例中,所述数据处理还包括:将来源于不同数据源的分子通过规律规则和数据补充形成一致的信息列表。In a preferred embodiment, the data processing further includes: forming a consistent information list by supplementing molecules from different data sources through regular rules and data.
在优选实施例中,所述数据处理包括:对不同源的化合数据进行合理性检验、排除规则检验、手性分子的手性信息一致性检验、互变异构体的数据补充、pKa预测值的补充的一种或多种处理。In a preferred embodiment, the data processing includes: plausibility check, exclusion rule check, chiral information consistency check for chiral molecules, data supplementation for tautomers, and pKa prediction values for compound data from different sources one or more treatments of the supplement.
在优选实施例中,所述数据处理还包括:In a preferred embodiment, the data processing further comprises:
重算:若处理规则变更,根据数据的唯一标识获取历史收集的相关原始数据,通过触发器将存储在数据仓库中的原始数据的唯一标识发送给数据清洗管道,根据变更后的处理规则进行重新处理,得到新的处理后的干净数据存储到数据仓库中。Recalculation: If the processing rules are changed, obtain the relevant original data collected in the history according to the unique identifier of the data, send the unique identifier of the original data stored in the data warehouse to the data cleaning pipeline through triggers, and recalculate according to the changed processing rules. Process, get new processed clean data and store it in the data warehouse.
在优选实施例中,所述数据处理还包括:In a preferred embodiment, the data processing further comprises:
聚合:对不同数据源的同一分子进行去重,同时保留其来源信息,对不同数据源中因为信息不对称带来的数据不一致的情况进行合并。Aggregation: Deduplication of the same molecule from different data sources, while retaining its source information, and merging the data inconsistencies in different data sources due to information asymmetry.
在优选实施例中,所述分析包括:将清洗、聚合后数据,通过化学物理计算对化合物的性质进行预测计算得到的结果及化合物信息一并存储到知识库中。In a preferred embodiment, the analysis includes: storing the data after cleaning and polymerization, the result obtained by predicting and calculating the properties of the compound through chemical-physical calculation, and the compound information together in the knowledge base.
在优选实施例中,所述数据处理还包括:将处理完后的化合物信息及其相应的附属信息通过CSV组织成一致的数据结构存储数据仓库中。In a preferred embodiment, the data processing further includes: organizing the processed compound information and its corresponding ancillary information into a consistent data structure and storing it in a data warehouse through CSV.
在优选实施例中,所述化合物信息包括:SMILES分子式、化合物来源信息、化合物唯一标识中的一种或多种;所述附属信息包括:手性、互变异构体、是否符合Lipinski’s Rule of Five、是否可购买到的一种或多种信息;In a preferred embodiment, the compound information includes: one or more of SMILES molecular formula, compound source information, and compound unique identifier; the auxiliary information includes: chirality, tautomer, whether it conforms to Lipinski's Rule of Five, whether one or more kinds of information are available for purchase;
所述数据集成器包括:API接口集成器、文件对象集成器、数据流对象集成器、事件对象集成器中的一种或多种;The data integrator includes: one or more of an API interface integrator, a file object integrator, a data stream object integrator, and an event object integrator;
所述API接口集成器实现的数据访问方式为HTTPS API,根据文档要求获取并解析其返回其结果,并将返回的内容写成JSON或CSV格式的字符串传输到数据收集管道;The data access method implemented by the API interface integrator is HTTPS API, which is obtained and parsed according to the document requirements and returns its result, and the returned content is written as a string in JSON or CSV format and transmitted to the data collection pipeline;
所述文件对象集成器实现基于文件对象的数据访问方式,通过下载接口下载得到文件形式的数据,完成下载获取的数据、检验文件下载的完整性并发送给数据收集管道;The file object integrator realizes the data access mode based on the file object, downloads the data in the form of a file through the download interface, completes the download of the acquired data, checks the integrity of the file download and sends it to the data collection pipeline;
所述事件对象集成器实现基于事件的数据访问方式,间隔设定时间轮训访问数据源的数据及更新情况,比对上次最新数据获取时间,将新发布的数据通过HTTPS API或文件下载方式获取并发送至数据收集管道;The event object integrator implements an event-based data access method, sets the time interval to train the data and update of the data source in rotation, compares the last latest data acquisition time, and acquires the newly released data through HTTPS API or file download. and sent to the data collection pipeline;
所述数据流对象集成器实现基于流对象的数据访问方式,获取能够给出增量或分页的数据获取方式的数据,记录上一次访问数据的参数信息,增量获取下一次的访问数据。The data flow object integrator implements a data access method based on a stream object, acquires data that can provide incremental or paginated data acquisition methods, records parameter information of the last access data, and incrementally acquires the next access data.
一种面向药物研发的知识库系统,包括:A knowledge base system for drug development, including:
数据集成模块:构建多种数据集成器,根据不同数据采用与其匹配的数据访问方式,获取数据,将获取的数据序列化成字符串推送给数据收集管道,数据收集管道将获取的数据以批量、异步的方式存储于数据仓库中,并对每一个存储数据记录标定唯一标识,此时存储的数据为原始数据;Data integration module: Build a variety of data integrators, use matching data access methods according to different data, obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch and asynchronously obtain the data. The method is stored in the data warehouse, and each stored data record is calibrated with a unique identifier, and the data stored at this time is the original data;
数据处理模块:通过触发器将存储在数据仓库中的原始数据的唯一标识发送给数据清洗管道,数据清洗订阅者处理数据清洗管道中的数据,对数据进行处理,清洗过程中通过唯一标识访问原始数据的内容,将处理后的数据存储于数据仓库中,并加上新的标识,此时存储的数据为干净数据;Data processing module: The unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers, and the data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and accesses the original data through the unique identifier during the cleaning process. The content of the data, the processed data is stored in the data warehouse, and a new logo is added, and the data stored at this time is clean data;
分析模块:对数据仓库中的数据进行分析,并将分析结果存储于知识库中。Analysis module: analyze the data in the data warehouse and store the analysis results in the knowledge base.
在优选实施例中,所述数据处理模块还包括:对不同源的化合数据进行合理性检验、排除规则检验、手性分子的手性信息一致性检验、互变异构体的数据补充、pKa预测值的补充的一种或多种处理,将来源于不同数据源的分子通过规律规则和数据补充形成一致的信息列表。In a preferred embodiment, the data processing module further includes: performing rationality check, exclusion rule check, chiral information consistency check of chiral molecules, data supplementation of tautomers, pKa One or more processing of supplementation of predicted values, forming a consistent list of information from molecules derived from different data sources through regular rules and data supplementation.
上述面向药物研发的数据处理方法及系统,简化及标准化支持多种数据源的对接与集成,提供多种不同的方式面向不同情况的数据源的数据收集,不仅 让数据收集变得已与扩展与维护,同时也能够一定程度简化后续数据清洗、重算及知识信息提取的复杂度;通过提供将不同数据源的感兴趣的字段映射到知识库统一的字段设计,实现同样的信息有同样的字段标识和索引,不同源的信息差异也能够合并保存和被查询;数据收集管道采用异步批处理让系统能够同时处理的数据源数量上得以提升,同时数据收集管道与数据清洗管道可采用同样的框架设计降低整体系统运维复杂度,灵活的订阅处理模式可以提提升系统的容错能力和稳定性;提供自定义的数据清洗过程、软件开发工具包及工作流工具,并对接交互式数据分析系统,提升了系统应对多样数据处理需求的灵活性及海量数据处理的扩展性。The above-mentioned data processing methods and systems for drug research and development simplify and standardize the connection and integration of multiple data sources, and provide a variety of different ways to collect data from data sources in different situations. It can also simplify the complexity of subsequent data cleaning, recalculation and knowledge information extraction to a certain extent; by providing a unified field design that maps interesting fields of different data sources to the knowledge base, the same information has the same field Identification and indexing, information differences from different sources can also be combined, saved and queried; the data collection pipeline uses asynchronous batch processing to increase the number of data sources that the system can process at the same time, and the data collection pipeline and data cleaning pipeline can use the same framework. The design reduces the overall system operation and maintenance complexity, and the flexible subscription processing mode can improve the fault tolerance and stability of the system; provide customized data cleaning processes, software development kits and workflow tools, and connect to the interactive data analysis system, It improves the flexibility of the system to deal with diverse data processing needs and the scalability of massive data processing.
本发明适用于药物研发过程中,针对药物靶点、药物分子结构、市场竞品、药物专利信息及实验数据的自动化收集、聚合、数据清洗、存储及在分析流程,构建了一个辅助药物研发的知识库系统;该系统能够对接不同数据源的数据信息,通过大批量数据处理及持久化技术存储、清洗、重算原始数据,进而根据需要构建成面向领域问题的知识库,通过建立于知识库之上的数据分析工具向药物研发相关人员提供便捷的药物数据聚合与分析能力,提升药物研发效率,促进开发设计新的药物研发方法。The invention is suitable for the process of drug research and development, aiming at the automatic collection, aggregation, data cleaning, storage and analysis process of drug targets, drug molecular structures, market competitors, drug patent information and experimental data, and constructs an auxiliary drug research and development process. Knowledge base system; the system can connect data information from different data sources, store, clean and recalculate original data through mass data processing and persistence technology, and then build a knowledge base for domain problems as needed. The above data analysis tools provide drug R&D personnel with convenient drug data aggregation and analysis capabilities, improve drug R&D efficiency, and facilitate the development and design of new drug R&D methods.
附图说明Description of drawings
图1为本发明一实施例的面向药物研发的数据处理方法的流程图;1 is a flowchart of a data processing method for drug research and development according to an embodiment of the present invention;
图2为本发明一优选实施例的面向药物研发的数据处理方法的流程图。FIG. 2 is a flow chart of a data processing method for drug research and development according to a preferred embodiment of the present invention.
具体实施方式Detailed ways
如图1及图2所示,本发明一实施例的面向药物研发的数据处理方法,包括:As shown in FIG. 1 and FIG. 2 , a data processing method for drug research and development according to an embodiment of the present invention includes:
步骤S101,数据集成:构建多种数据集成器,根据不同数据采用与其匹配的数据访问方式,获取数据,将获取的数据序列化成字符串推送给数据收集管道,数据收集管道将获取的数据以批量、异步的方式存储于数据仓库中,并对每一个存储数据记录标定唯一标识,此时存储的数据为原始数据;Step S101, data integration: build a variety of data integrators, use data access methods that match different data to obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch the obtained data in batches. , stored in the data warehouse in an asynchronous manner, and a unique identifier is calibrated for each stored data record, and the data stored at this time is the original data;
步骤S103,数据处理:通过触发器将存储在数据仓库中的原始数据的唯一标识发送给数据清洗管道,数据清洗订阅者处理数据清洗管道中的数据,对数据进行处理,清洗过程中通过唯一标识访问原始数据的内容,将处理后的数据存储于数据仓库中,并加上新的标识,此时存储的数据为干净数据;Step S103, data processing: the unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through the trigger, and the data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and passes the unique identifier during the cleaning process. Access the content of the original data, store the processed data in the data warehouse, and add a new identifier, the data stored at this time is clean data;
步骤S105,分析:对数据仓库中的数据进行分析,并将分析结果存储于知识库中。Step S105, analysis: analyze the data in the data warehouse, and store the analysis result in the knowledge base.
本实施例的数据收集管道在实现中就是异步数据队列,数据收集管就好比水库,即便上游流量大,也会被蓄积在队列中,然后批量渐进式被下游的订阅者处理。可采用供应商提供的消息队列服务例如AWS SQS,或者通过开源到消息队列例如Apache Kafka实现。订阅方式由消息队列这一软件本身提供。The data collection pipeline in this embodiment is an asynchronous data queue in implementation. The data collection pipe is like a reservoir. Even if the upstream flow is large, it will be accumulated in the queue, and then processed by downstream subscribers in batches. It can be implemented using vendor-provided message queuing services such as AWS SQS, or through open source to message queuing such as Apache Kafka. The subscription method is provided by the message queue software itself.
本实施例的序列化将一个数据对象(可能是包含不同信息的结构化的表格数据)通过一种数据格式(例如CSV或者TXT)写成文件的过程用到了“数据序列化”。存储的过程就是通过文件实现的。标识的意思是每一个化合物在被第一次存储于该系统中时,会对其进行一个唯一的编码,如通过UUID即可得到此编码,以便于数据在该系统中流转时有唯一的身份识别。The serialization of this embodiment uses "data serialization" in the process of writing a data object (which may be structured tabular data containing different information) into a file through a data format (eg, CSV or TXT). Stored procedures are implemented through files. Identification means that each compound will have a unique code when it is stored in the system for the first time. For example, the code can be obtained through UUID, so that the data has a unique identity when it flows in the system. identify.
不同数据的在存储到数据仓库后都会以CSV或者SDF这两种文件格式存在,所以格式上并无太大差异,不同的订阅者来处理不同的数据清洗过程中。Different data will exist in the two file formats of CSV or SDF after being stored in the data warehouse, so there is not much difference in the format, and different subscribers deal with different data cleaning processes.
进一步,本实施例的数据集成器包括:API接口集成器、文件对象集成器、数据流对象集成器、事件对象集成器中的一种或多种。Further, the data integrator in this embodiment includes: one or more of an API interface integrator, a file object integrator, a data flow object integrator, and an event object integrator.
不同的数据均是以文件这种介质存在,所以对接不同的数据问题会转变成解析不同数据格式提取数据的问题。Different data exists in the medium of files, so the problem of docking different data will be transformed into the problem of parsing different data formats to extract data.
多种数据源确实存在不同的数据格式,但基本上能够归为几个常用的格式类别,例如SDF,CSV等,其主要差异可能来自于数据字段的不同以及数据访问方式的差异,而此处系统将不同的数据源访问方式抽象为4中最为常见的数据访问方法:基于(数据方提供发布的)API请求、基于文件传输、基于(数据提供方发布数据更新的)事件触发、基于(数据提供方发布的)数据流。故此提供了四种不同的程序模块对接四种访问方式获取数据。There are indeed different data formats for various data sources, but they can basically be classified into several common format categories, such as SDF, CSV, etc. The main differences may come from different data fields and differences in data access methods, and here The system abstracts different data source access methods into 4 most common data access methods: based on API requests (provided and published by the data provider), based on file transfer, based on event triggering (posted by the data provider), and based on (data provider). Provider published) data stream. Therefore, four different program modules are provided to connect four access methods to obtain data.
异步数据队列是指从上述4中数据访问方式持续获取数据时,因为数据的获取速度和数据的存储IO速度是有差异的,为了降低数据获取的高并发对于存储的负载压力,提升系统在面对数据存储失败情况的鲁棒性,使用一个消息队列的架构设计方法来将收集汇总的数据以文件组织方式逐步写入到数据仓库中,典型的实现框架有很多,例如开源的Kafka框架或者云供应商提供的各种消息队列框架。Asynchronous data queue refers to the continuous acquisition of data from the above 4 data access methods, because the acquisition speed of data and the IO speed of data storage are different, in order to reduce the load pressure on storage due to high concurrency of data acquisition, improve the system performance on the surface. Robustness to the failure of data storage, using a message queue architecture design method to gradually write the collected and summarized data into the data warehouse in a file organization, there are many typical implementation frameworks, such as the open source Kafka framework or cloud Various message queuing frameworks from vendors.
进一步,本实施例的API接口集成器实现的数据访问方式为HTTPS API,根据文档要求获取并解析其返回其结果,并将返回的内容写成JSON或CSV格式的字符串传输到数据收集管道。Further, the data access method implemented by the API interface integrator in this embodiment is HTTPS API, which is obtained and parsed according to the document requirements to return the result, and the returned content is written as a string in JSON or CSV format and transmitted to the data collection pipeline.
例如一个数据源网站提供给我们的数据访问方式是HTTPS API,以及对应的数据查询文档和API访问秘钥,我们通过API接口集成器实现请求其HTTPS API,按照文档要求获取并解析其返回结果,并将返回的内容写成JSON或者 CSV格式的字符串传输到数据收集管道中。For example, the data access method provided to us by a data source website is HTTPS API, as well as the corresponding data query document and API access key. We request its HTTPS API through the API interface integrator, and obtain and parse the returned result according to the document requirements. And write the returned content as a string in JSON or CSV format and transmit it to the data collection pipeline.
进一步,本实施例的文件对象集成器实现基于文件对象的数据访问方式,通过下载接口下载得到文件形式的数据,完成下载获取的数据、检验文件下载的完整性并发送给数据收集管道。Further, the file object integrator of this embodiment implements a data access mode based on file objects, obtains data in the form of files by downloading through a download interface, completes downloading the acquired data, checks the integrity of the file download, and sends the data to the data collection pipeline.
基于文件对象的数据访问方式一般是通过数据提供方的下载接口下载得到文件形式的数据。此处文件对象集成器将完成下载获取数据、检验文件下载完整性并发送给数据收集管道。The data access method based on the file object generally downloads the data in the form of a file through the download interface of the data provider. Here the file object integrator will complete the download to get the data, verify the integrity of the file download and send it to the data collection pipeline.
进一步,本实施例的事件对象集成器实现基于事件的数据访问方式,间隔设定时间轮训访问数据源的数据及更新情况,比对上次最新数据获取时间,将新发布的数据通过HTTPS API或文件下载方式获取并发送至数据收集管道。Further, the event object integrator of the present embodiment implements an event-based data access method, sets the time interval to rotate the data and the update situation of the access data source, compares the latest data acquisition time last time, and passes the newly released data through HTTPS API or The file download method is obtained and sent to the data collection pipeline.
基于事件的数据访问方式是事件对象集成器会间隔一定周期(例如每天)轮训访问数据源的数据更新情况,一般可以是API、数据更新消息订阅、网站页面信息更新等,在比对上次最新数据获取时间后将新发布的数据通过HTTPS API或文件下载方式获取并发送至数据收集管道。The event-based data access method is that the event object integrator will rotate at a certain period (for example, every day) to access the data update of the data source, which can generally be API, data update message subscription, website page information update, etc. After the data acquisition time, the newly released data will be acquired through HTTPS API or file download and sent to the data collection pipeline.
进一步,本实施例的数据流对象集成器实现基于流对象的数据访问方式,获取能够给出增量或分页的数据获取方式的数据,记录上一次访问数据的参数信息,增量获取下一次的访问数据。Further, the data stream object integrator of this embodiment implements a data access method based on stream objects, acquires data that can provide incremental or paginated data acquisition methods, records the parameter information of the last accessed data, and incrementally acquires the next data. access data.
进一步,基于流对象的数据访问方式是对于数据提供方API对接的能力扩展,不同于HTTPS API是针对全量数据访问的下载,流对象处理需要数据提供方能够给出增量或者分页的数据获取方式,数据流对象集成器会记录上一次访问数据的参数信息,以便于增量获取下一次访问数据,例如数据提供方给出基于分页参数的HTTPS API,可按页大小及页编号获取数据,那么流对象集成器将固定页大小,递增页编号从而获取所有可能的数据。Further, the data access method based on stream objects is an extension of the ability of the data provider to connect with the API. Unlike the HTTPS API, which is aimed at downloading full data access, stream object processing requires the data provider to provide incremental or paginated data acquisition methods. , the data flow object integrator will record the parameter information of the last access data, so as to obtain the next access data incrementally. For example, if the data provider provides an HTTPS API based on paging parameters, the data can be obtained by page size and page number, then The stream object integrator will fix the page size and increment the page number to get all possible data.
进一步,本实施例的数据处理包括:对不同源的化合数据进行合理性检验、排除规则检验如某些包含某些金属元素的化合物需要剔除、手性分子的手性信息一致性检验、互变异构体的数据补充、pKa预测值的补充的一种或多种处理。上述处理的目的是将来源于不同数据源的分子通过规律规则和数据补充形成一致的信息列表。Further, the data processing in this embodiment includes: a rationality check on compound data from different sources, an exclusion rule check, such as some compounds containing certain metal elements need to be eliminated, a chiral molecule chiral information consistency check, interconversion One or more processing of data supplementation of isoforms, supplementation of predicted pKa values. The purpose of the above processing is to form a consistent information list of molecules from different data sources through regular rules and data supplementation.
本实施例的合理性检验:对于输入分子格式(例如SMILES,MOL格式)是否能够被常用的化学计算软件(例如RDKit)正常读取,一般可通过常用的软件的分子文件读取是否报错进行判断。The rationality test of this embodiment: whether the input molecular format (such as SMILES, MOL format) can be normally read by commonly used chemical calculation software (such as RDKit) can generally be judged by whether the molecular file reading of the commonly used software reports an error or not .
排除规则检验:是应用一些常见的可成药的小分子的筛选规则对输入分子进行过滤,例如”Lipinski's rule of five”,不可含有特殊金属等。Exclusion rule test: It is to filter the input molecules by applying some common screening rules for small molecules that can be made into medicines, such as "Lipinski's rule of five", which cannot contain special metals.
手性分子检验:是指验证输入分子如果是一个手性分子的话,是否有正确的手性定义于其输入的SMILES中。若存在多种手性的可能,可根据需要选择过滤该分子或生成所有可能手性的分子并保存。Chiral Molecule Test: Refers to verifying whether the input molecule, if it is a chiral molecule, has the correct chirality defined in its input SMILES. If there are multiple chirality possibilities, you can choose to filter the molecule or generate all possible chirality molecules and save as needed.
互变异构体的数据补充:是指在针对输入分子为SMILES时,存在互变异构体的情况可遍历并选取保留其中最常见的互变异构体,一般可通过常用的化学软件例如RDKit实现。Data supplementation of tautomers: It means that when the input molecule is SMILES, the existence of tautomers can be traversed and the most common tautomers can be selected and retained. Generally, common chemical software such as RDKit implementation.
pKa预测:是指针对输入的分子通过常用的开源或者商用软件可计算其pKa值并保存,例如ChemAxon。pKa prediction: It means that the pKa value of the input molecule can be calculated and saved by common open source or commercial software, such as ChemAxon.
数据处理还包括重算:若处理规则变更,根据数据的唯一标识获取历史收集的相关原始数据,通过触发器将存储在数据仓库中的原始数据的唯一标识发送给数据清洗管道,根据变更后的处理规则进行重新处理,得到新的处理后的干净数据存储到数据仓库中。Data processing also includes recalculation: if the processing rules are changed, the relevant original data collected in the history is obtained according to the unique identifier of the data, and the unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers. The processing rules are reprocessed, and new processed clean data is obtained and stored in the data warehouse.
具体的,重算的主要是指从收集到的原始数据经过数据清洗管道得到干净数据的过程,可能的场景是针对清洗规则发生变更时,需要对历史收集的相关原始数据进行重新清洗,此时可标记需要清洗原始数据,将其数据文件的ID通过触发器发送到数据清洗管道,后续流程即可完全自动运行,得到新的清洗后的干净数据存放于数据仓库中。Specifically, recalculation mainly refers to the process of obtaining clean data from the collected raw data through the data cleaning pipeline. The possible scenario is that when the cleaning rules are changed, it is necessary to re-clean the historically collected relevant raw data. At this time It can mark the original data that needs to be cleaned, and send the ID of its data file to the data cleaning pipeline through a trigger, and the subsequent process can run completely automatically, and get new clean data after cleaning and store it in the data warehouse.
数据处理还包括聚合:对不同数据源的同一分子进行去重,同时保留其来源信息,对不同数据源中因为信息不对称带来的数据不一致的情况进行合并。Data processing also includes aggregation: deduplication of the same molecule from different data sources while retaining its source information, and merging data inconsistencies in different data sources due to information asymmetry.
具体的聚合过程:不同源的数据经过清洗后虽然都变成CSV格式且具备相同的字段属性,但多少有很多分子来源于不同数据源但其实是同一个分子,所以首先要做聚合的目的就是去除重复分子的情况下保留其来源信息。其次是对同一属性字段来自不同数据源可能因为信息的不对称带来的数据不一致情况进行合并,例如化合物是否可购买到不同的数据源获取的信息可能不同,但在最终的知识库中应该有一致的信息表达。The specific aggregation process: Although the data from different sources are changed into CSV format and have the same field attributes after cleaning, there are many molecules from different data sources but they are actually the same molecule, so the first purpose of aggregation is to The source information is preserved without repeating molecules. The second is to merge the data inconsistencies that may be caused by information asymmetry from different data sources for the same attribute field. For example, whether a compound can be purchased from different data sources may have different information, but in the final knowledge base there should be Consistent message presentation.
本实施例的触发器对数据的处理的触发机制与数据收集颇为类似,只不过此时数据源来自于数据仓库中的文件。所谓触发器是指一旦有文件产生或者更新,便触发一条消息写入到数据清洗队列中,进而由数据清洗队列的消费者接受消息并对对应的原始数据进行处理。The trigger mechanism for data processing by the trigger in this embodiment is quite similar to that of data collection, except that at this time, the data source comes from a file in the data warehouse. The so-called trigger means that once a file is generated or updated, a message is triggered to be written into the data cleaning queue, and then the consumers of the data cleaning queue receive the message and process the corresponding original data.
清洗完后的化合物信息(例如SMILES分子式、化合物来源信息、化合物唯一标识符ID等)及其相应的附属信息(手性、互变异构体、是否符合Lipinski’s Rule of Five、是否可购买到等)通过CSV组织成一致的数据结构存储回数据仓库,此为干净数据。Compound information after cleaning (such as SMILES molecular formula, compound source information, compound unique identifier ID, etc.) and its corresponding auxiliary information (chirality, tautomer, whether it conforms to Lipinski's Rule of Five, whether it is available for purchase, etc. ) is organized into a consistent data structure through CSV and stored back into the data warehouse, which is clean data.
进一步,本实施例的数据处理还包括:将处理完后的化合物信息及其相应的附属信息通过CSV组织成一致的数据结构存储数据仓库中。化合物信息包括:SMILES分子式、化合物来源信息、化合物唯一标识中的一种或多种;所述附属信息包括:手性、互变异构体、是否符合Lipinski’s Rule of Five、是否可购买到的一种或多种信息。Further, the data processing in this embodiment further includes: organizing the processed compound information and its corresponding subsidiary information into a consistent data structure through CSV and storing it in a data warehouse. Compound information includes: one or more of SMILES molecular formula, compound source information, and compound unique identifier; the ancillary information includes: chirality, tautomer, whether it complies with Lipinski's Rule of Five, and whether it is available for purchase. one or more kinds of information.
本实施例的干净数据是一个相对于原始数据并且适配具体场景需求的变化规则,可以是上面提到的规则(针对不同源的化合数据做结构式合理性检验、排除规则检验、手性分子的手性信息一致性检验),亦或者由使用该系统的人定义自己的数据清洗规则得到的数据。The clean data in this embodiment is a change rule relative to the original data and adapted to the requirements of a specific scenario, which can be the rule mentioned above (for compound data from different sources, structural rationality check, exclusion rule check, chiral molecule Chiral Information Consistency Check), or data obtained by people using the system who define their own data cleaning rules.
进一步,本实施例的分析包括:数据装载与数据处理,将清洗、聚合后数据,通过化学物理计算对化合物的性质进行预测计算得到的结果及化合物信息一并存储到知识库中。Further, the analysis in this embodiment includes: data loading and data processing, and storing the cleaned and aggregated data, the result obtained by predicting and calculating the properties of the compound through chemical-physical calculation, and the compound information together in the knowledge base.
具体的数据装载与数据整理其实是指针对清洗、聚合后的数据通过交互分析工具、软件开发工具所实现的一系列化学物理计算方法对化合物的性质进行预测计算得到的结果与化合物一并存储到知识库的过程。例如知识库是一个数据库,数据库的字段设计中一部分的属性字段来自于数据源提供的信息聚合后的结果,一部分来自于交互式分析及软件计算的结果,将两部分结果与该化合物自身一起构成一条记录存储到数据库中。例如针对结构库,使用交互式分析及软件开发工具包针对数据仓库中的化合物分子式SMILES,预测生成其生物活性构象,并将可能的活性构象存储于结构库中。The specific data loading and data sorting actually refers to the prediction and calculation of the properties of the compounds through a series of chemical and physical calculation methods implemented by interactive analysis tools and software development tools for the cleaned and aggregated data. The results obtained are stored together with the compounds in the knowledge base process. For example, a knowledge base is a database. In the field design of the database, a part of the attribute fields come from the aggregated results of the information provided by the data source, and the other part comes from the results of interactive analysis and software calculation. The two parts of the results are formed together with the compound itself. A record is stored in the database. For example, for the structure library, use the interactive analysis and software development kit to predict and generate the biologically active conformation of the compound molecular formula SMILES in the data warehouse, and store the possible active conformation in the structure library.
化学物理计算一般根据用户需求自定,定义在数据清洗过程中,可以理解为一个程序的调用,例如如果做排除规则的检验,可以通过开源软件RDKit实现。The chemical physics calculation is generally customized according to the user's needs. It is defined in the data cleaning process and can be understood as a program call. For example, if the exclusion rule is checked, it can be implemented by the open source software RDKit.
数据分析过程以及调用该过程中的开发工具Jupyter Notebook的目的都一样,就是将数据仓库中的数据聚合整理存储到知识库中,软件开发工具包定义了一些列访问和存储知识库的Python SDK,同时软件开发工具包也包含了例如pKa计算模块、构象生成模块质子化位点判别模块等计算工具,而工具的流程串接可按需组件知识提取的工作流,Jupyter Notebook提供可编程的交互截面能够方便用户通过Python SDK来串接编程所需要的流程。最终数据将写入到知识库中。以构象库为例,软件开发工具包输入化合物SMILES,输出若干生物活性构象,并存储于结构库中便于后续使用。The purpose of the data analysis process and calling the development tool Jupyter Notebook in the process is the same, that is to aggregate and store the data in the data warehouse into the knowledge base. The software development kit defines some columns to access and store the knowledge base. Python SDK, At the same time, the software development kit also includes computing tools such as the pKa calculation module, the conformation generation module and the protonation site discrimination module, and the workflow of the tools is connected to the workflow of knowledge extraction of on-demand components, and Jupyter Notebook provides programmable interactive sections It is convenient for users to connect the process required for programming in series through the Python SDK. The final data will be written to the knowledge base. Taking the conformation library as an example, the software development kit inputs the compound SMILES, outputs several biologically active conformations, and stores them in the structure library for subsequent use.
进一步,本实施例的编码向量为SMILES(Simplified Molecular Input Line Entry Specification简化分子线性输入规范)式编码向量。Further, the encoding vector in this embodiment is a SMILES (Simplified Molecular Input Line Entry Specification) formula encoding vector.
本发明一实施例的面向药物研发的知识库系统,包括:The knowledge base system for drug research and development according to an embodiment of the present invention includes:
数据集成模块:构建多种数据集成器,根据不同数据采用与其匹配的数据访问方式,获取数据,将获取的数据序列化成字符串推送给数据收集管道,数据收集管道将获取的数据以批量、异步的方式存储于数据仓库中,并对每一个存储数据记录标定唯一标识,此时存储的数据为原始数据;Data integration module: Build a variety of data integrators, use matching data access methods according to different data, obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch and asynchronously obtain the data. The method is stored in the data warehouse, and each stored data record is calibrated with a unique identifier, and the data stored at this time is the original data;
数据处理模块:通过触发器将存储在数据仓库中的原始数据的唯一标识发送给数据清洗管道,数据清洗订阅者处理数据清洗管道中的数据,对数据进行处理,清洗过程中通过唯一标识访问原始数据的内容,将处理后的数据存储于数据仓库中,并加上新的标识,此时存储的数据为干净数据;Data processing module: The unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers, and the data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and accesses the original data through the unique identifier during the cleaning process. The content of the data, the processed data is stored in the data warehouse, and a new logo is added, and the data stored at this time is clean data;
分析模块:对数据仓库中的数据进行分析,并将分析结果存储于知识库中。Analysis module: analyze the data in the data warehouse and store the analysis results in the knowledge base.
进一步,本实施例的数据集成器包括:API接口集成器、文件对象集成器、数据流对象集成器、事件对象集成器中的一种或多种。Further, the data integrator in this embodiment includes: one or more of an API interface integrator, a file object integrator, a data flow object integrator, and an event object integrator.
进一步,本实施例的数据处理模块还包括:对不同源的化合数据进行合理性检验、排除规则检验、手性分子的手性信息一致性检验、互变异构体的数据补充、pKa预测值的补充的一种或多种处理,将来源于不同数据源的分子通过规律规则和数据补充形成一致的信息列表。Further, the data processing module of this embodiment also includes: performing rationality check, exclusion rule check, chiral information consistency check of chiral molecules, data supplementation of tautomers, and pKa predicted value for compound data from different sources. One or more processing of supplementation, forming a consistent list of information from molecules from different data sources through regular rules and data supplementation.
数据处理模块还包括重算单元:若处理规则变更,根据数据的唯一标识获取历史收集的相关原始数据,通过触发器将存储在数据仓库中的原始数据的唯一标识发送给数据清洗管道,根据变更后的处理规则进行重新处理,得到新的处理后的干净数据存储到数据仓库中。The data processing module also includes a recalculation unit: if the processing rules are changed, the relevant original data collected in the history is obtained according to the unique identifier of the data, and the unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers. The post-processing rules are reprocessed, and new processed clean data is obtained and stored in the data warehouse.
具体的,重算单元的主要是指从收集到的原始数据经过数据清洗管道得到干净数据的过程,可能的场景是针对清洗规则发生变更时,需要对历史收集的相关原始数据进行重新清洗,此时可标记需要清洗原始数据,将其数据文件的ID通过触发器发送到数据清洗管道,后续流程即可完全自动运行,得到新的清洗后的干净数据存放于数据仓库中。Specifically, the recalculation unit mainly refers to the process of obtaining clean data from the collected raw data through the data cleaning pipeline. The possible scenario is that when the cleaning rules are changed, it is necessary to re-clean the historically collected relevant raw data. When the original data needs to be cleaned, the ID of the data file can be sent to the data cleaning pipeline through the trigger, and the subsequent process can be run completely automatically, and the new cleaned data can be stored in the data warehouse.
数据处理模块还包括聚合单元:对不同数据源的同一分子进行去重,同时保留其来源信息,对不同数据源中因为信息不对称带来的数据不一致的情况进行合并。The data processing module also includes an aggregation unit: deduplication of the same molecule from different data sources, while retaining its source information, and merging the data inconsistencies in different data sources due to information asymmetry.
具体的,聚合单元:不同源的数据经过清洗后虽然都变成CSV格式且具备相同的字段属性,但多少有很多分子来源于不同数据源但其实是同一个分子,所以首先要做聚合的目的就是去除重复分子的情况下保留其来源信息。其次是对同一属性字段来自不同数据源可能因为信息的不对称带来的数据不一致情况进行合并,例如化合物是否可购买到不同的数据源获取的信息可能不同,但在 最终的知识库中应该有一致的信息表达。Specifically, the aggregation unit: although the data from different sources are changed into CSV format and have the same field attributes after cleaning, there are many molecules from different data sources but they are actually the same molecule, so the purpose of aggregation should be done first. It is to retain the source information in the case of removing duplicate molecules. The second is to merge the data inconsistencies that may be caused by information asymmetry from different data sources for the same attribute field. For example, whether a compound can be purchased from different data sources may have different information, but in the final knowledge base there should be Consistent message presentation.
进一步,本实施例的分析模块包括:数据装载与数据处理单元:将清洗、聚合后数据,通过化学物理计算对化合物的性质进行预测计算得到的结果及化合物信息一并存储到知识库中。Further, the analysis module of this embodiment includes: a data loading and data processing unit: storing the cleaned and aggregated data, the result obtained by predicting and calculating the properties of the compound through chemical-physical calculation, and the compound information together in the knowledge base.
本系统通过提供多种数据源集成方法对接多种数据源的数据格式及访问方式,将收集到的原始数据通过异步数据队列先进行原始数据持久化存储,进而通过数据提取、转换、装载系统将原始数据整理、聚合和重算得到干净数据。在此基础上,该系统进而提供基于Jupyter Notebook的交互式分析方式及相关物理化学信息计算工具,将干净数据转换并存储于知识库中,供药物研发访问使用。This system connects the data formats and access methods of various data sources by providing a variety of data source integration methods, firstly stores the collected raw data through the asynchronous data queue, and then stores the raw data persistently through the data extraction, conversion, and loading system. Raw data is cleaned, aggregated, and recalculated to get clean data. On this basis, the system further provides an interactive analysis method based on Jupyter Notebook and related physical and chemical information calculation tools, and converts clean data and stores it in the knowledge base for access to drug research and development.
本实施例的数据集成模块针对多种公开、商业、定制化数据源,抽象成以下4类数据集成方式:基于API接口数据访问方式、基于文件对象的数据访问方式、基于事件的数据访问方式、基于数据流的数据访问方式分别构建4种数据集成器,他们各自通过对接的数据源按照有效合规的数据访问方法获取数据,进而将获取到的数据序列化成字符串推送给数据收集管道,数据收集管道将获取的数据以批量、异步的方式存储于数据仓库中,并对每一个存储记录标定全局唯一的标识,以便于在数据清洗与重算时访问使用,此时存储的是原始数据。The data integration module of this embodiment is abstracted into the following four types of data integration methods for a variety of public, commercial, and customized data sources: API interface-based data access methods, file object-based data access methods, event-based data access methods, Four kinds of data integrators are constructed based on the data access method of data flow. Each of them obtains data through the docked data source according to the effective and compliant data access method, and then serializes the obtained data into a string and pushes it to the data collection pipeline. The collection pipeline stores the acquired data in the data warehouse in a batch and asynchronous manner, and assigns a globally unique identifier to each stored record, so that it can be accessed and used during data cleaning and recalculation. At this time, the original data is stored.
本实施例的数据处理模块,由于原始数据需要经历提取、转换、装载操作的,可通过触发器的将存储在数据仓库中的原始数据唯一标识发送给数据清洗管道,数据清洗管道中的内容会被数据清洗订阅者处理,每个数据清洗订阅者会调用自己所定义的数据清洗过程处理获取原始数据进而进行数据清洗、转换、装载或重算的任务。期间数据清洗过程需要通过数据仓库的全局唯一标识访问到原始数据内容,同时会将结果数据存储于数据仓库中,并加上新的全局唯一标识,此时存储的是干净数据。In the data processing module of this embodiment, since the original data needs to undergo extraction, transformation, and loading operations, the unique identifier of the original data stored in the data warehouse can be sent to the data cleaning pipeline through triggers, and the content in the data cleaning pipeline will be sent to the data cleaning pipeline. Processed by the data cleaning subscriber, each data cleaning subscriber will call the data cleaning process defined by itself to process the task of obtaining the original data and then performing data cleaning, transformation, loading or recalculation. During the data cleaning process, the original data content needs to be accessed through the global unique identifier of the data warehouse, and the resulting data will be stored in the data warehouse, and a new global unique identifier will be added. At this time, clean data is stored.
本实施例的分析模块通过软件开发工具包可访问到数据仓库中的原始数据及干净数据,提供交互式分析工具Jupyter Notebook使用软件开发工具包中的方法、函数与库实现对于数据仓库中的数据分析。同时该软件开发工具包也提供对应与知识库的存储方法,将分析的结果存储于知识库中。The analysis module of this embodiment can access the original data and clean data in the data warehouse through the software development kit, and provides an interactive analysis tool Jupyter Notebook to use the methods, functions and libraries in the software development kit to realize the data in the data warehouse. analyze. At the same time, the software development kit also provides a storage method corresponding to the knowledge base, and the analysis results are stored in the knowledge base.
本系统简化及标准化支持多种数据源的对接与集成,提供4中不同的方式面向不同情况的数据源的数据收集,不仅让数据收集变得已与扩展与维护,同时也能够一定程度简化后续数据清洗、重算及知识信息提取的复杂度。这里的简化其实是指将该系统提供了4种能够覆盖常见数据提供方的数据访问形式,不再是一种数据提供方一种访问方式,而只需要维护4种集成模式即可。而标 准化是指不同源的数据信息存在字段不一致、信息不相同的情况,通过提供将不同数据源的感兴趣的字段映射到知识库统一的字段设计,实现同样的信息有同样的字段标识和索引,不同源的信息差异也能够合并保存和被查询。The system simplifies and standardizes the connection and integration of multiple data sources, and provides 4 different ways to collect data from data sources in different situations, which not only makes data collection scalable and maintainable, but also simplifies subsequent The complexity of data cleaning, recalculation and knowledge information extraction. The simplification here actually means that the system provides four data access forms that can cover common data providers. It is no longer one data provider and one access method, but only needs to maintain four integration modes. Standardization means that the data information of different sources has inconsistent fields and different information. By providing a unified field design that maps interesting fields of different data sources to the knowledge base, the same information has the same field identification and index. , the information differences of different sources can also be merged and saved and queried.
数据收集管道采用异步批处理让系统能够同时处理的数据源数量上得以提升,同时数据收集管道与数据清洗管道可采用同样的框架设计降低整体系统运维复杂度,灵活的订阅处理模式可以提提升系统的容错能力和稳定性。The data collection pipeline adopts asynchronous batch processing to increase the number of data sources that the system can process at the same time. At the same time, the data collection pipeline and data cleaning pipeline can use the same framework design to reduce the overall system operation and maintenance complexity, and the flexible subscription processing mode can improve the The fault tolerance and stability of the system.
提供自定义的数据清洗过程、软件开发工具包及工作流工具,并对接交互式数据分析系统,提升了系统应对多样数据处理需求的灵活性及海量数据处理的扩展性。Provide customized data cleaning process, software development toolkit and workflow tools, and connect to interactive data analysis system, which improves the flexibility of the system to meet various data processing needs and the scalability of massive data processing.
本发明一优选实施例的面向药物研发的知识库系统采用以对接某一公开药物分子活性数据库为例来说明The knowledge base system for drug research and development according to a preferred embodiment of the present invention is described by taking docking with a public drug molecule activity database as an example.
某药物研发部门计划开发一款新药。该研发部门的药物设计团队希望能够将自有设计的新药与公开活性数据库的药物分子结构、活性及相关信息进行比对,挑选出结构类似的分子研究其活性及相关信息。A drug development department plans to develop a new drug. The drug design team of the R&D department hopes to compare the new drugs designed by itself with the drug molecular structure, activity and related information in the public activity database, and select molecules with similar structures to study their activities and related information.
首先团队选定了某一公开化合物库,该化合物库仅提供以文件形式下载相关数据,所以团队成员先按照筛选条件,将数据从公开化合物网站以文件形式下载下来,通过该系统基于文件对象的数据集成器将原始下载数据推送至数据收集管道,至此该数据最终会先存储到数据仓库中,且每条原始数据记录都有其在数据仓库中的唯一标识符。First, the team selected a public compound library, which only provides relevant data to be downloaded in the form of files, so the team members first downloaded the data from the public compound website in the form of files according to the screening conditions. The data integrator pushes the raw downloaded data to the data collection pipeline, where it is ultimately stored in the data warehouse, and each raw data record has its unique identifier in the data warehouse.
然后团队成员定义一个基于Lipski’s Rule of Five等原则对原始数据中更有可能成为药物分子的化合物进行筛选,该过程定义在数据清洗过程中,通过触发器获取刚才下载的原始数据,经由数据清洗管道得到最终的可能成药的化合物分子数据,即为干净数据。Then the team members define a method based on Lipski's Rule of Five and other principles to screen the compounds that are more likely to become drug molecules in the original data. This process is defined in the data cleaning process, and the raw data just downloaded is obtained through triggers. Through the data cleaning pipeline The final molecular data of possible druggable compounds are obtained, that is, clean data.
团队通过交互式分析工具Juypter Notebook及该系统提供的软件开发工具包,输入团队设计的药物分子后系统可根据化合物相似性比较算法在刚才清洗得到的数据中查询出来与团队设计相似的化合物分子的活性及相关信息,并可通过软件开发工具包将相关查询结果存储于知识库中,以便于团队在药物研发流程中进行使用。The team uses the interactive analysis tool Juypter Notebook and the software development kit provided by the system to input the drug molecules designed by the team, and the system can query the data just cleaned according to the compound similarity comparison algorithm. Activity and related information, and related query results can be stored in the knowledge base through the software development kit, so that the team can use it in the drug development process.
本发明适用于药物研发过程中,针对药物靶点、药物分子结构、市场竞品、药物专利信息及实验数据的自动化收集、聚合、数据清洗、存储及在分析流程,构建了一个辅助药物研发的知识库系统。该系统能够对接不同数据源的数据信息,通过大批量数据处理及持久化技术存储、清洗、重算原始数据,进而根据需要构建成面向领域问题的知识库,通过建立于知识库之上的数据分析工具向 药物研发相关人员提供便捷的药物数据聚合与分析能力,提升药物研发效率,促进开发设计新的药物研发方法。The invention is suitable for the process of drug research and development, aiming at the automatic collection, aggregation, data cleaning, storage and analysis process of drug targets, drug molecular structures, market competitors, drug patent information and experimental data, and constructs an auxiliary drug research and development process. Knowledge base system. The system can connect data information from different data sources, store, clean, and recalculate the original data through large-scale data processing and persistence technology, and then build a knowledge base for domain problems as needed. Analysis tools provide drug R&D personnel with convenient drug data aggregation and analysis capabilities, improve drug R&D efficiency, and facilitate the development and design of new drug R&D methods.
以上述依据本申请的理想实施例为启示,通过上述的说明内容,相关工作人员完全可以在不偏离本项申请技术思想的范围内,进行多样的变更以及修改。本项申请的技术性范围并不局限于说明书上的内容,必须要根据权利要求范围来确定其技术性范围。Taking the above ideal embodiments according to the present application as inspiration, and through the above descriptions, relevant personnel can make various changes and modifications without departing from the technical idea of the present application. The technical scope of the present application is not limited to the content in the description, and the technical scope must be determined according to the scope of the claims.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

Claims (10)

  1. 一种面向药物研发的数据处理方法,其特征在于,包括:A data processing method for drug research and development, comprising:
    数据集成:构建多种数据集成器,根据不同数据采用与其匹配的数据访问方式,获取数据,将获取的数据序列化成字符串推送给数据收集管道,数据收集管道将获取的数据以批量、异步的方式存储于数据仓库中,并对每一个存储数据记录标定唯一标识,此时存储的数据为原始数据;Data integration: Build a variety of data integrators, use matching data access methods according to different data, obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch and asynchronous the obtained data. The method is stored in the data warehouse, and a unique identifier is calibrated for each stored data record, and the data stored at this time is the original data;
    数据处理:通过触发器将存储在数据仓库中的原始数据的唯一标识发送给数据清洗管道,数据清洗订阅者处理数据清洗管道中的数据,对数据进行处理,清洗过程中通过唯一标识访问原始数据的内容,将处理后的数据存储于数据仓库中,并加上新的标识,此时存储的数据为干净数据;Data processing: The unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers. The data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and accesses the original data through the unique identifier during the cleaning process. content, store the processed data in the data warehouse, and add a new identifier, the data stored at this time is clean data;
    分析:对数据仓库中的数据进行分析,并将分析结果存储于知识库中。Analysis: Analyze the data in the data warehouse and store the analysis results in the knowledge base.
  2. 根据权利要求1所述的面向药物研发的数据处理方法,其特征在于,所述数据处理还包括:将来源于不同数据源的分子通过规律规则和数据补充形成一致的信息列表。The data processing method for drug research and development according to claim 1, wherein the data processing further comprises: forming a consistent information list by supplementing molecules from different data sources through regular rules and data.
  3. 根据权利要求2所述的面向药物研发的数据处理方法,其特征在于,所述数据处理包括:对不同源的化合数据进行合理性检验、排除规则检验、手性分子的手性信息一致性检验、互变异构体的数据补充、pKa预测值的补充的一种或多种处理。The data processing method for drug research and development according to claim 2, characterized in that, the data processing comprises: carrying out a rationality test, an exclusion rule test, and a chiral information consistency test on the compound data of different sources. One or more treatments of data complementation of tautomers, complementation of predicted pKa values.
  4. 根据权利要求3所述的面向药物研发的数据处理方法,其特征在于,所述数据处理还包括:The data processing method for drug research and development according to claim 3, wherein the data processing further comprises:
    重算:若处理规则变更,根据数据的唯一标识获取历史收集的相关原始数据,通过触发器将存储在数据仓库中的原始数据的唯一标识发送给数据清洗管道,根据变更后的处理规则进行重新处理,得到新的处理后的干净数据存储到数据仓库中。Recalculation: If the processing rules are changed, obtain the relevant original data collected in the history according to the unique identifier of the data, send the unique identifier of the original data stored in the data warehouse to the data cleaning pipeline through triggers, and recalculate according to the changed processing rules. Process, get new processed clean data and store it in the data warehouse.
  5. 根据权利要求1所述的面向药物研发的数据处理方法,其特征在于,所述数据处理还包括:The data processing method for drug research and development according to claim 1, wherein the data processing further comprises:
    聚合:对不同数据源的同一分子进行去重,同时保留其来源信息,对不同数据源中因为信息不对称带来的数据不一致的情况进行合并。Aggregation: Deduplication of the same molecule from different data sources, while retaining its source information, and merging data inconsistencies in different data sources due to information asymmetry.
  6. 根据权利要求5所述的面向药物研发的数据处理方法,其特征在于,所述分析包括:将清洗、聚合后数据,通过化学物理计算对化合物的性质进行预测 计算得到的结果及化合物信息一并存储到知识库中。The data processing method for drug research and development according to claim 5, wherein the analysis comprises: combining the data after cleaning and polymerization, the result obtained by predicting and calculating the properties of the compound by chemical-physical calculation, and the compound information stored in the knowledge base.
  7. 根据权利要求1至6任意一项所述的面向药物研发的数据处理方法,其特征在于,所述数据处理还包括:将处理完后的化合物信息及其相应的附属信息通过CSV组织成一致的数据结构存储数据仓库中。The data processing method for drug research and development according to any one of claims 1 to 6, wherein the data processing further comprises: organizing the processed compound information and its corresponding auxiliary information into a consistent format through CSV Data structures are stored in a data warehouse.
  8. 根据权利要求7所述的面向药物研发的数据处理方法,其特征在于,所述化合物信息包括:SMILES分子式、化合物来源信息、化合物唯一标识中的一种或多种;所述附属信息包括:手性、互变异构体、是否符合Lipinski’s Rule of Five、是否可购买到的一种或多种信息;The data processing method for drug research and development according to claim 7, wherein the compound information includes one or more of: SMILES molecular formula, compound source information, and compound unique identifier; the auxiliary information includes: manual one or more of the information on sex, tautomerism, compliance with Lipinski's Rule of Five, availability;
    所述数据集成器包括:API接口集成器、文件对象集成器、数据流对象集成器、事件对象集成器中的一种或多种;The data integrator includes: one or more of an API interface integrator, a file object integrator, a data stream object integrator, and an event object integrator;
    所述API接口集成器实现的数据访问方式为HTTPS API,根据文档要求获取并解析其返回其结果,并将返回的内容写成JSON或CSV格式的字符串传输到数据收集管道;The data access method realized by the API interface integrator is HTTPS API, which is obtained and parsed according to the document requirements and returns its result, and the returned content is written as a string in JSON or CSV format and transmitted to the data collection pipeline;
    所述文件对象集成器实现基于文件对象的数据访问方式,通过下载接口下载得到文件形式的数据,完成下载获取的数据、检验文件下载的完整性并发送给数据收集管道;The file object integrator realizes the data access mode based on the file object, downloads the data in the form of a file through the download interface, completes the download of the acquired data, checks the integrity of the file download and sends it to the data collection pipeline;
    所述事件对象集成器实现基于事件的数据访问方式,间隔设定时间轮训访问数据源的数据及更新情况,比对上次最新数据获取时间,将新发布的数据通过HTTPS API或文件下载方式获取并发送至数据收集管道;The event object integrator implements an event-based data access method, sets the time interval to train the data and update of the data source in rotation, compares the last latest data acquisition time, and acquires the newly released data through HTTPS API or file download. and sent to the data collection pipeline;
    所述数据流对象集成器实现基于流对象的数据访问方式,获取能够给出增量或分页的数据获取方式的数据,记录上一次访问数据的参数信息,增量获取下一次的访问数据。The data flow object integrator implements a data access method based on a stream object, acquires data that can provide incremental or paginated data acquisition methods, records parameter information of the last access data, and incrementally acquires the next access data.
  9. 一种面向药物研发的知识库系统,其特征在于,包括:A knowledge base system for drug research and development, comprising:
    数据集成模块:构建多种数据集成器,根据不同数据采用与其匹配的数据访问方式,获取数据,将获取的数据序列化成字符串推送给数据收集管道,数据收集管道将获取的数据以批量、异步的方式存储于数据仓库中,并对每一个存储数据记录标定唯一标识,此时存储的数据为原始数据;Data integration module: Build a variety of data integrators, use matching data access methods according to different data, obtain data, serialize the obtained data into strings and push them to the data collection pipeline, and the data collection pipeline will batch and asynchronously obtain the data. The method is stored in the data warehouse, and each stored data record is calibrated with a unique identifier, and the data stored at this time is the original data;
    数据处理模块:通过触发器将存储在数据仓库中的原始数据的唯一标识发送给数据清洗管道,数据清洗订阅者处理数据清洗管道中的数据,对数据进行处理, 清洗过程中通过唯一标识访问原始数据的内容,将处理后的数据存储于数据仓库中,并加上新的标识,此时存储的数据为干净数据;Data processing module: The unique identifier of the original data stored in the data warehouse is sent to the data cleaning pipeline through triggers, and the data cleaning subscriber processes the data in the data cleaning pipeline, processes the data, and accesses the original data through the unique identifier during the cleaning process. The content of the data, the processed data is stored in the data warehouse, and a new identifier is added, and the data stored at this time is clean data;
    分析模块:对数据仓库中的数据进行分析,并将分析结果存储于知识库中。Analysis module: analyze the data in the data warehouse and store the analysis results in the knowledge base.
  10. 根据权利要求9所述的面向药物研发的知识库系统,其特征在于,所述数据处理模块还包括:对不同源的化合数据进行合理性检验、排除规则检验、手性分子的手性信息一致性检验、互变异构体的数据补充、pKa预测值的补充的一种或多种处理,将来源于不同数据源的分子通过规律规则和数据补充形成一致的信息列表。The knowledge base system for drug research and development according to claim 9, wherein the data processing module further comprises: performing rationality check, exclusion rule check, and consistency of chirality information of chiral molecules on compound data from different sources One or more processing of sex test, data supplementation of tautomers, and supplementation of predicted pKa values, forming a consistent information list of molecules from different data sources through regular rules and data supplementation.
PCT/CN2020/120425 2020-10-12 2020-10-12 Data processing method and system for drug research and development WO2022077166A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/120425 WO2022077166A1 (en) 2020-10-12 2020-10-12 Data processing method and system for drug research and development

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/120425 WO2022077166A1 (en) 2020-10-12 2020-10-12 Data processing method and system for drug research and development

Publications (1)

Publication Number Publication Date
WO2022077166A1 true WO2022077166A1 (en) 2022-04-21

Family

ID=81208797

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/120425 WO2022077166A1 (en) 2020-10-12 2020-10-12 Data processing method and system for drug research and development

Country Status (1)

Country Link
WO (1) WO2022077166A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117113950A (en) * 2023-08-11 2023-11-24 广州标智未来科学技术有限公司 High-throughput experimental data processing method and device
CN117762954A (en) * 2023-11-17 2024-03-26 深圳市前海数据服务有限公司 Automatic data management method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108153747A (en) * 2016-12-02 2018-06-12 航天星图科技(北京)有限公司 A kind of parallel data cleaning system
CN108846076A (en) * 2018-06-08 2018-11-20 山大地纬软件股份有限公司 The massive multi-source ETL process method and system of supporting interface adaptation
CN109165202A (en) * 2018-07-04 2019-01-08 华南理工大学 A kind of preprocess method of multi-source heterogeneous big data
CN110825721A (en) * 2019-11-06 2020-02-21 武汉大学 Hypertension knowledge base construction and system integration method under big data environment
US10701140B2 (en) * 2015-10-08 2020-06-30 International Business Machines Corporation Automated ETL resource provisioner

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10701140B2 (en) * 2015-10-08 2020-06-30 International Business Machines Corporation Automated ETL resource provisioner
CN108153747A (en) * 2016-12-02 2018-06-12 航天星图科技(北京)有限公司 A kind of parallel data cleaning system
CN108846076A (en) * 2018-06-08 2018-11-20 山大地纬软件股份有限公司 The massive multi-source ETL process method and system of supporting interface adaptation
CN109165202A (en) * 2018-07-04 2019-01-08 华南理工大学 A kind of preprocess method of multi-source heterogeneous big data
CN110825721A (en) * 2019-11-06 2020-02-21 武汉大学 Hypertension knowledge base construction and system integration method under big data environment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117113950A (en) * 2023-08-11 2023-11-24 广州标智未来科学技术有限公司 High-throughput experimental data processing method and device
CN117762954A (en) * 2023-11-17 2024-03-26 深圳市前海数据服务有限公司 Automatic data management method

Similar Documents

Publication Publication Date Title
Rehman et al. Leveraging big data analytics in healthcare enhancement: trends, challenges and opportunities
Marin-Castro et al. Event log preprocessing for process mining: a review
EP2929467B1 (en) Integrating event processing with map-reduce
Ayers A second generation computer forensic analysis system
Rahul et al. Data life cycle management in big data analytics
US11403330B2 (en) Systems and methods for customized annotation of medical information
US7792783B2 (en) System and method for semantic normalization of healthcare data to support derivation conformed dimensions to support static and aggregate valuation across heterogeneous data sources
Pouchard Revisiting the data lifecycle with big data curation
US20090193054A1 (en) Tracking changes to a business object
WO2022077166A1 (en) Data processing method and system for drug research and development
Bahga et al. Healthcare data integration and informatics in the cloud
US20150134362A1 (en) Systems and methods for a medical coder marketplace
Chennamsetty et al. Predictive analytics on electronic health records (EHRs) using hadoop and hive
Wang et al. Large-scale multimodal mining for healthcare with mapreduce
US11538561B2 (en) Systems and methods for medical information data warehouse management
WO2018038745A1 (en) Clinical connector and analytical framework
US20210202111A1 (en) Method of classifying medical records
US20220114483A1 (en) Unified machine learning feature data pipeline
CN110659998A (en) Data processing method, data processing apparatus, computer apparatus, and storage medium
US20230113089A1 (en) Systems and methods for medical information data warehouse management
CN111445969A (en) Sales prediction method and system capable of flexibly adapting to noise
CN112164430B (en) Data processing method and system for drug development
US10346759B2 (en) Probabilistic inference engine based on synthetic events from measured data
Allen et al. Identifying and consolidating knowledge engineering requirements
US11971911B2 (en) Systems and methods for customized annotation of medical information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20956948

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 230823)

122 Ep: pct application non-entry in european phase

Ref document number: 20956948

Country of ref document: EP

Kind code of ref document: A1