WO2023123287A1 - 分子数据存储方法及装置、应用方法及装置 - Google Patents

分子数据存储方法及装置、应用方法及装置 Download PDF

Info

Publication number
WO2023123287A1
WO2023123287A1 PCT/CN2021/143388 CN2021143388W WO2023123287A1 WO 2023123287 A1 WO2023123287 A1 WO 2023123287A1 CN 2021143388 W CN2021143388 W CN 2021143388W WO 2023123287 A1 WO2023123287 A1 WO 2023123287A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
molecular
calculation
incremental
database
Prior art date
Application number
PCT/CN2021/143388
Other languages
English (en)
French (fr)
Inventor
张楠
陈勇攀
吴宇亮
Original Assignee
深圳晶泰科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳晶泰科技有限公司 filed Critical 深圳晶泰科技有限公司
Priority to PCT/CN2021/143388 priority Critical patent/WO2023123287A1/zh
Publication of WO2023123287A1 publication Critical patent/WO2023123287A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/40Searching chemical structures or physicochemical data
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/60In silico combinatorial chemistry
    • G16C20/64Screening of libraries

Definitions

  • the present application relates to the field of data processing, in particular to a molecular data storage method and device, and also to a molecular data application method and device.
  • molecular data In the early screening of drug development, there will generally be a large amount of molecular data generated by artificial intelligence (AI) and virtual screening of large compound libraries. These molecular data generally include the physical and chemical properties of the molecule itself. Such as molecular mass, basic attributes such as molecular smiles (simplified molecular input line entry system, simplified molecular linear input specification), and molecular three-dimensional coordinate structure files.
  • AI artificial intelligence
  • molecular data generally include the physical and chemical properties of the molecule itself.
  • molecular mass Such as molecular mass, basic attributes such as molecular smiles (simplified molecular input line entry system, simplified molecular linear input specification), and molecular three-dimensional coordinate structure files.
  • a large amount of structured molecular data will be generated, such as molecular properties, protein binding energy, activity data, etc., as well as a large amount of semi-structured and unstructured molecular data, such as molecular structure, molecular synthesis report, Molecular activity experiment report; it also contains molecular-related metadata such as docking protein information, molecular skeleton patent documents, etc.
  • a large number of simple or complex algorithms are run on molecular data. Algorithms are either independent of individual molecules, or batches of molecules are screened.
  • this application provides a molecular data storage method and device, an application method and a device, which can realize convenient and effective management of all relevant data of molecules, and facilitate users to manage these data inquiry and use.
  • the first aspect of the present application provides a method for storing molecular data, the method comprising: receiving molecular data to be processed; verifying the molecular data to obtain data that passes the verification; determining the increment of the data that passes the verification
  • Ancillary data, incremental ancillary data includes any one or more of the following: structured incremental ancillary data, semi-structured incremental ancillary data, unstructured incremental ancillary data; data that will pass verification and structured increments Save the attached data into the database; save the semi-structured incremental attached data and/or the unstructured incremental attached data into the file, and add the directory index of the file to the database to establish the association between the file and the database.
  • the second aspect of the present application provides a molecular data storage device, which includes: a data receiving module, an analysis module, a data summarization module, and a data management module.
  • the data receiving module is used to receive the molecular data to be processed;
  • the analysis module is used to verify the molecular data to obtain the data passed the verification;
  • the data summary module is used to determine the incremental subsidiary of the data passed the verification Data
  • incremental auxiliary data includes any one or more of the following: structured incremental auxiliary data, semi-structured incremental auxiliary data, unstructured incremental and structured incremental ancillary data into a database; save semi-structured incremental ancillary data and/or unstructured incremental ancillary data into a file, and add the directory index of the file to the database to establish file and database association.
  • the third aspect of the present application provides a molecular data application method, the method includes: receiving the calculation method submitted by the user through the API; obtaining calculation data related to the calculation method, the calculation data includes data obtained from databases and/or files Computational data; a database with molecular data and its structured incremental side-data stored in a file, a file with its semi-structured incremental side-data and/or unstructured incremental side-data stored, and a directory containing the files in the database Indexing; use calculation methods to calculate calculation data and obtain calculation results; save calculation results to databases and/or files.
  • the fourth aspect of the present application provides a molecular data application device, which includes: an application interface module, used to receive the calculation method submitted by the user; a calculation data acquisition module, used to obtain calculation data related to the calculation method, and calculate Data includes calculated data obtained from databases and/or files; databases store molecular data and their structured incremental appendages, files store molecular data's semi-structured incremental appendages and/or unstructured incremental appendages. Quantitative auxiliary data, and the database contains the directory index of the file; the calculation processing module is used to calculate the calculation data by using the calculation method, obtain the calculation result, and save the calculation result in the database and/or file.
  • a fifth aspect of the present application provides an electronic device, including: a processor; and a memory on which executable code is stored, and when the executable code is executed by the processor, the processor is made to execute the above method.
  • the sixth aspect of the present application also provides a computer-readable storage medium, on which executable codes are stored, and when the executable codes are executed by a processor of an electronic device, the processor is made to execute the above method.
  • the seventh aspect of the present application further provides a computer program product, including executable codes, and when the executable codes are executed by a processor, the foregoing method is realized.
  • the molecular data storage method and device provided in the embodiments of the present application verify the molecular data to be processed, and determine the incremental auxiliary data for the data that passes the verification.
  • the incremental auxiliary data may include one or more different types of According to the characteristics of data, different storage methods are adopted according to different types.
  • the data that passes the verification and its related structured molecular data are saved in the database, and the related semi-structured molecular data and unstructured molecular data are saved.
  • To the file, and add the directory index of the file to the database establish the association between the file and the database, so as to realize the convenient and effective management of all the relevant data of the molecule, and facilitate the query and use of these data by subsequent users .
  • the molecular data application method and device provided in the embodiments of the present application receive calculation methods submitted by users through the application programming interface (Application Programming Interface, API for short), such as quantum chemistry algorithms, computational chemistry algorithms, AI model algorithms, etc. Effective storage and association of data and its different types of incremental auxiliary data, obtain calculation data related to the calculation method, perform corresponding calculations to obtain calculation results, and save the calculation results in databases or files, so that the calculation results can also be obtained effective storage.
  • API Application Programming Interface
  • the technical solution of the present application not only supports local computing, but also submits part or all of the data to the remote cluster server for related computing, which greatly improves the computing power of the data, thereby meeting various computing needs of users.
  • the technical solution of the present application can also receive the query information submitted by the user through the API, and can query according to different query information, such as molecular substructure, molecular similarity, molecular attribute parameters, etc., read data from the database, and The read-out data is displayed, which facilitates the user's query and use of molecular data.
  • query information such as molecular substructure, molecular similarity, molecular attribute parameters, etc.
  • FIG. 1 schematically shows an exemplary system architecture in which a molecular data storage method and device, an application method and a device can be applied according to an embodiment of the present application;
  • Figure 2 schematically shows a flow chart of a molecular data storage method according to an embodiment of the present application
  • Figure 3 schematically shows a schematic diagram of a molecular dimension model in the embodiment of the present application
  • Fig. 4 schematically shows a flow chart of a molecular data application method according to an embodiment of the present application
  • Fig. 5 schematically shows a structural block diagram of a molecular data storage device according to an embodiment of the present application
  • Fig. 6 schematically shows a structural block diagram of a molecular data application device according to an embodiment of the present application
  • FIG. 7 schematically shows another structural block diagram of a molecular data application device according to an embodiment of the present application.
  • Fig. 8 schematically shows a block diagram of an electronic device implementing an embodiment of the present application.
  • first, second, third and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another.
  • first information may also be called second information, and similarly, second information may also be called first information.
  • second information may also be called first information.
  • a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features.
  • “plurality” means two or more, unless otherwise specifically defined.
  • a data warehouse is a core component for storing structured data, supporting report compilation, building data marts, and business intelligence.
  • JSON JS Object Notation
  • ECMAScript js specification
  • JSON is easy for people to read and write, and it is also easy for machines to parse and generate, and it can effectively improve network transmission efficiency.
  • the simplified molecular input line entry system (smiles for short) is a specification that uses ASCII strings to clearly describe molecular structures.
  • the smiles string can be imported by various molecular editing software and converted into two-dimensional graphics or three-dimensional models of molecules.
  • Relational databases are mostly applied to OLTP (Online Transaction Processing) and support transaction processing; data warehouses are applied to OLAP (Online Analytical Processing) and can support data analysis, but cannot support semi-structured and unstructured data storage, and cannot be used for molecular The screening algorithm runs.
  • OLTP Online Transaction Processing
  • OLAP Online Analytical Processing
  • the embodiment of the present application provides a molecular data storage method and device.
  • the molecular data to be processed is first verified.
  • For the data that passes the verification not only the molecular data itself is stored, but also its increment is determined.
  • a molecular data storage method and device, application method and device according to an embodiment of the present application will be described in detail below with reference to FIGS. 1 to 8 .
  • FIG. 1 schematically shows an exemplary system architecture to which a molecular data storage method and device, and an application method and device can be applied according to an embodiment of the present application.
  • Figure 1 is only an example of the system architecture to which the embodiment of the present application can be applied, to help those skilled in the art understand the technical content of the present application, but it does not mean that the embodiment of the present application cannot be used in other device, system, environment or scenario.
  • a system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
  • the network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 .
  • Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
  • Terminal devices 101, 102, 103 Users can use the terminal devices 101, 102, 103 to interact with other terminal devices and the server 105 through the network 104 to receive or send information, such as sending calculation methods and calculation data.
  • Terminal devices 101, 102, and 103 can be installed with various applications, for example, drug development applications, material design applications, web browser applications, database applications, search applications, instant messaging tools, email clients, social platform software and other applications wait.
  • the terminal devices 101, 102, and 103 include but are not limited to smart desktop computers, tablet computers, laptop computers, and other electronic devices that can support functions such as modeling, analysis and calculation, design, and Internet access.
  • the server 105 can receive calculation methods, calculation data, etc., and can also send calculation results to the terminal devices 101 , 102 , and 103 .
  • the server 105 may be a background management server, a server cluster, and the like.
  • terminal devices are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and clouds.
  • the embodiment of the present application provides a molecular data application method and device, using the above-mentioned database and files, it is convenient for the user to use and query the molecular data, and the calculation results can be saved in the database and files.
  • FIG. 2 shows a flowchart of a molecular data storage method according to an embodiment of the present application.
  • the molecular data storage method in this embodiment includes operation S210 to operation S240.
  • molecular data to be processed is received.
  • the molecular data to be processed can be any molecular data involved in drug development, for example, in the process of drug development, the molecular design, synthesis, experimental testing, and evaluation of molecular smiles, molecular mass, Molecular activity data, etc.
  • the molecular data is verified to obtain data that passes the verification.
  • verification can be performed according to the dimensions of the molecular data.
  • the dimension of the molecular data may be determined according to the molecular dimension model, and the molecular data may be verified according to the determined dimension to obtain data that passes the verification.
  • the molecular dimension model can be determined according to the type of molecular data, and the molecular dimension model can include: a molecular basic property dimension table and one or more molecular property dimension tables.
  • the molecular attribute dimension table includes: a calculated attribute dimension table and an experimental attribute dimension table
  • the molecular basic property dimension table includes identification fields of each molecular attribute dimension table.
  • FIG. 3 it schematically shows the schematic diagram of the molecular dimensional model in the embodiment of the present application.
  • Molecule is the basic property dimension table of the molecule, and the basic properties of the molecule are the molecular two-dimensional structural formula smiles, inch-key It is the label obtained based on the smiles hash operation, and the other ids are associated identifiers, that is, the identifier fields of each molecular attribute dimension table.
  • the corresponding molecular attribute dimension tables include: FEPResults, Structure, Activity, ADMET Results, Synthesis Results, etc.
  • pdb refers to the target protein used in drug research
  • dg is the binding free energy of the molecule to the protein
  • fep_method refers to the calculation method used.
  • clogp refers to the molecular hydrophobicity
  • caco-2 refers to the absorption capacity of the caco-2 cell line
  • water_solubility refers to the water solubility of the molecule.
  • name refers to the name of the molecular three-dimensional structure file
  • file_path is the file storage path.
  • weight and purity refer to the quality and purity of molecular synthesis respectively, and report records the synthesis report information.
  • Activity dimension table the values tested by different methods for different activity properties are recorded. Among them, name is used to identify the record with a custom name
  • property refers to the activity property of the test
  • method records the method of the activity experiment.
  • FIG. 3 is only a simple example of the molecular dimensional model in the embodiment of the present application, and is not intended to be limited to the specific structure of the molecular dimensional model.
  • each field and its attribute information contained in the molecular data can be determined according to the dimension of the molecular data, and the field is verified according to the attribute information of each field to obtain data that passes the verification.
  • the numerator The data is verified.
  • delta ancillary data of the verified data is determined.
  • Incremental auxiliary data refers to the data related to molecular data other than the basic data of the molecule itself, such as molecular state data (such as whether the molecule contains FEP data, whether it contains experimental data, molecular fingerprint), specific information data corresponding to the molecular state, etc. , which can be set as required.
  • Incremental auxiliary data can generally be used for screening and filtering of batch molecules, such as screening molecules that have undergone a certain type of calculation or experiment.
  • the type of incremental auxiliary data for molecular data may include any one or more of the following types: structured molecular data, semi-structured molecular data, and unstructured molecular data.
  • an error can also be reported, for example, by displaying error messages, etc., so that the operator can know in time whether the molecular data is correct, and it is convenient for the operator to check the molecular data correctly.
  • Processing of molecular data for example, correction and deletion of the molecular data can be performed.
  • the molecular data storage method and device provided in the embodiments of the present application verify the molecular data to be processed, and determine the incremental auxiliary data for the data that passes the verification.
  • the incremental auxiliary data may include one or more different types of According to the characteristics of data, different storage methods are adopted according to different types.
  • the data that passes the verification and its related structured molecular data are saved in the database, and the related semi-structured molecular data and unstructured molecular data are saved.
  • To the file, and add the directory index of the file to the database establish the association between the file and the database, so as to realize the convenient and effective management of all the relevant data of the molecule, and facilitate the query and use of these data by subsequent users .
  • the embodiment of the present application also provides a molecular data application method, which provides an effective solution for users to use molecular data.
  • FIG. 4 it schematically shows a flowchart of a molecular data application method according to an embodiment of the present application.
  • the calculation method may be, for example, but not limited to any of the following: quantum chemistry algorithm, computational chemistry algorithm, AI model algorithm, etc.
  • RESTFUL or graphql style APIs can be used, which is not limited in this embodiment of the present application.
  • RESTFUL Real State Transfer
  • HTTP Hyper Text Transfer Protocol
  • RESTFUL is suitable for scenarios where mobile Internet manufacturers serve as business interfaces, and realizes the function of third-party OTT calling mobile network resources.
  • the action types are adding, changing, and deleting the called resources.
  • GraphQL is a Query Language (query language) that is particularly advantageous for querying Graph (graphic data), so it is called GraphQL. Its relationship with SQL is to share the QL suffix. GraphQL can choose NoSQL type database, SQL type database or other storage methods.
  • calculation data related to the calculation method is obtained, and the calculation data includes calculation data obtained from databases and/or files.
  • databases and files here refer to the above-mentioned databases and files storing molecular data and their structured incremental auxiliary data, and the database includes directory indexes of files.
  • calculation data related to the calculation method may also include calculation data input by the user.
  • the calculation data that need to participate in the calculation for the calculation method submitted by the user can partly come from the stored data, such as the molecular three-dimensional structure file stored in the file storage system, or the molecular physical and chemical property data stored in the database, etc.
  • Calculation data from user input such as parameter configuration information used to indicate which batch of molecular data is used for calculation, etc.
  • parameters configuration information used to indicate which batch of molecular data is used for calculation, etc.
  • all of them may also come from the above-mentioned files and databases, which is not limited in this embodiment of the present application.
  • calculation is performed on the calculation data using a calculation method to obtain a calculation result.
  • part or all of the calculations can be completed locally, or part or all of the calculations can be submitted to the remote cluster server for calculation, and the calculation results returned by the remote cluster server can be received.
  • the calculations can be completed locally and the calculation results returned in real time; for complex and long-consuming calculation tasks, asynchronous calculations can be used to continuously monitor the calculation progress and calculation results.
  • the calculation results are saved to a database and/or to a file.
  • error information of the calculation error can also be stored, for example, saved in a log file.
  • the molecular data application method and device provided in the embodiments of the present application receive calculation methods submitted by users, such as quantum chemistry algorithms, computational chemistry algorithms, AI model algorithms, etc., based on the above-mentioned molecular data and its different types of incremental auxiliary data. Effectively store and associate relationships, obtain calculation data related to the calculation method, perform corresponding calculations to obtain calculation results, and save the calculation results in databases or files, so that the calculation results are also effectively stored.
  • the above-mentioned database and files can also be used to facilitate the user's query of molecular data information.
  • the query information may include but not limited to any one or more of the following: molecular substructure, molecular similarity, molecular attribute parameters, etc.; read data from the database according to the query information; Display the read data.
  • a query strategy When performing a query, a query strategy can be generated according to the query information, and data can be read from the database according to the query strategy; format conversion is performed on the read data.
  • each piece of data has a smiles attribute, which describes the two-dimensional structure of the molecule and can be used for substructure matching search.
  • a smiles describing the substructure information can be used as the query information, and an API (application program interface) is called to submit the query information to the system.
  • the system will run the substructure matching search algorithm, traverse and screen 1 million pieces of data, and return the molecular data on the final match. If the submitted query information contains substructures and query information of "molecular mass greater than 100", the system will automatically identify and generate a query optimization strategy, that is, to find molecules with molecular mass greater than 100 first, and then run sub-structures on these molecules.
  • a structured search algorithm for filtering queries The structure of a molecule can be described by smiles, or the sdf file format or mol file format can be used to generate a three-dimensional structure with coordinate description. According to the data format requirements specified in the query information, the system will automatically convert smiles into a specific format.
  • the query process is a general capability, which is processed according to specific query fields.
  • the system will automatically perform data retrieval for all molecules in the database based on the query strategy of the molecular substructure search algorithm.
  • the system will support format conversion of smiles according to different renderings, and also support conversion to other molecular data formats such as sdf.
  • the embodiment of the present application also provides a molecular data storage device, as shown in Figure 5, in a non-limiting embodiment, the molecular data storage device 500 includes: a data receiving module 510, an analysis module 520, a data Summary module 530 , data management module 540 . in:
  • the data receiving module 510 is used for receiving molecular data to be processed.
  • the analysis module 520 is used to verify the molecular data, and obtain the data that passes the verification;
  • the data summary module 530 is used to determine the incremental supplementary data of the data that has passed the verification.
  • the incremental supplementary data includes any one or more of the following: structured supplementary supplementary data, semi-structured supplementary supplementary data, unstructured supplementary supplementary data Quantitative ancillary data.
  • the data management module 540 is used to save the data and the structured incremental attached data in the database after verification; save the semi-structured incremental attached data and/or the unstructured incremental attached data in a file, and save the file
  • the directory index of the file is added to the database to establish the association between the file and the database.
  • the analysis module 520 may include: a data parsing unit and a data checking unit.
  • the data parsing unit is used to determine the dimension of molecular data according to the molecular dimension model.
  • the data verification unit is used to verify the molecular data according to the determined dimensions of the molecular data, and obtain data that passes the verification.
  • the above-mentioned data verification unit may include: a data detail determination unit and a field verification unit.
  • the data detail determination unit is used for determining each field and attribute information contained in the molecular data according to the dimensions of the molecular data.
  • the field verification unit is configured to verify the field according to the attribute information of each field, and obtain data that passes the verification.
  • the molecular dimension model may include: a molecular basic property dimension table and one or more molecular property dimension tables.
  • the molecular attribute dimension table may include: a calculated attribute dimension table and an experimental attribute dimension table, and the molecular basic property dimension table may include identification fields of each molecular attribute dimension table.
  • the molecular data management device provided by the embodiment of the present application can realize convenient and effective management of molecular data and its various types of incremental auxiliary data, and facilitate subsequent users to query and use these data, especially suitable for pharmaceutical R&D and other complex and long-term specific usage scenarios.
  • the embodiment of the present application also provides a molecular data application device, which can use the above-mentioned database and files storing molecular data and its incremental auxiliary data to provide users with calculation and query functions.
  • the molecular data application device 600 includes the following modules: an application interface module 610 , a calculation data acquisition module 620 and a calculation processing module 630 .
  • the application interface module 610 is used to receive the calculation method submitted by the user.
  • the calculation method includes but is not limited to any of the following: quantum chemical algorithm, computational chemical algorithm, AI model algorithm, etc.
  • Calculation data acquisition module 620 configured to acquire calculation data related to the calculation method.
  • the calculation processing module 630 is configured to use a calculation method to calculate the calculation data, obtain a calculation result, and save the calculation result in a database and/or a file.
  • calculation data may specifically include calculation data obtained from databases and/or files.
  • the database and files refer to the database and files established by the above-mentioned molecular data storage device.
  • the molecular data and its structured incremental auxiliary data are stored in the database, and the semi-structured incremental auxiliary data and/or unstructured incremental data of the molecular data are stored in the file.
  • Ancillary data is incremented, and the database contains a catalog index of files.
  • calculation data may also include calculation data input by the user, such as parameter configuration information used to indicate which batch of molecular data is used for calculation.
  • the above computing processing module 630 may include a local computing unit and/or a remote computing unit.
  • the local calculation unit is used to perform local calculation on part or all of the data by using a calculation method to obtain a calculation result.
  • the remote computing unit is used to submit part or all of the calculation method and calculation data to the remote cluster server for calculation, and receive the calculation result returned by the remote cluster server.
  • calculations can be completed locally and the calculation results returned in real time.
  • asynchronous calculation can be used, and the calculation progress and calculation results will be continuously monitored.
  • the molecular data application device 600 may further include: a query and export module 640 and a display module 650 .
  • the application interface module 610 is also configured to receive query information submitted by the user.
  • the query and export module 640 is used to read data from the database and/or files according to the query information.
  • the display module 650 is used to display the data read out by the query and export module 640 .
  • query information may include, but is not limited to, any one or more of the following: molecular substructure, molecular similarity, molecular attribute parameters, and the like.
  • the above query and export module 640 may specifically include: a filtering and sorting unit and a format converting unit.
  • the filtering and sorting unit is used for generating a query strategy according to the query information, and reading data from the database and/or the file according to the query strategy.
  • the format conversion unit is used for performing format conversion on the data read by the filtering and sorting unit.
  • the above-mentioned molecular data storage device 500 and molecular data application device 600 can also be integrated into one system, and the system is divided into a storage layer, a business layer, and a display layer.
  • the above-mentioned molecular data storage device Modules and units in 500 and molecular data application device 600 can be arranged on the business layer and display layer, and databases and files are arranged on the storage layer. Different permissions can be set for the creation and use of data to ensure data security.
  • Another aspect of the present application also provides an electronic device, which can implement the molecular data storage method provided in the embodiment of the present application, or implement the molecular data application method provided in the embodiment of the present application.
  • FIG. 8 schematically shows a block diagram of an electronic device implementing an embodiment of the present application.
  • an electronic device 800 includes a memory 810 and a processor 820 .
  • the processor 810 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), on-site Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the memory 820 may include various types of storage units, such as system memory, read only memory (ROM), and persistent storage. Wherein, the ROM can store static data or instructions required by the processor 820 or other modules of the computer.
  • the persistent storage device may be a readable and writable storage device. Persistent storage may be a non-volatile storage device that does not lose stored instructions and data even if the computer is powered off.
  • the permanent storage device adopts a mass storage device (such as a magnetic or optical disk, flash memory) as the permanent storage device.
  • the permanent storage device may be a removable storage device (such as a floppy disk, an optical drive).
  • System memory can be a readable and writable storage device or a volatile readable and writable storage device, such as dynamic random access memory.
  • System memory can store some or all of the instructions and data that the processor needs at runtime.
  • the memory 810 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (such as DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), and magnetic disks and/or optical disks may also be used.
  • memory 810 may include a readable and/or writable removable storage device such as a compact disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray Disc, Super Density Disc, Flash memory card (such as SD card, min SD card, Micro-SD card, etc.), magnetic floppy disk, etc.
  • a readable and/or writable removable storage device such as a compact disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray Disc, Super Density Disc, Flash memory card (such as SD card, min SD card, Micro-SD card, etc.), magnetic floppy disk, etc.
  • Computer-readable storage media do not contain carrier waves and transient electronic signals transmitted by wireless or wire.
  • Executable codes are stored in the memory 810, and when the executable codes are processed by the processor 820, the processor 820 may execute part or all of the methods mentioned in the foregoing embodiments.
  • the method according to the present application can also be implemented as a computer program or computer program product, the computer program or computer program product including computer program code instructions for executing part or all of the operations in the above method of the present application.
  • the present application may also be implemented as a computer-readable storage medium (or a non-transitory machine-readable storage medium or a machine-readable storage medium), on which executable code (or computer program or computer instruction code) is stored,
  • executable code or computer program or computer instruction code
  • the processor of the electronic device or server, etc.
  • the processor is made to perform part or all of the various operations of the above method according to the present application.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种分子数据管理方法及装置,该方法包括:接收待处理的分子数据;对分子数据进行校验,得到校验通过的数据;确定校验通过的数据的增量附属数据,增量附属数据包括以下任意一种或多种:结构化分子数据、半结构化分子数据、非结构化分子数据;将校验通过的数据及结构化分子数据保存到数据库中;将半结构化分子数据和非结构化分子数据保存到文件中,并将文件的目录索引添加到数据库中,以建立文件与数据库的关联。利用本申请的技术方案,可以方便、有效地管理分子的所有相关数据。

Description

分子数据存储方法及装置、应用方法及装置 技术领域
本申请涉及数据处理领域,具体涉及一种分子数据存储方法及装置,还涉及一种分子数据应用方法及装置。
背景技术
药物研发中早期筛查,一般会有来自人工智能(Artificial Intelligence,简称AI)生成的、以及对大型化合物库进行虚拟筛选产生的大批量的分子数据,这些分子数据一般包含分子本身的物化属性,如分子质量、分子smiles(simplified molecular input line entry system,简化分子线性输入规范)式等基本属性,以及分子的三维坐标结构文件。在筛查流程中,会使用各类计算化学、量子化学与AI模型预测方法如自由能微扰(Free Energy Perturbation,简称FEP)计算、量子化学(Quantum Chemistry,简称QM)计算、分子动力学(molecular dynamics,简称MD)模拟、ADMET(Absorption,Distribution,Metabolism,Excretion,and Toxicity,吸收、分布、代谢、排泄和毒性)预测等,得到分子的能量信息、吸收分解及毒性等预测信息;亦会合成部分分子并进行各类生物活性实验得到分子的活性数据。
在整个分子筛选流程中,会产生大量结构化的分子数据,如分子性质、蛋白结合能量、活性数据等,也会产生大量半结构化及非结构化分子数据,如分子结构、分子合成报告、分子活性实验报告;同时还包含分子相关元数据如对接蛋白信息、分子骨架专利文件等。在执行筛选的流程中,会有大量的简单或复杂的算法运行于分子数据之上。算法或是独立于各个分子,或是分子批量筛选。
由于有大量数据信息的存在,相关技术中的筛选流程的构建非常困难,筛选步骤中产生的数据无法被高效访问,而且流程的串联依赖于各种非标准化的方法建立,导致数据综合分析和复盘不能统一,数据分析工作量庞大,效率低。
发明内容
为解决或部分解决相关技术中存在的问题,本申请提供一种分子数据存储方法及装置、应用方法及装置,能够对分子的所有相关数据实现方便、有效的管理,而且方便了用户对这些数据的查询及使用。
本申请的第一个方面提供了一种分子数据存储方法,该方法包括:接收待处理的分子数据;对分子数据进行校验,得到校验通过的数据;确定校验通过的数据的增量附属数据,增量附属数据包括以下任意一种或多种:结构化增量附属数据、半结构化增量附属数据、非结构化增量附属数据;将校验通过的数据及结构化增量附属数据保存到数据库中;将半结构化增量附属数据和/或非结构化增量附属数据保存到文件中,并将文件的目录索引添加到数据库中,以建立文件与数据库的关联。
本申请的第二个方面提供了一种分子数据存储装置,该装置包括:数据接收模块,分析模块、数据汇总模块、数据管理模块。其中,数据接收模块,用于接收待处理的分子数据;分析模块,用于对分子数据进行校验,得到校验通过的数据;数据汇总模块,用于确定校验通过的数据的增量附属数据,增量附属数据包括以下任意一种或多种:结构化增量附属数据、半结构化增量附属数据、非结构化增量附属数据;数据管理模块,用于将校验通过的数据及结构化增量附属数据保存到数据库中;将半结构化增量附属数据和/或非结构化增量附属数据保存到文件中,并将文件的目录索引添加到数据库中,以建立文件与数据库的关联。
本申请的第三个方面提供了一种分子数据应用方法,该方法包括:接收用户通过API提交的计算方法;获取与计算方法相关的计算数据,计算数据包括从数据库和/或文件中获取的计算数据;数据库中存储有分子数据及其结构化增量附属数据,文件中存储有分子数据的半结构化增量附属数据和/或非结构化增量附属数据,并且数据库中包含文件的目录索引;利用计算方法对计算数据进行计算,得到计算结果;将计算结果保存到数据库和/或文件中。
本申请的第四个方面提供了一种分子数据应用装置,该装置包括:应用接口模块,用于接收用户提交的计算方法;计算数据获取模块,用于获取与计算方法相关的计算数据,计算数据包括从数据库和/或文件中获取的计算数据;数据库中存储有分子数据及其结构化增量附属数据,文件中存储有分子数据的半结构化增量附属数据和/或非结构化增量附属数据,并且数据库中包含文件的目录索引;计算处理模块,用于利用计算方法对计算数据进行计算,得到计算结果,并将计算结果保存到数据库和/或文件中。
本申请的第五方面提供了一种电子设备,包括:处理器;存储器,其上存储有可执行代码,当上述可执行代码被处理器执行时,使得处理器执行上述方法。
本申请的第六方面还提供了一种计算机可读存储介质,其上存储有可执行代码,当可执行代码被电子设备的处理器执行时,使处理器执行上述方法。
本申请的第七方面还提供了一种计算机程序产品,包括可执行代码,可执行代码被处理器执行时实现上述方法。
本申请实施例提供的分子数据存储方法及装置,对待处理的分子数据进行校验,对校验通过的数据确定其增量附属数据,针对增量附属数据可以包括一种或多种不同类型的数据这一特点,根据其类型不同采用不同的存储方式,将校验通过的数据及与其相关的结构化分子数据保存到数据库中,将与其相关的半结构化分子数据和非结构化分子数据保存到文件中,并将文件的目录索引添加到数据库中,建立起文件与数据库的关联,从而对分子的所有相关数据实现了方便、有效的管理,而且方便了后续用户对这些数据的查询和使用。
本申请实施例提供的分子数据应用方法及装置,通过应用程序接口(Application Programming Interface,简称API)接收用户提交的计算方法,比如量子化学算法、计算化学算法、AI模型算法等,基于上述对分子数据及其不同类型的增量附属数据的有效存储及关联关系,获取与该计算方法相关的计算数据,进行相应计算得到计算结果,将计算结果保存到数据库或文件中,从而使计算结果也得到了有效的存储。
进一步地,本申请的技术方案不仅支持本地计算,还可将部分或全部数据提交到远端集群服务器进行相关计算,大大提升了数据的计算能力,从而满足用户的多种不同计算需求。
进一步地,本申请的技术方案还可以通过API接收用户提交的查询信息,可以根据不同的查询信息,比如分子子结构、分子相似性、分子属性参数等进行查询,从数据库中读取数据,并展现读出的数据,方便了用户对分子数据的查询及使用。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
通过结合附图对本申请示例性实施方式进行更详细地描述,本申请的上述以及其它目的、特征和优势将变得更加明显,其中,在本申请示例性实施方式中,相同的参考标号通常代表相同部件。
图1示意性示出了根据本申请实施例的可以应用分子数据存储方法及装置、应用方法及装置的一种示例性系统架构;
图2示意性示出了根据本申请实施例的分子数据存储方法的一种流程图;
图3示意性示出了本申请实施例中的分子维度模型示意图;
图4示意性示出了根据本申请实施例的分子数据应用方法的一种流程图;
图5示意性示出了根据本申请实施例的分子数据存储装置的一种结构框图;
图6示意性示出了根据本申请实施例的分子数据应用装置的一种结构框图;
图7示意性示出了根据本申请实施例的分子数据应用装置的另一种结构框图;
图8示意性示出了实现本申请实施例的一种电子设备的方框图。
具体实施方式
下面将参照附图更详细地描述本申请的实施方式。虽然附图中显示了本申请的实施方式,然而应该理解,可以以各种形式实现本申请而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本申请更加透彻和完整,并且能够将本申请的范围完整地传达给本领域的技术人员。
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在此使用的术语“包括”、“包含”等表明了特征、步骤、操作和/或部件的存在,但是并不排除存在或添加一个或多个其他特征、步骤、操作或部件。
在此使用的所有术语(包括技术和科学术语)具有本领域技术人员通常所理解的含义,除非另外定义。应注意,这里使用的术语应解释为具有与本说明书的上下文相一致的含义,而不应以理想化或过于刻板的方式来解释。
应当理解,尽管在本申请可能采用术语“第一”、“第二”、“第三”等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。
在对本申请的技术方案进行描述之前,先对本申请涉及的本领域的部分技术术语进行说明。
数据仓库,是存储结构化数据,用于支持报表编制、构建数据集市、以及商业智能的核心组件。
JS对象简谱(JavaScript Object Notation,简称JSON),是一种轻量级的数据交换格式。JSON是基于欧洲计算机协会制定的js规范(ECMAScript)的一个子集,采用完全独立于编程语言的文本格式来存储和表示数据。JSON易于人阅读和编写,同 时也易于机器解析和生成,并有效地提升网络传输效率。
简化分子线性输入规范(simplified molecular input line entry system,简称smiles),是一种用ASCII字符串明确描述分子结构的规范。smiles字符串可以被多种分子编辑软件导入并转换成二维图形或分子的三维模型。
由于药物研发中早期筛查会有来自AI生成的、以及对大型化合物库进行虚拟筛选产生的大批量的分子数据,这些分子数据不仅数量庞大,而且类型繁多,在缺失分子数据存储系统的情况下,筛选流程的构建会很困难,筛选步骤中产生的数据也无法被高效访问,数据综合分析和复盘也很难统一起来,使得数据分析效率极低。而相关技术中的关系型数据库、数据仓库与文件存储(对象存储)系统,都无法支持对分子数据的完整存储。关系型数据库多是应用于OLTP(联机事务处理),支持事务处理;数据仓库应用于OLAP(联机分析处理),能支持数据分析,但是无法支持半结构与非结构化数据存储,无法用于分子筛选算法的运行。
为此,本申请实施例提供一种分子数据存储方法及装置,对待处理的分子数据,首先对其进行校验,针对校验通过的数据,不仅存储该分子数据本身,还要确定其增量附属数据,并且考虑到其增量附属数据可以包括一种或多种不同类型的数据这一特点,根据数据类型的不同采用不同的存储方式,将校验通过的数据及与其相关的结构化数据保存到数据库中,将与其相关的半结构化分子数据和非结构化分子数据保存到文件中,并将文件的目录索引添加到数据库中,并建立起文件与数据库的关联,使分子的所有类型数据都能够得到有效存储,尤其适用于药物研发等复杂且长周期的具体使用场景。
以下将通过图1至图8对本申请实施例的一种分子数据存储方法及装置、应用方法及装置进行详细描述。
图1示意性示出了根据本申请实施例的可以应用分子数据存储方法及装置、应用方法及装置的一种示例性系统架构。需要注意的是,图1所示仅为可以应用本申请实施例的系统架构的示例,以帮助本领域技术人员理解本申请的技术内容,但并不意味着本申请实施例不可以用于其他设备、系统、环境或场景。
参见图1,根据该实施例的系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与其他终端设备和服务器105进行交互,以接收或发送信息等,如发送计算方法、计算数据等。终端设备101、102、103可以安装有各种应用,例如,药物开发应用、材料设计应用、网页浏览器应用、数据库类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等应用等。
终端设备101、102、103包括但不限于智能台式电脑、平板电脑、膝上型便携计算机等等可以支持建模、分析计算、设计、上网等功能的电子设备。
服务器105可以接收计算方法、计算数据等,还可以发送计算结果给终端设备101、102、103。例如,服务器105可以为后台管理服务器、服务器集群等。
需要说明的是,终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和云端。
本申请实施例提供一种分子数据应用方法及装置,利用上述数据库及文件,方便用户对分子数据的使用及查询,而且可以将计算结果保存到数据库及文件中。
如图2所示,图2示出了根据本申请实施例的分子数据存储方法的一种流程图。
该实施例的分子数据存储方法包括操作S210~操作240。
在操作210,接收待处理的分子数据。
需要说明的是,待处理的分子数据可以是药物研发中涉及的任意分子数据,比如,药物研发过程中,涉及到的分子设计、合成、实验测试、评估产生的关于分子的smiles、分子质量、分子活性数据等。
在操作220,对分子数据进行校验,得到校验通过的数据。
在本申请实施例中,可以根据分子数据的维度进行校验。具体地,可以根据分子维度模型确定分子数据的维度,根据确定的维度对该分子数据进行校验,得到校验通过的数据。
分子维度模型可以根据分子数据的类型来确定,分子维度模型可以包括:分子基本性质维度表和一个或多个分子属性维度表。其中,分子属性维度表包括:计算属性维度表和实验属性维度表,分子基本性质维度表中包含各分子属性维度表的标识字段。
如图3所示,示意性示出了本申请实施例中的分子维度模型示意图,在该示例中,Molecule为分子基本性质维度表,其中分子的基本属性为分子二维结构式smiles,inchi-key为基于smiles哈希运算得到的标签,其他的id均为关联标识,即各分子属性维度表的标识字段。相应的分子属性维度表比如有:FEPResults、Structure、Activity、ADMET Results、Synthesis Results等。在FEPResults维度表中,pdb 指药物研究中使用的靶向蛋白,dg值为该分子与蛋白的结合自由能,fep_method指使用的计算方法。在ADMETResults维度表中,clogp指分子疏水常熟,caco-2指caco-2细胞系吸收能力,water_solubility指分子的水溶性。在Structure维度表中name指分子三维结构文件的名字,file_path是文件存储路径。在SynthesisResults维度表中,weight、purity分别指分子合成的质量与纯度,report记录了合成的报告信息。Activity维度表中,记录了不同活性性质使用不同方法测试的值,其中,name用于自定义名字标识该记录,property指测试的活性性质,method则记录了活性实验的方法。
需要说明的是,上述图3仅仅是本申请实施例中分子维度模型的一个简单示例,并不用于限于分子维度模型的具体结构。
在对分子数据进行校验时,可以根据该分子数据的维度确定其包含的各字段及其属性信息,根据各字段的属性信息对字段进行校验,得到校验通过的数据。
具体地,可以根据其维度,索引到不同的数据明细表,根据数据明细表中定义的字段属性,如是否为必须包含的字段、字段类型(如整数、浮点数或字符串)等,对分子数据进行校验。
在操作230,确定校验通过的数据的增量附属数据。
增量附属数据是指与分子数据相关的分子本身的基础数据之外的数据,比如分子状态数据(比如分子是否包含FEP数据、是否包含实验数据、分子指纹)、分子状态对应的具体信息数据等,具体可根据需要进行设置。
增量附属数据一般可用于批量分子的筛选过滤,比如筛选已经进行过某一类计算或实验的分子。
对于分子数据的增量附属数据的类型可以包括以下任意一种或多种类型:结构化分子数据、半结构化分子数据、非结构化分子数据。
通过对分子数据的增量附属数据的确定,可以更丰富、全面地获取该分子数据的所有相关数据信息。
在操作240,将校验通过的数据及结构化增量附属数据保存到数据库中;将半结构化增量附属数据和/或非结构化增量附属数据保存到文件中,并将文件的目录索引添加到数据库中,以建立文件与数据库的关联。
进一步地,在对分子数据进行校验过程中,如果该分子数据校验错误,还可以进行报错,比如,采用显示报错信息等方式,以使操作人员及时了解该分子数据是否正 确,方便对该分子数据的处理,比如可以对该分子数据进行更正、删除等处理。
本申请实施例提供的分子数据存储方法及装置,对待处理的分子数据进行校验,对校验通过的数据确定其增量附属数据,针对增量附属数据可以包括一种或多种不同类型的数据这一特点,根据其类型不同采用不同的存储方式,将校验通过的数据及与其相关的结构化分子数据保存到数据库中,将与其相关的半结构化分子数据和非结构化分子数据保存到文件中,并将文件的目录索引添加到数据库中,建立起文件与数据库的关联,从而对分子的所有相关数据实现了方便、有效的管理,而且方便了后续用户对这些数据的查询和使用。
相应地,基于上述通过数据库及文件对分子数据及其增量附属数据的存储,本申请实施例还提供一种分子数据应用方法,为用户对分子数据的使用提供了有效的解决方案。
如图4所示,示意性示出了根据本申请实施例的分子数据应用方法的一种流程图。
该实施例的分子数据应用方法包括以下操作:
在操作410,接收用户通过API提交的计算方法。
计算方法比如可以是但不限于以下任意一种:量子化学算法、计算化学算法、AI模型算法等。
在具体应用中,可以采用RESTFUL、或者graphql等风格的API,对此本申请实施例不做限定。
RESTFUL(Representational State Transfer)是一种网络应用程序的设计风格和开发方式,基于超文本传输协议(Hyper Text Transfer Protocol,简称HTTP),可以使用XML格式定义或JSON格式定义。RESTFUL适用于移动互联网厂商作为业务接口的场景,实现第三方OTT调用移动网络资源的功能,动作类型为新增、变更、删除所调用资源。
GraphQL是针对Graph(图状数据)进行查询特别有优势的Query Language(查询语言),所以叫做GraphQL。它跟SQL的关系是共用QL后缀。GraphQL可以选择NoSQL类型的数据库、SQL类型的数据库或者其它多种存储方式。
在操作420,获取与计算方法相关的计算数据,计算数据包括从数据库和/或文件中获取的计算数据。
需要说明的是,此处的数据库和文件是指上述存储有分子数据及其结构化增量附属数据的数据库和文件,而且数据库中包含文件的目录索引。
进一步地,与计算方法相关的计算数据还可以包括用户输入的计算数据。
也就是说,针对用户提交的计算方法需要参与计算的计算数据可以部分来自于已经存储的数据,比如存储在文件存储系统的分子三维结构文件、或者在数据库中存储的分子物化性质数据等,部分来自于用户输入的计算数据,比如用于指示使用哪一批分子数据进行计算的参数配置信息等。当然,也可以全部来自于上述文件和数据库,对此本申请实施例不做限定。
在操作430,利用计算方法对计算数据进行计算,得到计算结果。
需要说明的是,在具体应用中,可以将部分或全部计算在本地完成,或者将部分或全部计算提交到远端集群服务器进行计算,并接收远端集群服务器返回的计算结果。比如,对于轻量的计算,可以在本地完成计算,并实时返回计算结果;对于复杂、消耗较长的计算任务,可以采用异步计算,将持续监控计算进度与计算结果。
在操作440,将计算结果保存到数据库和/或文件中。
进一步地,还可以对计算出错的错误信息进行存储,比如保存到日志文件中。
本申请实施例提供的分子数据应用方法及装置,通过接收用户提交的计算方法,比如量子化学算法、计算化学算法、AI模型算法等,基于上述对分子数据及其不同类型的增量附属数据的有效存储及关联关系,获取与该计算方法相关的计算数据,进行相应计算得到计算结果,将计算结果保存到数据库或文件中,从而使计算结果也得到了有效的存储。
在本申请分子数据应用方法的另一实施例中,还可利用上述数据库及文件,方便用户对分子数据信息的查询。具体地,接收用户通过API提交的查询信息,查询信息可以包括但不限于以下任意一种或多种:分子子结构、分子相似性、分子属性参数等;根据查询信息从数据库中读取数据;展现读出的数据。
在进行查询时,可以根据查询信息生成查询策略,并根据查询策略从数据库中读出数据;对读出的数据进行格式转换。
比如,数据库中已经存在了100万条分子数据,并且每条数据都带有smiles属性,该属性描述了分子的二维结构,可以用于进行子结构匹配搜索。当使用查询功能时,可以将一个描述了子结构信息的smiles作为查询信息,调用API(应用程序接口)将该查询信息提交到系统。系统将运行子结构匹配搜索算法,进行100万条数据的遍历筛选,并返回最终匹配上的分子数据。如果提交的查询信息包含子结构和“分子质量大于100”的查询信息,则系统将自动识别,并生成查询优化的策略,即优先找出 分子质量大于100的分子,再对这批分子运行子结构搜索算法进行过滤查询。一个分子的结构可以由smiles描述,也可以使用sdf文件格式或者mol文件格式,为其生成带坐标描述的三维结构。根据查询信息中指定的数据格式要求,系统将自动将smiles转换成特定格式。
需要说明的是,查询过程是一个通用能力,根据具体查询字段进行处理。例如,对分子子结构进行查询,可以将分子子结构片段填入查询信息中,通过API提交。系统将自动对数据库中所有分子,基于分子子结构搜索算法的查询策略,进行数据检索。根据展示配置,系统将支持对smiles按照不同的渲染进行格式转换,也支持转换为sdf等其他的分子数据格式。
相应地,本申请实施例还提供一种分子数据存储装置,如图5所示,在一种非限制性实施例中,该分子数据存储装置500包括:数据接收模块510、分析模块520、数据汇总模块530、数据管理模块540。其中:
数据接收模块510用于接收待处理的分子数据。
分析模块520用于对分子数据进行校验,得到校验通过的数据;
数据汇总模块530用于确定校验通过的数据的增量附属数据,增量附属数据包括以下任意一种或多种:结构化增量附属数据、半结构化增量附属数据、非结构化增量附属数据。
数据管理模块540用于将校验通过的数据及结构化增量附属数据保存到数据库中;将半结构化增量附属数据和/或非结构化增量附属数据保存到文件中,并将文件的目录索引添加到数据库中,以建立文件与数据库的关联。
在一种非限制性实施例中,上述分析模块520可以包括:数据解析单元和数据校验单元。
其中,数据解析单元用于根据分子维度模型确定分子数据的维度。
数据校验单元用于根据确定的分子数据的维度对分子数据进行校验,得到校验通过的数据。
上述数据校验单元可以包括:数据明细确定单元和字段校验单元。
其中,数据明细确定单元用于根据分子数据的维度确定分子数据包含的各字段及其属性信息。
字段校验单元用于根据各字段的属性信息对该字段进行校验,得到校验通过的数据。
在具体应用中,分子维度模型可以包括:分子基本性质维度表和一个或多个分子属性维度表。
其中,分子属性维度表可以包括:计算属性维度表和实验属性维度表,分子基本性质维度表中可以包含各分子属性维度表的标识字段。
利用本申请实施例提供的分子数据管理装置,可以对分子数据及其各种不同类型的增量附属数据实现方便、有效的管理,而且方便后续用户对这些数据的查询和使用,尤其适用于药物研发等复杂且长周期的具体使用场景。
相应地,本申请实施例还提供一种分子数据应用装置,可以利用上述存储分子数据及其增量附属数据的数据库及文件,向用户提供计算及查询功能。
如图6所示,在一种非限制性实施例中,该分子数据应用装置600包括以下各模块:应用接口模块610、计算数据获取模块620和计算处理模块630。
应用接口模块610,用于接收用户提交的计算方法,计算方法包括但不限于以下任意一种:量子化学算法、计算化学算法、AI模型算法等。
计算数据获取模块620,用于获取与计算方法相关的计算数据。
计算处理模块630,用于利用计算方法对计算数据进行计算,得到计算结果,并将计算结果保存到数据库和/或文件中。
需要说明的是,计算数据具体可以包括从数据库和/或文件中获取的计算数据。数据库和文件即上述分子数据存储装置建立的数据库及文件,数据库中存储有分子数据及其结构化增量附属数据,文件中存储有分子数据的半结构化增量附属数据和/或非结构化增量附属数据,并且数据库中包含文件的目录索引。
进一步地,计算数据还可以包括用户输入的计算数据,比如用于指示使用哪一批分子数据进行计算的参数配置信息等。
在具体应用中,上述计算处理模块630可以包括本地计算单元、和/或远程计算单元。
其中,本地计算单元用于利用计算方法对数据中的部分或全部进行本地计算,得到计算结果。
远程计算单元用于将计算方法及计算数据中的部分或全部提交到远端集群服务器进行计算,并接收远端集群服务器返回的计算结果。
比如,对于轻量的计算,可以在本地完成计算,并实时返回计算结果。对于复杂、消耗较长的计算任务,可以采用异步计算,将持续监控计算进度与计算结果。
如图7所示,在另一种非限制性实施例中,与图6所示实施例不同的是,该分子数据应用装置600还可进一步包括:查询与导出模块640和显示模块650。
在该实施例中,应用接口模块610还用于接收用户提交的查询信息。
相应地,查询与导出模块640用于根据查询信息从数据库和/或文件中读取数据。显示模块650用于展现查询与导出模块640读出的数据。
需要说明的是,查询信息可以包括但不限于以下任意一种或多种:分子子结构、分子相似性、分子属性参数等。
上述查询与导出模块640具体可以包括:过滤排序单元和格式转换单元。
其中,过滤排序单元用于根据查询信息生成查询策略,并根据查询策略从数据库和/或文件中读出数据。
格式转换单元用于对过滤排序单元读出的数据进行格式转换。
关于上述实施例中的分子数据存储装置500和分子数据应用装置600,其中各个模块、单元执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不再做详细阐述说明。
需要说明的是,在具体应用中,还可以将上述分子数据存储装置500和分子数据应用装置600集成在一个系统中,将该系统分为存储层、业务层、展示层,上述分子数据存储装置500和分子数据应用装置600中的模块、单元可以布设在业务层和展示层,数据库和文件设置在存储层。对于数据的创建及使用可以分别设置不同的权限,以保证数据的安全性。
本申请的另一方面还提供了一种电子设备,该电子设备可以实施本申请实施例提供的分子数据存储方法、或者实施本申请实施例提供的分子数据应用方法。
如图8所示,图8示意性示出了实现本申请实施例的电子设备的方框图。
参见图8,电子设备800包括存储器810和处理器820。
处理器810可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
存储器820可以包括各种类型的存储单元,例如系统内存、只读存储器(ROM) 和永久存储装置。其中,ROM可以存储处理器820或者计算机的其他模块需要的静态数据或者指令。永久存储装置可以是可读写的存储装置。永久存储装置可以是即使计算机断电后也不会失去存储的指令和数据的非易失性存储设备。在一些实施方式中,永久性存储装置采用大容量存储装置(例如磁或光盘、闪存)作为永久存储装置。另外一些实施方式中,永久性存储装置可以是可移除的存储设备(例如软盘、光驱)。系统内存可以是可读写存储设备或者易失性可读写存储设备,例如动态随机访问内存。系统内存可以存储一些或者所有处理器在运行时需要的指令和数据。此外,存储器810可以包括任意计算机可读存储媒介的组合,包括各种类型的半导体存储芯片(例如DRAM,SRAM,SDRAM,闪存,可编程只读存储器),磁盘和/或光盘也可以采用。在一些实施方式中,存储器810可以包括可读和/或写的可移除的存储设备,例如激光唱片(CD)、只读数字多功能光盘(例如DVD-ROM,双层DVD-ROM)、只读蓝光光盘、超密度光盘、闪存卡(例如SD卡、min SD卡、Micro-SD卡等)、磁性软盘等。计算机可读存储媒介不包含载波和通过无线或有线传输的瞬间电子信号。
存储器810上存储有可执行代码,当可执行代码被处理器820处理时,可以使处理器820执行上述实施例述及的方法中的部分或者全部。
此外,根据本申请的方法还可以实现为一种计算机程序或计算机程序产品,该计算机程序或计算机程序产品包括用于执行本申请的上述方法中部分或全部操作的计算机程序代码指令。
或者,本申请还可以实施为一种计算机可读存储介质(或非暂时性机器可读存储介质或机器可读存储介质),其上存储有可执行代码(或计算机程序或计算机指令代码),当可执行代码(或计算机程序或计算机指令代码)被电子设备(或服务器等)的处理器执行时,使处理器执行根据本申请的上述方法的各个操作的部分或全部。
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文披露的各实施例。

Claims (25)

  1. 一种分子数据存储方法,其特征在于,所述方法包括:
    接收待处理的分子数据;
    对所述分子数据进行校验,得到校验通过的数据;
    确定所述校验通过的数据的增量附属数据,所述增量附属数据包括以下任意一种或多种:结构化增量附属数据、半结构化增量附属数据、非结构化增量附属数据;
    将所述校验通过的数据及所述结构化增量附属数据保存到数据库中;将所述半结构化增量附属数据和/或所述非结构化增量附属数据保存到文件中,并将所述文件的目录索引添加到所述数据库中,以建立所述文件与所述数据库的关联。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述分子数据进行校验,得到校验通过的数据包括:
    根据分子维度模型确定所述分子数据的维度;
    根据所述维度对所述分子数据进行校验,得到校验通过的数据。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述维度对所述分子数据进行校验,得到校验通过的数据包括:
    根据所述维度确定所述分子数据包含的各字段及其属性信息;
    根据各字段的属性信息对所述字段进行校验,得到校验通过的数据。
  4. 根据权利要求2所述的方法,其特征在于,所述分子维度模型包括:分子基本性质维度表和一个或多个分子属性维度表,所述分子属性维度表包括:计算属性维度表和实验属性维度表,所述分子基本性质维度表中包含各分子属性维度表的标识字段。
  5. 一种分子数据应用方法,其特征在于,所述方法包括:
    接收用户通过API提交的计算方法;
    获取与所述计算方法相关的计算数据,所述计算数据包括从数据库和/或文件中获取的计算数据;所述数据库中存储有分子数据及其结构化增量附属数据,所述文件中存储有所述分子数据的半结构化增量附属数据和/或非结构化增量附属数据,并且所述数据库中包含所述文件的目录索引;
    利用所述计算方法对所述计算数据进行计算,得到计算结果;
    将所述计算结果保存到所述数据库和/或所述文件中。
  6. 根据权利要求5所述的方法,其特征在于,所述计算方法包括以下任意一种或多种:量子化学算法、计算化学算法、AI模型算法。
  7. 根据权利要求5所述的方法,其特征在于,所述计算数据还包括:用户输入的计算数据。
  8. 根据权利要求5所述的方法,其特征在于,所述利用所述计算方法对所述计算数据进行计算,得到计算结果包括:
    利用所述计算方法对所述数据中的部分或全部进行本地计算,得到计算结果;和/或
    将所述计算方法及所述数据中的部分或全部提交到远端集群服务器进行计算,并接收所述远端集群服务器返回的计算结果。
  9. 根据权利要求5至8任一项所述的方法,其特征在于,所述方法还包括:
    接收用户通过API提交的查询信息;
    根据所述查询信息从所述数据库中读取数据;
    展现读出的数据。
  10. 根据权利要求9所述的方法,其特征在于,所述查询信息包括以下任意一种或多种:分子子结构、分子相似性、分子属性参数。
  11. 根据权利要求9所述的方法,其特征在于,所述根据所述查询信息从所述数据库中读取数据包括:
    根据所述查询信息生成查询策略,并根据所述查询策略从所述数据库中读出数据;
    对读出的数据进行格式转换。
  12. 一种药物研发管线分子数据存储装置,其特征在于,所述装置包括:数据接收模块,分析模块、汇总模块、数据管理模块、存储模块;
    所述数据接收模块,用于接收待处理的分子数据;
    所述分析模块,用于对所述分子数据进行校验,得到校验通过的数据;
    所述数据汇总模块,用于确定所述校验通过的数据的增量附属数据,所述增量附属数据包括以下任意一种或多种:结构化增量附属数据、半结构化增量附属数据、非结构化增量附属数据;
    所述数据管理模块,用于将所述校验通过的数据及所述结构化增量附属数据保存到数据库中;将所述半结构化增量附属数据和/或所述非结构化增量附属数据保存到文件中,并将所述文件的目录索引添加到所述数据库中,以建立所述文件与所述数据 库的关联。
  13. 根据权利要求12所述的装置,其特征在于,所述分析模块包括:
    数据解析单元,用于根据分子维度模型确定所述分子数据的维度;
    数据校验单元,用于根据所述维度对所述分子数据进行校验,得到校验通过的数据。
  14. 根据权利要求12所述的装置,其特征在于,所述数据校验单元包括:
    数据明细确定单元,用于根据所述维度确定所述分子数据包含的各字段及其属性信息;
    字段校验单元,用于根据各字段的属性信息对所述字段进行校验,得到校验通过的数据。
  15. 根据权利要求12所述的装置,其特征在于,所述分子维度模型包括:分子基本性质维度表和一个或多个分子属性维度表,所述分子属性维度表包括:计算属性维度表和实验属性维度表,所述分子基本性质维度表中包含各分子属性维度表的标识字段。
  16. 一种分子数据应用装置,其特征在于,所述装置包括:
    应用接口模块,用于接收用户提交的计算方法;
    计算数据获取模块,用于从数据库和/或文件中获取与所述计算方法相关的计算数据;所述数据库中存储有分子数据及其结构化增量附属数据,所述文件中存储有所述分子数据的半结构化增量附属数据和/或非结构化增量附属数据,并且所述数据库中包含所述文件的目录索引;
    计算处理模块,用于利用所述计算方法对所述计算数据进行计算,得到计算结果,并将所述计算结果保存到所述数据库和/或所述文件中。
  17. 根据权利要求16所述的装置,其特征在于,所述计算方法包括以下任意一种或多种:量子化学算法、计算化学算法、AI模型算法。
  18. 根据权利要求16所述的装置,其特征在于,
    所述计算数据获取模块,具体用于接收用户输入的计算数据;和/或根据所述计算方法从所述文件和/或所述数据库中提取计算数据。
  19. 根据权利要求16所述的装置,其特征在于,所述计算处理模块包括本地计算单元、和/或远程计算单元;
    所述本地计算单元,用于利用所述计算方法对所述数据中的部分或全部进行本地 计算,得到计算结果;
    所述远程计算单元,用于将所述计算方法及所述数据中的部分或全部提交到远端集群服务器进行计算,并接收所述远端集群服务器返回的计算结果。
  20. 根据权利要求16至19任一项所述的装置,其特征在于,所述装置还包括:查询与导出模块、显示模块;
    所述应用接口模块,还用于接收用户提交的查询信息;
    所述查询与导出模块,用于根据所述查询信息从所述数据库中读取数据;
    所述显示模块,用于展现所述查询与导出模块读出的数据。
  21. 根据权利要求20所述的装置,其特征在于,所述查询信息包括以下任意一种或多种:分子子结构、分子相似性、分子属性参数。
  22. 根据权利要求20所述的装置,其特征在于,所述查询与导出模块包括:
    过滤排序单元,用于根据所述查询信息生成查询策略,并根据所述查询策略从所述数据库中读出数据;
    格式转换单元,用于对所述过滤排序单元读出的数据进行格式转换。
  23. 一种电子设备,其特征在于,包括:
    处理器;以及
    存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行如权利要求1-4中任一项所述的方法、或者执行如权利要求5-11中任一项所述的方法。
  24. 一种计算机可读存储介质,其特征在于,其上存储有可执行代码,当所述可执行代码被电子设备的处理器执行时,使所述处理器执行如权利要求1-4中任一项所述的方法、或者执行如权利要求5-11中任一项所述的方法。
  25. 一种计算机程序产品,其特征在于,包括可执行代码,所述可执行代码被处理器执行时实现根据权利要求1-4中任一项所述的方法、或者实现根据权利要求5-11中任一项所述的方法。
PCT/CN2021/143388 2021-12-30 2021-12-30 分子数据存储方法及装置、应用方法及装置 WO2023123287A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/143388 WO2023123287A1 (zh) 2021-12-30 2021-12-30 分子数据存储方法及装置、应用方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/143388 WO2023123287A1 (zh) 2021-12-30 2021-12-30 分子数据存储方法及装置、应用方法及装置

Publications (1)

Publication Number Publication Date
WO2023123287A1 true WO2023123287A1 (zh) 2023-07-06

Family

ID=86997092

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/143388 WO2023123287A1 (zh) 2021-12-30 2021-12-30 分子数据存储方法及装置、应用方法及装置

Country Status (1)

Country Link
WO (1) WO2023123287A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116561374A (zh) * 2023-07-11 2023-08-08 腾讯科技(深圳)有限公司 基于半结构化存储的资源确定方法、装置、设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090187581A1 (en) * 2008-01-22 2009-07-23 Vincent Delisle Consolidation and association of structured and unstructured data on a computer file system
CN103440282A (zh) * 2013-08-13 2013-12-11 西安航天动力试验技术研究所 一种试验数据存储系统及方法
CN108509547A (zh) * 2018-03-20 2018-09-07 中国长城科技集团股份有限公司 一种信息管理方法、信息管理系统及电子设备
CN111753141A (zh) * 2019-03-26 2020-10-09 华为技术有限公司 一种数据管理方法及相关设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090187581A1 (en) * 2008-01-22 2009-07-23 Vincent Delisle Consolidation and association of structured and unstructured data on a computer file system
CN103440282A (zh) * 2013-08-13 2013-12-11 西安航天动力试验技术研究所 一种试验数据存储系统及方法
CN108509547A (zh) * 2018-03-20 2018-09-07 中国长城科技集团股份有限公司 一种信息管理方法、信息管理系统及电子设备
CN111753141A (zh) * 2019-03-26 2020-10-09 华为技术有限公司 一种数据管理方法及相关设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116561374A (zh) * 2023-07-11 2023-08-08 腾讯科技(深圳)有限公司 基于半结构化存储的资源确定方法、装置、设备及介质
CN116561374B (zh) * 2023-07-11 2024-02-23 腾讯科技(深圳)有限公司 基于半结构化存储的资源确定方法、装置、设备及介质

Similar Documents

Publication Publication Date Title
Kaur et al. Modeling and querying data in NoSQL databases
Aboutorabiª et al. Performance evaluation of SQL and MongoDB databases for big e-commerce data
Xu et al. Integrating hadoop and parallel dbms
CN103608809B (zh) 推荐数据富集
US9684699B2 (en) System to convert semantic layer metadata to support database conversion
Khazaei et al. How do I choose the right NoSQL solution? A comprehensive theoretical and experimental survey
Ivanov et al. Big data benchmark compendium
Li et al. An integration approach of hybrid databases based on SQL in cloud computing environment
US10198346B1 (en) Test framework for applications using journal-based databases
US10133767B1 (en) Materialization strategies in journal-based databases
Mostajabi et al. A Systematic Review of Data Models for the Big Data Problem
WO2023123287A1 (zh) 分子数据存储方法及装置、应用方法及装置
Gómez et al. Decentralized model persistence for distributed computing
Feuerlicht Database Trends and Directions: Current Challenges and Opportunities.
CN114297204A (zh) 一种异构数据源的数据存储、检索方法及装置
Lee et al. Hands-On Big Data Modeling: Effective database design techniques for data architects and business intelligence professionals
CN116975649A (zh) 数据处理方法、装置、电子设备、存储介质及程序产品
US11868362B1 (en) Metadata extraction from big data sources
US10417439B2 (en) Post-hoc management of datasets
Peng et al. Using object deputy model to prepare data for data warehousing
Lian et al. Sql or nosql? which is the best choice for storing big spatio-temporal climate data?
Dhanda Big data storage and analysis
CN114300064A (zh) 分子数据存储方法及装置、应用方法及装置
Cheah Quality, retrieval and analysis of provenance in large-scale data
Full et al. MicroStream vs. JPA: An Empirical Investigation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21969644

Country of ref document: EP

Kind code of ref document: A1