WO2024063585A1

WO2024063585A1 - Cloud platform system and service method capable of distributed and parallel processing for large-scale workflows

Info

Publication number: WO2024063585A1
Application number: PCT/KR2023/014455
Authority: WO
Inventors: 정종선; 홍종희; 정종철; 김영조; 홍운영; 안준식; 원대희; 김진우
Original assignee: (주)신테카바이오
Priority date: 2022-09-21
Filing date: 2023-09-21
Publication date: 2024-03-28

Abstract

A new drug development process includes a step of discovering new effective substance candidates by analyzing information regarding binding between proteins that cause disease and compounds. Because there are billions of compounds available on the market, reducing the time and cost required to develop new drugs has emerged as an important problem to be solved. In particular, much research is being conducted to quickly discover new effective substance candidates by making predictions using deep learning methods through physical calculations, and a platform developed in the prior art increased the possibility of discovering new substances by searching for large-scale compounds and predicting a binding force on the basis of physics theory, and greatly improved prediction accuracy. In terms of functionality, the platform includes DMC-PRE, DMC-SCR, and DMC-MD, and these parts have the characteristic of generating large-scale workflows and performing calculations in parallel. The purpose of the present invention is to analyze platform characteristics of the prior art, introduce cloud technology and cloud-based workflow manager technology to efficiently use large-scale computing resources, and propose a service methodology that takes into account convenience of users and efficient management of cloud resources. In order to provide an efficient service, by analyzing the characteristics of computing resource use for a new drug candidate substance discovery platform, which constitutes the prior art, the performance was optimized by using cloud technology. By providing a portal using web technology, user convenience for using cloud technology has been improved, and internally complex systems are stably connected through various message transmission methods. In addition, the stability of service using a cloud platform was verified through monitoring and a daemon that can respond to various error situations.

Description

Cloud platform system and service method capable of distributed and parallel processing for large-scale workflows

The present invention relates to a cloud platform and service method for predicting and discovering new effective substance candidates using a deep learning method using physical calculations. More specifically, the present invention relates to advanced virtual screening (DMC-PRE), deep Matcher screening (DMC-SRC) and molecular dynamics-based self-verification (DMC-MD) are applied, and the process is organized into a large-scale workflow capable of distributed and parallel processing for automation and high-performance computing. (high-performance computing) It is about cloud platforms and service methods that can utilize resources efficiently.

Recently, in the process of new drug development, an analytical methodology that discovers a large number of new effective substance candidates by analyzing the interactions between proteins and compounds that cause diseases through deep learning methods through physical and chemical calculations has emerged as a trend.

However, when the number of compounds to be analyzed is in the billions, analysis through artificial intelligence can consume a great deal of time and cost, so solving this problem has emerged as an important factor.

In addition, when making predictions for large-scale compounds, the time required for prediction per compound is the biggest problem. To solve this, 1) upgrading the prediction algorithm, and 2) executing the algorithm. It is important to develop an efficient utilization methodology for high-performance computing resources.

Meanwhile, in order to utilize high-performance computing resources efficiently, connectivity between analyzing algorithms and automation of resource utilization are essential, and to make automation easier, cloud computing technology that can efficiently utilize high-performance computer resources is beginning to emerge. did.

However, even if cloud computing technology is applied, there are 1) hardware and software dependency problems, 2) communication problems between heterogeneous platforms, 3) workload distribution management problems, and 4) user interface and user experience. User convenience issues based on this need to be resolved.

Meanwhile, in non-patent literature [001], Docker container technology, a virtualization technology used in a cloud environment, is a technology that can create a completely independent task execution environment that solves software dependency problems. Unlike a virtual machine, which is a fully virtualized technology, it shows better results in terms of performance because it shares the resources of the physical machine.

Platforms for discovering new drug development candidates are composed of various software. These software include 1) software version, 2) operating system, 3) compiler, 4) library, 5) It has dependencies on applications, etc.

To solve the problem of dependency, Docker container technology provides a function to install and configure and manage the above-mentioned parts by version in the form of an image.

Technologies that manage these Docker images include Docker hub (see non-patent document 013) and Docker private registry. Docker Hub is provided as a cloud image storage, so it has the purpose of sharing images with the public, and in the case of Docker personal storage, you can build a private image storage for use in a company's internal projects and personal projects.

Among the representative solutions for building a personal image repository, Harbor (see non-patent document 010) can not only manage Docker images, but also grant user permissions for each project.

Google's open source-based platform Kubernetes (see non-patent document 002), which automates these Docker containers and Docker images into cluster-based applications, is used to allocate resources for tasks executed in Docker containers, It automates deployment, execution, recovery, etc.

In particular, the self-healing function for applications ensures efficient management of applications and continuity of service.

The platform's work to discover new drug development candidates is executed in the form of a workflow with its own pipeline. Argo (see non-patent document 006), a technology that manages such workflows in a cloud environment, is a Kubernetes-based workflow based on pre-written automated templates for pods, the smallest computing unit. It allows you to execute it in the form

In the cloud, the template manager is written in YAML or JSON file format, and workflow automation is performed based on the workflow configuration information defined by the corresponding grammar.

For automation, various heterogeneous platforms exist, and it is very important to ensure smooth communication between heterogeneous platforms. In such a distributed cloud environment, Kafka (see non-patent document 009) is widely used, and Kafka can solve the complexity of asynchronous API (application programming interface) and messaging between numerous middleware, and not only for communication within a single cluster, It has the advantage of being able to communicate even in a multi-cluster environment.

In the case of parallelized workflows for various input information, such as drug development, the technology to efficiently distribute parallel analysis processes to high-performance computing resources is an important core technology, and if the technology is not applied well, it can cause a serious load on the entire system. This can affect the entire cloud service.

In order to distribute the load in terms of system resources, a cloud multi-cluster architecture (see non-patent documents 012 and 015) must be introduced. When introducing this architecture, an important task is to configure unit clusters according to the characteristics of the application to be executed and deploy the necessary resources for each cluster.

The present invention was created to solve the above problems, and the present invention is a high-performance computing for large-scale workflow that occurs in the process of discovering new materials by searching for large-scale compounds and predicting binding force based on physics theory. The goal is to provide a cloud platform and service method that increases resource efficiency by distributing and parallel processing infrastructure resources in a cloud environment and improves user convenience by implementing pipeline automation.

In other words, the present invention proposes a cloud architecture for a platform for discovering new drug development candidates. The cloud service is implemented through service flows and cloud system configurations that reflect the main requirements of new drug researchers, and the cloud service is implemented in the cloud for large-scale workflows. Distributed and parallel processing seeks to implement resource allocation and distribution using Argo templates.

At this time, workload management and self-healing due to the work of high-performance computing infrastructure are implemented in a multi-cluster method using Kubernetes, and images for various versions are developed and managed using Docker Hub to resolve software dependencies. The goal is to connect message communication between these heterogeneous platforms through Kafka, a distributed event streaming solution, and improve user convenience through a cloud portal.

According to the characteristics of the present invention for achieving the above-mentioned object, the present invention manages member information and projects in a cloud platform service capable of distributed and parallel processing for a large-scale workflow for discovering new drug development candidates. steps and; In the cloud workflow manager, analyzing the discovery of new drug development candidates; Visualizing the analyzed results; Remotely or internally controlling functions between the portal service, cloud workflow manager, cloud manager, and report manager through a message broker; And it is performed including the step of automatically responding to error situations in the cluster.

At this time, the cloud platform service is provided as a web service application provided in the form of a portal, and the management of the member information includes member registration and login services; The project management includes managing the price (charge) for the analysis service, intermediate storage and documentation functions for the price (charge); Management of project registration, service application, service execution and project progress monitoring functions; And it may be configured to include management of the analysis result report service.

In addition, the analysis for discovering new drug development candidates includes a template that stores the information that forms the basis of the programs and parameters required for each step for discovering new drug development candidates using deep learning; A template that allocates the optimal amount of computer resources to be used for each analysis step; For each analysis step, a template is allocated to high-performance computing resources in parallel; Templates that enable visualization of analyzed information; Templates for backup of analyzed large quantities of intermediate and final stocks; And it may be executed by an administrator unit that manages the template by connecting each template to a cluster for execution.

In addition, the new drug development candidate discovery analysis includes (I) a DMC-PRE step of first screening (pre-screening) candidate compounds based on the physical, chemical, and topology information of the compounds from the compound database; (Ⅱ) DMC-SCR step of secondary selection (deep screening) of the primary selected candidate compounds using a tool (Enva) learned through an artificial intelligence algorithm; And (Ⅲ) a DMC-MD step of verifying the secondary selected compounds through molecular dynamics molecular-motion simulation.

And the DMC-PRE step includes (Ⅰ-1) selecting compound structures and creating a database through structure and chemical property calculations based on large-scale compound data; (Ⅰ-2) Selecting compound structures by calculating the binding possibility and compatibility between the database (R-group) information established through the protein data bank (PDB) and the compound structures selected from step (Ⅰ-1). and; (Ⅰ-3) Analyzing the binding of proteins and compounds in virtual space through a molecular docking algorithm (molecular docking with deep learning), binding the compound to a predefined binding site in the protein structure and analyzing the binding environment; ; (Ⅰ-4) may be performed including the step of first screening candidate compounds from the structure of the compound selected in step (I-2) through the analysis results of step (I-3).

In addition, the DMC-SCR step includes (II-1) calculating binding information by analyzing the binding of the compound initially selected in the DMC-PRE step to the target protein; (Ⅱ-2) generating a protein-compound binding structure file in which the positional information of the bound compound is changed using the structure bound in step (Ⅱ-1); (Ⅱ-3) storing each protein-compound binding information generated in step (Ⅱ-2) and calculating the suitability of the binding structure between each protein-compound; (Ⅱ-4) calculating the binding force of a preset number of protein-compound structures according to the compatibility calculated in step (Ⅱ-3); (Ⅱ-5) calculating the binding force for key residues in the protein required for protein-compound binding in step (Ⅱ-4); (Ⅱ-6) For the protein-compound binding structure of step (Ⅱ-4), calculating top protein-compound binding information according to the binding score for each compound through a prediction model; And (Ⅱ-7) the step of secondary screening of candidate compounds by calculating the score for the protein-compound binding stability calculated from step (Ⅱ-6) above.

In addition, in steps (II-1) and (II-2), the maximum work time for each step may be set to prevent delays in calculation time due to the structure of a specific compound.

Additionally, in step (II-3), each protein-compound binding information may be stored through a memory buffer.

In the DMC-MD step, (III-1) the secondary selected compounds are combined with the protein structure to create a structure in which proteins and compounds active in virtual space are combined using a molecular dynamic simulation program. optimizing steps; (III-2) analyzing the binding information and structural characteristics of the protein-compound binding structure at preset time intervals to calculate a stabilized optimal binding form; And (Ⅲ-3) comparing the optimal protein-compound binding structure according to the simulation results of step (Ⅲ-2) above and the known optimal binding environment of the corresponding protein, compounds with the optimal binding form with the target protein are selected, and the final It may also be performed including a step of verifying a candidate compound.

Meanwhile, the visualization of the analyzed results for the discovery of new drug development candidates includes a database that stores the analysis information in a form for visualization; a database storing user requirements for documenting analysis information; An application programming interface (API) in REST format for visualizing and documenting information stored in the database; A web user interface that visualizes and documents the analyzed information on the analyzed new drug development candidates on the web and the binding structure between proteins and compounds; It can also be executed by an application programming interface that can manage servers running web applications and connection ports in REST format.

And internal control through the message broker includes message broker and daemon services that deliver project execution information; A message broker service that exchanges information to understand the situation of the cluster; a message broker service to check the execution status of the project; a message broker service that transfers the analyzed data to a database; message broker and daemon services to run a report system based on analysis data stored in a database; Message broker service to deliver the report creation completed status to the portal; It may also include a message broker service that delivers the resource usage of the servers that make up the cluster to the portal.

In addition, to cope with cluster error situations, session load distribution and server message processing are secured through configuration of alternative processing servers; Ensure service stability by configuring multiple master servers; This may include automatic node management through Kubernetes equipment health monitoring and node labeling functions.

Meanwhile, the present invention includes a cloud platform for executing distributed and parallel processing-capable cloud platform services as web services for large-scale workflow for discovering drug development candidates, as described above.

The following effects can be expected from the cloud platform and service method capable of distributed and parallel processing for large-scale workflows of DMC-PRE, DMC-SCR, and DMC-MD according to the present invention as seen above.

In other words, the present invention can increase the possibility of discovering new substances by searching for large-scale compounds and predicting binding force based on physics theory.

In addition, the present invention provides a methodology for efficiently utilizing high-performance computing resources by applying distributed and parallel processing methodologies in a cloud environment for workflows executing billions of compounds, thereby enabling the application of high-performance computing cloud technology in the bio field. There is an effect that makes it possible.

In addition, the present invention has the effect of improving user convenience of services in the field of new drug development by defining a cloud service methodology for a platform with greatly improved prediction accuracy.

Figure 1 is a cloud service flow chart for DMC-PRE, DMC-SCR, and DMC-MD according to the present invention.

Figure 2 is a cloud platform configuration diagram for DMC-PRE, DMC-SCR, and DMC-MD according to the present invention.

Figure 3 is a configuration diagram of a multi-cluster-based cloud system according to the present invention.

Figure 4 is an exemplary diagram of the AMQP middleware configuration environment according to the present invention.

Figure 5 is an example performance comparison of Kafka and RabbitMQ according to the present invention.

Figure 6 is an exemplary Kafka configuration environment according to the present invention.

Figure 7 is a diagram of the internal structure of Kafka according to the present invention.

Figure 8 is a distributed and parallel structure diagram of Kafka according to the present invention.

Figure 9 is a distributed message structure diagram according to the present invention.

Figure 10 is a message batch processing structure diagram according to the present invention.

Figure 11 is a portal service flow chart according to the present invention.

Figure 12 is a schematic diagram of project calculation nodes and storage allocation according to the present invention.

Figure 13 is an exemplary diagram of a pipeline automation method for DMC-PRE, DMC-SCR, and DMC-MD according to the present invention.

Figure 14 is an exemplary diagram of the PHscan pipeline of DMC-PRE according to the present invention.

Figure 15 is an exemplary diagram of the GAP0 pipeline of DMC-PRE according to the present invention.

Figure 16 is an exemplary diagram of the G1 pipeline of DMC-PRE according to the present invention.

Figure 17 is an exemplary diagram of a DMC-SCR pipeline according to the present invention.

18 is an exemplary diagram of a DMC-MD pipeline according to the present invention.

Figure 19 shows the PHscan cloud distribution and parallel processing structure of DMC-PRE according to the present invention.

Figure 20 shows the GAP0 cloud distribution and parallel processing structure of DMC-PRE according to the present invention.

Figure 21 shows the G1 cloud distribution and parallel processing structure of DMC-PRE according to the present invention.

Figure 22 is a cloud distribution and parallel processing structure of DMC-SCR according to the present invention.

Figure 23 is a cloud distribution and parallel processing structure of DMC-MD according to the present invention.

Figure 24 is a flow chart of report generation steps according to the present invention.

Figure 25 is a flowchart of project result backup according to the present invention.

In order to achieve this purpose, the present invention provides a cloud platform service capable of distributed and parallel processing for a large-scale workflow for discovering new drug development candidates, comprising the steps of managing member information and projects; In the cloud workflow manager, analyzing the discovery of new drug development candidates; Visualizing the analyzed results; Remotely or internally controlling functions between the portal service, cloud workflow manager, cloud manager, and report manager through a message broker; and automatically responding to cluster error situations. The cloud platform service is provided as a web service application provided in the form of a portal, and the management of the member information is performed through member registration and login services. Contains; The project management includes managing the price (charge) for the analysis service, intermediate storage and documentation functions for the price (charge); Management of project registration, service application, service execution and project progress monitoring functions; And it consists of management of analysis result report service; The discovery analysis of new drug development candidates stores the information that forms the basis of the programs and parameters required for each step to discover new drug development candidates using deep learning. template; A template that allocates the optimal amount of computer resources to be used for each analysis step; For each analysis step, a template is allocated to high-performance computing resources in parallel; Templates that enable visualization of analyzed information; Templates for backup of analyzed large quantities of intermediate and final stocks; and is executed by an administrator unit that manages the templates by connecting each template to a cluster for execution.

At this time, the new drug development candidate discovery analysis includes (I) a DMC-PRE step of first screening (pre-screening) candidate compounds based on the physical, chemical, and topology information of the compounds from the compound database; (Ⅱ) DMC-SCR step of secondary selection (deep screening) of the primary selected candidate compounds using a tool (Enva) learned through an artificial intelligence algorithm; And (Ⅲ) a DMC-MD step of verifying the secondary selected compounds through molecular dynamics molecular-motion simulation.

Hereinafter, we will look at the cloud platform and service method according to a specific embodiment of the present invention with reference to the attached drawings.

Prior to the description, the effects, features, and methods of achieving the present invention will become clear in the examples described in detail below along with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms. The present embodiments are merely provided to ensure that the disclosure of the present invention is complete and to provide common knowledge in the technical field to which the present invention pertains. It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims.

In describing the embodiments of the present invention, if it is judged that a detailed description of a known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description will be omitted, and the terms described below will be used in the embodiments of the present invention. These are terms defined in consideration of the function of and may vary depending on the intention or custom of the user or operator. Therefore, the definition should be made based on the contents throughout this specification.

The combination of each block in the attached block diagram and each step in the flow chart may be performed by computer program instructions (execution engine), and these computer program instructions can be installed on a processor of a general-purpose computer, special-purpose computer, or other programmable data processing equipment. Since it can be mounted, the instructions executed through a processor of a computer or other programmable data processing equipment create a means of performing the functions described in each block of the block diagram or each step of the flow diagram.

These computer program instructions may also be stored in computer-usable or computer-readable memory that can be directed to a computer or other programmable data processing equipment to implement a function in a particular manner, so that the computer-usable or computer-readable memory The instructions stored in can also produce manufactured items containing instruction means that perform the functions described in each block of the block diagram or each step of the flow diagram.

In addition, computer program instructions can also be mounted on a computer or other programmable data processing equipment, so a series of operation steps are performed on the computer or other programmable data processing equipment to create a process that is executed by the computer and runs on the computer or other program. Instructions that perform possible data processing equipment may also provide steps for executing functions described in each block of the block diagram and each step of the flow diagram.

Additionally, each block or each step may represent a module, segment, or portion of code containing one or more executable instructions for executing specified logical functions, and in some alternative embodiments, the blocks or steps referred to in the blocks or steps may represent a portion of code. It is also possible for functions to occur out of order.

In other words, it is possible for the two blocks or steps shown to be performed substantially simultaneously, and it is also possible for the blocks or steps to be performed in reverse order of the corresponding functions as needed.

As shown in Figure 1, the present invention includes (A) the project registration and application stage in the user's cloud environment (B) the manager's project approval and resource allocation stage (C) DMC-PRE, DMC-SCR, and DMC- It is performed including the automatic pipeline execution stage for MD (D) report generation stage (E) and the data backup template execution stage at the user's request.

At this time, when you first create an account in step (A), you can be granted permission as a guest user and use only the billing page. In addition, on this page, you can check billing information and download documents for the use of cloud services, and general users with a certified official user agreement can freely use the platform.

If you look at the user's platform usage status, you can register a project by entering a simple title and content on the project creation page and then selecting the platform to use. The administrator confirms whether the project has been contracted and then approves the use of the project, and the user enters the pipeline parameters to proceed with the analysis on the web. The purpose of designing the website to run the project is to make analysis easy on the website for users who are not familiar with the cloud environment.

In the case of a website that provides an existing cloud environment, the analysis must be performed after creating and selecting the type and quantity of resources used and the workflow suitable for the purpose of analysis, but the cloud environment of the present invention is predetermined. It is possible to use existing services using only simple parameters.

Meanwhile, when analysis begins, you can check the progress status and detailed progress steps for the project on the project page, and when completed, move to the report page and check the analyzed data in the form of various visual materials and various formats. You can download the documented analysis data.

And in step (B), the equipment needed for the project is approved according to the type of platform of the project registered by the user. It includes a resource request to execute the pipeline included in the project requested by the user, and specifies the appropriate quantity of resources and work storage for the request, and dynamically processes the workflow to ensure that it can be performed using optimal resources at each stage. Management functions for CPU, GPU, and storage resources are required, and for this purpose, the present invention applies Argo's workflow management function, Kubernetes' application distribution function, and container orchestration function.

And the (C) stage for discovering candidate substances for new drug development can be divided into (C-1) DMC-PRE, (C-2) DMC-SCR, and (C-3) DMC-MD stages, and (C-1) stage proceeds in the following order to select compound structures suitable for candidate substances from a database consisting of hundreds of millions of compounds built in the company. The (C-1-1) step selects approximately 1 million compound structures through structure and chemical property calculations based on large-scale compound data using the PHscan algorithm. The (C-1-2) step uses the GAP0 algorithm to calculate the binding possibility and compatibility between the R-group database information built in the company and the compound structures selected from the (C-1-1) step, and then calculates 100,000 Select compound structures. Step (C-1-3) utilizes the G1 docking program (binds proteins and compounds in virtual space and performs analysis) to bind compounds to predefined binding sites in the protein structure and analyze their binding environment. It provides one result, and in the final (C-1-4) detailed step, 1,000 of the best compound structures are selected from the analysis results. As a result, 1,000 candidate substances are selected from a large database of hundreds of millions of compounds through the DMC-PRE step, and the candidate substances are passed on as input data for the next step (C-2).

Meanwhile, the (C-2) DMC-SCR stage is divided into a total of 7 stages.

In the (C-2-1) step, docking refers to the task of binding 1,000 ZINC compounds (ligands) analyzed in the DMC-PRE step to the target protein, and the binding analysis between the target protein and one compound is performed in multiple ways. Proceed to calculate the combination information. In addition, considering the time delay due to the structure of a specific compound, the maximum work time for that step is set and variables for exceptions are blocked.

The (C-2-2) step uses the structure combined in step C-2-1 to create a protein-compound binding structure file with the positional information of the bound compound changed. In this case, as with C-2-1, the maximum execution time of the step is set considering that the execution time is delayed due to the compound structure.

The (C-2-3) step gathers the various protein-compound binding information generated in the (C-2-2) step into one place to calculate the suitability of the binding structure between each protein and compound. The files were small in size, so there was a problem with delay in collecting them in one place. By using a memory buffer in this process, the file copy time is improved by more than 30% compared to before.

In the (C-2-4) step, an (ENVA) operation is performed to calculate the binding force of the top 2,000 compound structures suitable for the protein-compound in the (C-2-3) step.

In the (C-2-5) step, the key residues in the protein required for binding to the protein and compound are calculated, and in the (C-2-6) step, a pre-generated prediction model is used for the binding structure between the protein and the compound. is applied to select the top one protein-compound binding information with a good binding score for each compound.

Step (C-2-7) calculates the score for the binding stability between 1000 protein-compounds from step (C-2-6), extracts the top one, and inputs the input information of (C-3). It is provided as.

In the (C-3) stage, 1,000 candidate substances are combined back into the protein structure, and AMBER, a molecular dynamic simulation program, is used to optimize the structure of proteins and compounds active in virtual space and time. Do the work. The protein-compound bonding structure used in the simulation changes the degree of atomic bonding over time, and because numerous atoms are affected by the bond, the overall shape of the protein also changes.

Therefore, the binding information and structural characteristics of the protein-compound binding structure are analyzed at regular time intervals to shape the structure into the most stable form it can have.

After simulation, tens to hundreds of compounds with the optimal binding form for the target protein are selected through comparison between the optimized protein-compound binding structure and the experimentally known optimal binding environment of the protein, and the corresponding information is collected. It is judged as the final candidate material. Finally, additional post-processing work is performed to provide the analysis results of the final selected candidate substances to the user in the form of a report.

In order to efficiently operate the platform for discovering new drug development candidates in stage (C) described above, solutions to the following items are required.

1) Distributed processing method that efficiently extracts necessary information from a large-scale compound (1 billion compounds) database, 2) Minimum computing resources required for each step, 3) Analysis time according to allocated computing resources, 4) Abnormal operation and errors Response plan, 5) Step-by-step normal confirmation plan

In order to operate in the current cloud environment, an infrastructure that communicates on the same physical server resources and the same network has been established, and the Argo cloud-based workflow manager is adopted and used to efficiently utilize the infrastructure and cope with various errors.

In addition, various optimized conditions were calculated for the resources of the physical machine currently in service and the consumed resources used for analysis at each stage, and the optimal resource consumption and distributed processing values in the currently in service model were calculated. set.

And in the case of a Kubernetes-based cluster that connects hundreds or more calculation servers centered on a master server, the management of the cluster and the stability of the master server are unstable because the master server is responsible for processing the information of hundreds of calculation servers. discovered that it was falling apart. To solve this problem, a separate server is added and configured to handle the function of connecting the master server and the calculation server (Proxy), thereby distributing the session load (SLB; Service Load Balancing) and at the same time ensuring high availability of the service. (HA; High Availability) has been secured to ensure session stability in abnormal situations that may occur on one master server.

In addition, by separating the necessary roles within clusters such as Argo and Calico, the master server was able to focus more on scheduling tasks through pods and managing the cluster.

And despite the self-healing function of Kubernetes, there were many cases of abnormal delays in the distribution of tasks due to physical server failures that occurred frequently in large cluster environments.

To solve this problem, 1) check the equipment status information of Kubernetes, and 2) utilize the labeling function of Kubernetes to identify the status of abnormal nodes and automatically remove them from the cluster management node. By doing so, we introduced an automatic node management function that can minimize delays in task distribution.

The detailed step (D) report function, which provides users with information about the analyzed data, was introduced in response to the need to improve accessibility so that users can easily check and understand the analysis results required in the process of discovering new effective substance candidates. Using Docker container technology, an independent web service virtualization environment was established for each project each time the analysis was completed. In the case of the report web service, MongoDB was introduced for fast data access and large-capacity data processing, and in the form of a single page application, an environment similar to a native app allows users to efficiently access information. The problem was solved by organizing the website so that it was easy to navigate and understand.

Lastly, step (E), the data backup step according to the user's request, is performed together with data backup to safely store and deliver the results generated after analysis according to the user's request, and the process of recovering storage resources after the work has been completed. During the process, technology to quickly back up the results and policies to prevent loss and alteration of the results were applied.

Hereinafter, detailed technical contents of the cloud platform and service method according to the present invention will be described in accordance with the order of the attached drawings.

Figure 1 shows a cloud service flow chart for DMC-PRE, DMC-SCR, and DMC-MD. As shown in Figure 1, when cloud users register a project and apply for use, the manager (or management algorithm) reviews and approves the project and allocates resources to be used for analysis.

The resources include not only physical machine resources, but also various parameters for executing automated workflows for analysis. The moment the user executes the project analysis after setting the parameters, the platform is executed in order according to the designated template, and a report is generated when execution is completed.

And users can download the report results through the portal. After reviewing the analyzed results, the user can request a backup of the data from the administrator. If the administrator specifies a pre-written backup template, the data is automatically moved from the work storage to the backup storage.

Meanwhile, in the present invention, an administrator is a subject who manages the operation of the platform, and may be an operating personnel or an electronic system such as an operating program or an operating server.

Next, Figure 2 is a cloud platform configuration diagram for DMC-PRE, DMC-SCR, and DMC-MD.

The cloud platform according to the present invention is composed of three layers, which are divided into hardware, software, and portal layers.

The hardware layer includes the GPU, CPU, and storage resources, and the software layer includes the workload manager implemented in Kubernetes, the workflow and template manager implemented in Argo, the platform image manager configured in Harbor, and the Python It consists of a report manager implemented in Django, the distributed system Kafka, and distributed and parallel processing workflow methods implemented with the corresponding items.

The cloud portal layer is divided into users and administrators. Users include project applications and report confirmation, and administrators include project management, resource management, database management, and user management in addition to functions available to users.

Figure 3 shows the architecture of a multi-cluster environment.

Each cluster has a workflow manager, workload manager, and physical storage, and the cluster configuration is made scalable for each content. Information on all clusters is stored in an integrated database and communicated with a separate report manager. .

These constructs deliver messages through Kafka. And the cloud portal receives Kafka's communication messages and provides information to users and administrators.

Figure 4 shows the AMQP middleware configuration environment according to the present invention, Figure 5 shows a performance comparison between Kafka and RabbitMQ according to the present invention, and Figure 6 shows the Kafka configuration environment according to the present invention. Figure 7 shows the internal structure of Kafka according to the present invention, Figure 8 shows the distributed and parallel structure of Kafka according to the present invention, Figure 9 shows the distributed message structure according to the present invention, and Figure 10 shows the message batch processing structure according to the present invention, and Figure 11 shows the portal service flow according to the present invention.

The MQ (Message Queue) protocol is a method of mediating information between various message services. The sender and receiver are fixed, so data is transmitted in a one-to-one manner, such as specific log data.

Once the recipient takes the message, the message disappears from the broker, making it impossible for another recipient to receive the same message.

In a multiple client environment where the sender and receiver are one-to-many, the number of queues must be increased at the broker and the sender must send messages to all queues. In a multi-cluster environment where there are multiple queues at each layer, the complexity becomes more severe. It has structural problems.

In addition, it has performance limitations such that real-time large-capacity data movement processing is not possible or technical tuning outside the standard must be performed in parallel to meet the data movement speed requirements required by the service.

For example, the throughput of the most popular RabbitMQ is about 5% of Kafka, and in Kafka's optimal environment such as latency tuning and high-speed RAID storage space, this gap widens further (see Figure 5). In order to solve the weaknesses and limitations of existing messaging intermediaries, Kafka applies distribution, batch, parallel, and cache technologies to the messaging model to effectively process large amounts of traffic and minimize infrastructure complexity in a multi-cluster environment (Figures 6 and 7 reference).

Meanwhile, in the present invention, since messages are stored based on a file system, read performance can be optimally expanded based on high-speed RAID and cache performance.

And as shown in Figures 8 and 9, one intermediary structurally allows connections between multiple senders and receivers to distribute and process messages in parallel.

Additionally, in the present invention, as shown in FIG. 10, throughput is improved by sending and receiving messages in batches.

In addition, the present invention dramatically improves traffic and throughput by actively utilizing the file system and network cache technology provided by the operating device.

Meanwhile, as shown in FIG. 11, the portal service according to the present invention is a web page that can be used in a cloud environment, where users can create an account, issue a contract according to use, register a project, apply for a project, output a project execution report, and It provides a user interface that allows you to proceed with all the processes required for analysis, including report documentation and downloading.

And when the analysis of the project begins, the parameters and information required for the analysis are transmitted through the Kafka message broker, and Argo recognizes the information and the analysis proceeds.

Users can check the progress of the analysis, and administrators can check additional information on the resource utilization of each cluster through the monitoring web page.

Figure 12 is a flowchart illustrating the process in which calculation server resources such as CPU, GPU, and storage resources are allocated and managed after the administrator approves the task requested by the user in the present invention.

As shown, when a user's project application is received, upon approval by the manager, resource information appropriate for the requested size is passed as a factor to the Argo workflow template, and the pipeline work is prepared to be executed according to the steps.

And when a task is executed by a user, pods for task execution are distributed according to the template order. At this time, for efficient resource utilization, the calculation server's CPU and GPU resources are dynamically allocated according to the required size from the entire resource pool. It is configured so that the waiting pods can be automatically executed when the progress stage is completed and free resources are confirmed. This is implemented through the workload management function of Kubernetes.

In addition, pre-designated storage is automatically allocated (mounted) to each pod when the work pod is executed according to each template, and automatically unmounted when the work of the pod ends, reducing the network load on the storage space.

Figure 13 shows the overall cloud pipeline for DMC-HIT (DeepMatcher-HIT) for discovery of new drug candidates.

DMC-HIT according to the present invention is largely composed of three stages: DMC-PRE, DMC-SCR, and DMC-MD. Common roles include generating (docking) a protein compound binding structure and forming the binding structure. There is a binding force analysis for this.

As the pipeline progresses, compounds deemed valuable among the large-scale compound data are selected, and up to 200 compounds can be confirmed in the resulting report.

Meanwhile, Figures 14 to 18 show the logic constituting the detailed areas of the present invention in the correct order, and the results derived from performing each logic are expressed in such a way that they are used in the next logic.

And Figure 19 shows how to perform the PHscan step of DMC-PRE in a cloud environment. At this time, since some stages of the pipeline require calculation work based on massive data, 100 processes are created and processed in parallel. As a final result of this step, data on about 1 million compounds are passed on to the next step.

Figure 20 shows how to perform the GAP0 step of DMC-PRE. In order to process a significant number of compounds, approximately 1 million, received from the PHscan step of DMC-PRE, the compounds are divided into batches and 100 parallel processes are performed at a time, and when all batches are finished, the compounds are divided into batches. It repeats until. In the final step, approximately 100,000 compounds are selected.

And Figure 21 shows how to perform the analysis step through the G1 program, which is the last part of DMC-PRE. In order to process about 100,000 compounds received from the GAP0 step of the DMC-PRE, the compounds are similarly divided into batches, and then 100 parallel processes are performed and repeated until the work on all batches is completed. The final 1,000 compounds are selected and passed to the next step, DMC-SCR.

Next, Figure 22 shows the overall configuration of the DMC-SCR step operating in the cloud. At this time, the area of parallel distributed processing consists of various detailed steps, and 1,000 compounds received from the previous step are processed in parallel, 100 at a time. The 1,000 protein-compound binding structures that are the result of this step are transferred to the next step, DMC-MD.

Figure 23 shows the DMC-MD step execution operation in the cloud. Analysis is performed based on a molecular dynamics simulation program with 1,000 protein-compound binding structures received from the DMC-SCR step.

At this time, the area processed in parallel consists of several detailed steps, and 100 of the 1,000 combined structures are processed in parallel at a time. After completion of this step, the DMC-POST step is performed to generate a DMC-HIT result report based on the protein-compound binding structure and analysis results, which are the output data of the molecular dynamics simulation program.

Meanwhile, Figure 24 is a flowchart illustrating the user's connection process from the creation of the report web service after completing the analysis of the project registered by the user.

When the analysis is completed normally, the analyzed data is uploaded to the database, and an additional database dedicated to report information is created. The database can quickly process the data needed for reports by creating various information into one document based on one central piece of information, like a view table in a relational database.

Next, a report server is selected, a port is specified within the server, and port forwarding is performed. Finally, the report server is built, the service is executed, and all service preparations are completed. Afterwards, users can access the report web service by accessing the cloud portal and clicking Download Report for the project. Through this, users can check the analysis results directly on the web or download them as documents in various formats.

Lastly, Figure 25 shows the process for backing up and externally storing the results obtained through project pipeline execution. As schematized in this step, after the work pipeline is terminated, the backup workflow template is executed to perform the backup. In this process, several terabytes of results must be transferred to the primary storage, so the maximum transfer capacity is Uses a multi-node, multi-session backup method to maximize speed and diary writing.

In addition, in order to prevent information loss or alteration that may occur during the backup step, a checksum is performed before and after the backup and the hash value of the result is compared.

Backup copies stored in primary storage are simultaneously re-stored in physically separate storage, and when two copies are created, the original is deleted and the work storage space resources are returned. The most important policy of the backup system is to maintain two copies at all times for a certain period of time upon user request, and each backup target and backup history are recorded. Backup copies that have expired the online storage period set by regulations are copied to the hard disk. A series of processes are followed, with a policy of deleting online copies.

The rights of the present invention are not limited to the embodiments described above but are defined by the claims, and those skilled in the art can make various changes and modifications within the scope of the claims. This is self-evident.

Existing clouds that have commercial value are rarely commercialized by linking large-scale genomes, proteomes, and compounds, such as our cloud, with full automation. In addition, although a pipeline in the form of research and development may exist at a partial stage of the overall workflow, a large-scale analysis pipeline that can analyze the binding degree of verified proteins and billions of compounds and high-performance computing infrastructure resources in conjunction with the pipeline Our cloud is the only cloud system applied to this. It is expected to be used on a large scale for business and research purposes both domestically and overseas in the future. Furthermore, through cloud-based workflow, large-scale calculations such as population genome biomarker analysis and neoantigen analysis are performed. Expansion to the cloud equipped with this necessary content-specific pipeline is expected to contribute to expanding business usability.

Claims

In the cloud platform service capable of distributed and parallel processing for large-scale workflow for discovering new drug development candidates,

Steps to manage member information and projects;

In the cloud workflow manager, analyzing the discovery of new drug development candidates;

Visualizing the analyzed results;

Remotely or internally controlling functions between the portal service, cloud workflow manager, cloud manager, and report manager through a message broker; and

A cloud platform service method capable of distributed and parallel processing for large-scale workflows, including the step of automatically responding to error situations in the cluster.
According to claim 1,

The cloud platform service is provided as a web service application provided in the form of a portal,

Management of the above member information is:

Includes member registration and login services;

The project management is,

Management of prices (charges) for analysis services, intermediate storage and documentation functions for prices (charges);

Management of project registration, service application, service execution and project progress monitoring functions; and

A cloud platform service method capable of distributed and parallel processing for large-scale workflows, comprising management of analysis result report services.
According to claim 1,

The analysis of new drug development candidate substances is:

A template that stores information that forms the basis of programs and parameters required for each step to discover candidate substances for new drug development using deep learning;

a template that allocates the optimal amount of computer resources to be used for each analysis step;

For each analysis step, a template is allocated to high-performance computing resources in parallel;

A template that enables visualization of the analyzed information;

Templates for backup of analyzed large quantities of intermediate and final stocks; and

A cloud platform service method capable of distributed and parallel processing for large-scale workflows, characterized in that each template is connected to a cluster for execution and executed by an administrator unit that manages the template.
According to claim 3,

The analysis for discovering new drug development candidates is,

(Ⅰ) a DMC-PRE step of first screening (pre-screening) candidate compounds based on the physical, chemical, and topology information of the compounds from the compound database;

(Ⅱ) DMC-SCR step of secondary selection (deep screening) of the primary selected candidate compounds using a tool (Enva) learned through an artificial intelligence algorithm; and

(Ⅲ) A cloud platform service method capable of distributed and parallel processing for large-scale workflow, characterized in that it is performed including a DMC-MD step of verifying the secondary selected compounds through molecular dynamics molecule-motion simulation.
According to claim 4,

The DMC-PRE step is,

(Ⅰ-1) selecting compound structures and creating a database through calculation of structures and chemical properties based on large-scale compound data;

(Ⅰ-2) Selecting compound structures by calculating the binding possibility and compatibility between the database (R-group) information established through the protein data bank (PDB) and the compound structures selected from step (Ⅰ-1). and;

(Ⅰ-3) Analyzing the binding of proteins and compounds in virtual space through a molecular docking algorithm (molecular docking with deep learning), binding the compound to a predefined binding site in the protein structure and analyzing the binding environment; ;

(Ⅰ-4) A large-scale method comprising the step of first screening candidate compounds from the structure of the compound selected in step (I-2) through the analysis results of step (I-3). A cloud platform service method that enables distributed and parallel processing for workflow.
According to claim 5,

The DMC-SCR step is,

(Ⅱ-1) calculating binding information by analyzing the binding of the compounds initially selected in the DMC-PRE step to the target protein;

(Ⅱ-2) generating a protein-compound binding structure file in which the positional information of the bound compound is changed using the structure bound in step (Ⅱ-1);

(Ⅱ-3) storing each protein-compound binding information generated in step (Ⅱ-2) and calculating the suitability of the binding structure between each protein-compound;

(Ⅱ-4) calculating the binding force of a preset number of protein-compound structures according to the compatibility calculated in step (Ⅱ-3);

(Ⅱ-5) calculating the binding force for key residues in the protein required for protein-compound binding in step (Ⅱ-4);

(Ⅱ-6) For the protein-compound binding structure of step (Ⅱ-4), calculating top protein-compound binding information according to the binding score for each compound through a prediction model; and

(Ⅱ-7) secondary screening of candidate compounds by calculating the score for the protein-compound binding stability calculated from step (Ⅱ-6) above. Cloud platform service method capable of distributed and parallel processing.
According to claim 6,

The steps (Ⅱ-1) and (Ⅱ-2) are,

A cloud platform service method capable of distributed and parallel processing for large-scale workflows, characterized by setting the maximum work time for each step to prevent delays in calculation time due to specific compound structures.
According to claim 6,

In step (Ⅱ-3),

A cloud platform service method capable of distributed and parallel processing for large-scale workflow, characterized in that the storage of each protein-compound binding information is performed through a memory buffer.
According to claim 6,

The DMC-MD step is,

(III-1) combining the secondary selected compounds with the protein structure and optimizing the structure of the protein and compound active in virtual space using a molecular dynamic simulation program;

(III-2) analyzing the binding information and structural characteristics of the protein-compound binding structure at preset time intervals to calculate a stabilized optimal binding form; and

(Ⅲ-3) By comparing the optimal protein-compound binding structure according to the simulation results of step (Ⅲ-2) above and the known optimal binding environment of the protein, compounds with the optimal binding form with the target protein are selected and final candidates are selected. A cloud platform service method capable of distributed and parallel processing for large-scale workflow, comprising the step of verifying a compound.
According to clause 9,

Visualization of the above analyzed results for discovery of new drug development candidates is as follows:

a database that stores analysis information in a form for visualization;

a database storing user requirements for documenting analysis information;

An application programming interface (API) in REST format for visualizing and documenting information stored in the database;

A web user interface that visualizes and documents the analyzed information on the analyzed new drug development candidates on the web and the binding structure between proteins and compounds; and

A cloud platform service method capable of distributed and parallel processing for large-scale workflows, characterized by being executed by an application programming interface that can manage servers running web applications and connection ports in REST format.
According to clause 9,

Internal control through message broker:

Message broker and daemon services that deliver project execution information;

A message broker service that exchanges information to understand the situation of the cluster;

a message broker service to check the execution status of the project;

a message broker service that transfers the analyzed data to a database;

message broker and daemon services to run a report system based on analysis data stored in a database;

Message broker service to deliver the report creation completed status to the portal; and

Message broker service that relays resource usage of the servers that make up the cluster to the portal; A cloud platform service method capable of distributed and parallel processing for large-scale workflows, comprising:
According to clause 9,

To deal with cluster error situations,

Securing high availability of session load distribution and server message processing through configuration of alternative processing servers;

Ensure service stability by configuring multiple master servers;

A cloud platform service method capable of distributed and parallel processing for large-scale workflows, including automatic node management through Kubernetes equipment status monitoring and node labeling functions.
A cloud platform for executing the method of any one of claims 1 to 12 as a web service.