CN116192885A

CN116192885A - High-availability cluster architecture artificial intelligent experiment cloud platform data processing method and system

Info

Publication number: CN116192885A
Application number: CN202211603530.6A
Authority: CN
Inventors: 贾子琪; 杨浩; 朱世冲; 古超; 周楚亚; 张强; 张腾飞; 陈连山
Original assignee: Nanyang Institute of Technology
Current assignee: Nanyang Institute of Technology
Priority date: 2022-12-13
Filing date: 2022-12-13
Publication date: 2023-05-30

Abstract

The application relates to a cloud platform technology and provides a high-availability cluster architecture artificial intelligent experimental cloud platform data processing method and system, wherein an artificial intelligent cloud platform comprises a plurality of master nodes and a plurality of slave nodes, and if a target slave node receives an experimental work task deployment instruction sent by the target master node, a target container is correspondingly created according to the experimental work task deployment instruction; if the target slave node receives the access request of the user terminal and passes the verification, connecting a target container instance corresponding to the target container with the user terminal; the target slave node receives target operation data of the user terminal, and stores the target operation data into a key value database corresponding to the target container; and if the target slave node receives the container operation instruction, correspondingly creating or deleting the container according to the container operation instruction. The cloud platform based artificial intelligent related experimental task processing method and system can process artificial intelligent related experimental tasks based on the cloud in the cloud platform, nodes can be added or subtracted from the cluster at any time, and the high availability and the load capacity of the cluster are improved.

Description

High-availability cluster architecture artificial intelligent experiment cloud platform data processing method and system

Technical Field

The application relates to the technical field of cloud platforms, in particular to a high-availability cluster architecture artificial intelligent experiment cloud platform data processing method and system.

Background

At present, when an enterprise or a university performs an artificial intelligence related experiment, a solution mode of partially adopting an experiment platform cluster appears, namely, artificial intelligence related experiment data are put on the cloud platform cluster to perform a cloud experiment task. However, the current cloud platform cluster cannot add or delete nodes to the cluster at any time, which results in limited number of operators faced by the artificial intelligence related experiment, and cannot process cloud experiment task processing participated by multiple-scale personnel. In addition, in the existing cloud platform cluster, abnormal faults such as power failure and the like are encountered, experimental data cannot be automatically stored, and the data has a large safety risk.

Disclosure of Invention

The embodiment of the application provides a cloud platform data processing method and a cloud platform data processing system for an artificial intelligence experiment with a high-availability cluster architecture, and aims to solve the problem that nodes in a cloud platform cluster used for carrying out an artificial intelligence related experiment in the prior art cannot be added or deleted at any time, so that the number of operators facing the artificial intelligence related experiment is limited, and the artificial intelligence related experiment involving a small number of operators can be carried out.

In a first aspect, an embodiment of the present application provides a data processing method for an artificial intelligence experiment cloud platform with a high availability cluster architecture, which is applied to an artificial intelligence experiment cloud platform, wherein the artificial intelligence cloud platform includes a plurality of master nodes and a plurality of slave nodes, and the plurality of master nodes and the plurality of slave nodes are all in communication connection; the method comprises the following steps:

if the target slave node receives the experimental work task deployment instruction sent by the target master node, a target container is correspondingly created according to the experimental work task deployment instruction; the target slave node is any slave node in the plurality of slave nodes, and the target master node is a master node in an active state in the plurality of master nodes;

if the target slave node receives an access request of a user terminal and passes verification, connecting a target container instance corresponding to the target container with the user terminal;

the target slave node receives target operation data of the user terminal, and stores the target operation data into a key value database corresponding to the target container;

the target master node sends a container operation instruction to the target slave node;

and if the target slave node receives the container operation instruction, correspondingly creating or deleting a container according to the container operation instruction.

In a second aspect, an embodiment of the present application provides a data processing system of an artificial intelligence experiment cloud platform with a high-availability cluster architecture, which operates on the artificial intelligence experiment cloud platform and includes a plurality of master nodes and a plurality of slave nodes, where the plurality of master nodes and the plurality of slave nodes are all communicatively connected; the target slave node is any slave node in the plurality of slave nodes, and the target master node is a master node in an active state in the plurality of master nodes;

the target slave node is used for correspondingly creating a target container according to the experimental work task deployment instruction if the experimental work task deployment instruction sent by the target master node is received; the target slave node is any slave node in the plurality of slave nodes, and the target master node is a master node in an active state in the plurality of master nodes;

the target slave node is further configured to connect a target container instance corresponding to the target container with the user terminal if an access request of the user terminal is received and verification is passed;

the target slave node is further configured to receive target operation data of the user terminal, and store the target operation data to a key value database corresponding to the target container;

The target master node is used for sending a container operation instruction to the target slave node;

and the target slave node is further used for correspondingly creating or deleting the container according to the container operation instruction if the container operation instruction sent by the target master node is received.

In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor implements the high-availability cluster architecture artificial intelligence experiment cloud platform data processing method of the first aspect when executing the computer program.

In a fourth aspect, embodiments of the present application further provide a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program when executed by a processor causes the processor to perform the high availability cluster architecture artificial intelligence experiment cloud platform data processing method of the first aspect.

The embodiment of the application provides a data processing method and a data processing system for an artificial intelligent experiment cloud platform with a high-availability cluster architecture, wherein the artificial intelligent cloud platform comprises a plurality of master nodes and a plurality of slave nodes, and the master nodes and the slave nodes are all in communication connection; the method comprises the following steps: if the target slave node receives the experimental work task deployment instruction sent by the target master node, a target container is correspondingly created according to the experimental work task deployment instruction; if the target slave node receives the access request of the user terminal and passes the verification, connecting a target container instance corresponding to the target container with the user terminal; the target slave node receives target operation data of the user terminal, and stores the target operation data into a key value database corresponding to the target container; the target master node sends a container operation instruction to the target slave node; and if the target slave node receives the container operation instruction, correspondingly creating or deleting the container according to the container operation instruction. The cloud-based processing method has the advantages that the processing of related experimental tasks of the artificial intelligence can be performed based on the cloud in the artificial intelligence experimental cloud platform, nodes can be added or subtracted to the cluster at any time, and the high availability and the load capacity of the cluster are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an application scenario schematic diagram of a data processing method of an artificial intelligent experiment cloud platform with a high-availability cluster architecture according to an embodiment of the present application;

fig. 2 is a flow chart of a data processing method of an artificial intelligence experiment cloud platform with a high-availability cluster architecture according to an embodiment of the present application;

FIG. 3 is a schematic block diagram of a high availability cluster architecture artificial intelligence experiment cloud platform data processing system provided by an embodiment of the present application;

fig. 4 is a schematic block diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic application scenario diagram of a data processing method of an artificial intelligent experiment cloud platform with a high-availability cluster architecture according to an embodiment of the present application; fig. 2 is a flow chart of a data processing method of an artificial intelligence experiment cloud platform with a high-availability cluster architecture according to an embodiment of the present application. The high-availability cluster architecture artificial intelligence experiment cloud platform data processing method is applied to an artificial intelligence experiment cloud platform, and as shown in fig. 1, the artificial intelligence experiment cloud platform comprises a plurality of master nodes and a plurality of slave nodes, and the master nodes and the slave nodes are all in communication connection. The artificial intelligence experiment cloud platform can be regarded as a Kubernetes cluster comprising a plurality of master nodes and a plurality of slave nodes, and is a distributed system capable of managing the arrangement and scheduling of single container cluster resources.

Wherein each of the plurality of master nodes includes an APIServer module (which may be understood as an interface module), a Scheduler module (which may be understood as a scheduling module), and a Controller-Manager module (which may be understood as a management control module), and a key database (which may be referred to as an Etcd database). The APIServer module is used for notifying the slave node to perform operations such as establishing, deleting, stopping and the like of the cluster resources according to the decision of the master node; the Scheduler module is used for performing Pod scheduling according to the resource consumption condition of each slave node in the cluster corresponding to the artificial intelligent experiment cloud platform (Pod is a minimum unit which can be created and managed in a Kubernetes system (namely a K8S system), and is a minimum resource object model created or deployed by a user in a resource object model); the Controller-Manager module is used for detecting whether the states of each master node and each slave node in the cluster corresponding to the artificial intelligent experiment cloud platform are healthy or not; the key value database is used for storing various important configuration information in the cluster corresponding to the artificial intelligent experiment cloud platform and various data resources in the persistence cluster. Only one Master Node which is running and in an active state is recorded as a Leader-Master-Node (only the Leader-Master-Node can provide services to the outside) at each moment, and other Master nodes are in an inactive standby state. If the working Master Node (i.e. the Leader-Master-Node) is abnormal, the cluster corresponding to the artificial intelligent experiment cloud platform automatically selects one Master Node from a plurality of Master nodes in a standby state, and immediately replaces the Master Node in the abnormal state to become a new running Master Node in an active state, and then the current work is continued.

Each slave node in the plurality of slave nodes can be regarded as a working node in the cluster corresponding to the artificial intelligence experiment cloud platform, and is a node for actually executing the artificial intelligence experiment task and is also an operation container for operating actual service and resources. In addition to providing a Pod's operating environment, each slave node has an infrastructure for management and communication, specifically, each slave node performs data interaction with each of the plurality of master nodes through a Kubelet component (which is a proxy component on the slave node). The Kubelet component is used for periodically receiving a work task from an API-Server module of the main node so as to process related matters of the whole life cycle of the Pod on the main node; and the Kubelet also periodically reports all the working information to the main node via the API-Server module of the main node. Network proxy access is performed between different slave nodes through a Kube-proxy component (which is a network proxy component on the Kubernetes cluster slave node).

As shown in FIG. 2, the data processing method of the artificial intelligence experiment cloud platform with the high availability cluster architecture comprises steps S101 to S105.

S101, if a target slave node receives an experimental work task deployment instruction sent by a target master node, a target container is correspondingly created according to the experimental work task deployment instruction; the slave node is any one of the plurality of slave nodes, and the target master node is a master node which is currently in an active state in the plurality of master nodes.

In this embodiment, when a cluster corresponding to the artificial intelligence experiment cloud platform is formed by a plurality of master nodes and a plurality of slave nodes, the artificial intelligence experiment cloud platform can be used as a cloud platform for performing an artificial intelligence experiment. Specifically, a user of a target master node (such as a master node with a platform administrator logging in an artificial intelligent experiment cloud platform through an administrator authority user account) operates on a user interface to deploy an experiment task in a slave node, and then triggers generation of an experiment work task deployment instruction. And the experimental work task deployment instruction generated in the target master node is sent to each slave node, and a target container is correspondingly created in the target slave node in each slave node based on the experimental work task deployment instruction. The target master node is a master node which is currently in an active state in a plurality of master nodes, so that only one master node in the cluster is ensured to work currently and all data processing is performed.

After the target slave node completes the creation of the target container based on the experimental work task deployment instruction, a target mirror image corresponding to the experimental work task deployment instruction is added into each target container, so that each target container instance is obtained. When the target container examples are obtained, each target container example can correspond to the corresponding experiment participant of the corresponding user terminal, so that each experiment participant can use the user terminal to connect the corresponding target container example to process the artificial intelligent experiment task. Wherein, the data of the container running environment, the model code and the like are deployed in each target container instance. The container engine of the target container instance can provide a container running environment, make different demand images and the like; the private mirror warehouse integrates TensorFlow, caffe and PyTorch and other frame mirrors required by related experiments in the artificial intelligence field, and also supports deep neural network DNN, convolutional neural network CNN, target detection related YoLoV 1-V5 models and the like.

It can be seen that the target container is created in the slave node instead of the master node, which ensures that the slave node acts as a cloud device for actually running the container in the artificial intelligence experiment cloud platform, and the master node acts as a cloud device for uniformly monitoring and managing the slave nodes. If the slave node fails, the artificial intelligent experiment cloud platform adopts high-availability clusters and high-availability deployment of the application to reduce the damage caused by the node failure problem and ensure the high reliability of the cloud platform.

In one embodiment, step S101 further includes:

an interface module in the target master node establishes a communication connection with a Kubelet proxy component of the target slave node.

In this embodiment, when constructing the artificial intelligence experiment cloud platform with a high availability cluster architecture, a plurality of master nodes and a plurality of slave nodes need to be connected in a communication manner. Specifically, each slave node establishes communication connection with an interface module in the target master node based on the Kubelet agent component, so that the target slave node serving as one of the slave nodes also establishes communication connection with the interface module in the target master node based on the Kubelet agent component. The Kubelet proxy component can be visually understood as a tie for data interaction between the target master node and each slave node. The Kubelet component is used for periodically receiving a work task from an API-Server module of the main node so as to process related matters of the whole life cycle of the Pod on the main node; and the Kubelet also periodically reports all the working information to the main node via the API-Server module of the main node. Moreover, each slave node included in the artificial intelligence experiment cloud platform can access the Internet based on a Kube-proxy component or can be in communication connection with a user terminal based on the Internet.

In one embodiment, the interface module in the target master node establishes a communication connection with the Kubelet proxy component of the target slave node, including:

the keepalive component of the target main node automatically configures a virtual IP address of the artificial intelligent experiment cloud platform through a virtual routing redundancy protocol;

and the interface module of the target master node establishes communication connection with the Kubelet proxy component block of the target slave node based on the virtual IP address.

In this embodiment, each host node has a keep-alive component and a Haproxy component, where the keep-alive component is configured to automatically configure a virtual IP address of the artificial intelligence experiment cloud platform through a virtual routing redundancy protocol (i.e., VRRP protocol), so as to ensure that the artificial intelligence experiment cloud platform has a unified virtual IP to access the outside. The Haproxy component is configured to provide load balancing services for the slave nodes.

And when the keepalive component of the target main node acquires the virtual IP address of the artificial intelligent experiment cloud platform, the keepalive component automatically configures the target main node so that the target main node has the same virtual IP as the artificial intelligent experiment cloud platform. Besides, the interface module of the target master node establishes communication connection with the Kubelet proxy component block of the target slave node based on the virtual IP address, and the interface module based on which the rest other master nodes are also switched from the standby state to the active state establishes communication connection with the Kubelet proxy component block of the target slave node based on the virtual IP address. It can be seen that based on this architecture approach, a high availability and high load of the system is ensured.

In one embodiment, step S101 includes:

if the experimental work task deployment instruction is a unified experimental task deployment instruction, the target slave node acquires a first target mirror resource, a GPU resource and a data storage volume path corresponding to the unified experimental task deployment instruction, and the target slave node correspondingly creates a target container according to the first target mirror resource, the GPU resource and the data storage volume path corresponding to the unified experimental task deployment instruction;

if the experimental work task deployment instruction is a personalized container deployment instruction, the target slave node acquires a second target mirror resource corresponding to the personalized container deployment instruction, and the target slave node correspondingly creates a target container according to the second target mirror resource corresponding to the personalized container deployment instruction and a pre-stored data storage volume path.

In this embodiment, user accounts with at least three types of rights are pre-divided in the artificial intelligence experiment cloud platform, and the user accounts are respectively an administrator right user account, a first right user account (such as a teacher right user account) and a second right user account (such as a student right user account). The administrator authority user account has the authority for managing all data of the whole artificial intelligent experiment cloud platform, for example, a plurality of namespaces (namely, namespaces) are created in a plurality of slave nodes of the artificial intelligent experiment cloud platform to serve as one authority; the first authority user account has authority for creating a plurality of containers in a corresponding naming space for logging in and using by the second authority user account; the second authority user account only has the authority of logging in the corresponding container of the corresponding naming space to process the artificial intelligent experiment task.

The administrator authority user account can receive a user account list to be created provided by a user corresponding to a certain first authority user account, and at this time, the administrator corresponding to the administrator authority user account can obtain a combined name by combining a teacher name and a class name corresponding to the user account list after logging in a master node or a slave node of the artificial intelligent experiment cloud platform. And then, correspondingly creating a name space by the administrator corresponding to the administrator authority user account through the slave node in the artificial intelligence experiment cloud platform according to the combined name. Therefore, the administrator can intuitively understand that a class-specific name space is created for the class carried by the teacher corresponding to the name of the teacher in the artificial intelligence experiment cloud platform. Then, the administrator corresponding to the administrator authority user account can also correspond to a corresponding number of second authority user accounts according to the student name list (or student number list) included in the user bill list (i.e. the total number of student names included in the user bill list is the same as the total number of second authority user accounts).

Of course, when creating a plurality of containers according to actual requirements in each namespace, an administrator authority user account corresponding to the administrator may create a corresponding number of containers according to requirements of an artificial intelligence experiment task and a student name list included in the user bill list; or the corresponding teacher of the first authority user account creates a corresponding number of containers according to the needs of the artificial intelligence experiment task and the student name list included in the user bill list. The related information of the namespaces in the artificial intelligence experiment cloud platform is stored in a master node, and containers corresponding to each namespace are deployed in slave nodes of the artificial intelligence experiment cloud platform.

After the creation of the name space of a class, such as class a, is completed in the slave node in the artificial intelligence experiment cloud platform, a unified experiment task deployment instruction can be continuously generated by a teacher corresponding to the first authority user account according to the requirements of the artificial intelligence experiment task, and a first target mirror resource, a GPU resource and a data storage volume path are specifically set in the unified experiment task deployment instruction. And the resource container layer correspondingly creates a container according to the first target mirror image resource, the GPU resource and the data storage volume path corresponding to the unified experiment task deployment instruction.

The first target image resource can be deployed from a frame image integrating TensorFlow, caffe, pyTorch and other framework images required by related experiments in the artificial intelligence field, or one or more artificial intelligence related neural network model images integrating a deep neural network DNN, a convolutional neural network CNN, target detection related YoLoV 1-V5 and other target detection networks. More specifically, the first target image resource may select a TensorFlow framework and deploy a convolutional neural network CNN on the TensorFlow framework.

When a teacher corresponding to the first authority user account generates a unified experimental task deployment instruction according to the requirements of the artificial intelligent experimental task, and the creation of a plurality of containers corresponding to the namespaces is correspondingly completed in the resource container layer, the initial environment construction of the artificial intelligent experimental task is completed.

Of course, after the creation of the name space of a class, such as class a, is completed in the target slave node of the artificial intelligence experiment cloud platform, a personalized container deployment instruction can be generated by a second authority user account corresponding student according to the personalized requirement of the student for performing the artificial intelligence experiment task, the personalized container deployment instruction is sent to a user terminal used by a teacher corresponding to the first authority user account by the slave node of the artificial intelligence experiment cloud platform, and when the first authority user account corresponding teacher passes the personalized container deployment instruction through operation approval on the user terminal, the target slave node of the artificial intelligence experiment cloud platform correspondingly creates a container according to a second target mirror image resource corresponding to the personalized container deployment instruction and a pre-stored data storage volume path. Similarly, the second target image resource may be deployed from a framework image integrating TensorFlow, caffe, pyrerch and other framework images required by related experiments in the artificial intelligence field, or an artificial intelligence related neural network model image integrating a deep neural network DNN, a convolutional neural network CNN, target detection related YoLoV 1-V5 and other target detection networks, or optionally one or more of these artificial intelligence related neural network model images. More specifically, the second target image resource may select a TensorFlow framework and deploy a YoLoV5 target detection network on the TensorFlow framework.

Moreover, when the second authority user account corresponds to the student to generate the personalized container deployment instruction according to the personalized requirement of performing the artificial intelligence experiment task, GPU resources are not allocated by default, namely, containers created in the target slave node based on the personalized container deployment instruction are all common server containers, and are not GPU server containers. Of course, if there is a need for using the GPU server container in the container needs corresponding to the personalized container deployment instruction, the personalized container deployment instruction is sent to a user terminal used by a teacher corresponding to the first authority user account, and when the operation approval of the teacher corresponding to the first authority user account on the user terminal passes the personalized container deployment instruction, the target slave node of the artificial intelligent experiment cloud platform correspondingly creates a container according to the second target mirror image resource, the GPU resource and the prestored data storage volume path corresponding to the personalized container deployment instruction.

In an embodiment, the target slave node obtaining a first target image resource, a GPU resource and a data storage volume path corresponding to the unified experiment task deployment instruction, including:

And if the target slave node detects the unified experiment task deployment instruction, acquiring teaching progress information and a teacher teaching label set corresponding to the unified experiment task deployment instruction, and generating a first target mirror resource, a GPU resource and a data storage volume path corresponding to the unified experiment task deployment instruction according to the teaching progress information, the teacher teaching label set and a preset resource calling strategy.

In this embodiment, when the target slave node detects the unified experimental task deployment instruction, the target slave node may specifically parse the unified experimental task deployment instruction to determine whether the unified experimental task deployment instruction includes teaching progress information (for example, learn the first chapter, the fifth section, etc. of the AA artificial intelligence course) and a teacher teaching label set (for example, labels including face recognition, convolutional neural network, etc.). And if the unified experimental task deployment instruction is analyzed to obtain teaching progress information and a teacher teaching label set, generating first target mirror image resources, GPU resources, data storage volume path and other container information required for creating a container according to the teaching progress information, the teacher teaching label set and a preset resource calling strategy. The resource calling strategy can be understood as a preset mapping table comprising a plurality of pieces of teaching progress information, teacher teaching label sets, target mirror image resources, GPU resources and data storage volume paths, and the corresponding target mirror image resources, GPU resources and data storage volume paths can be queried by taking the teaching progress information and the teacher teaching label sets as retrieval conditions.

S102, if the target slave node receives the access request of the user terminal and passes verification, connecting the target container instance corresponding to the target container with the user terminal.

In this embodiment, after the target slave node completes the deployment of the target container and the target container instance, the user may be provided with access to perform the process of the artificial intelligence related experiment task. When a user needs to access a target container in a slave node, an access request with user account information is sent to the target slave node. And when the target slave node verifies the access request, establishing communication connection between the target container instance corresponding to the target container and the user terminal. Thus, the user terminal can access the target slave node to process the artificial intelligence related experimental task.

S103, the target slave node receives target operation data of the user terminal, and stores the target operation data into a key value database corresponding to the target container.

In this embodiment, after the user terminal is connected to the target container instance in the target slave node, the target container instance may receive the target operation data of the user terminal. In order to improve data security for the target operation data, the target operation data may be stored in a key database (e.g., etcd database, which is a type of key database) corresponding to the target container in the target master node, so that the operation data of each container included therein is the key database stored in the target master node even if the target slave node fails to stop running. When the target slave node is recovered to normal after fault removal, all data of the target slave node are called from a key value database of the target master node to carry out breakpoint recovery.

In one embodiment, step S103 further includes:

and if the target master node detects that the current working state is an abnormal state, randomly selecting one master node from the rest multiple master nodes to serve as the target master node.

In this embodiment, only one master node currently works and performs various data processing in a general cluster, and other master nodes are in an inactive standby state. If the target Master Node (i.e. the Leader-Master-Node) in operation is abnormal, the cluster corresponding to the artificial intelligent experiment cloud platform automatically selects one Master Node from a plurality of Master nodes in standby state, and immediately replaces the Master Node in abnormal state to become a new Master Node in operation and in active state, and then the current operation is continued. Therefore, no matter which main Node in the cluster fails, the working of the whole cluster is not affected, if the main Node which is working is abnormal, the cluster can automatically select one main Node from standby main nodes to replace the abnormal Node immediately to become a new main control Node Leader-Master-Node to continue the current working.

In one embodiment, step S103 further includes:

and if the target slave node detects that the current working state is an abnormal state, acquiring current node container data and storing the current node container data into a key value database corresponding to the target slave node in a target master node.

In this embodiment, when the current working state of the target slave node is an abnormal state, before the target slave node is restarted to perform obstacle avoidance, the current node container data of the target slave node may be obtained again in the target master node, and the current node container data may be stored in a key value database corresponding to the target slave node in the target master node to perform data backup. Creating a persistent volume declaration (PVC) and a Persistent Volume (PV) for each user in the key value database, and automatically storing current node container data generated by the operation of the user on the target container instance in the target master node to the key value database corresponding to the target slave node for data backup after the operation of the user on the target container instance in the target slave node stops and closes or exits the target container instance due to failure before the closing operation or exiting operation. And when the user re-enters the target container instance next time, the artificial intelligence experiment cloud platform automatically invokes the current node container data from the key value database so as to restore the target container instance to a state when the user exits the target container instance last time, so that the user can continue to operate on the target container instance after re-entering the target container instance this time.

Because all node container data of each container are persisted into a dynamically created persisted volume, whether a user actively exits the container instance or passively exits the container instance due to a fault, the artificial intelligence experiment cloud platform can automatically save historical node container data into the persisted volume so that the user can call the historical node container data to continue to operate on the container instance after reentering the container.

S104, the target master node sends a container operation instruction to the target slave node.

In this embodiment, the user may, of course, perform operations on the related experimental data of the artificial intelligence in the container in the target slave node, or may perform trigger generation operations of the container operation instruction corresponding to the operations such as adding or deleting the container when the user accesses the target master node (e.g. the user account with the administrator authority logs in the target master node). And after the container operation instruction triggering to add or delete the container is completed, the target master node sends the container operation instruction to the target slave node.

And S105, if the target slave node receives the container operation instruction, correspondingly creating or deleting a container according to the container operation instruction.

In the embodiment, the addition and deletion of the cluster nodes at any time can be realized based on the container operation instruction in the artificial intelligent experiment cloud platform, so that the expansion and contraction of the Pod can be conveniently realized through the cluster controller. More specifically, when resources in the target slave node are strained, more slave nodes are added and containers are created in the slave nodes to realize capacity expansion.

The method realizes the processing of the related experimental tasks of the artificial intelligence based on the cloud in the artificial intelligence experimental cloud platform, and can add or delete nodes to the cluster at any time, thereby improving the high availability and the load capacity of the cluster.

The embodiment of the application also provides a high-availability cluster architecture artificial intelligence experiment cloud platform data processing system, which is used for executing any embodiment of the high-availability cluster architecture artificial intelligence experiment cloud platform data processing method. In particular, referring to fig. 3, fig. 3 is a schematic block diagram of a high availability cluster architecture artificial intelligence experiment cloud platform data processing system 100 provided in an embodiment of the present application.

Wherein, as shown in fig. 3, the high availability cluster architecture artificial intelligence experiment cloud platform data processing system 100 comprises a plurality of master nodes 101 and a plurality of slave nodes 102. The target slave node is any slave node in the plurality of slave nodes 102, and the target master node is a master node in the plurality of master nodes 101 that is currently in an active state.

The target slave node is used for correspondingly creating a target container according to the experimental work task deployment instruction if the experimental work task deployment instruction sent by the target master node is received; the slave node is any one of the plurality of slave nodes, and the target master node is a master node which is currently in an active state in the plurality of master nodes.

In one embodiment, the target master node is further configured to establish a communication connection with the Kubelet proxy component of the target slave node through an interface module in the target master node.

In an embodiment, the target slave node is further configured to:

And the target slave node is further used for connecting the target container instance corresponding to the target container with the user terminal if the access request of the user terminal is received and passes verification.

And the target slave node is also used for receiving target operation data of the user terminal and storing the target operation data into a key value database corresponding to the target container.

In an embodiment, the target master node is further configured to randomly select one master node from the remaining multiple master nodes to be the target master node if the current working state is detected to be an abnormal state.

In an embodiment, the target slave node is further configured to, if the current working state is detected to be an abnormal state, obtain current node container data and store the current node container data to a key value database corresponding to the target slave node in the target master node.

And the target master node is used for sending the container operation instruction to the target slave node.

And the target slave node is further used for correspondingly creating or deleting the container according to the container operation instruction if the container operation instruction is received.

The system realizes the processing of the related experimental tasks of the artificial intelligence based on the cloud in the artificial intelligence experimental cloud platform, and can add or delete nodes to the cluster at any time, thereby improving the high availability and the load capacity of the cluster.

The high availability cluster architecture artificial intelligence experiment cloud platform data processing system described above may be implemented in the form of a computer program that can run on a computer device as shown in fig. 4.

Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 is a server, or a cluster of servers. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content distribution networks (ContentDeliveryNetwork, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

Referring to fig. 4, the computer apparatus 500 includes a processor 502, a memory, and a network interface 505, which are connected by a device bus 501, wherein the memory may include a storage medium 503 and an internal memory 504.

The storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform a high availability cluster architecture artificial intelligence experiment cloud platform data processing method.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the execution of a computer program 5032 in the storage medium 503, which computer program 5032, when executed by the processor 502, enables the processor 502 to perform the high availability cluster architecture artificial intelligence experiment cloud platform data processing method.

The network interface 505 is used for network communication, such as providing for transmission of data information, etc. Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of a portion of the architecture in connection with the present application and is not intended to limit the computer device 500 to which the present application is applied, and that a particular computer device 500 may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

The processor 502 is configured to execute a computer program 5032 stored in a memory, so as to implement the high-availability cluster architecture artificial intelligence experiment cloud platform data processing method disclosed in the embodiment of the application.

Those skilled in the art will appreciate that the embodiment of the computer device shown in fig. 4 is not limiting of the specific construction of the computer device, and in other embodiments, the computer device may include more or less components than those shown, or certain components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may include only a memory and a processor, and in such embodiments, the structure and function of the memory and the processor are consistent with the embodiment shown in fig. 4, and will not be described again.

It should be appreciated that in embodiments of the present application, the processor 502 may be a Central processing unit (Central ProcessingUnit, CPU), and the processor 502 may also be other general purpose processors, digital signal processors (DigitalSignalProcessor, DSP), application specific integrated circuits (ApplicationSpecificIntegrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-ProgrammableGateArray, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In another embodiment of the present application, a computer-readable storage medium is provided. The computer readable storage medium may be a nonvolatile computer readable storage medium or a volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program realizes the high-availability cluster architecture artificial intelligence experiment cloud platform data processing method disclosed by the embodiment of the application when being executed by a processor.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein. Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus, device, and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, for example, the division of the units is merely a logical function division, there may be another division manner in actual implementation, or units having the same function may be integrated into one unit, for example, multiple units or components may be combined or may be integrated into another apparatus, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purposes of the embodiments of the present application.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units may be stored in a storage medium if implemented in the form of software functional units and sold or used as stand-alone products. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a background server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The data processing method of the artificial intelligent experimental cloud platform with the high-availability cluster architecture is applied to the artificial intelligent experimental cloud platform and is characterized in that the artificial intelligent cloud platform comprises a plurality of master nodes and a plurality of slave nodes, and the plurality of master nodes and the plurality of slave nodes are all in communication connection; the method comprises the following steps:

2. The method for processing data of the artificial intelligence experiment cloud platform with the high availability cluster architecture according to claim 1, wherein the creating the target container according to the experimental work task deployment instruction comprises:

3. The method for processing data of an artificial intelligence experiment cloud platform with a high availability cluster architecture according to claim 1, wherein if the target slave node receives an experiment task deployment instruction sent by the target master node, before the target container is correspondingly created according to the experiment task deployment instruction, the method further comprises:

4. The method for processing data of the artificial intelligence experiment cloud platform with the high availability cluster architecture according to claim 3, wherein the communication connection between the interface module in the target master node and the Kubelet proxy component of the target slave node is established, comprising:

5. The method for processing high-availability cluster architecture artificial intelligence experiment cloud platform data according to claim 1, wherein the target slave node receives target operation data of the user terminal, and after storing the target operation data in a key value database corresponding to the target container, the method further comprises:

6. The method for processing high-availability cluster architecture artificial intelligence experiment cloud platform data according to claim 1, wherein the target slave node receives target operation data of the user terminal, and after storing the target operation data in a key value database corresponding to the target container, the method further comprises:

7. The method for processing data of the artificial intelligence experiment cloud platform with the high availability cluster architecture according to claim 2, wherein the target slave node obtains a first target mirror resource, a GPU resource and a data storage volume path corresponding to the unified experiment task deployment instruction, and the method comprises the following steps:

8. The high-availability cluster architecture artificial intelligence experiment cloud platform data processing system operates on an artificial intelligence experiment cloud platform and is characterized by comprising a plurality of master nodes and a plurality of slave nodes, wherein the master nodes and the slave nodes are all in communication connection; the target slave node is any slave node in the plurality of slave nodes, and the target master node is a master node in an active state in the plurality of master nodes;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the high availability cluster architecture artificial intelligence experiment cloud platform data processing method of any of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, which when executed by a processor causes the processor to perform the high availability cluster architecture artificial intelligence experiment cloud platform data processing method according to any of claims 1 to 7.