WO2018224005A1

WO2018224005A1 - Package deployment method, electronic device and distributed system

Info

Publication number: WO2018224005A1
Application number: PCT/CN2018/090263
Authority: WO
Inventors: 周智强; 彭剑峰; 郑星; 叶挺群; 李鹏飞
Original assignee: 杭州海康威视数字技术股份有限公司
Priority date: 2017-06-08
Filing date: 2018-06-07
Publication date: 2018-12-13
Also published as: CN109032610A; CN109032610B

Abstract

A package deployment method, an electronic device, and a distributed system. The method is applied to a first computing node, and comprises: receiving training session information (S101), wherein the training session information comprises information on each computing node executing a training session; determining, according to the training session information, whether the first computing node is in a master state (S102); and if yes, obtaining a training package, and deploying the obtained training package to each computing node executing the training session (S103). As such, only a computing node in a master state obtains a training package, and the obtained training package is only deployed to the computing node executing a training session. Since it is not necessary for all computing nodes to obtain the training package from a management device, the method reduces pressure on network bandwidth.

Description

Package deployment method, electronic device and distributed system

This application claims the priority of the Chinese Patent Application filed on June 8, 2017, the Chinese Patent Office, Application No. 201710429234.1, entitled "A Package Deployment Method, Electronic Equipment, and Distributed System", the entire contents of which are The citations are incorporated herein by reference.

Technical field

The present application relates to the field of machine learning technology, and in particular, to a package deployment method, an electronic device, and a distributed system.

Background technique

Machine learning is an important technical means to realize artificial intelligence. Machine learning is mainly through the training of a large amount of data, so that the machine has the function of intelligent recognition. Due to the large amount of data in the learning and training process, distributed systems are usually used for data training.

Before data training in a distributed system, it is usually necessary to deploy the packages required for training in each computing node of the system. After the package is deployed, each computing node can coordinate training. Generally, a management device is set up, and the management device obtains a training package and sends the package to each computing node in the system.

That is to say, each computing node in the system obtains a package from the management device, so that the network bandwidth pressure between the management device and each computing node is relatively large.

Summary of the invention

The purpose of the embodiments of the present application is to provide a package deployment method, an electronic device, and a distributed system to reduce network bandwidth pressure.

To achieve the above objective, the embodiment of the present application provides a package deployment method, which is applied to a first computing node in a distributed system, where the method includes:

Receiving training task information, where the training task information includes information about each computing node that performs the training task;

Determining, according to the training task information, whether a state of the first computing node is an active state;

If in the active state, the training package is obtained and the acquired training package is deployed to each of the computing nodes performing the training task.

Optionally, after receiving the training task information, the method may further include:

Parsing the training task information, obtaining a storage address of the training package, status information of each computing node performing the training task, and a device address;

Determining, according to the training task information, whether the state of the first computing node is an active state, the method may include:

Searching, in the status information of each computing node, status information corresponding to the address of the first computing node device;

Determining whether the found status information is the primary status;

The obtaining the training package may include:

Obtaining the training package according to a storage address of the training package;

Deploying the acquired training package to each computing node that performs the training task may include:

And deploying the training package in each computing node according to the device address of each computing node.

Optionally, the method may further include:

If the state of the first computing node is the active state, after detecting that the computing nodes of the execution training task successfully deploy the training package, generate a markup file, and send the markup file to the Each computing node.

Optionally, the method may further include:

If the state of the first computing node is the active state, and detecting that there is a computing node that fails to deploy the training package, outputting the first prompt information for prompting the deployment failure.

Optionally, the method may further include:

If the state of the first computing node is not in the active state, determining whether the tag file is received within a preset time period;

If no, the second prompt message for prompting the deployment failure is output.

Optionally, the method may further include:

After each of the computing nodes performing the training task successfully deploys the training package, the training package is run to perform data training.

Optionally, the deploying the acquired training package to each computing node that performs the training task may include:

The acquired training package is deployed to each computing node performing the training task through the infinite bandwidth technology Infiniband.

In order to achieve the above objective, an embodiment of the present application further provides an electronic device, including: a memory and a processor, where

a memory for storing a computer program;

The processor, when used to execute a program stored on the memory, implements any of the above package deployment methods.

To achieve the above objective, an embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, and when the computer program is executed by the processor, implements any one of the foregoing package deployments. method.

To achieve the above objective, the embodiment of the present application further provides a distributed system, including: at least two computing nodes;

The computing node is configured to receive training task information, where the training task information includes each computing node information that performs a training task; and according to the training task information, determine whether the state is a primary state; Obtaining a training package, and deploying the acquired training package to each computing node that performs the training task.

Optionally, the system further includes: a management node;

The management node is configured to acquire and store a training package; add a storage address of the training package to the training task information; and send the training task information to each computing node that performs the training task;

The computing node may be specifically configured to:

Receiving training task information sent by the management node; parsing the training task information, obtaining a storage address of the training package, status information of each computing node performing the training task, and a device address; and calculating the status of each node In the information, the status information corresponding to the device address is searched; the status information of the found device is determined to be the active status; if the status is the active status, the training package is obtained according to the storage address of the training package, and according to the The device addresses of the computing nodes of each of the computing nodes are deployed in the computing nodes.

Optionally, the computing node is further configured to:

In the case where its own state is the primary state:

If it is detected that each of the computing nodes that perform the training task successfully deploys the training package, generate a tag file, and send the tag file to each computing node;

If it is detected that there is a computing node that fails to deploy the training package, the first prompt information for prompting the deployment failure is sent to the management node.

Optionally, the computing node is further configured to:

In a case where the self state is not in the main state, it is determined whether the tag file is received within a preset time period;

If not, sending the second prompt information for prompting the deployment failure to the management node.

Optionally, the computing node is further configured to:

Optionally, the computing nodes performing the training task are connected according to an Infiniband communication technology of an infinite bandwidth technology.

To achieve the above objective, an embodiment of the present application further discloses an executable program code for being executed to execute any of the above package deployment methods.

Applying the embodiment shown in the present application, only the computing node in the primary state acquires the training package, and deploys the acquired training package to each computing node that performs the training task, that is, not every computing node Getting the package from the management device reduces network bandwidth pressure.

Of course, implementing any of the products or methods of the present application does not necessarily require that all of the advantages described above be achieved at the same time.

DRAWINGS

In order to more clearly illustrate the embodiments of the present application and the technical solutions of the prior art, the following description of the embodiments and the drawings used in the prior art will be briefly introduced. Obviously, the drawings in the following description are only Some embodiments of the application may also be used to obtain other figures from those of ordinary skill in the art without departing from the scope of the invention.

FIG. 1 is a schematic diagram of a first process of a method for deploying a package according to an embodiment of the present application;

2 is a second schematic flowchart of a method for deploying a package according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

4 is a schematic diagram of a first structure of a distributed system according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a second structure of a distributed system according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a third structure of a distributed system according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a fourth structure of a distributed system according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of an embodiment provided by an embodiment of the present application.

detailed description

In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings. It is apparent that the described embodiments are only a part of the embodiments of the present application, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

In order to solve the above technical problem, the embodiment of the present application provides a package deployment method, an electronic device, and a distributed system. The distributed system provided by the embodiment of the present application may include at least two computing nodes (computing node 1, computing node 2, ... computing node n) as shown in FIG. 4, or may also include multiple computing as shown in FIG. As shown in FIG. 6 and FIG. 7 , the node and the management node may include a plurality of computing nodes, management nodes, and switches, and are not limited.

A method for deploying a package provided by the embodiment of the present application is described in detail below. The method can be applied to any computing node in the distributed system. For convenience of description, in the embodiment of FIGS. 1 and 2, the computing node performing the method is referred to as a first computing node.

FIG. 1 is a schematic flowchart of a method for deploying a package according to an embodiment of the present application, including:

S101: Receive training task information, where the training task information includes each computing node information that performs a training task.

The training task can be used for learning and training a large amount of data in various machine learning processes, for example, deep learning based on artificial neural networks. Before the training task is performed, the training package can be deployed in the distributed system by using the solution. After the deployment is completed, the computing nodes in the system can perform the training task.

The training task information includes information of each computing node that performs this training task. The compute node performing this training task can be a part of the compute nodes in the system. As an implementation manner, each time a machine learning (execution training task) is required, the user may specify a part of the computing nodes in the system to perform the training task according to actual conditions; or, each computing node in the system may also be used. Grouping, the same group of computing nodes perform the same training task; or, all computing nodes in the system may be determined as the computing nodes performing the training task, etc., and there are many ways to determine the computing nodes that perform the training tasks. The application examples are not limited thereto.

In the system shown in FIG. 4, the user equipment may directly store the training task information to each computing node that performs the training task; or, a management device may be set outside the system, and the management device acquires training task information and analyzes the training. The task information determines each computing node that performs the training task, and sends the training task information to each computing node that performs the training task.

As an implementation manner, the management device may obtain training task information by using the user equipment, for example, storing training task information in the web client, and the management device acquires training task information from the web client. As a computing node in the primary state, a training package is obtained from the management device.

In the system shown in FIG. 5, the management node may acquire training task information, analyze the training task information, determine each computing node that performs the training task, and send the training task information to each computing node that performs the training task. As an implementation manner, the management node may obtain the training task information by using the user equipment, for example, the training task information is stored in the web client, and the management node obtains the training task information from the web client, which is not limited.

In the embodiment shown in the present application, in order to distinguish the description, the management device set in the system is referred to as a management node, and the management device disposed outside the system is referred to as a management device.

S102: Determine, according to the training task information, whether a state of the first computing node is an active state, and if yes, execute S103.

As an implementation manner, the training task information received by each computing node performing the training task may be different. That is, the user equipment may store the training task information corresponding to the computing node in each computing node, or the management device, or the management node may send the training task information corresponding to the computing node for each computing node.

For example, the training task information stored in the computing node 1 by the user equipment may only include the state information of the computing node 1, and the training task information stored in the computing node 2 may only contain the state information of the computing node 2, etc. . The training device or the training task information sent by the management node to the computing node 1 may include only the state information of the computing node 1, and the training task information sent to the computing node 2 may include only the state information of the computing node 2, and so on.

Or, as another implementation manner, the training task information received by each computing node that performs the training task is the same, and the state information of each computing node that performs the training task is included.

The state of the computing node may include both an active state and a dependent state. The computing node of the primary state may be referred to as a master node, and the computing node of the slave state may be referred to as a slave node. The computing node can determine whether the state is the active state or the dependent state according to the training task information received by S101. If it is the active state, execute S103.

S103: Acquire a training package, and deploy the acquired training package to each computing node that performs the training task.

In the embodiment of the present application, the training package is acquired only when the state of the computing node is in the active state. As an implementation manner, a training package may be stored in the web client, and the web client sends the training package to the management node or the management device. As a computing node in the primary state, a training package is obtained from the management node or the management device.

As an implementation manner, each computing node in the system is connected by Infiniband communication through an infinite bandwidth technology. Therefore, the master node (the computing node in the active state) can deploy the acquired training package to perform the training task through Infiniband. Each computing node.

Those skilled in the art can understand that the InfiniBand architecture is a "conversion cable" technology that supports multiple concurrent links, and the network based on the InfiniBand architecture has very high bandwidth. In this embodiment, the master node completes the deployment of the slave training package by means of the InfiniBand network copy. On the one hand, the training package deployment efficiency can be improved, and on the other hand, the bandwidth capability of the Infiniband is fully utilized.

Alternatively, the master node can also complete the deployment of the slave training package by other means, such as Ethernet. Ethernet and InfiniBand networks can exist at the same time. That is to say, in one embodiment, each computing node in the system can exchange data through Ethernet and InfiniBand network.

If a compute package is obtained from the management device (management node) in each of the compute nodes in the system when the package is deployed, network congestion will result. For example, a deployment of a package for a plurality of training tasks in a system, wherein compute node 1 - compute node 5 needs to deploy a package for training task A, compute node 6 - compute node 10 needs to be deployed for training tasks Package B, Compute Node 11 - Compute Node 15 needs to deploy a package for Training Task C. If the 15 computing nodes need to obtain the package from the management device, the network bandwidth between the management device and each computing node is relatively high.

In the embodiment of the present application, for each training task, a master node is specified, and only the master node obtains a package from the management device, that is, only three computing nodes obtain the package from the management device, thereby reducing the management device. Network bandwidth pressure with each compute node.

On the other hand, after the master node obtains the package, the package is deployed to the slave node performing the training task, and the data interaction between the master node and the slave node is different from the data interaction between the compute node and the management device, and the master node and the master node Data interaction between nodes can utilize InfiniBand network or other system internal network, which has high bandwidth and high speed, which improves the efficiency of package deployment.

Applying the embodiment shown in FIG. 1 of the present application, only the computing node in the primary state acquires the training package, and deploys the acquired training package to each computing node that performs the training task, that is, not every calculation. Nodes get packages from management devices, reducing network bandwidth pressure.

FIG. 2 is a second schematic flowchart of a method for deploying a package according to an embodiment of the present disclosure, including:

S201: Receive training task information.

S202: Parse the training task information, obtain a storage address of the training package, status information of each computing node that performs the training task, and a device address.

As an implementation manner, the web client may store a training package and corresponding training task information, where the training task information includes status information and a device address of each computing node that performs the training task.

The management node or the management device obtains the training package and the training task information from the web client. The management node or management device stores the training package to a location and adds the storage address to the training task information. In addition, the management node or the management device may further parse the training task information, determine each computing node that performs the training task, and send the training task information with the added storage address to the determined computing nodes, that is, execute the training task. Each computing node.

In this way, the training task information received by the first computing node that executes the method includes the storage address of the training package, and the first computing node parses the received training task information to obtain the storage address.

In this embodiment, the status information and the device address of each computing node may be included in the “each computing node information for performing the training task” included in the training task information. According to the above description, the state of the computing node can be divided into a main state and a slave state. The computing node of the active state can be referred to as a master node, and the computing node of the slave state can be referred to as a slave node. The address of the device can be the address of the device, such as the IP address and MAC address of the device.

S203: In the status information of each computing node, search for status information corresponding to the device address of the first computing node.

S204: Determine whether the found status information is an active status, and if yes, execute S205.

There is a correspondence between the state information of the computing node and the device address that is obtained in S202. For example, as shown in Table 1:

Table 1

计算节点的状态信息Calculate the status information of the node	计算节点的IP地址Calculate the IP address of the node
主用状态Main state	100.4.5.6100.4.5.6
从属状态Dependent state	100.8.2.3100.8.2.3
从属状态Dependent state	100.6.5.2100.6.5.2
……......	……......

The above Table 1 is merely illustrative and is not intended to limit the embodiment.

Assuming that the IP address of the first computing node performing the method is 100.4.5.6, the first computing node searches for the self-state corresponding to its own IP address as the primary state in the analysis result of S202 (including the above Table 1), and the determination of S204. As a result, yes, S205 is executed.

S205: Acquire the training package according to a storage address of the training package.

The analysis result of S202 further includes a storage address of the training package, and according to the storage address, the training package can be obtained. It should be noted that, in this embodiment, only the computing node (master node) in the active state accesses the storage address to acquire the training package.

S206: Deploy the training package in each computing node according to the device address of each computing node.

The analysis result of S202 also includes the device address of each computing node that performs the training task, and the package acquired in S205 can be deployed to each computing node of the training task according to the device address.

As an implementation manner, each computing node in the system is connected by Infiniband communication through an infinite bandwidth technology. Therefore, the master node can deploy the acquired training package to each computing node that performs the training task through Infiniband.

Those skilled in the art can understand that the InfiniBand architecture is a "conversion cable" technology that supports multiple concurrent links, and the network based on the InfiniBand architecture has very high bandwidth. In this embodiment, the master node completes the deployment of the slave training package through the InfiniBand network copying method. On the one hand, the training package deployment efficiency can be improved, and on the other hand, the bandwidth capability of the Infiniband is fully utilized.

Alternatively, the master node can also complete the deployment of the slave training package by other means, such as Ethernet. Optionally, the Ethernet and the InfiniBand network can exist at the same time, that is, in one embodiment, each computing node in the system can perform data interaction through the Ethernet and the InfiniBand network.

Applying the embodiment shown in FIG. 2 of the present application, only the computing node in the primary state acquires the training package, and deploys the acquired training package to each computing node that performs the training task, that is, not every calculation. Nodes get packages from management devices, reducing network bandwidth pressure.

As an implementation manner, after FIG. 1S103 or after FIG. 2S206, the following solutions may also be included:

If the state of the first computing node is the active state, the first computing node detects that each of the computing nodes performing the training task successfully deploys the training package, generates a tag file, and sends the tag file to Each of the computing nodes.

It can be understood by those skilled in the art that only the package is deployed in the embodiment shown in FIG. 1 and FIG. 2, but in the embodiment, whether the deployment is successful can be further detected. In the foregoing embodiment, the master node completes the deployment of the slave training package by means of the InfiniBand network copy. Therefore, the master node can determine whether each copy is successful, and then check whether the package of the training task is all. The copy was successful.

After detecting that all the packages of the training task are successfully copied, that is, after the training package is successfully deployed, the master node may generate a mark file and send the mark file to each calculation for performing the training task. Nodes, so that other slave nodes get the message that the deployment is successful.

After each of the computing nodes performing the training task successfully deploys the training package, the training package is run to perform data training, that is, to perform a training task.

According to the above description, the training task can be a task of learning and training a large amount of data in various machine learning processes, for example, deep learning based on artificial neural network. Before the training task is performed, the training package can be deployed in the distributed system by using the solution. After the deployment is completed, the computing nodes in the system can perform data training and perform training tasks.

In the foregoing implementation manner of the “master node generating a tag file”, the tag file may be used to determine whether the training package is deployed successfully. If the deployment is successful, each computing node performing the training task may run the self-deployed training program. Package, data training, that is, start the training task.

As an embodiment, if the state of the first computing node is the active state, and detecting that there is a computing node that fails to deploy the training package, the first prompt information for prompting the deployment failure is output.

That is to say, if the master node fails to copy the training package to the slave node, the master node can output the first prompt information. For example, the first prompt information may be output to the management device or the management node to prompt the deployment of the package to fail, or the first prompt information may be directly displayed to the user, prompting the user to fail the deployment of the package. . The first prompt information may include the slave node information that fails to be copied, or may also include the reason for the copy failure, for example, the slave node is full of memory, network fault, etc., so as to facilitate subsequent processing by the relevant personnel.

In addition, in the foregoing embodiment of the “master node generating a tag file”, if the state of the first computing node is not in the main state, it is determined whether the tag file is received within a preset time period; if not, the output is used for A second prompt message indicating that the deployment failed.

It can be understood that, for the slave node, if the tag file indicating that the deployment is successful is not received for a period of time, the deployment may fail, and the slave node may also output the prompt information. In order to distinguish the prompt information outputted by the master node, the prompt information output by the master node is referred to as first prompt information, and the prompt information outputted from the node is referred to as second prompt information.

For the slave node, the preset time period may start from receiving the training task information, or may start from determining that the state is the slave state, or may start the timing by deploying the training package from the master node to the own device. Wait, the specific is not limited.

For example, the second prompt information may be output to the management device or the management node to prompt the deployment of the package to fail, or the second prompt information may be directly displayed to the user, prompting the user that the package deployment fails. . The second prompt information may include the reason for the failure of the copy, for example, the memory is full, the network is faulty, and the like, so as to facilitate subsequent processing by the relevant personnel.

Corresponding to the foregoing method embodiments, the embodiment of the present application further provides an electronic device, which is a computing node in a distributed system.

As shown in FIG. 3, an electronic device provided by an embodiment of the present application includes: a processor 301, and a memory 302.

a memory 302 for storing a computer program;

The processor 301 is configured to perform the following steps when executing the program stored on the memory 302:

Determining, according to the training task information, whether the state of the electronic device is an active state;

If it is in the active state, the training package is acquired, and the acquired training package is deployed to each computing node that performs the training task.

As an implementation manner, when the processor 301 is further configured to execute the program stored on the memory 302, the following steps are implemented:

After receiving the training task information, parsing the training task information, obtaining a storage address of the training package, status information of each computing node performing the training task, and a device address;

Searching, in the status information of each computing node, status information corresponding to the device address of the electronic device;

Determining whether the found status information is the primary status;

If the self-state of the electronic device is in the active state, after detecting that each of the computing nodes performing the training task successfully deploys the training package, generating a markup file, and sending the markup file to each of the A computing node.

If the self-state of the electronic device is in the active state, detecting that there is a computing node that fails to deploy the training package, outputting the first prompt information for prompting the deployment failure.

If the state of the electronic device is not in the active state, determining whether the tag file is received within a preset time period;

The memory mentioned in the above electronic device may include a random access memory (RAM), and may also include a non-volatile memory, such as at least one disk storage. Optionally, the memory may also be at least one storage device located away from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Ne twork processor, NP for short), or a digital signal processor (DSP). ), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component.

Applying the embodiment shown in FIG. 3 of the present application, only the computing node in the primary state acquires the training package, and deploys the acquired training package to each computing node that performs the training task, that is, not every calculation. Nodes get packages from management devices, reducing network bandwidth pressure.

The embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, and when the computer program is executed by the processor, the following steps are implemented:

Determining, according to the training task information, whether the state is the primary state;

As an implementation manner, it is also used to implement the following steps:

Finding status information corresponding to the address of the device in the status information of each computing node;

Determining whether the found status information is the primary status;

As an implementation manner, it is also used to implement the following steps:

If the self-states are in the active state, after detecting that the training nodes of the execution training task successfully deploy the training package, generate a tag file, and send the tag file to each computing node.

As an implementation manner, it is also used to implement the following steps:

If the self-state is in the active state, and the computing node that fails to deploy the training package is detected, the first prompt information for prompting the deployment failure is output.

As an implementation manner, it is also used to implement the following steps:

If the self state is not the main state, it is determined whether the tag file is received within a preset time period;

As an implementation manner, it is also used to implement the following steps:

The embodiment of the present application further provides a distributed system. As shown in FIG. 4, at least two computing nodes (computing node 1, computing node 2, ... computing node n) are included, and each computing node is used for:

Receiving training task information, where the training task information includes each computing node information for performing a training task; determining, according to the training task information, whether the state is a primary state; if the primary state is used, acquiring a training package, and The acquired training package is deployed to each computing node that performs the training task.

As an implementation manner, the management device may acquire training task information by using the user equipment, for example, storing training task information in the web client, and the management device acquiring training task information from the web client. As a computing node in the primary state, a training package is obtained from the management device.

Alternatively, the distributed system may also include multiple computing nodes and management nodes as shown in FIG. 5. The management node may acquire training task information, parse the training task information, determine each computing node that performs the training task, and set the training task information. Send to each compute node that performs the training task.

As an implementation manner, the management node may obtain the training task information by using the user equipment, for example, the training task information is stored in the web client, and the management node obtains the training task information from the web client, which is not limited. As a computing node in the primary state, a training package is obtained from the management node.

Alternatively, the distributed system may include multiple computing nodes, management nodes, and switches as shown in FIG. 6 and FIG. 7, which are not limited.

In the system shown in FIG. 5, FIG. 6, or FIG. 7, a management node is configured to acquire and store a training package; add a storage address of the training package to training task information; and send the training task information to perform Each computing node of the training task.

Compute node, specifically for:

As an implementation manner, the computing node can also be used to:

In the case where its own state is the primary state:

As an implementation manner, the computing node can also be used to:

As an implementation manner, an Infiniband communication connection based on an infinite bandwidth technology is performed between each computing node performing a training task.

As shown in Figure 7, data exchange can be performed between the compute nodes through the Infiniband network switch. The data exchange between the compute node and the management node can be performed through an Ethernet switch, such as a Gigabit Ethernet switch.

Alternatively, data exchanges can be performed between other computing nodes through other networks, such as Ethernet. Ethernet and InfiniBand networks can exist at the same time. That is to say, in one embodiment, each computing node in the system can exchange data through Ethernet and InfiniBand network.

A specific implementation is provided below, as shown in FIG.

1. The training client package and the training task information are stored in the web client.

2. The management node obtains the training package and the training task information from the web client.

3. The management node stores the acquired training package to a certain location, and adds the storage address of the training package to the training task information.

4. The management node parses the training task information and determines each computing node that performs the training task. 5. Through the Ethernet switch, the training task information with the added storage address is sent to each computing node determined in 4, that is, each computing node that performs the training task.

6. The computing node receives the training task information sent by the management node, parses the training task information, and obtains the storage address of the training package, the status information of each computing node that performs the training task, and the device address.

7. The computing node searches for status information corresponding to the address of the own device in the state information of each computing node obtained by the analysis.

8. Determine whether the found status information is the primary status. The computing node of the active state is the master node, and the computing node of the slave state is the slave node.

9. If the state is the active state, the training package is obtained according to the storage address of the training package obtained by the analysis.

10. The master node deploys the acquired training package to each slave node that performs the training task according to the device address of each computing node obtained by the analysis, through the infinite bandwidth technology Infiniband. After the master node detects that each slave node that performs the training task successfully copies the training package, generates a tag file, and sends the tag file to each slave node.

12. If the master node detects the slave node that failed to copy the training package, output the first prompt information to the management node, or if the slave node does not receive the marker file within the preset time period, output the second prompt information to Management node.

13. If the situation in 12 does not occur, after each of the computing nodes performing the training task successfully deploys the training package, each computing node runs its own deployed training package for data training.

If a compute node is obtained from the management node for each compute node in the system when the package is deployed, network congestion will result. For example, a deployment of a package for a plurality of training tasks in a system, wherein compute node 1 - compute node 5 needs to deploy a package for training task A, compute node 6 - compute node 10 needs to be deployed for training tasks Package B, Compute Node 11 - Compute Node 15 needs to deploy a package for Training Task C. If the 15 computing nodes need to obtain the package from the management node, the network bandwidth pressure between the management node and each computing node is relatively large.

In the embodiment of the present application, for each training task, a master node is specified, and only the master node obtains a package from the management node, that is, only three computing nodes obtain the package from the management node, thereby reducing the management node. Network bandwidth pressure with each compute node.

On the other hand, after the master node obtains the package, the package is deployed to the slave node performing the training task, and the data interaction between the master node and the slave node is different from the data interaction between the compute node and the management node, and the master node and the master node Data interaction between nodes can utilize InfiniBand network or other system internal network, which has high bandwidth and high speed, which improves the efficiency of package deployment.

Embodiments of the present application also provide an executable program code for being executed to perform any of the above video transmission methods.

It should be noted that, in this context, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply such entities or operations. There is any such actual relationship or order between them. Furthermore, the term "comprises" or "comprises" or "comprises" or any other variations thereof is intended to encompass a non-exclusive inclusion, such that a process, method, article, or device that comprises a plurality of elements includes not only those elements but also Other elements, or elements that are inherent to such a process, method, item, or device. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device that comprises the element.

The various embodiments in the present specification are described in a related manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the electronic device embodiment shown in FIG. 3, the distributed system embodiment shown in FIGS. 4-7, the computer readable storage medium embodiment, and the above executable program code embodiment are basically similar to each other. The embodiment of the package deployment method shown in Figure 1-2 is relatively simple. For related information, refer to the description of the embodiment of the package deployment method shown in Figure 1-2.

The above is only the preferred embodiment of the present application, and is not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc., which are made within the spirit and principles of the present application, should be included in the present application. Within the scope of protection.

Claims

A package deployment method is characterized by being applied to a first computing node in a distributed system, the method comprising:

Receiving training task information, where the training task information includes information about each computing node that performs the training task;

Determining, according to the training task information, whether a state of the first computing node is an active state;

If it is in the active state, the training package is acquired, and the acquired training package is deployed to each computing node that performs the training task.
The method according to claim 1, wherein after receiving the training task information, the method further comprises:

Parsing the training task information, obtaining a storage address of the training package, status information of each computing node performing the training task, and a device address;

Determining, according to the training task information, whether the state of the first computing node is an active state, including:

Searching, in the state information of each computing node, status information corresponding to the device address of the first computing node;

Determining whether the found status information is the primary status;

The obtaining training package includes:

Obtaining the training package according to a storage address of the training package;

The deploying the acquired training package to each computing node that performs the training task includes:

And deploying the training package in each computing node according to the device address of each computing node.
The method of claim 1 further comprising:

If the state of the first computing node is the active state, after detecting that the computing nodes of the execution training task successfully deploy the training package, generate a markup file, and send the markup file to the Each computing node.
The method of claim 1 further comprising:

If the state of the first computing node is the active state, and detecting that there is a computing node that fails to deploy the training package, outputting the first prompt information for prompting the deployment failure.
The method of claim 3, wherein the method further comprises:

If the state of the first computing node is not in the active state, determining whether the tag file is received within a preset time period;

If no, the second prompt message for prompting the deployment failure is output.
The method of claim 1 further comprising:

After each of the computing nodes performing the training task successfully deploys the training package, the training package is run to perform data training.
The method according to any one of claims 1-6, wherein the deploying the acquired training package to each computing node performing the training task comprises:

The acquired training package is deployed to each computing node performing the training task through the infinite bandwidth technology Infiniband.
An electronic device, comprising: a memory and a processor, wherein

a memory for storing a computer program;

The processor, when executed to execute a program stored on the memory, implements the method steps of any of claims 1-7.
A computer readable storage medium, wherein the computer readable storage medium stores a computer program, the computer program being executed by a processor to implement the method steps of any of claims 1-7.
A distributed system, comprising: at least two computing nodes;

The computing node is configured to receive training task information, where the training task information includes each computing node information that performs a training task; and according to the training task information, determine whether the state is a primary state; Obtaining a training package, and deploying the acquired training package to each computing node that performs the training task.
The system of claim 10, wherein the system further comprises: a management node;

The management node is configured to acquire and store a training package; add a storage address of the training package to the training task information; and send the training task information to each computing node that performs the training task;

The computing node is specifically configured to:

Receiving training task information sent by the management node; parsing the training task information, obtaining a storage address of the training package, status information of each computing node performing the training task, and a device address; and calculating the status of each node In the information, the status information corresponding to the device address is searched; the status information of the found device is determined to be the active status; if the status is the active status, the training package is obtained according to the storage address of the training package, and according to the The device addresses of the computing nodes of each of the computing nodes are deployed in the computing nodes.
The system of claim 11 wherein said computing node is further configured to:

In the case where its own state is the primary state:

If it is detected that each of the computing nodes that perform the training task successfully deploys the training package, generate a tag file, and send the tag file to each computing node;

If it is detected that there is a computing node that fails to deploy the training package, the first prompt information for prompting the deployment failure is sent to the management node.
The system of claim 12, wherein the computing node is further configured to:

In a case where the self state is not in the main state, it is determined whether the tag file is received within a preset time period;

If not, sending the second prompt information for prompting the deployment failure to the management node.
The system of claim 10, wherein the computing node is further configured to:

After each of the computing nodes performing the training task successfully deploys the training package, the training package is run for data training.
The system according to any one of claims 10 to 14, wherein each of the computing nodes performing the training task is connected by an Infiniband communication based on an infinite bandwidth technology.