WO2018224005A1 - Package deployment method, electronic device and distributed system - Google Patents

Package deployment method, electronic device and distributed system Download PDF

Info

Publication number
WO2018224005A1
WO2018224005A1 PCT/CN2018/090263 CN2018090263W WO2018224005A1 WO 2018224005 A1 WO2018224005 A1 WO 2018224005A1 CN 2018090263 W CN2018090263 W CN 2018090263W WO 2018224005 A1 WO2018224005 A1 WO 2018224005A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
package
computing node
node
training task
Prior art date
Application number
PCT/CN2018/090263
Other languages
French (fr)
Chinese (zh)
Inventor
周智强
彭剑峰
郑星
叶挺群
李鹏飞
Original Assignee
杭州海康威视数字技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州海康威视数字技术股份有限公司 filed Critical 杭州海康威视数字技术股份有限公司
Publication of WO2018224005A1 publication Critical patent/WO2018224005A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/95Hardware or software architectures specially adapted for image or video understanding structured as a network, e.g. client-server architectures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/35Switches specially adapted for specific applications
    • H04L49/356Switches specially adapted for specific applications for storage area networks
    • H04L49/358Infiniband Switches

Definitions

  • the present application relates to the field of machine learning technology, and in particular, to a package deployment method, an electronic device, and a distributed system.
  • Machine learning is an important technical means to realize artificial intelligence.
  • Machine learning is mainly through the training of a large amount of data, so that the machine has the function of intelligent recognition. Due to the large amount of data in the learning and training process, distributed systems are usually used for data training.
  • each computing node Before data training in a distributed system, it is usually necessary to deploy the packages required for training in each computing node of the system. After the package is deployed, each computing node can coordinate training. Generally, a management device is set up, and the management device obtains a training package and sends the package to each computing node in the system.
  • each computing node in the system obtains a package from the management device, so that the network bandwidth pressure between the management device and each computing node is relatively large.
  • the purpose of the embodiments of the present application is to provide a package deployment method, an electronic device, and a distributed system to reduce network bandwidth pressure.
  • the embodiment of the present application provides a package deployment method, which is applied to a first computing node in a distributed system, where the method includes:
  • training task information includes information about each computing node that performs the training task
  • the training package is obtained and the acquired training package is deployed to each of the computing nodes performing the training task.
  • the method may further include:
  • Parsing the training task information obtaining a storage address of the training package, status information of each computing node performing the training task, and a device address;
  • the method may include:
  • the obtaining the training package may include:
  • Deploying the acquired training package to each computing node that performs the training task may include:
  • the method may further include:
  • the state of the first computing node is the active state, after detecting that the computing nodes of the execution training task successfully deploy the training package, generate a markup file, and send the markup file to the Each computing node.
  • the method may further include:
  • the state of the first computing node is the active state, and detecting that there is a computing node that fails to deploy the training package, outputting the first prompt information for prompting the deployment failure.
  • the method may further include:
  • the method may further include:
  • the training package is run to perform data training.
  • the deploying the acquired training package to each computing node that performs the training task may include:
  • the acquired training package is deployed to each computing node performing the training task through the infinite bandwidth technology Infiniband.
  • an embodiment of the present application further provides an electronic device, including: a memory and a processor, where
  • a memory for storing a computer program
  • the processor when used to execute a program stored on the memory, implements any of the above package deployment methods.
  • an embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, and when the computer program is executed by the processor, implements any one of the foregoing package deployments. method.
  • the embodiment of the present application further provides a distributed system, including: at least two computing nodes;
  • the computing node is configured to receive training task information, where the training task information includes each computing node information that performs a training task; and according to the training task information, determine whether the state is a primary state; Obtaining a training package, and deploying the acquired training package to each computing node that performs the training task.
  • the system further includes: a management node;
  • the management node is configured to acquire and store a training package; add a storage address of the training package to the training task information; and send the training task information to each computing node that performs the training task;
  • the computing node may be specifically configured to:
  • the computing node is further configured to:
  • each of the computing nodes that perform the training task successfully deploys the training package, generate a tag file, and send the tag file to each computing node;
  • the first prompt information for prompting the deployment failure is sent to the management node.
  • the computing node is further configured to:
  • the computing node is further configured to:
  • the training package is run to perform data training.
  • the computing nodes performing the training task are connected according to an Infiniband communication technology of an infinite bandwidth technology.
  • an embodiment of the present application further discloses an executable program code for being executed to execute any of the above package deployment methods.
  • FIG. 1 is a schematic diagram of a first process of a method for deploying a package according to an embodiment of the present application
  • FIG. 2 is a second schematic flowchart of a method for deploying a package according to an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a first structure of a distributed system according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a second structure of a distributed system according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of a third structure of a distributed system according to an embodiment of the present disclosure.
  • FIG. 7 is a schematic diagram of a fourth structure of a distributed system according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram of an embodiment provided by an embodiment of the present application.
  • the embodiment of the present application provides a package deployment method, an electronic device, and a distributed system.
  • the distributed system provided by the embodiment of the present application may include at least two computing nodes (computing node 1, computing node 2, ... computing node n) as shown in FIG. 4, or may also include multiple computing as shown in FIG.
  • the node and the management node may include a plurality of computing nodes, management nodes, and switches, and are not limited.
  • a method for deploying a package provided by the embodiment of the present application is described in detail below.
  • the method can be applied to any computing node in the distributed system.
  • the computing node performing the method is referred to as a first computing node.
  • FIG. 1 is a schematic flowchart of a method for deploying a package according to an embodiment of the present application, including:
  • S101 Receive training task information, where the training task information includes each computing node information that performs a training task.
  • the training task can be used for learning and training a large amount of data in various machine learning processes, for example, deep learning based on artificial neural networks.
  • the training package can be deployed in the distributed system by using the solution.
  • the computing nodes in the system can perform the training task.
  • the training task information includes information of each computing node that performs this training task.
  • the compute node performing this training task can be a part of the compute nodes in the system.
  • the user may specify a part of the computing nodes in the system to perform the training task according to actual conditions; or, each computing node in the system may also be used. Grouping, the same group of computing nodes perform the same training task; or, all computing nodes in the system may be determined as the computing nodes performing the training task, etc., and there are many ways to determine the computing nodes that perform the training tasks.
  • the application examples are not limited thereto.
  • the user equipment may directly store the training task information to each computing node that performs the training task; or, a management device may be set outside the system, and the management device acquires training task information and analyzes the training.
  • the task information determines each computing node that performs the training task, and sends the training task information to each computing node that performs the training task.
  • the management device may obtain training task information by using the user equipment, for example, storing training task information in the web client, and the management device acquires training task information from the web client.
  • a training package is obtained from the management device.
  • the management node may acquire training task information, analyze the training task information, determine each computing node that performs the training task, and send the training task information to each computing node that performs the training task.
  • the management node may obtain the training task information by using the user equipment, for example, the training task information is stored in the web client, and the management node obtains the training task information from the web client, which is not limited.
  • the management device set in the system is referred to as a management node, and the management device disposed outside the system is referred to as a management device.
  • S102 Determine, according to the training task information, whether a state of the first computing node is an active state, and if yes, execute S103.
  • the training task information received by each computing node performing the training task may be different. That is, the user equipment may store the training task information corresponding to the computing node in each computing node, or the management device, or the management node may send the training task information corresponding to the computing node for each computing node.
  • the training task information stored in the computing node 1 by the user equipment may only include the state information of the computing node 1, and the training task information stored in the computing node 2 may only contain the state information of the computing node 2, etc. .
  • the training device or the training task information sent by the management node to the computing node 1 may include only the state information of the computing node 1, and the training task information sent to the computing node 2 may include only the state information of the computing node 2, and so on.
  • the training task information received by each computing node that performs the training task is the same, and the state information of each computing node that performs the training task is included.
  • the state of the computing node may include both an active state and a dependent state.
  • the computing node of the primary state may be referred to as a master node, and the computing node of the slave state may be referred to as a slave node.
  • the computing node can determine whether the state is the active state or the dependent state according to the training task information received by S101. If it is the active state, execute S103.
  • S103 Acquire a training package, and deploy the acquired training package to each computing node that performs the training task.
  • the training package is acquired only when the state of the computing node is in the active state.
  • a training package may be stored in the web client, and the web client sends the training package to the management node or the management device.
  • a training package is obtained from the management node or the management device.
  • each computing node in the system is connected by Infiniband communication through an infinite bandwidth technology. Therefore, the master node (the computing node in the active state) can deploy the acquired training package to perform the training task through Infiniband. Each computing node.
  • the InfiniBand architecture is a "conversion cable" technology that supports multiple concurrent links, and the network based on the InfiniBand architecture has very high bandwidth.
  • the master node completes the deployment of the slave training package by means of the InfiniBand network copy.
  • the training package deployment efficiency can be improved, and on the other hand, the bandwidth capability of the Infiniband is fully utilized.
  • the master node can also complete the deployment of the slave training package by other means, such as Ethernet.
  • Ethernet and InfiniBand networks can exist at the same time. That is to say, in one embodiment, each computing node in the system can exchange data through Ethernet and InfiniBand network.
  • a compute package is obtained from the management device (management node) in each of the compute nodes in the system when the package is deployed, network congestion will result. For example, a deployment of a package for a plurality of training tasks in a system, wherein compute node 1 - compute node 5 needs to deploy a package for training task A, compute node 6 - compute node 10 needs to be deployed for training tasks Package B, Compute Node 11 - Compute Node 15 needs to deploy a package for Training Task C. If the 15 computing nodes need to obtain the package from the management device, the network bandwidth between the management device and each computing node is relatively high.
  • a master node for each training task, a master node is specified, and only the master node obtains a package from the management device, that is, only three computing nodes obtain the package from the management device, thereby reducing the management device.
  • Network bandwidth pressure with each compute node for each training task, a master node is specified, and only the master node obtains a package from the management device, that is, only three computing nodes obtain the package from the management device, thereby reducing the management device.
  • the package is deployed to the slave node performing the training task, and the data interaction between the master node and the slave node is different from the data interaction between the compute node and the management device, and the master node and the master node Data interaction between nodes can utilize InfiniBand network or other system internal network, which has high bandwidth and high speed, which improves the efficiency of package deployment.
  • FIG. 2 is a second schematic flowchart of a method for deploying a package according to an embodiment of the present disclosure, including:
  • S202 Parse the training task information, obtain a storage address of the training package, status information of each computing node that performs the training task, and a device address.
  • the web client may store a training package and corresponding training task information, where the training task information includes status information and a device address of each computing node that performs the training task.
  • the management node or the management device obtains the training package and the training task information from the web client.
  • the management node or management device stores the training package to a location and adds the storage address to the training task information.
  • the management node or the management device may further parse the training task information, determine each computing node that performs the training task, and send the training task information with the added storage address to the determined computing nodes, that is, execute the training task.
  • Each computing node is not limited to, a computing node.
  • the training task information received by the first computing node that executes the method includes the storage address of the training package, and the first computing node parses the received training task information to obtain the storage address.
  • the status information and the device address of each computing node may be included in the “each computing node information for performing the training task” included in the training task information.
  • the state of the computing node can be divided into a main state and a slave state.
  • the computing node of the active state can be referred to as a master node, and the computing node of the slave state can be referred to as a slave node.
  • the address of the device can be the address of the device, such as the IP address and MAC address of the device.
  • S203 In the status information of each computing node, search for status information corresponding to the device address of the first computing node.
  • the first computing node searches for the self-state corresponding to its own IP address as the primary state in the analysis result of S202 (including the above Table 1), and the determination of S204. As a result, yes, S205 is executed.
  • S205 Acquire the training package according to a storage address of the training package.
  • the analysis result of S202 further includes a storage address of the training package, and according to the storage address, the training package can be obtained. It should be noted that, in this embodiment, only the computing node (master node) in the active state accesses the storage address to acquire the training package.
  • S206 Deploy the training package in each computing node according to the device address of each computing node.
  • the analysis result of S202 also includes the device address of each computing node that performs the training task, and the package acquired in S205 can be deployed to each computing node of the training task according to the device address.
  • each computing node in the system is connected by Infiniband communication through an infinite bandwidth technology. Therefore, the master node can deploy the acquired training package to each computing node that performs the training task through Infiniband.
  • the InfiniBand architecture is a "conversion cable" technology that supports multiple concurrent links, and the network based on the InfiniBand architecture has very high bandwidth.
  • the master node completes the deployment of the slave training package through the InfiniBand network copying method.
  • the training package deployment efficiency can be improved, and on the other hand, the bandwidth capability of the Infiniband is fully utilized.
  • the master node can also complete the deployment of the slave training package by other means, such as Ethernet.
  • the Ethernet and the InfiniBand network can exist at the same time, that is, in one embodiment, each computing node in the system can perform data interaction through the Ethernet and the InfiniBand network.
  • the first computing node detects that each of the computing nodes performing the training task successfully deploys the training package, generates a tag file, and sends the tag file to Each of the computing nodes.
  • the master node completes the deployment of the slave training package by means of the InfiniBand network copy. Therefore, the master node can determine whether each copy is successful, and then check whether the package of the training task is all. The copy was successful.
  • the master node may generate a mark file and send the mark file to each calculation for performing the training task. Nodes, so that other slave nodes get the message that the deployment is successful.
  • the training package is run to perform data training, that is, to perform a training task.
  • the training task can be a task of learning and training a large amount of data in various machine learning processes, for example, deep learning based on artificial neural network.
  • the training package can be deployed in the distributed system by using the solution.
  • the computing nodes in the system can perform data training and perform training tasks.
  • the tag file may be used to determine whether the training package is deployed successfully. If the deployment is successful, each computing node performing the training task may run the self-deployed training program. Package, data training, that is, start the training task.
  • the first prompt information for prompting the deployment failure is output.
  • the master node can output the first prompt information.
  • the first prompt information may be output to the management device or the management node to prompt the deployment of the package to fail, or the first prompt information may be directly displayed to the user, prompting the user to fail the deployment of the package.
  • the first prompt information may include the slave node information that fails to be copied, or may also include the reason for the copy failure, for example, the slave node is full of memory, network fault, etc., so as to facilitate subsequent processing by the relevant personnel.
  • the state of the first computing node is not in the main state, it is determined whether the tag file is received within a preset time period; if not, the output is used for A second prompt message indicating that the deployment failed.
  • the slave node if the tag file indicating that the deployment is successful is not received for a period of time, the deployment may fail, and the slave node may also output the prompt information.
  • the prompt information output by the master node is referred to as first prompt information
  • the prompt information outputted from the node is referred to as second prompt information.
  • the preset time period may start from receiving the training task information, or may start from determining that the state is the slave state, or may start the timing by deploying the training package from the master node to the own device. Wait, the specific is not limited.
  • the second prompt information may be output to the management device or the management node to prompt the deployment of the package to fail, or the second prompt information may be directly displayed to the user, prompting the user that the package deployment fails.
  • the second prompt information may include the reason for the failure of the copy, for example, the memory is full, the network is faulty, and the like, so as to facilitate subsequent processing by the relevant personnel.
  • the embodiment of the present application further provides an electronic device, which is a computing node in a distributed system.
  • an electronic device provided by an embodiment of the present application includes: a processor 301, and a memory 302.
  • a memory 302 for storing a computer program
  • the processor 301 is configured to perform the following steps when executing the program stored on the memory 302:
  • training task information includes information about each computing node that performs the training task
  • the training package is acquired, and the acquired training package is deployed to each computing node that performs the training task.
  • processor 301 when the processor 301 is further configured to execute the program stored on the memory 302, the following steps are implemented:
  • processor 301 when the processor 301 is further configured to execute the program stored on the memory 302, the following steps are implemented:
  • the self-state of the electronic device is in the active state, after detecting that each of the computing nodes performing the training task successfully deploys the training package, generating a markup file, and sending the markup file to each of the A computing node.
  • processor 301 when the processor 301 is further configured to execute the program stored on the memory 302, the following steps are implemented:
  • the self-state of the electronic device is in the active state, detecting that there is a computing node that fails to deploy the training package, outputting the first prompt information for prompting the deployment failure.
  • processor 301 when the processor 301 is further configured to execute the program stored on the memory 302, the following steps are implemented:
  • processor 301 when the processor 301 is further configured to execute the program stored on the memory 302, the following steps are implemented:
  • the training package is run to perform data training.
  • processor 301 when the processor 301 is further configured to execute the program stored on the memory 302, the following steps are implemented:
  • the acquired training package is deployed to each computing node performing the training task through the infinite bandwidth technology Infiniband.
  • the memory mentioned in the above electronic device may include a random access memory (RAM), and may also include a non-volatile memory, such as at least one disk storage.
  • the memory may also be at least one storage device located away from the aforementioned processor.
  • the processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Ne twork processor, NP for short), or a digital signal processor (DSP). ), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component.
  • CPU central processing unit
  • Ne twork processor Network processor
  • DSP digital signal processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, and when the computer program is executed by the processor, the following steps are implemented:
  • training task information includes information about each computing node that performs the training task
  • the training package is acquired, and the acquired training package is deployed to each computing node that performs the training task.
  • the self-states are in the active state, after detecting that the training nodes of the execution training task successfully deploy the training package, generate a tag file, and send the tag file to each computing node.
  • the first prompt information for prompting the deployment failure is output.
  • the self state is not the main state, it is determined whether the tag file is received within a preset time period
  • the training package is run to perform data training.
  • the acquired training package is deployed to each computing node performing the training task through the infinite bandwidth technology Infiniband.
  • the embodiment of the present application further provides a distributed system. As shown in FIG. 4, at least two computing nodes (computing node 1, computing node 2, ... computing node n) are included, and each computing node is used for:
  • the training task information includes each computing node information for performing a training task; determining, according to the training task information, whether the state is a primary state; if the primary state is used, acquiring a training package, and The acquired training package is deployed to each computing node that performs the training task.
  • the user equipment may directly store the training task information to each computing node that performs the training task; or, a management device may be set outside the system, and the management device acquires training task information and analyzes the training.
  • the task information determines each computing node that performs the training task, and sends the training task information to each computing node that performs the training task.
  • the management device may acquire training task information by using the user equipment, for example, storing training task information in the web client, and the management device acquiring training task information from the web client.
  • a training package is obtained from the management device.
  • the distributed system may also include multiple computing nodes and management nodes as shown in FIG. 5.
  • the management node may acquire training task information, parse the training task information, determine each computing node that performs the training task, and set the training task information. Send to each compute node that performs the training task.
  • the management node may obtain the training task information by using the user equipment, for example, the training task information is stored in the web client, and the management node obtains the training task information from the web client, which is not limited.
  • a computing node in the primary state a training package is obtained from the management node.
  • the management device set in the system is referred to as a management node, and the management device disposed outside the system is referred to as a management device.
  • the distributed system may include multiple computing nodes, management nodes, and switches as shown in FIG. 6 and FIG. 7, which are not limited.
  • a management node is configured to acquire and store a training package; add a storage address of the training package to training task information; and send the training task information to perform Each computing node of the training task.
  • Compute node specifically for:
  • the computing node can also be used to:
  • each of the computing nodes that perform the training task successfully deploys the training package, generate a tag file, and send the tag file to each computing node;
  • the first prompt information for prompting the deployment failure is sent to the management node.
  • the computing node can also be used to:
  • the computing node can also be used to:
  • the training package is run to perform data training.
  • an Infiniband communication connection based on an infinite bandwidth technology is performed between each computing node performing a training task.
  • data exchange can be performed between the compute nodes through the Infiniband network switch.
  • the data exchange between the compute node and the management node can be performed through an Ethernet switch, such as a Gigabit Ethernet switch.
  • the InfiniBand architecture is a "conversion cable" technology that supports multiple concurrent links, and the network based on the InfiniBand architecture has very high bandwidth.
  • the master node completes the deployment of the slave training package by means of the InfiniBand network copy.
  • the training package deployment efficiency can be improved, and on the other hand, the bandwidth capability of the Infiniband is fully utilized.
  • Ethernet and InfiniBand networks can exist at the same time. That is to say, in one embodiment, each computing node in the system can exchange data through Ethernet and InfiniBand network.
  • the training client package and the training task information are stored in the web client.
  • the management node obtains the training package and the training task information from the web client.
  • the management node stores the acquired training package to a certain location, and adds the storage address of the training package to the training task information.
  • the management node parses the training task information and determines each computing node that performs the training task. 5. Through the Ethernet switch, the training task information with the added storage address is sent to each computing node determined in 4, that is, each computing node that performs the training task.
  • the computing node receives the training task information sent by the management node, parses the training task information, and obtains the storage address of the training package, the status information of each computing node that performs the training task, and the device address.
  • the computing node searches for status information corresponding to the address of the own device in the state information of each computing node obtained by the analysis.
  • the computing node of the active state is the master node, and the computing node of the slave state is the slave node.
  • the training package is obtained according to the storage address of the training package obtained by the analysis.
  • the master node deploys the acquired training package to each slave node that performs the training task according to the device address of each computing node obtained by the analysis, through the infinite bandwidth technology Infiniband. After the master node detects that each slave node that performs the training task successfully copies the training package, generates a tag file, and sends the tag file to each slave node.
  • the master node If the master node detects the slave node that failed to copy the training package, output the first prompt information to the management node, or if the slave node does not receive the marker file within the preset time period, output the second prompt information to Management node.
  • each computing node runs its own deployed training package for data training.
  • a compute node is obtained from the management node for each compute node in the system when the package is deployed, network congestion will result. For example, a deployment of a package for a plurality of training tasks in a system, wherein compute node 1 - compute node 5 needs to deploy a package for training task A, compute node 6 - compute node 10 needs to be deployed for training tasks Package B, Compute Node 11 - Compute Node 15 needs to deploy a package for Training Task C. If the 15 computing nodes need to obtain the package from the management node, the network bandwidth pressure between the management node and each computing node is relatively large.
  • a master node for each training task, a master node is specified, and only the master node obtains a package from the management node, that is, only three computing nodes obtain the package from the management node, thereby reducing the management node.
  • Network bandwidth pressure with each compute node for each training task, a master node is specified, and only the master node obtains a package from the management node, that is, only three computing nodes obtain the package from the management node, thereby reducing the management node.
  • the package is deployed to the slave node performing the training task, and the data interaction between the master node and the slave node is different from the data interaction between the compute node and the management node, and the master node and the master node Data interaction between nodes can utilize InfiniBand network or other system internal network, which has high bandwidth and high speed, which improves the efficiency of package deployment.
  • Embodiments of the present application also provide an executable program code for being executed to perform any of the above video transmission methods.
  • the various embodiments in the present specification are described in a related manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments.
  • the electronic device embodiment shown in FIG. 3, the distributed system embodiment shown in FIGS. 4-7, the computer readable storage medium embodiment, and the above executable program code embodiment are basically similar to each other.
  • the embodiment of the package deployment method shown in Figure 1-2 is relatively simple. For related information, refer to the description of the embodiment of the package deployment method shown in Figure 1-2.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Computer And Data Communications (AREA)
  • Small-Scale Networks (AREA)

Abstract

A package deployment method, an electronic device, and a distributed system. The method is applied to a first computing node, and comprises: receiving training session information (S101), wherein the training session information comprises information on each computing node executing a training session; determining, according to the training session information, whether the first computing node is in a master state (S102); and if yes, obtaining a training package, and deploying the obtained training package to each computing node executing the training session (S103). As such, only a computing node in a master state obtains a training package, and the obtained training package is only deployed to the computing node executing a training session. Since it is not necessary for all computing nodes to obtain the training package from a management device, the method reduces pressure on network bandwidth.

Description

一种程序包部署方法、电子设备及分布式系统Package deployment method, electronic device and distributed system
本申请要求于2017年6月8日提交中国专利局、申请号为201710429234.1、发明名称为“一种程序包部署方法、电子设备及分布式系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application filed on June 8, 2017, the Chinese Patent Office, Application No. 201710429234.1, entitled "A Package Deployment Method, Electronic Equipment, and Distributed System", the entire contents of which are The citations are incorporated herein by reference.
技术领域Technical field
本申请涉及机器学习技术领域,特别是涉及一种程序包部署方法、电子设备及分布式系统。The present application relates to the field of machine learning technology, and in particular, to a package deployment method, an electronic device, and a distributed system.
背景技术Background technique
机器学习是实现人工智能的重要技术手段,机器学习主要是通过对大量数据的学习训练,使得机器具备智能识别的功能。由于学习训练过程中的数据量较大,通常采用分布式系统进行数据训练。Machine learning is an important technical means to realize artificial intelligence. Machine learning is mainly through the training of a large amount of data, so that the machine has the function of intelligent recognition. Due to the large amount of data in the learning and training process, distributed systems are usually used for data training.
在分布式系统中进行数据训练之前,通常需要在系统的各台计算节点中部署训练所需的程序包,程序包部署完成后,各台计算节点才能协同训练。一般来说,会设置一台管理设备,该管理设备获取训练程序包,并将该程序包下发给系统中各台计算节点。Before data training in a distributed system, it is usually necessary to deploy the packages required for training in each computing node of the system. After the package is deployed, each computing node can coordinate training. Generally, a management device is set up, and the management device obtains a training package and sends the package to each computing node in the system.
也就是说,系统中各台计算节点都从该管理设备中获取程序包,这样,该管理设备与各台计算节点之间的网络带宽压力较大。That is to say, each computing node in the system obtains a package from the management device, so that the network bandwidth pressure between the management device and each computing node is relatively large.
发明内容Summary of the invention
本申请实施例的目的在于提供一种程序包部署方法、电子设备及分布式系统,以降低网络带宽压力。The purpose of the embodiments of the present application is to provide a package deployment method, an electronic device, and a distributed system to reduce network bandwidth pressure.
为达到上述目的,本申请实施例提供了一种程序包部署方法,应用于分布式系统中的第一计算节点,所述方法包括:To achieve the above objective, the embodiment of the present application provides a package deployment method, which is applied to a first computing node in a distributed system, where the method includes:
接收训练任务信息,所述训练任务信息中包含执行训练任务的各台计算节点信息;Receiving training task information, where the training task information includes information about each computing node that performs the training task;
根据所述训练任务信息,确定所述第一计算节点的状态是否为主用状态;Determining, according to the training task information, whether a state of the first computing node is an active state;
如果为主用状态,获取训练程序包,并将所获取的训练程序包部署到所 述执行训练任务的各台计算节点。If in the active state, the training package is obtained and the acquired training package is deployed to each of the computing nodes performing the training task.
可选的,在接收训练任务信息之后,所述方法还可以包括:Optionally, after receiving the training task information, the method may further include:
解析所述训练任务信息,得到训练程序包的存储地址、执行训练任务的各台计算节点的状态信息及设备地址;Parsing the training task information, obtaining a storage address of the training package, status information of each computing node performing the training task, and a device address;
所述根据所述训练任务信息,确定所述第一计算节点的状态是否为主用状态,可以包括:Determining, according to the training task information, whether the state of the first computing node is an active state, the method may include:
在所述各台计算节点的状态信息中,查找所述第一计算节点设备地址对应的状态信息;Searching, in the status information of each computing node, status information corresponding to the address of the first computing node device;
确定所查找到的状态信息是否为主用状态;Determining whether the found status information is the primary status;
所述获取训练程序包,可以包括:The obtaining the training package may include:
根据所述训练程序包的存储地址,获取所述训练程序包;Obtaining the training package according to a storage address of the training package;
所述将所获取的训练程序包部署到所述执行训练任务的各台计算节点,可以包括:Deploying the acquired training package to each computing node that performs the training task may include:
根据所述各台计算节点的设备地址,在所述各台计算节点中部署所述训练程序包。And deploying the training package in each computing node according to the device address of each computing node.
可选的,所述方法还可以包括:Optionally, the method may further include:
如果所述第一计算节点的状态为主用状态,检测到所述执行训练任务的各台计算节点均成功部署所述训练程序包后,生成标记文件,并将所述标记文件发送至所述各台计算节点。If the state of the first computing node is the active state, after detecting that the computing nodes of the execution training task successfully deploy the training package, generate a markup file, and send the markup file to the Each computing node.
可选的,所述方法还可以包括:Optionally, the method may further include:
如果所述第一计算节点的状态为主用状态,检测到存在部署所述训练程序包失败的计算节点后,输出用于提示部署失败的第一提示信息。If the state of the first computing node is the active state, and detecting that there is a computing node that fails to deploy the training package, outputting the first prompt information for prompting the deployment failure.
可选的,所述方法还可以包括:Optionally, the method may further include:
如果所述第一计算节点的状态不为主用状态,判断是否在预设时间段内接收到所述标记文件;If the state of the first computing node is not in the active state, determining whether the tag file is received within a preset time period;
如果否,输出用于提示部署失败的第二提示信息。If no, the second prompt message for prompting the deployment failure is output.
可选的,所述方法还可以包括:Optionally, the method may further include:
在所述执行训练任务的各台计算节点均成功部署所述训练程序包后,运行所述训练程序包,进行数据训练。After each of the computing nodes performing the training task successfully deploys the training package, the training package is run to perform data training.
可选的,所述将所获取的训练程序包部署到所述执行训练任务的各台计算节点,可以包括:Optionally, the deploying the acquired training package to each computing node that performs the training task may include:
通过无限带宽技术Infiniband,将所获取的训练程序包部署到所述执行训练任务的各台计算节点。The acquired training package is deployed to each computing node performing the training task through the infinite bandwidth technology Infiniband.
为达到上述目的,本申请实施例还提供了一种电子设备,包括:存储器和处理器,其中,In order to achieve the above objective, an embodiment of the present application further provides an electronic device, including: a memory and a processor, where
存储器,用于存放计算机程序;a memory for storing a computer program;
处理器,用于执行存储器上所存放的程序时,实现上述任一项程序包部署方法。The processor, when used to execute a program stored on the memory, implements any of the above package deployment methods.
为达到上述目的,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现上述任一项程序包部署方法。To achieve the above objective, an embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, and when the computer program is executed by the processor, implements any one of the foregoing package deployments. method.
为达到上述目的,本申请实施例还提供了一种分布式系统,包括:至少两台计算节点;To achieve the above objective, the embodiment of the present application further provides a distributed system, including: at least two computing nodes;
所述计算节点,用于接收训练任务信息,所述训练任务信息中包含执行训练任务的各台计算节点信息;根据所述训练任务信息,确定自身状态是否为主用状态;如果为主用状态,获取训练程序包,并将所获取的训练程序包部署到所述执行训练任务的各台计算节点。The computing node is configured to receive training task information, where the training task information includes each computing node information that performs a training task; and according to the training task information, determine whether the state is a primary state; Obtaining a training package, and deploying the acquired training package to each computing node that performs the training task.
可选的,所述系统还包括:管理节点;Optionally, the system further includes: a management node;
所述管理节点,用于获取并存储训练程序包;将所述训练程序包的存储地址添加至训练任务信息;将所述训练任务信息发送给执行训练任务的各台计算节点;The management node is configured to acquire and store a training package; add a storage address of the training package to the training task information; and send the training task information to each computing node that performs the training task;
所述计算节点,具体可以用于:The computing node may be specifically configured to:
接收所述管理节点发送的训练任务信息;解析所述训练任务信息,得到训练程序包的存储地址、执行训练任务的各台计算节点的状态信息及设备地址;在所述各台计算节点的状态信息中,查找自身设备地址对应的状态信息;确定所查找到的状态信息是否为主用状态;如果为主用状态,根据所述训练程序包的存储地址,获取所述训练程序包,并根据所述各台计算节点的设备地址,在所述各台计算节点中部署所述训练程序包。Receiving training task information sent by the management node; parsing the training task information, obtaining a storage address of the training package, status information of each computing node performing the training task, and a device address; and calculating the status of each node In the information, the status information corresponding to the device address is searched; the status information of the found device is determined to be the active status; if the status is the active status, the training package is obtained according to the storage address of the training package, and according to the The device addresses of the computing nodes of each of the computing nodes are deployed in the computing nodes.
可选的,所述计算节点,还可以用于:Optionally, the computing node is further configured to:
在自身状态为主用状态的情况下:In the case where its own state is the primary state:
如果检测到所述执行训练任务的各台计算节点均成功部署所述训练程序包,生成标记文件,并将所述标记文件发送至所述各台计算节点;If it is detected that each of the computing nodes that perform the training task successfully deploys the training package, generate a tag file, and send the tag file to each computing node;
如果检测到存在部署所述训练程序包失败的计算节点,向所述管理节点发送用于提示部署失败的第一提示信息。If it is detected that there is a computing node that fails to deploy the training package, the first prompt information for prompting the deployment failure is sent to the management node.
可选的,所述计算节点,还可以用于:Optionally, the computing node is further configured to:
在自身状态不为主用状态的情况下,判断是否在预设时间段内接收到所述标记文件;In a case where the self state is not in the main state, it is determined whether the tag file is received within a preset time period;
如果否,向所述管理节点发送用于提示部署失败的第二提示信息。If not, sending the second prompt information for prompting the deployment failure to the management node.
可选的,所述计算节点,还可以用于:Optionally, the computing node is further configured to:
在所述执行训练任务的各台计算节点均成功部署所述训练程序包后,运行所述训练程序包,进行数据训练。After each of the computing nodes performing the training task successfully deploys the training package, the training package is run to perform data training.
可选的,所述执行训练任务的各台计算节点之间基于无限带宽技术Infiniband通信连接。Optionally, the computing nodes performing the training task are connected according to an Infiniband communication technology of an infinite bandwidth technology.
为达到上述目的,本申请实施例还公开了一种可执行程序代码,所述可执行程序代码用于被运行以执行上述任一种程序包部署方法。To achieve the above objective, an embodiment of the present application further discloses an executable program code for being executed to execute any of the above package deployment methods.
应用本申请所示实施例,只有主用状态的计算节点获取训练程序包,并将所获取的训练程序包部署到执行训练任务的各台计算节点,也就是说,并 不是每台计算节点都从管理设备中获取程序包,降低了网络带宽压力。Applying the embodiment shown in the present application, only the computing node in the primary state acquires the training package, and deploys the acquired training package to each computing node that performs the training task, that is, not every computing node Getting the package from the management device reduces network bandwidth pressure.
当然,实施本申请的任一产品或方法并不一定需要同时达到以上所述的所有优点。Of course, implementing any of the products or methods of the present application does not necessarily require that all of the advantages described above be achieved at the same time.
附图说明DRAWINGS
为了更清楚地说明本申请实施例和现有技术的技术方案,下面对实施例和现有技术中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application and the technical solutions of the prior art, the following description of the embodiments and the drawings used in the prior art will be briefly introduced. Obviously, the drawings in the following description are only Some embodiments of the application may also be used to obtain other figures from those of ordinary skill in the art without departing from the scope of the invention.
图1为本申请实施例提供的程序包部署方法的第一种流程示意图;FIG. 1 is a schematic diagram of a first process of a method for deploying a package according to an embodiment of the present application;
图2为本申请实施例提供的程序包部署方法的第二种流程示意图;2 is a second schematic flowchart of a method for deploying a package according to an embodiment of the present application;
图3为本申请实施例提供的一种电子设备的结构示意图;FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;
图4为本申请实施例提供的分布式系统的第一种结构示意图;4 is a schematic diagram of a first structure of a distributed system according to an embodiment of the present application;
图5为本申请实施例提供的分布式系统的第二种结构示意图;FIG. 5 is a schematic diagram of a second structure of a distributed system according to an embodiment of the present disclosure;
图6为本申请实施例提供的分布式系统的第三种结构示意图;FIG. 6 is a schematic diagram of a third structure of a distributed system according to an embodiment of the present disclosure;
图7为本申请实施例提供的分布式系统的第四种结构示意图;FIG. 7 is a schematic diagram of a fourth structure of a distributed system according to an embodiment of the present disclosure;
图8为本申请实施例提供的一种实施方式示意图。FIG. 8 is a schematic diagram of an embodiment provided by an embodiment of the present application.
具体实施方式detailed description
为使本申请的目的、技术方案、及优点更加清楚明白,以下参照附图并举实施例,对本申请进一步详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings. It is apparent that the described embodiments are only a part of the embodiments of the present application, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
为了解决上述技术问题,本申请实施例提供了一种程序包部署方法、电子设备及分布式系统。本申请实施例提供的分布式系统可以如图4所示,包含至少两台计算节点(计算节点1、计算节点2……计算节点n),或者也可以如图5所示,包含多台计算节点及管理节点,或者,也可以如图6、图7所示,包含多台计算节点、管理节点及交换机,具体不做限定。In order to solve the above technical problem, the embodiment of the present application provides a package deployment method, an electronic device, and a distributed system. The distributed system provided by the embodiment of the present application may include at least two computing nodes (computing node 1, computing node 2, ... computing node n) as shown in FIG. 4, or may also include multiple computing as shown in FIG. As shown in FIG. 6 and FIG. 7 , the node and the management node may include a plurality of computing nodes, management nodes, and switches, and are not limited.
下面首先对本申请实施例提供的一种程序包部署方法进行详细说明,该方法可以应用于该分布式系统中的任一台计算节点。为了方便描述,图1及图2实施例中,将执行该方法的计算节点称为第一计算节点。A method for deploying a package provided by the embodiment of the present application is described in detail below. The method can be applied to any computing node in the distributed system. For convenience of description, in the embodiment of FIGS. 1 and 2, the computing node performing the method is referred to as a first computing node.
图1为本申请实施例提供的一种程序包部署方法的流程示意图,包括:FIG. 1 is a schematic flowchart of a method for deploying a package according to an embodiment of the present application, including:
S101:接收训练任务信息,所述训练任务信息中包含执行训练任务的各台计算节点信息。S101: Receive training task information, where the training task information includes each computing node information that performs a training task.
训练任务可以为各种机器学习过程中对大量数据进行学习训练的任务,比如,基于人工神经网络的深度学习。在执行训练任务之前,可以利用本方案在分布式系统中进行训练程序包的部署,部署完成之后,系统中的计算节点便可以执行训练任务。The training task can be used for learning and training a large amount of data in various machine learning processes, for example, deep learning based on artificial neural networks. Before the training task is performed, the training package can be deployed in the distributed system by using the solution. After the deployment is completed, the computing nodes in the system can perform the training task.
训练任务信息中包含执行本次训练任务的各台计算节点信息。执行本次训练任务的计算节点可以为系统中的部分计算节点。作为一种实施方式,每次需要进行机器学习(执行训练任务)时,用户可以根据实际情况,指定系统中部分计算节点来执行本次训练任务;或者,也可以将系统中的各台计算节点进行分组,同组计算节点执行相同的训练任务;或者,也可以将系统中全部计算节点确定为执行本次训练任务的计算节点,等等,确定执行训练任务的计算节点的方式有很多,本申请实施例并不对此进行限定。The training task information includes information of each computing node that performs this training task. The compute node performing this training task can be a part of the compute nodes in the system. As an implementation manner, each time a machine learning (execution training task) is required, the user may specify a part of the computing nodes in the system to perform the training task according to actual conditions; or, each computing node in the system may also be used. Grouping, the same group of computing nodes perform the same training task; or, all computing nodes in the system may be determined as the computing nodes performing the training task, etc., and there are many ways to determine the computing nodes that perform the training tasks. The application examples are not limited thereto.
在图4所示系统中,用户设备可以直接将训练任务信息存储至执行训练任务的各台计算节点;或者,可以在系统外设置一台管理设备,该管理设备获取训练任务信息,并解析训练任务信息,确定执行训练任务的各台计算节点,将训练任务信息发送给执行训练任务的各台计算节点。In the system shown in FIG. 4, the user equipment may directly store the training task information to each computing node that performs the training task; or, a management device may be set outside the system, and the management device acquires training task information and analyzes the training. The task information determines each computing node that performs the training task, and sends the training task information to each computing node that performs the training task.
作为一种实施方式,管理设备可以通过用户设备获取训练任务信息,比如,在web客户端中存储训练任务信息,管理设备从该web客户端上获取训练任务信息。作为主用状态的计算节点,从该管理设备中获取训练程序包。As an implementation manner, the management device may obtain training task information by using the user equipment, for example, storing training task information in the web client, and the management device acquires training task information from the web client. As a computing node in the primary state, a training package is obtained from the management device.
在图5所示系统中,管理节点可以获取训练任务信息,并解析训练任务信息,确定执行训练任务的各台计算节点,将训练任务信息发送给执行训练任务的各台计算节点。作为一种实施方式,管理节点可以通过用户设备获取训练任务信息,比如,在web客户端中存储训练任务信息,管理节点从该web 客户端上获取训练任务信息,具体不做限定。In the system shown in FIG. 5, the management node may acquire training task information, analyze the training task information, determine each computing node that performs the training task, and send the training task information to each computing node that performs the training task. As an implementation manner, the management node may obtain the training task information by using the user equipment, for example, the training task information is stored in the web client, and the management node obtains the training task information from the web client, which is not limited.
在本申请所示实施例中,为了区分描述,将设置于系统内的管理设备称为管理节点,将设置于系统外的管理设备称为管理设备。In the embodiment shown in the present application, in order to distinguish the description, the management device set in the system is referred to as a management node, and the management device disposed outside the system is referred to as a management device.
S102:根据所述训练任务信息,确定第一计算节点的状态是否为主用状态,如果是,执行S103。S102: Determine, according to the training task information, whether a state of the first computing node is an active state, and if yes, execute S103.
作为一种实施方式,执行训练任务的每台计算节点接收到的训练任务信息可以不同。也就是说,用户设备可以在每台计算节点中存储该计算节点对应的训练任务信息,或者管理设备、或者管理节点可以针对每台计算节点,发送该计算节点对应的训练任务信息。As an implementation manner, the training task information received by each computing node performing the training task may be different. That is, the user equipment may store the training task information corresponding to the computing node in each computing node, or the management device, or the management node may send the training task information corresponding to the computing node for each computing node.
举例来说,用户设备存储在计算节点1中的训练任务信息中可以仅包含计算节点1的状态信息,存储在计算节点2中的训练任务信息中可以仅包含计算节点2的状态信息,等等。管理设备、或者管理节点发送给计算节点1的训练任务信息中可以仅包含计算节点1的状态信息,发送给计算节点2的训练任务信息中可以仅包含计算节点2的状态信息,等等。For example, the training task information stored in the computing node 1 by the user equipment may only include the state information of the computing node 1, and the training task information stored in the computing node 2 may only contain the state information of the computing node 2, etc. . The training device or the training task information sent by the management node to the computing node 1 may include only the state information of the computing node 1, and the training task information sent to the computing node 2 may include only the state information of the computing node 2, and so on.
或者,作为另一种实施方式,执行训练任务的每台计算节点接收到的训练任务信息是相同的,其中,包含执行训练任务的各台计算节点的状态信息。Or, as another implementation manner, the training task information received by each computing node that performs the training task is the same, and the state information of each computing node that performs the training task is included.
计算节点的状态可以包含主用状态和从属状态两种,主用状态的计算节点可以称为主节点,从属状态的计算节点可以称为从节点。计算节点都可以根据S101接收到的训练任务信息,确定自身状态是主用状态还是从属状态。如果是主用状态,执行S103。The state of the computing node may include both an active state and a dependent state. The computing node of the primary state may be referred to as a master node, and the computing node of the slave state may be referred to as a slave node. The computing node can determine whether the state is the active state or the dependent state according to the training task information received by S101. If it is the active state, execute S103.
S103:获取训练程序包,并将所获取的训练程序包部署到所述执行训练任务的各台计算节点。S103: Acquire a training package, and deploy the acquired training package to each computing node that performs the training task.
在本申请实施例中,仅当计算节点的状态为主用状态时,才获取训练程序包。作为一种实施方式,web客户端中可以存储有训练程序包,web客户端将训练程序包发送给管理节点或管理设备。作为主用状态的计算节点,从该管理节点或管理设备中获取训练程序包。In the embodiment of the present application, the training package is acquired only when the state of the computing node is in the active state. As an implementation manner, a training package may be stored in the web client, and the web client sends the training package to the management node or the management device. As a computing node in the primary state, a training package is obtained from the management node or the management device.
作为一种实施方式,系统中各台计算节点之间通过无限带宽技术 Infiniband通信连接,因此,主节点(主用状态的计算节点)可以通过Infiniband,将所获取的训练程序包部署到执行训练任务的各台计算节点。As an implementation manner, each computing node in the system is connected by Infiniband communication through an infinite bandwidth technology. Therefore, the master node (the computing node in the active state) can deploy the acquired training package to perform the training task through Infiniband. Each computing node.
本领域技术人员可以理解,InfiniBand架构是一种支持多并发链接的“转换线缆”技术,基于InfiniBand架构的网络具有非常高的带宽。本实施方式中,主节点通过InfiniBand网络拷贝的方式,来完成从节点训练程序包的部署,一方面,可以提高训练程序包部署效率,另一方面,充分利用了Infiniband的带宽能力。Those skilled in the art can understand that the InfiniBand architecture is a "conversion cable" technology that supports multiple concurrent links, and the network based on the InfiniBand architecture has very high bandwidth. In this embodiment, the master node completes the deployment of the slave training package by means of the InfiniBand network copy. On the one hand, the training package deployment efficiency can be improved, and on the other hand, the bandwidth capability of the Infiniband is fully utilized.
或者,主节点也可以通过其他方式,比如以太网,来完成从节点训练程序包的部署。以太网与InfiniBand网络可以同时存在,也就是说,在一种实施方式中,系统中各台计算节点可以通过以太网及InfiniBand网络进行数据交互。Alternatively, the master node can also complete the deployment of the slave training package by other means, such as Ethernet. Ethernet and InfiniBand networks can exist at the same time. That is to say, in one embodiment, each computing node in the system can exchange data through Ethernet and InfiniBand network.
如果进行程序包部署时,系统中每台计算节点都从管理设备(管理节点)中获取训练程序包,会造成网络拥堵。举例来说,系统中针对多个训练任务进行程序包的部署,其中,计算节点1——计算节点5需要部署针对训练任务A的程序包,计算节点6——计算节点10需要部署针对训练任务B的程序包,计算节点11——计算节点15需要部署针对训练任务C的程序包。如果这15台计算节点都需要从管理设备中获取程序包,管理设备与各台计算节点之间的网络带宽压力较大。If a compute package is obtained from the management device (management node) in each of the compute nodes in the system when the package is deployed, network congestion will result. For example, a deployment of a package for a plurality of training tasks in a system, wherein compute node 1 - compute node 5 needs to deploy a package for training task A, compute node 6 - compute node 10 needs to be deployed for training tasks Package B, Compute Node 11 - Compute Node 15 needs to deploy a package for Training Task C. If the 15 computing nodes need to obtain the package from the management device, the network bandwidth between the management device and each computing node is relatively high.
而本申请实施例中,针对每个训练任务,指定一个主节点,只有主节点从管理设备中获取程序包,也就是说,只有3台计算节点从管理设备中获取程序包,降低了管理设备与各台计算节点之间的网络带宽压力。In the embodiment of the present application, for each training task, a master node is specified, and only the master node obtains a package from the management device, that is, only three computing nodes obtain the package from the management device, thereby reducing the management device. Network bandwidth pressure with each compute node.
另一方面,主节点获取程序包后,将程序包部署到执行训练任务的从节点中,主节点与从节点之间的数据交互不同于计算节点与管理设备之间的数据交互,主节点与从节点之间的数据交互可以利用InfiniBand网络或其他系统内部网络,带宽高,速度快,提高了程序包部署效率。On the other hand, after the master node obtains the package, the package is deployed to the slave node performing the training task, and the data interaction between the master node and the slave node is different from the data interaction between the compute node and the management device, and the master node and the master node Data interaction between nodes can utilize InfiniBand network or other system internal network, which has high bandwidth and high speed, which improves the efficiency of package deployment.
应用本申请图1所示实施例,只有主用状态的计算节点获取训练程序包,并将所获取的训练程序包部署到执行训练任务的各台计算节点,也就是说,并不是每台计算节点都从管理设备中获取程序包,降低了网络带宽压力。Applying the embodiment shown in FIG. 1 of the present application, only the computing node in the primary state acquires the training package, and deploys the acquired training package to each computing node that performs the training task, that is, not every calculation. Nodes get packages from management devices, reducing network bandwidth pressure.
图2为本申请实施例提供的程序包部署方法的第二种流程示意图,包括:FIG. 2 is a second schematic flowchart of a method for deploying a package according to an embodiment of the present disclosure, including:
S201:接收训练任务信息。S201: Receive training task information.
S202:解析所述训练任务信息,得到训练程序包的存储地址、执行训练任务的各台计算节点的状态信息及设备地址。S202: Parse the training task information, obtain a storage address of the training package, status information of each computing node that performs the training task, and a device address.
作为一种实施方式,web客户端中可以存储有训练程序包及其对应的训练任务信息,该训练任务信息中包含执行训练任务的各台计算节点的状态信息及设备地址。As an implementation manner, the web client may store a training package and corresponding training task information, where the training task information includes status information and a device address of each computing node that performs the training task.
管理节点或者管理设备从该web客户端中获取训练程序包及训练任务信息。管理节点或者管理设备将训练程序包存储至某一位置,并将存储地址添加至该训练任务信息。另外,管理节点或者管理设备还可以解析该训练任务信息,确定执行训练任务的各台计算节点,并将添加了存储地址的训练任务信息发送给所确定的各台计算节点,也就是执行训练任务的各台计算节点。The management node or the management device obtains the training package and the training task information from the web client. The management node or management device stores the training package to a location and adds the storage address to the training task information. In addition, the management node or the management device may further parse the training task information, determine each computing node that performs the training task, and send the training task information with the added storage address to the determined computing nodes, that is, execute the training task. Each computing node.
这样,执行本方法的第一计算节点接收到的训练任务信息中便包含了训练程序包的存储地址,第一计算节点对接收到的训练任务信息进行解析,可以得到该存储地址。In this way, the training task information received by the first computing node that executes the method includes the storage address of the training package, and the first computing node parses the received training task information to obtain the storage address.
在本实施例中,训练任务信息中包含的“执行训练任务的各台计算节点信息”中可以包含各台计算节点的状态信息及设备地址。根据上面内容描述,计算节点的状态可以分为主用状态和从属状态两种,主用状态的计算节点可以称为主节点,从属状态的计算节点可以称为从节点。设备地址可以为设备的IP地址、MAC地址等可以访问到设备的地址,具体不做限定。In this embodiment, the status information and the device address of each computing node may be included in the “each computing node information for performing the training task” included in the training task information. According to the above description, the state of the computing node can be divided into a main state and a slave state. The computing node of the active state can be referred to as a master node, and the computing node of the slave state can be referred to as a slave node. The address of the device can be the address of the device, such as the IP address and MAC address of the device.
S203:在所述各台计算节点的状态信息中,查找第一计算节点的设备地址对应的状态信息。S203: In the status information of each computing node, search for status information corresponding to the device address of the first computing node.
S204:确定所查找到的状态信息是否为主用状态,如果是,执行S205。S204: Determine whether the found status information is an active status, and if yes, execute S205.
S202中解析得到的计算节点的状态信息及设备地址间存在对应关系,举例来说,可以如表1所示:There is a correspondence between the state information of the computing node and the device address that is obtained in S202. For example, as shown in Table 1:
表1Table 1
计算节点的状态信息Calculate the status information of the node 计算节点的IP地址Calculate the IP address of the node
主用状态Main state 100.4.5.6100.4.5.6
从属状态Dependent state 100.8.2.3100.8.2.3
从属状态Dependent state 100.6.5.2100.6.5.2
……...... ……......
上述表1仅为举例说明,并不对本实施例构成限定。The above Table 1 is merely illustrative and is not intended to limit the embodiment.
假设执行本方法的第一计算节点的IP地址为100.4.5.6,则第一计算节点在S202的解析结果(包含上述表1)中查找自身IP地址对应的自身状态为主用状态,S204的判断结果为是,执行S205。Assuming that the IP address of the first computing node performing the method is 100.4.5.6, the first computing node searches for the self-state corresponding to its own IP address as the primary state in the analysis result of S202 (including the above Table 1), and the determination of S204. As a result, yes, S205 is executed.
S205:根据所述训练程序包的存储地址,获取所述训练程序包。S205: Acquire the training package according to a storage address of the training package.
S202的解析结果中还包含训练程序包的存储地址,根据该存储地址,可以获取到训练程序包。需要说明的是,在本实施例中,只有主用状态的计算节点(主节点)访问该存储地址,获取该训练程序包。The analysis result of S202 further includes a storage address of the training package, and according to the storage address, the training package can be obtained. It should be noted that, in this embodiment, only the computing node (master node) in the active state accesses the storage address to acquire the training package.
S206:根据所述各台计算节点的设备地址,在所述各台计算节点中部署所述训练程序包。S206: Deploy the training package in each computing node according to the device address of each computing node.
S202的解析结果中还包含执行本次训练任务的各台计算节点的设备地址,可以根据设备地址,将S205中获取的程序包部署到本次训练任务的各台计算节点中。The analysis result of S202 also includes the device address of each computing node that performs the training task, and the package acquired in S205 can be deployed to each computing node of the training task according to the device address.
作为一种实施方式,系统中各台计算节点之间通过无限带宽技术Infiniband通信连接,因此,主节点可以通过Infiniband,将所获取的训练程序包部署到执行训练任务的各台计算节点。As an implementation manner, each computing node in the system is connected by Infiniband communication through an infinite bandwidth technology. Therefore, the master node can deploy the acquired training package to each computing node that performs the training task through Infiniband.
本领域技术人员可以理解,InfiniBand架构是一种支持多并发链接的“转换线缆”技术,基于InfiniBand架构的网络具有非常高的带宽。本实施方式中,主节点通过InfiniBand网络拷贝的方式,来完成从节点训练程序包的部署,一方面,可以提高训练程序包部署效率,另一方面,充分利用了Infiniband的带 宽能力。Those skilled in the art can understand that the InfiniBand architecture is a "conversion cable" technology that supports multiple concurrent links, and the network based on the InfiniBand architecture has very high bandwidth. In this embodiment, the master node completes the deployment of the slave training package through the InfiniBand network copying method. On the one hand, the training package deployment efficiency can be improved, and on the other hand, the bandwidth capability of the Infiniband is fully utilized.
或者,主节点也可以通过其他方式,比如以太网,来完成从节点训练程序包的部署。可选地,以太网与InfiniBand网络可以同时存在,也就是说,在一种实施方式中,系统中各台计算节点可以通过以太网及InfiniBand网络进行数据交互。Alternatively, the master node can also complete the deployment of the slave training package by other means, such as Ethernet. Optionally, the Ethernet and the InfiniBand network can exist at the same time, that is, in one embodiment, each computing node in the system can perform data interaction through the Ethernet and the InfiniBand network.
应用本申请图2所示实施例,只有主用状态的计算节点获取训练程序包,并将所获取的训练程序包部署到执行训练任务的各台计算节点,也就是说,并不是每台计算节点都从管理设备中获取程序包,降低了网络带宽压力。Applying the embodiment shown in FIG. 2 of the present application, only the computing node in the primary state acquires the training package, and deploys the acquired training package to each computing node that performs the training task, that is, not every calculation. Nodes get packages from management devices, reducing network bandwidth pressure.
作为一种实施方式,在图1S103之后、或图2S206之后,还可以包括如下方案:As an implementation manner, after FIG. 1S103 or after FIG. 2S206, the following solutions may also be included:
如果第一计算节点的状态为主用状态,第一计算节点检测到所述执行训练任务的各台计算节点均成功部署所述训练程序包后,生成标记文件,并将所述标记文件发送至所述各台计算节点。If the state of the first computing node is the active state, the first computing node detects that each of the computing nodes performing the training task successfully deploys the training package, generates a tag file, and sends the tag file to Each of the computing nodes.
本领域技术人员可以理解,图1、图2所示实施例中只是对程序包进行了部署,而在本实施方式中,可以进一步检测是否部署成功。在上述一种实施方式中,主节点通过InfiniBand网络拷贝的方式,来完成从节点训练程序包的部署,因此,主节点可以判断每次拷贝是否成功,进而检测本次训练任务的程序包是否全部拷贝成功。It can be understood by those skilled in the art that only the package is deployed in the embodiment shown in FIG. 1 and FIG. 2, but in the embodiment, whether the deployment is successful can be further detected. In the foregoing embodiment, the master node completes the deployment of the slave training package by means of the InfiniBand network copy. Therefore, the master node can determine whether each copy is successful, and then check whether the package of the training task is all. The copy was successful.
主节点在检测到执行本次训练任务的程序包全部拷贝成功后,也就是此次训练程序包部署成功后,可以生成标记文件,并将该标记文件发送给执行本次训练任务的各台计算节点,这样,其他从节点也就得到了部署成功的消息。After detecting that all the packages of the training task are successfully copied, that is, after the training package is successfully deployed, the master node may generate a mark file and send the mark file to each calculation for performing the training task. Nodes, so that other slave nodes get the message that the deployment is successful.
在所述执行训练任务的各台计算节点均成功部署所述训练程序包后,运行所述训练程序包,进行数据训练,也就是执行训练任务。After each of the computing nodes performing the training task successfully deploys the training package, the training package is run to perform data training, that is, to perform a training task.
根据上面内容描述,训练任务可以为各种机器学习过程中对大量数据进行学习训练的任务,比如,基于人工神经网络的深度学习。在执行训练任务之前,可以利用本方案在分布式系统中进行训练程序包的部署,部署完成之后,系统中的计算节点便可以进行数据训练,执行训练任务。According to the above description, the training task can be a task of learning and training a large amount of data in various machine learning processes, for example, deep learning based on artificial neural network. Before the training task is performed, the training package can be deployed in the distributed system by using the solution. After the deployment is completed, the computing nodes in the system can perform data training and perform training tasks.
在上述“主节点生成标记文件”的实施方式中,可以通过该标记文件来判断训练程序包是否部署成功,如果部署成功,则执行本次训练任务的每台计算节点可以运行自身部署的训练程序包,进行数据训练,也就是开始执行训练任务。In the foregoing implementation manner of the “master node generating a tag file”, the tag file may be used to determine whether the training package is deployed successfully. If the deployment is successful, each computing node performing the training task may run the self-deployed training program. Package, data training, that is, start the training task.
作为一种实施方式,如果第一计算节点的状态为主用状态,检测到存在部署所述训练程序包失败的计算节点后,输出用于提示部署失败的第一提示信息。As an embodiment, if the state of the first computing node is the active state, and detecting that there is a computing node that fails to deploy the training package, the first prompt information for prompting the deployment failure is output.
也就是说,如果发生主节点将训练程序包拷贝到从节点失败的情况,主节点可以输出第一提示信息。举例来说,可以将该第一提示信息输出至管理设备或管理节点,以提示此次程序包部署失败,或者,也可以将第一提示信息直接展示给用户,提示用户此次程序包部署失败。该第一提示信息中可以包含拷贝失败的从节点信息,或者也可以包含拷贝失败的原因,比如,从节点内存已满、网络故障等,以方便相关人员进行后续处理。That is to say, if the master node fails to copy the training package to the slave node, the master node can output the first prompt information. For example, the first prompt information may be output to the management device or the management node to prompt the deployment of the package to fail, or the first prompt information may be directly displayed to the user, prompting the user to fail the deployment of the package. . The first prompt information may include the slave node information that fails to be copied, or may also include the reason for the copy failure, for example, the slave node is full of memory, network fault, etc., so as to facilitate subsequent processing by the relevant personnel.
另外,在上述“主节点生成标记文件”的实施方式中,如果第一计算节点的状态不为主用状态,判断是否在预设时间段内接收到所述标记文件;如果否,输出用于提示部署失败的第二提示信息。In addition, in the foregoing embodiment of the “master node generating a tag file”, if the state of the first computing node is not in the main state, it is determined whether the tag file is received within a preset time period; if not, the output is used for A second prompt message indicating that the deployment failed.
可以理解,对于从节点来说,如果一段时间内未收到表示部署成功的标记文件,则可以认为此次部署失败,从节点也可以输出提示信息。为了与主节点输出的提示信息进行区分,将主节点输出的提示信息称为第一提示信息,从节点输出的提示信息称为第二提示信息。It can be understood that, for the slave node, if the tag file indicating that the deployment is successful is not received for a period of time, the deployment may fail, and the slave node may also output the prompt information. In order to distinguish the prompt information outputted by the master node, the prompt information output by the master node is referred to as first prompt information, and the prompt information outputted from the node is referred to as second prompt information.
对于从节点来说,该预设时间段可以从接收到训练任务信息开始计时,也可以从确定自身状态为从属状态开始计时,也可以从主节点将训练程序包部署到自身设备中开始计时,等等,具体不做限定。For the slave node, the preset time period may start from receiving the training task information, or may start from determining that the state is the slave state, or may start the timing by deploying the training package from the master node to the own device. Wait, the specific is not limited.
举例来说,可以将该第二提示信息输出至管理设备或管理节点,以提示此次程序包部署失败,或者,也可以将第二提示信息直接展示给用户,提示用户此次程序包部署失败。该第二提示信息中可以包含拷贝失败的原因,比如,内存已满、网络故障等,以方便相关人员进行后续处理。For example, the second prompt information may be output to the management device or the management node to prompt the deployment of the package to fail, or the second prompt information may be directly displayed to the user, prompting the user that the package deployment fails. . The second prompt information may include the reason for the failure of the copy, for example, the memory is full, the network is faulty, and the like, so as to facilitate subsequent processing by the relevant personnel.
与上述方法实施例相对应,本申请实施例还提供一种电子设备,该电子设备为分布式系统中的计算节点。Corresponding to the foregoing method embodiments, the embodiment of the present application further provides an electronic device, which is a computing node in a distributed system.
如图3所示,本申请实施例提供的电子设备包括:处理器301、和存储器302,As shown in FIG. 3, an electronic device provided by an embodiment of the present application includes: a processor 301, and a memory 302.
存储器302,用于存放计算机程序;a memory 302 for storing a computer program;
处理器301,用于执行存储器302上所存放的程序时,实现如下步骤:The processor 301 is configured to perform the following steps when executing the program stored on the memory 302:
接收训练任务信息,所述训练任务信息中包含执行训练任务的各台计算节点信息;Receiving training task information, where the training task information includes information about each computing node that performs the training task;
根据所述训练任务信息,确定所述电子设备的自身状态是否为主用状态;Determining, according to the training task information, whether the state of the electronic device is an active state;
如果为主用状态,获取训练程序包,并将所获取的训练程序包部署到所述执行训练任务的各台计算节点。If it is in the active state, the training package is acquired, and the acquired training package is deployed to each computing node that performs the training task.
作为一种实施方式,处理器301,还用于执行存储器302上所存放的程序时,实现如下步骤:As an implementation manner, when the processor 301 is further configured to execute the program stored on the memory 302, the following steps are implemented:
在接收训练任务信息之后,解析所述训练任务信息,得到训练程序包的存储地址、执行训练任务的各台计算节点的状态信息及设备地址;After receiving the training task information, parsing the training task information, obtaining a storage address of the training package, status information of each computing node performing the training task, and a device address;
在所述各台计算节点的状态信息中,查找所述电子设备的自身设备地址对应的状态信息;Searching, in the status information of each computing node, status information corresponding to the device address of the electronic device;
确定所查找到的状态信息是否为主用状态;Determining whether the found status information is the primary status;
根据所述训练程序包的存储地址,获取所述训练程序包;Obtaining the training package according to a storage address of the training package;
根据所述各台计算节点的设备地址,在所述各台计算节点中部署所述训练程序包。And deploying the training package in each computing node according to the device address of each computing node.
作为一种实施方式,处理器301,还用于执行存储器302上所存放的程序时,实现如下步骤:As an implementation manner, when the processor 301 is further configured to execute the program stored on the memory 302, the following steps are implemented:
如果所述电子设备的自身状态为主用状态,检测到所述执行训练任务的各台计算节点均成功部署所述训练程序包后,生成标记文件,并将所述标记文件发送至所述各台计算节点。If the self-state of the electronic device is in the active state, after detecting that each of the computing nodes performing the training task successfully deploys the training package, generating a markup file, and sending the markup file to each of the A computing node.
作为一种实施方式,处理器301,还用于执行存储器302上所存放的程序时,实现如下步骤:As an implementation manner, when the processor 301 is further configured to execute the program stored on the memory 302, the following steps are implemented:
如果所述电子设备的自身状态为主用状态,检测到存在部署所述训练程序包失败的计算节点后,输出用于提示部署失败的第一提示信息。If the self-state of the electronic device is in the active state, detecting that there is a computing node that fails to deploy the training package, outputting the first prompt information for prompting the deployment failure.
作为一种实施方式,处理器301,还用于执行存储器302上所存放的程序时,实现如下步骤:As an implementation manner, when the processor 301 is further configured to execute the program stored on the memory 302, the following steps are implemented:
如果所述电子设备的自身状态不为主用状态,判断是否在预设时间段内接收到所述标记文件;If the state of the electronic device is not in the active state, determining whether the tag file is received within a preset time period;
如果否,输出用于提示部署失败的第二提示信息。If no, the second prompt message for prompting the deployment failure is output.
作为一种实施方式,处理器301,还用于执行存储器302上所存放的程序时,实现如下步骤:As an implementation manner, when the processor 301 is further configured to execute the program stored on the memory 302, the following steps are implemented:
在所述执行训练任务的各台计算节点均成功部署所述训练程序包后,运行所述训练程序包,进行数据训练。After each of the computing nodes performing the training task successfully deploys the training package, the training package is run to perform data training.
作为一种实施方式,处理器301,还用于执行存储器302上所存放的程序时,实现如下步骤:As an implementation manner, when the processor 301 is further configured to execute the program stored on the memory 302, the following steps are implemented:
通过无限带宽技术Infiniband,将所获取的训练程序包部署到所述执行训练任务的各台计算节点。The acquired training package is deployed to each computing node performing the training task through the infinite bandwidth technology Infiniband.
上述电子设备提到的存储器可以包括随机存取存储器(Random Access Memory,简称RAM),也可以包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。可选的,存储器还可以是至少一个位于远离前述处理器的存储装置。The memory mentioned in the above electronic device may include a random access memory (RAM), and may also include a non-volatile memory, such as at least one disk storage. Optionally, the memory may also be at least one storage device located away from the aforementioned processor.
上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网络处理器(Ne twork Processor,简称NP)等;还可以是数字信号处理器(Digital Signal Processing,简称DSP)、专用集成电路(Applica tion Specific Integrated Circuit,简称ASIC)、现场可编程门阵列(Field-Programmable Gate Array,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (Ne twork processor, NP for short), or a digital signal processor (DSP). ), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component.
应用本申请图3所示实施例,只有主用状态的计算节点获取训练程序包,并将所获取的训练程序包部署到执行训练任务的各台计算节点,也就是说,并不是每台计算节点都从管理设备中获取程序包,降低了网络带宽压力。Applying the embodiment shown in FIG. 3 of the present application, only the computing node in the primary state acquires the training package, and deploys the acquired training package to each computing node that performs the training task, that is, not every calculation. Nodes get packages from management devices, reducing network bandwidth pressure.
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现如下步骤:The embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, and when the computer program is executed by the processor, the following steps are implemented:
接收训练任务信息,所述训练任务信息中包含执行训练任务的各台计算节点信息;Receiving training task information, where the training task information includes information about each computing node that performs the training task;
根据所述训练任务信息,确定自身状态是否为主用状态;Determining, according to the training task information, whether the state is the primary state;
如果为主用状态,获取训练程序包,并将所获取的训练程序包部署到所述执行训练任务的各台计算节点。If it is in the active state, the training package is acquired, and the acquired training package is deployed to each computing node that performs the training task.
作为一种实施方式,还用于实现如下步骤:As an implementation manner, it is also used to implement the following steps:
在接收训练任务信息之后,解析所述训练任务信息,得到训练程序包的存储地址、执行训练任务的各台计算节点的状态信息及设备地址;After receiving the training task information, parsing the training task information, obtaining a storage address of the training package, status information of each computing node performing the training task, and a device address;
在所述各台计算节点的状态信息中,查找自身设备地址对应的状态信息;Finding status information corresponding to the address of the device in the status information of each computing node;
确定所查找到的状态信息是否为主用状态;Determining whether the found status information is the primary status;
根据所述训练程序包的存储地址,获取所述训练程序包;Obtaining the training package according to a storage address of the training package;
根据所述各台计算节点的设备地址,在所述各台计算节点中部署所述训练程序包。And deploying the training package in each computing node according to the device address of each computing node.
作为一种实施方式,还用于实现如下步骤:As an implementation manner, it is also used to implement the following steps:
如果自身状态为主用状态,检测到所述执行训练任务的各台计算节点均成功部署所述训练程序包后,生成标记文件,并将所述标记文件发送至所述各台计算节点。If the self-states are in the active state, after detecting that the training nodes of the execution training task successfully deploy the training package, generate a tag file, and send the tag file to each computing node.
作为一种实施方式,还用于实现如下步骤:As an implementation manner, it is also used to implement the following steps:
如果自身状态为主用状态,检测到存在部署所述训练程序包失败的计算 节点后,输出用于提示部署失败的第一提示信息。If the self-state is in the active state, and the computing node that fails to deploy the training package is detected, the first prompt information for prompting the deployment failure is output.
作为一种实施方式,还用于实现如下步骤:As an implementation manner, it is also used to implement the following steps:
如果自身状态不为主用状态,判断是否在预设时间段内接收到所述标记文件;If the self state is not the main state, it is determined whether the tag file is received within a preset time period;
如果否,输出用于提示部署失败的第二提示信息。If no, the second prompt message for prompting the deployment failure is output.
作为一种实施方式,还用于实现如下步骤:As an implementation manner, it is also used to implement the following steps:
在所述执行训练任务的各台计算节点均成功部署所述训练程序包后,运行所述训练程序包,进行数据训练。After each of the computing nodes performing the training task successfully deploys the training package, the training package is run to perform data training.
作为一种实施方式,还用于实现如下步骤:As an implementation manner, it is also used to implement the following steps:
通过无限带宽技术Infiniband,将所获取的训练程序包部署到所述执行训练任务的各台计算节点。The acquired training package is deployed to each computing node performing the training task through the infinite bandwidth technology Infiniband.
应用本申请所示实施例,只有主用状态的计算节点获取训练程序包,并将所获取的训练程序包部署到执行训练任务的各台计算节点,也就是说,并不是每台计算节点都从管理设备中获取程序包,降低了网络带宽压力。Applying the embodiment shown in the present application, only the computing node in the primary state acquires the training package, and deploys the acquired training package to each computing node that performs the training task, that is, not every computing node Getting the package from the management device reduces network bandwidth pressure.
本申请实施例还提供一种分布式系统,如图4所示,包含至少两台计算节点(计算节点1、计算节点2……计算节点n),每台计算节点用于:The embodiment of the present application further provides a distributed system. As shown in FIG. 4, at least two computing nodes (computing node 1, computing node 2, ... computing node n) are included, and each computing node is used for:
接收训练任务信息,所述训练任务信息中包含执行训练任务的各台计算节点信息;根据所述训练任务信息,确定自身状态是否为主用状态;如果为主用状态,获取训练程序包,并将所获取的训练程序包部署到所述执行训练任务的各台计算节点。Receiving training task information, where the training task information includes each computing node information for performing a training task; determining, according to the training task information, whether the state is a primary state; if the primary state is used, acquiring a training package, and The acquired training package is deployed to each computing node that performs the training task.
在图4所示系统中,用户设备可以直接将训练任务信息存储至执行训练任务的各台计算节点;或者,可以在系统外设置一台管理设备,该管理设备获取训练任务信息,并解析训练任务信息,确定执行训练任务的各台计算节点,将训练任务信息发送给执行训练任务的各台计算节点。In the system shown in FIG. 4, the user equipment may directly store the training task information to each computing node that performs the training task; or, a management device may be set outside the system, and the management device acquires training task information and analyzes the training. The task information determines each computing node that performs the training task, and sends the training task information to each computing node that performs the training task.
作为一种实施方式,管理设备可以通过用户设备获取训练任务信息,比 如,在web客户端中存储训练任务信息,管理设备从该web客户端上获取训练任务信息。作为主用状态的计算节点,从该管理设备中获取训练程序包。As an implementation manner, the management device may acquire training task information by using the user equipment, for example, storing training task information in the web client, and the management device acquiring training task information from the web client. As a computing node in the primary state, a training package is obtained from the management device.
或者,分布式系统也可以如图5所示,包含多台计算节点及管理节点,管理节点可以获取训练任务信息,并解析训练任务信息,确定执行训练任务的各台计算节点,将训练任务信息发送给执行训练任务的各台计算节点。Alternatively, the distributed system may also include multiple computing nodes and management nodes as shown in FIG. 5. The management node may acquire training task information, parse the training task information, determine each computing node that performs the training task, and set the training task information. Send to each compute node that performs the training task.
作为一种实施方式,管理节点可以通过用户设备获取训练任务信息,比如,在web客户端中存储训练任务信息,管理节点从该web客户端上获取训练任务信息,具体不做限定。作为主用状态的计算节点,从该管理节点中获取训练程序包。As an implementation manner, the management node may obtain the training task information by using the user equipment, for example, the training task information is stored in the web client, and the management node obtains the training task information from the web client, which is not limited. As a computing node in the primary state, a training package is obtained from the management node.
在本申请所示实施例中,为了区分描述,将设置于系统内的管理设备称为管理节点,将设置于系统外的管理设备称为管理设备。In the embodiment shown in the present application, in order to distinguish the description, the management device set in the system is referred to as a management node, and the management device disposed outside the system is referred to as a management device.
或者,分布式系统也可以如图6、图7所示,包含多台计算节点、管理节点及交换机,具体不做限定。Alternatively, the distributed system may include multiple computing nodes, management nodes, and switches as shown in FIG. 6 and FIG. 7, which are not limited.
在图5、图6或图7所示系统中,管理节点,用于获取并存储训练程序包;将所述训练程序包的存储地址添加至训练任务信息;将所述训练任务信息发送给执行训练任务的各台计算节点。In the system shown in FIG. 5, FIG. 6, or FIG. 7, a management node is configured to acquire and store a training package; add a storage address of the training package to training task information; and send the training task information to perform Each computing node of the training task.
计算节点,具体用于:Compute node, specifically for:
接收所述管理节点发送的训练任务信息;解析所述训练任务信息,得到训练程序包的存储地址、执行训练任务的各台计算节点的状态信息及设备地址;在所述各台计算节点的状态信息中,查找自身设备地址对应的状态信息;确定所查找到的状态信息是否为主用状态;如果为主用状态,根据所述训练程序包的存储地址,获取所述训练程序包,并根据所述各台计算节点的设备地址,在所述各台计算节点中部署所述训练程序包。Receiving training task information sent by the management node; parsing the training task information, obtaining a storage address of the training package, status information of each computing node performing the training task, and a device address; and calculating the status of each node In the information, the status information corresponding to the device address is searched; the status information of the found device is determined to be the active status; if the status is the active status, the training package is obtained according to the storage address of the training package, and according to the The device addresses of the computing nodes of each of the computing nodes are deployed in the computing nodes.
作为一种实施方式,计算节点还可以用于:As an implementation manner, the computing node can also be used to:
在自身状态为主用状态的情况下:In the case where its own state is the primary state:
如果检测到所述执行训练任务的各台计算节点均成功部署所述训练程序包,生成标记文件,并将所述标记文件发送至所述各台计算节点;If it is detected that each of the computing nodes that perform the training task successfully deploys the training package, generate a tag file, and send the tag file to each computing node;
如果检测到存在部署所述训练程序包失败的计算节点,向所述管理节点发送用于提示部署失败的第一提示信息。If it is detected that there is a computing node that fails to deploy the training package, the first prompt information for prompting the deployment failure is sent to the management node.
作为一种实施方式,计算节点还可以用于:As an implementation manner, the computing node can also be used to:
在自身状态不为主用状态的情况下,判断是否在预设时间段内接收到所述标记文件;In a case where the self state is not in the main state, it is determined whether the tag file is received within a preset time period;
如果否,向所述管理节点发送用于提示部署失败的第二提示信息。If not, sending the second prompt information for prompting the deployment failure to the management node.
作为一种实施方式,计算节点还可以用于:As an implementation manner, the computing node can also be used to:
在所述执行训练任务的各台计算节点均成功部署所述训练程序包后,运行所述训练程序包,进行数据训练。After each of the computing nodes performing the training task successfully deploys the training package, the training package is run to perform data training.
作为一种实施方式,执行训练任务的各台计算节点之间基于无限带宽技术Infiniband通信连接。As an implementation manner, an Infiniband communication connection based on an infinite bandwidth technology is performed between each computing node performing a training task.
如图7所示,各台计算节点之间可以通过Infiniband网络交换机进行数据交互,计算节点与管理节点之间可以通过以太网交换机,比如千兆以太网交换机进行数据交互。As shown in Figure 7, data exchange can be performed between the compute nodes through the Infiniband network switch. The data exchange between the compute node and the management node can be performed through an Ethernet switch, such as a Gigabit Ethernet switch.
本领域技术人员可以理解,InfiniBand架构是一种支持多并发链接的“转换线缆”技术,基于InfiniBand架构的网络具有非常高的带宽。本实施方式中,主节点通过InfiniBand网络拷贝的方式,来完成从节点训练程序包的部署,一方面,可以提高训练程序包部署效率,另一方面,充分利用了Infiniband的带宽能力。Those skilled in the art can understand that the InfiniBand architecture is a "conversion cable" technology that supports multiple concurrent links, and the network based on the InfiniBand architecture has very high bandwidth. In this embodiment, the master node completes the deployment of the slave training package by means of the InfiniBand network copy. On the one hand, the training package deployment efficiency can be improved, and on the other hand, the bandwidth capability of the Infiniband is fully utilized.
或者,各台计算节点之间也可以通过其他网络进行数据交互,比如以太网。以太网与InfiniBand网络可以同时存在,也就是说,在一种实施方式中,系统中各台计算节点可以通过以太网及InfiniBand网络进行数据交互。Alternatively, data exchanges can be performed between other computing nodes through other networks, such as Ethernet. Ethernet and InfiniBand networks can exist at the same time. That is to say, in one embodiment, each computing node in the system can exchange data through Ethernet and InfiniBand network.
下面提供一种具体的实施方式,如图8所示:A specific implementation is provided below, as shown in FIG.
1、web客户端中存储训练程序包及训练任务信息。1. The training client package and the training task information are stored in the web client.
2、管理节点从web客户端中获取训练程序包及训练任务信息。2. The management node obtains the training package and the training task information from the web client.
3、管理节点将所获取的训练程序包存储至某一位置,并将该训练程序包的存储地址添加至该训练任务信息。3. The management node stores the acquired training package to a certain location, and adds the storage address of the training package to the training task information.
4、管理节点解析该训练任务信息,确定执行训练任务的各台计算节点。5、通过以太网交换机,将添加了存储地址的训练任务信息发送给4中所确定的各台计算节点,也就是执行训练任务的各台计算节点。4. The management node parses the training task information and determines each computing node that performs the training task. 5. Through the Ethernet switch, the training task information with the added storage address is sent to each computing node determined in 4, that is, each computing node that performs the training task.
6、计算节点接收管理节点发送的训练任务信息,解析该训练任务信息,得到训练程序包的存储地址、执行训练任务的各台计算节点的状态信息及设备地址。6. The computing node receives the training task information sent by the management node, parses the training task information, and obtains the storage address of the training package, the status information of each computing node that performs the training task, and the device address.
7、计算节点在解析得到的各台计算节点的状态信息中,查找自身设备地址对应的状态信息。7. The computing node searches for status information corresponding to the address of the own device in the state information of each computing node obtained by the analysis.
8、确定所查找到的状态信息是否为主用状态。主用状态的计算节点为主节点,从属状态的计算节点为从节点。8. Determine whether the found status information is the primary status. The computing node of the active state is the master node, and the computing node of the slave state is the slave node.
9、如果为主用状态,根据解析得到的训练程序包的存储地址,获取该训练程序包。9. If the state is the active state, the training package is obtained according to the storage address of the training package obtained by the analysis.
10、主节点根据解析得到的各台计算节点的设备地址,通过无限带宽技术Infiniband,将所获取的训练程序包部署到执行本次训练任务的各台从节点。11、主节点检测到执行训练任务的各台从节点均成功拷贝该训练程序包后,生成标记文件,并将所述标记文件发送至各台从节点。10. The master node deploys the acquired training package to each slave node that performs the training task according to the device address of each computing node obtained by the analysis, through the infinite bandwidth technology Infiniband. After the master node detects that each slave node that performs the training task successfully copies the training package, generates a tag file, and sends the tag file to each slave node.
12、如果主节点检测到训练程序包拷贝失败的从节点,输出第一提示信息至管理节点,或者,如果从节点在预设时间段内未接收到所述标记文件,输出第二提示信息至管理节点。12. If the master node detects the slave node that failed to copy the training package, output the first prompt information to the management node, or if the slave node does not receive the marker file within the preset time period, output the second prompt information to Management node.
13、如果12中的情况未发生,执行训练任务的各台计算节点均成功部署训练程序包后,则各台计算节点运行自身部署的训练程序包,进行数据训练。13. If the situation in 12 does not occur, after each of the computing nodes performing the training task successfully deploys the training package, each computing node runs its own deployed training package for data training.
如果进行程序包部署时,系统中每台计算节点都从管理节点中获取训练 程序包,会造成网络拥堵。举例来说,系统中针对多个训练任务进行程序包的部署,其中,计算节点1——计算节点5需要部署针对训练任务A的程序包,计算节点6——计算节点10需要部署针对训练任务B的程序包,计算节点11——计算节点15需要部署针对训练任务C的程序包。如果这15台计算节点都需要从管理节点中获取程序包,管理节点与各台计算节点之间的网络带宽压力较大。If a compute node is obtained from the management node for each compute node in the system when the package is deployed, network congestion will result. For example, a deployment of a package for a plurality of training tasks in a system, wherein compute node 1 - compute node 5 needs to deploy a package for training task A, compute node 6 - compute node 10 needs to be deployed for training tasks Package B, Compute Node 11 - Compute Node 15 needs to deploy a package for Training Task C. If the 15 computing nodes need to obtain the package from the management node, the network bandwidth pressure between the management node and each computing node is relatively large.
而本申请实施例中,针对每个训练任务,指定一个主节点,只有主节点从管理节点中获取程序包,也就是说,只有3台计算节点从管理节点中获取程序包,降低了管理节点与各台计算节点之间的网络带宽压力。In the embodiment of the present application, for each training task, a master node is specified, and only the master node obtains a package from the management node, that is, only three computing nodes obtain the package from the management node, thereby reducing the management node. Network bandwidth pressure with each compute node.
另一方面,主节点获取程序包后,将程序包部署到执行训练任务的从节点中,主节点与从节点之间的数据交互不同于计算节点与管理节点之间的数据交互,主节点与从节点之间的数据交互可以利用InfiniBand网络或其他系统内部网络,带宽高,速度快,提高了程序包部署效率。On the other hand, after the master node obtains the package, the package is deployed to the slave node performing the training task, and the data interaction between the master node and the slave node is different from the data interaction between the compute node and the management node, and the master node and the master node Data interaction between nodes can utilize InfiniBand network or other system internal network, which has high bandwidth and high speed, which improves the efficiency of package deployment.
应用本申请所示实施例,只有主用状态的计算节点获取训练程序包,并将所获取的训练程序包部署到执行训练任务的各台计算节点,也就是说,并不是每台计算节点都从管理设备中获取程序包,降低了网络带宽压力。Applying the embodiment shown in the present application, only the computing node in the primary state acquires the training package, and deploys the acquired training package to each computing node that performs the training task, that is, not every computing node Getting the package from the management device reduces network bandwidth pressure.
本申请实施例还提供一种可执行程序代码,所述可执行程序代码用于被运行以执行上述任一种视频传输方法。Embodiments of the present application also provide an executable program code for being executed to perform any of the above video transmission methods.
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this context, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply such entities or operations. There is any such actual relationship or order between them. Furthermore, the term "comprises" or "comprises" or "comprises" or any other variations thereof is intended to encompass a non-exclusive inclusion, such that a process, method, article, or device that comprises a plurality of elements includes not only those elements but also Other elements, or elements that are inherent to such a process, method, item, or device. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device that comprises the element.
本说明书中的各个实施例均采用相关的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于图3所示的电子设备实施例、图4-7所示的分布式系统实施 例、上述计算机可读存储介质实施例、上述可执行程序代码实施例而言,由于其基本相似于图1-2所示的程序包部署方法实施例,所以描述的比较简单,相关之处参见图1-2所示的程序包部署方法实施例的部分说明即可。The various embodiments in the present specification are described in a related manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the electronic device embodiment shown in FIG. 3, the distributed system embodiment shown in FIGS. 4-7, the computer readable storage medium embodiment, and the above executable program code embodiment are basically similar to each other. The embodiment of the package deployment method shown in Figure 1-2 is relatively simple. For related information, refer to the description of the embodiment of the package deployment method shown in Figure 1-2.
以上所述仅为本申请的较佳实施例而已,并不用以限制本申请,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请保护的范围之内。The above is only the preferred embodiment of the present application, and is not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc., which are made within the spirit and principles of the present application, should be included in the present application. Within the scope of protection.

Claims (15)

  1. 一种程序包部署方法,其特征在于,应用于分布式系统中的第一计算节点,所述方法包括:A package deployment method is characterized by being applied to a first computing node in a distributed system, the method comprising:
    接收训练任务信息,所述训练任务信息中包含执行训练任务的各台计算节点信息;Receiving training task information, where the training task information includes information about each computing node that performs the training task;
    根据所述训练任务信息,确定所述第一计算节点的状态是否为主用状态;Determining, according to the training task information, whether a state of the first computing node is an active state;
    如果为主用状态,获取训练程序包,并将所获取的训练程序包部署到所述执行训练任务的各台计算节点。If it is in the active state, the training package is acquired, and the acquired training package is deployed to each computing node that performs the training task.
  2. 根据权利要求1所述的方法,其特征在于,在接收训练任务信息之后,所述方法还包括:The method according to claim 1, wherein after receiving the training task information, the method further comprises:
    解析所述训练任务信息,得到训练程序包的存储地址、执行训练任务的各台计算节点的状态信息及设备地址;Parsing the training task information, obtaining a storage address of the training package, status information of each computing node performing the training task, and a device address;
    所述根据所述训练任务信息,确定所述第一计算节点的状态是否为主用状态,包括:Determining, according to the training task information, whether the state of the first computing node is an active state, including:
    在所述各台计算节点的状态信息中,查找所述第一计算节点的设备地址对应的状态信息;Searching, in the state information of each computing node, status information corresponding to the device address of the first computing node;
    确定所查找到的状态信息是否为主用状态;Determining whether the found status information is the primary status;
    所述获取训练程序包,包括:The obtaining training package includes:
    根据所述训练程序包的存储地址,获取所述训练程序包;Obtaining the training package according to a storage address of the training package;
    所述将所获取的训练程序包部署到所述执行训练任务的各台计算节点,包括:The deploying the acquired training package to each computing node that performs the training task includes:
    根据所述各台计算节点的设备地址,在所述各台计算节点中部署所述训练程序包。And deploying the training package in each computing node according to the device address of each computing node.
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 further comprising:
    如果所述第一计算节点的状态为主用状态,检测到所述执行训练任务的各台计算节点均成功部署所述训练程序包后,生成标记文件,并将所述标记 文件发送至所述各台计算节点。If the state of the first computing node is the active state, after detecting that the computing nodes of the execution training task successfully deploy the training package, generate a markup file, and send the markup file to the Each computing node.
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 further comprising:
    如果所述第一计算节点的状态为主用状态,检测到存在部署所述训练程序包失败的计算节点后,输出用于提示部署失败的第一提示信息。If the state of the first computing node is the active state, and detecting that there is a computing node that fails to deploy the training package, outputting the first prompt information for prompting the deployment failure.
  5. 根据权利要求3所述的方法,其特征在于,所述方法还包括:The method of claim 3, wherein the method further comprises:
    如果所述第一计算节点的状态不为主用状态,判断是否在预设时间段内接收到所述标记文件;If the state of the first computing node is not in the active state, determining whether the tag file is received within a preset time period;
    如果否,输出用于提示部署失败的第二提示信息。If no, the second prompt message for prompting the deployment failure is output.
  6. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 further comprising:
    在所述执行训练任务的各台计算节点均成功部署所述训练程序包后,运行所述训练程序包,进行数据训练。After each of the computing nodes performing the training task successfully deploys the training package, the training package is run to perform data training.
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述将所获取的训练程序包部署到所述执行训练任务的各台计算节点,包括:The method according to any one of claims 1-6, wherein the deploying the acquired training package to each computing node performing the training task comprises:
    通过无限带宽技术Infiniband,将所获取的训练程序包部署到所述执行训练任务的各台计算节点。The acquired training package is deployed to each computing node performing the training task through the infinite bandwidth technology Infiniband.
  8. 一种电子设备,其特征在于,包括:存储器和处理器,其中,An electronic device, comprising: a memory and a processor, wherein
    存储器,用于存放计算机程序;a memory for storing a computer program;
    处理器,用于执行存储器上所存放的程序时,实现权利要求1-7任一所述的方法步骤。The processor, when executed to execute a program stored on the memory, implements the method steps of any of claims 1-7.
  9. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1-7任一所述的方法步骤。A computer readable storage medium, wherein the computer readable storage medium stores a computer program, the computer program being executed by a processor to implement the method steps of any of claims 1-7.
  10. 一种分布式系统,其特征在于,包括:至少两台计算节点;A distributed system, comprising: at least two computing nodes;
    所述计算节点,用于接收训练任务信息,所述训练任务信息中包含执行训练任务的各台计算节点信息;根据所述训练任务信息,确定自身状态是否 为主用状态;如果为主用状态,获取训练程序包,并将所获取的训练程序包部署到所述执行训练任务的各台计算节点。The computing node is configured to receive training task information, where the training task information includes each computing node information that performs a training task; and according to the training task information, determine whether the state is a primary state; Obtaining a training package, and deploying the acquired training package to each computing node that performs the training task.
  11. 根据权利要求10所述的系统,其特征在于,所述系统还包括:管理节点;The system of claim 10, wherein the system further comprises: a management node;
    所述管理节点,用于获取并存储训练程序包;将所述训练程序包的存储地址添加至训练任务信息;将所述训练任务信息发送给执行训练任务的各台计算节点;The management node is configured to acquire and store a training package; add a storage address of the training package to the training task information; and send the training task information to each computing node that performs the training task;
    所述计算节点,具体用于:The computing node is specifically configured to:
    接收所述管理节点发送的训练任务信息;解析所述训练任务信息,得到训练程序包的存储地址、执行训练任务的各台计算节点的状态信息及设备地址;在所述各台计算节点的状态信息中,查找自身设备地址对应的状态信息;确定所查找到的状态信息是否为主用状态;如果为主用状态,根据所述训练程序包的存储地址,获取所述训练程序包,并根据所述各台计算节点的设备地址,在所述各台计算节点中部署所述训练程序包。Receiving training task information sent by the management node; parsing the training task information, obtaining a storage address of the training package, status information of each computing node performing the training task, and a device address; and calculating the status of each node In the information, the status information corresponding to the device address is searched; the status information of the found device is determined to be the active status; if the status is the active status, the training package is obtained according to the storage address of the training package, and according to the The device addresses of the computing nodes of each of the computing nodes are deployed in the computing nodes.
  12. 根据权利要求11所述的系统,其特征在于,所述计算节点,还用于:The system of claim 11 wherein said computing node is further configured to:
    在自身状态为主用状态的情况下:In the case where its own state is the primary state:
    如果检测到所述执行训练任务的各台计算节点均成功部署所述训练程序包,生成标记文件,并将所述标记文件发送至所述各台计算节点;If it is detected that each of the computing nodes that perform the training task successfully deploys the training package, generate a tag file, and send the tag file to each computing node;
    如果检测到存在部署所述训练程序包失败的计算节点,向所述管理节点发送用于提示部署失败的第一提示信息。If it is detected that there is a computing node that fails to deploy the training package, the first prompt information for prompting the deployment failure is sent to the management node.
  13. 根据权利要求12所述的系统,其特征在于,所述计算节点,还用于:The system of claim 12, wherein the computing node is further configured to:
    在自身状态不为主用状态的情况下,判断是否在预设时间段内接收到所述标记文件;In a case where the self state is not in the main state, it is determined whether the tag file is received within a preset time period;
    如果否,向所述管理节点发送用于提示部署失败的第二提示信息。If not, sending the second prompt information for prompting the deployment failure to the management node.
  14. 根据权利要求10所述的系统,其特征在于,所述计算节点,还用于:The system of claim 10, wherein the computing node is further configured to:
    在所述执行训练任务的各台计算节点均成功部署所述训练程序包后,运 行所述训练程序包,进行数据训练。After each of the computing nodes performing the training task successfully deploys the training package, the training package is run for data training.
  15. 根据权利要求10-14任一项所述的系统,其特征在于,所述执行训练任务的各台计算节点之间基于无限带宽技术Infiniband通信连接。The system according to any one of claims 10 to 14, wherein each of the computing nodes performing the training task is connected by an Infiniband communication based on an infinite bandwidth technology.
PCT/CN2018/090263 2017-06-08 2018-06-07 Package deployment method, electronic device and distributed system WO2018224005A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710429234.1 2017-06-08
CN201710429234.1A CN109032610B (en) 2017-06-08 2017-06-08 Program package deployment method, electronic equipment and distributed system

Publications (1)

Publication Number Publication Date
WO2018224005A1 true WO2018224005A1 (en) 2018-12-13

Family

ID=64566889

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/090263 WO2018224005A1 (en) 2017-06-08 2018-06-07 Package deployment method, electronic device and distributed system

Country Status (2)

Country Link
CN (1) CN109032610B (en)
WO (1) WO2018224005A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783968A (en) * 2020-06-30 2020-10-16 山东信通电子股份有限公司 Power transmission line monitoring method and system based on cloud edge cooperation
CN112506955A (en) * 2020-12-10 2021-03-16 星环信息科技(上海)股份有限公司 Query processing method, computer equipment and storage medium
CN114721804A (en) * 2022-04-15 2022-07-08 支付宝(杭州)信息技术有限公司 Task scheduling method and device and electronic equipment

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723147B (en) * 2019-03-21 2023-07-25 杭州海康威视数字技术股份有限公司 Block chain-based data training method, device and equipment and storage medium
CN112148468B (en) * 2019-06-28 2023-10-10 杭州海康威视数字技术股份有限公司 Resource scheduling method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140067738A1 (en) * 2012-08-28 2014-03-06 International Business Machines Corporation Training Deep Neural Network Acoustic Models Using Distributed Hessian-Free Optimization
CN103744899A (en) * 2013-12-25 2014-04-23 浪潮电子信息产业股份有限公司 Distributed environment based mass data rapid classification method
CN104753994A (en) * 2013-12-27 2015-07-01 杭州海康威视系统技术有限公司 Method and device for data synchronization based on cluster server system
CN105894087A (en) * 2015-01-26 2016-08-24 华为技术有限公司 System and method for training parameter set in neural network
CN106529673A (en) * 2016-11-17 2017-03-22 北京百度网讯科技有限公司 Deep learning network training method and device based on artificial intelligence

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102486739B (en) * 2009-11-30 2015-03-25 国际商业机器公司 Method and system for distributing data in high-performance computer cluster
CN102404381A (en) * 2011-09-02 2012-04-04 西安交通大学 Software deployment system and deployment method based on workflow in cloud computing environment
CN102546782B (en) * 2011-12-28 2015-04-29 北京奇虎科技有限公司 Distribution system and data operation method thereof
CN103078941B (en) * 2012-12-31 2016-01-20 中金数据系统有限公司 A kind of method for scheduling task of distributed computing system
CN105187465B (en) * 2014-06-20 2019-03-01 中国科学院深圳先进技术研究院 A kind of sharing method of file, apparatus and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140067738A1 (en) * 2012-08-28 2014-03-06 International Business Machines Corporation Training Deep Neural Network Acoustic Models Using Distributed Hessian-Free Optimization
CN103744899A (en) * 2013-12-25 2014-04-23 浪潮电子信息产业股份有限公司 Distributed environment based mass data rapid classification method
CN104753994A (en) * 2013-12-27 2015-07-01 杭州海康威视系统技术有限公司 Method and device for data synchronization based on cluster server system
CN105894087A (en) * 2015-01-26 2016-08-24 华为技术有限公司 System and method for training parameter set in neural network
CN106529673A (en) * 2016-11-17 2017-03-22 北京百度网讯科技有限公司 Deep learning network training method and device based on artificial intelligence

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783968A (en) * 2020-06-30 2020-10-16 山东信通电子股份有限公司 Power transmission line monitoring method and system based on cloud edge cooperation
CN111783968B (en) * 2020-06-30 2024-05-31 山东信通电子股份有限公司 Power transmission line monitoring method and system based on cloud edge cooperation
CN112506955A (en) * 2020-12-10 2021-03-16 星环信息科技(上海)股份有限公司 Query processing method, computer equipment and storage medium
CN112506955B (en) * 2020-12-10 2021-09-21 星环信息科技(上海)股份有限公司 Query processing method, computer equipment and storage medium
CN114721804A (en) * 2022-04-15 2022-07-08 支付宝(杭州)信息技术有限公司 Task scheduling method and device and electronic equipment

Also Published As

Publication number Publication date
CN109032610A (en) 2018-12-18
CN109032610B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
WO2018224005A1 (en) Package deployment method, electronic device and distributed system
US10530633B2 (en) Link detection method, apparatus, network device, and controller
US10133622B2 (en) Enhanced error detection in data synchronization operations
CN105051687B (en) Using dispositions method and equipment
US8478803B2 (en) Management of logical statements in a distributed database environment
CN108289034A (en) A kind of fault discovery method and apparatus
US20130117012A1 (en) Knowledge based parsing
WO2019001312A1 (en) Method and apparatus for realizing alarm association, and computer readable storage medium
JP2016508638A5 (en)
US11709756B2 (en) Dynamic distributed tracing instrumentation in a microservice architecture
JP2019527429A (en) Anomaly detection using system call sequence
US20180113750A1 (en) Container-based distributed application management system and method
JP6434021B2 (en) Manage data feeds
CN104320312A (en) Network application safety test tool and fuzz test case generation method and system
JP6509344B2 (en) Method and apparatus for detecting tag exchange path connectivity
WO2019119269A1 (en) Network fault detection method and control center device
US10680974B2 (en) Method and device for monitoring data processing status
US11789957B2 (en) System, method, and apparatus for querying a database
WO2016164061A1 (en) Big data transfer
US10445080B2 (en) Methods for adaptive placement of applications and devices thereof
CN106657436B (en) Message processing method and device
CN112114871B (en) Code sharing method, device, server, terminal and medium
WO2016082432A1 (en) Data query method and device
TWI536185B (en) Data source searching system based on cloud server and method thereof
US8930369B2 (en) Information processing apparatus, message classifying method and non-transitory medium for associating series of transactions

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18814307

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18814307

Country of ref document: EP

Kind code of ref document: A1