WO2016119429A1 - 用于神经网络中训练参数集的系统和方法 - Google Patents

用于神经网络中训练参数集的系统和方法 Download PDF

Info

Publication number
WO2016119429A1
WO2016119429A1 PCT/CN2015/086011 CN2015086011W WO2016119429A1 WO 2016119429 A1 WO2016119429 A1 WO 2016119429A1 CN 2015086011 W CN2015086011 W CN 2015086011W WO 2016119429 A1 WO2016119429 A1 WO 2016119429A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
parameter
node
master
nodes
Prior art date
Application number
PCT/CN2015/086011
Other languages
English (en)
French (fr)
Inventor
陈嘉
曾嘉
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP15879628.4A priority Critical patent/EP3196809A4/en
Publication of WO2016119429A1 publication Critical patent/WO2016119429A1/zh
Priority to US15/455,259 priority patent/US20170185895A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Definitions

  • the present invention relates to the field of data processing, and more particularly to systems and methods for training parameter sets in neural networks in the field of data processing.
  • the neural network is a mathematical model that simulates the synaptic structure of the brain for information processing. It is an abstraction, simplification and simulation of the human brain, which can reflect the basic characteristics of the human brain.
  • a neural network consists of a large number of nodes (also called neurons) and a weighted connection between each other. Each node represents a specific output function, called an excitation function; and the connection between every two nodes represents a weighting value for the signal passing through the connection, called a weight.
  • the training of the neural network is to find the parameter set W in the above function.
  • Deep learning is one of the training methods for neural networks.
  • deep learning can be well used to solve practical application problems such as speech recognition, image recognition and text processing.
  • Neural networks usually need to be trained using large-scale training data to ensure that the neural network operation results achieve a certain degree of accuracy.
  • the larger the training data the larger the calculation and the longer the training time.
  • coprocessors such as Graphic Processing Units (GPUs) are widely used in deep learning training calculations.
  • GPUs Graphic Processing Units
  • the memory of these coprocessors is relatively small and cannot accommodate the weighted parameter set of large neural networks.
  • the prior art solution sends each copy of the neural network to the computing node through the master node, and instructs the computing node to perform training.
  • Each computing node is equipped with at least one GPU for arithmetic processing.
  • the master node periodically queries the state of the computing node when the computing node performs training, and updates the weighting parameters of the replica neural network on the master node and the computing node after the computing node reaches the stop state.
  • a large number of computing nodes are used to cooperatively train a large neural network, using a conventional synchronous update. In this way, all computing nodes in the system can only be trained based on the same parameter set W at the same time, and the overall performance of the system is limited by the slowest node and the system network bandwidth. When one or some nodes fail, it will have a serious impact on the entire training process.
  • the training system of the existing neural network has poor reliability and only supports one master node.
  • the master node fails, the entire training fails.
  • the computing nodes of the existing training system can only be trained based on the same parameter set W at the same time, and the scale and overall performance of the system are limited by the memory size of the master node and the computing node.
  • Embodiments of the present invention provide a system and method for training a parameter set in a neural network, which can improve the reliability and training efficiency of the neural network training process.
  • a system for training a parameter set in a neural network comprising:
  • the master node set includes M master nodes
  • the master node set is used to control a process of training a parameter set in the neural network, and is used to store the training parameter set a data set and a parameter set
  • the data set includes a plurality of data subsets
  • the parameter set includes a plurality of parameter subsets
  • the plurality of parameter subsets are respectively stored on different master nodes
  • the master control The collection of the parameter subsets stored by all the master nodes in the node set is the parameter set
  • the M master control nodes are communicatively connected between the two, and at least one of the M master nodes is used by the master node. Backing up the parameter set, where M is a positive integer greater than one;
  • N training node sets each of the N training node sets being in communication connection with the master node set, the training node set comprising a plurality of training nodes, the training node being configured to receive the The data subset delivered by the set of the master node and the parameter set are trained according to the received data subset and the parameter set, and the training result is sent to the storage parameter.
  • a master node of the subset wherein N is a positive integer greater than 1, and any two of the N training node sets are different in data subsets used for training, and each of the training node sets
  • a collection of subsets of parameters trained by all training nodes is the set of parameters.
  • the training result is that the training node trains the subset of parameters that are responsible for the training according to the received data subset and the parameter set.
  • the amount of parameter change of the subset of parameters that it is responsible for, in the set of master nodes is also used to:
  • the set of the master control node is specifically used to:
  • the plurality of parameter subsets are respectively stored on different master control nodes, wherein a collection of parameter subsets stored by all the master control nodes in the master control node set is the parameter set;
  • the parameter subset stored in the master node is updated according to a parameter change amount sent by the second training node of the second training node set.
  • the master node set Specifically used for:
  • the process of stopping the training parameter set is determined.
  • the training node is further used to:
  • the same training node set The training nodes in the communication link between the two.
  • a method for training a parameter set in a neural network the method being performed in a system for training a parameter set in a neural network according to any one of claims 1 to 7 a set of nodes, the system further comprising N sets of training nodes, wherein the set of master nodes The M master control nodes are connected, and the M master control nodes are in communication connection between the two, wherein M is a positive integer greater than 1, and N is a positive integer greater than 1, the method includes:
  • the set of master nodes stores a data set and a parameter set used by the training, the data set includes a plurality of data subsets, the parameter set includes a plurality of parameter subsets, and the plurality of parameter subsets are respectively stored in different
  • a collection of parameter subsets stored by all the master nodes in the set of master nodes is the parameter set, and at least one of the M master nodes is used to back up the Parameter set
  • the master node in the set of master control nodes sends a data subset and the parameter subset to the training node responsible for training the subset of parameters stored by the master node;
  • the master node in the set of master nodes receives the training result sent by the training node, where the training node belongs to a training node set, and the training node set is communicatively connected with the master node set, the training node
  • the set includes a plurality of training nodes, and the training result is obtained according to the received data subset and parameter set sent by the set of the master node.
  • the training result is that the training node sends the data subset and the parameter set that are sent according to the received set of the master node,
  • the parameter variation of the subset of parameters obtained by training the subset of parameters that are responsible for the training the method further includes:
  • the master node in the set of master nodes receives the parameter change amount sent by the training node;
  • the master node in the master node set updates the parameter subset stored in the master node according to the parameter change amount.
  • the master node set stores a data set and a parameter set used for training, including:
  • the set of master nodes divides the parameter set into a plurality of parameter subsets
  • the plurality of parameter subsets are respectively stored on different master control nodes, wherein a collection of parameter subsets stored by all the master control nodes in the master control node set is the parameter set;
  • the method further includes:
  • the set of master nodes determines each of the N training node sets according to a size of a plurality of the parameter subsets.
  • the master node in the master control node sets a parameter subset stored in the master control node according to the parameter change amount Updates include:
  • the master node in the set of master nodes updates the parameter subset stored in the master node according to the parameter change amount sent by the first training node of the first training node set at the first moment;
  • the master node in the master control node updates the parameter subset stored in the master node according to the parameter change amount sent by the second training node of the second training node set at the second moment.
  • the method further includes:
  • the set of master nodes determines whether to stop the process of the training parameter set according to the accuracy of the training result.
  • the parameter subset Stored and responsible by at least one master node, and correspondingly responsible by at least two training nodes, the at least two training nodes belong to different training node sets, and any two of the plurality of training node sets
  • the data subset used by the training is different, and the collection of the subset of parameters trained by all the training nodes in each training node set is the parameter set.
  • the same training node set The training nodes in the communication link between the two.
  • a system and method for training a parameter set in a neural network controls a training process by forming a master node set by a plurality of master nodes connected by communication between two pairs. Avoiding the entire training failure caused by the failure of a certain master node can improve the reliability of the training process. By configuring multiple training node sets to train the parameter sets in parallel, the training efficiency can be improved.
  • FIG. 1 is a schematic block diagram of a system for training a parameter set in a neural network, in accordance with an embodiment of the present invention.
  • FIG. 2 is a schematic block diagram of a computing device in accordance with an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a system workflow for a training parameter set in a neural network, in accordance with an embodiment of the present invention.
  • FIG. 4 is a schematic flow chart of a training process in accordance with an embodiment of the present invention.
  • FIG. 5 is a schematic flow chart of a method for training a parameter set in a neural network according to an embodiment of the present invention.
  • FIG. 1 shows a schematic block diagram of a system 100 for training a parameter set in a neural network in accordance with an embodiment of the present invention.
  • system 100 includes:
  • a master node set 110 the master node set 110 includes M master nodes, the master node set 110 is used to control the process of training the parameter set in the neural network, and is used to store the training parameter set process.
  • a data set and a parameter set the data set includes a plurality of data subsets, the parameter set includes a plurality of parameter subsets, and the plurality of parameter subsets are respectively stored on different master control nodes, and the master control node set 110
  • the collection of the parameter subsets stored by all the master nodes is the parameter set, and the M master control nodes are communicatively connected, and at least one of the M master control nodes is used to back up the parameter set, wherein , M is a positive integer greater than 1;
  • N training node sets 120 each of the N training node sets 120 being communicatively coupled to the master node set 110, the training node set 120 comprising a plurality of training nodes for receiving the master
  • the data subset and the parameter set delivered by the control node set 110 are trained according to the received data subset and the parameter set, and the training result is sent to the main body storing the parameter subset.
  • Control node where N is a positive integer greater than one, The data subsets used by any two training node sets 120 of the N training node sets 120 are different, and the collection of parameter subsets trained by all the training nodes in each training node set 120 is the parameter set.
  • the system for training a parameter set in a neural network provided by the embodiment of the present invention can control a training process by forming a master node set by a plurality of master nodes connected by communication between two pairs, thereby avoiding being a certain master.
  • the entire training failure caused by the failure of the control node can improve the reliability of the training process.
  • the training efficiency can be improved.
  • system 100 of training parameter sets includes a master node set 110 and at least two training node sets 120.
  • the master node set 110 includes at least two master nodes, and the master node has a communication connection between the two, and at least one master node is used for backing up the parameter set, which can improve the reliability of the training process.
  • the set of training nodes 120 may be divided by the set of master nodes 110 according to the size of the data processing and the performance of the training nodes used to form the training node set 120 (such as memory size, etc.).
  • the system 100 of the training parameter set of the embodiment of the present invention can be applied to the training process of the neural network.
  • the master node set 110 is used to control the training process, such as controlling the start or end of the training process, controlling the subset of data used by each set of training nodes, and determining each training node in the set of training nodes, and the like.
  • the master node set 110 is also used to store the data set D and parameter set W used by the training process.
  • the parameter set W includes a plurality of parameter subsets, and the plurality of parameter subsets are respectively stored on different master control nodes, and the collection of parameter subsets stored by all the master control nodes in the master control node set 110 is the parameter set W.
  • the training node in the training node set 120 is configured to receive the data subset and the current parameter set W delivered by the master node set 110, and perform a parameter subset that is responsible for itself according to the received data subset and the current parameter set W. Training, and sending the parameter change amount ⁇ W for updating according to the data subset and the current parameter set W training to the master node.
  • the data subsets used by any two training node sets 120 of the N training node sets 120 are different, and the collection of parameter subsets trained by all the training nodes in each training node set 120 is This parameter set. That is, multiple training node sets 120 process different data subsets in parallel. For the same parameter subset, multiple training nodes train at the same time, which can improve the efficiency of the training process.
  • the system for training a parameter set in a neural network controls a training process by forming a master node set by a plurality of master nodes connected by communication between the two groups.
  • the reliability of the training process can be improved.
  • the training efficiency can be improved.
  • the data set includes a plurality of data subsets
  • the parameter set includes a plurality of parameter subsets
  • the data subsets used by any two of the N training node sets 120 are different.
  • system 100 of training parameter sets includes more than one training node set 120.
  • the data set stored by the master node set 110 includes a plurality of data subsets, and the training time master node set 110 sends different data subsets to different training node sets 120.
  • the parameter set stored by the master node set 110 includes a plurality of parameter subsets, and the master node in the master node set 110 stores and is responsible for maintaining different parameter subsets.
  • the training node in the training node set 120 responsible for a certain subset of parameters receives the subset of parameters stored and responsible for maintenance from the corresponding master node, and the collection of parameter subsets received from the plurality of master nodes is a parameter set.
  • the master node set 110 includes more than one master node
  • the system 100 includes at least two training node sets 120
  • the training node set 120 includes more than one training node.
  • the system for training a parameter set in a neural network provided by the embodiment of the present invention can control a training process by forming a master node set by a plurality of master nodes connected by communication between two pairs, thereby avoiding being a certain master.
  • the entire training failure caused by the failure of the control node can improve the reliability of the training process.
  • the training efficiency can be improved.
  • the master node in the master node set 110 and the training nodes in the training node set 120 are all computing devices.
  • 2 shows a schematic block diagram of a computing device in accordance with an embodiment of the present invention.
  • the computing device can include a processing module, a storage module, and a coprocessing module for computing (eg, a Graphic Processing Unit (GPU), Intel Super Multicore (Intel Many) Integrated Core, Intel MIC) processor, Field-Programmable Gate Array (FPGA), etc. and communication module for communicating with the training node and with the master node or within the master node set 110 .
  • a processing module e.g, a Graphic Processing Unit (GPU), Intel Super Multicore (Intel Many) Integrated Core, Intel MIC) processor, Field-Programmable Gate Array (FPGA), etc.
  • communication module for communicating with the training node and with the master node or within the master node set 110 .
  • the parameter set used by at least one of the N training node sets 120 is different from the parameter set currently stored in the master node set 110.
  • the master node is specifically configured to:
  • the parameter subset stored in the main control node is updated according to the parameter change amount sent by the second training node of the second training node set.
  • each training node set 120 in system 100 operates independently and in parallel without affecting each other.
  • the failure of any one of the training node sets 120 does not affect the overall system 100 to continue training.
  • at least one of the N training node sets 120 calculates that the parameter set used is different from the parameter set currently stored in the master node set 110.
  • the parameter set used by at least one of the N training node sets 120 is different from the parameter set used by the other training node set 120 training.
  • the update of the parameter set W by the master node set 110 is asynchronous, and the parameter stored in the master control node by the master control node according to the parameter change amount sent by the first training node of the first training node set at the first moment.
  • the set is updated; at the second time, the parameter subset stored in the master node is updated according to the parameter change amount sent by the second training node of the second training node set.
  • the current parameter set W of the master node set 110 may have been different from the parameter set W used by the training node set 120 for training.
  • the master node set 110 may be specifically configured to:
  • the plurality of parameter subsets are respectively stored on different master control nodes, wherein a collection of parameter subsets stored by all the master control nodes in the master control node set 110 is the parameter set;
  • Each of the N training node sets 120 is determined based on a plurality of the subset of the parameter subsets.
  • the master node set 110 performs initialization work at the beginning of the training, for example, dividing the training node set 120, configuring the trained data set and parameter set, initializing the original model, and the like.
  • the parameter set W in which the training is configured specifically includes dividing the parameter set W into a plurality of parameter subsets W 1 , W 2 , . . . , W K .
  • Each master node is responsible for maintaining one or more subsets of parameters. If the subset of parameters W i M j by the master node is responsible for storing, updating and maintaining, called M j W i is the host node.
  • the master node set 110 pairs all of the sets used to form the training node set 120 based on the size of the parameter set W and the size of the memory of each of the training nodes used to form the training node set 120 (or the coprocessor of the training node).
  • the training nodes are divided. In general, the larger the size of a subset of parameters, the stronger the ability of the training nodes that need to be assigned to them. It is assumed that there are a total of P training node sets 120, denoted as C 1 , C 2 , ..., C P .
  • Each training node is responsible for at least one subset of parameters, each of which collectively stores and processes a complete copy of the parameter set W.
  • the master node set 110 backs up the parameter set by using a disk array RAID0/1/5/6 or an erasure code.
  • the master node set 110 may back up the parameter set by using an encoding method of RAID0/1/5/6 or Erasure Coding.
  • RAID0/1/5/6 or Erasure Coding may be used to ensure the reliability of the system 100, which is not limited by the embodiment of the present invention.
  • the training node may be specifically configured to:
  • the process of receiving the training parameter set is stopped by receiving an instruction sent by the master node set 110.
  • the training node for the training node set Ck needs to access the host node of the parameter subset it is responsible for and download a copy of the latest parameter subset.
  • the collection of all the latest parameter subsets acquired by all the training nodes of the training node set Ck through the communication network is the latest parameter set, denoted as Wk .
  • Different sets of training nodes may acquire the latest parameter set W from the set of master nodes 110 at different times, while the parameter set W is constantly changing. Therefore, at the same time, the copies of the parameter set W used by the different sets of training node sets may be different.
  • the training node of the training node set C k also needs to obtain a part of data of the data set, that is, the data subset, from the master node set 110, wherein the training nodes acquired in the same training node set have the same data subset. . Further, the training node performs training according to the parameter set W k and the data subset to obtain a parameter change amount ⁇ W i k corresponding to the parameter subset W i that it is responsible for. The training node sends the trained parameter variation ⁇ W i k to the master node responsible for the corresponding parameter subset W i , that is, the host node.
  • the set of parameter variations ⁇ W i k calculated by all the training nodes of the training node set C k is denoted as ⁇ W k .
  • the manner in which the training node obtains the parameter subset and the data from the master node set 110 is not limited in the embodiment of the present invention.
  • the training node continuously repeats receiving the parameter set and receiving the data subset for training until receiving the instruction to stop training sent by the master node set 110 from the master node set 110, and the training node stops the training parameter set. process.
  • the necessary data exchange needs to be performed between the training nodes.
  • the same training node The training nodes in the collection can be communicatively connected.
  • the training result is a parameter change amount of the parameter subset of the parameter that the training node is responsible for training the training parameter node according to the received data subset and the parameter set according to the received data subset and the parameter set.
  • the master node in the master node set 110 is also used to:
  • the parameter subset stored in the master node is updated according to the parameter change amount.
  • the master node in the master node set 110 receives the parameter change amount ⁇ W i k for training from the training node of the training node set C k according to the data set and the parameter set. Therefore, the parameter subset W i responsible for the master node of the master node set is updated. That is, after receiving the complete parameter set change amount ⁇ W k from a certain training node set C k , the master node set updates the parameter set W of the neural network.
  • the update of the parameter set W by the master node set is asynchronous, that is, at the same time, the current parameter set W of the master node set may have been different from the parameter set W k used by the training node set C k in the training process. .
  • This asynchronous update method can make full use of the training ability of all training node sets.
  • the embodiment of the present invention does not limit the specific update method of the parameter set W of the master node set.
  • the set of master nodes is specifically used to:
  • the system 100 provided by the embodiment of the present invention is applied to an image classification system based on a deep convolutional neural network, and is trained using an optimization algorithm based on a Mini-batch Stochastic Gradient Descent.
  • the parameter set of the convolutional neural network is W, and the parameter set of the system training includes the size of the small batch m and the learning rate ⁇ .
  • 3 is a schematic diagram of a system workflow of data processing in accordance with an embodiment of the present invention.
  • the parameter set of the deep convolutional neural network is divided into two parameter subsets W 1 and W 2 .
  • the master node set includes three master nodes M 1 , M 2 , and M 3 .
  • the master node M 1 is the host node of the parameter subset W 1
  • the master node M 2 is the host node of the parameter subset W 2
  • the master node M 3 is saved.
  • Each training node set C k includes two training nodes C k 1 and C k 2 , which are responsible for the training of the parameter subsets W 1 and W 2 , respectively.
  • the training process 200 includes:
  • the training nodes C k 1 and C k 2 both receive the same batch of training data from the set of master nodes. Forward transfer training is performed based on parameter subsets W 1 k and W 2 k , respectively.
  • the training nodes C k 1 and C k 2 can communicate with each other during training to perform the necessary data exchange.
  • Training nodes C k 1 and C k 2 respectively train their corresponding errors
  • the error back Propagation (BP) algorithm is used for reverse transmission training, respectively training
  • the training nodes C k 1 and C k 2 can communicate with each other to perform the necessary data exchange.
  • the training nodes C k 1 and C k 2 respectively determine the parameter variation
  • the training nodes C k 1 and C k 2 upload ⁇ W 1 k and ⁇ W 2 k to the master nodes M 1 and M 2 , respectively .
  • the training nodes C k 1 and C k 2 repeat steps 210 through 250 until an instruction to terminate the training is received from the set of master nodes.
  • the master node set comprises a master node M 1 and M 2, step 260 and steps 210-250 are performed in parallel.
  • Master node M 1 and M 2 respectively, the node C train from the training set and the node C k 2 k 1 receives the parameter change amount with According to the amount of parameter change with
  • the master nodes M 1 and M 2 update the parameter subsets W 1 and W 2 according to the following formula:
  • the master nodes M 1 and M 2 transmit the updated subset of parameters W 1 and W 2 to the master node M 3 .
  • the master node M 3 updates W 3 according to the following formula:
  • the master node set determines whether to stop the training process according to the accuracy of the training result. If the training stop condition is not satisfied, steps 210 to 270 are repeated; if the training stop condition is satisfied, step 280 is performed.
  • the master node set sends an instruction to terminate the training to the training node set.
  • the system for training a parameter set in a neural network can control a training process by forming a master node set by a plurality of master nodes connected by communication between two pairs, thereby avoiding being a certain master.
  • the entire training failure caused by the failure of the control node can improve the reliability of the training process, and the training efficiency can be improved by configuring a plurality of training node sets to train the parameter set in parallel.
  • a method 200 for training a parameter set in a neural network corresponding to the embodiment of the present invention will now be described in detail.
  • FIG. 5 illustrates a method 300 for training a parameter set in a neural network, the method 300 being performed on a set of master nodes in a system for training a parameter set in a neural network, the system further comprising, in accordance with an embodiment of the present invention, N training node sets, wherein the master node set includes M a master control node, wherein the M master control nodes are in communication connection between the two, wherein M is a positive integer greater than 1, and N is a positive integer greater than 1, the method 300 includes:
  • the master node set stores a data set and a parameter set used by the training, where the data set includes a plurality of data subsets, the parameter set includes a plurality of parameter subsets, and the plurality of parameter subsets are respectively stored in different mains.
  • the collection of the parameter subsets stored by all the master nodes in the set of the master node is the parameter set, and at least one of the M master nodes is used to back up the parameter set;
  • the master node in the master node set sends a data subset and the parameter subset to the training node responsible for training the parameter subset stored by the master node;
  • the master node in the master node set receives the training result sent by the training node, where the training node belongs to a training node set, and the training node set is communicatively connected to the master node set, where the training node set includes multiple The training node is trained according to the received data subset and parameter set sent by the set of the master node.
  • the method for training a parameter set in a neural network can prevent a certain master by controlling a training process by forming a master node set by a plurality of master nodes connected by communication between two pairs.
  • the entire training failure caused by the failure of the control node can improve the reliability of the training process.
  • Training the parameter set in parallel through multiple training node sets can improve the training efficiency.
  • the training result is a parameter subset obtained by the training node according to the received data subset and the parameter set sent by the training node set, and the parameter subset obtained by the training node is trained.
  • the parameter variation 300 also includes:
  • the master node in the master node set receives the parameter change amount sent by the training node;
  • the master node in the master node set updates the parameter subset stored in the master node according to the parameter change amount.
  • the master node set stores a data set and a parameter set used by the training, including:
  • the master node set divides the parameter set into a plurality of parameter subsets
  • the plurality of parameter subsets are respectively stored on different master control nodes, wherein a collection of parameter subsets stored by all the master control nodes in the master control node set is the parameter set;
  • the method 300 also includes:
  • the master node set determines each of the N training node sets according to a plurality of the parameter subset sizes.
  • the master node in the master node set updates the parameter subset stored in the master node according to the parameter change amount, including:
  • the master node in the master node set updates the parameter subset stored in the master node according to the parameter change amount sent by the first training node of the first training node set at the first moment;
  • the master node in the master node set updates the parameter subset stored in the master node according to the parameter change amount sent by the second training node of the second training node set at the second moment.
  • the method 300 further includes:
  • the master node set determines whether to stop the training parameter set according to the accuracy of the training result.
  • a subset of the parameters is stored and responsible by at least one master node, and correspondingly is responsible for at least two training nodes, where the at least two training nodes belong to different training node sets, and the multiple The data subsets used by any two training node sets in the training node set are different, and the collection of the parameter subsets trained by all the training nodes in each training node set is the parameter set.
  • the training nodes in the same training node set are communicatively connected between the two.
  • the method for training a parameter set in a neural network can prevent a certain master by controlling a training process by forming a master node set by a plurality of master nodes connected by communication between two pairs.
  • the entire training failure caused by the failure of the control node can improve the reliability of the training process, and the training efficiency can be improved by configuring multiple training node sets to be trained in parallel.
  • Y corresponding to X means that Y is associated with X, and Y can be determined according to X.
  • determining Y from X does not mean that Y is determined solely from X, and that Y may also be determined from X and/or other information.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, or an electrical, mechanical or other form of connection.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiments of the present invention.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • the technical solution of the present invention contributes in essence or to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Abstract

一种用于神经网络中训练参数集的系统和方法,该系统包括:主控节点集合(110),用于控制训练过程,并用于存储训练所使用的数据集和参数集,主控节点集合包括M个主控节点,M个主控节点两两之间通信连接,M个主控节点中的至少一个主控节点用于备份参数集;以及N个训练节点集合(120),训练节点集合包括多个训练节点,训练节点用于根据主控节点集合下发的数据集和参数集进行训练,并将训练结果发送给相应的主控节点。该系统和方法可以避免当某一个主控节点失效时导致的整个训练失败情况,提高训练过程的可靠性,通过配置多个训练节点集合并行地进行训练,提高训练效率。

Description

用于神经网络中训练参数集的系统和方法 技术领域
本发明涉及数据处理领域,尤其涉及数据处理领域中的用于神经网络中训练参数集的系统和方法。
背景技术
神经网络是一种模拟大脑神经突触结构来进行信息处理的数学模型,是对人脑的抽象、简化和模拟,可以反映人脑的基本特性。神经网络由大量的节点(也称为神经元)和相互之间的加权连接构成。每个节点代表一种特定的输出函数,称为激励函数;而每两个节点间的连接都代表一个对于通过该连接信号的加权值,称为权重。神经网络用数学函数可以表示为Y=f(X,W),其中,X代表网络的输入,Y代表网络的输出,W代表网络的参数集。
下面以监督学习为例来简单描述神经网络的训练问题。神经网络的训练即是要寻找上述函数中的参数集W。神经网络的训练过程为:给定训练的数据集D={(X1,Y1),(X2,Y2),...,(XN,YN)},对每一个训练数据(Xi,Yi),定义其价值函数为
Figure PCTCN2015086011-appb-000001
确定W,使得
Figure PCTCN2015086011-appb-000002
的值最小。
深度学习是针对神经网络的训练方法之一。目前,深度学习已经可以很好的用于解决语音识别、图象识别及文本处理等实际应用问题。神经网络通常需要使用大规模的训练数据进行训练,以保证神经网络的运算结果达到一定的准确度。相应地,训练数据规模越大就会使得计算量越大,训练所需的时间也越长。为了加快神经网络的训练速度,图形处理单元(Graphic Processing Unit,GPU)等协处理器被广泛应用于深度学习训练计算中。但是这些协处理器的内存相对较小,无法容纳大型神经网络的加权参数集。
并且,现有技术方案通过主控节点将神经网络各副本发送给运算节点,并指示运算节点进行训练。每个运算节点至少配备一个GPU进行运算处理。主控节点在运算节点进行训练时定时查询运算节点状态,并在运算节点达到停止状态后更新主控节点以及运算节点上副本神经网络加权参数。现有技术中,使用众多计算节点协同训练一个大型神经网络,采用传统的同步更新的 方式,系统中所有的计算节点只能同时基于相同的参数集W进行训练,系统的整体性能会被最慢的节点以及系统网络带宽所限制。当某个或某些节点失效时,会对整个训练过程带来严重影响。
因此,现有的神经网络的训练系统可靠性较差,仅支持一个主控节点,当主控节点失效时会导致整个训练的失败。并且,现有的训练系统的运算节点只能同时基于相同的参数集W进行训练,系统的规模和整体性能受限于主控节点以及运算节点的内存大小。
发明内容
本发明实施例提供了一种用于神经网络中训练参数集的系统和方法,能够提高神经网络训练过程的可靠性和训练效率。
第一方面,提供了一种用于神经网络中训练参数集的系统,所述系统包括:
主控节点集合,所述主控节点集合包括M个主控节点,所述主控节点集合用于控制所述神经网络中训练参数集的过程,并用于存储所述训练参数集的过程所使用的数据集和参数集,所述数据集包括多个数据子集,所述参数集包括多个参数子集,所述多个参数子集分别存储于不同的主控节点上,所述主控节点集合中的所有主控节点存储的参数子集的合集为所述参数集,所述M个主控节点两两之间通信连接,所述M个主控节点中的至少一个主控节点用于备份所述参数集,其中,M为大于1的正整数;以及
N个训练节点集合,所述N个训练节点集合中的每一个训练节点集合与所述主控节点集合通信连接,所述训练节点集合包括多个训练节点,所述训练节点用于接收所述主控节点集合下发的数据子集和所述参数集,根据接收的所述数据子集和所述参数集,对自身负责的参数子集进行训练,并将训练结果发送给存储所述参数子集的主控节点,其中,N为大于1的正整数,所述N个训练节点集合中的任意两个训练节点集合训练所使用的数据子集不同,所述每个训练节点集合中的所有训练节点所训练的参数子集的合集为所述参数集。
结合第一方面,在第一方面的第一种可能的实现方式中,所述训练结果为训练节点根据接收的所述数据子集和所述参数集,对自身负责的参数子集进行训练得到的自身负责的参数子集的参数变化量,所述主控节点集合中的 主控节点还用于:
接收所述训练节点发送的所述参数变化量;
根据所述参数变化量,对所述主控节点中存储的参数子集进行更新。
结合第一方面或第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,所述主控节点集合具体用于:
将所述参数集划分为多个参数子集;
将所述多个参数子集分别存储于不同的主控节点上,其中,所述主控节点集合中的所有主控节点存储的参数子集的合集为所述参数集;
根据多个所述参数子集的大小确定所述N个训练节点集合中的每个训练节点。
结合第一方面和第一方面的第一种至第二种可能的实现方式中的任一种可能的实现方式,在第一方面的第三种可能的实现方式中,所述主控节点具体用于:
在第一时刻根据第一训练节点集合的第一训练节点发送的参数变化量对所述主控节点中存储的参数子集进行更新;
在第二时刻根据第二训练节点集合的第二训练节点发送的参数变化量对所述主控节点中存储的参数子集进行更新。
结合第一方面和第一方面的第一种至第三种可能的实现方式中的任一种可能的实现方式,在第一方面的第四种可能的实现方式中,所述主控节点集合具体用于:
根据训练结果的准确性,确定是否停止所述训练参数集的过程。
结合第一方面和第一方面的第一种至第四种可能的实现方式中的任一种可能的实现方式,在第一方面的第五种可能的实现方式中,所述训练节点还用于:
接收所述主控节点集合发送的指令,停止所述训练参数集的过程。
结合第一方面和第一方面的第一种至第五种可能的实现方式中的任一种可能的实现方式,在第一方面的第六种可能的实现方式中,同一所述训练节点集合中的训练节点两两之间通信连接。
第二方面,提供了一种用于神经网络中训练参数集的方法,所述方法执行于权利要求1至7中任一项所述的用于神经网络中训练参数集的系统中的主控节点集合,所述系统还包括N个训练节点集合,其中,所述主控节点集 合包括M个主控节点,所述M个主控节点两两之间通信连接,其中,M为大于1的正整数,N为大于1的正整数,所述方法包括:
所述主控节点集合存储训练所使用的数据集和参数集,所述数据集包括多个数据子集,所述参数集包括多个参数子集,所述多个参数子集分别存储于不同的主控节点上,所述主控节点集合中的所有主控节点存储的参数子集的合集为所述参数集,所述M个主控节点中的至少一个主控节点用于备份所述参数集;
所述主控节点集合中的主控节点向负责训练自身存储的参数子集的训练节点下发数据子集和所述参数子集;
所述主控节点集合中的主控节点接收所述训练节点发送的训练结果,其中所述训练节点属于训练节点集合,所述训练节点集合与所述主控节点集合通信连接,所述训练节点集合包括多个训练节点,所述训练结果是根据接收的所述主控节点集合下发的数据子集和参数集进行训练得到的。
结合第二方面,在第二方面的第一种可能的实现方式中,所述训练结果为训练节点根据接收的所述主控节点集合下发的所述数据子集和所述参数集,对自身负责的参数子集进行训练得到的参数子集的参数变化量,所述方法还包括:
所述主控节点集合中的主控节点接收所述训练节点发送的所述参数变化量;
所述主控节点集合中的主控节点根据所述参数变化量,对所述主控节点中存储的参数子集进行更新。
结合第二方面或第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,所述主控节点集合存储训练所使用的数据集和参数集,包括:
所述主控节点集合将所述参数集划分为多个参数子集;
将所述多个参数子集分别存储于不同的主控节点上,其中,所述主控节点集合中的所有主控节点存储的参数子集的合集为所述参数集;
所述方法还包括:
所述主控节点集合根据多个所述参数子集的大小确定所述N个训练节点集合中的每个训练节点。
结合第二方面和第二方面的第一种至第二种可能的实现方式中的任一 种可能的实现方式,在第二方面的第三种可能的实现方式中,所述主控节点集合中的主控节点根据所述参数变化量,对所述主控节点中存储的参数子集进行更新,包括:
所述主控节点集合中的主控节点在第一时刻根据第一训练节点集合的第一训练节点发送的参数变化量对所述主控节点中存储的参数子集进行更新;
所述主控节点集合中的主控节点在第二时刻根据第二训练节点集合的第二训练节点发送的参数变化量对所述主控节点中存储的参数子集进行更新。
结合第二方面和第二方面的第一种至第三种可能的实现方式中的任一种可能的实现方式,在第二方面的第四种可能的实现方式中,所述方法还包括:
所述主控节点集合根据训练结果的准确性,确定是否停止所述训练参数集的过程。
结合第二方面和第二方面的第一种至第四种可能的实现方式中的任一种可能的实现方式,在第二方面的第五种可能的实现方式中,一个所述参数子集由至少一个主控节点存储并负责,并对应的由至少两个训练节点负责,所述至少两个训练节点属于不同的训练节点集合,所述多个训练节点集合中的任意两个训练节点集合训练所使用的数据子集不同,所述每个训练节点集合中的所有训练节点所训练的参数子集的合集为所述参数集。
结合第二方面和第二方面的第一种至第五种可能的实现方式中的任一种可能的实现方式,在第二方面的第六种可能的实现方式中,同一所述训练节点集合中的训练节点两两之间通信连接。
基于上述技术方案,本发明实施例提供的用于神经网络中训练参数集的系统和方法,通过由多个两两之间通信连接的多个主控节点形成主控节点集合控制训练过程,可以避免当某一个主控节点失效时导致的整个训练失败情况,能够提高训练过程的可靠性,通过配置多个训练节点集合并行地对参数集进行训练,可以提高训练效率。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例中 所需要使用的附图作简单地介绍,显而易见地,下面所描述的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是根据本发明实施例的用于神经网络中训练参数集的系统的示意性框图。
图2是根据本发明实施例的计算设备的示意性框图。
图3是根据本发明实施例的用于神经网络中训练参数集的系统工作流程的示意图。
图4是根据本发明实施例的训练过程的示意性流程图。
图5是根据本发明实施例的用于神经网络中训练参数集的方法的示意性流程图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明的一部分实施例,而不是全部实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都应属于本发明保护的范围。
图1示出了根据本发明实施例的用于神经网络中训练参数集的系统100的示意性框图。如图1所示,系统100包括:
主控节点集合110,该主控节点集合110包括M个主控节点,该主控节点集合110用于控制该神经网络中训练参数集的过程,并用于存储该训练参数集的过程所使用的数据集和参数集,该数据集包括多个数据子集,该参数集包括多个参数子集,该多个参数子集分别存储于不同的主控节点上,该主控节点集合110中的所有主控节点存储的参数子集的合集为该参数集,该M个主控节点两两之间通信连接,该M个主控节点中的至少一个主控节点用于备份该参数集,其中,M为大于1的正整数;以及
N个训练节点集合120,该N个训练节点集合120中的每一个训练节点集合与该主控节点集合110通信连接,该训练节点集合120包括多个训练节点,该训练节点用于接收该主控节点集合110下发的数据子集和该参数集,根据接收的该数据子集和该参数集,对自身负责的参数子集进行训练,并将训练结果发送给存储该参数子集的主控节点,其中,N为大于1的正整数, 该N个训练节点集合120中的任意两个训练节点集合120训练所使用的数据子集不同,该每个训练节点集合120中的所有训练节点所训练的参数子集的合集为该参数集。
因此,本发明实施例提供的用于神经网络中训练参数集的系统,通过由多个两两之间通信连接的多个主控节点形成主控节点集合控制训练过程,可以避免当某一个主控节点失效时导致的整个训练失败情况,能够提高训练过程的可靠性,通过配置多个训练节点集合并行地对参数集进行训练,可以提高训练效率。
具体而言,训练参数集的系统100包括一个主控节点集合110和至少两个训练节点集合120。主控节点集合110包括至少两个主控节点,主控节点两两之间通信连接,至少一个主控节点用于备份参数集,可以提高训练过程的可靠性。训练节点集合120可以是主控节点集合110根据数据处理的规模和用于形成训练节点集合120的训练节点的性能(如内存大小等)划分的。
本发明实施例的训练参数集的系统100可以应用于神经网络的训练过程。神经网络的训练过程的输入为神经网络函数Y=f(X,W)、初始的参数集和训练的数据集D,输出为训练后的神经网络的参数集W。主控节点集合110用于控制训练过程,例如控制训练过程的开始或结束,控制各训练节点集合使用的数据子集,以及确定训练节点集合中的每个训练节点等。主控节点集合110还用于存储训练过程的所使用的数据集D和参数集W。参数集W包括多个参数子集,多个参数子集分别存储于不同的主控节点上,该主控节点集合110中的所有主控节点存储的参数子集的合集为该参数集W。
训练节点集合120中的训练节点用于接收该主控节点集合110下发的数据子集和当前的参数集W,根据接收的数据子集和当前的参数集W对自身负责的参数子集进行训练,并将根据该数据子集和当前的参数集W训练可以得到的用于更新的参数变化量ΔW发送给主控节点。训练过程中,该N个训练节点集合120中的任意两个训练节点集合120训练所使用的数据子集不同,该每个训练节点集合120中的所有训练节点所训练的参数子集的合集为该参数集。即,多个训练节点集合120并行处理不同的数据子集,对同一参数子集而言,同一时刻有多个训练节点对其进行训练,可以提高训练过程的效率。
因此,本发明实施例提供的用于神经网络中训练参数集的系统,通过由多个两两之间通信连接的多个主控节点形成主控节点集合控制训练过程,可 以避免当某一个主控节点失效时导致的整个训练失败情况,能够提高训练过程的可靠性,通过配置多个训练节点集合并行地对参数集进行训练,可以提高训练效率。
在本发明实施例中,数据集包括多个数据子集,参数集包括多个参数子集,该N个训练节点集合120中的任意两个训练节点集合120训练所使用的数据子集不同,至少存在两个训练节点训练同一个参数子集,该两个训练节点属于不同的训练节点集合120。
具体而言,训练参数集的系统100包括多于一个训练节点集合120。此时,主控节点集合110存储的数据集包括多个数据子集,训练时主控节点集合110将不同的数据子集下发给不同的训练节点集合120。主控节点集合110存储的参数集包括多个参数子集,主控节点集合110中的主控节点分别存储和负责维护不同的参数子集。训练节点集合120中负责某一参数子集的训练节点从相应的主控节点接收其存储和负责维护的该参数子集,从多个主控节点接收的参数子集的合集为参数集。根据数据子集和参数集训练自身负责的参数子集。其中,至少存在两个训练节点训练同一个参数子集,这两个训练节点属于不同的训练节点集合120。即,当有多个训练节点集合120时,多个训练节点集合120并行处理不同的数据子集,对同一参数子集而言,同一时刻有多个训练节点对其进行训练,可以提高训练过程的效率。
应理解,图1中示出系统100中的主控节点集合110中主控节点的个数、训练节点集合120的个数以及训练节点集合120中训练节点的个数均为示意性的。主控节点集合110中包括多于1个主控节点,系统100中包括至少两个训练节点集合120,训练节点集合120中包括多于1个训练节点。
因此,本发明实施例提供的用于神经网络中训练参数集的系统,通过由多个两两之间通信连接的多个主控节点形成主控节点集合控制训练过程,可以避免当某一个主控节点失效时导致的整个训练失败情况,能够提高训练过程的可靠性。并且,通过配置多个训练节点集合并行地对参数集进行训练,可以提高训练的效率。
主控节点集合110中的主控节点和训练节点集合120中的训练节点均为计算设备。图2示出了根据本发明实施例的计算设备的示意性框图。如图2所示,计算设备可以包含处理模块、存储模块、用于计算的协处理模块(例如,图形处理器(Graphic Processing Unit,GPU)、英特尔超多核心(Intel Many  Integrated Core,Intel MIC)处理器、现场可编程门阵列(Field-Programmable Gate Array,FPGA)等)和用于在训练节点和与主控节点进行通信或者在主控节点集合110内部通信的通信模块。
可选地,作为一个实施例,在同一时刻,该N个训练节点集合120中的至少一个训练节点集合120训练所使用的参数集与当前该主控节点集合110中存储的参数集不同。
或者,可选地,作为一个实施例,主控节点具体用于:
在第一时刻根据第一训练节点集合的第一训练节点发送的参数变化量对该主控节点中存储的参数子集进行更新;
在第二时刻根据第二训练节点集合的第二训练节点发送的参数变化量对还主控节点中存储的参数子集进行更新。
具体而言,系统100中的每个训练节点集合120均独立并行地运作,互不影响。任何一个训练节点集合120的失效,不影响整个系统100继续进行训练。在训练过程中的某一时刻,N个训练节点集合120中的至少一个训练节点集合120计算所使用的参数集与当前该主控节点集合110中存储的该参数集不同。或者说,在训练过程中的某一时刻,N个训练节点集合120中的至少一个训练节点集合120训练所使用的参数集与其它的训练节点集合120训练所使用的参数集不同。即,主控节点集合110对参数集W的更新是异步的,主控节点在第一时刻根据第一训练节点集合的第一训练节点发送的参数变化量对该主控节点中存储的参数子集进行更新;在第二时刻根据第二训练节点集合的第二训练节点发送的参数变化量对该主控节点中存储的参数子集进行更新。在某一时刻,主控节点集合110当前的参数集W可能已经和训练节点集合120正在训练所使用的参数集W不同了。
可选地,作为一个实施例,主控节点集合110具体可以用于:
将该参数集划分为多个参数子集;
将该多个参数子集分别存储于不同的主控节点上,其中,该主控节点集合110中的所有主控节点存储的参数子集的合集为该参数集;
根据多个该参数子集的大小确定该N个训练节点集合120中的每个训练节点。
具体而言,主控节点集合110在训练的最开始进行初始化工作,例如,划分训练节点集合120、配置训练的数据集和参数集、初始化原始模型等等。 其中配置训练的参数集W具体为,将参数集W划分为多个参数子集W1,W2,...,WK。每个主控节点负责维护一个或多个参数子集。如果参数子集Wi由主控节点Mj负责存储、更新和维护,则称Mj是Wi的宿主节点。
根据参数集W的大小以及每个用于形成训练节点集合120的训练节点的内存(或者训练节点的协处理器的内存)大小,主控节点集合110对所有的用于形成训练节点集合120的训练节点进行划分。通常而言,参数子集的大小越大,则需要为其分配的训练节点的能力应越强。假设共有P个训练节点集合120,记为C1,C2,...,CP。每个训练节点负责至少一个参数子集,每个计训练节点集合120协同存储和处理参数集W的一个完整副本。
可选地,作为一个实施例,主控节点集合110采用磁盘阵列RAID0/1/5/6或者纠删码对参数集进行备份。
具体而言,为了保证系统100的可靠性,主控节点集合110可以采用RAID0/1/5/6或者纠删码(Erasure Coding)的编码方法对参数集进行备份。这样,在某些主控节点失效的情况下,系统100可以通过相应的解码运算来恢复失效的参数子集而维持正常运作。应理解,还可以采用其它的编码方法来保证系统100的可靠性,本发明实施例对此不作限定。
可选地,作为一个实施例,训练节点具体可以用于:
接收该主控节点集合110发送的指令,停止该训练参数集的过程。
具体而言,对于训练节点集合Ck的训练节点需要访问其负责的参数子集的宿主节点,并下载最新的参数子集的副本。训练节点集合Ck的所有的训练节点通过通信网络获取的所有最新的参数子集的合集即为最新参数集,记作Wk。不同的训练节点集合可能会在不同的时刻从主控节点集合110获取最新的参数集W,而参数集W是不断变化的。因此,在同一时刻,不同的训练节点集合计算所使用的参数集W的副本可能是不同的。
要进行训练,训练节点集合Ck的训练节点还需从该主控节点集合110获取数据集的一部分数据,即数据子集,其中,同一训练节点集合中的训练节点所获取的数据子集相同。进而,训练节点根据该参数集Wk和该数据子集,进行训练,以获得自身负责的参数子集Wi对应的参数变化量ΔWi k。训练节点将训练得到的参数变化量ΔWi k发送给的负责对应的参数子集Wi的主控节点,即宿主节点。训练节点集合Ck的所有的训练节点计算得到的参数 变化量ΔWi k合集记为ΔWk。对于训练节点从主控节点集合110获取参数子集和数据的方式,本发明实施例不作限定。
在训练过程中,训练节点不断地重复接收参数集、接收数据子集的进行训练,直至从主控节点集合110接收到主控节点集合110发送的停止训练的指令,训练节点停止训练参数集的过程。
可选地,作为一个实施例,如果在训练的过程中,训练节点集合中的训练节点之间的参数是相互关联的,则训练节点之间需要进行必要的数据交换,此时,同一训练节点集合中的训练节点两两之间可以通信连接。
可选地,作为一个实施例,训练结果为训练节点根据接收的该数据子集和该参数集,对训练节点自身负责的参数子集进行训练得到的自身负责的参数子集的参数变化量,该主控节点集合110中的主控节点还用于:
接收该训练节点发送的该参数变化量;
根据该参数变化量,对该主控节点中存储的参数子集进行更新。
具体而言,主控节点集合110中的主控节点从某个训练节点集合Ck的训练节点接收该训练节点根据该数据集和该参数集训练得到的用于更新的参数变化量ΔWi k,从而对主控节点集合的该主控节点负责的参数子集Wi进行更新。亦即,主控节点集合从某个训练节点集合Ck接收到完整的参数集变化量ΔWk后,对神经网络的参数集W进行更新。主控节点集合对参数集W的更新是异步的,也就是说,在同一时刻,主控节点集合当前的参数集W可能已经和训练节点集合Ck在训练过程中使用的参数集Wk不同。这种异步的更新方式可以充分的利用所有训练节点集合的训练能力。此外,本发明实施例对主控节点集合对参数集W的具体更新方法不作限定。
可选地,作为一个实施例,主控节点集合具体用于:
根据训练结果的准确性,确定是否停止训练参数集的过程。
具体而言,主控节点集合110确根据训练结果是否准确,确定是否应停止当前的训练。例如,主控节点集合110可以在当参数集W的变化ΔWk小于一定的阈值时,确定停止训练过程;或者,当更新的参数集W使得根据参数集W和神经网络的数学函数Y=f(X,W)计算得到的结果Y的变化值小于一定的阈值时,确定停止训练过程,本发明实施例对此不作限定。
下面将结合具体的例子对本发明实施例提供的系统100的工作流程进行详细说明。
本发明实施例提供的系统100应用于基于深层卷积神经网络的图像分类系统,并使用基于小批量随机梯度下降(Mini-batch Stochastic Gradient Descent)的优化算法进行训练。该深层卷积神经网络的输入X为图像输出Y为图像类别,训练过程的数据集D={(Xi,Yi)}。卷积神经网络的参数集为W,系统训练的参数集包括的参数为小批量的大小m和学习率α。图3是根据本发明实施例的数据处理的系统工作流程的示意图。深层卷积神经网络的参数集为W被分成两个参数子集W1和W2。主控节点集合包括三个主控节点M1,M2,和M3。主控节点M1是参数子集W1的宿主节点,主控节点M2是参数子集W2的宿主节点,主控节点M3保存
Figure PCTCN2015086011-appb-000003
本发明实施例中
Figure PCTCN2015086011-appb-000004
表示异或训练。每个训练节点集合Ck包括两个训练节点Ck 1和Ck 2,分别用来负责参数子集W1和W2的训练。
图4是根据本发明实施例的训练过程200的示意性流程图。训练过程200包括:
210,系统100包括P个训练节点集合Ck(k=1,2,…,P)。训练节点Ck 1和Ck 2分别从主控节点M1和M2下载最新的参数子集W1和W2,记作W1 k和W2 k。如果主控节点M1失效,训练节点Ck 1可以从主控节点M2下载参数子集W2,从主控节点M3下载
Figure PCTCN2015086011-appb-000005
然后通过训练
Figure PCTCN2015086011-appb-000006
得到参数子集W1 k。如果主控节点M2失效,训练节点Ck 2可以从主控节点M1下载参数子集W1,从主控节点M3下载
Figure PCTCN2015086011-appb-000007
然后通过训练
Figure PCTCN2015086011-appb-000008
得到参数子集W2 k
220,训练节点Ck 1和Ck 2都从主控节点集合接收同一批训练数据
Figure PCTCN2015086011-appb-000009
分别基于参数子集W1 k和W2 k进行正向传递训练。训练过程中训练节点Ck 1和Ck 2可以相互通信,以进行必要的数据交换。
230,对于每一个训练数据
Figure PCTCN2015086011-appb-000010
训练节点Ck 1和Ck 2分别训练出其对应的误差
Figure PCTCN2015086011-appb-000011
然后通过误差反向传播(Error Back Propagation,BP)算法进行反向传递训练,分别训练出
Figure PCTCN2015086011-appb-000012
训练过程中,训练节点Ck 1和Ck 2可以相互通信,以进行必要的数据交换。
240,训练节点Ck 1和Ck 2分别求得参数变化量
Figure PCTCN2015086011-appb-000013
Figure PCTCN2015086011-appb-000014
250,训练节点Ck 1和Ck 2分别把ΔW1 k和ΔW2 k上传到主控节点M1和M2。训练节点Ck 1和Ck 2重复步骤210至250,直至从主控节点集合接收到终止训练的指令。
260,主控节点集合包括主控节点M1和M2,步骤260与步骤210至250并行进行。主控节点M1和M2分别从训练节点集合的训练节点Ck 1和Ck 2接收参数变化量
Figure PCTCN2015086011-appb-000015
Figure PCTCN2015086011-appb-000016
根据参数变化量
Figure PCTCN2015086011-appb-000017
Figure PCTCN2015086011-appb-000018
主控节点M1和M2根据以下公式分别对参数子集W1和W2进行更新:
Figure PCTCN2015086011-appb-000019
Figure PCTCN2015086011-appb-000020
主控节点M1和M2将更新过的参数子集W1和W2传输给主控节点M3。主控节点M3根据以下公式对W3进行更新:
Figure PCTCN2015086011-appb-000021
270,主控节点集合根据训练结果的准确性,确定是否停止训练过程。如果不满足训练停止条件,重复步骤210至270;如果满足训练停止条件,执行步骤280。
280,主控节点集合向训练节点集合发出终止训练的指令。
因此,本发明实施例提供的用于神经网络中训练参数集的系统,通过由多个两两之间通信连接的多个主控节点形成主控节点集合控制训练过程,可以避免当某一个主控节点失效时导致的整个训练失败情况,能够提高训练过程的可靠性,并且,通过配置多个训练节点集合并行地对参数集进行训练,可以提高训练的效率。
下面对对应于本发明施例的用于神经网络中训练参数集的方法200进行详细的说明。
图5示出了根据本发明实施例的用于神经网络中训练参数集的方法300,该方法300执行于上述用于神经网络中训练参数集的系统中的主控节点集合,该系统还包括N个训练节点集合,其中,该主控节点集合包括M 个主控节点,该M个主控节点两两之间通信连接,其中,M为大于1的正整数,N为大于1的正整数,该方法300包括:
S310,该主控节点集合存储训练所使用的数据集和参数集,该数据集包括多个数据子集,该参数集包括多个参数子集,该多个参数子集分别存储于不同的主控节点上,该主控节点集合中的所有主控节点存储的参数子集的合集为该参数集,该M个主控节点中的至少一个主控节点用于备份该参数集;
S320,该主控节点集合中的主控节点向负责训练自身存储的参数子集的训练节点下发数据子集和该参数子集;
S330,该主控节点集合中的主控节点接收该训练节点发送的训练结果,其中该训练节点属于训练节点集合,该训练节点集合与该主控节点集合通信连接,该训练节点集合包括多个训练节点,该训练结果是根据接收的该主控节点集合下发的数据子集和参数集进行训练得到的。
因此,本发明实施例提供的用于神经网络中训练参数集的方法,通过由多个两两之间通信连接的多个主控节点形成主控节点集合控制训练过程,可以避免当某一个主控节点失效时导致的整个训练失败情况,能够提高训练过程的可靠性,通过多个训练节点集合并行地对参数集进行训练,可以提高训练效率。
可选地,作为一个实施例,该训练结果为训练节点根据接收的该主控节点集合下发的该数据子集和该参数集,对自身负责的参数子集进行训练得到的参数子集的参数变化量,该方法300还包括:
该主控节点集合中的主控节点接收该训练节点发送的该参数变化量;
该主控节点集合中的主控节点根据该参数变化量,对该主控节点中存储的参数子集进行更新。
可选地,作为一个实施例,该主控节点集合存储训练所使用的数据集和参数集,包括:
该主控节点集合将该参数集划分为多个参数子集;
将该多个参数子集分别存储于不同的主控节点上,其中,该主控节点集合中的所有主控节点存储的参数子集的合集为该参数集;
该方法300还包括:
该主控节点集合根据多个该参数子集的大小确定该N个训练节点集合中的每个训练节点。
可选地,作为一个实施例,该主控节点集合中的主控节点根据该参数变化量,对该主控节点中存储的参数子集进行更新,包括:
该主控节点集合中的主控节点在第一时刻根据第一训练节点集合的第一训练节点发送的参数变化量对该主控节点中存储的参数子集进行更新;
该主控节点集合中的主控节点在第二时刻根据第二训练节点集合的第二训练节点发送的参数变化量对该主控节点中存储的参数子集进行更新。
可选地,作为一个实施例,该方法300还包括:
该主控节点集合根据训练结果的准确性,确定是否停止该训练参数集的过程。
可选地,作为一个实施例,一个该参数子集由至少一个主控节点存储并负责,并对应的由至少两个训练节点负责,该至少两个训练节点属于不同的训练节点集合,该多个训练节点集合中的任意两个训练节点集合训练所使用的数据子集不同,该每个训练节点集合中的所有训练节点所训练的参数子集的合集为该参数集。
可选地,作为一个实施例,同一该训练节点集合中的训练节点两两之间通信连接。
因此,本发明实施例提供的用于神经网络中训练参数集的方法,通过由多个两两之间通信连接的多个主控节点形成主控节点集合控制训练过程,可以避免当某一个主控节点失效时导致的整个训练失败情况,能够提高训练过程的可靠性,并且,通过配置多个训练节点集合并行进行训练,可以提高训练的效率。
应理解,在本发明实施例中,“与X相应的Y”表示Y与X相关联,根据X可以确定Y。但还应理解,根据X确定Y并不意味着仅仅根据X确定Y,还可以根据X和/或其它信息确定Y。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本发明实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (14)

  1. 一种用于神经网络中训练参数集的系统,其特征在于,所述系统包括:
    主控节点集合,所述主控节点集合包括M个主控节点,所述主控节点集合用于控制所述神经网络中训练参数集的过程,并用于存储所述训练参数集的过程所使用的数据集和参数集,所述数据集包括多个数据子集,所述参数集包括多个参数子集,所述多个参数子集分别存储于不同的主控节点上,所述主控节点集合中的所有主控节点存储的参数子集的合集为所述参数集,所述M个主控节点两两之间通信连接,所述M个主控节点中的至少一个主控节点用于备份所述参数集,其中,M为大于1的正整数;以及
    N个训练节点集合,所述N个训练节点集合中的每一个训练节点集合与所述主控节点集合通信连接,所述训练节点集合包括多个训练节点,所述训练节点用于接收所述主控节点集合下发的数据子集和所述参数集,根据接收的所述数据子集和所述参数集,对自身负责的参数子集进行训练,并将训练结果发送给存储所述参数子集的主控节点,其中,N为大于1的正整数,所述N个训练节点集合中的任意两个训练节点集合训练所使用的数据子集不同,所述每个训练节点集合中的所有训练节点所训练的参数子集的合集为所述参数集。
  2. 根据权利要求1所述的系统,其特征在于,所述训练结果为训练节点根据接收的所述数据子集和所述参数集,对自身负责的参数子集进行训练得到的自身负责的参数子集的参数变化量,所述主控节点集合中的主控节点还用于:
    接收所述训练节点发送的所述参数变化量;
    根据所述参数变化量,对所述主控节点中存储的参数子集进行更新。
  3. 根据权利要求1或2所述的系统,其特征在于,所述主控节点集合具体用于:
    将所述参数集划分为多个参数子集;
    将所述多个参数子集分别存储于不同的主控节点上,其中,所述主控节点集合中的所有主控节点存储的参数子集的合集为所述参数集;
    根据多个所述参数子集的大小确定所述N个训练节点集合中的每个训练节点。
  4. 根据权利要求1至3中任一项所述的系统,其特征在于,所述主控节点具体用于:
    在第一时刻根据第一训练节点集合的第一训练节点发送的参数变化量对所述主控节点中存储的参数子集进行更新;
    在第二时刻根据第二训练节点集合的第二训练节点发送的参数变化量对所述主控节点中存储的参数子集进行更新。
  5. 根据权利要求1至4中任一项所述的系统,其特征在于,所述主控节点集合具体用于:
    根据训练结果的准确性,确定是否停止所述训练参数集的过程。
  6. 根据权利要求1至5中任一项所述的系统,其特征在于,所述训练节点还用于:
    接收所述主控节点集合发送的指令,停止所述训练参数集的过程。
  7. 根据权利要求1至6中任一项所述的系统,其特征在于,同一所述训练节点集合中的训练节点两两之间通信连接。
  8. 一种用于神经网络中训练参数集的方法,其特征在于,所述方法执行于权利要求1至7中任一项所述的用于神经网络中训练参数集的系统中的主控节点集合,所述系统还包括N个训练节点集合,其中,所述主控节点集合包括M个主控节点,所述M个主控节点两两之间通信连接,其中,M为大于1的正整数,N为大于1的正整数,所述方法包括:
    所述主控节点集合存储训练所使用的数据集和参数集,所述数据集包括多个数据子集,所述参数集包括多个参数子集,所述多个参数子集分别存储于不同的主控节点上,所述主控节点集合中的所有主控节点存储的参数子集的合集为所述参数集,所述M个主控节点中的至少一个主控节点用于备份所述参数集;
    所述主控节点集合中的主控节点向负责训练自身存储的参数子集的训练节点下发数据子集和所述参数子集;
    所述主控节点集合中的主控节点接收所述训练节点发送的训练结果,其中所述训练节点属于训练节点集合,所述训练节点集合与所述主控节点集合通信连接,所述训练节点集合包括多个训练节点,所述训练结果是根据接收的所述主控节点集合下发的数据子集和参数集进行训练得到的。
  9. 根据权利要求8所述的方法,其特征在于,所述训练结果为训练节 点根据接收的所述主控节点集合下发的所述数据子集和所述参数集,对自身负责的参数子集进行训练得到的参数子集的参数变化量,所述方法还包括:
    所述主控节点集合中的主控节点接收所述训练节点发送的所述参数变化量;
    所述主控节点集合中的主控节点根据所述参数变化量,对所述主控节点中存储的参数子集进行更新。
  10. 根据权利要求8或9所述的方法,其特征在于,所述主控节点集合存储训练所使用的数据集和参数集,包括:
    所述主控节点集合将所述参数集划分为多个参数子集;
    将所述多个参数子集分别存储于不同的主控节点上,其中,所述主控节点集合中的所有主控节点存储的参数子集的合集为所述参数集;
    所述方法还包括:
    所述主控节点集合根据多个所述参数子集的大小确定所述N个训练节点集合中的每个训练节点。
  11. 根据权利要求8至10中任一项所述的方法,其特征在于,所述主控节点集合中的主控节点根据所述参数变化量,对所述主控节点中存储的参数子集进行更新,包括:
    所述主控节点集合中的主控节点在第一时刻根据第一训练节点集合的第一训练节点发送的参数变化量对所述主控节点中存储的参数子集进行更新;
    所述主控节点集合中的主控节点在第二时刻根据第二训练节点集合的第二训练节点发送的参数变化量对所述主控节点中存储的参数子集进行更新。
  12. 根据权利要求8至11中任一项所述的方法,其特征在于,所述方法还包括:
    所述主控节点集合根据训练结果的准确性,确定是否停止所述训练参数集的过程。
  13. 根据权利要求8至12中任一项所述的方法,其特征在于,一个所述参数子集由至少一个主控节点存储并负责,并对应的由至少两个训练节点负责,所述至少两个训练节点属于不同的训练节点集合,所述多个训练节点集合中的任意两个训练节点集合训练所使用的数据子集不同,所述每个训练 节点集合中的所有训练节点所训练的参数子集的合集为所述参数集。
  14. 根据权利要求8至13中任一项所述的方法,其特征在于,同一所述训练节点集合中的训练节点两两之间通信连接。
PCT/CN2015/086011 2015-01-26 2015-08-04 用于神经网络中训练参数集的系统和方法 WO2016119429A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP15879628.4A EP3196809A4 (en) 2015-01-26 2015-08-04 System and method for training parameter set in neural network
US15/455,259 US20170185895A1 (en) 2015-01-26 2017-03-10 System and Method for Training Parameter Set in Neural Network

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510036813.0 2015-01-26
CN201510036813.0A CN105894087A (zh) 2015-01-26 2015-01-26 用于神经网络中训练参数集的系统和方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/455,259 Continuation US20170185895A1 (en) 2015-01-26 2017-03-10 System and Method for Training Parameter Set in Neural Network

Publications (1)

Publication Number Publication Date
WO2016119429A1 true WO2016119429A1 (zh) 2016-08-04

Family

ID=56542304

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/086011 WO2016119429A1 (zh) 2015-01-26 2015-08-04 用于神经网络中训练参数集的系统和方法

Country Status (4)

Country Link
US (1) US20170185895A1 (zh)
EP (1) EP3196809A4 (zh)
CN (1) CN105894087A (zh)
WO (1) WO2016119429A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018023832A1 (en) * 2016-08-03 2018-02-08 Huawei Technologies Co., Ltd. Systems, methods and devices for neural network communications
EP3396528A1 (en) * 2017-04-24 2018-10-31 INTEL Corporation Dynamic distributed training of machine learning models
TWI675335B (zh) * 2017-06-09 2019-10-21 宏達國際電子股份有限公司 訓練任務優化系統、訓練任務優化方法及其非暫態電腦可讀媒體
US11373116B2 (en) * 2015-11-16 2022-06-28 Huawei Technologies Co., Ltd. Model parameter fusion method and apparatus
US11681914B2 (en) 2020-05-08 2023-06-20 International Business Machines Corporation Determining multivariate time series data dependencies

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110843B (zh) 2014-08-29 2020-09-25 谷歌有限责任公司 用于处理图像的方法和系统
CN107784364B (zh) * 2016-08-25 2021-06-15 微软技术许可有限责任公司 机器学习模型的异步训练
CN106169961B (zh) * 2016-09-07 2019-07-23 北京百度网讯科技有限公司 基于人工智能的神经网络的网络参数处理方法及装置
CN107885716B (zh) * 2016-09-29 2020-02-11 腾讯科技(深圳)有限公司 文本识别方法及装置
CN108229687B (zh) * 2016-12-14 2021-08-24 腾讯科技(深圳)有限公司 数据处理方法、数据处理装置及电子设备
CN106815644B (zh) * 2017-01-26 2019-05-03 北京航空航天大学 机器学习方法和系统
US11023803B2 (en) * 2017-04-10 2021-06-01 Intel Corporation Abstraction library to enable scalable distributed machine learning
US11288575B2 (en) * 2017-05-18 2022-03-29 Microsoft Technology Licensing, Llc Asynchronous neural network training
CN109032610B (zh) * 2017-06-08 2024-04-09 杭州海康威视数字技术股份有限公司 一种程序包部署方法、电子设备及分布式系统
CN107578094A (zh) * 2017-10-25 2018-01-12 济南浪潮高新科技投资发展有限公司 基于参数服务器和fpga实现神经网络分布式训练的方法
CN108304924B (zh) * 2017-12-21 2021-10-12 内蒙古工业大学 一种深度置信网的流水线式预训练方法
US11151449B2 (en) 2018-01-24 2021-10-19 International Business Machines Corporation Adaptation of a trained neural network
JP7301801B2 (ja) * 2018-10-09 2023-07-03 株式会社Preferred Networks ハイパーパラメータチューニング方法、装置及びプログラム
CN109889366B (zh) * 2019-01-04 2020-06-16 烽火通信科技股份有限公司 网络流量增量统计、分析方法及系统
CN110059813B (zh) * 2019-02-13 2021-04-06 创新先进技术有限公司 利用gpu集群更新卷积神经网络的方法、装置及设备
KR102391817B1 (ko) * 2019-02-18 2022-04-29 주식회사 아이도트 딥 러닝 시스템
CN113412494B (zh) * 2019-02-27 2023-03-17 华为技术有限公司 一种确定传输策略的方法及装置
CN110490316B (zh) * 2019-08-21 2023-01-06 腾讯科技(深圳)有限公司 基于神经网络模型训练系统的训练处理方法、训练系统
US11509667B2 (en) * 2019-10-19 2022-11-22 Microsoft Technology Licensing, Llc Predictive internet resource reputation assessment
US11669780B2 (en) 2019-11-06 2023-06-06 International Business Machines Corporation Asynchronous multiple scheme meta learning
US11431751B2 (en) 2020-03-31 2022-08-30 Microsoft Technology Licensing, Llc Live forensic browsing of URLs
CN111736904B (zh) * 2020-08-03 2020-12-08 北京灵汐科技有限公司 多任务并行处理方法、装置、计算机设备及存储介质
CN113065666A (zh) * 2021-05-11 2021-07-02 海南善沙网络科技有限公司 一种神经网络机器学习模型训练用分布式计算方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001026026A2 (en) * 1999-10-04 2001-04-12 University Of Florida Local diagnostic and remote learning neural networks for medical diagnosis
US20040133531A1 (en) * 2003-01-06 2004-07-08 Dingding Chen Neural network training data selection using memory reduced cluster analysis for field model development
CN102735747A (zh) * 2012-04-10 2012-10-17 南京航空航天大学 高速铁路钢轨高速漏磁巡检的缺陷定量识别方法
CN103077347A (zh) * 2012-12-21 2013-05-01 中国电力科学研究院 一种基于改进核心向量机数据融合的复合式入侵检测方法
CN104035751A (zh) * 2014-06-20 2014-09-10 深圳市腾讯计算机系统有限公司 基于多图形处理器的数据并行处理方法及装置
CN104463322A (zh) * 2014-11-10 2015-03-25 浪潮(北京)电子信息产业有限公司 一种异构系统的并行混合人工蜂群方法
CN104463324A (zh) * 2014-11-21 2015-03-25 长沙马沙电子科技有限公司 一种基于大规模高性能集群的卷积神经网络并行处理方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7747070B2 (en) * 2005-08-31 2010-06-29 Microsoft Corporation Training convolutional neural networks on graphics processing units
CN104036451B (zh) * 2014-06-20 2018-12-11 深圳市腾讯计算机系统有限公司 基于多图形处理器的模型并行处理方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001026026A2 (en) * 1999-10-04 2001-04-12 University Of Florida Local diagnostic and remote learning neural networks for medical diagnosis
US20040133531A1 (en) * 2003-01-06 2004-07-08 Dingding Chen Neural network training data selection using memory reduced cluster analysis for field model development
CN102735747A (zh) * 2012-04-10 2012-10-17 南京航空航天大学 高速铁路钢轨高速漏磁巡检的缺陷定量识别方法
CN103077347A (zh) * 2012-12-21 2013-05-01 中国电力科学研究院 一种基于改进核心向量机数据融合的复合式入侵检测方法
CN104035751A (zh) * 2014-06-20 2014-09-10 深圳市腾讯计算机系统有限公司 基于多图形处理器的数据并行处理方法及装置
CN104463322A (zh) * 2014-11-10 2015-03-25 浪潮(北京)电子信息产业有限公司 一种异构系统的并行混合人工蜂群方法
CN104463324A (zh) * 2014-11-21 2015-03-25 长沙马沙电子科技有限公司 一种基于大规模高性能集群的卷积神经网络并行处理方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3196809A4 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11373116B2 (en) * 2015-11-16 2022-06-28 Huawei Technologies Co., Ltd. Model parameter fusion method and apparatus
WO2018023832A1 (en) * 2016-08-03 2018-02-08 Huawei Technologies Co., Ltd. Systems, methods and devices for neural network communications
EP3396528A1 (en) * 2017-04-24 2018-10-31 INTEL Corporation Dynamic distributed training of machine learning models
TWI675335B (zh) * 2017-06-09 2019-10-21 宏達國際電子股份有限公司 訓練任務優化系統、訓練任務優化方法及其非暫態電腦可讀媒體
US11144828B2 (en) 2017-06-09 2021-10-12 Htc Corporation Training task optimization system, training task optimization method and non-transitory computer readable medium for operating the same
US11681914B2 (en) 2020-05-08 2023-06-20 International Business Machines Corporation Determining multivariate time series data dependencies

Also Published As

Publication number Publication date
EP3196809A1 (en) 2017-07-26
EP3196809A4 (en) 2017-11-22
CN105894087A (zh) 2016-08-24
US20170185895A1 (en) 2017-06-29

Similar Documents

Publication Publication Date Title
WO2016119429A1 (zh) 用于神经网络中训练参数集的系统和方法
US11204822B1 (en) Distributed storage network (DSN) configuration adaptation based on estimated future loading
US10733535B1 (en) Training a model using parameter server shards
CN106297774B (zh) 一种神经网络声学模型的分布式并行训练方法及系统
CN107209872B (zh) 用于训练强化学习系统的系统、方法和存储介质
US11443194B2 (en) Anomaly detection using a dimensional-reduction model
CN103150596B (zh) 一种反向传播神经网络dnn的训练系统
KR20180045635A (ko) 뉴럴 네트워크 간소화 방법 및 장치
US20180322383A1 (en) Storage controller accelaration for neural network training and inference
WO2018081563A9 (en) NEURONAL ARCHITECTURE RESEARCH
US20180253646A1 (en) Hybrid aggregation for deep learning neural networks
CN113515370A (zh) 一种面向大规模深度神经网络的分布式训练方法
CN110889509B (zh) 一种基于梯度动量加速的联合学习方法及装置
WO2022116424A1 (zh) 交通流预测模型训练方法、装置、电子设备及存储介质
EP3688673A1 (en) Neural architecture search
CN115358487A (zh) 面向电力数据共享的联邦学习聚合优化系统及方法
US11307781B2 (en) Managing replicas of content in storage systems
CN113159287A (zh) 一种基于梯度稀疏的分布式深度学习方法
CN115131476A (zh) 虚拟对象的骨骼绑定迁移方法、装置、设备及存储介质
EP4025995A1 (en) Automatic probabilistic upgrade of tenant devices
TWI758223B (zh) 具有動態最小批次尺寸之運算方法,以及用於執行該方法之運算系統及電腦可讀儲存媒體
WO2023169167A1 (zh) 模型训练方法、装置、设备及存储介质
US11475311B2 (en) Neural network instruction streaming
US20200257967A1 (en) Selecting a disconnect from different types of channel disconnects by training a machine learning module
US20220327374A1 (en) Dynamic distributed training of machine learning models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15879628

Country of ref document: EP

Kind code of ref document: A1

REEP Request for entry into the european phase

Ref document number: 2015879628

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2015879628

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE