CN106991474A

CN106991474A - The parallel full articulamentum method for interchanging data of deep neural network model and system

Info

Publication number: CN106991474A
Application number: CN201710191684.1A
Authority: CN
Inventors: 蒋文斌; 金海�; 张杨松; 叶阁焰; 马阳; 祝简; 刘湃
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-03-28
Filing date: 2017-03-28
Publication date: 2017-07-28
Anticipated expiration: 2037-03-28
Also published as: CN106991474B

Abstract

The invention discloses a kind of parallel full articulamentum method for interchanging data of deep neural network model and system, the full articulamentum of deep neural network is evenly dividing on N number of training unit by the number of neuron, a kind of parallel network model of connection layer model complete in deep neural network is formed；During the propagated forward of full articulamentum, using the input data of the propagated forward method to front layer such as partly stopping, part is taken to reach, part is calculated, overall output and the overall processing mode propagated；During the back-propagating of full articulamentum, using stopping to wait back-propagating method to the residual error data of rear layer surely, the processing mode for quantitatively reaching, quantitatively calculating and quantitatively propagate is taken；After the completion of a forward direction with back-propagating, according to required weights gradient and threshold gradient, the weight data and threshold data of each layer are concurrently updated.Can be overlapping by the data communication of full articulamentum and data calculating progress, the convergence of acceleration model on the premise of accuracy is ensured.

Description

Deep neural network model parallel full-connection layer data exchange method and system

Technical Field

The invention belongs to the technical field of deep learning, and particularly relates to a full-connection layer data exchange method and system for model parallel in a deep neural network.

Background

Deep Neural Network (DNN) is an Artificial Neural Network (ANN) composed of an input layer, a plurality of hidden layers and an output layer, each layer is composed of a plurality of neuron nodes, the neuron nodes of the front layer and the neuron nodes of the rear layer are connected with each other, as shown in fig. 1, all the layers in fig. 1 are on the same training unit, I represents the input layer, H represents the hidden layers (the hidden layers need to be in plurality), O represents the hidden layers, thin lines represent the connection of neurons and neurons, and thick lines represent the connection of components and components (herein, a certain layer). In the neural network model, a fully-Connected Layer (Full-Connected Layer) refers to a Layer in which all nodes are Connected to any one of adjacent layers, and the fully-Connected Layer is denoted by "FC".

With the increase of the training data set, in the model training of the deep neural network, the training parameters (including connection weight parameters and threshold parameters, which are also referred to as bias parameters) of the full connection layer often exceed the memory size of a single training unit (the training unit is an independent computing node, which may be a GPU card or a server node), so that the full connection layer needs to be split into N pieces, each piece is composed of part of neuron nodes and training parameters between the neuron nodes, and the N training units distributed on one or more hosts respectively hold the training parameters and cooperate with each other to complete training, as shown in fig. 2, so that a model parallel training mode of the full connection layer in the deep neural network is formed.

Communication overhead occurs when the input to one neuron comes from the output of a neuron on another training unit, as in FIG. 2, when the training unit GPU₂The input of the neuron needs to come from a training unit GPU₁The output of the neuron(s) above needs to be copied to the output of the latter, which results in communication overhead. In the propagation method of the deep neural network standard, the computation and the communication are strictly consistent regardless of the forward propagation or the backward propagation, the standard forward propagation (the standard backward propagation is similar to the standard forward propagation) is taken as an explanation below, and the core idea is as follows:

(1) waiting for data: the training unit waits for the arrival of the output data of all source training units (training units that generate data);

(2) and (4) integral output: the training unit calculates the output data of the current layer;

(3) and (3) overall propagation: the training unit propagates the output data to a plurality of target training units (training units that receive the data) as input data to the target training units.

However, the above method has the following drawbacks: because the training units are distributed on one or more hosts, the transmission rates of output data from the source training unit to the target training unit are inconsistent, the target training unit needs to wait for the data of the source training unit with the slowest rate to start the next work (continue to propagate forward), and the communication overhead is increased.

Disclosure of Invention

Aiming at the defects or improvement requirements of the prior art, the invention provides a deep neural network model parallel full-connection layer data exchange method and system, which can overlap data communication and data calculation of a full-connection layer and accelerate the convergence of the model on the premise of ensuring the accuracy. Therefore, the technical problem of high communication overhead in the prior art is solved.

To achieve the above object, according to an aspect of the present invention, there is provided a deep neural network model parallel full-connection layer data exchange method, including:

(1) for each full connection layer FC_l,l∈[1,L]According to FC_lNumber of neurons in FC_lDividing the N sub full connection layers into N equal parts to obtain N sub full connection layers, and distributing the divided sub full connection layers to N training units respectively, wherein L is the number of the full connection layers;

(2) in the forward propagation process of each sub full link layer, the output data of each sub full link layer is obtained by adopting a half-stop equal forward propagation method in parallel;

(3) in the backward propagation process of each sub full connection layer, the weight gradient and the threshold gradient of each sub full connection layer are obtained in parallel by adopting a backward propagation method such as fixed stop and the like based on the output data of each sub full connection layer obtained by a forward propagation method such as half stop and the like;

(4) after one-time forward propagation and backward propagation are finished, the weight data and the threshold data of each sub full connection layer are updated in parallel through the weight gradient and the threshold gradient of each sub full connection layer.

Preferably, the step (2) specifically comprises:

(2.1) for each sub-full connection layerIf any sub-full connection layer Has arrived, then the data is represented by the formula:computing sub-full connection layerSub-full connection layerWherein the subscript l denotes the index of the full link layer, the subscripts j and i denote the indexes of the sub-full link layers,representing a fully connected layer of atomsAnd a sub full connection layerThe connection weight of (a) is set,representing a fully connected layer of atomsThe output data of (a) is obtained,representing a fully connected layer of atomsSub-full connection layerThe generated input data;

(2.2) for the sub full junction layerAccording to the result of step (2.1), the following formula: computing sub-full connection layerThe overall input data of (a), wherein,representing a fully connected layer of atomsThe overall input data of (1);

(2.3) for the sub full junction layerAccording to the result of step (2.2), the following formula: computing sub-full connection layerWherein the function F represents a non-linear activation function,is a sub-full connection layerThe threshold value data of (1).

Preferably, step (3) specifically comprises:

(3.1) for each sub-full connection layerSub full link layer on Q training units To the sub full connection layerAfter the generated output residual data arrive, taking the Q output residual data as a sub full link layerThe input residual data of (a), is noted as:

(3.2) for the sub full junction layerBy the formula:accumulating the Q input residual error data in the step (3.1);

(3.3) for the sub full junction layerAccording to the result of step (3.2), calculating sub-full connection layers in parallelSub-full connection layerOutput residual data of (3), noted as: the calculation formula is as follows:

(3.4) for the sub full junction layerAccording to the result of step (3.1), calculating sub-full connection layer in parallelSub-full connection layerThe weight gradient of (a) is recorded as: the calculation formula is as follows:

(3.5) for the sub full junction layerCalculating sub-full connection layer according to the result of the step (3.2)Sub-full connection layerIs recorded as:the calculation formula is as follows:v is a unit vector, and the dimension size of V is equal to the size of a batch processing block in training;

(3.6) for the sub full junction layerRepeating the steps (3.1) to (3.5), and treating each timeQ-part sub-full connection layer pair sub-full connection layer of rear layerGenerated output residual data up to sub-full link layerAnd finishing processing all the output residual data of the rear layer.

Preferably, the step (4) specifically comprises:

(4.1) the formula:updating each sub-full connection layer in parallelWherein η represents a learning rate;

(4.2) the formula:updating each sub-full connection layer in parallelThe threshold value data of (1).

According to another aspect of the present invention, there is provided a deep neural network model parallel full-connection layer data exchange system, including:

a partitioning module for each full connection layer FC_l,l∈[1,L]According to FC_lNumber of neurons in FC_lDividing the N sub full connection layers into N equal parts to obtain N sub full connection layers, and distributing the divided sub full connection layers to N training units respectively, wherein L is the number of the full connection layers;

the forward propagation module is used for obtaining the output data of each sub full-link layer in parallel by adopting a half-stop equal forward propagation method in the forward propagation process of each sub full-link layer;

the backward propagation module is used for obtaining the weight gradient and the threshold gradient of each sub full-connection layer in parallel by adopting a backward propagation method such as stop and the like based on the output data of each sub full-connection layer obtained by a forward propagation method such as half stop and the like in the backward propagation process of each sub full-connection layer;

and the updating module is used for updating the weight data and the threshold data of each sub full connection layer in parallel by the weight gradient and the threshold gradient of each sub full connection layer after one-time forward propagation and one-time backward propagation are finished.

Generally, compared with the prior art, the above technical solutions conceived by the present invention mainly have the following technical advantages:

(1) the calculation parallelism is high: each training unit processes the current data in parallel;

(2) the communication overhead is small: the half stop equal forward propagation method and the fixed stop equal backward propagation method both maximize the calculation and communication time in the training process of the overlapping deep neural network, and reduce the communication overhead in the training.

Drawings

FIG. 1 is a schematic diagram of a deep neural network architecture in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a fully-connected layer structure of model parallelism in a deep neural network in an embodiment of the present invention;

FIG. 3 is a schematic flow chart of the overall process in an embodiment of the invention;

FIG. 4 is a flow chart of a semi-stop equal forward propagation method in an embodiment of the present invention;

FIG. 5 is a flow chart illustrating a stop-and-wait back propagation method in an embodiment of the present invention;

fig. 6 is a flowchart illustrating a method for updating weight data and threshold data according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention consists of two parts, namely a half stop equal forward propagation method and a fixed stop equal backward propagation method, wherein the core idea of the half stop equal forward propagation method is as follows:

(1) and partial calculation: the training unit calculates the output Data (ID) of the front layer which has already arrived;

(2) and (4) integral output: the training unit combines all calculation results in the step (1) as input Data of the training unit, and calculates Output Data (OD) for the input Data;

(3) and (3) overall propagation: the training unit propagates the output data to a plurality of target training units as input data for the target training units.

The core idea of the stop-and-wait back propagation method is as follows:

(1) and (3) quantitative calculation: the training unit calculates residual data (also called error data, and expressed by "" in the invention) output by a rear layer after reaching Q parts (Q belongs to [1, N ], Q is a constant set by a user, and N is the number of the training units) each time;

(2) quantitative transmission: the training unit transmits the calculation result in the step (1) to a plurality of target training units as input residual data (I) of the target training units;

(3) and (3) repeating the step (1) and the step (2) until the N output residual data (O) of the rear layer are processed.

The overall thought of the invention is that in the parallel training process of the model of the deep neural network, a semi-stop equal forward propagation method is used for replacing a standard forward propagation method, a fixed stop equal backward propagation method is used for replacing a standard backward propagation method, and the calculation and communication time in the training process is overlapped, so that the training communication overhead is reduced.

Fig. 3 is a schematic general flow chart of the method in the embodiment of the present invention, and the method shown in fig. 3 includes:

wherein a full connection layer FC_l,l∈[1,L]Dividing the number of the neurons into N equal parts to obtain N sub full-connection layers which are respectively recorded as:… andand distribute them to N training units, wherein the training unit is an independent computing node, which can be a GPU card or a server node, for other full connection layersThe same process is also performed, and a network model with parallel fully-connected layer models in a deep neural network is formed, as shown in fig. 2, each fully-connected layer is divided into N parts and held by N training units respectively, N represents the number of training units, that is, the number of divisions of each fully-connected layer, L represents the number of fully-connected layers, thin lines represent connection of neurons and neurons, and thick lines represent connection of components and components (herein, a certain part of a certain layer).

the half-stop equal forward propagation method shown in fig. 4 specifically includes:

(2.1) for the sub full junction layerIf any sub-full connection layer Has arrived, the calculation is performedSub-full connection layerThe calculation formula of the input data is as follows:

the subscript l in the formula represents the index of the full link layer, and the subscripts j and i represent the indexes of the sub full link layers, that is, the indexes of the training units.Representing a fully connected layer of atomsAnd a sub full connection layerThe connection weight value is set to be a connection weight value,representing a fully connected layer of atomsThe output data of (a) is obtained,representing a fully connected layer of atomsSub-full connection layerThe generated input data;

(2.2) for the sub full junction layerAccording to the result of step (2.1), calculatingThe overall input data has the calculation formula as follows:

wherein,representing a fully connected layer of atomsThe overall input data of (1);

(2.3) for the sub full junction layerAccording to the result of step (2.2), calculatingAnd finally outputting data, wherein the calculation formula is as follows:

wherein,the function F is a non-linear activation function (e.g., ReLU function),is a sub-full connection layerThreshold data of (a);

(2.4) for the sub full junction layerCarrying out parallel same treatment according to the steps (2.1) to (2.3);

the stop-and-go back propagation method shown in fig. 5 specifically includes:

(3.1) for the sub full junction layerSub-full connection layer on Q training units each time To pairAfter the generated output residual data arrives (i.e. the output residual data is copied from the source training unit to the target training unit), the Q parts of output residual data are taken as sub-full connected layersThe input residual data of (a), is noted as:

(3.2) for the sub full junction layerAccumulating the Q parts of input residual data in the step (3.1) and recording as:the calculation formula is as follows:

(3.3) for the sub full junction layerAccording to the result of the step (3.2), parallel computing is carried outSub-full connection layerOutput residual data of (3), noted as: the calculation formula is as follows:

(3.4) for the sub full junction layerAccording to the result of step (3.1), calculating sub-full connection layer in parallelTo pairThe weight gradient of (a) is recorded as: the calculation formula is as follows:

(3.5) for the sub full junction layerCalculating sub-full connection layer according to the result of the step (3.2) To pairIs recorded as:the calculation formula is as follows:

where V is a unit vector, and the dimension size of V is equal to the size of the batch block in training.

(3.6) for the sub full junction layerRepeating the steps (3.1) to (3.5), and treating each timeQ-part sub-full connection layer pair sub-full connection layer of rear layerGenerated output residual data up to sub-full link layerAll the output residual data of the rear layer are processed;

(3.7) for the sub full junction layerThe same treatment is carried out in parallel according to the steps (3.1) to (3.6).

The method for updating the weight data and the threshold data shown in fig. 6 specifically includes:

(4.1) for all sub full junction layersThe weight data are updated in parallel, and the calculation formula is as follows:wherein η represents the learning rate;

(4.2) for all sub full connection layersUpdating the threshold data thereof in parallel, wherein the calculation formula is as follows:

in one embodiment of the invention, a deep neural network model parallel full-connection layer data exchange system is disclosed, which comprises:

The specific implementation of each module may refer to the description of the method embodiment, and the embodiment of the present invention will not be repeated.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A deep neural network model parallel full-connection layer data exchange method is characterized by comprising the following steps:

2. The method according to claim 1, wherein step (2) comprises in particular:

(2.1) for each sub-full connection layerIf any sub-full connection layerHas arrived, then the data is represented by the formula:computing sub-full connection layerSub-full connection layerWherein the subscript l denotes the index of the full link layer, the subscripts j and i denote the indexes of the sub-full link layers,representing a fully connected layer of atomsAnd a sub full connection layerThe connection weight of (a) is set,representing a fully connected layer of atomsThe output data of (a) is obtained,representing a fully connected layer of atomsSub-full connection layerThe generated input data;

(2.2) for the sub full junction layerAccording to the result of step (2.1), the following formula:computing sub-full connection layerThe overall input data of (a), wherein,representing a fully connected layer of atomsThe overall input data of (1);

3. The method according to claim 2, wherein step (3) comprises in particular:

(3.1) for each sub-full connection layerSub full link layer on Q training units To the sub full connection layerAfter the generated output residual data arrive, taking the Q output residual data as a sub full link layerIs transported byResidual data, noted as:

4. The method according to claim 3, characterized in that step (4) comprises in particular:

(4.2) the formula:updating each sub-full connection layer in parallel The threshold value data of (1).

5. A deep neural network model parallel full-connection layer data exchange system, comprising:

a partitioning module for each full connection layer FC_l,l∈[1,L]According to FC_lNumber of neurons in FC_lIs divided into N equal partsDistributing each divided sub full connection layer to N training units respectively at N sub full connection layers, wherein L is the number of the full connection layers;