CN117501245A

CN117501245A - Neural network model training method and device, and data processing method and device

Info

Publication number: CN117501245A
Application number: CN202180099427.XA
Authority: CN
Inventors: 林嘉树; 朱思宇; 侯庆
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2024-02-02
Also published as: WO2022267036A1

Abstract

The application discloses a neural network model training method and device and a data processing method and device in the field of artificial intelligence. The neural network model training method comprises the following steps: in the process of carrying out gradient calculation on the parameters of the initial neural network model by using the back propagation BP algorithm, the first calculation node acquires the gradient of the parameters of the partial layer of the initial neural network model which is already calculated by the second calculation node, so that after the gradient calculation is completed, the parameters of the partial layer can be adjusted according to the gradient of the parameters of the partial layer which is already received, the operation idle time of the first calculation node after the gradient calculation is shortened, the training efficiency is improved, and the training performance is improved.

Description

Neural network model training method and device, and data processing method and device

Technical Field

The present application relates to the field of artificial intelligence, and more particularly, to a neural network model training method and apparatus, and a data processing method and apparatus.

Background

Artificial intelligence (artificial intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, man-machine interaction, recommendation and search, AI-based theory, and the like.

The neural network model is trained in a distributed parallel mode, and a plurality of computing nodes are used for training the neural network model in parallel, so that the time required by training can be shortened, and the training efficiency is improved. In the training process of the neural network model by each computing node, the computing idle time exists, and the overall training efficiency is lower due to the computing idle time.

Disclosure of Invention

The application provides a neural network model training method and device, and a data processing method and device, which shorten the operation idle time of a computing node after the completion of computation by using a Back Propagation (BP) algorithm in the neural network model training process, improve the training efficiency and improve the training performance.

In a first aspect, a neural network model training method based on a computing system is provided, the computing system including a first computing node and a second computing node, the method comprising: the first computing node acquires a training data set, wherein the training data set comprises a plurality of training data and labeling information corresponding to each training data; the first computing node processes the training data by using an initial neural network model to obtain training information corresponding to the training data; the first computing node processes differences between training information corresponding to the training data and labeling information corresponding to the training data by using a Back Propagation (BP) algorithm to determine a first gradient data set, wherein the first gradient data set comprises first gradient data and second gradient data, the first gradient data is used for indicating gradients of parameters in a first layer set, the first layer set comprises one or more layers in the initial neural network model, the second gradient data is used for indicating gradients of parameters in a second layer set, and the second layer set comprises an input layer of the initial neural network model; the first computing node acquires a second gradient data set calculated by the second computing node, wherein the second gradient data set is used for indicating the gradient of the parameter of the initial neural network model, the second gradient data set comprises third gradient data, the third gradient data is used for indicating the gradient of the parameter in the first layer set, and the first computing node acquires the third gradient data in the process of processing the difference by the first computing node through the BP algorithm; the first computing node adjusts parameters of the initial neural network model according to the first gradient data set and the second gradient data set to obtain a trained neural network model, wherein the first computing node adjusts parameters of the first layer set and the second layer set after the first computing node determines the first gradient data set.

In the computing system for parallel neural network model, each computing node transmits the gradient of the parameter of the partial layer of which the initial neural network model is calculated to other computing nodes in the process of processing by BP algorithm to determine the parameter gradient of the initial neural network model, so that the other computing nodes can adjust the parameter of the partial layer according to the gradient of the parameter of the partial layer of which the initial neural network model is received after calculating the gradient in the initial neural network model by BP algorithm, the operation idle time of the computing node after the calculation by BP algorithm is completed in the process of training the neural network model is shortened, the training efficiency is improved, and the training performance is improved.

It should be appreciated that the first computing node may adjust the parameters of the second layer set after adjusting the parameters of the first layer set.

With reference to the first aspect, in some possible implementations, the second data set includes fourth gradient data, the fourth gradient data being used to indicate a gradient of a parameter in the second layer set, and the first computing node obtaining the second gradient data set calculated by the second computing node includes: and in the process of adjusting the parameters of the first layer set by the first computing node, the first computing node acquires the fourth gradient data.

For the fourth gradient data finally obtained by the second computing node in the calculation process using the BP algorithm, the first computing node can acquire the parameters of the first layer set in the adjustment process, so that the first computing node can immediately start to adjust the parameters of the second layer set after adjusting the parameters of the first layer set, the operation idle time of the first computing node after the calculation using the BP algorithm is completed in the neural network model training process can be eliminated, the training efficiency is further improved, and the training performance is improved.

With reference to the first aspect, in some possible implementations, the training information is obtained by processing an initial neural network model after j times of adjustment, and the second gradient data set is obtained by processing the second computing node by using the initial neural network model after j times of adjustment, where j is a positive integer.

Since the acquisition of the gradient data set of the other computing nodes by the first computing node is performed during the operation of the first computing node, it is not necessary to set a separate time for the transmission of the gradient data set. Therefore, even if the computing time required by each computing node to compute the gradient data set has a certain difference, as long as the computing result of the second gradient data set by each computing node can be received before the first computing node completes the computation of the first gradient data set, the parameters corresponding to the computing result of the second gradient data set in the initial neural network model can be immediately adjusted. Therefore, under the condition that gradient data transmission is carried out in a synchronous mode, the influence of the difference of the time required by processing different training data and the labeling information thereof by each computing node on the training time can be reduced.

With reference to the first aspect, in some possible implementations, the adjusting, by the first computing node, parameters of the initial neural network model according to the first gradient data set and the second gradient data set includes: and the first computing node adjusts parameters of the initial neural network model by using a gradient pruning algorithm.

By gradient trimming, the occurrence of gradient extinction and/or gradient explosion conditions can be avoided.

With reference to the first aspect, in some possible implementations, the adjusting, by the first computing node, parameters of the initial neural network model according to the first gradient data set and the second gradient data set includes: after the first computing node determines the first gradient data set, the first computing node adjusts parameters of the initial neural network model.

Therefore, the operation idle time of the first computing node in the neural network model training process can be reduced as much as possible without determining the time required by the first computing node to acquire the fourth data, and the method is simple and convenient.

In a second aspect, there is provided a data processing method, the method comprising: acquiring data to be processed; processing the data to be processed by using a neural network model, wherein the neural network model is obtained by adjusting parameters of an initial neural network model according to a first gradient data set and a second gradient data set, the adjustment of parameters of a first layer set and a second layer set is obtained by using the first computing node to determine the first gradient data set, the first layer set comprises one or more layers in the initial neural network model, the second layer set comprises an input layer of the neural network model, the first gradient data set is obtained by using a back propagation BP algorithm to process differences between training information corresponding to training data and labeling information corresponding to the training data by using the first computing node, the training data corresponding to training data is obtained by using the initial neural network model to process the training data, the first gradient data set comprises first gradient data and second gradient data, the first gradient data is used for indicating one or more layers in the initial neural network model, the second layer set comprises second gradient data indicating a second gradient, the second gradient data corresponding to the training data is obtained by using a back propagation BP algorithm to process differences between training information corresponding to the training data and the labeling information corresponding to the training data, the training information corresponding to obtain the second gradient data is obtained by using the first gradient data set, the first computing node acquires the third gradient data during processing of the difference by the first computing node using the BP algorithm.

With reference to the second aspect, in some possible implementations, the second data set includes fourth gradient data, where the fourth gradient data is used to indicate a gradient of a parameter in the second layer set, and the obtaining of the fourth gradient data by the first computing node is performed during the process of adjusting the parameter of the first layer set by the first computing node.

With reference to the second aspect, in some possible implementations, the training information is obtained by processing an initial neural network model after j times of adjustment, and the second gradient data set is obtained by processing the second computing node by using the initial neural network model after j times of adjustment, where j is a positive integer.

With reference to the second aspect, in some possible implementations, the adjustment of the parameters of the initial neural network model by the first computing node is performed using a gradient pruning algorithm.

With reference to the second aspect, in some possible implementations, the adjusting of the parameters of the initial neural network model by the first computing node is performed after the first computing node determines the first gradient data set.

In a third aspect, a neural network model training apparatus is provided, the apparatus comprising: an acquisition module and a processing module. The acquisition module is used for acquiring a training data set, wherein the training data set comprises a plurality of training data and marking information corresponding to each training data; the processing module is used for processing the training data by using an initial neural network model so as to obtain training information corresponding to the training data; the processing module is further configured to process, by using a back propagation BP algorithm, a difference between training information corresponding to the training data and labeling information corresponding to the training data, to determine a first gradient data set, where the first gradient data set includes first gradient data for indicating gradients of parameters in a first layer set including one or more layers in the initial neural network model, and second gradient data for indicating gradients of parameters in a second layer set including an input layer of the initial neural network model; the acquiring module is further configured to acquire a second gradient data set calculated by the second computing node, where the second gradient data set is used to indicate a gradient of a parameter of the initial neural network model, the second gradient data set includes third gradient data, where the third gradient data is used to indicate a gradient of the parameter in the first layer set, and the acquiring module is specifically configured to acquire the third gradient data in a process that the processing module processes the difference by using the BP algorithm; the processing module is further configured to adjust parameters of the initial neural network model according to the first gradient data set and the second gradient data set to obtain a trained neural network model, and the processing module is specifically configured to adjust parameters of the first layer set and the second layer set after the processing module determines the first gradient data set.

With reference to the third aspect, in some possible implementations, the second data set includes fourth gradient data, where the fourth gradient data is used to indicate a gradient of a parameter in the second layer set, and the obtaining module is specifically configured to obtain the fourth gradient data during adjustment of the parameter of the first layer set by the processing module.

With reference to the third aspect, in some possible implementations, the training information is obtained by processing an initial neural network model after j times of adjustment, and the second gradient data set is obtained by processing the second computing node by using the initial neural network model after j times of adjustment, where j is a positive integer.

With reference to the third aspect, in some possible implementations, the processing module is configured to adjust parameters of the initial neural network model using a gradient clipping algorithm.

With reference to the third aspect, in some possible implementations, the processing module is configured to adjust parameters of the initial neural network model after the processing module determines the first gradient data set.

In a fourth aspect, there is provided a data processing apparatus comprising: an acquisition module and a processing module. The acquisition module is used for acquiring data to be processed; the processing module is configured to process the data to be processed by using a neural network model, where the neural network model is obtained by adjusting parameters of an initial neural network model by using a first computing node according to a first gradient data set, and a second gradient data set, where the adjustment of parameters of a first layer set and a second layer set is obtained by using the first computing node to determine the first gradient data set, the first layer set includes one or more layers in the initial neural network model, the second layer set includes an input layer of the neural network model, the first gradient data set is obtained by using a back propagation BP algorithm to process a difference between training information corresponding to training data and labeling information corresponding to the training data, the training information corresponding to the training data is obtained by using the initial neural network model to process the training data, the first gradient data set includes first gradient data and second gradient data, the first gradient data set includes one or more layers in the initial neural network model, the first gradient data set includes a second gradient data set indicates a second gradient, the second gradient data set indicates a second gradient in the initial gradient data set, and the second gradient data set indicates a second gradient data set, the first computing node acquires the third gradient data during processing of the difference by the first computing node using the BP algorithm.

With reference to the fourth aspect, in some possible implementations, the second data set includes fourth gradient data, where the fourth gradient data is used to indicate a gradient of a parameter in the second layer set, and the obtaining of the fourth gradient data by the first computing node is performed during the process that the first computing node adjusts the parameter of the first layer set.

With reference to the fourth aspect, in some possible implementations, the training information is obtained by processing an initial neural network model after j times of adjustment, and the second gradient data set is obtained by processing the second computing node by using the initial neural network model after j times of adjustment, where j is a positive integer.

With reference to the fourth aspect, in some possible implementations, the adjustment of the parameters of the initial neural network model by the first computing node is performed using a gradient pruning algorithm.

With reference to the fourth aspect, in some possible implementations, the adjustment of the parameter of the initial neural network model by the first computing node is performed after the first computing node determines the first gradient data set.

In a fifth aspect, an electronic device is provided that includes a memory for storing program instructions and a processor; the program instructions, when executed in the processor, are for performing the method of the first or second aspect.

The processor in the fifth aspect may include a central processing unit (central processing unit, CPU) or may include a combination of a CPU and a neural network arithmetic processor.

In a sixth aspect, there is provided a computer readable medium storing program code for execution by a device, the program code comprising instructions for performing the method of the first or second aspects.

In a seventh aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first or second aspects described above.

In an eighth aspect, there is provided a chip comprising a processor and a data interface, the processor reading instructions stored on a memory via the data interface, performing the method of the first or second aspect.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, where the instructions, when executed, are configured to perform the method in the first aspect or any implementation manner of the first aspect.

The chip may be a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).

It should be understood that, in this application, the method of the first aspect may specifically refer to the method of the first aspect and any implementation manner of the various implementation manners of the first aspect.

Drawings

Fig. 1 is a schematic structural diagram of a system architecture according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of another convolutional neural network according to an embodiment of the present application.

Fig. 4 is a schematic hardware structure of a chip according to an embodiment of the present application.

Fig. 5 is a schematic diagram of a system architecture according to an embodiment of the present application.

Fig. 6 is a schematic flow chart of a neural network model training method.

Fig. 7 is a schematic flow chart of a training method of a neural network model based on a computing system according to an embodiment of the present application.

Fig. 8 is a schematic flowchart of a training method of a neural network model according to an embodiment of the present application.

FIG. 9 is a schematic block diagram of a computing system provided by an embodiment of the present application.

Fig. 10 is a schematic flow chart of a data processing method provided in an embodiment of the present application.

Fig. 11 is a schematic structural diagram of a neural network model training device provided in an embodiment of the present application.

Fig. 12 is a schematic structural diagram of a data processing apparatus provided in an embodiment of the present application.

Fig. 13 is a schematic diagram of a hardware structure of a data processing apparatus according to an embodiment of the present application.

Fig. 14 is a schematic hardware configuration diagram of a neural network model training device according to an embodiment of the present application.

Detailed Description

The technical solutions in the present application will be described below with reference to the accompanying drawings. The following description of the technical solutions in the embodiments of the present application will be made with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Since embodiments of the present application relate to a large number of applications of neural networks, for ease of understanding, related terms and concepts of the neural networks to which embodiments of the present application may relate are first described below.

(1) Neural network

The neural network may be composed of neural units, which may be referred to as x _s And an arithmetic unit whose intercept 1 is an input, the output of the arithmetic unit may be:

wherein s=1, 2, … … n, n is a natural number greater than 1, W _s Is x _s B is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to a next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by joining together a plurality of the above-described single neural units, i.e., the output of one neural unit may be the input of another neural unit. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

(2) Deep neural network

Deep neural networks (deep neural network, DNN), also known as multi-layer neural networks, can be understood as neural networks with multiple hidden layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three types: input layer, hidden layer, output layer. Typically the first layer is the input layer, the last layer is the output layer, and the intermediate layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.

Although DNN appears to be complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:wherein,is the input vector which is to be used for the input,is the output vector of the vector,is the offset vector, W is the weight matrix (also called coefficient), and α () is the activation function. Each layer is only for input vectorsObtaining the output vector through such simple operationSince the DNN layers are many, the coefficient W and the offset vectorAnd the number of (2) is also relatively large. The definition of these parameters in DNN is as follows: taking the coefficient W as an example: it is assumed that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined asThe superscript 3 represents the number of layers in which the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.

In summary, the coefficients of the kth neuron of the L-1 layer to the jth neuron of the L layer are defined as

It should be noted that the input layer is devoid of W parameters. In deep neural networks, more hidden layers make the network more capable of characterizing complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the greater the "capacity", meaning that it can accomplish more complex learning tasks. The process of training the deep neural network, i.e. learning the weight matrix, has the final objective of obtaining a weight matrix (a weight matrix formed by a number of layers of vectors W) for all layers of the trained deep neural network.

(3) Convolutional neural network

The convolutional neural network (convolutional neuron network, CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer, which can be regarded as a filter. The convolution layer refers to a neuron layer in the convolution neural network, which performs convolution processing on an input signal. In the convolutional layer of the convolutional neural network, one neuron may be connected with only a part of adjacent layer neurons. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights can be understood as the way data information is extracted is independent of location. The convolution kernel can be initialized in the form of a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network. In addition, the direct benefit of sharing weights is to reduce the connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Loss function

In training the deep neural network, since the output of the deep neural network is expected to be as close to the value actually expected, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the actually expected target value according to the difference between the predicted value of the current network and the actually expected target value (of course, there is usually an initialization process before the first update, that is, the pre-configuration parameters of each layer in the deep neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to be lower than the predicted value, and the adjustment is continuously performed until the deep neural network can predict the actually expected target value or the value very close to the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible.

(5) Forward propagation algorithm

The forward propagation algorithm, which may also be referred to as a forward propagation algorithm, is an algorithm that performs computation from the front to the back. And (4) starting from the input layer by utilizing a forward propagation algorithm, and calculating backward layer by layer until the calculation reaches the output layer to obtain an output result. The forward propagation algorithm obtains an output layer result through a layer-by-layer forward and backward operation.

(6) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial neural network model are updated by back propagation of the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal neural network model, such as a weight matrix.

(7) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.

Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.

(8) Distributed parallel training

Distributed parallel training may also be referred to as data parallel training, where a copy of the entire neural network model is maintained on each compute node, and a different subset of the training data set is processed on each work machine. Training the neural network model by utilizing a large number of computing nodes, and sorting the output of each computing node to obtain the trained neural network model. Each of the computing nodes may be one of the computing nodes in the large-scale computer cluster network.

Each computing node may perform iterative processing on the training data set using the same computational flow. The computational flow includes processing of training data by the initial neural network model, back propagation, gradient aggregation, and optimizer calculations.

The processing of training data by the initial neural network model may be implemented in a forward propagation manner.

The back propagation is used to determine the gradient of the parameter size in the initial neural network model from the error loss of the initial neural network model. The gradient of the parameter magnitude, i.e. the bias of the loss function to the parameter.

The gradient aggregation is used for sending data obtained by back propagation of the computing node to other computing nodes and receiving the data obtained by back propagation sent by other computing nodes.

The optimizer calculates parameters for adjusting the initial neural network model according to the data obtained by back propagation of each calculation node so as to reduce error loss.

As shown in fig. 1, an embodiment of the present application provides a system architecture 100. In fig. 1, a data acquisition device 160 is used to acquire training data. For the data processing method of the embodiment of the application, the training data may include a plurality of training input data and a training identifier corresponding to each training input data.

After the training data is collected, the data collection device 160 stores the training data in the database 130 and the training device 120 trains the target model/rule 101 based on the training data maintained in the database 130.

The training device 120 obtains the target model/rule 101 based on the training data, and the training device 120 processes the input training input data, compares the output result with the training identifier corresponding to the training input data until the difference between the output result of the training device 120 and the training identifier is smaller than a certain threshold value, thereby completing training of the target model/rule 101.

The above-described object model/rule 101 can be used to implement the data processing method of the embodiment of the present application. The target model/rule 101 in the embodiment of the present application may be specifically a neural network. In practical applications, the training data maintained in the database 130 is not necessarily collected by the data collecting device 160, but may be received from other devices. It should be noted that the training device 120 is not necessarily completely based on the training data maintained by the database 130 to perform training of the target model/rule 101, and it is also possible to obtain the training data from the cloud or other places to perform model training, which should not be taken as a limitation of the embodiments of the present application.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, such as the execution device 110 shown in fig. 1, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR) AR/Virtual Reality (VR), a vehicle-mounted terminal, or may also be a server or cloud. In fig. 1, an execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through a client device 140, where the input data may include in embodiments of the present application: and the data to be processed is input by the client device.

The preprocessing module 113 and the preprocessing module 114 are used for preprocessing according to input data (such as data to be processed) received by the I/O interface 112, in this embodiment of the present application, the preprocessing module 113 and the preprocessing module 114 may be omitted (or only one of the preprocessing modules may be used), and the computing module 111 may be directly used for processing the input data.

In preprocessing input data by the execution device 110, or in performing processing related to computation or the like by the computation module 111 of the execution device 110, the execution device 110 may call data, codes or the like in the data storage system 150 for corresponding processing, or may store data, instructions or the like obtained by corresponding processing in the data storage system 150.

Finally, the I/O interface 112 returns the processing results, such as the processing results of the data obtained as described above, to the client device 140, thereby providing the processing results to the user.

It should be noted that the training device 120 may generate, based on different training data, a corresponding target model/rule 101 for different targets or different tasks, where the corresponding target model/rule 101 may be used to achieve the targets or complete the tasks, thereby providing the user with the desired result.

In the case shown in FIG. 1, the user may manually give input data that may be manipulated through an interface provided by the I/O interface 112. In another case, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data requiring the user's authorization, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 140 may also be used as a data collection terminal to collect input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data as shown in the figure, and store the new sample data in the database 130. Of course, instead of being collected by the client device 140, the I/O interface 112 may directly store the input data input to the I/O interface 112 and the output result output from the I/O interface 112 as new sample data into the database 130.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided in the embodiments of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawings is not limited in any way, for example, in fig. 1, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110.

As shown in fig. 1, the training device 120 trains to obtain a target model/rule 101, where the target model/rule 101 may be a neural network in the present application in the embodiment of the present application, specifically, the neural network may be a CNN, a deep convolutional neural network (deep convolutional neural networks, DCNN), a recurrent neural network (recurrent neural network, RNN), or the like.

Since CNN is a very common neural network, the structure of CNN will be described in detail with reference to fig. 2. As described in the basic concept introduction above, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning architecture, in which multiple levels of learning are performed at different abstraction levels through machine learning algorithms. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to data input thereto.

The structure of the neural network specifically adopted by the data processing method in the embodiment of the present application may be shown in fig. 2. In fig. 2, a Convolutional Neural Network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230. The input layer 210 may acquire data to be processed, and process the acquired data to be processed by the convolution layer/pooling layer 220 and the following neural network layer 230, so as to obtain a data processing result. The internal layer structure of the CNN 200 of fig. 2 is described in detail below.

Convolution layer/pooling layer 220:

convolution layer:

the convolution/pooling layer 220 as shown in fig. 2 may include layers as examples 221-226, for example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, layer 223 is a convolutional layer, layer 224 is a pooling layer, layer 225 is a convolutional layer, and layer 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 are pooling layers, 224, 225 are convolutional layers, and 226 are pooling layers. I.e. the output of the convolution layer may be used as input to a subsequent pooling layer or as input to another convolution layer to continue the convolution operation.

The internal principle of operation of one convolution layer will be described below using the convolution layer 221 as an example.

The convolution layer 221 may comprise a number of convolution operators, also called kernels, which function in the data processing as a filter extracting specific information from the input data matrix, which convolution operators may be essentially a weight matrix, which weight matrix is usually predefined.

The weight values in the weight matrices are required to be obtained through a large amount of training in practical application, and each weight matrix formed by the weight values obtained through training can be used for extracting information from input data, so that the convolutional neural network 200 can perform correct prediction.

When convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 200 increases, features extracted by the later convolutional layers (e.g., 226) become more complex, such as features of high level semantics, which are more suitable for the problem to be solved.

Pooling layer:

since it is often desirable to reduce the number of training parameters, the convolutional layers often require periodic introduction of pooling layers, one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers, as illustrated by layers 221-226 in FIG. 2, 220. The only purpose of the pooling layer during data processing is to reduce the spatial size of the data.

Neural network layer 230:

after processing by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not yet sufficient to output the desired output information. Because, as previously described, the convolutional layer/pooling layer 220 will only extract features and reduce the parameters imposed by the input data. However, in order to generate the final output information (the required class information or other relevant information), convolutional neural network 200 needs to utilize neural network layer 230 to generate the output of the required number of classes or a set of classes. Thus, multiple hidden layers (231, 232 through 23n as shown in fig. 2) may be included in the neural network layer 230, and the output layer 240, where parameters included in the multiple hidden layers may be pre-trained based on relevant training data for a particular task type, which may include, for example, recognition, classification, and so forth.

After the underlying layers in the neural network layer 230, i.e., the final layer of the overall convolutional neural network 200 is the output layer 240, the output layer 240 has a class-cross entropy-like loss function, specifically for calculating the prediction error, once the forward propagation of the overall convolutional neural network 200 (e.g., propagation from 210 to 240 as shown in fig. 2) is completed, the backward propagation (e.g., propagation from 240 to 210 as shown in fig. 2) will begin to update the weights and deviations of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the desired result.

The structure of the neural network specifically adopted by the data processing method in the embodiment of the present application may be shown in fig. 3. In fig. 3, convolutional Neural Network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230. In contrast to fig. 2, the plurality of convolution layers/pooling layers 220 in fig. 3 are parallel, and the features extracted respectively are input to the neural network layer 230 for processing.

It should be noted that the convolutional neural network shown in fig. 2 and fig. 3 is only an example of two possible convolutional neural networks used in the data processing method of the embodiment of the present application, and in a specific application, the convolutional neural network used in the data processing method of the embodiment of the present application may also exist in the form of other network models.

Fig. 4 is a hardware structure of a chip according to an embodiment of the present application, where the chip includes a neural network processor 50. The chip may be provided in an execution device 110 as shown in fig. 1 for performing the calculation of the calculation module 111. The chip may also be provided in the training device 120 as shown in fig. 1 to complete the training work of the training device 120 and output the target model/rule 101. The algorithms of the layers in the convolutional neural network as shown in fig. 2 and 3 can be implemented in the chip as shown in fig. 4.

The neural network processor NPU 50 is mounted as a coprocessor to a main central processing unit (central processing unit, CPU) (host CPU) and tasks are distributed by the main CPU. The NPU has a core part of an arithmetic circuit 503, and a controller 504 controls the arithmetic circuit 503 to extract data in a memory (weight memory or input memory) and perform arithmetic.

In some implementations, the arithmetic circuitry 503 internally includes a plurality of processing units (PEs). In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 502 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 501 and performs matrix operation with matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 508.

The vector calculation unit 507 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 507 may be used for network calculations of non-convolutional/non-FC layers in a neural network, such as pooling, batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector computation unit 507 can store the vector of processed outputs to the unified buffer 506. For example, the vector calculation unit 507 may apply a nonlinear function to an output of the operation circuit 503, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 507 generates a normalized value, a combined value, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 503, for example for use in subsequent layers in a neural network.

The unified memory 506 is used for storing input data and output data.

The weight data is directly transferred to the input memory 501 and/or the unified memory 506 through the memory cell access controller 505 (direct memory access controller, DMAC), the weight data in the external memory is stored in the weight memory 502, and the data in the unified memory 506 is stored in the external memory.

A bus interface unit (bus interface unit, BIU) 510 for interfacing between the main CPU, DMAC and finger memory 509 via a bus.

An instruction fetch memory (instruction fetch buffer) 509 connected to the controller 504 for storing instructions used by the controller 504;

and a controller 504 for calling the instruction cached in the instruction memory 509 to control the operation of the operation accelerator.

Typically, the unified memory 506, the input memory 501, the weight memory 502, and the finger memory 509 are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, abbreviated as DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memory.

The operations of the respective layers in the convolutional neural network shown in fig. 2 and 3 may be performed by the operation circuit 503 or the vector calculation unit 507.

The execution device 110 in fig. 1 described above is capable of executing the steps of the data processing method of the embodiment of the present application, and the CNN model shown in fig. 2 and 3 and the chip shown in fig. 4 may also be used to execute the steps of the data processing method of the embodiment of the present application. The following describes the neural network training method and the data processing method according to the embodiments of the present application in detail with reference to the accompanying drawings.

As shown in fig. 5, an embodiment of the present application provides a system architecture 300. The system architecture comprises a local device 301, a local device 302, and an executing device 110 and a data storage system 150, wherein the local device 301 and the local device 302 are connected to the executing device 110 through a communication network.

The execution device 110 may be implemented by one or more servers. Alternatively, the execution device 110 may be used with other computing devices, such as: data storage, routers, load balancers, etc. The execution device 110 may be disposed on one physical site or distributed across multiple physical sites. The execution device 110 may use data in the data storage system 150 or call program code in the data storage system 150 to implement the methods of data processing of embodiments of the present application.

The user may operate respective user devices (e.g., local device 301 and local device 302) to interact with the execution device 110. Each local device may represent any computing device, such as a personal computer, computer workstation, smart phone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set top box, game console, etc.

The local device of each user may interact with the performing device 110 through a communication network of any communication mechanism/communication standard, which may be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.

In one implementation, the local device 301, 302 obtain relevant parameters of the target neural network from the executing device 110, deploy the target neural network on the local device 301, 302, use the target neural network for data classification or identification, and so on.

In another implementation, the target neural network may be deployed directly on the execution device 110, where the execution device 110 obtains the data to be processed from the local device 301 and the local device 302, and classifies the data to be processed according to the target neural network or performs other types of data processing.

The executing device 110 may also be a cloud device, where the executing device 110 may be deployed at the cloud; alternatively, the executing device 110 may be a terminal device, and in this case, the executing device 110 may be disposed on the user terminal side, which is not limited in the embodiment of the present application.

At present, the neural network model is widely applied to a plurality of fields such as images, videos, voices and the like, and shows the capability exceeding the capability of the traditional method.

In training neural network models, the number of samples that need to be processed increases as the complexity of the model increases. In order to reduce the training time of the neural network model, the training of the neural network model can be performed in a distributed parallel training mode.

Fig. 6 is a schematic flow chart of a neural network model training method.

In the training process of the neural network model by using the distributed parallel training mode, a plurality of computing nodes perform data processing in parallel. Fig. 6 shows a structure in which a node 600 is calculated. It should be appreciated that the various computing nodes may have the same structure.

Each computing node among the plurality of computing nodes may obtain a subset of the training data set. The subset acquired by each computing node comprises a plurality of training data and labeling information corresponding to each training data.

The computing node 600 includes a computing unit and a communication unit. The computing unit includes an initial neural network model 610, BP620, an optimizer 640. The computing unit is used to implement the computational function of the computing node 600, which is also the computational core of the deep learning training. The performance of the computing unit has an important impact on the training time of the neural network model.

The communication unit includes an Aggregation (AR) module 630. The communication unit is used for realizing information transmission among all the computing nodes.

The training data is input into the initial neural network model 610 to obtain training information. The initial neural network model 610 is used to perform forward propagation and other calculations on training data to obtain training information.

Training information and labeling information are input to BP 620.BP 620 may be operated on using a back-propagation algorithm to obtain a gradient of parameters in the initial neural network model.

Specifically, the BP620 may calculate the generated errors of the respective layers based on the training information and the labeling information, and the parameter values of the respective layers in the initial neural network model 610. BP620 may also calculate the partial derivatives of the loss function to the parameters of each layer based on the errors generated by each layer and the output of the layer during the processing of the training data by the initial neural network model. The bias of the loss function to the parameter is the gradient of the parameter.

Parameters for each layer may include weights and offsets. Thus, the bias of the loss function to the parameters of each layer includes the bias of the weights and the bias of the biases.

The AR 630 is configured to transmit, after the operation of BP 620 is completed, the gradient of the parameter in the initial neural network model calculated by each calculation node. The AR 630 may send the gradient data calculated by the BP 620 to the computing nodes and may receive the gradient data calculated by the other computing nodes. The gradient data sent by AR 630 includes gradients of various parameters in the initial neural network model 610.

The optimizer 640 is configured to, after the AR 630 receives gradient data sent by each of the computing units except the computing device 600, adjust parameters in the initial neural network model 610 according to the gradient data determined by the computing units, so as to obtain an adjusted neural network model.

The adjusted neural network model can be used as an initial neural network model to process other training data.

The trained neural network model can be obtained through the iterative processing of each computing node in the plurality of computing nodes on each training data and the corresponding labeling information.

When the communication unit transmits the calculation result of the calculation unit between the calculation nodes, the calculation unit is in an idle state, so that the overall efficiency of the training process is low.

In order to solve the above problems, the embodiments of the present application provide a training method of a neural network model.

Training method 700 may be performed by training device 120. The training method 700 includes S710 to S750.

The computing system includes a first computing node and at least one second computing node.

At S710, the first computing node obtains a training data set, where the training data set includes a plurality of training data and labeling information corresponding to each training data.

Under the condition that each computing node carries out parallel operation, the training time of the neural network model can be shortened, and the training efficiency is improved. Thus, each of the first computing node and the at least one second computing node may obtain a training data set. The set of training data acquired by each computing node may be different.

At S720, the first computing node processes the training data using the initial neural network model to obtain training information corresponding to the training data.

The initial neural network model in each computing node may have the same parameters.

At S730, the first computing node processes, using a back propagation BP algorithm, a difference between training information corresponding to the training data and labeling information corresponding to the training data to determine a first set of gradient data, the first set of gradient data including first gradient data for indicating gradients of parameters in a first set of layers including one or more layers in the initial neural network model and second gradient data for indicating gradients of parameters in a second set of layers including an input layer of the initial neural network model.

The difference between the training information corresponding to the training data and the labeling information can be represented by a loss value.

Different layers of the initial neural network model may be included in different sets of layers. That is, the first set of layers does not include the input layers of the initial neural network model.

According to the operation principle of the BP algorithm, the parameter gradient of each layer is calculated according to the reverse order of the operation of the initial neural network model. In the first gradient data set processed by the back propagation BP algorithm, the second gradient data is used for indicating the gradient of the parameter in the second layer set, and the second layer set comprises the input layer of the neural network model. The second layer set may further include a preset number of layers located after the input layer. Thus, among the respective gradient data of the first gradient data set, the second gradient data is finally calculated.

At S740, the first computing node acquires a second gradient data set calculated by the second computing node, the second gradient data set being used for indicating a gradient of a parameter of the initial neural network model, the second gradient data set including third gradient data, the third gradient data being used for indicating a gradient of a parameter in the first layer set, the acquisition of the third gradient data by the first computing node being performed during the processing of the difference by the first computing node using the BP algorithm.

That is, in S740, the first computing node acquires the third gradient data in the process that the first computing node processes the difference using the BP algorithm. It is understood that the first computing node starts acquiring the third gradient data during processing of the difference using the BP algorithm, and acquires the third gradient data before the processing of the difference using the BP algorithm ends.

The first computing node may receive the second gradient data set sent by the second computing node, or may send the first gradient data set to the second computing node for adjustment of the initial neural network model by the second computing node.

Specifically, in the process that the first computing node processes the difference by using the BP algorithm, the first computing node may receive the third gradient data sent by the second computing node, or may send the first gradient data to the second computing node.

At S750, the first computing node adjusts parameters of the initial neural network model according to the first gradient data set and the second gradient data set to obtain a trained neural network model, where the adjustment of the parameters of the first layer set and the second layer set by the first computing node is performed after the first computing node determines the first gradient data set.

That is, at S750, after the first computing node determines the first gradient data set, the first computing node adjusts parameters of the first layer set and the second layer set.

It should be appreciated that where the computing system includes a plurality of second computing nodes, the first computing node may acquire a second gradient data set for each second computing node at S740, and the first computing node may adjust the parameters of the initial neural network model based on the first gradient data set and the respective second gradient data sets at S750.

The first computing node may be acquired at the same time period for gradient data corresponding to the same set of layers in each of the second sets of gradient data.

Each computing node in the computing system may act as a first computing node, that is, each computing node may perform method 700.

Through S710-S750, in the computing system that performs the neural network model in parallel, each computing node sends the gradient of the parameter of the partial layer that has been computed to complete the initial neural network model to other computing nodes in the process of processing by using the BP algorithm to determine the parameter gradient of the initial neural network model, so that, after computing the gradient in the initial neural network model by using the BP algorithm, the other computing nodes can adjust the parameter of the partial layer according to the gradient of the parameter of the partial layer that has been received in the initial neural network model, thereby shortening the operation idle time of the computing node after computing by using the BP algorithm in the training process of the neural network model, improving the training efficiency and improving the training performance.

Further, the second data set may include fourth gradient data. The fourth gradient data is used to indicate gradients of parameters in the second set of layers. The first computing node obtaining the second gradient data set calculated by the second computing node may include: and in the process of adjusting the parameters of the first layer set by the first computing node, the first computing node acquires the fourth gradient data.

That is, the first computing node obtains the fourth gradient data for each computing node during the first computing node adjusts the parameters of the first layer set.

The first set of gradient data may include a plurality of gradient data, each of which may be used to indicate a gradient of a parameter in a set of layers, each set of layers including one or more layers in the initial neural network model. That is, the first gradient data set may further include gradient data other than the first gradient data and the second gradient data.

The first gradient data set and the second gradient data set may be gradient data in one-to-one correspondence. The corresponding gradient data in the first gradient data set and the second gradient data set are used to indicate the same set of layers in the initial neural network model.

The first layer set may be set according to a time required for the first computing node to acquire the fourth data, such that a time required for the first computing node to adjust a parameter of the first layer set is greater than or equal to a time required for the first computing node to acquire the fourth data.

According to the operation principle of the BP algorithm, the parameter gradient of each layer is calculated according to the reverse order of the operation of the initial neural network model. Neural network models typically have a large number of layers. The second layer set may be configured to include 1/2, 1/3, 1/5, 1/10, 1/20, 1/50, or even less of the total number of layers of the neural network model, thereby avoiding a significant amount of time required for the first computing node to acquire the fourth layer set, and avoiding a situation in which the time required for the first computing node to begin adjusting parameters of the layer set of the initial neural network model other than the second layer set is less than the time required for the first computing node to acquire the fourth layer set.

In some embodiments, the first computing node may adjust parameters of the initial neural network model after the first computing node determines the first gradient data set.

Typically, the difference between the computing power of the individual computing nodes in the computing system is small. Each computing node may be used as a first computing node to train the initial neural network model using method 700.

After S710 to S750, the adjusted initial neural network model may be used as an initial neural network model, S720 to S750 may be performed again, and other training data may be processed, so as to complete adjustment of parameters of the initial neural network model, and obtain a trained neural network model.

For example, in the case where the number of times of performing S720-S750 (i.e., the number of iterations) reaches a preset value, or in the case where the difference between the training information obtained by processing the training data by the initial neural network model and the labeling information corresponding to the training data gradually converges, S720-S750 may not be performed any more.

It should be appreciated that the various computing nodes may adjust the parameters of the initial neural network model in the same manner of adjustment.

In the process of multiple iterations, each computing node can adopt a synchronous or asynchronous mode to transmit gradient data.

In an asynchronous mode, in the process of acquiring and carrying out the j+1th iteration, the first computing node utilizes the initial neural network model subjected to j times of adjustment to process so as to obtain training information. The second gradient data set calculated by the second computing node and acquired by the first computing node is the latest gradient data set calculated by the second computing node by using the BP algorithm.

The synchronization mode is that the training information is obtained by processing the initial neural network model after j times of adjustment, each second gradient data set is obtained by processing the initial neural network model after j times of adjustment, and j is a positive integer. That is, in the process that the first computing node performs the j+1th iteration, the obtained second gradient data set of each second computing node is obtained by processing training data according to the initial neural network model obtained by performing the j iterations on the second computing node and subjected to the j adjustments.

The initial neural network model is trained in parallel in a synchronous mode, so that training efficiency can be improved, and training time can be shortened.

Since the acquisition of the gradient data set of the other computing nodes by the first computing node is performed during the operation of the first computing node, it is not necessary to set a separate time for the transmission of the gradient data set. Therefore, even if the calculation time required for each calculation node to calculate the gradient data set has a certain difference, S740 can be immediately performed as long as a partial calculation result of the second gradient data set by other calculation nodes can be received before the first calculation node completes S730 the calculation of the first gradient data set.

Therefore, through S710-S750, in the case of performing gradient data transmission in a synchronous manner, the influence of the difference in time required for processing different training data and the labeling information thereof by each computing node on the training time can be reduced.

In some embodiments, the first computing node may utilize a gradient pruning algorithm to adjust parameters of the initial neural network model.

The first computing node may perform gradient pruning on each gradient in the first gradient data set using a gradient pruning algorithm, and obtain a second gradient data set after gradient pruning. Alternatively, the first computing node may perform gradient pruning on each gradient in the first gradient data set and the second gradient data set by using a gradient pruning algorithm.

And then, the initial neural network model can determine the aggregation gradient of each parameter in the initial neural network model according to the trimmed first gradient data set and second gradient data set, and adjust the parameters in the initial neural network model according to the aggregation gradient.

The first computing node may further determine an aggregate gradient of each parameter according to gradients of each parameter in the first gradient data set and the second gradient data set, and then prune the aggregate gradient by using a gradient pruning algorithm, and adjust parameters of the initial neural network model by using the adjusted aggregate gradient.

The training method of the neural network model is performed by each computing node in the computing system. Each computing node includes a computing unit and a communication unit. The computing unit is used for carrying out operation and processing. The communication unit is used for communicating with other computing nodes so as to realize information transmission among the computing nodes. Fig. 8 illustrates a method of training a neural network model, using processing system 800 utilized by one of the computing nodes in the computing system as an example.

The processing system 800 includes an initial neural network model, modules BP 821 to BP 824, modules AR 831 to AR 832, and optimizers 841 to 842. The initial neural network model is used for implementing the functions of the initial neural network model 610, the modules BP 821 to BP 824 are used for implementing the functions of the BP 620, the modules AR 831 to AR 832 are used for implementing the functions of the AR 630, and the optimizers 841 to 842 are used for implementing the functions of the optimizers 640.

Before time t0, the computing node may obtain a training data set, where the training data set includes training data and labeling information corresponding to each training data.

The labeling information may also be referred to as a label. Each training data can be marked manually to obtain a label corresponding to each training data.

At time t0 to t1, the computing unit processes the training data by using the initial neural network model to obtain training information corresponding to the training data.

The processing of training data by the initial neural network model may be operated on by a forward propagation algorithm.

The initial neural network model may include a plurality of layer sets, each layer set including one or more layers. Each layer set can be understood as one FP module. Fig. 8 illustrates an example of an initial neural network model including modules FP 811 and FP 812.

At the time t1 to t2, the computing unit utilizes a back propagation algorithm to compute the gradient of the parameters of each layer in the initial neural network model according to the training information and the labeling information corresponding to the training data.

The computing unit may perform an inversion propagation operation using a plurality of BP modules. The modules BP 822 and BP 824 are used for calculating the gradient of the parameter. For example, module BP 821 may be used to calculate errors in the generation of various layers in FP 812. The module BP 822 may be configured to calculate gradients of parameters of the various layers in the FP 812 based on the calculation of the module BP 821. The module BP 823 may be used to calculate the resulting errors of the various layers in FP 811. The module BP 824 may be configured to calculate gradients of parameters of the various layers in the FP 811 based on the calculation of the module BP 823.

After the end of the module BP 822 operation, the communication unit may send the gradient of the parameter of each layer in the FP 812 to other computing nodes after obtaining the gradient of the parameter of each layer in the FP 812. Typically, the computing power of each computing node is similar, and the time required for each computing node to compute the gradient of the parameter of each layer in FP 812 is substantially the same. Other individual compute nodes may send gradients of parameters of individual layers in FP 812 after completing the computation of gradients of parameters of individual layers in FP 812. Thus, the communication unit in the computing node may receive the gradients of the parameters of each layer in FP 812 sent by other computing nodes during the computation performed by module BP 823 and/or module BP 824. The communication unit may be in an idle state during the computation of the modules BP 821, BP 822.

The communication unit may receive the parameter gradients transmitted by the other respective computing units using the AR module, and transmit the parameter gradients to the other respective computing units. That is, the AR module is configured to implement aggregation of the parameter gradients calculated by the respective computing nodes. The communication unit may utilize the module AR 831 to aggregate the parameter gradients in the FP 812.

That is, the communication unit is configured to implement transmission of gradients of parameters of the respective layers in FP 812 between the respective computing nodes from time t1 to t 2. At the time t1 to t2, synchronization of the parameter gradients in the FP 812 calculated by each computing node is realized between the computing nodes, and each computing node obtains the parameter gradients in the fp812 calculated by other computing nodes.

The calculation unit uses the parameters of the adjusted initial neural network model from time t2 to t 3.

The calculation unit may adjust the parameters of FP 812 using optimizer 841.

The optimizer can calculate the average value of the gradient of a certain parameter in the model in the initial neural network according to the gradient of the parameter calculated by each calculation node. And adjusting the value of the parameter in the model in the initial neural network according to the gradient average value.

While the computing unit adjusts the parameters of the FP 812 by using the optimizer 841, the communication unit may send the parameter gradients in the FP 811 calculated by the module BP 824 to other computing nodes, or may receive the parameter gradients in the FP 811 calculated by other computing nodes. The communication unit may utilize the module AR 832 for aggregation of the parameter gradients in the FP 812.

Thus, the calculation unit may adjust the parameters of FP 811 with optimizer 842 after adjusting the parameters of FP 812 with optimizer 841.

It should be appreciated that the dependence of each optimizer on the last module BP 824 of the back-propagation operation may be increased, that is, such that the operation of each optimizer starts to execute after the operation of module BP 824.

At times t0 to t3, the calculation unit performs various operations and processes serially with the modules FP, BP, the optimizer. That is, the operations of the respective modules are performed in series. The calculation node completes one iteration at the time t0 to t3, takes the adjusted initial neural network model as the initial neural network model, and repeatedly processes the time t0 to t3 to complete training of the neural network model.

In each computing node of the computing system, the computing unit outputs gradients of parameters of each layer in the initial neural network model in reverse order to the layers of the initial neural network model in the process of performing back propagation operation. In the process of carrying out back propagation operation by the computing unit, the communication unit transmits the calculated parameter gradient to other computing nodes and receives the calculated parameter gradient sent by the other computing nodes. After the calculation unit finishes the back propagation operation, the parameters of the neural network model can be adjusted according to the received parameter gradients calculated by the calculation units, and the communication unit continues to transmit the parameter gradients. Therefore, the idle time of the computing unit in each iteration process can be reduced or even eliminated, the utilization rate of the computing unit is improved, the time required by training the neural network model is reduced, and the training efficiency is improved.

It should be appreciated that the length of time required for each module to process may be the same or different, depending on the length of time required for the actual process.

Compared with the time required by one iteration of the computing node 600 (the sum of the time required by the initial neural network model 610, the BP 620, the AR 630 and the optimizer 640), the method provided by the embodiment of the application is used for training the neural network model, and the computing unit performs operation and simultaneously transmits gradient data, so that the idle time of the computing unit in each iteration process can be reduced or even eliminated without increasing the operation amount.

In some embodiments, the optimizer or BP module may perform gradient pruning (clipping gradient).

If the gradient of the parameter is infinitely close to 0, i.e. the gradient disappears, the initial neural network model cannot be updated effectively. If the gradient of the parameter is large, namely gradient explosion, the initial neural network model may skip the optimal solution and cannot converge.

Gradient trimming may be performed in order to avoid gradient extinction and/or gradient explosion conditions.

Gradient clipping may also be referred to as gradient clipping. The preset range can be set, and when the gradient exceeds the preset range, the boundary value of the preset range is taken as the gradient to adjust the parameters of the neural network model. Alternatively, the gradient may be pruned according to its norm. The average value preset range or the variance preset range of the gradients of the plurality of parameters can be set, and when the gradients of the plurality of parameters and the like do not meet the average value preset range or the variance preset range, the gradients of the plurality of parameters are trimmed so that the trimmed gradients meet the average value preset range or the variance preset range.

It should be appreciated that the BP module may perform gradient pruning on the gradient of the calculated parameter. The optimizer may average the gradient of the parameter calculated by each calculation node, or the optimizer may perform gradient pruning on the average gradient value.

The computing system comprises a plurality of computing nodes, and each computing node can perform the training method of the neural network model shown in fig. 7 or fig. 8, so as to realize distributed parallel training.

The plurality of computing nodes may be located in the same or different computing devices.

Each computing node may be directly connected with only a portion of the computing nodes. The direct connection between the computing nodes, i.e. the communication interface between the computing nodes, can carry out information transmission without forwarding the information through other computing nodes. Information between individual computing nodes that are not directly connected may be forwarded by other computing nodes.

For example, information transmission between the computing nodes may be implemented by means of ring aggregation (ring AR). The ring aggregation can also be called a ring, each computing node forms a ring, and each computing node is directly connected with only two adjacent computing nodes, so that information transmission among the computing nodes is realized through forwarding of the computing nodes on information.

Alternatively, direct connections may be made entirely between computing nodes.

One-way or two-way information transmission can be performed through the communication interfaces of the two computing nodes.

The data processing method 1000 includes S1010 to S1020.

At S1010, data to be processed is acquired.

At S1020, the data to be processed is processed by using a neural network model, where the neural network model is obtained by adjusting parameters of an initial neural network model by a first computing node according to a first gradient data set and a second gradient data set, where the adjustment of parameters of a first layer set and a second layer set is performed after the first computing node determines the first gradient data set, the first layer set includes one or more layers in the initial neural network model, and the second layer set includes an input layer of the neural network model.

The first gradient data set is obtained by processing differences between training information corresponding to training data and labeling information corresponding to the training data by the first computing node through a back propagation BP algorithm, the training information corresponding to the training data is obtained by processing the training data by the first computing node through the initial neural network model, the first gradient data set comprises first gradient data and second gradient data, the first gradient data is used for indicating gradients of parameters in the first layer set, and the second gradient data is used for indicating gradients of parameters in the second layer set.

The second gradient data set is calculated by a second computing node, the second gradient data set is used for indicating the gradient of the parameter of the initial neural network model, the second gradient data set comprises third gradient data, the third gradient data is used for indicating the gradient of the parameter in the first layer set, and the first computing node acquires the third gradient data in the process that the first computing node processes the difference by using the BP algorithm.

Optionally, the second data set includes fourth gradient data for indicating a gradient of a parameter in the second layer set, and the first computing node acquires the fourth gradient data during the first computing node adjusts the parameter of the first layer set.

Optionally, the training information is obtained by processing the initial neural network model after j times of adjustment, and the second gradient data set is obtained by processing the second computing node by using the initial neural network model after j times of adjustment, where j is a positive integer.

Optionally, the first computing node adjusts the parameters of the initial neural network model using a gradient pruning algorithm.

Optionally, the adjusting of the parameters of the initial neural network model by the first computing node is performed after the first computing node determines the first gradient data set.

That is, the neural network model for processing the data to be processed may be trained by the training method of the neural network model shown in fig. 7 or fig. 8.

The data processing system, the neural network model training method and the data processing method provided in the embodiments of the present application are described above with reference to fig. 1 to 9, and the device embodiments of the present application are described below with reference to fig. 11 to 15. It should be understood that the description of the neural network model training method and the data processing method correspond to the description of the device embodiments, and thus, portions not described in detail may be referred to the above description.

Fig. 11 is a schematic structural diagram of a neural network model training device provided in an embodiment of the present application. The neural network model training apparatus 3000 may be located in the training device 120 shown in fig. 1 or other devices. The neural network model training apparatus 3000 includes an acquisition module 3010 and a processing module 3020.

The obtaining module 3010 is configured to obtain a training data set, where the training data set includes a plurality of training data and labeling information corresponding to each training data.

The processing module 3020 is configured to process the training data by using an initial neural network model to obtain training information corresponding to the training data.

The processing module 3020 is configured to process, by using a back propagation BP algorithm, a difference between training information corresponding to the training data and labeling information corresponding to the training data, to determine a first gradient data set, where the first gradient data set includes first gradient data and second gradient data, the first gradient data is used to indicate a gradient of a parameter in a first layer set, the first layer set includes one or more layers in the initial neural network model, and the second gradient data is used to indicate a gradient of a parameter in a second layer set, where the second layer set includes an input layer of the initial neural network model.

The obtaining module 3010 is configured to obtain a second gradient data set calculated by the second computing node, where the second gradient data set is used to indicate a gradient of a parameter of the initial neural network model, and the second gradient data set includes third gradient data, where the third gradient data is used to indicate a gradient of a parameter in the first layer set.

The obtaining module 3010 is specifically configured to obtain the third gradient data in a process that the processing module processes the difference by using the BP algorithm.

The processing module 3020 is configured to adjust parameters of the initial neural network model according to the first gradient data set and the second gradient data set, so as to obtain a trained neural network model.

The processing module 3020 is specifically configured to adjust parameters of the first layer set and the second layer set after the processing module determines the first gradient data set.

Optionally, each second data set comprises fourth gradient data for indicating gradients of parameters in the second layer set.

The obtaining module 3010 is specifically configured to obtain the fourth gradient data in a process of adjusting the parameters of the first layer set by the processing module.

Optionally, the training information is obtained by processing the initial neural network model after j times of adjustment, and the second gradient data set is obtained by processing the second computing node by using the initial neural network model after j times of adjustment, where j is a positive integer. .

Optionally, the processing module 3020 is configured to adjust parameters of the initial neural network model using a gradient clipping algorithm.

Optionally, the processing module 3020 is configured to adjust the parameter of the initial neural network model after the processing module determines the first gradient data set.

Fig. 12 is a schematic structural diagram of a data processing apparatus provided in an embodiment of the present application. The data processing apparatus 2000 may be located in the executing device 110 shown in fig. 1 or other devices. The data processing apparatus 2000 includes an acquisition module 2010 and a processing module 2020.

The acquiring module 2010 is configured to acquire data to be processed.

The processing module 2020 is configured to process the data to be processed by using a neural network model, where the neural network model is obtained by adjusting parameters of an initial neural network model by a first computing node according to a first gradient data set and a second gradient data set, where the adjustment of parameters of a first layer set and a second layer set is performed after the first computing node determines the first gradient data set, the first layer set includes one or more layers in the initial neural network model, and the second layer set includes an input layer of the neural network model.

Fig. 13 is a schematic diagram of a hardware structure of a data processing apparatus according to an embodiment of the present application. The data processing apparatus 4000 shown in fig. 13 includes a memory 4001, a processor 4002, a communication interface 4003, and a bus 4004. The memory 4001, the processor 4002 and the communication interface 4003 are connected to each other by a bus 4004.

The memory 4001 may be a ROM, a static storage device, and a RAM. The memory 4001 may store a program, and when the program stored in the memory 4001 is executed by the processor 4002, the processor 4002 and the communication interface 4003 are used to perform the respective steps of the data processing method of the embodiment of the present application.

The processor 4002 may employ a general-purpose, CPU, microprocessor, ASIC, GPU, or one or more integrated circuits for executing associated programs to perform the functions required by the elements in the data processing apparatus of the embodiments of the present application or to perform the data processing methods of the embodiments of the methods of the present application.

The processor 4002 may also be an integrated circuit chip having signal processing capabilities, for example, the chip shown in fig. 4. In implementation, various steps of the data processing method of the embodiments of the present application may be completed by instructions in the form of integrated logic circuits of hardware or software in the processor 4002.

The processor 4002 may also be a general purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 4001, and the processor 4002 reads information in the memory 4001, and in combination with hardware thereof, performs functions required to be executed by units included in the data processing apparatus of the embodiment of the present application, or executes the data processing method of the embodiment of the method of the present application.

The communication interface 4003 enables communication between the apparatus 4000 and other devices or communication networks using a transceiving apparatus such as, but not limited to, a transceiver. For example, the image to be processed can be acquired through the communication interface 4003.

Bus 4004 may include a path for transferring information between various components of device 4000 (e.g., memory 4001, processor 4002, communication interface 4003).

Fig. 14 is a schematic hardware configuration diagram of a neural network model training device according to an embodiment of the present application. Similar to the apparatus 4000 described above, the neural network model training apparatus 5000 shown in fig. 14 includes a memory 5001, a processor 5002, a communication interface 5003, and a bus 5004. The memory 5001, the processor 5002, and the communication interface 5003 are communicatively connected to each other via a bus 5004.

The neural network model training device 5000 shown in fig. 14 is used to train the initial neural network model, and the neural network model obtained by training can be used to execute the data processing method according to the embodiment of the present application.

Specifically, the apparatus shown in fig. 14 can acquire a training data set required for training and an initial neural network model from the outside through the communication interface 5003, and then perform training of the neural network model by the processor based on the training data set and the initial neural network model.

It should be noted that although the above described apparatus 4000 and apparatus 5000 only show memory, processors, communication interfaces, in a specific implementation, it will be appreciated by those skilled in the art that the apparatus 4000 and apparatus 5000 may also include other devices necessary to achieve proper operation. Also, those skilled in the art will appreciate that the apparatus 4000 and the apparatus 5000 may also include hardware devices that implement other additional functions, as desired. Furthermore, those skilled in the art will appreciate that the apparatus 4000 and the apparatus 5000 may also include only the devices necessary to implement the embodiments of the present application, and not necessarily all of the devices shown in fig. 13 and 14.

It should be appreciated that the processor in embodiments of the present application may be a central processing unit (central processing unit, CPU), but may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example but not limitation, many forms of random access memory (random access memory, RAM) are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with the embodiments of the present application are all or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.

In the present application, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A neural network model training method based on a computing system, wherein the computing system comprises a first computing node and a second computing node, the method comprising:

the first computing node acquires a training data set, wherein the training data set comprises a plurality of training data and labeling information corresponding to each training data;

the first computing node processes the training data by using an initial neural network model to obtain training information corresponding to the training data;

the first computing node processes differences between training information corresponding to the training data and labeling information corresponding to the training data by using a Back Propagation (BP) algorithm to determine a first gradient data set, wherein the first gradient data set comprises first gradient data and second gradient data, the first gradient data is used for indicating gradients of parameters in a first layer set, the first layer set comprises one or more layers in the initial neural network model, the second gradient data is used for indicating gradients of parameters in a second layer set, and the second layer set comprises an input layer of the initial neural network model;

The first computing node acquires a second gradient data set calculated by the second computing node, wherein the second gradient data set is used for indicating the gradient of the parameter of the initial neural network model, the second gradient data set comprises third gradient data, the third gradient data is used for indicating the gradient of the parameter in the first layer set, and the first computing node acquires the third gradient data in the process of processing the difference by the first computing node through the BP algorithm;

the first computing node adjusts parameters of the initial neural network model according to the first gradient data set and the second gradient data set to obtain a trained neural network model, wherein the first computing node adjusts parameters of the first layer set and the second layer set after the first computing node determines the first gradient data set.
The method of claim 1, wherein the second data set includes fourth gradient data for indicating gradients of parameters in the second layer set,

The first computing node obtaining a second gradient data set calculated by the second computing node, including: and in the process of adjusting the parameters of the first layer set by the first computing node, the first computing node acquires the fourth gradient data.
The method according to claim 1 or 2, wherein the training information is obtained by processing the initial neural network model after j times of adjustment, and the second gradient data set is obtained by processing the second computing node by using the initial neural network model after j times of adjustment, where j is a positive integer.
A method according to any of claims 1-3, wherein the first computing node adjusting parameters of the initial neural network model according to the first gradient data set and the second gradient data set, comprising:

and the first computing node adjusts parameters of the initial neural network model by using a gradient pruning algorithm.
The method of any of claims 1-4, wherein the first computing node adjusting parameters of the initial neural network model based on the first gradient data set and the second gradient data set, comprising:

After the first computing node determines the first gradient data set, the first computing node adjusts parameters of the initial neural network model.
A method of data processing, the method comprising:

acquiring data to be processed;

processing the data to be processed by using a neural network model, wherein the neural network model is obtained by adjusting parameters of an initial neural network model according to a first gradient data set and a second gradient data set by a first computing node, the adjustment of the parameters of a first layer set and a second layer set is performed after the first computing node determines the first gradient data set, the first layer set comprises one or more layers in the initial neural network model, the second layer set comprises an input layer of the neural network model,

the first gradient data set is obtained by processing, by the first computing node, a difference between training information corresponding to training data and labeling information corresponding to the training data by using a back propagation BP algorithm, the training information corresponding to the training data is obtained by processing, by the first computing node, the training data by using the initial neural network model, the first gradient data set includes first gradient data and second gradient data, the first gradient data is used for indicating gradients of parameters in the first layer set, the second gradient data is used for indicating gradients of parameters in the second layer set,

The second gradient data set is calculated by a second computing node, the second gradient data set is used for indicating the gradient of the parameter of the initial neural network model, the second gradient data set comprises third gradient data, the third gradient data is used for indicating the gradient of the parameter in the first layer set, and the first computing node acquires the third gradient data in the process that the first computing node processes the difference by using the BP algorithm.
The method of claim 6, wherein the second data set includes fourth gradient data indicating gradients of parameters in the second layer set, the first computing node acquiring the fourth gradient data during the first computing node adjusting parameters of the first layer set.
The method according to claim 6 or 7, wherein the training information is obtained by processing the initial neural network model after j times of adjustment, and the second gradient data set is obtained by processing the second computing node by using the initial neural network model after j times of adjustment, where j is a positive integer.
The method of any of claims 6-8, wherein the adjustment of the parameters of the initial neural network model by the first computing node is performed using a gradient pruning algorithm.
The method of any of claims 6-9, wherein the adjustment of parameters of the initial neural network model by the first computing node is performed after the first computing node determines the first gradient data set.
The neural network model training device is characterized by comprising an acquisition module and a processing module;

the acquisition module is used for acquiring a training data set, wherein the training data set comprises a plurality of training data and marking information corresponding to each training data;

the processing module is used for processing the training data by using an initial neural network model so as to obtain training information corresponding to the training data;

the processing module is further configured to process, by using a back propagation BP algorithm, a difference between training information corresponding to the training data and labeling information corresponding to the training data, to determine a first gradient data set, where the first gradient data set includes first gradient data for indicating gradients of parameters in a first layer set including one or more layers in the initial neural network model, and second gradient data for indicating gradients of parameters in a second layer set including an input layer of the initial neural network model;

The acquisition module is further configured to acquire a second set of gradient data calculated by the second computational node, the second set of gradient data being configured to indicate gradients of parameters of the initial neural network model, the second set of gradient data including third gradient data being configured to indicate gradients of parameters in the first set of layers,

the acquiring module is specifically configured to acquire the third gradient data in a process that the processing module processes the difference by using the BP algorithm;

the processing module is further configured to adjust parameters of the initial neural network model according to the first gradient data set and the second gradient data set to obtain a trained neural network model,

the processing module is specifically configured to adjust parameters of the first layer set and the second layer set after the processing module determines the first gradient data set.
The apparatus of claim 11, wherein the second data set includes fourth gradient data for indicating gradients of parameters in the second layer set,

The obtaining module is specifically configured to obtain the fourth gradient data in a process of adjusting the parameter of the first layer set by the processing module.
The apparatus according to claim 11 or 12, wherein the training information is obtained by processing the initial neural network model after j times of adjustment, and the second gradient data set is obtained by processing the second computing node by using the initial neural network model after j times of adjustment, where j is a positive integer.
The apparatus of any one of claims 11-13, wherein the processing module is configured to adjust parameters of the initial neural network model using a gradient clipping algorithm.
The apparatus of any of claims 11-14, wherein the processing module is configured to adjust parameters of the initial neural network model after the processing module determines the first gradient data set.
A data processing apparatus, wherein the apparatus comprises an acquisition module and a processing module;

the acquisition module is used for acquiring data to be processed;

the processing module is configured to process the data to be processed by using a neural network model, where the neural network model is obtained by adjusting parameters of an initial neural network model by a first computing node according to a first gradient data set and a second gradient data set, where the adjustment of parameters of a first layer set and a second layer set is performed after the first computing node determines the first gradient data set, the first layer set includes one or more layers in the initial neural network model, the second layer set includes an input layer of the neural network model,

The first gradient data set is obtained by processing, by the first computing node, a difference between training information corresponding to training data and labeling information corresponding to the training data by using a back propagation BP algorithm, the training information corresponding to the training data is obtained by processing, by the first computing node, the training data by using the initial neural network model, the first gradient data set includes first gradient data and second gradient data, the first gradient data is used for indicating gradients of parameters in the first layer set, the second gradient data is used for indicating gradients of parameters in the second layer set,

the second gradient data set is calculated by a second computing node, the second gradient data set is used for indicating the gradient of the parameter of the initial neural network model, the second gradient data set comprises third gradient data, the third gradient data is used for indicating the gradient of the parameter in the first layer set, and the first computing node acquires the third gradient data in the process that the first computing node processes the difference by using the BP algorithm.
The apparatus of claim 16, wherein the second data set includes fourth gradient data indicating gradients of parameters in the second layer set, the first computing node acquiring the fourth gradient data during adjustment of parameters of the first layer set by the first computing node.
The apparatus of claim 16 or 17, wherein the training information is obtained by processing the initial neural network model after j times of adjustment, and the second gradient data set is obtained by processing the second computing node by using the initial neural network model after j times of adjustment, where j is a positive integer.
The apparatus of any of claims 16-18, wherein the adjustment of the parameters of the initial neural network model by the first computing node is performed using a gradient pruning algorithm.
The apparatus of any of claims 16-19, wherein the adjustment of parameters of the initial neural network model by the first computing node is performed after the first computing node determines the first gradient data set.
An electronic device, characterized in that the apparatus comprises a memory for storing a program and a processor for performing the method of any of claims 1-10 when the program is executed in the processor.
A computer readable storage medium storing program code for execution by a device, which when executed by the device performs the method of any one of claims 1 to 10.
A chip comprising a processor and a data interface, the processor reading instructions stored on a memory via the data interface to perform the method of any one of claims 1 to 10.