CN113570053A

CN113570053A - Neural network model training method and device and computing equipment

Info

Publication number: CN113570053A
Application number: CN202010353931.5A
Authority: CN
Inventors: 刘强; 孟浩; 韩亮; 焦阳
Original assignee: Alibaba Group Holding Ltd
Current assignee: Pingtouge Shanghai Semiconductor Co Ltd
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2021-10-29

Abstract

The invention discloses a training method and device of a neural network model and computing equipment. The method comprises a forward propagation step, a backward propagation step and a parameter updating step, wherein in the parameter updating step: calculating a parameter updating value of the current network layer based on the parameter gradient, and generating a fourth floating point number corresponding to the parameter updating value, wherein the bit width of the fourth floating point number is a third predetermined value; and updating the parameters of the current network layer based on the first floating point block and the fourth floating point block, and generating the first floating point block corresponding to the updated parameters.

Description

Neural network model training method and device and computing equipment

Technical Field

The invention relates to the technical field of deep learning, in particular to a training method and device of a neural network model and computing equipment.

Background

As artificial intelligence has rapidly developed in recent years, artificial intelligence, particularly deep learning algorithms, have been widely used in the fields of visual images, speech recognition, natural language processing, and the like. With the popularization of end intelligence, more and more artificial intelligence calculations need to be performed at edge devices to bring new interactive experiences. But the edge devices are less computationally intensive and have less memory than the servers. Therefore, how to optimize a deep learning algorithm with high computational power and storage consumption so as to be deployed on the edge-end device is a hot problem in the field of artificial intelligence.

One method of optimizing deep learning algorithms, particularly neural network models, is to quantize the model parameters, i.e., parameters such as single precision floating point numbers into integer format, and since the processor in the end device, whether GPU or other processor, can process integer arithmetic more quickly than processing floating point arithmetic, the deep learning model deployed on the end device can be accelerated by parameter quantization, thereby speeding up the inference (inference) process.

However, low degree neural network model training is still mainly based on high precision floating point computing platforms. With the increasing network scale and training data set of neural network models, huge computing power, storage space and power consumption are required for training the neural network models. The related research of the quantization compression technology shows that the quantization compression technology has the performance advantages of low storage, low power consumption and high operation. The neural network model training comprises three parts of forward propagation, backward propagation and parameter updating. The existing low-precision quantization compression technology research mainly focuses on a forward reasoning part of a neural network model, and researches on low-precision back propagation and parameter updating are less. Since the back propagation is twice as computationally intensive as the forward propagation, it is important to quantify the back gradient propagation.

In the training process of the low-precision neural network model, the parameter update value cannot be directly accumulated to the low-precision parameter because the parameter update value is too small compared with the high-precision parameter value (usually, a single-precision floating point number is adopted). In order to solve the problem, low-precision parameter values are generally adopted in forward propagation and directional propagation, and corresponding high-precision floating point values are additionally stored for the low-precision parameter values so as to update parameters, so that on one hand, additional storage resources are added, and on the other hand, frequent conversion operation of low-precision and high-precision type data is required, which results in low training efficiency.

Disclosure of Invention

In view of the above, the present invention has been made to provide a training method, apparatus and computing device of a neural network model that overcomes or at least partially solves the above problems.

According to an aspect of the present invention, there is provided a training method of a neural network model, executed in a server, the neural network model including a plurality of network layers, the method including: a forward propagation step: acquiring a first block of floating point number corresponding to a parameter of a current network layer; quantizing the activation value output by the upper network layer into a second floating point number, wherein the bit width of the first floating point number and the second floating point number is a first preset value; calculating an activation value of the current network layer based on the first block floating point number and the second block floating point number, and outputting the activation value to the next network layer; and (3) a back propagation step: quantizing the activation value gradient output by the next network layer into a third block floating point number, wherein the bit width of the third block floating point number is a second preset value; calculating an activation value gradient of an upper network layer based on the third block floating point number and the first block floating point number, and outputting the activation value gradient to the upper network layer; calculating a parameter gradient of the current network layer based on the third block floating point number and the second block floating point number; and (3) updating parameters: calculating a parameter updating value of the current network layer based on the parameter gradient, and generating a fourth floating point number corresponding to the parameter updating value, wherein the bit width of the fourth floating point number is a third predetermined value; and updating the parameters of the current network layer based on the first floating point block and the fourth floating point block, and generating the first floating point block corresponding to the updated parameters.

Optionally, in the training method of a neural network model of the present invention, the current network layer is a fully-connected layer, and the fully-connected layer includes a linear processing unit and an activation function unit; said calculating an activation value for a current network layer based on a first block floating point number and a second block floating point number, including; inputting the first block floating point number and the second block floating point number into a linear processing unit for processing, and outputting a linear value with a bit width of a third preset value; quantizing the linear value into a quantized linear value having a bit width of a first predetermined value; and inputting the quantized linear value into an activation function unit for processing, and outputting an activation value with the bit width being a third preset value.

Optionally, in the training method of a neural network model of the present invention, the calculating an activation value gradient of an upper network layer based on the third block floating point number and the first block floating point number includes: performing reverse derivation on the floating point number of the third block based on the activation function adopted by the activation function unit to obtain a linear value gradient with the bit width being a third preset value; quantizing the linear value gradient into a quantized linear value gradient with a bit width of a second preset value; carrying out reverse derivation on the quantized linear value gradient based on the first block of floating point number to obtain an activation value gradient of an upper network layer; the calculating the parameter gradient of the current network layer based on the third block floating point number and the second block floating point number includes: and carrying out reverse derivation on the linear value gradient based on the second block floating point number to obtain the parameter gradient of the current network layer.

Optionally, in the training method of a neural network model of the present invention, the current network layer is a convolutional layer, and the convolutional layer includes a convolutional unit, a pooling unit, and an activation function unit; said calculating an activation value for a current network layer based on a first block floating point number and a second block floating point number, including; inputting the first block floating point number and the second block floating point number into a convolution unit for processing, and outputting a linear value with a bit width of a third preset value; quantizing the linear value into a quantized linear value having a bit width of a first predetermined value; inputting the quantized linear value into a pooling unit for processing, and outputting a pooling value; and inputting the pooled value into an activation function unit for processing, and outputting an activation value with the bit width being a third preset value.

Optionally, in the training method of a neural network model of the present invention, the calculating an activation value gradient of an upper network layer based on the third block floating point number and the first block floating point number includes: performing reverse derivation on the floating point number of the third block based on the activation function adopted by the activation function unit to obtain a pooled value gradient with the bit width being a third preset value; carrying out reverse derivation on the pooling value gradient based on a pooling template adopted by a pooling unit to obtain a linear value gradient; quantizing the linear value gradient into a quantized linear value gradient with a bit width of a second preset value; carrying out reverse derivation on the quantized linear value gradient based on the first block of floating point number to obtain an activation value gradient of an upper network layer; the calculating the parameter gradient of the current network layer based on the third block floating point number and the second block floating point number includes: and carrying out reverse derivation on the linear value gradient based on the second block floating point number to obtain the parameter gradient of the current network layer.

Optionally, in the training method of a neural network model of the present invention, the first predetermined value is smaller than the second predetermined value, and the second predetermined value is smaller than the third predetermined value.

Optionally, in the training method of a neural network model of the present invention, the first predetermined value is 8, the second predetermined value is 16, and the third predetermined value is 32.

Optionally, in the training method of a neural network model of the present invention, the updating a parameter of a current network layer based on the first floating point block and the fourth floating point block to generate the first floating point block corresponding to the updated parameter includes: acquiring a fifth floating point number corresponding to the delay updating value after the last iteration, wherein the delay updating value is a part which is not updated into the parameter in the parameter updating value, and the bit width of the fifth floating point number is a second preset value; shifting the mantissa of the fourth floating point number by a first predetermined numerical bit to the right, and accumulating the mantissa of the fourth floating point number into a fifth floating point number; calculating the difference between the first floating point number and a fifth floating point number of the mantissa which is moved to the right by a second preset numerical digit to obtain a first difference value; calculating the sum of the first floating point number and the fifth floating point number, and then subtracting the first difference value to be used as the fifth floating point number corresponding to the delay updating value after the iteration; updating the first block of floating point numbers to the first difference.

Optionally, in the training method of the neural network model of the present invention, the first predetermined value is a difference between an exponent of a floating point number of the fifth block and an exponent of a floating point number of the fourth block, and the second predetermined value is a bit width-1 of the floating point number of the fifth block.

Alternatively, in the training method of a neural network model of the present invention, the truncated part is added to the non-truncated part in a rounded manner when the shift operation of the mantissa is performed.

Optionally, in the training method of the neural network model of the present invention, the server includes an acceleration unit therein, and the training method is adapted to be executed by the acceleration unit.

Optionally, in the training method of a neural network model of the present invention, the acceleration unit is a neural network processing unit NPU or a graphics processing unit GPU.

Optionally, in the training method of a neural network model of the present invention, the server is deployed in a data center.

According to another aspect of the present invention, there is provided a training method of a neural network model, executed in a terminal device, the neural network model including a plurality of network layers, the method including: a forward propagation step: acquiring a first block of floating point number corresponding to a parameter of a current network layer; quantizing the activation value output by the upper network layer into a second floating point number, wherein the bit width of the first floating point number and the second floating point number is a first preset value; calculating an activation value of the current network layer based on the first block floating point number and the second block floating point number, and outputting the activation value to the next network layer; and (3) a back propagation step: quantizing the activation value gradient output by the next network layer into a third block floating point number, wherein the bit width of the third block floating point number is a second preset value; calculating an activation value gradient of an upper network layer based on the third block floating point number and the first block floating point number, and outputting the activation value gradient to the upper network layer; calculating a parameter gradient of the current network layer based on the third block floating point number and the second block floating point number; and (3) updating parameters: calculating a parameter updating value of the current network layer based on the parameter gradient, and generating a fourth floating point number corresponding to the parameter updating value, wherein the bit width of the fourth floating point number is a third predetermined value; and updating the parameters of the current network layer based on the first floating point block and the fourth floating point block, and generating the first floating point block corresponding to the updated parameters.

Optionally, in the training method of the neural network model of the present invention, the terminal device is a desktop computing device, a laptop computing device, a tablet computing device, a mobile phone computing device, a speaker computing device, a computing device of a vehicle, a wearable apparatus including a computing device, or a home appliance including a computing device.

Optionally, in the training method of the neural network model of the present invention, an acceleration unit is included in the terminal device, and the training method is adapted to be executed by the acceleration unit.

According to another aspect of the present invention, there is provided a training apparatus for a neural network model, including: a forward propagation module adapted to: acquiring a first block of floating point number corresponding to a parameter of a current network layer; quantizing the activation value output by the upper network layer into a second floating point number, wherein the bit width of the first floating point number and the second floating point number is a first preset value; calculating an activation value of the current network layer based on the first block floating point number and the second block floating point number, and outputting the activation value to the next network layer; a counter-propagation module adapted to: quantizing the activation value gradient output by the next network layer into a third block floating point number, wherein the bit width of the third block floating point number is a second preset value; calculating an activation value gradient of an upper network layer based on the third block floating point number and the first block floating point number, and outputting the activation value gradient to the upper network layer; calculating a parameter gradient of the current network layer based on the third block floating point number and the second block floating point number; a parameter update module adapted to: calculating a parameter updating value of the current network layer based on the parameter gradient, and generating a fourth floating point number corresponding to the parameter updating value, wherein the bit width of the fourth floating point number is a third predetermined value; and updating the parameters of the current network layer based on the first floating point block and the fourth floating point block, and generating the first floating point block corresponding to the updated parameters.

According to yet another aspect of the invention, there is provided a computing device comprising: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, the program instructions comprising instructions for performing the above-described method.

According to yet another aspect of the present invention, there is provided a readable storage medium storing program instructions which, when read and executed by a computing device, cause the computing device to perform the above-described method.

According to the training scheme of the neural network model, based on the combination of the block floating point quantization and the delay updating, the shared exponent control parameter value, the delay updating value and the parameter updating value of the block floating point number have the same decimal point position, the effective part of the parameter updating value relative to the delay updating value (the parameter updating value and the mantissa overlapping part of the delay updating value) is added into the delay updating value, and when the delay updating value is large enough, the parameter value is updated according to the delay updating value. Therefore, low-precision parameter values can be adopted in the forward propagation process, the backward propagation process and the parameter updating process, high-precision floating point values corresponding to the low-precision parameter values do not need to be additionally stored, the high-precision and low-precision type data do not need to be frequently converted, and the training speed of the neural network model is improved while the storage resources are saved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 illustrates a block diagram of a data center utilized in accordance with one embodiment of the present invention;

FIG. 2 illustrates an internal block diagram of a server in a data center according to one embodiment of the invention;

FIG. 3 is a diagram illustrating the connection between a dispatch unit and an acceleration unit within a server, according to one embodiment of the present invention;

FIG. 4 is an internal block diagram of an accelerator core according to one embodiment of the invention;

FIG. 5 is a diagram illustrating a representation of floating point numbers of blocks in an embodiment of the invention;

FIG. 6 illustrates a flow diagram of a method 600 of training a neural network model in accordance with one embodiment of the present invention;

FIG. 7 shows a schematic diagram of a training process for a convolutional layer in an embodiment of the present invention;

FIG. 8 is a diagram illustrating a parameter update process according to an embodiment of the present invention;

FIG. 9 shows a schematic diagram of a training apparatus 900 for a neural network model according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

First, an implementation environment of the neural network model training method according to the embodiment of the present invention is described.

Data center

A data center is a globally collaborative network of devices that is used to communicate, accelerate, present, compute, store data information over an internet network infrastructure. In future development, the data center will become an asset for enterprise competition. With the popularization of data center applications, artificial intelligence and the like are increasingly applied to data centers. The neural network is an important technology of artificial intelligence, and is widely applied to big data analysis and operation of a data center.

In a conventional large data center, the network structure is generally as shown in fig. 1, i.e., a hierarchical inter-networking model (internetworking model). This model contains the following parts:

the server 140: each server 140 is a processing and storage entity of a data center in which the processing and storage of large amounts of data is performed by the servers 140.

The access switch 130: the access switch 130 is a switch used to access the server 140 to the data center. One access switch 130 accesses multiple servers 140. The access switches 130 are typically located on Top of the Rack, so they are also called set-Top (Top of Rack) switches, which physically connect the servers.

Aggregation switch 120: each aggregation switch 120 connects multiple access switches 130 while providing other services such as firewalls, intrusion detection, network analysis, and the like.

The core switch 110: core switches 110 provide high-speed forwarding of packets to and from the data center and connectivity for aggregation switches 120. The entire data center network is divided into an L3 layer routing network and an L2 layer routing network, and the core switch 110 provides a flexible L3 layer routing network for the entire data center network.

Typically, the aggregation switch 120 is the demarcation point between L2 and L3 layer routing networks, with L2 below and L3 above the aggregation switch 120. Each group Of aggregation switches manages a Point Of Delivery (POD), within each Of which is a separate VLAN network. Server migration within a POD does not have to modify the IP address and default gateway because one POD corresponds to one L2 broadcast domain.

A Spanning Tree Protocol (STP) is typically used between aggregation switch 120 and access switch 130. STP enables only one aggregation layer switch 120 to be available to a VLAN network, and other aggregation switches 120 to be used in the event of a failure. That is, at the level of aggregation switches 120, no horizontal scaling is done, since only one is working even if multiple aggregation switches 120 are added.

Server

Since the server 140 is a real processing device of the data center, fig. 2 shows a structural block diagram of the inside of the server 140. The server 140 includes a bus-connected memory 210, a cluster of scheduling units 270, and a cluster of acceleration units 280. The dispatch unit cluster 270 includes a plurality of dispatch units 220. The acceleration unit cluster 280 includes a plurality of acceleration units 230. The acceleration unit is a special processing unit designed to accelerate the operation processing speed of the neural network model in the embodiment of the present disclosure, and may be embodied as a processing unit (NPU), a Graphics Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), and the like specially designed for the neural network operation processing. The scheduling unit is a processing unit that schedules the acceleration units and allocates instruction sequences to be executed to each acceleration unit, and may take various forms such as a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), and the like.

In the traditional architecture design of the central processing unit, a control unit and a storage unit occupy a large part of space in the architecture, and the space occupied by a computing unit is insufficient, so that the traditional architecture design is very effective in logic control and is not efficient in large-scale parallel computing. Therefore, various special acceleration units have been developed to perform more efficient processing for increasing the operation speed for calculations of different functions and different fields. The acceleration unit provided by the invention is a processing unit special for accelerating the operation processing speed of a neural network model. The method is a processing unit which adopts a data-driven parallel computing architecture and is used for processing a large number of operations (such as convolution, pooling and the like) of each neural network node. Because data in a large number of operations (such as convolution, pooling and the like) of each neural network node and intermediate results are closely related in the whole calculation process and are frequently used, the conventional central processing unit framework needs to frequently access an off-core storage in a large number because the internal memory capacity of a core of the central processing unit is small, and thus, the processing efficiency is low. By adopting the accelerating unit special for accelerating the operation processing speed of the neural network model, because each core of the accelerating unit is provided with the on-chip memory with the storage capacity suitable for the neural network calculation, the frequent access to the memory outside the core is avoided, the processing efficiency can be greatly improved, and the calculation performance is improved.

The acceleration unit 230 is to accept the schedule of the scheduling unit 220. As shown in fig. 2, various neural network models are stored in the memory 210, including nodes of the models, weight and bias data of the nodes, and the like. These neural network models are deployed by a dispatch unit 220 to an acceleration unit 230 in fig. 2 when needed. That is, the scheduling unit 220 may send addresses of parameters in the model (such as weights and offsets of the nodes) in the memory 210 to the acceleration unit 230 in the form of instructions. When the acceleration unit 230 actually uses the neural network model to perform calculation, the parameters are directly addressed in the memory 210 according to the addresses of the parameters in the memory 210, and are temporarily stored in the on-chip memory. When the acceleration unit 230 actually uses the neural network model for calculation, the scheduling unit 220 further sends the input parameters of the model to the acceleration unit 230 in the form of instructions, and temporarily stores the input parameters in the on-chip memory of the acceleration unit 230. The acceleration unit 230 can then perform inferential calculations based on these input parameters and parameters in the model (e.g., weights and biases).

Internal structure of dispatching unit and accelerating unit

How the scheduling unit 220 schedules the acceleration unit 230 to operate will be described in detail below with reference to the internal structure diagrams of the scheduling unit 220 and the acceleration unit 230 of fig. 3.

As shown in fig. 3, the scheduling unit 220 includes a plurality of processor cores 222 and a cache 221 shared by the plurality of processor cores 222. Each processor core 222 includes an instruction fetch unit 203, an instruction decode unit 224, an instruction issue unit 225, and an instruction execution unit 226.

Instruction fetch unit 223 is configured to move an instruction to be executed from memory 210 into an instruction register (which may be one of register files 229 shown in fig. 3 for storing instructions) and receive or compute a next instruction fetch address according to an instruction fetch algorithm, which includes, for example: the address is incremented or decremented according to the instruction length.

After fetching the instruction, dispatch unit 220 enters an instruction decode stage where instruction decode unit 224 decodes the fetched instruction according to a predetermined instruction format to obtain operand fetch information needed by the fetched instruction in preparation for operation by instruction execution unit 226. The operand fetch information points, for example, to an immediate, register, or other software/hardware capable of providing source operands.

An instruction issue unit 225 is located between the instruction decode unit 224 and the instruction execution unit 226 for scheduling and control of instructions to efficiently allocate individual instructions to different instruction execution units 226, enabling parallel operation of multiple instructions.

After instruction issue unit 225 issues an instruction to instruction execution unit 226, instruction execution unit 226 begins executing the instruction. But if the instruction execution unit 226 determines that the instruction should be executed by an acceleration unit, it is forwarded to the corresponding acceleration unit for execution. For example, if the instruction is a neural network inference (inference) instruction, instruction execution unit 226 no longer executes the instruction, but rather sends the instruction over the bus to acceleration unit 230 for execution by acceleration unit 230.

The acceleration unit 30 internally includes a plurality of cores 236 (4 cores are shown in fig. 3, but it will be understood by those skilled in the art that other numbers of cores 236, a command processor 237, a direct memory access mechanism 235, and a bus channel 231 may be included in the acceleration unit 230.

Bus channel 231 is a channel for instructions to pass from the bus to and from acceleration unit 230.

Direct Memory Access (DMA) mechanism 235 is a function provided by some computer bus architectures that enables data to be written from an attached device directly into the Memory of a computer motherboard. Compared with the mode that all data transmission between the devices needs to pass through the scheduling unit, the mode greatly improves the efficiency of data access. Due to such a mechanism, the core of the acceleration unit 230 can directly access the memory 210, read parameters (such as the weight and offset of each node) in the neural network model, and the like, and greatly improve the data access efficiency.

The command handler 237 distributes instructions sent by the dispatch unit 220 to the acceleration unit 230 for execution by the core 236. Instruction execution unit 226 sends a sequence of instructions to be executed that require execution by acceleration unit 230 to acceleration unit 230. After entering from the bus channel 231, the instruction sequence to be executed is buffered in the command processor 237, and the command processor 237 selects the core 236 to allocate the instruction sequence to its execution. In addition, the command processor 237 is also responsible for synchronizing operations between the cores 236.

Accelerating unit core

FIG. 4 is an internal block diagram of an accelerator unit core 236, according to one embodiment of the invention.

In one embodiment, as shown in fig. 4, the accelerator core 236 includes a tensor engine 310, a pooling engine 320, a memory copy engine 330, a sequencer 350, an instruction buffer 340, an on-chip memory 360, and a constant buffer 370.

The instruction sequence assigned by the command processor 237 to the accelerator unit core 236 first enters the instruction buffer 340 for buffering. The sequencer 350 then fetches instructions from the instruction buffer 340 in a first-in-first-out order, and assigns them to the tensor engine 310, pooling engine 320, or memory copy engine 330 for execution based on the nature of the instructions. The tensor engine 310 is responsible for handling related operations such as convolution and matrix multiplication in the neural network model. The pooling engine 320 is responsible for handling pooling operations in the neural network model. The memory copy engine 330 is responsible for copying operands stored by the on-chip memory 360 within the cores 236 to memory shared between the cores 236, or to the on-chip memory 360 within other cores 236. The sequencer 350 determines whether to assign an instruction to the tensor engine 310, the pooling engine 320, or the memory copy engine 330, depending on the nature of the operation, such as convolution, matrix multiplication, pooling, or operand copying, of the fetched instruction.

The on-chip memory 360 is an in-core memory that stores model parameters in the neural network model, as well as input parameters and various intermediate results when the neural network model is actually used. The constant buffer 370 is a buffer that stores other constant parameters (e.g., hyper-parameters in the neural network model) in addition to the weight parameters in the neural network model. As described above, in the process of the scheduling unit 220 pre-configuring the neural network model in the acceleration unit 230, the scheduling unit 220 sends the addresses of the parameters in the model in the memory 210 to the acceleration unit 230 in the form of instructions. These parameters include the weight of the node and other parameters (e.g., hyper-parameters). For the weight, the acceleration unit 230 fetches the actual neural network model from the corresponding location of the storage 210 and puts the neural network model into the on-chip memory 360 during the actual neural network model operation. For other parameters, the acceleration unit 230 fetches the corresponding position from the memory 210 during the actual neural network model operation, and places the corresponding position in the constant buffer 370. In addition, when an instruction to actually start inference (inference) is assigned to the core 236 by the command processor 237 for execution, the input parameters in the instruction (inputs to the neural network model) are also stored in the on-chip memory 360. In addition, after the tensor engine 310 and the pooling engine 320 perform convolution or pooling operation, various intermediate results obtained are also stored in the on-chip memory 360.

The training method of the neural network model according to the embodiment of the present invention may be executed in the server 140 of the data center. Depending on the scale of the neural network model and the scale of the training data set, the training method of the model may be performed by one acceleration unit 230 in one server 140 or by a plurality of acceleration units 230 in one or more servers 120 performing distributed training of the neural network model. The acceleration unit 230 may retrieve the neural network model and the training data from the memory 210 and train the neural network model based on the retrieved training data.

The neural network model training comprises three parts of forward propagation, backward propagation and parameter updating. The existing low-precision quantization compression technology research mainly focuses on a forward reasoning part of a neural network model, and researches on low-precision back propagation and parameter updating are less. Since the back propagation is twice as computationally intensive as the forward propagation, it is important to quantify the back gradient propagation.

However, in the training process of the low-precision neural network model, since the parameter update value is too small compared with the low-precision parameter value, the parameter update value cannot be directly accumulated to the low-precision parameter. In order to solve the problem, low-precision parameter values are generally adopted in forward propagation and directional propagation, and corresponding high-precision floating point values (generally single-precision floating point numbers) are additionally stored for the low-precision parameter values so as to update parameters, so that on one hand, additional storage resources are added, and on the other hand, frequent conversion operation of low-precision and high-precision type data is required, which results in low training efficiency.

To this end, the embodiment of the present invention proposes a neural network model training scheme based on a combination of Block Floating Point (BFP) and late update (Lazy update). In the scheme, the shared exponent control parameter value, the delay updating value and the parameter updating value of the floating-point number of the block have the same decimal point position, a part (for example, a mantissa overlapping part of the parameter updating value and the delay updating value) which is effective relative to the delay updating value in the parameter updating value is added into the delay updating value, and when the delay updating value is large enough, the parameter value is updated according to the delay updating value. The parameter updating value is a numerical value which is obtained by calculation based on a parameter gradient and by adopting an optimization algorithm (such as a random gradient descent method, a momentum gradient descent method and the like) and is to be updated for the parameter value in one iteration; the delayed parameter update value is a part of the parameter value that is not updated from the first iteration to the last iteration, and is, for example, x, the parameter update value at the i-th iteration, which is performed n iterations in total_i，x_iThe part updated into the parameter value is y_iThen the delay update value is

Therefore, low-precision parameter values can be adopted in the forward propagation process, the backward propagation process and the parameter updating process, high-precision floating point values corresponding to the low-precision parameter values do not need to be additionally stored, the high-precision and low-precision type data do not need to be frequently converted, and the training speed of the neural network model is improved while the storage resources are saved.

The principle of quantizing a plurality of floating point data into one floating point data of a block is described below, and in the embodiment of the training method of the model, the process of delaying updating is described in detail.

The floating-point data of a block is a data format having the same shared exponent for the whole integer data, that is, a plurality of floating-point data are quantized, an exponent (integer data) is formed with reference to the maximum data among the plurality of floating-point data as a shared exponent for the plurality of floating-point data, and then mantissas (integer data) for each floating-point data are generated with reference to the shared exponent. That is, a plurality of floating point data forms a data block, which may be represented as a block floating point number.

FIG. 5 is a diagram illustrating a representation of floating point numbers of blocks according to an embodiment of the present invention. The floating point number of the block shown in fig. 5 is a floating point number of a block having a bit width of 8 bits, and has a mantissa portion of 8 bits (including a sign bit) and an exponent portion of 8 bits (including a sign bit), and corresponds to 3 floating point data.

It should be noted that, in the embodiment of the present invention, the bit width of the floating point number of the block refers to the bit width of the mantissa section (i.e., how many bits the mantissa section has), and the bit widths of the exponent section of the floating point numbers of the blocks with different bit widths are the same, for example, all are 8 bits. Thus, an 8-bit block floating-point number, both its exponent and mantissa portions are 8 bits; a 16-bit floating-point block number having an exponent portion of 16 bits and a mantissa portion of 8 bits, and a 32-bit floating-point block number having an exponent portion of 32 bits and a mantissa portion of 8 bits; the n-bit block floating point number has an exponent part of n bits and a mantissa part of 8 bits.

Floating point numbers are quantized to integers, usually by affine quantization, and the formula is as follows:

r＝s(q-z)

where s (expansion coefficient) and z (zero point) represent quantization parameters, the r value refers to a floating point value, and q refers to a quantization value. Typically, s is represented by a 32-bit single precision floating or fixed point, and z is mapped to the true zero of the quantized value, where z and q are both integer data.

To more efficiently improve the implementation efficiency in the customized hardware, the expansion coefficients are constrained to a format of 2 exponents in symmetric quantization, while z is set to 0. By adopting the quantization mode in the embodiment of the invention, the high-complexity multiplication calculation between the fixed point number and the floating point number can be replaced by a bit displacement mode.

Thus, the quantization formula can be expressed as:

wherein the content of the first and second substances,

to expand the coefficient, E_sIs exponential and q is mantissa. For a plurality of floating point numbers r, the value range is r epsilon [ r ∈ ]_min,r_max]Then, the shared exponent of the floating-point number of the block corresponding to the plurality of floating-point numbers can be expressed as follows:

E_s＝ceil(log₂{max(|r_min|,|r_max|)})-(b-1)

wherein r is_minIs the minimum of a plurality of floating point numbers, r_maxFor the maximum of a plurality of floating point numbers, the ceil () function is rounded up, and b represents the bit width of the floating point number, e.g., for a 32-bit single precision floating point number, b equals 32.

The multiplication for two block floating point numbers is:

i_a·b＝i_a·i_b,E_a·b＝E_a+E_b

wherein i_a、i_bMantissa portions, E, of two floating-point numbers of a block, respectively_a、E_bExponent parts, i, of two floating-point numbers of blocks, respectively_a·bBeing the mantissa part of the product, E_a·bIs the exponential part of the product. In the embodiment of the invention, the intermediate calculation result is usually temporarily stored by using a 32-bit block floating point number, and the main purpose is to prevent overflow during multiply-accumulate operation in subsequent convolution, so that a new shared exponent is obtained based on the formula.

The above description provides an implementation manner for quantizing a plurality of floating point numbers into floating point numbers, and in the training of the neural network model according to the embodiment of the present invention, the floating point numbers may be quantized into floating point numbers, or other implementation manners for quantizing floating point numbers into floating point numbers in the prior art may be adopted.

FIG. 6 shows a flow diagram of a method 600 of training a neural network model, according to one embodiment of the invention. The method 600 may be performed in the aforementioned server 140, for example, by one acceleration unit 230 in one server 140, or by multiple acceleration units 230 in one or more servers 120 performing distributed training of the neural network model. The acceleration unit 230 may retrieve the neural network model and the training data from the memory 210 and train the neural network model based on the retrieved training data. The method 600 may be applied to various scenes such as images, voice, video, machine translation, etc., for example, in an image scene, the corresponding neural network model may be an image classification model, a target detection model, etc.; in a machine translation scenario, the corresponding neural network model may be a neural network machine translation model. As shown in fig. 6, the method 600 includes a forward propagation step S602, a backward propagation step S604, and a parameter update step S606.

In step S602, the sample characteristics of the training sample are input into the neural network model, the model output of the neural network model is obtained through processing of each network layer of the neural network model, and the loss function is calculated based on the model output and the sample label of the training sample.

The type of training data may be: image samples, speech samples, natural language processing samples. For example, when the neural network model to be trained is a neural network machine translation model, each piece of training data is a text pair, the text pair is a corresponding relationship between a first language text and a second language text, the first language text is a sample feature and is used as an input of the model, and the second language text is a sample label.

The first network layer of the neural network model takes the sample characteristics as input, calculates an activation value based on the input and the parameters (weight and bias) of the first network layer, then the second network layer takes the activation value output by the first network layer as input, calculates the activation value based on the input and the parameters of the second network layer, and so on, finally takes the activation value output by the output layer of the neural network model as the output of the model, and calculates a loss function (loss) based on the model output and the sample label of the training sample.

In the embodiment of the present invention, when the forward propagation process is executed, the model input, the activation value of each network layer, and the parameter of each network layer are quantized into the floating point number of the block, and the specific quantization mode may be the mode described above, or may be another floating point quantization mode in the prior art. For convenience of description, the input of the first layer network is also referred to as an activation value, and a floating point number of a block corresponding to a parameter of each network layer is referred to as a first floating point number of the block, and a floating point number of a block corresponding to the activation value is referred to as a second floating point number of the block. The first and second block floating point numbers are of the same bit width, e.g., both are 8-bit block floating point numbers.

Thus, in the forward propagation, the processing performed by the current network layer is: acquiring a first block of floating point number corresponding to a parameter of a current network layer; quantizing the activation value output by the upper network layer into a second floating point number; and calculating the activation value of the current network layer based on the first block floating point number and the second block floating point number, and outputting the activation value to the next network layer.

The activation value is usually calculated by the network layer in units of tensors, that is, the activation value is calculated based on an input activation value tensor, which is a tensor composed of all activation values of one network layer (or network elements in the network layer), and a parameter tensor, which is a tensor composed of all parameters of one network layer (or network elements in the network layer). Correspondingly, the floating-point number of the block is also a tensor as a granularity, the first floating-point number of the block is a floating-point number of the block obtained by quantizing the data block corresponding to the parameter tensor, and the second floating-point number of the block is a floating-point number of the block obtained by quantizing the data block corresponding to the activation value tensor.

The neural network model typically includes a plurality of convolutional layers including convolution units, pooling units, and activation function units, and one or more fully-connected layers including linear processing units and activation function units. Wherein the pooling unit has no parameters and parameter gradients.

Fig. 7 shows a schematic diagram of a training process of a convolutional layer in an embodiment of the present invention. As shown in fig. 7, the forward propagation process performed in the convolutional layer includes the following flows:

1) acquiring a first floating point number corresponding to a parameter of a network layer and an activation value output by a previous network layer from a memory (DDR) by adopting a Direct Memory Access (DMA) mode, quantizing the activation value into a second floating point number, inputting the first floating point number and the second floating point number into a convolution unit for processing, and outputting a linear value obtained by processing. In addition, the first block floating point number and the second block floating point number are temporarily stored into a memory so as to be accessed when the propagation direction is reversed. In this embodiment, the first and second floating point blocks are both 8-bit floating point blocks (8-bit BFP), and the output linear value is 32-bit floating point blocks (32-bit BFP);

assuming that the current convolutional layer is the l-th layer, the processing performed in the convolutional unit is:

I_l＝I_l-1*W_l

wherein, represents the convolution operation, I_l-1Floating point number of second block, I, corresponding to activation value output by upper network layer_lFor linear values obtained by convolution, W_lThe floating point number is the first block of floating point number corresponding to the parameter of the convolution unit.

2) Quantizing the output linear value into a quantized linear value (8-bit BFP) with a bit width of 8;

3) inputting the quantized linear value into a pooling unit for processing, and outputting a pooling value (8-bit BFP);

4) inputting the pooled value into an activation function unit for processing, outputting an activation value (32-bit BFP), and finishing the forward propagation of the current convolutional layer.

Similarly, the forward propagation processing flow executed in the full connection layer includes:

1) inputting the first block floating point number and the second block floating point number into a linear processing unit for processing, and outputting a processed linear value (32-bit BFP);

3) and inputting the quantized linear value into an activation function unit for processing, outputting an activation value (32-bit BFP), and finishing the forward propagation of the current full-link layer.

After the forward propagation is complete, the method 600 proceeds to step S604. In step S604, an activation value gradient of the output layer is calculated based on the loss function, and a parameter gradient of the current network layer and an activation value gradient of the next network layer are calculated layer by layer from the output layer.

In the embodiment of the present invention, when the back propagation process is executed, the activation value gradient is quantized into floating-point numbers of blocks, and the specific quantization manner may be the manner described above, or may be other floating-point quantization manners in the prior art. For convenience of description, the floating point number of the block corresponding to the gradient of the activation value is referred to as a third floating point number of the block. The bit width of the third block floating point number is greater than the bit widths of the first and second block floating point numbers, e.g., a 16-bit block floating point number (16-bit BFP).

Thus, in the back propagation, the processing performed by the current network layer is: quantizing the activation value gradient output by the next network layer into a third block floating point number; calculating an activation value gradient of an upper network layer based on the third block floating point number and the first block floating point number, and outputting the activation value gradient to the upper network layer; and calculating the parameter gradient of the current network layer based on the third block floating point number and the second block floating point number.

With continued reference to fig. 7, the back propagation process performed in the convolutional layer of the neural network model includes the following flow:

1) quantizing the activation value gradient output by the next network layer into a third block floating point number (16-bit BFP);

the applicant found that to ensure that the training accuracy of the neural network model is not compromised, the backpropagated gradients need to be wider than the parameters and activation values, and therefore, in an embodiment of the present invention, the bit widths of the various gradients in the backpropagation are set to 16 bits.

2) Reverse derivation (active derivation) is carried out on the floating point number of the third block based on an active function adopted by the active function unit, and a pooling value gradient (32-bit BFP) is obtained;

the inverse derivation is a derivation performed according to a chain rule, and if an activation value gradient is known, a derivation may be performed on the activation function, and the derivation result is multiplied by the activation value gradient to obtain a derivative of an input of the activation function unit (i.e., an output of the pooling unit), which is referred to as a pooling value gradient in the present invention.

In this step and subsequent steps, regarding the principle and process of derivation by using the chain rule in the backward propagation, reference may be made to the related prior art, which is not described herein again.

3) Performing reverse derivation (pooling derivation) on the pooling value gradient based on a pooling template adopted by the pooling unit to obtain a linear value gradient (32-bit BFP);

4) quantizing the linear value gradient to a quantized linear value gradient (16-bit BFP)

5) Carrying out reverse derivation (convolution derivation 1) on the quantized linear value gradient based on the first floating point number (namely the parameter of the convolution unit) to obtain an activation value gradient (32-bit BFP) of the upper network layer;

6) and performing reverse derivation (convolution derivation 2) on the linear value gradient based on the second floating point number (namely the activation value input to the convolution unit) to obtain the parameter gradient (32-bit BFP) of the current network layer.

Assuming that the current convolutional layer is the l-th layer, the back propagation process corresponding to the convolutional unit is:

g_l-1＝g_l*rot180(W_l)

g_w＝I_l-1*g_l

wherein, represents a convolution operation, W_lThe first block of floating-point numbers, g, corresponding to the parameters of the convolution unit_lFor pooling the derivative result, g_l-1Rot180() represents a 180 degree flip to the matrix, I, for the activation value gradient of the previous network layer_l-1For the gradient of activation values input to the current network layer, g_wIs the parameter gradient of the current network layer.

Due to g_l-1Is calculated throughout the entire counter-propagating data stream, and g_wCan be calculated with g_l-1The computation of (a) is parallel and does not block the data stream propagation to the lower layers. Thus, in the back propagation phase, the gradient of activation values can be expressed as two different data accuracies, for the next layer g_l-1Is quantized into the format of 16-bit BFP, for g_wThe 32-bit BFP format is maintained.

Similarly, the back propagation processing flow executed in the full connection layer includes:

2) reversely deriving the floating point number of the third block based on the activation function adopted by the activation function unit to obtain a linear value gradient (32-bit BFP);

3) quantizing the linear value gradient to a quantized linear value gradient (16-bit BFP);

4) carrying out reverse derivation on the quantized linear value gradient based on the first block of floating point number (namely the parameter of the linear processing unit) to obtain an activation value gradient (32-bit BFP) of the upper network layer;

5) and reversely deriving the linear value gradient based on the second floating point number (namely the activation value input to the linear processing unit) to obtain the parameter gradient (32-bit BFP) of the current network layer.

After the back propagation is complete, the method 600 proceeds to step S606. In step S606, parameters of the neural network model are updated. The parameters of each network layer of the neural network model can be updated sequentially or in parallel, the parameter updating value of the current network layer is calculated based on the parameter gradient of the current network layer, a fourth floating point block (32-bit BFP) corresponding to the parameter updating value is generated, then the parameters of the current network layer are updated based on the first floating point block and the fourth floating point block, and the first floating point block corresponding to the updated parameters is generated.

Since the parameter update value is too small compared to the low-precision parameter value, the parameter update value cannot be directly accumulated to the low-precision parameter, and therefore, in the step, a delay update (Lazy update) method is used to update the parameter. By sharing exponent control parameter values, delay update values, and parameter update values for floating-point numbers having the same decimal point location, the portion of the parameter update value that is valid relative to the delay update value (e.g., the parameter update value and the mantissa overlap of the delay update value) is added to the delay update value, and when the delay update value is sufficiently large, the parameter value is updated based on the delay update value. The parameter updating value is a numerical value which is obtained by calculation based on a parameter gradient and by adopting an optimization algorithm (such as a random gradient descent method, a momentum gradient descent method and the like) and is to be updated for the parameter value in one iteration; the parameter delay update value is a part of the parameter update value which is not updated into the parameter value from the first iteration to the last iteration.

Fig. 8 is a schematic diagram illustrating a parameter updating process in the embodiment of the present invention. As shown in fig. 8, the process of delaying updating the parameters includes:

1) acquiring a fifth floating point number (16-bit BFP) corresponding to the delay updating value after the last iteration;

2) and after the mantissa of the fourth floating point number is shifted to the right by a first preset numerical digit, accumulating the mantissa into the fifth floating point number, wherein the first preset numerical digit is the difference between the exponent of the fifth floating point number and the exponent of the fourth floating point number. When the shift operation of the mantissa is performed, the truncated portion is added to the non-truncated portion in a rounded manner.

3) And calculating the difference between the first floating point number and the fifth floating point number of the mantissa which is moved to the right by the second preset numerical digit to obtain a first difference value. In one implementation, the second predetermined value is bit width-1 (15 in the figure) of the floating point number of the fifth block, and likewise, when the shift operation of the mantissa is performed, the truncated part is added to the non-truncated part in a rounding manner, which is equivalent to updating the most significant bit of the delay updating value into the parameter in a rounding manner. In another implementation, the second predetermined value may take other values such that the shifted delay update has an overlap with the mantissa of the parameter value, and the overlap of the delay update is updated into the parameter.

4) Calculating the sum of the first floating point number and the fifth floating point number, and then subtracting the first difference value to be used as the fifth floating point number corresponding to the delay updating value after the iteration;

5) updating the first block of floating point numbers to the first difference.

The above-mentioned delayed update procedure is represented algorithmically as follows:

w_acc＝w_acc+(w_ch＞＞(E_acc-E_ch)).round()

w_wu＝w-(w_acc＞＞(b-1)).round()

w_acc←(w+w_acc)-w_wu

w←w_wu

where w is the mantissa portion of the parameter value, w_chFor updating the mantissa part of the value of the parameter, E_chFor the exponential part of the parameter update value, w_accTo delay the mantissa part of the update value, E_accTo delay the exponential part of the update value, w_wuFor the intermediate variable, b is the bit width of the delay update value (16 in this embodiment), round () is round, and ← represents assigning the value on the right to the left.

After step S602 to step S606 are executed, one iteration is completed. The steps S602 to S606 may be repeatedly performed, and when the neural network model converges, or when the number of iterations of the parameter reaches a preset number, the training is stopped, so as to obtain a trained neural network model.

In one embodiment, the neural network model to be trained is a neural network machine translation model, and then the final parameters are applied to the model to obtain the trained neural network machine translation model, and then the information to be translated is translated based on the neural network machine translation model to obtain a translation result, and the translation result is output.

According to the training scheme of the neural network model, based on the combination of block floating point quantization and delay updating, the shared exponent control parameter value, the delay updating value and the parameter updating value of the block floating point number have the same decimal point position, the effective part of the parameter updating value relative to the delay updating value is added into the delay updating value, and when the delay updating value is large enough, the parameter value is updated according to the delay updating value. Therefore, low-precision parameter values can be adopted in the forward propagation process, the backward propagation process and the parameter updating process, high-precision floating point values corresponding to the low-precision parameter values do not need to be additionally stored, the high-precision and low-precision type data do not need to be frequently converted, and the training speed of the neural network model is improved while the storage resources are saved.

Thus, in some application scenarios, the training method may also be performed by a terminal device, in which the acceleration unit is deployed, such as a neural Network Processing Unit (NPU), a Graphics Processing Unit (GPU), and the like. The terminal device may be a desktop computing device, a laptop computing device, a tablet computing device, a mobile phone computing device, a speaker computing device, a computing device of a vehicle (e.g., an in-vehicle communication system, an in-vehicle entertainment system, an in-vehicle navigation system), a wearable apparatus including a computing device (e.g., a watch with a computing device, glasses with a computing device), or a household apparatus including a computing device (e.g., a speaker with a computing device, a television with a computing device, a washing machine with a computing device).

FIG. 9 shows a schematic diagram of a training apparatus 900 for a neural network model according to an embodiment of the present invention. Referring to fig. 9, the apparatus 900 includes a forward propagation module 910, a backward propagation module 920, and a parameter update module 930, wherein:

the forward propagation module 910 is adapted to:

acquiring a first block of floating point number corresponding to a parameter of a current network layer;

quantizing the activation value output by the upper network layer into a second floating point number, wherein the bit width of the first floating point number and the second floating point number is a first preset value;

calculating an activation value of the current network layer based on the first block floating point number and the second block floating point number, and outputting the activation value to the next network layer;

the back propagation module 920 is adapted to:

quantizing the activation value gradient output by the next network layer into a third block floating point number, wherein the bit width of the third block floating point number is a second preset value;

calculating an activation value gradient of an upper network layer based on the third block floating point number and the first block floating point number, and outputting the activation value gradient to the upper network layer;

calculating a parameter gradient of the current network layer based on the third block floating point number and the second block floating point number;

the parameter update module 930 is adapted to:

calculating a parameter updating value of the current network layer based on the parameter gradient, and generating a fourth floating point number corresponding to the parameter updating value, wherein the bit width of the fourth floating point number is a third predetermined value;

and updating the parameters of the current network layer based on the first floating point block and the fourth floating point block, and generating the first floating point block corresponding to the updated parameters.

The specific processing performed by the forward propagation module 910, the backward propagation module 920 and the parameter updating module 930 may refer to the method 600, which is not described herein again.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the method of the invention according to instructions in said program code stored in the memory.

By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose preferred embodiments of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense with respect to the scope of the invention, as defined in the appended claims.

Claims

1. A method of training a neural network model, performed in a server, the neural network model comprising a plurality of network layers, the method comprising:

a forward propagation step:

and (3) a back propagation step:

and (3) updating parameters:

2. The method of claim 1, wherein the current network layer is a fully-connected layer comprising a linear processing unit and an activation function unit;

said calculating an activation value for a current network layer based on a first block floating point number and a second block floating point number, including;

inputting the first block floating point number and the second block floating point number into a linear processing unit for processing, and outputting a linear value with a bit width of a third preset value;

quantizing the linear value into a quantized linear value having a bit width of a first predetermined value;

and inputting the quantized linear value into an activation function unit for processing, and outputting an activation value with the bit width being a third preset value.

3. The method of claim 2, wherein the calculating an activation value gradient of an upper network layer based on the third block floating point number and the first block floating point number comprises:

performing reverse derivation on the floating point number of the third block based on the activation function adopted by the activation function unit to obtain a linear value gradient with the bit width being a third preset value;

quantizing the linear value gradient into a quantized linear value gradient with a bit width of a second preset value;

carrying out reverse derivation on the quantized linear value gradient based on the first block of floating point number to obtain an activation value gradient of an upper network layer;

the calculating the parameter gradient of the current network layer based on the third block floating point number and the second block floating point number includes:

and carrying out reverse derivation on the linear value gradient based on the second block floating point number to obtain the parameter gradient of the current network layer.

4. The method of claim 1, wherein the current network layer is a convolutional layer comprising a convolutional unit, a pooling unit, and an activation function unit;

inputting the first block floating point number and the second block floating point number into a convolution unit for processing, and outputting a linear value with a bit width of a third preset value;

inputting the quantized linear value into a pooling unit for processing, and outputting a pooling value;

and inputting the pooled value into an activation function unit for processing, and outputting an activation value with the bit width being a third preset value.

5. The method of claim 4, wherein the calculating an activation value gradient of an upper network layer based on the third block floating point number and the first block floating point number comprises:

performing reverse derivation on the floating point number of the third block based on the activation function adopted by the activation function unit to obtain a pooled value gradient with the bit width being a third preset value;

carrying out reverse derivation on the pooling value gradient based on a pooling template adopted by a pooling unit to obtain a linear value gradient;

6. The method of any one of claims 1 to 5, wherein: the first predetermined value is less than the second predetermined value, which is less than the third predetermined value.

7. The method of claim 6, wherein: the first predetermined value is 8, the second predetermined value is 16, and the third predetermined value is 32.

8. The method of any one of claims 1 to 7, wherein the updating the parameter of the current network layer based on the first block floating point number and the fourth block floating point number to generate the first block floating point number corresponding to the updated parameter comprises:

acquiring a fifth floating point number corresponding to the delay updating value after the last iteration, wherein the delay updating value is a part which is not updated into the parameter in the parameter updating value, and the bit width of the fifth floating point number is a second preset value;

shifting the mantissa of the fourth floating point number by a first predetermined numerical bit to the right, and accumulating the mantissa of the fourth floating point number into a fifth floating point number;

calculating the difference between the first floating point number and a fifth floating point number of the mantissa which is moved to the right by a second preset numerical digit to obtain a first difference value;

calculating the sum of the first floating point number and the fifth floating point number, and then subtracting the first difference value to be used as the fifth floating point number corresponding to the delay updating value after the iteration;

updating the first block of floating point numbers to the first difference.

9. The method of claim 8, wherein the first predetermined value is a difference between an exponent of a fifth floating point number and an exponent of a fourth floating point number, and the second predetermined value is a bit width-1 of the fifth floating point number.

10. The method of claim 8 or 9, wherein the truncated portion is added to the non-truncated portion in a rounded manner when the shift operation of the mantissa is performed.

11. Training method according to any one of claims 1 to 10, wherein an acceleration unit is comprised in the server, the training method being adapted to be executed by the acceleration unit.

12. The training method of claim 13, wherein the acceleration unit is a neural Network Processing Unit (NPU) or a Graphics Processing Unit (GPU).

13. The training method of claim 11 or 12, wherein the server is deployed in a data center.

14. A method of training a neural network model, performed in a terminal device, the neural network model comprising a plurality of network layers, the method comprising:

a forward propagation step:

and (3) a back propagation step:

and (3) updating parameters:

15. The training method of claim 14, wherein the terminal device is a desktop computing device, a laptop computing device, a tablet computing device, a mobile phone computing device, a speaker computing device, a computing device of a vehicle, a wearable apparatus including a computing device, or a household apparatus including a computing device.

16. Training method according to claim 14 or 15, wherein an acceleration unit is comprised in the terminal device, the training method being adapted to be performed by the acceleration unit.

17. The training method of claim 16, wherein the acceleration unit is a neural Network Processing Unit (NPU) or a Graphics Processing Unit (GPU).

18. An apparatus for training a neural network model, comprising:

a forward propagation module adapted to:

a counter-propagation module adapted to:

a parameter update module adapted to:

19. A computing device, comprising:

at least one processor; and

a memory storing program instructions configured for execution by the at least one processor, the program instructions comprising instructions for performing the method of any of claims 1-17.

20. A readable storage medium storing program instructions that, when read and executed by a computing device, cause the computing device to perform the method of any of claims 1-17.