CN111860814B

CN111860814B - Apparatus and method for performing batch normalization operations

Info

Publication number: CN111860814B
Application number: CN202010617696.8A
Authority: CN
Inventors: 刘少礼; 于涌; 陈云霁; 陈天石
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2016-04-29
Filing date: 2016-04-29
Publication date: 2024-01-16
Anticipated expiration: 2036-04-29
Also published as: CN111860814A; CN107341546B; CN107341546A

Abstract

The disclosure discloses an apparatus for performing batch normalization operations, comprising an operation module. Batch normalization operation in the multilayer artificial neural network can be realized by using the device. For batch normalization operations, in the forward process, the square root of the sum of the variance and the minimum constant eps is divided by the input minus the mean. And multiplying the learning parameter alpha and the learning parameter beta to obtain the output of the layer. In the reverse training process, the average value of the product of the gradient vector and the output obtained in the forward direction is multiplied by the output after subtracting the average value of the gradient vector from the input gradient vector, and the obtained difference is divided by the square root of the sum of the variance and the minimum fixed value in the forward direction to obtain the output gradient vector of the layer. The method and the device effectively improve the support of forward and reverse operations of batch normalization in the artificial neural network.

Description

Apparatus and method for performing batch normalization operations

Technical Field

The present disclosure relates to artificial neural network technology, and in particular, to an apparatus and method for performing batch normalization forward and reverse operations in an artificial neural network.

Background

The multilayer artificial neural network is widely applied to the fields of pattern recognition, image processing, function approximation, optimization calculation and the like, and in recent years, the multilayer artificial neural network is more and more widely focused by academia and industry due to higher recognition accuracy and better parallelism, and batch normalization operation in the multilayer artificial neural network is increasingly applied to the multilayer neural network due to the characteristics of accelerating the training speed of the neural network and improving the recognition accuracy.

One known method of supporting batch normalization operations is to use a general purpose processor. The method supports the above algorithm by executing general instructions using a general register file and general purpose functional units. One of the disadvantages of this method is that the operation performance of a single general processor is low, and the performance requirement of the operation of the conventional multi-layer artificial neural network cannot be met. When a plurality of general-purpose processors execute in parallel, the mutual communication between the general-purpose processors becomes a performance bottleneck. In addition, the general processor needs to decode the forward operation of the multi-layer artificial neural network into a long-line operation and access instruction sequence, and the front end decoding of the processor brings great power consumption expense.

Another known method of supporting batch normalization is to use a Graphics Processor (GPU). The method supports the above algorithm by executing a general purpose SIMD instruction using a general purpose register file and a general purpose stream processing unit. Because the GPU is a device that is specially used to perform graphics image operations and scientific calculations, without special support for the multi-layer artificial neural network batch normalization operations, a large amount of front-end decoding work is still required to perform the multi-layer artificial neural network operations, resulting in a large amount of overhead. In addition, the GPU has only a small on-chip cache, and model data of the multi-layer artificial neural network batch normalization needs to be repeatedly carried from off-chip, and off-chip bandwidth becomes a main performance bottleneck. And batch normalization operations have a large number of normalization operations, such as summation, the parallel architecture of the GPU is not suitable for doing this large number of normalization operations.

Disclosure of Invention

One aspect of the present disclosure provides an apparatus for performing an artificial neural network batch normalization operation, including an instruction storage unit, a controller unit, a data access unit, an operation module, wherein: the instruction storage unit is used for reading the instruction through the data access unit and caching the read instruction; the controller unit is used for reading the instruction from the instruction storage unit and decoding the instruction into a micro instruction for controlling the operation module; the data access unit is used for writing data into the corresponding data cache unit of the operation module from the external address space or reading data from the data cache unit to the external address space; the operation module is used for specifically calculating the data.

Another aspect of the present disclosure provides a method of performing batch normalization forward operations using the above apparatus. In use, let x be each input neuron element and y be the output element. The learning parameters alpha, beta, minimum constant eps, mean E [ x ] and variance var [ x ] are constants obtained in the training process, and the device completes batch normalization forward y=f (x) =alpha (x-E [ x ])/sqrt (var (x) +eps) +beta calculation processes in parallel to obtain the output neuron. In the training process, forward operation requires dynamic calculation of the mean Ex and variance var. The calculation module of the device is used for completing the accumulation sum (normalization) operation in the mean value and variance calculation process, so that the mean value and variance of each iteration in the training process are calculated.

Another aspect of the present disclosure provides a method of performing batch normalization inverse operations using the above apparatus. Assuming that the gradient of one pixel is dl/dY and the forward process output is Y, the gradient of the reverse propagation is batch normalization dl/dx= (alpha/sqrt (var (x) +eps)) (dl/dY-mean (dl/dY) ×y) the gradient of alpha of the learning parameter: dl/dalpha= (Σdl/dY) Y, learning the gradient of parameter beta: dl/dbeta = Σdl/dY. The inverse process of batch normalization completes the normalization operation of the neurons, such as mean value taking, variance taking, and the like, in parallel through the operation unit.

The present disclosure may be applied in the following (including but not limited to) scenarios: the system comprises various electronic products such as a data processing device, a robot, a computer, a printer, a scanner, a telephone, a tablet personal computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage device, a wearable device and the like; various vehicles such as airplanes, ships, vehicles and the like; various household appliances such as televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers, range hoods and the like; including nuclear magnetic resonance apparatus, B-ultrasonic apparatus, electrocardiograph, etc. The present disclosure solves the problems of insufficient CPU and GPU operation performance and large front end decoding overhead by employing a device and instruction set for performing batch normalization operations. The support of forward and reverse operation of batch normalization is effectively improved.

The method and the device fully excavate reusability of the input neurons and the intermediate data by adopting the special on-chip cache aiming at batch normalization operation, avoid repeatedly reading the data to the memory, reduce the memory access bandwidth and avoid the problem that the memory bandwidth becomes the forward operation performance bottleneck of the multilayer artificial neural network.

The present disclosure balances the relationship between parallel and serial better by employing a dedicated arithmetic unit for batch normalization arithmetic. The weaknesses that the CPU architecture is only serial operation, the speed is slower when the data scale is larger, the GPU architecture is only parallel operation and the normalization operation is not well processed are avoided. The data storage unit and the operation unit are matched, so that normalized serial operation and parallel operation can be balanced well.

Drawings

For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

fig. 1 shows an example block diagram of the overall structure of an apparatus for performing batch normalization operations according to an embodiment of the disclosure.

Fig. 2 shows an example block diagram of an operation module structure in an apparatus for performing batch normalization operation according to an embodiment of the present disclosure.

Fig. 3 shows an example block diagram of batch normalization operational procedure in accordance with an embodiment of the present disclosure.

Fig. 4 shows a flowchart of batch normalization operations according to an embodiment of the present disclosure.

Detailed Description

For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same.

batch normalization operation includes two parts, forward and reverse. Both the forward and reverse of batch normalization operations need to be applied during artificial neural network training, while only the forward process of batch normalization operations is performed during artificial neural network use. Parameters obtained in the training process, such as mean value, variance and the like in batch normalization operation, are used in the artificial neural network application process, and repeated calculation is not needed.

Fig. 1 shows an overall block diagram of an apparatus for performing artificial neural network batch normalization operations according to the present disclosure. As shown in fig. 1, the apparatus includes an instruction storage unit 1, a controller unit 2, a data access unit 3, and an arithmetic module 4. The instruction storage unit 1, the controller unit 2, the data access unit 3, the operation module 4 may be implemented by hardware circuits (including, but not limited to, FPGAs, CGRA, application specific integrated circuits ASIC, analog circuits, memristors, etc.).

The instruction storage unit 1 reads in instructions through the data access unit 3 and caches the read-in instructions. The instruction storage unit may be implemented by various different memory devices (SRAM, eDRAM, DRAM, memristors, 3D-DRAM or nonvolatile memory, etc.).

The controller unit 2 reads instructions from the instruction storage unit 1, decodes the instructions into micro-instructions that control the behavior of other units or modules, such as the data access unit 3, the arithmetic module 4, etc., and then distributes the respective micro-instructions to the respective units or modules.

The data access unit 3 can access the external address space, and directly read and write data to each cache unit in the device to complete the loading and storage of the data.

Fig. 2 shows an example block diagram of the structure of the operation module 4 in the apparatus for performing the artificial neural network latch normalization operation according to the embodiment of the present disclosure. As shown in fig. 2, the operation module 4 includes an operation unit 41, a data dependency determination unit 42, a neuron cache unit 43, and an intermediate value cache unit 44.

The arithmetic unit 41 receives the microinstruction issued from the controller unit 2 and performs arithmetic logic operation.

The data dependency relationship determination unit 42 is responsible for the read and write operations to the neuron cache unit during the calculation process. The data dependency determination unit 42 first ensures that there is no read-write consistency conflict with the data used between the instructions before performing the read-write operation. For example, all micro instructions directed to the data dependency unit 42 are stored in an instruction queue within the data dependency unit 42, where if a read data range of a read instruction conflicts with a write data range of a write instruction preceding the queue, the instruction must wait until the dependent write instruction is executed.

The neuron buffer unit 43 buffers the input neuron vector data and the output neuron value data of the operation module 4.

The intermediate value buffer unit 44 buffers intermediate value data required by the operation module 4 in the calculation process. Such as partial sums, partial sums squared values, etc., calculated during the operation. For each operation module 4, the intermediate value buffer unit 44 stores batch normalization intermediate value data of the operation process. For example, forward batch normalization operation, during use of the artificial neural network. Let x be each input neuron data and y be the output neuron data. The learning parameters alpha and beta are updated continuously in the reverse training process and are used in a formula for calculating the output neuron data y afterwards. The minimum constant eps represents a minimum amount of data, which can be represented generally by the power of 10 to-5, and can be set to 0 in practical use. The mean value Ex represents the mean value of the input data calculated by taking the batch size as a total amount, and var x represents the variance of the corresponding input data calculated by taking the batch size as a total amount. In artificial neural network algorithms, input neuron data typically has four dimensions: the input batch size, namely the batch (also called number), the input channel number, namely channel, the input high height and the input wide width, the four dimensions determine the total number of input data x, E [ x ], and var [ x ] are the average value and variance of the data in the other three dimensions calculated by taking the batch as the total number. The arithmetic unit 41 may complete the calculation process of y=f (x) =alpha×x-E [ x ])/sqrt (var (x) +eps) +beta in parallel, where sqrt represents the evolution operation, constant data of the process may be stored in the intermediate value buffer unit, and the obtained result is returned to the data access unit to obtain the output neuron. In artificial neural network algorithms, input neuron data typically has four dimensions: the batch size, channel number, high height and width of the input determine the total number of the input data x, and E [ x ] and var [ x ] are calculated by taking the batch as the total number. In addition, as the data storage mode of the device is stored according to channel, height, weight three dimensions, the device can sequentially perform summation, average and variance operation after reading the input neuron data x.

For the forward operation of batch normalization, the mean and variance in the batch normalization operation can be calculated by using the mean and variance E (x) and var (x) already calculated, using the parameters as constants for storage and operation, and the mean and variance in the batch normalization operation of the calculation after completion can also be calculated according to the input data in the forward process. The arithmetic unit calculates the mean and variance data of each time. During each training iteration, the input neurons compute the mean and variance via the arithmetic unit, and place the portion of data in the intermediate value buffer unit 44 for f (x) subsequent computations for that iteration.

The present disclosure also provides an instruction set for performing artificial neural network batch normalization operations on the foregoing apparatus. The instruction set includes a CONFIG instruction, a COMPUTE instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction, wherein:

the CONFIG instruction configures the various constants required for the current layer computation before the batch normalization computation begins;

the COMPUTE instruction completes the arithmetic logic computation of the batch normalization process;

the IO instruction is used for reading in input data required by calculation from an external address space and storing the data back to the external space after the calculation is completed;

the NOP instruction is responsible for clearing all micro instructions in the micro instruction storage queue in the current device, and ensuring that all instructions before the NOP instruction are completely executed. The NOP instruction itself does not contain any operations;

the JUMP instruction is responsible for controlling the JUMP of the address of the next instruction to be read from the instruction storage unit and is used for realizing the JUMP of the control flow;

the MOVE instruction is responsible for handling data at one address in the device internal address space to another address in the device internal address space, independent of the arithmetic unit, without taking up the resources of the arithmetic unit during execution.

Fig. 3 illustrates an example block diagram of artificial neural network batch normalization forward and reverse operations according to an embodiment of this disclosure. For the formula out= (in-middle)/middle, in is the input neuron data and out is the output neuron data. middle is an intermediate value in the operation process, and the intermediate value is an intermediate result of the normalization operation required by the mean, the variance, and the like, and a part of the intermediate values [ middle1, &..middlen ] in the normalization process are calculated in parallel by the operation module 4 and stored in the intermediate value buffer unit 44. The operation module 4 then calculates the output neuron data out in parallel with the intermediate value middle for each input neuron data in, and obtains the final output vector.

Fig. 4 illustrates a batch normalization forward operation flow diagram in a training process according to one embodiment. This flowchart describes the process of implementing the forward operation of the batch normalization operation shown in fig. 3 using the apparatus and instruction set of the present disclosure.

In step S1, an IO instruction is stored in advance at the head address of the instruction storage unit 1.

In step S2, the operation starts, the controller unit 2 reads the IO instruction from the first address of the instruction storage unit 1, and according to the decoded microinstructions, the data access unit 3 reads all corresponding batch normalization forward operation instructions from the external address space and caches them in the instruction storage unit 1.

In step S3, the controller unit 2 then reads the next IO instruction from the instruction storage unit, and based on the decoded microinstruction, the data access unit 3 reads all data (including, for example, the input neuron vector, batch size, learning parameters alpha, beta, minimum values eps, mean, variance, etc.) required by the operation module 4 from the external address space to the neuron cache unit 43 of the operation module 4.

In step S4, the controller unit 2 then reads the next CONFIG instruction from the instruction storage unit, and operates the device configuration batch normalization according to the decoded micro instruction. For example, the current forward operation uses the calculated mean variance or calculates the mean variance from the input.

In step S5, the controller unit 2 then reads the next command from the command storage unit, and according to the decoded microinstruction, the operation module 4 reads the input neuron vector from the neuron buffer unit, calculates the mean and variance of the input neuron, and stores the mean and variance in the intermediate value buffer unit.

In step S6, the operation module 4 divides the data in the input neuron buffer unit and the intermediate value buffer unit by the square root operation of the variance and the minimum eps sum after subtracting the mean value according to the microinstruction decoded by the command instruction, and stores the result back to the intermediate value buffer unit.

In step S7, the operation module 4 reads the learning parameter alpha from the neuron cache unit 43 according to the microinstruction decoded by the command, multiplies the learning parameter alpha by the intermediate value, and adds the learning parameter beta to the result to return to the neuron cache.

In step S8, the controller unit then reads the next IO instruction from the instruction storage unit, and the data access unit 3 stores the output neuron vector in the neuron cache unit 43 to the external address space specified address according to the decoded microinstruction, and the operation ends.

The forward process for batch normalizaiton operation during use differs from the forward process for batch normalization operation during training in that the configuration in step S4 uses constant means and variances without each dynamic calculation, i.e. step S5 is eliminated. The other is the same as in fig. 4.

The reverse procedure for batch normalization operation is similar to the forward procedure described above. The difference is that the data of the operations are different. Assuming that the gradient of one pixel is dl/dY, the gradient of the reverse outgoing is dl/dx, the forward process output is Y, and the remaining parameters represent the same meaning as the forward process, the gradient of the reverse outgoing through batch normalization dl/dx= (alpha/sqrt (x) +eps)) x (dl/dY-mean (dl/dY) -mean (dl/dY x Y), where mean is the averaging operation. Gradient of alpha of learning parameters: dl/dalpha= (Σdl/dY) Y, learning the gradient of parameter beta: dl/dbeta = Σdl/dY, the values of the learning parameters are updated by these two gradients. The inverse process of batch normalization normalizes the operational gradient data, e.g., average, variance, etc., by the arithmetic unit. And then the operation unit completes the rest operations in the formula in parallel.

By adopting the device and the instruction set for executing batch normalization operation, the problems of insufficient operation performance of the CPU and the GPU and high front-end decoding overhead are solved. The support of forward and reverse operation of batch normalization is effectively improved.

By adopting the special on-chip cache aiming at batch normalization operation, reusability of input neurons and intermediate data is fully mined, repeated reading of the data to a memory is avoided, memory access bandwidth is reduced, and the problem that the memory bandwidth becomes a forward operation performance bottleneck of the multilayer artificial neural network is avoided.

The relationship between parallel and serial is well balanced by employing a dedicated arithmetic unit for batch normalization arithmetic. The weaknesses that the CPU architecture is only serial operation, the speed is slower when the data scale is larger, the GPU architecture is only parallel operation and the normalization operation is not well processed are avoided. The data storage unit and the operation unit are matched, so that normalized serial operation and parallel operation can be balanced well.

While the foregoing is directed to embodiments of the present disclosure, other and further details of the invention may be had by the present application, it is to be understood that the foregoing description is merely exemplary of the present disclosure and that no limitations are intended to the scope of the disclosure, except insofar as modifications, equivalents, improvements or modifications may be made without departing from the spirit and principles of the present disclosure.

Claims

1. A method for performing batch normalization operations in a neural network training process, the method being applied in an operation device comprising a controller unit and an operation module; the operation module comprises a neuron cache unit and an intermediate value cache unit;

the controller unit reads the instruction and decodes the instruction into a micro instruction;

the controller unit sends the micro instruction to the operation module;

the operation module executes batch normalization forward operation and backward operation according to the micro instruction;

wherein the operation module performs a forward operation of batch normalization operation according to the microinstruction, which includes:

the operation module reads an input neuron vector from the neuron cache unit according to the microinstruction, and calculates the mean value of the input neuron vector and the variance of the input neuron vector;

the operation module stores the mean value of the input neuron vector and the variance of the input neuron vector into the intermediate value buffer unit;

the operation module reads the learning parameter and the input neuron vector from the neuron caching unit, reads the mean value of the input neuron vector and the variance of the input neuron vector from the intermediate value caching unit, and calculates to obtain an output neuron according to the learning parameter, the input neuron vector, the mean value of the input neuron vector and the variance of the input neuron vector;

the operation module stores the output neuron into the neuron cache unit.

2. The method of claim 1, wherein the operation module reads a learning parameter and an input neuron vector from the neuron cache unit, reads a mean value of the input neuron vector and a variance of the input neuron vector from an intermediate value cache unit, and calculates an output neuron based on the learning parameter, the input neuron vector, the mean value of the input neuron vector, and the variance of the input neuron vector, comprising:

the operation module reads an input neuron vector from the neuron caching unit, reads the mean value of the input neuron vector and the variance of the input neuron vector from the intermediate value caching unit, and calculates in parallel to obtain an intermediate value of the normalization process according to the input neuron vector, the mean value of the input neuron vector and the variance of the input neuron vector;

the operation module stores the intermediate value of the normalization process into an intermediate value buffer unit;

the operation module reads the learning parameter from the neuron cache unit, reads the intermediate value of the normalization process from the intermediate value cache unit, and obtains the output neuron according to the learning parameter and the intermediate value.

3. The method according to claim 1, comprising: the operation module calculates to obtain an output neuron according to the learning parameter, the input neuron vector, the mean value of the input neuron vector and the variance of the input neuron vector, and the operation module comprises the following steps:

the operation module is used for calculating the following formula: y=alpha (x-ex)/sqrt (var (x) +eps) +beta calculation to obtain an output neuron;

wherein x is an input neuron vector, y is an output neuron, alpha and beta are learning parameters which are continuously updated in the training process and are used for calculating new output neuron data; eps is the minimum constant, ex is the mean, and var (x) is the variance.

4. The method of claim 1, the operation module further comprising a data dependency determination unit;

the operation module reads an input neuron vector from the neuron cache unit according to the microinstruction, and comprises:

the data dependency relationship judging unit judges whether read-write consistency conflict exists among the micro instructions;

and if the read-write consistency conflict does not exist, the operation module reads the input neuron vector from the neuron cache unit according to the micro instruction.

5. The method of claim 1, the operation module further comprising a data dependency determination unit;

the operation module stores the output neuron into the neuron cache unit, and the operation module comprises:

and if the read-write consistency conflict does not exist, the operation module stores the output neuron into the neuron cache unit.

6. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the operation module performing a reverse operation of the batch normalization operation according to the microinstruction includes:

the operation module reads the input neuron gradient from the neuron cache unit according to the microinstruction, and calculates the mean value of the input neuron gradient and the variance of the input neuron gradient;

the operation module stores the mean value of the input neuron gradient and the variance of the input neuron gradient into the intermediate value buffer unit;

the operation module reads the learning parameter gradient and the input neuron gradient from the neuron cache unit, reads the mean value of the input neuron gradient and the variance of the input neuron gradient from the intermediate value cache unit, and calculates to obtain the output neuron gradient according to the learning parameter gradient, the input neuron gradient, the mean value of the input neuron gradient and the variance of the input neuron gradient;

and the operation module stores the output neuron gradient into the neuron cache unit.

7. The method of claim 3, wherein the operation module reads a learning parameter gradient and an input neuron gradient from the neuron cache unit, reads a mean value of the input neuron gradient and a variance of the input neuron gradient from an intermediate value cache unit, and calculates an output neuron gradient from the learning parameter gradient, the input neuron gradient, the mean value of the input neuron gradient, and the variance of the input neuron gradient, comprising:

dl/dx= (alpha/sqrt (var (x) +eps)) (dl/dy-mean (dl/dy) y), where input neuron gradient is dl/dy, output neuron gradient is dl/dx, mean is the averaging operation, eps is a very small constant, alpha is the learning parameter.

8. A method for performing batch normalization operations during use of a neural network, the method being applied in an operation device comprising a controller unit and an operation module; the operation module comprises a neuron cache unit and an intermediate value cache unit;

the controller unit generates the micro instruction to the operation module;

the operation module executes batch normalization operation according to the micro instruction;

wherein the operation module performs batch normalization operations according to the microinstructions, including:

the operation module reads input neuron vectors and learning parameters from the neuron cache unit according to the microinstruction;

the operation module configures a mean constant of the input neuron vector and a variance constant of the input neuron vector according to the microinstruction;

the operation module calculates and obtains an output neuron according to the learning parameter, the input neuron vector, the mean constant and the variance constant;

the operation module stores the output neuron into the neuron cache unit.

9. The method of claim 8, wherein the computing module calculating an output neuron from the learning parameter, the input neuron vector, the mean constant, and the variance constant comprises:

the operation module reads an input neuron vector from the neuron cache unit, and calculates in parallel to obtain an intermediate value of the normalization process according to the input neuron vector, the mean constant and the variance constant;

10. The method of claim 8, the operation module further comprising a data dependency determination unit;

11. The method of claim 8, the operation module further comprising a data dependency determination unit;

12. An apparatus for implementing the method of any one of claims 1-11, the apparatus comprising at least one of: the mobile phone comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet personal computer, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a cloud server, a camera, a video camera, a projector, a watch, an earphone and mobile storage; aircraft, ship, vehicle; television, air conditioner, microwave oven, refrigerator, electric cooker, humidifier, washing machine, electric lamp, gas stove, and range hood; nuclear magnetic resonance apparatus, B-mode ultrasound apparatus and electrocardiograph apparatus.