WO2018076331A1

WO2018076331A1 - Neural network training method and apparatus

Info

Publication number: WO2018076331A1
Application number: PCT/CN2016/103979
Authority: WO
Inventors: 陈云霁; 庄毅敏; 郭崎; 陈天石
Original assignee: 北京中科寒武纪科技有限公司
Priority date: 2016-10-31
Filing date: 2016-10-31
Publication date: 2018-05-03

Abstract

A neural network training apparatus and method, for use in training parameters within a neural network: said method first using a nonlinear function to perform a nonlinear transformation on parameters to obtain transformation parameters (S1); then converting the transformation parameters to perform bit width conversion to obtain low bit width transformation parameters (S2); next acquiring a to-be-updated gradient value of the low bit width transformation parameters by means of a reverse process in the neural network to obtain a to-be-updated gradient value of the parameters prior to the nonlinear transformation according to the nonlinear function and said to-be-updated gradient value of the low bit width transformation parameters (S3); and finally updating the parameters according to the to-be-updated gradient value of the parameters (S4). The present method results in the parameters having a lower bit width after training while precision loss is lower.

Description

Neural network training method and device

Technical field

The invention belongs to the technical field of neural networks, and in particular relates to a neural network training method and device.

Background technique

Multi-layer neural networks have attracted more and more attention in academia and industry due to their high recognition accuracy and good parallelism. They also have many fields in pattern recognition, image processing and natural language processing. More and more applications.

Because of its large model parameter data, neural networks make it difficult to apply to embedded systems. Researchers use a variety of ways to reduce the storage space required to store these model parameters. The most common method is to store data using low-precision data representation methods. For example, a 16-bit floating-point data representation method, a 16-bit fixed-point data representation method, and a 1-bit binary data representation method are used. Compared with the original precision floating-point data representation method, the low-precision data representation method can reduce the data storage space. However, since the neural network model parameter data has a wide range of values, the use of low-precision data representation method will bring great Loss of precision affects the performance of the neural network.

Summary of the invention

(1) Technical problems to be solved

It is an object of the present invention to provide a neural network training method and apparatus for training parameters in a neural network such that the trained parameters have a low bit width and a small loss of precision.

(2) Technical plan

The invention provides a neural network training method for training parameters in a neural network, and the method comprises:

S1, using a nonlinear function to nonlinearly transform the parameters to obtain transformation parameters;

S2, performing low bit width conversion on the transform parameter conversion to obtain a low bit width transform parameter;

S3, obtaining a gradient value of the low bit width transform parameter to be updated by a neural network reverse process, and obtaining a gradient value of the parameter before the nonlinear transform according to the nonlinear function and the gradient value of the low bit width transform parameter to be updated;

S4, the parameter is updated according to the gradient value of the parameter to be updated.

Further, in step S3, the gradient value Δy of the low bit width conversion parameter to be updated is:

Where η is the learning rate of the neural network,

For the loss function of the neural network reverse process, the nonlinear transformation function is y=f(x), x is the parameter before the nonlinear transformation, and y is the weight or offset parameter after the nonlinear transformation;

The gradient value of the parameter to be updated before the nonlinear transformation is:

Δx=f'(x)Δy;

In step S4, the expression for updating the parameter is:

x _new = x - Δx.

Further, the nonlinear function is a hyperbolic tangent series function or a sigmoid series function.

Further, the parameters include weights and offsets of the neural network.

Further, steps S1 to S4 are repeatedly performed on the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.

The invention also provides a neural network training device for training parameters in a neural network, the device comprising:

a nonlinear transformation module for nonlinearly transforming parameters using a nonlinear function to obtain transformation parameters;

a low bit width conversion module for performing low bit width conversion on the transform parameter conversion to obtain a low bit width conversion parameter;

a reverse gradient transform module, configured to obtain a gradient value of a low bit width transform parameter to be updated by a neural network reverse process, and obtain a pre-linear transform parameter according to a nonlinear function and a gradient value of the low bit width transform parameter to be updated The gradient value to be updated;

The update module is configured to update the parameter according to the gradient value of the parameter to be updated.

Further, in the inverse gradient transform module, the gradient value Δy of the low bit width transform parameter to be updated is:

Where η is the learning rate of the neural network,

Δx=f'(x)Δy;

The expression that the update module updates the parameters is:

x _new = x - Δx.

Further, the parameters include weights and offsets of the neural network.

Further, the nonlinear transform module, the low bit width conversion module, the inverse gradient transform module, and the update module repeatedly train the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.

(3) Beneficial effects

The invention has the following advantages:

1. Since the nonlinear transformation function can control the data range and data precision of the parameters, the accuracy of the original data can be well preserved in the subsequent low bit width conversion, thereby ensuring the performance of the neural network.

2. Since the parameters obtained by the training have a lower bit width, the storage space required for the parameters is greatly reduced.

3. The parameters obtained by the training can be used for the dedicated neural network accelerator. Due to the use of lower precision parameters, the transmission bandwidth requirement of the accelerator can be reduced, and the low precision data can reduce the hardware area overhead, for example, reduce the arithmetic unit. The size of the hardware can thus optimize the area-to-power ratio of the hardware.

DRAWINGS

FIG. 1 is a flowchart of a neural network training method provided by the present invention.

2 is a schematic structural view of a neural network training device provided by the present invention.

detailed description

The invention provides a neural network training device and method for training parameters in a neural network. Firstly, a nonlinear function is used to nonlinearly transform the parameters to obtain transformation parameters, and then the transformation parameters are converted into bit widths. Converting, obtaining a low bit width transform parameter, and then obtaining a gradient value of the low bit width transform parameter to be updated by a neural network reverse process, and obtaining a nonlinear transform according to a nonlinear function and a gradient value of the low bit width transform parameter to be updated The gradient value of the parameter to be updated, and finally the parameter is updated according to the gradient value of the parameter to be updated. The present invention allows the post-training parameters to have a lower bit width with less loss of precision.

FIG. 1 is a flowchart of a neural network training method provided by the present invention. As shown in FIG. 1, the method includes:

S1, using a nonlinear function to nonlinearly transform the parameters to obtain transformation parameters.

The nonlinear transformation function used in this step is not unique, and different nonlinear functions can be selected according to the actual use requirements, which can be a hyperbolic tangent series function, a sigmoid series function, and the like.

For example, for a neural network using an 8-bit floating-point representation method, since the 8-bit floating-point representation method can represent a small data range, the neural network parameter data range generally adopting a high-precision data representation method is relatively large, if Converting the original full-precision data into bit-width data into low-precision data can affect the performance of the neural network due to the loss of precision in the data itself. Therefore, the function of the hyperbolic tangent series can be used for nonlinear transformation.

The hyperbolic tangent series function can be in the form of

Taking the weight data of a convolutional layer of the neural network as an example, w is the weight data before the transformation, p_w is the transformed weight data, Max _w is the maximum value of the layer ownership value data, and A and B are constants. Parameter. Among them, A is used to control the transformed data range, and B is used to control the transformed data distribution. The principle of the adjustment is as follows.

The transformation scales the data to the range [-1,1]. By adjusting A, the transformed data can be scaled to the range [-A, A]. By adjusting B, the data can be made to a different function of the tan(x) function. Segment, B is very small, then the data to be converted mostly falls in the linear segment of the tan(x) function. When B is large, the data to be converted mostly falls in the saturated segment of the tan(x) function, and B takes a moderate value. The data to be converted is mostly in a non-linear segment, thereby changing the original distribution of the data without changing the relative size relationship between the data. Therefore, the purpose of the nonlinear transformation in the present invention is to make the range of data and the distribution of data controllable.

In other scenarios, other nonlinear functions or function parameters may be used as needed. For example, for a neural network using a 16-bit floating point representation method, since 16-bit floating point can indicate that the data range is relatively large, more attention is required. The loss of precision may be due to the loss of precision of individual data beyond the 16-bit floating-point representation range, so a nonlinear function can be designed such that this portion of the data goes out of saturation and the other data is in a nonlinear or linear segment.

In addition, the parameters referred to in the present invention are the weights and offsets of the neural network. When the neural network is a multi-layer neural network, the parameters are the weights and offsets of each layer of the neural network.

S2, performing low bit width conversion on the transform parameter conversion to obtain a low bit width transform parameter.

Generally, the parameters of the neural network are floating point numbers. In this step, the original full precision floating point number is nonlinearly transformed and converted into a low precision low bit width floating point number. Here, the "lower bit" in the low-order wide floating point number indicates that the number of bits required to store one data is smaller than the number of bits required for the full-precision floating point, and "wide" means that the data is statistically evenly distributed in the data representation. Indicated within the index range. For example, suppose there is a bunch of data, distributed between [-5, 5], and concentrated in the range [-1.5, 1.5], then a nonlinear transformation tanh(x) is used, for [-1.5, 1.5 The data in the range, in the linear segment where the slope of the transformation is large, can spread the dense data distribution, and for the data of the range [-5, -1.5] and [1.5, 5], the curvature of the transformation is Large non-linear segments can make the original loose data dense, which can make the data distribution more uniform. And, the transformation causes the data range to be scaled into the [-1, 1] region.

S3, obtaining a gradient value of the low bit width transformation parameter to be updated by a neural network reverse process, and obtaining a gradient value of the parameter to be updated before the nonlinear transformation according to the nonlinear function and the gradient value of the low bit width transformation parameter to be updated.

In this step, it is assumed that the nonlinear transformation function used in the nonlinear transformation is y=f(x), where x is the weight or offset parameter before the transformation, and y is the transformed weight or offset parameter. The loss function of the reverse process is

Then the gradient calculated by the reverse process is

This gradient is the gradient of y, by the chain rule

Gradient of x can be obtained, gradient through x

Update the weight and offset parameters before the nonlinear transformation. Specifically, the gradient of the y to be updated can be obtained by the inverse process of the neural network.

Where η is the learning rate of the neural network, and the updated gradient value of x is obtained by the return gradient transform module.

That is, the inverse gradient module obtains the gradient value of x to be updated by calculating Δx=f'(x)Δy.

In this step, the expression that updates the parameters is:

x _new = x - Δx.

Next, steps S1 to S4 are repeatedly performed on the updated parameters until the parameters are less than a predetermined threshold, and the training is completed.

2 is a schematic structural diagram of a neural network training device according to the present invention. As shown in FIG. 2, the network training device includes:

The nonlinear transformation function adopted by the nonlinear transformation module is not unique, and different nonlinear functions can be selected according to the actual use requirements, which can be a hyperbolic tangent series function, a sigmoid series function, and the like.

The hyperbolic tangent series function can be in the form of

The low bit width conversion module is used for low bit width conversion of the transform parameter conversion to obtain a low bit width conversion parameter.

Generally, the parameters of the neural network are floating point numbers, and the function of the low bit width conversion module is to convert the original full precision floating point number into a low precision low bit width floating point number after nonlinear transformation. Here, the "lower bit" in the low-order wide floating point number indicates that the number of bits required to store one data is smaller than the number of bits required for the full-precision floating point, and "wide" means that the data is statistically evenly distributed in the data representation. Indicated within the index range. For example, suppose there is a bunch of data, distributed between [-5, 5], and concentrated in the range [-1.5, 1.5], then a nonlinear transformation tanh(x) is used, for [-1.5, 1.5 The data in the range, in the linear segment where the slope of the transformation is large, can spread the dense data distribution, and for the data of the range [-5, -1.5] and [1.5, 5], the curvature of the transformation is Large non-linear segments can make the original loose data dense, which can make the data distribution more uniform. And, the transformation causes the data range to be scaled into the [-1, 1] region.

In the inverse gradient transform module, let the nonlinear transform function used in the nonlinear transform be y=f(x), where x is the weight or offset parameter before the transform, and y is the transformed weight or offset. parameter. The loss function of the reverse process is

Then the gradient calculated by the reverse process is

This gradient is the gradient of y, by the chain rule

Gradient of x can be obtained, gradient through x

Update the module and update the parameters to:

x _new = x - Δx.

The neural network training device provided by the present invention repeatedly performs steps S1 to S4 on the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.

Embodiment 1

The neural network of this embodiment supports a 16-bit floating point data representation method, and this embodiment uses a nonlinear transformation function as a hyperbolic tangent series function:

Taking the weight data of a convolutional layer of the neural network as an example, w is the weight data before the transformation, p_w is the weight data after the transformation, Maxw is the maximum value of the ownership value data of the layer, and A and B are constant parameters. A is used to control the transformed data range, and B is used to control the transformed data distribution. By adjusting A and B, the data with uneven distribution can be more evenly distributed in the exponential domain, and the data can be mapped to a specified range.

Specifically, when

When B=3, the nonlinear function is

Since the weight or offset data is relatively small relative to its maximum value,

The data is mapped into the range of [-1,1], and most of the data is concentrated near the value of 0. This function form can make full use of the nonlinear segment of the function to compress the interval with less data distribution to make the data dense, and at the same time utilize The linear segment of the function stretches the interval where the data is distributed near the value of 0 to spread the data.

The weights and offsets of each layer of the neural network are nonlinearly transformed by the nonlinear transform module to obtain the transformed weights and offsets, ie

Where y is the transformed weight or offset, x is the weight or offset before the transformation, and Max _x is the maximum of all x. At the same time, the weight and offset data before the transformation are retained.

The transformed weight and offset data and the input data of each layer are converted into the required 16-bit floating point data through the low bit width floating point data conversion module, and the low precision floating point data conversion can adopt the direct truncation method, that is, the original The full-precision data intercepts the precision part that the 16-bit floating-point representation method can represent, and the part that exceeds the precision is directly discarded.

The converted 16-bit floating point data is used for neural network training, and the gradient value Δy to be updated is obtained by the neural network reverse process.

The gradient value is calculated by the inverse gradient transformation module before the nonlinear transformation and the biased gradient value to be updated, ie

And using the gradient value to complete the update of the weight and offset before the nonlinear transformation, that is, x _new = x - Δx.

Repeat the above steps until the training is complete.

Embodiment 2

The difference between the second embodiment and the first embodiment is that the embodiment adopts the Sigmoid series nonlinear transformation function:

Taking the weight data of a convolutional layer of the neural network as an example, w is the weight data before the transformation, p_w is the transformed weight data, A and B are constant parameters, and A is used to control the transformed data range. B is used to control the transformed data distribution. By adjusting A and B, the data with uneven distribution can be more evenly distributed in the exponential domain, and the data can be mapped to a specified range. Assume that the weight parameter data size of the neural network is generally on the order of 10 ^-4 , and a nonlinear function can be used.

This functional form can make full use of the nonlinear and linear segments of the function to make the data range more evenly distributed.

The weights and offsets of each layer of the neural network are nonlinearly transformed by the nonlinear transform module to obtain the transformed weights and offsets, ie Where y is the transformed weight or offset, and x is the weight or offset before the transformation. At the same time, the weight and offset data before the transformation are retained.

The transformed weight and offset data and the input data of each layer are converted by low bit width. 16-bit floating-point data, low-precision floating-point data conversion can use the method of direct truncation, that is, for the original full-precision data, the precision part that can be represented by the 16-bit floating-point representation method is intercepted, and the part exceeding the precision is directly discarded.

Repeat the above steps until the training is complete.

The trained neural network of the present invention can be used for a neural network accelerator, a voice recognition device, an image recognition device, an automatic navigation device, and the like.

The specific embodiments of the present invention have been described in detail, and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Claims

A neural network training method for training parameters in a neural network, the method comprising:

S1, nonlinearly transforming the parameter by using a nonlinear function to obtain a transform parameter;

S2, performing low bit width conversion on the transform parameter conversion to obtain a low bit width transform parameter;

S3. Acquire a gradient value of the low bit width transform parameter to be updated by a neural network reverse process, and obtain a parameter before the nonlinear transform according to the nonlinear function and the gradient value of the low bit width transform parameter to be updated. The gradient value to be updated;

S4: Update the parameter according to the gradient value of the parameter to be updated.
The neural network training method according to claim 1, wherein in the step S3, the gradient value Δy of the low bit width transformation parameter to be updated is:

Where η is the learning rate of the neural network,
For the loss function of the neural network reverse process, the nonlinear transformation function is y=f(x), x is the parameter before the nonlinear transformation, and y is the weight or offset parameter after the nonlinear transformation;

The gradient value of the parameter to be updated before the nonlinear transformation is:

Δx=f'(x)Δy;

In the step S4, the expression for updating the parameter is:

x new = x - Δx.
The neural network training method according to claim 1, wherein the nonlinear function is a hyperbolic tangent series function or a sigmoid series function.
The neural network training method according to claim 1, wherein the parameters include weights and offsets of the neural network.
The neural network training method according to claim 1, wherein steps S1 to S4 are repeatedly performed on the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.
A neural network training device for training parameters in a neural network, characterized in that the device comprises:

a nonlinear transformation module, configured to perform nonlinear transformation on the parameter by using a nonlinear function to obtain a transformation parameter;

a low bit width conversion module, configured to perform low bit width conversion on the conversion parameter conversion to obtain a low bit width conversion parameter;

a reverse gradient transform module, configured to acquire a gradient value of the low bit width transform parameter to be updated by a neural network reverse process, and obtain a nonlinearity according to the nonlinear function and the gradient value of the low bit width transform parameter to be updated The gradient value of the parameter to be updated before the transformation;

And an update module, configured to update the parameter according to the gradient value of the parameter to be updated.
The neural network training device according to claim 6, wherein in the inverse gradient transform module, the gradient value Δy of the low bit width transform parameter to be updated is:

Where η is the learning rate of the neural network,
For the loss function of the neural network reverse process, the nonlinear transformation function is y=f(x), x is the parameter before the nonlinear transformation, and y is the weight or offset parameter after the nonlinear transformation;

The gradient value of the parameter to be updated before the nonlinear transformation is:

Δx=f'(x)Δy;

The expression that the update module updates the parameter is:

x new = x - Δx.
The neural network training device according to claim 6, wherein the nonlinear function is a hyperbolic tangent series function or a sigmoid series function.
The neural network training device according to claim 6, wherein the parameters include weights and offsets of the neural network.
The neural network training device according to claim 6, wherein the nonlinear transform module, the low bit width conversion module, the inverse gradient transform module, and the update module repeatedly train the updated parameters until the parameter is less than a predetermined one. At the threshold, the training is completed.