WO2018076331A1 - Neural network training method and apparatus - Google Patents

Neural network training method and apparatus Download PDF

Info

Publication number
WO2018076331A1
WO2018076331A1 PCT/CN2016/103979 CN2016103979W WO2018076331A1 WO 2018076331 A1 WO2018076331 A1 WO 2018076331A1 CN 2016103979 W CN2016103979 W CN 2016103979W WO 2018076331 A1 WO2018076331 A1 WO 2018076331A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameter
neural network
nonlinear
updated
bit width
Prior art date
Application number
PCT/CN2016/103979
Other languages
French (fr)
Chinese (zh)
Inventor
陈云霁
庄毅敏
郭崎
陈天石
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Priority to PCT/CN2016/103979 priority Critical patent/WO2018076331A1/en
Publication of WO2018076331A1 publication Critical patent/WO2018076331A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention belongs to the technical field of neural networks, and in particular relates to a neural network training method and device.
  • Multi-layer neural networks have attracted more and more attention in academia and industry due to their high recognition accuracy and good parallelism. They also have many fields in pattern recognition, image processing and natural language processing. More and more applications.
  • neural networks make it difficult to apply to embedded systems.
  • researchers use a variety of ways to reduce the storage space required to store these model parameters.
  • the most common method is to store data using low-precision data representation methods. For example, a 16-bit floating-point data representation method, a 16-bit fixed-point data representation method, and a 1-bit binary data representation method are used.
  • the low-precision data representation method can reduce the data storage space.
  • the neural network model parameter data has a wide range of values, the use of low-precision data representation method will bring great Loss of precision affects the performance of the neural network.
  • the invention provides a neural network training method for training parameters in a neural network, and the method comprises:
  • the parameter is updated according to the gradient value of the parameter to be updated.
  • step S3 the gradient value ⁇ y of the low bit width conversion parameter to be updated is:
  • is the learning rate of the neural network
  • x is the parameter before the nonlinear transformation
  • y is the weight or offset parameter after the nonlinear transformation
  • the gradient value of the parameter to be updated before the nonlinear transformation is:
  • step S4 the expression for updating the parameter is:
  • nonlinear function is a hyperbolic tangent series function or a sigmoid series function.
  • the parameters include weights and offsets of the neural network.
  • steps S1 to S4 are repeatedly performed on the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.
  • the invention also provides a neural network training device for training parameters in a neural network, the device comprising:
  • a nonlinear transformation module for nonlinearly transforming parameters using a nonlinear function to obtain transformation parameters
  • a low bit width conversion module for performing low bit width conversion on the transform parameter conversion to obtain a low bit width conversion parameter
  • a reverse gradient transform module configured to obtain a gradient value of a low bit width transform parameter to be updated by a neural network reverse process, and obtain a pre-linear transform parameter according to a nonlinear function and a gradient value of the low bit width transform parameter to be updated The gradient value to be updated;
  • the update module is configured to update the parameter according to the gradient value of the parameter to be updated.
  • the gradient value ⁇ y of the low bit width transform parameter to be updated is:
  • is the learning rate of the neural network
  • x is the parameter before the nonlinear transformation
  • y is the weight or offset parameter after the nonlinear transformation
  • the gradient value of the parameter to be updated before the nonlinear transformation is:
  • nonlinear function is a hyperbolic tangent series function or a sigmoid series function.
  • the parameters include weights and offsets of the neural network.
  • the nonlinear transform module, the low bit width conversion module, the inverse gradient transform module, and the update module repeatedly train the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.
  • the nonlinear transformation function can control the data range and data precision of the parameters, the accuracy of the original data can be well preserved in the subsequent low bit width conversion, thereby ensuring the performance of the neural network.
  • the parameters obtained by the training can be used for the dedicated neural network accelerator. Due to the use of lower precision parameters, the transmission bandwidth requirement of the accelerator can be reduced, and the low precision data can reduce the hardware area overhead, for example, reduce the arithmetic unit. The size of the hardware can thus optimize the area-to-power ratio of the hardware.
  • FIG. 1 is a flowchart of a neural network training method provided by the present invention.
  • FIG. 2 is a schematic structural view of a neural network training device provided by the present invention.
  • the invention provides a neural network training device and method for training parameters in a neural network. Firstly, a nonlinear function is used to nonlinearly transform the parameters to obtain transformation parameters, and then the transformation parameters are converted into bit widths. Converting, obtaining a low bit width transform parameter, and then obtaining a gradient value of the low bit width transform parameter to be updated by a neural network reverse process, and obtaining a nonlinear transform according to a nonlinear function and a gradient value of the low bit width transform parameter to be updated The gradient value of the parameter to be updated, and finally the parameter is updated according to the gradient value of the parameter to be updated.
  • the present invention allows the post-training parameters to have a lower bit width with less loss of precision.
  • FIG. 1 is a flowchart of a neural network training method provided by the present invention. As shown in FIG. 1, the method includes:
  • the nonlinear transformation function used in this step is not unique, and different nonlinear functions can be selected according to the actual use requirements, which can be a hyperbolic tangent series function, a sigmoid series function, and the like.
  • the neural network parameter data range generally adopting a high-precision data representation method is relatively large, if Converting the original full-precision data into bit-width data into low-precision data can affect the performance of the neural network due to the loss of precision in the data itself. Therefore, the function of the hyperbolic tangent series can be used for nonlinear transformation.
  • the hyperbolic tangent series function can be in the form of Taking the weight data of a convolutional layer of the neural network as an example, w is the weight data before the transformation, p_w is the transformed weight data, Max w is the maximum value of the layer ownership value data, and A and B are constants. Parameter. Among them, A is used to control the transformed data range, and B is used to control the transformed data distribution. The principle of the adjustment is as follows. The transformation scales the data to the range [-1,1]. By adjusting A, the transformed data can be scaled to the range [-A, A]. By adjusting B, the data can be made to a different function of the tan(x) function.
  • Segment, B is very small, then the data to be converted mostly falls in the linear segment of the tan(x) function.
  • B When B is large, the data to be converted mostly falls in the saturated segment of the tan(x) function, and B takes a moderate value.
  • the data to be converted is mostly in a non-linear segment, thereby changing the original distribution of the data without changing the relative size relationship between the data. Therefore, the purpose of the nonlinear transformation in the present invention is to make the range of data and the distribution of data controllable.
  • nonlinear functions or function parameters may be used as needed.
  • 16-bit floating point since 16-bit floating point can indicate that the data range is relatively large, more attention is required.
  • the loss of precision may be due to the loss of precision of individual data beyond the 16-bit floating-point representation range, so a nonlinear function can be designed such that this portion of the data goes out of saturation and the other data is in a nonlinear or linear segment.
  • the parameters referred to in the present invention are the weights and offsets of the neural network.
  • the parameters are the weights and offsets of each layer of the neural network.
  • the parameters of the neural network are floating point numbers.
  • the original full precision floating point number is nonlinearly transformed and converted into a low precision low bit width floating point number.
  • the "lower bit” in the low-order wide floating point number indicates that the number of bits required to store one data is smaller than the number of bits required for the full-precision floating point, and "wide” means that the data is statistically evenly distributed in the data representation. Indicated within the index range.
  • the loss function of the reverse process is Then the gradient calculated by the reverse process is This gradient is the gradient of y, by the chain rule Gradient of x can be obtained, gradient through x Update the weight and offset parameters before the nonlinear transformation.
  • the parameter is updated according to the gradient value of the parameter to be updated.
  • steps S1 to S4 are repeatedly performed on the updated parameters until the parameters are less than a predetermined threshold, and the training is completed.
  • the network training device includes:
  • a nonlinear transformation module for nonlinearly transforming parameters using a nonlinear function to obtain transformation parameters
  • the nonlinear transformation function adopted by the nonlinear transformation module is not unique, and different nonlinear functions can be selected according to the actual use requirements, which can be a hyperbolic tangent series function, a sigmoid series function, and the like.
  • the neural network parameter data range generally adopting a high-precision data representation method is relatively large, if Converting the original full-precision data into bit-width data into low-precision data can affect the performance of the neural network due to the loss of precision in the data itself. Therefore, the function of the hyperbolic tangent series can be used for nonlinear transformation.
  • the hyperbolic tangent series function can be in the form of Taking the weight data of a convolutional layer of the neural network as an example, w is the weight data before the transformation, p_w is the transformed weight data, Max w is the maximum value of the layer ownership value data, and A and B are constants. Parameter. Among them, A is used to control the transformed data range, and B is used to control the transformed data distribution. The principle of the adjustment is as follows. The transformation scales the data to the range [-1,1]. By adjusting A, the transformed data can be scaled to the range [-A, A]. By adjusting B, the data can be made to a different function of the tan(x) function.
  • Segment, B is very small, then the data to be converted mostly falls in the linear segment of the tan(x) function.
  • B When B is large, the data to be converted mostly falls in the saturated segment of the tan(x) function, and B takes a moderate value.
  • the data to be converted is mostly in a non-linear segment, thereby changing the original distribution of the data without changing the relative size relationship between the data. Therefore, the purpose of the nonlinear transformation in the present invention is to make the range of data and the distribution of data controllable.
  • nonlinear functions or function parameters may be used as needed.
  • 16-bit floating point since 16-bit floating point can indicate that the data range is relatively large, more attention is required.
  • the loss of precision may be due to the loss of precision of individual data beyond the 16-bit floating-point representation range, so a nonlinear function can be designed such that this portion of the data goes out of saturation and the other data is in a nonlinear or linear segment.
  • the parameters referred to in the present invention are the weights and offsets of the neural network.
  • the parameters are the weights and offsets of each layer of the neural network.
  • the low bit width conversion module is used for low bit width conversion of the transform parameter conversion to obtain a low bit width conversion parameter.
  • the parameters of the neural network are floating point numbers
  • the function of the low bit width conversion module is to convert the original full precision floating point number into a low precision low bit width floating point number after nonlinear transformation.
  • the "lower bit” in the low-order wide floating point number indicates that the number of bits required to store one data is smaller than the number of bits required for the full-precision floating point, and "wide” means that the data is statistically evenly distributed in the data representation. Indicated within the index range.
  • a reverse gradient transform module configured to obtain a gradient value of a low bit width transform parameter to be updated by a neural network reverse process, and obtain a pre-linear transform parameter according to a nonlinear function and a gradient value of the low bit width transform parameter to be updated The gradient value to be updated;
  • the loss function of the reverse process is Then the gradient calculated by the reverse process is This gradient is the gradient of y, by the chain rule Gradient of x can be obtained, gradient through x Update the weight and offset parameters before the nonlinear transformation.
  • the gradient of the y to be updated can be obtained by the inverse process of the neural network.
  • is the learning rate of the neural network
  • the update module is configured to update the parameter according to the gradient value of the parameter to be updated.
  • the neural network training device provided by the present invention repeatedly performs steps S1 to S4 on the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.
  • the neural network of this embodiment supports a 16-bit floating point data representation method, and this embodiment uses a nonlinear transformation function as a hyperbolic tangent series function:
  • w is the weight data before the transformation
  • p_w is the weight data after the transformation
  • Maxw is the maximum value of the ownership value data of the layer
  • a and B are constant parameters.
  • A is used to control the transformed data range
  • B is used to control the transformed data distribution.
  • the nonlinear function is Since the weight or offset data is relatively small relative to its maximum value, The data is mapped into the range of [-1,1], and most of the data is concentrated near the value of 0.
  • This function form can make full use of the nonlinear segment of the function to compress the interval with less data distribution to make the data dense, and at the same time utilize The linear segment of the function stretches the interval where the data is distributed near the value of 0 to spread the data.
  • weights and offsets of each layer of the neural network are nonlinearly transformed by the nonlinear transform module to obtain the transformed weights and offsets, ie Where y is the transformed weight or offset, x is the weight or offset before the transformation, and Max x is the maximum of all x. At the same time, the weight and offset data before the transformation are retained.
  • the transformed weight and offset data and the input data of each layer are converted into the required 16-bit floating point data through the low bit width floating point data conversion module, and the low precision floating point data conversion can adopt the direct truncation method, that is, the original
  • the full-precision data intercepts the precision part that the 16-bit floating-point representation method can represent, and the part that exceeds the precision is directly discarded.
  • the converted 16-bit floating point data is used for neural network training, and the gradient value ⁇ y to be updated is obtained by the neural network reverse process.
  • w is the weight data before the transformation
  • p_w is the transformed weight data
  • a and B are constant parameters
  • A is used to control the transformed data range.
  • B is used to control the transformed data distribution.
  • weights and offsets of each layer of the neural network are nonlinearly transformed by the nonlinear transform module to obtain the transformed weights and offsets, ie Where y is the transformed weight or offset, and x is the weight or offset before the transformation. At the same time, the weight and offset data before the transformation are retained.
  • the transformed weight and offset data and the input data of each layer are converted by low bit width.
  • 16-bit floating-point data, low-precision floating-point data conversion can use the method of direct truncation, that is, for the original full-precision data, the precision part that can be represented by the 16-bit floating-point representation method is intercepted, and the part exceeding the precision is directly discarded.
  • the converted 16-bit floating point data is used for neural network training, and the gradient value ⁇ y to be updated is obtained by the neural network reverse process.
  • the trained neural network of the present invention can be used for a neural network accelerator, a voice recognition device, an image recognition device, an automatic navigation device, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Feedback Control In General (AREA)

Abstract

A neural network training apparatus and method, for use in training parameters within a neural network: said method first using a nonlinear function to perform a nonlinear transformation on parameters to obtain transformation parameters (S1); then converting the transformation parameters to perform bit width conversion to obtain low bit width transformation parameters (S2); next acquiring a to-be-updated gradient value of the low bit width transformation parameters by means of a reverse process in the neural network to obtain a to-be-updated gradient value of the parameters prior to the nonlinear transformation according to the nonlinear function and said to-be-updated gradient value of the low bit width transformation parameters (S3); and finally updating the parameters according to the to-be-updated gradient value of the parameters (S4). The present method results in the parameters having a lower bit width after training while precision loss is lower.

Description

一种神经网络训练方法及装置Neural network training method and device 技术领域Technical field
本发明属于神经网络技术领域,具体涉及一种神经网络训练方法及装置。The invention belongs to the technical field of neural networks, and in particular relates to a neural network training method and device.
背景技术Background technique
多层神经网络由于其具有较高的识别准确度和较好的可并行性,受到学术界和工业界越来越广泛的关注,在模式识别、图像处理和自然语言处理等多个领域也有着越来越多的应用。Multi-layer neural networks have attracted more and more attention in academia and industry due to their high recognition accuracy and good parallelism. They also have many fields in pattern recognition, image processing and natural language processing. More and more applications.
神经网络由于其庞大的模型参数数据,使之难以应用于嵌入式系统,研究人员通过多种方式减少存储这些模型参数所需的存储空间,最常用的是采用低精度的数据表示方法存储数据,例如采用16位浮点型数据表示方法,16位定点型数据表示方法,更有甚者采用1bit的二值数据表示方法。相比于原精度的浮点数据表示方法,采用低精度的数据表示方法可以减少数据存储空间,然而由于神经网络模型参数数据数值范围非常广,采用低精度的数据表示方法会带来极大的精度损失,影响神经网络的性能。Because of its large model parameter data, neural networks make it difficult to apply to embedded systems. Researchers use a variety of ways to reduce the storage space required to store these model parameters. The most common method is to store data using low-precision data representation methods. For example, a 16-bit floating-point data representation method, a 16-bit fixed-point data representation method, and a 1-bit binary data representation method are used. Compared with the original precision floating-point data representation method, the low-precision data representation method can reduce the data storage space. However, since the neural network model parameter data has a wide range of values, the use of low-precision data representation method will bring great Loss of precision affects the performance of the neural network.
发明内容Summary of the invention
(一)要解决的技术问题(1) Technical problems to be solved
本发明的目的在于,提供一种神经网络训练方法及装置,用于对神经网络中的参数进行训练,以使得训练后的参数具有低位宽并且精度损失较小。It is an object of the present invention to provide a neural network training method and apparatus for training parameters in a neural network such that the trained parameters have a low bit width and a small loss of precision.
(二)技术方案(2) Technical plan
本发明提供一种神经网络训练方法,用于对神经网络中的参数进行训练,方法包括: The invention provides a neural network training method for training parameters in a neural network, and the method comprises:
S1,采用非线性函数对参数进行非线性变换,得到变换参数;S1, using a nonlinear function to nonlinearly transform the parameters to obtain transformation parameters;
S2,对变换参数转换进行低位宽转换,得到低位宽变换参数;S2, performing low bit width conversion on the transform parameter conversion to obtain a low bit width transform parameter;
S3,通过神经网络反向过程,获取低位宽变换参数的待更新梯度值,根据非线性函数和所述低位宽变换参数的待更新梯度值,得到非线性变换前参数的待更新梯度值;S3, obtaining a gradient value of the low bit width transform parameter to be updated by a neural network reverse process, and obtaining a gradient value of the parameter before the nonlinear transform according to the nonlinear function and the gradient value of the low bit width transform parameter to be updated;
S4,根据参数的待更新梯度值,对参数进行更新。S4, the parameter is updated according to the gradient value of the parameter to be updated.
进一步,步骤S3中,低位宽变换参数的待更新梯度值Δy为:Further, in step S3, the gradient value Δy of the low bit width conversion parameter to be updated is:
Figure PCTCN2016103979-appb-000001
Figure PCTCN2016103979-appb-000001
其中,η为神经网络的学习率,
Figure PCTCN2016103979-appb-000002
为神经网络反向过程的损失函数,非线性变换函数为y=f(x),x为非线性变换前的参数,y为非线性变换后的权值或偏置参数;
Where η is the learning rate of the neural network,
Figure PCTCN2016103979-appb-000002
For the loss function of the neural network reverse process, the nonlinear transformation function is y=f(x), x is the parameter before the nonlinear transformation, and y is the weight or offset parameter after the nonlinear transformation;
非线性变换前所述参数的待更新梯度值为:The gradient value of the parameter to be updated before the nonlinear transformation is:
Δx=f′(x)Δy;Δx=f'(x)Δy;
步骤S4中,对参数进行更新的表达式为:In step S4, the expression for updating the parameter is:
xnew=x-Δx。x new = x - Δx.
进一步,非线性函数为双曲正切系列函数或者sigmoid系列函数。Further, the nonlinear function is a hyperbolic tangent series function or a sigmoid series function.
进一步,参数包括神经网络的权值和偏置。Further, the parameters include weights and offsets of the neural network.
进一步,对更新后的参数重复执行步骤S1~S4,直到参数小于一预定阈值时,训练完成。Further, steps S1 to S4 are repeatedly performed on the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.
本发明还提供一种神经网络训练装置,用于对神经网络中的参数进行训练,装置包括:The invention also provides a neural network training device for training parameters in a neural network, the device comprising:
非线性变换模块,用于采用非线性函数对参数进行非线性变换,得到变换参数;a nonlinear transformation module for nonlinearly transforming parameters using a nonlinear function to obtain transformation parameters;
低位宽转换模块,用于对变换参数转换进行低位宽转换,得到低位宽变换参数;a low bit width conversion module for performing low bit width conversion on the transform parameter conversion to obtain a low bit width conversion parameter;
反向梯度变换模块,用于通过神经网络反向过程,获取低位宽变换参数的待更新梯度值,根据非线性函数和所述低位宽变换参数的待更新梯度值,得到非线性变换前参数的待更新梯度值; a reverse gradient transform module, configured to obtain a gradient value of a low bit width transform parameter to be updated by a neural network reverse process, and obtain a pre-linear transform parameter according to a nonlinear function and a gradient value of the low bit width transform parameter to be updated The gradient value to be updated;
更新模块,用于根据参数的待更新梯度值,对参数进行更新。The update module is configured to update the parameter according to the gradient value of the parameter to be updated.
进一步,反向梯度变换模块中,低位宽变换参数的待更新梯度值Δy为:Further, in the inverse gradient transform module, the gradient value Δy of the low bit width transform parameter to be updated is:
Figure PCTCN2016103979-appb-000003
Figure PCTCN2016103979-appb-000003
其中,η为神经网络的学习率,
Figure PCTCN2016103979-appb-000004
为神经网络反向过程的损失函数,非线性变换函数为y=f(x),x为非线性变换前的参数,y为非线性变换后的权值或偏置参数;
Where η is the learning rate of the neural network,
Figure PCTCN2016103979-appb-000004
For the loss function of the neural network reverse process, the nonlinear transformation function is y=f(x), x is the parameter before the nonlinear transformation, and y is the weight or offset parameter after the nonlinear transformation;
非线性变换前所述参数的待更新梯度值为:The gradient value of the parameter to be updated before the nonlinear transformation is:
Δx=f′(x)Δy;Δx=f'(x)Δy;
更新模块对所述参数进行更新的表达式为:The expression that the update module updates the parameters is:
xnew=x-Δx。x new = x - Δx.
进一步,非线性函数为双曲正切系列函数或者sigmoid系列函数。Further, the nonlinear function is a hyperbolic tangent series function or a sigmoid series function.
进一步,参数包括神经网络的权值和偏置。Further, the parameters include weights and offsets of the neural network.
进一步,非线性变换模块、低位宽转换模块、反向梯度变换模块及更新模块对更新后的参数反复进行训练,直到参数小于一预定阈值时,训练完成。Further, the nonlinear transform module, the low bit width conversion module, the inverse gradient transform module, and the update module repeatedly train the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.
(三)有益效果(3) Beneficial effects
本发明具有以下优点:The invention has the following advantages:
1、由于非线性变换函数对参数的数据范围和数据精度可控,在后续进行低位宽转换时,能够很好的保留数据原本的精度,从而保证神经网络的性能。1. Since the nonlinear transformation function can control the data range and data precision of the parameters, the accuracy of the original data can be well preserved in the subsequent low bit width conversion, thereby ensuring the performance of the neural network.
2、由于训练获得的参数具有较低的位宽,故参数所需的存储空间大大减少。2. Since the parameters obtained by the training have a lower bit width, the storage space required for the parameters is greatly reduced.
3、训练获得的参数可以用于专用的神经网络加速器,由于采用较低精度的参数,故可以减少对加速器传输带宽的要求,并且低精度的数据可以减少硬件的面积开销,例如减小运算器的大小,从而可以优化硬件的面积功耗比。 3. The parameters obtained by the training can be used for the dedicated neural network accelerator. Due to the use of lower precision parameters, the transmission bandwidth requirement of the accelerator can be reduced, and the low precision data can reduce the hardware area overhead, for example, reduce the arithmetic unit. The size of the hardware can thus optimize the area-to-power ratio of the hardware.
附图说明DRAWINGS
图1为本发明提供的神经网络训练方法的流程图。FIG. 1 is a flowchart of a neural network training method provided by the present invention.
图2是本发明提供神经网络训练装置的结构示意图。2 is a schematic structural view of a neural network training device provided by the present invention.
具体实施方式detailed description
本发明提供一种神经网络训练装置及方法,用于对神经网络中的参数进行训练,方法首先采用非线性函数对所述参数进行非线性变换,得到变换参数,然后对变换参数转换进行位宽转换,得到低位宽变换参数,接着,通过神经网络反向过程,获取低位宽变换参数的待更新梯度值,根据非线性函数和所述低位宽变换参数的待更新梯度值,得到非线性变换前参数的待更新梯度值,最后根据参数的待更新梯度值,对参数进行更新。本发明使得训练后的参数具有较低的位宽并且精度损失较小。The invention provides a neural network training device and method for training parameters in a neural network. Firstly, a nonlinear function is used to nonlinearly transform the parameters to obtain transformation parameters, and then the transformation parameters are converted into bit widths. Converting, obtaining a low bit width transform parameter, and then obtaining a gradient value of the low bit width transform parameter to be updated by a neural network reverse process, and obtaining a nonlinear transform according to a nonlinear function and a gradient value of the low bit width transform parameter to be updated The gradient value of the parameter to be updated, and finally the parameter is updated according to the gradient value of the parameter to be updated. The present invention allows the post-training parameters to have a lower bit width with less loss of precision.
图1为本发明提供的神经网络训练方法的流程图,如图1所示,方法包括:FIG. 1 is a flowchart of a neural network training method provided by the present invention. As shown in FIG. 1, the method includes:
S1,采用非线性函数对参数进行非线性变换,得到变换参数。S1, using a nonlinear function to nonlinearly transform the parameters to obtain transformation parameters.
在本步骤中所采用非线性变换函数不唯一,可根据实际使用的需求选择不同的非线性函数,可以是双曲正切系列函数、sigmoid系列函数等。The nonlinear transformation function used in this step is not unique, and different nonlinear functions can be selected according to the actual use requirements, which can be a hyperbolic tangent series function, a sigmoid series function, and the like.
例如,针对于采用8位浮点表示方法的神经网络,由于8位浮点表示方法所能表示的数据范围较小,而一般采用高精度数据表示方法的神经网络参数数据范围都比较大,若将原全精度数据通过位宽转换成低精度数据,对数据本身会带来的精度损失会影响神经网络的性能。故可以采用双曲正切系列的函数做非线性变换。For example, for a neural network using an 8-bit floating-point representation method, since the 8-bit floating-point representation method can represent a small data range, the neural network parameter data range generally adopting a high-precision data representation method is relatively large, if Converting the original full-precision data into bit-width data into low-precision data can affect the performance of the neural network due to the loss of precision in the data itself. Therefore, the function of the hyperbolic tangent series can be used for nonlinear transformation.
双曲正切系列函数可以是如下形式,
Figure PCTCN2016103979-appb-000005
以神经网络一个卷积层的权值数据为例,则w为变换前的权值数据,p_w为变换后的权值数据,Maxw为该层所有权值数据的最大值,A与B为常数参量。其中,A用于控制变换后的数据范围,B用于控制变换后的数据分布,调 参原理如下,由于
Figure PCTCN2016103979-appb-000006
变换将数据缩放到了[-1,1]范围内,则通过调节A可以将变换后的数据缩放到[-A,A]范围内,通过调节B可以使得数据处于tan(x)函数的不同函数段,B很小时,则待转换数据大部分落在tan(x)函数的线性段,B很大时,则待转化数据大部分落在tan(x)函数饱和段,B取一个适中的值,可以使待转化数据大部分处于非线性段,从而改变数据原本的分布情况,而不改变数据之间的相对大小关系。故,本发明中非线性变换的目的是使得数据的范围和数据的分布可控。
The hyperbolic tangent series function can be in the form of
Figure PCTCN2016103979-appb-000005
Taking the weight data of a convolutional layer of the neural network as an example, w is the weight data before the transformation, p_w is the transformed weight data, Max w is the maximum value of the layer ownership value data, and A and B are constants. Parameter. Among them, A is used to control the transformed data range, and B is used to control the transformed data distribution. The principle of the adjustment is as follows.
Figure PCTCN2016103979-appb-000006
The transformation scales the data to the range [-1,1]. By adjusting A, the transformed data can be scaled to the range [-A, A]. By adjusting B, the data can be made to a different function of the tan(x) function. Segment, B is very small, then the data to be converted mostly falls in the linear segment of the tan(x) function. When B is large, the data to be converted mostly falls in the saturated segment of the tan(x) function, and B takes a moderate value. The data to be converted is mostly in a non-linear segment, thereby changing the original distribution of the data without changing the relative size relationship between the data. Therefore, the purpose of the nonlinear transformation in the present invention is to make the range of data and the distribution of data controllable.
在其他场景,根据需要,可以采用其他非线性函数或者函数参数,例如针对于采用16位浮点表示方法的神经网络,由于16位浮点可表示数据范围相对较大,因此,需要更多关注的精度损失可能为个别数据超出16位浮点表示范围带来的精度损失,故可以设计非线性函数使得这部分数据出去饱和段,而其他数据处于非线性段或线性段。In other scenarios, other nonlinear functions or function parameters may be used as needed. For example, for a neural network using a 16-bit floating point representation method, since 16-bit floating point can indicate that the data range is relatively large, more attention is required. The loss of precision may be due to the loss of precision of individual data beyond the 16-bit floating-point representation range, so a nonlinear function can be designed such that this portion of the data goes out of saturation and the other data is in a nonlinear or linear segment.
另外,本发明所指的参数为神经网络的权值和偏置,其中,神经网络为多层神经网络时,参数即为每一层神经网络的权值和偏置。In addition, the parameters referred to in the present invention are the weights and offsets of the neural network. When the neural network is a multi-layer neural network, the parameters are the weights and offsets of each layer of the neural network.
S2,对变换参数转换进行低位宽转换,得到低位宽变换参数。S2, performing low bit width conversion on the transform parameter conversion to obtain a low bit width transform parameter.
一般地,神经网络的参数为浮点数,本步骤中,将原全精度浮点数经过非线性变换后转为低精度的低位宽浮点数。此处低位宽浮点数中“低位”表示存储一个数据所需的比特数小于全精度浮点所需的比特位数,“宽”表示数据在统计意义上较均匀分布在该数据表示法所能表示的指数范围内。例如,假设有一堆数据,分布在[-5,5]之间,并且集中分布在[-1.5,1.5]范围内,那么,采用一种非线性变换tanh(x),对于[-1.5,1.5]范围内的数据,处于该变换斜率较大的线性段,能够将密集的数据分布拉散开,而对于[-5,-1.5]和[1.5,5]范围的数据,处于该变换曲率较大的非线性段,能够将原本松散的数据变得密集,由此可以使得数据的分布更加均匀。并且,该变换使得数据范围缩放到[-1,1]区域内。Generally, the parameters of the neural network are floating point numbers. In this step, the original full precision floating point number is nonlinearly transformed and converted into a low precision low bit width floating point number. Here, the "lower bit" in the low-order wide floating point number indicates that the number of bits required to store one data is smaller than the number of bits required for the full-precision floating point, and "wide" means that the data is statistically evenly distributed in the data representation. Indicated within the index range. For example, suppose there is a bunch of data, distributed between [-5, 5], and concentrated in the range [-1.5, 1.5], then a nonlinear transformation tanh(x) is used, for [-1.5, 1.5 The data in the range, in the linear segment where the slope of the transformation is large, can spread the dense data distribution, and for the data of the range [-5, -1.5] and [1.5, 5], the curvature of the transformation is Large non-linear segments can make the original loose data dense, which can make the data distribution more uniform. And, the transformation causes the data range to be scaled into the [-1, 1] region.
S3,通过神经网络反向过程,获取低位宽变换参数的待更新梯度值,根据非线性函数和低位宽变换参数的待更新梯度值,得到非线性变换前参数的待更新梯度值。S3, obtaining a gradient value of the low bit width transformation parameter to be updated by a neural network reverse process, and obtaining a gradient value of the parameter to be updated before the nonlinear transformation according to the nonlinear function and the gradient value of the low bit width transformation parameter to be updated.
在本步骤中,设非线性变换采用的非线性变换函数为y=f(x),此处x为变换前的权值或偏置参数,y为变换后的权值或偏置参数。反向过 程的损失函数为
Figure PCTCN2016103979-appb-000007
则反向过程计算得的梯度为
Figure PCTCN2016103979-appb-000008
此梯度为y的梯度,由链式法则
Figure PCTCN2016103979-appb-000009
可获得x的梯度,通过x的梯度
Figure PCTCN2016103979-appb-000010
更新非线性变换前的权值和偏置参数。具体地,通过神经网络的反向过程计算后可得到y的待更新梯度
Figure PCTCN2016103979-appb-000011
其中η为神经网络的学习率,通过返向梯度变换模块计算获得x的带更新梯度值
Figure PCTCN2016103979-appb-000012
即反向梯度模块通过计算Δx=f′(x)Δy获得x的待更新梯度值。
In this step, it is assumed that the nonlinear transformation function used in the nonlinear transformation is y=f(x), where x is the weight or offset parameter before the transformation, and y is the transformed weight or offset parameter. The loss function of the reverse process is
Figure PCTCN2016103979-appb-000007
Then the gradient calculated by the reverse process is
Figure PCTCN2016103979-appb-000008
This gradient is the gradient of y, by the chain rule
Figure PCTCN2016103979-appb-000009
Gradient of x can be obtained, gradient through x
Figure PCTCN2016103979-appb-000010
Update the weight and offset parameters before the nonlinear transformation. Specifically, the gradient of the y to be updated can be obtained by the inverse process of the neural network.
Figure PCTCN2016103979-appb-000011
Where η is the learning rate of the neural network, and the updated gradient value of x is obtained by the return gradient transform module.
Figure PCTCN2016103979-appb-000012
That is, the inverse gradient module obtains the gradient value of x to be updated by calculating Δx=f'(x)Δy.
S4,根据参数的待更新梯度值,对参数进行更新。S4, the parameter is updated according to the gradient value of the parameter to be updated.
在本步骤中,对参数进行更新的表达式为:In this step, the expression that updates the parameters is:
xnew=x-Δx。x new = x - Δx.
接着,对更新后的参数重复执行步骤S1~S4,直到参数小于一预定阈值时,训练完成。Next, steps S1 to S4 are repeatedly performed on the updated parameters until the parameters are less than a predetermined threshold, and the training is completed.
图2是本发明提供神经网络训练装置的结构示意图,如图2所示,网络训练装置包括:2 is a schematic structural diagram of a neural network training device according to the present invention. As shown in FIG. 2, the network training device includes:
非线性变换模块,用于采用非线性函数对参数进行非线性变换,得到变换参数;a nonlinear transformation module for nonlinearly transforming parameters using a nonlinear function to obtain transformation parameters;
非线性变换模块所采用非线性变换函数不唯一,可根据实际使用的需求选择不同的非线性函数,可以是双曲正切系列函数、sigmoid系列函数等。The nonlinear transformation function adopted by the nonlinear transformation module is not unique, and different nonlinear functions can be selected according to the actual use requirements, which can be a hyperbolic tangent series function, a sigmoid series function, and the like.
例如,针对于采用8位浮点表示方法的神经网络,由于8位浮点表示方法所能表示的数据范围较小,而一般采用高精度数据表示方法的神经网络参数数据范围都比较大,若将原全精度数据通过位宽转换成低精度数据,对数据本身会带来的精度损失会影响神经网络的性能。故可以采用双曲正切系列的函数做非线性变换。For example, for a neural network using an 8-bit floating-point representation method, since the 8-bit floating-point representation method can represent a small data range, the neural network parameter data range generally adopting a high-precision data representation method is relatively large, if Converting the original full-precision data into bit-width data into low-precision data can affect the performance of the neural network due to the loss of precision in the data itself. Therefore, the function of the hyperbolic tangent series can be used for nonlinear transformation.
双曲正切系列函数可以是如下形式,
Figure PCTCN2016103979-appb-000013
以神经网络一个卷积层的权值数据为例,则w为变换前的权值数据,p_w为变换后的权值数据,Maxw为该层所有权值数据的最大值,A与B为常数参量。其中,A用于控制变换后的数据范围,B用于控制变换后的数据分布, 调参原理如下,由于
Figure PCTCN2016103979-appb-000014
变换将数据缩放到了[-1,1]范围内,则通过调节A可以将变换后的数据缩放到[-A,A]范围内,通过调节B可以使得数据处于tan(x)函数的不同函数段,B很小时,则待转换数据大部分落在tan(x)函数的线性段,B很大时,则待转化数据大部分落在tan(x)函数饱和段,B取一个适中的值,可以使待转化数据大部分处于非线性段,从而改变数据原本的分布情况,而不改变数据之间的相对大小关系。故,本发明中非线性变换的目的是使得数据的范围和数据的分布可控。
The hyperbolic tangent series function can be in the form of
Figure PCTCN2016103979-appb-000013
Taking the weight data of a convolutional layer of the neural network as an example, w is the weight data before the transformation, p_w is the transformed weight data, Max w is the maximum value of the layer ownership value data, and A and B are constants. Parameter. Among them, A is used to control the transformed data range, and B is used to control the transformed data distribution. The principle of the adjustment is as follows.
Figure PCTCN2016103979-appb-000014
The transformation scales the data to the range [-1,1]. By adjusting A, the transformed data can be scaled to the range [-A, A]. By adjusting B, the data can be made to a different function of the tan(x) function. Segment, B is very small, then the data to be converted mostly falls in the linear segment of the tan(x) function. When B is large, the data to be converted mostly falls in the saturated segment of the tan(x) function, and B takes a moderate value. The data to be converted is mostly in a non-linear segment, thereby changing the original distribution of the data without changing the relative size relationship between the data. Therefore, the purpose of the nonlinear transformation in the present invention is to make the range of data and the distribution of data controllable.
在其他场景,根据需要,可以采用其他非线性函数或者函数参数,例如针对于采用16位浮点表示方法的神经网络,由于16位浮点可表示数据范围相对较大,因此,需要更多关注的精度损失可能为个别数据超出16位浮点表示范围带来的精度损失,故可以设计非线性函数使得这部分数据出去饱和段,而其他数据处于非线性段或线性段。In other scenarios, other nonlinear functions or function parameters may be used as needed. For example, for a neural network using a 16-bit floating point representation method, since 16-bit floating point can indicate that the data range is relatively large, more attention is required. The loss of precision may be due to the loss of precision of individual data beyond the 16-bit floating-point representation range, so a nonlinear function can be designed such that this portion of the data goes out of saturation and the other data is in a nonlinear or linear segment.
另外,本发明所指的参数为神经网络的权值和偏置,其中,神经网络为多层神经网络时,参数即为每一层神经网络的权值和偏置。In addition, the parameters referred to in the present invention are the weights and offsets of the neural network. When the neural network is a multi-layer neural network, the parameters are the weights and offsets of each layer of the neural network.
低位宽转换模块,用于对变换参数转换进行低位宽转换,得到低位宽变换参数。The low bit width conversion module is used for low bit width conversion of the transform parameter conversion to obtain a low bit width conversion parameter.
一般地,神经网络的参数为浮点数,低位宽转换模块的作用是将原全精度浮点数经过非线性变换后转为低精度的低位宽浮点数。此处低位宽浮点数中“低位”表示存储一个数据所需的比特数小于全精度浮点所需的比特位数,“宽”表示数据在统计意义上较均匀分布在该数据表示法所能表示的指数范围内。例如,假设有一堆数据,分布在[-5,5]之间,并且集中分布在[-1.5,1.5]范围内,那么,采用一种非线性变换tanh(x),对于[-1.5,1.5]范围内的数据,处于该变换斜率较大的线性段,能够将密集的数据分布拉散开,而对于[-5,-1.5]和[1.5,5]范围的数据,处于该变换曲率较大的非线性段,能够将原本松散的数据变得密集,由此可以使得数据的分布更加均匀。并且,该变换使得数据范围缩放到[-1,1]区域内。Generally, the parameters of the neural network are floating point numbers, and the function of the low bit width conversion module is to convert the original full precision floating point number into a low precision low bit width floating point number after nonlinear transformation. Here, the "lower bit" in the low-order wide floating point number indicates that the number of bits required to store one data is smaller than the number of bits required for the full-precision floating point, and "wide" means that the data is statistically evenly distributed in the data representation. Indicated within the index range. For example, suppose there is a bunch of data, distributed between [-5, 5], and concentrated in the range [-1.5, 1.5], then a nonlinear transformation tanh(x) is used, for [-1.5, 1.5 The data in the range, in the linear segment where the slope of the transformation is large, can spread the dense data distribution, and for the data of the range [-5, -1.5] and [1.5, 5], the curvature of the transformation is Large non-linear segments can make the original loose data dense, which can make the data distribution more uniform. And, the transformation causes the data range to be scaled into the [-1, 1] region.
反向梯度变换模块,用于通过神经网络反向过程,获取低位宽变换参数的待更新梯度值,根据非线性函数和所述低位宽变换参数的待更新梯度值,得到非线性变换前参数的待更新梯度值;a reverse gradient transform module, configured to obtain a gradient value of a low bit width transform parameter to be updated by a neural network reverse process, and obtain a pre-linear transform parameter according to a nonlinear function and a gradient value of the low bit width transform parameter to be updated The gradient value to be updated;
在反向梯度变换模块中,设非线性变换采用的非线性变换函数为 y=f(x),此处x为变换前的权值或偏置参数,y为变换后的权值或偏置参数。反向过程的损失函数为
Figure PCTCN2016103979-appb-000015
则反向过程计算得的梯度为
Figure PCTCN2016103979-appb-000016
此梯度为y的梯度,由链式法则
Figure PCTCN2016103979-appb-000017
可获得x的梯度,通过x的梯度
Figure PCTCN2016103979-appb-000018
更新非线性变换前的权值和偏置参数。具体地,通过神经网络的反向过程计算后可得到y的待更新梯度
Figure PCTCN2016103979-appb-000019
其中η为神经网络的学习率,通过返向梯度变换模块计算获得x的带更新梯度值
Figure PCTCN2016103979-appb-000020
即反向梯度模块通过计算Δx=f′(x)Δy获得x的待更新梯度值。
In the inverse gradient transform module, let the nonlinear transform function used in the nonlinear transform be y=f(x), where x is the weight or offset parameter before the transform, and y is the transformed weight or offset. parameter. The loss function of the reverse process is
Figure PCTCN2016103979-appb-000015
Then the gradient calculated by the reverse process is
Figure PCTCN2016103979-appb-000016
This gradient is the gradient of y, by the chain rule
Figure PCTCN2016103979-appb-000017
Gradient of x can be obtained, gradient through x
Figure PCTCN2016103979-appb-000018
Update the weight and offset parameters before the nonlinear transformation. Specifically, the gradient of the y to be updated can be obtained by the inverse process of the neural network.
Figure PCTCN2016103979-appb-000019
Where η is the learning rate of the neural network, and the updated gradient value of x is obtained by the return gradient transform module.
Figure PCTCN2016103979-appb-000020
That is, the inverse gradient module obtains the gradient value of x to be updated by calculating Δx=f'(x)Δy.
更新模块,用于根据参数的待更新梯度值,对参数进行更新。The update module is configured to update the parameter according to the gradient value of the parameter to be updated.
更新模块,对参数进行更新的表达式为:Update the module and update the parameters to:
xnew=x-Δx。x new = x - Δx.
本发明提供的神经网络训练装置,对更新后的参数重复执行步骤S1~S4,直到参数小于一预定阈值时,训练完成。The neural network training device provided by the present invention repeatedly performs steps S1 to S4 on the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.
实施例一Embodiment 1
本实施例的神经网络支持16位浮点数据表示方法,并且,本实施例采用非线性变换函数为双曲正切系列函数:The neural network of this embodiment supports a 16-bit floating point data representation method, and this embodiment uses a nonlinear transformation function as a hyperbolic tangent series function:
Figure PCTCN2016103979-appb-000021
Figure PCTCN2016103979-appb-000021
以神经网络一个卷积层的权值数据为例,w为变换前的权值数据,p_w为变换后的权值数据,Maxw为该层所有权值数据的最大值,A与B为常数参量,A用于控制变换后的数据范围,B用于控制变换后的数据分布。通过调节A与B,能够将分布不均的数据在指数域上较为均匀分布,能够将数据映射到指定范围内。Taking the weight data of a convolutional layer of the neural network as an example, w is the weight data before the transformation, p_w is the weight data after the transformation, Maxw is the maximum value of the ownership value data of the layer, and A and B are constant parameters. A is used to control the transformed data range, and B is used to control the transformed data distribution. By adjusting A and B, the data with uneven distribution can be more evenly distributed in the exponential domain, and the data can be mapped to a specified range.
具体地,当
Figure PCTCN2016103979-appb-000022
B=3时,非线性函数即为
Figure PCTCN2016103979-appb-000023
由于权值或偏置数据相对于其最大值都比较小,因此
Figure PCTCN2016103979-appb-000024
将数据映射到[-1,1]范围内,同时大部分数据都集中在0值附近,该函数形式可以充分利用函数的非线性段将数据分布较少的区间压缩使数据密集化,同时利用函数线性段将0值附近数据分布较多的区间拉伸使数据散开。
Specifically, when
Figure PCTCN2016103979-appb-000022
When B=3, the nonlinear function is
Figure PCTCN2016103979-appb-000023
Since the weight or offset data is relatively small relative to its maximum value,
Figure PCTCN2016103979-appb-000024
The data is mapped into the range of [-1,1], and most of the data is concentrated near the value of 0. This function form can make full use of the nonlinear segment of the function to compress the interval with less data distribution to make the data dense, and at the same time utilize The linear segment of the function stretches the interval where the data is distributed near the value of 0 to spread the data.
对神经网络每一层的权值和偏置,通过非线性变换模块做非线性变换,获得变换后的权值和偏置,即
Figure PCTCN2016103979-appb-000025
其中y为变换后的权值或偏置,x为变换前的权值或偏置,Maxx为所有x的最大值。同时,保留变换前的权值和偏置的数据。
The weights and offsets of each layer of the neural network are nonlinearly transformed by the nonlinear transform module to obtain the transformed weights and offsets, ie
Figure PCTCN2016103979-appb-000025
Where y is the transformed weight or offset, x is the weight or offset before the transformation, and Max x is the maximum of all x. At the same time, the weight and offset data before the transformation are retained.
变换后的权值和偏置数据以及各层输入数据通过低位宽浮点数据转换模块,转换成所需的16位浮点数据,低精度浮点数据转换可以采用直接截断的方法,即对于原全精度的数据,截取16位浮点表示方法所能表示的精度部分,超出该精度的部分直接舍去。The transformed weight and offset data and the input data of each layer are converted into the required 16-bit floating point data through the low bit width floating point data conversion module, and the low precision floating point data conversion can adopt the direct truncation method, that is, the original The full-precision data intercepts the precision part that the 16-bit floating-point representation method can represent, and the part that exceeds the precision is directly discarded.
转换后的16位浮点数据用于神经网络训练,通过神经网络反向过程获得待更新梯度值Δy。The converted 16-bit floating point data is used for neural network training, and the gradient value Δy to be updated is obtained by the neural network reverse process.
梯度值通过反向梯度变换模块计算的非线性变换前的权值和偏置的待更新梯度值,即
Figure PCTCN2016103979-appb-000026
并用该梯度值完成非线性变换前的权值和偏置的更新,即xnew=x-Δx。
The gradient value is calculated by the inverse gradient transformation module before the nonlinear transformation and the biased gradient value to be updated, ie
Figure PCTCN2016103979-appb-000026
And using the gradient value to complete the update of the weight and offset before the nonlinear transformation, that is, x new = x - Δx.
重复上述步骤直至训练完成。Repeat the above steps until the training is complete.
实施例二Embodiment 2
实施例二与实施例一的区别在于,本实施例采用Sigmoid系列非线性变换函数:The difference between the second embodiment and the first embodiment is that the embodiment adopts the Sigmoid series nonlinear transformation function:
Figure PCTCN2016103979-appb-000027
Figure PCTCN2016103979-appb-000027
以神经网络一个卷积层的权值数据为例,则w为变换前的权值数据,p_w为变换后的权值数据,A与B为常数参量,A用于控制变换后的数据范围,B用于控制变换后的数据分布。通过调节A与B,能够将分布不均的数据在指数域上较为均匀分布,能够将数据映射到指定范围内。假设该神经网络的权值参数数据大小一般在10-4量级,可以采用非线性函数
Figure PCTCN2016103979-appb-000028
该函数形式可以充分利用函数的非线性段和线性段使得数据范围较均匀的分布。
Taking the weight data of a convolutional layer of the neural network as an example, w is the weight data before the transformation, p_w is the transformed weight data, A and B are constant parameters, and A is used to control the transformed data range. B is used to control the transformed data distribution. By adjusting A and B, the data with uneven distribution can be more evenly distributed in the exponential domain, and the data can be mapped to a specified range. Assume that the weight parameter data size of the neural network is generally on the order of 10 -4 , and a nonlinear function can be used.
Figure PCTCN2016103979-appb-000028
This functional form can make full use of the nonlinear and linear segments of the function to make the data range more evenly distributed.
对神经网络每一层的权值和偏置,通过非线性变换模块做非线性变换,获得变换后的权值和偏置,即其中y为变换后的权值或偏置,x为变换前的权值或偏置。同时,保留变换前的权值和偏置的数据。The weights and offsets of each layer of the neural network are nonlinearly transformed by the nonlinear transform module to obtain the transformed weights and offsets, ie Where y is the transformed weight or offset, and x is the weight or offset before the transformation. At the same time, the weight and offset data before the transformation are retained.
变换后的权值和偏置数据以及各层输入数据通过低位宽转换,得到 16位浮点数据,低精度浮点数据转换可以采用直接截断的方法,即对于原全精度的数据,截取16位浮点表示方法所能表示的精度部分,超出该精度的部分直接舍去。The transformed weight and offset data and the input data of each layer are converted by low bit width. 16-bit floating-point data, low-precision floating-point data conversion can use the method of direct truncation, that is, for the original full-precision data, the precision part that can be represented by the 16-bit floating-point representation method is intercepted, and the part exceeding the precision is directly discarded.
转换后的16位浮点数据用于神经网络训练,通过神经网络反向过程获得待更新梯度值Δy。The converted 16-bit floating point data is used for neural network training, and the gradient value Δy to be updated is obtained by the neural network reverse process.
梯度值通过反向梯度变换模块计算的非线性变换前的权值和偏置的待更新梯度值,即
Figure PCTCN2016103979-appb-000030
并用该梯度值完成非线性变换前的权值和偏置的更新,即xnew=x-Δx。
The gradient value is calculated by the inverse gradient transformation module before the nonlinear transformation and the biased gradient value to be updated, ie
Figure PCTCN2016103979-appb-000030
And using the gradient value to complete the update of the weight and offset before the nonlinear transformation, that is, x new = x - Δx.
重复上述步骤直至训练完成。Repeat the above steps until the training is complete.
本发明经过训练后的神经网络可用于神经网络加速器、语音识别装置、图像识别装置及自动导航装置等。The trained neural network of the present invention can be used for a neural network accelerator, a voice recognition device, an image recognition device, an automatic navigation device, and the like.
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。 The specific embodiments of the present invention have been described in detail, and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Claims (10)

  1. 一种神经网络训练方法,用于对神经网络中的参数进行训练,其特征在于,方法包括:A neural network training method for training parameters in a neural network, the method comprising:
    S1,采用非线性函数对所述参数进行非线性变换,得到变换参数;S1, nonlinearly transforming the parameter by using a nonlinear function to obtain a transform parameter;
    S2,对所述变换参数转换进行低位宽转换,得到低位宽变换参数;S2, performing low bit width conversion on the transform parameter conversion to obtain a low bit width transform parameter;
    S3,通过神经网络反向过程,获取所述低位宽变换参数的待更新梯度值,根据所述非线性函数和所述低位宽变换参数的待更新梯度值,得到非线性变换前所述参数的待更新梯度值;S3. Acquire a gradient value of the low bit width transform parameter to be updated by a neural network reverse process, and obtain a parameter before the nonlinear transform according to the nonlinear function and the gradient value of the low bit width transform parameter to be updated. The gradient value to be updated;
    S4,根据所述参数的待更新梯度值,对所述参数进行更新。S4: Update the parameter according to the gradient value of the parameter to be updated.
  2. 根据权利要求1所述的神经网络训练方法,其特征在于,所述步骤S3中,所述低位宽变换参数的待更新梯度值Δy为:The neural network training method according to claim 1, wherein in the step S3, the gradient value Δy of the low bit width transformation parameter to be updated is:
    Figure PCTCN2016103979-appb-100001
    Figure PCTCN2016103979-appb-100001
    其中,η为神经网络的学习率,
    Figure PCTCN2016103979-appb-100002
    为神经网络反向过程的损失函数,非线性变换函数为y=f(x),x为非线性变换前的参数,y为非线性变换后的权值或偏置参数;
    Where η is the learning rate of the neural network,
    Figure PCTCN2016103979-appb-100002
    For the loss function of the neural network reverse process, the nonlinear transformation function is y=f(x), x is the parameter before the nonlinear transformation, and y is the weight or offset parameter after the nonlinear transformation;
    所述非线性变换前所述参数的待更新梯度值为:The gradient value of the parameter to be updated before the nonlinear transformation is:
    Δx=f′(x)Δy;Δx=f'(x)Δy;
    所述步骤S4中,对所述参数进行更新的表达式为:In the step S4, the expression for updating the parameter is:
    xnew=x-Δx。x new = x - Δx.
  3. 根据权利要求1所述的神经网络训练方法,其特征在于,所述非线性函数为双曲正切系列函数或者sigmoid系列函数。The neural network training method according to claim 1, wherein the nonlinear function is a hyperbolic tangent series function or a sigmoid series function.
  4. 根据权利要求1所述的神经网络训练方法,其特征在于,所述参数包括神经网络的权值和偏置。The neural network training method according to claim 1, wherein the parameters include weights and offsets of the neural network.
  5. 根据权利要求1所述的神经网络训练方法,其特征在于,对更新后的参数重复执行步骤S1~S4,直到所述参数小于一预定阈值时,训练完成。The neural network training method according to claim 1, wherein steps S1 to S4 are repeatedly performed on the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.
  6. 一种神经网络训练装置,用于对神经网络中的参数进行训练,其特征在于,装置包括: A neural network training device for training parameters in a neural network, characterized in that the device comprises:
    非线性变换模块,用于采用非线性函数对所述参数进行非线性变换,得到变换参数;a nonlinear transformation module, configured to perform nonlinear transformation on the parameter by using a nonlinear function to obtain a transformation parameter;
    低位宽转换模块,用于对所述变换参数转换进行低位宽转换,得到低位宽变换参数;a low bit width conversion module, configured to perform low bit width conversion on the conversion parameter conversion to obtain a low bit width conversion parameter;
    反向梯度变换模块,用于通过神经网络反向过程,获取所述低位宽变换参数的待更新梯度值,根据所述非线性函数和所述低位宽变换参数的待更新梯度值,得到非线性变换前所述参数的待更新梯度值;a reverse gradient transform module, configured to acquire a gradient value of the low bit width transform parameter to be updated by a neural network reverse process, and obtain a nonlinearity according to the nonlinear function and the gradient value of the low bit width transform parameter to be updated The gradient value of the parameter to be updated before the transformation;
    更新模块,用于根据所述参数的待更新梯度值,对所述参数进行更新。And an update module, configured to update the parameter according to the gradient value of the parameter to be updated.
  7. 根据权利要求6所述的神经网络训练装置,其特征在于,所述反向梯度变换模块中,所述低位宽变换参数的待更新梯度值Δy为:The neural network training device according to claim 6, wherein in the inverse gradient transform module, the gradient value Δy of the low bit width transform parameter to be updated is:
    Figure PCTCN2016103979-appb-100003
    Figure PCTCN2016103979-appb-100003
    其中,η为神经网络的学习率,
    Figure PCTCN2016103979-appb-100004
    为神经网络反向过程的损失函数,非线性变换函数为y=f(x),x为非线性变换前的参数,y为非线性变换后的权值或偏置参数;
    Where η is the learning rate of the neural network,
    Figure PCTCN2016103979-appb-100004
    For the loss function of the neural network reverse process, the nonlinear transformation function is y=f(x), x is the parameter before the nonlinear transformation, and y is the weight or offset parameter after the nonlinear transformation;
    所述非线性变换前所述参数的待更新梯度值为:The gradient value of the parameter to be updated before the nonlinear transformation is:
    Δx=f′(x)Δy;Δx=f'(x)Δy;
    所述更新模块对所述参数进行更新的表达式为:The expression that the update module updates the parameter is:
    xnew=x-Δx。x new = x - Δx.
  8. 根据权利要求6所述的神经网络训练装置,其特征在于,所述非线性函数为双曲正切系列函数或者sigmoid系列函数。The neural network training device according to claim 6, wherein the nonlinear function is a hyperbolic tangent series function or a sigmoid series function.
  9. 根据权利要求6所述的神经网络训练装置,其特征在于,所述参数包括神经网络的权值和偏置。The neural network training device according to claim 6, wherein the parameters include weights and offsets of the neural network.
  10. 根据权利要求6所述的神经网络训练装置,其特征在于,非线性变换模块、低位宽转换模块、反向梯度变换模块及更新模块对更新后的参数反复进行训练,直到所述参数小于一预定阈值时,训练完成。 The neural network training device according to claim 6, wherein the nonlinear transform module, the low bit width conversion module, the inverse gradient transform module, and the update module repeatedly train the updated parameters until the parameter is less than a predetermined one. At the threshold, the training is completed.
PCT/CN2016/103979 2016-10-31 2016-10-31 Neural network training method and apparatus WO2018076331A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/103979 WO2018076331A1 (en) 2016-10-31 2016-10-31 Neural network training method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/103979 WO2018076331A1 (en) 2016-10-31 2016-10-31 Neural network training method and apparatus

Publications (1)

Publication Number Publication Date
WO2018076331A1 true WO2018076331A1 (en) 2018-05-03

Family

ID=62024220

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/103979 WO2018076331A1 (en) 2016-10-31 2016-10-31 Neural network training method and apparatus

Country Status (1)

Country Link
WO (1) WO2018076331A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555508A (en) * 2018-05-31 2019-12-10 北京深鉴智能科技有限公司 Artificial neural network adjusting method and device
WO2020019236A1 (en) * 2018-07-26 2020-01-30 Intel Corporation Loss-error-aware quantization of a low-bit neural network
CN111198714A (en) * 2018-11-16 2020-05-26 上海寒武纪信息科技有限公司 Retraining method and related product
CN112114874A (en) * 2020-08-20 2020-12-22 北京百度网讯科技有限公司 Data processing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5280564A (en) * 1991-02-20 1994-01-18 Honda Giken Kogyo Kabushiki Kaisha Neural network having an optimized transfer function for each neuron
CN1846218A (en) * 2003-09-09 2006-10-11 西麦恩公司 An artificial neural network
CN105550748A (en) * 2015-12-09 2016-05-04 四川长虹电器股份有限公司 Method for constructing novel neural network based on hyperbolic tangent function
CN105787439A (en) * 2016-02-04 2016-07-20 广州新节奏智能科技有限公司 Depth image human body joint positioning method based on convolution nerve network
CN105976027A (en) * 2016-04-29 2016-09-28 北京比特大陆科技有限公司 Data processing method and device, chip

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5280564A (en) * 1991-02-20 1994-01-18 Honda Giken Kogyo Kabushiki Kaisha Neural network having an optimized transfer function for each neuron
CN1846218A (en) * 2003-09-09 2006-10-11 西麦恩公司 An artificial neural network
CN105550748A (en) * 2015-12-09 2016-05-04 四川长虹电器股份有限公司 Method for constructing novel neural network based on hyperbolic tangent function
CN105787439A (en) * 2016-02-04 2016-07-20 广州新节奏智能科技有限公司 Depth image human body joint positioning method based on convolution nerve network
CN105976027A (en) * 2016-04-29 2016-09-28 北京比特大陆科技有限公司 Data processing method and device, chip

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555508A (en) * 2018-05-31 2019-12-10 北京深鉴智能科技有限公司 Artificial neural network adjusting method and device
WO2020019236A1 (en) * 2018-07-26 2020-01-30 Intel Corporation Loss-error-aware quantization of a low-bit neural network
CN111198714A (en) * 2018-11-16 2020-05-26 上海寒武纪信息科技有限公司 Retraining method and related product
CN111198714B (en) * 2018-11-16 2022-11-18 寒武纪(西安)集成电路有限公司 Retraining method and related product
CN112114874A (en) * 2020-08-20 2020-12-22 北京百度网讯科技有限公司 Data processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
WO2017185412A1 (en) Neural network operation device and method supporting few-bit fixed-point number
WO2018076331A1 (en) Neural network training method and apparatus
CN107340993B (en) Arithmetic device and method
CN110969250B (en) Neural network training method and device
US10671922B2 (en) Batch renormalization layers
CN113052868B (en) Method and device for training matting model and image matting
CN109002889A (en) Adaptive iteration formula convolutional neural networks model compression method
CN111985523A (en) Knowledge distillation training-based 2-exponential power deep neural network quantification method
DE112020003600T5 (en) MACHINE LEARNING HARDWARE WITH PARAMETER COMPONENTS WITH REDUCED ACCURACY FOR EFFICIENT PARAMETER UPDATE
CN109284761B (en) Image feature extraction method, device and equipment and readable storage medium
CN104504015A (en) Learning algorithm based on dynamic incremental dictionary update
CN111311530B (en) Multi-focus image fusion method based on directional filter and deconvolution neural network
CN109389222A (en) A kind of quick adaptive neural network optimization method
CN108460783A (en) A kind of cerebral magnetic resonance image organizational dividing method
EP3648015A3 (en) A method for training a neural network
WO2023020456A1 (en) Network model quantification method and apparatus, device, and storage medium
CN112561050B (en) Neural network model training method and device
WO2019037409A1 (en) Neural network training system and method, and computer readable storage medium
WO2020118553A1 (en) Method and device for quantizing convolutional neural network, and electronic device
WO2020253692A1 (en) Quantification method for deep learning network parameters
CN108710944A (en) One kind can train piece-wise linear activation primitive generation method
CN112257466A (en) Model compression method applied to small machine translation equipment
CN110837885B (en) Sigmoid function fitting method based on probability distribution
CN111382854B (en) Convolutional neural network processing method, device, equipment and storage medium
CN110633787A (en) Deep neural network compression method based on multi-bit neural network nonlinear quantization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16919904

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16919904

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16919904

Country of ref document: EP

Kind code of ref document: A1