WO2018076331A1 - Procédé et appareil d'apprentissage de réseau neuronal - Google Patents

Procédé et appareil d'apprentissage de réseau neuronal Download PDF

Info

Publication number
WO2018076331A1
WO2018076331A1 PCT/CN2016/103979 CN2016103979W WO2018076331A1 WO 2018076331 A1 WO2018076331 A1 WO 2018076331A1 CN 2016103979 W CN2016103979 W CN 2016103979W WO 2018076331 A1 WO2018076331 A1 WO 2018076331A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameter
neural network
nonlinear
updated
bit width
Prior art date
Application number
PCT/CN2016/103979
Other languages
English (en)
Chinese (zh)
Inventor
陈云霁
庄毅敏
郭崎
陈天石
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Priority to PCT/CN2016/103979 priority Critical patent/WO2018076331A1/fr
Publication of WO2018076331A1 publication Critical patent/WO2018076331A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention belongs to the technical field of neural networks, and in particular relates to a neural network training method and device.
  • Multi-layer neural networks have attracted more and more attention in academia and industry due to their high recognition accuracy and good parallelism. They also have many fields in pattern recognition, image processing and natural language processing. More and more applications.
  • neural networks make it difficult to apply to embedded systems.
  • researchers use a variety of ways to reduce the storage space required to store these model parameters.
  • the most common method is to store data using low-precision data representation methods. For example, a 16-bit floating-point data representation method, a 16-bit fixed-point data representation method, and a 1-bit binary data representation method are used.
  • the low-precision data representation method can reduce the data storage space.
  • the neural network model parameter data has a wide range of values, the use of low-precision data representation method will bring great Loss of precision affects the performance of the neural network.
  • the invention provides a neural network training method for training parameters in a neural network, and the method comprises:
  • the parameter is updated according to the gradient value of the parameter to be updated.
  • step S3 the gradient value ⁇ y of the low bit width conversion parameter to be updated is:
  • is the learning rate of the neural network
  • x is the parameter before the nonlinear transformation
  • y is the weight or offset parameter after the nonlinear transformation
  • the gradient value of the parameter to be updated before the nonlinear transformation is:
  • step S4 the expression for updating the parameter is:
  • nonlinear function is a hyperbolic tangent series function or a sigmoid series function.
  • the parameters include weights and offsets of the neural network.
  • steps S1 to S4 are repeatedly performed on the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.
  • the invention also provides a neural network training device for training parameters in a neural network, the device comprising:
  • a nonlinear transformation module for nonlinearly transforming parameters using a nonlinear function to obtain transformation parameters
  • a low bit width conversion module for performing low bit width conversion on the transform parameter conversion to obtain a low bit width conversion parameter
  • a reverse gradient transform module configured to obtain a gradient value of a low bit width transform parameter to be updated by a neural network reverse process, and obtain a pre-linear transform parameter according to a nonlinear function and a gradient value of the low bit width transform parameter to be updated The gradient value to be updated;
  • the update module is configured to update the parameter according to the gradient value of the parameter to be updated.
  • the gradient value ⁇ y of the low bit width transform parameter to be updated is:
  • is the learning rate of the neural network
  • x is the parameter before the nonlinear transformation
  • y is the weight or offset parameter after the nonlinear transformation
  • the gradient value of the parameter to be updated before the nonlinear transformation is:
  • nonlinear function is a hyperbolic tangent series function or a sigmoid series function.
  • the parameters include weights and offsets of the neural network.
  • the nonlinear transform module, the low bit width conversion module, the inverse gradient transform module, and the update module repeatedly train the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.
  • the nonlinear transformation function can control the data range and data precision of the parameters, the accuracy of the original data can be well preserved in the subsequent low bit width conversion, thereby ensuring the performance of the neural network.
  • the parameters obtained by the training can be used for the dedicated neural network accelerator. Due to the use of lower precision parameters, the transmission bandwidth requirement of the accelerator can be reduced, and the low precision data can reduce the hardware area overhead, for example, reduce the arithmetic unit. The size of the hardware can thus optimize the area-to-power ratio of the hardware.
  • FIG. 1 is a flowchart of a neural network training method provided by the present invention.
  • FIG. 2 is a schematic structural view of a neural network training device provided by the present invention.
  • the invention provides a neural network training device and method for training parameters in a neural network. Firstly, a nonlinear function is used to nonlinearly transform the parameters to obtain transformation parameters, and then the transformation parameters are converted into bit widths. Converting, obtaining a low bit width transform parameter, and then obtaining a gradient value of the low bit width transform parameter to be updated by a neural network reverse process, and obtaining a nonlinear transform according to a nonlinear function and a gradient value of the low bit width transform parameter to be updated The gradient value of the parameter to be updated, and finally the parameter is updated according to the gradient value of the parameter to be updated.
  • the present invention allows the post-training parameters to have a lower bit width with less loss of precision.
  • FIG. 1 is a flowchart of a neural network training method provided by the present invention. As shown in FIG. 1, the method includes:
  • the nonlinear transformation function used in this step is not unique, and different nonlinear functions can be selected according to the actual use requirements, which can be a hyperbolic tangent series function, a sigmoid series function, and the like.
  • the neural network parameter data range generally adopting a high-precision data representation method is relatively large, if Converting the original full-precision data into bit-width data into low-precision data can affect the performance of the neural network due to the loss of precision in the data itself. Therefore, the function of the hyperbolic tangent series can be used for nonlinear transformation.
  • the hyperbolic tangent series function can be in the form of Taking the weight data of a convolutional layer of the neural network as an example, w is the weight data before the transformation, p_w is the transformed weight data, Max w is the maximum value of the layer ownership value data, and A and B are constants. Parameter. Among them, A is used to control the transformed data range, and B is used to control the transformed data distribution. The principle of the adjustment is as follows. The transformation scales the data to the range [-1,1]. By adjusting A, the transformed data can be scaled to the range [-A, A]. By adjusting B, the data can be made to a different function of the tan(x) function.
  • Segment, B is very small, then the data to be converted mostly falls in the linear segment of the tan(x) function.
  • B When B is large, the data to be converted mostly falls in the saturated segment of the tan(x) function, and B takes a moderate value.
  • the data to be converted is mostly in a non-linear segment, thereby changing the original distribution of the data without changing the relative size relationship between the data. Therefore, the purpose of the nonlinear transformation in the present invention is to make the range of data and the distribution of data controllable.
  • nonlinear functions or function parameters may be used as needed.
  • 16-bit floating point since 16-bit floating point can indicate that the data range is relatively large, more attention is required.
  • the loss of precision may be due to the loss of precision of individual data beyond the 16-bit floating-point representation range, so a nonlinear function can be designed such that this portion of the data goes out of saturation and the other data is in a nonlinear or linear segment.
  • the parameters referred to in the present invention are the weights and offsets of the neural network.
  • the parameters are the weights and offsets of each layer of the neural network.
  • the parameters of the neural network are floating point numbers.
  • the original full precision floating point number is nonlinearly transformed and converted into a low precision low bit width floating point number.
  • the "lower bit” in the low-order wide floating point number indicates that the number of bits required to store one data is smaller than the number of bits required for the full-precision floating point, and "wide” means that the data is statistically evenly distributed in the data representation. Indicated within the index range.
  • the loss function of the reverse process is Then the gradient calculated by the reverse process is This gradient is the gradient of y, by the chain rule Gradient of x can be obtained, gradient through x Update the weight and offset parameters before the nonlinear transformation.
  • the parameter is updated according to the gradient value of the parameter to be updated.
  • steps S1 to S4 are repeatedly performed on the updated parameters until the parameters are less than a predetermined threshold, and the training is completed.
  • the network training device includes:
  • a nonlinear transformation module for nonlinearly transforming parameters using a nonlinear function to obtain transformation parameters
  • the nonlinear transformation function adopted by the nonlinear transformation module is not unique, and different nonlinear functions can be selected according to the actual use requirements, which can be a hyperbolic tangent series function, a sigmoid series function, and the like.
  • the neural network parameter data range generally adopting a high-precision data representation method is relatively large, if Converting the original full-precision data into bit-width data into low-precision data can affect the performance of the neural network due to the loss of precision in the data itself. Therefore, the function of the hyperbolic tangent series can be used for nonlinear transformation.
  • the hyperbolic tangent series function can be in the form of Taking the weight data of a convolutional layer of the neural network as an example, w is the weight data before the transformation, p_w is the transformed weight data, Max w is the maximum value of the layer ownership value data, and A and B are constants. Parameter. Among them, A is used to control the transformed data range, and B is used to control the transformed data distribution. The principle of the adjustment is as follows. The transformation scales the data to the range [-1,1]. By adjusting A, the transformed data can be scaled to the range [-A, A]. By adjusting B, the data can be made to a different function of the tan(x) function.
  • Segment, B is very small, then the data to be converted mostly falls in the linear segment of the tan(x) function.
  • B When B is large, the data to be converted mostly falls in the saturated segment of the tan(x) function, and B takes a moderate value.
  • the data to be converted is mostly in a non-linear segment, thereby changing the original distribution of the data without changing the relative size relationship between the data. Therefore, the purpose of the nonlinear transformation in the present invention is to make the range of data and the distribution of data controllable.
  • nonlinear functions or function parameters may be used as needed.
  • 16-bit floating point since 16-bit floating point can indicate that the data range is relatively large, more attention is required.
  • the loss of precision may be due to the loss of precision of individual data beyond the 16-bit floating-point representation range, so a nonlinear function can be designed such that this portion of the data goes out of saturation and the other data is in a nonlinear or linear segment.
  • the parameters referred to in the present invention are the weights and offsets of the neural network.
  • the parameters are the weights and offsets of each layer of the neural network.
  • the low bit width conversion module is used for low bit width conversion of the transform parameter conversion to obtain a low bit width conversion parameter.
  • the parameters of the neural network are floating point numbers
  • the function of the low bit width conversion module is to convert the original full precision floating point number into a low precision low bit width floating point number after nonlinear transformation.
  • the "lower bit” in the low-order wide floating point number indicates that the number of bits required to store one data is smaller than the number of bits required for the full-precision floating point, and "wide” means that the data is statistically evenly distributed in the data representation. Indicated within the index range.
  • a reverse gradient transform module configured to obtain a gradient value of a low bit width transform parameter to be updated by a neural network reverse process, and obtain a pre-linear transform parameter according to a nonlinear function and a gradient value of the low bit width transform parameter to be updated The gradient value to be updated;
  • the loss function of the reverse process is Then the gradient calculated by the reverse process is This gradient is the gradient of y, by the chain rule Gradient of x can be obtained, gradient through x Update the weight and offset parameters before the nonlinear transformation.
  • the gradient of the y to be updated can be obtained by the inverse process of the neural network.
  • is the learning rate of the neural network
  • the update module is configured to update the parameter according to the gradient value of the parameter to be updated.
  • the neural network training device provided by the present invention repeatedly performs steps S1 to S4 on the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.
  • the neural network of this embodiment supports a 16-bit floating point data representation method, and this embodiment uses a nonlinear transformation function as a hyperbolic tangent series function:
  • w is the weight data before the transformation
  • p_w is the weight data after the transformation
  • Maxw is the maximum value of the ownership value data of the layer
  • a and B are constant parameters.
  • A is used to control the transformed data range
  • B is used to control the transformed data distribution.
  • the nonlinear function is Since the weight or offset data is relatively small relative to its maximum value, The data is mapped into the range of [-1,1], and most of the data is concentrated near the value of 0.
  • This function form can make full use of the nonlinear segment of the function to compress the interval with less data distribution to make the data dense, and at the same time utilize The linear segment of the function stretches the interval where the data is distributed near the value of 0 to spread the data.
  • weights and offsets of each layer of the neural network are nonlinearly transformed by the nonlinear transform module to obtain the transformed weights and offsets, ie Where y is the transformed weight or offset, x is the weight or offset before the transformation, and Max x is the maximum of all x. At the same time, the weight and offset data before the transformation are retained.
  • the transformed weight and offset data and the input data of each layer are converted into the required 16-bit floating point data through the low bit width floating point data conversion module, and the low precision floating point data conversion can adopt the direct truncation method, that is, the original
  • the full-precision data intercepts the precision part that the 16-bit floating-point representation method can represent, and the part that exceeds the precision is directly discarded.
  • the converted 16-bit floating point data is used for neural network training, and the gradient value ⁇ y to be updated is obtained by the neural network reverse process.
  • w is the weight data before the transformation
  • p_w is the transformed weight data
  • a and B are constant parameters
  • A is used to control the transformed data range.
  • B is used to control the transformed data distribution.
  • weights and offsets of each layer of the neural network are nonlinearly transformed by the nonlinear transform module to obtain the transformed weights and offsets, ie Where y is the transformed weight or offset, and x is the weight or offset before the transformation. At the same time, the weight and offset data before the transformation are retained.
  • the transformed weight and offset data and the input data of each layer are converted by low bit width.
  • 16-bit floating-point data, low-precision floating-point data conversion can use the method of direct truncation, that is, for the original full-precision data, the precision part that can be represented by the 16-bit floating-point representation method is intercepted, and the part exceeding the precision is directly discarded.
  • the converted 16-bit floating point data is used for neural network training, and the gradient value ⁇ y to be updated is obtained by the neural network reverse process.
  • the trained neural network of the present invention can be used for a neural network accelerator, a voice recognition device, an image recognition device, an automatic navigation device, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Feedback Control In General (AREA)

Abstract

L'invention concerne un appareil et un procédé d'apprentissage de réseau neuronal, destinés à être utilisés dans l'apprentissage de paramètres au sein d'un réseau neuronal: ledit procédé comportant les étapes consistant à utiliser d'abord une fonction non linéaire pour effectuer une transformation non linéaire sur des paramètres pour obtenir des paramètres de transformation (S1); à convertir ensuite les paramètres de transformation pour effectuer une conversion de largeur de bits pour obtenir des paramètres de transformation à faible largeur de bits (S2); à acquérir ensuite une valeur de gradient à actualiser des paramètres de transformation à faible largeur de bits au moyen d'un processus inverse dans le réseau neuronal pour obtenir une valeur de gradient à actualiser des paramètres avant la transformation non linéaire d'après la fonction non linéaire et ladite valeur de gradient à actualiser des paramètres de transformation à faible largeur de bits (S3); et à actualiser enfin les paramètres selon la valeur de gradient à actualiser des paramètres (S4). Le présent procédé a pour effet que les paramètres présentent une plus faible largeur de bits après l'apprentissage, tandis que la perte de précision est plus faible.
PCT/CN2016/103979 2016-10-31 2016-10-31 Procédé et appareil d'apprentissage de réseau neuronal WO2018076331A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/103979 WO2018076331A1 (fr) 2016-10-31 2016-10-31 Procédé et appareil d'apprentissage de réseau neuronal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/103979 WO2018076331A1 (fr) 2016-10-31 2016-10-31 Procédé et appareil d'apprentissage de réseau neuronal

Publications (1)

Publication Number Publication Date
WO2018076331A1 true WO2018076331A1 (fr) 2018-05-03

Family

ID=62024220

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/103979 WO2018076331A1 (fr) 2016-10-31 2016-10-31 Procédé et appareil d'apprentissage de réseau neuronal

Country Status (1)

Country Link
WO (1) WO2018076331A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555508A (zh) * 2018-05-31 2019-12-10 北京深鉴智能科技有限公司 人工神经网络调整方法和装置
WO2020019236A1 (fr) * 2018-07-26 2020-01-30 Intel Corporation Quantification sensible aux erreurs de perte d'un réseau neuronal à faible bit
CN111198714A (zh) * 2018-11-16 2020-05-26 上海寒武纪信息科技有限公司 重训练方法及相关产品
CN112114874A (zh) * 2020-08-20 2020-12-22 北京百度网讯科技有限公司 数据处理方法、装置、电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5280564A (en) * 1991-02-20 1994-01-18 Honda Giken Kogyo Kabushiki Kaisha Neural network having an optimized transfer function for each neuron
CN1846218A (zh) * 2003-09-09 2006-10-11 西麦恩公司 人工神经网络
CN105550748A (zh) * 2015-12-09 2016-05-04 四川长虹电器股份有限公司 基于双曲正切函数的新型神经网络的构造方法
CN105787439A (zh) * 2016-02-04 2016-07-20 广州新节奏智能科技有限公司 一种基于卷积神经网络的深度图像人体关节定位方法
CN105976027A (zh) * 2016-04-29 2016-09-28 北京比特大陆科技有限公司 数据处理方法和装置、芯片

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5280564A (en) * 1991-02-20 1994-01-18 Honda Giken Kogyo Kabushiki Kaisha Neural network having an optimized transfer function for each neuron
CN1846218A (zh) * 2003-09-09 2006-10-11 西麦恩公司 人工神经网络
CN105550748A (zh) * 2015-12-09 2016-05-04 四川长虹电器股份有限公司 基于双曲正切函数的新型神经网络的构造方法
CN105787439A (zh) * 2016-02-04 2016-07-20 广州新节奏智能科技有限公司 一种基于卷积神经网络的深度图像人体关节定位方法
CN105976027A (zh) * 2016-04-29 2016-09-28 北京比特大陆科技有限公司 数据处理方法和装置、芯片

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555508A (zh) * 2018-05-31 2019-12-10 北京深鉴智能科技有限公司 人工神经网络调整方法和装置
WO2020019236A1 (fr) * 2018-07-26 2020-01-30 Intel Corporation Quantification sensible aux erreurs de perte d'un réseau neuronal à faible bit
US12112256B2 (en) 2018-07-26 2024-10-08 Intel Corporation Loss-error-aware quantization of a low-bit neural network
CN111198714A (zh) * 2018-11-16 2020-05-26 上海寒武纪信息科技有限公司 重训练方法及相关产品
CN111198714B (zh) * 2018-11-16 2022-11-18 寒武纪(西安)集成电路有限公司 重训练方法及相关产品
CN112114874A (zh) * 2020-08-20 2020-12-22 北京百度网讯科技有限公司 数据处理方法、装置、电子设备和存储介质

Similar Documents

Publication Publication Date Title
WO2018076331A1 (fr) Procédé et appareil d'apprentissage de réseau neuronal
CN107340993B (zh) 运算装置和方法
KR102476343B1 (ko) 자리수가 비교적 적은 고정 소수점 수치의 신경망 연산에 대한 지원 장치와 방법
CN110969250B (zh) 一种神经网络训练方法及装置
CN109002889B (zh) 自适应迭代式卷积神经网络模型压缩方法
US10671922B2 (en) Batch renormalization layers
CN111127360B (zh) 一种基于自动编码器的灰度图像迁移学习方法
CN111985523A (zh) 基于知识蒸馏训练的2指数幂深度神经网络量化方法
CN105118067A (zh) 一种基于高斯平滑滤波的图像分割方法
CN104504015A (zh) 一种基于动态增量式字典更新的学习算法
CN111311530B (zh) 基于方向滤波器及反卷积神经网络的多聚焦图像融合方法
CN109389222A (zh) 一种快速的自适应神经网络优化方法
CN108460783A (zh) 一种脑部核磁共振图像组织分割方法
WO2023020456A1 (fr) Procédé et appareil de quantification de modèle de réseau, dispositif et support de stockage
WO2020118553A1 (fr) Procédé et dispositif de quantification de réseau de neurones à convolution, et dispositif électronique
CN110738660A (zh) 基于改进U-net的脊椎CT图像分割方法及装置
CN112561050B (zh) 一种神经网络模型训练方法及装置
CN116578945A (zh) 一种基于飞行器多源数据融合方法、电子设备及存储介质
WO2019037409A1 (fr) Système et procédé d'apprentissage de réseau neuronal et support de stockage lisible par ordinateur
CN110633787A (zh) 基于多比特神经网络非线性量化的深度神经网络压缩方法
CN112257466B (zh) 一种应用于小型机器翻译设备的模型压缩方法
WO2020253692A1 (fr) Procédé de quantification pour paramètres de réseau d'apprentissage profond
CN110837885B (zh) 一种基于概率分布的Sigmoid函数拟合方法
CN105846826B (zh) 基于近似平滑l0范数的压缩感知信号重构方法
CN111382854B (zh) 一种卷积神经网络处理方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16919904

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16919904

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16919904

Country of ref document: EP

Kind code of ref document: A1