WO2017185335A1 - Appareil et procédé d'exécution d'une opération de normalisation par lots - Google Patents

Appareil et procédé d'exécution d'une opération de normalisation par lots Download PDF

Info

Publication number
WO2017185335A1
WO2017185335A1 PCT/CN2016/080695 CN2016080695W WO2017185335A1 WO 2017185335 A1 WO2017185335 A1 WO 2017185335A1 CN 2016080695 W CN2016080695 W CN 2016080695W WO 2017185335 A1 WO2017185335 A1 WO 2017185335A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
unit
instruction
batch normalization
batch
Prior art date
Application number
PCT/CN2016/080695
Other languages
English (en)
Chinese (zh)
Inventor
刘少礼
于涌
陈云霁
陈天石
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Priority to PCT/CN2016/080695 priority Critical patent/WO2017185335A1/fr
Publication of WO2017185335A1 publication Critical patent/WO2017185335A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead

Definitions

  • the present invention relates to artificial neural network technology, and in particular to an apparatus and method for performing forward normalization of backward normalization in an artificial neural network.
  • Multi-layer artificial neural networks are widely used in the fields of pattern recognition, image processing, function approximation and optimization calculation.
  • Multi-layer artificial neural networks have been accepted by academic circles in recent years due to their high recognition accuracy and good parallelism. And the industry is more and more concerned, and the batch normalization operation in the multi-layer artificial neural network is more and more applied to the multi-layer neural network because it can accelerate the training speed of the neural network and improve the recognition accuracy.
  • One known method of supporting batch normalization operations is to use a general purpose processor.
  • the method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions.
  • One of the disadvantages of this approach is that the performance of a single general purpose processor is low and cannot meet the performance requirements of conventional multi-layer artificial neural network operations.
  • communication between general-purpose processors becomes a performance bottleneck.
  • the general-purpose processor needs to decode the multi-layer artificial neural network into a long column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.
  • Another known method of supporting batch normalization is to use a graphics processing unit (GPU).
  • the method supports the above algorithm by executing a generic SIMD instruction using a general purpose register file and a generic stream processing unit.
  • the GPU is a device dedicated to performing graphic image operations and scientific calculations, without the special support for multi-layer artificial neural network batch normalization operations, a large amount of front-end decoding work is still required to perform multi-layer artificial neural network operations, bringing A lot of extra overhead.
  • the GPU has only a small on-chip cache, and the model data of the multi-layer artificial neural network batch normalization needs to be repeatedly transferred from off-chip, and the off-chip bandwidth becomes the main performance bottleneck.
  • the batch normalization operation has a large number of normalization operations such as summation, and the parallel architecture of the GPU is not suitable for such a large number of normalized operations.
  • An aspect of the present invention provides an apparatus for performing an artificial neural network batch normalization operation, including an instruction storage unit, a controller unit, a data access unit, and an operation module, wherein: the instruction storage unit is configured to read in through the data access unit The instruction caches the read instruction; the controller unit is configured to read the instruction from the instruction storage unit and decode the instruction into a micro instruction that controls the operation module; the data access unit is configured to use the corresponding data from the external address space to the operation module. The data is written to or read from the data cache unit to the external address space; the arithmetic module is used for specific calculation of the data.
  • Another aspect of the present invention provides a method of performing a batch normalization forward operation using the above apparatus.
  • x is each input neuron element and y is the output element.
  • the forward operation requires dynamic calculation of the mean E[x] and the variance var[x].
  • the summation and (normalization) operations in the mean and variance calculation process are performed by the arithmetic module of the device, thereby calculating the mean and variance of each iteration in the training process.
  • Another aspect of the present invention provides a method of performing a batch normalization inverse operation using the above apparatus.
  • the gradient passed in by a pixel is dl/dY
  • the forward process output is Y
  • the reverse process of batch normalization completes the normalization operation of the neurons in parallel by the arithmetic unit, for example, taking the mean value, the variance, and the like.
  • the invention can be applied to the following (including but not limited to) scenarios: data processing, robots, computers, printers, scanners, telephones, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers , cameras, camcorders, projectors, watches, earphones, mobile storage, wearable devices and other electronic products; aircraft, ships, vehicles and other types of transportation; televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, Washing machines, electric lights, gas stoves, range hoods and other household appliances; and including nuclear magnetic resonance instruments, B-ultrasound, electrocardiographs and other medical equipment.
  • the invention solves the problem that the computing performance of the CPU and the GPU is insufficient and the front-end decoding overhead is large by adopting the device and the instruction set for performing the batch normalization operation. Effectively improved support for batch normalization forward and reverse operations.
  • the invention By adopting a dedicated on-chip buffer for batch normalization operation, the invention fully exploits the reusability of input neurons and intermediate data, avoids repeatedly reading the data into the memory, reduces the memory access bandwidth, and avoids the memory bandwidth becoming a multi-layer artificial The problem of neural network forward computing performance bottlenecks.
  • the present invention better balances the relationship between parallel and serial by employing a dedicated arithmetic unit for batch normalization operations. Avoiding the CPU architecture is only a serial operation, the data is slower when the data size is larger, and the GPU architecture is only a parallel operation, which deals with the weakness of the normalized operation.
  • the data storage unit and the arithmetic unit cooperate to better balance the normalized serial operation and the parallel operation.
  • FIG. 1 shows an example block diagram of an overall structure of an apparatus for performing a batch normalization operation according to an embodiment of the present invention.
  • FIG. 2 shows an example block diagram of the structure of an arithmetic module in an apparatus for performing a batch normalization operation in accordance with an embodiment of the present invention.
  • FIG. 3 shows an example block diagram of a batch normalization operation process in accordance with an embodiment of the present invention.
  • FIG. 4 shows a flow chart of a batch normalization operation in accordance with an embodiment of the present invention.
  • the batch normalization operation consists of two parts, forward and reverse.
  • both the forward and reverse of the batch normalization operation need to be applied, and only the forward process of the batch normalization operation is performed during the use of the artificial neural network.
  • the parameters obtained by the training process are used in the process of using the artificial neural network. For example, the mean value and variance of the batch normalization operation do not need to be repeated.
  • the apparatus includes an instruction storage unit 1, a controller unit 2, a data access unit 3, and an arithmetic module 4.
  • the instruction storage unit 1, the controller unit 2, the data access unit 3, and the arithmetic module 4 can all be implemented by hardware circuits (including, but not limited to, FPGA, CGRA, application specific integrated circuit ASIC, analog circuit, memristor, etc.).
  • the instruction storage unit 1 reads in an instruction through the data access unit 3 and caches the read instruction.
  • the instruction memory unit can be implemented by a variety of different memory devices (SRAM, eDRAM, DRAM, memristor, 3D-DRAM or non-volatile memory, etc.).
  • the controller unit 2 reads instructions from the instruction storage unit 1, decodes the instructions into microinstructions that control the behavior of other units or modules, and then distributes the respective microinstructions to the various units or modules, such as data.
  • Access unit 3 arithmetic module 4, and the like.
  • the data access unit 3 can access the external address space, directly read and write data to each cache unit inside the device, and complete data loading and storage.
  • the arithmetic module 4 includes an arithmetic unit 41, a data dependency determining unit 42, a neuron buffer unit 43, and an intermediate value buffer unit 44.
  • the arithmetic unit 41 receives the microinstructions issued by the controller unit 2 and performs an arithmetic logic operation.
  • the data dependency judging unit 42 is responsible for reading and writing operations to the neuron cache unit in the calculation process. Before the data dependency judgment unit 42 performs the read/write operation, it first ensures that there is no read/write consistency conflict between the data used between the instructions. For example, all microinstructions sent to the data dependency unit 42 are stored in an instruction queue inside the data dependency unit 42, in which the range of read data of the read instruction is a write command ahead of the queue position. Write If the scope of the data conflicts, the instruction must wait until the dependent write instruction is executed before it can execute.
  • the neuron buffer unit 43 buffers the input neuron vector data and the output neuron value data of the arithmetic module 4.
  • the intermediate value buffer unit 44 buffers the intermediate value data required by the arithmetic module 4 in the calculation process. For example, the data of the partial sum, partial square sum, etc. calculated during the operation. For each of the arithmetic modules 4, the intermediate value buffer unit 44 stores the intermediate value data of the batch normalization operation process. For example, the forward batch normalization operation is used during the use of the artificial neural network. Suppose x is every input neuron data and y is output neuron data. The learning parameters alpha, beta, which are constantly updated during the reverse training process, are used in the formula for calculating the output neuron data y.
  • the minimum constant eps which represents a very small amount of data, can usually be represented by a scale of 10 to 5, and can also be set to 0 in actual use.
  • the mean value E[x] represents the mean value of the neuron data x of the input data
  • the mean value of the batch size is a total amount
  • var[x] represents the variance of the corresponding input neuron data x by the batch size as a total amount. .
  • the input neuron data usually has four dimensions: the input batch is the batch (also called number) size, the input channel number is channel, the input high height, the input width width, these four dimensions Determine the total number of input data x, E[x], var[x] is to calculate the mean and variance of the data in the other three dimensions with the batch as the total number.
  • the result is returned to the data access unit to get the output neurons.
  • the input neuron data usually has four dimensions: the input batch is the batch size, the input channel number is channel, the input is high height, and the input width is width. These four dimensions determine the input.
  • the total number of data x, E[x], var[x] is the mean and variance of the data in the other three dimensions calculated by batch as the total number.
  • the data storage mode of the device is stored according to the three dimensions of channel, height, and weight, the device can read the input summation, average, and variance operation after reading the input neuron data x. .
  • the mean and variance in the batch normalization operation can use the calculated mean variance E(x), var(x), and use this parameter as a constant storage and operation.
  • the mean and variance of the calculation process during the batch normalization operation can also be calculated from the input data in the forward process.
  • the arithmetic unit calculates the mean and variance data for each time.
  • the input neurons calculate the mean and variance via the arithmetic unit, and the partial data is placed in the intermediate value buffer unit 44 for subsequent calculation of the f(x) of the iterative process.
  • the present invention also provides an instruction set for performing an artificial neural network batch normalization operation on the aforementioned apparatus.
  • the instruction set includes the CONFIG instruction, the COMPUTE instruction, the IO instruction, the NOP instruction, the JUMP instruction, and the MOVE instruction, where:
  • the CONFIG directive configures the various constants required for the current layer calculation before the batch normalization calculation begins;
  • the COMPUTE instruction completes the arithmetic logic calculation of the batch normalization process
  • the IO instruction realizes reading input data required for calculation from the external address space and storing the data back to the external space after the calculation is completed;
  • the NOP instruction is responsible for clearing the microinstructions in all the microinstruction memory queues of the current device, and ensuring that all the instructions before the NOP instruction are all executed.
  • the NOP instruction itself does not contain any operations;
  • the JUMP instruction is responsible for controlling the jump of the next instruction address to be read from the instruction storage unit, and is used to implement the jump of the control flow;
  • the MOVE instruction is responsible for carrying data of an address in the internal address space of the device to another address in the internal address space of the device.
  • the process is independent of the operation unit and does not occupy the resources of the operation unit during execution.
  • FIG. 3 illustrates an example block diagram of an artificial neural network batch normalization forward and reverse operations in accordance with an embodiment of the present invention.
  • out (in-middle)/middle
  • in is the input neuron data
  • out is the output neuron data.
  • Middle is the intermediate value in the operation process.
  • the intermediate value is the intermediate result of the normalization, variance, etc., which needs to be normalized.
  • the intermediate value of the normalization process is calculated in parallel by the operation module 4 [middlel,..., middleN], stored to Intermediate value buffer unit 44.
  • the arithmetic module 4 then calculates the output neuron data out in parallel with each of the input neuron data in using the intermediate value middle to obtain the final output vector.
  • FIG. 4 illustrates a flow chart of the batch normalization forward operation in a training process in accordance with one embodiment.
  • the flowchart depicts the process of implementing the forward operation of the batch normalization operation shown in FIG. 3 using the apparatus and instruction set of the present invention.
  • step S1 an IO instruction is pre-stored at the first address of the instruction memory unit 1.
  • step S2 the operation starts, the controller unit 2 reads the IO instruction from the first address of the instruction storage unit 1, and according to the translated microinstruction, the data access unit 3 reads all the corresponding batch normalization forwards from the external address space.
  • the instruction is operated and cached in the instruction storage unit 1.
  • step S3 the controller unit 2 then reads in the next IO instruction from the instruction storage unit, and according to the translated microinstruction, the data access unit 3 reads all the data required by the operation module 4 from the external address space (for example, including the input nerve The meta-vector, the batch size, the learning parameter alpha, beta, the minimum value eps, the mean, the variance, etc.) are supplied to the neuron buffer unit 43 of the arithmetic module 4.
  • the external address space for example, including the input nerve
  • step S4 the controller unit 2 then reads the next CONFIG command from the instruction storage unit, and according to the translated microinstruction, the device configures a batch normalization operation. For example, does the forward calculation process use the calculated mean variance or calculate the mean variance based on the input.
  • step S5 the controller unit 2 then reads the next COMPUTE instruction from the instruction storage unit.
  • the operation module 4 reads the input neuron vector from the neuron buffer unit, and calculates the mean and variance of the input neurons. Stored in the intermediate value cache unit.
  • step S6 the arithmetic module 4 subtracts the mean value from the data in the input neuron buffer unit and the intermediate value buffer unit according to the micro-instruction decoded by the COMPUTE instruction, and divides the square root operation of the variance and the minimum amount of eps, and stores the result. Back to the intermediate value cache unit.
  • step S7 the arithmetic module 4 reads the learning parameter alpha from the neuron buffer unit 43 according to the microinstruction decoded by the COMPUTE instruction, multiplies the intermediate value, and adds the learning parameter beta back to the neuron cache.
  • step S8 the controller unit then reads the next IO instruction from the instruction storage unit. According to the translated microinstruction, the data access unit 3 stores the output neuron vector in the neuron buffer unit 43 to the external address space designation address. The operation ends.
  • step S4 uses the constant mean and the variance in step S4, and does not require dynamic calculation every time, that is, step S5 is removed.
  • step S5 is removed.
  • the other is the same as FIG. 4.
  • the reverse process for the batch normalization operation is similar to the forward process described above. The difference is that the data of the operation is different.
  • the gradient introduced by one pixel is dl/dY
  • the gradient of the backward transmission is dl/dx
  • the output of the forward process is Y
  • the other parameters indicate the same meaning as the forward process, and then propagated backward by batch normalization.
  • the gradient dl/dx (alpha/sqrt(var(x)+eps))*(dl/dY-mean(dl/dY)-mean(dl/dY*Y)*Y), where mean is the averaging operation.
  • the reverse process of batch normalization normalizes the gradient data by the arithmetic unit, for example, the mean value, the variance, and the like.
  • the arithmetic unit then completes the rest of the equations in parallel.
  • the relationship between parallel and serial is better balanced by using a dedicated arithmetic unit for batch normalization operations. Avoiding the CPU architecture is only a serial operation, the data is slower when the data size is larger, and the GPU architecture is only a parallel operation, which deals with the weakness of the normalized operation.
  • the data storage unit and the arithmetic unit cooperate to better balance the normalized serial operation and the parallel operation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

L'invention concerne un appareil d'exécution d'une opération de normalisation par lots comprenant une unité de mémorisation d'instructions (1), une unité de commande (2), une unité d'accès aux données (3) et un module de fonctionnement (4), permettant de mettre en œuvre une opération de normalisation par lots dans un réseau neuronal artificiel multicouche. L'unité de mémorisation d'instructions (1) lit une instruction au moyen de l'unité d'accès aux données (3) et mémorise temporairement l'instruction de lecture. L'unité de commande (2) lit l'instruction à partir de l'unité de mémorisation d'instructions (1), code l'instruction en micro-instructions pour commander les comportements d'autres unités ou modules, puis distribue les micro-instructions aux unités ou modules. L'unité d'accès aux données (3) sert à accéder à un espace d'adresses externe, et à charger et à mémoriser des données. Le module de fonctionnement (4) sert à un processus de propagation ou à un processus inverse de l'opération de normalisation par lots. L'appareil améliore efficacement le support des opérations de normalisation par lots de propagation et inverses dans un réseau neuronal artificiel.
PCT/CN2016/080695 2016-04-29 2016-04-29 Appareil et procédé d'exécution d'une opération de normalisation par lots WO2017185335A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/080695 WO2017185335A1 (fr) 2016-04-29 2016-04-29 Appareil et procédé d'exécution d'une opération de normalisation par lots

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/080695 WO2017185335A1 (fr) 2016-04-29 2016-04-29 Appareil et procédé d'exécution d'une opération de normalisation par lots

Publications (1)

Publication Number Publication Date
WO2017185335A1 true WO2017185335A1 (fr) 2017-11-02

Family

ID=60160652

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/080695 WO2017185335A1 (fr) 2016-04-29 2016-04-29 Appareil et procédé d'exécution d'une opération de normalisation par lots

Country Status (1)

Country Link
WO (1) WO2017185335A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754062A (zh) * 2017-11-07 2019-05-14 上海寒武纪信息科技有限公司 卷积扩展指令的执行方法以及相关产品
CN110097181A (zh) * 2018-01-30 2019-08-06 上海寒武纪信息科技有限公司 用于执行人工神经网络正向运算的装置和方法
CN111222632A (zh) * 2018-11-27 2020-06-02 中科寒武纪科技股份有限公司 计算装置、计算方法及相关产品
CN112789627A (zh) * 2018-09-30 2021-05-11 华为技术有限公司 一种神经网络处理器、数据处理方法及相关设备

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103019656A (zh) * 2012-12-04 2013-04-03 中国科学院半导体研究所 可动态重构的多级并行单指令多数据阵列处理系统
CN105528191A (zh) * 2015-12-01 2016-04-27 中国科学院计算技术研究所 数据累加装置、方法及数字信号处理装置

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103019656A (zh) * 2012-12-04 2013-04-03 中国科学院半导体研究所 可动态重构的多级并行单指令多数据阵列处理系统
CN105528191A (zh) * 2015-12-01 2016-04-27 中国科学院计算技术研究所 数据累加装置、方法及数字信号处理装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
IOFFE, S ET AL.: "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", COMPUTER SCIENCE, 2 March 2015 (2015-03-02), XP055266268 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754062A (zh) * 2017-11-07 2019-05-14 上海寒武纪信息科技有限公司 卷积扩展指令的执行方法以及相关产品
CN109754062B (zh) * 2017-11-07 2024-05-14 上海寒武纪信息科技有限公司 卷积扩展指令的执行方法以及相关产品
CN110097181A (zh) * 2018-01-30 2019-08-06 上海寒武纪信息科技有限公司 用于执行人工神经网络正向运算的装置和方法
CN112789627A (zh) * 2018-09-30 2021-05-11 华为技术有限公司 一种神经网络处理器、数据处理方法及相关设备
CN112789627B (zh) * 2018-09-30 2023-08-22 华为技术有限公司 一种神经网络处理器、数据处理方法及相关设备
CN111222632A (zh) * 2018-11-27 2020-06-02 中科寒武纪科技股份有限公司 计算装置、计算方法及相关产品

Similar Documents

Publication Publication Date Title
KR102470264B1 (ko) 완전연결층 신경망 역방향 트레이닝 실행용 장치와 방법
WO2017185391A1 (fr) Dispositif et procédé permettant d'effectuer un apprentissage d'un réseau neuronal convolutif
CN109284825B (zh) 用于执行lstm运算的装置和方法
KR102486030B1 (ko) 완전연결층 신경망 정방향 연산 실행용 장치와 방법
CN107316078B (zh) 用于执行人工神经网络自学习运算的装置和方法
KR102175044B1 (ko) 인공 신경망 역방향 트레이닝 실행용 장치와 방법
KR102402111B1 (ko) 콘볼루션 신경망 정방향 연산 실행용 장치와 방법
WO2017185336A1 (fr) Appareil et procédé pour exécuter une opération de regroupement
WO2017185347A1 (fr) Appareil et procédé permettant d'exécuter des calculs de réseau neuronal récurrent et de ltsm
CN106991476B (zh) 用于执行人工神经网络正向运算的装置和方法
WO2017185396A1 (fr) Dispositif et procédé à utiliser lors de l'exécution d'opérations d'addition/de soustraction matricielle
CN107886166B (zh) 一种执行人工神经网络运算的装置和方法
WO2017185393A1 (fr) Appareil et procédé d'exécution d'une opération de produit interne de vecteurs
WO2017185335A1 (fr) Appareil et procédé d'exécution d'une opération de normalisation par lots
CN107315568B (zh) 一种用于执行向量逻辑运算的装置
WO2017185392A1 (fr) Dispositif et procédé permettant d'effectuer quatre opérations fondamentales de calcul de vecteurs
WO2017185248A1 (fr) Appareil et procédé permettant d'effectuer une opération d'apprentissage automatique de réseau neuronal artificiel
CN111651206A (zh) 一种用于执行向量外积运算的装置和方法
WO2018058452A1 (fr) Appareil et procédé permettant de réaliser une opération de réseau neuronal artificiel
CN107341546B (zh) 一种用于执行batch normalization运算的装置和方法
WO2017185419A1 (fr) Appareil et procédé permettant d'exécuter des opérations d'une valeur maximale et d'une valeur minimale de vecteurs
CN111860772B (zh) 一种用于执行人工神经网络pooling运算的装置和方法

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16899847

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16899847

Country of ref document: EP

Kind code of ref document: A1