WO2018058452A1 - Appareil et procédé permettant de réaliser une opération de réseau neuronal artificiel - Google Patents

Appareil et procédé permettant de réaliser une opération de réseau neuronal artificiel Download PDF

Info

Publication number
WO2018058452A1
WO2018058452A1 PCT/CN2016/100870 CN2016100870W WO2018058452A1 WO 2018058452 A1 WO2018058452 A1 WO 2018058452A1 CN 2016100870 W CN2016100870 W CN 2016100870W WO 2018058452 A1 WO2018058452 A1 WO 2018058452A1
Authority
WO
WIPO (PCT)
Prior art keywords
module
unit
neural network
artificial neural
slave
Prior art date
Application number
PCT/CN2016/100870
Other languages
English (en)
Chinese (zh)
Inventor
陈天石
刘少礼
郭崎
陈云霁
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Priority to PCT/CN2016/100870 priority Critical patent/WO2018058452A1/fr
Publication of WO2018058452A1 publication Critical patent/WO2018058452A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention generally relates to an artificial neural network, and in particular to an apparatus and method for performing an artificial neural network operation, which can solve the problem of high power consumption of an artificial neural network.
  • Multi-layer artificial neural networks are widely used in the fields of pattern recognition, image processing, function approximation and optimization calculation.
  • Multi-layer artificial networks have been accepted by Kirin, image processing, function approximation and optimization calculation.
  • Multi-layer artificial networks have been accepted by Kirin, image processing, function approximation and optimization calculation.
  • Multi-layer artificial networks have been accepted by Kirin, image processing, function approximation and optimization calculation.
  • Multi-layer artificial networks have been accepted by Kir in recent years due to their high recognition accuracy and good parallelism. The industry is getting more and more attention.
  • One known method of supporting multi-layer artificial neural network operations is to use a general purpose processor.
  • the method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions.
  • One of the disadvantages of this approach is that the performance of a single general purpose processor is low and cannot meet the performance requirements of conventional multi-layer artificial neural network operations.
  • communication between general-purpose processors becomes a performance bottleneck.
  • the general-purpose processor needs to decode the multi-layer artificial neural network into a long column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.
  • Another known method of supporting multi-layer artificial neural network operations is to use a graphics processing unit (GPU).
  • the method supports the above algorithm by executing a generic SIMD instruction using a general purpose register file and a generic stream processing unit.
  • the GPU is a device dedicated to performing graphics and image operations and scientific calculations, without the special support for multi-layer artificial neural network operations, a large amount of front-end decoding work is still required to perform multi-layer artificial neural network operations, bringing a large number of Additional overhead.
  • the GPU has only a small on-chip cache, and the model data (weight) of the multi-layer artificial neural network needs to be repeatedly transferred from off-chip, and the off-chip bandwidth becomes the main performance bottleneck.
  • the GPU has only a small on-chip cache, and the model data (weight) of the multi-layer artificial neural network needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck, and brings huge power consumption overhead.
  • Another known method of supporting multi-layer artificial neural network operations is the traditional neural network accelerator.
  • the method performs the above algorithm by designing an application specific integrated circuit using a dedicated register file and a dedicated stream processing unit.
  • problems such as high temperature and high power consumption of dedicated circuits and unstable circuit operation caused by frequent changes in circuit current and voltage are caused.
  • the present invention provides an apparatus and method for performing an artificial neural network operation, which solves the problems of excessive power consumption and unstable operation existing in the prior art.
  • An apparatus for performing an artificial neural network operation comprising: a controller unit, a controlled module group, and a clock gating unit; wherein the controlled module group is connected to the clock gating unit at the clock gate Under the control of the control unit, the module participating in the artificial neural network operation in the controlled module group is opened, and the module not participating in the artificial neural network operation in the controlled module group is closed, and the power consumption of the artificial neural network operation is reduced.
  • the clock signal of the controlled module group is connected to the controlled module group via the clock gating unit, and the clock gating unit realizes the quilt by controlling a clock signal of the controlled module group.
  • the controlled module group includes: a direct memory access unit, an interconnection module, a main operation module, and a plurality of slave operation modules.
  • the input clock of the direct memory access unit, the input clock of the interconnect module, the input clock of the main operation module, and the input clock of the slave arithmetic module are respectively input to the clock gating unit; the clock gating unit outputs the direct memory access unit.
  • the output clock of the direct memory access unit 3 is connected to the direct memory access unit, and the output clock of the interconnect module is connected to An interconnection module, an output clock of the main operation module is connected to the main operation module, an output clock of the slave operation module is connected to the slave operation module, and a control signal of the controller unit 2 is connected to the clock gating unit;
  • the clock Under the control of the control signal, the gating unit sets the output clock to 0 to turn off the module corresponding to the output clock, and turns on the output clock to turn on the module corresponding to the output clock.
  • the main operation module includes: a main operation module operation unit, a main operation module data dependency determination unit, and a main operation module neuron cache unit.
  • the slave operation module includes: an arithmetic module operation unit, and a slave operation module number According to the dependency determination unit, the slave module, the neuron cache unit, and the weight buffer unit.
  • a method of performing an artificial neural network operation comprising: performing an artificial neural network initialization step; performing an artificial neural network calculation step; performing The step of outputting the result of the artificial neural network; wherein, in at least one of the steps, the module participating in the artificial neural network operation is turned on, and the module not participating in the artificial neural network operation is turned off, and the power consumption of the artificial neural network operation is reduced.
  • the step of: performing the artificial neural network initialization comprises: opening the direct memory access unit and the main operation module, closing the interconnection module and the slave operation module, and the direct memory access unit reads the operation data of the main operation module from the external address space.
  • the step of performing the artificial neural network calculation comprises: opening the interconnection module, the main operation module and the slave operation module, and closing the direct memory access unit, and the main operation module sends the input neuron vector to the slave operation module through the interconnection module;
  • the interconnect module and the slave arithmetic module close the direct memory access unit and the main operation module, and the arithmetic module obtains the intermediate result according to the input neuron vector and the weight vector, and returns the intermediate result to the interconnect module; opens the interconnect module and closes the direct memory access unit.
  • the main operation module and the slave operation module, the interconnection module divides the intermediate results returned from the operation module into a complete intermediate result vector step by step; opens the interconnection module and the main operation module, closes the direct memory access unit and the slave operation module, and the interconnection module
  • the intermediate result vector is returned to the main operation module, and the main operation module obtains the output neuron vector from the intermediate result vector.
  • the step of performing the output of the artificial neural network calculation result comprises: opening the direct memory access unit and the main operation module, closing the interconnection module and the slave operation module; and the direct memory access unit storing the output neuron vector of the main operation module to the outside Address space.
  • the module that does not participate in the operation can be turned off, and only the module participating in the operation is turned on, so that the modules of the device are not always kept in the on state, thereby reducing the power consumption of the device and achieving low power consumption.
  • Artificial neural network operation
  • FIG. 1 shows an example block diagram of an overall structure of an apparatus for performing an artificial neural network operation according to an embodiment of the present invention
  • FIG. 2 shows the structure of an interconnection module of an apparatus for performing an artificial neural network operation according to an embodiment of the present invention
  • FIG. 3 is a block diagram showing an example of a main operation module structure of an apparatus for performing an artificial neural network operation according to an embodiment of the present invention
  • FIG. 4 is a block diagram showing an example of a slave operation module structure of an apparatus for performing an artificial neural network operation according to an embodiment of the present invention
  • FIG. 5 is a block diagram showing the structure of a clock gating unit of an apparatus for performing an artificial neural network operation according to an embodiment of the present invention
  • FIG. 6 shows an example block diagram of an artificial neural network operation process in accordance with an embodiment of the present invention
  • Figure 7 shows a flow chart of a single layer artificial neural network operation in accordance with an embodiment of the present invention.
  • 61- slave arithmetic module arithmetic unit 62- slave arithmetic module data dependency determining unit; 63-slave arithmetic module neuron cache unit; 64-weight buffer unit;
  • the apparatus and method for performing artificial neural network operation of the present invention can perform operations on a single-layer or multi-layer artificial neural network, and can perform a forward process and a reverse process of an artificial neural network operation.
  • the unit that does not participate in the operation can be turned off by the Clock Gating unit to achieve the purpose of reducing power consumption.
  • the operation can be divided into two parts.
  • the dot product operation is performed from the input neuron vector and the weight vector in the arithmetic module.
  • the arithmetic module obtains the output nerve from the calculation result of the arithmetic module through the activation function. Meta vector.
  • the device receives the instruction through the Clock Gating unit, and the device turns off the clock signal of the slave arithmetic module when performing the partial operation of the main operation module, and the device turns off the clock signal of the main operation module when performing the operation of the partial operation module, and blocks each module of the device through the Clock Gating unit.
  • the clock signal is controlled to realize dynamic switching of each module of the device.
  • the device includes: an instruction cache unit 1, a controller unit 2, a controlled module group, and a Clock Gating unit 7.
  • the controlled module group includes: a direct memory access unit 3, an interconnect module 4, a main arithmetic module 5, and N slave arithmetic modules 6.
  • the instruction cache unit 1, the controller unit 2, the direct memory access unit 3, the interconnection module 4, the main operation module 5, the slave operation module 6 and the Clock Gating unit 7 can all be implemented by hardware circuits such as, but not limited to, an application specific integrated circuit ASIC.
  • instruction cache unit 1, the controller unit 2, the direct memory access unit 3, the interconnection module 4, the main operation module 5, the slave operation module 6 and the Clock Gating unit 7 are integrated in a separate chip, and are based on a CPU or a GPU.
  • the computing device is different.
  • the Clock Gating unit 7 of the device of the present invention can dynamically open and close each module in the controlled module group. Specifically, the module participating in the operation of the controlled module group is opened, and the controlled module group does not participate in the operation. The module is turned off, reducing the power consumption of artificial neural network operations.
  • the instruction cache unit 1 reads in an instruction through the direct memory access unit 3 and caches the read instruction.
  • the controller unit 2 reads the instructions from the instruction cache unit 1 and translates the instructions into micro-instructions that control the behavior of the controlled module group and the Clock Gating unit 7.
  • the direct memory access unit 3 can access the external address space, write data directly from the memory to the instruction cache unit of the device, the main operation module, and the corresponding data buffer unit of the slave operation module, or the corresponding data cache unit from the main operation module and the slave operation module. Read data into memory to complete loading and storage of data.
  • the Clock Gating unit 7 which is respectively connected to each module in the group of controlled modules, the clock signal of the direct memory access unit 3, the clock signal of the interconnect module 4, the clock signal of the main operation module 5, and the clock signal of the slave arithmetic module 6
  • the clock Gating unit 7 is connected to the direct memory access unit 3, the interconnect module 4, the main arithmetic module 5, and the slave arithmetic module 6.
  • the Clock Gating unit controls the turning on and off of the clock signals of the respective modules of the controlled module group by the microinstructions given by the controller unit.
  • the interconnect module 4 is used to connect the main operation module and the slave operation module, and can be implemented into different interconnection topologies, such as a tree structure, a ring structure, a network structure, a hierarchical interconnection, a bus structure, and the like.
  • FIG. 2 schematically shows an embodiment of an interconnection module 4: an H-tree structure.
  • the interconnection module 4 constitutes a data path between the main operation module 5 and the plurality of slave operation modules 6, and has an H-tree structure.
  • the H-tree module is used to transfer the layer data from the main operation module to all the slave arithmetic modules through the H-tree module at the beginning of each layer of the artificial neural network operation, and the calculation process in the slave computing module. After completion, the H-tree module sequentially adds the output portions of the respective slave computing modules and the two pairs to obtain the output of the layer.
  • the H-tree is a binary tree path composed of multiple nodes.
  • Each node sends the upstream data to the downstream two nodes in the same way, and the data returned by the two downstream nodes are combined and returned to the upstream node.
  • the input neuron vector in the main operation module 5 is sent to each slave operation module 6 through the H-tree module 4; when the calculation process from the operation module 6 is completed, each slave operation is performed.
  • the output neurons of the module are progressively assembled into a complete vector of output neurons in the H-tree module as an intermediate result vector.
  • the intermediate result vector is segmented by N, each segment has N elements, and the i-th slave computing module calculates the number of each segment. i elements.
  • the N elements are assembled into a vector of length N through the H-tree module and returned to the main arithmetic module. If the artificial neural network has only N output neurons, each slave arithmetic module only needs to output a single neuron value. If the artificial neural network has m*N output neurons, each slave arithmetic module needs to output m neurons. value.
  • FIG. 3 shows an example block diagram of the structure of a main arithmetic module 5 of an apparatus for performing an artificial neural network operation according to an embodiment of the present invention.
  • the main operation module 5 is configured to perform subsequent calculations, such as activation, offset, etc., by using the output vector of the layer in the artificial neural network calculation process of each layer, including: a main operation module operation unit 51, The main operation module data dependency determination unit 52 and the main operation module neuron buffer unit 53.
  • the main operation module neuron buffer unit 53 is configured to buffer the input data and the output data used by the main operation module 5 in the operation, and the main operation module operation unit 51 implements various calculation functions of the main operation module 5, and the main operation module data dependency relationship
  • the judging unit 52 is a port of the main operation module operation unit 51 that reads and writes the main operation module neuron buffer unit 53, and can ensure the read/write consistency of data in the main operation module neuron cache unit.
  • the main operation module data dependency judging unit 52 is also responsible for transmitting the read data to the slave operation module 6 through the interconnect module 4, and the output data from the operation module 6 is directly sent to the main operation module operation unit 51 through the interconnect module 4.
  • the microinstruction outputted by the controller unit 2 is sent to the main operation module operation unit 51 and the main operation module data dependency judgment unit 52 to control the behavior thereof.
  • each slave operation module 6 uses the same input and the respective weight data to calculate the corresponding output portion sum in parallel, including: slave operation.
  • the arithmetic instruction unit 61 receives the microinstruction issued by the controller unit 2 and performs an arithmetic logic operation.
  • the slave arithmetic module data dependency determining unit 62 is used for performing read and write operations on the slave arithmetic module neuron buffer unit 63 in the operation.
  • the slave module data dependency determining unit 62 first ensures that there is no read/write consistency conflict between the data used between the instructions before performing the read and write operations. For example, all microinstructions destined for the slave module data dependency unit 62 are stored in the instruction queue internal to the slave module data dependency unit 62, in which the read data range of the read command is aligned with the queue. If the top write command writes a data range conflict, the read instruction must wait until the dependent write instruction is executed before it can execute.
  • the input neuron vector and the output neuron value from the arithmetic module 6 are buffered from the arithmetic module neuron buffer unit 63.
  • the weight buffer unit 64 buffers the weight matrix required for the operation from the arithmetic module 6. For each slave arithmetic module 6, its weight buffer unit stores only the weight vector between all input neuron vectors and partial output neurons. Taking the all-connection layer as an example, the output neurons are segmented according to the number N of the operation modules, and the weight vector corresponding to the n-th output neuron of each segment is stored in the n-th slave operation module.
  • the first half of each layer of artificial neural network operation can be executed in parallel from the arithmetic module 6.
  • the sum of the parts to be accumulated, and the parts are added to each other in the interconnection module 4 step by step to obtain the final result.
  • Each slave arithmetic module 6 calculates an output neuron value, and all output neuron values of the slave arithmetic module are assembled in the interconnect module 4 to obtain an intermediate result vector.
  • Each slave arithmetic module 6 only needs to calculate the output neuron value corresponding to the slave arithmetic module in the intermediate result vector y.
  • the interconnect module 4 sums the output neuron values of all slave arithmetic modules 6 to obtain the final intermediate result vector y.
  • the main operation module 5 performs the second half of the operation based on the intermediate result vector y. For example, add offset, pool (such as MAXPOOLING or AVGPOOLING), activate and sample.
  • out_gradient w*in_gradient
  • out_gradient and In_gradient is a column vector, each of which computes only the product of the corresponding partial scalar element in in_gradient and the column corresponding to the weight matrix w
  • each output vector obtained is a part of the final result to be accumulated, and these parts Add the final result by adding them step by step in the interconnection. So the computational process becomes a parallel computational part of the process and the subsequent accumulation process.
  • Each of the slave arithmetic modules 6 calculates a portion and sum of the output gradient vectors and performs a summation operation in the interconnect module 4 to obtain a final output gradient vector.
  • Each slave arithmetic module 6 simultaneously multiplies the input gradient vector and the output value of each layer in the forward operation to calculate a gradient of the weights to update the weight stored by the slave arithmetic module 6.
  • Forward and reverse training are the two main processes of neural network algorithms. To train (update) the weights in the network, the neural network needs to calculate the forward output of the input vector in the network composed of the current weights. This is positive. To the process, the weight of each layer is trained (updated) layer by layer according to the difference between the output value and the label value of the input vector itself.
  • the output vector of each layer and the derivative value of the activation function are saved. These data are required for the reverse training process, so these data are guaranteed to exist at the beginning of the reverse training.
  • the output value of each layer in the forward operation is the data existing at the beginning of the reverse operation, and can be buffered in the main operation module by the direct memory fetch unit and sent to the slave operation module through the interconnection.
  • the main operation module 5 performs subsequent calculation based on the output gradient vector, for example, multiplying the output gradient vector by the derivative of the activation function in the forward operation to obtain the input gradient value of the next layer.
  • the derivative of the activation function in the forward operation is the data existing at the beginning of the reverse operation, and can be cached in the main operation module by the direct memory fetch unit.
  • FIG. 5 shows an example block diagram of the structure of a Clock Gating unit 7 in an apparatus for performing an artificial neural network operation according to an embodiment of the present invention.
  • the Clock Gating unit sets the output clock to 0 according to the control signal of the controller unit to turn off the unit corresponding to the output clock, and the output clock remains unchanged to keep the unit corresponding to the output clock open. According to the difference of the control signals, the opening and closing of each module of the controlled module group is dynamically realized in the operation process of the artificial neural network.
  • the input clock of the main operation module 5 and the input clock of the slave arithmetic module 6 are respectively connected to the Clock Gating unit 7, and the output clock of the direct memory access unit 3 output by the Clock Gating unit 7 is connected to the direct memory access unit 3, and the output of the interconnect module 4
  • the clock is connected to the interconnect module 4,
  • the output clock of the main arithmetic module 5 is connected to the main arithmetic module 5, and the output clock of the arithmetic module 6 is connected to the slave arithmetic module 6, and the control signal of the controller unit 2 is connected to the clock Gating unit 7.
  • the Clock Gating unit 7 sets, under the control of the control signal, a part of the modules in the controlled module group that does not participate in the operation, and sets the output clock of the part of the module to 0 to turn it off.
  • the output clock of the partial module is turned on to open the partial module.
  • the main operation module 5 participates in the operation and the operation module 6 does not participate in the operation
  • the output clock of the operation module 6 is set to 0, and the slave operation module 6 is turned off, and the output clock of the main operation module 5 remains unchanged, and the main operation module is maintained. 5 is turned on, so that in the operation process, the modules of the controlled module group are not always kept in the on state, thereby reducing the power consumption of the device, and the artificial neural network operation with low power consumption can be realized.
  • the apparatus can also perform an artificial neural network operation using an instruction set.
  • the instruction set includes the CONFIG instruction, the COMPUTE instruction, the IO instruction, the NOP instruction, the JUMP instruction, the MOVE instruction, and the CLOCKGATING instruction, where:
  • the CONFIG command configures various constants required for the current layer artificial neural network operation before each layer of artificial neural network operation begins;
  • the COMPUTE instruction completes the arithmetic logic calculation of each layer of artificial neural network
  • the IO instruction implements input data required for reading from the external address space and stores the data back to the external address space after the operation is completed;
  • the NOP instruction clears the microinstructions currently loaded into all microinstruction buffer queues inside the device, ensuring that all instructions before the NOP instruction are all completed.
  • the NOP instruction itself does not contain any operations;
  • the JUMP instruction implements a jump of the next instruction address that the controller will read from the instruction cache unit to implement a jump of the control flow;
  • the MOVE instruction implements carrying data of an address in the internal address space of the device to another address in the internal address space of the device.
  • the process is independent of the main operation module and the slave operation module, and does not occupy the resources of the module during execution;
  • the CLOCKGATING command enables the unit to be turned on and off, and the device can pass the strip.
  • the instruction realization unit is turned on and off, and the unit that needs to be turned on or off can be automatically selected according to the correlation of the instruction, thereby realizing automatic opening and closing of the unit.
  • each slave computing module 6 shows an example block diagram of an artificial neural network forward operation process in accordance with an embodiment of the present invention.
  • the input neuron vector is respectively subjected to a dot product operation with the weight vector of the slave computing module 6, to obtain a corresponding output neuron value, and all of the output neuron values constitute an intermediate result vector, and the device performs the above
  • the output clock of other modules in the controlled module group is set to 0 by the Clock Gating unit, and other modules are turned off.
  • the Clock Gating unit will set the output clock of the arithmetic module to 0, close the slave arithmetic module, turn on the output clock of the main arithmetic module and the interconnect module, open the main arithmetic module and the interconnect module, and pass the intermediate result vector through the bias.
  • w is the weight matrix and f is the activation function.
  • the weight vector of each slave arithmetic module 6 is a column vector corresponding to the slave arithmetic module 6 in the weight matrix.
  • the interconnect module sends the input neuron vector [in0,...,inN] to all slave arithmetic modules, and is temporarily stored in the slave module neuron cache unit. For the i-th slave arithmetic module, calculate the dot product of its corresponding weight vector [w_i0,...,w_iN] and the input neuron vector. The output from the arithmetic module is integrated into the complete output neuron vector through the interconnection module and returned to the main operation module, and the activation operation is performed in the main operation module to obtain the final output neuron vector [out0, out1, out2,... , outN]. In this process, only the module participating in the operation of the controlled module group is opened by the Clock Gating module 7, and other modules that do not participate in the operation are closed, thereby realizing low-power execution of the artificial neural network operation.
  • FIG. 7 is a flow chart showing a forward operation of a low power single layer artificial neural network in accordance with one embodiment.
  • the flowchart depicts the process of implementing the forward operation of a single layer artificial neural network as shown in FIG. 6 using the apparatus and instruction set of the present invention.
  • step S1 an IO instruction is pre-stored at the first address of the instruction cache unit 1.
  • step S2 the operation starts, the controller unit 2 reads the IO instruction from the first address of the instruction cache unit 1, and according to the translated microinstruction, the direct memory access unit 3 reads all the corresponding artificial neural networks from the external address space. The instruction is operated and cached in the instruction cache unit 1.
  • step S3 the Clock Gating module 7 maintains the output clocks of the direct memory access unit 3 and the main arithmetic module 5, opens the direct memory access unit 3 and the main arithmetic module 5, and sets the output clocks of the interconnect module 4 and the slave arithmetic module 6 to 0. Closing the interconnect module 4 and the slave computing module 6;
  • the controller unit 2 then reads the next IO instruction from the instruction cache unit 1, and according to the translated microinstruction, the direct memory access unit 3 reads the operation data required by the main operation module 5 from the external address space, including the input neuron vector, Interpolation table, constant table and offset vector, etc., and the operation data is stored in the main operation module neuron buffer unit 53 of the main operation module 5;
  • step S4 the Clock Gating module 7 maintains the output clock of the direct memory access unit 3, and turns on the output clock of the slave arithmetic module 6, opens the direct memory access unit 3 and the slave arithmetic module 6, and connects the interconnect module 4 and the main arithmetic module 5.
  • the output clock is set to 0, and the interconnect module 4 and the main operation module 5 are turned off;
  • the controller unit 2 then reads the next IO instruction from the instruction cache unit. Based on the translated microinstruction, the direct memory access unit 3 reads the weight matrix required from the arithmetic module 6 from the external address space.
  • step S5 the Clock Gating module 7 turns on the output clock of the main operation module 5 and maintains the output clock of the slave operation module 6, and turns on the main operation module 5 and the slave operation module 6, and outputs the output clocks of the direct memory access unit 3 and the interconnection module 4.
  • the controller unit 2 then reads the next CONFIG instruction from the instruction cache unit, and configures various constants required for the operation of the layer artificial neural network according to the translated microinstruction.
  • the main operation module operation unit 51 and the slave operation module operation unit 61 respectively configure the main operation module and the register inside the slave operation module according to the parameters in the micro instruction (for example, the main operation module neuron buffer unit 53 and the slave operation module nerve)
  • the metadata buffer unit 63 and the weight buffer unit 64 for example, including the precision setting of the artificial neural network operation of the layer, and the data of the activation function (for example, the precision bit of the artificial neural network operation of the layer, the rang parameter of the Lrn layer algorithm, AveragePooling layer algorithm window reciprocal, etc.).
  • step S6 the Clock Gating module 7 maintains the output clock of the main operation module 5, and turns on the output clocks of the interconnection module 4 and the slave operation module 6, and opens the interconnection module 4, the main operation module 5, and the slave operation module 6, and directly accesses the memory.
  • the output clock of unit 3 is set to 0, and the direct memory access unit 3 is turned off;
  • the controller unit 2 then reads the next COMPUTE instruction from the instruction cache unit 1. According to the translated microinstruction, the main operation module 5 sends the input neuron vector to each slave operation module 6 through the interconnection module 4, and saves it to the slave operation module.
  • the slave arithmetic module neuron cache unit 63 The slave arithmetic module neuron cache unit 63.
  • step S7 the Clock Gating module 7 maintains the output clocks of the interconnect module 4 and the slave computing module 6, opens the interconnect module 4 and the slave computing module 6, sets the output clocks of the direct memory access unit 3 and the main arithmetic module 5 to 0, and directly closes Memory access unit 3 and main operation module 5;
  • the slave operation module 61 of the operation module 6 reads the weight vector from the weight buffer unit 64 (the column vector corresponding to the slave operation module 6 in the weight matrix),
  • the arithmetic module neuron buffer unit reads the input neuron vector, completes the dot product operation of the weight vector and the input neuron vector, obtains an intermediate result, and returns the intermediate result to the interconnection module 4.
  • step S8 the Clock Gating module 7 maintains the output clock of the interconnect module 4, opens the interconnect module 4, sets the output clocks of the direct memory access unit 3, the main operation module 5, and the slave arithmetic module 6 to 0, and closes the direct memory access unit 3, Main operation module 5 and slave operation module 6;
  • the interconnect module 4 progressively stages each of the intermediate results returned from the arithmetic module 6 into a complete intermediate result vector.
  • step S9 the Clock Gating module 7 maintains the output clock of the interconnect module 4 and turns on the output clock of the main arithmetic module 5, opens the interconnect module 4 and the main arithmetic module 5, and sets the output clocks of the direct memory access unit 3 and the slave arithmetic module 6. 0, close the direct memory access unit 3 and the slave arithmetic module 6;
  • the interconnect module 4 returns the intermediate result vector to the main operation module 5, and the main operation module 5 reads the offset vector from the main operation module neuron buffer unit 53 according to the micro-instruction decoded by the COMPUTE instruction, and the intermediate result returned by the interconnection module 4 The vectors are added, and then the activation function is used to perform an activation operation on the addition result to obtain the final output neuron vector, and the last output neuron vector is written back to the main operation module neuron buffer unit 53.
  • step S10 the Clock Gating module 7 turns on the output clock of the direct memory access unit 3 and maintains the output clock of the main arithmetic module 5, opens the direct memory access unit 3 and the main arithmetic module 5, and outputs the output of the interconnect module 4 and the slave arithmetic module 6. Setting the clock to 0, closing the interconnect module 4 and the slave computing module 6;
  • the controller unit then reads the next IO instruction from the instruction cache unit, and the direct memory access unit 3 stores the output neuron vector in the main operation module neuron buffer unit 53 to the external address space designation address, and the operation ends.
  • the implementation process is similar to that of a single-layer artificial neural network.
  • the instruction of the next artificial neural network will use the output neuron vector address of the upper artificial neural network stored in the main operation module of the layer device as the input neuron vector of the layer.
  • the address, as well, the weight matrix address and the offset vector address in the instruction are also changed to the address corresponding to the layer.
  • the device for performing artificial neural network operation of the present invention can be integrated in a circuit board by means of chip or IP core authorization, and can be applied to the following (including but not limited to) fields: data processing, robots, computers, printers, scanners, Telephones, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers, cameras, camcorders, projectors, watches, headsets, mobile storage, wearable devices, and other electronic products; aircraft, ships , vehicles and other types of transportation; TV, air conditioning, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods and other household appliances; and including nuclear magnetic resonance instruments, B-ultrasound, electrocardiographs, etc. Medical equipment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Advance Control (AREA)

Abstract

L'invention concerne un appareil et un procédé permettant de réaliser une opération de réseau neuronal artificiel, l'appareil comprenant une unité de pilotage d'horloge (7), une unité cache d'instructions (1), une unité contrôleur (2), une unité d'accès direct à la mémoire (3), un module d'interconnexion (4), un module d'opération principal (5) et une pluralité de modules d'opération subordonnés (6). L'appareil peut réaliser l'opération du réseau neuronal artificiel multicouche avec une faible consommation d'énergie. Pendant l'opération de réseau neuronal artificiel, l'unité de pilotage d'horloge (7) commande, selon une instruction, des signaux d'horloge des éléments suivants afin qu'ils soient activés ou réglés sur 0 : l'unité cache d'instructions (1), l'unité contrôleur (2), l'unité d'accès direct à la mémoire (3), le module d'interconnexion (4), le module d'opération principal (5) et la pluralité de modules d'opération subordonnés (6), et, ainsi, les signaux d'horloge des unités pertinentes pour l'opération spécifique sont préservés et les signaux d'horloge des unités non pertinentes sont réglés sur 0, ce qui réduit le nombre de modules impliqués dans l'opération et permet au réseau neuronal artificiel de fonctionner en consommant peu d'énergie.
PCT/CN2016/100870 2016-09-29 2016-09-29 Appareil et procédé permettant de réaliser une opération de réseau neuronal artificiel WO2018058452A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/100870 WO2018058452A1 (fr) 2016-09-29 2016-09-29 Appareil et procédé permettant de réaliser une opération de réseau neuronal artificiel

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/100870 WO2018058452A1 (fr) 2016-09-29 2016-09-29 Appareil et procédé permettant de réaliser une opération de réseau neuronal artificiel

Publications (1)

Publication Number Publication Date
WO2018058452A1 true WO2018058452A1 (fr) 2018-04-05

Family

ID=61762295

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/100870 WO2018058452A1 (fr) 2016-09-29 2016-09-29 Appareil et procédé permettant de réaliser une opération de réseau neuronal artificiel

Country Status (1)

Country Link
WO (1) WO2018058452A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670581A (zh) * 2018-12-21 2019-04-23 北京中科寒武纪科技有限公司 一种计算装置及板卡
CN110347506A (zh) * 2019-06-28 2019-10-18 Oppo广东移动通信有限公司 基于lstm的数据处理方法、装置、存储介质与电子设备
CN111062469A (zh) * 2018-10-17 2020-04-24 上海寒武纪信息科技有限公司 计算装置及相关产品
US11205118B2 (en) * 2017-04-17 2021-12-21 Microsoft Technology Licensing, Llc Power-efficient deep neural network module configured for parallel kernel and parallel input processing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101184092A (zh) * 2007-12-10 2008-05-21 华中科技大学 环境感知可重构的移动终端通信处理器
CN104021420A (zh) * 2014-05-23 2014-09-03 电子科技大学 可编程离散霍普菲尔德网络电路
CN105184366A (zh) * 2015-09-15 2015-12-23 中国科学院计算技术研究所 一种时分复用的通用神经网络处理器
CN105930902A (zh) * 2016-04-18 2016-09-07 中国科学院计算技术研究所 一种神经网络的处理方法、系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101184092A (zh) * 2007-12-10 2008-05-21 华中科技大学 环境感知可重构的移动终端通信处理器
CN104021420A (zh) * 2014-05-23 2014-09-03 电子科技大学 可编程离散霍普菲尔德网络电路
CN105184366A (zh) * 2015-09-15 2015-12-23 中国科学院计算技术研究所 一种时分复用的通用神经网络处理器
CN105930902A (zh) * 2016-04-18 2016-09-07 中国科学院计算技术研究所 一种神经网络的处理方法、系统

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11205118B2 (en) * 2017-04-17 2021-12-21 Microsoft Technology Licensing, Llc Power-efficient deep neural network module configured for parallel kernel and parallel input processing
CN111062469A (zh) * 2018-10-17 2020-04-24 上海寒武纪信息科技有限公司 计算装置及相关产品
CN111062469B (zh) * 2018-10-17 2024-03-05 上海寒武纪信息科技有限公司 计算装置及相关产品
CN109670581A (zh) * 2018-12-21 2019-04-23 北京中科寒武纪科技有限公司 一种计算装置及板卡
CN110347506A (zh) * 2019-06-28 2019-10-18 Oppo广东移动通信有限公司 基于lstm的数据处理方法、装置、存储介质与电子设备
CN110347506B (zh) * 2019-06-28 2023-01-06 Oppo广东移动通信有限公司 基于lstm的数据处理方法、装置、存储介质与电子设备

Similar Documents

Publication Publication Date Title
WO2017185387A1 (fr) Procédé et dispositif d'exécution d'une opération de transfert d'un réseau neuronal en couches entièrement connecté
CN109284825B (zh) 用于执行lstm运算的装置和方法
CN107316078B (zh) 用于执行人工神经网络自学习运算的装置和方法
WO2017185347A1 (fr) Appareil et procédé permettant d'exécuter des calculs de réseau neuronal récurrent et de ltsm
WO2017124641A1 (fr) Dispositif et procédé permettant d'exécuter un apprentissage inversé d'un réseau de neurones artificiels
KR102470264B1 (ko) 완전연결층 신경망 역방향 트레이닝 실행용 장치와 방법
WO2017124642A1 (fr) Dispositif et procédé permettant d'exécuter un calcul depuis l'origine d'un réseau de neurones artificiels
CN107886166B (zh) 一种执行人工神经网络运算的装置和方法
WO2017185391A1 (fr) Dispositif et procédé permettant d'effectuer un apprentissage d'un réseau neuronal convolutif
CN111260025B (zh) 用于执行lstm神经网络运算的装置和运算方法
CN111860813B (zh) 一种用于执行卷积神经网络正向运算的装置和方法
WO2019218896A1 (fr) Procédé de calcul et produit associé
CN111291880A (zh) 计算装置以及计算方法
WO2018120016A1 (fr) Appareil d'exécution d'opération de réseau neuronal lstm, et procédé opérationnel
WO2018058452A1 (fr) Appareil et procédé permettant de réaliser une opération de réseau neuronal artificiel
WO2017185336A1 (fr) Appareil et procédé pour exécuter une opération de regroupement
WO2017185248A1 (fr) Appareil et procédé permettant d'effectuer une opération d'apprentissage automatique de réseau neuronal artificiel
WO2017185335A1 (fr) Appareil et procédé d'exécution d'une opération de normalisation par lots
WO2017177446A1 (fr) Appareil de support de représentation de données discrètes et procédé d'apprentissage arrière d'un réseau neuronal artificiel
WO2017181336A1 (fr) Appareil et procédé d'opération de couche "maxout"
CN111178492B (zh) 计算装置及相关产品、执行人工神经网络模型的计算方法
CN111368967B (zh) 一种神经网络计算装置和方法
CN111860814A (zh) 一种用于执行batch normalization运算的装置和方法
CN111368990B (zh) 一种神经网络计算装置和方法
CN111368986B (zh) 一种神经网络计算装置和方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16917206

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16917206

Country of ref document: EP

Kind code of ref document: A1