CN113298245B

CN113298245B - Multi-precision neural network computing device and method based on data flow architecture

Info

Publication number: CN113298245B
Application number: CN202110631644.0A
Authority: CN
Inventors: 吴欣欣; 范志华; 欧焱; 李文明; 叶笑春; 范东睿
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2022-11-29
Anticipated expiration: 2041-06-07
Also published as: CN113298245A

Abstract

The embodiment of the invention provides a multi-precision neural network computing device based on a data flow architecture, which comprises: the system comprises a microcontroller and a PE array connected with the microcontroller, wherein each PE of the PE array is provided with a plurality of low-precision computing components with original precision and precision lower than the original precision, more parallel multiplication accumulators are arranged in the computing components with lower precision to fully utilize the bandwidth of a network on chip, and each low-precision computing component in each PE is provided with sufficient registers to avoid data overflow; the microcontroller is configured to: and in response to an acceleration request for a specific convolutional neural network, controlling original-precision or low-precision computing components, which are matched with the precision of the specific convolutional neural network, in the PE array to execute corresponding operations in convolution operations and storing intermediate computing results to corresponding registers. Therefore, the speed of the convolutional neural network with different precision can be increased, the calculation time delay and the energy consumption are reduced, and the user experience is improved.

Description

Multi-precision neural network computing device and method based on data flow architecture

Technical Field

The present invention relates to the field of computers, in particular to the field of an accelerating device or an accelerator for neural network model calculation, and more particularly to a multi-precision neural network calculation device and method based on a data flow architecture.

Background

Deep Neural Networks (DNN) have shown significant advantages in many application areas, from computer vision to natural language processing and speech recognition, with support of powerful hardware making it easier to train DNN models, with diversification and sophistication of applications, DNN models are also complex, with increasingly deeper depths and increasingly larger parameters, complex DNN models are more expressive for capturing features and output non-linear relationships, thus showing excellent result accuracy.

Although these neural networks are very powerful, a large number of weights occupy a large amount of storage and memory bandwidth, while the DNN model also produces a large amount of intermediate data during operation, the movement between the computation unit and the buffer being affected by internal bandwidth and large energy dissipation. Thus, it becomes increasingly difficult to deploy deep neural networks on a system.

The explosive growth of the application of convolutional neural networks with millions and billions of parameters poses a huge challenge to terminal devices, so that customized DNN models must be obtained for the limited resources of mobile devices, and more scientists turn to research on how to reduce these large number of parameters, and it has been proved that the parameterization of several deep learning models has significant redundancy, and the redundant characteristics of the network can be utilized to reduce the number of parameters. In the compression and acceleration of the DNN model, low precision and quantization are an effective method of compressing the model. The reduction in accuracy may reduce memory usage and required external memory bandwidth, and may also reduce energy consumption, allowing more multiply-accumulate (MAC) to be installed on-chip and achieve greater energy efficiency. For example, 16-bit fixed-point multiplication consumes about 6.2 times less energy than 32-bit floating-point multiplication under 45nm process technology.

The operation of the neural network mainly performs a multiply-add operation on an input feature map (also called activation data) and weight parameters (known parameters provided by a network model) in the neural network. The low precision of the network compresses the original network by reducing the number of bits required to represent each weight or activation data, e.g., using 8, 4, 3, or dynamic bits to represent the weights and activations.

The data flow architecture has wide application in the aspects of big data processing, scientific calculation and the like, and the decoupled algorithm and structure enable the data flow architecture to have good universality and flexibility. The natural parallelism of the dataflow architecture matches well the parallel nature of the neural network algorithm. Compared with a special neural network accelerating device, the accelerating device based on the data stream has certain universality, and compared with a GPU, the accelerating device based on the data stream has certain energy efficiency ratio advantage. Based on a data flow architecture, a neural network algorithm is mapped in an architecture formed by a computing array (PE array) in the form of a data flow graph, the data flow graph comprises a plurality of nodes, the nodes comprise a plurality of instructions, and directed edges of the flow graph represent the dependency relationship of the nodes. The PE array executes the mapped instruction to realize the operation of the neural network.

The main operation in the neural network is multiply-add operation, the accelerator based on data flow designs corresponding multiply-add instructions according to the operation rules to realize the operations, and meanwhile, in order to realize high data transmission and fast data operation, some prior arts design high data bandwidth and single instruction multiple data flow operation (SIMD) mode to support network operation. SIMD is a single instruction multiple data stream operation, i.e., multiple data perform the same operation simultaneously using one instruction. This operation provides a fast way of execution for neural networks. In the accelerator, the designed arithmetic unit is mainly a floating point multiply-add unit or a 32bit/16bit fixed point multiply-add unit, so as to support the higher data bit width to ensure the result precision of the neural network.

To compress the model of the neural network, the neural network operation may be performed using lower precision data, while to ensure the result precision, the compressed network model may be trained to achieve the original result precision. In this case, the compressed and trained neural network model may be used as the final neural network to be executed on the acceleration device. However, in the conventional accelerator apparatus, since there is no related calculation means supporting low-precision calculation and there is no corresponding instruction support, it is impossible to support these low-precision calculations. Therefore, a low-precision network operation cannot be performed to reduce the power consumption of the accelerator. If the low-precision operation is executed by using the operation part with the original precision, the performance is not improved while the power consumption is wasted. In addition, the result of the low-precision operation is usually stored by using a wide bit width to avoid data overflow, and in addition, the SIMD operation used in the conventional acceleration apparatus has a high data bandwidth (network bandwidth on chip), and when the designed low-precision component is not reasonable, not only the waste of the operation bandwidth is caused, but also the data overflow is caused, which makes it challenging to support the design of various low-precision components.

Disclosure of Invention

Therefore, an object of the present invention is to overcome the above-mentioned drawbacks of the prior art and to provide a data flow architecture-based multi-precision neural network computing device and method.

The purpose of the invention is realized by the following technical scheme:

according to a first aspect of the present invention, there is provided a multi-precision neural network computing device based on a dataflow architecture, including: the system comprises a microcontroller and a PE array connected with the microcontroller and composed of a plurality of PEs, wherein each PE in the PE array is provided with a computing unit for original precision and one or more low-precision computing units respectively for precision lower than the original precision, the computing units with different precisions are provided with a corresponding number of parallel multiplication accumulators to fully utilize the bandwidth of an on-chip network, and each computing unit with different precisions is provided with a corresponding number of registers to avoid data overflow; the microcontroller is configured to: in response to a computation request for a convolutional neural network and a computation accuracy indication, a computation element controlling a PE array to perform operations in respective convolution operations of the convolutional neural network corresponding to the computation accuracy and store intermediate computation results to respective registers.

In some embodiments of the present invention, the number of the multiply accumulators in the computation units of different precisions in the same PE is different, wherein the number of the multiply accumulators in the computation unit of each precision is equal to or substantially equal to the bandwidth of the network on chip in the processing array divided by the number of data bits corresponding to that precision.

In some embodiments of the present invention, a data bit width of a register configured for each low-precision multiply accumulator in each low-precision computation unit is greater than a precision bit number of the low-precision multiply accumulator.

In some embodiments of the present invention, each PE may further include: the router is used for constructing a Mesh network among the PEs in the PE array; an instruction cache for caching instructions; an instruction status register to indicate a status of instructions cached in the instruction cache, wherein the status of instructions includes running, ready, and initialized; the data cache is used for caching data required by the corresponding instruction; an execution component comprising: the PE comprises a loading unit, a storage unit and a data flow unit, wherein the loading unit is used for loading data from a memory to the data cache, the storage unit is used for transmitting the data from the data cache to the memory, and the data flow unit is used for transmitting the data in the PE to other PEs through a Mesh network; the controller is used for controlling the execution unit, the instruction cache, the instruction state register, the data cache and the router, and comprises a decoding unit which is used for decoding the calculation instruction, resolving the calculation instruction with the required precision for the calculation of the neural network, and controlling the calculation unit with the corresponding precision to execute according to the calculation instruction with the corresponding precision so as to complete at least part of calculation of the characteristic diagram according to the input data and the weight data of the neural network.

In some embodiments of the invention, each PE may include: the device comprises an original precision computing unit, a first low precision computing unit and a second low precision computing unit, wherein the original precision computing unit comprises a fixed point computing multiplication accumulator with the precision being the corresponding bit of the original precision, and a register with original data bit width is configured for storing an intermediate computing result; the first low-precision computing unit comprises a multiply accumulator for fixed-point computing with the precision of a first preset number of bits, and a register for configuring the bit width of original data is used for storing an intermediate computing result; the second low-precision computing unit comprises a multiply accumulator for fixed-point computation with precision of a second preset number of bits, and a register for configuring the bit width of the original data is used for storing an intermediate computation result. Preferably, the original precision is greater than the first low precision and greater than the second low precision, and the number of bits corresponding to the second low precision is greater than or equal to 8.

In some embodiments of the invention, the original precision is 32-bit precision, the first low precision is 16-bit precision, and the second low precision is 8-bit precision.

In some embodiments of the invention, each PE may further include: a third low-precision computing component. Preferably, the third low-precision computing unit comprises a multiply accumulator for fixed-point computation with precision of a third predetermined number of bits, and a register of 4 times the original data bit width is configured for holding intermediate computation results.

In some embodiments of the present invention, each PE may further include: a fourth low precision computing component. Preferably, the fourth low-precision computing unit may include a multiply accumulator for fixed-point computation with precision of a fourth predetermined number of bits, and a register 4 times the bit width of the original data is configured to hold intermediate computation results.

In some embodiments of the present invention, each PE may further include: a fifth low-precision computing component. Preferably, the fifth low-precision computing unit may include a multiply accumulator for fixed-point computation with precision of a fifth predetermined number of bits, and a register 4 times the bit width of the original data is configured to hold intermediate computation results.

Preferably, the third low precision is 4-bit precision. The fourth low precision is a 2-bit precision. The fifth low precision is 1-bit precision.

According to a second aspect of the present invention, there is provided a method for the data flow architecture-based multi-precision neural network computing device of the first aspect, comprising: acquiring a convolutional neural network to be calculated and a calculation instruction, dividing the convolutional operation of the convolutional neural network into a plurality of parallel calculation tasks, distributing the calculation tasks to different PEs of a PE array, and executing the calculation tasks, wherein the calculation instruction comprises a calculation precision indication and an operation object used for the convolutional neural network; and in each PE, according to the calculation precision indication and the operation object in the calculation instruction, performing parallel calculation on the multiple characteristic graphs of the convolutional neural network by using a calculation component with corresponding precision.

In some embodiments of the invention, the compute instruction comprises a plurality of first single instruction multiple data instructions, a plurality of second single instruction multiple data instructions, a third instruction, and a fourth instruction, the method further comprising: inputting a plurality of operands of a plurality of feature maps and a plurality of operands of a filter to a computation unit loading a corresponding computation precision as an input of a multiply accumulator thereof by a plurality of first single instruction multiple data instructions; performing multiply-accumulate operations on a plurality of operands of a plurality of feature maps and a plurality of operands of a filter by a plurality of second single-instruction-multiple-data instructions respectively by using a plurality of multiply-accumulators of a calculation unit of corresponding calculation precision; and acquiring convolution results of a plurality of feature maps from the register through a third instruction and storing the convolution results in the memory.

In some embodiments of the present invention, the second single instruction multiple data instruction is configured to perform a multiply-accumulate operation, the second single instruction multiple data instruction executed before in the current instruction set obtains an intermediate calculation result as an accumulated value of the second single instruction multiple data instruction executed after, the intermediate calculation result is stored in the register through a third instruction in the instruction set, and after the last second single instruction multiple data instruction in the current instruction set is executed, the convolution results of the multiple feature maps are obtained from the register through a fourth instruction in the instruction set and stored in the memory.

Compared with the prior art, the invention has the advantages that:

the invention constructs a multi-precision neural network computing device based on a data flow architecture, which comprises the following components: the system comprises a microcontroller and a PE array connected with the microcontroller, wherein each PE of the PE array is provided with a plurality of low-precision computing components with original precision and precision lower than the original precision, more parallel multiplication accumulators are arranged in the computing components with lower precision to fully utilize the bandwidth of a network on chip, and each low-precision computing component in each PE is provided with sufficient registers to avoid data overflow; the microcontroller is configured to: and in response to an acceleration request for a specific convolutional neural network, controlling original-precision or low-precision computing components, which are matched with the precision of the specific convolutional neural network, in the PE array to execute corresponding operations in convolution operations and storing intermediate computing results to corresponding registers. Therefore, the speed of the convolutional neural network with different precision can be increased, the calculation time delay is reduced, and the user experience is improved. When the low-precision convolution operation is adopted, the energy consumption of the convolution neural network operation can be reduced.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a multi-precision neural network acceleration device based on a dataflow architecture according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a multi-precision computing component deployed in a computing unit in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of an exemplary implementation of a multi-precision neural network acceleration device based on a dataflow architecture, according to an embodiment of the invention;

FIG. 4 is a diagram illustrating a convolution operation of raw precision according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a convolution operation with low precision according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As mentioned in the background section, the use of SIMD operations in existing computing devices has a high data bandwidth (network-on-chip bandwidth), which not only wastes operating bandwidth but also causes overflow of data when low-precision components are not reasonable in design. The present application thus builds a multi-precision neural network computing device based on a dataflow architecture, including: the system comprises a microcontroller and a PE array connected with the microcontroller, wherein each PE of the PE array is provided with a plurality of low-precision computing components with original precision and precision lower than the original precision, more parallel multiplication accumulators are arranged in the computing components with lower precision to fully utilize the bandwidth of a network on chip, and each low-precision computing component in each PE is provided with sufficient registers to avoid data overflow; the microcontroller is configured to: and in response to an acceleration request for a specific convolutional neural network, controlling original-precision or low-precision computing components, which are matched with the precision of the specific convolutional neural network, in the PE array to execute corresponding operations in convolution operations and storing intermediate computing results to corresponding registers. Therefore, the convolutional neural networks with different precisions can be accelerated, the calculation time delay is reduced, and the user experience is improved. When the low-precision convolution operation is adopted, the energy consumption of the convolution neural network operation can be reduced.

Before embodiments of the invention are explained in detail, some terms used therein will be explained:

the low-precision calculation means is a means for performing calculation using data of less than 32 bits (bit). For example, in general, a neural network operation uses a 32-bit floating point operation or a fixed point operation, and when a 32-bit data operation is used, it is considered to be an operation of original precision, and when a data operation lower than 32 bits is used, it is considered to be an operation of low precision.

PE arrays, i.e., processing Element arrays (PE) or Parallel processors (Parallel processors), are interconnected in an Array and connected to a microcontroller in a certain manner by repeatedly providing a plurality of identical Processing Elements (PE), and under the control of the microcontroller, perform operations specified by the same set of instructions in Parallel on different data allocated to each.

SIMD, single Instruction Multiple Data, refers to Single Instruction Multiple Data stream operations. That is, the same operation is performed on a plurality of data at the same time using one instruction.

Aiming at the problem that the existing computing device can not execute low-precision neural network operation, the invention designs various low-precision computing components and corresponding instructions to support various low-precision operations, and simultaneously, in order to fully utilize the high bandwidth of the network, different low-precision computing components are reasonably designed by combining the existing network bandwidth. According to an embodiment of the invention, a multi-precision neural network computing device based on a data flow architecture may include: the system comprises a microcontroller and a PE array connected with the microcontroller and composed of a plurality of PEs, wherein each PE of the PE array is provided with a computing unit for original precision and one or more computing units with low precision respectively for precision lower than the original precision, the computing units with different precisions are provided with a corresponding number of parallel multiply accumulators to fully utilize the bandwidth of an on-chip network, and each computing unit with different precisions is provided with a corresponding number of registers to avoid data overflow. Preferably, the microcontroller may be configured to: in response to a computation request for a convolutional neural network and a computation accuracy indication, a computation element controlling a PE array to perform operations in respective convolution operations of the convolutional neural network corresponding to the computation accuracy and store intermediate computation results to respective registers.

The invention is oriented to scenes needing low-precision operation, such as terminal equipment of mobile phones, computers, intelligent watches and the like. The multi-precision neural network computing device based on the dataflow architecture can be implemented on an FPGA or an ASIC. For example, when recognizing that there is a computation task of a convolutional neural network that needs to be accelerated, the general-purpose processor of the terminal device may send a computation request and a computation accuracy indication to a multi-precision neural network computing device based on a dataflow architecture, which is integrated within the terminal device or connected to the terminal device through a communication connection line and is disposed outside the terminal device, complete computation of the computation task of the convolutional neural network by the multi-precision neural network computing device based on the dataflow architecture, and obtain a computation result.

The original network-on-chip bandwidth is set for the original-precision computation unit, and the network-on-chip bandwidth = multiplying the number of bits of the original precision by the number of parallel multiply accumulators in the original-precision computation unit. In order to perform a low-precision network operation and reduce the computation power consumption, according to an embodiment of the present invention, on the basis of the foregoing embodiments, at least two low-precision computation sections among a first low-precision computation section, a second low-precision computation section, a third low-precision computation section, a fourth low-precision computation section, and a fifth low-precision computation section may be configured in each PE of the multi-precision neural network computation apparatus based on the dataflow architecture. Referring to fig. 1, a schematic diagram of a PE in which raw-precision computation elements and 5 low-precision computation elements are configured in one PE is shown. From the bit number of the calculated data, the original precision is more than the first low precision, more than the second low precision, more than the third low precision, more than the fourth low precision, and more than the fifth low precision. The number of bits corresponding to the second lowest precision is 8 or more. When a low-precision computing unit is used, since the number of bits of the operand (input data) of the low-precision computing unit is lower, if the number of the multiply accumulators therein is set to be the same as the number of the multiply accumulators in the original-precision computing unit, the on-chip network bandwidth may not be fully utilized. Therefore, in order to fully utilize the original network-on-chip bandwidth, the number of multiply accumulators in the computation unit needs to be set reasonably. According to an embodiment of the present invention, the number of the multiply accumulators in the computation units of different precisions in the same PE is different, wherein the number of the multiply accumulators in the computation unit of each precision is equal to or substantially equal to the bandwidth of the network on chip in the processing array divided by the number of data bits corresponding to the precision. To avoid data overflow, the data bit width of the register configured for each low-precision multiply accumulator in each low-precision computation unit is greater than the precision bit number of the low-precision multiply accumulator.

According to an embodiment of the present invention, referring to fig. 2, the original precision computing unit a can be configured in the pe, and the original precision computing unit is assumed to support 32-bit data operation. The SIMD32 operation is supported, namely 32 multiply accumulators are included, and the on-chip network bandwidth is 32 x 32=1024bit. And configuring a register of the original data bit width for the calculation component of the original precision for storing an intermediate calculation result, wherein the original data bit width can be 1024 bits.

The first low precision computing element b may comprise a multiply accumulator with fixed point computation precision of a first preset number of bits, with a register configured 2 times the original data bit wide for holding intermediate computation results. For example, referring to fig. 2, the first predetermined number of bits is 16 bits (bit), the number of multiply accumulators in the first low-precision computation block is 1024/16=64, named SIMD64, and the 2 × 1024=2048bit registers are configured to hold intermediate computation results, reg1 and Reg2.

The second low precision computing element c may comprise a multiply accumulator for fixed point calculations with precision to a second predetermined number of bits, with a register 4 times the bit width of the original data being configured to hold intermediate calculation results. For example, the second preset number of bits is 8 bits, the number of multiply accumulators in the second low-precision computation block is 1024/8=128, named SIMD128, and the registers of 4 × 1024=4096 bits are configured to hold intermediate computation results, namely Reg1, reg2, reg3, reg4.

For example, the third predetermined number of bits is 4 bits, the number of the multiply accumulators in the third low-precision computation block is 1024/4=256, named SIMD256, and the registers configured with 4 × 1024=4096 bits are used for storing intermediate computation results, i.e., reg1, reg2, reg3, reg4.

The fourth low-precision computation element e may comprise a multiply accumulator with fixed-point computation precision of a fourth predetermined number of bits, with a register 4 times the original data bit width configured to hold intermediate computation results. For example, the fourth preset number of bits is 2 bits, the number of the multiply-accumulators in the fourth low-precision computation block is 1024/2=512, which is named SIMD512, and the registers of 4 × 1024=4096 bits are configured to hold intermediate computation results, namely Reg1, reg2, reg3, reg4.

The fifth low-precision computation element f may comprise a multiply accumulator for fixed-point computation with precision of a fifth predetermined number of bits, with a register configured 4 times the original data bit width for holding intermediate computation results. For example, the fifth preset number of bits is 1bit, the number of the multiply accumulators in the fifth low-precision computation block is 1024/1=1024, named SIMD1024, and the 4 × 1024= 4096-bit registers are configured to hold intermediate computation results, namely Reg1, reg2, reg3, reg4.

In summary, for a 16-bit computing unit, since the original data bit width is 1024 bits, 1024/16=64 multiply accumulators should be set, and similarly, for 8-bit, 4-bit, 2-bit, and 1-bit operations, 128, 256, 512, and 1024 computing units should be set, so that the corresponding simd channels are simd64, simd128, simd256, simd512, and simd1024, respectively. B, c, d, e, f in fig. 2 are designed low precision computing components. When the 16bit, 8bit, 4bit, 2bit and 1bit are used for executing multiplication, respective digits are used for executing multiplication, and the result is stored as 32bit, 16bit, 8bit and 4bit (in order to ensure the precision of the intermediate calculation result and prevent the overflow of the calculation result, the bit width of the intermediate calculation result is higher than the bit width of the data, the intermediate calculation result of the 8bit, 4bit, 2bit and 1bit operation is stored as 4 times of the bit width of the original data, the intermediate calculation result of the 16bit operation is stored as 2 times of the original bit width), and then addition operation is respectively executed with the digits of 32bit, 16bit, 8bit and 4bit. The number of bits of the finally generated result is 2048bit, 4096bit and 4096bit respectively, which exceeds 1024bit width of the original data, so the result needs to be temporarily stored in a register, and in order to avoid data overflow, more registers are configured for the low-precision computing unit relative to the original-precision computing unit. As shown in fig. 2, corresponding registers are configured for different low-precision computation units, respectively, to store the result of the simd operation.

For the setting of the register, registers required for setting may be set independently of each other or shared between different low-precision computation units in the same PE. If the acceleration task is independently set, the same PE can simultaneously execute acceleration tasks on neural network models with different accuracies. If the setting is shared, the setting number of the total registers can be reduced, and the utilization rate is improved.

For the design of low precision instructions, see the following table:

in order to support different low-precision operations, corresponding instructions need to be designed, so that the instructions can execute corresponding operations after being decoded. The MADD16, MADD8, MADD4, MADD2 and MADD1 respectively support the multiply-add operation of 16 bits, 8 bits, 4 bits, 2 bits and1 bit. The RXIN instruction is used to place data into a specified register number and the RXOUT is used to obtain a value in a specified register. The PE may analyze the calculation precision according to the instruction corresponding to the multiply-accumulate operation, for example, if the called instruction is MADD16, a calculation component corresponding to 16-bit precision needs to be called for calculation. According to one embodiment of the invention, the instruction formats are as follows:

instruction format of MADDx (x denotes the multiply-add operation of xbit): [ instruction type, source operand0, source operand1, destination operand 0]. Taking MADD16 in the above table as an example, MADD16 represents a multiply operation instruction with an instruction type of 16 bits, src _ operand0 represents a source operand0, src_operand 1 represents a source operand1, and dst_operand 0 represents a destination operand 0. The corresponding calculation is: destination operand 0= source operand0 × source operand 1+ destination operand 0.

RXIN instruction format: [ instruction type, source operand0, register id]. RXIN indicates that the instruction type is an RXIN instruction, src _ operand0 indicates the source operand0, reg _idindicates the specified register number. The source operand0 of the RXIN instruction may correspond to the result of a multiply-add instruction (MADDx instruction). For example, for a set of instruction sets of 16-bit precision, the RXIN instruction will multiply-add the instruction Inst ₁₉ -Inst ₂₇ The result of each time is stored in a register. Namely: the RXIN instruction stores the result of the multiply-add instruction each time, i.e., source operand0 of the RXIN instruction = destination operand0 updated by the multiply-add instruction each time it calculates. Wherein Inst _19-26 The result of the instruction calculation corresponds to an intermediate calculation result, and the intermediate calculation result is stored in a register with a corresponding number, inst, through RXIN after each calculation ₂₇ The result of the instruction calculation corresponds to the result of the convolution calculation of the filter with a local region of the feature map.

RXOUT instruction format: [ instruction type, register id, destination operand0]. RXOUT denotes that the instruction type is an RXOUT instruction, reg _ id denotes a specified register number, dst _ operand0 denotes a destination operand 0.RXOUT instructions can be used to assign Inst ₂₇ The result of the instruction calculation is fetched from the register, and Inst is then inserted by storing the instruction ₂₇ The result of the instruction calculation is stored to memory.

According to one embodiment of the invention, a multi-precision neural network computing device based on a data flow architecture comprises: the system comprises a microcontroller and a PE array connected with the microcontroller, wherein each PE of the PE array is provided with an original precision and a plurality of low-precision computing units with lower precision than the original precision, more parallel multiplication accumulators are arranged in the computing units with lower precision to fully utilize the bandwidth of a network on chip, and each low-precision computing unit in each PE is provided with sufficient registers to avoid data overflow; the microcontroller is configured to: and in response to an acceleration request for a specific convolutional neural network, controlling original-precision or low-precision computing components, matched with the precision of the specific convolutional neural network, in the PE array to execute corresponding operation in convolution operation and storing an intermediate computing result to a corresponding register. According to one embodiment of the invention, each PE comprises: the router is used for constructing a Mesh network among the PEs in the PE array; an instruction cache for caching instructions; an instruction status register for indicating the status of instructions cached in the instruction cache, including running, ready, and initialized; the data cache is used for caching data required by the corresponding instruction; an execution component comprising: the system comprises a loading unit, a storage unit, a calculation unit and a data flow unit, wherein the loading unit is used for loading data from a memory to the data cache, the storage unit is used for transmitting the data from the data cache to the memory, the calculation unit comprises a plurality of low-precision calculation components with original precision and precision lower than the original precision, and the data flow unit is used for transmitting the data in one PE to other PEs through a Mesh network; the controller is used for controlling the execution unit, the instruction cache, the instruction state register, the data cache and the router, and comprises a decoding unit which is used for decoding the calculation instruction, analyzing the calculation instruction with the required precision for calculating the neural network, and controlling the calculation unit with the corresponding precision to execute according to the calculation instruction with the corresponding precision so as to complete at least part of calculation of the characteristic diagram according to the input data and the weight data of the neural network. For example, the ready instruction is decoded by the decoding component, the calculation instruction with the precision required by the neural network is analyzed, and the calculation instruction with the corresponding precision is sent to the calculation component with the corresponding precision for execution.

The device of the invention comprises a low-precision computing component and a corresponding low-precision computing instruction. When different low-precision networks are executed, different operation parts are selected according to different precisions to execute operations, the original data bandwidth is fully utilized, and the performance of low-precision operations is improved. In a multi-precision neural network computing device based on a data flow architecture, designed multi-precision computing components are contained in PE, and when a certain precision operation is needed, corresponding calculation is carried out by using the computing components with corresponding precision.

An overall structure of an exemplary data stream architecture-based multi-precision neural network computing device is shown in fig. 3, where a microcontroller manages an execution process of a PE array, the microcontroller starts the PE array, the PE array completes initialization, that is, instructions in a memory are loaded into an instruction cache of each PE, for example, an instruction Inst0 shown in fig. 3, and the PE array selects a Ready instruction according to an instruction status register (the instruction status register records a state of the instruction, running corresponding to a number B0 indicates an instruction in operation, ready corresponding to a number B1 indicates a Ready instruction, and Init corresponding to a number B2 indicates an initialized instruction) when executing, and obtains the selected instruction from an instruction buffer, and the instruction is sent to an execution unit to execute the instruction after passing through a decoding unit in the controller. When the instruction is decoded, the calculation instruction with specific precision can be analyzed, and the corresponding instruction can be sent to the calculation part with the corresponding precision for execution. The PEs are connected to each other through a 2D Mesh network, and perform data communication through a data Flow Unit (Flow Unit). Inside each PE, a Load Unit (Load Unit), a Store Unit (Store Unit), a compute Unit (Cal Unit), and a data Flow Unit (Flow Unit) are respectively included. The load unit is responsible for loading Data from the memory into a Data cache of the PE, such as Data0 required by instruction Inst0 shown in fig. 3, and the store unit is responsible for transferring Data from the Data cache to the memory. The function of the data flow unit is to transmit data in one PE to another PE through a Mesh network constructed by routers.

The following describes the required instructions and execution process by taking a convolution operation in a PE as an example. The convolution operation and the required instructions are shown in fig. 4:

in the convolution operation of SIMD32 (an instruction performs an operation of 32 data), an instruction set composed of instructions Inst1 to Inst27 is involved. 9 data I of input feature map Ifmap of convolutional neural network ₀ -I ₈ (9 data in the input characteristic diagram, all are 32 bit) and the W of the Filter respectively ₀ -W ₈ (9 data in the filter, all 32 bits) to obtain data O of an output characteristic diagram Ofmap ₀ (is 32 bit). In the course of this process, the temperature of the molten steel is controlled,ifmap requires the use of instruction Inst ₁ -Inst ₉ Load I ₀ -I ₈ Data, 32 values (of Ifmap1-Ifmap 32) per instruction load, i.e. 32 components (simd 1-simd32 components), using instructions Inst ₁₀ -Inst ₁₈ Loading weight parameter W of Filter ₀ -W ₈ Each instruction also loads 32 values, i.e. 32 components (simd 1-simd32 components), and then uses the instruction Inst ₁₉ -Inst ₂₇ Performing multiply-add operation, wherein each multiply-add instruction performs multiply-add operation of 32 components (simd 1-simd 32) to complete convolution operation, and in the operation, O with 32 components is calculated ₀ The value is obtained.

For low precision operation, the RXIN instruction is needed to be used for calculating the intermediate calculation result O ₀ ' put into a register. See fig. 5, also relating to the Inst by instruction ₁ -Inst ₂₇ The instruction set of which is constructed. It should be understood that, for simplicity, MADD16, MADD8, MADD4, MADD2, and MADD1 are simplified in one figure, and are represented by MADD16/8/4/2/1, and actually, each instruction should be an instruction set respectively constituting such precision according to the structure shown in fig. 4.

For a 16-bit low precision operation, i.e., a SIMD64 convolution operation, there will be 64 components, each of which will have 32-bit intermediate computation results O ₀ ' put into reg using an RXIN instruction, because of a total of 2048 bits and a data bit width of 1024 bits, 2018 bits of O0 are put into two regs using two RXIN instructions. For example, when calculating the O0 value in the simd1 component, I is loaded ₀ Value sum W ₀ After the value is input to two input ends of a multiplication accumulator, I is calculated by a MADD instruction ₀ ×W ₀ Obtaining an intermediate calculation result O ₀ ', the O is commanded by RXIN ₀ ' store to the corresponding numbered register, and then use I of the same feature map ₁ Value sum W ₁ The value is calculated by using the same multiply-accumulator ₀ ’＝I ₁ ×W ₁ +O ₀ ', by analogy, continuously update the intermediate calculation result O ₀ ', until O is calculated ₀ ＝I ₈ ×W ₈ +O ₀ ', obtaining O ₀ The value is obtained. Using two beltsRXIN command, 2048bit O ₀ Put into two 1024-bit registers. Obtaining calculated O via RXOUT instruction ₀ Value, then O using Store instruction (Store instruction) ₀ The value is stored in the memory. And then carrying out next convolution operation according to the mode, and calculating the output value of the next region until the convolution operation of the whole feature map is completed.

For 8-bit low-precision operation, namely, the convolution operation of the SIMD128, at most 128 components corresponding to 128 feature maps can be calculated in parallel, and the calculation process of each component is similar to 16-bit low-precision operation, which is not described in detail. Calculated O of 32 bits per component ₀ Put into reg with RXIN instruction, four RXIN instructions, O, because there are 4096 bits in total ₀ Into four 1024-bit registers reg. Obtaining calculated O via RXOUT instruction ₀ Value, then use Store instruction (Store instruction) to get O ₀ The value is stored in the memory. And then carrying out next convolution operation according to the mode, and calculating the output value of the next region until the convolution operation of the whole feature map is completed.

For 4-bit low-precision operation, namely, convolution operation of SIMD256, at most 256 components corresponding to 256 feature maps can be calculated in parallel, and the calculation process of each component is similar to 16-bit low-precision operation, which is not described in detail. Calculated O of 16 bits per component ₀ Put into reg with RXIN instruction, four RXIN instructions, O, because there are 4096 bits in total ₀ Into four 1024-bit registers reg. Obtaining calculated O via RXOUT instruction ₀ Value, then use Store instruction (Store instruction) to get O ₀ The value is stored in the memory. And then carrying out next convolution operation according to the mode, and calculating the output value of the next region until the convolution operation of the whole feature map is completed.

For a 2-bit low precision operation, i.e., a convolution operation of SIMD512, there will first be 512 components of O of 8 bits each ₀ Put into reg with RXIN instructions, four RXIN instructions are used, since there are 4096 bits in total, put O ₀ Put into four 1024-bit registers reg. Obtaining calculated O via RXOUT instruction ₀ Value, then use Store instruction (Store instruction) to get O ₀ The value is stored in the memory. And then carrying out next convolution operation according to the mode, and calculating the output value of the next region until the convolution operation of the whole feature map is completed.

For a 1-bit low precision operation, i.e., a SIMD1024 convolution operation, there will first be 1024 components, each with 4-bit O ₀ Using RXIN instruction, put into register reg, four RXIN instructions are used, with O, since there are 4096 bits in total ₀ Into four 1024-bit registers reg. Obtaining calculated O via RXOUT instruction ₀ Value, then O using Store instruction (Store instruction) ₀ The value is stored in the memory. And then carrying out next convolution operation according to the mode, and calculating the output value of the next region until the convolution operation of the whole feature map is completed.

The invention also provides a method for the multi-precision neural network computing device based on the data flow architecture, which can comprise the following steps: acquiring a convolutional neural network to be calculated and a calculation instruction, dividing the convolutional operation of the convolutional neural network into a plurality of parallel calculation tasks to be distributed to different PEs of a PE array for execution, wherein the calculation instruction comprises a calculation precision indication and an operation object used for the convolutional neural network; and in each PE, according to the calculation precision indication and the operation object in the calculation instruction, performing parallel calculation on the multiple characteristic graphs of the convolutional neural network by using a calculation component with corresponding precision. According to one embodiment of the present invention, assuming that the accuracy of the convolutional neural network to be computed is 32-bit accuracy, the convolution operation is performed in the computation element using the original accuracy in the PE. Assuming that the precision of the convolutional neural network to be calculated is 8-bit precision, the convolution operation is performed in the calculation section of low precision using 8 bits in PE (the second calculation section of low precision of the foregoing embodiment). The operands may include data objects of the instruction, such as source operands and destination operands. The operand may include a register number to store data.

According to one embodiment of the invention, the source is identifiedThe parallel computation of a plurality of characteristic graphs of the convolutional neural network by using an instruction set with initial precision or corresponding low precision comprises the following steps: inputting a plurality of operands of a plurality of feature maps and a plurality of operands of a filter to a computation unit loading a corresponding computation precision as inputs to a multiply accumulator thereof by a plurality of first single instruction multiple data instructions in the identified instruction set of original precision or corresponding low precision; performing multiply-accumulate operations on a plurality of operands of the plurality of feature maps and a plurality of operands of the filter by a plurality of second single-instruction multiple-data instructions in the identified instruction sets of original precision or corresponding low precision respectively by using a plurality of multiply-accumulators of the calculation unit of corresponding calculation precision; and acquiring convolution results of a plurality of feature maps from the register through a third instruction and storing the convolution results in the memory. Preferably, the second single instruction multiple data instruction may be configured to perform a multiply-accumulate operation, the second single instruction multiple data instruction executed before in the current instruction set obtains an intermediate calculation result as an accumulated value for performing the accumulate operation by the second single instruction multiple data instruction executed after, the intermediate calculation result is stored in the register through a third instruction in the instruction set, and after the last second single instruction multiple data instruction in the current instruction set is executed, the convolution results of the multiple feature maps are obtained from the register through a fourth instruction in the instruction set and stored in the memory. According to one embodiment of the invention, the first single instruction multiple data instruction is, for example, an instruction Inst for loading a feature map ₁ -Inst ₈ And an instruction Inst for loading the weight parameters of the filter ₁₀ -Inst ₁₈ . The second SIMD instruction is, for example, an Inst for performing multiply-accumulate operations ₁₉ -Inst ₂₇ . The third instruction is, for example, an RXIN instruction. The fourth instruction is, for example, an RXOUT instruction.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as a punch card or an in-groove protruding structure with instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A multi-precision neural network computing device based on a dataflow architecture, comprising: a microcontroller and a PE array connected thereto and composed of a plurality of PEs, wherein,

each PE of the PE array is provided with a computing unit for original precision and one or more low-precision computing units for precision lower than the original precision, wherein the computing units with different precisions are provided with a corresponding number of parallel multiplication accumulators to fully utilize the bandwidth of the network on chip, and each computing unit with each precision is provided with a corresponding number of registers to avoid data overflow;

the microcontroller is configured to:

in response to a convolutional neural network to be calculated and a calculation instruction, dividing the convolutional operation of the convolutional neural network into a plurality of parallel calculation tasks to be distributed to different PEs of a PE array for execution, controlling the PE array to execute operation in the corresponding convolutional operation of the convolutional neural network by a calculation component corresponding to the calculation precision and storing an intermediate calculation result to a corresponding register, wherein the calculation instruction comprises a calculation precision indication and an operation object for the convolutional neural network;

and in each PE, according to the calculation precision indication and the operation object in the calculation instruction, performing parallel calculation on the multiple characteristic graphs of the convolutional neural network by using a calculation component with corresponding precision.

2. The data flow architecture based multi-precision neural network computing device of claim 1, wherein the number of the multiply accumulators in the computing units of different precision in the same PE is different, wherein the number of the multiply accumulators in the computing unit of each precision is equal to or substantially equal to the bandwidth of the network on chip in the processing array divided by the number of data bits corresponding to the precision.

3. The data-flow-architecture-based multi-precision neural network computing device of claim 2, wherein the data bit width of the register configured for each low-precision multiply accumulator in each low-precision computing unit is greater than the precision bit number of the low-precision multiply accumulator.

4. The dataflow architecture-based multi-precision neural network computing device of claim 3, wherein each PE further comprises:

the router is used for constructing a Mesh network among the PEs in the PE array;

an instruction cache for caching instructions;

an instruction status register to indicate a status of instructions cached in the instruction cache, wherein the status of instructions includes running, ready, and initialized;

the data cache is used for caching data required by the corresponding instruction;

an execution component comprising:

a loading unit for loading data from the memory to the data cache,

a storage unit for transmitting data from the data cache to the memory,

the data flow unit is used for transmitting the data in the PE to other PEs through a Mesh network;

the controller is used for controlling the execution unit, the instruction cache, the instruction state register, the data cache and the router, and comprises a decoding unit which is used for decoding the calculation instruction, resolving the calculation instruction with the required precision for calculating the convolutional neural network, and controlling the calculation unit with the corresponding precision to execute according to the calculation instruction with the corresponding precision so as to complete at least part of calculation of the characteristic diagram according to the input data and the weight data of the neural network.

5. The dataflow architecture-based multi-precision neural network computing apparatus according to any one of claims 1 to 4, wherein each PE includes:

a raw-precision computing component, a first low-precision computing component, and a second low-precision computing component, wherein,

the calculation component of the original precision comprises a multiplication accumulator of fixed-point calculation with the precision being the corresponding bit of the original precision, and a register with the original data bit width is configured for storing an intermediate calculation result;

the first low-precision computing unit comprises a fixed-point computing multiply accumulator with the precision of a first preset number of bits, and a register with original data bit width is configured for storing an intermediate computing result;

the second low-precision computing unit comprises a fixed-point computing multiply accumulator with the precision of a second preset number of bits, and a register with the original data bit width is configured for storing an intermediate computing result;

the original precision is larger than the first low precision and larger than the second low precision, and the bit number corresponding to the second low precision is larger than or equal to 8.

6. The data-stream-architecture-based multi-precision neural network computing device of claim 5, wherein the original precision is a 32-bit precision, the first low precision is a 16-bit precision, and the second low precision is an 8-bit precision.

7. The dataflow architecture-based multi-precision neural network computing device of claim 5, wherein each PE further includes:

a third low-precision computing component, a fourth low-precision computing component, and a fifth low-precision computing component, wherein,

the third low-precision computing unit comprises a multiply accumulator for fixed-point computing with precision of a third preset number of bits, and a register with the bit width being 4 times that of the original data is configured for storing an intermediate computing result;

the fourth low-precision computing unit comprises a multiply accumulator for fixed-point computing with precision of a fourth preset number of bits, and a register with the bit width being 4 times that of the original data is configured for storing an intermediate computing result;

the fifth low-precision computing unit includes a multiply accumulator for fixed-point computation with precision of a fifth predetermined number of bits, and a register configured with a bit width 4 times that of the original data is used for storing an intermediate computation result.

8. The data-flow-architecture-based multi-precision neural network computing device of claim 7, wherein the third low precision is a 4-bit precision, the fourth low precision is a 2-bit precision, and the fifth low precision is a 1-bit precision.

9. A method for the data flow architecture based multi-precision neural network computing device of any one of claims 1-8, comprising:

acquiring a convolutional neural network to be calculated and a calculation instruction, dividing the convolutional operation of the convolutional neural network into a plurality of parallel calculation tasks to be distributed to different PEs of a PE array for execution, wherein the calculation instruction comprises a calculation precision indication and an operation object used for the convolutional neural network;

10. The method of claim 9, wherein the compute instruction comprises a first plurality of single instruction multiple data instructions, a second plurality of single instruction multiple data instructions, a third instruction, and a fourth instruction, the method further comprising:

inputting a plurality of operands of a plurality of feature maps and a plurality of operands of a filter to a computation unit loading a corresponding computation precision as an input of a multiply accumulator thereof by a plurality of first single instruction multiple data instructions;

performing multiply-accumulate operations on a plurality of operands of a plurality of feature maps and a plurality of operands of a filter by a plurality of second single-instruction-multiple-data instructions respectively by using a plurality of multiply-accumulators of a calculation unit of corresponding calculation precision;

and acquiring convolution results of a plurality of feature maps from the register through a third instruction and storing the convolution results in the memory.

11. The method of claim 10, wherein the second simd instruction is configured to perform a multiply-accumulate operation, wherein the second simd instruction executed before in the current instruction set obtains an intermediate computation result as an accumulated value of the second simd instruction executed after the current instruction set executes, and wherein the intermediate computation result is stored in the register via a third instruction in the instruction set, and wherein after the last second simd instruction in the current instruction set completes execution, the convolution results of the multiple feature maps are obtained from the register via a fourth instruction in the instruction set and stored in the memory.