CN111091181B

CN111091181B - Convolution processing unit, neural network processor, electronic device and convolution operation method

Info

Publication number: CN111091181B
Application number: CN201911253109.5A
Authority: CN
Inventors: 方攀; 陈岩
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2023-09-05
Anticipated expiration: 2039-12-09
Also published as: CN111091181A

Abstract

The embodiment of the application provides a convolution processing unit, a neural network processor, electronic equipment and a convolution operation method, wherein the convolution processing unit is used for: performing one-time windowing operation on input data according to convolution check to obtain a first window region, wherein the first window region comprises first depth data of a first number of layers along the depth direction; acquiring a plurality of convolution kernels, the plurality of convolution kernels comprising a first number of layers of second depth data in a depth direction; and performing multiply-accumulate operation on the first depth data of one layer and the second depth data of the same layer of the convolution kernels to obtain first operation data. The format of the obtained target operation data is the same as that of the input data, deformation is not needed, the input of the next operation layer (such as a convolution layer or a pooling layer) can be directly used, the format change step is omitted, and the overall efficiency of convolution operation is improved.

Description

Convolution processing unit, neural network processor, electronic device and convolution operation method

Technical Field

The present application relates to the field of electronic technologies, and in particular, to a convolution processing unit, a neural network processor, an electronic device, and a convolution operation method.

Background

Convolutional neural networks (Convolutional Neural Networks, CNN) are a type of feedforward neural network (Feedforward Neural Networks) that contains convolutional calculations and has a deep structure, and are one of the representative algorithms of deep learning. In the convolutional neural network, the convolutional operation is the most main calculation in the convolutional neural network, and the efficiency of the convolutional operation directly influences the efficiency of the convolutional neural network. In the related art, the convolution operation efficiency in the convolution neural network is not high enough.

Disclosure of Invention

The embodiment of the application provides a convolution processing unit, a neural network processor, electronic equipment and a convolution operation method, which can improve the convolution operation efficiency in a convolution neural network.

The embodiment of the application discloses a convolution processing unit, which is used for:

performing one-time windowing operation on input data according to convolution check to obtain a first window region, wherein the first window region comprises first depth data of a first number of layers along the depth direction;

acquiring a plurality of convolution kernels, the plurality of convolution kernels comprising a first number of layers of second depth data in a depth direction; and

and performing multiply-accumulate operation on the first depth data of one layer and the second depth data of the same layer of the convolution kernels to obtain first operation data.

The embodiment of the application also discloses a neural network processor, which comprises:

a data buffer unit for storing input data;

the convolution processing unit acquires input data through the data caching unit, and the convolution processing unit is the convolution processing unit.

The embodiment of the application also discloses an electronic device, which comprises:

a system bus; and

the neural network processor is the neural network processor, and the neural network processor is connected with the system bus.

The embodiment of the application also discloses a convolution operation method, which comprises the following steps:

acquiring a plurality of convolution kernels, the plurality of convolution kernels comprising a first number of layers of second depth data in a depth direction;

And accumulating the plurality of first operation data corresponding to the plurality of layers of first depth data to obtain target operation data.

In the embodiment of the application, in the convolution operation of input data, the convolution processing unit can multiply and accumulate a first window area of the input data and a plurality of convolution kernels, and particularly, multiply and accumulate one layer of first depth data of the first window area and a plurality of second depth data of the same layer of convolution kernels to obtain first operation data, the format of the obtained target operation data is the same as that of the input data, deformation is not needed, the input of the next operation layer (such as a convolution layer or a pooling layer) can be directly used, the format change step is omitted, and the overall efficiency of the convolution operation is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the description of the embodiments will be briefly described below.

Fig. 1 is a schematic diagram of a first architecture of a neural network processor according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a second structure of a neural network processor according to an embodiment of the present application.

Fig. 3 is a schematic diagram of a third structure of a neural network processor according to an embodiment of the present application.

Fig. 4 is a schematic diagram of a fourth structure of a neural network processor according to an embodiment of the present application.

Fig. 5 is a schematic diagram of a fifth structure of a neural network processor according to an embodiment of the present application.

Fig. 6 is a schematic diagram of a sixth structure of a neural network processor according to an embodiment of the present application.

Fig. 7 is a schematic diagram of a seventh structure of a neural network processor according to an embodiment of the present application.

Fig. 8 is a schematic diagram of an eighth structure of a neural network processor according to an embodiment of the present application.

Fig. 9 is a schematic diagram of a ninth structure of a neural network processor according to an embodiment of the present application.

Fig. 10 is a schematic diagram of a tenth structure of a neural network processor according to an embodiment of the present application.

Fig. 11 is a schematic diagram of an eleventh architecture of a neural network processor according to an embodiment of the present application.

Fig. 12 is a schematic diagram of a twelfth structure of a neural network processor according to an embodiment of the present application.

Fig. 13 is a schematic diagram of a thirteenth architecture of a neural network processor according to an embodiment of the present application.

Fig. 14 is a schematic diagram of a fourteenth structure of a neural network processor according to an embodiment of the present application.

Fig. 15 is a schematic diagram of a fifteenth structure of a neural network processor according to an embodiment of the present application.

Fig. 16 is a schematic diagram of a sixteenth structure of a neural network processor according to an embodiment of the present application.

Fig. 17 is a schematic diagram of input data of a convolution processing unit in a neural network processor according to an embodiment of the present application.

Fig. 18 is a schematic diagram of weight data of a convolution processing unit in a neural network processor according to an embodiment of the present application.

Fig. 19 is a schematic diagram of convolution operation of a convolution processing unit in a neural network processor according to an embodiment of the present application.

Fig. 20 is a schematic diagram of another convolution operation of a convolution processing unit in a neural network processor according to an embodiment of the present disclosure.

Fig. 21 is a schematic structural diagram of a chip according to an embodiment of the present application.

Fig. 22 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Fig. 23 is a schematic flow chart of a convolution operation method according to an embodiment of the present application.

Fig. 24 is a second flowchart of a convolution operation method according to an embodiment of the present disclosure.

Fig. 25 is a third flowchart of a convolution operation method according to an embodiment of the present disclosure.

Detailed Description

The technical scheme provided by the embodiment of the application can be applied to various scenes in which the input image is required to be subjected to image processing to obtain the corresponding output image, and the embodiment of the application is not limited to the scenes. For example, the technical scheme provided by the embodiment of the application can be applied to various scenes in the fields of computer vision and the like, such as: face recognition, image classification, object detection, semantic segmentation, and the like.

Referring to fig. 1, fig. 1 is a schematic diagram of a first architecture of a neural network processor according to an embodiment of the application. The neural network processor (Neural Network Process Unit, NPU) 200 may include a first processing module 210 and an instruction distribution module 220.

The first processing module 210 may include one or more processing units, such as the first processing module 210 includes a convolution processing unit 212 and a vector processing unit 214. The plurality of processing units included in the first processing module 210 according to the embodiment of the present application may process the vector. It should be noted that the embodiment of the present application is not limited to the type of the data processed by the first processing module 210.

The convolution processing unit 212 may also be referred to as a convolution operation unit, and the convolution processing unit 212 may also be referred to as a convolution calculation engine. The convolution processing unit 212 may include a plurality of multiply-add units (Multiplication Add Cell, MAC), and the number of multiply-add units may be thousands, for example, the convolution processing unit 212 may include 4096 multiply-add units, and may be divided into 16 cells, where each cell may calculate a maximum element number that is a 256-vector inner product operation.

Vector processing unit 214 may also be referred to as a vector computation unit and may also be referred to as a single instruction multiple data (Single Instruction Multiple Data, SIMD) processing unit. The vector processing unit 214 is an element level vector calculation engine, and can process arithmetic operations such as addition, subtraction, multiplication, division and the like among conventional vectors, and can also process logical operations such as AND, OR, NOT, XOR and the like of a bit level. It should be noted that, the vector processing unit 214 of the embodiment of the present application may also support the operation of the common activation functions such as the linear rectification function (Rectified Linear Unit, reLU), and pralu. It should be further noted that, the vector processing unit 214 in the embodiment of the present application may also support the nonlinear activation functions Sigmoid and Tanh through a table lookup method.

The instruction dispatch module 220 may also be referred to as an instruction preprocessing module. The instruction distribution module 220 is coupled to the first processing module 210, and the instruction distribution module 220 may be coupled to each processing unit in the first processing module 210, such as the instruction distribution module 220 being coupled to the convolution processing unit 212 and the vector processing unit 214 in the first processing module 210. The instruction dispatch module 220 may issue instructions to the first processing module 210, i.e., the instruction dispatch module 220 may issue instructions to a processing unit of the first processing module 210.

In some embodiments, instruction dispatch module 220 may issue multiple instructions in parallel to first processing module 210, such as instruction dispatch module 220 may issue multiple instructions in parallel to convolution processing unit 212 and vector processing unit 214. For example, instruction dispatch module 220 may issue multiple instructions in parallel to convolution processing unit 212 and vector processing unit 214 in one clock cycle. Thus, embodiments of the present application may support multiple issue instruction operations, and may be capable of executing multiple instructions simultaneously and efficiently, such as convolution processing unit 212 and vector processing unit 214 may execute convolution calculation instructions and vector calculation instructions, respectively. After the convolution processing unit 212 and the vector processing unit 214 receive the instruction, the convolution processing unit 212 and the vector processing unit 214 process the respective received data according to the instruction to obtain a processing result. Therefore, the embodiment of the application can improve the calculation efficiency or the data processing efficiency of the NPU.

It can be appreciated that the multiple instructions issued by the instruction issue module 220 in parallel correspond to processing units that do not have resource conflicts during execution.

The plurality of instructions transmitted by the instruction distribution module 220 may include a fine-granularity instruction, the instruction distribution module 220 may transmit the fine-granularity instruction to the convolution processing unit 212, and after the convolution processing unit 212 receives the fine-granularity instruction, the convolution processing unit 212 may perform a vector inner product operation on the data received by the convolution processing unit according to the fine-granularity instruction.

It should be appreciated that the fine-grained instructions issued by the instruction dispatch module 220 are not limited to the convolution processing unit 212, and that the instruction dispatch module 220 may issue fine-grained instructions into the vector processing unit 214 or other processing units of the first processing module 210.

It should also be appreciated that the instructions that the instruction dispatch module 220 may issue in embodiments of the present application are not limited to fine-grained instructions. The embodiments of the present application are not limited to the instructions transmitted by the instruction distribution module 220. It should be noted that the instructions may include different types, such as a calculation type instruction, a control type instruction, and the like, where the calculation type instruction may include a first calculation instruction, a second calculation instruction, a third calculation instruction, and the like.

The operation corresponding to the fine-grained instruction can be precisely performed every clock cycle, which is different from a coarse-grained processor, namely, one instruction needs a processor to execute more clock cycles to complete. It is also understood that fine-grained instructions are also embodied in a finer granularity of computation by the processing unit. Such as convolution processing unit 212, may perform a basic vector inner product operation based on a fine granularity instruction. While coarse-grained processors may perform matrix multiplication based on one instruction, it is understood that matrix multiplication may consist of multiple vector inner product operations. Therefore, the embodiment of the application can support the operation of multiple instructions with fine granularity, and can improve the flexibility of programming and has better universality.

The instruction dispatch module 220 of embodiments of the present application may issue a first computational instruction to the convolution processing unit 212 and a second computational instruction to the vector processing unit 214 in parallel. Such as instruction dispatch module 220 transmitting a first computational instruction to convolution processing unit 212 and a second computational instruction to vector processing unit 214 in one clock cycle. Convolution processing unit 212 may perform a vector inner product operation on the data it receives according to a first computational instruction transmitted by instruction dispatch module 220. And vector processing unit 214 performs vector calculation operations on the data it receives according to the second calculation instructions transmitted by instruction dispatch module 220.

It should be noted that the processing units in the first processing module 210 are not limited to the convolution processing unit 212 and the vector processing unit 214, or the first processing module 210 may further include other processing units. Such as the first processing module 210 also reshapes the processing unit, etc.

Referring to fig. 2, fig. 2 is a schematic diagram of a second structure of a neural network processor according to an embodiment of the application. The first processing module 210 of the neural network processor 200 provided in the embodiment of the present application may include a convolution processing unit 214, a vector processing unit 214, and a shaping processing unit 216, where the convolution processing unit 212 and the vector processing unit 214 may refer to the convolution processing unit 212 and the vector processing unit 214 shown in fig. 1, and are not described herein. The shaping processing unit may also be referred to as a shaping engine.

The shaping processing unit 216 is connected to the instruction distribution module 220, and the instruction distribution module 220 may also transmit multiple instructions in parallel to the convolution processing unit 212, the vector processing unit 214, and the shaping processing unit 216. Such as instruction dispatch module 220, may also issue multiple instructions in parallel to convolution processing unit 212, vector processing unit 214, and shaping processing unit 216 in one clock cycle. The shaping processing unit 216 processes the data it receives according to instructions, such as third calculation instructions, transmitted by the instruction distribution module 220. The shaping processing unit 216 may support a Reshape operation of a common Tensor, such as dimension transposition, slicing according to one dimension, data filling Padding, and the like.

It should be noted that, the instruction transmission of the instruction distribution module 220 is not limited to the first processing module 210. In other embodiments, the instruction dispatch module 220 may also transmit instructions to other processing modules.

Referring to fig. 3, fig. 3 is a schematic diagram of a third structure of a neural network processor according to an embodiment of the application. The neural network processor 200 provided by the embodiment of the present application may include a first processing module 210, a second processing module 230, and an instruction distribution module 220. The first processing module 210 includes at least a convolution processing unit 212, although the first processing module 210 may also include other processing units such as a vector processing unit 214 and a shaping processing unit 216. The convolution processing unit 212 may perform a vector inner product operation on the received data, which may be referred to above and will not be described herein. The vector processing unit 214 may refer to the above, and is not described herein. The shaping processing unit 216 may refer to the above, and is not described herein.

The second processing module 230 may process scalar data, the second processing module 230 including at least a scalar processing unit 232 (Scalar Process Unit, SPU). Scalar processing unit 232 may be a RISC-V instruction set compatible processing unit. Therein, the scalar processing unit 232 may comprise a scalar register file (Scalar Register File, SRF), i.e. the scalar processing unit 232 may comprise a plurality of scalar registers.

The instruction distribution module 220 is connected to the first processing module 210 and the second processing module 230, and the instruction distribution module 220 may transmit a plurality of instructions to the first processing module 210 and the second processing module 230 in parallel. Such as instruction dispatch module 220 may issue multiple instructions in parallel to convolution processing unit 212 and scalar processing unit 232 in one clock cycle.

It should be appreciated that when the first processing module 210 further includes other processing units, the instruction dispatch module 220 may also issue multiple instructions to the other processing units in parallel in one clock cycle. Such as instruction dispatch module 220 emitting multiple instructions to convolution processing unit 212, vector processing unit 214, and scalar processing unit 232 in parallel in one clock cycle, such as instruction dispatch module 220 emitting multiple instructions to convolution processing unit 212, shaping processing unit 216, and scalar processing unit 232 in parallel in one clock cycle, and such as instruction dispatch module 220 emitting multiple instructions to convolution processing unit 212, vector processing unit 214, shaping processing unit 216, and scalar processing unit 232 in parallel in one clock cycle.

It should also be understood that, in actual processes, the instructions transmitted by the instruction distribution module 220 are not limited thereto, and the instruction distribution module 220 may transmit different instructions to multiple processing units in the same processing module in parallel, or transmit different instructions to processing units in different processing modules in parallel, according to the requirements of the neural network processor 200 for processing data. The above are only examples of the technical solutions provided by the embodiments of the present application, in which the instruction distributing unit 220 transmits a plurality of instructions in parallel. The manner in which the instruction issuing unit 220 issues the instruction is not limited to this embodiment. Such as: instruction dispatch unit 220 transmits multiple instructions in parallel to scalar processing unit 232 and vector processing unit 214. For another example: instruction dispatch unit 220 transmits multiple instructions in parallel to shaping processing unit 216 and vector processing unit 214.

Scalar processing unit 232 processes data it receives according to instructions, such as control instructions, issued by instruction issue module 220. Scalar processing unit 232 may receive or otherwise be a scalar instruction, such as a control instruction, scalar processing unit 232 primarily negating the scalar operations of neural network processor 200.

Note that, the scalar processing unit 232 may not only receive instructions from the instruction issue module 220, but may also transmit the value of a new Program Counter (PC) to the instruction issue module 220.

Referring to fig. 4, fig. 4 is a schematic diagram of a fourth structure of a neural network processor according to an embodiment of the application. Scalar processing unit 232 may not only receive instructions from instruction dispatch module 220, but may also transmit the value of the new Program Counter (PC) to instruction dispatch module 220. Scalar processing unit 232 may execute scalar calculation instructions (add, subtract, multiply, divide, logic operations), branch instructions (conditional operations), jump instructions (function calls). When processing branch and jump instructions, scalar processing unit 232 returns a new PC value to instruction dispatch module 220 to cause instruction dispatch module 220 to fetch from the new PC the next time the instruction is dispatched.

Referring to fig. 5, fig. 5 is a schematic diagram of a fifth configuration of a neural network processor according to an embodiment of the application. The neural network processor 200 provided in the embodiment of the present application further includes a data storage module (BUF) 240, and the data storage module 240 may store data, such as image data, weight data, and the like.

The data storage module 240 may be connected with the first processing module 210 and the second processing module 230. Such as data storage module 240, is coupled to scalar processing unit 232, convolution processing unit 212, vector processing unit 214, and shaping processing unit 216. The data storage unit 240 and scalar processing unit 232, convolution processing unit 212, vector processing unit 214, and shaping processing unit 216 may all transmit data, such as the data storage unit 240 and convolution processing unit 212, vector processing unit 214, and shaping processing unit 216 transmitting data directly. Thus, the embodiment of the present application may implement direct data transmission between the data storage module 220 and each processing unit, such as the convolution processing unit 212 and the vector processing unit 214, so as to improve the performance of the NPU 200.

The processing of the data by the first processing module 210 may be: the convolution processing unit 212 and the vector processing unit 214, upon receiving the instructions such as the first calculation instruction, the second calculation instruction, which are transmitted in parallel by the instruction distribution unit 220, the convolution processing unit 212 and the vector processing unit 214 can read the data required to be processed thereof such as the data to be processed by the data storage module 240. The convolution processing unit 212 and the vector processing unit 214 perform processing operations on the data to be processed to obtain a processing result, and store the processing result to the data buffer module 240.

The convolution processing unit 212 and the vector processing unit 214 may process data as follows: upon receiving an instruction such as a first calculation instruction transmitted by the instruction distribution unit 220, the convolution processing unit 212 reads data to be processed, such as data to be processed, from the data storage module 240 according to the first calculation instruction. After the convolution processing unit 212 reads the data to be processed from the data storage module 240, the convolution processing unit 212 performs a corresponding operation, such as a vector inner product calculation, according to the first calculation instruction to obtain an intermediate calculation result. The convolution processing unit 212 may store the intermediate calculation result in the data storage module 240. The vector processing unit 214 may acquire the intermediate calculation result from the data storage module 240, perform a second calculation process such as a pooling operation on the intermediate calculation result to obtain a processing result, and store the processing result to the data buffer module 240.

The data stored in the data storage module 240 may be raw data and weight data, such as data to be processed, or the data stored in the data storage module 240 may be data requiring at least one processing unit to perform processing such as arithmetic processing. The data stored in the data storage module 240 may be a processing result, or the data stored in the data storage module 240 is data processed by at least one processing unit. It should be noted that, the data actually stored in the data storage module 240 is not limited thereto, and the data storage module 240 may also store other data.

Note that, the convolution processing unit 212 and the vector processing unit 214 are not limited to this, and the convolution processing unit 212 and the vector processing unit 214 may be directly connected through a signal line.

The convolution processing unit 212 and the vector processing unit 214 may also process data as follows: upon receiving an instruction such as a first calculation instruction transmitted by the instruction distribution unit 220, the convolution processing unit 212 reads data to be processed, such as data to be processed, from the data storage module 240 according to the first calculation instruction. After the convolution processing unit 212 reads the data to be processed from the data storage module 240, the convolution processing unit 212 performs a corresponding operation, such as a vector inner product calculation, according to the first calculation instruction to obtain an intermediate calculation result. The convolution processing unit 212 may transmit the intermediate calculation result to the vector processing unit 214. The vector processing unit 214 performs a second calculation process such as a pooling process, a subsequent activation, a quantization operation, or a fusion with the operation of the next layer, while processing the operations of the two layers of operators to obtain a processing result, and stores the processing result to the data buffer module 240.

It should be noted that the convolution processing unit 212 may also be connected to other processing units of the first processing module 210, such as the shaping processing unit 216, through a signal line. The first processing module 210 may also directly transmit the intermediate calculation result obtained by calculating the data after the processing of the data by the convolution processing unit 212 to the shaping processing unit 216 or other processing units in the first processing module 210 for performing other calculation operations. Or the first processing module 210 may store the intermediate calculation result obtained by calculation after the processing of the data by the convolution processing unit 212 in the data buffer module 240, and then the shaping processing unit 216 or other processing units in the first processing module 210 obtain the intermediate calculation result from the data buffer module 240, and perform further processing operations such as shaping operations on the intermediate calculation result to obtain a processing result. The shaping processing unit 216 or other processing units in the first processing module 210 store the processing results to the data caching module 240.

It should be understood that, during the processing of the data transmitted by the processing units of the first processing module 210, the intermediate calculation result may not be stored in the data buffer module 240, and the data buffer module 240 may store the original data and the weight, instead of storing the intermediate calculation result. Not only can the storage space of the data caching module 240 be saved, but also the access of the data storage module 240 can be reduced, the power consumption can be reduced, and the performance of the neural network processor 200 can be improved.

It should also be appreciated that the manner in which data is processed between the other processing units of the first processing module 210 can be analogous to the manner in which the convolution processing unit 212 and the vector processing unit 214 in the first processing module 210 are described above. The manner in which the data is processed between the other processing units of the first processing module 210 according to the embodiment of the present application is not illustrated here.

The data storage module 220 of the embodiment of the present application may store the calculation result. In the operation process of a plurality of processing units, 0fallback can be achieved to an external memory, the settlement result fallback of the last operator is not required to be stored to the external memory, the bandwidth requirement on the soc is low, and therefore the system bandwidth is saved, and the calculation delay among operators is reduced.

In some embodiments, the data storage module 240 may be a shared storage module. The data storage module 220 may have multiple banks accessed in parallel, such as three, four, etc. The method can be flexibly divided according to actual needs.

Referring to fig. 6, fig. 6 is a schematic diagram of a sixth structure of a neural network processor according to an embodiment of the application. The neural network processor shown in fig. 8 is different from the neural network processor shown in fig. 5 in that: the second processing module 230 in fig. 6, such as scalar processing unit 232, is coupled to the instruction dispatch module 220, and the second processing module 230 in fig. 6, such as scalar processing unit 232, is not coupled to the data storage module 240. Whereas a second processing module 230 in fig. 10, such as scalar processing unit 232, is coupled to instruction dispatch module 220, and second processing module 230 in fig. 10, such as scalar processing unit 232, is coupled to data storage module 240. The data that needs to be processed by the second processing module 230 in fig. 6, such as the scalar processing unit 232, may be carried by the instructions it receives, or the data that needs to be processed by the second processing module 230 in fig. 6, such as the scalar processing unit 232, may be carried by the instructions distributed by the instruction distribution module 220. Embodiments of the present application may also provide a separate data storage module for a second processing module 230, such as scalar processing unit 232.

It should be noted that, the data storage module 240 may also be connected to the instruction distribution module 220, where the instruction distribution module 220 determines whether to transmit an instruction according to whether the data storage module 240 stores data to be processed.

Referring to fig. 12, fig. 7 is a schematic diagram of a seventh structure of a neural network processor according to an embodiment of the application. The instruction distribution module 220 is connected to the data storage module 240, and the instruction distribution module 220 may send an index to the data storage module 240, and the data storage module 240 returns a signal according to the index sent by the instruction distribution module 220. When data to be processed is stored in the data storage module 240, the data storage module 240 returns a signal, such as "1", storing the data to be processed to the instruction distribution module 220. When the data to be processed is not stored in the data storage module 240, the data storage module 240 returns a signal, such as "0", to the instruction distribution module 220, in which the data to be processed is not stored.

The instruction dispatch module 220 acts differently depending on the different return signals it receives. Such as if the instruction dispatch module 220 receives a "1", the instruction dispatch module 220 determines that the data storage module 240 stores data to be processed, and the instruction dispatch module 220 then transmits the plurality of instructions in parallel. If the instruction distribution module 220 receives "0", the instruction distribution module 220 determines that the data storage module 240 does not store the data to be processed, and the instruction distribution module 220 does not transmit an instruction. Thus, unnecessary instruction distribution can be avoided, and power consumption can be saved.

Referring to fig. 13, fig. 8 is a schematic diagram of an eighth structure of a neural network processor according to an embodiment of the application. The neural network processor 200 provided in the embodiment of the present application may further include an instruction storage module 250, where the instruction storage module 250 may also be referred to as an instruction storage module (Instruction Cache, ICache). Instruction storage module 250 may store some fine-grained instructions, such as calculation instructions, control instructions, and the like. Or instruction storage module 250 is used to store instructions for the NPU. It should be noted that the instructions stored in the instruction storage module 250 may be other instructions. The instruction storage module 250 is connected to the instruction distribution module 220, and the instruction storage module 250 may send the instructions stored therein to the instruction distribution module 220. Alternatively, instruction dispatch module 220 may retrieve a plurality of instructions from instruction store module 250.

The process by which instruction dispatch module 220 retrieves instructions from instruction store module 250 may be: the instruction distribution module 220 sends an instruction fetch request to the instruction storage module 250, and when an instruction corresponding to the instruction fetch request is found in the instruction storage module 250, the instruction storage module 250 sends an instruction corresponding to the instruction fetch request to the instruction distribution module 220 in response to the instruction fetch request, which is an instruction hit. When the instruction corresponding to the instruction fetch request is not found in the instruction storage module 250, the instruction storage module 250 will suspend (Hold) responding to the instruction fetch request, and the instruction storage module 250 will send the instruction fetch request to wait for the instruction to return to the instruction storage module 250, and then the instruction storage module 250 will send the instruction corresponding to the instruction fetch request to the instruction distribution module 220 in response to the instruction fetch request.

The process by which instruction dispatch module 220 retrieves instructions from instruction store module 250 may be understood as: when the instructions required by the instruction dispatch module 220 have been stored in the instruction storage module 250, the instruction dispatch module 220 may be retrieved directly from the instruction storage module 250. When at least one instruction required by the instruction dispatch module 220 is not in the instruction storage module 250, it is necessary to read the instruction required by the instruction dispatch module 220 from another location such as an external memory through the instruction storage module 250 and return the instruction to the instruction dispatch module 220.

It should be noted that, in the embodiment of the present application, the instruction distribution module 220 and the instruction storage module 250 may be two separate parts. Of course, the instruction distribution module 220 and the instruction storage module 250 may also form an instruction preprocessing module, or the instruction distribution module 220 and the instruction storage module 250 may be two parts of the instruction preprocessing module.

It should be further noted that, each instruction stored in the instruction storage module 250 has a corresponding type, and the instruction distribution module 220 may transmit a plurality of instructions based on the type of the instruction. Such as instruction dispatch module 220 transmitting a first type of instruction to convolution processing unit 212 and instruction dispatch module 220 transmitting a second type of instruction to scalar processing unit 232. The type of instruction is such as a jump instruction, a branch instruction, a convolution calculation instruction, a vector calculation instruction, a shaping calculation instruction, or the like.

The instruction storage module 250 of embodiments of the present application is not limited to storing only a portion of the instructions of the NPU 200. The Instruction storage module 250 of an embodiment of the present application may also store all instructions of the NPU200, and the Instruction storage module 250 may be referred to as an Instruction memory (IRAM), or as a program memory. Upper layer software such as an external processor may write the program directly into the IRAM.

Referring to fig. 9, fig. 9 is a schematic diagram of a ninth structure of a neural network processor according to an embodiment of the application. The neural network processing unit 200 may also include a data mover module 260, an instruction mover module 270, and a system bus interface 280.

The system bus interface 280 is connected to a system bus, which may be a system bus of an electronic device such as a smart phone. The system bus interface 280 is connected to the system bus to enable data transmission between other processors and external memories. The system bus interface 280 may convert internal read and write requests to bus read and write requests conforming to a bus interface protocol, such as the advanced extensible interface (Advanced extensible interface, AXI) protocol.

The data moving module 260 is connected to the system bus interface 280 and the data storage module 240, and the data moving module 260 is used for moving data, so that external data can be moved to the data storage module 240, and data of the data storage module 240 can be moved to the outside. Such as the data mover module 260, reads data from the system bus through the system bus interface 280 and writes the data read thereby to the data storage module 240. The data migration module 260 may also transfer the data or the processing results stored by the data storage module 240 to an external memory, such as the data migration module 260 transferring the processing results of the respective processing units in the first processing module 210 to the external memory. That is, the data transfer module 260 can transfer data between the internal data and the external storage via the system bus interface 280.

The data movement module 260 may be a direct memory access (Direct memory access, DMA) that moves data from one address space to another. The address space for data movement may be an internal memory or a peripheral interface. Descriptors for controlling DMA data movement are typically stored in advance on RAM, and include information such as source address space, destination address space, data length, etc. The software initializes the DMA, the data starts to be moved, and the moving process can be independently carried out without the NPU, so that the efficiency of the NPU is improved, and the burden of the NPU is reduced.

The instruction moving module 270 is connected to the system bus interface 280 and the instruction storage module 250, and the instruction moving module 270 is used for moving instructions, or the instruction moving module 270 is used for reading instructions, so as to move external instructions to the instruction storage module 250. Such as instruction mover 270, reads instructions from the system bus via system bus interface 280 and stores the read instructions to instruction storage module 250. When the instruction of the instruction storage module 250 is missing, the instruction storage module 250 requests the instruction moving module 270 to send a read instruction request to the system bus interface 280, so as to read the corresponding instruction and store the instruction in the instruction storage module 250. Instruction move module 270 may be a direct memory access. Of course, the instruction storage module 250 may also directly write all instructions to the instruction storage module 250 through the instruction moving module 270.

Referring to fig. 10, fig. 10 is a schematic diagram of a tenth structure of a neural network processor according to an embodiment of the present application, fig. 10 shows that an instruction storage module 250 is connected to a system bus interface 280, and an external memory may directly store a program or an instruction required by the neural network processor 200 to the instruction storage module 250.

It should be noted that, when the instruction storage module 250 is an IRAM, the embodiment of the present application may also connect the instruction storage module 250 to an external memory through other interfaces. So that the external processor can write instructions or programs directly to the instruction storage module 250 or initiate the instructions.

Therefore, the data moving module 260 and the instruction moving module 270 according to the embodiment of the present application are two separate unit modules, so as to respectively implement data and instruction transmission, or move. Or two DMAs are required to be set to realize the movement of data and instructions in the embodiment of the application. The data movement module 260 needs to set one or more logical channels and the instruction movement module 270 needs to set one or more physical channels. Instruction move module 270 is described herein as an example.

For example, the data mover module 260 of an embodiment of the present application may be a single DMA, defined herein as DMA1; instruction move module 270 may be a separate DMA, defined herein as DMA2. That is, DMA1 can move data and DMA2 can move instructions.

Referring to fig. 11, fig. 11 is a schematic diagram of a first structure of direct memory access in a neural network processor according to an embodiment of the application. The DMA260a shown in fig. 11 corresponds to a partial schematic structure of the data transfer module 260. DMA260a includes a plurality of logical channels 262a and an arbitration unit 254a. The plurality of logic channels 262a are each coupled to an arbitration unit 264a, and the arbitration unit 264a may be coupled to the system bus via a system bus interface. It should be noted that the arbitration unit 264a may also connect at least one of the peripheral device and the storage through other interfaces.

The number of the logic channels 262a may be h, where h is a natural number greater than 1, i.e., the number of the logic channels 262a may be at least two. Each logical channel 262a may receive data move requests such as request 1, request 2, request f and perform data move operations based on the data move requests.

The logical channel 262a of each DMA260a may perform descriptor generation, parsing, control, etc., as determined by the composition of the command request (request). When a plurality of logic channels 262a receive requests for data movement at the same time, a request may be selected by the arbitration unit 264a, and the requests enter the read request queue 266a and the write request queue 268a, waiting for data movement.

Logic channel 262a requires software intervention to configure descriptors or registers in advance by software to complete initialization for data movement. All logical channels 262a of DMA260a are visible to the software, which schedules. While some business scenarios, such as autonomous data movement by internal engines such as instruction distribution modules (or instruction preprocessing modules), do not require software to schedule, the logic channel 262a of such DMA260a cannot be used. Therefore, the method is inconvenient to flexibly transplant according to business requirements and too dependent on software scheduling.

Based on this, the embodiment of the application also provides a DMA to realize different moving requirements.

Referring to fig. 12, fig. 12 is a schematic diagram of a second structure of direct memory access in a neural network processor according to an embodiment of the application. The direct memory access 260b shown in fig. 12 is functionally equivalent to the instruction movement module 270 and the data movement module 260, or the direct memory access 260b shown in fig. 12 combines the functions of the instruction movement module 270 and the data movement module 260. The direct memory access 260b may include at least one logical channel 261b and at least one physical channel 262b, and the at least one logical channel 261b and the at least one physical channel 262b are parallel, which may also be understood as the at least one logical channel 261b and the at least one physical channel 262b are commonly connected to the same interface. So that at least one physical channel 262b and at least one logical channel 261b may move instructions and data in parallel. Because the physical channel 262b makes a request for moving the instruction independently by an internal engine such as an instruction distribution module, the instruction does not need to be scheduled by upper software, so that the whole DMA260b can rely on software scheduling, thereby being more convenient for moving data and further being more convenient for flexibly moving data according to business requirements. It will be appreciated that the embodiment of the present application can implement the movement of the instruction and the data by using one DMA260b, and can also save the number of unit modules.

Wherein the logical channel 261b may perform data movement in response to a movement request scheduled by the upper layer software. The upper level software may be a programmable unit such as a Central Processing Unit (CPU).

The number of the logic channels 261b may be n, and n may be a natural number greater than or equal to 1. Such as one, two, three, etc. logical channels 261 b. It should be noted that the actual number of the logic channels 261b may be set according to the actual product requirement.

Wherein the physical channel 262b may be subject to data movement in response to a movement request from an internal engine, which may be an instruction dispatch module, or instruction pre-processing module, of the NPU.

Wherein the number of physical channels 262b may be m, and m may be a natural number greater than or equal to 1. Such as one, two, three, etc. physical channels 262 b. It should be noted that the actual number of physical channels 262b may be set according to the actual product requirement. In some embodiments, the number of logical channels 261b may be two and the number of physical channels 262b may be one.

With continued reference to fig. 12, the dma260b may further include a first arbitration unit 263b, where the first arbitration unit 263b interfaces with a system bus.

Fig. 13 is a schematic diagram of an eleventh architecture of a neural network processor according to an embodiment of the application. The first arbitration unit 263b and the system bus interface 264b, it is understood that the system bus interface 264b may be equivalent to the system bus interface 280. The first arbitration unit 263b is connected to the system bus via the system bus interface 264b, and the first arbitration unit 263b is also connected to all of the physical channels 262b and all of the logical channels 261b, respectively, so that the logical channels 261b and the physical channels 262b can move data and instructions from the system bus. When a plurality of channels simultaneously initiate read/write requests, the first arbitration unit 263b may arbitrate that one read/write request is sent to the system bus interface 264b. Such as one logical channel 261b and one physical channel 262b, the first arbitration unit 263b may arbitrate that the read/write request of one physical channel 262b is sent to the system bus interface 264b, or the first arbitration unit 263b may arbitrate that the read/write request of one logical channel 261b is sent to the system bus interface 264b.

Wherein system bus interface 264b may be disposed outside of DMA260 b. It should be noted that the system bus interface 264b may also be disposed in the DMB260b, i.e., the system bus interface 264b may be a part of the DMA260 b.

In some embodiments, the first arbitration unit 263b may reallocate the bandwidth of the at least one physical channel 262b and the at least one logical channel 261 b.

In some embodiments, the logical channel 261b may include a logical channel interface 2612b, a descriptor control module 2614b, and a data transfer module 2616b. The logical channel interface 2612b may be connected to a data storage module such as the data storage module 240 shown in fig. 5, and the logical channel interface 2612b, the descriptor control module 2614b, and the data transmission module 2616b are sequentially connected, and the data transmission module 2616b is also connected to the first arbitration unit 263b to connect to the system bus through the system bus interface 264 b.

The logical channel interface 2612b may be determined by the format in which the upper layer software issues the command, and the logical channel interface 2612b may contain the address of the descriptor. The descriptor control module 2614b indexes the descriptor according to the command issued by the upper layer software, parses information such as the data source address, destination address, data length, etc., and initiates a read/write data command to the data transfer module 2616b of the DMA260 b. The data transmission module 2616b receives the read-write data command of the previous stage (descriptor control module 2614 b), converts the read-write data command into a required signal, can read and write before, completes data movement, and returns a response to the descriptor control module 2614b.

The specific process of moving data by the logic channel 261b is as follows:

control status register (Control Status Register, CSR) 269b of DMA260b is configured. It should be noted that, the DMA260b needs to satisfy several conditions: where data is to be transmitted (source address), where data is to be transmitted (destination address), and when data is to be transmitted (trigger source, or trigger signal). The various parameters and conditions of DMA260b need to be configured to complete to enable the movement of data. The source address, destination address, and trigger source may be set using upper layer software.

In practice, various parameters and conditions of DMA260b may be set in control status register 269b, or configuration information and parameters of DMA260b, such as operating mode, arbitration priority, interface information, etc., may be set in control status register 269b. In some embodiments, such as setting in control status register 269b the address of the peripheral registers, the address of the data memory, the amount of data that needs to be transferred, the priority between the various channels, the direction of data transfer, the round robin pattern, the delta pattern of peripherals and memory, the data width of peripherals and memory, etc.

The upper layer software issues a data movement command to the logic channel 261b of the DMA260b to the logic channel interface 2612b, or the upper layer software issues a data movement request to the logic channel 261b of the DMA260b to the logic channel interface 2612b, and the programmable unit carries the address of the descriptor together with the address of the descriptor or directly carries the descriptor when issuing the data movement command to the logic channel 261b of the DMA260 b. And transmits the address of the descriptor or the descriptor to the descriptor control module 2614b through the logical channel interface 2612 b.

If the descriptor control unit 2614b receives the address of the descriptor, the descriptor control unit 2614b reads the descriptor according to the address of the descriptor. I.e. index descriptors. And then resolving based on the descriptor, namely generating information required by data migration, such as a data source end address space, a destination end address space, a data length and the like. And when the descriptor control unit 2614b receives the descriptor, the descriptor control unit 2614b directly parses the descriptor.

After the descriptor control unit 2614b parses the completed descriptor, the data transmission module 2616b may follow the read-before-write principle, and convert the information generated by the descriptor control unit 2614b parsing the descriptor into a signal required to be transmitted by the system bus interface 264b and transmit the signal to the first arbitration unit 263b.

The first arbitration unit 263b may arbitrate one of the read/write requests sent to the system bus interface 264b when receiving the read/write requests simultaneously initiated by the plurality of logic channels 261b.

When the first arbitration unit 263b receives the read/write request from the logical channel 261b and the read/write request from the physical channel 262b at the same time, the first arbitration unit 263b may also arbitrate a signal sent to the system bus interface 264b and transmit the signal to the system bus through the system bus interface 264b.

After the read/write request of DMA260b is transferred to the system bus, the system bus completes the read/write command and the data of the source address space is written into the destination address space. Thereby completing the data movement.

Wherein physical channel 262b may be connected to an internal engine, such as an instruction dispatch module, through an interface that may contain configuration and parameters for instruction movement. Of course, the configuration and parameters of the physical channel 262b for instruction movement may also be configured by the control status register 269 b.

It should be noted that, the DMA260b may also be connected to other components through other structures to implement data movement.

With continued reference to fig. 14 and 15, fig. 14 is a third structural diagram of direct memory access in a neural network processor according to an embodiment of the present application, and fig. 15 is a twelfth structural diagram of the neural network processor according to an embodiment of the present application. DMA260b may also include a second arbitration policy 265b, and second arbitration policy 265b may be coupled to storage interface 266 b. The memory interface 266b may be coupled to a memory module (or BUF), which may or may not be located in the same NPU as the DMA260 b. Such as DMA260b, is located in the NPU, the memory module may be located in the NPU, and the memory module may be located in other devices. The second arbitration unit 265b may be connected to each of the logic channels 261b, and the first arbitration unit 263b and the second arbitration unit 265b may be connected to the same logic channel 261b by a selector when the same logic channel 261b is connected. Storage interface 266b may be provided outside DMA260b or may be provided inside DMA260 b.

With continued reference to fig. 14 and 15, the dma260b may further include a third arbitration unit 267b, the third arbitration unit 267b, and a peripheral interface 268b, where the peripheral interface 268b may connect to an external device that is located in a different device than the DMA260b, such as where the DMA260b is located in an NPU, where the external device is a CPU, etc. The third arbitration unit 267b can be connected to each logic channel 261b, and when the first arbitration unit 263b and the third arbitration unit 267b are connected to the same logic channel 261b, a selector can be used to connect the same logic unit 261b. The peripheral interface 268b may be provided outside the DMA260b or may be provided inside the DMA260 b.

With continued reference to fig. 14 and 15, the DMA260b according to the embodiment of the present application may further include a first arbitration unit 263b, a second arbitration unit 265b, and a third arbitration unit 267b. The first arbitration unit 263b is connected to the system bus interface 264b, the second arbitration unit 265b is connected to the storage interface 266b, the third arbitration unit 267b is connected to the peripheral interface 268b, the first arbitration unit 263b, the second arbitration unit 265b and the third arbitration unit 267b are all connected to the logic channel 261b, and when the first arbitration unit 263b, the second arbitration unit 265b and the third arbitration unit 267b are connected to one logic channel 261b, a selector is connected between one logic channel 261b and three arbitration units.

It should be noted that other arbitration units may be provided in the embodiments of the present application to connect other elements through other interfaces. 19 and fig. 15 may actually be one, two or three. That is, when the arbitration unit shown in fig. 14 and 15 is one, it may be the first arbitration unit 263b, referring to fig. 12 and 13. When two arbitration units are shown in fig. 14 and 15, there may be a first arbitration unit 263b and a second arbitration unit 265b, or there may be a first arbitration unit 263b and a third arbitration unit 267b. When the number of arbitration units shown in fig. 19 and 15 is three, there may be a first arbitration unit 263b, a second arbitration unit 265b, and a third arbitration unit 267b.

Referring to fig. 11, fig. 11 is an eleventh structural diagram of a neural network processor according to an embodiment of the application. Fig. 11 illustrates one connection of the direct memory access 260b of fig. 12 or 14 to other elements of the neural network processor 200. The direct memory access 260b is connected to the system bus interface 280, the instruction memory module 250 and the data memory module 240, the direct memory access 260b can move data to the data memory module 240 through the system bus interface 280, the direct memory access 260b can move instructions to the instruction memory module 250 through the system bus interface 280, and the direct memory access 260b can also move data stored in the data memory module 240 to an external memory through the system bus interface 280.

In the neural network processor 200 according to the embodiment of the present application, the data of the first processing module 210 may be directly stored in the data storage module 240, and the data of the data storage module 240 may also be loaded into the first processing module 210, so that the program is relatively simplified. However, in order to increase the data access speed, the embodiment of the present application may further add a general register between the data storage module 240 and the first processing module 210. A neural network processor having general purpose registers is described in detail below with reference to the accompanying drawings.

Referring to fig. 12, fig. 12 is a schematic diagram of a twelfth structure of a neural network processor according to an embodiment of the application. The neural network processor 200 may also include general registers 290 and a load store module 202.

The general purpose register 290 is connected to the first processing module 210, and the general purpose register 290 may be connected to all processing units in the first processing module 210. Such as general purpose registers 290 are coupled to convolution processing unit 212, vector processing unit 214 of first processing module 210. Both the convolution processing unit 212 and the vector processing unit 214 may obtain the required data from the general purpose registers 290. Of course, the convolution processing unit 212 and the vector processing unit 214 may each store the respective processing results to the general-purpose register 290. It should be noted that, the number of processing units in the first processing module 210 shown in fig. 12 is not limited to this, and the first processing module 210 further includes a shaping processing unit.

Wherein the general purpose registers 290 may include a plurality of registers, such as the general purpose registers 290 including a plurality of vector registers 292. For another example, general purpose registers 290 include a plurality of prediction registers 294. Also for example, general purpose registers 290 include a plurality of vector registers 292 and a plurality of prediction registers 294. Wherein the plurality of vector registers 292 may be referred to simply as a vector register file (Vector Register File, VRF). Wherein the plurality of prediction registers 294 may be simply referred to as a prediction register file (Predicate Register File, PRF), the prediction registers may also be referred to as predicate registers. The types and numbers of the various registers in the general purpose register 290 can be set according to actual requirements. To increase the flexibility of software programming.

Among them, the convolution processing unit 212 may have a dedicated register 2122, which dedicated register 2122 may store data, such as two dedicated registers 2122 of the convolution processing unit 212, a first dedicated register may store image data and a second dedicated register may store weights, respectively.

The Load Store Unit (LSU) 202 is coupled to the general purpose registers 290, and the Load Store Unit 202 can Load data into the general purpose registers 290 to facilitate retrieval of data from the general purpose registers 290 by the various processing units of the first processing module 210. The load store module 202 may also be coupled to a special purpose register 2122 of the convolution processing unit 212, and the load store module 202 may also load data directly into the special purpose register 2122 of the convolution processing unit 212 to facilitate processing of the data by the convolution processing unit 212, such as convolution processing. Thereby increasing the speed of loading data.

It should be noted that, fig. 12 only shows a part of elements of the neural network processor 200, and other elements of the neural network processor 200 shown in fig. 12 may refer to fig. 1 to 11, and in order to describe the relationship between the load storage module 202 and the general register 290 and other elements of the neural network processor 200 in detail, the following description will refer to fig. 13.

Referring to fig. 13, fig. 13 is a schematic diagram of a thirteenth configuration of a neural network processor according to an embodiment of the application. A Load Store Unit (LSU) 290 connects the general purpose registers 290 to the data Store module 240. The load store module 202 may load the data of the data store module 240 into the general purpose registers 290, and the processing units of the first processing module 210, such as the convolution processing unit 212, the vector processing unit 214, the shaping processing unit 216, may retrieve the data to be processed from the general purpose registers 290 according to the instruction. The general purpose registers 290 may be connected with a plurality of processing units, such as general purpose registers 290 connected with convolution processing unit 212, and general purpose registers 290 also connected with at least one of vector processing unit 214 and shaping processing unit 216. Thus, the convolution processing unit 212, the vector processing unit 214, and the shaping processing unit 216 may each obtain the data to be processed from the general purpose register 290.

The convolution processing unit 212, the vector processing unit 214, and the shaping processing unit 216 may also each store the respective processing results to the general purpose register 290. The load store module 202 may then store the processing results in the general purpose registers 290 to the data store module 240, and the data store module 240 may transfer the processing results to an external memory via a direct memory access or data transfer module 260.

It should be noted that, in the embodiment of the present application, the second processing module 230, such as the scalar processing unit 232, is not connected to the general purpose register 290, and as described above, the data that the scalar processing unit 232 needs to process may be carried by the instruction received by the second processing module. Scalar processing unit 232 may also be coupled to data storage module 240 to obtain the data from data storage module 240 for processing in accordance with embodiments of the present application.

The load store module 202 of the present embodiment may load not only the data of the data store module 240 into the general purpose registers 290, but also other locations. For example, the load store module 202 may also be directly coupled to the convolution processing unit 212, and the direct coupling of the load store module 202 to the convolution processing unit 212 may be understood as not having the general purpose registers 290 coupled between the load store module 202 and the convolution processing unit 212 as described above. The connection of the load store module 202 to the convolution processing unit 212 may be understood as the connection of the load store module 202 to a dedicated register 2122 of the convolution processing unit 212, such as the connection of the load store module 202 to one of the dedicated registers 2122 of the convolution processing unit 212, the load store module 202 may directly load data, such as weights, of the data store module 240 to one of the dedicated registers 2122 of the convolution processing unit 212. It will be appreciated that the load store module 202 may also load other data, such as image data, directly into one of the special purpose registers 2122 of the convolution processing unit 212.

Thus, the load store module 202 may load the data of the data store module 240 directly to the convolution processing unit 212, the load store module 202 may also store the data of the data store module 240 to the general purpose register 290, and the processing unit of the first processing module 210, such as the convolution processing unit 212, may obtain corresponding data from the general purpose register 290 based on the instruction received by the processing unit. Such as the load store module 202 may load the first data directly to the convolution processing unit 212, the load store module 202 may store the second data to the general purpose registers 290, and the convolution processing unit 212 may retrieve the second data from the general purpose registers 290. The types of the first data and the second data may be different, such as the first data being a weight and the second data being image data. Therefore, the convolution processing unit 212 of the embodiment of the application can receive the data to be processed from different paths, and compared with the convolution processing unit 212 receiving the data to be processed from the same path, the convolution processing unit 212 can increase the data loading speed, and further can increase the operation speed of the neural network processor 200. Moreover, embodiments of the present application may simplify the instruction set such that it is easy to implement. Meanwhile, the embodiment of the application is easier to optimize the compiler.

It should be noted that, after the load store module 202 loads the first data directly into the convolution processing unit 212 and the load store module 202 loads the second data into the general purpose register 290, the second data may also be obtained from the general purpose register 290 by other processing units of the first processing module 210, such as the vector processing unit 214.

It should also be noted that the load store module 202 may also load other data, such as third data, into the general purpose register 290, which may be retrieved from the general purpose register 290 by one or more processing units of the first processing module 210, such as the shaping processing unit 216. The third data may be of a different type than both the first data and the second data.

The load store module 202 is further connected to the instruction dispatch module 220, the load store module 202 may receive instructions transmitted by the instruction dispatch module 220, and the load store module 202 may store data of the data store module 240 to the general purpose registers 290 or/and load the data to the convolution processing unit 212 according to instructions transmitted by the instruction dispatch module 240. The load store module 202 may also store the processing results stored in the general purpose registers 290 to the data store module 240 according to instructions issued by the instruction dispatch module 240. Such as the result of vector processing unit 214.

It should be noted that, the instruction issue module 220 may issue multiple instructions to the first processing module 210, the second processing module 230, and the load store module 202 in parallel in one clock cycle. Such as instruction dispatch module 220 may issue multiple instructions in parallel to scalar processing unit 232, convolution processing unit 212, vector processing unit 214, and load store module 202 in one clock cycle.

Wherein the load store module 202 and the data store module 240 may be integrated together as two parts of one module. Of course, the load store module 202 and the data store module 240 may be provided separately, or the load store module 202 and the data store module 240 may not be integrated into one module.

Referring to fig. 14, fig. 14 is a schematic diagram illustrating a fourteenth structure of a neural network processor according to an embodiment of the application. The neural network processor 200 may also include a data mover engine 204. The data mover 204 may also be referred to as a register file data Mover (MOVE). The data movement engine 204 may implement movement of data between different registers to facilitate the processing units of the first processing module 210, such as the convolution processing unit 212, and the processing units of the second processing module 230, such as the scalar processing unit 232, to obtain the required data from within the NPU200 for processing without the need to transfer the data to the outside of the NPU200 for processing via upper layer software before returning to the NPU200. Or the data migration engine 204 may implement data interaction between different registers, so that a process of transmitting some data in the NPU200 from the NPU200 to the outside may be saved, interaction between the NPU200 and upper software such as a CPU is reduced, and efficiency of processing data by the NPU200 is improved. At the same time, the workload of the external CPU can be reduced.

The data moving engine 204 is connected to the general register 290 and the scalar processing unit 232 of the second processing module 230, and the scalar processing unit 232 has the above description that can be referred to herein. The scalar processing unit 232 includes a plurality of scalar registers 2322, referred to as a scalar register file for short, and the scalar processing unit 232 is coupled to the data movement engine 204 via the scalar registers 2322. The general purpose register 290 has a plurality of registers, simply referred to as a register file, and the general purpose register 290 is coupled to the data mover 204 via the register file therein. The plurality of registers of the general register 290 may be all connected to the data transfer engine 204. The plurality of registers of the general-purpose register 290 may be partially connected to the data transfer engine 204.

Referring to fig. 15, fig. 15 is a schematic diagram of a fifteenth structure of a neural network processor according to an embodiment of the application. The general purpose registers 290 in the neural network processor 200 may include a plurality of vector registers 292, abbreviated as vector register files, and the vector registers 292 may be all connected to the data movement engine 204 in embodiments of the present application, and a portion of the vector registers 292 may be connected to the data movement engine 204 in embodiments of the present application, where the portion may be understood as at least one vector register, and not all vector registers.

The general purpose registers 290 in the neural network processor 200 may include a plurality of prediction registers 294, which may be referred to as a prediction register file, or may be referred to as a predicate register file, and all of the prediction registers 294 may be connected to the data migration engine 204, or a portion of the prediction registers 294 may be connected to the data migration engine 204.

Note that when the general purpose registers 290 include a plurality of types of registers, the general purpose registers 290 may be connected to the data movement engine 204 through all types of registers. The general purpose registers 290 may also be coupled to the data mover 204 through some of the types of registers therein, such as when the general purpose registers 290 of the neural network processor 200 include a plurality of vector registers 292 and a plurality of prediction registers 294, the general purpose registers 290 are coupled to the data mover 204 only through the plurality of vector registers 292.

It should be noted that fig. 14 and fig. 15 only show some elements of the neural network processor 200, and other elements of the neural network processor 200 shown in fig. 14 and fig. 15 may refer to fig. 1 to fig. 13, and in order to describe the relationship between the data movement engine 204 and other elements in detail, and the data movement engine 204 specifically implements the movement of data, the following description will be made with reference to fig. 16.

Referring to fig. 16, fig. 16 is a schematic diagram of a sixteenth structure of a neural network processor according to an embodiment of the application. Some data of the neural network processor 200, such as the data processed by the convolution processing unit 212, the vector processing unit 214, or the shaping processing unit 216 of the first processing module 210, may be stored in the general register 290 when the data needs to be scalar calculated, and the data moving engine 204 may move the data to the scalar processing unit 232, where the scalar processing unit 232 performs scalar calculation on the data. When the scalar processing unit 232 finishes calculating the data to obtain a calculation result, the data movement engine 204 may move the calculation result to the general register 290, and the corresponding processing unit in the first processing module 210 may obtain the calculation result from the general register 290. Therefore, in the NPU200 of the embodiment of the present application, the data is moved inside the NPU200, and compared with the NPU200 that transmits the data to the outside, the data is returned to the NPU200 after being processed by external upper software such as a CPU, so that interaction between the NPU200 and the outside can be reduced, and the efficiency of the NPU200 in processing the data can be improved.

The data processed by the convolution processing unit 212, the vector processing unit 214, or the shaping processing unit 216 of the first processing module 210 needs to perform scalar calculation, and the intermediate result obtained by processing by the convolution processing unit 212, the vector processing unit 214, or the shaping processing unit 216 of the first processing module 210 needs to perform a judgment operation. This determination may be accomplished by scalar processing unit 232. In other words, the data stored in the general register 290 is to be judged, the to-be-judged data needs to be judged, and the data moving engine 201 moves the to-be-judged data to the scalar register 2322 of the scalar processing unit 232 to perform the judgment.

When some data of the neural network processor 200, such as scalar data of the scalar processing unit 232, needs to be converted into vector data, the data shifting engine 204 may shift the scalar data to the general purpose register 290, and a corresponding processing unit in the first processing module 210, such as the vector processing unit 214, may obtain the scalar data from the general purpose register 290 to convert it into vector data. It should be noted that, the scalar data needs to be converted into vector data, which may also be referred to as scalar data needs to be expanded into vector data. For example, a 32-bit data is duplicated into 16 identical data to form a 512-bit vector.

In practical applications, the instruction distribution module 220 is connected to the data movement engine 204, and the instruction distribution module 220 may transmit an instruction to the data movement engine 204, and the data movement engine 204 may perform a data movement operation according to the instruction received by the instruction distribution module. Such as instruction dispatch module 220, transmits a first instruction to data mover 204, and data mover 204 moves the data of general purpose register 290 to scalar registers 2322 of scalar processing unit 232 in accordance with the first instruction. For example, the instruction dispatch module 220 may issue a second instruction to the data movement engine 204, and the data movement engine 204 may move the data of the scalar register 2322 to the general purpose register 290 based on the second instruction.

It should be noted that, the instruction distribution module 220 may transmit multiple instructions to the first processing module 210, the second processing module 230, the load store module 202, and the data movement engine 204 in parallel in one clock cycle. Such as instruction dispatch module 220 may issue multiple instructions in parallel to convolution processing unit 212, vector processing unit 214, scalar processing unit 232, load store module 202, and data mover engine 204 in one clock cycle.

The neural network processor 200 may perform convolutional neural network operations, cyclic neural network operations, and the like, and taking convolutional neural network operations as an example, the neural network processor 200 may obtain data to be processed (such as image data) from the outside, and the convolutional processing unit 212 in the neural network processor 200 may perform convolutional processing on the data to be processed. The input of the convolution layer in the convolution neural network comprises input data (such as data to be processed obtained from the outside) and weight data, and the main calculation flow of the convolution layer is to carry out convolution operation on the input data and the weight data to obtain output data. The main body of the convolution operation is a convolution processing unit, and it can be understood that the convolution processing unit of the neural network processor performs convolution operation on the input data and the weight data to obtain output data. It should be noted that the weight data may be understood as one or more convolution kernels in some cases. The convolution operation is described in detail below.

Referring to fig. 17 and fig. 18, fig. 17 is a schematic diagram of input data of a convolution processing unit in a neural network processor according to an embodiment of the present application, and fig. 18 is a schematic diagram of weight data of the convolution processing unit in the neural network processor according to an embodiment of the present application. The size of the input data is h×w×c1, the size of the weight data is k×r×s×c2, where H is the height of the input data, W is the width of the input data, C1 is the depth of the input data, K is the output number of the weight data, i.e., the number of convolution kernels, R is the height of the weight data, i.e., the height of the convolution kernels, S is the width of the weight data, i.e., the width of the convolution kernels, and C2 is the depth of the weight data, i.e., the depth of the convolution kernels, wherein C2 of the weight data and C1 of the input data are equal, because C2 and C1 are both corresponding depth values and equal, and for ease of understanding, C2 and C1 below are both replaced with C, and it can also be understood that c2=c1=c. The input data size may be n×h×w×c, N being the number of batches of input data.

Referring to fig. 19, fig. 19 is a schematic diagram illustrating convolution operation of a convolution processing unit in a neural network processor according to an embodiment of the present application. The convolution processing unit firstly performs window taking on input data according to the size of a convolution kernel, performs multiply-accumulate operation on a window area after window taking and one convolution kernel in weight data to obtain data, then slides the window in the W direction and the H direction respectively, performs multiply-accumulate operation to obtain H '×W' data, and finally traverses K convolution kernels to obtain K×H '×W' data. The specific operation steps may be as follows (which may also be understood as the specific steps of the convolution processing unit performing the convolution operation are as follows):

1. Windowing input data according to the size of a convolution kernel from a starting point (W=0, H=0) to obtain a window area;

2. selecting one uncomputed convolution kernel from the K convolution kernels;

3. performing dot multiplication on the window area after window taking and the convolution kernel, and then accumulating to obtain data;

4. sliding the window in the W direction to obtain a new window (the size of the window is unchanged);

5. sequentially repeating the steps 3 and 4 until the boundary in the W direction, so as to obtain W' data;

6. returning to the starting point of the W direction, and sliding the window in the H direction according to a step length to obtain a new window (the size of the window is unchanged);

7. repeating the steps 3-6 until the boundary of the H direction, thus obtaining H '×W' data, wherein the steps 3-5 still need to be repeated after the boundary of the H direction is reached;

8. and (3) repeating the steps 2-7, traversing K convolution kernels, and calculating to obtain K multiplied by H 'multiplied by W' data.

The size (l×m) of a multiply-accumulate Array (MAC Array) used for convolution operation in the convolution processing unit is fixed, where L is the length of performing the multiply-accumulate operation, M is the number of units that perform the multiply-accumulate operation in parallel, and it can be understood that M multiply-accumulate operations with length L may be performed in one cycle. The steps of assigning the multiply-accumulate operations (e.g., steps 3-4 above) in the above convolution operation procedure to the convolution processing unit for parallel computation are as follows (which can also be understood as the specific steps of the convolution processing unit for multiply-accumulate operations using the multiply-accumulate array are as follows:

1. Windowing input data according to the size of a convolution kernel on a HW plane, and dividing the input data into C/L data segments with the length of L in the depth direction;

2. sliding the window body along the W direction, dividing the input data into C/L data segments with the length of L in the depth direction, and continuously sliding the window body along the W direction for M-2 times to obtain M groups of input data, wherein each group has C/L data segments;

3. dividing the convolution kernel into C/L data segments with the length of L in the depth direction, and performing the operation on K convolution kernels in the weight data to obtain K groups of weight data, wherein each group has C/L data segments;

4. taking the ith (i=1, 2, …, C/L) data segment of the M groups of input data to obtain M input data segments;

5. taking the ith (i=1, 2, …, C/L) data segment in the f (f=1, 2, …, K) data of the K sets of weight data to obtain a weight data segment;

6. performing multiply-accumulate operation on M input data segments (depth L) and 1 weight data segment (weight data broadcast multiplexing) by using a MAC array (L×M) to obtain partial results of M outputs;

7. increment i, and repeat steps 4, 5, and 6, the output M data is added to the M data calculated previously, so as to obtain M complete output results, where i is incremented from 1 to C/L.

The order of the steps may be adjusted as needed. For example, the order of steps 2 and 3 may be reversed. For another example, the steps of steps 4 and 5 may be reversed.

In this embodiment, by dividing the input data and the weight data, the MAC array may perform multiply-accumulate operation on the data of M frames and one convolution kernel at a time, and the MAC array may be fully utilized to quickly complete the convolution operation. In this embodiment, C is greater than L, K is greater than L, W is greater than M, and when one or more of C/L, K/L, W/M is not divided, the number of the non-divided portions needs to be rounded and added with 1, specifically, 1 is added after the integer portion thereof is obtained.

Of course, the convolution processing unit may also adopt other convolution operation modes. Another embodiment of convolution operation is described in detail below. Referring to fig. 20, another convolution operation schematic diagram of a convolution processing unit in the neural network processor according to the embodiment of the present disclosure is shown in fig. 20. Wherein the input data size is still h×w×c and the weight data (one or more convolution kernels) size is still kxrxsxc. Of course, the input data size may be n×h×w×c, and N may be the number of batches of data input.

The convolution processing unit firstly performs window taking on input data according to the size of convolution kernels, performs multiply-accumulate operation on a first window area after window taking and all convolution kernels in weight data to obtain data, and then slides window bodies in the W direction and the H direction respectively, and performs multiply-accumulate operation to obtain H '×W' ×K data. The specific operation steps are as follows (the specific steps of the convolution operation performed by the convolution processing unit can be understood as follows):

1. windowing input data according to the size (R x S) of a convolution kernel from a starting point (W=0, H=0) to obtain a first window region (R x S x C);

2. multiplying and accumulating the windowed first window region and the K convolution kernels respectively to obtain K data;

3. sliding to obtain a new first window area according to a first sliding step length in the W direction (the size of the first window area is unchanged), wherein the first sliding step length can be set according to the requirement;

4. steps 2, 3 are repeated in sequence until a W direction boundary, thus obtaining W 'x K data, wherein W' = (W-S)/first sliding step +1. For example, if w=7, s=3, the first sliding step=2, then W' =3. For another example, if w=7, s=3, the first sliding step=1, then W' =5;

5. Returning to the starting point of the W direction, sliding the window in the H direction according to a second sliding step length, where the second sliding step length in the H direction may be set as required, so as to obtain a new first window area (the size of the first window area is unchanged), for example, after sliding the window in the H direction according to a second sliding step length (the second sliding step length in the H direction is 1), the coordinates may be (w=0, h=1).

6. Steps 2-5 were repeated until the H direction boundary, thus obtaining H '×w' ×k data. It should be noted that, each time the window is slid along the W direction until the W direction boundary, after the window is slid in the H direction for the last time until the boundary is reached, the window is still slid in the W direction until the W direction boundary (i.e. repeating steps 2-4).

The convolution operation unit comprises a multiply-accumulate Array (MAC Array) for convolution operation, wherein the size (l×m) of the multiply-accumulate Array is fixed, L is the length of performing multiply-accumulate operation, M is the number of units performing multiply-accumulate operation in parallel, and it can be understood that M multiply-accumulate operations with length L can be performed in one cycle. The steps of assigning the multiply-accumulate operation in the above convolution operation process (i.e. the above step 2) to the convolution operation unit to perform the parallel operation are as follows (which can also be understood as the specific steps of the convolution processing unit performing the multiply-accumulate operation by using the multiply-accumulate array are as follows:

1. Windowing input data according to a convolution kernel size (R multiplied by S) on a HW plane from a starting point (W=0, H=0) to obtain a first window area, and dividing the first window area into C/L data segments with the length of L in the depth direction; it should be noted that, after the first window area is obtained, the first window area may be divided into C/L data segments with length L, or the input data may be first divided into C/L data segments with length L, and then the first window area is obtained, where the first window area includes C/L data segments with length L; it is understood that the first frame region may include first depth data of the C/L layer in the depth direction;

2. dividing the convolution kernel into C/L data segments with the length of L in the depth direction, and performing the operation on K convolution kernels in the weight data to obtain K groups of weight data, wherein each group has C/L weight data segments; it is understood that each convolution kernel includes C/L weight data segments of length L in the depth direction; the K convolution kernels can be further divided into K/M convolution kernel groups, and each convolution kernel group comprises weight data of M convolution kernels;

3. taking first depth data of an ith (i=1, 2, …, C/L) layer of a first window area of input data to obtain 1 first depth data;

4. Taking the second depth data of the ith (i=1, 2, …, C/L) layer of the f (f=1, 2, …, K/M) group convolution kernel group to obtain M second depth data;

5. performing multiply-accumulate operation on the 1 first depth data and the M second depth data (weight data broadcast multiplexing) by using the MAC array to obtain M first operation data; the M weight data segments are weight data segments of M convolution kernels;

6. increasing i, and repeating the step 3-5, wherein the output M pieces of first operation data are added to the M pieces of first operation data calculated before, so as to obtain M pieces of target operation data; wherein i starts from 1 and increases to C/L;

7. and f, increasing the value f, and repeating the steps 3-6 to obtain K outputs after K/M times of calculation are completed. Where K starts from 1 and increases to K/M.

The height H, width W and depth C of the input data are all random, i.e. the size of the input data may have a very large number of formats, for example, the width W of the input data is uncertain, the width W of the input data is divided by the number M of units of the multiply-accumulate array that perform the multiply-accumulate operation in parallel, and in most cases, an integer cannot be obtained, which wastes a part of the multiply-accumulate operation units during the multiply-accumulate operation. In this embodiment, the number K of convolution kernels is divided by the number M of units of the multiply-accumulate array for performing the multiply-accumulate operation in parallel, where the number K of convolution kernels is generally a fixed number and is n times 2 (i.e. 2 n), or is one of a limited number of numbers (e.g. K is one of 32, 64, 128, 256), so when the multiply-accumulate operation unit is provided, the number M of units of the multiply-accumulate operation may be set to be the same as or an integral multiple of the number K, e.g. M is one of 32, 64, 128, etc. The embodiment can fully utilize the multiply-accumulate operation unit, reduce the waste of the multiply-accumulate operation unit and improve the convolution operation efficiency. In this embodiment, the number K of convolution kernels corresponding to the number M of units performing multiply-accumulate operation is a division in one dimension direction, and if the number M of units performing multiply-accumulate operation corresponds to a sliding window area, the corresponding number M of units performing multiply-accumulate operation includes not only a width W dimension but also an H dimension, and the corresponding number M of units performing multiply-accumulate operation is unfavorable for folding.

In addition, the output target operation data in the present embodiment has the format of H '×w' ×k, which is the same as the format of the input data, and can be directly used as the input data of the next operation layer (such as the next convolution layer or the next pooling layer) without deforming the input data. The target operation data are continuous data in the depth direction, continuous data can be stored in storage, the target operation data are continuous in subsequent re-reading, and the address is not required to be calculated for many times in hardware loading, so that the calculation efficiency is optimized.

In this embodiment, C is greater than L, K is greater than M, and when one or both of C/L, K/M is not divided, the number of the non-divided portions needs to be rounded and added with 1, specifically, 1 is added after the integer portion thereof is obtained. Illustratively, L and M in a multiply-accumulate Array (MAC Array) take the same value, e.g., 64. The input data is padded in the depth direction at 64 length granularity. The data blocks are divided into 1×1×64 data blocks in the depth direction, and when the depth is less than 64, the data blocks are padded to 64, and the data organization mode is nxh×w× (c×c '), wherein c=64 and C' is C divided by C and rounded up. The weight data is padded in the depth direction at 64 length granularity. The weight data is divided into 1×1×64 data blocks along the depth direction, when the depth is less than 64, the weight data is complemented to 64, and when the number of convolution kernels is greater than 64, the weight data is divided into a plurality of groups according to 64 granularity. The adjusted data organization is rxsx (c×c ')x (k×k'), where c=64, C 'is C divided by C rounded up, k=64, and K' is K divided by K rounded up.

In the convolution operation process of the embodiment, the convolution processing unit may be further configured to transmit K pieces of target operation data corresponding to one window area to a next layer and be used for performing an operation; or transmitting N multiplied by K target operation data corresponding to the N first window areas to the next layer for operation, wherein N is smaller than the total number of the first window areas of the input data.

Because the complete operation is performed on each first window area, that is, all data of each first window area (including the depth direction) and all convolution kernels (including the depth direction) are subjected to multiply-accumulate operation, the obtained target operation data is complete, the target operation data corresponding to one or more first window areas can be transmitted to the next layer first, and the operation of all input data is not required to be completed and then transmitted, when part of the target operation data transmitted to the next layer can be used as the minimum unit of the operation of the next layer (for example, part of the target operation data can be used as the data included in one window area of the input data of the next layer), the next layer can start operation, and the waiting for the whole operation result of the previous layer is not required, thereby improving the efficiency of convolution operation and shortening the duration of convolution operation. In addition, because the NPU internal cache where the convolution operation unit is located is generally small, a larger intermediate result cannot be stored. If the data format of the convolution operation is kxh 'xw', the result of this layer needs to be calculated to perform the calculation of the next layer, and the output data is larger and needs to be cached in an external memory (i.e. a memory outside the NPU). The convolution operation of the embodiment is completed in the format of H ' ×w ' ×k, so that after a partial result is calculated on the plane of H ' ×w ', the input data of the next layer of calculation can be directly calculated, and the internal buffer memory of the smaller NPU only needs to store 1×w ' ×k, or N1×n2×k, where N1 can be far smaller than H ', N2 can be far smaller than W ', and the output result does not need to be buffered to the external memory, and then the operation of the next layer is performed by reading from the external memory, so that the bandwidth pressure can be greatly relieved, and the operation efficiency is improved. In addition, the pipelining can be conveniently performed in a Fusion Layer (Fusion Layer) scene.

When the target operation data to be transmitted to the next layer and the target operation data transmitted last time have repeated data, removing the repeated data to obtain target data; and transmitting the target data to the next layer. The transmission and storage of the data can be optimized, and the target operation data can be transmitted out each time and can cover the repeated data.

The length L of the multiply-accumulate operation performed by the multiply-accumulate Array (MAC Array) may be equal to the number M of units performing the multiply-accumulate operation in parallel, because the L and M of the multiply-accumulate Array are equal, the values of the multiply-accumulate result in two directions are equal, and the calculated result may be adjusted conveniently. Of course, in other embodiments, L and M of the multiply-accumulate array may not be equal to facilitate the arrangement of the multiply-accumulate array.

It will be appreciated that in some embodiments, the number of convolution kernels K need not be partitioned when the number K is equal to or less than the number M of units of the multiply-accumulate array that are calculated in parallel. For example, the multiply-accumulate array in the present embodiment may set the number of units M calculated in parallel to a larger value, or the number of convolution kernels is smaller. At this time, the convolution processing unit may be configured to:

Dividing the input data into C/L layer first depth data in the depth direction, and dividing the plurality of convolution kernels into C/L layer second depth data in the depth direction;

performing multiply-accumulate operation on the ith layer first depth data and the ith layer second depth data of the K convolution kernels to obtain K first intermediate data; and

and (3) increasing i to obtain new K pieces of first intermediate data, accumulating the K pieces of first intermediate data obtained before, and obtaining K pieces of target operation data until i is increased from 1 to C/L.

In other embodiments, the depth C of the convolution kernel need not be partitioned when it is equal to or less than the length L of the multiply-accumulate array for multiply-accumulate. For example, the multiply-accumulate array in this embodiment may set the length L of multiply-accumulate to a larger value, or the depth C of the input data and the convolution kernel is smaller. At this time, the convolution processing unit may be configured to:

dividing the plurality of convolution kernels into K/M convolution kernel groups;

performing multiply-accumulate operation on the first depth data of the ith layer and the second depth data of the ith layer of all convolution kernels in the f group to obtain M pieces of first intermediate data;

increasing i to obtain new M pieces of first intermediate data, accumulating the M pieces of first intermediate data obtained before, and obtaining M pieces of second intermediate data, wherein i is increased from 1 to C; and

And f is increased, so that new M pieces of second intermediate data are obtained, wherein f is increased from 1 to K/M, and K pieces of target operation data are obtained.

In some embodiments, a single layer operation of a convolution processing unit may be described, and in particular, the convolution processing unit may be configured to:

acquiring a plurality of convolution kernels, wherein the plurality of convolution kernels comprise second depth data of a first number of layers along a depth direction;

The convolution processing unit may further perform an operation on the multiple layers, and specifically, the convolution processing unit is further configured to accumulate multiple first operation data corresponding to the first depth data of the multiple layers to obtain target operation data. That is, based on the single-layer operation in the above embodiment, multiply-accumulate operation is performed on the first depth data of the plurality of layers and the second depth data of the plurality of convolution kernels, so as to obtain the target operation data after accumulating the plurality of first operation data.

In the convolution operation process, deviation data may be added, the convolution layer performs convolution operation on the input data and the weight data, and then the calculated result is added with the deviation data to obtain an output result.

The convolution processing unit may store the operation result thereof in the data storage module, or may transmit the operation result to the vector processing unit or the shaping processing unit for further calculation operation.

The neural network processor 200 provided in the embodiment of the present application may be integrated into one chip.

Referring to fig. 21, fig. 21 is a schematic structural diagram of a chip according to an embodiment of the application. The chip 20 includes a neural network processor 200, and the neural network processor 200 has the above-mentioned contents, which are not described herein. The chip 20 may be applied to an electronic device.

It should be noted that the neural network processor 200 according to the embodiment of the present application may be integrated with other processors, memories, etc. in one chip.

To further illustrate the overall operation of the neural network processor 200 of embodiments of the present application, the following description is provided in connection with other processors and memories.

Referring to fig. 22, fig. 22 is a schematic structural diagram of an electronic device according to an embodiment of the application. The electronic device 20 may include a neural network processor 200, a system bus 400, an external memory 600, and a central processor 800. The neural network processor 200, the external memory 600 and the central processor 800 are all connected to the system bus 400, so that the neural network processor 200 and the external memory 600 can realize data transmission.

The system bus 400 is coupled to the neural network processor 200 via the system bus interface 280. The system bus 400 may be connected to the central processor 800 and the external memory 600 through other system bus interfaces.

The neural network processor 200 is controlled by the central processing unit 800 to acquire the data to be processed from the external memory 600, process the data to be processed to obtain a processing result, and feed back the processing result to the external memory 600

When the neural network processor 200 is required to perform data processing, upper layer driver software of the electronic device 20, such as the central processing unit 800, writes the configuration of the currently required execution program into a corresponding register, such as: an operation mode, an initial value of a Program Counter (PC), configuration parameters, and the like. Then, the data moving module 260 reads the data to be processed such as image data, weight data from the external memory 600 through the system bus interface 280 and writes it to the data storage module 240. Instruction dispatch module 220 begins fetching instructions as per the initial PC. When an instruction is fetched, the instruction dispatch module 220 issues the instruction to the corresponding processing unit according to the type of instruction. The various processing units perform different operations according to specific instructions and then write the results to the data storage module 240.

The register is a configuration status register of the neural network processor 200, or referred to as a control status register, which can set an operation mode of the neural network processor 200, such as a bit width of input data, a position of a program initial PC, and the like.

It should be noted that the neural network processor shown in fig. 22 may be replaced by other neural network processors shown in the drawings.

Referring to fig. 23, fig. 23 is a schematic flow chart of a convolution operation method according to an embodiment of the present application. The embodiment also provides a convolution operation method, which comprises the following steps:

4001, performing a windowing operation on the input data according to the convolution kernel to obtain a first window region, where the first window region includes first depth data of a first number of layers in a depth direction.

4002, a plurality of convolution kernels including a first number of layers of second depth data in a depth direction is acquired.

4003, performing multiply-accumulate operation on the first depth data of one layer and the second depth data of the same layer of the plurality of convolution kernels to obtain first operation data.

4004, accumulating the plurality of first operation data corresponding to the plurality of layers of first depth data to obtain target operation data.

In the convolution operation method, a convolution processing unit firstly performs window taking on input data according to the size of convolution kernels, a first window area after window taking and all convolution kernels in weight data are subjected to multiply-accumulate operation to obtain data, and then window sliding is performed in the W direction and the H direction respectively, and then multiply-accumulate operation is performed to obtain H '×W' ×K data. Referring to fig. 24, fig. 24 is a schematic flow chart of a convolution operation method according to an embodiment of the present application. The specific convolution operation method is as follows (the specific steps of the convolution operation method of the convolution processing unit may be understood as follows):

5001. windowing input data according to the size (R x S) of a convolution kernel from a starting point (W=0, H=0) to obtain a first window region (R x S x C);

5002. multiplying and accumulating the windowed first window region and the K convolution kernels respectively to obtain K data;

5003. sliding to obtain a new first window area according to a first sliding step length in the W direction (the size of the first window area is unchanged), wherein the first sliding step length can be set according to the requirement;

5004. steps 5002 and 5003 are repeated in sequence until a W direction boundary, thus obtaining W 'x K data, where W' = (W-S)/first sliding step +1. For example, if w=7, s=3, the first sliding step=2, then W' =3. For another example, if w=7, s=3, the first sliding step=1, then W' =5;

5005. Returning to the starting point of the W direction, sliding the window in the H direction according to a second sliding step length, where the second sliding step length in the H direction may be set as required, so as to obtain a new first window area (the size of the first window area is unchanged), for example, after sliding the window in the H direction according to a second sliding step length (the second sliding step length in the H direction is 1), the coordinates may be (w=0, h=1).

5006. Steps 5002-5005 are repeated until the H direction is bordered, thus obtaining H 'x W' x K data. It should be noted that, each time the window is slid along the W direction until the W direction boundary, after the window is slid in the H direction until the boundary, the window is still slid in the W direction until the W direction boundary (i.e. steps 5002-5004 are repeated).

Step 5003 may also be understood as sliding the window along the first direction of the input data, and obtaining a plurality of first window areas until the window is slid in the first direction to reach the boundary. Step 5005 may also be understood as returning to the starting point of the first direction of the input data, and sliding the window in the second direction according to the sliding step length until the window is slid in the second direction to reach the boundary, where after sliding in the second direction by one sliding step length, the window is slid in the first direction until the window is slid in the first direction to reach the boundary.

Referring to fig. 25, fig. 25 is a third flowchart of a convolution operation method according to an embodiment of the present disclosure. The convolution operation unit comprises a multiply-accumulate Array (MAC Array) for convolution operation, wherein the size (l×m) of the multiply-accumulate Array is fixed, L is the length of performing multiply-accumulate operation, M is the number of units performing multiply-accumulate operation in parallel, and it can be understood that M multiply-accumulate operations with length L can be performed in one cycle. The steps of assigning the multiply-accumulate operation in the above convolution operation process (i.e. the above step 5002) to the convolution operation unit to perform the parallel operation are as follows (which may also be understood as the specific steps of the convolution processing unit performing the multiply-accumulate operation using the multiply-accumulate array are as follows:

5021. windowing input data according to a convolution kernel size (R multiplied by S) on a HW plane from a starting point (W=0, H=0) to obtain a first window area, and dividing the first window area into C/L data segments with the length of L in the depth direction; it should be noted that, after the first window area is obtained, the first window area may be divided into C/L data segments with length L, or the input data may be first divided into C/L data segments with length L, and then the first window area is obtained, where the first window area includes C/L data segments with length L; it is understood that the first frame region may include first depth data of the C/L layer in the depth direction;

5022. Dividing the convolution kernel into C/L data segments with the length of L in the depth direction, and performing the operation on K convolution kernels in the weight data to obtain K groups of weight data, wherein each group has C/L weight data segments; it is understood that each convolution kernel includes C/L weight data segments of length L in the depth direction; the K convolution kernels can be further divided into K/M convolution kernel groups, and each convolution kernel group comprises weight data of M convolution kernels;

5023. taking first depth data of an ith (i=1, 2, …, C/L) layer of a first window area of input data to obtain 1 first depth data;

5024. taking the second depth data of the ith (i=1, 2, …, C/L) layer of the f (f=1, 2, …, K/M) group convolution kernel group to obtain M second depth data;

5025. performing multiply-accumulate operation on the 1 first depth data and the M second depth data (weight data broadcast multiplexing) by using the MAC array to obtain M first operation data; the M weight data segments are weight data segments of M convolution kernels;

5026. incrementing i, repeating steps 5023-5025, and accumulating the output M first operation data onto the M first operation data calculated before, so as to obtain M target operation data; wherein i starts from 1 and increases to C/L;

5027. And f is increased, steps 5023-5026 are repeated, and K target operation data are obtained after K/M times of calculation are completed. Where f starts from 1 and increases to K/M.

In some embodiments, steps 5021-5027 can be replaced by other steps, which specifically can include:

and (3) increasing i to obtain new K pieces of first intermediate data, accumulating the K pieces of first intermediate data obtained before, and obtaining K pieces of target operation data, wherein i is increased from 1 to C/L.

performing multiply-accumulate operation on the ith layer first depth data and the ith layer second depth data of all convolution kernels in the F group to obtain M pieces of first intermediate data;

In the above embodiment, L of the multiply-accumulate array may be equal to M, and of course, L of the multiply-accumulate array may not be equal to M.

After obtaining the K pieces of target operation data, the convolution operation method may further include: transmitting K target operation data corresponding to one window area to the next layer and using the K target operation data for operation; or transmitting N multiplied by K target operation data corresponding to the N first window areas to the next layer for operation, wherein N is smaller than the total number of the first window areas of the input data.

After the K or nxk target operation data are processed, the convolution operation method may further include:

when the target operation data to be transmitted to the next layer and the target operation data transmitted last time have repeated data, removing the repeated data to obtain target data; and transmitting the target data to the next layer.

The convolution processing unit, the neural network processor, the electronic device and the convolution operation method provided by the embodiment of the application are described in detail. Specific examples are set forth herein to illustrate the principles and embodiments of the present application and are provided to aid in the understanding of the present application. Meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims

1. A convolution processing unit, wherein the convolution processing unit is configured to:

performing multiply-accumulate operation on the first depth data of one layer and the second depth data of the same layer of the convolution kernels to obtain first operation data; and

accumulating a plurality of first operation data corresponding to the first depth data of multiple layers to obtain target operation data;

The convolution processing unit comprises a multiply-accumulate array, wherein the multiply-accumulate array is L multiplied by M, L is the length of multiply-accumulate operation, and M is the number of units for performing multiply-accumulate operation in parallel;

the convolution processing unit is further configured to:

dividing the input data into C/L layer first depth data in the depth direction, and dividing a plurality of convolution kernels into C/L layer second depth data in the depth direction;

performing multiply-accumulate operation on the first depth data of the ith layer and the second depth data of the ith layer of the K convolution kernels to obtain K first intermediate data; and

increasing i to obtain new K first intermediate data, accumulating the K first intermediate data obtained before, and obtaining K target operation data, wherein i is increased from 1 to C/L;

alternatively, the convolution processing unit is further configured to:

dividing a plurality of the convolution kernels into K/M convolution kernel groups;

performing multiply-accumulate operation on the first depth data of the ith layer and the second depth data of the ith layer of all convolution kernels in the f-th group to obtain M pieces of first intermediate data;

increasing f to obtain new M pieces of second intermediate data, wherein f is increased from 1 to K/M to obtain K pieces of target operation data;

alternatively, the convolution processing unit is further configured to:

dividing the input data into C/L layer first depth data in the depth direction;

dividing a plurality of the convolution kernels into C/L layer second depth data in a depth direction;

increasing i to obtain new M pieces of first intermediate data, and accumulating the M pieces of first intermediate data obtained before, wherein i is increased from 1 to C/L, and M pieces of second intermediate data are obtained; and

2. The convolution processing unit according to claim 1, wherein L of said multiply-accumulate array is equal to M.

3. The convolution processing unit according to claim 1 or 2, further adapted to:

sliding window taking along the first direction of the input data, and obtaining a plurality of first window areas until the window taking along the first direction reaches a boundary;

and returning to the starting point of the first direction of the input data, and sliding the window in the second direction according to a second sliding step length until the window in the second direction reaches the boundary, wherein after sliding one second sliding step length in the second direction each time, the window is slid in the first direction until the window in the first direction reaches the boundary.

4. The convolution processing unit according to claim 3, further adapted to:

transmitting K target operation data corresponding to one window area to the next layer and using the K target operation data for operation; or (b)

And transmitting N multiplied by K target operation data corresponding to the N first window areas to the next layer for operation, wherein N is smaller than the total number of the first window areas of the input data.

5. The convolution processing unit according to claim 4, further configured to:

When the target operation data to be transmitted to the next layer and the target operation data transmitted last time have repeated data, removing the repeated data to obtain target data; and

the target data is transferred to the next layer.

6. A neural network processor, comprising:

a data buffer unit for storing input data;

a convolution processing unit, the convolution processing unit obtains input data through the data buffer unit, the convolution processing unit is a convolution processing unit according to any one of claims 1-5.

7. An electronic device, comprising:

a system bus; and

a neural network processor, the neural network processor being the neural network processor of claim 6, the neural network processor being connected to the system bus.

8. A convolution operation method, applied to the convolution processing unit of any one of claims 1 to 5, comprising:

Performing multiply-accumulate operation on the first depth data of one layer and the second depth data of the same layer of the convolution kernels to obtain first operation data;