WO2021115149A1 - 神经网络处理器、芯片和电子设备 - Google Patents

神经网络处理器、芯片和电子设备 Download PDF

Info

Publication number
WO2021115149A1
WO2021115149A1 PCT/CN2020/132792 CN2020132792W WO2021115149A1 WO 2021115149 A1 WO2021115149 A1 WO 2021115149A1 CN 2020132792 W CN2020132792 W CN 2020132792W WO 2021115149 A1 WO2021115149 A1 WO 2021115149A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
module
processing unit
neural network
instruction
Prior art date
Application number
PCT/CN2020/132792
Other languages
English (en)
French (fr)
Inventor
袁生光
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2021115149A1 publication Critical patent/WO2021115149A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements

Definitions

  • This application relates to the field of electronic technology, in particular to a neural network processor, chip and electronic equipment.
  • ANN Artificial Neural Networks
  • the processing unit in the neural network processor often interacts with data storage, and the transmission speed is slow during the data transmission process.
  • the embodiments of the present application provide a neural network processor, a chip, and an electronic device, which can increase the speed at which the neural network processor loads data.
  • the embodiment of the application discloses a neural network processor, including:
  • a first processing module including a convolution processing unit with a dedicated register
  • a general-purpose register connected to the convolution processing unit
  • a load storage module is connected to the general register, and the load storage module is also connected to the convolution processing unit through the dedicated register;
  • the load storage module is used to load data into at least one of the general-purpose register and load data into the special register of the convolution processing unit.
  • the embodiment of the present application also discloses a chip, which includes a neural network processor, and the neural network processor is the neural network processor as described above.
  • the embodiment of the present application also discloses an electronic device, which includes:
  • a neural network processor is the neural network processor as described above;
  • the neural network processor is connected to the external memory and the central processing unit through the system bus, and the neural network processor is controlled by the central processing unit to obtain the data to be processed from the external memory, And processing the to-be-processed data to obtain a processing result, and feeding back the processing result to the external memory.
  • FIG. 1 is a schematic diagram of the first structure of a neural network processor provided by an embodiment of this application.
  • FIG. 2 is a schematic diagram of a second structure of a neural network processor provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of a third structure of a neural network processor provided by an embodiment of the application.
  • FIG. 4 is a schematic diagram of a fourth structure of a neural network processor provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of a fifth structure of a neural network processor provided by an embodiment of the application.
  • FIG. 6 is a schematic structural diagram of a data storage module provided by an embodiment of the application.
  • FIG. 7 is a schematic diagram of a sixth structure of a neural network processor provided by an embodiment of the application.
  • FIG. 8 is a schematic diagram of a seventh structure of a neural network processor provided by an embodiment of the application.
  • FIG. 9 is a schematic diagram of an eighth structure of a neural network processor provided by an embodiment of the application.
  • FIG. 10 is a schematic diagram of a ninth structure of a neural network processor provided by an embodiment of the application.
  • FIG. 11 is a schematic diagram of a tenth structure of a neural network processor provided by an embodiment of the application.
  • FIG. 12 is a schematic diagram of the first structure of direct storage access in the neural network processor provided by an embodiment of the application.
  • FIG. 13 is a schematic diagram of the second structure of direct storage access in the neural network processor provided by an embodiment of the application.
  • FIG. 14 is a schematic diagram of the eleventh structure of a neural network processor provided by an embodiment of this application.
  • FIG. 15 is a schematic diagram of the third structure of direct storage access in the neural network processor provided by an embodiment of the application.
  • FIG. 16 is a schematic diagram of a twelfth structure of a neural network processor provided by an embodiment of the application.
  • FIG. 17 is a schematic diagram of the thirteenth structure of a neural network processor provided by an embodiment of the application.
  • FIG. 18 is a schematic diagram of the fourteenth structure of a neural network processor provided by an embodiment of this application.
  • FIG. 19 is a schematic diagram of the fifteenth structure of a neural network processor provided by an embodiment of the application.
  • FIG. 20 is a schematic diagram of a sixteenth structure of a neural network processor provided by an embodiment of the application.
  • FIG. 21 is a schematic diagram of the seventeenth structure of a neural network processor provided by an embodiment of the application.
  • FIG. 22 is a schematic diagram of the eighteenth structure of a neural network processor provided by an embodiment of the application.
  • FIG. 23 is a schematic diagram of convolution operation of a convolution processing unit in a neural network processor provided by an embodiment of the present application.
  • FIG. 24 is a schematic structural diagram of a chip provided by an embodiment of the application.
  • FIG. 25 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • FIG. 26 is a schematic flowchart of a data processing method provided by an embodiment of the application.
  • FIG. 27 is a schematic flowchart of another data processing method provided by an embodiment of the application.
  • FIG. 28 is a schematic flowchart of a data loading method provided by an embodiment of the application.
  • the technical solutions provided in the embodiments of the present application can be applied to various scenes that require image processing on input images to obtain corresponding output images, which is not limited in the embodiments of the present application.
  • the technical solutions provided by the embodiments of the present application can be applied to various scenarios in the fields of computer vision, such as face recognition, image classification, target detection, and semantic segmentation.
  • FIG. 1 is a schematic diagram of the first structure of a neural network processor provided by an embodiment of the application.
  • a neural network processor (Neural Network Process Unit, NPU) 200 may include a first processing module 210 and an instruction distribution module 220.
  • the first processing module 210 may include one or more processing units, such as the first processing module 210 including a convolution processing unit 212 and a vector processing unit 214.
  • the multiple processing units included in the first processing module 210 in the embodiment of the present application can all process vectors. It should be noted that the embodiment of the present application does not limit the type of data processed by the first processing module 210.
  • the convolution processing unit 212 may also be referred to as a convolution operation unit, and the convolution processing unit 212 may also be referred to as a convolution calculation engine.
  • the convolution processing unit 212 may include multiple multiplication and addition units (Multiplication Add Cell, MAC), and the number of the multiplication and addition units may be several thousand.
  • the convolution processing unit 212 may include 4096 multiplication and addition units. Can be divided into 16 cells, each cell can calculate the maximum number of elements is 256 vector inner product operation.
  • the vector processing unit 214 may also be referred to as a vector calculation unit, and may also be referred to as a single instruction multiple data (SIMD) processing unit.
  • the vector processing unit 214 is an element-level vector calculation engine that can process conventional arithmetic operations such as addition, subtraction, multiplication, and division between vectors, and can also process bit-level logical operations such as AND, OR, NOT, XOR. It should be noted that the vector processing unit 214 of the embodiment of the present application may also support common activation function operations such as Rectified Linear Unit (ReLU) and PRelu. It should also be noted that the vector processing unit 214 of the embodiment of the present application can also support the non-linear activation functions Sigmoid and Tanh through the look-up table method.
  • ReLU Rectified Linear Unit
  • PRelu PRelu
  • the instruction distribution module 220 may also be referred to as an instruction preprocessing module.
  • the instruction distribution module 220 is connected to the first processing module 210, and the instruction distribution module 220 can be connected to each processing unit in the first processing module 210, such as the instruction distribution module 220 and the convolution processing unit 212 and the first processing module 210.
  • the vector processing unit 214 is connected.
  • the instruction distribution module 220 may transmit instructions to the first processing module 210, that is, the instruction distribution module 220 may transmit instructions to the processing unit of the first processing module 210.
  • the instruction distribution module 220 may transmit multiple instructions to the first processing module 210 in parallel, for example, the instruction distribution module 220 may transmit multiple instructions to the convolution processing unit 212 and the vector processing unit 214 in parallel.
  • the instruction distribution module 220 may issue multiple instructions to the convolution processing unit 212 and the vector processing unit 214 in parallel within one clock cycle. Therefore, the embodiment of the present application can support multi-issue instruction operations and can execute multiple instructions efficiently at the same time.
  • the convolution processing unit 212 and the vector processing unit 214 can execute convolution calculation instructions and vector calculation instructions, respectively.
  • the convolution processing unit 212 and the vector processing unit 214 After the convolution processing unit 212 and the vector processing unit 214 receive the instructions, the convolution processing unit 212 and the vector processing unit 214 process the received data according to the instructions to obtain the processing result. Therefore, the embodiment of the present application can improve the calculation efficiency, or in other words, the efficiency of the NPU in processing data.
  • processing units corresponding to the multiple instructions issued in parallel by the instruction distribution module 220 have no resource conflicts during the execution process.
  • the multiple instructions transmitted by the instruction distribution module 220 may include fine-grained instructions.
  • the instruction distribution module 220 may transmit the fine-grained instructions to the convolution processing unit 212. After the convolution processing unit 212 receives the fine-grained instructions, the convolution processing The unit 212 can perform a vector inner product operation on the received data according to the fine-grained instruction.
  • the fine-grained instructions emitted by the instruction distribution module 220 are not limited to the convolution processing unit 212, and the instruction distribution module 220 may also transmit fine-grained instructions to the vector processing unit 214 or other processing units of the first processing module 210. in.
  • the instructions that can be issued by the instruction distribution module 220 of the embodiment of the present application are not limited to fine-grained instructions.
  • the embodiment of the present application does not limit the instructions issued by the instruction distribution module 220.
  • the instructions may include different types, such as calculation-type instructions, control-type instructions, etc., where the calculation-type instructions may include a first calculation instruction, a second calculation instruction, a third calculation instruction, and so on.
  • the operation corresponding to the fine-grained instruction can be accurate every clock cycle, which is different from the coarse-grained processor, that is, different from an instruction that requires the processor to execute more clock cycles to complete.
  • fine-grained instructions are also embodied in the calculation of the processing unit with finer granularity.
  • the convolution processing unit 212 can complete a basic vector inner product operation based on a fine-grained instruction.
  • the coarse-grained processor can complete matrix multiplication based on one instruction. It is understandable that matrix multiplication can consist of multiple vector inner product operations. It can be seen that the embodiment of the present application can support multi-issue fine-grained instruction operations, and the embodiment of the present application can improve the flexibility of programming and has better versatility.
  • the instruction distribution module 220 of the embodiment of the present application may transmit the first calculation instruction to the convolution processing unit 212 and transmit the second calculation instruction to the vector processing unit 214 in parallel.
  • the instruction distribution module 220 transmits the first calculation instruction to the convolution processing unit 212 and the second calculation instruction to the vector processing unit 214 within one clock cycle.
  • the convolution processing unit 212 may perform a vector inner product operation on the received data according to the first calculation instruction issued by the instruction distribution module 220.
  • the vector processing unit 214 performs a vector calculation operation on the received data according to the second calculation instruction issued by the instruction distribution module 220.
  • the processing unit in the first processing module 210 is not limited to the convolution processing unit 212 and the vector processing unit 214, or the first processing module 210 may also include other processing units.
  • the first processing module 210 also reshapes the processing unit.
  • FIG. 2 is a schematic diagram of a second structure of a neural network processor provided by an embodiment of the application.
  • the first processing module 210 of the neural network processor 200 provided by the embodiment of the present application may include a convolution processing unit 214, a vector processing unit 214, and a shaping processing unit 216.
  • the convolution processing unit 212 and the vector processing unit 214 can be referred to as shown in FIG.
  • the convolution processing unit 212 and the vector processing unit 214 shown are not repeated here.
  • the shaping processing unit may also be referred to as a shaping engine.
  • the shaping processing unit 216 is connected to the instruction distribution module 220, and the instruction distribution module 220 can also transmit multiple instructions to the convolution processing unit 212, the vector processing unit 214, and the shaping processing unit 216 in parallel.
  • the instruction distribution module 220 may also issue multiple instructions to the convolution processing unit 212, the vector processing unit 214, and the shaping processing unit 216 in one clock cycle in parallel.
  • the shaping processing unit 216 processes the received data according to the instruction issued by the instruction distribution module 220, such as the third calculation instruction.
  • the shaping processing unit 216 can support common Tensor Reshape operations, such as dimensional transposition, segmentation according to one dimension, and data padding.
  • the instruction issuance of the instruction distribution module 220 is not limited to the first processing module 210. In some other embodiments, the instruction distribution module 220 may also transmit instructions to other processing modules.
  • FIG. 3 is a schematic diagram of a third structure of a neural network processor provided by an embodiment of this application.
  • the neural network processor 200 provided in the embodiment of the present application may include a first processing module 210, a second processing module 230, and an instruction distribution module 220.
  • the first processing module 210 includes at least a convolution processing unit 212.
  • the first processing module 210 may also include other processing units such as a vector processing unit 214 and a shaping processing unit 216.
  • the convolution processing unit 212 can perform a vector inner product operation on the data it receives.
  • the vector processing unit 214 can refer to the above content for details, which will not be repeated here.
  • the shaping processing unit 216 refer to the above content, which will not be repeated here.
  • the second processing module 230 may process scalar data, and the second processing module 230 includes at least a scalar processing unit 232 (Scalar Process Unit, SPU).
  • the scalar processing unit 232 may be a processing unit compatible with the RISC-V instruction set.
  • the scalar processing unit 232 may include a scalar register file (Scalar Register File, SRF), that is, the scalar processing unit 232 may include a plurality of scalar registers.
  • SRF Scalar Register File
  • the instruction distribution module 220 connects the first processing module 210 and the second processing module 230, and the instruction distribution module 220 can transmit multiple instructions to the first processing module 210 and the second processing module 230 in parallel.
  • the instruction distribution module 220 may issue multiple instructions to the convolution processing unit 212 and the scalar processing unit 232 in parallel within one clock cycle.
  • the instruction distribution module 220 may also issue multiple instructions to other processing units in parallel within one clock cycle.
  • the instruction distribution module 220 transmits multiple instructions to the convolution processing unit 212, the vector processing unit 214, and the scalar processing unit 232 in parallel in one clock cycle, and the instruction distribution module 220 transmits multiple instructions in parallel in one clock cycle.
  • the instruction distribution module 220 transmits multiple instructions to the convolution processing unit 212, the vector processing unit 214, and the shaping processing unit 216 in one clock cycle in parallel. And the scalar processing unit 232.
  • the instructions emitted by the instruction distribution module 220 are not limited to this.
  • the instruction distribution module 220 can transmit different instructions in parallel to the same processing module according to the needs of the neural network processor 200 to process data. Multiple processing units, or different instructions are issued in parallel to processing units in different processing modules.
  • the above are only a few examples of how the instruction distribution unit 220 transmits multiple instructions in parallel in the technical solution provided by the embodiments of the present application.
  • the manner in which the instruction distributing unit 220 of the technical solution provided in the embodiment of the present application transmits instructions is not limited to this.
  • the instruction distribution unit 220 transmits multiple instructions to the scalar processing unit 232 and the vector processing unit 214 in parallel.
  • the instruction distribution unit 220 transmits multiple instructions to the shaping processing unit 216 and the vector processing unit 214 in parallel.
  • the scalar processing unit 232 processes the received data according to the instructions distributed by the instruction distribution module 220, such as control instructions.
  • the scalar processing unit 232 may receive or be a scalar instruction, such as a control instruction, and the scalar processing unit 232 mainly negatives the scalar operation of the neural network processor 200.
  • the scalar processing unit 232 can not only receive instructions from the instruction distribution module 220, but also transmit the value of a new program counter (PC) to the instruction distribution module 220.
  • PC program counter
  • FIG. 4 is a schematic diagram of a fourth structure of a neural network processor provided by an embodiment of the application.
  • the scalar processing unit 232 may not only receive instructions from the instruction distribution module 220, but may also transmit the value of a new program counter (PC) to the instruction distribution module 220.
  • the scalar processing unit 232 can execute scalar calculation instructions (addition, subtraction, multiplication, and division, logical operations), branch instructions (conditional judgment operations), and jump instructions (function calls).
  • branch instructions and jump instructions the scalar processing unit 232 returns the new PC value to the instruction distribution module 220, so that the instruction distribution module 220 fetches instructions from the new PC the next time the instruction is distributed.
  • FIG. 5 is a schematic diagram of a fifth structure of a neural network processor provided by an embodiment of the application.
  • the neural network processor 200 provided by the embodiment of the present application further includes a data storage module (Buffer, BUF) 240, and the data storage module 240 can store data, such as image data, weight data, and the like.
  • Buffer, BUF data storage module
  • the data storage module 240 may be connected to the first processing module 210 and the second processing module 230.
  • the data storage module 240 is connected to the scalar processing unit 232, the convolution processing unit 212, the vector processing unit 214, and the shaping processing unit 216.
  • the data storage unit 240 and the scalar processing unit 232, the convolution processing unit 212, the vector processing unit 214, and the shaping processing unit 216 can all transmit data, such as the data storage unit 240 and the convolution processing unit 212, the vector processing unit 214, and the shaping processing unit 216 directly transfer data. Therefore, in the embodiment of the present application, direct data transmission can be realized between the data storage module 220 and various processing units such as the convolution processing unit 212 and the vector processing unit 214, and the performance of the NPU 200 can be improved.
  • the processing of the data by the first processing module 210 may be: when the convolution processing unit 212 and the vector processing unit 214 receive instructions issued in parallel by the instruction distribution unit 220, such as the first calculation instruction and the second calculation instruction, the convolution processing unit
  • the 212 and vector processing unit 214 can read the data to be processed by the data storage module 240, such as to-be-processed data.
  • the convolution processing unit 212 and the vector processing unit 214 perform processing operations on the to-be-processed data to obtain a processing result, and store the processing result in the data storage module 240.
  • the data processed by the convolution processing unit 212 and the vector processing unit 214 may be: when the convolution processing unit 212 receives an instruction issued by the instruction distribution unit 220, such as a first calculation instruction, the convolution processing unit 212 will perform the processing according to the first calculation instruction.
  • the data to be processed such as to-be-processed data is read from the data storage module 240.
  • the convolution processing unit 212 After the convolution processing unit 212 reads the data to be processed from the data storage module 240, the convolution processing unit 212 performs corresponding operations such as vector inner product calculations according to the first calculation instruction to obtain intermediate calculation results.
  • the convolution processing unit 212 may store the intermediate calculation result in the data storage module 240.
  • the vector processing unit 214 may obtain the intermediate calculation result from the data storage module 240, and perform a second calculation process such as a pooling operation on the intermediate calculation result to obtain the processing result, and store the processing result in the data storage module 240 .
  • the data stored in the data storage module 240 may be raw data and weight data, such as to-be-processed data, or the data stored in the data storage module 240 is data that requires at least one processing unit to process, such as arithmetic processing.
  • the data stored in the data storage module 240 may also be a processing result, or in other words, the data stored in the data storage module 240 is data after the data to be processed is processed by at least one processing unit. It should be noted that the data actually stored by the data storage module 240 is not limited to this, and the data storage module 240 may also store other data.
  • processing of data by the convolution processing unit 212 and the vector processing unit 214 is not limited to this, and the convolution processing unit 212 and the vector processing unit 214 may also be directly connected through a signal line.
  • the data processed by the convolution processing unit 212 and the vector processing unit 214 may also be: when the convolution processing unit 212 receives an instruction issued by the instruction distribution unit 220, such as a first calculation instruction, the convolution processing unit 212 will perform the calculation according to the first calculation.
  • the instruction reads data to be processed, such as to-be-processed data, from the data storage module 240. After the convolution processing unit 212 reads the data to be processed from the data storage module 240, the convolution processing unit 212 performs corresponding operations such as vector inner product calculations according to the first calculation instruction to obtain intermediate calculation results.
  • the convolution processing unit 212 may transmit the intermediate calculation result to the vector processing unit 214.
  • the vector processing unit 214 performs a second calculation process on the intermediate calculation result, such as pooling, subsequent activation, quantization operation, or fusion with the operation of the next layer, and processes the operations of the two layers of operators at the same time to obtain the processing result , And store the processing result in the data storage module 240.
  • a second calculation process on the intermediate calculation result, such as pooling, subsequent activation, quantization operation, or fusion with the operation of the next layer, and processes the operations of the two layers of operators at the same time to obtain the processing result , And store the processing result in the data storage module 240.
  • the convolution processing unit 212 may also be connected to other processing units of the first processing module 210 such as the shaping processing unit 216 through a signal line.
  • the first processing module 210 can also directly transmit the intermediate calculation results calculated by the convolution processing unit 212 to the shaping processing unit 216 or other processing units in the first processing module 210 to perform other calculation operations after processing the data. .
  • the first processing module 210 may also perform the processing of the data by the convolution processing unit 212, and store the intermediate calculation results calculated by it in the data storage module 240, and then use the shaping processing unit 216 or other processing modules in the first processing module 210.
  • the processing unit obtains the intermediate calculation result from the data storage module 240, and performs further processing operations such as a shaping operation on the intermediate calculation result to obtain the processing result.
  • the shaping processing unit 216 or other processing units in the first processing module 210 store the processing result in the data storage module 240.
  • the intermediate calculation results may not be stored in the data storage module 240, and the data storage module 240 may store the original data and weights without storing intermediate calculation results. Calculation results. Not only can the storage space of the data storage module 240 be saved, but also the access of the data storage module 240 can be reduced, power consumption can be reduced, and the performance of the neural network processor 200 can be improved.
  • the method of processing data among other processing units of the first processing module 210 in the embodiment of the present application can be analogous to the method of the convolution processing unit 212 and the vector processing unit 214 in the first processing module 210 above.
  • the manners of processing data among other processing units of the first processing module 210 in the embodiment of the present application will not be illustrated one by one here.
  • the data storage module 220 of the embodiment of the present application may store the calculation result. During the operation of multiple processing units, it can be done 0 fallback to the external memory, and there is no need to fallback the settlement result of the previous operator to the external storage.
  • the bandwidth requirement for soc is relatively low, thus saving system bandwidth and reducing The calculation delay between operators.
  • the data storage module 240 may be a shared storage module.
  • the data storage module 220 may have multiple banks accessed in parallel, such as three, four, and so on. It can be divided flexibly according to actual needs.
  • FIG. 6 is a schematic structural diagram of a data storage module provided by an embodiment of the present application.
  • the data storage module 240 includes at least two data storage units 241 and at least two address decoding units 242.
  • the number of address decoding units 242 is not greater than the number of data storage units 241.
  • the number of data storage units 241 is four, and the number of address decoding units 242 is four.
  • Each address decoding unit a includes four output ports, and each output port corresponds to a data storage unit 241.
  • Each data storage unit 241 such as: data storage unit a, data storage unit b, data storage unit c and data storage unit d, four address decoding units 242, such as address decoding unit a, address decoding unit b, Address decoding unit c and address decoding unit d.
  • the four address decoding units 242 are all connected to one data storage unit 241.
  • One address decoding unit 242 includes four output ports.
  • the number of output ports of one address decoding unit 242 is equal to the number of data storage units in the data storage module 240.
  • the number is equal, that is, the output port of one address decoding unit 242 corresponds to one data storage unit 241.
  • the first output port of each address decoding unit a corresponds to the data storage unit a
  • the second output port corresponds to the data storage unit a.
  • the unit b corresponds
  • the third output port corresponds to the data storage unit c
  • the fourth output port corresponds to the data storage unit d.
  • the data output by an output port can be used to store in a data storage unit corresponding to the output port.
  • the data output by the first output port corresponding to the address decoding unit a and the data storage unit a, the data output by the first output port corresponding to the address decoding unit b and the data storage unit a, the address decoding unit c and the data The data output by the first output port corresponding to the storage unit a, and the data output by the first output port corresponding to the address decoding unit d and the data storage unit a are all stored in the data storage unit a. Therefore, each address can be decoded.
  • the data in the unit can be stored in any data storage unit 241, so that the data storage units 241 can be shared.
  • One output port is used to output one type of data.
  • the four output ports of the same address decoding unit 242 correspond to different data types.
  • the first output port of an address decoding unit 242 is used to output characteristic maps
  • the second output The port is used to output characteristic parameters.
  • Each address decoding unit 242 also includes three input ports, and the three input ports are respectively used to receive signals, data, and address information transmitted by external ports. Each address decoding unit 242 compiles and forms four data according to the received signal, data, and address information.
  • the number of address decoding units 242 is the same as the number of external ports. For example, when the number of external ports is four, the number of corresponding address decoding units 242 is four, and the data transmitted by the external ports can pass through the address decoding unit 242. It is stored in any data storage unit 241 to realize resource sharing in the data storage module 240.
  • the external port may be the port of the processing unit or the port of the data bus. As long as the port that can store data to and read data from the data storage unit can be implemented, it is within the protection scope of the embodiments of the present application.
  • the data storage module 240 also includes at least two data merging units 243, such as four. Each data merging unit 243 includes at least two data input terminals and one data output terminal. Each data merging unit 243 receives all data corresponding to a data storage unit 241 through at least two data input terminals, and processes all the data. By storing the data in the data storage unit 241 corresponding to the data, the data storage module 240 can process the data regularly, which can improve the efficiency of data processing, and at the same time, can avoid the phenomenon of data storage confusion.
  • Each data merging unit 243 corresponds to a data storage unit 241, and a data input end of each data merging unit 243 is connected to the output ports of all address decoding units 242 corresponding to a data storage unit 241, that is, a data merging unit 243 is connected to all address decoding units 242, and one data merging unit 243 processes the data of multiple address decoding units 242, which can improve the efficiency of data storage.
  • the data merging unit 243 uses bitwise OR operation to count data, bitwise or binocular operation. As long as one of the corresponding two binary bits is 1, the result bit is 1.
  • the bitwise OR operation logic is relatively simple and the operation speed is faster, which can improve the processing efficiency of the data merging unit 243, and thus can improve the storage efficiency of the data storage module 240.
  • One data merging unit 243 corresponds to one data storage unit 241.
  • data merging unit a corresponds to data storage unit a
  • data merging unit b corresponds to data storage unit b
  • a piece of data decoded by address decoding unit a is transmitted to
  • the data merging unit a corresponding to the data storage unit a performs processing, and the processed data can be transmitted to the data storage unit a for storage.
  • the data storage module 240 can store data quickly and efficiently.
  • the data that needs to be processed by the second processing module 230 may not be obtained from the data storage module 240, and the data that needs to be processed by the scalar processing unit 232 may be carried by instructions received or carried by other Mode transmission.
  • FIG. 7 is a schematic diagram of a sixth structure of a neural network processor provided by an embodiment of this application.
  • the difference between the neural network processor shown in FIG. 7 and the neural network processor shown in FIG. 5 is that the second processing module 230 in FIG. 7, such as the scalar processing unit 232, is connected to the instruction distribution module 220, and the second processing module in FIG.
  • the second processing module 230, such as the scalar processing unit 232 is not connected to the data storage module 240.
  • the second processing module 230 in FIG. 5, such as the scalar processing unit 232 is connected to the instruction distribution module 220, and the second processing module 230 in FIG. 5, such as the scalar processing unit 232, is connected to the data storage module 240.
  • the data that needs to be processed by the second processing module 230 such as the scalar processing unit 232 in FIG. 7 can be carried by receiving instructions, or the data that needs to be processed by the second processing module 230 such as the scalar processing unit 232 in FIG. 7 can be carried by The instructions distributed by the instruction distribution module 220 are carried.
  • a separate data storage module may also be provided for the second processing module 230, such as the scalar processing unit 232.
  • the data storage module 240 may also be connected to the instruction distribution module 220, and the instruction distribution module 220 determines whether to transmit the instruction according to whether the data storage module 240 stores to-be-processed data.
  • FIG. 8 is a schematic diagram of a seventh structure of a neural network processor provided by an embodiment of this application.
  • the instruction distribution module 220 is connected to the data storage module 240.
  • the instruction distribution module 220 can send an index to the data storage module 240, and the data storage module 240 returns a signal according to the index sent by the instruction distribution module 220.
  • the data storage module 240 returns a signal that the data to be processed is stored to the instruction distribution module 220, such as "1".
  • the data storage module 240 returns a signal that no data to be processed is stored to the instruction distribution module 220, such as "0".
  • the instruction distribution module 220 takes different actions according to the different return signals it receives. For example, if the instruction distribution module 220 receives "1", the instruction distribution module 220 determines that the data storage module 240 stores data to be processed, and then the instruction distribution module 220 transmits multiple instructions in parallel. If the instruction distribution module 220 receives "0", the instruction distribution module 220 determines that the data storage module 240 does not store the data to be processed, and the instruction distribution module 220 does not issue instructions to the data storage module 240 at this time. Therefore, unnecessary instruction distribution can be avoided, and power consumption can be saved.
  • FIG. 9 is a schematic diagram of the eighth structure of the neural network processor provided by an embodiment of the application.
  • the neural network processor 200 provided in the embodiment of the present application may further include an instruction storage module 250, and the instruction storage module 250 may also be referred to as an instruction storage module (Instruction Cache, ICache).
  • the instruction storage module 250 may store some fine-grained instructions, such as calculation instructions, control instructions, and so on. In other words, the instruction storage module 250 is used to store instructions of the NPU. It should be noted that the instructions stored in the instruction storage module 250 may also be other instructions.
  • the instruction storage module 250 is connected to the instruction distribution module 220, and the instruction storage module 250 can send the stored instructions to the instruction distribution module 220. In other words, the instruction distribution module 220 may obtain multiple instructions from the instruction storage module 250.
  • the process of the instruction distribution module 220 obtaining instructions from the instruction storage module 250 may be: the instruction distribution module 220 sends an instruction fetch request to the instruction storage module 250, and when an instruction corresponding to the instruction fetch request is found in the instruction storage module 250, that is, For an instruction hit, the instruction storage module 250 sends an instruction corresponding to the instruction fetch request to the instruction distribution module 220 in response to the instruction fetch request.
  • the instruction storage module 250 When the instruction corresponding to the instruction fetch request is not found in the instruction storage module 250, it is called instruction missing, the instruction storage module 250 will suspend (Hold) respond to the instruction fetch request, and the instruction storage module 250 will send the instruction fetch request at the same time. Waiting for the instruction to return to the instruction storage module 250, and then the instruction storage module 250 sends the instruction corresponding to the instruction fetch request to the instruction distribution module 220 in response to the instruction fetch request.
  • the process of the instruction distribution module 220 obtaining instructions from the instruction storage module 250 can be understood as: when the instructions required by the instruction distribution module 220 have been stored in the instruction storage module 250, the instruction distribution module 220 may directly obtain the instructions from the instruction storage module 250. When at least one instruction required by the instruction distribution module 220 is not in the instruction storage module 250, the instruction storage module 250 needs to read the instruction required by the instruction distribution module 220 from another location, such as an external memory, and return the instruction to the instruction. Distribution module 220.
  • instruction distribution module 220 and the instruction storage module 250 of the embodiment of the present application may be two separate parts.
  • the instruction distribution module 220 and the instruction storage module 250 may also form an instruction preprocessing module, or the instruction distribution module 220 and the instruction storage module 250 may be two parts of the instruction preprocessing module.
  • each instruction stored in the instruction storage module 250 has a corresponding type, and the instruction distribution module 220 may issue multiple instructions based on the type of the instruction.
  • the instruction distribution module 220 transmits the first type of instructions to the convolution processing unit 212, and the instruction distribution module 220 transmits the second type of instructions to the scalar processing unit 232.
  • the types of instructions are, for example, jump instructions, branch instructions, convolution calculation instructions, vector calculation instructions, and shaping calculation instructions.
  • the instruction storage module 250 of the embodiment of the present application is not limited to storing only a part of instructions of the NPU 200.
  • the instruction storage module 250 of the embodiment of the present application may also store all instructions of the NPU 200, and the instruction storage module 250 may be referred to as an instruction memory (Instruction RAM, IRAM), or as a program memory.
  • Instruction RAM Instruction RAM
  • Upper-level software such as an external processor can directly write programs into IRAM.
  • the neural network processing unit 200 may further include a data transfer module 260, an instruction transfer module 270, and a system bus interface 280.
  • the system bus interface 280 is connected to a system bus, which may be a system bus of an electronic device such as a smart phone.
  • the system bus interface 280 is connected to the system bus to realize data transmission with other processors and external memories.
  • the system bus interface 280 can convert internal read and write requests into bus read and write requests that comply with a bus interface protocol, such as the Advanced Extensible Interface (AXI) protocol.
  • AXI Advanced Extensible Interface
  • the data moving module 260 is connected to the system bus interface 280 and the data storage module 240.
  • the data moving module 260 is used to move data. It can move external data to the data storage module 240 or move the data of the data storage module 240 to the outside.
  • the data transfer module 260 reads data from the system bus through the system bus interface 280 and writes the read data to the data storage module 240.
  • the data moving module 260 can also transfer the data or processing results stored in the data storage module 240 to the external memory.
  • the data moving module 260 transfers the processing results of the processing units in the first processing module 210 to the external memory. That is, the data transfer module 260 can implement data transfer between internal data and external storage through the system bus interface 280.
  • the data moving module 260 may be direct memory access (DMA), and the DMA may move data from one address space to another address space.
  • the address space for data movement can be internal memory or peripheral interface.
  • the descriptors that control DMA data movement are usually stored in RAM in advance.
  • the descriptors include information such as source address space, destination address space, and data length.
  • the software initializes the DMA and the data starts to move. This moving process can be carried out independently from the NPU, which improves the efficiency of the NPU and reduces the burden on the NPU.
  • the instruction moving module 270 is connected to the system bus interface 280 and the instruction storage module 250.
  • the instruction moving module 270 is used to move instructions, or the instruction moving module 270 is used to read instructions to move external instructions to the instruction storage module 250.
  • the instruction transfer module 270 reads instructions from the system bus through the system bus interface 280, and stores the read instructions in the instruction storage module 250.
  • the instruction storage module 250 requests the instruction transfer module 270 to send a read instruction request to the system bus interface 280 to read the corresponding instruction and store it in the instruction storage module 250.
  • the instruction moving module 270 may be direct storage access.
  • the instruction storage module 250 can also directly write all instructions into the instruction storage module 250 through the instruction transfer module 270.
  • FIG. 11 is a schematic diagram of the tenth structure of the neural network processor provided by the embodiment of the application.
  • FIG. 11 shows that the instruction storage module 250 is connected to the system bus interface 280.
  • the external memory can directly connect the program or neural network. Instructions required by the network processor 200 are stored in the instruction storage module 250.
  • the embodiment of the present application may also connect the instruction storage module 250 to an external memory through other interfaces. This is so that the external processor can directly write the instruction or program into the instruction storage module 250, or it is the initialization of the instruction.
  • the data transfer module 260 and the instruction transfer module 270 in the embodiment of the present application are two separate unit modules, which implement the transmission, or transfer, of data and instructions, respectively.
  • the embodiment of the present application needs to set up two DMAs to realize the movement of data and instructions.
  • the data moving module 260 needs to set one or more logical channels
  • the command moving module 270 needs to set one or more physical channels.
  • the instruction moving module 270 is taken as an example for description.
  • the data transfer module 260 in the embodiment of the present application may be a separate DMA, which is defined as DMA1 here; the instruction transfer module 270 may be a separate DMA, which is defined as DMA2 here. That is, DMA1 can move data, and DMA2 can move instructions.
  • FIG. 12 is a schematic diagram of the first structure of direct storage access in the neural network processor provided by an embodiment of the application.
  • the DMA 260a shown in FIG. 12 is equivalent to a partial structural diagram of the data transfer module 260.
  • the DMA 260a includes a plurality of logical channels 262a and an arbitration unit 264a.
  • the multiple logical channels 262a are all connected to the arbitration unit 264a, and the arbitration unit 264a can be connected to the system bus through the system bus interface.
  • the arbitration unit 264a may also be connected to at least one of the peripheral and the storage through other interfaces.
  • the number of logic channels 262a may be h, and h is a natural number greater than 1, that is, there may be at least two logic channels 262a.
  • Each logical channel 262a can receive data movement requests such as request 1, request 2, and request f, and perform data movement operations based on the data movement request.
  • the logical channel 262a of each DMA 260a can complete functions such as descriptor generation, parsing, and control, and the specific conditions are determined according to the composition of the command request (request).
  • the arbitration unit 264a can select a request, enter the read request queue 266a and the write request queue 268a, and wait for the data transfer.
  • the logic channel 262a requires software intervention, and the software configures the descriptor or register in advance, and completes initialization to move the data. All logical channels 262a of the DMA 260a are visible to the software and are scheduled by the software. In some business scenarios, for example, when an internal engine such as an instruction distribution module (or an instruction preprocessing module) autonomously performs data movement, it does not need to be scheduled by software, and the logical channel 262a of such DMA 260a cannot be used. Therefore, it is inconvenient to flexibly transplant according to business requirements, and rely too much on software scheduling.
  • an internal engine such as an instruction distribution module (or an instruction preprocessing module) autonomously performs data movement
  • the embodiment of the present application also provides a DMA to achieve different transfer requirements.
  • FIG. 13 is a schematic diagram of the second structure of direct storage access in the neural network processor provided by an embodiment of the application.
  • the direct storage access 260b shown in FIG. 13 is functionally equivalent to the instruction movement module 270 and the data movement module 260, or the direct storage access 260b shown in FIG. 13 combines the functions of the instruction movement module 270 and the data movement module 260.
  • the direct storage access 260b may include at least one logical channel 261b and at least one physical channel 262b. At least one logical channel 261b and at least one physical channel 262b are in parallel. It can also be understood that at least one logical channel 261b and at least one physical channel 262b are connected to the same An interface.
  • At least one physical channel 262b and at least one logical channel 261b can move instructions and data in parallel. Since the physical channel 262b is automatically requested by the internal engine such as the instruction distribution module to move the instructions, it does not need to be scheduled by the upper software, so that the entire DMA260b depends on the software scheduling, which makes it more convenient to move data, and it is more convenient to flexibly according to business needs. Move data. It is understandable that, in the embodiment of the present application, a DMA260b can be used to realize the transfer of instructions and data, and the number of unit modules can also be saved.
  • the logical channel 261b can perform data transfer in response to a transfer request scheduled by upper-layer software.
  • the upper layer software may be a programmable unit, such as a central processing unit (CPU).
  • the number of logical channels 261b can be n, and n can be a natural number greater than or equal to 1.
  • the number of logical channels 261b is one, two, three, and so on. It should be noted that the actual number of logical channels 261b can be set according to actual product requirements.
  • the physical channel 262b may perform data transfer in response to a transfer request of an internal engine, and the internal engine may be an instruction distribution module of the NPU, or an instruction preprocessing module.
  • the number of physical channels 262b may be m, and m may be a natural number greater than or equal to 1.
  • the number of physical channels 262b is one, two, three, and so on. It should be noted that the actual number of physical channels 262b can be set according to actual product requirements.
  • the number of logical channels 261b may be two, and the number of physical channels 262b may be one.
  • the DMA 260b may further include a first arbitration unit 263b, and the first arbitration unit 263b is connected to the system bus interface.
  • the first arbitration unit 263b is connected to the system bus interface 264b. It can be understood that the system bus interface 264b may be equivalent to the system bus interface 280.
  • the first arbitration unit 263b can be connected to the system bus through the system bus interface 264b.
  • the first arbitration unit 263b is also connected to all the physical channels 262b and all the logical channels 261b, so that the logical channel 261b and the physical channel 262b can be connected to the system.
  • the bus moves data and instructions. When multiple channels simultaneously initiate read/write requests, the first arbitration unit 263b can arbitrate a read/write request and send it to the system bus interface 264b.
  • the first arbitration unit 263b can arbitrate the read/write request of a physical channel 262b and send it to the system bus interface 264b, or the first arbitration unit 263b can arbitrate A read/write request of a logical channel 261b is sent to the system bus interface 264b.
  • the system bus interface 264b can be set outside the DMA 260b. It should be noted that the system bus interface 264b may also be provided in the DMB 260b, that is, the system bus interface 264b may be a part of the DMA 260b.
  • the first arbitration unit 263b may reallocate the bandwidth of the at least one physical channel 262b and the at least one logical channel 261b.
  • the logical channel 261b may include a logical channel interface 2612b, a descriptor control module 2614b, and a data transmission module 2616b.
  • the logical channel interface 2612b can be connected to a data storage module such as the data storage module 240 shown in FIG. 5.
  • the logical channel interface 2612b, the descriptor control module 2614b, and the data transmission module 2616b are connected in sequence, and the data transmission module 2616b is also connected to the first arbitration unit 263b.
  • the system bus interface 264b To connect to the system bus through the system bus interface 264b.
  • the logical channel interface 2612b can be determined by the format of the command issued by the upper layer software, and the logical channel interface 2612b can contain the address of the descriptor.
  • the descriptor control module 2614b indexes the descriptors according to the commands issued by the upper software, analyzes the data source address, destination address, data length and other information, and initiates read and write data commands to the data transmission module 2616b of the DMA 260b.
  • the data transmission module 2616b receives the read and write data commands of the upper level (descriptor control module 2614b), converts the read and write data commands into the required signals, can read and write afterwards, complete the data movement, and return a response to the descriptor control module 2614b.
  • CSR Control Status Register
  • DMA260b needs to meet several conditions to move data: where is the data transferred from (source address), where is the data transferred (destination address), and when the data is transferred (trigger source, or trigger signal). It is necessary to complete the configuration of various parameters and conditions of DMA260b to be able to realize the transfer of data.
  • the source address, destination address, and trigger source can be set by upper-level software.
  • various parameters and conditions of DMA260b can be set in the control status register 269b, or the configuration information and parameters of DMA260b, such as working mode, arbitration priority, interface information, etc., can be set in the control status register 269b.
  • the address of the peripheral register, the address of the data memory, the amount of data to be transmitted, the priority between each channel, the direction of data transmission, the cycle mode, and the peripheral are set in the control status register 269b.
  • the upper-layer software issues a data transfer command to the logical channel 261b of the DMA260b to the logical channel interface 2612b, or the upper-layer software issues a data transfer request to the logical channel 261b of the DMA260b to the logical channel interface 2612b, and the programmable unit under the logical channel 261b of the DMA260b
  • the address of the descriptor is also carried, or the descriptor is directly carried.
  • the address or descriptor of the descriptor is transmitted to the descriptor control module 2614b through the logical channel interface 2612b.
  • the descriptor control unit 2614b If the descriptor control unit 2614b receives the address of the descriptor, the descriptor control unit 2614b reads the descriptor according to the address of the descriptor. That is, the index descriptor. Then parse based on the descriptor, that is, generate information required for data movement, such as data source address space, destination address space, data length, and so on. When the descriptor control unit 2614b receives a descriptor, the descriptor control unit 2614b directly parses the descriptor.
  • the data transmission module 2616b can follow the principle of read first and write later, and convert the information produced by the descriptors parsed by the descriptor control unit 2614b into signals that the system bus interface 264b needs to transmit. And transmitted to the first arbitration unit 263b.
  • the first arbitration unit 263b When the first arbitration unit 263b receives read/write requests simultaneously initiated by multiple logical channels 261b, it can arbitrate one and send it to the system bus interface 264b.
  • the first arbitration unit 263b When the first arbitration unit 263b simultaneously receives a read/write request initiated from the logical channel 261b and a read/write request initiated from the physical channel 262b, the first arbitration unit 263b can also arbitrate one and send it to the system bus interface 264b. , And transmitted to the system bus through the system bus interface 264b.
  • the system bus After the read/write request of DMA260b is transmitted to the system bus, the system bus completes the read and write commands, and the data in the source address space is written into the destination address space. So as to complete the data migration.
  • the physical channel 262b may be connected to an internal engine such as an instruction distribution module through an interface, and the interface may include configuration and parameters for instruction transfer.
  • the configuration and parameters of the physical channel 262b for instruction transfer can also be configured by the control status register 269b.
  • DMA260b can also be connected with other components through other structures to realize data transfer.
  • FIG. 15 is a schematic diagram of the third structure of direct storage access in the neural network processor provided by the embodiment of the application
  • FIG. 16 is the twelfth structure of the neural network processor provided by the embodiment of the application.
  • the DMA 260b may further include a second arbitration ticket 265b, and the second arbitration ticket 265b may be connected to the storage interface 266b.
  • the storage interface 266b may be connected to a storage module (memory, or BUF).
  • the storage module and the DMA 260b may be located in the same NPU, or the storage module and the DMA 260b may not be located in the same NPU.
  • the DMA260b is located in the NPU, the storage module can be located in the NPU, and the storage module can also be located in other devices.
  • the second arbitration ticket 265b can be connected to each logical channel 261b, and when the first arbitration unit 263b and the second arbitration ticket 265b are connected to the same logical channel 261b, they can be connected to the same logical channel 261b by a selector.
  • the storage interface 266b may be arranged outside the DMA260b or inside the DMA260b.
  • DMA260b may also include a third arbitration unit 267b, a third arbitration unit 267b, and a peripheral interface 268b.
  • the peripheral interface 268b can be connected to an external device, which is located in a different device from the DMA260b.
  • the DMA260b is located in the NPU, and the external device is the CPU.
  • the third arbitration unit 267b can be connected to each logical channel 261b, and when the first arbitration unit 263b and the third arbitration unit 267b are connected to the same logical channel 261b, a selector can be connected to the same logical unit 261b.
  • the peripheral interface 268b may be set outside the DMA260b, or may be set inside the DMA260b.
  • the DMA 260b in the embodiment of the present application may also include a first arbitration unit 263b, a second arbitration unit 265b, and a third arbitration unit 267b at the same time.
  • the first arbitration unit 263b is connected to the system bus interface 264b
  • the second arbitration unit 265b is connected to the storage interface 266b
  • the third arbitration unit 267b is connected to the peripheral interface 268b.
  • the first arbitration unit 263b, the second arbitration unit 265b, and the third arbitration unit 267b are all connected to the system bus interface 264b. It can be connected to the logical channel 261b.
  • the first arbitration unit 263b, the second arbitration unit 265b, and the third arbitration unit 267b are connected to a logical channel 261b, they can be connected to a logical channel 261b and three arbitration units by a selector. between.
  • arbitration units may also be provided in the embodiment of the present application to connect other components through other interfaces.
  • FIG. 17 is a schematic diagram of the thirteenth structure of the neural network processor provided by an embodiment of the application.
  • FIG. 17 shows a connection relationship between the direct storage access 260b of FIG. 13 or FIG. 15 and other elements of the neural network processor 200.
  • the direct storage access 260b is connected to the system bus interface 280, the instruction storage module 250, and the data storage module 240.
  • the direct storage access 260b can move data to the data storage module 240 through the system bus interface 280, and the direct storage access 260b can pass through the system bus interface 280.
  • the instructions are moved to the instruction storage module 250, and the direct storage access 260b can also move the data stored in the data storage module 240 to the external memory through the system bus interface 280.
  • the data of the first processing module 210 in the neural network processor 200 in the embodiment of the present application can be directly stored in the data storage module 240, and the data of the data storage module 240 can also be loaded into the first processing module 210, so that the program is relatively streamlined.
  • the embodiment of the present application may also add a general register between the data storage module 240 and the first processing module 210.
  • the neural network processor with general registers will be described in detail below with reference to the accompanying drawings.
  • FIG. 18 is a schematic diagram of the fourteenth structure of the neural network processor provided by an embodiment of the application.
  • the neural network processor 200 may also include a general register 290 and a load storage module 202.
  • the general register 290 is connected to the first processing module 210, and the general register 290 can be connected to all processing units in the first processing module 210.
  • the general register 290 is connected to the convolution processing unit 212 and the vector processing unit 214 of the first processing module 210. Both the convolution processing unit 212 and the vector processing unit 214 can obtain the required data from the general register 290. Of course, both the convolution processing unit 212 and the vector processing unit 214 may also store their respective processing results in the general register 290. It should be noted that the number of processing units in the first processing module 210 shown in FIG. 17 is not limited to this.
  • the first processing module 210 further includes a shaping processing unit.
  • the general register 290 may include a plurality of registers, for example, the general register 290 includes a plurality of vector registers 292.
  • the general register 290 includes a plurality of prediction registers 294.
  • the general register 290 includes a plurality of vector registers 292 and a plurality of prediction registers 294.
  • the multiple vector registers 292 may be referred to as a vector register file (Vector Register File, VRF) for short.
  • the multiple predictive registers 294 may be referred to as Predicate Register File (PRF) for short, and the predictive registers may also be referred to as predicate registers.
  • PRF Predicate Register File
  • the type and number of each register in the general register 290 can be set according to actual requirements. To improve the flexibility of software programming.
  • the convolution processing unit 212 may have a dedicated register 2122 that can store data.
  • a dedicated register 2122 that can store data.
  • the register can store image data
  • the second dedicated register can store weights.
  • a load store unit (LSU) 202 is connected to the general register 290, and the load store module 202 can load data into the general register 290, so that each processing unit of the first processing module 210 can obtain data from the general register 290.
  • the load and storage module 202 can also be connected to the special register 2122 of the convolution processing unit 212, and the load and storage module 202 can also directly load data into the special register 2122 of the convolution processing unit 212, so that the convolution processing unit 212 can process the data. , Such as convolution processing. This can increase the speed of loading data.
  • FIG. 18 only shows part of the components of the neural network processor 200.
  • FIGS. 1 to 17 for detailed description of the load storage module 202 of the embodiment of the present application.
  • the relationship with the general register 290 and other components of the neural network processor 200 will be described in detail below with reference to FIG. 19.
  • FIG. 19 is a schematic diagram of a fifteenth structure of a neural network processor provided by an embodiment of the application.
  • the Load Store Unit (LSU) 290 is connected to the general register 290 and the data storage module 240.
  • the load storage module 202 can load the data of the data storage module 240 into the general register 290.
  • the processing units of the first processing module 210 such as the convolution processing unit 212, the vector processing unit 214, and the shaping processing unit 216, can load the data from the general register 290 according to instructions. Get the data that needs to be processed.
  • the general register 290 may be connected to a plurality of processing units, such as the general register 290 is connected to the convolution processing unit 212, and the general register 290 is also connected to at least one of the vector processing unit 214 and the shaping processing unit 216. Therefore, the convolution processing unit 212, the vector processing unit 214, and the shaping processing unit 216 can all obtain the data to be processed from the general register 290.
  • the convolution processing unit 212, the vector processing unit 214, and the shaping processing unit 216 can also store their respective processing results in the general register 290. Furthermore, the load storage module 202 can store the processing result in the general register 290 to the data storage module 240, and the data storage module 240 can transfer the processing result to the external memory through the direct storage access or the data transfer module 260.
  • the second processing module 230 such as the scalar processing unit 232 in the embodiment of the present application is not connected to the general register 290.
  • the data to be processed by the scalar processing unit 232 in the embodiment of the present application can be controlled by the instructions received by it. carry.
  • the scalar processing unit 232 in the embodiment of the present application may also be connected to the data storage module 240 to obtain the data to be processed from the data storage module 240.
  • the load storage module 202 of the embodiment of the present application can not only store the data of the data storage module 240 in the general register 290, but can also load other locations.
  • the load storage module 202 is also directly connected to the convolution processing unit 212, and the load storage module 202 is directly connected to the convolution processing unit 212. It can be understood that there is no connection between the load storage module 202 and the convolution processing unit 212 as described above.
  • General register 290 General register 290.
  • the connection between the load storage module 202 and the convolution processing unit 212 can be understood as the connection between the load storage module 202 and the special register 2122 of the convolution processing unit 212, such as the connection between the load storage module 202 and one of the special registers 2122 in the convolution processing unit 212,
  • the load storage module 202 can directly load the data of the data storage module 240, such as weights, to one of the special registers 2122 of the convolution processing unit 212. It can be understood that the load storage module 202 can also directly load other data, such as image data, to one of the special registers 2122 of the convolution processing unit 212.
  • the load storage module 202 of the embodiment of the present application can directly load the data of the data storage module 240 to the convolution processing unit 212, and the load storage module 202 can also store the data of the data storage module 240 in the general register 290.
  • the first processing module The processing unit of 210 such as the convolution processing unit 212, may obtain corresponding data from the general register 290 based on the instructions it receives.
  • the load storage module 202 can directly load the first data to the convolution processing unit 212, and the load storage module 202 can store the second data in the general register 290, and the convolution processing unit 212 can obtain the second data from the general register 290. data.
  • the types of the first data and the second data may be different, for example, the first data is weight, and the second data is image data. Therefore, the convolution processing unit 212 in the embodiment of the present application can receive the data to be processed from different channels. Compared with the convolution processing unit 212 receiving the data to be processed from the same channel, the data loading speed can be increased, and the speed of data loading can be increased. The calculation rate of the neural network processor 200. Moreover, the embodiments of the present application can simplify the instruction set, making it easy to implement. At the same time, the embodiments of the present application are also easier to optimize the compiler.
  • the load and storage module 202 directly loads the first data into the convolution processing unit 212, and after the load and storage module 202 loads the second data into the general register 290, other processing units of the first processing module 210 may also be used.
  • the vector processing unit 214 obtains the second data from the general register 290.
  • the load storage module 202 can also load other data, such as third data, into the general register 290, which can be obtained from the general register 290 by one or more processing units of the first processing module 210, such as the shaping processing unit 216.
  • the third data can be different from the types of the first data and the second data.
  • the load storage module 202 is also connected to the instruction distribution module 220.
  • the load storage module 202 can receive the instructions transmitted by the instruction distribution module 220.
  • the load storage module 202 can store the data of the data storage module 240 according to the instructions issued by the instruction distribution module 240.
  • the load storage module 202 can also store the processing result stored in the general register 290 to the data storage module 240 according to the instruction issued by the instruction distribution module 240.
  • the processing result is, for example, the processing result of the vector processing unit 214.
  • the instruction distribution module 220 can transmit multiple instructions to the first processing module 210, the second processing module 230, and the load storage module 202 in parallel within one clock cycle.
  • the instruction distribution module 220 can issue multiple instructions to the scalar processing unit 232, the convolution processing unit 212, the vector processing unit 214, and the load storage module 202 in parallel within one clock cycle.
  • the load storage module 202 and the data storage module 240 can be integrated together, and serve as two parts of one module.
  • the load and storage module 202 and the data storage module 240 may also be provided separately, or in other words, the load and storage module 202 and the data storage module 240 are not integrated into one module.
  • FIG. 20 is a schematic diagram of the sixteenth structure of the neural network processor provided by an embodiment of the application.
  • the neural network processor 200 may also include a data movement engine 204.
  • the data movement engine 204 may also be referred to as a register file data movement engine (MOVE).
  • MOVE register file data movement engine
  • the data movement engine 204 can realize the movement of data between different registers, so that the processing unit of the first processing module 210, such as the convolution processing unit 212 and the processing unit of the second processing module 230, such as the scalar processing unit 232, can obtain the data from the NPU 200.
  • the required data is processed without the need to transmit the data to the outside of the NPU200 and then return it to the NPU200 after being processed by the upper layer software.
  • the data transfer engine 204 can realize data interaction between different registers, thereby saving some data in the NPU200 from the NPU200 to the outside, reducing the interaction between the NPU200 and upper-layer software such as the CPU, and improving the NPU200 to process data. s efficiency. At the same time, it can also reduce the workload of the external CPU.
  • the data movement engine 204 is connected to the general register 290 and the scalar processing unit 232 of the second processing module 230.
  • the scalar processing unit 232 can refer to the above content and will not be repeated here.
  • the scalar processing unit 232 includes a plurality of scalar registers 2322, referred to as a scalar register file for short, and the scalar processing unit 232 is connected to the data movement engine 204 through the scalar register 2322.
  • the general register 290 has a plurality of registers, referred to as register files, and the general register 290 is connected to the data movement engine 204 through the register files therein. It should be noted that multiple registers of the general register 290 can all be connected to the data migration engine 204. It should be noted that a part of the multiple registers of the general register 290 may also be connected to the data transfer engine 204.
  • FIG. 21 is a schematic diagram of a seventeenth structure of a neural network processor provided by an embodiment of the application.
  • the general register 290 in the neural network processor 200 may include a plurality of vector registers 292, referred to as vector register files.
  • the vector registers 292 in the embodiment of the present application may all be connected to the data transfer engine 204, and the vector registers 292 in the embodiment of the present application may also be partially connected to the data transfer engine 204.
  • the data transfer engine 204 is connected, and this part can be understood as at least one vector register, and not all vector registers.
  • the general register 290 in the neural network processor 200 may include a plurality of prediction registers 294, referred to as prediction register file, or predicate register file.
  • the prediction registers 294 in the embodiment of the present application may all be connected to the data migration engine 204, and the implementation of this application For example, the prediction register 294 may be partially connected to the data migration engine 204.
  • the general register 290 when the general register 290 includes multiple types of registers, the general register 290 can be connected to the data movement engine 204 through all types of registers. The general register 290 can also be connected to the data transfer engine 204 through some types of registers. For example, when the general register 290 of the neural network processor 200 includes multiple vector registers 292 and multiple prediction registers 294, the general register 290 only passes through multiple vectors. The register 292 is connected to the data transfer engine 204.
  • FIG. 20 and FIG. 21 only show part of the components of the neural network processor 200.
  • FIGS. 1 to 19 please refer to FIGS. 1 to 19 for detailed description of the present application.
  • the relationship between the data movement engine 204 and other components of the embodiment, and the specific implementation of the data movement by the data movement engine 204 are described in detail below with reference to FIG. 22.
  • FIG. 22 is a schematic diagram of an eighteenth structure of a neural network processor provided by an embodiment of the application.
  • Some data of the neural network processor 200 in the embodiment of the present application such as the data processed by the convolution processing unit 212, the vector processing unit 214, or the shaping processing unit 216 of the first processing module 210, needs to be scalar calculations, the data can be stored in In the general register 290, the data movement engine 204 can move the data to the scalar processing unit 232, and the scalar processing unit 232 performs scalar calculation on the data.
  • the data movement engine 204 can move the calculation result to the general register 290, and the corresponding processing unit in the first processing module 210 can obtain it from the general register 290 The calculation result. Therefore, the data transfer in the NPU200 in the embodiment of this application is all inside the NPU200. Compared with the NPU200 transmitting the data to the outside, the external upper layer software such as the CPU processing is completed and then returned to the NPU200 can reduce the interaction between the NPU200 and the outside and improve the NPU200 processing data. effectiveness.
  • the data processed by the convolution processing unit 212, vector processing unit 214 or shaping processing unit 216 of the first processing module 210 requires scalar calculations such as the convolution processing unit 212, vector processing unit 214 or shaping of the first processing module 210
  • the intermediate result processed by the processing unit 216 requires a judgment operation. This judgment operation can be completed by the scalar processing unit 232.
  • the data stored in the general register 290 is data to be judged.
  • the data to be judged needs to be judged.
  • the data transfer engine 201 moves the data to be judged to the scalar register 2322 of the scalar processing unit 232 to perform the judgment operation.
  • the data movement engine 204 can move the scalar data to the general register 290, and the corresponding data in the first processing module 210
  • a processing unit such as the vector processing unit 214 may obtain the scalar data from the general register 290 to transform it into vector data.
  • scalar data needs to be transformed into vector data, which can also be called scalar data and needs to be expanded into vector data. For example, a 32-bit data is copied into 16 identical data to form a 512-bit vector.
  • the instruction distribution module 220 is connected to the data movement engine 204, and the instruction distribution module 220 can transmit instructions to the data movement engine 204, and the data movement engine 204 can perform data movement operations according to the instructions it receives.
  • the instruction distribution module 220 transmits a first instruction to the data movement engine 204, and the data movement engine 204 moves the data of the general register 290 to the scalar register 2322 of the scalar processing unit 232 according to the first instruction.
  • the instruction distribution module 220 transmits a second instruction to the data movement engine 204, and the data movement engine 204 moves the data of the scalar register 2322 to the general register 290 according to the second instruction.
  • the instruction distribution module 220 can transmit multiple instructions to the first processing module 210, the second processing module 230, the load storage module 202, and the data movement engine 204 in parallel within one clock cycle.
  • the instruction distribution module 220 can issue multiple instructions to the convolution processing unit 212, the vector processing unit 214, the scalar processing unit 232, the load storage module 202, and the data movement engine 204 in parallel within one clock cycle.
  • the neural network processor 200 can perform convolutional neural network operations, cyclic neural network operations, etc. The following takes convolutional neural network operations as an example.
  • the neural network processor 200 obtains data to be processed (such as image data) from the outside, and the neural network processor
  • the convolution processing unit 212 in 200 may perform convolution processing on the data to be processed.
  • the input of the convolutional layer in the convolutional neural network includes input data (such as data to be processed from the outside) and weight data.
  • the main calculation process of the convolutional layer is to perform convolution operations on the input data and weight data to obtain output data.
  • the main body that performs the convolution operation is the convolution processing unit, which can also be understood as the convolution processing unit of the neural network processor performing the convolution operation on the input data and the weight data to obtain the output data.
  • the weight data can be understood as one or more convolution kernels in some cases. The convolution operation will be described in detail below.
  • the size of the input data is H W C1, and the size of the weight data is K R S C2, where H is the height of the input data, W is the width of the input data, C1 is the depth of the input data, and K is the output number of the weight data. That is, the number of convolution kernels, R is the height of the weight data, that is, the height of the convolution kernel, S is the width of the weight data, that is, the width of the convolution kernel, and C2 is the depth of the weight data, that is, the depth of the convolution kernel.
  • C2 of the weight data is equal to C1 of the input data, because C2 and C1 are corresponding depth values and are equal.
  • the input data size can also be N H W C, where N is the number of batches of input data.
  • the convolution processing unit first takes a window of the input data according to the size of the convolution kernel, and the window area after the window is multiplied and accumulated with a convolution kernel in the weight data to obtain a data, and then respectively in the W direction and the H direction
  • the sliding window performs multiplication and accumulation operations to obtain H'W' data, and finally traverses K convolution kernels to obtain KH'W' data.
  • the convolution processing unit may also adopt other convolution operation modes.
  • the convolution operation in another mode will be described in detail below.
  • FIG. 23 is a schematic diagram of the convolution operation of the convolution processing unit in the neural network processor provided by the embodiment of the present application.
  • the size of the input data is still HWC
  • the size of the weight data is still KRSC.
  • the input data size can also be N H W C, where N is the number of batches of data input.
  • the convolution processing unit first takes a window of the input data according to the size of the convolution kernel, the first window area after the window is multiplied and accumulated with all the convolution kernels in the weight data to obtain a data, and then respectively in the W direction and Slide the window in the H direction and perform multiplication and accumulation operations to obtain H'W'K data.
  • the specific operation steps are as follows (can also be understood as the specific steps of the convolution processing unit to perform the convolution operation as follows):
  • the convolution operation unit includes a multiply-accumulate array (MAC Array) used for convolution operations.
  • the size of the multiply-accumulate array (L M) is fixed, where L is the length of the multiply-accumulate operation, and M is the multiply-accumulate operation in parallel.
  • the number of units can also be understood as M multiplication and accumulation operations of length L can be performed in one cycle.
  • the steps of assigning the multiplication and accumulation operation in the above convolution operation process (that is, step 2 above) to the convolution operation unit for parallel operation are as follows (also can be understood as the specific operation of the multiplication and accumulation operation by the convolution processing unit using the multiplication and accumulation array Proceed as follows):
  • the body area is divided into C/L data segments of length L; it should be noted that after the first form area is obtained, the first form area can be divided into C/L data segments of length L, or first After dividing the input data into C/L data segments with a length of L, the first form area is obtained.
  • the first form area includes C/L data segments with a length of L; it can be understood as the first form
  • the area may include the first depth data of the C/L layer along the depth direction;
  • the convolution kernel is divided into C/L data segments of length L, and this operation is performed on the K convolution kernels in the weight data to obtain K groups of weight data, each group has C/L pieces Weight data segment; it can be understood that each convolution kernel includes C/L weight data segments of length L along the depth direction; K convolution kernels can also be divided into K/M convolution kernel groups, each group The convolution kernel group includes the weight data of M convolution kernels;
  • M weighted data segments are M convolution kernels The weight data segment
  • the output M first operation data are accumulated to the M first operation data calculated before, so far M target operation data are obtained; among them, i starts from 1 and increases To C/L;
  • the height H, width W, and depth C of the input data are random, that is, the size of the input data can have many formats, such as the width W of the input data is uncertain, the width W of the input data is divided by multiplying and accumulating
  • the number of units M in which the array performs multiply and accumulate operations in parallel cannot be obtained in most cases. In this way, part of the multiply and accumulate operation units will be wasted in the process of multiply and accumulate operations.
  • the number K of convolution kernels is divided by the number M of multiply-accumulate units in the multiply-accumulate array in parallel.
  • the number K of convolution kernels generally adopts a fixed number and is the nth power of 2 ( That is 2n), or one of a limited number of numbers (such as K is one of 32, 64, 128, 256), so, when setting the multiply-accumulate operation unit, the number of multiply-accumulate operation unit M can be set to and
  • the number of K is the same or an integral multiple, for example, M is one of 32, 64, 128, etc.
  • This embodiment can make full use of the multiply-accumulate operation unit, reduce the waste of the multiply-accumulate operation unit, and improve the efficiency of the convolution operation.
  • the number K of convolution kernels corresponding to the number M of multiply-accumulate operations is a division in one dimension. If the number M of multiply-accumulate units corresponds to the sliding window area, the corresponding Including not only the width W dimension but also the H dimension, the correspondence of the two dimensions is not conducive to folding.
  • the format of the output target calculation data in this embodiment is H'W'K, which is the same as the input data format, and it can be directly used as the next calculation layer (the next layer of volume The input data of the accumulation layer or the next pooling layer, etc.).
  • the target calculation data is continuous data in the depth direction, and continuous data can be stored during storage, and subsequent reading of the target calculation data is also continuous. When the hardware is loaded, there is no need to calculate the address multiple times to optimize the calculation efficiency.
  • C is greater than L
  • K is greater than M.
  • L and M in the MAC Array have the same value, for example, both are 64.
  • the input data is filled in the depth direction according to the length of 64 granularity. Divide into 1 64 data blocks along the depth direction. When the depth is less than 64, fill in to 64.
  • the weight data is complemented in the depth direction at a granularity of 64 lengths.
  • the weight data is divided into 1 64 data blocks along the depth direction. When the depth is less than 64, it is filled to 64. When the number of convolution kernels is greater than 64, it is divided into multiple groups at the granularity of 64.
  • the convolution processing unit may also be used to transfer K target operation data corresponding to one window area to the next layer and use it for operation; or to correspond to N first window areas
  • the N K target operation data are transferred to the next layer and used for operation, where N is less than the total number of output data in the first window area.
  • each first window area has been fully calculated, that is, all data in each first window area (including the depth direction) and all convolution kernels (including the depth direction) are multiplied and accumulated.
  • the target calculation data obtained is complete, then one or more target calculation data corresponding to the first window area can be transmitted to the next layer first, without waiting for all input data to be calculated and then transmitted.
  • part of the target calculation data to the next layer can be used as the smallest unit of the next layer calculation (for example, part of the target calculation data can be used as the data included in a form area of the input data of the next layer)
  • the next layer can start calculation, but It is necessary to wait for all the operation results of the upper layer, which improves the efficiency of the convolution operation and shortens the time of the convolution operation.
  • the internal buffer of the NPU where the convolution operation unit is located is generally small, it cannot store large intermediate results. If the format of the data completed by the convolution operation is KH'W', then the results of this layer need to be calculated before the next layer can be calculated, and the output data needs to be cached in external memory (that is, outside the NPU). Memory). However, the result of the convolution operation in this embodiment is in the H'W'K format. After calculating part of the result on the H'W' plane, the input data of the next layer can be directly calculated. The smaller NPU The internal cache only needs to store 1 W'K or N1 W'K or N1 N2 K, where N1 can be much smaller than H', and N2 can be much smaller than W'.
  • the target calculation data to be transferred to the next layer has duplicate data with the target calculation data transferred last time, the duplicate data is removed to obtain the target data; and the target data is transferred to the next layer.
  • the data transmission and storage can be optimized, of course, the target calculation data can also be transmitted every time to cover the repeated data.
  • the length L of the multiply-accumulate operation of the MAC Array can be equal to the number of units M of the multiply-accumulate operation in parallel, because the L and M of the multiply-accumulate array are equal, and the value of the data of the multiply-accumulate operation is the same in both directions , You can easily adjust the calculated results.
  • L and M of the multiply-accumulate array may not be equal to facilitate the setting of the multiply-accumulate array.
  • the convolution processing unit may be used to: perform a windowing operation on the input data according to the convolution check to obtain a first window area, which includes a first number of layers of first depth data in a depth direction; and obtains multiple volumes Multiplying and accumulating the first depth data of a layer with the second depth data of the same layer of the multiple convolution kernels to obtain the first depth data of the first number of layers along the depth direction.
  • One operation data One operation data.
  • the convolution processing unit may also perform operations on multiple layers, and the convolution processing unit is further configured to accumulate multiple first operation data corresponding to the first depth data of the multiple layers to obtain target operation data. That is, based on the single-layer operation in the above embodiment, multiply and accumulate the first depth data of multiple layers and the second depth data of multiple convolution kernels to obtain the target after accumulating multiple first operation data. Operational data.
  • the convolution processing unit can store its operation result in the data storage module, and can also transmit the operation result to the vector processing unit or the shaping processing unit for further calculation operations.
  • the neural network processor 200 provided in the embodiment of the present application can be integrated into one chip.
  • FIG. 24 is a schematic structural diagram of a chip provided by an embodiment of the application.
  • the chip 20 includes a neural network processor 200, and the neural network processor 200 can refer to the above content, which will not be repeated here.
  • the chip 20 can be applied to electronic equipment.
  • neural network processor 200 of the embodiment of the present application may also be integrated with other processors, memories, etc. in a chip.
  • the electronic device 20 may include a neural network processor 200, a system bus 400, an external memory 600, and a central processing unit 800.
  • the neural network processor 200, the external memory 600, and the central processing unit 800 are all connected to the system bus 400, so that the neural network processor 200 and the external memory 600 can realize data transmission.
  • the system bus 400 is connected to the neural network processor 200 through the system bus interface 280.
  • the system bus 400 may be connected to the central processing unit 800 and the external memory 600 through other system bus interfaces.
  • the neural network processor 200 is controlled by the central processing unit 800 to obtain the data to be processed from the external memory 600, process the data to be processed to obtain processing results, and feed back the processing results to the external memory 600
  • the upper-level driver software of the electronic device 20 such as the central processing unit 800, writes the configuration of the current program to be executed into the corresponding register, such as: working mode, program counter (Program Counter) ,PC) initial values, configuration parameters, etc.
  • the data movement module 260 reads the data to be processed, such as image data and weight data, from the external memory 600 through the system bus interface 280, and writes the data to the data storage module 240.
  • the instruction distribution module 220 starts to fetch instructions according to the initial PC. After the instruction is fetched, the instruction distribution module 220 transmits the instruction to the corresponding processing unit according to the type of the instruction. Each processing unit performs different operations according to specific instructions, and then writes the results to the data storage module 240.
  • the register is the configuration status register of the neural network processor 200, or is called the control status register, which can set the working mode of the neural network processor 200, such as the bit width of the input data, the position of the initial PC of the program, and so on.
  • neural network processor shown in FIG. 25 can also be replaced with other neural network processors shown in the figure.
  • FIG. 26 is a schematic flowchart of a data processing method provided by an embodiment of this application.
  • the data processing method is based on the above-mentioned neural network processor to process data.
  • the data processing method includes:
  • the data to be processed may be image data and weight data that need to be processed by a neural network processor.
  • the data transfer module 260 may be used to read the data to be processed from the external memory 600 through the system bus interface 280.
  • the DMA 260b can also be used to move the data to be processed from the external memory through the system bus interface 266b.
  • the to-be-processed data can be loaded into the data storage module 240.
  • the multiple instructions may be calculation instructions or control instructions.
  • the instruction transfer module 270 can be used to read the required instructions from the outside through the system bus interface 280.
  • the DMA 260b can also be used to transfer the required instructions from the outside through the system bus interface 266b. It is also possible to write instructions directly to the NPU200 from the outside. After receiving multiple instructions, the multiple instructions can be loaded into the instruction storage module 240.
  • the instruction distributing module 220 of the neural network processor 200 can transmit the multiple instructions to the respective processing units within one clock cycle according to the multiple received instructions, so that each processing unit implements the processing of the data to be processed according to the instructions. deal with.
  • the instruction distribution module 220 can transmit multiple instructions to at least two processing units of the first processing module 210 in one clock cycle.
  • the instruction distribution module 220 may transmit multiple instructions to at least one processing unit of the scalar processing unit 232 and the first processing module 210 within one clock cycle.
  • each processing unit processes the data according to the instructions.
  • the command distribution module 220 Before the distribution module 220 transmits the command, the command distribution module 220 first sends a judgment signal to the data storage module 240. When a signal is returned from the data storage module 240, the command distribution module 240 can determine whether the data storage module 240 is buffered for processing based on the return signal. data. If the instruction distribution module 220 determines that the data storage module 240 does not store data to be processed, the instruction distribution module 240 will not transmit the instruction to each processing unit. Only when the instruction distribution module 220 determines that the data storage module 240 stores data to be processed, the instruction distribution module 240 will transmit the instruction to multiple processing units.
  • the multiple processing units process the to-be-processed data according to the multiple instructions to obtain a processing result.
  • Each processing unit 230 will obtain a processing result after processing the data to be processed.
  • multiple processing units 230 may also write the processing result to the data storage module 240.
  • the data transfer module 260 and the system bus interface 280 can transmit the processing result to the external memory 600.
  • the instruction distribution module 220 of the neural network processor in the embodiment of the present application receives the end identification instruction, it considers that the program has been executed and issues an interrupt to the upper layer software to end the work of the NPU 200. If it is not over, it returns to 1002, and continues to fetch and execute instruction transmission until the program is executed.
  • FIG. 27 is a schematic flowchart of a data processing method provided by an embodiment of the application.
  • the data processing method is based on the above-mentioned neural network processor to process data.
  • the data processing method includes:
  • the data of the general register is moved to the scalar register according to the first condition.
  • the first condition can be the first instruction.
  • the data movement engine 204 can move the data of the general register 290 to the scalar register 2322 according to the first instruction.
  • the specific content please refer to the above content, which will not be repeated here.
  • the data of the general register is moved to the scalar register according to the second condition.
  • the second condition can be the second instruction.
  • the data movement engine 204 can move the data of the scalar register 2322 to the general register 290 according to the second instruction.
  • FIG. 28 is a schematic flowchart of a data loading method provided by an embodiment of the application.
  • the data loading method is based on the above neural network processor 200 loading data, and the data loading method includes:
  • the convolution processing unit 212 with the dedicated register 2122 can refer to the above content, and will not be repeated here.
  • the general register 290 can refer to the above content, which will not be repeated here.
  • the LSU290 may be used to implement data loading or transmission.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Advance Control (AREA)

Abstract

一种神经网络处理器、芯片和电子设备,神经网络处理器包括:第一处理模块(210),包括具有专用寄存器(2122)的卷积处理单元(212);通用寄存器(290),与卷积处理单元(212)连接;和加载存储模块(202),与通用寄存器(290)连接,加载存储模块(202)还通过专用寄存器(2122)与卷积处理单元(212)连接;加载存储模块(202)用于加载数据到通用寄存器(290)和加载数据到卷积处理单元(212)的专用寄存器(2122)中的至少一个。该处理器可以提高神经网络处理器加载数据的速度。

Description

神经网络处理器、芯片和电子设备
本申请要求于2019年12月09日提交中国专利局、申请号为201911253030.2、申请名称为“神经网络处理器、芯片和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及电子技术领域,特别涉及一种神经网络处理器、芯片和电子设备。
背景技术
人工神经网络(artificial neural networks,ANN)从信息处理角度对神经元网络进行抽象,建立某种简单模型,按照不同的连接方式组成不同的网络。这些研究通常被称为深度学习(deep learning)、计算机学习(computer learning)等类似术语。
相关技术中,神经网络处理器中的处理单元往往与数据存储交互,数据传输过程中传输速度较慢。
发明内容
本申请实施例提供一种神经网络处理器、芯片和电子设备,可以提高神经网络处理器加载数据的速度。
本申请实施例公开一种神经网络处理器,包括:
第一处理模块,所述第一处理模块包括具有专用寄存器的卷积处理单元;
通用寄存器,所述通用寄存器与所述卷积处理单元连接;和
加载存储模块,所述加载存储模块与所述通用寄存器连接,所述加载存储模块还通过所述专用寄存器与所述卷积处理单元连接;
所述加载存储模块用于加载数据到所述通用寄存器和加载数据到所述卷积处理单元的专用寄存器中的至少一个。
本申请实施例还公开一种芯片,其包括神经网络处理器,所述神经网络处理器为如上所述的神经网络处理器。
本申请实施例还公开一种电子设备,其包括:
系统总线;
外部存储器;
中央处理器;和
神经网络处理器,所述神经网络处理器为如上所述的神经网络处理器;
其中,所述神经网络处理器通过所述系统总线连接所述外部存储器和所述中央处理器,所述神经网络处理器受控于所述中央处理器从所述外部存储器中获取待处理数据、及对所述待处理数据进行处理以得到处理结果,并将所述处理结果反馈到所述外部存储器。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍。
图1为本申请实施例提供的神经网络处理器的第一种结构示意图。
图2为本申请实施例提供的神经网络处理器的第二种结构示意图。
图3为本申请实施例提供的神经网络处理器的第三种结构示意图。
图4为本申请实施例提供的神经网络处理器的第四种结构示意图。
图5为本申请实施例提供的神经网络处理器的第五种结构示意图。
图6为本申请实施例提供的数据存储模块的结构示意图。
图7为本申请实施例提供的神经网络处理器的第六种结构示意图。
图8为本申请实施例提供的神经网络处理器的第七种结构示意图。
图9为本申请实施例提供的神经网络处理器的第八种结构示意图。
图10为本申请实施例提供的神经网络处理器的第九种结构示意图。
图11为本申请实施例提供的神经网络处理器的第十种结构示意图。
图12为本申请实施例提供的神经网络处理器中直接存储访问的第一种结构示意图。
图13为本申请实施例提供的神经网络处理器中直接存储访问的第二种结构示意图。
图14为本申请实施例提供的神经网络处理器的第十一种结构示意图。
图15为本申请实施例提供的神经网络处理器中直接存储访问的第三种结构示意图。
图16为本申请实施例提供的神经网络处理器的第十二种结构示意图。
图17为本申请实施例提供的神经网络处理器的第十三种结构示意图。
图18为本申请实施例提供的神经网络处理器的第十四种结构示意图。
图19为本申请实施例提供的神经网络处理器的第十五种结构示意图。
图20为本申请实施例提供的神经网络处理器的第十六种结构示意图。
图21为本申请实施例提供的神经网络处理器的第十七种结构示意图。
图22为本申请实施例提供的神经网络处理器的第十八种结构示意图。
图23本申请实施例提供的神经网络处理器中卷积处理单元卷积运算示意图。
图24为本申请实施例提供的芯片的结构示意图。
图25为本申请实施例提供的电子设备的结构示意图。
图26为本申请实施例提供的数据处理方法的一种流程示意图。
图27为本申请实施例提供的数据处理方法的另一种流程示意图。
图28为本申请实施例提供的数据加载方法的流程示意图。
具体实施方式
本申请实施例提供的技术方案可以应用于各种需要对输入图像进行图像处理以得到相应地输出图像的场景,本申请实施例对此并不限定。诸如,本申请实施例提供的技术方案可以应用于计算机视觉等领域的各种场景,比如:人脸识别、图像分类、目标检测及语义分割等。
请参阅图1,图1为本申请实施例提供的神经网络处理器的第一种结构示意图。神经网络处理器(Neural Network Process Unit,NPU)200可包括第一处理模块210和指令分发模块220。
第一处理模块210可包括一个或多个处理单元,诸如第一处理模块210包括卷积处理单元212和向量处理单元214。本申请实施例的第一处理模块210所包括的多个处理单元均可对向量进行处理。需要说明的是,本申请实施例并不对第一处理模块210所处理数据的类型进行限定。
卷积处理单元212也可以称为卷积运算单元,卷积处理单元212还可以称为卷积计算引擎。卷积处理单元212内部可包含有多个乘加单元(Multiplication Add Cell,MAC),该乘加单元的个数可以为几千个,诸如卷积处理单元212内部可包含4096个乘加单元,可以分成16个cell,每个cell可以计算最大元素数是256向量内积操作。
向量处理单元214也可以称为向量计算单元,还可以称为单指令多数据(Single Instruction Multiple Data,SIMD)处理单元。向量处理单元214是一个元素级向量计算引擎,可以处理常规的向量间的加减乘除等算术操作,同时也可以处理比特级的与、或、非、异或等逻辑操作。需要说明的是,本申请实施例的向量处理单元214还可以支持线性整流函数(Rectified Linear Unit,ReLU)、PRelu等常见的激活函数操作。还需要说明的是,本申请实施例的向量处理单元214还也可以通过查表法支持非线性激活函数Sigmoid和Tanh。
指令分发模块220也可以称为指令预处理模块。指令分发模块220与第一处理模块210连接,指令分发模块220可以与第一处理模块210中的每一个处理单元连接,诸如指令分发模块220与第一处理模块210中的卷积处理单元212和向量处理单元214连接。指令分发模块220可以发射指令到第一处理模块210,即指令分发模块220可以发射指令到第一处理模块210的处理单元。
在一些实施例中,指令分发模块220可以并行发射多条指令到第一处理模块210,诸如指令分发模块220可以并行发射多条指令到卷积处理单元212和向量处理单元214。举例来说,指令分发模块220可以将多条指令在一个时钟周期内并行发射到卷积处理单元212和向量处理单元214。从而本申请实施例可以支持多发射指令操作,可以同时高效的执行多条指令,诸如卷积处理单元212和向量处理单元214可分别执行卷积计算指令和向量计算指令。卷积处理单元212和向量处理单元214接收到指令后,卷积处理单元212和向量处理单元214根据指令对各自接收到的数据进行处理,以得到处理结果。由此本申请实施例可以提升计算效率,或者说可以提高NPU处理数据的效率。
可以理解的是,指令分发模块220所并行发射出的多条指令,所对应的处理单元在执行过程中没有资源上的冲突。
指令分发模块220所发射的多条指令中可以包括细粒度指令,指令分发模块220可以将细粒度指令发射到卷积处理单元212,卷积处理单元212在接收到细粒度指令后,卷积处理单元212根据细粒度指令可对其所接收到的数据进行一次向量内积运算。
应理解的是,指令分发模块220所发射的细粒度指令并不限于卷积处理单元212,指令分发模块220也可以将细粒度指令发射到向量处理单元214或第一处理模块210的其它处理单元中。
还应理解的是,本申请实施例指令分发模块220可以发射的指令并不限于细粒度指令。本申请实施例并不对指令分发模块220发射的指令进行限定。需要说明的是,指令可以包括不同的类型,诸如计算类型的指令、控制类型的指令等,其中计算类型指令可以包括第一计算指令、第二计算指令、第三计算指令等。
其中,细粒度指令对应的操作可以精确每个时钟周期,不同于粗粒度的处理器,即不同于一条指令需要处理器执行比较多的时钟周期才能够完成。还可以理解为,细粒度指令还体现为处理单元的计算粒度较细。诸如卷积处理单元212基于一个细粒度指令可以完成一个基本的向量内积操作。而粗粒度的处理器基于一个指令可以完成矩阵乘法,可以理解的是,矩阵乘法可以由多个向量内积运算组成。由此可见,本申请实施例可以支持多发射细粒度的指令操作,本申请实施例可以提高编程的灵活性,通用性较好。
本申请实施例指令分发模块220可以并行将第一计算指令发射到卷积处理单元212以及将第二计算指令发射到向量处理单元214。比如指令分发模块220在一个时钟周期内将第一计算指令发射到卷积处理单元212以及将第二计算指令发射到向量处理单元214。卷积处理单元212可以根据指令分发模块220所发射的第一计算指令对其所接收到的数据进行向量内积操作。以及向量处理单元214根据指令分发模块220所发射的第二计算指令对其所接收到的数据进行向量计算操作。
需要说明的是,第一处理模块210中处理单元并不限于卷积处理单元212和向量处理单元214,或者说第一处理模块210还可以包括其他处理单元。诸如第一处理模块210还整形处理单元等。
请参阅图2,图2为本申请实施例提供的神经网络处理器的第二种结构示意图。本申请实施例提供的神经网络处理器200的第一处理模块210可以包括卷积处理单元214、向量处理单元214和整形处理单元216,卷积处理单元212和向量处理单元214可以参阅图1所示的卷积处理单元212和向量处理单元214,在此不再赘述。整形处理单元也可以称为整形引擎。
整形处理单元216与指令分发模块220连接,指令分发模块220还可以将多条指令并行发射到卷积处理单元212、向量处理单元214以及整形处理单元216。诸如指令分发模块220还可以将多条指令在一个时钟周期内并行发射到卷积处理单元212、向量处理单元214以及整形处理单元216。整形处理单元216根据指令分发模块220所发射的指令诸如第三计算指令对其接收到的数据进行处理。整形处理单元216可以支持常见的Tensor的Reshape操作,比如维度转置,按照一个维度进行切分,数据填充Padding等。
需要说明的是,指令分发模块220的指令发射并不限于第一处理模块210。其他一些实施例中,指令分发模块220还可以将指令发射到其他处理模块。
请参阅图3,图3为本申请实施例提供的神经网络处理器的第三种结构示意图。本申请实施例提供的神经网络处理器200可以包括第一处理模块210、第二处理模块230和指令分发模块220。该第一处理模块210至少包括卷积处理单元212,当然该第一处理模块210也可以包括其他处理单元诸如向量处理单元214、整形处理单元216。卷积处理单元212可以其接收到的数据进行向量内积运算,具有可以参阅以上内容,在此不再赘述。向量处理单元214具体可以参阅以上内容,在此不再赘述。整形处理单元216具体可以参阅以上内容,在此不再赘述。
第二处理模块230可以对标量数据进行处理,第二处理模块230至少包括标量处理单元232(Scalar Process Unit,SPU)。标量处理单元232可以是一个兼容RISC-V指令集的处理单元。其中,标量处理单元232可以包括标量寄存器堆(Scalar Register File,SRF),即标量处理单元232可以包括多个标量寄存器。
指令分发模块220连接第一处理模块210和第二处理模块230,指令分发模块220可以将多条指令并行发射到第一处理模块210和第二处理模块230。诸如指令分发模块220可以将多条指令在一个时钟周期内并行发射到卷积处理单元212和标量处理单元232。
应理解,第一处理模块210还包括其他处理单元时,指令分发模块220还可以将多条指令在一个时钟周期内并行发射到其他处理单元。诸如指令分发模块220将多条指令在一个时钟周期内并行发射到卷积处理单元212、向量处理单元214及标量处理单元232,还比如指令分发模块220将多条指令在一个时钟周期内并行发射到卷积处理单元212、整形处理单元216及标量处理单元232,再比如指令分发模块220将多条指令在一个时钟周期内并行发射到卷积处理单元212、向量处理单元214、整形处理单元216及标量处理单元232。
还应理解,在实际过程中,指令分发模块220所发射的指令并不限于此,指令分发模块220根据神经网络处理器200处理数据的需求可以将不同的指令并行发射到同一个处理模块中的多个处理单元,或者将不同的指令并行发射到不同处理模块中的处理单元。以上仅是本申请实施例所提供的技术方案中指令分发单元220并行发射多条指令的几种举例说明。本申请实施例所提供的技术方案指令分发单元220发射指令的方式并不限于此。比如:指令分发单元220将多条指令并行发射到标量处理单元232和向量处理单元214。再比如:指令分发单元220将多条指令并行发射到整形处理单元216和向量处理单元214。
标量处理单元232根据指令分发模块220所分发的指令诸如控制指令对其接收到的数据进行处理。标量处理单元232可以接收或者说是标量指令,诸如控制指令,标量处理单元232主要负神经网络处理器200的标量运算。
需要说明的是,标量处理单元232不仅可以从指令分发模块220接收指令,还可以将新的程序计数器(PC)的值传输到指令分发模块220。
请参阅图4,图4为本申请实施例提供的神经网络处理器的第四种结构示意图。标量处理单元232不仅可以从指令分发模块220接收指令,还可以将新的程序计数器(PC)的值传输到指令分发模块220。标量处理单元232可以执行标量计算指令(加减乘除、逻辑操作)、分支指令(条件判断操作)、跳转指令(函数调用)。当处理分支指令和跳转指令时,标量处理单元232会将新的PC值返回给指令分发模块220,以使得指令分发模块220下一次分发指令时从新的PC来取指。
请参阅图5,图5为本申请实施例提供的神经网络处理器的第五种结构示意图。本申请实施例提供的神经网络处理器200还包括数据存储模块(Buffer,BUF)240,数据存储模块240可以存储数据,诸如图像数据、权重数据等。
数据存储模块240可以与第一处理模块210和第二处理模块230连接。诸如数据存储模块240与标量处理单元232、卷积处理单元212、向量处理单元214及整形处理单元216连接。数据存储单元240与标量处理单元232、卷积处理单元212、向量处理单元214及整形处理单元216均可以传输数据,诸如数据存储单元240与卷积处理单元212、向量处理单元214及整形处理单元216直接传输数据。由此,本申请实施例在数据存储模块220和各个处理单元诸如卷积处理单元212、向量处理单元214之间可以实现数据的直接传输,可以提升NPU200的性能。
第一处理模块210对数据的处理可以是:卷积处理单元212和向量处理单元214在接收到指令分发单元220所并行发射的指令诸如第一计算指令、第二计算指令时,卷积处理单元212和向量处理单元214可以数据存储模块240读取其所需要处理的数据诸如待处理数据。卷积处理单元212和向量处理单元214对该待处理数据进行处理操作,以得到处理结果,并将该处理结果存储到数据存储模块240。
卷积处理单元212和向量处理单元214处理数据可以是:卷积处理单元212在接收到指令分发单元220所发射的指令诸如第一计算指令时,卷积处理单元212会根据该第一计算指令从数据存储模块240读取其所需要处理的数据诸如待处理数据。卷积处理单元212从数据存储模块240读取到其所需要处理的数据后,卷积处理单元212会根据该第一计算指令执行相应的操作诸如向量内积计算,以得到中间计算结果。卷积处理单元212可以将该中间计算结果存储到数据存储模块240中。向量处理单元214可以从数据存储模块240中获取该中间计算结果、并对该中间计算结果进行第二次计算处理诸如池化操作,以得到处理结果,并将该处理结果存储到数据存储模块240。
数据存储模块240所存储的数据可以是原始数据和权重数据,诸如待处理数据,或者说数据存储模块240所存储的数据是需要至少一个处理单元进行处理诸如运算处理的数据。数据存储模块240所存储的数据也可以是处理结果,或者说数据存储模块240所存储的数据是经过至少一个处理单元对待处理数据进行处理后的数据。需要说明的是,数据存储模块240实际所存储的数据并不限于此,数据存储模块240还可以存储其他数据。
需要说明的是,卷积处理单元212和向量处理单元214处理数据并不限于此,卷积处理单元212和向量处理单元214还可以通过信号线直接相连。
卷积处理单元212和向量处理单元214处理数据还可以是:卷积处理单元212在接收到指令分发单元220所发射的指令诸如第一计算指令时,卷积处理单元212会根据该第一计算指令从数据存储模块240读取其所需要处理的数据诸如待处理数据。卷积处理单元212从数据存储模块240读取到其所需要处理的数据后,卷积处理单元212会根据该第一计算指令执行相应的操作诸如向量内积计算,以得到中间计算结果。卷积处理单元212可以将该中间计算结果传输到向量处理单元214。向量处理单元214对该中间计算结果进行第二次计算处理诸如池化处理、后续的激活、量化操作或者是和下一层的操作进行融合,同时处理两层算子的操作,以得到处理结果,并将该处理结果存储到数据存储模块240。
需要说明的是,卷积处理单元212还可以与第一处理模块210的其他处理单元诸如整形处理单元216通过信号线连接。第一处理模块210对数据的还可以是卷积处理单元212处理完成后将其计算得到的中间计算结果直接传输到整形处理单元216或第一处理模块210中的其他处理单元以进行其他计算操作。或者第一处理模块210对数据的还可以是卷积处理单元212处理完成后将其计算得到的中间计算结果存储到数据存储模块240,再由整形处理单元216或第一处理模块210中的其他处理单元从数据存储模块240获取该中间计算结果、并对该中间计算结果做进一步的处理操作诸如整形操作,以得到处理结果。整形处理单元216或第一处理模块210中的其他处理单元将该处理结果存储到数据存储模块240。
应理解,在第一处理模块210的各个处理单元在相互传输数据进行处理过程中,可以不将中间计算结果存储到数据存储模块240,数据存储模块240可以存储原始数据和权重,而不存储中间计算结果。不仅可以节省数据存储模块240的存储空间,还可以减少数据存储模块240的访问,降低功耗,提升神经网络处理器200的性能。
还应理解的是,本申请实施例第一处理模块210其他处理单元之间对数据处理的方式可以类比以上第一处理模块210中卷积处理单元212和向量处理单元214的方式。本申请实施例第一处理模块210其他处理单元之间对数据处理的方式在此不再一一举例说明。
本申请实施例数据存储模块220可以存储计算结果。在多个处理单元的运算过程中,可以做到0fallback到外部存储器,可以不需要将上一个算子的结算结果fallback到外部存储,对soc的带宽需求比较低,从而节省了系统带宽,减少了算子间的计算延迟。
在一些实施例中,该数据存储模块240可以是共享的存储模块。该数据存储模块220可以具有多个并行访问的Bank,诸如三个、四个等。可以根据实际需要对其进行灵活的划分。
请参阅图6,图6为本申请实施例提供的数据存储模块的结构示意图,数据存储模块240包括至少两个数据存储单元241和至少两个地址译码单元242。其中,地址译码单元242的数量不大于数据存储单元241的数量,例如,数据存储单元241的数量为四个,地址译码单元242的数量为四个。每一个地址译码单元a包括四个输出端口,每一个输出端口对应一个数据存储单元241。四个数据存储单元241,诸如:数据存储单元a,数据存储单元b,数据存储单元c和数据存储单元d,四个地址译码单元242,诸如地址译码单元a、地址译码单元b、地址译码单元c和地址译码单元d。
四个地址译码单元242均与一个数据存储单元241连接,一个地址译码单元242包括四个输出端口, 一个地址译码单元242的输出端口的数量与数据存储模块240内的数据存储单元的数量相等,也即,一个地址译码单元242的输出端口对应一个数据存储单元241,例如,每一个地址译码单元a的第一输出端口与数据存储单元a对应,第二输出端口与数据存储单元b对应,第三输出端口与数据存储单元c对应,第四输出端口与数据存储单元d对应。
一个输出端口所输出的数据可以用于存储到与输出端口对应的一个数据存储单元内。例如:将地址译码单元a与数据存储单元a对应的第一输出端口输出的数据,地址译码单元b与数据存储单元a对应的第一输出端口输出的数据,地址译码单元c与数据存储单元a对应的第一输出端口输出的数据,地址译码单元d与数据存储单元a对应的第一输出端口输出的数据均存储到数据存储单元a中,因此,可以实现每一个地址译码单元内的数据可以存储到任意一个数据存储单元241中,从而可以实现数据存储单元241之间的共享。
一个输出端口用于输出一类数据,同一个地址译码单元242的四个输出端口对应的数据类型不同,例如,一个地址译码单元242的第一输出端口用于输出特征图,第二输出端口用于输出特征参数。
每一个地址译码单元242还包括三个输入端口,三个输入端口分别用于接收外部端口(port)传输的信号、数据和地址信息。每一个地址译码单元242根据接收到的信号、数据以及地址信息编译形成四个数据。
地址译码单元242的数量与外部端口的数量一致,例如,当外部端口的数量为四个,对应的地址译码单元242的数量为四个,外部端口传输的数据可以通过地址译码单元242存储至任意一个数据存储单元241内,实现数据存储模块240内的资源共享。外部端口可以为处理单元的端口,也可以为数据总线的端口,只要可以实现向数据存储单元存储数据以及从数据存储单元读取数据的端口,均是本申请实施例保护的范围。
数据存储模块240还包括至少两个数据合并单元243,诸如四个。每一个数据合并单元243包括至少两个数据输入端和一个数据输出端,每一数据合并单元243通过至少两个数据输入端接收与一数据存储单元241对应的所有数据,并将所有数据处理后存储到与该数据对应的数据存储单元241内,可以实现数据存储模块240有规律的处理数据,可以提高数据处理的效率,同时,也可以避免数据存储混乱的现象发生。
每一个数据合并单元243对应一个数据存储单元241,每一数据合并单元243的一个数据输入端连接与一数据存储单元241对应的所有地址译码单元242的输出端口,也即,一个数据合并单元243连接所有的地址译码单元242,通过一个数据合并单元243处理多个地址译码单元242的数据,可以提高数据存储的效率。
数据合并单元243采用按位或操作来数量数据,按位或是双目运算。只要对应的二个二进位有一个为1时,结果位就为1。按位或运算逻辑比较简单,运算速度较快,可以提高数据合并单元243的处理效率,进而可以提高数据存储模块240的存储效率。
一个数据合并单元243与一个数据存储单元241对应,例如,数据合并单元a与数据存储单元a对应,数据合并单元b与数据存储单元b对应,地址译码单元a译码形成的一个数据传输至与数据存储单元a对应的数据合并单元a进行处理,处理后的数据可以传输至数据存储单元a进行存储。可以实现数据存储模块240快速、高效的存储数据。
还需要说明是的,第二处理模块230诸如标量处理单元232所需要处理的数据可以不从数据存储模块240中获取,标量处理单元232所需要处理的数据可以由其接收到指令携带或者由其他方式传输。
请参阅7,图7为本申请实施例提供的神经网络处理器的第六种结构示意图。图7所示的神经网络处理器与图5所示的神经网络处理器的区别在于:图7中的第二处理模块230诸如标量处理单元232与指令分发模块220连接、且图7中的第二处理模块230诸如标量处理单元232与数据存储模块240不连接。而图5中的第二处理模块230诸如标量处理单元232与指令分发模块220连接、且图5中的第二处理模块230诸如标量处理单元232与数据存储模块240连接。图7中的第二处理模块230诸如标量处理单元232所需要处理的数据可以由其接收到指令携带,或者说图7中的第二处理模块230诸如标量处理单元232所需要处理的数据可以由指令分发模块220所分发的指令携带。本申请实施例还可以给第二处理模块230诸如标量处理单元232设置一个单独的数据存储模块。
需要说明的是,数据存储模块240还可以与指令分发模块220连接,指令分发模块220根据数据存储模块240是否存储有待处理数据确定是否发射指令。
请参阅8,图8为本申请实施例提供的神经网络处理器的第七种结构示意图。指令分发模块220与数据存储模块240连接,指令分发模块220可以向数据存储模块240发送一个索引,数据存储模块240根据指令分发模块220所发送的索引返回一个信号。当数据存储模块240中存储有待处理数据时,数据存储模块240向指令分发模块220返回存储有待处理数据的信号,诸如“1”。当数据存储模块240中未存储有待处理数据时,数据存储模块240向指令分发模块220返回未存储有待处理数据的信号,诸如“0”。
指令分发模块220根据其接收到的不同的返回信号以做出不同的动作。诸如若指令分发模块220接收到“1”时,指令分发模块220确定出数据存储模块240存储有待处理数据,进而指令分发模块220将多条指令并行发射出去。若指令分发模块220接收到“0”时,指令分发模块220确定出数据存储模块240未存储有待处理数据,此时指令分发模块220不向发射指令。从而可以避免不必要的指令分发, 节省功耗。
请参阅9,图9为本申请实施例提供的神经网络处理器的第八种结构示意图。本申请实施例提供的神经网络处理器200还可以包括指令存储模块250,指令存储模块250也可以称为指令存储模块(Instruction Cache,ICache)。指令存储模块250可以存储一些细粒度的指令,诸如计算指令、控制指令等。或者说指令存储模块250用于存储NPU的指令。需要说明的是,指令存储模块250所存储的指令还可以为其他指令。指令存储模块250与指令分发模块220连接,指令存储模块250可以将其所存储的指令发送到指令分发模块220。或者说指令分发模块220可以从指令存储模块250获取多条指令。
指令分发模块220从指令存储模块250获取指令的过程可以是:指令分发模块220向指令存储模块250发送取指请求,当在指令存储模块250中找到与该取指请求相对应的指令时,即为指令命中,指令存储模块250响应该取指请求将与该取指请求相对应的指令发送至指令分发模块220。而当在指令存储模块250中未找到与该取指请求相对应的指令时,称为指令缺失,指令存储模块250会暂停(Hold)响应取指请求,同时指令存储模块250会发送指令获取请求以等待指令返回到指令存储模块250,然后指令存储模块250再响应取指请求将与取指请求相对应的指令发送至指令分发模块220。
指令分发模块220从指令存储模块250获取指令的过程可以理解为:当指令分发模块220所需要的指令已经存储在指令存储模块250时,指令分发模块220可以直接从指令存储模块250获取。而当指令分发模块220所需要的指令至少有一条不在指令存储模块250时,需要通过指令存储模块250从其他位置诸如外部存储器读取指令分发模块220所需要的指令、并将该指令返回给指令分发模块220。
需要说明的是,本申请实施例指令分发模块220和指令存储模块250可以是相互分离的两个部分。当然,指令分发模块220和指令存储模块250也可以组成一个指令预处理模块,或者说指令分发模块220可以和指令存储模块250为指令预处理模块中的两个部分。
还需要说明的是,指令存储模块250所存储的每一条指令都具有相应的类型,指令分发模块220可以基于指令的类型对多条指令进行发射。诸如指令分发模块220将第一类型的指令发射到卷积处理单元212,指令分发模块220将第二类型的指令发射到标量处理单元232。指令的类型诸如为跳转指令、分支指令、卷积计算指令、向量计算指令、整形计算指令等。
本申请实施例指令存储模块250并不限于仅存储NPU200的一部分指令。本申请实施例的指令存储模块250还可以存储NPU200的所有指令,指令存储模块250可以称为指令存储器(Instruction RAM,IRAM),或者称为程序存储器。上层软件诸如外部处理器可以直接将程序写入到IRAM中。
请参阅图10,图10为本申请实施例提供的神经网络处理器的第九种结构示意图。神经网络处理单元200还可以包括数据搬移模块260、指令搬移模块270和系统总线接口280。
系统总线接口280与系统总线连接,该系统总线可以是电子设备诸如智能手机的系统总线。系统总线接口280与系统总线连接可以实现与其他处理器、外部存储器之间进行数据传输。系统总线接口280可以将内部读写请求转换成符合总线接口协议诸如先进可拓展接口(Advanced extensible interface,AXI)协议的总线读写请求。
数据搬移模块260连接系统总线接口280和数据存储模块240,数据搬移模块260用于搬移数据,可以将外部数据搬移到数据存储模块240,也可以将数据存储模块240的数据搬移到外部。诸如数据搬移模块260通过系统总线接口280从系统总线读取数据、并将其所读取的数据写入到数据存储模块240。数据搬移模块260还可以将数据存储模块240所存储的数据或者说是处理结果传输到外部存储器,诸如数据搬移模块260将第一处理模块210中各个处理单元的处理结果传输外部存储器。即数据搬移模块260通过系统总线接口280可以实现内部数据和外部存储之间进行数据搬移。
其中,数据搬移模块260可以为直接存储访问(Direct memory access,DMA),DMA可将数据从一个地址空间搬移到另一个地址空间。数据搬移的地址空间可以是内部存储器,也可以是外设接口。通常提前在RAM上存储控制DMA数据搬移的描述符,描述符包括源端地址空间、目的端地址空间、数据长度等信息。软件对DMA进行初始化,数据开始搬移,这个搬移的过程可以脱离NPU独立进行,提高NPU的效率,减小NPU负担。
指令搬移模块270连接系统总线接口280和指令存储模块250,指令搬移模块270用于搬移指令,或者说指令搬移模块270用于读取指令,以将外部的指令搬移到指令存储模块250。诸如指令搬移模块270通过系统总线接口280从系统总线读取指令、并将其所读取的指令存储到指令存储模块250。当指令存储模块250的指令缺失时,指令存储模块250会请求指令搬移模块270向系统总线接口280发出读指令请求,以读取相应的指令并存储到指令存储模块250。指令搬移模块270可以为直接存储访问。当然,指令存储模块250也可以通过指令搬移模块270直接将所有的指令写入到指令存储模块250。
请参阅图11,图11为本申请实施例提供的神经网络处理器的第十种结构示意图,图11示出了指令存储模块250与系统总线接口280连接,外部存储器可以直接将程序或者说神经网络处理器200所需的指令存储到指令存储模块250。
需要说明的是,当指令存储模块250为IRAM时,本申请实施例还可以将指令存储模块250通过其他接口与外部存储器连接。以便于外部处理器可以将指令或者说程序直接写入到指令存储模块250,或者说是指令的初始化。
因此,本申请实施例数据搬移模块260和指令搬移模块270是两个相互分开的单元模块,分别实现 数据和指令的传输,或者说搬移。或者说本申请实施例需要设置两个DMA来实现数据和指令的搬移。数据搬移模块260需要设置一个或多个逻辑通道,指令搬移模块270需要设置一个或多个物理通道。在此以指令搬移模块270为例进行说明。
诸如本申请实施例数据搬移模块260可以为一个单独的DMA,在此定义为DMA1;指令搬移模块270可以为一个单独的DMA,在此定义为DMA2。即DMA1可以进行数据的搬移,DMA2可以进行指令的搬移。
请参阅图12,图12为本申请实施例提供的神经网络处理器中直接存储访问的第一种结构示意图。图12所示的DMA260a相当于为数据搬移模块260的部分结构示意图。DMA260a包括多个逻辑通道262a和一个仲裁单元264a。多个逻辑通道262a均与仲裁单元264a连接,仲裁单元264a可通过系统总线接口连接系统总线。需要说明的是,仲裁单元264a也可以通过其他接口连接外设和存储中的至少一者。
其中,逻辑通道262a的个数可以为h个,h为大于1的自然数,即逻辑通道262a可以至少为两个。每个逻辑通道262a都可以接收数据搬移请求诸如请求1、请求2、请求f,并基于数据搬移请求进行数据搬移操作。
每个DMA260a的逻辑通道262a可以完成描述符生成、解析、控制等功能,具体情况根据命令请求(request)的构成来确定。当多个逻辑通道262a同时接收到数据搬移的请求时,通过仲裁单元264a可以选出一个请求,进入读请求队列266a和写请求队列268a,等待数据搬移。
逻辑通道262a需要软件介入,由软件提前配置描述符或者寄存器,完成初始化来进行数据搬移。DMA260a所有逻辑通道262a对于软件都是可见的,由软件来进行调度。而有些业务场景,如内部引擎诸如指令分发模块(或者说指令预处理模块)自主进行数据搬移时,不需要经过软件来调度,则不能使用此类DMA260a的逻辑通道262a。因此不方便根据业务需求灵活地移植,过于依赖软件调度。
基于此,本申请实施例还提供一种DMA以实现不同搬移的需求。
请参阅图13,图13为本申请实施例提供的神经网络处理器中直接存储访问的第二种结构示意图。图13所示的直接存储访问260b在功能上相当于指令搬移模块270和数据搬移模块260,或者说图13所示的直接存储访问260b将指令搬移模块270和数据搬移模块260的功能合并在一起。直接存储访问260b可包括至少一个逻辑通道261b和至少一个物理通道262b,至少一个逻辑通道261b和至少一个物理通道262b并行,也可以理解为至少一个逻辑通道261b和至少一个物理通道262b共同连接到同一个接口。从而至少一个物理通道262b和至少一个逻辑通道261b可以并行的搬移指令和数据。由于物理通道262b对指令的搬移由内部引擎诸如指令分发模块自主发出请求,不需要经过上层软件的调度,从而可以整个DMA260b对软件调度的依赖,更加方便搬移数据,进而更加方便根据业务需求灵活地搬移数据。可以理解的是,本申请实施例采用一个DMA260b即可实现指令和数据的搬移,还可以节省单元模块的个数。
其中逻辑通道261b可以响应于上层软件调度的搬移请求进行数据搬移。该上层软件可以是可编程单元,诸如中央处理器(CPU)。
其中逻辑通道261b的个数可以为n个,n可以为大于或等于1的自然数。诸如逻辑通道261b的个数为一个、两个、三个等。需要说明的是,逻辑通道261b的实际个数可根据实际产品的需求进行设置。
其中物理通道262b可以响应于内部引擎的搬移请求进行数据搬移,该内部引擎可以是NPU的指令分发模块,或者说是指令预处理模块。
其中物理通道262b的个数可以为m个,m可以为大于或等于1的自然数。诸如物理通道262b的个数为一个、两个、三个等。需要说明的是,物理通道262b的实际个数可根据实际产品的需求进行设置。在一些实施例中,逻辑通道261b的个数可以为两个,物理通道262b的个数可以为一个。
请继续参阅图13,DMA260b还可以包括第一仲裁单元263b,该第一仲裁单元263b与系统总线接口连接。
请结合图14,图14为本申请实施例提供的神经网络处理器的第十一种结构示意图。第一仲裁单元263b与系统总线接口264b,可以理解的是,该系统总线接口264b可等同于系统总线接口280。第一仲裁单元263b可通过该系统总线接口264b与系统总线连接,该第一仲裁单元263b还分别与所有的物理通道262b以及所有的逻辑通道261b连接,以便于逻辑通道261b、物理通道262b从系统总线搬移数据和指令。当多个通道同时发起读/写请求时,第一仲裁单元263b可以仲裁出一个读/写请求发送给系统总线接口264b。诸如一个逻辑通道261b和一个物理通道262b同时发起读/写请求时,第一仲裁单元263b可以仲裁出一个物理通道262b的读/写请求发送给系统总线接口264b,或者第一仲裁单元263b可以仲裁出一个逻辑通道261b的读/写请求发送给系统总线接口264b。
其中,系统总线接口264b可以设置在DMA260b外。需要说明的是,系统总线接口264b也可以设置在DMB260b内,即系统总线接口264b可以是DMA260b的一部分。
在一些实施例中,第一仲裁单元263b可以对至少一个物理通道262b和至少一个逻辑通道261b的带宽进行重新分配。
在一些实施例中,逻辑通道261b可以包括逻辑通道接口2612b、描述符控制模块2614b和数据传输模块2616b。逻辑通道接口2612b可以与数据存储模块诸如图5所示的数据存储模块240连接,逻辑通道接口2612b、描述符控制模块2614b及数据传输模块2616b依次连接,数据传输模块2616b还与第一 仲裁单元263b连接,以通过系统总线接口264b与系统总线连接。
逻辑通道接口2612b可以由上层软件下发命令的格式所确定,逻辑通道接口2612b可以含有描述符的地址。描述符控制模块2614b根据上层软件下发的命令来索引描述符,解析数据源端地址、目的端地址、数据长度等信息,对DMA260b的传输数据模块2616b发起读写数据命令。数据传输模块2616b接收上一级(描述符控制模块2614b)的读写数据命令,将该读写数据命令转换成所需信号,可以先读后写,完成数据搬移,返回响应给描述符控制模块2614b。
逻辑通道261b搬移数据的具体过程如下:
配置DMA260b的控制状态寄存器(Control Status Register,CSR)269b。需要说明的是,DMA260b搬移数据需要满足几个条件:数据从哪里传(源地址),数据传到哪里去(目的地址),数据在什么时间传输(触发源,或者说触发信号)。需要将DMA260b的各种参数、条件配置完成才能够实现数据的搬移。可以采用上层软件设置源地址、目的地址以及触发源。
实际应用中,可以将DMA260b的各种参数及条件设置在控制状态寄存器269b中,或者说将DMA260b的配置信息以及参数,如工作模式、仲裁优先级、接口信息等设置在控制状态寄存器269b中。在一些实施例中,比如在控制状态寄存器269b中设置外设寄存器的地址、数据存储器的地址、需要传输数据的数据量、各个通道之间的优先级、数据传输的方向、循环模式、外设和存储器的增量模式、外设和存储器的数据宽度等。
上层软件对DMA260b的逻辑通道261b下发数据搬移命令至逻辑通道接口2612b,或者说上层软件对DMA260b的逻辑通道261b下发数据搬移请求至逻辑通道接口2612b,可编程单元对DMA260b的逻辑通道261b下发数据搬移命令时一并携带有描述符的地址,或者直接携带有描述符。并通过逻辑通道接口2612b将该描述符的地址或者描述符传输到描述符控制模块2614b。
描述符控制单元2614b若接收到的是描述符的地址,则描述符控制单元2614b根据该描述符的地址读取描述符。即索引描述符。然后基于该描述符进行解析,即生成数据搬移所需信息,如数据源端地址空间、目的端地址空间、数据长度等。而当描述符控制单元2614b接收到的是描述符,则描述符控制单元2614b直接解析描述符。
描述符控制单元2614b解析完成描述符后,数据传输模块2616b可遵循先读后写的原则,将描述符控制单元2614b所解析描述符而生产的信息转换成系统总线接口264b所需传输的信号,并传输至第一仲裁单元263b。
第一仲裁单元263b在接收到多个逻辑通道261b同时发起的读/写请求时,可以仲裁出一个发送到系统总线接口264b。
而当第一仲裁单元263b同时接收到有来自逻辑通道261b发起的读/写请求以及来自物理通道262b发起的读/写请求时,第一仲裁单元263b同样可以仲裁出一个发送到系统总线接口264b,并通过系统总线接口264b传输到系统总线。
DMA260b的读/写请求传输到系统总线后,系统总线完成读写命令,源端地址空间的数据写入目的端地址空间。从而完成数据搬移。
其中,物理通道262b可以通过接口与内部引擎诸如指令分发模块连接,该接口可以包含进行指令搬移的配置和参数。当然该物理通道262b进行指令搬移的配置和参数也可以由控制状态寄存器269b来配置。
需要说明的是,DMA260b还可以通过其他结构与其他部件实现连接,以实现数据的搬移。
请继续参阅图15和图16,图15为本申请实施例提供的神经网络处理器中直接存储访问的第三种结构示意图,图16为本申请实施例提供的神经网络处理器的第十二种结构示意图。DMA260b还可以包括第二仲裁单265b,第二仲裁单265b可以与存储接口266b连接。该存储接口266b可以连接存储模块(memory,或者为BUF),该存储模块与DMA260b可以位于同一个NPU中,该存储模块与DMA260b也可以不位于同一个NPU中。诸如DMA260b位于NPU中,存储模块可以位于NPU中,存储模块也可以位于其它器件中。第二仲裁单265b可以与每一个逻辑通道261b连接,且第一仲裁单元263b和第二仲裁单265b连接同一个逻辑通道261b时,可以由一选择器连接到同一个逻辑通道261b。存储接口266b可以设置在DMA260b外,也可以设置在DMA260b内。
请继续参阅图15和图16,DMA260b还可以包括第三仲裁单元267b,第三仲裁单元267b和外设接口268b,外设接口268b可以连接外部设备,该外部设备与DMA260b位于不同的器件中,诸如DMA260b位于NPU中,外部设备为CPU等。第三仲裁单元267b可以与每一个逻辑通道261b连接,且第一仲裁单元263b和第三仲裁单元267b连接同一个逻辑通道261b时,可以由一选择器连接同一逻辑单元261b。其中,外设接口268b可以设置在DMA260b外,也可以设置在DMA260b内。
请继续参阅图15和图16,本申请实施例DMA260b还可以同时包括第一仲裁单元263b、第二仲裁单元265b及第三仲裁单元267b。第一仲裁单元263b连接系统总线接口264b,第二仲裁单元265b连接存储接口266b,第三仲裁单元267b连接外设接口268b,第一仲裁单元263b、第二仲裁单元265b及第三仲裁单元267b均与可以与逻辑通道261b连接,当第一仲裁单元263b、第二仲裁单元265b及第三仲裁单元267b连接有一个逻辑通道261b时,可以由一个选择器连接在一个逻辑通道261b和三个仲裁单元之间。
需要说明的是,本申请实施例还可以设置其他仲裁单元以通过其他接口连接其他元件。
请参阅图17,图17为本申请实施例提供的神经网络处理器的第十三种结构示意图。图17示出了图13或图15的直接存储访问260b与神经网络处理器200其他元件的一种连接关系。该直接存储访问260b连接系统总线接口280、指令存储模块250及数据存储模块240,直接存储访问260b可以通过系统总线接口280将数据搬移到数据存储模块240,直接存储访问260b可以通过系统总线接口280将指令搬移到指令存储模块250,直接存储访问260b还可以通过系统总线接口280将数据存储模块240所存储的数据搬移到外部存储器。
本申请实施例神经网络处理器200中第一处理模块210的数据可以直接存储到数据存储模块240,数据存储模块240的数据也可以加载到第一处理模块210,从而使得程序比较精简。然而,为了加快数据的存取速度,本申请实施例还可以在数据存储模块240和第一处理模块210之间增加通用寄存器。下面结合附图对具有通用寄存器的神经网络处理器进行详细说明。
请参阅图18,图18为本申请实施例提供的神经网络处理器的第十四种结构示意图。神经网络处理器200还可以包括通用寄存器290和加载存储模块202。
通用寄存器290与第一处理模块210连接,通用寄存器290可以与第一处理模块210中的所有的处理单元连接。诸如通用寄存器290与第一处理模块210的卷积处理单元212、向量处理单元214连接。卷积处理单元212和向量处理单元214均可以从通用寄存器290中获取所需的数据。当然,卷积处理单元212和向量处理单元214也均可以将各自的处理结果存储到通用寄存器290。需要说明的是,图17所示出的第一处理模块210中处理单元的个数并不限于此,诸如第一处理模块210还包括整形处理单元。
其中,通用寄存器290可以包括多个寄存器,诸如通用寄存器290包括多个向量寄存器292。再比如通用寄存器290包括多个预测寄存器294。还比如通用寄存器290包括多个向量寄存器292和多个预测寄存器294。其中,多个向量寄存器292可以简称为向量寄存器堆(Vector Register File,VRF)。其中,多个预测寄存器294可以简称为预测寄存器堆(Predicate Register File,PRF),预测寄存器也可以称为谓词寄存器。通用寄存器290中各个寄存器的类型及个数可以根据实际需求设置。以提高软件编程的灵活性。
其中,卷积处理单元212可以具有专用寄存器2122,该专用寄存器2122可以存储数据,诸如卷积处理单元212的专用寄存器2122为两个,分别为第一专用寄存器和第二专用寄存器,第一专用寄存器可以存储图像数据,第二专用寄存器可以存储权重。
加载存储模块(Load Store Unit,LSU)202与通用寄存器290连接,加载存储模块202可以将数据加载到通用寄存器290,便于第一处理模块210的各个处理单元从通用寄存器290获取数据。加载存储模块202还可以与卷积处理单元212的专用寄存器2122连接,加载存储模块202还可以将数据直接加载到卷积处理单元212的专用寄存器2122,以便于卷积处理单元212对数据进行处理,诸如卷积处理。从而可以提高加载数据的速度。
需要说明的是,图18仅示出神经网络处理器200的部分元件,图18所示神经网络处理器200的其他元件可以参阅图1至图17,为了详细说明本申请实施例加载存储模块202和通用寄存器290与神经网络处理器200其他元件的关系,下面结合图19进行详细说明。
请参阅图19,图19为本申请实施例提供的神经网络处理器的第十五种结构示意图。加载存储模块(Load Store Unit,LSU)290连接通用寄存器290和数据存储模块240。加载存储模块202可以将数据存储模块240的数据加载到通用寄存器290,第一处理模块210的处理单元诸如卷积处理单元212、向量处理单元214、整形处理单元216可以根据指令从通用寄存器290中获取所需要处理的数据。通用寄存器290可以与多个处理单元连接,诸如通用寄存器290与卷积处理单元212连接,且通用寄存器290还与向量处理单元214和整形处理单元216中的至少一者连接。从而,卷积处理单元212、向量处理单元214、整形处理单元216均可以从通用寄存器290获取所需要处理的数据。
卷积处理单元212、向量处理单元214、整形处理单元216也均可以将各自的处理结果存储到通用寄存器290。进而加载存储模块202可以将通用寄存器290中的处理结果存储到数据存储模块240,数据存储模块240可以通过直接存储访问或数据搬移模块260将处理结果传输到外部存储器中。
需要说明的是,本申请实施例第二处理模块230诸如标量处理单元232不与通用寄存器290连接,如上所述,本申请实施例标量处理单元232所需要处理的数据可以由其所接收的指令携带。本申请实施例标量处理单元232也可以与数据存储模块240连接,以从数据存储模块240获取所需处理的数据。
本申请实施例加载存储模块202不仅可以将数据存储模块240的数据存储通用寄存器290,还可以加载其它位置。比如加载存储模块202还直接与卷积处理单元212连接,该加载存储模块202直接与卷积处理单元212连接可以理解为加载存储模块202和卷积处理单元212之间未连接有如上所述的通用寄存器290。加载存储模块202与卷积处理单元212连接可以理解为加载存储模块202与卷积处理单元212的专用寄存器2122连接,诸如加载存储模块202与卷积处理单元212中的其中一个专用寄存器2122连接,加载存储模块202可以直接将数据存储模块240的数据诸如权重加载到卷积处理单元212的其中一个专用寄存器2122。可以理解的是,加载存储模块202也可以直接将其他数据诸如图像数据加载到卷积处理单元212的其中一个专用寄存器2122。
由此本申请实施例加载存储模块202可以将数据存储模块240的数据直接加载到卷积处理单元212, 加载存储模块202还可以将数据存储模块240的数据存储到通用寄存器290,第一处理模块210的处理单元诸如卷积处理单元212可以基于其接收到的指令从通用寄存器290获取相应的数据。诸如加载存储模块202可以将第一数据直接加载到卷积处理单元212,加载存储模块202可以将第二数据存储到通用寄存器290,卷积处理单元212可以从通用寄存器290中获取到该第二数据。第一数据和第二数据的类型可以不同,诸如第一数据为权重,第二数据为图像数据。从而本申请实施例卷积处理单元212可以从不同的通路接收到所需要处理的数据,相比卷积处理单元212从同一通路接收到所需处理的数据可以加快数据加载的速度,进而可以提高神经网络处理器200的运算速率。而且,本申请实施例可以简化指令集,使得其易于实现。同时,本申请实施例还更加容易优化编译器。
需要说明的是,加载存储模块202将第一数据直接加载到卷积处理单元212,以及加载存储模块202将第二数据加载到通用寄存器290后,也可以由第一处理模块210的其他处理单元诸如向量处理单元214从通用寄存器290中获取到该第二数据。
还需要说明的是,加载存储模块202还可以将其他数据诸如第三数据加载到通用寄存器290,可以由第一处理模块210的一个或多个处理单元诸如整形处理单元216从通用寄存器290中获取该第三数据。该第三数据的类型可以与第一数据、第二数据的类型均不同。
加载存储模块202还与指令分发模块220连接,加载存储模块202可以接收指令分发模块220所发射的指令,加载存储模块202可以根据指令分发模块240所发射的指令将数据存储模块240的数据进行存储到通用寄存器290或/和加载到卷积处理单元212。加载存储模块202还可以根据指令分发模块240所发射的指令将存储在通用寄存器290的处理结果存储到数据存储模块240。该处理结果诸如为向量处理单元214的处理结果。
需要说明的是,指令分发模块220可以在一个时钟周期内将多条指令并行发射到第一处理模块210、第二处理模块230及加载存储模块202。诸如指令分发模块220可以在一个时钟周期内将多条指令并行发射到标量处理单元232、卷积处理单元212、向量处理单元214和加载存储模块202。
其中加载存储模块202与数据存储模块240可以集成在一起,而作为一个模块的两部分。当然,加载存储模块202与数据存储模块240也可以分开设置,或者说加载存储模块202与数据存储模块240不集成在一个模块中。
请参阅图20,图20为本申请实施例提供的神经网络处理器的第十六种结构示意图。神经网络处理器200还可以包括数据搬移引擎204。该数据搬移引擎204也可以称为寄存器堆数据搬移引擎(MOVE)。该数据搬移引擎204可以实现不同寄存器之间数据的搬移,以便于第一处理模块210的处理单元诸如卷积处理单元212和第二处理模块230的处理单元诸如标量处理单元232从NPU200内部获取所需的数据进行处理,而无需将数据传输到NPU200外部经由上层软件处理后再返还到NPU200。或者说述数据搬移引擎204可以实现不同寄存器之间的数据交互,从而就可以节省NPU200中的一些数据从NPU200向外部传输的过程,减少了NPU200与上层软件诸如CPU的交互,提高了NPU200处理数据的效率。同时,还可以减少外部CPU的工作负载。
数据搬移引擎204连接通用寄存器290和第二处理模块230的标量处理单元232,该标量处理单元232具有可以参阅以上内容,在此不再赘述。该标量处理单元232包括有多个标量寄存器2322,简称为标量寄存器堆,标量处理单元232通过标量寄存器2322与数据搬移引擎204连接。该通用寄存器290中具有多个寄存器,简称寄存器堆,通用寄存器290通过其内的寄存器堆与数据搬移引擎204连接。需要说明的是,通用寄存器290的多个寄存器可以全部与数据搬移引擎204连接。需要说明的是,通用寄存器290的多个寄存器也可以是一部分与数据搬移引擎204连接。
请参阅图21,图21为本申请实施例提供的神经网络处理器的第十七种结构示意图。神经网络处理器200中的通用寄存器290可以包括多个向量寄存器292,简称向量寄存器堆,本申请实施例向量寄存器292可以全部与数据搬移引擎204连接,本申请实施例向量寄存器292也可以一部分与数据搬移引擎204连接,该一部分可以理解为至少为一个向量寄存器,且不是全部向量寄存器。
神经网络处理器200中的通用寄存器290可以包括多个预测寄存器294,简称预测寄存器堆,也可以称为谓词寄存器堆,本申请实施例预测寄存器294可以全部与数据搬移引擎204连接,本申请实施例预测寄存器294也可以一部分与数据搬移引擎204连接。
需要说明的是,通用寄存器290包括多种类型的寄存器时,通用寄存器290可以通过所有类型的寄存器与数据搬移引擎204连接。通用寄存器290也可以通过其中部分类型的寄存器与数据搬移引擎204连接,诸如神经网络处理器200的通用寄存器290包括多个向量寄存器292和多个预测寄存器294时,通用寄存器290仅通过多个向量寄存器292与数据搬移引擎204连接。
需要说明的是,图20和图21仅示出神经网络处理器200的部分元件,图20和图21所示神经网络处理器200的其他元件可以参阅图1至图19,为了详细说明本申请实施例数据搬移引擎204与其他元件的关系,以及数据搬移引擎204具体实现数据的搬移,下面结合图22进行详细说明。
请参阅图22,图22为本申请实施例提供的神经网络处理器的第十八种结构示意图。本申请实施例神经网络处理器200的一些数据诸如第一处理模块210的卷积处理单元212、向量处理单元214或整形处理单元216所处理的数据需要进行标量计算时,可以将该数据存储到通用寄存器290中,数据搬移引擎204可以将该数据搬移到标量处理单元232,由标量处理单元232对该数据进行标量计算。待标量处 理单元232对该数据计算完成,得到计算结果时,可以由该数据搬移引擎204将该计算结果搬移到通用寄存器290,第一处理模块210中相应的处理单元可以从通用寄存器290中获取该计算结果。从而本申请实施例NPU200中的数据搬移均在NPU200内部,相比NPU200将数据传输到外部,经外部上层软件诸如CPU处理完成后再返还到NPU200可以减少NPU200与外部的交互,提升NPU200处理数据的效率。
其中,第一处理模块210的卷积处理单元212、向量处理单元214或整形处理单元216所处理的数据需要进行标量计算诸如第一处理模块210的卷积处理单元212、向量处理单元214或整形处理单元216处理得到的中间结果需要进行判断操作。该判断操作可以由标量处理单元232完成。或者说通用寄存器290所存储的数据为待判断数据,该待判断数据需要进行判断操作,数据搬移引擎201将该待判断数据搬移到标量处理单元232的标量寄存器2322,以进行判断操作。
本申请实施例神经网络处理器200的一些数据诸如标量处理单元232的标量数据需要变换成向量数据时,数据搬移引擎204可以将该标量数据搬移到通用寄存器290,第一处理模块210中相应的处理单元诸如向量处理单元214可以从通用寄存器290中获取该标量数据,以将其变换成向量数据。需要说明的是,标量数据需要变换成向量数据也可以称为标量数据需要拓展成向量数据。比如一个32位数据复制成16个一样的数据组成一个512bit的向量。
在实际应用中,指令分发模块220与数据搬移引擎204连接,指令分发模块220可以将指令发射到数据搬移引擎204,数据搬移引擎204可以根据其接收到的指令来执行数据搬移的操作。诸如指令分发模块220向数据搬移引擎204发射第一指令,数据搬移引擎204根据第一指令将通用寄存器290的数据搬移到标量处理单元232的标量寄存器2322。再比如指令分发模块220向数据搬移引擎204发射第二指令,数据搬移引擎204根据第二指令将标量寄存器2322的数据搬移到通用寄存器290。
需要说明的是,指令分发模块220可以在一个时钟周期内将多条指令并行发射到第一处理模块210、第二处理模块230、加载存储模块202和数据搬移引擎204。诸如指令分发模块220可以在一个时钟周期内将多条指令并行发射到卷积处理单元212、向量处理单元214、标量处理单元232、加载存储模块202和数据搬移引擎204。
神经网络处理器200可以进行卷积神经网络运算、循环神经网络运算等,下面以卷积神经网络运算为例,神经网络处理器200从外部获取待处理数据(如图像数据),神经网络处理器200内的卷积处理单元212可以对待处理数据进行卷积处理。卷积神经网络中的卷积层的输入包括输入数据(如从外部获取的待处理数据)和权重数据,卷积层的主要计算流程是对输入数据和权重数据进行卷积运算得到输出数据。其中,进行卷积运算的主体为卷积处理单元,也可以理解为,神经网络处理器的卷积处理单元对输入数据和权重数据进行卷积运算得到输出数据。需要说明的是,权重数据在一些情况下可以理解为一个或多个卷积核。下面针对卷积运算进行详细说明。
输入数据的大小为H W C1,权重数据的大小为K R S C2,其中,H为输入数据的高度,W为输入数据的宽度,C1为输入数据的深度,K为权重数据的输出数,即卷积核的个数,R为权重数据的高度,即卷积核的高度,S为权重数据的宽度,即卷积核的宽度,C2为权重数据的深度,即卷积核的深度,其中,权重数据的C2和输入数据的C1相等,因为C2和C1均为对应的深度数值并且相等,为了方便理解,下面的C2和C1都用C替代,也可以理解为C2=C1=C。输入数据大小还可以为N H W C,N为输入数据的批数。
卷积处理单元先对输入数据按卷积核的大小进行取窗,取窗后的窗体区域与权重数据中的一个卷积核进行乘累加运算得到一个数据,随后分别在W方向和H方向滑动窗体再进行乘累加运算得H’ W’个数据,最后遍历K个卷积核得到K H’ W’个数据。
当然,卷积处理单元还可以采用其他方式的卷积运算方式。下面对另一种方式的卷积运算进行详细说明。请参阅图23,图23本申请实施例提供的神经网络处理器中卷积处理单元卷积运算示意图。其中,输入数据大小仍然为H W C,权重数据(一个或多个卷积核)大小仍然为K R S C。当然,输入数据大小还可以为N H W C,N为数据输入的批数。
卷积处理单元先对输入数据按卷积核的大小进行取窗,取窗后的第一窗体区域与权重数据中的所有卷积核进行乘累加运算得到一个数据,随后分别在W方向和H方向滑动窗体再进行乘累加运算得H’ W’ K个数据。具体的运算步骤如下(也可以理解为卷积处理单元进行卷积运算的具体步骤如下):
1、从起始点(W=0,H=0)对输入数据按卷积核的大小(R S)进行取窗,得到第一窗体区域(R S C);
2、将取窗后的第一窗体区域与K个卷积核分别进行乘累加得到K个数据;
3、在W方向按照一个第一滑动步长滑动取窗,得到新的第一窗体区域(第一窗体区域的大小不变),其中第一滑动步长可以根据需要设置;
4、依次重复步骤2、3,直到W方向边界,这样得到W’ K个数据,其中,W’=(W-S)/第一滑动步长+1。例如,若W=7,S=3,第一滑动步长=2,则W’=3。又例如,若W=7,S=3,第一滑动步长=1,则W’=5;
5、回到W方向起始点,在H方向按照一个第二滑动步长滑动窗体,其中H方向的第二滑动步长可以根据需要设置,得到新的第一窗体区域(第一窗体区域的大小不变),例如,在H方向按照一个第 二滑动步长(H方向的第二滑动步长为1)滑动窗体后,坐标可以为(W=0,H=1)。
6、重复步骤2-5,直到H方向边界,这样得到H’ W’ K个数据。需要说的是,每次沿W方向滑动窗体都直到W方向边界,最后一次在H方向滑动取窗到达边界后,仍然在W方向滑动取窗直至W方向边界(即重复步骤2-4)。
卷积运算单元包括用于卷积运算的乘累加阵列(MAC Array),乘累加阵列的大小(L M)是固定的,其中,L为进行乘累加运算的长度,M为并行进行乘累加运算的单元数,也可以理解为一个周期可以进行M个长度为L的乘累加运算。将上面卷积运算过程中的乘累加运算(即上面的步骤2)分配到卷积运算单元上进行并行运算的步骤如下(也可以理解为卷积处理单元利用乘累加阵列进行乘累加运算的具体步骤如下):
1、从起始点(W=0,H=0)对输入数据在HW平面上按卷积核大小(R S)进行取窗,得到第一窗体区域,并在深度方向上将第一窗体区域分割成C/L个长度为L的数据段;需要说明的是,可以得到第一窗体区域后对第一窗体区域分割成C/L个长度为L的数据段,也可以先将输入数据分割成C/L个长度为L的数据段后,再得到第一窗体区域,第一窗体区域包括C/L个长度为L的数据段;可以理解为,第一窗体区域沿深度方向可以包括C/L层的第一深度数据;
2、在深度方向上将卷积核分割成C/L个长度为L的数据段,对权重数据中K个卷积核均进行该操作,得到K组权重数据,每组有C/L个权重数据段;可以理解为,每个卷积核沿深度方向包括C/L个长度为L的权重数据段;还可以将K个卷积核分割成K/M个卷积核组,每组卷积核组都包括M个卷积核的权重数据;
3、取输入数据的第一窗体区域的第i(i=1,2,…,C/L)层第一深度数据,得到1个第一深度数据;
4、取第f(f=1,2,…,K/M)组卷积核组的第i(i=1,2,…,C/L)层第二深度数据,得到M个第二深度数据;
5、使用MAC阵列对1个第一深度数据和M个第二深度数据(权重数据广播复用)进行乘累加运算,得到M个第一运算数据;M个权重数据段为M个卷积核的权重数据段;
6、递增i,并重复步骤3-5,输出的M个第一运算数据累加到之前计算的M个第一运算数据之上,至此得到M个目标运算数据;其中,i从1开始并递增到C/L;
7、递增f,并重复步骤3-6,完成K/M次计算后得到K个输出。其中,k从1开始并递增到K/M。
其中,输入数据的高度H、宽度W和深度C都是随机的,即,输入数据的大小可以有非常多的格式,如输入数据的宽度W是不确定,输入数据的宽度W除以乘累加阵列并行进行乘累加运算的单元数M,大多数情况下无法得到整数,这样在乘累加运算过程中,就会浪费部分乘累加运算单元。本实施例中,利用卷积核的个数K除以乘累加阵列并行进行乘累加运算的单元数M,卷积核的个数K一般都采用固定的数并且为2的n次方数(即2n),或者为有限的几个数中一个(如K为32、64、128、256中的一个),如此,设置乘累加运算单元时,可以将乘累加运算的单元数M设置为与K的数量相同或整倍数,如M为32、64、128等中的一个。本实施例可以充分利用乘累加运算单元,减少乘累加运算单元的浪费,提高了卷积运算的效率。本实施例中的进行乘累加运算的单元数M对应的卷积核的个数K,是一个维度方向的划分,若进行乘累加运算的单元数M对应的是滑动的窗体区域,对应的不仅包括宽度W维度还包括H维度,两个维度的对应不利于折叠。
此外,本实施例中的输出的目标运算数据的格式为H’ W’ K,其与输入数据的格式相同,不需要再对其进行形变,就可以直接作为下一运算层(如下一层卷积层或下一层池化层等)的输入数据。而且目标运算数据是深度方向连续的数据,在存储时可以存储连续的数据,后续再读取目标运算数据是也是连续的,硬件加载时,不需要多次计算地址,优化计算效率。
需要说明的是,本实施例中,C大于L,K大于M,当C/L、K/M中的一个或两个不整除时,需要对不整除的数取整并加1,具体的为获取其整数部分后再加1。示例性地,乘累加阵列(MAC Array)中L和M采用相同的数值,如均为64。对输入数据在深度方向上按64长度粒度进行补齐。沿深度方向分割成1 1 64的数据块,深度不足64时,补齐到64,数据组织方式为N H W (c C’),其中c=64,C’为C除c向上取整。对权重数据在深度方向上按64长度粒度进行补齐。权重数据沿深度方向分割成1 1 64的数据块,深度不足64时,补齐到64,卷积核个数大于64时,按64粒度分割成多组。调整后数据组织方式为R S (c C’) (k K’),其中c=64,C’为C除c向上取整,k=64,K’为K除k向上取整。
本实施例在卷积运算过程中,卷积处理单元还可以用于将一个窗体区域对应的K个目标运算数据传输到下一层并用于进行运算;或者将N个第一窗体区域对应的N K个目标运算数据传输到下一层并用于进行运算,其中,N小于输出数据的第一窗体区域的总数量。
因为对每一个第一窗体区域都进行了完整的运算,即每一个第一窗体区域(包括深度方向)的所有数据都与所有的卷积核(包括深度方向)都进行了乘累加运算,得到的目标运算数据是完整的,那可以将一个或多个第一窗体区域对应的目标运算数据先传输到下一层,而不需要等待所有的输入数据都运算完成再传输,当传输到下一层的部分目标运算数据可以作为下一层运算的最小单元时(如部分目标运算数据可以作为下一层输入数据的一个窗体区域包括的数据),下一层可以开始运算,不需要等待上一层的全部运算结果,提高了卷积运算的效率,缩短了卷积运算的时长。此外,因为卷积运算单元所在的 NPU内部缓存一般很小,无法存放较大大的中间结果。若卷积运算完成的数据的格式是K H’ W’的,这样需要计算完这一层的结果才能进行下一层的计算,并且其输出的数据较大需要缓存到外部内存(即NPU外的内存)。而本实施例卷积运算完成的结果是H’ W’ K格式的,则可以在H’ W’平面上计算出部分结果后就可以直接进行下一层的计算的输入数据,较小的NPU内部缓存只需存储1 W’ K或N1 W’ K或N1 N2 K,其中N1可以远小于H’,N2可以远小于W’,不需要再将输出结果缓存到外部内存,再从外部内存读取进行下一层的运算,这样可以很大程度上缓解了带宽压力,同时提高了运算效率。另外,在融合层(Fusion Layer)场景下可以很方便的进行流水作业。
其中,当待传输到下一层的目标运算数据与上一次传输的目标运算数据有重复数据时,去除重复数据得到目标数据;以及将目标数据传输到下一层。可以优化数据的传输和存储,当然也可以每次都将目标运算数据传输出去,将其覆盖重复的数据。
乘累加阵列(MAC Array)进行乘累加运算的长度L可以等于并行进行乘累加运算的单元数M,因为乘累加阵列的L和M相等,乘累加运算出来的结果的数据两个方向的值相等,可以方便的对运算出来的结果进行调整。当然,在其他一些实施例中,乘累加阵列的L和M可以不相等,以利于乘累加阵列的设置。
卷积处理单元可以用于:根据卷积核对输入数据进行一次取窗操作,得到第一窗体区域,第一窗体区域沿深度方向包括第一数量层的第一深度数据;获取多个卷积核,多个卷积核沿深度方向包括第一数量层的第二深度数据;将一层的第一深度数据与多个卷积核同一层的第二深度数据进行乘累加运算,得到第一运算数据。
卷积处理单元还可以对多层进行运算,卷积处理单元还用于将多层的第一深度数据对应的多个第一运算数据累加得到目标运算数据。即,基于上述实施例中单层运算的方式,将多层的第一深度数据与多个卷积核多层的第二深度数据进行乘累加运算,得到多个第一运算数据累加后得到目标运算数据。
卷积处理单元可以将其运算结果存储到数据存储模块,也可以将运算结果传输到向量处理单元或整形处理单元以进行进一步的计算操作。
本申请实施例所提供的神经网络处理器200可以集成为一个芯片。
请参阅图24,图24为本申请实施例提供的芯片的结构示意图。芯片20包括神经网络处理器200,该神经网络处理器200具有可以参阅以上内容,在此不再赘述。该芯片20可以应用到电子设备中。
需要说明的是,本申请实施例的神经网络处理器200也可以与其他处理器、存储器等集成在一个芯片中。
为了进一步说明本申请实施例神经网络处理器200的整体运作过程,下面结合其他处理器、存储器进行描述。
请参阅图25,图25为本申请实施例提供的电子设备的结构示意图。电子设备20可以包括神经网络处理器200、系统总线400、外部存储器600和中央处理器800。神经网络处理器200、外部存储器600及中央处理器800均与系统总线400连接,以使得神经网络处理器200与外部存储器600可以实现数据的传输。
系统总线400通过系统总线接口280与神经网络处理器200实现连接。系统总线400可以通过其他系统总线接口与中央处理器800及外部存储器600连接。
神经网络处理器200受控于所述中央处理器800从外部存储器600中获取待处理数据、及对待处理数据进行处理以得到处理结果,并将处理结果反馈到外部存储器600
当需要采用神经网络处理器200进行数据处理时,电子设备20的上层驱动软件诸如中央处理器800将当前需要执行程序的配置写入到对应的寄存器中,比如:工作模式、程序计数器(Program Counter,PC)的初始值、配置参数等。然后,数据搬移模块260将待处理的数据诸如图像数据、权重数据通过系统总线接口280从外部存储器600中读取过来,并写入到数据存储模块240。指令分发模块220按照初始PC开始取指令。当取到指令后,指令分发模块220根据指令的类型将指令发射到对应的处理单元。各个处理单元根据具体的指令来执行不同的操作,然后将结果写入到数据存储模块240。
其中,该寄存器为神经网络处理器200的配置状态寄存器,或者称为控制状态寄存器,其可以设定神经网络处理器200的工作模式,比如输入数据的位宽,程序初始PC的位置等。
需要说明的是,图25所示的神经网络处理器还可以替换成其他图示的神经网络处理器。
下面从本申请实施例通过神经网络处理器对数据进行处理的方法步骤和对数据进行加载的方法步骤的角度进行描述。
请参阅图26,图26为本申请实施例提供的数据处理方法的流程示意图。该数据处理方法是基于以上所述神经网络处理器对数据进行处理。该数据处理方法包括:
1001,接收待处理数据和多条指令。该待处理数据可以是需要采用神经网络处理器进行处理的图像数据和权重数据。可以采用数据搬移模块260通过系统总线接口280从外部存储器600中读取需要进行处理的数据。当然也可以采用DMA260b通过系统总线接口266b从外部存储器中搬移需要进行处理的数据。接收到待处理数据后可以将所述待处理数据加载到所述数据存储模块240。
该多条指令可以是计算指令,也可以是控制指令。可以采用指令搬移模块270通过系统总线接口280从外部读取需要的指令。当然也可以采用DMA260b通过系统总线接口266b从外部搬移所需指令。还 可以是外部直接将指令写入到NPU200。接收到多条指令后,可以将该多条指令加载到所述指令存储模块240。
1002,将所述多条指令并行发射到所述多个处理单元。神经网络处理器200的指令分发模块220可以根据接收到的多条指令,将该多条指令在一个时钟周期内发射到各自的处理单元,以使得各个处理单元依据该指令来实现对待处理数据的处理。其中,该指令分发模块220可以将多条指令在一个时钟周期内发射到第一处理模块210的至少两个处理单元中。该指令分发模块220可以将多条指令在一个时钟周期内发射到标量处理单元232和第一处理模块210的至少一个处理单元中。
需要说明的是,为确保指令分发模块220所发射下去的指令均是有用的,或者说指令分发模块220将指令发射下去后各个处理单元有根据该指令对数据进行处理,本申请实施例在指令分发模块220发射指令之前,指令分发模块220先向数据存储模块240发送一个判断信号,待从数据存储模块240返回信号时,指令分发模块240可以基于该返回信号确定数据存储模块240是否缓存有待处理数据。如果指令分发模块220确定出数据存储模块240未存储有待处理数据,则指令分发模块240则不会将指令发射到各个处理单元。而只有在指令分发模块220确定出数据存储模块240存储有待处理数据时,指令分发模块240才会将指令发射到多个处理单元。
1003,所述多个处理单元根据所述多条指令对所述待处理数据进行处理,以得到处理结果。各个处理单元230对待处理数据处理完成后,会得到处理结果。本申请实施例多个处理单元230还可以将所述处理结果写入到所述数据存储模块240。进而采用数据搬移模块260及系统总线接口280可以将该处理结果传输到外部存储器600。
数据处理完成后,本申请实施例神经网络处理器的指令分发模块220如果收到结束标识指令,则认为程序已经执行完毕,发出中断给上层软件,结束NPU200的工作。如果没有结束则返回到1002,继续取指执行指令发射,一直到程序执行完毕。
请参阅图27,图27为本申请实施例提供的数据处理方法的流程示意图。该数据处理方法是基于以上所述神经网络处理器对数据进行处理。该数据处理方法包括:
2001,根据第一条件将所述通用寄存器的数据搬移到所述标量寄存器。其中第一条件可以为第一指令。数据搬移引擎204可以根据第一指令将通用寄存器290的数据搬移到标量寄存器2322,具体内容可以参阅以上内容,在此不再赘述。
2002,根据第二条件将所述通用寄存器的数据搬移到所述标量寄存器。其中第二条件可以为第二指令。数据搬移引擎204可以根据第二指令将标量寄存器2322的数据搬移到通用寄存器290,具体内容可以参阅以上内容,在此不再赘述。
请参阅图28,图28为本申请实施例提供的数据加载方法的流程示意图。该数据加载方法基于以上神经网络处理器200加载数据,该数据加载方法包括:
3001,将第一数据加载到具有专用寄存器的卷积处理单元。其中,具有专用寄存器2122的卷积处理单元212可以参阅以上内容,在此不再赘述。
3002,将第二数据存储到通用寄存器,所述第一数据和所述第二数据的类型不同。其中,通用寄存器290可以参阅以上内容,在此不再赘述。本申请实施例可以通过LSU290实现数据的加载或传输,具体加载或传输方式可以参阅以上内容,在此不再赘述。其中第一数据和第二数据具体可以参阅以上内容,在此不再赘述。
以上对本申请实施例提供的神经网络处理器、芯片和电子设备进行了详细介绍。本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请。同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种神经网络处理器,其包括:
    第一处理模块,所述第一处理模块包括具有专用寄存器的卷积处理单元;
    通用寄存器,所述通用寄存器与所述卷积处理单元连接;和
    加载存储模块,所述加载存储模块与所述通用寄存器连接,所述加载存储模块还通过所述专用寄存器与所述卷积处理单元连接;
    所述加载存储模块用于加载数据到所述通用寄存器和加载数据到所述卷积处理单元的专用寄存器中的至少一个。
  2. 根据权利要求1所述的神经网络处理器,其中,所述神经网络处理器还包括:
    数据存储模块,所述数据存储模块用于存储数据,所述数据存储模块与所述加载存储模块连接;
    所述加载存储模块用于将所述数据存储模块中的数据加载到所述通用寄存器和所述卷积处理单元的专用寄存器中的至少一个;
    所述加载存储模块还用于将所述通用寄存器的数据存储到所述数据存储模块。
  3. 根据权利要求2所述的神经网络处理器,其中,所述数据存储模块所存储的数据包括第一数据和第二数据,所述第一数据和所述第二数据的一者为图像数据,另一者为权重数据;
    所述加载存储模块还用于将所述第一数据加载到所述卷积处理单元的专用寄存器;
    所述加载存储模块还用于将所述第二数据加载到所述通用寄存器;
    所述卷积处理单元还用于从所述通用寄存器获取所述第二数据。
  4. 根据权利要求2所述的神经网络处理器,其中,所述加载存储模块和所述数据存储模块集成在一起;或
    所述加载存储模块和所述数据存储模块分开设置。
  5. 根据权利要求1所述的神经网络处理器,其中,所述通用寄存器包括多个向量寄存器和多个预测寄存器。
  6. 根据权利要求5所述的神经网络处理器,其中,所述神经网络处理器还包括:
    指令分发模块,所述指令分发模块与所述卷积处理单元连接,所述指令分发模块用于并行发射多条指令。
  7. 根据权利要求6所述的神经网络处理器,其中,所述指令分发模块还用于在一个时钟周期内并行发射多条指令。
  8. 根据权利要求6所述的神经网络处理器,其中,所述指令分发模块还用于根据指令的类型并行发射多条指令。
  9. 根据权利要求6所述的神经网络处理器,其中,所述指令分发模块所发射的指令包括细粒度指令,所述指令分发模块用于将所述细粒度指令发射到所述卷积处理单元,所述卷积处理单元用于根据一条细粒度指令对其所接收到的数据进行一次向量内积运算。
  10. 根据权利要求6所述的神经网络处理器,其中,所述第一处理模块还包括与所述指令分发模块连接的向量处理单元,所述指令分发模块用于将所述多条指令并行发射到所述卷积处理单元和所述向量处理单元。
  11. 根据权利要求10所述的神经网络处理器,其中,所述第一处理模块还包括与所述指令分发模块连接的整形处理单元,所述指令分发模块用于将所述多条指令并行发射到所述卷积处理单元、所述向量处理单元和所述整形处理单元。
  12. 根据权利要求6所述的神经网络处理器,其中,所述神经网络处理器还包括第二处理模块,所述第二处理模块包括与所述指令分发模块连接的标量处理单元,所述指令分发模块用于将所述多条指令并行发射到所述卷积处理单元和所述标量处理单元。
  13. 根据权利要求10或12所述神经网络处理器,其中,所述神经网络处理器还包括用于存储数据的数据存储模块,所述数据存储模块与所述卷积处理单元连接。
  14. 根据权利要求13所述神经网络处理器,其中,所述数据存储模块还与所述指令分发模块连接;
    所述指令分发模块还用于:
    根据所述数据存储模块存储有待处理的数据并行发射多条指令;
    根据所述指令缓存单元未存储待处理的数据不发射指令。
  15. 根据权利要求13所述神经网络处理器,其中,所述数据存储模块还与所述标量处理单元连接。
  16. 根据权利要求14所述的神经网络处理器,其中,所述神经网络处理器还包括:
    系统总线接口,所述系统总线接口用于与系统总线连接;
    数据搬移模块,所述数据搬移模块连接所述数据存储模块和所述系统总线接口,所述数据搬移模块用于搬移数据;
    指令存储模块,所述指令存储模块与所述指令分发模块连接,所述指令存储模块用于存储所述指令分发模块所需要发射的部分指令或全部指令;和
    指令搬移模块,所述指令搬移模块连接所述指令存储模块和所述系统总线接口,所述指令搬移模块用于搬移指令。
  17. 根据权利要求14所述的神经网络处理器,其中,所述神经网络处理器还包括:
    系统总线接口,所述系统总线接口用于与系统总线连接;
    数据搬移模块,所述数据搬移模块连接所述数据存储模块和所述系统总线接口,所述数据搬移模块用于搬移数据;和
    指令存储模块,所述指令存储模块连接所述指令分发模块和所述系统总线接口,所述指令存储模块用于存储所述指令分发模块所需要发射的全部指令。
  18. 根据权利要求14所述的神经网络处理器,其中,所述神经网络处理器还包括:
    系统总线接口,所述系统总线接口用于与系统总线连接;和
    直接存储访问,所述直接存储访问包括至少一条物理通道、至少一条逻辑通道和第一仲裁单元,所述至少一条物理通道和至少一条逻辑通道通过所述第一仲裁单元连接所述系统总线接口,所述至少一个物理通道与所述指令存储模块连接,所述至少一个逻辑通道与所述数据存储模块连接。
  19. 一种芯片,其包括神经网络处理器,所述神经网络处理器为如权利要求1至18任一项所述的神经网络处理器。
  20. 一种电子设备,其包括:
    系统总线;
    外部存储器;
    中央处理器;和
    神经网络处理器,所述神经网络处理器为如权利要求1至18任一项所述的神经网络处理器;
    其中,所述神经网络处理器通过所述系统总线连接所述外部存储器和所述中央处理器,所述神经网络处理器受控于所述中央处理器从所述外部存储器中获取待处理数据、及对所述待处理数据进行处理以得到处理结果,并将所述处理结果反馈到所述外部存储器。
PCT/CN2020/132792 2019-12-09 2020-11-30 神经网络处理器、芯片和电子设备 WO2021115149A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911253030.2A CN111047035B (zh) 2019-12-09 2019-12-09 神经网络处理器、芯片和电子设备
CN201911253030.2 2019-12-09

Publications (1)

Publication Number Publication Date
WO2021115149A1 true WO2021115149A1 (zh) 2021-06-17

Family

ID=70235304

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/132792 WO2021115149A1 (zh) 2019-12-09 2020-11-30 神经网络处理器、芯片和电子设备

Country Status (2)

Country Link
CN (1) CN111047035B (zh)
WO (1) WO2021115149A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111047035B (zh) * 2019-12-09 2024-04-19 Oppo广东移动通信有限公司 神经网络处理器、芯片和电子设备
CN112130901A (zh) * 2020-09-11 2020-12-25 山东云海国创云计算装备产业创新中心有限公司 基于risc-v的协处理器、数据处理方法及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477454A (zh) * 2009-01-22 2009-07-08 浙江大学 嵌入式处理器的乱序执行控制装置
CN101916428A (zh) * 2010-08-18 2010-12-15 中国科学院光电技术研究所 一种图像数据的图像处理装置
US20180307974A1 (en) * 2017-04-19 2018-10-25 Beijing Deephi Intelligence Technology Co., Ltd. Device for implementing artificial neural network with mutiple instruction units
CN109214506A (zh) * 2018-09-13 2019-01-15 深思考人工智能机器人科技(北京)有限公司 一种卷积神经网络的建立装置及方法
CN111047035A (zh) * 2019-12-09 2020-04-21 Oppo广东移动通信有限公司 神经网络处理器、芯片和电子设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679621B (zh) * 2017-04-19 2020-12-08 赛灵思公司 人工神经网络处理装置
US11347964B2 (en) * 2017-08-07 2022-05-31 Renesas Electronics Corporation Hardware circuit
CN107590535A (zh) * 2017-09-08 2018-01-16 西安电子科技大学 可编程神经网络处理器
US10482337B2 (en) * 2017-09-29 2019-11-19 Infineon Technologies Ag Accelerating convolutional neural network computation throughput
CN109034373B (zh) * 2018-07-02 2021-12-21 鼎视智慧(北京)科技有限公司 卷积神经网络的并行处理器及处理方法
CN110097174B (zh) * 2019-04-22 2021-04-20 西安交通大学 基于fpga和行输出优先的卷积神经网络实现方法、系统及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477454A (zh) * 2009-01-22 2009-07-08 浙江大学 嵌入式处理器的乱序执行控制装置
CN101916428A (zh) * 2010-08-18 2010-12-15 中国科学院光电技术研究所 一种图像数据的图像处理装置
US20180307974A1 (en) * 2017-04-19 2018-10-25 Beijing Deephi Intelligence Technology Co., Ltd. Device for implementing artificial neural network with mutiple instruction units
CN109214506A (zh) * 2018-09-13 2019-01-15 深思考人工智能机器人科技(北京)有限公司 一种卷积神经网络的建立装置及方法
CN111047035A (zh) * 2019-12-09 2020-04-21 Oppo广东移动通信有限公司 神经网络处理器、芯片和电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NING, XI ET AL.: "A DMA Controller Supporting Multi-bus Arbitration and the Matrix Data Transfer", THE 15TH NATIONAL CONFERENCE ON COMPUTER ENGINEERING AND TECHNOLOGY AND THE 1ST MICROPROCESSOR FORUM, 12 August 2011 (2011-08-12), pages 353 - 357, XP055819803 *

Also Published As

Publication number Publication date
CN111047035A (zh) 2020-04-21
CN111047035B (zh) 2024-04-19

Similar Documents

Publication Publication Date Title
WO2021115208A1 (zh) 神经网络处理器、芯片和电子设备
US11403104B2 (en) Neural network processor, chip and electronic device
KR102661605B1 (ko) 벡터 계산 유닛
CN106940815B (zh) 一种可编程卷积神经网络协处理器ip核
US11379713B2 (en) Neural network processing
US9489343B2 (en) System and method for sparse matrix vector multiplication processing
US11669443B2 (en) Data layout optimization on processing in memory architecture for executing neural network model
Xu et al. A dedicated hardware accelerator for real-time acceleration of YOLOv2
WO2021115149A1 (zh) 神经网络处理器、芯片和电子设备
CN111091181B (zh) 卷积处理单元、神经网络处理器、电子设备及卷积运算方法
WO2022253075A1 (zh) 一种编译方法及相关装置
WO2022142479A1 (zh) 一种硬件加速器、数据处理方法、系统级芯片及介质
US20240105260A1 (en) Extended memory communication
CN115033188A (zh) 一种基于zns固态硬盘的存储硬件加速模块系统
US12032925B1 (en) Latency processing unit
Hussain et al. Memory controller for vector processor
CN113805940B (zh) 用于人工智能和机器学习的向量加速器
Rettkowski et al. Application-specific processing using high-level synthesis for networks-on-chip
Rico et al. Amd xdna™ npu in ryzen™ ai processors
US20220114234A1 (en) Matrix processing engine with coupled dense and scalar compute
WO2023142091A1 (zh) 计算任务调度装置、计算装置、计算任务调度方法和计算方法
US20210209462A1 (en) Method and system for processing a neural network
CN107301155A (zh) 一种数据处理方法及处理装置
Li Embedded AI Accelerator Chips
CN118363754A (zh) 单个算子在多核处理器上的拆分方法及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20900157

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20900157

Country of ref document: EP

Kind code of ref document: A1