CN110058882B

CN110058882B - OPU instruction set definition method for CNN acceleration

Info

Publication number: CN110058882B
Application number: CN201910192455.0A
Authority: CN
Inventors: 喻韵璇; 王铭宇
Original assignee: Shenzhen Biong Core Technology Co ltd
Current assignee: Shenzhen Biong Core Technology Co ltd
Priority date: 2019-03-14
Filing date: 2019-03-14
Publication date: 2023-01-06
Anticipated expiration: 2039-03-14
Also published as: CN110058882A

Abstract

The invention discloses an OPU instruction set definition method for CNN acceleration, which relates to the field of instructions of CNN acceleration processors. The unconditional instruction provides configuration parameters for the conditional instruction, the conditional instruction sets a trigger condition, the trigger condition is written in hardware in a hard mode, the conditional instruction sets a corresponding trigger condition register, the conditional instruction is executed after the trigger condition is met, the unconditional instruction is directly executed after being read, and the content of the parameter register is replaced. The invention selects the calculation mode of the parallel input and output channels according to the CNN network and the acceleration requirement, and sets the instruction granularity. The instruction set of the invention avoids the problem that the instruction sequencing cannot be predicted due to large uncertainty of the operation period. The instruction set and the corresponding processor OPU can be realized by an FPGA or an ASIC; the OPU can accelerate different target CNN networks, avoiding hardware reconfiguration.

Description

OPU instruction set definition method for CNN acceleration

Technical Field

The invention relates to the field of CNN accelerator instruction set definition methods, in particular to an OPU instruction set definition method for CNN acceleration.

Background

Deep Convolutional Neural Networks (CNNs) exhibit high accuracy in various applications such as visual object recognition, speech recognition, and object detection. However, its breakthrough in accuracy comes at the cost of high computational cost, which needs to be driven by compute clusters, GPUs and FPGAs. The FPGA accelerator has the advantages of high energy efficiency, good flexibility, strong computing power and the like, and is particularly outstanding in CNN deep application on edge equipment such as voice recognition, visual object recognition and the like on a smart phone; the method generally relates to architecture exploration and optimization, RTL programming, hardware implementation and software-hardware interface development, and as people develop deep research on an automatic compiler for FPGA CNN (convolutional neural network) acceleration, a configurable platform provides rich parallel computing resources and high energy efficiency, so that the platform becomes an ideal choice for edge computing and data center CNN acceleration. However, with the development of DNN (deep neural network) algorithms in various more complex computer vision tasks, such as face recognition, license plate recognition, gesture recognition, etc., a plurality of DNN cascade structures are widely applied to obtain better performance, and these new application scenarios require sequential execution of different networks, so that the FPGA devices need to be continuously reconfigured, which causes a problem of long time consumption; on the other hand, each new update in the client network architecture can result in regeneration of the RTL code and the entire implementation process, which takes longer.

In recent years, auto-accelerator generators capable of rapidly deploying CNNs to FPGAs have become another focus, and researchers in the prior art have developed Deep weavers that map CNN algorithms to manually optimized design templates based on resource allocation and hardware organization provided by design planners; a compiler based on an RTL library has also been proposed, which consists of a plurality of optimized manually coded Verilog templates, describing the calculation and data flow of different types of layers; both of these efforts achieve comparable performance compared to custom designed accelerators; researchers have also provided a compiler based on HLS, focusing mainly on bandwidth optimization through memory access reorganization; researchers have also proposed a shrinking array architecture to achieve higher FPGA operating frequencies. But existing FPGA acceleration work aims at generating specific individual accelerators for different CNNs, which guarantees reasonably high performance of RTL-based or HLS-RTL-based templates, but high hardware upgrade complexity in case of adjusting the target network. Therefore, in order to realize that a specific hardware description code does not need to be generated on an independent network, the FPGA is not involved to be re-burned, all deployment processes are completed by means of instruction configuration, different target network configurations are configured through the instructions, the FPGA accelerator is not reconstructed, a brand-new CNN acceleration system is provided, an OPU (Overlay Processor Unit) instruction set is defined, a compiler compiles the defined instruction set to generate an instruction sequence, the OPU executes the compiled instruction to realize CNN acceleration, how to define the instruction to realize mapping and recombining the networks with different structures onto a specific structure needs to be considered for realizing the CNN acceleration, and the universality of a Processor for realizing instruction control is good. On the other hand, when external memory usage is involved, the cycle simulation accuracy of memory read and write operations is not high, because additional refresh time and other overhead may occur during external memory usage; if the instructions are executed immediately after decoding, the order of operation can only be controlled by the order of the instruction sequence; controlling the starting point of the parallel executed operations becomes tricky if the operation cycle is not accurately simulated; meanwhile, the change of the initial conditions of the main services is limited, and the initial conditions are usually triggered after the previous steps reach a certain state, so that the uncertainty of the implementation of the instruction execution time is large, therefore, an instruction set definition method is needed to define the instruction to overcome the above problems, an OPU instruction set is provided to map and recombine networks with different structures to a specific structure, the universality of a processor controlled by the instruction is optimized, the configuration of different target networks can be completed according to the instruction, and the acceleration of the general CNN is realized through the OPU.

Disclosure of Invention

The invention aims to: the invention provides a method for defining an OPU instruction set for CNN acceleration, which is used for mapping and recombining networks with different structures to the OPU instruction set with a specific structure, optimizing the universality of an instruction-controlled processor and achieving the purpose of realizing different networks without reconstructing an FPGA.

The technical scheme adopted by the invention is as follows:

an OPU instruction set definition method for CNN acceleration comprises the following steps:

the method comprises the steps of defining a conditional instruction, defining an unconditional instruction and setting instruction granularity;

defining the conditional instruction includes the steps of:

constructing conditional instructions, wherein the conditional instructions comprise a read storage instruction, a write storage instruction, a data capture instruction, a data post-processing instruction and a calculation instruction;

the method comprises the steps that a register of a conditional instruction and an execution mode are set, the execution mode is executed after meeting a trigger condition written by hardware, and the register comprises a parameter register and a trigger condition register;

setting a parameter configuration mode of a conditional instruction, wherein the configuration mode is to configure parameters according to the unconditional instruction;

defining unconditional instructions includes the steps of:

defining parameters of the unconditional instruction;

defining an execution mode of the unconditional instruction parameter, wherein the execution mode is directly executed after being read;

the step of setting the instruction granularity comprises the following steps:

counting CNN networks and acceleration requirements;

and determining a calculation mode according to the statistical result and the selected parallel input and output channels, and setting instruction granularity.

Preferably, the read-store instruction includes a read-store operation according to a mode A1 and a read-store operation according to a mode A2, and the granularity is n number read in each time, where n is greater than 1;

mode A1: reading n numbers backwards from the designated address;

mode A2: reading n numbers according to the address stream, wherein the address in the address stream is discontinuous, and the operation is carried out after three kinds of reading: 1, no operation is performed after reading; 2, splicing the read data into a specified length; 3, splitting the read file into specified lengths; four read operations post-chip storage locations: the device comprises a characteristic graph storage module, an inner product parameter storage module, a bias parameter storage module and an instruction storage module;

the read-store operation instruction configurable parameters comprise a starting address, operand quantity, a read-post-processing mode and an on-chip storage position.

Preferably, the write-store instruction includes performing write-store operation according to a mode B1 and performing write-store operation according to a mode B2, and the granularity of the write-store instruction is n numbers written out each time, where n >1;

mode B1: writing n numbers backwards from the designated address;

mode B2: writing n numbers according to a target address stream, wherein addresses in the address stream are discontinuous;

the write store operation instruction configurable parameters include a starting address and a number of operands.

Preferably, the data fetching instruction comprises data reading operation from the on-chip feature map memory and the inner product parameter memory according to different data reading modes and data reorganization and arrangement modes and reorganization and arrangement operation on the read data, and the granularity of the data fetching instruction is that 64 input data are operated at the same time; the data grabbing instruction configurable parameters comprise a characteristic graph reading memory and an inner product reading parameter memory, wherein the characteristic graph reading memory comprises reading address constraints, namely a minimum address and a maximum address, a reading step size and a rearrangement mode; the read inner product parameter memory comprises a read address constraint and a read mode.

Preferably, the data post-processing instruction comprises one or more operations of pooling, activating, fixed-point cutting, rounding and vector para-position addition, and the granularity of the operation is multiple of 64 data per operation; the data post-processing operation instructions may be configured with parameters including pooling type, pooling size, activation category, and fixed point cutting location.

Preferably, the instruction for calculating includes performing vector inner product operation according to different length vector allocation, the granularity of the instruction is 32, the basic unit for calculating the vector inner product operation is two vector inner product modules with the length of 32, and the parameter allocable for calculating the operation instruction includes the number of output results.

Preferably, the unconditional instruction provides parameter update, and the parameters include length and width of an on-chip storage feature map module, number of channels, input length of a current layer, width of the on-chip storage feature map module, number of input channels of the current layer, number of output channels, read-storage operation start address, read-operation mode selection, write-storage operation start address, write-operation mode selection, data capture mode and constraint, setting of a calculation mode, setting of pooling operation related parameters, setting of activation operation related parameters, setting of data shift, and cutting and rounding related operations.

Preferably, the method further comprises setting an instruction sequence definition mode, specifically: if the instruction sequence is a plurality of continuous repeated instructions, a single instruction is set, and the instruction is repeatedly executed until the contents of the trigger condition register and the parameter register are updated.

Preferably, the method further comprises defining the instruction length, wherein the instruction length is a uniform length.

Preferably, the minimum unit of the parallel input and output channels corresponding to the computation mode is the vector inner product of 32.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1. the method comprises the steps of defining a conditional instruction, defining an unconditional instruction and setting instruction granularity, wherein the unconditional instruction provides configuration parameters for the conditional instruction, the conditional instruction sets a trigger condition, the trigger condition is written into hardware in a hard mode, the conditional instruction sets a corresponding register and is executed after the trigger condition is met, the unconditional instruction is directly executed after being read, the content of the parameter register is replaced, the problem that the order ordering cannot be predicted due to the fact that the existing operation period is large in uncertainty is avoided, and the instruction order can be predicted accurately; determining a calculation mode according to the CNN network, the acceleration requirement and the selected parallel input and output channel, setting instruction granularity, mapping and recombining networks with different structures to a specific structure, adapting the size of the cores of the networks with different sizes by adopting the parallel calculation mode, solving the universality of a processor corresponding to an instruction set, completing the configuration of different target networks according to the instruction by the CNN acceleration processor, and providing an applicable OPU instruction set for accelerating the general CNN acceleration speed;

2. the conditional instruction of the invention sets the trigger condition, avoids the defect that the existing instruction sequence completely depends on the set sequence to execute for a long time, realizes that the memory reading is continuously operated in the same mode without executing according to the set fixed interval sequence, greatly shortens the length of the instruction sequence, and is beneficial to realizing the acceleration of different target networks through the rapid configuration of the instruction;

3. determining a calculation mode according to the statistical result and the selected parallel input and output channels, setting instruction granularity, adjusting the parallel part of input channels through parameters to simultaneously calculate more output channels or parallel more input channels to reduce the number of calculation rounds, wherein the input channels and the output channels are multiples of 32 in a common CNN structure, and the maximum utilization rate of a calculation unit can be effectively ensured by selecting 32 as a basic unit;

4. the unconditional instruction provides parameter updating, and the parameters with synchronous updating frequency are classified into the same unconditional instruction so as to fully utilize all bit positions of the instruction and reduce the number of calling total instructions;

5. when a plurality of continuous repeated instructions exist, only one instruction is set, and the instruction is repeatedly executed until the trigger condition register and the parameter register keep the content until the trigger condition register and the parameter register are updated, so that the acceleration of different target networks can be realized through the rapid configuration of the instruction.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a flow chart of an instruction set definition method of the present invention;

FIG. 2 is a schematic diagram of conditional instruction triggered operation according to the present invention;

FIG. 3 is a diagram illustrating a parallel computing scheme according to the present invention;

FIG. 4 is a schematic diagram of an instruction set according to the present invention;

FIG. 5 is a schematic diagram of an instruction set-based OPU according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It is noted that relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The features and properties of the present invention are described in further detail below with reference to examples.

Example 1

An OPU instruction set definition method for CNN acceleration comprises defining conditional instructions, defining unconditional instructions and setting instruction granularity;

when the defined instruction set is used for CNN acceleration, the instruction type of the instruction, the operation corresponding to each instruction, the conventional parameter definition and the instruction granularity need to be defined, wherein the conventional parameter definition comprises the instruction length and the instruction sequence. The OPU instruction comprises the following steps: reading an instruction block (the instruction set is a set list of all instructions; the instruction block is an instruction of a group of continuous instructions, and the instruction for executing a network comprises a plurality of instruction blocks); step 2: acquiring unconditional instructions in the instruction block, directly executing the unconditional instructions, decoding parameters contained in the unconditional instructions, and writing the parameters into corresponding registers; acquiring a conditional instruction in the instruction block, setting a trigger condition according to the conditional instruction, and then jumping to the step 3; and step 3: judging whether the trigger condition is met, if so, executing a conditional instruction; if not, the instruction is not executed; and 4, step 4: judging whether a reading instruction of the next instruction block contained in the instruction meets a trigger condition, if so, returning to the step 1 to continue executing the instruction; otherwise, the register parameters and the trigger conditions set by the current condition instruction are kept unchanged until the trigger conditions are met.

The instruction set defined by the application is used for a CNN accelerating system based on an OPU, the structural schematic diagram of the OPU is shown in FIG. 5, the OPU is realized by an FPGA or an ASIC, a final operation instruction is generated according to the defined instruction, the OPU operation instruction can realize the acceleration of CNN networks with different targets, and the technical means adopted is as follows: defining conditional instructions, defining unconditional instructions and setting instruction granularity, wherein the conditional instructions define the composition of the conditional instructions, and comprise six types of instructions, as shown in a flow chart shown in FIG. 1; the method comprises the steps that a register of a conditional instruction and an execution mode are set, the execution mode is executed after meeting a trigger condition written by hardware, and the register comprises a parameter register and a trigger condition register; setting a parameter configuration mode of a conditional instruction, wherein the configuration mode is to perform parameter configuration according to the unconditional instruction; defining the unconditional instruction comprises defining the parameters and the execution mode of the unconditional instruction, namely directly executing; the instruction length is defined as a uniform length, and the instruction set has the structure shown in FIG. 4; setting of instruction granularity: counting CNN networks and acceleration requirements; and determining a calculation mode according to the statistical result and the selected parallel input and output channels, and setting instruction granularity. The OPU instruction set includes a conditional instruction, i.e., a C-type instruction, and an unconditional instruction, i.e., a U-type instruction, and the formed instruction sequence is as shown in fig. 4;

wherein the conditional instruction comprises the composition shown in table 1:

TABLE 1

Name of instruction	Instruction function
		r	Read store instruction
w	Write store instruction
		f	Data fetching instruction
c	Computing instructions
		p	Data post-processing instructions

The instruction granularity of each type of instruction is set according to the CNN network structure and the acceleration requirement: the read-storage instruction sets the granularity of the read-storage instruction to be n numbers per time according to the CNN acceleration characteristic, wherein n is more than 1; the writing storage instruction sets the granularity of the writing storage instruction to be n numbers written out each time according to the CNN acceleration characteristic, wherein n is greater than 1; the granularity of the data capture instruction is a multiple of 64 according to the structure of the CNN network, namely 64 input data are operated simultaneously; the granularity of the data post-processing instruction is multiple data of 64 times of each operation; the granularity of the calculation instruction is 32 because the product of the network input channel and the network output channel is a multiple of 32.

The parameters defined by the unconditional instruction are shown in Table 2:

TABLE 2

The calculation mode is parallel input and output channels, more output channels can be calculated simultaneously by adjusting the parallel part of input channels through parameters, or more input channels are parallel to reduce the number of calculation rounds, and the input channels and the output channels are multiples of 32 in a common CNN structure. When the instruction set is used for CNN acceleration, the parallel input and output channel calculation mode is schematically shown in fig. 3, and in each clock cycle, a segment with the size of 1 × 1 and the depth of an ICS input channel and corresponding kernel elements are read, and the elements conform to a natural data storage mode and only require a very small bandwidth. Parallelism is achieved within the Input Channels (ICS) and output channels (OCS, number of kernel sets involved). Fig. 3 (c) further illustrates the calculation process. For cycle 0 of round 0, read the input channel slice of location (0, 0), next cycle we skip step x and read location (0, 2), the operational reading continues until all pixels corresponding to kernel location (0, 0) are computed. Then we go to round 1 and start reading all pixels from position (0, 1) corresponding to kernel position (0, 1). To compute block data of size IN IM IC with OC set kernel, kx Ky (IC/ICs) wheel (OC/OCs) is needed. By adopting the calculation mode, aiming at unified data of any kernel size or step length, the data management before calculation is greatly simplified, higher efficiency is realized with less resource consumption, and the method is suitable for the kernel sizes of various networks with different sizes.

In summary, because the existing FPGA acceleration work aims at generating specific individual accelerators for different CNNs, the present application sets an acceleration processor to implement different networks without reconfiguring the FPGA, and controls the instructions defined in the present application, there is no technical hint for the instructions in the OPU instruction set defined in the present application, because the instructions are different from the hardware, system, and coverage of the FPGA acceleration system in the prior art; according to the method, the calculation mode is determined and the instruction granularity is set according to the CNN network, the acceleration requirement and the selected parallel input and output channel, so that the network mapping with different structures is recombined to a specific structure, the method is suitable for the sizes of the cores of the networks with different sizes, and the universality of the processor corresponding to the instruction set is solved; the invention provides an OPU instruction set, which comprises a network mapping reorganization module, an instruction set and a corresponding processor OPU, wherein the instruction set and a corresponding processor OPU can be realized by using FPGA or ASIC, the universality of an instruction control processor is improved, the OPU can accelerate different target CNN networks, and hardware reconstruction is avoided.

Example 2

Based on embodiment 1, six instructions of the conditional instructions of the present application: the method comprises a reading storage instruction, a writing storage instruction, a data grabbing instruction, a data post-processing instruction and a calculating instruction; the conditional instruction is executed after meeting the triggering condition written in by hardware, and the conditional instruction register comprises a parameter register and a triggering condition register; the conditional instruction performs parameter configuration according to the unconditional instruction.

The reading and storing instruction comprises reading and storing operation according to a mode A1 and reading and storing operation according to a mode A2; the read store operation instruction configurable parameters include a start address, a number of operands, a post-read processing mode, and an on-chip storage location.

Mode A1: reading n numbers backwards from the designated address, wherein n is a positive integer;

the write-store instruction comprises performing write-store operation according to a mode B1 and performing write-store operation according to a mode B2; the write store operation instruction configurable parameters include a starting address and a number of operands.

Mode B1: writing n numbers backwards from the designated address;

the data grabbing instruction comprises the operations of reading data from the on-chip characteristic diagram memory and the inner product parameter memory according to different data reading modes and data recombination arrangement modes and performing recombination arrangement operation on the read data; the data grabbing and recombining operation instruction configurable parameters comprise a characteristic graph reading memory and an inner product reading parameter memory, wherein the characteristic graph reading memory comprises reading address constraints, namely a minimum address and a maximum address, a reading step size and a rearrangement mode; the read inner product parameter memory comprises a read address constraint and a read mode.

The data post-processing instruction comprises one or more operations of pooling, activating, fixed-point cutting, rounding and vector counterpoint addition; the data post-processing operation instructions may be configured with parameters including pooling type, pooling size, activation category, and fixed point cutting location.

The calculation instruction comprises vector inner product operation according to different length vector allocation, the calculation basic unit adopted by the vector inner product operation is two 32 (32 is the length of the vector and comprises 32 8bit data) length vector inner product modules, and the calculation operation instruction allocable parameters comprise the output result quantity.

The method comprises the following steps that a conditional instruction sets a trigger condition, the trigger condition is written in hardware in a hard mode, the conditional instruction sets a corresponding register, reading storage, writing storage, data capture, data post-processing and calculation are executed after the trigger condition is met, the instruction is compiled, an OPU reads the instruction according to a start signal sent by a GUI (graphical user interface), and then the instruction is operated according to a parallel calculation mode defined by the instruction, so that acceleration of different target networks is completed; the triggering conditions are hard-written in hardware, for example, for a storage reading module instruction, there are 6 instruction triggering conditions, including 1, when the last storage reading is completed and the last data capturing and recombining is completed, triggering is performed; 2. triggering when the last data writing and storing operation is finished; 3. triggering when the last data post-processing operation is completed, and the like; conditional instructions set trigger conditions, avoiding the disadvantage of long execution time of existing instruction sequences completely depending on the set sequence, achieving that memory reads operate continuously in the same mode, without executing sequentially at set fixed intervals, greatly reducing the length of instruction sequences, further speeding up instruction execution speed, facilitating acceleration of different target networks by instruction rapid configuration, as shown in fig. 2, for two operations, i.e., read and write, an initial TCI is set to t0, memory reads are triggered at t1, which are executed from t1-t5, TCIs for the next trigger condition can be updated at any point in time between t1 and t5, storing the current TCI, which is updated by a new instruction, in which case, when memory reads operate continuously in the same mode, no instruction is needed (operations are triggered by the same TCI at times t6 and t 12), which shortens instruction sequences by more than 10 ×; meanwhile, the conditional instruction is executed after the trigger condition is met, the configuration parameter of the conditional instruction is provided by the unconditional parameter instruction, the instruction execution is accurate, and the problem of instruction pause caused by high uncertainty in the prior art is solved.

Example 3

Based on embodiment 1, when CNN is accelerated, there are multiple consecutive repeat instructions in the instruction sequence, so the defining manner of defining the instruction sequence when defining the instruction set specifically includes: if the instruction sequence is a plurality of continuous repeated instructions, only one instruction is set, and the instruction is repeatedly executed until the contents in the trigger condition register and the parameter register are updated; when a plurality of continuous repeated instructions exist, only the first instruction is defined, the trigger condition register and the parameter register keep the content until the content is updated, and the acceleration of different target networks is realized through the rapid configuration of the instructions.

The unconditional instruction needs to define various parameters, the corresponding instruction length is long, in order to reduce the instruction length, a unified mode of unconditional instruction parameters is defined, the unified mode is unified when the updating frequency is synchronous, the parameters with the synchronous updating frequency are classified into the same unconditional instruction so as to fully utilize all bits of the instruction, the number of calling of the total instruction is reduced, the instruction length is greatly shortened, and the acceleration of different target networks is favorably realized through the rapid configuration of the instruction.

The above description is intended to be illustrative of the preferred embodiment of the present invention and should not be taken as limiting the invention, but rather, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims

1. A method for defining an OPU instruction set for CNN acceleration, comprising: the method comprises the steps of defining conditional instructions, defining unconditional instructions and setting instruction granularity;

defining the conditional instruction includes the steps of:

setting a parameter configuration mode of a conditional instruction, wherein the configuration mode is to perform parameter configuration according to the unconditional instruction;

defining unconditional instructions includes the steps of:

defining parameters of the unconditional instruction;

the step of setting the instruction granularity comprises the following steps:

counting CNN networks and acceleration requirements;

determining a calculation mode according to the statistical result and the selected parallel input and output channels, and setting instruction granularity;

the reading and storing instruction comprises reading and storing operation according to a mode A1 and reading and storing operation according to a mode A2, the granularity of the reading and storing instruction is n numbers read in each time, and n is more than 1;

mode A1: reading n numbers backwards from the designated address;

mode A2: reading n numbers according to the address stream, wherein the addresses in the address stream are discontinuous, and the three operations are performed after reading: 1, no operation is performed after reading; 2, splicing the read data into a specified length; 3, splitting the read file into specified lengths; four read operations post-chip storage locations: the device comprises a characteristic graph storage module, an inner product parameter storage module, a bias parameter storage module and an instruction storage module;

the configurable parameters of the read storage operation instruction comprise a starting address, operand quantity, a read post-processing mode and an on-chip storage position;

the write storage instruction comprises write storage operation according to a mode B1 and write storage operation according to a mode B2, the granularity of the write storage instruction is n numbers written out each time, and n is greater than 1;

mode B1: writing n numbers backwards from the designated address;

the write store operation instruction configurable parameters comprise a starting address and operand quantity;

the data grabbing instruction comprises data reading operation from an on-chip characteristic diagram memory and an inner product parameter memory according to different data reading modes and data recombination arrangement operation on the read data, and the granularity of the data reading operation is that 64 input data are operated at the same time; the configurable parameters of the data grabbing instruction comprise a feature map reading memory and an inner product reading memory, wherein the feature map reading memory comprises reading address constraints, namely a minimum address and a maximum address, a reading step size and a rearrangement mode; the reading inner product parameter memory comprises reading address constraint and a reading mode;

the data post-processing instruction comprises one or more operations of pooling, activating, fixed-point cutting, rounding and vector counterpoint addition, and the granularity of the data post-processing instruction is multiple of 64 data of each operation; the configurable parameters of the data post-processing operation instruction comprise a pooling type, a pooling size, an activation type and a fixed-point cutting position;

the calculation instruction comprises vector inner product operation according to vector allocation with different lengths, the granularity of the vector inner product operation is 32, a calculation basic unit adopted by the vector inner product operation is two vector inner product modules with the length of 32, and the calculation operation instruction adjustable parameters comprise the quantity of output results;

the unconditional instruction provides parameter updating, and the parameters comprise length and width of an on-chip storage characteristic diagram module, channel number, current layer input length, width, current layer input channel number, output channel number, read-storage operation starting address, read operation mode selection, write-storage operation starting address, write operation mode selection, data capture mode and constraint, calculation mode setting, pooling operation related parameter setting, activation operation related parameter setting, data shift setting, and trimming and rounding related operation.

2. The OPU instruction set definition method for CNN acceleration according to claim 1, characterized in that: the method further comprises setting an instruction sequence definition mode, specifically: if the instruction sequence is a plurality of continuous repeated instructions, a single instruction is set, and the instruction is repeatedly executed until the contents of the trigger condition register and the parameter register are updated.

3. The OPU instruction set definition method for CNN acceleration according to claim 1, characterized in that: the method also comprises the step of defining the instruction length, wherein the instruction length is a uniform length.

4. The OPU instruction set definition method for CNN acceleration according to claim 1, characterized in that: the minimum unit of the parallel input and output channels corresponding to the computation mode is the vector inner product of 32.