CN110502278B

CN110502278B - Neural network coprocessor based on RiccV extended instruction and coprocessing method thereof

Info

Publication number: CN110502278B
Application number: CN201910671987.2A
Authority: CN
Inventors: 廖裕民; 张义航
Original assignee: Rockchip Electronics Co Ltd
Current assignee: Rockchip Electronics Co Ltd
Priority date: 2019-07-24
Filing date: 2019-07-24
Publication date: 2021-07-16
Anticipated expiration: 2039-07-24
Also published as: CN110502278A

Abstract

The invention provides a neural network coprocessor based on a RiccV extended instruction and a coprocessing method thereof, wherein the neural network coprocessor comprises an extended instruction arithmetic unit, wherein the extended instruction arithmetic unit is connected to a RiccV CPU; and when receiving an extended instruction request of the RiscV CPU, the extended instruction operation unit classifies each extended instruction into a plurality of instruction operation levels according to the dependency relationship, configures the operation units required by each extended instruction of each operation level, the connection relationship among the operation units and the parallelism of the operation units, completes instruction operation according to the configuration information, and outputs the instruction operation to the RiscV CPU. The invention provides support for realizing new functions and adapting to any new algorithm by enabling the coprocessor to have programmable and configurable capacity.

Description

Neural network coprocessor based on RiccV extended instruction and coprocessing method thereof

Technical Field

The invention relates to a coprocessor and a coprocessing method thereof.

Background

A coprocessor (coprocessor) is a chip that relieves the system microprocessor of certain processing tasks. For example, a math co-processor may control digital processing; the graphics coprocessor may handle video rendering. For example, an intel pentium microprocessor includes a built-in math coprocessor.

The coprocessor may be attached to the ARM processor. A coprocessor extends core processing functionality by extending the instruction set or providing configuration registers. One or more coprocessors may be connected to the ARM core through a coprocessor interface. The ARM microprocessor may support up to 16 coprocessors for various coprocessing operations, each coprocessor executing coprocessing instructions for itself only, ignoring instructions from the ARM processor and other coprocessors during program execution. The coprocessor instructions of the ARM are mainly used for initializing the data processing operation of the ARM coprocessor by the ARM processor, transmitting data between the registers of the ARM processor and the registers of the coprocessor, and transmitting data between the registers of the ARM coprocessor and a memory.

However, the conventional coprocessor belongs to a fixed circuit after leaving the factory, and is not programmable or configurable at a later stage, and only can perform operation and acceleration of a specific algorithm, however, with the rapid development of a high-speed computing technology, various novel algorithms are developed endlessly, and it is obvious that the conventional coprocessor cannot adapt to the development of a new and improved high-speed computing technology because only operation and acceleration of the specific algorithm can be performed.

Therefore, the invention provides a programmable and configurable coprocessor which realizes programmable and configurable functions based on a RiccV CPU extended instruction so as to solve the defects of a fixed circuit in the prior art.

RiscV, RISC-V (the English pronunciation is "risk-five"), is a completely new instruction set architecture originally invented in 2010 by developers of the department of computer science, Krste Asanovic, Andrew Waterman and Yunsule, of the department of EECS, university of California, Berkeley, Inc. Where "RISC" represents the reduced instruction set and where "V" represents the fifth generation instruction set designed from RISC I by Berkeley division.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a RiccV extended instruction-based neural network coprocessor and a coprocessing method thereof, wherein the coprocessor is enabled to realize new functions in a programmable and configurable mode so as to be capable of adapting to any new algorithm.

The coprocessor of the invention is realized by the following steps: a neural network coprocessor based on a RiccV extended instruction comprises an extended instruction arithmetic unit, wherein the extended instruction arithmetic unit is connected to a RiccV CPU; and when receiving an extended instruction request of the RiscV CPU, the extended instruction operation unit classifies each extended instruction into a plurality of instruction operation levels according to the dependency relationship, configures the operation units required by each extended instruction of each operation level, the connection relationship among the operation units and the parallelism of the operation units, completes instruction operation according to the configuration information, and outputs the instruction operation to the RiscV CPU.

Furthermore, the extended instruction arithmetic unit comprises an instruction decoding distribution unit, an instruction grade mapping storage unit, a result output selection unit, an arithmetic basic unit array consisting of a plurality of basic arithmetic circuits, a plurality of grades of interconnection control units and a plurality of grades of interconnection configuration units;

the instruction level mapping storage unit is respectively connected with the instruction decoding distribution unit and the result output selection unit;

the input and the output of each basic operational circuit are respectively connected to the output and the input of any other basic operational circuit through the interconnection control unit of each grade;

the interconnection control unit of each grade is connected with the instruction decoding distribution unit, the result output selection unit and the interconnection configuration unit of a corresponding grade.

Further, the arithmetic basic unit array includes:

an adder group composed of a plurality of adders;

a multiplier group composed of a plurality of multipliers;

an OR operation group composed of a plurality of OR operators;

an inverse operation group composed of a plurality of inverse operators;

the LUT lookup operation group consists of a plurality of LUT table storage units and a plurality of LUT lookup configuration units, and each LUT table storage unit is correspondingly connected with one LUT lookup configuration unit;

the interconnection control unit includes:

a first-stage interconnection control unit, which is respectively connected with each adder, each multiplier, each or operator, each inverse operator, and each LUT table storage unit;

a second-level interconnection control unit respectively connected to each of the adders, each of the multipliers, each of the or operators, each of the inverse operators, and each of the LUT table storage units;

a three-level interconnection control unit respectively connected to each of the adders, each of the multipliers, each of the or operators, each of the inverse operators, and each of the LUT table storage units;

the interconnection configuration unit includes:

the primary interconnection configuration unit is connected with the primary interconnection control unit;

the second-level interconnection configuration unit is connected with the second-level interconnection control unit;

and the third-level interconnection configuration unit is connected with the second-level interconnection control unit.

Furthermore, the coprocessor also comprises a multi-beat instruction delay mapping storage unit and a feedback control completion unit; the multi-beat instruction delay mapping storage unit is connected to the RiscV CPU through the completion feedback control unit;

the multi-beat instruction delay mapping storage unit stores multi-beat delay information corresponding to all the extended instructions, after receiving an extended instruction request from the RiscV CPU, the multi-beat instruction delay mapping storage unit inquires the delay beat number corresponding to the extended instruction and sends the delay beat number to the completion feedback control unit, then the feedback control unit sets the completion state feedback signal to be valid according to the delay clock beat number of the current extended instruction sent by the multi-beat instruction delay mapping storage unit, and after the delay beat after receiving the instruction request reaches the value, the RiscV CPU is informed to sample and receive the operation result.

Furthermore, the coprocessor also comprises an extended instruction pipeline state storage unit and an extended instruction pipeline control unit; the extended instruction pipeline state storage unit is connected to the RiccV CPU through the extended instruction pipeline control unit;

the extended instruction pipeline state storage unit stores information whether all extended instructions need the CPU to stop the pipeline, inquires whether the current extended instructions need the CPU to stop the pipeline after receiving an extended instruction request from the RiccV CPU, and sends the information to the extended instruction pipeline control unit; and the extended instruction pipeline control unit transmits the received information to the RiscV CPU and informs the RiscV CPU whether to stop the current pipeline operation until the extended instruction finishes the operation.

The method of the invention is realized as follows: a neural network coprocessing method based on a RiccV extended instruction is characterized in that the configurable coprocessor is provided with an extended instruction arithmetic unit, the extended instruction arithmetic unit receives an application of the extended instruction sent by a RiccV CPU, classifies each extended instruction into a plurality of instruction arithmetic levels according to a subordination relation, configures arithmetic units required by each extended instruction of each arithmetic level, connection relations among the arithmetic units and parallelism of the arithmetic units, completes instruction arithmetic according to configuration information, and outputs the instruction to the RiccV CPU.

the instruction level mapping storage unit is respectively connected with the instruction decoding distribution unit and the result output selection unit; the input and the output of each basic operational circuit are respectively connected to the output and the input of any other basic operational circuit through the interconnection control unit of each grade; the interconnection control unit of each grade is connected with the instruction decoding distribution unit, the result output selection unit and the interconnection configuration unit of a corresponding grade.

The instruction decoding distribution unit reads the information of which grade the current expansion instruction corresponds to in the instruction grade mapping storage unit and distributes the current expansion instruction to the interconnection control unit of the corresponding grade for operation;

after receiving the extension instruction, the interconnection control unit at any level inquires the operation units needed to be used corresponding to the current extension instruction, the connection relation among the operation units and the configuration information of the parallelism degree of the operation units from the interconnection configuration unit at the corresponding level; then configuring corresponding interconnection forms according to the configuration information and controlling data flow;

and then, the interconnection control unit of any grade starts to carry out instruction operation through the basic operation circuit, and sends a result to the result output selection unit after the instruction operation is finished.

Furthermore, the configurable coprocessor further comprises a multi-beat instruction delay mapping storage unit and a feedback control completion unit;

the multi-beat instruction delay mapping storage unit stores multi-beat delay information corresponding to all the extended instructions, inquires the delay beat number corresponding to the extended instruction after receiving the extended instruction request from the RiscV CPU, and sends the delay beat number to the completion feedback control unit;

and the feedback control unit sets the completion state feedback signal to be effective after the delay beat after receiving the instruction request reaches the value according to the delay clock beat number of the current extended instruction sent by the multi-beat instruction delay mapping storage unit, and informs the RiccV CPU to sample and receive the operation result.

Further, the configurable coprocessor further comprises an extended instruction pipeline state storage unit and an extended instruction pipeline control unit;

the extended instruction pipeline state storage unit stores information whether all extended instructions need the CPU to stop the pipeline, inquires whether the current extended instructions need the CPU to stop the pipeline after receiving an extended instruction request from the RiccV CPU, and sends the information to the extended instruction pipeline control unit;

and the extended instruction pipeline control unit transmits the received information to the RiscV CPU and informs the RiscV CPU whether to stop the current pipeline operation until the extended instruction finishes the operation.

The invention has the following advantages:

1. the extended instruction arithmetic unit in the coprocessor has configurability, namely, the circuit can realize new functions through programming and configuration, so that a new neural network structure and a new algorithm can be adapted at will, and the defect of a fixed circuit in the prior art is well overcome;

2. the coprocessor can be well coupled with an emerging RiccV CPU instruction set by expanding the instruction set, so that the circuit can be seamlessly connected to the RiccV CPU to form a neural network coprocessor based on the RiccV expansion instruction;

3. the coprocessor of the invention realizes an instruction system with multi-level complexity classification by expanding the instruction arithmetic unit, so that the programming arithmetic structure instruction of each level can be called by the higher programming instruction, and the circuit can work more efficiently.

Drawings

The invention will be further described with reference to the following examples with reference to the accompanying drawings.

FIG. 1 is a block diagram of the overall circuit structure of a coprocessor according to an embodiment of the present invention.

FIG. 2 is a block diagram of an extended instruction arithmetic unit in the coprocessor of the present invention.

FIG. 3 is a block diagram of the overall circuit structure of another embodiment of the coprocessor of the present invention.

FIG. 4 is a block diagram of the overall circuit structure of a coprocessor according to another embodiment of the present invention.

Detailed Description

Example one

Fig. 1 shows an embodiment of the coprocessor of the present invention, which is a RiscV extended instruction-based neural network coprocessor, including an extended instruction arithmetic unit connected to a RiscV CPU.

The RiccV CPU is a main CPU and is responsible for program operation, and when a native instruction (namely a non-extended instruction) is operated, instruction operation is completed inside the RiccV CPU; when the instruction to be executed is an extended instruction, the definition of the extended instruction by the user is received, for example, the extended instruction is defined as a convolution operation instruction (the convolution operation instruction belongs to a non-native instruction, but is an extended instruction) by the user, and then the extended instruction request is sent to the extended instruction operation unit.

And when receiving an extended instruction request of the RiscV CPU, the extended instruction operation unit classifies each extended instruction into a plurality of instruction operation levels according to the dependency relationship, configures the operation units required by each extended instruction of each operation level, the connection relationship among the operation units and the parallelism of the operation units, completes instruction operation according to the configuration information, and outputs the instruction operation to the RiscV CPU.

As shown in fig. 2, the extended instruction arithmetic unit includes an instruction decoding allocation unit, an instruction level mapping storage unit, a result output selection unit, an arithmetic basic unit array composed of a plurality of basic arithmetic circuits, a plurality of levels of interconnection control units, and a plurality of levels of interconnection configuration units;

Wherein,

the instruction level mapping storage unit is responsible for storing the level instruction corresponding to each extended instruction, and the instruction decoding distribution unit distributes the current instruction to the interconnection control unit of the corresponding level for operation;

the instruction decoding distribution unit is responsible for mapping the level corresponding to the query instruction to the instruction level mapping storage unit after receiving the extended instruction, and then distributing the instruction to the interconnection control unit with the corresponding level for operation;

the arithmetic basic unit array is used for providing an arithmetic unit for finishing the calculation of the extended instruction;

the interconnection configuration unit is responsible for storing information of the connection relation between the arithmetic units and the parallelism of the arithmetic units, which are needed by each extension instruction of the level;

the interconnection control unit is responsible for inquiring the connection relation between the arithmetic units and the units which need to be used and correspond to the instruction and the parallelism information of the arithmetic units from the interconnection configuration unit corresponding to the interconnection control unit after receiving the instruction, configuring the interconnection control unit into the interconnection form corresponding to the information according to the configuration information and controlling the data flow of the interconnection control unit, starting the instruction operation and sending the result to the result output selection unit after the instruction operation is finished;

the result output selection unit gates the output result of the interconnection control unit of the corresponding level according to the instruction level information of the instruction level mapping storage unit;

specifically, as shown in fig. 2, in the embodiment, the arithmetic basic unit array includes:

the adder group consists of a plurality of adders, and the number of the adders is unlimited;

the multiplier group is composed of a plurality of multipliers, and the number of the multipliers is unlimited;

an OR operation group composed of a plurality of OR operators, the number of which is unlimited;

the inverse operation group is composed of a plurality of inverse operators, and the number is unlimited;

the LUT lookup operation group consists of a plurality of LUT table storage units and a plurality of LUT lookup configuration units, the number of the LUT table storage units is not limited, and each LUT table storage unit is correspondingly connected with one LUT lookup configuration unit;

the interconnection control unit includes:

the interconnection configuration unit includes:

Therefore, each extended instruction can be classified into three instruction levels according to the subordination relationship to complete calculation, the instruction complexity from level one to level three is an increasing relationship and a calling relationship, the instructions with low levels can adapt to various novel algorithms, the adaptability is strong, but the data throughput and the instruction number are increased due to the fact that multiple times of execution are needed, and therefore the operation efficiency is reduced; the high-level instruction has high complexity, is composed of low-level instructions, can complete high-complexity calculation at one time, but has low adaptability, and only supports partial mature high-complexity operation, so that the multi-level instruction system has both adaptability and execution efficiency.

Examples are:

the first-stage instruction is a multiply-add operation, a multiply-add operation group and an add operation group are needed, the connection relation is that an input value is firstly given to the multiplier group, then the output of the multiplier group is connected to the input of the adder group, the parallelism degree is N, and the method can be used in any new algorithm used for the multiply-add operation.

The second-level instruction is convolution operation, can directly call the structure of the first-level instruction multiply-add operation, and then adds an output result on the basis of the multiply-add operation instruction, and completes the accumulation of the multiply-add result through an adder, thereby realizing the convolution.

The three-level instruction and convolution activation operation can directly call the structure of the convolution operation of the two-level instruction, and then the output result is added on the basis of the convolution instruction and output after the activation operation is completed through the LUT table look-up unit, so that the convolution activation is realized.

Example two

As shown in fig. 3, compared with the first embodiment, the coprocessor of the present embodiment further includes a multi-beat instruction delay mapping storage unit and a completion feedback control unit; the multi-beat instruction delay mapping storage unit is connected to the RiscV CPU through the completion feedback control unit; when the extended instruction arithmetic unit needs a plurality of clock beats to finish the operation, after the delay beat reaches a corresponding value, the completion state feedback signal is set to be effective, and the RiccV CPU is informed to sample and receive the operation result.

EXAMPLE III

As shown in fig. 4, compared with the coprocessor of the first embodiment or the second embodiment, the coprocessor of the present embodiment further includes an extended instruction pipeline state storage unit and an extended instruction pipeline control unit; the extended instruction pipeline state storage unit is connected to the RiccV CPU through the extended instruction pipeline control unit; the RiscV CPU is notified to stall the pipeline when it needs to stall the pipeline waiting for the instruction to complete the operation.

The overall work flow of the invention is as follows:

1. when the instructions to be executed by the RiccV CPU are extended instructions, the extended instruction request is sent to the extended instruction operation unit, and meanwhile, pipeline control information and completion state feedback information are inquired and read, wherein the pipeline control information indicates whether the extended instructions need the RiccV CPU to stop all current pipelines or not, and when the instructions are executed, the completion state feedback information is used for indicating whether the extended instructions finish the operation or not according to the current clock beat, and the RiccV CPU can sample and receive the instruction results through an instruction result return end.

2. The extended instruction arithmetic unit completes instruction arithmetic after receiving the extended instruction request and sends the arithmetic result back to the RiccV CPU, when the extended instruction arithmetic unit needs a plurality of clock beats to complete arithmetic, the multi-beat instruction delay mapping storage unit inquires the delay beat number corresponding to the instruction after receiving the extended instruction request and sends the delay beat number to the completion feedback control unit, after the delay beat reaches the value, the completion state feedback signal is set to be effective through the feedback control unit, and the RiccV CPU is informed to sample and receive the arithmetic result.

3. Meanwhile, after receiving the extended instruction request, the RiccV CPU inquires whether the instruction corresponds to the instruction and needs to stop the pipeline of the RiccV CPU to wait for the instruction to complete the operation, sends the inquiry result to the extended instruction pipeline control unit, and then the extended instruction pipeline control unit sends the information to the RiccV CPU to inform the RiccV CPU whether to stop the current pipeline operation until the extended instruction operation unit finishes the operation.

4. After the operation of the extended instruction operation unit is finished, the RiccV CPU samples and receives and executes the operation result of the extended instruction operation unit according to the completion state feedback signal, and after the instruction execution is finished, the next instruction is executed.

Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims

1. A neural network coprocessor based on a RiscV extended instruction, comprising: the system comprises an extended instruction arithmetic unit, a processor and a control unit, wherein the extended instruction arithmetic unit is connected to a RiccV CPU;

2. The RiscV instruction-based neural network coprocessor of claim 1, wherein: the extended instruction arithmetic unit comprises an instruction decoding distribution unit, an instruction grade mapping storage unit, a result output selection unit, an arithmetic basic unit array consisting of a plurality of basic arithmetic circuits, a plurality of grades of interconnection control units and a plurality of grades of interconnection configuration units;

3. The RiscV extended instruction-based neural network coprocessor of claim 2, wherein:

the arithmetic basic unit array includes:

an adder group composed of a plurality of adders;

a multiplier group composed of a plurality of multipliers;

an OR operation group composed of a plurality of OR operators;

an inverse operation group composed of a plurality of inverse operators;

the interconnection control unit includes:

the interconnection configuration unit includes:

4. The RiscV extended instruction based neural network coprocessor of any of claims 1-3, wherein: the system also comprises a multi-beat instruction delay mapping storage unit and a feedback control completion unit; the multi-beat instruction delay mapping storage unit is connected to the RiscV CPU through the completion feedback control unit;

the multi-beat instruction delay mapping storage unit stores multi-beat delay information corresponding to all the extended instructions, after receiving an extended instruction request from the RiscV CPU, the multi-beat instruction delay mapping storage unit inquires the delay beat number corresponding to the extended instruction and sends the delay beat number to the completion feedback control unit, then the feedback control unit sets the completion state feedback signal to be valid according to the delay clock beat number of the current extended instruction sent by the multi-beat instruction delay mapping storage unit after the delay beat after receiving the instruction request reaches a corresponding value, and informs the RiscV CPU to sample and receive the operation result.

5. The RiscV instruction-based neural network coprocessor of claim 4, wherein: the system also comprises an extended instruction pipeline state storage unit and an extended instruction pipeline control unit; the extended instruction pipeline state storage unit is connected to the RiccV CPU through the extended instruction pipeline control unit;

the extended instruction pipeline state storage unit stores information whether all extended instructions need the CPU to stop the pipeline, inquires whether the current extended instructions need the RiccV CPU to stop the pipeline after receiving an extended instruction request from the RiccV CPU, and sends the information to the extended instruction pipeline control unit; and the extended instruction pipeline control unit transmits the received information to the RiscV CPU and informs the RiscV CPU whether to stop the current pipeline operation until the extended instruction finishes the operation.

6. A neural network coprocessing method based on a RiccV extended instruction is characterized by comprising the following steps: the configurable coprocessor is provided with an extended instruction arithmetic unit, the extended instruction arithmetic unit receives an application of an extended instruction sent by the RiscV CPU, classifies each extended instruction into a plurality of instruction arithmetic levels according to the subordination relation, configures the arithmetic unit required by each extended instruction of each arithmetic level, the connection relation among the arithmetic units and the parallelism of each arithmetic unit, completes the instruction operation according to the configuration information and outputs the instruction operation to the RiscV CPU.

7. The method of claim 6, wherein the neural network coprocessing method based on the RiscV extended instruction comprises the following steps: the extended instruction arithmetic unit comprises an instruction decoding distribution unit, an instruction grade mapping storage unit, a result output selection unit, an arithmetic basic unit array consisting of a plurality of basic arithmetic circuits, a plurality of grades of interconnection control units and a plurality of grades of interconnection configuration units;

the instruction level mapping storage unit is respectively connected with the instruction decoding distribution unit and the result output selection unit; the input and the output of each basic operational circuit are respectively connected to the output and the input of any other basic operational circuit through the interconnection control unit of each grade; the interconnection control unit of each grade is connected with the instruction decoding distribution unit, the result output selection unit and the interconnection configuration unit of a corresponding grade;

8. The method for neural network co-processing based on the RiscV extended instruction according to claim 6 or 7, wherein: the configurable coprocessor further comprises a multi-beat instruction delay mapping storage unit and a feedback completion control unit;

and the feedback control unit sets a completion state feedback signal to be effective after the delay beat after receiving the instruction request reaches a corresponding value according to the delay clock beat number of the current extended instruction sent by the multi-beat instruction delay mapping storage unit, and informs the RiscV CPU to sample and receive the operation result.

9. The method for neural network coprocessing based on the RiscV extended instruction, according to claim 8, is characterized in that: the configurable coprocessor further comprises an extended instruction pipeline state storage unit and an extended instruction pipeline control unit;

the extended instruction pipeline state storage unit stores information whether all extended instructions need the CPU to stop the pipeline, inquires whether the current extended instructions need the RiccV CPU to stop the pipeline after receiving an extended instruction request from the RiccV CPU, and sends the information to the extended instruction pipeline control unit;