CN115113933B - Apparatus for accelerating data operation - Google Patents

Apparatus for accelerating data operation Download PDF

Info

Publication number
CN115113933B
CN115113933B CN202211023496.5A CN202211023496A CN115113933B CN 115113933 B CN115113933 B CN 115113933B CN 202211023496 A CN202211023496 A CN 202211023496A CN 115113933 B CN115113933 B CN 115113933B
Authority
CN
China
Prior art keywords
instruction
data
register
operations
operated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211023496.5A
Other languages
Chinese (zh)
Other versions
CN115113933A (en
Inventor
汪泳江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xuanzhi Electronic Technology Shanghai Co ltd
Original Assignee
Xuanzhi Electronic Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xuanzhi Electronic Technology Shanghai Co ltd filed Critical Xuanzhi Electronic Technology Shanghai Co ltd
Priority to CN202211023496.5A priority Critical patent/CN115113933B/en
Publication of CN115113933A publication Critical patent/CN115113933A/en
Application granted granted Critical
Publication of CN115113933B publication Critical patent/CN115113933B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • G06F9/30058Conditional branch instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30047Prefetch instructions; cache control instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/30138Extension of register space, e.g. register cache

Abstract

Embodiments of the present disclosure relate to an apparatus for accelerating data operations, comprising: a configuration register to store a plurality of instructions; the data register is used for storing data to be operated and an operation result; the controller is used for acquiring instructions from the configuration register and executing the instructions, transferring data to be operated and indicated by the instructions to the cache register indicated by the instructions in response to the fact that the instructions belong to the first transfer instruction, and transferring operation results stored in the cache register indicated by the instructions to the data register indicated by the instructions in response to the fact that the instructions belong to the second transfer instruction; and the operation accelerator is used for acquiring data to be operated from the cache register to operate so as to store an operation result into the cache register.

Description

Apparatus for accelerating data operations
Technical Field
Embodiments of the present disclosure relate generally to the field of data processing, and more particularly, to an apparatus for accelerating data operations.
Background
In the process of data processing, complex data operations are often involved. Currently, a CPU (central processing unit) is generally used to implement relevant data operations in data processing. On one hand, for each complex data operation, the CPU usually needs dozens or even hundreds or thousands of instruction cycles to complete the operation, which consumes a long time and cannot meet the requirement of operation real-time performance. On the other hand, in a conventional application scenario, the CPU often needs to process other control requests and the like, and therefore, even if the calculation time can be satisfied theoretically, data processing still has a real-time risk.
In the solution of implementing data operation based on ASIC (application specific integrated circuit), since ASIC can implement only its customized operation function, flexibility is lacking.
In summary, the conventional scheme for performing data operations has the following disadvantages: data operations are time consuming, lack of real-time, and lack of flexibility.
Disclosure of Invention
In view of the above problems, the present disclosure provides an apparatus for accelerating data operation, which can effectively shorten the time consumed by data operation, accelerate the data processing speed, and have extremely high flexibility.
According to one aspect of the present disclosure, an apparatus for accelerating data operations is provided. The apparatus for accelerating data operations includes: a configuration register to store a plurality of instructions; the data register is used for storing data to be operated and an operation result; the controller is used for acquiring instructions from the configuration register for execution, transferring data to be operated indicated by the instructions to the cache register indicated by the instructions in response to the determination that the instructions belong to the first transfer instruction, and transferring operation results stored in the cache register indicated by the instructions to the data register indicated by the instructions in response to the determination that the instructions belong to the second transfer instruction; and the operation accelerator is used for acquiring data to be operated from the cache register to operate so as to store an operation result into the cache register.
In some embodiments, the number of compute accelerators is at least two; the controller is further used for responding to the first transfer instruction determined that the instruction belongs to the first transfer instruction, transferring the data to be operated corresponding to the at least two operation accelerators to cache registers corresponding to the at least two operation accelerators respectively, so that the at least two operation accelerators can obtain the data to be operated from the corresponding cache registers respectively to operate.
In some embodiments, the number of computing accelerators is at least two; the controller is also used for responding to the third transfer instruction, transferring the operation result in the cache register corresponding to one of the operation accelerators to the cache register corresponding to the other operation accelerator to be used as the data to be operated of the other operation accelerator.
In some embodiments, the configuration register is further configured to store a start indication signal, and when the start indication signal is in a valid state, the data to be operated stored in the data register is represented to be valid; and the controller is further configured to jump to a target instruction in the configuration register and execute in response to determining that the enable indication signal is valid, the target instruction including a first branch instruction.
In some embodiments, the controller is further configured to jump to a target instruction in the configuration register and execute the target instruction according to a size relationship between an operation result stored in the cache register and the target data; the target data includes a predetermined value or another operation result stored by the cache register.
In some embodiments, the means for accelerating data operations further comprises: a bus to perform at least one of: the method includes storing a plurality of instructions to a configuration register via a bus and storing data to be operated on to a data register via the bus.
In some embodiments, the means for accelerating data operations further comprises: a compiler to parse a program into a combination of a plurality of operations and determine whether the operations are associated with an operation accelerator, compile the operations into acceleration instructions in response to determining that the operations are associated with the operation accelerator, the acceleration instructions including at least a first transfer instruction, compile the operations into normal instructions in response to determining that the operations are not associated with the operation accelerator, and form the compiled acceleration instructions and the compiled normal instructions into an instruction set, the instruction set including a plurality of instructions.
In some embodiments, the number of computing accelerators is at least two; the compiler is further configured to determine whether at least two operations of the plurality of operations respectively correspond to the at least two operation accelerators are included, and in response to determining that at least two operations of the plurality of operations respectively correspond to the at least two operation accelerators are included, perform at least one of: compiling at least two operations respectively corresponding to the at least two operation accelerators into a first transfer instruction so that the controller responds to the first transfer instruction to transfer the data to be operated respectively corresponding to the at least two operation accelerators to the cache registers respectively corresponding to the at least two operation accelerators; and compiling at least two operations corresponding to the at least two operation accelerators into a combined instruction, wherein the combined instruction comprises a first transfer instruction and a third transfer instruction which are executed in sequence, so that the controller responds to the first transfer instruction to transfer the data to be operated to the cache register corresponding to one of the operation accelerators to be used as the data to be operated, and responds to the third transfer instruction to transfer the operation result corresponding to one of the operation accelerators to the cache register corresponding to the other operation accelerator to be used as the data to be operated.
In some embodiments, the computation accelerator comprises at least one of a CORDIC (Coordinate Rotation Digital Computer) operator, a multiplier-adder.
In some embodiments, the bus is connected with the upper computer for performing at least one of: the method comprises the steps of enabling an upper computer to store a plurality of instructions to a configuration register through a bus, and enabling the upper computer to store data to be operated to a data register through the bus.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements.
Fig. 1 shows a schematic structural diagram of an apparatus for accelerating data operations according to an embodiment of the present disclosure.
FIG. 2 shows a flow diagram of a method of data manipulation by an apparatus according to an embodiment of the disclosure.
FIG. 3 shows a flow diagram of a method of data manipulation by an apparatus according to an embodiment of the present disclosure.
FIG. 4 shows a flow diagram of a method of data manipulation by an apparatus according to an embodiment of the disclosure.
FIG. 5 illustrates a flow diagram of a method of an apparatus to perform data operations according to an embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The term "including" and variations thereof as used herein is intended to be open-ended, i.e., "including but not limited to". The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same objects. Other explicit and implicit definitions are also possible below.
As described above, a CPU or an ASIC is generally used for data processing. In the scheme of only adopting the CPU, the data operation consumes a long time. The processing implemented by the CPU is limited in each instruction cycle by the architecture and instruction set of the CPU. Therefore, in order to implement a complex data operation, the CPU usually needs tens or even tens of instruction cycles to complete, which takes a long time. When the CPU needs to respond to an external interrupt to process another control request during data operation, the data operation enters a pending (pending) state. And after the CPU finishes processing the external interrupt, continuing to perform data operation. Therefore, the data processing is performed simply by using the CPU, and the real-time performance is also insufficient. In the solution of implementing data operation based on ASIC, the ASIC can only implement its customized operation function, so that it lacks flexibility.
To address, at least in part, one or more of the above problems and other potential problems, example embodiments of the present disclosure propose an apparatus for accelerating data operations. In the disclosed solution, an apparatus for accelerating data operations includes a configuration register, a data register, a controller, a cache register, and an operation accelerator. Wherein the controller fetches instructions from the configuration register and executes them. If it is determined that the instruction to be executed belongs to a first transfer instruction (e.g., an instruction that transfers data from a data register to a cache register), the controller transfers the data to be operated on indicated by the instruction from the data register to the cache register indicated by the instruction. Correspondingly, the operation accelerator automatically acquires the data to be operated from the cache register for operation, and then stores the operation result into the cache register. If it is determined that the instruction to be executed belongs to a second transfer instruction (e.g., an instruction that transfers data from the cache register to the data register), the controller transfers the operation result stored by the cache register indicated by the instruction to the data register indicated by the instruction for output. Based on the device for accelerating data operation, for a program which relates to more complex operation, the related operation can be sent to the operation accelerator for operation, so that the program execution speed is greatly improved; for simpler operations, such as arithmetic, can be handled by the controller. Therefore, the speed of data operation can be effectively improved, and the operation executed by the operation accelerator and the operation executed by the controller can be reasonably distributed by reasonably setting the instructions, so that various different data operations are realized, and the data operation system has extremely high flexibility. In addition, in the process of executing operation by the operation accelerator, the controller can respond to other related operations, and the operation process is not influenced, so that the real-time performance of data operation is obviously improved.
Fig. 1 shows a schematic structural diagram of an apparatus 100 for accelerating data operations according to an embodiment of the present disclosure. The apparatus 100 includes a configuration register 116, a data register 118, a cache register 120, a controller 104, and a computation accelerator 108. In some embodiments, the apparatus 100 further comprises a compiler 102. In some embodiments, the apparatus 100 also includes a bus 114.
With respect to the configuration register 116, for example, it is used to store a plurality of instructions. In some embodiments, the instructions are compiled by the compiler 102 and stored in the configuration registers 116 via the bus 114 for execution by the controller 104.
In some embodiments, the configuration register 116 is further configured to store an enable indication signal, and the enable indication signal is in a valid state and indicates that the data to be calculated stored by the data register is valid. The controller 104 is further operable to jump to a target instruction in the configuration register 116, including the first branch instruction, and execute in response to determining that the enable indication signal is active. In some embodiments, the target instructions further include, for example, a second branch instruction, a third branch instruction.
The data register 118 is used, for example, to store data to be operated on and an operation result. In some embodiments, a host computer (not shown) stores data to be operated on to the data register 118 via the bus 114, and reads operation results from the data register 118 via the bus 114. In some embodiments, an external device (e.g., a sensor) stores data to be operated to the data register 118 via the bus 114, and the host computer reads operation results from the data register 118 via the bus 114.
With respect to the controller 104, for example, to fetch instructions from a configuration register for execution, to transfer data to be operated on indicated by the instructions to a cache register indicated by the instructions in response to determining that the instructions belong to a first transfer instruction, and to transfer operation results stored by the cache register indicated by the instructions to a data register indicated by the instructions in response to determining that the instructions belong to a second transfer instruction.
The controller 104 may be, for example, a dedicated Processing Unit such as a GPU (Graphics Processing Unit), an FPGA (Field Programmable Gate Array), an ASIC (Application Specific Integrated Circuit), or a general-purpose Processing Unit such as a CPU (Central Processing Unit).
As for the operation accelerator 108, for example, it is used to fetch data to be operated on from a cache register to perform an operation, so as to store the operation result to the cache register. In some embodiments, the number of the operation accelerators is plural, and the types of the operation accelerators are plural. As shown in fig. 1, the computation accelerator includes a multiplier adder 112 and CORDIC operators 110, the number of multiplier adders 112 is 2, and the number of CORDIC operators 110 is 2. It is understood that the multiplier-adder 112 is used to implement the multiply-add operation, so as to fuse and add the result of the multiply operation and another operand to obtain the final result, thereby saving the execution delay of the whole multiply-add operation; the CORDIC operator 110 is used to implement coordinate rotation number calculation, and replaces multiplication operation with basic addition and shift operation, so that the calculation of rotation and orientation of the vector does not need function operations such as trigonometric function, multiplication, evolution, inverse trigonometry, exponent, and the like. In some embodiments, the computation accelerator also includes other acceleration units, such as an AI (artificial intelligence) acceleration unit. The type and the number of the accelerators can be reasonably set according to needs.
In some embodiments, the cache register 120 includes a plurality of cache regions 122, and each cache region 122 corresponds to one of the operation accelerators. For example, the operation accelerator acquires data to be operated, which is required by an operation, from the corresponding buffer sub-region 122, and transmits an operation result to the corresponding buffer sub-region 122 for storage. It is understood that, in the cache sub-region 122, the address for storing the data to be operated on may be different from the address for storing the operation result.
With respect to the bus 114, for example, it is used to store instructions to configuration registers via the bus and data to be operated on to data registers via the bus. In some embodiments, the final result of the operation is stored in a data register, and the bus 114 is also used, for example, for an external device to read the final result of the operation from the data register via the bus 114.
With respect to the compiler 102, for example, the compiler is configured to parse a program into a combination of a plurality of operations and determine whether the operations are associated with an arithmetic accelerator, compile the operations into acceleration instructions in response to determining that the operations are associated with the arithmetic accelerator, the acceleration instructions including at least a first transfer instruction, compile the operations into normal instructions in response to determining that the operations are not associated with the arithmetic accelerator, and compile the generated acceleration instructions and the compiled normal instructions into an instruction set, the instruction set including a plurality of instructions. It is to be understood that the acceleration instruction is an instruction associated with the arithmetic accelerator, and the controller 104, when executing the acceleration instruction, at least involves transferring data to be operated to the cache register for the arithmetic accelerator to perform the operation; the normal instruction is not directly related to the compute accelerator and the controller 104 performs the normal instruction related operations on its own.
In some embodiments, the compiler 102 is, for example, further configured to determine whether at least two operations of the plurality of operations respectively correspond to at least two of the computational accelerators are included, and in response to determining that at least two operations of the plurality of operations respectively correspond to at least two of the computational accelerators are included, perform at least one of:
compiling at least two operations respectively corresponding to the at least two operation accelerators into a first transfer instruction so that the controller responds to the first transfer instruction to transfer the data to be operated respectively corresponding to the at least two operation accelerators to the cache registers respectively corresponding to the at least two operation accelerators; and
compiling at least two operations respectively corresponding to at least two operation accelerators into a combined instruction, wherein the combined instruction comprises a first transfer instruction and a third transfer instruction which are executed in sequence, so that the controller responds to the first transfer instruction to transfer the data to be operated to a cache register corresponding to one of the operation accelerators to be used as the data to be operated, and responds to the third transfer instruction to transfer an operation result corresponding to one of the operation accelerators to a cache register corresponding to the other operation accelerator to be used as the data to be operated.
In some embodiments, the instructions generated by the compiler 102 include, for example, three types: an operation class instruction, a jump class instruction, and a branch class instruction.
The operation class instructions include, for example:
(1) NOP: there is no operation instruction. When the controller executes the NOP instruction, no operation is performed;
(2) INT: the instruction is interrupted. When the controller executes the INT instruction, an interrupt request is sent to external equipment (such as an upper computer);
(3) ASSIGN: and assigning an operation instruction. When the controller executes an ASSIGN instruction, assigning a value to a target register in the cache register;
(4) ADD/SUB/OR/AND/XOR/INV: arithmetic or logical operation instructions. The ADD is an addition operation instruction, the SUB is a subtraction operation instruction, the OR is a logical OR operation instruction, the AND is a logical AND operation instruction, the XOR is a logical exclusive OR operation instruction, AND the INV is a logical negation operation instruction. When the controller executes an arithmetic operation or logic operation instruction, acquiring data from the buffer register to perform arithmetic operation or logic operation;
(5) SHL/SHR: a shift operation instruction. Wherein, SHL is a left shift operation instruction, and SHR is a right shift operation instruction. And when the controller executes the shift operation instruction, the controller acquires data from the buffer register and performs operations of shifting the data to the left or shifting the data to the right by a plurality of bits.
Jump-class instructions include, for example:
(1) JMP: an unconditional jump instruction. When the controller executes the JMP instruction, unconditionally jumping to a target address in a configuration register so as to execute the instruction stored by the target address;
(2) JMPEQ: a first conditional jump instruction. When the controller executes the JMPEQ instruction, comparing data stored at a first address in the cache register with data stored at a second address, if the two are equal, jumping to a target address in the configuration register so as to execute the instruction stored at the target address, otherwise, executing the instructions in the configuration register in sequence;
(3) JMPGT: a second conditional jump instruction. When the controller executes the JMPGT instruction, comparing data stored in a first address in the cache register with data stored in a second address, if the former is larger than the latter, jumping to a target address in the configuration register so as to execute the instruction stored in the target address, otherwise, executing the instructions in the configuration register in sequence;
(4) JMPLT: a third conditional jump instruction. When the controller executes the JMPLT instruction, comparing data stored in a first address and data stored in a second address in the cache register, if the former is smaller than the latter, jumping to a target address in the configuration register so as to execute the instruction stored in the target address, otherwise, executing the instructions in the configuration register in sequence;
(5) JMPNEQ: the fourth condition jumps instruction. When the controller executes the JMPNEQ instruction, the data stored in the first address and the data stored in the second address in the cache register are compared, if the two are not equal, the target address in the configuration register is jumped to, so that the instruction stored in the target address is executed, and otherwise, the instructions in the configuration register are executed in sequence.
Branch class instructions include, for example:
(1) MOV _ R2R: transferring data in a source address of the cache register to an instruction in a destination address of the cache register;
(2) MOV _ R2R _ M: correspondingly transferring data in a plurality of source addresses of the cache register to instructions in a plurality of target addresses of the cache register;
(3) MOV _ X2R: instructions for transferring data from a source address of the data register to a destination address of the cache register;
(4) MOV _ X2R _ M: correspondingly transferring data in a plurality of source addresses of the data register to instructions in a plurality of target addresses of the cache register;
(5) MOV _ R2X: instructions for transferring data from a source address of the cache register to a destination address of the data register;
(6) MOV _ R2X _ M: and correspondingly transferring the data in the source addresses of the cache register to the instructions in the target addresses of the data register.
Wherein the MOV _ X2R instruction and the MOV _ X2R _ M instruction relate to transferring data in a data register to a cache register, belonging to a first transfer instruction; the MOV _ R2X instruction and the MOV _ R2X _ M instruction relate to transferring data in the cache register to a data register, belonging to a second transfer instruction; the MOV _ R2R instruction and the MOV _ R2R _ M instruction relate to transferring data inside a cache register, belonging to a third transfer instruction.
FIG. 2 shows a flow diagram of a method 200 of device 100 performing data operations, according to an embodiment of the disclosure. It should be understood that method 200 may also include additional steps not shown and/or may omit steps shown, as the scope of the present disclosure is not limited in this respect.
At step 202, the compiler parses the program into a combination of operations.
At step 204, the compiler determines whether the operation is associated with an operation accelerator.
At step 206, if the operation is determined to be associated with the operation accelerator, the compiler compiles the operation to generate an acceleration instruction. Wherein the acceleration instruction comprises at least a first branch instruction. In some embodiments, the acceleration instructions further include, for example, a second branch instruction, a third branch instruction.
At step 208, if the operation is determined not to be associated with the arithmetic accelerator, the compiler compiles the operation to generate a normal instruction.
It is to be appreciated that the compiler 102 parses the program into a combination of operations according to a predetermined compilation algorithm. The compiler 102 determines one by one whether the operation is associated with an operation accelerator. For example, if the compiler 102 determines that the operation is a multiply-add operation, associated with the multiply-add unit 112, the operation is compiled into, for example, a MOV _ X2R instruction. The operand corresponding to the MOV _ X2R instruction includes a source address and a destination address, where the source address may be, for example, an address of the data register to be calculated related to the multiply-add operation, and the destination address may be, for example, an address of a corresponding cache register of the related multiply-add device. In some embodiments, the instructions generated by the compiler 102 for the multiply-add operation may further include, for example, a MOV _ R2X instruction, such that when the associated multiply-add unit completes the operation, the controller 104 executes the MOV _ R2X instruction to transfer the result of the multiply-add unit operation to the associated data register. If the compiler 102 determines that the operation is a logical AND operation, the compiler 102 determines that the operation is not associated with an operation accelerator and, thus, compiles the operation into a normal instruction. For example, the compiler 102 compiles a generate ADD instruction that is generic and does not directly involve the arithmetic accelerator.
At step 210, the compiler will compile the generated acceleration instructions and compile the generated normal instructions into an instruction set. Wherein the instruction set comprises a plurality of instructions.
At step 212, a plurality of instructions are stored to a configuration register via a bus and data to be operated on is stored to a data register via the bus.
In some embodiments, compiler 102 stores the plurality of instructions generated by the compilation to a configuration register via a bus and stores the associated data to be executed to a data register via a bus.
In some embodiments, the host computer stores a plurality of instructions to the configuration register via the bus and stores data to be operated on to the data register via the bus.
At step 214, the controller transfers the data to be operated on indicated by the instruction to the cache register indicated by the instruction in response to the first transfer instruction. For example, the controller 104 fetches instructions from a configuration register and executes. If the fetch results in the compiler 102 compiling the generated MOV _ X2R instruction, the controller 104 determines that the current MOV _ X2R instruction belongs to the first transfer instruction and transfers the data to be computed in the source address indicated by the MOV _ X2R instruction to the target address. For example, the controller 104 transfers the related data to be operated to the corresponding buffer register of the multiplier-adder 112 for the multiplier-adder to operate.
At step 216, the operation accelerator obtains the data to be operated on from the cache register to perform the operation, and stores the operation result to the cache register. For example, the relevant multiplier-adder 112 obtains the data to be operated from the buffer register to perform the multiplication-addition operation, and then stores the operation result in the buffer register.
At step 218, the controller transfers the operation result stored by the instruction indicated cache register to the instruction indicated data register in response to the second transfer instruction. For example, if the controller 104 obtains the MOV _ R2X instruction generated by compiling, the controller 104 determines that the current MOV _ R2X instruction belongs to the second transfer instruction, and transfers the data to be calculated in the source address indicated by the MOV _ R2X instruction to the target address. For example, the controller 104 transfers the operation result in the cache register corresponding to the multiplier-adder 112 to the data register. In some embodiments, an external device (e.g., a host computer) obtains the operation result from the data register via the bus 114.
Based on the scheme, when the data operation is carried out and the operation which can be realized by the relevant operation accelerator is involved, the controller only needs to transfer the data to be operated to the cache register of the relevant operation accelerator, and the operation is very quick. Then, the operation accelerator completes the relevant operation, and the high-speed operation based on the operation accelerator improves the operation efficiency. After the operation accelerator finishes operation and outputs an operation result, the controller acquires the operation result. Taking the traditional operation that can be completed only by using a plurality of instruction cycles (for example, it takes tens of microseconds) based on the CPU as an example, based on the above scheme, the operation can be completed in about one microsecond. Moreover, based on the flexible configuration of the configuration register, the device for accelerating data operation can realize various different operation programs, and has extremely high flexibility compared with an ASIC-based data operation scheme. In addition, in the process of executing operation by the operation accelerator, the controller can respond to other related operations, and the operation process is not influenced, so that the real-time performance of data operation is obviously improved.
FIG. 3 shows a flow diagram of a method 300 of device 100 performing data operations, according to an embodiment of the disclosure. It should be understood that method 300 may also include additional steps not shown and/or may omit steps shown, as the scope of the present disclosure is not limited in this respect.
At step 302, the compiler parses the program into a combination of operations.
At step 304, the compiler determines whether the plurality of operations includes at least two operations that respectively correspond to the at least two operation accelerators.
At step 306, if it is determined that the plurality of operations includes at least two operations respectively corresponding to the at least two operation accelerators, the compiler compiles the at least two operations respectively corresponding to the at least two operation accelerators into a first transfer instruction to cause the controller to transfer the data to be computed respectively corresponding to the at least two operation accelerators to the cache registers respectively corresponding to the at least two operation accelerators in response to the first transfer instruction.
For example, in the system 100, the number of multiplier-adders 112 is 2. The compiler 102 determines that the parsed operations include, for example, two operations that are both multiply-add operations and that the two multiply-add operations can be executed in parallel, and the compiler 102 compiles the two multiply-add operations into, for example, an MOV _ X2R _ M instruction. Wherein operands of the MOV _ X2R _ M instruction include a plurality of source addresses of data registers and a plurality of destination addresses of cache registers. A plurality of source addresses, for example, addresses of data to be calculated in the data register, which are respectively related to the two multiply-add operations; the multiple target addresses, for example, multiple addresses in the cache registers corresponding to the two multipliers and adders, are used to store data to be calculated of the two multipliers and adders. The compiler 102 compiles and generates the MOV _ X2R _ M instruction, so that the controller 104 can simultaneously transfer the data to be calculated related to the two multipliers and adders to the corresponding cache registers when executing the MOV _ X2R _ M instruction, thereby enabling the two multipliers and adders to perform operations in parallel, and greatly saving time. In some embodiments, the compiler 102 also compiles a generate MOV _ R2X _ M instruction, for example, for the two multiply-add operations so that the controller 104 moves the results of the two multiply-add operations to the data register at the same time.
At step 308, if it is determined that at least two operations respectively corresponding to the at least two operation accelerators are not included in the plurality of operations, the compiler compiles a corresponding instruction for each operation respectively.
At step 310, the compiler forms the compile-generated acceleration instruction and the compile-generated normal instruction into an instruction set. Wherein the instruction set comprises a plurality of instructions.
At step 312, a plurality of instructions are stored to the configuration register via the bus and data to be operated on is stored to the data register via the bus.
At step 314, the controller transfers the data to be operated corresponding to the at least two operation accelerators respectively to the cache registers corresponding to the at least two operation accelerators respectively in response to the first transfer instruction, so that the at least two operation accelerators respectively obtain the data to be operated from the corresponding cache registers for operation.
For example, if the controller 104 obtains the MOV _ X2R _ M instruction generated by compiling, the controller 104 determines that the current MOV _ X2R _ M instruction belongs to the first transfer instruction, and correspondingly transfers the data in the source addresses of the data register indicated by the MOV _ X2R _ M instruction to the target addresses of the cache register. For example, two multiplier-adder related data to be calculated are simultaneously transferred to corresponding cache registers.
At step 316, at least two operation accelerators respectively acquire the data to be operated from the corresponding cache registers for operation, and respectively store the operation results to the data registers.
For example, two multiplier-adders simultaneously and respectively acquire data to be operated from corresponding cache registers and execute multiplication-addition operation in parallel; then, the respective operation results are stored in the corresponding data registers.
At step 318, the controller transfers operation results respectively corresponding to the at least two operation accelerators to the data register in response to the second transfer instruction.
For example, if the controller 104 fetches the compiled MOV _ R2X _ M instruction, the controller 104 determines that the current MOV _ R2X _ M instruction belongs to the second branch instruction. The controller 104 simultaneously transfers the operation results of the two multipliers and adders to the corresponding data registers.
Based on the scheme, at least two operation accelerators can be called simultaneously to perform operation in parallel, so that the resources of the operation accelerators are fully utilized, the operation speed is improved, and the operation time is greatly saved.
FIG. 4 shows a flow diagram of a method 400 of device 100 performing data operations, according to an embodiment of the disclosure. It should be understood that method 400 may also include additional steps not shown and/or may omit steps shown, as the scope of the disclosure is not limited in this respect.
At step 402, a compiler parses a program into a combination of operations.
At step 404, the compiler determines whether the plurality of operations includes at least two operations corresponding to the at least two operation accelerators, respectively.
At step 406, if it is determined that the plurality of operations includes at least two operations respectively corresponding to at least two operation accelerators, the compiler compiles the at least two operations respectively corresponding to the at least two operation accelerators into a combined instruction, the combined instruction including a first branch instruction and a third branch instruction which are sequentially executed, so that the controller transfers data to be operated to a cache register corresponding to one of the operation accelerators as data to be operated in response to the first branch instruction, and so that the controller transfers an operation result corresponding to one of the operation accelerators to a cache register corresponding to another one of the operation accelerators as data to be operated in response to the third branch instruction.
For example, the compiler 102 determines that the parsed operations include two operations that are sequentially executed, where a first operation is, for example, a multiply-add operation, a second operation is, for example, a CORDIC operation, and the CORDIC operation takes an operation result of the multiply-add operation as data to be operated on. The compiler 102 compiles these two operations into a combined instruction that includes, for example, a MOV _ X2R instruction and a MOV _ R2R instruction that are executed sequentially. The operand of the MOV _ X2R instruction includes a source address, such as an address of data to be operated on in a data register associated with a multiply-add operation, and a target address, such as an address of a cache register corresponding to the multiply-add unit. The operands of the MOV _ R2R instruction include a source address, such as the address of the result output by the multiplier-adder in the cache register, and a target address, such as the address of the corresponding cache register of the CORDIC operator. Thus, when the combination instruction is executed, the controller 104 first executes the MOV _ X2R instruction, and transfers the data to be operated to the cache register corresponding to the multiplier-adder, so that the multiplier-adder obtains the data to be operated; then, the controller 104 executes the MOV _ R2R instruction to transfer the operation result of the multiplier-adder to the corresponding cache register of the CORDIC operator as the data to be operated on by the CORDIC operator. The combination instruction is generated based on the compiling mode, so that the step of transferring the operation result of the prior operation accelerator (such as a multiplier-adder) from the cache register to the data register and then transferring the operation result from the data register to the cache register to be used as the data to be operated of the subsequent operation accelerator (such as a CORDIC operator) can be omitted in the execution process, the time can be greatly saved, and the operation efficiency can be improved.
In some embodiments, the instructions generated by the compiler 102 for the operation may further include, for example, a MOV _ R2X instruction, such that after the associated CORDIC operator completes the operation, the controller 104 executes the MOV _ R2X instruction to transfer the result of the CORDIC operator to the associated data register.
At step 408, if it is determined that at least two operations respectively corresponding to the at least two operation accelerators are not included in the plurality of operations, the compiler generates a corresponding instruction for each respective operation.
At step 410, the compiler will compile the generated acceleration instructions and compile the generated normal instructions into an instruction set. Wherein the instruction set comprises a plurality of instructions.
At step 412, a plurality of instructions are stored to a configuration register via a bus and data to be operated on is stored to a data register via the bus.
At step 414, the controller transfers the data to be operated on indicated by the instruction to the cache register indicated by the instruction in response to the first transfer instruction. For example, the controller 104 fetches instructions from a configuration register and executes. If the controller 104 obtains the MOV _ X2R instruction generated by compiling, the controller 104 determines that the current MOV _ X2R instruction belongs to the first transfer instruction, and transfers the data to be calculated in the source address indicated by the MOV _ X2R instruction to the target address. For example, the controller 104 transfers the related data to be operated to the corresponding buffer register of the multiplier-adder 112 for the multiplier-adder 112 to operate.
At step 416, one of the operation accelerators obtains the data to be operated on from the cache register to perform the operation, and stores the operation result into the cache register. For example, the multiplier-adder 112 obtains data to be operated from the buffer register to perform a relevant multiplication-addition operation, and then stores the operation result in the buffer register.
At step 418, the controller transfers the operation result in the cache register corresponding to one of the operation accelerators to the cache register corresponding to the other operation accelerator in response to the third transfer instruction, so as to serve as the data to be operated on of the other operation accelerator.
For example, if the controller 104 obtains the MOV _ R2R instruction obtained by the analysis, the controller 104 determines that the current MOV _ R2R instruction belongs to the third branch instruction. The controller 104 transfers the data in the source address indicated by the MOV _ R2R instruction to the destination address. For example, the controller 104 transfers the operation result of the multiplier-adder 112 to a corresponding buffer register of the CORDIC operator 110 as the data to be operated by the CORDIC operator 110.
At step 420, another accelerator fetches data to be operated on from the cache register, and stores the operation result to the cache register. For example, the CORDIC operator 110 obtains data to be operated from the buffer register, performs a related CORDIC operation, and stores an operation result in the buffer register.
At step 422, the controller transfers the operation result stored by the instruction indicated cache register to the instruction indicated data register in response to the second transfer instruction. For example, if the controller 104 obtains the MOV _ R2X instruction generated by compiling, the controller 104 determines that the current MOV _ R2X instruction belongs to the second transfer instruction, and transfers the data to be calculated in the source address indicated by the MOV _ R2X instruction to the target address. For example, the controller 104 transfers the operation result in the cache register corresponding to the CORDIC operator 110 to the data register.
Based on the scheme, the compiler automatically compiles and generates the combination instruction for sequentially and serially calling at least two operation accelerators, and after the operation of the prior operation accelerator is finished, the operation result is directly transferred to the cache register of the subsequent operation accelerator to be used as the data to be operated of the subsequent operation accelerator. The process that the operation result of the prior operation accelerator is transferred to the data register and then transferred to the cache register of the subsequent operation accelerator from the data register is omitted, the time can be effectively saved, and the operation efficiency is improved.
Fig. 5 shows a flow diagram of a method 500 of performing data operations by the apparatus 100 according to an embodiment of the disclosure. It should be understood that method 500 may also include additional steps not shown and/or may omit steps shown, as the scope of the present disclosure is not limited in this respect.
At step 502, the controller determines whether the enable indication signal is in an active state.
At step 504, if the controller determines that the enable indication signal is in an inactive state, it returns to step 502.
At step 506, if the controller determines that the enable indication signal is in an active state, it jumps to the target instruction in the configuration register and executes. The target instruction includes, for example, a first branch instruction.
For example, the state of the associated enable indication signal is stored by an external device (e.g., a sensor) to a configuration register via bus 114. When the external device has not acquired the data to be operated (for example, the sensor has not detected the data to be operated), the start indication signal is in an invalid state. When the external device acquires data to be operated (for example, the sensor detects the data to be operated), the data to be operated is stored in the data register via the bus 114, and the state of the start indication signal is set to be in an active state via the bus 114. The controller 104 reads the data in the configuration register at the address corresponding to the enable indication signal to determine whether the enable indication signal is in a valid state. If the start indication signal is determined to be in an invalid state, the controller 104 performs the read operation again; and if the starting indication signal is determined to be in a valid state, jumping to a target instruction and executing. The target instruction includes, for example, a first branch instruction, and the controller 104 transfers the data to be operated on to the associated cache register for the associated accelerator to perform the operation.
Based on the scheme, the controller can determine whether the data to be operated is ready according to the state of the starting indication signal, and transfers the data to be operated to the relevant cache register under the condition that the data to be operated is ready, so that the relevant operation accelerator can execute operation, the resource occupation can be reduced, and the execution efficiency can be improved.
In some embodiments, the controller 104 executes the instructions in the configuration registers in principle sequentially. If the controller 104 determines that the current instruction belongs to a jump class instruction, the controller 104 jumps to a target address in a configuration register according to the jump class instruction in order to execute the instruction stored by the target address. Based on the jump instruction, a variety of programs can be enriched, and the flexibility of the apparatus for accelerating data operation is improved.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (9)

1. An apparatus for accelerating data operations, comprising:
a configuration register to store a plurality of instructions;
the data register is used for storing data to be operated and an operation result;
a controller to fetch an instruction from a configuration register for execution, to transfer data to be operated on indicated by the instruction to a cache register indicated by the instruction in response to determining that the instruction belongs to a first transfer instruction, and to transfer an operation result stored by the cache register indicated by the instruction to a data register indicated by the instruction in response to determining that the instruction belongs to a second transfer instruction; and
the operation accelerator is used for acquiring data to be operated from the cache register to perform operation and storing an operation result to the cache register;
the device further comprises:
a compiler to parse a program into a combination of a plurality of operations and determine whether the operations are associated with an operation accelerator, compile the operations into acceleration instructions in response to determining that the operations are associated with the operation accelerator, the acceleration instructions including at least a first branch instruction, compile the operations into normal instructions in response to determining that the operations are not associated with the operation accelerator, and form the compiled acceleration instructions and the compiled normal instructions into an instruction set, the instruction set including a plurality of instructions.
2. The apparatus of claim 1, wherein the number of computation accelerators is at least two;
the controller is further used for responding to the first transfer instruction determined that the instruction belongs to the first transfer instruction, transferring the data to be operated corresponding to the at least two operation accelerators to cache registers corresponding to the at least two operation accelerators respectively, so that the at least two operation accelerators can obtain the data to be operated from the corresponding cache registers respectively to operate.
3. The apparatus of claim 1, wherein the number of computation accelerators is at least two;
the controller is also used for responding to the third transfer instruction, transferring the operation result in the cache register corresponding to one operation accelerator to the cache register corresponding to the other operation accelerator to serve as the data to be operated of the other operation accelerator.
4. The apparatus according to claim 1, wherein the configuration register is further configured to store an enable indication signal, and when the enable indication signal is in a valid state, the data to be calculated stored in the data register is represented to be valid; and
the controller is further configured to jump to a target instruction in the configuration register in response to determining that the launch indication signal is in a valid state and execute, the target instruction comprising a first branch instruction.
5. The apparatus according to claim 1, wherein the controller is further configured to jump to a target instruction in the configuration register and execute the target instruction according to a size relationship between an operation result stored in the cache register and target data;
the target data includes a predetermined value or another operation result stored by the cache register.
6. The apparatus of claim 1, further comprising:
a bus to perform at least one of:
storing a plurality of instructions to a configuration register via the bus, an
Data to be operated on is stored to a data register via the bus.
7. The apparatus of claim 1, wherein the number of computation accelerators is at least two;
the compiler is further configured to determine whether at least two operations of the plurality of operations respectively corresponding to the at least two operation accelerators are included, and in response to determining that at least two operations of the plurality of operations respectively corresponding to the at least two operation accelerators are included, perform at least one of:
compiling at least two operations respectively corresponding to the at least two operation accelerators into a first transfer instruction so that the controller responds to the first transfer instruction to transfer the data to be operated respectively corresponding to the at least two operation accelerators to cache registers respectively corresponding to the at least two operation accelerators; and
compiling at least two operations respectively corresponding to at least two operation accelerators into a combined instruction, wherein the combined instruction comprises a first transfer instruction and a third transfer instruction which are executed in sequence, so that the controller responds to the first transfer instruction to transfer the data to be operated to a cache register corresponding to one of the operation accelerators to be used as the data to be operated, and responds to the third transfer instruction to transfer an operation result corresponding to one of the operation accelerators to a cache register corresponding to the other operation accelerator to be used as the data to be operated.
8. The apparatus of claim 1, wherein the computation accelerator comprises at least one of a CORDIC operator, a multiplier-adder.
9. The apparatus of claim 5, wherein the bus is connected with the upper computer for performing at least one of:
causing a host computer to store a plurality of instructions to a configuration register via the bus, an
And enabling the upper computer to store the data to be operated to the data register through the bus.
CN202211023496.5A 2022-08-25 2022-08-25 Apparatus for accelerating data operation Active CN115113933B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211023496.5A CN115113933B (en) 2022-08-25 2022-08-25 Apparatus for accelerating data operation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211023496.5A CN115113933B (en) 2022-08-25 2022-08-25 Apparatus for accelerating data operation

Publications (2)

Publication Number Publication Date
CN115113933A CN115113933A (en) 2022-09-27
CN115113933B true CN115113933B (en) 2022-11-15

Family

ID=83335398

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211023496.5A Active CN115113933B (en) 2022-08-25 2022-08-25 Apparatus for accelerating data operation

Country Status (1)

Country Link
CN (1) CN115113933B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301455A (en) * 2017-05-05 2017-10-27 中国科学院计算技术研究所 Mixing cube storage system and speed-up computation method for convolutional neural networks
CN112860320A (en) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 Method, system, device and medium for data processing based on RISC-V instruction set

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4283131B2 (en) * 2004-02-12 2009-06-24 パナソニック株式会社 Processor and compiling method
US7395410B2 (en) * 2004-07-06 2008-07-01 Matsushita Electric Industrial Co., Ltd. Processor system with an improved instruction decode control unit that controls data transfer between processor and coprocessor
US7809927B2 (en) * 2007-09-11 2010-10-05 Texas Instruments Incorporated Computation parallelization in software reconfigurable all digital phase lock loop
CN105573716A (en) * 2015-12-15 2016-05-11 西安电子科技大学 Application specific instruction set processor based on transport triggered architecture (TTA)

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301455A (en) * 2017-05-05 2017-10-27 中国科学院计算技术研究所 Mixing cube storage system and speed-up computation method for convolutional neural networks
CN112860320A (en) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 Method, system, device and medium for data processing based on RISC-V instruction set

Also Published As

Publication number Publication date
CN115113933A (en) 2022-09-27

Similar Documents

Publication Publication Date Title
CN109661647B (en) Data processing apparatus and method
JP7015249B2 (en) Processor with reconfigurable algorithm pipeline core and algorithm matching pipeline compiler
EP0395348A2 (en) Method and apparatus for multi-gauge computation
CA1119731A (en) Multibus processor for increasing execution speed using a pipeline effect
JP2019511056A (en) Complex multiplication instruction
JP7183197B2 (en) high throughput processor
US6934938B2 (en) Method of programming linear graphs for streaming vector computation
JP2019517060A (en) Apparatus and method for managing address conflicts in performing vector operations
CN111027690A (en) Combined processing device, chip and method for executing deterministic inference
JPH0769795B2 (en) Computer
JPH07244589A (en) Computer system and method to solve predicate and boolean expression
CN115113933B (en) Apparatus for accelerating data operation
EP4152150A1 (en) Processor, processing method, and related device
CN116917859A (en) Parallel decoded instruction set computer architecture with variable length instructions
US11275712B2 (en) SIMD controller and SIMD predication scheme
CN112130899A (en) Stack computer
US20210173809A1 (en) Processor architectures
CA3225836A1 (en) Apparatus and method for energy-efficient and accelerated processing of an arithmetic operation
JP2000029696A (en) Processor, and pipeline process control method
JP2000163266A (en) Instruction execution system
CN117591184A (en) RISC-V vector compression out-of-order execution realization method and device
CN114911526A (en) Brain-like processor based on brain-like instruction set and application method thereof
Matthes The ReAl Computer Architecture
JP2001216154A (en) Method and device for reducing size of code with exposed pipeline by encoding nop operation as instruction operand
Ogawa et al. A reconfigurable Java accelerator with software compatibility for embedded systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant