CN113157636B - Coprocessor, near data processing device and method - Google Patents

Coprocessor, near data processing device and method Download PDF

Info

Publication number
CN113157636B
CN113157636B CN202110358261.0A CN202110358261A CN113157636B CN 113157636 B CN113157636 B CN 113157636B CN 202110358261 A CN202110358261 A CN 202110358261A CN 113157636 B CN113157636 B CN 113157636B
Authority
CN
China
Prior art keywords
register
instruction
calculation
decoding
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110358261.0A
Other languages
Chinese (zh)
Other versions
CN113157636A (en
Inventor
山蕊
冯雅妮
蒋林
杨博文
高旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Posts and Telecommunications
Original Assignee
Xian University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Posts and Telecommunications filed Critical Xian University of Posts and Telecommunications
Priority to CN202110358261.0A priority Critical patent/CN113157636B/en
Publication of CN113157636A publication Critical patent/CN113157636A/en
Application granted granted Critical
Publication of CN113157636B publication Critical patent/CN113157636B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • G06F7/575Basic arithmetic logic units, i.e. devices selectable to perform either addition, subtraction or one of several logical operations, using, at least partially, the same circuitry
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The present invention relates to the field of chip technologies, and in particular, to a coprocessor, a near data processing apparatus and a near data processing method. The coprocessor includes: the instruction register unit is used for receiving and caching the calculation instruction issued by the main processor, and sending the calculation instruction to the decoding and fetching unit when receiving the decoding enabling signal; the decoding and fetching unit is used for receiving and decoding the calculation instruction, reading an operand required by the execution of the calculation instruction from the operand storage position according to a decoding result, and sending the operand to the calculation unit when receiving a calculation enabling signal; the computing unit is used for receiving the operand sent by the decoding and fetching unit, executing corresponding computing operation and writing a computing result to a corresponding storage address; and the accumulation register is used for storing the data written by the calculation unit and the accumulation result of the original value in the accumulation register. According to the invention, the coprocessor directly reads the data of the main memory to perform calculation and then writes back, so that the operation efficiency of the main processor is improved.

Description

Coprocessor, near data processing device and method
Technical Field
The present invention relates to the field of chip technologies, and in particular, to a coprocessor, a near data processing apparatus and a near data processing method.
Background
The operands of the main processor calculation instruction in the reconfigurable system are all from registers or immediate numbers in the instruction, when the main memory data is required to be calculated, the instruction supported by the main processor is to fetch the data from the main memory to the registers through the LD instruction or write the values in the registers back to the main memory through the ST instruction, and the data of the main memory is not directly calculated.
Therefore, the main processor in the prior art has the problems of slow data reading and low operation efficiency when calculating the main memory data.
The above drawbacks are to be overcome by those skilled in the art.
Disclosure of Invention
First, the technical problem to be solved
In order to solve the problems in the prior art, the invention provides a coprocessor, a near data processing device and a near data processing method, which solve the problems of slow data reading and low operation efficiency when a main processor in the prior art calculates main memory data.
(II) technical scheme
In order to achieve the above purpose, the main technical scheme adopted by the invention comprises the following steps:
in a first aspect, embodiments of the present application provide a coprocessor, comprising: the device comprises an instruction register unit, a decoding and fetching unit, a calculation unit and an accumulation register;
the instruction register unit is connected with the main processor, and is used for receiving and caching the calculation instruction issued by the main processor, and sending the calculation instruction to the decoding and fetching unit when receiving the decoding enabling signal sent by the decoding and fetching unit;
the decoding and fetching unit is connected with the instruction register unit and is used for receiving the calculation instruction, decoding, reading an operand required by the execution of the calculation instruction from an operand storage position according to a decoding result, and sending the operand to the calculation unit when receiving a calculation enabling signal sent by the calculation unit;
the computing unit is connected with the decoding and fetching unit and is used for receiving the operand sent by the decoding and fetching unit, executing corresponding computing operation and writing a computing result to a corresponding storage address;
the accumulation register is connected with the decoding and fetching unit and the calculating unit and is used for storing the data written by the calculating unit and the accumulation result of the original value in the accumulation register.
Optionally, the location where the operand is stored includes a main memory, a general purpose register, and the accumulation register; when the storage position of the operand is the main memory, the coprocessor accesses the data storage device in a register indirect addressing mode.
Optionally, a communication flow between the instruction register unit and the main processor includes:
when the cache calculation instruction effective signal output by the instruction register unit is in a low level or the feedback signal of the decoding fetch unit is in a high level, enabling the effective bit register and the instruction register to be effective;
the main processor sends a calculation instruction, an effective signal and a register enabling signal issued by the main processor are high at the same time, and the instruction register unit receives and caches the calculation instruction;
the instruction register unit sends a feedback output signal to the main processor, wherein the feedback output signal indicates that the main processor can issue a calculation instruction to the coprocessor again when the next clock cycle arrives.
In a second aspect, embodiments of the present application provide a near data processing apparatus based on a reconfigurable array processor, the apparatus comprising a coprocessor as described in any one of the first aspects above;
the coprocessor is connected with a main processor, the main processor is a reconfigurable array processor and is used for performing first decoding on a currently issued calculation instruction and sending a first decoding result to the coprocessor according to an instruction operation code;
the coprocessor is used for performing secondary decoding according to a preset field of the calculation instruction, and reading data to be processed from a memory according to a secondary decoding result, wherein the memory is one of a main memory, a general register and an accumulation register;
the coprocessor is used for calculating the data to be processed according to the second decoding result, and storing the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register which are determined based on the calculation instruction.
In a third aspect, embodiments of the present application provide a near data processing method based on a reconfigurable array processor, the method including:
the coprocessor receives a first decoding result sent by the main processor after performing first decoding on the currently issued calculation instruction;
the coprocessor performs secondary decoding according to a preset field of the calculation instruction, reads data to be processed from a memory according to a secondary decoding result, and the memory is one of a main memory, a general register and an accumulation register;
and the coprocessor calculates the data to be processed according to the second decoding result and stores the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register which are determined based on the calculation instruction.
Optionally, the coprocessor adopts a three-stage pipeline mode; wherein, the liquid crystal display device comprises a liquid crystal display device,
the first stage comprises receiving and caching a calculation instruction issued by the main processor through an instruction register unit;
the second stage comprises receiving the calculation instruction through a decoding fetch unit, decoding, and reading an operand required by the execution of the calculation instruction according to a decoding result;
the third stage includes receiving the operand sent by the decoding fetch unit by the computing unit, executing the computing instruction, and writing the computing result to the corresponding storage location.
Optionally, when the computation order is a convolution computation in a neural network,
the main memory is used for storing original image data and a convolutional neural network calculation result;
the general register is used for storing operands or intermediate calculation results used in the instruction execution process;
the accumulation register is used for storing the data written by the calculation unit and the accumulation result of the original value in the accumulation register.
Optionally, receiving, by the computing unit, the operand sent by the decode fetch unit and executing the computing instruction, and writing a computing result to a corresponding storage address, including:
carrying out convolution calculation of the current channel characteristic diagram and a convolution kernel through a calculation unit;
accumulating the convolution calculation result and the calculation result of the last convolution in an accumulation register, and storing the accumulation result in the accumulation register;
when the convolution calculation of the current image is completed, the calculation result in the accumulation register is read through the STRM instruction, and the calculation result is written into the main memory or the general register according to the highest bit of the general register field.
Alternatively, when a non-multiply accumulate instruction is executed, and the most significant bit of the register Rd field is 0, the storage location is a general purpose register,
when the non-multiply accumulate instruction is executed and the most significant bit of the register Rd field is 1, the storage location is main memory.
Optionally, the preset field is a Func field, the coprocessor reads data to be processed from the memory according to a second decoding result, and when calculating, judges a type and an operand source of a current calculation instruction according to a least significant bit of the Func field, including:
when the lowest bit of the Func field of the current calculation instruction is 0, if the highest bit of the register field is 0, directly reading the operand of the calculation instruction from the general register; if the highest bit of the register field is 1, reading the corresponding position in the data storage according to the address stored in the register to obtain an operand required by the calculation instruction;
when the lowest bit of the Func field of the current calculation instruction is 1, one operand of the calculation instruction is from an immediate number, the other operand is determined by the highest bit of the Rs field of the register, and if the highest bit of the Rs field of the register is 0, the operand of the calculation instruction is directly read from a general register; if the highest bit of the register field is 1, reading the corresponding position in the data storage according to the address stored in the register to obtain the operand of the calculation instruction;
when a STRM instruction is executed, the instruction operands come directly from the accumulation registers.
(III) beneficial effects
The beneficial effects of the invention are as follows: according to the coprocessor provided by the embodiment of the invention, the access speed of the main memory data is improved by directly reading the main memory data for calculation and then writing back, so that the operation efficiency of the main processor is improved.
Furthermore, the near data processing device based on the reconfigurable array processor provided by the embodiment of the invention not only improves the processing efficiency of the processor on the main memory data, but also further improves the operation speed and the data processing efficiency through the accumulation register arranged in the coprocessor.
Drawings
FIG. 1 is a schematic diagram of a coprocessor according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a coprocessor according to another embodiment of the present invention;
FIG. 3 is a diagram illustrating a coprocessor instruction encoding format according to another embodiment of the present invention;
FIG. 4 is a schematic diagram of an instruction register unit according to another embodiment of the present invention;
FIG. 5 is a schematic diagram of a decoding fetch unit according to another embodiment of the present invention;
FIG. 6 is a schematic diagram of a computing unit according to another embodiment of the present invention;
FIG. 7 is a schematic diagram of a near data processing apparatus based on a reconfigurable array processor according to another embodiment of the present invention;
FIG. 8 is a flow chart of a method for near data processing based on a reconfigurable array processor according to another embodiment of the invention.
Detailed Description
The invention will be better explained by the following detailed description of the embodiments with reference to the drawings.
All technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
FIG. 1 is a schematic diagram of a coprocessor according to an embodiment of the present invention, as shown in FIG. 1, the coprocessor includes: the device comprises an instruction register unit, a decoding and fetching unit, a calculation unit and an accumulation register;
the instruction register unit is connected with the main processor, and is used for receiving and caching the calculation instruction issued by the main processor, and transmitting the calculation instruction to the decoding and fetching unit when receiving the decoding enabling signal transmitted by the decoding and fetching unit;
the decoding and fetching unit is connected with the instruction register unit and is used for receiving and decoding the calculation instruction, reading an operand required by the execution of the calculation instruction from the operand storage position according to a decoding result, and sending the operand to the calculation unit when receiving a calculation enabling signal sent by the calculation unit;
the computing unit is connected with the decoding and fetching unit and is used for receiving the operand sent by the decoding and fetching unit, executing corresponding computing operation and writing a computing result to a corresponding storage address;
the accumulation register is connected with the decoding and fetching unit and the calculating unit and is used for storing the data written by the calculating unit and the accumulation result of the original value in the accumulation register.
Also included in the coprocessor shown in fig. 1 is a register file, which is a general purpose register for the coprocessor.
In the technical scheme provided by the embodiment of the invention shown in fig. 1, the access speed of the main memory data is improved by directly reading the main memory data for calculation and then writing back, so that the operation efficiency of the main processor is improved.
FIG. 2 is a schematic diagram of a coprocessor according to another embodiment of the present invention, and the embodiment shown in FIG. 2 is described in detail below:
in this embodiment, the co-processing includes an instruction register unit (IR), a decode fetch unit (dec-fet), a compute unit (ALU), an accumulation register (Rm), a register file (RegisterFile).
The locations where operands are stored include main memory, general purpose registers, and accumulator registers. The main memory is used for storing a large amount of original data or calculation results, when the highest bit of the Rs or Rt register field is 1, the source operand is read from the main memory, and when the highest bit of the Rd register field is 1, the instruction calculation results are written back to the main memory;
the general register is used for storing operands or intermediate calculation results used in the instruction execution process, when the highest bit of the Rs or Rt register field is 0, the source operands are read from the general register, and when the highest bit of the Rd register field is 0, the instruction calculation results are written back to the general register;
the accumulation register is used for storing the execution result of the multiply-accumulate instruction MAC, accumulating the execution result with the existing data in the accumulation register, and when the STRM instruction is executed to read the accumulation register, reading the data stored in the accumulation register and then emptying the accumulation register.
When the storage position of the operand is the main memory, the storage space is accessed in a register indirect addressing mode because of large storage capacity of the main memory, and at the moment, the corresponding address of the target data stored in the main memory is stored in the general register; the accumulation register is used for storing the convolution result. The instruction interaction between the units in the figure, and specific instructions will be described in the following description of the unit modules.
FIG. 3 is a diagram of a coprocessor instruction encoding format according to another embodiment of the present invention, wherein, as shown in FIG. 3, I is an immediate type instruction and non-I is a non-immediate type instruction. The high 6 bits in the 32bit instruction are the operation codes (OP) of the calculation instructions, which can be used as the distinguishing marks of the main processor support instructions and the coprocessor support instructions, and when the operation codes are 6' b100111, the calculation instructions need to be issued to the coprocessor for execution.
The [25:22] and [21:18] (the numerals in brackets represent bits) are the general register Rd field, the general register Rs field, and the [5:0] func field represents the calculation type of the instruction, respectively. Immediate type instructions differ from non-immediate type instructions in whether immediate is employed as a source operand. The [17:6] in the immediate type instruction represents 12bit immediate and takes the immediate as a source operand, and the lowest bit of the Func field in the instruction is 1; non-immediate instructions use [17:14] to represent the general register Rt field, and the lowest order bit of the Func field in the instruction is 0. The 8bit data in the non-immediate type instruction is not filled, and the 8bit data does not participate in the calculation when the non-I type instruction is executed.
The source of the other source operands is determined by the highest bit of the register field, except that the lowest bit of the Func field is used to determine whether an immediate is used as the source operand, and when the highest bit of the register field is 0, the source operand is indicated to come from the register; when the register field most significant bit is 1, it indicates that the source operand is from the corresponding address in main memory to which the register is directed. The execution result of the instruction may be written back to the register or the main memory may be updated.
Fig. 4 is a schematic diagram of an instruction register unit according to another embodiment of the present invention, as shown in fig. 4, where the instruction register unit is composed of two parts, and the valid bit register is used for receiving a valid signal issued by the host processor, and the instruction register is used for receiving an instruction issued by the host processor. When the valid bit register and the instruction register are both in the space and the valid signal issued by the main processor is valid, the instruction received by the instruction register is valid and can be used for the next stage of decoding fetch unit.
The main function of the instruction register unit IR is to cache instructions issued by the main processor to the coprocessor. After the calculation instruction is decoded in the main processor, the instruction high 6bit operation code is the low 26 bits of the instruction of 6' b100111 and the effective signal of the instruction are sent to the coprocessor, and the instruction register unit IR in the coprocessor caches the instruction and the effective signal sent by the main processor.
The instruction register unit starts to receive new calculation instructions in idle state or after the last instruction finishes decoding, namely when the IR_valid signal is low level or the feedback signal ready_in after decoding is high level, the valid bit register and the enabling valid of the instruction register can buffer the instruction and valid signal issued by the main processor. When the valid signal issued by the main processor and the register enable signal are both high, the instruction register unit is indicated to receive the instruction issued by the main processor and can finish buffering in the next clock cycle, so that when the valid signal issued by the main processor and the register enable signal are both high, a feedback signal ready_out is sent to the main processor, and the instruction register unit is indicated to issue a calculation instruction again to the coprocessor when the next clock cycle arrives. Table 1 is an instruction register unit interface information specification table.
TABLE 1
Fig. 5 is a schematic diagram of a decoding fetch unit according to another embodiment of the present invention, where, as shown in fig. 5, the decoding fetch unit is composed of a decoding control, a selector MUX, a register and a feedback circuit. The decoding control unit is used for decoding each field of the instruction; the two selectors respectively select two operands required by instruction calculation by taking the decoded result as a selection control signal, and send the operation result to a register for shooting and then output the operation result to a next-stage calculation unit; the feedback circuit is used for controlling the whole pipeline.
The main function of the decoding fetch unit dec_fet is to decode an instruction and obtain a source operand required by instruction execution according to a decoding result. When the decoding fetch unit is in an idle state or the decoding fetch operation of the current calculation instruction is completed, the instruction register unit IR sends the instruction to the decoding fetch unit for secondary decoding, and reads all source operands required by instruction execution from the corresponding addresses according to a decoding result until all operands required by the calculation instruction are ready, namely after the decoding fetch operation is completed, a ready feedback signal is sent to the upper instruction register unit to indicate that the decoding fetch operation of the current instruction is completed, and the next instruction can be decoded. While the decode fetch unit sends valid signals and source operands, both of which are ready, to the next stage of the computational unit ALU for instruction execution.
Judging the type and operand source of the current calculation instruction through the least significant bit of the Func field, wherein the type and operand source of the current calculation instruction are mainly divided into the following 3 cases:
(1) when the lowest bit of the instruction Func field is 0, if the highest bit of the register field is 0, directly reading the operand of the calculation instruction from the general register; if the highest bit of the register field is 1, reading the corresponding position in the data storage according to the address stored in the register to obtain the operand of the calculation instruction;
(2) when the lowest order bit of the instruction Func field is 1, one of the operands of the calculation instruction comes from the immediate and the other operand source is determined by the highest order bit of the Rs field of the register, as in (1);
(3) when executing a STRM instruction, the instruction operands come directly from the accumulation registers Rm.
In the decoding fetch unit, the En enable signal is used to indicate that the current decoding fetch unit can be idle or instruction decoding fetch is completed, and the next calculation instruction can be decoded. When all operand sources of the instruction are general registers, accumulation registers or immediate numbers in the instruction, the decoding fetch unit can immediately finish fetch, en enable signals are pulled high, and the next calculation instruction is received and decoded in the next clock period; when one operand of the instruction comes from the main memory or both operands of the instruction come from the main memory, the corresponding register is accessed according to the lower 3bit of the Rs or Rt field of the register, the data stored in the register is used as the address for accessing the main memory, the En enable signal is simultaneously sent out to be low, and the En enable signal is not pulled high again until the return destination data of the main memory and the effective feedback signal are received, so that the fetching operation of the instruction is completed.
When En enable signal is valid, two operands needed by the instruction are respectively assigned to left Data L_Data and right Data R_Data, and meanwhile, a valid signal L_V for completing fetch of the left Data is set to be valid, and a valid signal R_V for completing fetch of the right Data is set to be valid. The ALU_O_V signal indicates that all operands in the instruction are ready. When executing an instruction requiring only left data, such as a NOT instruction, directly assigning a left data valid signal L_V to the ALU_O_V; when executing an instruction requiring only right data, such as a LUI instruction, directly assigning a right data valid signal r_v to alu_o_v; when two operands are simultaneously required in an executed instruction, ALU_O_V is active only when the left data valid signal L_V and the right data valid signal R_V are active simultaneously. After the decoding fetch is completed, the left Data l_data and right Data r_data and the signal alu_o_v with the operands ready are output to the ALU computation unit. When the instruction valid signal IR_valid and the enable signal En transmitted from the upper instruction register unit received by the decoding and fetching unit are valid at the same time, a feedback signal rdy_out_fetch is sent to the upper instruction register unit to indicate that the current calculation instruction finishes the decoding and fetching operation, and a new calculation instruction can be received to perform the decoding and fetching operation. Table 2 is a decoding fetch unit interface information specification table.
TABLE 2
Because some calculation instructions simultaneously need two operands transmitted from the main memory, but the main memory only has one group of read-write access ports, arbitration processing is carried out on requests which are simultaneously transmitted to the main memory at the top layer, one request is preferentially responded, the response to the feedback signal and the destination data are cached, and the other access request is responded. After the two access requests are responded, the destination data and the feedback signals obtained by the two accesses are returned to the decoding and fetching unit. When the read access main memory, the general register or the accumulation register collides with the write access, the written data is directly used as read data to be output to the decoding and fetching unit, so that the access speed can be improved on the premise of ensuring the correct access.
Fig. 6 is a schematic diagram of a computing unit according to another embodiment of the present invention, where the computing unit, as shown in fig. 6, is composed of registers, a selector MUX, an arithmetic logic unit ALU, and a feedback circuit. The register is used for receiving the destination register field Rd and the function field Func transmitted by the decoding unit, beating the destination register field Rd and the function field Func, and taking the destination register field Rd and the function field Func as control signals of the selector, and selecting a destination address written back by an arithmetic logic unit ALU calculation result; the feedback circuit is used to control the pipeline.
The main function of the computing unit ALU is to execute an instruction and write the execution result of the instruction back to the destination address. And when the computing unit is in an idle state or the computing operation of the last instruction is finished, after receiving the effective signals that all source operands of the instruction are ready and are sent by the upper-stage decoding fetch unit dec_fet, executing corresponding computation according to the high 5 bits of the Func field of the instruction. After the instruction calculation is completed, a ready feedback signal is sent to the upper-stage decoding fetch unit to indicate that the current instruction calculation is completed, and the calculation unit can calculate and execute the next instruction. Table 3 illustrates tables for different calculation types for different Func fields.
TABLE 3 Table 3
Calculation type Func[5:1] Calculation type Func[5:1]
ADD 00001 Right shift SRL 00111
Subtraction SUB 00010 Multiplication MUL 01000
AND AND with 00011 Multiply-accumulate MAC 01001
OR 00100 Immediate left shift LUI 01010
Non NOT 00101 Less than one SLT 01011
Left shift SLL 00110 Reading accumulation register STRM 01100
After the calculation is completed, the calculation result may be written back to the general register, the accumulation register, or the main memory, as long as it is judged according to the highest bit of the Rd field of the type register of the executed calculation instruction. When the multiply-accumulate instruction MAC is executed, writing the calculation result back into the accumulation register Rm; when the non-MAC instruction is executed and the highest bit of the Rd field of the register is 0, writing the calculation result back to the Rd register; when the non-MAC instruction is executed and the most significant bit of the Rd field of the register is 1, the calculation result is written back to the main memory corresponding location corresponding to the address stored in the Rd register.
In the calculating unit, the enable signal En indicates that the calculating unit is idle or has calculated the current instruction and writes back the calculation result, and when the enable signal En is valid, the current calculating unit can receive the new operand input by the last stage decoding and fetching unit and calculate and write back. When the left Data L_Data and the right Data R_Data input by the last decoding and fetching unit and the signal ALU_O_V with the operands ready are received, the calculating unit immediately completes corresponding calculation according to the Func field delayed by one beat and outputs a calculation result ALU_R, and generates an enabling signal for updating the general register, the accumulation register or the main memory. When the general register or the accumulation register is updated with the calculation result, the write access can be completed in the next clock cycle generated by updating the enabling; when the main memory is updated by the calculation result, the corresponding register is accessed according to the low 3bit of the register Rd field delayed by one beat, the data stored in the register is used as the destination address for updating the main memory, the enabling signal and the calculation result for updating the main memory are sent out at the same time, the enabling signal En is pulled down, and the enabling signal En is not set to be valid again until the feedback signal of the completion of updating transmitted by the main memory is received. When the signals ALU_O_V and the enable signal En which are ready for the operands input by the upper-stage decoding and fetching unit are simultaneously valid, a feedback signal rdy_out_alu is sent out to the upper-stage decoding and fetching unit, which indicates that the current instruction has completed calculation and write-back operation. The specific computing unit interface information is shown in table 5.
In convolution calculation, multiply-accumulate operation is to store and accumulate the product result of each pixel point and convolution kernel in a block area, and after the input image pixel point and convolution kernel are multiplied by the multiply-accumulate instruction MAC designed according to the characteristic, the calculation result is written into an accumulation register Rm and is automatically accumulated with the calculation result executed by the next MAC instruction. When this area convolution of the input image is completed, the result of the convolution calculation accumulation of this area is output by executing the STRM instruction when it is required as the next layer convolution or pooled input image. And executing the STRM instruction, reading out the data in the accumulation register, outputting the data to a general register or a main memory according to the highest bit of the Rd field of the register, and resetting the accumulation register in the next clock period of reading out the accumulation result to ensure that the convolution accumulation result of other areas of the input image is not influenced by the convolution result of the last area.
Table 4 is a calculation unit interface information specification table.
TABLE 4 Table 4
FIG. 7 is a schematic diagram of a near data processing apparatus based on a reconfigurable array processor according to another embodiment of the present invention, as shown in FIG. 7, which may be used for any data access and computation to a main memory, including a coprocessor according to any of the above embodiments;
the coprocessor is connected with the main processor, and the main processor is a reconfigurable array processor and is used for performing first decoding on a currently issued calculation instruction and sending a first decoding result to the coprocessor according to an instruction operation code;
the coprocessor is used for carrying out secondary decoding according to a preset field of the calculation instruction and reading data to be processed from the main memory according to a secondary decoding result;
the coprocessor is used for calculating the data to be processed according to the second decoding result and storing the calculated result into the main memory.
It should be noted that, the storage location of the calculation result is determined according to the calculation instruction, and the storage location of the data to be processed may also be a main memory, a general register or an accumulation register, and the storage location of the calculation result may also be a main memory, a general register or an accumulation register, which are not described in the present application.
In this embodiment, the coprocessor is controlled by the main processor layer, so when the coprocessor executes an instruction, the main processor PE first decodes the instruction, and determines whether the current issued instruction is a calculation instruction executed in the coprocessor according to the instruction operation code. After the calculation instruction is issued to the coprocessor, performing secondary decoding according to the Func field of the instruction, and generating a corresponding control signal according to a decoding result to complete corresponding calculation.
The embodiment can be used for convolution calculation of the neural network. According to the characteristic of high parallelism of the array processor, the neural network is mapped on the array structure in a calculation parallel mode, so that the calculation speed of the neural network can be improved, and data of the main memory still needs to be read into a register first and then processed. In order to further improve the calculation speed of the network, a coprocessor is combined with an array processor, a multiplication and accumulation instruction and an accumulation register are specially designed according to the calculation characteristics of convolution in a neural network in the coprocessor, pixels of an input feature map from a main memory and elements of a convolution kernel can be directly multiplied and accumulated with the calculation result of the last convolution in the accumulation register, and finally the result in the accumulation register is read through an STRM instruction.
The near data processing device based on the reconfigurable array processor not only improves the access speed of the main memory data, but also can improve the operation speed of the convolutional neural network and improve the efficiency of image processing by arranging the accumulation register arranged in the coprocessor.
FIG. 8 is a flow chart of a method for near data processing based on a reconfigurable array processor according to still another embodiment of the present invention, as shown in FIG. 8, the method includes:
the coprocessor receives a first decoding result sent by the main processor after performing first decoding on the currently issued calculation instruction;
the coprocessor performs secondary decoding according to a preset field of the calculation instruction, reads data to be processed from a memory according to a secondary decoding result, and the memory is one of a main memory, a general register and an accumulation register;
the coprocessor calculates the data to be processed according to the second decoding result, and stores the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register which are determined based on a calculation instruction.
In this embodiment, the coprocessor may employ a three-stage pipeline manner; wherein, the liquid crystal display device comprises a liquid crystal display device,
the first stage comprises receiving and caching a calculation instruction issued by the main processor through an instruction register unit;
the second stage comprises receiving and decoding the calculation instruction through a decoding and fetching unit, and reading an operand required by the execution of the calculation instruction according to a decoding result; operands may be stored in main memory, general purpose registers, or dedicated accumulator registers; when the calculation instruction is a convolution calculation in the neural network, the operand includes a convolution kernel, a channel feature map and a result of a previous convolution calculation, wherein the result of the previous convolution calculation is stored in an accumulation register.
The third stage comprises receiving the operand sent by the decoding and fetching unit through the computing unit, executing a computing instruction, and writing a computing result to a corresponding storage position; wherein the storage location comprises a main memory, a general purpose register, or an accumulation register.
In this embodiment, when the calculation instruction is convolution calculation in the neural network, the calculating unit receives the operand sent by the decoding and fetching unit and executes the calculation instruction, and writes the calculation result to the corresponding storage address, including:
carrying out convolution calculation of the current channel characteristic diagram and a convolution kernel through a calculation unit;
accumulating the convolution calculation result and the calculation result of the last convolution in an accumulation register, and storing the accumulation result in the accumulation register;
when the convolution calculation of the current image is completed, the calculation result in the accumulation register is read through the STRM instruction, and the calculation result is written to the corresponding storage address according to the highest bit of the register field.
In this embodiment, when the non-multiply-accumulate instruction is executed and the highest bit of the register Rd field is 0, the storage location is a general purpose register,
when the non-multiply accumulate instruction is executed and the most significant bit of the register Rd field is 1, the storage location is main memory.
In this embodiment, the preset field is a Func field, the coprocessor reads data to be processed from a corresponding storage location according to a second decoding result, and when calculating, the coprocessor determines a type and an operand source of a current calculation instruction according to a least significant bit of the Func field, including:
when the lowest bit of the Func field of the current calculation instruction is 0, if the highest bit of the register field is 0, directly reading the operand of the calculation instruction from the general register; if the highest bit of the register field is 1, reading the corresponding position in the data storage according to the address stored in the register to obtain the operand of the calculation instruction;
when the lowest bit of the Func field of the current calculation instruction is 1, one operand of the calculation instruction is from an immediate number, the other operand is determined by the highest bit of the Rs field of the register, and if the highest bit of the Rs field of the register is 0, the operand of the calculation instruction is directly read from a general register; if the highest bit of the register field is 1, reading the corresponding position in the data storage according to the address stored in the register to obtain the operand of the calculation instruction;
when executing a STRM instruction, the instruction operands come directly from the accumulation registers Rm.
According to the method, the near data processing device based on the reconfigurable array processor is adopted, so that the operation speed and the data processing efficiency are improved.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (8)

1. A near data processing apparatus based on a reconfigurable array processor, the apparatus comprising a coprocessor; the coprocessor includes: the device comprises an instruction register unit, a decoding and fetching unit, a calculation unit and an accumulation register;
the instruction register unit is connected with the main processor, and is used for receiving and caching the calculation instruction issued by the main processor, and sending the calculation instruction to the decoding and fetching unit when receiving the decoding enabling signal sent by the decoding and fetching unit;
the decoding and fetching unit is connected with the instruction register unit and is used for receiving the calculation instruction, decoding, reading an operand required by the execution of the calculation instruction from an operand storage position according to a decoding result, and sending the operand to the calculation unit when receiving a calculation enabling signal sent by the calculation unit;
the computing unit is connected with the decoding and fetching unit and is used for receiving the operand sent by the decoding and fetching unit, executing corresponding computing operation and writing a computing result to a corresponding storage address;
the accumulation register is connected with the decoding and fetching unit and the calculating unit and is used for storing the data written by the calculating unit and the accumulation result of the original value in the accumulation register;
the coprocessor is connected with a main processor, the main processor is a reconfigurable array processor and is used for performing first decoding on a currently issued calculation instruction and sending a first decoding result to the coprocessor according to an instruction operation code;
the coprocessor is used for performing secondary decoding according to a preset field of the calculation instruction, and reading data to be processed from a memory according to a secondary decoding result, wherein the memory is one of a main memory, a general register and an accumulation register;
the coprocessor is used for calculating the data to be processed according to the second decoding result, and storing the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register which are determined based on the calculation instruction;
the preset field is a Func field, the coprocessor reads data to be processed from the memory according to a second decoding result, and when calculating, the coprocessor judges the type and operand source of the current calculation instruction according to the least significant bit of the Func field, and the method comprises the following steps:
when the lowest bit of the Func field of the current calculation instruction is 0, if the highest bit of the general register field is 0, directly reading the operand of the calculation instruction from the general register; if the highest bit of the register field is 1, reading the corresponding position in the data storage according to the address stored in the register to obtain an operand required by the calculation instruction;
when the lowest bit of the Func field of the current calculation instruction is 1, one operand of the calculation instruction is from an immediate number, the other operand is determined by the highest bit of the Rs field of the register, and if the highest bit of the Rs field of the register is 0, the operand of the calculation instruction is directly read from a general register; if the highest bit of the register field is 1, reading the corresponding position in the data storage according to the address stored in the register to obtain the operand of the calculation instruction;
when a STRM instruction is executed, the instruction operands come directly from the accumulation registers.
2. A near data processing device based on a reconfigurable array processor according to claim 1,
the operand storage position comprises a main memory, a general register and the accumulation register; when the storage position of the operand is the main memory, the coprocessor accesses the data storage device in a register indirect addressing mode.
3. A near data processing device based on a reconfigurable array processor according to claim 1,
the communication flow between the instruction register unit and the main processor comprises the following steps:
when the cache calculation instruction effective signal output by the instruction register unit is in a low level or the feedback signal of the decoding fetch unit is in a high level, enabling the effective bit register and the instruction register to be effective;
the main processor sends a calculation instruction, an effective signal and a register enabling signal issued by the main processor are high at the same time, and the instruction register unit receives and caches the calculation instruction;
the instruction register unit sends a feedback output signal to the main processor, wherein the feedback output signal indicates that the main processor can issue a calculation instruction to the coprocessor again when the next clock cycle arrives.
4. A method of near data processing based on a reconfigurable array processor, the method comprising:
the coprocessor receives a first decoding result sent after the main processor decodes the currently issued calculation instruction for the first time;
the coprocessor performs secondary decoding according to a preset field of the calculation instruction, reads data to be processed from a memory according to a secondary decoding result, and the memory is one of a main memory, a general register and an accumulation register;
the coprocessor calculates the data to be processed according to the second decoding result and stores the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register which are determined based on the calculation instruction;
the preset field is a Func field, the coprocessor reads data to be processed from the memory according to a second decoding result, and when calculating, the coprocessor judges the type and operand source of the current calculation instruction according to the least significant bit of the Func field, and the method comprises the following steps:
when the lowest bit of the Func field of the current calculation instruction is 0, if the highest bit of the general register field is 0, directly reading the operand of the calculation instruction from the general register; if the highest bit of the register field is 1, reading the corresponding position in the data storage according to the address stored in the register to obtain an operand required by the calculation instruction;
when the lowest bit of the Func field of the current calculation instruction is 1, one operand of the calculation instruction is from an immediate number, the other operand is determined by the highest bit of the Rs field of the register, and if the highest bit of the Rs field of the register is 0, the operand of the calculation instruction is directly read from a general register; if the highest bit of the register field is 1, reading the corresponding position in the data storage according to the address stored in the register to obtain the operand of the calculation instruction;
when a STRM instruction is executed, the instruction operands come directly from the accumulation registers.
5. The near data processing method of claim 4, wherein the coprocessor adopts a three-stage pipeline mode; wherein, the liquid crystal display device comprises a liquid crystal display device,
the first stage comprises receiving and caching a calculation instruction issued by the main processor through an instruction register unit;
the second stage comprises receiving the calculation instruction through a decoding fetch unit, decoding, and reading an operand required by the execution of the calculation instruction according to a decoding result;
the third stage includes receiving the operand sent by the decoding fetch unit by the computing unit, executing the computing instruction, and writing the computing result to the corresponding storage location.
6. The near data processing method of claim 5, wherein, when the calculation instruction is a convolution calculation in a neural network,
the main memory is used for storing original image data and a convolutional neural network calculation result;
the general register is used for storing operands or intermediate calculation results used in the instruction execution process;
the accumulation register is used for storing the data written by the calculation unit and the accumulation result of the original value in the accumulation register.
7. The near data processing method of claim 6, wherein receiving, by a computing unit, the operand sent by the decode fetch unit and executing the computing instruction, and writing a result of the computation to a corresponding memory address, comprises:
carrying out convolution calculation of the current channel characteristic diagram and a convolution kernel through a calculation unit;
accumulating the convolution calculation result and the calculation result of the last convolution in an accumulation register, and storing the accumulation result in the accumulation register;
when the convolution calculation of the current image is completed, the calculation result in the accumulation register is read through the STRM instruction, and the calculation result is written into the main memory or the general register according to the highest bit of the general register field.
8. The method of claim 7, wherein,
when the non-multiply accumulate instruction is executed and the highest bit of the Rd field of the register is 0, the storage location is the general register,
when the non-multiply accumulate instruction is executed and the most significant bit of the register Rd field is 1, the storage location is main memory.
CN202110358261.0A 2021-04-01 2021-04-01 Coprocessor, near data processing device and method Active CN113157636B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110358261.0A CN113157636B (en) 2021-04-01 2021-04-01 Coprocessor, near data processing device and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110358261.0A CN113157636B (en) 2021-04-01 2021-04-01 Coprocessor, near data processing device and method

Publications (2)

Publication Number Publication Date
CN113157636A CN113157636A (en) 2021-07-23
CN113157636B true CN113157636B (en) 2023-07-18

Family

ID=76886130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110358261.0A Active CN113157636B (en) 2021-04-01 2021-04-01 Coprocessor, near data processing device and method

Country Status (1)

Country Link
CN (1) CN113157636B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743599B (en) * 2021-08-08 2023-08-25 苏州浪潮智能科技有限公司 Computing device and server of convolutional neural network
CN113702700B (en) * 2021-09-01 2022-08-19 上海交通大学 Special integrated circuit for calculating electric energy and electric energy quality parameters
CN116610362B (en) * 2023-04-27 2024-02-23 合芯科技(苏州)有限公司 Method, system, equipment and storage medium for decoding instruction set of processor

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004005738A (en) * 1998-03-11 2004-01-08 Matsushita Electric Ind Co Ltd Data processor, and instruction set expansion method
CN111159094A (en) * 2019-12-05 2020-05-15 天津芯海创科技有限公司 RISC-V based near data stream type calculation acceleration array

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS62145340A (en) * 1985-12-20 1987-06-29 Toshiba Corp Cache memory control system
JP2672532B2 (en) * 1987-11-20 1997-11-05 株式会社日立製作所 Coprocessor system
JPH06309270A (en) * 1993-04-22 1994-11-04 Fujitsu Ltd Interruption control circuit built in dpram
JP2987308B2 (en) * 1995-04-28 1999-12-06 松下電器産業株式会社 Information processing device
EP2267597A3 (en) * 1999-05-12 2012-01-04 Analog Devices, Inc. Digital signal processor having a pipeline structure
JP2011138308A (en) * 2009-12-28 2011-07-14 Sony Corp Processor, coprocessor, information processing system, and control method in them
JP6075157B2 (en) * 2013-03-28 2017-02-08 富士通株式会社 Calculation method, calculation program, and calculation apparatus
CN109144573A (en) * 2018-08-16 2019-01-04 胡振波 Two-level pipeline framework based on RISC-V instruction set
CN111930426A (en) * 2020-08-14 2020-11-13 西安邮电大学 Reconfigurable computing dual-mode instruction set architecture and application method thereof
CN112099762B (en) * 2020-09-10 2024-03-12 上海交通大学 Synergistic processing system and method for rapidly realizing SM2 cryptographic algorithm
CN112181496A (en) * 2020-09-30 2021-01-05 中国电力科学研究院有限公司 AI extended instruction execution method and device based on open source instruction set processor, storage medium and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004005738A (en) * 1998-03-11 2004-01-08 Matsushita Electric Ind Co Ltd Data processor, and instruction set expansion method
CN111159094A (en) * 2019-12-05 2020-05-15 天津芯海创科技有限公司 RISC-V based near data stream type calculation acceleration array

Also Published As

Publication number Publication date
CN113157636A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
CN113157636B (en) Coprocessor, near data processing device and method
CN109522254B (en) Arithmetic device and method
US7339592B2 (en) Simulating multiported memories using lower port count memories
JP3871883B2 (en) Method for calculating indirect branch targets
US6216219B1 (en) Microprocessor circuits, systems, and methods implementing a load target buffer with entries relating to prefetch desirability
US7707397B2 (en) Variable group associativity branch target address cache delivering multiple target addresses per cache line
EP1399824B1 (en) Using type bits to track storage of ecc and predecode bits in a level two cache
JP4195006B2 (en) Instruction cache way prediction for jump targets
US5848432A (en) Data processor with variable types of cache memories
US7746348B2 (en) Multi-thread graphics processing system
US7234040B2 (en) Program-directed cache prefetching for media processors
KR100586057B1 (en) Using ecc/parity bits to store predecode information
US5761720A (en) Pixel engine pipeline processor data caching mechanism
JPH10124391A (en) Processor and method for executing store convergence by merged store operation
JPH10187533A (en) Cache system, processor, and method for operating processor
US11301250B2 (en) Data prefetching auxiliary circuit, data prefetching method, and microprocessor
US5898852A (en) Load instruction steering in a dual data cache microarchitecture
US6460132B1 (en) Massively parallel instruction predecoding
CN115563027B (en) Method, system and device for executing stock instruction
CN117453594A (en) Data transmission device and method
US20070233963A1 (en) Data processing system and method for processing data
US5951671A (en) Sharing instruction predecode information in a multiprocessor system
US20230267079A1 (en) Processing apparatus, method and system for executing data processing on a plurality of channels
CN113886286B (en) Two-dimensional structure compatible data reading and writing system and method
US7430657B2 (en) System, method and device for queuing branch predictions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant