CN113157636B

CN113157636B - Coprocessor, near data processing device and method

Info

Publication number: CN113157636B
Application number: CN202110358261.0A
Authority: CN
Inventors: 山蕊; 冯雅妮; 蒋林; 杨博文; 高旭
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2021-04-01
Filing date: 2021-04-01
Publication date: 2023-07-18
Anticipated expiration: 2041-04-01
Also published as: CN113157636A

Abstract

The present invention relates to the field of chip technologies, and in particular, to a coprocessor, a near data processing apparatus and a near data processing method. The coprocessor includes: the instruction register unit is used for receiving and caching the calculation instruction issued by the main processor, and sending the calculation instruction to the decoding and fetching unit when receiving the decoding enabling signal; the decoding and fetching unit is used for receiving and decoding the calculation instruction, reading an operand required by the execution of the calculation instruction from the operand storage position according to a decoding result, and sending the operand to the calculation unit when receiving a calculation enabling signal; the computing unit is used for receiving the operand sent by the decoding and fetching unit, executing corresponding computing operation and writing a computing result to a corresponding storage address; and the accumulation register is used for storing the data written by the calculation unit and the accumulation result of the original value in the accumulation register. According to the invention, the coprocessor directly reads the data of the main memory to perform calculation and then writes back, so that the operation efficiency of the main processor is improved.

Description

Coprocessor, near data processing device and method

Technical Field

The present invention relates to the field of chip technologies, and in particular, to a coprocessor, a near data processing apparatus and a near data processing method.

Background

The operands of the main processor calculation instruction in the reconfigurable system are all from registers or immediate numbers in the instruction, when the main memory data is required to be calculated, the instruction supported by the main processor is to fetch the data from the main memory to the registers through the LD instruction or write the values in the registers back to the main memory through the ST instruction, and the data of the main memory is not directly calculated.

Therefore, the main processor in the prior art has the problems of slow data reading and low operation efficiency when calculating the main memory data.

The above drawbacks are to be overcome by those skilled in the art.

Disclosure of Invention

First, the technical problem to be solved

In order to solve the problems in the prior art, the invention provides a coprocessor, a near data processing device and a near data processing method, which solve the problems of slow data reading and low operation efficiency when a main processor in the prior art calculates main memory data.

(II) technical scheme

In order to achieve the above purpose, the main technical scheme adopted by the invention comprises the following steps:

in a first aspect, embodiments of the present application provide a coprocessor, comprising: the device comprises an instruction register unit, a decoding and fetching unit, a calculation unit and an accumulation register;

the instruction register unit is connected with the main processor, and is used for receiving and caching the calculation instruction issued by the main processor, and sending the calculation instruction to the decoding and fetching unit when receiving the decoding enabling signal sent by the decoding and fetching unit;

the decoding and fetching unit is connected with the instruction register unit and is used for receiving the calculation instruction, decoding, reading an operand required by the execution of the calculation instruction from an operand storage position according to a decoding result, and sending the operand to the calculation unit when receiving a calculation enabling signal sent by the calculation unit;

the computing unit is connected with the decoding and fetching unit and is used for receiving the operand sent by the decoding and fetching unit, executing corresponding computing operation and writing a computing result to a corresponding storage address;

the accumulation register is connected with the decoding and fetching unit and the calculating unit and is used for storing the data written by the calculating unit and the accumulation result of the original value in the accumulation register.

Optionally, the location where the operand is stored includes a main memory, a general purpose register, and the accumulation register; when the storage position of the operand is the main memory, the coprocessor accesses the data storage device in a register indirect addressing mode.

Optionally, a communication flow between the instruction register unit and the main processor includes:

when the cache calculation instruction effective signal output by the instruction register unit is in a low level or the feedback signal of the decoding fetch unit is in a high level, enabling the effective bit register and the instruction register to be effective;

the main processor sends a calculation instruction, an effective signal and a register enabling signal issued by the main processor are high at the same time, and the instruction register unit receives and caches the calculation instruction;

the instruction register unit sends a feedback output signal to the main processor, wherein the feedback output signal indicates that the main processor can issue a calculation instruction to the coprocessor again when the next clock cycle arrives.

In a second aspect, embodiments of the present application provide a near data processing apparatus based on a reconfigurable array processor, the apparatus comprising a coprocessor as described in any one of the first aspects above;

the coprocessor is connected with a main processor, the main processor is a reconfigurable array processor and is used for performing first decoding on a currently issued calculation instruction and sending a first decoding result to the coprocessor according to an instruction operation code;

the coprocessor is used for performing secondary decoding according to a preset field of the calculation instruction, and reading data to be processed from a memory according to a secondary decoding result, wherein the memory is one of a main memory, a general register and an accumulation register;

the coprocessor is used for calculating the data to be processed according to the second decoding result, and storing the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register which are determined based on the calculation instruction.

In a third aspect, embodiments of the present application provide a near data processing method based on a reconfigurable array processor, the method including:

the coprocessor receives a first decoding result sent by the main processor after performing first decoding on the currently issued calculation instruction;

the coprocessor performs secondary decoding according to a preset field of the calculation instruction, reads data to be processed from a memory according to a secondary decoding result, and the memory is one of a main memory, a general register and an accumulation register;

and the coprocessor calculates the data to be processed according to the second decoding result and stores the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register which are determined based on the calculation instruction.

Optionally, the coprocessor adopts a three-stage pipeline mode; wherein, the liquid crystal display device comprises a liquid crystal display device,

the first stage comprises receiving and caching a calculation instruction issued by the main processor through an instruction register unit;

the second stage comprises receiving the calculation instruction through a decoding fetch unit, decoding, and reading an operand required by the execution of the calculation instruction according to a decoding result;

the third stage includes receiving the operand sent by the decoding fetch unit by the computing unit, executing the computing instruction, and writing the computing result to the corresponding storage location.

Optionally, when the computation order is a convolution computation in a neural network,

the main memory is used for storing original image data and a convolutional neural network calculation result;

the general register is used for storing operands or intermediate calculation results used in the instruction execution process;

the accumulation register is used for storing the data written by the calculation unit and the accumulation result of the original value in the accumulation register.

Optionally, receiving, by the computing unit, the operand sent by the decode fetch unit and executing the computing instruction, and writing a computing result to a corresponding storage address, including:

carrying out convolution calculation of the current channel characteristic diagram and a convolution kernel through a calculation unit;

accumulating the convolution calculation result and the calculation result of the last convolution in an accumulation register, and storing the accumulation result in the accumulation register;

when the convolution calculation of the current image is completed, the calculation result in the accumulation register is read through the STRM instruction, and the calculation result is written into the main memory or the general register according to the highest bit of the general register field.

Alternatively, when a non-multiply accumulate instruction is executed, and the most significant bit of the register Rd field is 0, the storage location is a general purpose register,

when the non-multiply accumulate instruction is executed and the most significant bit of the register Rd field is 1, the storage location is main memory.

Optionally, the preset field is a Func field, the coprocessor reads data to be processed from the memory according to a second decoding result, and when calculating, judges a type and an operand source of a current calculation instruction according to a least significant bit of the Func field, including:

when the lowest bit of the Func field of the current calculation instruction is 0, if the highest bit of the register field is 0, directly reading the operand of the calculation instruction from the general register; if the highest bit of the register field is 1, reading the corresponding position in the data storage according to the address stored in the register to obtain an operand required by the calculation instruction;

when the lowest bit of the Func field of the current calculation instruction is 1, one operand of the calculation instruction is from an immediate number, the other operand is determined by the highest bit of the Rs field of the register, and if the highest bit of the Rs field of the register is 0, the operand of the calculation instruction is directly read from a general register; if the highest bit of the register field is 1, reading the corresponding position in the data storage according to the address stored in the register to obtain the operand of the calculation instruction;

when a STRM instruction is executed, the instruction operands come directly from the accumulation registers.

(III) beneficial effects

The beneficial effects of the invention are as follows: according to the coprocessor provided by the embodiment of the invention, the access speed of the main memory data is improved by directly reading the main memory data for calculation and then writing back, so that the operation efficiency of the main processor is improved.

Furthermore, the near data processing device based on the reconfigurable array processor provided by the embodiment of the invention not only improves the processing efficiency of the processor on the main memory data, but also further improves the operation speed and the data processing efficiency through the accumulation register arranged in the coprocessor.

Drawings

FIG. 1 is a schematic diagram of a coprocessor according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a coprocessor according to another embodiment of the present invention;

FIG. 3 is a diagram illustrating a coprocessor instruction encoding format according to another embodiment of the present invention;

FIG. 4 is a schematic diagram of an instruction register unit according to another embodiment of the present invention;

FIG. 5 is a schematic diagram of a decoding fetch unit according to another embodiment of the present invention;

FIG. 6 is a schematic diagram of a computing unit according to another embodiment of the present invention;

FIG. 7 is a schematic diagram of a near data processing apparatus based on a reconfigurable array processor according to another embodiment of the present invention;

FIG. 8 is a flow chart of a method for near data processing based on a reconfigurable array processor according to another embodiment of the invention.

Detailed Description

The invention will be better explained by the following detailed description of the embodiments with reference to the drawings.

All technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

FIG. 1 is a schematic diagram of a coprocessor according to an embodiment of the present invention, as shown in FIG. 1, the coprocessor includes: the device comprises an instruction register unit, a decoding and fetching unit, a calculation unit and an accumulation register;

the instruction register unit is connected with the main processor, and is used for receiving and caching the calculation instruction issued by the main processor, and transmitting the calculation instruction to the decoding and fetching unit when receiving the decoding enabling signal transmitted by the decoding and fetching unit;

the decoding and fetching unit is connected with the instruction register unit and is used for receiving and decoding the calculation instruction, reading an operand required by the execution of the calculation instruction from the operand storage position according to a decoding result, and sending the operand to the calculation unit when receiving a calculation enabling signal sent by the calculation unit;

Also included in the coprocessor shown in fig. 1 is a register file, which is a general purpose register for the coprocessor.

In the technical scheme provided by the embodiment of the invention shown in fig. 1, the access speed of the main memory data is improved by directly reading the main memory data for calculation and then writing back, so that the operation efficiency of the main processor is improved.

FIG. 2 is a schematic diagram of a coprocessor according to another embodiment of the present invention, and the embodiment shown in FIG. 2 is described in detail below:

in this embodiment, the co-processing includes an instruction register unit (IR), a decode fetch unit (dec-fet), a compute unit (ALU), an accumulation register (Rm), a register file (RegisterFile).

The locations where operands are stored include main memory, general purpose registers, and accumulator registers. The main memory is used for storing a large amount of original data or calculation results, when the highest bit of the Rs or Rt register field is 1, the source operand is read from the main memory, and when the highest bit of the Rd register field is 1, the instruction calculation results are written back to the main memory;

the general register is used for storing operands or intermediate calculation results used in the instruction execution process, when the highest bit of the Rs or Rt register field is 0, the source operands are read from the general register, and when the highest bit of the Rd register field is 0, the instruction calculation results are written back to the general register;

the accumulation register is used for storing the execution result of the multiply-accumulate instruction MAC, accumulating the execution result with the existing data in the accumulation register, and when the STRM instruction is executed to read the accumulation register, reading the data stored in the accumulation register and then emptying the accumulation register.

When the storage position of the operand is the main memory, the storage space is accessed in a register indirect addressing mode because of large storage capacity of the main memory, and at the moment, the corresponding address of the target data stored in the main memory is stored in the general register; the accumulation register is used for storing the convolution result. The instruction interaction between the units in the figure, and specific instructions will be described in the following description of the unit modules.

FIG. 3 is a diagram of a coprocessor instruction encoding format according to another embodiment of the present invention, wherein, as shown in FIG. 3, I is an immediate type instruction and non-I is a non-immediate type instruction. The high 6 bits in the 32bit instruction are the operation codes (OP) of the calculation instructions, which can be used as the distinguishing marks of the main processor support instructions and the coprocessor support instructions, and when the operation codes are 6' b100111, the calculation instructions need to be issued to the coprocessor for execution.

The [25:22] and [21:18] (the numerals in brackets represent bits) are the general register Rd field, the general register Rs field, and the [5:0] func field represents the calculation type of the instruction, respectively. Immediate type instructions differ from non-immediate type instructions in whether immediate is employed as a source operand. The [17:6] in the immediate type instruction represents 12bit immediate and takes the immediate as a source operand, and the lowest bit of the Func field in the instruction is 1; non-immediate instructions use [17:14] to represent the general register Rt field, and the lowest order bit of the Func field in the instruction is 0. The 8bit data in the non-immediate type instruction is not filled, and the 8bit data does not participate in the calculation when the non-I type instruction is executed.

The source of the other source operands is determined by the highest bit of the register field, except that the lowest bit of the Func field is used to determine whether an immediate is used as the source operand, and when the highest bit of the register field is 0, the source operand is indicated to come from the register; when the register field most significant bit is 1, it indicates that the source operand is from the corresponding address in main memory to which the register is directed. The execution result of the instruction may be written back to the register or the main memory may be updated.

Fig. 4 is a schematic diagram of an instruction register unit according to another embodiment of the present invention, as shown in fig. 4, where the instruction register unit is composed of two parts, and the valid bit register is used for receiving a valid signal issued by the host processor, and the instruction register is used for receiving an instruction issued by the host processor. When the valid bit register and the instruction register are both in the space and the valid signal issued by the main processor is valid, the instruction received by the instruction register is valid and can be used for the next stage of decoding fetch unit.

The main function of the instruction register unit IR is to cache instructions issued by the main processor to the coprocessor. After the calculation instruction is decoded in the main processor, the instruction high 6bit operation code is the low 26 bits of the instruction of 6' b100111 and the effective signal of the instruction are sent to the coprocessor, and the instruction register unit IR in the coprocessor caches the instruction and the effective signal sent by the main processor.

The instruction register unit starts to receive new calculation instructions in idle state or after the last instruction finishes decoding, namely when the IR_valid signal is low level or the feedback signal ready_in after decoding is high level, the valid bit register and the enabling valid of the instruction register can buffer the instruction and valid signal issued by the main processor. When the valid signal issued by the main processor and the register enable signal are both high, the instruction register unit is indicated to receive the instruction issued by the main processor and can finish buffering in the next clock cycle, so that when the valid signal issued by the main processor and the register enable signal are both high, a feedback signal ready_out is sent to the main processor, and the instruction register unit is indicated to issue a calculation instruction again to the coprocessor when the next clock cycle arrives. Table 1 is an instruction register unit interface information specification table.

TABLE 1

Fig. 5 is a schematic diagram of a decoding fetch unit according to another embodiment of the present invention, where, as shown in fig. 5, the decoding fetch unit is composed of a decoding control, a selector MUX, a register and a feedback circuit. The decoding control unit is used for decoding each field of the instruction; the two selectors respectively select two operands required by instruction calculation by taking the decoded result as a selection control signal, and send the operation result to a register for shooting and then output the operation result to a next-stage calculation unit; the feedback circuit is used for controlling the whole pipeline.

The main function of the decoding fetch unit dec_fet is to decode an instruction and obtain a source operand required by instruction execution according to a decoding result. When the decoding fetch unit is in an idle state or the decoding fetch operation of the current calculation instruction is completed, the instruction register unit IR sends the instruction to the decoding fetch unit for secondary decoding, and reads all source operands required by instruction execution from the corresponding addresses according to a decoding result until all operands required by the calculation instruction are ready, namely after the decoding fetch operation is completed, a ready feedback signal is sent to the upper instruction register unit to indicate that the decoding fetch operation of the current instruction is completed, and the next instruction can be decoded. While the decode fetch unit sends valid signals and source operands, both of which are ready, to the next stage of the computational unit ALU for instruction execution.

Judging the type and operand source of the current calculation instruction through the least significant bit of the Func field, wherein the type and operand source of the current calculation instruction are mainly divided into the following 3 cases:

(1) when the lowest bit of the instruction Func field is 0, if the highest bit of the register field is 0, directly reading the operand of the calculation instruction from the general register; if the highest bit of the register field is 1, reading the corresponding position in the data storage according to the address stored in the register to obtain the operand of the calculation instruction;

(2) when the lowest order bit of the instruction Func field is 1, one of the operands of the calculation instruction comes from the immediate and the other operand source is determined by the highest order bit of the Rs field of the register, as in (1);

(3) when executing a STRM instruction, the instruction operands come directly from the accumulation registers Rm.

In the decoding fetch unit, the En enable signal is used to indicate that the current decoding fetch unit can be idle or instruction decoding fetch is completed, and the next calculation instruction can be decoded. When all operand sources of the instruction are general registers, accumulation registers or immediate numbers in the instruction, the decoding fetch unit can immediately finish fetch, en enable signals are pulled high, and the next calculation instruction is received and decoded in the next clock period; when one operand of the instruction comes from the main memory or both operands of the instruction come from the main memory, the corresponding register is accessed according to the lower 3bit of the Rs or Rt field of the register, the data stored in the register is used as the address for accessing the main memory, the En enable signal is simultaneously sent out to be low, and the En enable signal is not pulled high again until the return destination data of the main memory and the effective feedback signal are received, so that the fetching operation of the instruction is completed.

When En enable signal is valid, two operands needed by the instruction are respectively assigned to left Data L_Data and right Data R_Data, and meanwhile, a valid signal L_V for completing fetch of the left Data is set to be valid, and a valid signal R_V for completing fetch of the right Data is set to be valid. The ALU_O_V signal indicates that all operands in the instruction are ready. When executing an instruction requiring only left data, such as a NOT instruction, directly assigning a left data valid signal L_V to the ALU_O_V; when executing an instruction requiring only right data, such as a LUI instruction, directly assigning a right data valid signal r_v to alu_o_v; when two operands are simultaneously required in an executed instruction, ALU_O_V is active only when the left data valid signal L_V and the right data valid signal R_V are active simultaneously. After the decoding fetch is completed, the left Data l_data and right Data r_data and the signal alu_o_v with the operands ready are output to the ALU computation unit. When the instruction valid signal IR_valid and the enable signal En transmitted from the upper instruction register unit received by the decoding and fetching unit are valid at the same time, a feedback signal rdy_out_fetch is sent to the upper instruction register unit to indicate that the current calculation instruction finishes the decoding and fetching operation, and a new calculation instruction can be received to perform the decoding and fetching operation. Table 2 is a decoding fetch unit interface information specification table.

TABLE 2

Because some calculation instructions simultaneously need two operands transmitted from the main memory, but the main memory only has one group of read-write access ports, arbitration processing is carried out on requests which are simultaneously transmitted to the main memory at the top layer, one request is preferentially responded, the response to the feedback signal and the destination data are cached, and the other access request is responded. After the two access requests are responded, the destination data and the feedback signals obtained by the two accesses are returned to the decoding and fetching unit. When the read access main memory, the general register or the accumulation register collides with the write access, the written data is directly used as read data to be output to the decoding and fetching unit, so that the access speed can be improved on the premise of ensuring the correct access.

Fig. 6 is a schematic diagram of a computing unit according to another embodiment of the present invention, where the computing unit, as shown in fig. 6, is composed of registers, a selector MUX, an arithmetic logic unit ALU, and a feedback circuit. The register is used for receiving the destination register field Rd and the function field Func transmitted by the decoding unit, beating the destination register field Rd and the function field Func, and taking the destination register field Rd and the function field Func as control signals of the selector, and selecting a destination address written back by an arithmetic logic unit ALU calculation result; the feedback circuit is used to control the pipeline.

The main function of the computing unit ALU is to execute an instruction and write the execution result of the instruction back to the destination address. And when the computing unit is in an idle state or the computing operation of the last instruction is finished, after receiving the effective signals that all source operands of the instruction are ready and are sent by the upper-stage decoding fetch unit dec_fet, executing corresponding computation according to the high 5 bits of the Func field of the instruction. After the instruction calculation is completed, a ready feedback signal is sent to the upper-stage decoding fetch unit to indicate that the current instruction calculation is completed, and the calculation unit can calculate and execute the next instruction. Table 3 illustrates tables for different calculation types for different Func fields.

TABLE 3 Table 3

Calculation type	Func[5:1]	Calculation type	Func[5:1]
				ADD	00001	Right shift SRL	00111
Subtraction SUB	00010	Multiplication MUL	01000
				AND AND with	00011	Multiply-accumulate MAC	01001
OR	00100	Immediate left shift LUI	01010
				Non NOT	00101	Less than one SLT	01011
Left shift SLL	00110	Reading accumulation register STRM	01100

After the calculation is completed, the calculation result may be written back to the general register, the accumulation register, or the main memory, as long as it is judged according to the highest bit of the Rd field of the type register of the executed calculation instruction. When the multiply-accumulate instruction MAC is executed, writing the calculation result back into the accumulation register Rm; when the non-MAC instruction is executed and the highest bit of the Rd field of the register is 0, writing the calculation result back to the Rd register; when the non-MAC instruction is executed and the most significant bit of the Rd field of the register is 1, the calculation result is written back to the main memory corresponding location corresponding to the address stored in the Rd register.

In the calculating unit, the enable signal En indicates that the calculating unit is idle or has calculated the current instruction and writes back the calculation result, and when the enable signal En is valid, the current calculating unit can receive the new operand input by the last stage decoding and fetching unit and calculate and write back. When the left Data L_Data and the right Data R_Data input by the last decoding and fetching unit and the signal ALU_O_V with the operands ready are received, the calculating unit immediately completes corresponding calculation according to the Func field delayed by one beat and outputs a calculation result ALU_R, and generates an enabling signal for updating the general register, the accumulation register or the main memory. When the general register or the accumulation register is updated with the calculation result, the write access can be completed in the next clock cycle generated by updating the enabling; when the main memory is updated by the calculation result, the corresponding register is accessed according to the low 3bit of the register Rd field delayed by one beat, the data stored in the register is used as the destination address for updating the main memory, the enabling signal and the calculation result for updating the main memory are sent out at the same time, the enabling signal En is pulled down, and the enabling signal En is not set to be valid again until the feedback signal of the completion of updating transmitted by the main memory is received. When the signals ALU_O_V and the enable signal En which are ready for the operands input by the upper-stage decoding and fetching unit are simultaneously valid, a feedback signal rdy_out_alu is sent out to the upper-stage decoding and fetching unit, which indicates that the current instruction has completed calculation and write-back operation. The specific computing unit interface information is shown in table 5.

In convolution calculation, multiply-accumulate operation is to store and accumulate the product result of each pixel point and convolution kernel in a block area, and after the input image pixel point and convolution kernel are multiplied by the multiply-accumulate instruction MAC designed according to the characteristic, the calculation result is written into an accumulation register Rm and is automatically accumulated with the calculation result executed by the next MAC instruction. When this area convolution of the input image is completed, the result of the convolution calculation accumulation of this area is output by executing the STRM instruction when it is required as the next layer convolution or pooled input image. And executing the STRM instruction, reading out the data in the accumulation register, outputting the data to a general register or a main memory according to the highest bit of the Rd field of the register, and resetting the accumulation register in the next clock period of reading out the accumulation result to ensure that the convolution accumulation result of other areas of the input image is not influenced by the convolution result of the last area.

Table 4 is a calculation unit interface information specification table.

TABLE 4 Table 4

FIG. 7 is a schematic diagram of a near data processing apparatus based on a reconfigurable array processor according to another embodiment of the present invention, as shown in FIG. 7, which may be used for any data access and computation to a main memory, including a coprocessor according to any of the above embodiments;

the coprocessor is connected with the main processor, and the main processor is a reconfigurable array processor and is used for performing first decoding on a currently issued calculation instruction and sending a first decoding result to the coprocessor according to an instruction operation code;

the coprocessor is used for carrying out secondary decoding according to a preset field of the calculation instruction and reading data to be processed from the main memory according to a secondary decoding result;

the coprocessor is used for calculating the data to be processed according to the second decoding result and storing the calculated result into the main memory.

It should be noted that, the storage location of the calculation result is determined according to the calculation instruction, and the storage location of the data to be processed may also be a main memory, a general register or an accumulation register, and the storage location of the calculation result may also be a main memory, a general register or an accumulation register, which are not described in the present application.

In this embodiment, the coprocessor is controlled by the main processor layer, so when the coprocessor executes an instruction, the main processor PE first decodes the instruction, and determines whether the current issued instruction is a calculation instruction executed in the coprocessor according to the instruction operation code. After the calculation instruction is issued to the coprocessor, performing secondary decoding according to the Func field of the instruction, and generating a corresponding control signal according to a decoding result to complete corresponding calculation.

The embodiment can be used for convolution calculation of the neural network. According to the characteristic of high parallelism of the array processor, the neural network is mapped on the array structure in a calculation parallel mode, so that the calculation speed of the neural network can be improved, and data of the main memory still needs to be read into a register first and then processed. In order to further improve the calculation speed of the network, a coprocessor is combined with an array processor, a multiplication and accumulation instruction and an accumulation register are specially designed according to the calculation characteristics of convolution in a neural network in the coprocessor, pixels of an input feature map from a main memory and elements of a convolution kernel can be directly multiplied and accumulated with the calculation result of the last convolution in the accumulation register, and finally the result in the accumulation register is read through an STRM instruction.

The near data processing device based on the reconfigurable array processor not only improves the access speed of the main memory data, but also can improve the operation speed of the convolutional neural network and improve the efficiency of image processing by arranging the accumulation register arranged in the coprocessor.

FIG. 8 is a flow chart of a method for near data processing based on a reconfigurable array processor according to still another embodiment of the present invention, as shown in FIG. 8, the method includes:

the coprocessor calculates the data to be processed according to the second decoding result, and stores the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register which are determined based on a calculation instruction.

In this embodiment, the coprocessor may employ a three-stage pipeline manner; wherein, the liquid crystal display device comprises a liquid crystal display device,

the second stage comprises receiving and decoding the calculation instruction through a decoding and fetching unit, and reading an operand required by the execution of the calculation instruction according to a decoding result; operands may be stored in main memory, general purpose registers, or dedicated accumulator registers; when the calculation instruction is a convolution calculation in the neural network, the operand includes a convolution kernel, a channel feature map and a result of a previous convolution calculation, wherein the result of the previous convolution calculation is stored in an accumulation register.

The third stage comprises receiving the operand sent by the decoding and fetching unit through the computing unit, executing a computing instruction, and writing a computing result to a corresponding storage position; wherein the storage location comprises a main memory, a general purpose register, or an accumulation register.

In this embodiment, when the calculation instruction is convolution calculation in the neural network, the calculating unit receives the operand sent by the decoding and fetching unit and executes the calculation instruction, and writes the calculation result to the corresponding storage address, including:

when the convolution calculation of the current image is completed, the calculation result in the accumulation register is read through the STRM instruction, and the calculation result is written to the corresponding storage address according to the highest bit of the register field.

In this embodiment, when the non-multiply-accumulate instruction is executed and the highest bit of the register Rd field is 0, the storage location is a general purpose register,

In this embodiment, the preset field is a Func field, the coprocessor reads data to be processed from a corresponding storage location according to a second decoding result, and when calculating, the coprocessor determines a type and an operand source of a current calculation instruction according to a least significant bit of the Func field, including:

when the lowest bit of the Func field of the current calculation instruction is 0, if the highest bit of the register field is 0, directly reading the operand of the calculation instruction from the general register; if the highest bit of the register field is 1, reading the corresponding position in the data storage according to the address stored in the register to obtain the operand of the calculation instruction;

when executing a STRM instruction, the instruction operands come directly from the accumulation registers Rm.

According to the method, the near data processing device based on the reconfigurable array processor is adopted, so that the operation speed and the data processing efficiency are improved.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A near data processing apparatus based on a reconfigurable array processor, the apparatus comprising a coprocessor; the coprocessor includes: the device comprises an instruction register unit, a decoding and fetching unit, a calculation unit and an accumulation register;

the accumulation register is connected with the decoding and fetching unit and the calculating unit and is used for storing the data written by the calculating unit and the accumulation result of the original value in the accumulation register;

the coprocessor is used for calculating the data to be processed according to the second decoding result, and storing the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register which are determined based on the calculation instruction;

the preset field is a Func field, the coprocessor reads data to be processed from the memory according to a second decoding result, and when calculating, the coprocessor judges the type and operand source of the current calculation instruction according to the least significant bit of the Func field, and the method comprises the following steps:

when the lowest bit of the Func field of the current calculation instruction is 0, if the highest bit of the general register field is 0, directly reading the operand of the calculation instruction from the general register; if the highest bit of the register field is 1, reading the corresponding position in the data storage according to the address stored in the register to obtain an operand required by the calculation instruction;

2. A near data processing device based on a reconfigurable array processor according to claim 1,

the operand storage position comprises a main memory, a general register and the accumulation register; when the storage position of the operand is the main memory, the coprocessor accesses the data storage device in a register indirect addressing mode.

3. A near data processing device based on a reconfigurable array processor according to claim 1,

the communication flow between the instruction register unit and the main processor comprises the following steps:

4. A method of near data processing based on a reconfigurable array processor, the method comprising:

the coprocessor receives a first decoding result sent after the main processor decodes the currently issued calculation instruction for the first time;

the coprocessor calculates the data to be processed according to the second decoding result and stores the calculated result into a corresponding storage position, wherein the storage position is one of a main memory, a general register and an accumulation register which are determined based on the calculation instruction;

5. The near data processing method of claim 4, wherein the coprocessor adopts a three-stage pipeline mode; wherein, the liquid crystal display device comprises a liquid crystal display device,

6. The near data processing method of claim 5, wherein, when the calculation instruction is a convolution calculation in a neural network,

7. The near data processing method of claim 6, wherein receiving, by a computing unit, the operand sent by the decode fetch unit and executing the computing instruction, and writing a result of the computation to a corresponding memory address, comprises:

8. The method of claim 7, wherein,

when the non-multiply accumulate instruction is executed and the highest bit of the Rd field of the register is 0, the storage location is the general register,