WO2019093352A1 - Dispositif de traitement de données - Google Patents

Dispositif de traitement de données Download PDF

Info

Publication number
WO2019093352A1
WO2019093352A1 PCT/JP2018/041281 JP2018041281W WO2019093352A1 WO 2019093352 A1 WO2019093352 A1 WO 2019093352A1 JP 2018041281 W JP2018041281 W JP 2018041281W WO 2019093352 A1 WO2019093352 A1 WO 2019093352A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
transfer
instruction
register
internal memory
Prior art date
Application number
PCT/JP2018/041281
Other languages
English (en)
Japanese (ja)
Inventor
悠記 小林
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Publication of WO2019093352A1 publication Critical patent/WO2019093352A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/36Handling requests for interconnection or transfer for access to common bus or bus system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode

Definitions

  • the present invention relates to a data processing apparatus that performs data transfer and arithmetic processing.
  • calculation may be repeated on data in which hundreds of millions of data are collected into millions of entries.
  • operations such as matrix multiplication, vector-matrix multiplication, and element-by-element multiplication of vectors on a matrix of several million dimensions by several hundred dimensions may occur.
  • CPU central processing unit
  • GPGPU general-purpose computing on graphics processing units
  • the FPGA is provided with a general-purpose logic element called a LUT (Look Up Table) and a variable wiring network connecting between a plurality of LUTs.
  • LUT Look Up Table
  • various arithmetic devices can be realized by rewriting the contents of the LUT and the wiring network.
  • an FPGA in which dedicated resources such as a digital signal processor (DSP) and a static random access memory (SRAM) are mounted.
  • DSP digital signal processor
  • SRAM static random access memory
  • Such an FPGA can realize an efficient computing device.
  • the physical position of the DSP or SRAM is fixed, so that the wiring is congested unless the architecture appropriately uses the DSP or SRAM, and the congested portion of the wiring is bypassed.
  • the problem is that the wiring length becomes long.
  • the delay time of the wire is extended and the operating frequency of the arithmetic device is lowered.
  • FPGAs in which the SRAM module can be configured as a True Dual Port, ie, a completely independent Dual Port RAM (Random Access Memory).
  • a True Dual Port ie, a completely independent Dual Port RAM (Random Access Memory).
  • Such an FPGA can be used as a memory having two systems of clock input and address input. It is desirable to make full use of these features in order to maximize the capabilities of the FPGA.
  • Patent Document 1 discloses a neural network apparatus configured by connecting a plurality of ring registers for performing a product-sum operation of a neural network in a ring.
  • the device disclosed in Patent Document 1 includes a ring register path configured by connecting a plurality of ring registers having a transfer function in a ring, a plurality of arithmetic devices connected to at least one of each of the ring registers, and an arithmetic device. And a plurality of storage devices connected to each other.
  • Non-Patent Document 1 discloses a method of performing matrix operation using an FPGA.
  • Patent Document 2 discloses a parallel computer that processes matrix products at high speed.
  • the computer of Patent Document 2 includes a plurality of processor elements, and a control device that distributes data to each processor element and collects operation results. Further, the computer of Patent Document 2 includes a first communication path connecting the control device and each processor element, and a second communication path connecting the logically adjacent processor elements.
  • matrix products can be calculated using processing elements arranged in one dimension.
  • An object of the present invention is to provide a data processing apparatus capable of continuously executing data transfer and arithmetic processing to improve the operation rate of a computing unit in order to solve the problems described above.
  • a data processing apparatus includes a first annular bus, a transfer element group including a plurality of transfer elements connected in series by the first annular bus, and at least two transfer elements via the first annular bus.
  • Transfer control means connected to one transfer element and to an external memory, a second ring bus independent of the first ring bus, and a plurality of processes connected in series by the second ring bus
  • Internal memory including processing elements including elements, overall control means connected to at least two processing elements via a second ring bus, and a plurality of internal memories connected to corresponding transfer elements and processing elements And a group.
  • the present invention it is possible to provide a data processing apparatus capable of continuously executing data transfer and arithmetic processing to improve the operation rate of a computing unit.
  • a data processing apparatus according to a first embodiment of the present invention will be described with reference to the drawings.
  • the data processing device of the present embodiment is mounted on an FPGA (Field-Programmable Gate Array)
  • the data processing apparatus of the present embodiment may be realized as a dedicated circuit (ASIC: Application Specific Integrated Circuit).
  • FIG. 1 is a block diagram showing the configuration of the data processing apparatus 1 of the present embodiment.
  • the data processing apparatus 1 includes a transfer element group 12, an internal memory group 13, a processing element group 14, a transfer control unit 15, an overall control unit 16, a first ring bus 17, and a second ring bus 18.
  • the transfer element group 12 includes a plurality of transfer elements 20 (transfer elements 20-1 to 20-n) connected in series by the first ring bus 17, (n is a natural number).
  • the transfer elements 20 constituting the transfer element group 12 are connected to the adjacent transfer elements 20 via the first ring bus 17. Further, the input of the transfer element 20-1 and the output of the transfer element 20-n are connected to the transfer control unit 15 via the first ring bus 17.
  • Each of the plurality of transfer elements 20 writes the data included in the transfer data into the internal memory 30 corresponding to itself according to the analysis result of the transfer data transferred by the first ring bus 17. Also, the transfer element 20 transmits transfer data to the adjacent transfer element 20 via the first ring bus 17. Also, each of the plurality of transfer elements 20 reads output data from the internal memory 30 corresponding to itself. The transfer element 20 transmits the read output data to the transfer control unit 15 through the first ring bus 17.
  • the processing element group 14 includes a plurality of processing elements 40 (processing elements 40-1 to 40-n) connected in series by the second annular bus 18 (n is a natural number).
  • the processing elements 40 constituting the processing element group 14 are connected to the adjacent processing elements 40 by the second annular bus 18.
  • the input of the processing element 40-1 and the output of the processing element 40-n are connected to the overall control unit 16 via the second annular bus 18.
  • Each of the plurality of processing elements 40 reads data from the internal memory 30 corresponding to itself in accordance with the operation instruction received from the general control unit 16 via the second ring bus 18.
  • the processing element 40 writes the operation result of the operation using the read data into the internal memory 30 as output data.
  • the internal memory group 13 includes a plurality of internal memories 30 (internal memories 30-1 to n) (n is a natural number).
  • An internal memory 30 constituting the internal memory group 13 is connected between the corresponding transfer element 20 and the processing element 40. That is, each of the internal memories 30-1 to n is connected to each of the transfer elements 20-1 to n and each of the processing elements 40-1 to n.
  • the transfer control unit 15 (also referred to as transfer control means) is connected to the transfer element group 12 via the first ring bus 17. That is, the transfer control unit 15 is connected to at least two transfer elements 20 constituting the transfer element group 12 via the first ring bus 17. Since the transfer elements 20 adjacent to each other are connected via the first ring bus 17, the transfer control unit 15 receives the input of the transfer element 20-1 via the first ring bus 17 and the transfer element 20. Connected to the -n output.
  • the transfer control unit 15 is also connected to the external memory 100.
  • the transfer control unit 15 receives data to be processed from the external memory 100.
  • the transfer control unit 15 transmits the input data to the transfer element group 12 through the first ring bus 17.
  • the transfer control unit 15 also writes the output data received from the internal memory group 13 to the external memory 100 via the first ring bus 17.
  • the overall control unit 16 (also referred to as overall control means) is connected to the processing element group 14 via the second annular bus 18. That is, the overall control unit 16 is connected to at least two processing elements 40 via the second annular bus 18. The overall control unit 16 transmits an operation instruction to the processing element group 14 through the second ring bus 18. The transfer control unit 15 and the overall control unit 16 are connected to each other.
  • the first annular bus 17 is a one-dimensional annular bus.
  • the first ring bus 17 connects a plurality of transfer elements 20 included in the transfer element group 12 in series. Further, the first ring bus 17 is connected to the transfer control unit 15.
  • the second annular bus 18 is a one-dimensional annular bus independent of the first annular bus 17.
  • the second annular bus 18 connects a plurality of processing elements 40 included in the processing element group 14 in series.
  • the second annular bus 18 is connected to the overall control unit 16.
  • FIG. 2 is a block diagram showing the configuration of transfer elements 20-1 to 20-n included in the transfer element group 12. As shown in FIG. Hereinafter, the transfer element 20-1, the transfer element 20-2, ..., and the transfer element 20-n will be referred to as a transfer element 20 without distinction. Although the transfer elements 20 adjacent to each other are connected in FIG. 2, the input of the transfer element 20-1 and the output of the transfer element 20-n are connected to the transfer control unit 15.
  • the transfer element 20 is connected to the first annular bus 17.
  • the transfer element 20 includes an annular bus register 21 forming a part of the first annular bus 17 and a memory interface unit 22.
  • the ring bus register 21 includes a first register unit 211, a second register unit 212, and a third register unit 213.
  • the ring bus register 21 (also referred to as a first ring bus register) analyzes transfer data transferred from the transfer element 20 in the previous stage through the first ring bus 17.
  • the ring bus register 21 issues an access instruction to the internal memory 30 to the memory interface unit 22 according to the analysis result of the transfer data.
  • the ring bus register 21 transfers the transfer data to the transfer element 20 of the next stage as it is.
  • the ring bus register 21 transfers the transfer data updated using the data read from the internal memory 30 to the transfer element 20 of the next stage. Forward.
  • FIG. 3 is a conceptual view showing a configuration example (transfer data 170) of transfer data flowing on the first ring bus 17.
  • the transfer data 170 includes a command field cmd, an identification field peid, an address field addr, and a data field data.
  • the command field cmd represents the type of data transfer (such as reading from an external memory or writing to an external memory).
  • the address field addr indicates which address in the internal memory 30 is to be accessed.
  • the data field data holds data to be read from or written to the internal memory 30.
  • FIG. 4 is a table summarizing an example of transfer data flowing on the first ring bus 17.
  • FIG. 4 shows an example of transfer data when eight 32-bit data are read from the external memory 100 and sequentially stored in the address 0 of the internal memories 30-1 to 8-8.
  • the command field cmd is 0x1, it indicates that the external memory 100 writes data to the internal memory 30.
  • the first register unit 211 (also referred to as a first register) analyzes the transfer data transferred from the transfer element 20 in the previous stage.
  • the first register unit 211 issues an access instruction to the internal memory 30 to the memory interface unit 22 according to the analysis result of the transfer data.
  • the identification field peid of the transfer data received from the transfer element 20 at the previous stage matches the identifier of the first register unit 211
  • the first register unit 211 determines that the command is a command for itself.
  • the command field cmd is a write command to the internal memory 30
  • the first register unit 211 sends the value of the data field DATA, the address of the address field ADDR and the write instruction to the memory interface unit 22.
  • the command field cmd is a read command from the internal memory 30, the first register unit 211 sends the address of the address field addr and a read instruction to the memory interface unit 22.
  • the memory interface unit 22 accesses the internal memory 30 in accordance with the instruction received from the first register unit 211.
  • the memory interface unit 22 receives a write instruction from the first register unit 211
  • the memory interface unit 22 writes data in the internal memory 30 according to the received write instruction.
  • the memory interface unit 22 receives a read instruction from the first register unit 211
  • the memory interface unit 22 reads data from the internal memory 30 according to the received read instruction. Then, the memory interface unit 22 sends the data read from the internal memory 30 to the third register unit 213.
  • the second register unit 212 (also referred to as a second register) is a buffer that is set in accordance with the access latency of the internal memory 30.
  • the second register unit 212 transfers the transfer data transferred from the first register unit 211 to the third register unit 213.
  • the second register unit 212 may be configured as a plurality of stages of shift registers in accordance with the access latency of the internal memory 30.
  • the third register unit 213 (also referred to as a third register) transfers the transfer data transferred from the second register unit 212 to the transfer element 20 of the next stage.
  • the third register unit 213 sends the transfer data that has arrived via the second register unit 212 to the transfer element 20 of the next stage as it is.
  • the third register unit 213 replaces the data field data included in the transfer data reached via the second register unit 212 with the data read from the internal memory 30. To the transfer element 20 of the next stage.
  • FIG. 5 is a block diagram showing the configuration of the internal memory 30. As shown in FIG. The arrows between the blocks shown in FIG. 5 conceptually indicate the flow of the write instruction, the address, the read data, and the write data, and do not limit their directions.
  • the Internal memory 30 includes dual port memory 31.
  • the dual port memory 31 includes two access ports of a port A 311 (hereinafter referred to as port A) and a port B 312 (hereinafter referred to as port B).
  • a signal line from the transfer element 20 is connected to the port A (also referred to as a first port).
  • a signal line from the processing element 40 is connected to the port B (also referred to as a second port).
  • These signal lines are wires for transmitting addresses for writing and reading, writing instructions, writing data, reading data, and the like.
  • FIG. 6 is a block diagram showing the configuration of the processing element 40. As shown in FIG. Hereinafter, the processing element 40-1, the processing element 40-2,..., The processing element 40-n will be referred to as the processing element 40 without distinction. Although the processing elements 40 adjacent to each other are illustrated as being connected in FIG. 6, the input of the processing element 40-1 and the output of the processing element 40-n are connected to the overall control unit 16.
  • the processing element 40 includes a ring bus register 41, an instruction decoder 42, a memory interface unit 43, and an arithmetic unit 44.
  • the ring bus register 41 (also referred to as a second ring bus register) is part of the elements connected to the second ring bus 18 and constituting the second ring bus 18.
  • the ring bus register 41 is connected to the instruction decoder 42.
  • the ring bus register 41 may be a single register or a shift register composed of a plurality of stages.
  • the ring bus register 41 receives an operation instruction from the preceding processing element 40 connected to the second ring bus 18 and sends the received operation instruction to the processing element 40 of the next stage. Among the received operation instructions, the ring bus register 41 sends the operation instruction to be processed by itself to the instruction decoder 42.
  • the instruction decoder 42 is connected to the ring bus register 41. Also, the instruction decoder 42 is connected to the memory interface unit 43 and the arithmetic unit 44. The instruction decoder 42 analyzes the operation instruction received from the ring bus register 41 and generates a control signal according to the operation instruction. The instruction decoder 42 outputs the generated control signal to the memory interface unit 43 and the computing unit 44.
  • the memory interface unit 43 (also referred to as a second memory interface) is connected to the instruction decoder 42 and the arithmetic unit 44. Also, the memory interface unit 43 is connected to the internal memory 30. The memory interface unit 43 reads data from the internal memory 30 in response to a control signal from the instruction decoder 42, and transmits the read data to the computing unit 44. Also, the memory interface unit 43 writes the calculation result of the arithmetic unit 44 in the internal memory 30 as output data.
  • the arithmetic unit 44 is connected to the instruction decoder 42 and the memory interface unit 43. Arithmetic unit 44 executes an operation using data received from memory interface unit 43 in response to a control signal from instruction decoder 42. The arithmetic unit 44 transmits the operation result to the memory interface unit 43.
  • the computing unit 44 can be realized by a DSP (Digital Signal Processor) of an FPGA (Field-Programmable Gate Array).
  • a register file may be provided in the computing unit 44 so that operations on the registers in the register file can be performed.
  • FIG. 7 is a conceptual diagram showing a configuration example (operation instruction 420) of an operation instruction.
  • the operation instruction 420 includes fields of an 8-bit opcode opc, a first source operand rs, a second source operand rt, a destination operand rd, and a 32-bit immediate operand imm.
  • FIG. 8 is a table summarizing an example of operation instructions.
  • the table of FIG. 8 shows the value of the opcode opc and the operation corresponding to the opcode opc.
  • the description of instructions other than the opcode opc is omitted.
  • the instruction of the operation code MACI and the operation code MACR will be described later.
  • opc 0x01
  • rs 0x00
  • rt 0x40
  • rd 0x80
  • the instruction decoder 42 outputs a control signal instructing the memory interface unit 43 to read data from the addresses 0x00 and 0x40 in the internal memory 30.
  • the instruction decoder 42 outputs a control signal instructing the arithmetic unit 44 to perform an addition operation on the input data supplied from the memory interface unit 43. Then, the instruction decoder 42 outputs a control signal instructing the memory interface unit 43 to write the output data of the arithmetic unit 44 to the address 0x80 of the internal memory 30.
  • the instruction decoder 42 outputs, to the computing unit 44, a control signal instructing to perform a multiplication operation on the input data supplied from the memory interface unit 43 and the value of the immediate field imm. Then, the instruction decoder 42 outputs a control signal instructing the memory interface unit 43 to write the output data of the arithmetic unit 44 at the address 0x46 of the internal memory 30.
  • FIG. 9 is a block diagram showing the configuration of the overall control unit 16.
  • the overall control unit 16 has a program counter 61, a command memory 62, a command decoder 63, and an overall control unit data path 64.
  • the command decoder 63 is connected to the first processing element 40-1.
  • the general control unit data path 64 is connected to the last processing element 40-n.
  • the general control unit 16 operates in the same manner as a general instruction set processor.
  • the program counter 61 stores a value indicating a command to be executed next. If the content of the command is other than a branch instruction, the program counter 61 is automatically incremented. On the other hand, when the content of the command is a branch instruction, the value of the program counter 61 is changed in accordance with the branch instruction.
  • the command memory 62 stores a command including a flag indicating a subject that executes an instruction.
  • the command memory 62 outputs a command corresponding to the value of the program counter 61 to the command decoder 63.
  • the command decoder 63 analyzes the command output from the command memory 62 and generates a control signal according to the analysis result. When the command decoder 63 interprets the command as an instruction of the overall control unit 16, the command decoder 63 outputs the generated control signal to the overall control unit data path 64. On the other hand, when the command decoder 63 interprets the command as an instruction of the processing element 40, the command decoder 63 outputs the generated control signal to the processing element 40-1 of the first stage included in the processing element group 14.
  • the overall control unit data path 64 (also referred to as an overall control data path) performs an operation according to the content of the command in accordance with the control signal generated by the command decoder 63.
  • the overall control unit data path 64 performs operations such as addition and branching.
  • the overall control unit data path 64 may include elements included in a general instruction set processor such as a register file. If the content of the command is a branch instruction, the overall control unit data path 64 changes the value of the program counter 61 in accordance with the branch instruction.
  • FIG. 10 is a conceptual diagram showing a configuration example of the command 620 stored in the command memory 62.
  • the command 620 in the example of FIG. 10 includes a 1-bit flag pf and a 64-bit instruction inst. When pf is 0, it is interpreted as an instruction of the overall control unit 16. On the other hand, when pf is 1, it is interpreted as an instruction of the processing element 40. Then, if pf is a command 620 of 1, the command decoder 63 shown in FIG. 9 transmits the instruction inst to the first processing element 40-1 on the second ring bus 18.
  • the operation instruction that has arrived from the last processing element 40-n is stored in a register (not shown) in the overall control unit data path 64.
  • the storage destination of the operation instruction may be a specific register in the register file or may be a dedicated register.
  • overall control unit data path 64 may be provided with a dedicated FIFO (First In First Out) for storing an operation instruction, and a register for storing the inside of the register file is separately designated by a flag or the like in the operation instruction. It may be possible.
  • FIFO First In First Out
  • FIG. 11 is a conceptual diagram showing a configuration example of the instruction 160 of the overall control unit 16.
  • the instruction 160 of the general control unit 16 includes fields of an opcode opc, a first source operand rs, a second source operand rt, a destination operand rd, and an immediate operand imm.
  • the operation code opc is 8 bits
  • the first source operand rs is 5 bits
  • the second source operand rt is 5 bits
  • the destination operand rd is 5 bits
  • the immediate operand imm is 32 bits.
  • the instruction 160 of the general control unit 16 of FIG. 11 may be stored left-justified in the inst of 64-bit width shown in FIG.
  • FIG. 12 is a table summarizing an example of an instruction of the overall control unit 16.
  • RF [rs] represents the register value of the index specified by rs in the register file.
  • PC represents a program counter value.
  • dmactrl represents an instruction register to the transfer control unit 15.
  • dmastatus represents the status register of the transfer control unit 15.
  • “ ⁇ RF [rs], RF [rt] ⁇ ” represents a value obtained by concatenating two register values RF [rs] and RF [rt].
  • the bit width of the register in the register file is assumed to be 32 bits in the example of FIG. 12, the bit width of the register is not limited to 32 bits.
  • FIG. 13 is a block diagram showing the configuration of the transfer control unit 15. As shown in FIG. As shown in FIG. 13, the transfer control unit 15 includes an instruction register 51, a state register 52, and a control circuit 53. The instruction register 51 and the status register 52 are connected to the overall control unit 16. Control circuit 53 is connected to external memory 100. Also, the control circuit 53 is connected to the first transfer element 20-1 and the last transfer element 20-n.
  • the status register 52 holds a value indicating whether transfer data is being transferred or has been completed in the first ring bus 17.
  • Control circuit 53 is connected to external memory 100. Control circuit 53 receives data to be processed from external memory 100. The control circuit 53 transmits the input data to the first stage transfer element 20-1 included in the transfer element group 12 through the first ring bus 17. The control circuit 53 also writes the output data received from the internal memory group 13 to the external memory 100 via the first ring bus 17.
  • the control circuit 53 starts transfer if the instruction register 51 includes a valid transfer instruction. In addition, the control circuit 53 reflects a value indicating whether the transfer is in progress or the transfer is completed as needed, and notifies the overall control unit 16 of the reflected result. That is, when the instruction register 51 contains a valid transfer instruction, the control circuit 53 transfers data between the external memory 100 and the transfer element group 12 to update the value of the status register 52.
  • the control circuit 53 writes a value in the instruction register 51 according to the ivkdma instruction of the general control unit 16 shown in FIG. Further, the control circuit 53 reads the value of the status register 52 by the chkdma instruction of the overall control unit 16 shown in FIG.
  • the above is the description of the components of the data processing device 1.
  • the above configuration of the data processing apparatus 1 is an example, and various configurations may be added or deleted as long as the functions of the data processing apparatus 1 of the present embodiment can be exhibited.
  • FIG. 15 is a flowchart for explaining the operation of the processing element 40.
  • the processing element 40 determines whether or not an operation instruction has come from the processing element 40 of the previous stage (step S11).
  • step S11 When the operation instruction is received (Yes in step S11), the processing element 40 receives the operation instruction (step S12). On the other hand, when the operation instruction has not been received (No in step S11), the processing element 40 waits for the arrival of the operation instruction (return to step S11).
  • the processing element 40 performs an operation according to the received arithmetic instruction (step S13). For example, the processing element 40 performs the following operations 1 to 4 according to the received operation instruction. (1) Read values from the addresses shown in rs and rt in the internal memory 30. (2) Perform an operation on the read value. (3) Write the operation result to the address indicated by rd in the internal memory 30. (4) Rewrite imm in the operation instruction with the operation result.
  • the processing element 40 sends the updated operation instruction to the processing element 40 of the next stage (step S14).
  • step S15 If the transfer is continued (Yes in step S15), the process returns to step S11. When the transfer is completed (No in step S15), the process according to the flowchart of FIG. 15 ends.
  • FIG. 16 is a calculation example of the matrix product by the data processing device 1.
  • the six elements (A00 to A21) of the matrix A are stored in the register files (RF [0] to RF [5]) of the overall control unit 16.
  • FIG. 17 illustrates an example in which the elements of the matrix A are stored in the register file, the present invention is not limited to this.
  • a memory such as a scratch pad memory may be configured in the general control unit 16, and elements of the matrix A may be stored in the scratch pad memory.
  • matrix B and elements of matrix C are stored in the internal memory 30.
  • B17 represents an element of row 1 column 7 of the matrix B.
  • the transfer control unit 15 reads in advance the matrix B, which is input data, into the 0th address to the 4th address of the internal memory 30.
  • the matrix C, which is output data, is stored in the area of addresses 400 to 408 initialized to zero.
  • the MAC Immediate instruction shown in FIG. 8 is used to calculate the matrix product.
  • the rtth register of the register file of the overall control unit 16 is set to imm
  • an operation instruction is sent to the second ring bus 18
  • the imm is multiplied by the internal memory rs address in the processing element 40
  • the product is accumulated at the rd address of the internal memory 30.
  • FIG. 18 shows an example of an assembly program of matrix products (assembly program 171).
  • MACI is a mnemonic that represents a MAC Immediate instruction.
  • FIGS. 19 to 23 values of the internal memory 30 in each cycle of matrix multiplication, operation instructions (instructions 1 to 6 in FIG. 18) flowing through the second ring bus 18, and imm field in the operation instruction are shown. Indicates the set value.
  • the first processing element 40-1 of the second annular bus 18 receives an instruction 1.
  • Data A00 is set in the imm field of instruction 1.
  • the processing element 40-1 multiplies the value at address 0 (B00) of the internal memory 30-1 corresponding to itself with A00 according to the operation of the instruction 1 "MACI 0, 0, 0x400", and is the operation result. Accumulate product at address 0x400.
  • “A00 * B00” as the operation result is stored at address 400 of the internal memory 30-1.
  • “A00 * B00” indicates the product of "A00" and "B00".
  • the processing element 40-1 receives the instruction 2 and the processing element 40-2 receives the instruction 1.
  • the processing element 40-1 performs a product-sum operation of multiplying B10 and A01 and accumulating the product (A01 * B10) which is the operation result at address 400.
  • the processing element 40-2 performs multiplication of B01 and A00, and accumulates the product (A00 * B01), which is the operation result, at address 400.
  • A00 * B00 + A01 * B10 is stored at address 400 of the internal memory 30-1, and "A00 * B01” is stored at address 400 of the internal memory 30-2.
  • "A00 * B00 + A01 * B10" indicates the sum of "A00 * B00" and "A01 * B10".
  • the processing element 40-1 receives the instruction 3
  • the processing element 40-2 receives the instruction 2
  • the processing element 40-3 receives the instruction 1.
  • the processing element 40-1 performs multiplication of B00 and A10, and accumulates the product (A10 * B00), which is the operation result, at address 404.
  • the processing element 40-2 performs multiplication of B11 and A01 and executes a product-sum operation of accumulating the product (A01 * B11) which is the operation result at address 400.
  • the processing element 40-3 performs multiplication of B02 and A00, and accumulates the product (A00 * B02), which is the operation result, at address 400.
  • "A10 * B00" is stored at address 404 of internal memory 30-1,
  • "A00 * B01 + A01 * B11" is stored at address 400 of internal memory 30-2, and
  • "A00 * B02" is stored at address 400 of the memory 30-3.
  • processing element 40-1 receives instruction 4
  • processing element 40-2 receives instruction 3
  • processing element 40-3 receives instruction 2
  • processing element 40-4 Receives instruction 1.
  • the processing elements 40-1 to 4 execute the operation in the same manner as in FIGS. 19 to 21, and store the operation result in the designated address of the internal memory 30-1 to 4.
  • FIG. 23 shows the state of the internal memories 30-1 to 8 in cycle 14 (cyc14) when the matrix product calculation is completed. At respective addresses of the internal memories 30-1 to 8, the operation result according to the operation instruction is stored.
  • the data processing device 1 calculates the matrix product.
  • the processing element 40 performs an operation. To store the calculation result in the internal memory 30. Then, the processing element 40 stores the calculation result calculated using the immediate data received through the second ring bus 18 and the data stored in the internal memory 30 corresponding to itself in the internal memory 30 corresponding to itself. Do.
  • FIG. 24 is an example of calculation of the inner product of vectors by the data processing device 1.
  • the data processing device 1 obtains an inner product d of a 1-row-8-column matrix A and a 1-row-8-column matrix B.
  • the elements of matrix A and matrix B are stored in the internal memory 30.
  • the MAC reduction instruction of FIG. 8 is used to calculate the inner product.
  • the MAC reduction instruction is an instruction for performing multiplication of the addresses rs and rt of the internal memory 30 in each processing element 40, accumulating the product to the value of imm in the operation instruction, and transferring it to the next processing element 40. That is, in the MAC reduction instruction, each time the operation instruction passes through the processing element 40, the operation result is multiplied by the value of the imm field of the operation instruction.
  • FIG. 26 shows an example of an inner product assembly program (assembly program 172).
  • MACR is a mnemonic that represents a MAC reduction instruction.
  • cycle 1 shown in FIG. 27 the value of the imm field of instruction 1 that has arrived at processing element 40-1 is zero.
  • the processing element 40-1 performs an operation of A00 * B00 and adds it to the imm field.
  • the processing element 40-1 transfers the operation result to the processing element 40-2 of the next stage.
  • “A00 * B00” is stored in the imm field of the processing element 40-2.
  • the processing element 40-2 performs an operation of A01 * B01 and adds it to the imm field.
  • the processing element 40-2 transfers the operation result to the processing element 40-3 of the next stage.
  • “A00 * B00 + A01 * B01” is stored in the imm field of the processing element 40-3.
  • the processing element 40-3 performs an operation of A02 * B02 and adds it to the imm field.
  • the processing element 40-3 transfers the operation result to the processing element 40-4 of the next stage.
  • “A00 * B00 + A01 * B01 + A02 * B02” is stored in the imm field of the processing element 40-4.
  • cycle 4 shown in FIG. 30 the processing element 40-4 performs the operation of A03 * B03 and adds it to the imm field.
  • the processing element 40-4 transfers the operation result to the next stage processing element 40-5.
  • the description of cycles 5 to 7 is omitted.
  • the last processing element 40 the processing element 40-8, adds the calculation result of A07 * B07 to the value “A00 * B00 + A01 * B01 +... + A06 * B06” of the imm field. .
  • the operation instruction that the last processing element 40-8 outputs to the second ring bus 18 is stored in a register or the like in the overall control unit data path 64 in the overall control unit 16.
  • the processing element 40 outputs immediate data when the overall control unit 16 outputs an operation instruction including a field for storing immediate data to the second ring bus 18. Perform the operation using That is, the processing element 40 rewrites the immediate data according to the calculation result calculated using the immediate data received through the second ring bus 18 and the data stored in the internal memory 30 corresponding to itself. Then, the processing element 40 outputs the rewritten immediate data to the second ring bus 18 as output data.
  • the second annular bus 18 and the first annular bus 17 can operate independently. Therefore, as shown in FIG. 32, processing and transfer can be performed in parallel. That is, the data processor 1 performs transfer of a matrix A and transfer of a matrix B for the next stage of matrix multiplication at the same time as performing certain matrix multiplication, and of the matrix C which is an output of the previous stage of matrix multiplication. Transfer can be done.
  • the data to be processed by the data processing apparatus of the present embodiment is not limited to a matrix, and may be data of another form.
  • the data processing apparatus according to the present embodiment may process vector data.
  • the inside of the overall control unit and the processing element included in the data processing apparatus of the present embodiment may be realized as a pipeline processor.
  • Implementing the entire control unit and the inside of the processing elements as a pipeline processor can increase the throughput of operations.
  • the internal memory since it is necessary to simultaneously access the internal memory, for example, it is necessary to simultaneously access rs and rd in the MACI instruction, the internal memory may be configured as a plurality of banks to allow simultaneous access.
  • the number of processing elements is eight has been described as an example in the present embodiment, the number of processing elements is not limited. For example, even if the number of processing elements is 256 or 512, the configuration of the present embodiment is applicable. Also, the number of processing elements may be less than eight or more than 512.
  • the first effect of maintaining the operating rate of the computing unit can be obtained.
  • the operation content of the processing element as the operation instruction through the ring bus, it is possible to improve the operating frequency by eliminating the long wiring except for the signal line for the clock signal and the reset signal.
  • the second effect is obtained. That is, according to the present embodiment, the matrix product and the inner product of the vectors can be efficiently calculated by making the bus for data transfer and the bus for data processing independent.
  • the data processing apparatus is used for flexible and efficient execution on a field-programmable gate array (FPGA) with respect to an application such as analysis processing of big data that performs matrix operation such as large-scale matrix product or inner product.
  • FPGA field-programmable gate array
  • the data processing device of the present embodiment can be realized not only on the FPGA but also as a dedicated circuit (ASIC: Application Specific Integrated Circuit).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Advance Control (AREA)

Abstract

Afin de réaliser en continu un transfert de données et un traitement arithmétique, et d'améliorer le taux de fonctionnement d'une unité arithmétique, ce dispositif de traitement de données comprend : un premier bus annulaire ; un groupe d'éléments de transfert qui comprend une pluralité d'éléments de transfert connectés en série par le premier bus annulaire ; un moyen de commande de transfert qui est connecté à au moins deux des éléments de transfert par l'intermédiaire du premier bus annulaire, et qui est connecté à une mémoire externe ; un second bus annulaire qui est indépendant du premier bus annulaire ; un groupe d'éléments de traitement qui comprend une pluralité d'éléments de traitement connectés en série par le second bus annulaire ; un moyen de commande global qui est connecté à au moins deux des éléments de traitement par l'intermédiaire du second bus annulaire ; et un groupe de mémoire interne qui comprend une pluralité de mémoires internes connectées à des éléments de transfert et à des éléments de traitement correspondants.
PCT/JP2018/041281 2017-11-10 2018-11-07 Dispositif de traitement de données WO2019093352A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017217026 2017-11-10
JP2017-217026 2017-11-10

Publications (1)

Publication Number Publication Date
WO2019093352A1 true WO2019093352A1 (fr) 2019-05-16

Family

ID=66437805

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/041281 WO2019093352A1 (fr) 2017-11-10 2018-11-07 Dispositif de traitement de données

Country Status (1)

Country Link
WO (1) WO2019093352A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09185592A (ja) * 1995-12-19 1997-07-15 Commiss Energ Atom 多重並列構造プロセッサのアレイシステムアーキテクチャ
JP2003036248A (ja) * 2001-07-25 2003-02-07 Nec Software Tohoku Ltd シングルチップマイクロプロセッサに用いる小規模プロセッサ
JP2010079921A (ja) * 2003-07-25 2010-04-08 Rmi Corp プロセッサ

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09185592A (ja) * 1995-12-19 1997-07-15 Commiss Energ Atom 多重並列構造プロセッサのアレイシステムアーキテクチャ
JP2003036248A (ja) * 2001-07-25 2003-02-07 Nec Software Tohoku Ltd シングルチップマイクロプロセッサに用いる小規模プロセッサ
JP2010079921A (ja) * 2003-07-25 2010-04-08 Rmi Corp プロセッサ

Similar Documents

Publication Publication Date Title
JP6960479B2 (ja) 再構成可能並列処理
KR102292349B1 (ko) 처리 장치 및 처리 방법
CN111580865B (zh) 一种向量运算装置及运算方法
JP7361133B2 (ja) 深層学習アルゴリズムのための効率的なアーキテクチャ
US7350054B2 (en) Processor having array of processing elements whose individual operations and mutual connections are variable
US20030061601A1 (en) Data processing apparatus and method, computer program, information storage medium, parallel operation apparatus, and data processing system
TW544603B (en) Designer configurable multi-processor system
US11907681B2 (en) Semiconductor device and method of controlling the semiconductor device
JP7131115B2 (ja) データ処理装置、データ処理方法、およびプログラム
US7509479B2 (en) Reconfigurable global cellular automaton with RAM blocks coupled to input and output feedback crossbar switches receiving clock counter value from sequence control unit
WO2019093352A1 (fr) Dispositif de traitement de données
CN117421048A (zh) 多线程计算中的混合的标量操作和向量操作
JP4962305B2 (ja) リコンフィギュラブル回路
US11443014B1 (en) Sparse matrix multiplier in hardware and a reconfigurable data processor including same
US11250105B2 (en) Computationally efficient general matrix-matrix multiplication (GeMM)
WO2001097054A2 (fr) Systeme de calcul synergetique
Chalamalasetti et al. A low cost reconfigurable soft processor for multimedia applications: Design synthesis and programming model
Heenes et al. FPGA implementations of the massively parallel GCA model
US7107478B2 (en) Data processing system having a Cartesian Controller
EP3030963B1 (fr) Unité de diffusion en flux matérielle à configuration flexible
CN114968911B (zh) 算子频度压缩及上下文配置调度的fir可重构处理器
JPH05324694A (ja) 再構成可能並列プロセッサ
JPH1063647A (ja) 行列演算装置
US20150046687A1 (en) Hardware Streaming Unit
CN117785287A (zh) 多线程计算中的私有存储器模式顺序存储器访问

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18876682

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18876682

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP