CN110704343A

CN110704343A - Data transmission method and device for memory access and on-chip communication of many-core processor

Info

Publication number: CN110704343A
Application number: CN201910852824.4A
Authority: CN
Inventors: 施晶晶; 唐勇; 谢军; 张清波; 陈芳园; 陈庆强; 过锋
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2019-09-10
Filing date: 2019-09-10
Publication date: 2020-01-17
Anticipated expiration: 2039-09-10
Also published as: CN110704343B

Abstract

The invention provides a data transmission method and device for memory access and on-chip communication of a many-core processor, and belongs to the field of computer system structures and processor microstructures. The data transmission method and device facing to memory access and on-chip communication of the many-core processor comprise the following steps: s1: the channel instruction buffer unit acquires 1 or more channel instructions sent by the source core processor; s2: extracting a DMA channel instruction or an RMA channel instruction from a channel instruction buffer unit; s3: analyzing DMA micro access from the DMA channel instruction, sending the DMA micro access to a memory, analyzing RMA micro access from the RMA channel instruction and sending the RMA micro access to a target core processor; s4: and initiating a response word operation after acquiring a response returned by the memory or a response returned by the target core processor. The invention reduces the hardware logic overhead, realizes the high-efficiency in-chip data multiplexing and improves the computing capability of the many-core processor.

Description

Data transmission method and device for memory access and on-chip communication of many-core processor

Technical Field

The invention belongs to the field of computer system structures and processor microstructure design, and relates to a data transmission method and device for memory access and on-chip communication of a many-core processor.

Background

With the development of semiconductor technology, the integration level of chips is higher and higher. Many-core processors usually integrate hundreds of cores, and compared with the traditional processors adopting a main memory + multi-level CACHE structure, a considerable number of many-core processors adopt a storage hierarchy structure of the main memory + an on-chip local memory. Compared with the computing capacity of a many-core processor, the memory access bandwidth which can be obtained by the many-core processor is limited, and the problem of a more serious memory wall is faced, on one hand, the basic memory access capacity of the many-core processor can be improved by adopting new technologies such as high-bandwidth memory access and the like, on the other hand, data can be shared among cores by utilizing a network on chip, larger-scale data reuse is realized, and memory access requests are reduced.

How to more effectively realize the multiplexing of data in a chip and reduce the memory access requirement is a difficult problem for designing a many-core processor. Conventionally, cores can generally acquire new data one by one from different cores by using LD/ST instructions to realize more data multiplexing, but because many-core cores are more and the access delay difference between different cores is larger, usually dozens or hundreds of beats, the cores can only wait before acquiring new data, and the exertion of the computing power of the many-core processor is severely restricted.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a data transmission method and a device for memory access and on-chip communication of a many-core processor, and the technical problems to be solved by the invention are as follows: a data transmission method and a data transmission device facing memory access and on-chip communication of a many-core processor are provided, wherein the data transmission method and the data transmission device can enable batch data transmission (DMA) between a many-core and a memory and batch data transmission (RMA) between the many-core cores to be realized in parallel so as to reduce hardware logic overhead.

The purpose of the invention can be realized by the following technical scheme:

the data transmission method facing the memory access and on-chip communication of the many-core processor comprises the following steps:

s1: the channel instruction buffer unit acquires 1 or more channel instructions sent by the source core processor;

s2: extracting a DMA channel instruction or an RMA channel instruction from a channel instruction buffer unit;

s3: analyzing DMA micro access from the DMA channel instruction, sending the DMA micro access to a memory, analyzing RMA micro access from the RMA channel instruction and sending the RMA micro access to a target core processor;

s4: and initiating a response word operation after acquiring a response returned by the memory or a response returned by the target core processor.

Preferably, step S2 specifically includes performing a unified allocation on the channel state register set to obtain the DMA channel instruction or the RMA channel instruction respectively when obtaining the extraction request for extracting the DMA channel instruction or the RMA channel instruction from the channel instruction buffer unit, then dispatching the DMA channel instruction to the DMA splitting station and dispatching the RMA channel instruction to the RMA splitting station respectively, where in step S3 the DMA splitting station splits the DMA channel instruction to resolve the DMA micro-access, and the RMA splitting station splits the RMA channel instruction to resolve the RMA micro-access.

Preferably, step S4 specifically includes acquiring a response returned by the memory or a response returned by the target core processor, and then updating the channel state register set of the internal state in real time, and when the channel state register receives all responses, initiating a reply word operation, where the reply word operation is setting a flag in the local memory of the source core processor or the target core processor.

Preferably, the DMA split station split DMA channel instruction and the RMA split station split channel instruction are performed concurrently.

Preferably, after the DMA channel instruction or the RMA channel instruction is obtained, arbitration is performed according to the operation type and the instruction sequence of the channel instruction to respectively dispatch the DMA channel instruction to the DMA splitting station and the RMA channel instruction to the RMA splitting station, and after the DMA micro-access or RMA micro-access is analyzed, arbitration is performed according to the operation type and the sequence of the micro-access to respectively send the DMA micro-access to the memory and send the RMA micro-access to the target core processor.

Preferably, the device further comprises a channel barrier instruction set for changing the direction of the channel instruction stream, wherein the channel barrier instruction set comprises a DMA barrier instruction, an RMA barrier instruction, and a full barrier instruction, and the subsequent DMA channel instruction is executed after the completion of the transmission initiated by the DMA channel instruction is determined after the DMA barrier instruction is received, the subsequent RMA channel instruction is executed after the completion of the transmission initiated by the RMA channel instruction received before the RMA barrier instruction is determined after the RMA barrier instruction is received, and the subsequent DMA instruction or the RMA instruction is executed after the completion of the transmission initiated by the DMA instruction received before the full barrier instruction is received.

Preferably, the split DMA micro-access or RMA micro-access is rotated or forwarded in parallel by weight through the network on chip to send the DMA micro-access to the memory and send the RMA micro-access to the target core processor.

A data transmission device facing memory access and on-chip communication of a many-core processor is used for respectively carrying out data transmission among a source core processor, a memory, the source core processor and a target core processor and comprises a channel instruction buffer unit, a channel instruction extracting unit, a channel instruction splitting unit and a channel instruction distributing unit, wherein the channel instruction buffer unit is used for receiving and storing a DMA channel instruction or an RMA channel instruction sent by the source core processor, the channel instruction extracting unit is used for extracting the DMA channel instruction or the RMA channel instruction from the channel instruction buffer unit, the channel instruction splitting unit is used for splitting the DMA micro-access from the DMA channel instruction and sending the DMA micro-access to the memory, the RMA micro-access from the RMA channel instruction and sending the RMA micro-access to the target core processor, the channel instruction distributing unit is used for receiving the DMA channel instruction or the RMA channel instruction extracted by the channel instruction extracting unit and sending the DMA channel instruction and RMA channel instruction to the channel instruction splitting, the memory sends a response to the channel instruction distribution unit after receiving DMA micro-access, the target core processor sends a response to the channel instruction distribution unit after receiving RMA micro-access, and the channel instruction distribution unit sets a mark in a local memory of the source core processor or the target core processor after acquiring the response returned by the memory or the response returned by the target core processor.

Preferably, the channel instruction splitting unit includes a DMA splitting station configured to split the DMA micro-access from the DMA channel instruction and send the DMA micro-access to the memory, and an RMA splitting station configured to split the RMA micro-access from the RMA channel instruction and send the RMA micro-access to the memory.

Preferably, the system further comprises a barrier command set management unit configured to issue a channel barrier command set for changing a channel command stream direction to the channel command buffer unit, where the channel barrier command set includes a DMA barrier command, an RMA barrier command, and a full barrier command, the channel command buffer unit continues to execute the subsequent DMA channel command after determining that a transmission initiated by the previous DMA channel command is completed after receiving the DMA barrier command issued by the barrier command set management unit, continues to execute the subsequent RMA channel command after determining that a transmission initiated by the previously received RMA channel command is completed after receiving the RMA barrier command issued by the barrier command set management unit, and continues to execute the subsequent DMA command or the RMA command after receiving the full barrier command issued by the barrier command set management unit and determining that the previously received DMA command or the transmission initiated by the RMA command is completed.

The channel instruction buffer unit obtains 1 or more channel instructions sent by a source core processor, the channel buffer unit can buffer the channel instructions, then extracts DMA channel instructions or RMA channel instructions from the channel instruction buffer unit, then analyzes DMA micro-access from the DMA channel instructions, sends the DMA micro-access to a memory, analyzes RMA micro-access from the RMA channel instructions and sends the RMA micro-access to a target core processor, and finally initiates a reply word operation after acquiring a response returned by the memory or a response returned by the target core processor.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic diagram of two data transmission processes when the apparatus of the present invention is applied;

fig. 3 is a schematic view of the structure of the apparatus of the present invention.

Detailed Description

The following are specific embodiments of the present invention and are further described with reference to the drawings, but the present invention is not limited to these embodiments.

Referring to fig. 1, the data transmission method for memory access and on-chip communication of a many-core processor in this embodiment includes the following steps:

The channel instruction buffer unit obtains 1 or more channel instructions sent by a source core processor, the channel buffer unit can buffer the channel instructions, then extracts DMA channel instructions or RMA channel instructions from the channel instruction buffer unit, then analyzes DMA micro-access from the DMA channel instructions, sends the DMA micro-access to a memory, analyzes RMA micro-access from the RMA channel instructions and sends the RMA micro-access to a target core processor, and finally initiates answer word operation after acquiring a response returned by the memory or a response returned by the target core processor. The channel instruction defines parameters such as the type, transmission length, source and target addresses, access mode, answer word address and the like of the batch transmission operation. After the source core processor sends the channel instruction, the computing task can be executed.

Step S2 may specifically include that when an extraction request for extracting a DMA channel instruction or an RMA channel instruction from the channel instruction buffer unit is obtained, the channel state register set performs a unified allocation to respectively obtain the DMA channel instruction or the RMA channel instruction, then the DMA channel instruction is respectively dispatched to the DMA splitting station, and the RMA channel instruction is dispatched to the RMA splitting station, where in step S3 the DMA splitting station splits the DMA channel instruction to resolve the DMA micro-access, and the RMA splitting station splits the RMA channel instruction to resolve the RMA micro-access. When an extraction request for extracting a DMA channel instruction or an RMA channel instruction from a channel instruction buffer unit is obtained, the channel state register group alternately carries out uniform distribution of transmission channel numbers to obtain the instruction of the transmission channel number, and the channel instruction is distributed to a special DMA splitting platform or an RMA splitting platform according to the operation type of the instruction to carry out parallel splitting processing.

Step S4 may specifically include acquiring a response returned by the memory or a response returned by the target core processor, and then updating the channel state register set of the internal state in real time, and when the channel state register collects all responses, initiating a reply word operation, where the reply word operation is setting a flag in the local memory of the source core processor or the target core processor. Whether the data transmission is completed or not is judged by inquiring the address of the answer word of the local memory (the address can be specified in the channel instruction).

The DMA splitting station splits the DMA channel instruction and the RMA splitting station splits the channel instruction to be carried out concurrently, so that the concurrent execution of two data transmissions is realized, and the transmission efficiency is improved.

After obtaining the DMA channel instruction or the RMA channel instruction, carrying out arbitration according to the operation type and the instruction sequence of the channel instruction so as to respectively dispatch the DMA channel instruction to the DMA splitting station and dispatch the RMA channel instruction to the RMA splitting station, and after analyzing the DMA micro-access or the RMA micro-access, carrying out arbitration according to the operation type and the sequence of the micro-access so as to respectively send the DMA micro-access to the memory and send the RMA micro-access to the target core processor.

The data transmission method facing the memory access and the in-chip communication of the many-core processor in the embodiment can also comprise a channel barrier instruction set used for changing the flow direction of the channel instructions, wherein the channel barrier instruction set comprises a DMA barrier instruction, an RMA barrier instruction and a full barrier instruction, after receiving the DMA barrier command, determining that the transmission initiated by the previous DMA channel command is completed, then continuing to execute the subsequent DMA channel command, after receiving the RMA barrier command, determining that the transmission initiated by the previously received RMA channel command is completed, then continuing to execute the subsequent RMA channel command, after receiving the DMA command received before the determination of the full barrier command or the transmission initiated by the RMA command, then continuing to execute the subsequent DMA command or the RMA command, under the control of the three barrier instructions, the device can control the transmission sequence between two data streams and each data stream, and provides bottom support for the many-core processor to realize different communication models. The concurrent execution of the two data transmissions is realized, and the transmission sequence of the two transmissions can be controlled and the nested mixed operation of the two transmissions can be realized under the control of a matched channel instruction. And finally, the data transmission between the source core processor and the target core processor and the data transmission between the source core processor and the memory are completely parallel, the in-chip data multiplexing is efficiently realized, and the computing capacity of the many-core processor is improved.

And the split DMA micro-access or RMA micro-access rotates or forwards in parallel through the network on chip according to the weight so as to send the DMA micro-access to the memory and send the RMA micro-access to the target core processor. The weights are forwarded in parallel from large to small by rotation according to the sequence or the sequence of the priority.

Referring to fig. 2 and 3, a data transmission device for memory access and on-chip communication of a many-core processor is used for respectively performing data transmission among a source core processor, a memory, the source core processor and a target core processor, and includes a channel instruction buffer unit for receiving and storing a DMA channel instruction or an RMA channel instruction sent by the source core processor, a channel instruction extraction unit for extracting the DMA channel instruction or the RMA channel instruction from the channel instruction buffer unit, a channel instruction splitting unit for splitting a DMA micro-access from the DMA channel instruction and sending the DMA micro-access to the memory, a channel instruction splitting unit for splitting the RMA micro-access from the RMA channel instruction and sending the RMA micro-access to the target core processor, and a channel instruction distribution unit for receiving the DMA channel instruction or the RMA channel instruction extracted by the channel instruction extraction unit and sending the DMA channel instruction and the RMA channel instruction to the channel instruction splitting unit, the device comprises a memory, a target core processor, a channel instruction distribution unit, a multi-core LDM (low density memory) and a multi-core LDM (linear memory access), wherein the memory sends a response to the channel instruction distribution unit after receiving DMA (direct memory access) micro access, the target core processor sends a response to the channel instruction distribution unit after receiving RMA micro access, the channel instruction distribution unit sets a mark in a local memory of a source core processor or a target core processor after acquiring the response returned by the memory or the response returned by the target core processor, and can be compatible with batch data transmission between the multi-core LDM and a memory (DMA) and between the multi-core LDM, support the mixed operation of the two transmission modes, realize the sharing of partial resources between the two by adopting a unified framework, reduce the realization expense to the maximum extent, simultaneously support multiple transmission modes such as point-to-point transmission, row transmission, column multicast and the like when the batch data transmission between the cores, each type of transmission can realize parallel transmission of multiple batches of data, so that the expenditure is saved, the dynamic adjustment and use of resources are realized, the many-core can completely separate calculation and data transmission, and a user can conveniently construct a large-scale complex scientific calculation program.

The device reserves a small amount of resources (such as message channel numbers) for each type of channel instruction, dynamically allocates other resources in the two types of channel instructions, and sets a special DMA (direct memory access) and RMA (remote management architecture) splitting platform to support the transmission, splitting and processing of the two types of data. The DMA and RMA transmission sequence can be flexibly scheduled and controlled; a flexible and configurable answer word completion notification mechanism is implemented. A DMA transfer is a transfer from a source core processor to memory. RAM transfers are transfers between a source core processor to a target core processor.

The channel instruction allocation unit can dynamically allocate and recycle the channel state register set.

The channel instruction fetch unit may include a DMA fetch module to fetch DMA channel instructions from the channel instruction buffer unit and an RMA fetch module to fetch RMA channel instructions from the channel instruction buffer unit.

The data transmission device facing memory access and in-chip communication of the many-core processor in this embodiment may further include an arbitration unit that performs arbitration according to an operation type and an instruction sequence of the channel instruction after obtaining the DMA channel instruction or the RMA channel instruction to respectively dispatch the DMA channel instruction to the DMA splitting station and the RMA channel instruction to the RMA splitting station, performs arbitration according to an operation type and a sequence of micro-access after resolving the DMA micro-access or the RMA micro-access to respectively send the DMA micro-access to the memory, and sends the RMA micro-access to the target core processor.

The channel instruction splitting unit comprises a DMA splitting station for splitting the DMA micro access from the DMA channel instruction and sending the DMA micro access to the memory, and an RMA splitting station for splitting the RMA micro access from the RMA channel instruction and sending the RMA micro access to the memory, and the splitting is respectively carried out, so that the transmission efficiency is improved. And the DMA split station sends the DMA micro-access to the memory through the network on chip.

The data transmission device facing the memory access and on-chip communication of the many-core processor in the embodiment can also comprise a barrier command set management unit used for sending a channel barrier command set for changing the flow direction of the channel command to the channel command buffer unit, the channel barrier instruction set comprises a DMA barrier instruction, an RMA barrier instruction and a full barrier instruction, the channel instruction buffer unit continues to execute subsequent DMA channel instructions after determining that transmission initiated by the previous DMA channel instruction is completed after receiving the DMA barrier instruction sent by the barrier instruction set management unit, continues to execute subsequent RMA channel instructions after determining that transmission initiated by the previously received RMA channel instruction is completed after receiving the RMA barrier instruction sent by the barrier instruction set management unit, and continues to execute subsequent DMA instructions or RMA instructions after receiving the DMA instruction received before determining that transmission initiated by the previous DMA instruction or RMA instruction is completed after receiving the full barrier instruction sent by the barrier instruction set management unit. After receiving the DMA barrier command, the channel command buffer unit must wait for the completion of the previous transmission initiated by the DMA channel command before allowing the subsequent DMA channel command to start executing. After receiving the RMA barrier command, the channel command buffer unit must wait for the transmission initiated by the previously received RMA channel command to complete before allowing the subsequent RMA channel command to start executing. The channel instruction buffer unit receives the full barrier instruction, and must wait for the transmission initiated by all the previously received channel transmission instructions to be completed before allowing the subsequent channel instructions to start executing. A unified channel transmission and barrier instruction set is defined, under the control of the three barrier instructions, the channel instruction buffer unit can control the transmission sequence between two data streams and each data stream, and bottom layer support is provided for the many-core processor to realize different communication models.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A data transmission method facing memory access and on-chip communication of a many-core processor is characterized by comprising the following steps:

2. The data transmission method oriented to many-core processor memory access and on-chip communication of claim 1, wherein: step S2 specifically includes performing a global allocation on the channel state register set to obtain DMA channel instructions or RMA channel instructions respectively when an extraction request for extracting DMA channel instructions or RMA channel instructions from the channel instruction buffer unit is obtained, then distributing the DMA channel instructions to the DMA splitting station and distributing the RMA channel instructions to the RMA splitting station, where in step S3 the DMA splitting station splits the DMA channel instructions to resolve DMA micro-accesses and the RMA splitting station splits the RMA channel instructions to resolve RMA micro-accesses.

3. The data transmission method for many-core processor memory access and on-chip communication according to claim 1 or 2, characterized in that: step S4 includes acquiring a response returned by the memory or a response returned by the target core processor, and then updating the channel state register set of the internal state in real time, and when the channel state register collects all responses, initiating a response word operation, where the response word operation is setting a flag in the local memory of the source core processor or the target core processor.

4. The data transmission method oriented to many-core processor memory access and on-chip communication of claim 3, wherein: the DMA splitting station splits the DMA channel instruction and the RMA splitting station splits the channel instruction and carries out concurrently.

5. The data transmission method oriented to many-core processor memory access and on-chip communication of claim 2, wherein: after obtaining the DMA channel instruction or the RMA channel instruction, carrying out arbitration according to the operation type and the instruction sequence of the channel instruction so as to respectively dispatch the DMA channel instruction to the DMA splitting station and dispatch the RMA channel instruction to the RMA splitting station, and after analyzing the DMA micro-access or the RMA micro-access, carrying out arbitration according to the operation type and the sequence of the micro-access so as to respectively send the DMA micro-access to the memory and send the RMA micro-access to the target core processor.

6. The data transmission method oriented to many-core processor memory access and on-chip communication of claim 3, wherein: the system also comprises a channel barrier instruction set used for changing the direction of the channel instruction stream, wherein the channel barrier instruction set comprises a DMA barrier instruction, an RMA barrier instruction and a full barrier instruction, the subsequent DMA channel instruction is continuously executed after the transmission initiated by the DMA channel instruction is determined to be completed after the DMA barrier instruction is received, the subsequent RMA channel instruction is continuously executed after the transmission initiated by the RMA channel instruction received before is determined to be completed after the RMA barrier instruction is received, and the subsequent DMA instruction or the RMA instruction is continuously executed after the full barrier instruction is received and the DMA instruction or the RMA instruction received before is determined to be completed.

7. The data transmission method for many-core processor memory access and on-chip communication according to claim 1 or 2, characterized in that: and the split DMA micro-access or RMA micro-access rotates or forwards in parallel through the network on chip according to the weight so as to send the DMA micro-access to the memory and send the RMA micro-access to the target core processor.

8. A data transmission device facing memory access and on-chip communication of a many-core processor is used for respectively carrying out data transmission among a source core processor, a memory, the source core processor and a target core processor and comprises a channel instruction buffer unit, a channel instruction extracting unit, a channel instruction splitting unit and a channel instruction distributing unit, wherein the channel instruction buffer unit is used for receiving and storing a DMA channel instruction or an RMA channel instruction sent by the source core processor, the channel instruction extracting unit is used for extracting the DMA channel instruction or the RMA channel instruction from the channel instruction buffer unit, the channel instruction splitting unit is used for splitting the DMA micro-access from the DMA channel instruction and sending the DMA micro-access to the memory, the RMA micro-access from the RMA channel instruction and sending the RMA micro-access to the target core processor, the channel instruction distributing unit is used for receiving the DMA channel instruction or the RMA channel instruction extracted by the channel instruction extracting unit and sending the DMA channel instruction and RMA channel instruction to the channel instruction splitting, the memory sends a response to the channel instruction distribution unit after receiving DMA micro-access, the target core processor sends a response to the channel instruction distribution unit after receiving RMA micro-access, and the channel instruction distribution unit sets a mark in a local memory of the source core processor or the target core processor after acquiring the response returned by the memory or the response returned by the target core processor.

9. The data transfer device for many-core processor memory access and on-chip communication of claim 8, wherein: the channel instruction splitting unit comprises a DMA splitting station used for splitting DMA micro access from a DMA channel instruction and sending the DMA micro access to the memory, and an RMA splitting station used for splitting RMA micro access from an RMA channel instruction and sending the RMA micro access to the memory.

10. The data transfer device for many-core processor memory access and on-chip communication of claim 8 or 9, wherein: the channel command buffer unit continues to execute subsequent DMA channel commands after determining that transmission initiated by the previous DMA channel command is completed after receiving the DMA barrier command sent by the barrier command set management unit, continues to execute subsequent RMA channel commands after determining that transmission initiated by the previously received RMA channel command is completed after receiving the RMA barrier command sent by the barrier command set management unit, and continues to execute subsequent DMA commands or RMA commands after receiving the DMA command received before determining that transmission initiated by the full barrier command or RMA command sent by the barrier command set management unit is completed.