CN117196929B

CN117196929B - Software and hardware interaction system based on fixed-length data packet

Info

Publication number: CN117196929B
Application number: CN202311242401.3A
Authority: CN
Inventors: 高卫
Original assignee: Muxi Integrated Circuit Shanghai Co ltd
Current assignee: Muxi Integrated Circuit Shanghai Co ltd
Priority date: 2023-09-25
Filing date: 2023-09-25
Publication date: 2024-03-08
Anticipated expiration: 2043-09-25
Also published as: CN117196929A

Abstract

The invention relates to the technical field of computers, in particular to a software and hardware interaction system based on fixed-length data packets, which comprises software, GPU hardware, software and a storage area, wherein the software comprises a driver and M processes, and the length of the fixed-length data packet corresponding to each process is R; the GPU hardware comprises a doorbell interface, N processing modules and a W group of state registers, wherein the doorbell interface is connected with the N processing modules, each processing module is connected with the X group of state registers, and each processing module can process fixed-length data packets corresponding to X processes in a time-sharing multiplexing mode; in the system initialization process, a driver is used for setting W annular buffer areas in the storage area, and establishing a mapping relation between the annular buffer areas and doorbell addresses; and in the process of storing the fixed-length data packet, updating the target write pointer sequence number and the target read pointer sequence number. The invention can reasonably issue the fixed-length data packet to the GPU hardware resource for processing, thereby improving the hardware performance of the GPU.

Description

Software and hardware interaction system based on fixed-length data packet

Technical Field

The invention relates to the technical field of computers, in particular to a software and hardware interaction system based on a fixed-length data packet.

Background

When the computer processes complex operations such as artificial intelligence (Artificial Intelligence, AI for short) operation, the software and graphics processor (English: graphics Processing Unit, abbreviated as GPU) hardware are needed to cooperatively process, and fixed-length data packets with equal length are issued by the software to the GPU hardware for processing. The software includes thousands of processes, each corresponding to a number of fixed-length data packets, and the GPU hardware resources are limited. The existing software and hardware interaction mechanism is complex, and the GPU hardware performance is poor. Therefore, how to reasonably issue the fixed-length data packet of the software process to the GPU hardware resource for processing, so as to improve the GPU hardware performance becomes a technical problem to be solved.

Disclosure of Invention

The invention aims to provide a software and hardware interaction system based on a fixed-length data packet, which can reasonably issue the fixed-length data packet to a GPU hardware resource for processing, thereby improving the hardware performance of the GPU.

According to one aspect of the invention, a software-hardware interaction system based on fixed-length data packets is provided, and the system comprises software, GPU hardware and a storage area which can be accessed by the software and the GPU hardware, wherein the software comprises a driver and M processes, and the length of the fixed-length data packet corresponding to each process is R; the GPU hardware comprises a doorbell interface, N processing modules and W groups of state registers, wherein M is greater than W and is greater than N, the doorbell interface is connected with the N processing modules, each processing module is connected with the X groups of state registers, each processing module can process fixed-length data packets corresponding to X processes in a time division multiplexing mode, and W=N X;

in the system initialization process, the driver is used for setting W annular buffer areas in the storage area, setting corresponding state information in W groups of state registers, and establishing a mapping relation between the annular buffer areas and doorbell addresses, wherein each annular buffer area corresponds to one doorbell address, the state information comprises a use state, a doorbell address, an annular buffer area starting address, a read pointer sequence number and a write pointer sequence number, the initial values of the read pointer sequence number and the write pointer sequence number are 0, the length L=Y×R of the annular buffer areas, and Y is the number of fixed-length data packets which can be stored in the annular buffer areas;

in the process of storing fixed-length data packets, the driver selects a target annular buffer area corresponding to a to-be-processed process based on the use state of a W group state register, determines the number of distributable fixed-length data packets corresponding to the to-be-processed process according to a target read pointer sequence number and a target write pointer sequence number corresponding to the target annular buffer area, and a target processing module corresponding to the target annular buffer area stores the to-be-distributed fixed-length data packets corresponding to the to-be-processed process into the target annular buffer area according to the number of distributable fixed-length data packets, and updates the target write pointer sequence number according to the number of the stored fixed-length data packets through the doorbell interface;

and in the fixed-length data packet processing process, the target processing module is used for reading the fixed-length data packet from the target annular buffer area to process when the target read pointer sequence number and the target write pointer sequence number are unequal, and updating the target read pointer sequence number in real time according to the number of the processed fixed-length data packet through the doorbell interface when the processing is completed.

Compared with the prior art, the invention has obvious advantages and beneficial effects. By means of the technical scheme, the software and hardware interaction system based on the fixed-length data packet can achieve quite technical progress and practicality, has wide industrial utilization value, and has at least the following beneficial effects:

according to the system, the mapping relation between the W doorbell addresses and the annular buffer zone arranged in the storage area which can be accessed by both software and GPU hardware is established through the W group of state registers, the cooperative interaction of the software and GPU hardware resources is realized based on the information in the state registers, fixed-length data packets can be reasonably issued to the GPU hardware resources for processing, and the GPU hardware performance is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a software and hardware interaction system based on a fixed-length data packet according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The embodiment of the invention provides a software and hardware interaction system based on a fixed-length data packet, which is suitable for application scenes of issuing the fixed-length data packet such as AI operation, wherein the fixed-length data packet refers to the data packet with a fixed size. As shown in fig. 1, the system includes software, GPU hardware, and memory regions accessible to both the software and the GPU hardware. The software runs in a central processing unit (Central Processing Unit, CPU for short) and comprises a Driver and M processes, and the length of a fixed-length data packet corresponding to each process is R. The GPU hardware includes a doorbell interface (Doorbell Interface), N processing modules, and W-set status registers, M > W > N, it being noted that the hardware resources are limited, so W is typically on the order of 10, while software processes can be thousands of, and therefore M is much larger than W. The doorbell interface is connected with N processing modules, each processing module is connected with an X group of state registers, and each processing module can process fixed-length data packets corresponding to X processes in a time-sharing multiplexing mode, wherein W=N X; the processing module comprises a RISC-V instruction set, wherein RISC-V is an open Instruction Set Architecture (ISA) established based on a Reduced Instruction Set Computing (RISC) principle, and V is expressed as a fifth generation RISC (reduced instruction set computer) and represents a prototype chip of a fourth generation RISC processor.

In the system initialization process, the driver is used for setting W ring buffers (ringbuffers) in the storage area, setting corresponding state information in the W group of state registers, and establishing a mapping relation between the ring buffers and doorbell addresses, wherein each ring buffer corresponds to one doorbell address. The state information comprises a use state, a doorbell address, a ring buffer starting address, a read pointer sequence number and a write pointer sequence number, wherein initial values of the read pointer sequence number and the write pointer sequence number are 0, the length L=Y×R of the ring buffer, and Y is the number of fixed-length data packets which can be stored in the ring buffer. It should be noted that, the usage status stored in the status register can indicate whether the corresponding ring buffer can be reassigned to other processes. The doorbell address and the ring buffer starting address can define the mapping relation between the ring buffer and the doorbell address, and the next write starting address can be determined by combining the write pointer sequence number with the ring buffer starting address and the length R of the fixed-length data packet. The next read start address can be determined by combining the read pointer sequence number with the ring buffer start address and the length R of the fixed length data packet.

In the process of storing fixed-length data packets, the driver selects a target annular buffer area corresponding to a to-be-processed process based on the use state of a W group state register, determines the number of distributable fixed-length data packets corresponding to the to-be-processed process according to a target read pointer sequence number and a target write pointer sequence number corresponding to the target annular buffer area, and a target processing module corresponding to the target annular buffer area stores the to-be-distributed fixed-length data packets corresponding to the to-be-processed process into the target annular buffer area according to the number of distributable fixed-length data packets, and updates the target write pointer sequence number according to the number of the stored fixed-length data packets through the doorbell interface.

The system updates the target read pointer sequence number and the target write pointer sequence number in real time according to the writing condition and the processing condition of the fixed-length data packet, and then determines the target read start address and the target write start address according to the target read pointer sequence number and the target write pointer sequence number, so that the cooperative interaction of software and hardware is realized, the fixed-length data packet can be reasonably issued to GPU hardware resources for processing, and the hardware performance of the GPU is improved.

As an embodiment, each ring buffer can only store a fixed-length data packet corresponding to one process at the same time, and the use states corresponding to the ring buffers comprise optional states and non-optional states; when the fixed-length data packet is stored in the annular buffer zone and the incomplete processing is completed, setting the corresponding use state as an unselected state; and when the fixed-length data packet is not stored in the shape buffer area or the corresponding process fixed-length data packet is processed completely, setting the corresponding use state as an optional state. It can be understood that the ring buffer is an annular first-in first-out queue connected end to end, when fixed-length data packets corresponding to one process are already allocated in the ring buffer and all fixed-length data packets are not processed, fixed-length data packets of other processes cannot be stored, but when all fixed-length data packets are processed, data packets of another process can be stored, the ring buffer is not required to be emptied, the mapping relation between the ring buffer and doorbell addresses is only required to be changed by changing the corresponding state register, and the fixed-length data packets of the other process can be directly covered with the fixed-length data packets in the corresponding ring buffer.

As an embodiment, in the process of storing the fixed-length data packet, if there is at least one ring buffer with a selectable usage state, the driver selects the ring buffer with the selectable usage state as the target ring buffer corresponding to the process to be processed. I.e. when there is a ring buffer to which a process has not yet been allocated, one may be directly selected as the target ring buffer to which the process to be processed corresponds.

As an embodiment, each process includes priority information, in the process of storing the fixed-length data packet, if the usage states of the W-group status registers are all non-selectable states, and there is a process with a priority lower than that of the process to be processed in the process of currently distributing the fixed-length data packet, the driver newly adds a ring buffer in the storage area as a target ring buffer corresponding to the process to be processed, selects a process to be adjusted from the processes of distributing the fixed-length data packet, sets the starting address of the ring buffer in the corresponding status register in the process to be adjusted as the starting address of the target ring buffer, and sets the initial sequence numbers of the corresponding read pointer and write pointer to 0. It should be noted that, in the processing process, the process with a low priority may need a processing result of the process with a high priority, but in the distributing process, the process with a low priority may be distributed first, when the hardware resource is insufficient to process the newly added ring buffer, a ring buffer may be directly added by adjusting the mapping relationship, and the newly added ring buffer may be processed preferentially.

As an embodiment, the system further includes a preset buffer area, the driver is further configured to store a ring buffer area starting address, a read pointer sequence number, and a write pointer sequence number corresponding to the process to be adjusted in the preset buffer area, and when the usage status of at least one set of status registers is updated to be a selectable status, the driver selects a set of status registers whose usage status is the selectable status, and updates the ring buffer area starting address, the read pointer sequence number, and the write pointer sequence number in the selected status registers to the ring buffer area starting address, the read pointer sequence number, and the write pointer sequence number corresponding to the process to be adjusted. According to the system, the preset buffer area is set to store the starting address of the annular buffer area, the sequence number of the read pointer and the sequence number of the write pointer corresponding to the process to be adjusted, when the use state of at least one group of state registers is updated to be the selectable state, the mapping relation between the annular buffer area corresponding to the process to be adjusted and the doorbell address can be quickly established, the process to be adjusted is continuously executed, and the interaction efficiency of software and hardware is improved.

In one embodiment, in the process of storing the fixed-length data packet, the driver is further configured to generate a fixed-length data packet distribution instruction based on a target doorbell address corresponding to the selected target ring buffer and fixed-length data packet information to be distributed, and send the fixed-length data packet distribution instruction to a target processing module corresponding to the target doorbell address through a doorbell interface, where the fixed-length data packet information to be distributed includes the number K of fixed-length data packets to be distributed and fixed-length data packet data to be distributed.

As an embodiment, in the process of storing the fixed-length data packets, the driver determines the number of distributable fixed-length data packets corresponding to the to-be-processed process according to the target read pointer sequence number and the target write pointer sequence number corresponding to the target ring buffer, including: the method comprises the steps of obtaining a target read pointer sequence number D and a target write pointer sequence number C, wherein the sequence numbers D and C correspond to the sequence numbers of storage units of a target annular buffer area, so that the method is realized in a cycle counting mode, and if D is smaller than C, the target write pointer is before the target read pointer, the distributable fixed-length data packet quantity E=Y- (C-D); if D > C, the target write pointer is described as following the target read pointer, then the number of distributable fixed length packets e=d-C.

As another embodiment, the values of D and C are all counted in an accumulated manner, and if C-D < Y, the number of distributable fixed-length packets e=y- (C-D); if c=d, or C-d=y, e=0.

In the process of storing the fixed-length data packets, the target processing module stores the fixed-length data packets to be distributed into the target annular buffer area according to the number of the distributable fixed-length data packets, and the method comprises the following steps: the target processing module analyzes the fixed-length data packet information to be distributed to obtain the number K of the fixed-length data packets to be distributed and the fixed-length data packet data to be distributed, if the current E is larger than or equal to K, the number K of the fixed-length data packets to be distributed is enough to store the K fixed-length data packets to be distributed, the K fixed-length data packets to be distributed are sequentially stored in the target annular buffer zone, if the E is smaller than K, the number K of the fixed-length data packets to be distributed is not enough to store the K fixed-length data packets to be distributed, and if the E is larger than or equal to K, the K fixed-length data packets to be distributed need to be waited to be met, and then the K fixed-length data packets to be distributed are sequentially stored in the target annular buffer zone.

As an embodiment, said updating, by the doorbell interface, the target write pointer sequence number according to the number of stored fixed-length data packets comprises:

if c+k is less than or equal to Y, indicating that the stored K fixed-length data packets to be distributed do not span the starting position of the target ring buffer, updating c=c+k; if c+k > Y, indicating that the stored K fixed-length packets to be distributed span the starting position of the target ring buffer, c=c+k-Y is updated.

As an embodiment, in the processing of the fixed-length data packet, the target processing module is configured to read the fixed-length data packet from the target ring buffer to process when the target read pointer sequence number and the target write pointer sequence number are unequal, and update, in real time, the target read pointer sequence number according to the number of the processed fixed-length data packets through the doorbell interface, including: and when the calculation resource of the target processing module is enough to process at least one fixed-length data packet, the fixed-length data packets are read one by one from the fixed-length data packet corresponding to the target read pointer sequence number to be processed, if D+1 is less than or equal to Y, D=D+1 is updated, and if D+1>Y, D=D+1-Y is updated. It should be noted that, the existing GPU hardware technology can acquire the occupation condition of the current hardware resource in real time, and determine whether the computing resource is sufficient to process at least one fixed-length data packet, which is not described herein. After the fixed-length data packet is processed, the information of the current read fixed-length data packet can be actively sent to the driver besides updating the corresponding target read pointer sequence number, so that the driver can quickly know whether the corresponding data packet can be distinguished to the target annular buffer area.

According to the system provided by the embodiment of the invention, the mapping relation between the W doorbell addresses and the annular buffer zone arranged in the storage area which can be accessed by both software and GPU hardware is established through the W group of state registers, the cooperative interaction of the software and GPU hardware resources is realized based on the information in the state registers, the fixed-length data packet can be reasonably issued to the GPU hardware resources for processing, and the performance of the GPU hardware is improved.

The present invention is not limited to the above-mentioned embodiments, but is intended to be limited to the following embodiments, and any modifications, equivalents and modifications can be made to the above-mentioned embodiments without departing from the scope of the invention.

Claims

1. A software and hardware interaction system based on fixed-length data packets is characterized in that,

the method comprises software, GPU hardware and a storage area which can be accessed by the software and the GPU hardware, wherein the software comprises a driver and M processes, and the length of a fixed-length data packet corresponding to each process is R; the GPU hardware comprises a doorbell interface, N processing modules and W groups of state registers, wherein M is greater than W and is greater than N, the doorbell interface is connected with the N processing modules, each processing module is connected with the X groups of state registers, each processing module can process fixed-length data packets corresponding to X processes in a time division multiplexing mode, and W=N X;

2. The system of claim 1, wherein the system further comprises a controller configured to control the controller,

each annular buffer zone can only store a fixed-length data packet corresponding to one process at the same time, and the use states corresponding to the annular buffer zones comprise selectable states and non-selectable states; when the fixed-length data packet is stored in the annular buffer zone and the incomplete processing is completed, setting the corresponding use state as an unselected state; and when the fixed-length data packet is not stored in the shape buffer area or the corresponding process fixed-length data packet is processed completely, setting the corresponding use state as an optional state.

3. The system of claim 2, wherein the system further comprises a controller configured to control the controller,

in the process of storing the fixed-length data packet, if at least one ring buffer with the use state being the optional state exists, the driver program selects the ring buffer with the use state being the optional state as a target ring buffer corresponding to the process to be processed.

4. The system of claim 2, wherein the system further comprises a controller configured to control the controller,

each process comprises priority information, in the process of storing the fixed-length data packet, if the use states of the W group of state registers are all non-selectable states and a process with priority lower than that of a process to be processed exists in the process of distributing the fixed-length data packet currently, the driver newly adds a ring buffer in the storage area to serve as a target ring buffer corresponding to the process to be processed, selects one process to be adjusted from the processes of distributing the fixed-length data packet, sets the starting address of the ring buffer in the corresponding state register in the process to be adjusted as the starting address of the target ring buffer, and sets the initial sequence numbers of the corresponding read pointer and write pointer to 0.

5. The system of claim 4, wherein the system further comprises a controller configured to control the controller,

the system also comprises a preset buffer zone, the driver is further used for storing a ring buffer zone starting address, a read pointer sequence number and a write pointer sequence number corresponding to the process to be adjusted in the preset buffer zone, when the use state of at least one group of state registers is updated to be the selectable state, the driver selects one group of state registers with the use state being the selectable state, and updates the ring buffer zone starting address, the read pointer sequence number and the write pointer sequence number in the selected state registers to the ring buffer zone starting address, the read pointer sequence number and the write pointer sequence number corresponding to the process to be adjusted.

6. The system of claim 1, wherein the system further comprises a controller configured to control the controller,

in the process of storing the fixed-length data packets, the driver is further used for generating a fixed-length data packet distribution instruction based on a target doorbell address corresponding to the selected target annular buffer zone and fixed-length data packet information to be distributed, the fixed-length data packet distribution instruction is sent to a target processing module corresponding to the target doorbell address through a doorbell interface, and the fixed-length data packet information to be distributed comprises the number K of the fixed-length data packets to be distributed and the fixed-length data packet data to be distributed.

7. The system of claim 6, wherein the system further comprises a controller configured to control the controller,

in the process of storing the fixed-length data packets, the driver determines the number of the distributable fixed-length data packets corresponding to the to-be-processed process according to the target read pointer sequence number and the target write pointer sequence number corresponding to the target annular buffer area, and the method comprises the following steps:

acquiring a target read pointer sequence number D and a target write pointer sequence number C, and if D is smaller than C, determining the number E=Y- (C-D) of distributable fixed-length data packets; if D > C, the number of distributable fixed-length packets e=d-C.

8. The system of claim 7, wherein the system further comprises a controller configured to control the controller,

in the process of storing the fixed-length data packets, the target processing module stores the fixed-length data packets to be distributed into the target annular buffer area according to the number of the distributable fixed-length data packets, and the method comprises the following steps:

the target processing module analyzes the fixed-length data packet information to be distributed, acquires the number K of the fixed-length data packets to be distributed and the fixed-length data packet data to be distributed, sequentially stores the K fixed-length data packets to be distributed in a target annular buffer area if the current E is more than or equal to K, and sequentially stores the K fixed-length data packets to be distributed in the target annular buffer area if the current E is more than or equal to K otherwise.