WO2022121287A1

WO2022121287A1 - Command issuing method and apparatus, processing device, computer device, and storage medium

Info

Publication number: WO2022121287A1
Application number: PCT/CN2021/102943
Authority: WO
Inventors: 冷祥纶; 孙海涛
Original assignee: 上海阵量智能科技有限公司
Priority date: 2020-12-11
Filing date: 2021-06-29
Publication date: 2022-06-16
Also published as: CN114626541A

Abstract

A command issuing method and apparatus, a processing device, a computer device, and a storage medium, the method comprising: on the basis of a plurality of commands to be issued to a processing device for processing, generating at least one command stream, each command stream comprising at least one command (101); inserting the at least one command stream into a buffer (102); and, by means of the communication link between a host computer and the processing device, transmitting the at least one command stream in the buffer to the processing device (103). The communication overhead between the host computer and the processing device is reduced, and the scheduling efficiency of the host computer is increased.

Description

Command issuing method, device, processing device, computer device and storage medium

CROSS-REFERENCE TO RELATED APPLICATIONS

This patent application claims the priority of the Chinese patent application filed on December 11, 2020, with the application number of 202011459860.3 and the invention titled "command issuing method, device, processing equipment, computer equipment and storage medium". The entire contents are incorporated herein by reference.

technical field

The present disclosure relates to the field of computer technology, and in particular, to a command issuing method, apparatus, processing device, computer device, and storage medium.

Background technique

In the field of deep learning, artificial intelligence (AI) chips, like graphics processing units (GPUs), are usually used as accelerator cards for the host/CPU. Among them, the AI chip or GPU can be called a processing device, which is scheduled and controlled by the host.

With the widespread use of AI, the amount of deep learning models and data continues to grow. When the host schedules and controls the processing equipment, it not only needs to transmit a large amount of data, but also needs to issue operation commands frequently. As a result, the communication link between the host and the processing device often hits a communication bottleneck, and the communication overhead of the communication link is too large, resulting in low host scheduling efficiency.

SUMMARY OF THE INVENTION

The present disclosure provides a command issuing method, device, processing device, computer device and storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a command issuing method, the method comprising: generating at least one command stream according to a plurality of commands to be issued to a processing device for processing; wherein each command The stream includes at least one command; inserting the at least one command stream into a buffer; transmitting the at least one command stream in the buffer to the processing device through a communication link between the host and the processing device .

In some optional embodiments, the transmitting at least one command stream in the buffer to the processing device through a communication link between the host and the processing device includes: including in the buffer In the case of at least two command streams, the at least two command streams are transmitted to the processing device through one communication of the communication link.

In some optional embodiments, after the inserting the at least one command stream into the buffer, the method further includes: updating a write pointer of the buffer, where the write pointer is used to indicate that writing to the buffer is performed The current position of the operation; the updated pointer information of the write pointer is sent to the processing device through the communication link, so that the processing device updates the copy of the write pointer on the side of the processing device.

In some optional embodiments, after the inserting the at least one command stream into the buffer, the method further includes: updating a write pointer of the buffer, where the write pointer is used to indicate that writing to the buffer is performed The current position of the operation; when the number of updates of the write pointer of the buffer reaches a preset number of times, send the pointer information of the last updated write pointer to the processing device through the communication link, so that the The processing device updates the copy of the write pointer on the processing device side.

In some optional embodiments, the method further includes: receiving pointer information of a read pointer sent by the processing device through the communication link, where the read pointer is used to indicate the current status of the read operation on the buffer. position; according to the pointer information of the read pointer, update the copy of the read pointer on the host side.

In some alternative embodiments, the communication link is a high-speed serial computer expansion bus standard PCI-Express link.

According to a second aspect of the embodiments of the present disclosure, another method for issuing commands is provided. The method includes: pulling at least one command stream from a buffer on the host side through a communication link between a processing device and a host; The pulled at least one command stream is read into a local stream queue of the processing device, where the stream queue is used to store the command stream to be executed.

In some optional embodiments, the pulling at least one command stream from the buffer on the host side through the communication link between the processing device and the host includes: including at least two command streams in the buffer on the host side In the case of command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.

In some optional embodiments, the reading the pulled at least one command stream into a local stream queue of the processing device includes: pulling a plurality of command streams from the buffer on the host side In this case, the multiple command streams are respectively read into different local stream queues of the processing device; the method further includes: executing the command streams in the local different stream queues in parallel.

In some optional embodiments, the method further includes: receiving pointer information of the write pointer sent by the host through the communication link; and updating the copy of the write pointer on the processing device side according to the pointer information of the write pointer.

In some optional embodiments, the pulling at least one command stream from the buffer on the host side includes: determining, according to the local read pointer of the processing device and the pointer information of the copy of the write pointer, to be issued in the buffer The number of command streams in the buffer; when the buffer includes at least one command stream to be issued, at least one command stream is pulled from the buffer.

In some optional embodiments, the reading the pulled at least one command stream into a local stream queue of the processing device includes: reading one command stream to the local stream queue of the processing device at a time After that, update the local read pointer of the processing device; send the updated pointer information of the read pointer to the host, so that the host can update the copy of the read pointer on the host side.

In some optional embodiments, the communication link is a PCI-Express link.

According to a third aspect of the embodiments of the present disclosure, there is provided an apparatus for issuing commands, the apparatus comprising: a command stream generation module configured to generate at least one command stream according to multiple commands to be issued to a processing device for processing; Wherein, each of the command streams includes at least one command; an inserting module is used to insert the at least one command stream into a buffer; a transmission module is used to pass the communication link between the host and the processing device, Streaming at least one command in the buffer to the processing device.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a processing device, the processing device comprising: a queue memory for storing flow queues; Pulling at least one command stream from a buffer on the side; and reading the at least one command stream pulled into a local stream queue of the processing device, where the stream queue is used to store the command stream to be executed.

According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the first aspect when the program is executed Or the command issuing method according to any one of the second aspect.

According to a sixth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the command under any one of the first aspect or the second aspect. send method.

According to a seventh aspect of the embodiments of the present disclosure, a computer program product is provided, including a computer program, which implements the command issuing method according to any one of the first aspect or the second aspect when the program is executed by a processor.

In the embodiment of the present disclosure, a command stream may be generated from the plurality of commands according to the commands to be issued to the processing device, and the command may be issued to the processing device in the form of a command stream. In this command delivery method, multiple commands can be delivered by one command stream delivery, and multiple commands can be delivered through one communication of the communication link. The communication frequency between the host and the processing device is effectively reduced, the communication overhead between the host and the processing device is reduced, and the scheduling efficiency of the host is improved.

It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure.

1 is a flowchart of a method for issuing commands according to an exemplary embodiment;

2 is a flowchart of another method for issuing commands according to an exemplary embodiment;

3 is an interactive flowchart of a method for issuing commands according to an exemplary embodiment;

4 is a schematic diagram of an apparatus for issuing commands according to an exemplary embodiment;

FIG. 5 is a schematic diagram of another device for issuing commands according to an exemplary embodiment;

FIG. 6 is a schematic diagram of another apparatus for issuing commands according to an exemplary embodiment;

FIG. 7 is a schematic diagram of another apparatus for issuing commands according to an exemplary embodiment;

FIG. 8 is a schematic diagram of a processing device according to an exemplary embodiment;

FIG. 9 is a schematic diagram of another processing device according to an exemplary embodiment;

Fig. 10 is a schematic structural diagram of a computer device according to an exemplary embodiment.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The specific approaches described in the exemplary embodiments below are not intended to represent all aspects consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various pieces of information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure. Depending on the context, the word "if" as used herein can be interpreted as "at the time of" or "when" or "in response to determining."

With the widespread use of artificial intelligence, the amount of deep learning models and data continues to increase. When the host schedules and controls the processing device, it needs to issue operation commands frequently, and the communication link (such as the PCI-Express link) between the host and the processing device needs to transmit a large amount of communication data and model codes, resulting in the failure of the communication link. The communication overhead is too large and the scheduling efficiency is low.

Based on the above, the present disclosure provides a command issuing method: a host generates at least one command stream according to multiple commands to be issued to a processing device; inserts the at least one command stream into a buffer, and sends the The commands in the buffer are streamed to the processing device.

By adopting the method of issuing commands in a command stream, multiple commands can be issued to the processing device through one command stream, which reduces the communication times of the communication link. The communication overhead of the communication link between the host and the processing device is reduced, and the scheduling efficiency of the host is improved.

In order to make the command issuing method provided by the present disclosure clearer, the following describes the execution process of the solution provided by the present disclosure in detail with reference to the accompanying drawings and specific embodiments.

Referring to FIG. 1 , FIG. 1 is a flowchart of a method for issuing commands according to an embodiment of the present disclosure. The method is applied to the host. As shown in Figure 1, the process includes:

Step 101: Generate at least one command stream according to multiple commands to be sent to the processing device for processing; wherein each of the command streams includes at least one command.

In this embodiment, the command used to generate the command stream is a command generated by the host and needs to be sent to the processing device for processing.

For example, it may be multiple commands generated by multiple processes included in the application layer.

It is assumed that the application layer includes the application "Pay ×" for payment, and the application "Meitu × ×" for beauty. In the process of using the above two applications, the commands generated by the "Payment ×" process or the "Meitu ××" process need to be sent to a processing device (such as an AI chip) for processing. Among them, the command generated by the “Payment×” process or the “Meitu××” process is the command to be sent to the processing device for processing.

For example, in the field of deep learning, when an AI chip is used as a processing device, the commands in the command stream may include: various operators (kernel), data movement (memcpy) commands, and event synchronization commands of the deep learning model.

In this step, at least one command stream may be generated according to multiple commands to be issued. Wherein, a command stream may include one command, and may also include multiple commands. Commands in the same command stream need to be executed sequentially, and different command streams can be executed in parallel.

The command stream here is similar to the "stream" in CUDA (Compute Unified Device Architecture, which is a computing platform launched by the graphics card manufacturer NVIDIA).

For example, it is assumed that the commands to be issued include: command 1, command 2, command 3, command A, command B, and command C.

If the command 1, the command 2 and the command 3 need to be executed in sequence, this step can generate a command stream 1 from the command 1, the command 2 and the command 3. The command stream 1 includes: command 1 , command 2 and command 3 .

Similarly, if the command A and the command B need to be executed in sequence, in this step, a command stream A can be generated from the command A and the command B. Wherein, the command stream A includes: command A and command B.

If the execution of command C has nothing to do with the execution of other commands, this step can generate a command stream C from command C.

When the command stream 1, the command stream A, and the command stream C are generated, the execution of the respective command streams does not affect each other. For example, three command streams can be executed in parallel.

Step 102, inserting the at least one command stream into a buffer.

Exemplarily, the buffer in this embodiment may be a ring buffer (Ring Buffer). It can be understood that, any buffer that can meet the usage requirements of this step can be regarded as the buffer of this embodiment, and is not limited to a ring buffer.

A ring buffer is a typical "producer-consumer" model. In this embodiment, the host is the producer and can insert the command stream into the circular buffer; the processing device is the consumer and can pull down the command stream from the circular buffer to the local stream queue (Stream Queue).

Taking a ring buffer as an example, this step may insert one or more command streams into the ring buffer.

For example, the driver can create different command stream buffers (Stream Buffers) for each command stream, and insert each Stream Buffer into the Ring Buffer by locking. Each Stream Buffer corresponds to an Entry of the Ring Buffer.

Step 103: Stream at least one command in the buffer to the processing device through the communication link between the host and the processing device.

Exemplarily, the communication link between the host and the processing device (such as an AI chip) in this embodiment may be PCI-Express (peripheral component interconnect express, a high-speed serial computer expansion bus standard). It can be understood that, in addition to PCI-Express, other types of communication links may also be included between the host and the processing device, which are not limited in the present disclosure.

Taking the PCI-Express link as an example, in this step, the command stream in the buffer can be transmitted to the processing device through the PCI-Express link. Among them, a command stream can be transmitted to the processing device through one communication of PCI-Express. Alternatively, multiple commands can be streamed to the processing device through a single PCI-Express communication.

In this embodiment, in the process of transmitting the command stream in the buffer to the processing device, the host may actively send the command stream to the processing device, or the processing device may actively pull the command stream from the buffer. The specific manner in which the command stream in the buffer is transmitted to the processing device also includes various forms, which are not limited in this embodiment.

According to the number of command streams to be sent in the buffer, the host can actively send a certain number of command streams in the buffer to the processing device, and the processing device further processes the issued command streams. For example, when the number of command streams in the buffer reaches a certain preset number, the host may send a certain preset number of command streams to the processing device at one time through the PCI-Express link.

The processing device can also actively pull a certain number of command streams from the buffer.

For example, the processing device can poll the pointer information of the buffer on the host side. If there is a command stream to be issued in the buffer, the processing device can use the PCI-Express link. , pull a command stream to the local stream queue. Alternatively, the processing device can pull multiple command streams from the ring buffer on the host side to the local stream queue at one time through the PCI-Express link.

In this embodiment, the host can generate at least one command stream according to multiple commands to be issued to the processing device, and issue commands to the processing device in the form of a command stream. One command stream can be issued to realize the downloading of multiple commands. It reduces the number of communications between the host and the processing device, reduces the communication overhead between the host and the processing device, and improves the scheduling efficiency of the host.

In addition, with the development of the field of artificial intelligence, the computing power (referred to as computing power) of AI chips has been increasing, and the computing power has even reached 256/512 Tops. In the case of low scheduling efficiency, the host will not be able to issue operation commands to the processing device for scheduling and control in time, the computing power of the processing device cannot be fully utilized, and computing resources are wasted.

In the method for issuing commands in this embodiment, since the scheduling efficiency of the host is improved, the computing power of the processing device can be more fully utilized.

In some optional embodiments, in step 103, streaming at least one command in the buffer to the processing device through a communication link between the host and the processing device may include: in the buffer In the case where at least two command streams are included in the processor, the at least two command streams are transmitted to the processing device through one communication of the communication link.

In the above embodiment, when multiple command streams have been inserted into the buffer, the host can transmit the multiple command streams in the buffer to the processing device in batches at one time through one communication of the communication link. In a possible implementation, when multiple command streams have been inserted into the buffer, the host may transmit all the command streams in the buffer to the processing device in batches through one communication of the communication link. In another possible implementation manner, when multiple command streams have been inserted into the buffer, the host may, through one communication of the communication link, send some command streams (more than A command stream) is batched to the processing device.

In the above embodiment, by inserting multiple command streams into the buffer, the host can transmit multiple command streams to the processing device at one time through the communication link, which further reduces the number of communications between the host and the processing device. The communication overhead between the host and the processing device is reduced, and the scheduling efficiency of the host is improved.

Referring to FIG. 2, FIG. 2 is a flowchart of another method for issuing commands according to an embodiment of the present disclosure. The method is applied to processing equipment. As shown in Figure 2, the process includes:

Step 201: Pull at least one command stream from a buffer on the host side through the communication link between the processing device and the host.

The host generates the command stream to be sent to the processing device for processing, and buffers the command stream in the buffer. For example, the host may buffer the commands to be issued in the form of a command stream in a ring buffer.

The processing device can pull one command stream from the buffer at a time through the communication link with the host, or pull multiple command streams in batches at one time. The number of command streams that the processing device pulls from the buffer at one time to be delivered needs to be comprehensively determined according to the number of command streams to be delivered in the buffer and the number of local idle stream queues of the processing device.

For example, in the case that there is a command stream to be issued in the buffer, the processing device may determine that there is at least one idle stream queue locally. Then, the processing device can pull the command stream to be issued from the buffer, and read the command stream to the corresponding idle stream queue. That is, the delivery of multiple commands included in this one command stream from the host side to the processing device side is completed.

For example, when there are multiple command streams to be issued in the buffer, the processing device may determine that there are enough idle stream queues locally. Then, the processing device can pull the multiple command streams from the buffer in batches at one time, and read the multiple command streams to different stream queues respectively. That is, the delivery of the plurality of commands included in the plurality of command streams from the host side to the processing device side is completed.

In this embodiment, the processing device needs to determine the number of command streams to be issued in the host-side buffer. Only when there are command streams to be issued in the buffer and an idle stream queue exists locally on the processing device, the processing device pulls a certain number of command streams from the buffer.

Wherein, when the processing device determines the number of command streams to be issued in the host-side buffer, the processing device can poll the read-write pointer of the host-side buffer through the communication link with the host, and according to the read-write pointer of the buffer The pointer determines whether there is a command stream to be issued in the buffer.

Step 202: Read the pulled at least one command stream into a local stream queue, where the stream queue is used to store the command stream to be executed.

In this embodiment, the processing device may include a stream queue for storing command streams to be executed. The processing device can read multiple command streams pulled from the buffer into different stream queues respectively. Therefore, the processing device can use the command distributor to distribute the commands in the stream queue to different computing units for calculation.

In some optional embodiments, in the case of pulling multiple command streams from the buffer, the processing device may read the multiple command streams into different local stream queues respectively; execute the multiple command streams in parallel in different stream queues. command flow. The execution efficiency of the command by the processing device is improved.

In this embodiment, the processing device may pull one command stream from the buffer on the host side at a time through the communication link with the host. In this way of pulling commands from the host side in the form of a command stream, pulling one command stream can implement the issuance of multiple commands, reducing the number of communications between the host and the processing device. The communication overhead of the communication link between the host and the processing device is reduced, and the scheduling efficiency of the host is improved. The computing power of processing equipment can also be more fully utilized.

In some optional embodiments, in step 201, pulling at least one command stream from the buffer on the host side through the communication link between the processing device and the host includes: including at least one command stream in the buffer on the host side. In the case of two command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.

In the above-mentioned embodiment, when the buffer on the host side includes multiple command streams that can be pulled, the processing device can pull from the buffer on the host side in batches at one time through one communication of the communication link Multiple command streams. In a possible implementation, when the buffer on the host side includes multiple command streams, the processing device may pull all of the command streams in batches from the buffer on the host side at one time through one communication of the communication link. command flow. In another possible implementation manner, in the case that the buffer on the host side includes multiple command streams, the processing device may obtain all command streams from the buffer on the host side at one time through one communication of the communication link. Pull part of the command stream (more than one command stream) in batches.

In the above embodiment, the processing device can pull multiple command streams from the buffer on the host side at one time, which further reduces the number of times of communication between the host and the processing device. Therefore, the communication overhead of the communication link between the host and the processing device is greatly reduced, and the scheduling efficiency of the host is improved. The computing power of processing equipment can also be more fully utilized.

In step 201, the processing device needs to determine the number of command streams to be issued in the host-side buffer. The processing device pulls the command stream from the buffer only when there is a command stream to be issued in the buffer and an idle stream queue exists locally on the processing device.

To determine the number of command streams to be issued in the buffer, the processing device needs to obtain the read and write pointers of the buffer. In the related manner in which the processing device obtains the read and write pointers of the buffer, the processing device needs to poll the read and write pointers of the buffer on the host side through the communication link with the host. This way of "polling" to obtain the read and write pointers of the buffer requires the processing device to access a large number of hosts through the communication link, which undoubtedly causes communication overhead to the communication link.

To this end, the present disclosure provides a new pointer acquisition method, which enables the processing device to acquire the read and write pointers of the host-side buffer with fewer communication times.

Corresponding to the read and write pointers of the buffer on the host side, the corresponding read and write pointers are set locally on the processing device side, and the read and write pointers on both sides are updated synchronously according to certain rules.

For example, the read and write pointers of the buffer on the host side can be stored in the local main storage of the host; the corresponding read and write pointers set on the processing device side can be stored in the local registers of the processing device, and the two sides are synchronized according to certain rules. Stored read and write pointers.

In this method, since the read and write pointers of the buffer are correspondingly set on the processing device side, the processing device does not need to access the host side, but only needs to poll the local read and write pointers, and the host side buffer can be determined according to the read and write pointers The number of command streams to be issued in the server greatly reduces the number of communications through the communication link.

The manner in which the processing device obtains the buffer read and write pointers is provided above, which is only a principle description. The following describes the new pointer acquisition method provided by the present disclosure in detail with reference to the command issuing method provided by the present disclosure.

In this embodiment, the read and write pointers in the buffer may be set in a master-copy manner. On the host side, the write pointer write-pointer is the master, and the read pointer read-pointer is the copy; on the processing device side, the write pointer write-pointer is the copy, and the read pointer read-pointer is the master.

In order to easily distinguish the read and write pointers on both sides, the write-pointer on the host side can be called a write pointer, and the read-pointer can be called a copy of the read pointer; the write-pointer on the processing device side can be called a copy of the write pointer, and the read-pointer called the read pointer.

In step 102, after the host inserts at least one command stream into the buffer, it further includes:

The host updates the write pointer of the buffer, where the write pointer is used to indicate the current position of the write operation to the buffer; and sends the updated pointer information of the write pointer to the processing device through the communication link.

The timing at which the host sends the updated pointer information of the write pointer to the processing device may include various timings.

For example, each time the host inserts a command stream into the buffer and updates the write pointer, it sends the pointer information of the updated write pointer to the processing device. That is, every time the host updates the write pointer, it sends the updated pointer information of the write pointer to the processing device through the communication link.

In this way of updating the write pointer, the write pointers on both sides can be synchronized in real time, so that the processing device can obtain the latest information of the write pointer of the buffer on the host side in a more timely manner. Compared with the processing device polling the write pointer on the host side, this method uses the communication link to send the pointer information of the write pointer only when the write pointer of the buffer is updated, which reduces the number of communications.

In a possible implementation manner, when the host sends the updated pointer information of the write pointer to the processing device, it may send the latest pointer information of the write pointer to the processing device after the write pointer of the buffer is updated multiple times. .

In the above implementation manner, the update times of the buffer write pointer may be preset. For example, if the number of times of updating the write pointer of the buffer is preset to be 8, then 8 command streams are inserted into the buffer and the write pointer is updated 8 times before the pointer information of the 8th updated write pointer is sent to the processing device.

In this way of updating the write pointer, the pointer information of the last updated write pointer is sent to the processing device by using the communication link after the write pointer of the host-side buffer has been updated for a number of times accumulatively. The number of times of communication using the communication link is further reduced, and the communication overhead of the communication link is reduced.

After receiving the pointer information of the write pointer through the communication link, the processing device can update the corresponding copy of the write pointer stored locally according to the pointer information of the write pointer.

In step 202, each time the processing device reads a command stream into the stream queue, it updates the local read pointer; and sends the updated pointer information of the read pointer to the host.

The host receives the pointer information of the read pointer sent by the processing device through the communication link, where the read pointer is used to indicate the current position of the read operation on the buffer; according to the pointer information of the read pointer, the host is updated A copy of the read pointer on the side.

After updating the copy of the read pointer on the host side according to the pointer information of the read pointer sent by the processing device, the host can release the corresponding command stream that has been read to the stream queue by the processing device according to the copy of the read pointer, thereby releasing the buffer space. .

In the above, two sets of read and write pointers are set on both sides of the host and the processing device in a master-copy manner, and the read and write pointers stored on both sides can be updated according to the method of the above embodiment.

In this way, the processing device does not need to access the host through the communication link for pointer polling, but only needs to poll the local copy of the read pointer and the write pointer, and based on the local copy of the read pointer and the write pointer, it can determine whether the buffer is in the buffer. There are command streams to be delivered, and the number of command streams to be delivered is determined.

Thus, the processing device may pull one or more command streams from the buffer upon determining that there is at least one idle stream queue locally. When pulling multiple command streams, read the multiple command streams into different stream queues for processing.

Since the processing device only needs to poll the read and write pointers stored locally, it does not need to frequently access the host to poll the read and write pointers of the buffer, which greatly reduces the number of communication times that the processing device accesses the host through the communication link, which can effectively ease the need for the host to communicate with the processing device. The communication overhead between devices improves the scheduling efficiency of the host.

Refer to the interactive flowchart of the command issuing method shown in FIG. 3 . In the following embodiments, the command issuing method is described in the form of interaction between the host and the processing device.

Step 301, the host generates at least one command stream according to multiple commands to be sent to the processing device for processing.

The host can generate at least one command stream according to multiple commands to be issued. For example, multiple commands that need to be executed in sequence can be generated as a complete command stream; or, a command that needs to be executed individually can be generated as a complete command stream.

Step 302, the host inserts at least one command stream into the buffer.

After the host generates the command stream, it needs to insert the generated command stream into the buffer for buffering. The buffer can play the role of temporarily buffering the command stream, so that when a command needs to be issued, multiple command streams temporarily buffered in the buffer can be delivered to the processing device in batches at one time.

Step 303, the host updates the write pointer of the buffer.

After the host inserts the command stream into the buffer, the corresponding write pointer of the buffer needs to be updated. The host can perform multiple write operations according to the constantly updated pointer information of the write pointer, and insert the command stream into the buffer.

Step 304, the host sends the updated pointer information of the write pointer to the processing device.

In the embodiment of the present disclosure, since the read and write pointers in the buffer are set in a master-copy manner, after the host side updates the write pointer of the buffer, the pointer information of the write pointer needs to be sent to the processing device to correspond to the update Handles a copy of the write pointer set on the device side.

In a possible implementation manner, after each update of the write pointer on the host side, pointer information of the updated write pointer may be sent to the processing device. In another possible implementation manner, after the write pointer is updated multiple times on the host side, the pointer information of the final write pointer after multiple updates may be sent to the processing device. The number of times the host sends pointer information to the processing device can be further reduced, and the communication overhead of the communication link between the two can be reduced.

Step 305, the processing device updates the copy of the write pointer on the processing device side.

After receiving the pointer information of the write pointer sent by the host, the processing device needs to correspondingly update the local copy of the write pointer according to the received pointer information.

Step 306: The processing device determines the number of command streams to be issued in the buffer according to the pointer information of the local read pointer and the copy of the write pointer.

In the embodiment of the present disclosure, since the read and write pointers in the buffer are set in a master-copy manner, the pointers on both sides can be synchronously updated according to certain rules. Therefore, the processing device can directly access the local pointer information, so as to determine the number of command streams to be issued in the buffer on the host side. Since the processing device does not need to access the pointer on the host side, compared with the processing device polling the pointer of the buffer on the host side, the communication times of the communication link are greatly reduced, and the communication overhead of the communication link is reduced.

Step 307, the host stream transmits at least one command in the buffer to the processing device.

In a possible implementation manner, after determining the number of command streams to be issued in the host-side buffer, the processing device may actively pull a certain number of command streams from the host-side buffer. For example, all command streams in the buffer can be pulled to the processing device at once. In this way, through one communication of the communication link, batches of multiple command streams can be issued, and the communication overhead of the communication link can be reduced.

Step 308, the processing device reads at least one command stream into a local stream queue.

After the processing device pulls the command stream, it needs to read the command stream into the local stream queue to store the pulled command stream. Therefore, the processing device can use the command distributor to distribute the commands in the stream queue to different computing units for calculation.

Step 309: After each time the processing device reads a command stream into the local stream queue, it updates the local read pointer.

Step 310, the processing device sends the updated pointer information of the read pointer to the host.

Step 311, the host updates the copy of the read pointer on the host side.

After the processing device reads the pulled command stream into the local stream queue, the cache location corresponding to the stored command stream in the buffer on the host side can be released at this time. In the embodiment of the present disclosure, after each command stream is read into the local stream queue, the local corresponding read pointer is updated, and pointer information of the updated read pointer is sent to the host. After the host receives the pointer information, it correspondingly updates the local copy of the read pointer. The host can release the cache of the corresponding position in the host-side buffer according to the update of the read pointer copy.

In the above embodiment, the implementation process of the command issuing method is completely described in the manner of interaction between the host and the processing device. The command issuing method can be used to issue commands in the form of a command stream, and a single command stream issuing can realize the issuing of multiple commands, thereby reducing the communication overhead of the communication link. In addition, the method can send multiple command streams to the processing device at the same time by one communication, which further reduces the communication overhead of the communication link and improves the scheduling efficiency of the host.

As shown in FIG. 4 , the present disclosure provides a command issuing apparatus, and the apparatus can execute the command issuing method of any embodiment of the present disclosure. The apparatus may include a command stream generation module 401 , an insertion module 402 and a transmission module 403 . in:

The command stream generation module 401 is configured to generate at least one command stream according to multiple commands to be sent to the processing device for processing; wherein, each of the command streams includes at least one command;

an inserting module 402, configured to insert the at least one command stream into a buffer;

The transmission module 403 is configured to transmit at least one command stream in the buffer to the processing device through the communication link between the host and the processing device.

Optionally, when the transmission module 403 is configured to stream at least one command in the buffer to the processing device through the communication link between the host and the processing device, it is further configured to: In the case where the buffer includes at least two command streams, the at least two command streams are transmitted to the processing device through one communication of the communication link.

Optionally, as shown in Figure 5, the device further includes:

a first write pointer update module 501, configured to update the write pointer of the buffer, where the write pointer is used to indicate the current position of the write operation to the buffer;

The first pointer information sending module 502 is configured to send the updated pointer information of the write pointer to the processing device through the communication link, so that the processing device can update the copy of the write pointer on the processing device side.

Optionally, as shown in Figure 6, the device further includes:

A second write pointer update module 601, configured to update the write pointer of the buffer, where the write pointer is used to indicate the current position of the write operation to the buffer;

The second pointer information sending module 602 is configured to send the last updated pointer information of the write pointer to the processing device when the number of updates of the write pointer of the buffer reaches a preset number of times.

Optionally, as shown in Figure 7, the device further includes:

A pointer information receiving module 701, configured to receive pointer information of a read pointer sent by the processing device through the communication link, where the read pointer is used to indicate the current position of the read operation on the buffer;

The read pointer copy update module 702 is configured to update the read pointer copy on the host side according to the pointer information of the read pointer.

Optionally, the communication link is a PCI-Express link.

As shown in FIG. 8 , the present disclosure provides a processing device, and the processing device can execute the command issuing method of any embodiment of the present disclosure. The processing device may include queue memory 801 and microprocessor 802 . in:

a queue memory 801 for storing flow queues;

The microprocessor 802 is configured to pull at least one command stream from the buffer on the host side through the communication link between the processing device and the host; and read the pulled at least one command stream to the In the local flow queue of the processing device, the flow queue is used to store the command flow to be executed.

Optionally, when the microprocessor is used to pull at least one command stream from the buffer on the host side through the communication link between the processing device and the host, the microprocessor is also used for: buffering on the host side When the host includes at least two command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.

Optionally, as shown in FIG. 9 , when the microprocessor is configured to read the pulled at least one command stream into a local stream queue of the processing device, the microprocessor is further configured to: In the case of pulling multiple command streams from the buffer on the host side, the multiple command streams are respectively read into different local stream queues of the processing device; the processing device further includes: a parallel scheduling module 901 , which is used to schedule the corresponding computing modules in parallel to execute the command streams in the local different stream queues in parallel.

Optionally, the microprocessor is further configured to receive pointer information of the write pointer sent by the host through the communication link; and update the copy of the write pointer on the processing device side according to the pointer information of the write pointer.

Optionally, when the microprocessor is used to pull at least one command stream from the buffer on the host side, it is further configured to: determine according to the pointer information of the local read pointer and the copy of the write pointer of the processing device. The number of command streams to be issued in the buffer; when the buffer includes at least one command stream to be issued, at least one command stream is pulled from the buffer.

Optionally, when the microprocessor is configured to read the pulled at least one command stream into the local stream queue of the processing device, it is further configured to: read one command stream to the processing device each time. After the processing device is in the local stream queue, update the local read pointer of the processing device; send the updated pointer information of the read pointer to the host, so that the host can update the read pointer on the host side copy.

Optionally, the processing device is an AI chip or a GPU.

Optionally, the communication link is a PCI-Express link.

As for the apparatus embodiment or the processing device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for related parts. The apparatus embodiments or processing device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separated unit, that is, it can be located in one place, or it can be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of at least one embodiment of the present disclosure. Those of ordinary skill in the art can understand and implement it without creative effort.

The present disclosure also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor can implement the commands of any embodiment of the present disclosure when the processor executes the program delivery method.

10 shows a more specific schematic diagram of the hardware structure of a computer device provided by an embodiment of the present disclosure. The device may include: a processor 1010 , a memory 1020 , an input/output interface 1030 , a communication interface 1040 and a bus 1050 . The processor 1010 , the memory 1020 , the input/output interface 1030 and the communication interface 1040 realize the communication connection among each other within the device through the bus 1050 .

The processor 1010 can be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. program to implement the technical solutions provided by the embodiments of this specification.

The memory 1020 may be implemented in the form of a ROM (Read Only Memory, read-only memory), a RAM (Random Access Memory, random access memory), a static storage device, a dynamic storage device, and the like. The memory 1020 may store an operating system and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, relevant program codes are stored in the memory 1020 and invoked by the processor 1010 for execution.

The input/output interface 1030 is used to connect the input/output module to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be externally connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, and the like.

The communication interface 1040 is used to connect a communication module (not shown in the figure), so as to realize the communication interaction between the device and other devices. The communication module may implement communication through wired means (eg, USB, network cable, etc.), or may implement communication through wireless means (eg, mobile network, WIFI, Bluetooth, etc.).

Bus 1050 includes a path to transfer information between the various components of the device (eg, processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).

It should be noted that although the above device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation process, the device may also include necessary components for normal operation. other components. In addition, those skilled in the art can understand that, the above-mentioned device may only include components necessary to implement the solutions of the embodiments of the present specification, rather than all the components shown in the figures.

The present disclosure also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the command issuing method of any embodiment of the present disclosure can be implemented.

Wherein, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc., which is not limited in this application.

In some optional embodiments, embodiments of the present disclosure provide a computer program product, comprising computer-readable code, when the computer-readable code is executed on a device, the processor in the device executes any of the above implementations. The command delivery method provided by the example. The computer program product can be specifically implemented by hardware, software or a combination thereof.

Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention claimed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common general knowledge or techniques in the technical field to which this disclosure is not claimed . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

The above descriptions are only preferred embodiments of the present disclosure, and are not intended to limit the present disclosure. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present disclosure shall be included in the present disclosure. within the scope of protection.

Claims

A method for issuing commands, characterized in that the method comprises:

Generate at least one command stream according to multiple commands to be sent to the processing device for processing; wherein each of the command streams includes at least one command;

inserting the at least one command stream into a buffer;

At least one command in the buffer is streamed to the processing device over a communication link between the host and the processing device.
The method according to claim 1, wherein the transmitting at least one command stream in the buffer to the processing device through the communication link between the host and the processing device comprises:

In the event that at least two command streams are included in the buffer, the at least two command streams are transmitted to the processing device through one communication of the communication link.
The method according to claim 1 or 2, wherein after the inserting the at least one command stream into the buffer, the method further comprises:

updating the write pointer of the buffer, the write pointer being used to indicate the current position of the write operation to the buffer;

The updated pointer information of the write pointer is sent to the processing device over the communication link, so that the processing device updates the copy of the write pointer on the side of the processing device.
The method according to claim 1 or 2, wherein after the inserting the at least one command stream into the buffer, the method further comprises:

updating the write pointer of the buffer, the write pointer being used to indicate the current position of the write operation to the buffer;

When the number of updates of the write pointer of the buffer reaches a preset number of times, the pointer information of the last updated write pointer is sent to the processing device through the communication link, so that the processing device can be updated by the processing device A copy of the write pointer on the side.
The method according to any one of claims 1 to 4, wherein the method further comprises:

receiving pointer information of a read pointer sent by the processing device through the communication link, where the read pointer is used to indicate the current position of the read operation on the buffer;

According to the pointer information of the read pointer, the copy of the read pointer on the host side is updated.
The method according to any one of claims 1 to 5, wherein the communication link is a high-speed serial computer expansion bus standard PCI-Express link.
A method for issuing commands, characterized in that the method comprises:

Pull at least one command stream from the buffer on the host side by processing the communication link between the device and the host;

The pulled at least one command stream is read into a local stream queue of the processing device, where the stream queue is used to store the command stream to be executed.
The method according to claim 7, wherein the pulling at least one command stream from the buffer on the host side through the communication link between the processing device and the host comprises:

In the case where the buffer on the host side includes at least two command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.
The method according to claim 7 or 8, wherein the reading the pulled at least one command stream into a local stream queue of the processing device comprises:

In the case where multiple command streams are pulled from the buffer on the host side, the multiple command streams are respectively read into different local stream queues of the processing device;

The method also includes executing command flows in the local different flow queues in parallel.
The method according to any one of claims 7 to 9, wherein the method further comprises:

receiving the pointer information of the write pointer sent by the host through the communication link;

According to the pointer information of the write pointer, the copy of the write pointer on the side of the processing device is updated.
The method according to any one of claims 7 to 10, wherein the pulling at least one command stream from the buffer on the host side comprises:

Determine the number of command streams to be issued in the buffer according to the local read pointer and the pointer information of the write pointer copy of the processing device;

When the buffer includes at least one command stream to be issued, at least one command stream is pulled from the buffer.
The method according to any one of claims 7 to 11, wherein the reading the pulled at least one command stream into a local stream queue of the processing device comprises:

After each time a command stream is read into the local stream queue of the processing device, the local read pointer of the processing device is updated;

The updated pointer information of the read pointer is sent to the host, so that the host can update the copy of the read pointer on the host side.
The method of any one of claims 7 to 12, wherein the communication link is a PCI-Express link.
A device for issuing commands, characterized in that the device comprises:

The command stream generation module is used to generate at least one command stream according to multiple commands to be sent to the processing device for processing; wherein, each of the command streams includes at least one command;

an insertion module for inserting the at least one command stream into the buffer;

The transmission module is configured to transmit at least one command stream in the buffer to the processing device through the communication link between the host and the processing device.
The apparatus of claim 14, wherein the transmission module is configured to transmit at least one command stream in the buffer to the processing device through a communication link between the host and the processing device equipment, also used to:

In the event that at least two command streams are included in the buffer, the at least two command streams are transmitted to the processing device through one communication of the communication link.
A processing device, characterized in that the processing device comprises:

Queue memory for storing stream queues;

a microprocessor, configured to pull at least one command stream from the buffer on the host side through the communication link between the processing device and the host; and read the pulled at least one command stream to the processing In the local flow queue of the device, the flow queue is used to store the command flow to be executed.
The processing device according to claim 16, wherein when the microprocessor is used to pull at least one command stream from the buffer on the host side through the communication link between the processing device and the host, the microprocessor also Used for:

In the case where the buffer on the host side includes at least two command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.
The processing device according to claim 16 or 17, wherein when the microprocessor is configured to read the pulled at least one command stream into a local stream queue of the processing device, the microprocessor also Used for:

In the case of pulling multiple command streams from the buffer on the host side, read the multiple command streams into different local stream queues of the processing device respectively;

The processing device also includes:

The parallel scheduling module is used for scheduling corresponding computing modules in parallel to execute the command streams in the different local stream queues in parallel.
A computer device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the program, any one of claims 1-6 is implemented. method, or implement the method of any one of claims 7-13.
A computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the method described in any one of claims 1-6, or any one of claims 7-13 is realized. method described.