WO2022121287A1 - Command issuing method and apparatus, processing device, computer device, and storage medium - Google Patents

Command issuing method and apparatus, processing device, computer device, and storage medium Download PDF

Info

Publication number
WO2022121287A1
WO2022121287A1 PCT/CN2021/102943 CN2021102943W WO2022121287A1 WO 2022121287 A1 WO2022121287 A1 WO 2022121287A1 CN 2021102943 W CN2021102943 W CN 2021102943W WO 2022121287 A1 WO2022121287 A1 WO 2022121287A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing device
command
buffer
stream
host
Prior art date
Application number
PCT/CN2021/102943
Other languages
French (fr)
Chinese (zh)
Inventor
冷祥纶
孙海涛
Original Assignee
上海阵量智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海阵量智能科技有限公司 filed Critical 上海阵量智能科技有限公司
Publication of WO2022121287A1 publication Critical patent/WO2022121287A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4022Coupling between buses using switching circuits, e.g. switching matrix, connection or expansion network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0026PCI express

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a command issuing method, apparatus, processing device, computer device, and storage medium.
  • AI chips like graphics processing units (GPUs), are usually used as accelerator cards for the host/CPU.
  • the AI chip or GPU can be called a processing device, which is scheduled and controlled by the host.
  • the present disclosure provides a command issuing method, device, processing device, computer device and storage medium.
  • a command issuing method comprising: generating at least one command stream according to a plurality of commands to be issued to a processing device for processing; wherein each command The stream includes at least one command; inserting the at least one command stream into a buffer; transmitting the at least one command stream in the buffer to the processing device through a communication link between the host and the processing device .
  • the transmitting at least one command stream in the buffer to the processing device through a communication link between the host and the processing device includes: including in the buffer In the case of at least two command streams, the at least two command streams are transmitted to the processing device through one communication of the communication link.
  • the method further includes: updating a write pointer of the buffer, where the write pointer is used to indicate that writing to the buffer is performed The current position of the operation; the updated pointer information of the write pointer is sent to the processing device through the communication link, so that the processing device updates the copy of the write pointer on the side of the processing device.
  • the method further includes: updating a write pointer of the buffer, where the write pointer is used to indicate that writing to the buffer is performed The current position of the operation; when the number of updates of the write pointer of the buffer reaches a preset number of times, send the pointer information of the last updated write pointer to the processing device through the communication link, so that the The processing device updates the copy of the write pointer on the processing device side.
  • the method further includes: receiving pointer information of a read pointer sent by the processing device through the communication link, where the read pointer is used to indicate the current status of the read operation on the buffer. position; according to the pointer information of the read pointer, update the copy of the read pointer on the host side.
  • the communication link is a high-speed serial computer expansion bus standard PCI-Express link.
  • Another method for issuing commands includes: pulling at least one command stream from a buffer on the host side through a communication link between a processing device and a host; The pulled at least one command stream is read into a local stream queue of the processing device, where the stream queue is used to store the command stream to be executed.
  • the pulling at least one command stream from the buffer on the host side through the communication link between the processing device and the host includes: including at least two command streams in the buffer on the host side In the case of command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.
  • the reading the pulled at least one command stream into a local stream queue of the processing device includes: pulling a plurality of command streams from the buffer on the host side In this case, the multiple command streams are respectively read into different local stream queues of the processing device; the method further includes: executing the command streams in the local different stream queues in parallel.
  • the method further includes: receiving pointer information of the write pointer sent by the host through the communication link; and updating the copy of the write pointer on the processing device side according to the pointer information of the write pointer.
  • the pulling at least one command stream from the buffer on the host side includes: determining, according to the local read pointer of the processing device and the pointer information of the copy of the write pointer, to be issued in the buffer The number of command streams in the buffer; when the buffer includes at least one command stream to be issued, at least one command stream is pulled from the buffer.
  • the reading the pulled at least one command stream into a local stream queue of the processing device includes: reading one command stream to the local stream queue of the processing device at a time After that, update the local read pointer of the processing device; send the updated pointer information of the read pointer to the host, so that the host can update the copy of the read pointer on the host side.
  • the communication link is a PCI-Express link.
  • an apparatus for issuing commands comprising: a command stream generation module configured to generate at least one command stream according to multiple commands to be issued to a processing device for processing; Wherein, each of the command streams includes at least one command; an inserting module is used to insert the at least one command stream into a buffer; a transmission module is used to pass the communication link between the host and the processing device, Streaming at least one command in the buffer to the processing device.
  • a processing device comprising: a queue memory for storing flow queues; Pulling at least one command stream from a buffer on the side; and reading the at least one command stream pulled into a local stream queue of the processing device, where the stream queue is used to store the command stream to be executed.
  • a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the first aspect when the program is executed Or the command issuing method according to any one of the second aspect.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the command under any one of the first aspect or the second aspect. send method.
  • a computer program product including a computer program, which implements the command issuing method according to any one of the first aspect or the second aspect when the program is executed by a processor.
  • a command stream may be generated from the plurality of commands according to the commands to be issued to the processing device, and the command may be issued to the processing device in the form of a command stream.
  • this command delivery method multiple commands can be delivered by one command stream delivery, and multiple commands can be delivered through one communication of the communication link.
  • the communication frequency between the host and the processing device is effectively reduced, the communication overhead between the host and the processing device is reduced, and the scheduling efficiency of the host is improved.
  • FIG. 1 is a flowchart of a method for issuing commands according to an exemplary embodiment
  • FIG. 2 is a flowchart of another method for issuing commands according to an exemplary embodiment
  • FIG. 3 is an interactive flowchart of a method for issuing commands according to an exemplary embodiment
  • FIG. 4 is a schematic diagram of an apparatus for issuing commands according to an exemplary embodiment
  • FIG. 5 is a schematic diagram of another device for issuing commands according to an exemplary embodiment
  • FIG. 6 is a schematic diagram of another apparatus for issuing commands according to an exemplary embodiment
  • FIG. 7 is a schematic diagram of another apparatus for issuing commands according to an exemplary embodiment
  • FIG. 8 is a schematic diagram of a processing device according to an exemplary embodiment
  • FIG. 9 is a schematic diagram of another processing device according to an exemplary embodiment.
  • Fig. 10 is a schematic structural diagram of a computer device according to an exemplary embodiment.
  • first, second, third, etc. may be used in this disclosure to describe various pieces of information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure.
  • word "if” as used herein can be interpreted as "at the time of” or "when” or "in response to determining.”
  • the host schedules and controls the processing device, it needs to issue operation commands frequently, and the communication link (such as the PCI-Express link) between the host and the processing device needs to transmit a large amount of communication data and model codes, resulting in the failure of the communication link.
  • the communication overhead is too large and the scheduling efficiency is low.
  • a host generates at least one command stream according to multiple commands to be issued to a processing device; inserts the at least one command stream into a buffer, and sends the The commands in the buffer are streamed to the processing device.
  • FIG. 1 is a flowchart of a method for issuing commands according to an embodiment of the present disclosure. The method is applied to the host. As shown in Figure 1, the process includes:
  • Step 101 Generate at least one command stream according to multiple commands to be sent to the processing device for processing; wherein each of the command streams includes at least one command.
  • the command used to generate the command stream is a command generated by the host and needs to be sent to the processing device for processing.
  • it may be multiple commands generated by multiple processes included in the application layer.
  • the application layer includes the application “Pay ⁇ ” for payment, and the application “Meitu ⁇ ⁇ ” for beauty.
  • the commands generated by the “Payment ⁇ ” process or the “Meitu ⁇ ” process need to be sent to a processing device (such as an AI chip) for processing.
  • the command generated by the “Payment ⁇ ” process or the “Meitu ⁇ ” process is the command to be sent to the processing device for processing.
  • the commands in the command stream may include: various operators (kernel), data movement (memcpy) commands, and event synchronization commands of the deep learning model.
  • At least one command stream may be generated according to multiple commands to be issued.
  • a command stream may include one command, and may also include multiple commands. Commands in the same command stream need to be executed sequentially, and different command streams can be executed in parallel.
  • the command stream here is similar to the "stream” in CUDA (Compute Unified Device Architecture, which is a computing platform launched by the graphics card manufacturer NVIDIA).
  • CUDA Computer Unified Device Architecture
  • commands to be issued include: command 1, command 2, command 3, command A, command B, and command C.
  • this step can generate a command stream 1 from the command 1, the command 2 and the command 3.
  • the command stream 1 includes: command 1 , command 2 and command 3 .
  • a command stream A can be generated from the command A and the command B.
  • the command stream A includes: command A and command B.
  • this step can generate a command stream C from command C.
  • the execution of the respective command streams does not affect each other.
  • three command streams can be executed in parallel.
  • Step 102 inserting the at least one command stream into a buffer.
  • the buffer in this embodiment may be a ring buffer (Ring Buffer). It can be understood that, any buffer that can meet the usage requirements of this step can be regarded as the buffer of this embodiment, and is not limited to a ring buffer.
  • a ring buffer is a typical "producer-consumer" model.
  • the host is the producer and can insert the command stream into the circular buffer;
  • the processing device is the consumer and can pull down the command stream from the circular buffer to the local stream queue (Stream Queue).
  • this step may insert one or more command streams into the ring buffer.
  • the driver can create different command stream buffers (Stream Buffers) for each command stream, and insert each Stream Buffer into the Ring Buffer by locking.
  • Stream Buffers corresponds to an Entry of the Ring Buffer.
  • Step 103 Stream at least one command in the buffer to the processing device through the communication link between the host and the processing device.
  • the communication link between the host and the processing device may be PCI-Express (peripheral component interconnect express, a high-speed serial computer expansion bus standard). It can be understood that, in addition to PCI-Express, other types of communication links may also be included between the host and the processing device, which are not limited in the present disclosure.
  • PCI-Express peripheral component interconnect express, a high-speed serial computer expansion bus standard
  • the command stream in the buffer can be transmitted to the processing device through the PCI-Express link.
  • a command stream can be transmitted to the processing device through one communication of PCI-Express.
  • multiple commands can be streamed to the processing device through a single PCI-Express communication.
  • the host in the process of transmitting the command stream in the buffer to the processing device, the host may actively send the command stream to the processing device, or the processing device may actively pull the command stream from the buffer.
  • the specific manner in which the command stream in the buffer is transmitted to the processing device also includes various forms, which are not limited in this embodiment.
  • the host can actively send a certain number of command streams in the buffer to the processing device, and the processing device further processes the issued command streams. For example, when the number of command streams in the buffer reaches a certain preset number, the host may send a certain preset number of command streams to the processing device at one time through the PCI-Express link.
  • the processing device can also actively pull a certain number of command streams from the buffer.
  • the processing device can poll the pointer information of the buffer on the host side. If there is a command stream to be issued in the buffer, the processing device can use the PCI-Express link. , pull a command stream to the local stream queue. Alternatively, the processing device can pull multiple command streams from the ring buffer on the host side to the local stream queue at one time through the PCI-Express link.
  • the host can generate at least one command stream according to multiple commands to be issued to the processing device, and issue commands to the processing device in the form of a command stream.
  • One command stream can be issued to realize the downloading of multiple commands. It reduces the number of communications between the host and the processing device, reduces the communication overhead between the host and the processing device, and improves the scheduling efficiency of the host.
  • computing power (referred to as computing power) of AI chips has been increasing, and the computing power has even reached 256/512 Tops.
  • the host will not be able to issue operation commands to the processing device for scheduling and control in time, the computing power of the processing device cannot be fully utilized, and computing resources are wasted.
  • the computing power of the processing device can be more fully utilized.
  • streaming at least one command in the buffer to the processing device through a communication link between the host and the processing device may include: in the buffer In the case where at least two command streams are included in the processor, the at least two command streams are transmitted to the processing device through one communication of the communication link.
  • the host when multiple command streams have been inserted into the buffer, can transmit the multiple command streams in the buffer to the processing device in batches at one time through one communication of the communication link.
  • the host may transmit all the command streams in the buffer to the processing device in batches through one communication of the communication link.
  • the host may, through one communication of the communication link, send some command streams (more than A command stream) is batched to the processing device.
  • the host can transmit multiple command streams to the processing device at one time through the communication link, which further reduces the number of communications between the host and the processing device.
  • the communication overhead between the host and the processing device is reduced, and the scheduling efficiency of the host is improved.
  • FIG. 2 is a flowchart of another method for issuing commands according to an embodiment of the present disclosure.
  • the method is applied to processing equipment. As shown in Figure 2, the process includes:
  • Step 201 Pull at least one command stream from a buffer on the host side through the communication link between the processing device and the host.
  • the host generates the command stream to be sent to the processing device for processing, and buffers the command stream in the buffer. For example, the host may buffer the commands to be issued in the form of a command stream in a ring buffer.
  • the processing device can pull one command stream from the buffer at a time through the communication link with the host, or pull multiple command streams in batches at one time.
  • the number of command streams that the processing device pulls from the buffer at one time to be delivered needs to be comprehensively determined according to the number of command streams to be delivered in the buffer and the number of local idle stream queues of the processing device.
  • the processing device may determine that there is at least one idle stream queue locally. Then, the processing device can pull the command stream to be issued from the buffer, and read the command stream to the corresponding idle stream queue. That is, the delivery of multiple commands included in this one command stream from the host side to the processing device side is completed.
  • the processing device may determine that there are enough idle stream queues locally. Then, the processing device can pull the multiple command streams from the buffer in batches at one time, and read the multiple command streams to different stream queues respectively. That is, the delivery of the plurality of commands included in the plurality of command streams from the host side to the processing device side is completed.
  • the processing device needs to determine the number of command streams to be issued in the host-side buffer. Only when there are command streams to be issued in the buffer and an idle stream queue exists locally on the processing device, the processing device pulls a certain number of command streams from the buffer.
  • the processing device when the processing device determines the number of command streams to be issued in the host-side buffer, the processing device can poll the read-write pointer of the host-side buffer through the communication link with the host, and according to the read-write pointer of the buffer The pointer determines whether there is a command stream to be issued in the buffer.
  • Step 202 Read the pulled at least one command stream into a local stream queue, where the stream queue is used to store the command stream to be executed.
  • the processing device may include a stream queue for storing command streams to be executed.
  • the processing device can read multiple command streams pulled from the buffer into different stream queues respectively. Therefore, the processing device can use the command distributor to distribute the commands in the stream queue to different computing units for calculation.
  • the processing device may read the multiple command streams into different local stream queues respectively; execute the multiple command streams in parallel in different stream queues. command flow. The execution efficiency of the command by the processing device is improved.
  • the processing device may pull one command stream from the buffer on the host side at a time through the communication link with the host.
  • pulling one command stream can implement the issuance of multiple commands, reducing the number of communications between the host and the processing device.
  • the communication overhead of the communication link between the host and the processing device is reduced, and the scheduling efficiency of the host is improved.
  • the computing power of processing equipment can also be more fully utilized.
  • pulling at least one command stream from the buffer on the host side through the communication link between the processing device and the host includes: including at least one command stream in the buffer on the host side. In the case of two command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.
  • the processing device when the buffer on the host side includes multiple command streams that can be pulled, the processing device can pull from the buffer on the host side in batches at one time through one communication of the communication link Multiple command streams.
  • the processing device may pull all of the command streams in batches from the buffer on the host side at one time through one communication of the communication link. command flow.
  • the processing device may obtain all command streams from the buffer on the host side at one time through one communication of the communication link. Pull part of the command stream (more than one command stream) in batches.
  • the processing device can pull multiple command streams from the buffer on the host side at one time, which further reduces the number of times of communication between the host and the processing device. Therefore, the communication overhead of the communication link between the host and the processing device is greatly reduced, and the scheduling efficiency of the host is improved.
  • the computing power of processing equipment can also be more fully utilized.
  • step 201 the processing device needs to determine the number of command streams to be issued in the host-side buffer.
  • the processing device pulls the command stream from the buffer only when there is a command stream to be issued in the buffer and an idle stream queue exists locally on the processing device.
  • the processing device To determine the number of command streams to be issued in the buffer, the processing device needs to obtain the read and write pointers of the buffer. In the related manner in which the processing device obtains the read and write pointers of the buffer, the processing device needs to poll the read and write pointers of the buffer on the host side through the communication link with the host. This way of "polling" to obtain the read and write pointers of the buffer requires the processing device to access a large number of hosts through the communication link, which undoubtedly causes communication overhead to the communication link.
  • the present disclosure provides a new pointer acquisition method, which enables the processing device to acquire the read and write pointers of the host-side buffer with fewer communication times.
  • the corresponding read and write pointers are set locally on the processing device side, and the read and write pointers on both sides are updated synchronously according to certain rules.
  • the read and write pointers of the buffer on the host side can be stored in the local main storage of the host; the corresponding read and write pointers set on the processing device side can be stored in the local registers of the processing device, and the two sides are synchronized according to certain rules. Stored read and write pointers.
  • the processing device since the read and write pointers of the buffer are correspondingly set on the processing device side, the processing device does not need to access the host side, but only needs to poll the local read and write pointers, and the host side buffer can be determined according to the read and write pointers
  • the number of command streams to be issued in the server greatly reduces the number of communications through the communication link.
  • the read and write pointers in the buffer may be set in a master-copy manner.
  • the write pointer write-pointer is the master, and the read pointer read-pointer is the copy;
  • the write pointer write-pointer is the copy, and the read pointer read-pointer is the master.
  • the write-pointer on the host side can be called a write pointer, and the read-pointer can be called a copy of the read pointer; the write-pointer on the processing device side can be called a copy of the write pointer, and the read-pointer called the read pointer.
  • step 102 after the host inserts at least one command stream into the buffer, it further includes:
  • the host updates the write pointer of the buffer, where the write pointer is used to indicate the current position of the write operation to the buffer; and sends the updated pointer information of the write pointer to the processing device through the communication link.
  • the timing at which the host sends the updated pointer information of the write pointer to the processing device may include various timings.
  • each time the host inserts a command stream into the buffer and updates the write pointer it sends the pointer information of the updated write pointer to the processing device. That is, every time the host updates the write pointer, it sends the updated pointer information of the write pointer to the processing device through the communication link.
  • this method uses the communication link to send the pointer information of the write pointer only when the write pointer of the buffer is updated, which reduces the number of communications.
  • the host when the host sends the updated pointer information of the write pointer to the processing device, it may send the latest pointer information of the write pointer to the processing device after the write pointer of the buffer is updated multiple times. .
  • the update times of the buffer write pointer may be preset. For example, if the number of times of updating the write pointer of the buffer is preset to be 8, then 8 command streams are inserted into the buffer and the write pointer is updated 8 times before the pointer information of the 8th updated write pointer is sent to the processing device.
  • the pointer information of the last updated write pointer is sent to the processing device by using the communication link after the write pointer of the host-side buffer has been updated for a number of times accumulatively.
  • the number of times of communication using the communication link is further reduced, and the communication overhead of the communication link is reduced.
  • the processing device After receiving the pointer information of the write pointer through the communication link, the processing device can update the corresponding copy of the write pointer stored locally according to the pointer information of the write pointer.
  • step 202 each time the processing device reads a command stream into the stream queue, it updates the local read pointer; and sends the updated pointer information of the read pointer to the host.
  • the host receives the pointer information of the read pointer sent by the processing device through the communication link, where the read pointer is used to indicate the current position of the read operation on the buffer; according to the pointer information of the read pointer, the host is updated A copy of the read pointer on the side.
  • the host After updating the copy of the read pointer on the host side according to the pointer information of the read pointer sent by the processing device, the host can release the corresponding command stream that has been read to the stream queue by the processing device according to the copy of the read pointer, thereby releasing the buffer space. .
  • two sets of read and write pointers are set on both sides of the host and the processing device in a master-copy manner, and the read and write pointers stored on both sides can be updated according to the method of the above embodiment.
  • the processing device does not need to access the host through the communication link for pointer polling, but only needs to poll the local copy of the read pointer and the write pointer, and based on the local copy of the read pointer and the write pointer, it can determine whether the buffer is in the buffer. There are command streams to be delivered, and the number of command streams to be delivered is determined.
  • the processing device may pull one or more command streams from the buffer upon determining that there is at least one idle stream queue locally.
  • the processing device may pull one or more command streams from the buffer upon determining that there is at least one idle stream queue locally.
  • the processing device may pull one or more command streams from the buffer upon determining that there is at least one idle stream queue locally.
  • the processing device Since the processing device only needs to poll the read and write pointers stored locally, it does not need to frequently access the host to poll the read and write pointers of the buffer, which greatly reduces the number of communication times that the processing device accesses the host through the communication link, which can effectively ease the need for the host to communicate with the processing device.
  • the communication overhead between devices improves the scheduling efficiency of the host.
  • command issuing method Refer to the interactive flowchart of the command issuing method shown in FIG. 3 .
  • the command issuing method is described in the form of interaction between the host and the processing device.
  • Step 301 the host generates at least one command stream according to multiple commands to be sent to the processing device for processing.
  • the host can generate at least one command stream according to multiple commands to be issued. For example, multiple commands that need to be executed in sequence can be generated as a complete command stream; or, a command that needs to be executed individually can be generated as a complete command stream.
  • Step 302 the host inserts at least one command stream into the buffer.
  • the buffer can play the role of temporarily buffering the command stream, so that when a command needs to be issued, multiple command streams temporarily buffered in the buffer can be delivered to the processing device in batches at one time.
  • Step 303 the host updates the write pointer of the buffer.
  • the host After the host inserts the command stream into the buffer, the corresponding write pointer of the buffer needs to be updated.
  • the host can perform multiple write operations according to the constantly updated pointer information of the write pointer, and insert the command stream into the buffer.
  • Step 304 the host sends the updated pointer information of the write pointer to the processing device.
  • the read and write pointers in the buffer are set in a master-copy manner, after the host side updates the write pointer of the buffer, the pointer information of the write pointer needs to be sent to the processing device to correspond to the update Handles a copy of the write pointer set on the device side.
  • pointer information of the updated write pointer may be sent to the processing device.
  • the pointer information of the final write pointer after multiple updates may be sent to the processing device. The number of times the host sends pointer information to the processing device can be further reduced, and the communication overhead of the communication link between the two can be reduced.
  • Step 305 the processing device updates the copy of the write pointer on the processing device side.
  • the processing device After receiving the pointer information of the write pointer sent by the host, the processing device needs to correspondingly update the local copy of the write pointer according to the received pointer information.
  • Step 306 The processing device determines the number of command streams to be issued in the buffer according to the pointer information of the local read pointer and the copy of the write pointer.
  • the processing device can directly access the local pointer information, so as to determine the number of command streams to be issued in the buffer on the host side. Since the processing device does not need to access the pointer on the host side, compared with the processing device polling the pointer of the buffer on the host side, the communication times of the communication link are greatly reduced, and the communication overhead of the communication link is reduced.
  • Step 307 the host stream transmits at least one command in the buffer to the processing device.
  • the processing device may actively pull a certain number of command streams from the host-side buffer. For example, all command streams in the buffer can be pulled to the processing device at once. In this way, through one communication of the communication link, batches of multiple command streams can be issued, and the communication overhead of the communication link can be reduced.
  • Step 308 the processing device reads at least one command stream into a local stream queue.
  • the processing device pulls the command stream, it needs to read the command stream into the local stream queue to store the pulled command stream. Therefore, the processing device can use the command distributor to distribute the commands in the stream queue to different computing units for calculation.
  • Step 309 After each time the processing device reads a command stream into the local stream queue, it updates the local read pointer.
  • Step 310 the processing device sends the updated pointer information of the read pointer to the host.
  • Step 311 the host updates the copy of the read pointer on the host side.
  • the cache location corresponding to the stored command stream in the buffer on the host side can be released at this time.
  • the local corresponding read pointer is updated, and pointer information of the updated read pointer is sent to the host.
  • the host receives the pointer information, it correspondingly updates the local copy of the read pointer.
  • the host can release the cache of the corresponding position in the host-side buffer according to the update of the read pointer copy.
  • the implementation process of the command issuing method is completely described in the manner of interaction between the host and the processing device.
  • the command issuing method can be used to issue commands in the form of a command stream, and a single command stream issuing can realize the issuing of multiple commands, thereby reducing the communication overhead of the communication link.
  • the method can send multiple command streams to the processing device at the same time by one communication, which further reduces the communication overhead of the communication link and improves the scheduling efficiency of the host.
  • the present disclosure provides a command issuing apparatus, and the apparatus can execute the command issuing method of any embodiment of the present disclosure.
  • the apparatus may include a command stream generation module 401 , an insertion module 402 and a transmission module 403 . in:
  • the command stream generation module 401 is configured to generate at least one command stream according to multiple commands to be sent to the processing device for processing; wherein, each of the command streams includes at least one command;
  • an inserting module 402 configured to insert the at least one command stream into a buffer
  • the transmission module 403 is configured to transmit at least one command stream in the buffer to the processing device through the communication link between the host and the processing device.
  • the transmission module 403 when the transmission module 403 is configured to stream at least one command in the buffer to the processing device through the communication link between the host and the processing device, it is further configured to: In the case where the buffer includes at least two command streams, the at least two command streams are transmitted to the processing device through one communication of the communication link.
  • the device further includes:
  • a first write pointer update module 501 configured to update the write pointer of the buffer, where the write pointer is used to indicate the current position of the write operation to the buffer;
  • the first pointer information sending module 502 is configured to send the updated pointer information of the write pointer to the processing device through the communication link, so that the processing device can update the copy of the write pointer on the processing device side.
  • the device further includes:
  • a second write pointer update module 601, configured to update the write pointer of the buffer, where the write pointer is used to indicate the current position of the write operation to the buffer;
  • the second pointer information sending module 602 is configured to send the last updated pointer information of the write pointer to the processing device when the number of updates of the write pointer of the buffer reaches a preset number of times.
  • the device further includes:
  • a pointer information receiving module 701 configured to receive pointer information of a read pointer sent by the processing device through the communication link, where the read pointer is used to indicate the current position of the read operation on the buffer;
  • the read pointer copy update module 702 is configured to update the read pointer copy on the host side according to the pointer information of the read pointer.
  • the communication link is a PCI-Express link.
  • the present disclosure provides a processing device, and the processing device can execute the command issuing method of any embodiment of the present disclosure.
  • the processing device may include queue memory 801 and microprocessor 802 . in:
  • a queue memory 801 for storing flow queues
  • the microprocessor 802 is configured to pull at least one command stream from the buffer on the host side through the communication link between the processing device and the host; and read the pulled at least one command stream to the In the local flow queue of the processing device, the flow queue is used to store the command flow to be executed.
  • the microprocessor when the microprocessor is used to pull at least one command stream from the buffer on the host side through the communication link between the processing device and the host, the microprocessor is also used for: buffering on the host side When the host includes at least two command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.
  • the microprocessor when the microprocessor is configured to read the pulled at least one command stream into a local stream queue of the processing device, the microprocessor is further configured to: In the case of pulling multiple command streams from the buffer on the host side, the multiple command streams are respectively read into different local stream queues of the processing device; the processing device further includes: a parallel scheduling module 901 , which is used to schedule the corresponding computing modules in parallel to execute the command streams in the local different stream queues in parallel.
  • the microprocessor is further configured to receive pointer information of the write pointer sent by the host through the communication link; and update the copy of the write pointer on the processing device side according to the pointer information of the write pointer.
  • the microprocessor when used to pull at least one command stream from the buffer on the host side, it is further configured to: determine according to the pointer information of the local read pointer and the copy of the write pointer of the processing device. The number of command streams to be issued in the buffer; when the buffer includes at least one command stream to be issued, at least one command stream is pulled from the buffer.
  • the microprocessor when configured to read the pulled at least one command stream into the local stream queue of the processing device, it is further configured to: read one command stream to the processing device each time. After the processing device is in the local stream queue, update the local read pointer of the processing device; send the updated pointer information of the read pointer to the host, so that the host can update the read pointer on the host side copy.
  • the processing device is an AI chip or a GPU.
  • the communication link is a PCI-Express link.
  • the apparatus embodiment or the processing device embodiment since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for related parts.
  • the apparatus embodiments or processing device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separated unit, that is, it can be located in one place, or it can be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of at least one embodiment of the present disclosure. Those of ordinary skill in the art can understand and implement it without creative effort.
  • the present disclosure also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor can implement the commands of any embodiment of the present disclosure when the processor executes the program delivery method.
  • the device may include: a processor 1010 , a memory 1020 , an input/output interface 1030 , a communication interface 1040 and a bus 1050 .
  • the processor 1010 , the memory 1020 , the input/output interface 1030 and the communication interface 1040 realize the communication connection among each other within the device through the bus 1050 .
  • the processor 1010 can be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. program to implement the technical solutions provided by the embodiments of this specification.
  • a general-purpose CPU Central Processing Unit, central processing unit
  • a microprocessor an application specific integrated circuit (Application Specific Integrated Circuit, ASIC)
  • ASIC Application Specific Integrated Circuit
  • the memory 1020 may be implemented in the form of a ROM (Read Only Memory, read-only memory), a RAM (Random Access Memory, random access memory), a static storage device, a dynamic storage device, and the like.
  • the memory 1020 may store an operating system and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, relevant program codes are stored in the memory 1020 and invoked by the processor 1010 for execution.
  • the input/output interface 1030 is used to connect the input/output module to realize information input and output.
  • the input/output/module can be configured in the device as a component (not shown in the figure), or can be externally connected to the device to provide corresponding functions.
  • the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc.
  • the output device may include a display, a speaker, a vibrator, an indicator light, and the like.
  • the communication interface 1040 is used to connect a communication module (not shown in the figure), so as to realize the communication interaction between the device and other devices.
  • the communication module may implement communication through wired means (eg, USB, network cable, etc.), or may implement communication through wireless means (eg, mobile network, WIFI, Bluetooth, etc.).
  • Bus 1050 includes a path to transfer information between the various components of the device (eg, processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).
  • the above device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation process, the device may also include necessary components for normal operation. other components.
  • the above-mentioned device may only include components necessary to implement the solutions of the embodiments of the present specification, rather than all the components shown in the figures.
  • the present disclosure also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the command issuing method of any embodiment of the present disclosure can be implemented.
  • non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc., which is not limited in this application.
  • embodiments of the present disclosure provide a computer program product, comprising computer-readable code, when the computer-readable code is executed on a device, the processor in the device executes any of the above implementations.
  • the computer program product can be specifically implemented by hardware, software or a combination thereof.

Abstract

A command issuing method and apparatus, a processing device, a computer device, and a storage medium, the method comprising: on the basis of a plurality of commands to be issued to a processing device for processing, generating at least one command stream, each command stream comprising at least one command (101); inserting the at least one command stream into a buffer (102); and, by means of the communication link between a host computer and the processing device, transmitting the at least one command stream in the buffer to the processing device (103). The communication overhead between the host computer and the processing device is reduced, and the scheduling efficiency of the host computer is increased.

Description

命令下发方法、装置、处理设备、计算机设备及存储介质Command issuing method, device, processing device, computer device and storage medium
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本专利申请要求于2020年12月11日提交的、申请号为202011459860.3、发明名称为“命令下发方法、装置、处理设备、计算机设备及存储介质”的中国专利申请的优先权,该申请的全文以引用的方式并入本文中。This patent application claims the priority of the Chinese patent application filed on December 11, 2020, with the application number of 202011459860.3 and the invention titled "command issuing method, device, processing equipment, computer equipment and storage medium". The entire contents are incorporated herein by reference.
技术领域technical field
本公开涉及计算机技术领域,具体涉及一种命令下发方法、装置、处理设备、计算机设备及存储介质。The present disclosure relates to the field of computer technology, and in particular, to a command issuing method, apparatus, processing device, computer device, and storage medium.
背景技术Background technique
在深度学习领域,人工智能(AI)芯片如同图形处理器(GPU)一样,通常作为主机/CPU的加速卡。其中,AI芯片或GPU可称为处理设备,由主机进行调度和控制。In the field of deep learning, artificial intelligence (AI) chips, like graphics processing units (GPUs), are usually used as accelerator cards for the host/CPU. Among them, the AI chip or GPU can be called a processing device, which is scheduled and controlled by the host.
随着AI的广泛使用,深度学习的模型和数据量不断增大。主机对处理设备进行调度和控制时,不仅需要传输大量的数据,还需要频繁下发操作命令。这使得主机与处理设备之间的通信链路经常触及通信瓶颈,通信链路的通信开销过大,导致主机调度效率低下。With the widespread use of AI, the amount of deep learning models and data continues to grow. When the host schedules and controls the processing equipment, it not only needs to transmit a large amount of data, but also needs to issue operation commands frequently. As a result, the communication link between the host and the processing device often hits a communication bottleneck, and the communication overhead of the communication link is too large, resulting in low host scheduling efficiency.
发明内容SUMMARY OF THE INVENTION
本公开提供了一种命令下发方法、装置、处理设备、计算机设备及存储介质。The present disclosure provides a command issuing method, device, processing device, computer device and storage medium.
根据本公开实施例的第一方面,提供一种命令下发方法,所述方法包括:根据待下发到处理设备进行处理的多个命令,生成至少一个命令流;其中,每个所述命令流中包括至少一个命令;将所述至少一个命令流插入到缓冲器中;通过主机与所述处理设备间的通信链路,将所述缓冲器中的至少一个命令流传输至所述处理设备。According to a first aspect of the embodiments of the present disclosure, there is provided a command issuing method, the method comprising: generating at least one command stream according to a plurality of commands to be issued to a processing device for processing; wherein each command The stream includes at least one command; inserting the at least one command stream into a buffer; transmitting the at least one command stream in the buffer to the processing device through a communication link between the host and the processing device .
在一些可选实施例中,所述通过主机与所述处理设备间的通信链路,将所述缓冲器中的至少一个命令流传输至所述处理设备,包括:在所述缓冲器中包括至少两个命令流的情况下,通过所述通信链路的一次通信,将所述至少两个命令流传输至所述处理设备。In some optional embodiments, the transmitting at least one command stream in the buffer to the processing device through a communication link between the host and the processing device includes: including in the buffer In the case of at least two command streams, the at least two command streams are transmitted to the processing device through one communication of the communication link.
在一些可选实施例中,所述将所述至少一个命令流插入到缓冲器中之后,还包括:更新所述缓冲器的写指针,所述写指针用于表示对所述缓冲器进行写操作的当前位置;通过所述通信链路,将更新后的写指针的指针信息发送至所述处理设备,以由所述处理 设备更新处理设备侧的写指针副本。In some optional embodiments, after the inserting the at least one command stream into the buffer, the method further includes: updating a write pointer of the buffer, where the write pointer is used to indicate that writing to the buffer is performed The current position of the operation; the updated pointer information of the write pointer is sent to the processing device through the communication link, so that the processing device updates the copy of the write pointer on the side of the processing device.
在一些可选实施例中,所述将所述至少一个命令流插入到缓冲器中之后,还包括:更新所述缓冲器的写指针,所述写指针用于表示对所述缓冲器进行写操作的当前位置;在所述缓冲器的写指针更新次数达到预设次数的情况下,通过所述通信链路,将最后更新的写指针的指针信息发送至所述处理设备,以由所述处理设备更新处理设备侧的写指针副本。In some optional embodiments, after the inserting the at least one command stream into the buffer, the method further includes: updating a write pointer of the buffer, where the write pointer is used to indicate that writing to the buffer is performed The current position of the operation; when the number of updates of the write pointer of the buffer reaches a preset number of times, send the pointer information of the last updated write pointer to the processing device through the communication link, so that the The processing device updates the copy of the write pointer on the processing device side.
在一些可选实施例中,所述方法还包括:接收所述处理设备通过所述通信链路发送的读指针的指针信息,所述读指针用于表示对所述缓冲器进行读操作的当前位置;根据所述读指针的指针信息,更新主机侧的读指针副本。In some optional embodiments, the method further includes: receiving pointer information of a read pointer sent by the processing device through the communication link, where the read pointer is used to indicate the current status of the read operation on the buffer. position; according to the pointer information of the read pointer, update the copy of the read pointer on the host side.
在一些可选实施例中,所述通信链路是高速串行计算机扩展总线标准PCI-Express链路。In some alternative embodiments, the communication link is a high-speed serial computer expansion bus standard PCI-Express link.
根据本公开实施例的第二方面,提供另一种命令下发方法,所述方法包括:通过处理设备与主机间的通信链路,从主机侧的缓冲器中拉取至少一个命令流;将拉取的所述至少一个命令流读取到处理设备本地的流队列中,所述流队列用于存储待执行的命令流。According to a second aspect of the embodiments of the present disclosure, another method for issuing commands is provided. The method includes: pulling at least one command stream from a buffer on the host side through a communication link between a processing device and a host; The pulled at least one command stream is read into a local stream queue of the processing device, where the stream queue is used to store the command stream to be executed.
在一些可选实施例中,所述通过处理设备与主机间的通信链路,从主机侧的缓冲器中拉取至少一个命令流,包括:在所述主机侧的缓冲器中包括至少两个命令流的情况下,通过所述通信链路的一次通信,从所述主机侧的缓冲器中拉取所述至少两个命令流。In some optional embodiments, the pulling at least one command stream from the buffer on the host side through the communication link between the processing device and the host includes: including at least two command streams in the buffer on the host side In the case of command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.
在一些可选实施例中,所述将拉取的所述至少一个命令流读取到处理设备本地的流队列中,包括:在从所述主机侧的缓冲器中拉取多个命令流的情况下,将所述多个命令流分别读取到处理设备本地不同的流队列中;所述方法还包括:并行执行所述本地不同的流队列中的命令流。In some optional embodiments, the reading the pulled at least one command stream into a local stream queue of the processing device includes: pulling a plurality of command streams from the buffer on the host side In this case, the multiple command streams are respectively read into different local stream queues of the processing device; the method further includes: executing the command streams in the local different stream queues in parallel.
在一些可选实施例中,所述方法还包括:接收主机通过所述通信链路发送的写指针的指针信息;根据所述写指针的指针信息,更新处理设备侧的写指针副本。In some optional embodiments, the method further includes: receiving pointer information of the write pointer sent by the host through the communication link; and updating the copy of the write pointer on the processing device side according to the pointer information of the write pointer.
在一些可选实施例中,所述从主机侧的缓冲器中拉取至少一个命令流,包括:根据处理设备本地的读指针和写指针副本的指针信息,确定所述缓冲器中待下发的命令流的数量;在所述缓冲器中包括至少一个待下发的命令流的情况下,从所述缓冲器中拉取至少一个命令流。In some optional embodiments, the pulling at least one command stream from the buffer on the host side includes: determining, according to the local read pointer of the processing device and the pointer information of the copy of the write pointer, to be issued in the buffer The number of command streams in the buffer; when the buffer includes at least one command stream to be issued, at least one command stream is pulled from the buffer.
在一些可选实施例中,所述将拉取的所述至少一个命令流读取到处理设备本地的流队列中,包括:每次将一个命令流读取到所述处理设备本地的流队列中之后,更新处理设备本地的读指针;将更新后的所述读指针的指针信息,发送至所述主机,以由所述主 机更新主机侧的读指针副本。In some optional embodiments, the reading the pulled at least one command stream into a local stream queue of the processing device includes: reading one command stream to the local stream queue of the processing device at a time After that, update the local read pointer of the processing device; send the updated pointer information of the read pointer to the host, so that the host can update the copy of the read pointer on the host side.
在一些可选实施例中,所述通信链路是PCI-Express链路。In some optional embodiments, the communication link is a PCI-Express link.
根据本公开实施例的第三方面,提供一种命令下发装置,所述装置包括:命令流生成模块,用于根据待下发到处理设备进行处理的多个命令,生成至少一个命令流;其中,每个所述命令流中包括至少一个命令;插入模块,用于将所述至少一个命令流插入到缓冲器中;传输模块,用于通过主机与所述处理设备间的通信链路,将所述缓冲器中的至少一个命令流传输至所述处理设备。According to a third aspect of the embodiments of the present disclosure, there is provided an apparatus for issuing commands, the apparatus comprising: a command stream generation module configured to generate at least one command stream according to multiple commands to be issued to a processing device for processing; Wherein, each of the command streams includes at least one command; an inserting module is used to insert the at least one command stream into a buffer; a transmission module is used to pass the communication link between the host and the processing device, Streaming at least one command in the buffer to the processing device.
根据本公开实施例的第四方面,提供一种处理设备,所述处理设备包括:队列存储器,用于存储流队列;微处理器,用于通过处理设备与主机间的通信链路,从主机侧的缓冲器中拉取至少一个命令流;并将拉取的所述至少一个命令流读取到处理设备本地的流队列中,所述流队列用于存储待执行的命令流。According to a fourth aspect of the embodiments of the present disclosure, there is provided a processing device, the processing device comprising: a queue memory for storing flow queues; Pulling at least one command stream from a buffer on the side; and reading the at least one command stream pulled into a local stream queue of the processing device, where the stream queue is used to store the command stream to be executed.
根据本公开实施例的第五方面,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现第一方面或第二方面中任一项所述的命令下发方法。According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the first aspect when the program is executed Or the command issuing method according to any one of the second aspect.
根据本公开实施例的第六方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现第一方面或第二方面中任一所述的命令下发方法。According to a sixth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the command under any one of the first aspect or the second aspect. send method.
根据本公开实施例的第七方面,提供一种计算机程序产品,包括计算机程序,所述程序被处理器执行时实现第一方面或第二方面中任一所述的命令下发方法。According to a seventh aspect of the embodiments of the present disclosure, a computer program product is provided, including a computer program, which implements the command issuing method according to any one of the first aspect or the second aspect when the program is executed by a processor.
本公开实施例中,可以根据待下发到处理设备的多个命令,将多个命令生成一个命令流,以命令流的方式向处理设备下发命令。这种命令下发方式中,一次命令流的下发可以实现多个命令的下发,通过通信链路的一次通信即可下发多个命令。有效减少了主机与处理设备的通信次数,减轻了主机与处理设备之间的通信开销,提高了主机的调度效率。In the embodiment of the present disclosure, a command stream may be generated from the plurality of commands according to the commands to be issued to the processing device, and the command may be issued to the processing device in the form of a command stream. In this command delivery method, multiple commands can be delivered by one command stream delivery, and multiple commands can be delivered through one communication of the communication link. The communication frequency between the host and the processing device is effectively reduced, the communication overhead between the host and the processing device is reduced, and the scheduling efficiency of the host is improved.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure.
图1是根据一示例性实施例示出的一种命令下发方法流程图;1 is a flowchart of a method for issuing commands according to an exemplary embodiment;
图2是根据一示例性实施例示出的另一种命令下发方法流程图;2 is a flowchart of another method for issuing commands according to an exemplary embodiment;
图3是根据一示例性实施例示出的一种命令下发方法交互流程图;3 is an interactive flowchart of a method for issuing commands according to an exemplary embodiment;
图4是根据一示例性实施例示出的一种命令下发装置示意图;4 is a schematic diagram of an apparatus for issuing commands according to an exemplary embodiment;
图5是根据一示例性实施例示出的另一种命令下发装置示意图;FIG. 5 is a schematic diagram of another device for issuing commands according to an exemplary embodiment;
图6是根据一示例性实施例示出的又一种命令下发装置示意图;FIG. 6 is a schematic diagram of another apparatus for issuing commands according to an exemplary embodiment;
图7是根据一示例性实施例示出的又一种命令下发装置示意图;FIG. 7 is a schematic diagram of another apparatus for issuing commands according to an exemplary embodiment;
图8是根据一示例性实施例示出的一种处理设备示意图;FIG. 8 is a schematic diagram of a processing device according to an exemplary embodiment;
图9是根据一示例性实施例示出的另一种处理设备示意图;FIG. 9 is a schematic diagram of another processing device according to an exemplary embodiment;
图10是根据一示例性实施例示出的一种计算机设备的结构示意图。Fig. 10 is a schematic structural diagram of a computer device according to an exemplary embodiment.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的具体方式并不代表与本公开相一致的所有方案。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The specific approaches described in the exemplary embodiments below are not intended to represent all aspects consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本公开和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various pieces of information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure. Depending on the context, the word "if" as used herein can be interpreted as "at the time of" or "when" or "in response to determining."
随着人工智能的广泛使用,深度学习的模型和数据量不断增大。主机对处理设备进行调度和控制时,需要频繁下发操作命令,主机与处理设备之间的通信链路(如PCI-Express链路)需要传输大量的通信数据和模型代码,导致通信链路的通信开销过大,调度效率低下。With the widespread use of artificial intelligence, the amount of deep learning models and data continues to increase. When the host schedules and controls the processing device, it needs to issue operation commands frequently, and the communication link (such as the PCI-Express link) between the host and the processing device needs to transmit a large amount of communication data and model codes, resulting in the failure of the communication link. The communication overhead is too large and the scheduling efficiency is low.
基于以上,本公开提供了一种命令下发方法:主机根据待下发到处理设备的多个命 令,生成至少一个命令流;将至少一个命令流插入到缓冲器中,并通过通信链路将缓冲器中的命令流传输至处理设备。Based on the above, the present disclosure provides a command issuing method: a host generates at least one command stream according to multiple commands to be issued to a processing device; inserts the at least one command stream into a buffer, and sends the The commands in the buffer are streamed to the processing device.
采用命令流下发命令的方式,通过一个命令流即可向处理设备下发多个命令,减少了通信链路的通信次数。减轻了主机与处理设备之间通信链路的通信开销,提高了主机的调度效率。By adopting the method of issuing commands in a command stream, multiple commands can be issued to the processing device through one command stream, which reduces the communication times of the communication link. The communication overhead of the communication link between the host and the processing device is reduced, and the scheduling efficiency of the host is improved.
为了使本公开提供的命令下发方法更加清楚,下面结合附图和具体实施例对本公开提供的方案执行过程进行详细描述。In order to make the command issuing method provided by the present disclosure clearer, the following describes the execution process of the solution provided by the present disclosure in detail with reference to the accompanying drawings and specific embodiments.
参见图1,图1是本公开提供的实施例示出的一种命令下发方法流程图。该方法应用于主机。如图1所示,该流程包括:Referring to FIG. 1 , FIG. 1 is a flowchart of a method for issuing commands according to an embodiment of the present disclosure. The method is applied to the host. As shown in Figure 1, the process includes:
步骤101,根据待下发到处理设备进行处理的多个命令,生成至少一个命令流;其中,每个所述命令流中包括至少一个命令。Step 101: Generate at least one command stream according to multiple commands to be sent to the processing device for processing; wherein each of the command streams includes at least one command.
本实施例中,用来生成命令流的命令,是主机生成的需要下发到处理设备进行处理的命令。In this embodiment, the command used to generate the command stream is a command generated by the host and needs to be sent to the processing device for processing.
例如,可以是应用层包括的多个进程产生的多个命令。For example, it may be multiple commands generated by multiple processes included in the application layer.
假设应用层包括用于支付的应用“支付×”、包括用于美颜的应用“美图××”。在使用以上两个应用的过程中,由“支付×”进程或“美图××”进程产生的命令,需要下发到处理设备(如AI芯片)进行处理。其中,由“支付×”进程或“美图××”进程产生的命令,即待下发到处理设备进行处理的命令。It is assumed that the application layer includes the application "Pay ×" for payment, and the application "Meitu × ×" for beauty. In the process of using the above two applications, the commands generated by the "Payment ×" process or the "Meitu ××" process need to be sent to a processing device (such as an AI chip) for processing. Among them, the command generated by the “Payment×” process or the “Meitu××” process is the command to be sent to the processing device for processing.
例如,在深度学习领域,将AI芯片作为处理设备的情况下,命令流中的命令可以包括:深度学习模型的各种算子(kernel)、数据搬移(memcpy)命令、event同步命令。For example, in the field of deep learning, when an AI chip is used as a processing device, the commands in the command stream may include: various operators (kernel), data movement (memcpy) commands, and event synchronization commands of the deep learning model.
本步骤可以根据待下发的多个命令,生成至少一个命令流。其中,一个命令流中可以包括一个命令,也可以包括多个命令。同一个命令流中的命令需要依次执行,不同命令流之间可以并行执行。In this step, at least one command stream may be generated according to multiple commands to be issued. Wherein, a command stream may include one command, and may also include multiple commands. Commands in the same command stream need to be executed sequentially, and different command streams can be executed in parallel.
这里的命令流类似于CUDA(Compute Unified Device Architecture,是显卡厂商NVIDIA推出的运算平台)中的“stream”。The command stream here is similar to the "stream" in CUDA (Compute Unified Device Architecture, which is a computing platform launched by the graphics card manufacturer NVIDIA).
例如,假设待下发的命令中包括:命令1、命令2、命令3、命令A、命令B和命令C。For example, it is assumed that the commands to be issued include: command 1, command 2, command 3, command A, command B, and command C.
如果命令1、命令2和命令3需要依次执行,则本步骤可以将命令1、命令2和命令 3生成一个命令流1。其中,命令流1中包括:命令1、命令2和命令3。If the command 1, the command 2 and the command 3 need to be executed in sequence, this step can generate a command stream 1 from the command 1, the command 2 and the command 3. The command stream 1 includes: command 1 , command 2 and command 3 .
同理,如果命令A和命令B需要依次执行,则本步骤可以将命令A和命令B生成一个命令流A。其中,命令流A中包括:命令A和命令B。Similarly, if the command A and the command B need to be executed in sequence, in this step, a command stream A can be generated from the command A and the command B. Wherein, the command stream A includes: command A and command B.
如果命令C和其他命令的执行无关,则本步骤可以将命令C生成一个命令流C。If the execution of command C has nothing to do with the execution of other commands, this step can generate a command stream C from command C.
在生成命令流1、命令流A和命令流C的情况下,各个命令流的执行互不影响。例如,可以并行执行三个命令流。When the command stream 1, the command stream A, and the command stream C are generated, the execution of the respective command streams does not affect each other. For example, three command streams can be executed in parallel.
步骤102,将所述至少一个命令流插入到缓冲器中。 Step 102, inserting the at least one command stream into a buffer.
示例性的,本实施例中的缓冲器可以是环形缓冲器(Ring Buffer)。可以理解的是,凡是能够满足本步骤使用需求的缓冲器,均可以看作本实施例的缓冲器,并不仅局限于环形缓冲器。Exemplarily, the buffer in this embodiment may be a ring buffer (Ring Buffer). It can be understood that, any buffer that can meet the usage requirements of this step can be regarded as the buffer of this embodiment, and is not limited to a ring buffer.
环形缓冲器是一种典型的“生产者-消费者”模型。本实施例中,主机是生产者,可以将命令流插入到环形缓冲器中;处理设备是消费者,可以从环形缓冲器中将命令流下拉到本地的流队列(Stream Queue)。A ring buffer is a typical "producer-consumer" model. In this embodiment, the host is the producer and can insert the command stream into the circular buffer; the processing device is the consumer and can pull down the command stream from the circular buffer to the local stream queue (Stream Queue).
以环形缓冲器为例,本步骤可以将一个或多个命令流插入到环形缓冲器中。Taking a ring buffer as an example, this step may insert one or more command streams into the ring buffer.
例如,驱动可以为各个命令流创建不同的命令流缓冲(Stream Buffer),将每个Stream Buffer通过锁方式,插入到Ring Buffer。每个Stream Buffer对应Ring Buffer的一个Entry。For example, the driver can create different command stream buffers (Stream Buffers) for each command stream, and insert each Stream Buffer into the Ring Buffer by locking. Each Stream Buffer corresponds to an Entry of the Ring Buffer.
步骤103,通过主机与所述处理设备间的通信链路,将所述缓冲器中的至少一个命令流传输至所述处理设备。Step 103: Stream at least one command in the buffer to the processing device through the communication link between the host and the processing device.
示例性的,本实施例中主机与处理设备(如AI芯片)间的通信链路,可以是PCI-Express(peripheral component interconnect express,一种高速串行计算机扩展总线标准)。可以理解的是,除PCI-Express之外,主机与处理设备之间还可以包括其他类型的通信链路,本公开并不对此限制。Exemplarily, the communication link between the host and the processing device (such as an AI chip) in this embodiment may be PCI-Express (peripheral component interconnect express, a high-speed serial computer expansion bus standard). It can be understood that, in addition to PCI-Express, other types of communication links may also be included between the host and the processing device, which are not limited in the present disclosure.
以PCI-Express链路为例,本步骤可以通过PCI-Express链路将缓冲器中的命令流传输至处理设备。其中,通过PCI-Express的一次通信,可以将一个命令流传输至处理设备。或者,通过PCI-Express的一次通信,可以将多个命令流传输至处理设备。Taking the PCI-Express link as an example, in this step, the command stream in the buffer can be transmitted to the processing device through the PCI-Express link. Among them, a command stream can be transmitted to the processing device through one communication of PCI-Express. Alternatively, multiple commands can be streamed to the processing device through a single PCI-Express communication.
本实施例中,将缓冲器中的命令流传输至处理设备的过程,可以是主机主动将命令流下发到处理设备中,也可以由处理设备主动从缓冲器中拉取命令流。缓冲器中命令流传输至处理设备的具体方式还包括多种形式,本实施例并不对此限制。In this embodiment, in the process of transmitting the command stream in the buffer to the processing device, the host may actively send the command stream to the processing device, or the processing device may actively pull the command stream from the buffer. The specific manner in which the command stream in the buffer is transmitted to the processing device also includes various forms, which are not limited in this embodiment.
主机可以根据缓冲器中待下发命令流的数量,主动将缓冲器中一定数量的命令流下 发到处理设备,由处理设备对下发的命令流进行进一步处理。例如,主机可以在缓冲器中命令流数量达到一定预设数量的情况下,可以将一定预设数量的命令流,通过PCI-Express链路一次性下发到处理设备。According to the number of command streams to be sent in the buffer, the host can actively send a certain number of command streams in the buffer to the processing device, and the processing device further processes the issued command streams. For example, when the number of command streams in the buffer reaches a certain preset number, the host may send a certain preset number of command streams to the processing device at one time through the PCI-Express link.
处理设备也可以主动从缓冲器中,拉取一定数量的命令流。The processing device can also actively pull a certain number of command streams from the buffer.
例如,处理设备可以通过轮询主机侧缓冲器的指针信息,在缓冲器中存在待下发的命令流的情况下,处理设备可以通过PCI-Express链路,一次性从主机侧的环形缓冲器中,拉取一个命令流到本地的流队列。或者,处理设备可以通过PCI-Express链路,一次性从主机侧的环形缓冲器中,拉取多个命令流到本地的流队列。For example, the processing device can poll the pointer information of the buffer on the host side. If there is a command stream to be issued in the buffer, the processing device can use the PCI-Express link. , pull a command stream to the local stream queue. Alternatively, the processing device can pull multiple command streams from the ring buffer on the host side to the local stream queue at one time through the PCI-Express link.
本实施例中,主机可以根据多个待下发到处理设备的命令,生成至少一个命令流,以命令流的方式向处理设备下发命令,一次命令流的下发可以实现多个命令的下发,减少了主机与处理设备的通信次数,减轻了主机与处理设备之间的通信开销,提高了主机的调度效率。In this embodiment, the host can generate at least one command stream according to multiple commands to be issued to the processing device, and issue commands to the processing device in the form of a command stream. One command stream can be issued to realize the downloading of multiple commands. It reduces the number of communications between the host and the processing device, reduces the communication overhead between the host and the processing device, and improves the scheduling efficiency of the host.
另外,随着人工智能领域的发展,AI芯片的计算能力(简称算力)节节攀高,算力甚至达到256/512Tops。在调度效率低下的情况下,主机将不能及时将操作命令下发到处理设备进行调度和控制,处理设备的算力不能被充分利用,浪费计算资源。In addition, with the development of the field of artificial intelligence, the computing power (referred to as computing power) of AI chips has been increasing, and the computing power has even reached 256/512 Tops. In the case of low scheduling efficiency, the host will not be able to issue operation commands to the processing device for scheduling and control in time, the computing power of the processing device cannot be fully utilized, and computing resources are wasted.
本实施例的命令下发方法,由于提高了主机的调度效率,所以可以更加充分的利用处理设备的算力。In the method for issuing commands in this embodiment, since the scheduling efficiency of the host is improved, the computing power of the processing device can be more fully utilized.
在一些可选实施例中,步骤103中,通过主机与所述处理设备间的通信链路,将所述缓冲器中的至少一个命令流传输至所述处理设备,可以包括:在所述缓冲器中包括至少两个命令流的情况下,通过所述通信链路的一次通信,将所述至少两个命令流传输至所述处理设备。In some optional embodiments, in step 103, streaming at least one command in the buffer to the processing device through a communication link between the host and the processing device may include: in the buffer In the case where at least two command streams are included in the processor, the at least two command streams are transmitted to the processing device through one communication of the communication link.
上述实施例中,在缓冲器中已经插入多个命令流的情况下,主机可以通过通信链路的一次通信,一次性的将缓冲器中的多个命令流批量的传输至处理设备。在一种可能的实现方式中,在缓冲器中已经插入多个命令流的情况下,主机可以通过通信链路的一次通信,将该缓冲器中所有的命令流批量的传输至处理设备。在另一种可能的实现方式中,在缓冲器中已经插入多个命令流的情况下,主机可以通过通信链路的一次通信,将该缓冲器中所有命令流中的部分命令流(多于一个命令流)批量传输至处理设备。In the above embodiment, when multiple command streams have been inserted into the buffer, the host can transmit the multiple command streams in the buffer to the processing device in batches at one time through one communication of the communication link. In a possible implementation, when multiple command streams have been inserted into the buffer, the host may transmit all the command streams in the buffer to the processing device in batches through one communication of the communication link. In another possible implementation manner, when multiple command streams have been inserted into the buffer, the host may, through one communication of the communication link, send some command streams (more than A command stream) is batched to the processing device.
上述实施例中,主机通过将多个命令流插入到缓冲器中,通过通信链路可以一次将多个命令流传输至处理设备,进一步减少了主机与处理设备的通信次数。减轻了主机与处理设备之间的通信开销,提高了主机的调度效率。In the above embodiment, by inserting multiple command streams into the buffer, the host can transmit multiple command streams to the processing device at one time through the communication link, which further reduces the number of communications between the host and the processing device. The communication overhead between the host and the processing device is reduced, and the scheduling efficiency of the host is improved.
参见图2,图2是本公开提供的实施例示出的另一种命令下发方法流程图。该方法应用于处理设备。如图2所示,该流程包括:Referring to FIG. 2, FIG. 2 is a flowchart of another method for issuing commands according to an embodiment of the present disclosure. The method is applied to processing equipment. As shown in Figure 2, the process includes:
步骤201,通过处理设备与主机间的通信链路,从主机侧的缓冲器中拉取至少一个命令流。Step 201: Pull at least one command stream from a buffer on the host side through the communication link between the processing device and the host.
主机将待下发到处理设备进行处理的命令,生成了命令流,并将命令流缓存在缓冲器中。例如,主机可以将待下发的命令以命令流的形式,缓存在环形缓冲器中。The host generates the command stream to be sent to the processing device for processing, and buffers the command stream in the buffer. For example, the host may buffer the commands to be issued in the form of a command stream in a ring buffer.
处理设备可以通过与主机间的通信链路,从缓冲器中一次性拉取一个命令流,或者一次性批量拉取多个命令流。其中,处理设备一次从缓冲器中拉取待下发的命令流的数量,需要根据缓冲器中待下发的命令流的数量,与处理设备本地闲置的流队列的数量,综合确定。The processing device can pull one command stream from the buffer at a time through the communication link with the host, or pull multiple command streams in batches at one time. The number of command streams that the processing device pulls from the buffer at one time to be delivered needs to be comprehensively determined according to the number of command streams to be delivered in the buffer and the number of local idle stream queues of the processing device.
例如,在缓冲器中存在一个待下发的命令流的情况下,处理设备可以确定本地存在至少一个空闲的流队列。则,处理设备可以从缓冲器中拉取这一个待下发的命令流,并将该一个命令流读取到对应空闲的流队列。即,完成了将这一个命令流中包括的多个命令从主机侧到处理设备侧的下发。For example, in the case that there is a command stream to be issued in the buffer, the processing device may determine that there is at least one idle stream queue locally. Then, the processing device can pull the command stream to be issued from the buffer, and read the command stream to the corresponding idle stream queue. That is, the delivery of multiple commands included in this one command stream from the host side to the processing device side is completed.
例如,在缓冲器中存在多个待下发的命令流的情况下,处理设备可以确定本地存在足够多的空闲的流队列。则,处理设备可以从缓冲器中一次批量拉取该多个命令流,并将该多个命令流分别读取到不同的流队列。即,完成了将该多个命令流中包括的多个命令从主机侧到处理设备侧的下发。For example, when there are multiple command streams to be issued in the buffer, the processing device may determine that there are enough idle stream queues locally. Then, the processing device can pull the multiple command streams from the buffer in batches at one time, and read the multiple command streams to different stream queues respectively. That is, the delivery of the plurality of commands included in the plurality of command streams from the host side to the processing device side is completed.
本实施例中,处理设备需要确定主机侧缓冲器中待下发的命令流的数量。在缓冲器中存在待下发的命令流的情况下,且处理设备本地存在闲置的流队列,处理设备才从缓冲器中拉取一定数量的命令流。In this embodiment, the processing device needs to determine the number of command streams to be issued in the host-side buffer. Only when there are command streams to be issued in the buffer and an idle stream queue exists locally on the processing device, the processing device pulls a certain number of command streams from the buffer.
其中,处理设备在确定主机侧缓冲器中待下发的命令流的数量时,处理设备可以通过与主机间的通信链路,轮询主机侧缓冲器的读写指针,根据缓冲器的读写指针判断缓冲器中是否存在待下发的命令流。Wherein, when the processing device determines the number of command streams to be issued in the host-side buffer, the processing device can poll the read-write pointer of the host-side buffer through the communication link with the host, and according to the read-write pointer of the buffer The pointer determines whether there is a command stream to be issued in the buffer.
步骤202,将拉取的所述至少一个命令流读取到本地的流队列中,所述流队列用于存储待执行的命令流。Step 202: Read the pulled at least one command stream into a local stream queue, where the stream queue is used to store the command stream to be executed.
本实施例中,处理设备可以包括用于存储待执行的命令流的流队列。处理设备可以将从缓冲器中拉取的多个命令流,分别读取到不同的流队列。从而,处理设备可以利用命令分发器,将流队列中的命令分发到不同的运算单元进行计算。In this embodiment, the processing device may include a stream queue for storing command streams to be executed. The processing device can read multiple command streams pulled from the buffer into different stream queues respectively. Therefore, the processing device can use the command distributor to distribute the commands in the stream queue to different computing units for calculation.
在一些可选实施例中,在从缓冲器中拉取多个命令流的情况下,处理设备可以将该 多个命令流分别读取到本地不同的流队列中;并行执行不同的流队列中的命令流。提高了处理设备对命令的执行效率。In some optional embodiments, in the case of pulling multiple command streams from the buffer, the processing device may read the multiple command streams into different local stream queues respectively; execute the multiple command streams in parallel in different stream queues. command flow. The execution efficiency of the command by the processing device is improved.
本实施例中,处理设备可以通过与主机的通信链路,一次性从主机侧的缓冲器中拉取一个命令流。这种以命令流形式从主机侧拉取命令的方式,拉取一个命令流可以实现多个命令的下发,减少了主机与处理设备的通信次数。减轻了主机与处理设备间通信链路的通信开销,提高了主机的调度效率。处理设备的算力也可以得到更加充分的利用。In this embodiment, the processing device may pull one command stream from the buffer on the host side at a time through the communication link with the host. In this way of pulling commands from the host side in the form of a command stream, pulling one command stream can implement the issuance of multiple commands, reducing the number of communications between the host and the processing device. The communication overhead of the communication link between the host and the processing device is reduced, and the scheduling efficiency of the host is improved. The computing power of processing equipment can also be more fully utilized.
在一些可选实施例中,步骤201中,通过处理设备与主机间的通信链路,从主机侧的缓冲器中拉取至少一个命令流,包括:在所述主机侧的缓冲器中包括至少两个命令流的情况下,通过所述通信链路的一次通信,从所述主机侧的缓冲器中拉取所述至少两个命令流。In some optional embodiments, in step 201, pulling at least one command stream from the buffer on the host side through the communication link between the processing device and the host includes: including at least one command stream in the buffer on the host side. In the case of two command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.
在上述实施例中,在主机侧的缓冲器中包括多个可以被拉取的命令流的情况下,处理设备可以通过通信链路的一次通信,一次性从主机侧的缓冲器中批量拉取多个命令流。在一种可能的实现方式中,在主机侧的缓冲器中包括多个命令流的情况下,处理设备可以通过通信链路的一次通信,一次性从主机侧的缓冲器中批量拉取其中所有的命令流。在另一种可能的实现方式中,在主机侧的缓冲器中包括多个命令流的情况下,处理设备可以通过通信链路的一次通信,一次性从主机侧的缓冲器中所有的命令流中批量拉取其中的部分命令流(多于一个命令流)。In the above-mentioned embodiment, when the buffer on the host side includes multiple command streams that can be pulled, the processing device can pull from the buffer on the host side in batches at one time through one communication of the communication link Multiple command streams. In a possible implementation, when the buffer on the host side includes multiple command streams, the processing device may pull all of the command streams in batches from the buffer on the host side at one time through one communication of the communication link. command flow. In another possible implementation manner, in the case that the buffer on the host side includes multiple command streams, the processing device may obtain all command streams from the buffer on the host side at one time through one communication of the communication link. Pull part of the command stream (more than one command stream) in batches.
上述实施例中,处理设备可以一次性从主机侧的缓冲器中拉取多个命令流,进一步减少了主机与处理设备的通信次数。从而,大大减轻了主机与处理设备间通信链路的通信开销,提高了主机的调度效率。处理设备的算力也可以得到更加充分的利用。In the above embodiment, the processing device can pull multiple command streams from the buffer on the host side at one time, which further reduces the number of times of communication between the host and the processing device. Therefore, the communication overhead of the communication link between the host and the processing device is greatly reduced, and the scheduling efficiency of the host is improved. The computing power of processing equipment can also be more fully utilized.
在步骤201中,处理设备需要确定主机侧缓冲器中待下发的命令流的数量。在缓冲器中存在待下发的命令流,且处理设备本地存在闲置的流队列的情况下,处理设备才从缓冲器中拉取命令流。In step 201, the processing device needs to determine the number of command streams to be issued in the host-side buffer. The processing device pulls the command stream from the buffer only when there is a command stream to be issued in the buffer and an idle stream queue exists locally on the processing device.
处理设备若要确定缓冲器中待下发命令流的数量,需要获取缓冲器的读、写指针。在处理设备获取缓冲器的读、写指针的相关方式中,需要处理设备通过与主机间的通信链路,轮询主机侧缓冲器的读、写指针。这种“轮询”获取缓冲器读、写指针的方式,需要处理设备通过通信链路大量访问主机,无疑给通信链路造成了通信开销。To determine the number of command streams to be issued in the buffer, the processing device needs to obtain the read and write pointers of the buffer. In the related manner in which the processing device obtains the read and write pointers of the buffer, the processing device needs to poll the read and write pointers of the buffer on the host side through the communication link with the host. This way of "polling" to obtain the read and write pointers of the buffer requires the processing device to access a large number of hosts through the communication link, which undoubtedly causes communication overhead to the communication link.
为此,本公开提供一种新的指针获取方式,以更少的通信次数使处理设备获取主机侧缓冲器的读、写指针。To this end, the present disclosure provides a new pointer acquisition method, which enables the processing device to acquire the read and write pointers of the host-side buffer with fewer communication times.
对应主机侧缓冲器的读、写指针,在处理设备侧本地设置对应的读、写指针,并按 照一定规则同步更新两侧的读、写指针。Corresponding to the read and write pointers of the buffer on the host side, the corresponding read and write pointers are set locally on the processing device side, and the read and write pointers on both sides are updated synchronously according to certain rules.
例如,可以将主机侧缓冲器的读、写指针存储在主机本地的主存储中;对应的设置在处理设备侧的读、写指针存储在处理设备本地的寄存器上,并按照一定规则同步两侧存储的读、写指针。For example, the read and write pointers of the buffer on the host side can be stored in the local main storage of the host; the corresponding read and write pointers set on the processing device side can be stored in the local registers of the processing device, and the two sides are synchronized according to certain rules. Stored read and write pointers.
该方式中,由于处理设备侧对应设置了缓冲器的读、写指针,所以处理设备不需要访问主机侧,只需要轮询本地的读、写指针,即可根据读、写指针确定主机侧缓冲器中待下发的命令流的数量,大大减少了通过通信链路的通信次数。In this method, since the read and write pointers of the buffer are correspondingly set on the processing device side, the processing device does not need to access the host side, but only needs to poll the local read and write pointers, and the host side buffer can be determined according to the read and write pointers The number of command streams to be issued in the server greatly reduces the number of communications through the communication link.
以上提供处理设备获取缓冲器读、写指针的方式,仅是进行原理性说明。下面结合本公开提供的命令下发方法,对本公开提供的新的指针获取方式进行详细说明。The manner in which the processing device obtains the buffer read and write pointers is provided above, which is only a principle description. The following describes the new pointer acquisition method provided by the present disclosure in detail with reference to the command issuing method provided by the present disclosure.
本实施例中,可以将缓冲器中的读写指针采用主副本方式进行设置。在主机侧,写指针write-pointer是主,读指针read-pointer是副本;在处理设备侧,写指针write-pointer是副本,读指针read-pointer是主。In this embodiment, the read and write pointers in the buffer may be set in a master-copy manner. On the host side, the write pointer write-pointer is the master, and the read pointer read-pointer is the copy; on the processing device side, the write pointer write-pointer is the copy, and the read pointer read-pointer is the master.
为方便区别两侧的读、写指针,可以将主机侧的write-pointer称为写指针,read-pointer称为读指针副本;将处理设备侧的write-pointer称为写指针副本,read-pointer称为读指针。In order to easily distinguish the read and write pointers on both sides, the write-pointer on the host side can be called a write pointer, and the read-pointer can be called a copy of the read pointer; the write-pointer on the processing device side can be called a copy of the write pointer, and the read-pointer called the read pointer.
在步骤102中,主机将至少一个命令流插入到缓冲器中之后,还包括:In step 102, after the host inserts at least one command stream into the buffer, it further includes:
主机更新缓冲器的写指针,所述写指针用于表示对所述缓冲器进行写操作的当前位置;通过所述通信链路,将更新后的写指针的指针信息发送至所述处理设备。The host updates the write pointer of the buffer, where the write pointer is used to indicate the current position of the write operation to the buffer; and sends the updated pointer information of the write pointer to the processing device through the communication link.
其中,主机向处理设备发送更新后的写指针的指针信息的时机,可以包括多种。The timing at which the host sends the updated pointer information of the write pointer to the processing device may include various timings.
例如,主机每次在缓冲器中插入一个命令流并更新写指针后,将该次更新的写指针的指针信息发送至处理设备。即,主机每更新一次写指针,则通过通信链路向处理设备发送一次更新后的写指针的指针信息。For example, each time the host inserts a command stream into the buffer and updates the write pointer, it sends the pointer information of the updated write pointer to the processing device. That is, every time the host updates the write pointer, it sends the updated pointer information of the write pointer to the processing device through the communication link.
这种写指针更新方式,可以实时同步两侧的写指针,以便于处理设备能够更加及时的获取主机侧缓冲器的写指针的最新信息。相比于处理设备轮询主机侧的写指针,这种方式只在缓冲器的写指针更新的情况下利用通信链路发送写指针的指针信息,减少了通信次数。In this way of updating the write pointer, the write pointers on both sides can be synchronized in real time, so that the processing device can obtain the latest information of the write pointer of the buffer on the host side in a more timely manner. Compared with the processing device polling the write pointer on the host side, this method uses the communication link to send the pointer information of the write pointer only when the write pointer of the buffer is updated, which reduces the number of communications.
在一种可能的实现方式中,主机将更新后的写指针的指针信息发送至处理设备的时机,可以在缓冲器的写指针更新多次后,将最新的写指针的指针信息发送至处理设备。In a possible implementation manner, when the host sends the updated pointer information of the write pointer to the processing device, it may send the latest pointer information of the write pointer to the processing device after the write pointer of the buffer is updated multiple times. .
上述实现方式中,可以预先设置缓冲器写指针的更新次数。比如,预设缓冲器写指针的更新次数为8次,则在缓冲器中插入8个命令流且写指针更新8次后,才将第8次更新的写指针的指针信息发送至处理设备。In the above implementation manner, the update times of the buffer write pointer may be preset. For example, if the number of times of updating the write pointer of the buffer is preset to be 8, then 8 command streams are inserted into the buffer and the write pointer is updated 8 times before the pointer information of the 8th updated write pointer is sent to the processing device.
这种写指针的更新方式中,在主机侧缓冲器的写指针累计更新多次后,才利用通信链路将最后更新的写指针的指针信息发送至处理设备。进一步的减少了利用通信链路的通信次数,降低了通信链路的通信开销。In this way of updating the write pointer, the pointer information of the last updated write pointer is sent to the processing device by using the communication link after the write pointer of the host-side buffer has been updated for a number of times accumulatively. The number of times of communication using the communication link is further reduced, and the communication overhead of the communication link is reduced.
处理设备通过通信链路接收到写指针的指针信息后,可以根据写指针的指针信息更新存储在本地对应的写指针副本。After receiving the pointer information of the write pointer through the communication link, the processing device can update the corresponding copy of the write pointer stored locally according to the pointer information of the write pointer.
在步骤202中,处理设备每次将一个命令流读取到流队列中之后,更新本地的读指针;将更新后的所述读指针的指针信息,发送至所述主机。In step 202, each time the processing device reads a command stream into the stream queue, it updates the local read pointer; and sends the updated pointer information of the read pointer to the host.
主机接收所述处理设备通过所述通信链路发送的读指针的指针信息,所述读指针用于表示对所述缓冲器进行读操作的当前位置;根据所述读指针的指针信息,更新主机侧的读指针副本。The host receives the pointer information of the read pointer sent by the processing device through the communication link, where the read pointer is used to indicate the current position of the read operation on the buffer; according to the pointer information of the read pointer, the host is updated A copy of the read pointer on the side.
主机在根据处理设备发送的读指针的指针信息,更新主机侧的读指针副本后,可以根据读指针副本释放对应的已经被处理设备读取到流队列的命令流,从而释放缓冲器的缓存空间。After updating the copy of the read pointer on the host side according to the pointer information of the read pointer sent by the processing device, the host can release the corresponding command stream that has been read to the stream queue by the processing device according to the copy of the read pointer, thereby releasing the buffer space. .
以上在主机和处理设备两侧,以主副本方式设置两套读写指针,并且可以按照上述实施例的方式对两侧存储的读写指针进行更新。In the above, two sets of read and write pointers are set on both sides of the host and the processing device in a master-copy manner, and the read and write pointers stored on both sides can be updated according to the method of the above embodiment.
以这种方式,处理设备不需要通过通信链路访问主机进行指针轮询,只需要轮询本地的读指针和写指针副本,基于本地的读指针和写指针副本,即可确定缓冲器中是否存在待下发的命令流、确定待下发命令流的数量。In this way, the processing device does not need to access the host through the communication link for pointer polling, but only needs to poll the local copy of the read pointer and the write pointer, and based on the local copy of the read pointer and the write pointer, it can determine whether the buffer is in the buffer. There are command streams to be delivered, and the number of command streams to be delivered is determined.
从而,在确定本地存在至少一个闲置的流队列的情况下,处理设备可以从缓冲器中拉取一个或者多个命令流。在拉取多个命令流时,将多个命令流分别读取到不同的流队列中进行处理。Thus, the processing device may pull one or more command streams from the buffer upon determining that there is at least one idle stream queue locally. When pulling multiple command streams, read the multiple command streams into different stream queues for processing.
由于处理设备只需要轮询本地存储的读写指针,不需要频繁访问主机去轮询缓冲器的读写指针,大大减少了处理设备通过通信链路访问主机的通信次数,可以有效缓解主机与处理设备之间的通信开销,提高主机的调度效率。Since the processing device only needs to poll the read and write pointers stored locally, it does not need to frequently access the host to poll the read and write pointers of the buffer, which greatly reduces the number of communication times that the processing device accesses the host through the communication link, which can effectively ease the need for the host to communicate with the processing device. The communication overhead between devices improves the scheduling efficiency of the host.
参照图3所示的命令下发方法交互流程图。以下实施例中,以主机与处理设备交互的形式,对命令下发方法进行说明。Refer to the interactive flowchart of the command issuing method shown in FIG. 3 . In the following embodiments, the command issuing method is described in the form of interaction between the host and the processing device.
步骤301,主机根据待下发到处理设备进行处理的多个命令,生成至少一个命令 流。Step 301, the host generates at least one command stream according to multiple commands to be sent to the processing device for processing.
主机可以根据待下发的多个命令,生成至少一个命令流。例如,可以将需要依次执行的多个命令生成为一个完整的命令流;或者,可以将需要单独执行的一个命令生成为一个完整的命令流。The host can generate at least one command stream according to multiple commands to be issued. For example, multiple commands that need to be executed in sequence can be generated as a complete command stream; or, a command that needs to be executed individually can be generated as a complete command stream.
步骤302,主机将至少一个命令流插入到缓冲器中。Step 302, the host inserts at least one command stream into the buffer.
主机生成命令流后,需要将生成的命令流插入到缓冲器中进行缓存。缓冲器可以起到对命令流进行临时缓存的作用,以方便在需要下发命令时,可以一次性批量将缓冲器中临时缓存的多个命令流下发到处理设备。After the host generates the command stream, it needs to insert the generated command stream into the buffer for buffering. The buffer can play the role of temporarily buffering the command stream, so that when a command needs to be issued, multiple command streams temporarily buffered in the buffer can be delivered to the processing device in batches at one time.
步骤303,主机更新缓冲器的写指针。Step 303, the host updates the write pointer of the buffer.
在主机将命令流插入到缓冲器中之后,需要对应的更新缓冲器的写指针。主机可以依据不断更新的写指针的指针信息,多次进行写操作,将命令流插入到缓冲器中。After the host inserts the command stream into the buffer, the corresponding write pointer of the buffer needs to be updated. The host can perform multiple write operations according to the constantly updated pointer information of the write pointer, and insert the command stream into the buffer.
步骤304,主机将更新后的写指针的指针信息发送至处理设备。Step 304, the host sends the updated pointer information of the write pointer to the processing device.
本公开实施例中,由于将缓冲器中的读写指针采用主副本方式进行设置,所以在主机侧更新了缓冲器的写指针之后,需要将写指针的指针信息发送至处理设备,以对应更新处理设备侧设置的写指针副本。In the embodiment of the present disclosure, since the read and write pointers in the buffer are set in a master-copy manner, after the host side updates the write pointer of the buffer, the pointer information of the write pointer needs to be sent to the processing device to correspond to the update Handles a copy of the write pointer set on the device side.
在一种可能的实现方式中,可以在主机侧每次更新写指针后,向处理设备发送更新后的写指针的指针信息。在另一种可能的实现方式中,可以在主机侧多次更新写指针后,将多次更新后最终的写指针的指针信息发送至处理设备。可以进一步的减少主机向处理设备发送指针信息的次数,减少了两者间通信链路的通信开销。In a possible implementation manner, after each update of the write pointer on the host side, pointer information of the updated write pointer may be sent to the processing device. In another possible implementation manner, after the write pointer is updated multiple times on the host side, the pointer information of the final write pointer after multiple updates may be sent to the processing device. The number of times the host sends pointer information to the processing device can be further reduced, and the communication overhead of the communication link between the two can be reduced.
步骤305,处理设备更新处理设备侧的写指针副本。Step 305, the processing device updates the copy of the write pointer on the processing device side.
处理设备在接收到主机发送的写指针的指针信息后,需要根据接收的指针信息,对应更新本地的写指针副本。After receiving the pointer information of the write pointer sent by the host, the processing device needs to correspondingly update the local copy of the write pointer according to the received pointer information.
步骤306,处理设备根据本地的读指针和写指针副本的指针信息,确定缓冲器中待下发的命令流的数量。Step 306: The processing device determines the number of command streams to be issued in the buffer according to the pointer information of the local read pointer and the copy of the write pointer.
本公开实施例中,由于将缓冲器中的读写指针采用主副本方式进行设置,两侧指针可以按照一定规则进行指针信息的同步更新。所以,处理设备可以直接访问本地的指针信息,即可确定主机侧缓冲器中待下发的命令流的数量。由于处理设备不需要访问主机侧的指针,相比于处理设备轮询主机侧缓冲器的指针,大大减少了通信链路的通信次数,降低了通信链路的通信开销。In the embodiment of the present disclosure, since the read and write pointers in the buffer are set in a master-copy manner, the pointers on both sides can be synchronously updated according to certain rules. Therefore, the processing device can directly access the local pointer information, so as to determine the number of command streams to be issued in the buffer on the host side. Since the processing device does not need to access the pointer on the host side, compared with the processing device polling the pointer of the buffer on the host side, the communication times of the communication link are greatly reduced, and the communication overhead of the communication link is reduced.
步骤307,主机将缓冲器中的至少一个命令流传输至处理设备。Step 307, the host stream transmits at least one command in the buffer to the processing device.
在一种可能的实现方式中,处理设备在确定主机侧缓冲器中待下发的命令流的数量后,可以主动从主机侧的缓冲器中拉取一定数量的命令流。例如,可以一次性将缓冲器中所有的命令流拉取到处理设备。这样通过通信链路的一次通信,即可实现多个命令流的批量下发,减少通信链路的通信开销。In a possible implementation manner, after determining the number of command streams to be issued in the host-side buffer, the processing device may actively pull a certain number of command streams from the host-side buffer. For example, all command streams in the buffer can be pulled to the processing device at once. In this way, through one communication of the communication link, batches of multiple command streams can be issued, and the communication overhead of the communication link can be reduced.
步骤308,处理设备将至少一个命令流读取到本地的流队列中。Step 308, the processing device reads at least one command stream into a local stream queue.
处理设备在拉取到命令流后,需要将命令流读取到本地的流队列中,以存储拉取的命令流。从而,处理设备可以利用命令分发器,将流队列中的命令分发到不同的运算单元进行计算。After the processing device pulls the command stream, it needs to read the command stream into the local stream queue to store the pulled command stream. Therefore, the processing device can use the command distributor to distribute the commands in the stream queue to different computing units for calculation.
步骤309,处理设备每次将一个命令流读取到本地的流队列中之后,更新本地的读指针。Step 309: After each time the processing device reads a command stream into the local stream queue, it updates the local read pointer.
步骤310,处理设备将更新后的读指针的指针信息,发送至主机。Step 310, the processing device sends the updated pointer information of the read pointer to the host.
步骤311,主机更新主机侧的读指针副本。Step 311, the host updates the copy of the read pointer on the host side.
处理设备将拉取的命令流读取到本地的流队列中之后,此时主机侧的缓冲器中对应存储所拉取的命令流的缓存位置可以释放。本公开实施例中,在每次将命令流读取到本地的流队列中之后,即更新本地对应的读指针,并将更新后的读指针的指针信息发送至主机。主机接收到指针信息后,对应的更新本地的读指针副本。主机可以根据读指针副本的更新,将主机侧缓冲器中对应位置进行缓存释放。After the processing device reads the pulled command stream into the local stream queue, the cache location corresponding to the stored command stream in the buffer on the host side can be released at this time. In the embodiment of the present disclosure, after each command stream is read into the local stream queue, the local corresponding read pointer is updated, and pointer information of the updated read pointer is sent to the host. After the host receives the pointer information, it correspondingly updates the local copy of the read pointer. The host can release the cache of the corresponding position in the host-side buffer according to the update of the read pointer copy.
上述实施例中,以主机与处理设备双方交互的方式,对命令下发方法的实现过程进行了完整描述。该命令下发方法,可以以命令流的方式进行命令下发,一次命令流下发实现多个命令的下发,减少了通信链路的通信开销。另外,该方法可以一次通信,将多个命令流同时下发到处理设备,进一步减少了通信链路的通信开销,提高了主机的调度效率。In the above embodiment, the implementation process of the command issuing method is completely described in the manner of interaction between the host and the processing device. The command issuing method can be used to issue commands in the form of a command stream, and a single command stream issuing can realize the issuing of multiple commands, thereby reducing the communication overhead of the communication link. In addition, the method can send multiple command streams to the processing device at the same time by one communication, which further reduces the communication overhead of the communication link and improves the scheduling efficiency of the host.
图4所示,本公开提供了一种命令下发装置,该装置可以执行本公开任一实施例的命令下发方法。该装置可以包括命令流生成模块401、插入模块402和传输模块403。其中:As shown in FIG. 4 , the present disclosure provides a command issuing apparatus, and the apparatus can execute the command issuing method of any embodiment of the present disclosure. The apparatus may include a command stream generation module 401 , an insertion module 402 and a transmission module 403 . in:
命令流生成模块401,用于根据待下发到处理设备进行处理的多个命令,生成至少一个命令流;其中,每个所述命令流中包括至少一个命令;The command stream generation module 401 is configured to generate at least one command stream according to multiple commands to be sent to the processing device for processing; wherein, each of the command streams includes at least one command;
插入模块402,用于将所述至少一个命令流插入到缓冲器中;an inserting module 402, configured to insert the at least one command stream into a buffer;
传输模块403,用于通过主机与所述处理设备间的通信链路,将所述缓冲器中的至少一个命令流传输至所述处理设备。The transmission module 403 is configured to transmit at least one command stream in the buffer to the processing device through the communication link between the host and the processing device.
可选的,所述传输模块403,在用于通过主机与所述处理设备间的通信链路,将所述缓冲器中的至少一个命令流传输至所述处理设备时,还用于:在所述缓冲器中包括至少两个命令流的情况下,通过所述通信链路的一次通信,将所述至少两个命令流传输至所述处理设备。Optionally, when the transmission module 403 is configured to stream at least one command in the buffer to the processing device through the communication link between the host and the processing device, it is further configured to: In the case where the buffer includes at least two command streams, the at least two command streams are transmitted to the processing device through one communication of the communication link.
可选的,如图5所示,所述装置还包括:Optionally, as shown in Figure 5, the device further includes:
第一写指针更新模块501,用于更新所述缓冲器的写指针,所述写指针用于表示对所述缓冲器进行写操作的当前位置;a first write pointer update module 501, configured to update the write pointer of the buffer, where the write pointer is used to indicate the current position of the write operation to the buffer;
第一指针信息发送模块502,用于通过所述通信链路,将更新后的写指针的指针信息发送至所述处理设备,以由所述处理设备更新处理设备侧的写指针副本。The first pointer information sending module 502 is configured to send the updated pointer information of the write pointer to the processing device through the communication link, so that the processing device can update the copy of the write pointer on the processing device side.
可选的,如图6所示,所述装置还包括:Optionally, as shown in Figure 6, the device further includes:
第二写指针更新模块601,用于更新所述缓冲器的写指针,所述写指针用于表示对所述缓冲器进行写操作的当前位置;A second write pointer update module 601, configured to update the write pointer of the buffer, where the write pointer is used to indicate the current position of the write operation to the buffer;
第二指针信息发送模块602,用于在所述缓冲器的写指针更新次数达到预设次数的情况下,将最后更新的写指针的指针信息发送至所述处理设备。The second pointer information sending module 602 is configured to send the last updated pointer information of the write pointer to the processing device when the number of updates of the write pointer of the buffer reaches a preset number of times.
可选的,如图7所示,所述装置还包括:Optionally, as shown in Figure 7, the device further includes:
指针信息接收模块701,用于接收所述处理设备通过所述通信链路发送的读指针的指针信息,所述读指针用于表示对所述缓冲器进行读操作的当前位置;A pointer information receiving module 701, configured to receive pointer information of a read pointer sent by the processing device through the communication link, where the read pointer is used to indicate the current position of the read operation on the buffer;
读指针副本更新模块702,用于根据所述读指针的指针信息,更新主机侧的读指针副本。The read pointer copy update module 702 is configured to update the read pointer copy on the host side according to the pointer information of the read pointer.
可选的,所述通信链路是PCI-Express链路。Optionally, the communication link is a PCI-Express link.
图8所示,本公开提供了一种处理设备,该处理设备可以执行本公开任一实施例的命令下发方法。该处理设备可以包括队列存储器801和微处理器802。其中:As shown in FIG. 8 , the present disclosure provides a processing device, and the processing device can execute the command issuing method of any embodiment of the present disclosure. The processing device may include queue memory 801 and microprocessor 802 . in:
队列存储器801,用于存储流队列;a queue memory 801 for storing flow queues;
微处理器802,用于通过所述处理设备与主机间的通信链路,从主机侧的缓冲器中拉取至少一个命令流;并将拉取的所述至少一个命令流读取到所述处理设备本地的流队列中,所述流队列用于存储待执行的命令流。The microprocessor 802 is configured to pull at least one command stream from the buffer on the host side through the communication link between the processing device and the host; and read the pulled at least one command stream to the In the local flow queue of the processing device, the flow queue is used to store the command flow to be executed.
可选的,所述微处理器,在用于通过处理设备与主机间的通信链路,从主机侧 的缓冲器中拉取至少一个命令流时,还用于:在所述主机侧的缓冲器中包括至少两个命令流的情况下,通过所述通信链路的一次通信,从所述主机侧的缓冲器中拉取所述至少两个命令流。Optionally, when the microprocessor is used to pull at least one command stream from the buffer on the host side through the communication link between the processing device and the host, the microprocessor is also used for: buffering on the host side When the host includes at least two command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.
可选的,如图9所示,所述微处理器,在用于将拉取的所述至少一个命令流读取到所述处理设备本地的流队列中时,还用于:在从所述主机侧的缓冲器中拉取多个命令流的情况下,将所述多个命令流分别读取到所述处理设备本地不同的流队列中;所述处理设备还包括:并行调度模块901,用于并行调度相应的计算模块以并行执行所述本地不同的流队列中的命令流。Optionally, as shown in FIG. 9 , when the microprocessor is configured to read the pulled at least one command stream into a local stream queue of the processing device, the microprocessor is further configured to: In the case of pulling multiple command streams from the buffer on the host side, the multiple command streams are respectively read into different local stream queues of the processing device; the processing device further includes: a parallel scheduling module 901 , which is used to schedule the corresponding computing modules in parallel to execute the command streams in the local different stream queues in parallel.
可选的,所述微处理器,还用于接收主机通过所述通信链路发送的写指针的指针信息;根据所述写指针的指针信息,更新处理设备侧的写指针副本。Optionally, the microprocessor is further configured to receive pointer information of the write pointer sent by the host through the communication link; and update the copy of the write pointer on the processing device side according to the pointer information of the write pointer.
可选的,所述微处理器,在用于从主机侧的缓冲器中拉取至少一个命令流时,还用于:根据所述处理设备本地的读指针和写指针副本的指针信息,确定所述缓冲器中待下发的命令流的数量;在所述缓冲器中包括至少一个待下发的命令流的情况下,从所述缓冲器中拉取至少一个命令流。Optionally, when the microprocessor is used to pull at least one command stream from the buffer on the host side, it is further configured to: determine according to the pointer information of the local read pointer and the copy of the write pointer of the processing device. The number of command streams to be issued in the buffer; when the buffer includes at least one command stream to be issued, at least one command stream is pulled from the buffer.
可选的,所述微处理器,在用于将拉取的所述至少一个命令流读取到所述处理设备本地的流队列中时,还用于:每次将一个命令流读取到所述处理设备本地的流队列中之后,更新所述处理设备本地的读指针;将更新后的所述读指针的指针信息,发送至所述主机,以由所述主机更新主机侧的读指针副本。Optionally, when the microprocessor is configured to read the pulled at least one command stream into the local stream queue of the processing device, it is further configured to: read one command stream to the processing device each time. After the processing device is in the local stream queue, update the local read pointer of the processing device; send the updated pointer information of the read pointer to the host, so that the host can update the read pointer on the host side copy.
可选的,所述处理设备是AI芯片或者GPU。Optionally, the processing device is an AI chip or a GPU.
可选的,所述通信链路是PCI-Express链路。Optionally, the communication link is a PCI-Express link.
对于装置实施例或处理设备实施例而言,由于基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例或处理设备实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本公开至少一个实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。As for the apparatus embodiment or the processing device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for related parts. The apparatus embodiments or processing device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separated unit, that is, it can be located in one place, or it can be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of at least one embodiment of the present disclosure. Those of ordinary skill in the art can understand and implement it without creative effort.
本公开还提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时能够实现本公开任一实施例的命令下发方法。The present disclosure also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor can implement the commands of any embodiment of the present disclosure when the processor executes the program delivery method.
图10示出了本公开实施例所提供的一种更为具体的计算机设备硬件结构示意图,该设备可以包括:处理器1010、存储器1020、输入/输出接口1030、通信接口1040和总线1050。其中处理器1010、存储器1020、输入/输出接口1030和通信接口1040通过总线1050实现彼此之间在设备内部的通信连接。10 shows a more specific schematic diagram of the hardware structure of a computer device provided by an embodiment of the present disclosure. The device may include: a processor 1010 , a memory 1020 , an input/output interface 1030 , a communication interface 1040 and a bus 1050 . The processor 1010 , the memory 1020 , the input/output interface 1030 and the communication interface 1040 realize the communication connection among each other within the device through the bus 1050 .
处理器1010可以采用通用的CPU(Central Processing Unit,中央处理器)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本说明书实施例所提供的技术方案。The processor 1010 can be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. program to implement the technical solutions provided by the embodiments of this specification.
存储器1020可以采用ROM(Read Only Memory,只读存储器)、RAM(Random Access Memory,随机存取存储器)、静态存储设备,动态存储设备等形式实现。存储器1020可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施例所提供的技术方案时,相关的程序代码保存在存储器1020中,并由处理器1010来调用执行。The memory 1020 may be implemented in the form of a ROM (Read Only Memory, read-only memory), a RAM (Random Access Memory, random access memory), a static storage device, a dynamic storage device, and the like. The memory 1020 may store an operating system and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, relevant program codes are stored in the memory 1020 and invoked by the processor 1010 for execution.
输入/输出接口1030用于连接输入/输出模块,以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中(图中未示出),也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等,输出设备可以包括显示器、扬声器、振动器、指示灯等。The input/output interface 1030 is used to connect the input/output module to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be externally connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, and the like.
通信接口1040用于连接通信模块(图中未示出),以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。The communication interface 1040 is used to connect a communication module (not shown in the figure), so as to realize the communication interaction between the device and other devices. The communication module may implement communication through wired means (eg, USB, network cable, etc.), or may implement communication through wireless means (eg, mobile network, WIFI, Bluetooth, etc.).
总线1050包括一通路,在设备的各个组件(例如处理器1010、存储器1020、输入/输出接口1030和通信接口1040)之间传输信息。 Bus 1050 includes a path to transfer information between the various components of the device (eg, processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).
需要说明的是,尽管上述设备仅示出了处理器1010、存储器1020、输入/输出接口1030、通信接口1040以及总线1050,但是在具体实施过程中,该设备还可以包括实现正常运行所必需的其他组件。此外,本领域的技术人员可以理解的是,上述设备中也可以仅包含实现本说明书实施例方案所必需的组件,而不必包含图中所示的全部组件。It should be noted that although the above device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation process, the device may also include necessary components for normal operation. other components. In addition, those skilled in the art can understand that, the above-mentioned device may only include components necessary to implement the solutions of the embodiments of the present specification, rather than all the components shown in the figures.
本公开还提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时能够实现本公开任一实施例的命令下发方法。The present disclosure also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the command issuing method of any embodiment of the present disclosure can be implemented.
其中,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等,本申请并不对此进行限制。Wherein, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc., which is not limited in this application.
在一些可选实施例中,本公开实施例提供了一种计算机程序产品,包括计算机可读代码,当计算机可读代码在设备上运行时,设备中的处理器执行用于实现如上任一实施例提供的命令下发方法。该计算机程序产品可以具体通过硬件、软件或其结合的方式实现。In some optional embodiments, embodiments of the present disclosure provide a computer program product, comprising computer-readable code, when the computer-readable code is executed on a device, the processor in the device executes any of the above implementations. The command delivery method provided by the example. The computer program product can be specifically implemented by hardware, software or a combination thereof.
本领域技术人员在考虑说明书及实践这里申请的发明后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未申请的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention claimed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common general knowledge or techniques in the technical field to which this disclosure is not claimed . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
以上所述仅为本公开的较佳实施例而已,并不用于限制本公开,凡在本公开的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本公开保护的范围之内。The above descriptions are only preferred embodiments of the present disclosure, and are not intended to limit the present disclosure. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present disclosure shall be included in the present disclosure. within the scope of protection.

Claims (20)

  1. 一种命令下发方法,其特征在于,所述方法包括:A method for issuing commands, characterized in that the method comprises:
    根据待下发到处理设备进行处理的多个命令,生成至少一个命令流;其中,每个所述命令流中包括至少一个命令;Generate at least one command stream according to multiple commands to be sent to the processing device for processing; wherein each of the command streams includes at least one command;
    将所述至少一个命令流插入到缓冲器中;inserting the at least one command stream into a buffer;
    通过主机与所述处理设备间的通信链路,将所述缓冲器中的至少一个命令流传输至所述处理设备。At least one command in the buffer is streamed to the processing device over a communication link between the host and the processing device.
  2. 根据权利要求1所述的方法,其特征在于,所述通过主机与所述处理设备间的通信链路,将所述缓冲器中的至少一个命令流传输至所述处理设备,包括:The method according to claim 1, wherein the transmitting at least one command stream in the buffer to the processing device through the communication link between the host and the processing device comprises:
    在所述缓冲器中包括至少两个命令流的情况下,通过所述通信链路的一次通信,将所述至少两个命令流传输至所述处理设备。In the event that at least two command streams are included in the buffer, the at least two command streams are transmitted to the processing device through one communication of the communication link.
  3. 根据权利要求1或2所述的方法,其特征在于,所述将所述至少一个命令流插入到缓冲器中之后,还包括:The method according to claim 1 or 2, wherein after the inserting the at least one command stream into the buffer, the method further comprises:
    更新所述缓冲器的写指针,所述写指针用于表示对所述缓冲器进行写操作的当前位置;updating the write pointer of the buffer, the write pointer being used to indicate the current position of the write operation to the buffer;
    通过所述通信链路,将更新后的写指针的指针信息发送至所述处理设备,以由所述处理设备更新处理设备侧的写指针副本。The updated pointer information of the write pointer is sent to the processing device over the communication link, so that the processing device updates the copy of the write pointer on the side of the processing device.
  4. 根据权利要求1或2所述的方法,其特征在于,所述将所述至少一个命令流插入到缓冲器中之后,还包括:The method according to claim 1 or 2, wherein after the inserting the at least one command stream into the buffer, the method further comprises:
    更新所述缓冲器的写指针,所述写指针用于表示对所述缓冲器进行写操作的当前位置;updating the write pointer of the buffer, the write pointer being used to indicate the current position of the write operation to the buffer;
    在所述缓冲器的写指针更新次数达到预设次数的情况下,通过所述通信链路,将最后更新的写指针的指针信息发送至所述处理设备,以由所述处理设备更新处理设备侧的写指针副本。When the number of updates of the write pointer of the buffer reaches a preset number of times, the pointer information of the last updated write pointer is sent to the processing device through the communication link, so that the processing device can be updated by the processing device A copy of the write pointer on the side.
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 4, wherein the method further comprises:
    接收所述处理设备通过所述通信链路发送的读指针的指针信息,所述读指针用于表示对所述缓冲器进行读操作的当前位置;receiving pointer information of a read pointer sent by the processing device through the communication link, where the read pointer is used to indicate the current position of the read operation on the buffer;
    根据所述读指针的指针信息,更新主机侧的读指针副本。According to the pointer information of the read pointer, the copy of the read pointer on the host side is updated.
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述通信链路是高速串行计算机扩展总线标准PCI-Express链路。The method according to any one of claims 1 to 5, wherein the communication link is a high-speed serial computer expansion bus standard PCI-Express link.
  7. 一种命令下发方法,其特征在于,所述方法包括:A method for issuing commands, characterized in that the method comprises:
    通过处理设备与主机间的通信链路,从主机侧的缓冲器中拉取至少一个命令流;Pull at least one command stream from the buffer on the host side by processing the communication link between the device and the host;
    将拉取的所述至少一个命令流读取到所述处理设备本地的流队列中,所述流队列用于存储待执行的命令流。The pulled at least one command stream is read into a local stream queue of the processing device, where the stream queue is used to store the command stream to be executed.
  8. 根据权利要求7所述的方法,其特征在于,所述通过处理设备与主机间的通信链路,从主机侧的缓冲器中拉取至少一个命令流,包括:The method according to claim 7, wherein the pulling at least one command stream from the buffer on the host side through the communication link between the processing device and the host comprises:
    在所述主机侧的缓冲器中包括至少两个命令流的情况下,通过所述通信链路的一次通信,从所述主机侧的缓冲器中拉取所述至少两个命令流。In the case where the buffer on the host side includes at least two command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.
  9. 根据权利要求7或8所述的方法,其特征在于,所述将拉取的所述至少一个命令流读取到所述处理设备本地的流队列中,包括:The method according to claim 7 or 8, wherein the reading the pulled at least one command stream into a local stream queue of the processing device comprises:
    在从所述主机侧的缓冲器中拉取多个命令流的情况下,将所述多个命令流分别读取到所述处理设备本地不同的流队列中;In the case where multiple command streams are pulled from the buffer on the host side, the multiple command streams are respectively read into different local stream queues of the processing device;
    所述方法还包括:并行执行所述本地不同的流队列中的命令流。The method also includes executing command flows in the local different flow queues in parallel.
  10. 根据权利要求7至9中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 7 to 9, wherein the method further comprises:
    接收主机通过所述通信链路发送的写指针的指针信息;receiving the pointer information of the write pointer sent by the host through the communication link;
    根据所述写指针的指针信息,更新处理设备侧的写指针副本。According to the pointer information of the write pointer, the copy of the write pointer on the side of the processing device is updated.
  11. 根据权利要求7至10中任一项所述的方法,其特征在于,所述从主机侧的缓冲器中拉取至少一个命令流,包括:The method according to any one of claims 7 to 10, wherein the pulling at least one command stream from the buffer on the host side comprises:
    根据所述处理设备本地的读指针和写指针副本的指针信息,确定所述缓冲器中待下发的命令流的数量;Determine the number of command streams to be issued in the buffer according to the local read pointer and the pointer information of the write pointer copy of the processing device;
    在所述缓冲器中包括至少一个待下发的命令流的情况下,从所述缓冲器中拉取至少一个命令流。When the buffer includes at least one command stream to be issued, at least one command stream is pulled from the buffer.
  12. 根据权利要求7至11中任一项所述的方法,其特征在于,所述将拉取的所述至少一个命令流读取到所述处理设备本地的流队列中,包括:The method according to any one of claims 7 to 11, wherein the reading the pulled at least one command stream into a local stream queue of the processing device comprises:
    每次将一个命令流读取到所述处理设备本地的流队列中之后,更新所述处理设备本地的读指针;After each time a command stream is read into the local stream queue of the processing device, the local read pointer of the processing device is updated;
    将更新后的所述读指针的指针信息,发送至所述主机,以由所述主机更新主机侧的读指针副本。The updated pointer information of the read pointer is sent to the host, so that the host can update the copy of the read pointer on the host side.
  13. 根据权利要求7至12中任一项所述的方法,其特征在于,所述通信链路是PCI-Express链路。The method of any one of claims 7 to 12, wherein the communication link is a PCI-Express link.
  14. 一种命令下发装置,其特征在于,所述装置包括:A device for issuing commands, characterized in that the device comprises:
    命令流生成模块,用于根据待下发到处理设备进行处理的多个命令,生成至少一个 命令流;其中,每个所述命令流中包括至少一个命令;The command stream generation module is used to generate at least one command stream according to multiple commands to be sent to the processing device for processing; wherein, each of the command streams includes at least one command;
    插入模块,用于将所述至少一个命令流插入到缓冲器中;an insertion module for inserting the at least one command stream into the buffer;
    传输模块,用于通过主机与所述处理设备间的通信链路,将所述缓冲器中的至少一个命令流传输至所述处理设备。The transmission module is configured to transmit at least one command stream in the buffer to the processing device through the communication link between the host and the processing device.
  15. 根据权利要求14所述的装置,其特征在于,所述传输模块,在用于通过主机与所述处理设备间的通信链路,将所述缓冲器中的至少一个命令流传输至所述处理设备时,还用于:The apparatus of claim 14, wherein the transmission module is configured to transmit at least one command stream in the buffer to the processing device through a communication link between the host and the processing device equipment, also used to:
    在所述缓冲器中包括至少两个命令流的情况下,通过所述通信链路的一次通信,将所述至少两个命令流传输至所述处理设备。In the event that at least two command streams are included in the buffer, the at least two command streams are transmitted to the processing device through one communication of the communication link.
  16. 一种处理设备,其特征在于,所述处理设备包括:A processing device, characterized in that the processing device comprises:
    队列存储器,用于存储流队列;Queue memory for storing stream queues;
    微处理器,用于通过所述处理设备与主机间的通信链路,从主机侧的缓冲器中拉取至少一个命令流;并将拉取的所述至少一个命令流读取到所述处理设备本地的流队列中,所述流队列用于存储待执行的命令流。a microprocessor, configured to pull at least one command stream from the buffer on the host side through the communication link between the processing device and the host; and read the pulled at least one command stream to the processing In the local flow queue of the device, the flow queue is used to store the command flow to be executed.
  17. 根据权利要求16所述的处理设备,其特征在于,所述微处理器,在用于通过处理设备与主机间的通信链路,从主机侧的缓冲器中拉取至少一个命令流时,还用于:The processing device according to claim 16, wherein when the microprocessor is used to pull at least one command stream from the buffer on the host side through the communication link between the processing device and the host, the microprocessor also Used for:
    在所述主机侧的缓冲器中包括至少两个命令流的情况下,通过所述通信链路的一次通信,从所述主机侧的缓冲器中拉取所述至少两个命令流。In the case where the buffer on the host side includes at least two command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.
  18. 根据权利要求16或17所述的处理设备,其特征在于,所述微处理器,在用于将拉取的所述至少一个命令流读取到所述处理设备本地的流队列中时,还用于:The processing device according to claim 16 or 17, wherein when the microprocessor is configured to read the pulled at least one command stream into a local stream queue of the processing device, the microprocessor also Used for:
    在从所述主机侧的缓冲器中拉取多个命令流的情况下,将所述多个命令流分别读取到所述处理设备本地不同的流队列中;In the case of pulling multiple command streams from the buffer on the host side, read the multiple command streams into different local stream queues of the processing device respectively;
    所述处理设备还包括:The processing device also includes:
    并行调度模块,用于并行调度相应的计算模块以并行执行所述本地不同的流队列中的命令流。The parallel scheduling module is used for scheduling corresponding computing modules in parallel to execute the command streams in the different local stream queues in parallel.
  19. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1-6任一所述的方法,或者实现权利要求7-13任一所述的方法。A computer device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the program, any one of claims 1-6 is implemented. method, or implement the method of any one of claims 7-13.
  20. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行时实现权利要求1-6任一所述的方法,或者实现权利要求7-13任一所述的方法。A computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the method described in any one of claims 1-6, or any one of claims 7-13 is realized. method described.
PCT/CN2021/102943 2020-12-11 2021-06-29 Command issuing method and apparatus, processing device, computer device, and storage medium WO2022121287A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011459860.3A CN114626541A (en) 2020-12-11 2020-12-11 Command issuing method, command issuing device, processing equipment, computer equipment and storage medium
CN202011459860.3 2020-12-11

Publications (1)

Publication Number Publication Date
WO2022121287A1 true WO2022121287A1 (en) 2022-06-16

Family

ID=81895512

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/102943 WO2022121287A1 (en) 2020-12-11 2021-06-29 Command issuing method and apparatus, processing device, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN114626541A (en)
WO (1) WO2022121287A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1495621A (en) * 2002-06-24 2004-05-12 �¿���˹�����ɷ����޹�˾ Parallel input/output data transmission controller
CN107209665A (en) * 2015-01-07 2017-09-26 美光科技公司 Produce and perform controlling stream
CN111124993A (en) * 2018-10-31 2020-05-08 伊姆西Ip控股有限责任公司 Method, apparatus and program product for reducing cache data mirroring latency during I/O processing
CN111143234A (en) * 2018-11-02 2020-05-12 三星电子株式会社 Storage device, system including such storage device and method of operating the same

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1495621A (en) * 2002-06-24 2004-05-12 �¿���˹�����ɷ����޹�˾ Parallel input/output data transmission controller
CN107209665A (en) * 2015-01-07 2017-09-26 美光科技公司 Produce and perform controlling stream
CN111124993A (en) * 2018-10-31 2020-05-08 伊姆西Ip控股有限责任公司 Method, apparatus and program product for reducing cache data mirroring latency during I/O processing
CN111143234A (en) * 2018-11-02 2020-05-12 三星电子株式会社 Storage device, system including such storage device and method of operating the same

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
OASIS-OPEN.ORG: "Virtual I/O Device (VIRTIO) Version 1.1", 20 December 2018 (2018-12-20), pages 1 - 118, XP055800181, Retrieved from the Internet <URL:https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html> [retrieved on 20210429] *

Also Published As

Publication number Publication date
CN114626541A (en) 2022-06-14

Similar Documents

Publication Publication Date Title
KR102245247B1 (en) GPU remote communication using triggered actions
TWI531958B (en) Mass storage virtualization for cloud computing
US7835897B2 (en) Apparatus and method for connecting hardware to a circuit simulation
JP5137434B2 (en) Data processing apparatus, distributed processing system, data processing method, and data processing program
US9418181B2 (en) Simulated input/output devices
US20180219797A1 (en) Technologies for pooling accelerator over fabric
US10540301B2 (en) Virtual host controller for a data processing system
US8448172B2 (en) Controlling parallel execution of plural simulation programs
CN104094235A (en) Multithreaded computing
US11308008B1 (en) Systems and methods for handling DPI messages outgoing from an emulator system
CN107729050A (en) Real-time system and task construction method based on LET programming models
US8468006B2 (en) Method of combined simulation of the software and hardware parts of a computer system, and associated system
WO2022121287A1 (en) Command issuing method and apparatus, processing device, computer device, and storage medium
JP2007011720A (en) System simulator, system simulation method, control program, and readable recording medium
US11151074B2 (en) Methods and apparatus to implement multiple inference compute engines
CN115168256A (en) Interrupt control method, interrupt controller, electronic device, medium, and chip
US20180011804A1 (en) Inter-Process Signaling Mechanism
CN116711279A (en) System and method for simulation and testing of multiple virtual ECUs
US8572631B2 (en) Distributed control of devices using discrete device interfaces over single shared input/output
US20120065953A1 (en) Computer-readable, non-transitory medium storing simulation program, simulation apparatus and simulation method
US11941722B2 (en) Kernel optimization and delayed execution
WO2023207829A1 (en) Device virtualization method and related device
EP3630318B1 (en) Selective acceleration of emulation
WO2023010232A1 (en) Processor and communication method
CN116933698A (en) Verification method and device for computing equipment, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901999

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901999

Country of ref document: EP

Kind code of ref document: A1