WO2022121287A1 - Procédé et appareil d'émission de commande, dispositif de traitement, dispositif informatique et support de stockage - Google Patents

Procédé et appareil d'émission de commande, dispositif de traitement, dispositif informatique et support de stockage Download PDF

Info

Publication number
WO2022121287A1
WO2022121287A1 PCT/CN2021/102943 CN2021102943W WO2022121287A1 WO 2022121287 A1 WO2022121287 A1 WO 2022121287A1 CN 2021102943 W CN2021102943 W CN 2021102943W WO 2022121287 A1 WO2022121287 A1 WO 2022121287A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing device
command
buffer
stream
host
Prior art date
Application number
PCT/CN2021/102943
Other languages
English (en)
Chinese (zh)
Inventor
冷祥纶
孙海涛
Original Assignee
上海阵量智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海阵量智能科技有限公司 filed Critical 上海阵量智能科技有限公司
Publication of WO2022121287A1 publication Critical patent/WO2022121287A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4022Coupling between buses using switching circuits, e.g. switching matrix, connection or expansion network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0026PCI express

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a command issuing method, apparatus, processing device, computer device, and storage medium.
  • AI chips like graphics processing units (GPUs), are usually used as accelerator cards for the host/CPU.
  • the AI chip or GPU can be called a processing device, which is scheduled and controlled by the host.
  • the present disclosure provides a command issuing method, device, processing device, computer device and storage medium.
  • a command issuing method comprising: generating at least one command stream according to a plurality of commands to be issued to a processing device for processing; wherein each command The stream includes at least one command; inserting the at least one command stream into a buffer; transmitting the at least one command stream in the buffer to the processing device through a communication link between the host and the processing device .
  • the transmitting at least one command stream in the buffer to the processing device through a communication link between the host and the processing device includes: including in the buffer In the case of at least two command streams, the at least two command streams are transmitted to the processing device through one communication of the communication link.
  • the method further includes: updating a write pointer of the buffer, where the write pointer is used to indicate that writing to the buffer is performed The current position of the operation; the updated pointer information of the write pointer is sent to the processing device through the communication link, so that the processing device updates the copy of the write pointer on the side of the processing device.
  • the method further includes: updating a write pointer of the buffer, where the write pointer is used to indicate that writing to the buffer is performed The current position of the operation; when the number of updates of the write pointer of the buffer reaches a preset number of times, send the pointer information of the last updated write pointer to the processing device through the communication link, so that the The processing device updates the copy of the write pointer on the processing device side.
  • the method further includes: receiving pointer information of a read pointer sent by the processing device through the communication link, where the read pointer is used to indicate the current status of the read operation on the buffer. position; according to the pointer information of the read pointer, update the copy of the read pointer on the host side.
  • the communication link is a high-speed serial computer expansion bus standard PCI-Express link.
  • Another method for issuing commands includes: pulling at least one command stream from a buffer on the host side through a communication link between a processing device and a host; The pulled at least one command stream is read into a local stream queue of the processing device, where the stream queue is used to store the command stream to be executed.
  • the pulling at least one command stream from the buffer on the host side through the communication link between the processing device and the host includes: including at least two command streams in the buffer on the host side In the case of command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.
  • the reading the pulled at least one command stream into a local stream queue of the processing device includes: pulling a plurality of command streams from the buffer on the host side In this case, the multiple command streams are respectively read into different local stream queues of the processing device; the method further includes: executing the command streams in the local different stream queues in parallel.
  • the method further includes: receiving pointer information of the write pointer sent by the host through the communication link; and updating the copy of the write pointer on the processing device side according to the pointer information of the write pointer.
  • the pulling at least one command stream from the buffer on the host side includes: determining, according to the local read pointer of the processing device and the pointer information of the copy of the write pointer, to be issued in the buffer The number of command streams in the buffer; when the buffer includes at least one command stream to be issued, at least one command stream is pulled from the buffer.
  • the reading the pulled at least one command stream into a local stream queue of the processing device includes: reading one command stream to the local stream queue of the processing device at a time After that, update the local read pointer of the processing device; send the updated pointer information of the read pointer to the host, so that the host can update the copy of the read pointer on the host side.
  • the communication link is a PCI-Express link.
  • an apparatus for issuing commands comprising: a command stream generation module configured to generate at least one command stream according to multiple commands to be issued to a processing device for processing; Wherein, each of the command streams includes at least one command; an inserting module is used to insert the at least one command stream into a buffer; a transmission module is used to pass the communication link between the host and the processing device, Streaming at least one command in the buffer to the processing device.
  • a processing device comprising: a queue memory for storing flow queues; Pulling at least one command stream from a buffer on the side; and reading the at least one command stream pulled into a local stream queue of the processing device, where the stream queue is used to store the command stream to be executed.
  • a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the first aspect when the program is executed Or the command issuing method according to any one of the second aspect.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the command under any one of the first aspect or the second aspect. send method.
  • a computer program product including a computer program, which implements the command issuing method according to any one of the first aspect or the second aspect when the program is executed by a processor.
  • a command stream may be generated from the plurality of commands according to the commands to be issued to the processing device, and the command may be issued to the processing device in the form of a command stream.
  • this command delivery method multiple commands can be delivered by one command stream delivery, and multiple commands can be delivered through one communication of the communication link.
  • the communication frequency between the host and the processing device is effectively reduced, the communication overhead between the host and the processing device is reduced, and the scheduling efficiency of the host is improved.
  • FIG. 1 is a flowchart of a method for issuing commands according to an exemplary embodiment
  • FIG. 2 is a flowchart of another method for issuing commands according to an exemplary embodiment
  • FIG. 3 is an interactive flowchart of a method for issuing commands according to an exemplary embodiment
  • FIG. 4 is a schematic diagram of an apparatus for issuing commands according to an exemplary embodiment
  • FIG. 5 is a schematic diagram of another device for issuing commands according to an exemplary embodiment
  • FIG. 6 is a schematic diagram of another apparatus for issuing commands according to an exemplary embodiment
  • FIG. 7 is a schematic diagram of another apparatus for issuing commands according to an exemplary embodiment
  • FIG. 8 is a schematic diagram of a processing device according to an exemplary embodiment
  • FIG. 9 is a schematic diagram of another processing device according to an exemplary embodiment.
  • Fig. 10 is a schematic structural diagram of a computer device according to an exemplary embodiment.
  • first, second, third, etc. may be used in this disclosure to describe various pieces of information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure.
  • word "if” as used herein can be interpreted as "at the time of” or "when” or "in response to determining.”
  • the host schedules and controls the processing device, it needs to issue operation commands frequently, and the communication link (such as the PCI-Express link) between the host and the processing device needs to transmit a large amount of communication data and model codes, resulting in the failure of the communication link.
  • the communication overhead is too large and the scheduling efficiency is low.
  • a host generates at least one command stream according to multiple commands to be issued to a processing device; inserts the at least one command stream into a buffer, and sends the The commands in the buffer are streamed to the processing device.
  • FIG. 1 is a flowchart of a method for issuing commands according to an embodiment of the present disclosure. The method is applied to the host. As shown in Figure 1, the process includes:
  • Step 101 Generate at least one command stream according to multiple commands to be sent to the processing device for processing; wherein each of the command streams includes at least one command.
  • the command used to generate the command stream is a command generated by the host and needs to be sent to the processing device for processing.
  • it may be multiple commands generated by multiple processes included in the application layer.
  • the application layer includes the application “Pay ⁇ ” for payment, and the application “Meitu ⁇ ⁇ ” for beauty.
  • the commands generated by the “Payment ⁇ ” process or the “Meitu ⁇ ” process need to be sent to a processing device (such as an AI chip) for processing.
  • the command generated by the “Payment ⁇ ” process or the “Meitu ⁇ ” process is the command to be sent to the processing device for processing.
  • the commands in the command stream may include: various operators (kernel), data movement (memcpy) commands, and event synchronization commands of the deep learning model.
  • At least one command stream may be generated according to multiple commands to be issued.
  • a command stream may include one command, and may also include multiple commands. Commands in the same command stream need to be executed sequentially, and different command streams can be executed in parallel.
  • the command stream here is similar to the "stream” in CUDA (Compute Unified Device Architecture, which is a computing platform launched by the graphics card manufacturer NVIDIA).
  • CUDA Computer Unified Device Architecture
  • commands to be issued include: command 1, command 2, command 3, command A, command B, and command C.
  • this step can generate a command stream 1 from the command 1, the command 2 and the command 3.
  • the command stream 1 includes: command 1 , command 2 and command 3 .
  • a command stream A can be generated from the command A and the command B.
  • the command stream A includes: command A and command B.
  • this step can generate a command stream C from command C.
  • the execution of the respective command streams does not affect each other.
  • three command streams can be executed in parallel.
  • Step 102 inserting the at least one command stream into a buffer.
  • the buffer in this embodiment may be a ring buffer (Ring Buffer). It can be understood that, any buffer that can meet the usage requirements of this step can be regarded as the buffer of this embodiment, and is not limited to a ring buffer.
  • a ring buffer is a typical "producer-consumer" model.
  • the host is the producer and can insert the command stream into the circular buffer;
  • the processing device is the consumer and can pull down the command stream from the circular buffer to the local stream queue (Stream Queue).
  • this step may insert one or more command streams into the ring buffer.
  • the driver can create different command stream buffers (Stream Buffers) for each command stream, and insert each Stream Buffer into the Ring Buffer by locking.
  • Stream Buffers corresponds to an Entry of the Ring Buffer.
  • Step 103 Stream at least one command in the buffer to the processing device through the communication link between the host and the processing device.
  • the communication link between the host and the processing device may be PCI-Express (peripheral component interconnect express, a high-speed serial computer expansion bus standard). It can be understood that, in addition to PCI-Express, other types of communication links may also be included between the host and the processing device, which are not limited in the present disclosure.
  • PCI-Express peripheral component interconnect express, a high-speed serial computer expansion bus standard
  • the command stream in the buffer can be transmitted to the processing device through the PCI-Express link.
  • a command stream can be transmitted to the processing device through one communication of PCI-Express.
  • multiple commands can be streamed to the processing device through a single PCI-Express communication.
  • the host in the process of transmitting the command stream in the buffer to the processing device, the host may actively send the command stream to the processing device, or the processing device may actively pull the command stream from the buffer.
  • the specific manner in which the command stream in the buffer is transmitted to the processing device also includes various forms, which are not limited in this embodiment.
  • the host can actively send a certain number of command streams in the buffer to the processing device, and the processing device further processes the issued command streams. For example, when the number of command streams in the buffer reaches a certain preset number, the host may send a certain preset number of command streams to the processing device at one time through the PCI-Express link.
  • the processing device can also actively pull a certain number of command streams from the buffer.
  • the processing device can poll the pointer information of the buffer on the host side. If there is a command stream to be issued in the buffer, the processing device can use the PCI-Express link. , pull a command stream to the local stream queue. Alternatively, the processing device can pull multiple command streams from the ring buffer on the host side to the local stream queue at one time through the PCI-Express link.
  • the host can generate at least one command stream according to multiple commands to be issued to the processing device, and issue commands to the processing device in the form of a command stream.
  • One command stream can be issued to realize the downloading of multiple commands. It reduces the number of communications between the host and the processing device, reduces the communication overhead between the host and the processing device, and improves the scheduling efficiency of the host.
  • computing power (referred to as computing power) of AI chips has been increasing, and the computing power has even reached 256/512 Tops.
  • the host will not be able to issue operation commands to the processing device for scheduling and control in time, the computing power of the processing device cannot be fully utilized, and computing resources are wasted.
  • the computing power of the processing device can be more fully utilized.
  • streaming at least one command in the buffer to the processing device through a communication link between the host and the processing device may include: in the buffer In the case where at least two command streams are included in the processor, the at least two command streams are transmitted to the processing device through one communication of the communication link.
  • the host when multiple command streams have been inserted into the buffer, can transmit the multiple command streams in the buffer to the processing device in batches at one time through one communication of the communication link.
  • the host may transmit all the command streams in the buffer to the processing device in batches through one communication of the communication link.
  • the host may, through one communication of the communication link, send some command streams (more than A command stream) is batched to the processing device.
  • the host can transmit multiple command streams to the processing device at one time through the communication link, which further reduces the number of communications between the host and the processing device.
  • the communication overhead between the host and the processing device is reduced, and the scheduling efficiency of the host is improved.
  • FIG. 2 is a flowchart of another method for issuing commands according to an embodiment of the present disclosure.
  • the method is applied to processing equipment. As shown in Figure 2, the process includes:
  • Step 201 Pull at least one command stream from a buffer on the host side through the communication link between the processing device and the host.
  • the host generates the command stream to be sent to the processing device for processing, and buffers the command stream in the buffer. For example, the host may buffer the commands to be issued in the form of a command stream in a ring buffer.
  • the processing device can pull one command stream from the buffer at a time through the communication link with the host, or pull multiple command streams in batches at one time.
  • the number of command streams that the processing device pulls from the buffer at one time to be delivered needs to be comprehensively determined according to the number of command streams to be delivered in the buffer and the number of local idle stream queues of the processing device.
  • the processing device may determine that there is at least one idle stream queue locally. Then, the processing device can pull the command stream to be issued from the buffer, and read the command stream to the corresponding idle stream queue. That is, the delivery of multiple commands included in this one command stream from the host side to the processing device side is completed.
  • the processing device may determine that there are enough idle stream queues locally. Then, the processing device can pull the multiple command streams from the buffer in batches at one time, and read the multiple command streams to different stream queues respectively. That is, the delivery of the plurality of commands included in the plurality of command streams from the host side to the processing device side is completed.
  • the processing device needs to determine the number of command streams to be issued in the host-side buffer. Only when there are command streams to be issued in the buffer and an idle stream queue exists locally on the processing device, the processing device pulls a certain number of command streams from the buffer.
  • the processing device when the processing device determines the number of command streams to be issued in the host-side buffer, the processing device can poll the read-write pointer of the host-side buffer through the communication link with the host, and according to the read-write pointer of the buffer The pointer determines whether there is a command stream to be issued in the buffer.
  • Step 202 Read the pulled at least one command stream into a local stream queue, where the stream queue is used to store the command stream to be executed.
  • the processing device may include a stream queue for storing command streams to be executed.
  • the processing device can read multiple command streams pulled from the buffer into different stream queues respectively. Therefore, the processing device can use the command distributor to distribute the commands in the stream queue to different computing units for calculation.
  • the processing device may read the multiple command streams into different local stream queues respectively; execute the multiple command streams in parallel in different stream queues. command flow. The execution efficiency of the command by the processing device is improved.
  • the processing device may pull one command stream from the buffer on the host side at a time through the communication link with the host.
  • pulling one command stream can implement the issuance of multiple commands, reducing the number of communications between the host and the processing device.
  • the communication overhead of the communication link between the host and the processing device is reduced, and the scheduling efficiency of the host is improved.
  • the computing power of processing equipment can also be more fully utilized.
  • pulling at least one command stream from the buffer on the host side through the communication link between the processing device and the host includes: including at least one command stream in the buffer on the host side. In the case of two command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.
  • the processing device when the buffer on the host side includes multiple command streams that can be pulled, the processing device can pull from the buffer on the host side in batches at one time through one communication of the communication link Multiple command streams.
  • the processing device may pull all of the command streams in batches from the buffer on the host side at one time through one communication of the communication link. command flow.
  • the processing device may obtain all command streams from the buffer on the host side at one time through one communication of the communication link. Pull part of the command stream (more than one command stream) in batches.
  • the processing device can pull multiple command streams from the buffer on the host side at one time, which further reduces the number of times of communication between the host and the processing device. Therefore, the communication overhead of the communication link between the host and the processing device is greatly reduced, and the scheduling efficiency of the host is improved.
  • the computing power of processing equipment can also be more fully utilized.
  • step 201 the processing device needs to determine the number of command streams to be issued in the host-side buffer.
  • the processing device pulls the command stream from the buffer only when there is a command stream to be issued in the buffer and an idle stream queue exists locally on the processing device.
  • the processing device To determine the number of command streams to be issued in the buffer, the processing device needs to obtain the read and write pointers of the buffer. In the related manner in which the processing device obtains the read and write pointers of the buffer, the processing device needs to poll the read and write pointers of the buffer on the host side through the communication link with the host. This way of "polling" to obtain the read and write pointers of the buffer requires the processing device to access a large number of hosts through the communication link, which undoubtedly causes communication overhead to the communication link.
  • the present disclosure provides a new pointer acquisition method, which enables the processing device to acquire the read and write pointers of the host-side buffer with fewer communication times.
  • the corresponding read and write pointers are set locally on the processing device side, and the read and write pointers on both sides are updated synchronously according to certain rules.
  • the read and write pointers of the buffer on the host side can be stored in the local main storage of the host; the corresponding read and write pointers set on the processing device side can be stored in the local registers of the processing device, and the two sides are synchronized according to certain rules. Stored read and write pointers.
  • the processing device since the read and write pointers of the buffer are correspondingly set on the processing device side, the processing device does not need to access the host side, but only needs to poll the local read and write pointers, and the host side buffer can be determined according to the read and write pointers
  • the number of command streams to be issued in the server greatly reduces the number of communications through the communication link.
  • the read and write pointers in the buffer may be set in a master-copy manner.
  • the write pointer write-pointer is the master, and the read pointer read-pointer is the copy;
  • the write pointer write-pointer is the copy, and the read pointer read-pointer is the master.
  • the write-pointer on the host side can be called a write pointer, and the read-pointer can be called a copy of the read pointer; the write-pointer on the processing device side can be called a copy of the write pointer, and the read-pointer called the read pointer.
  • step 102 after the host inserts at least one command stream into the buffer, it further includes:
  • the host updates the write pointer of the buffer, where the write pointer is used to indicate the current position of the write operation to the buffer; and sends the updated pointer information of the write pointer to the processing device through the communication link.
  • the timing at which the host sends the updated pointer information of the write pointer to the processing device may include various timings.
  • each time the host inserts a command stream into the buffer and updates the write pointer it sends the pointer information of the updated write pointer to the processing device. That is, every time the host updates the write pointer, it sends the updated pointer information of the write pointer to the processing device through the communication link.
  • this method uses the communication link to send the pointer information of the write pointer only when the write pointer of the buffer is updated, which reduces the number of communications.
  • the host when the host sends the updated pointer information of the write pointer to the processing device, it may send the latest pointer information of the write pointer to the processing device after the write pointer of the buffer is updated multiple times. .
  • the update times of the buffer write pointer may be preset. For example, if the number of times of updating the write pointer of the buffer is preset to be 8, then 8 command streams are inserted into the buffer and the write pointer is updated 8 times before the pointer information of the 8th updated write pointer is sent to the processing device.
  • the pointer information of the last updated write pointer is sent to the processing device by using the communication link after the write pointer of the host-side buffer has been updated for a number of times accumulatively.
  • the number of times of communication using the communication link is further reduced, and the communication overhead of the communication link is reduced.
  • the processing device After receiving the pointer information of the write pointer through the communication link, the processing device can update the corresponding copy of the write pointer stored locally according to the pointer information of the write pointer.
  • step 202 each time the processing device reads a command stream into the stream queue, it updates the local read pointer; and sends the updated pointer information of the read pointer to the host.
  • the host receives the pointer information of the read pointer sent by the processing device through the communication link, where the read pointer is used to indicate the current position of the read operation on the buffer; according to the pointer information of the read pointer, the host is updated A copy of the read pointer on the side.
  • the host After updating the copy of the read pointer on the host side according to the pointer information of the read pointer sent by the processing device, the host can release the corresponding command stream that has been read to the stream queue by the processing device according to the copy of the read pointer, thereby releasing the buffer space. .
  • two sets of read and write pointers are set on both sides of the host and the processing device in a master-copy manner, and the read and write pointers stored on both sides can be updated according to the method of the above embodiment.
  • the processing device does not need to access the host through the communication link for pointer polling, but only needs to poll the local copy of the read pointer and the write pointer, and based on the local copy of the read pointer and the write pointer, it can determine whether the buffer is in the buffer. There are command streams to be delivered, and the number of command streams to be delivered is determined.
  • the processing device may pull one or more command streams from the buffer upon determining that there is at least one idle stream queue locally.
  • the processing device may pull one or more command streams from the buffer upon determining that there is at least one idle stream queue locally.
  • the processing device may pull one or more command streams from the buffer upon determining that there is at least one idle stream queue locally.
  • the processing device Since the processing device only needs to poll the read and write pointers stored locally, it does not need to frequently access the host to poll the read and write pointers of the buffer, which greatly reduces the number of communication times that the processing device accesses the host through the communication link, which can effectively ease the need for the host to communicate with the processing device.
  • the communication overhead between devices improves the scheduling efficiency of the host.
  • command issuing method Refer to the interactive flowchart of the command issuing method shown in FIG. 3 .
  • the command issuing method is described in the form of interaction between the host and the processing device.
  • Step 301 the host generates at least one command stream according to multiple commands to be sent to the processing device for processing.
  • the host can generate at least one command stream according to multiple commands to be issued. For example, multiple commands that need to be executed in sequence can be generated as a complete command stream; or, a command that needs to be executed individually can be generated as a complete command stream.
  • Step 302 the host inserts at least one command stream into the buffer.
  • the buffer can play the role of temporarily buffering the command stream, so that when a command needs to be issued, multiple command streams temporarily buffered in the buffer can be delivered to the processing device in batches at one time.
  • Step 303 the host updates the write pointer of the buffer.
  • the host After the host inserts the command stream into the buffer, the corresponding write pointer of the buffer needs to be updated.
  • the host can perform multiple write operations according to the constantly updated pointer information of the write pointer, and insert the command stream into the buffer.
  • Step 304 the host sends the updated pointer information of the write pointer to the processing device.
  • the read and write pointers in the buffer are set in a master-copy manner, after the host side updates the write pointer of the buffer, the pointer information of the write pointer needs to be sent to the processing device to correspond to the update Handles a copy of the write pointer set on the device side.
  • pointer information of the updated write pointer may be sent to the processing device.
  • the pointer information of the final write pointer after multiple updates may be sent to the processing device. The number of times the host sends pointer information to the processing device can be further reduced, and the communication overhead of the communication link between the two can be reduced.
  • Step 305 the processing device updates the copy of the write pointer on the processing device side.
  • the processing device After receiving the pointer information of the write pointer sent by the host, the processing device needs to correspondingly update the local copy of the write pointer according to the received pointer information.
  • Step 306 The processing device determines the number of command streams to be issued in the buffer according to the pointer information of the local read pointer and the copy of the write pointer.
  • the processing device can directly access the local pointer information, so as to determine the number of command streams to be issued in the buffer on the host side. Since the processing device does not need to access the pointer on the host side, compared with the processing device polling the pointer of the buffer on the host side, the communication times of the communication link are greatly reduced, and the communication overhead of the communication link is reduced.
  • Step 307 the host stream transmits at least one command in the buffer to the processing device.
  • the processing device may actively pull a certain number of command streams from the host-side buffer. For example, all command streams in the buffer can be pulled to the processing device at once. In this way, through one communication of the communication link, batches of multiple command streams can be issued, and the communication overhead of the communication link can be reduced.
  • Step 308 the processing device reads at least one command stream into a local stream queue.
  • the processing device pulls the command stream, it needs to read the command stream into the local stream queue to store the pulled command stream. Therefore, the processing device can use the command distributor to distribute the commands in the stream queue to different computing units for calculation.
  • Step 309 After each time the processing device reads a command stream into the local stream queue, it updates the local read pointer.
  • Step 310 the processing device sends the updated pointer information of the read pointer to the host.
  • Step 311 the host updates the copy of the read pointer on the host side.
  • the cache location corresponding to the stored command stream in the buffer on the host side can be released at this time.
  • the local corresponding read pointer is updated, and pointer information of the updated read pointer is sent to the host.
  • the host receives the pointer information, it correspondingly updates the local copy of the read pointer.
  • the host can release the cache of the corresponding position in the host-side buffer according to the update of the read pointer copy.
  • the implementation process of the command issuing method is completely described in the manner of interaction between the host and the processing device.
  • the command issuing method can be used to issue commands in the form of a command stream, and a single command stream issuing can realize the issuing of multiple commands, thereby reducing the communication overhead of the communication link.
  • the method can send multiple command streams to the processing device at the same time by one communication, which further reduces the communication overhead of the communication link and improves the scheduling efficiency of the host.
  • the present disclosure provides a command issuing apparatus, and the apparatus can execute the command issuing method of any embodiment of the present disclosure.
  • the apparatus may include a command stream generation module 401 , an insertion module 402 and a transmission module 403 . in:
  • the command stream generation module 401 is configured to generate at least one command stream according to multiple commands to be sent to the processing device for processing; wherein, each of the command streams includes at least one command;
  • an inserting module 402 configured to insert the at least one command stream into a buffer
  • the transmission module 403 is configured to transmit at least one command stream in the buffer to the processing device through the communication link between the host and the processing device.
  • the transmission module 403 when the transmission module 403 is configured to stream at least one command in the buffer to the processing device through the communication link between the host and the processing device, it is further configured to: In the case where the buffer includes at least two command streams, the at least two command streams are transmitted to the processing device through one communication of the communication link.
  • the device further includes:
  • a first write pointer update module 501 configured to update the write pointer of the buffer, where the write pointer is used to indicate the current position of the write operation to the buffer;
  • the first pointer information sending module 502 is configured to send the updated pointer information of the write pointer to the processing device through the communication link, so that the processing device can update the copy of the write pointer on the processing device side.
  • the device further includes:
  • a second write pointer update module 601, configured to update the write pointer of the buffer, where the write pointer is used to indicate the current position of the write operation to the buffer;
  • the second pointer information sending module 602 is configured to send the last updated pointer information of the write pointer to the processing device when the number of updates of the write pointer of the buffer reaches a preset number of times.
  • the device further includes:
  • a pointer information receiving module 701 configured to receive pointer information of a read pointer sent by the processing device through the communication link, where the read pointer is used to indicate the current position of the read operation on the buffer;
  • the read pointer copy update module 702 is configured to update the read pointer copy on the host side according to the pointer information of the read pointer.
  • the communication link is a PCI-Express link.
  • the present disclosure provides a processing device, and the processing device can execute the command issuing method of any embodiment of the present disclosure.
  • the processing device may include queue memory 801 and microprocessor 802 . in:
  • a queue memory 801 for storing flow queues
  • the microprocessor 802 is configured to pull at least one command stream from the buffer on the host side through the communication link between the processing device and the host; and read the pulled at least one command stream to the In the local flow queue of the processing device, the flow queue is used to store the command flow to be executed.
  • the microprocessor when the microprocessor is used to pull at least one command stream from the buffer on the host side through the communication link between the processing device and the host, the microprocessor is also used for: buffering on the host side When the host includes at least two command streams, the at least two command streams are pulled from the buffer on the host side through one communication of the communication link.
  • the microprocessor when the microprocessor is configured to read the pulled at least one command stream into a local stream queue of the processing device, the microprocessor is further configured to: In the case of pulling multiple command streams from the buffer on the host side, the multiple command streams are respectively read into different local stream queues of the processing device; the processing device further includes: a parallel scheduling module 901 , which is used to schedule the corresponding computing modules in parallel to execute the command streams in the local different stream queues in parallel.
  • the microprocessor is further configured to receive pointer information of the write pointer sent by the host through the communication link; and update the copy of the write pointer on the processing device side according to the pointer information of the write pointer.
  • the microprocessor when used to pull at least one command stream from the buffer on the host side, it is further configured to: determine according to the pointer information of the local read pointer and the copy of the write pointer of the processing device. The number of command streams to be issued in the buffer; when the buffer includes at least one command stream to be issued, at least one command stream is pulled from the buffer.
  • the microprocessor when configured to read the pulled at least one command stream into the local stream queue of the processing device, it is further configured to: read one command stream to the processing device each time. After the processing device is in the local stream queue, update the local read pointer of the processing device; send the updated pointer information of the read pointer to the host, so that the host can update the read pointer on the host side copy.
  • the processing device is an AI chip or a GPU.
  • the communication link is a PCI-Express link.
  • the apparatus embodiment or the processing device embodiment since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for related parts.
  • the apparatus embodiments or processing device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separated unit, that is, it can be located in one place, or it can be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of at least one embodiment of the present disclosure. Those of ordinary skill in the art can understand and implement it without creative effort.
  • the present disclosure also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor can implement the commands of any embodiment of the present disclosure when the processor executes the program delivery method.
  • the device may include: a processor 1010 , a memory 1020 , an input/output interface 1030 , a communication interface 1040 and a bus 1050 .
  • the processor 1010 , the memory 1020 , the input/output interface 1030 and the communication interface 1040 realize the communication connection among each other within the device through the bus 1050 .
  • the processor 1010 can be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. program to implement the technical solutions provided by the embodiments of this specification.
  • a general-purpose CPU Central Processing Unit, central processing unit
  • a microprocessor an application specific integrated circuit (Application Specific Integrated Circuit, ASIC)
  • ASIC Application Specific Integrated Circuit
  • the memory 1020 may be implemented in the form of a ROM (Read Only Memory, read-only memory), a RAM (Random Access Memory, random access memory), a static storage device, a dynamic storage device, and the like.
  • the memory 1020 may store an operating system and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, relevant program codes are stored in the memory 1020 and invoked by the processor 1010 for execution.
  • the input/output interface 1030 is used to connect the input/output module to realize information input and output.
  • the input/output/module can be configured in the device as a component (not shown in the figure), or can be externally connected to the device to provide corresponding functions.
  • the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc.
  • the output device may include a display, a speaker, a vibrator, an indicator light, and the like.
  • the communication interface 1040 is used to connect a communication module (not shown in the figure), so as to realize the communication interaction between the device and other devices.
  • the communication module may implement communication through wired means (eg, USB, network cable, etc.), or may implement communication through wireless means (eg, mobile network, WIFI, Bluetooth, etc.).
  • Bus 1050 includes a path to transfer information between the various components of the device (eg, processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).
  • the above device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation process, the device may also include necessary components for normal operation. other components.
  • the above-mentioned device may only include components necessary to implement the solutions of the embodiments of the present specification, rather than all the components shown in the figures.
  • the present disclosure also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the command issuing method of any embodiment of the present disclosure can be implemented.
  • non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc., which is not limited in this application.
  • embodiments of the present disclosure provide a computer program product, comprising computer-readable code, when the computer-readable code is executed on a device, the processor in the device executes any of the above implementations.
  • the computer program product can be specifically implemented by hardware, software or a combination thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Information Transfer Systems (AREA)

Abstract

Un procédé et un appareil d'émission de commande, un dispositif de traitement, un dispositif informatique et un support de stockage sont décrits, le procédé consistant : sur la base d'une pluralité de commandes à émettre vers un dispositif de traitement pour le traitement, à générer au moins un flux de commande, chaque flux de commande comprenant au moins une commande (101) ; à insérer le ou les flux de commande dans un tampon (102) ; au moyen de la liaison de communication entre un ordinateur hôte et le dispositif de traitement, à transmettre le ou les flux de commande dans le tampon au dispositif de traitement (103). Le surdébit de communication entre l'ordinateur hôte et le dispositif de traitement est réduit, et l'efficacité de planification de l'ordinateur hôte est augmentée.
PCT/CN2021/102943 2020-12-11 2021-06-29 Procédé et appareil d'émission de commande, dispositif de traitement, dispositif informatique et support de stockage WO2022121287A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011459860.3 2020-12-11
CN202011459860.3A CN114626541A (zh) 2020-12-11 2020-12-11 命令下发方法、装置、处理设备、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022121287A1 true WO2022121287A1 (fr) 2022-06-16

Family

ID=81895512

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/102943 WO2022121287A1 (fr) 2020-12-11 2021-06-29 Procédé et appareil d'émission de commande, dispositif de traitement, dispositif informatique et support de stockage

Country Status (2)

Country Link
CN (1) CN114626541A (fr)
WO (1) WO2022121287A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1495621A (zh) * 2002-06-24 2004-05-12 �¿���˹�����ɷ����޹�˾ 平行输入/输出数据传输控制器
CN107209665A (zh) * 2015-01-07 2017-09-26 美光科技公司 产生并执行控制流
CN111124993A (zh) * 2018-10-31 2020-05-08 伊姆西Ip控股有限责任公司 减小i/o处理时缓存数据镜像时延的方法、设备和程序产品
CN111143234A (zh) * 2018-11-02 2020-05-12 三星电子株式会社 存储设备、包括这种存储设备的系统及其操作方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1495621A (zh) * 2002-06-24 2004-05-12 �¿���˹�����ɷ����޹�˾ 平行输入/输出数据传输控制器
CN107209665A (zh) * 2015-01-07 2017-09-26 美光科技公司 产生并执行控制流
CN111124993A (zh) * 2018-10-31 2020-05-08 伊姆西Ip控股有限责任公司 减小i/o处理时缓存数据镜像时延的方法、设备和程序产品
CN111143234A (zh) * 2018-11-02 2020-05-12 三星电子株式会社 存储设备、包括这种存储设备的系统及其操作方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
OASIS-OPEN.ORG: "Virtual I/O Device (VIRTIO) Version 1.1", 20 December 2018 (2018-12-20), pages 1 - 118, XP055800181, Retrieved from the Internet <URL:https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html> [retrieved on 20210429] *

Also Published As

Publication number Publication date
CN114626541A (zh) 2022-06-14

Similar Documents

Publication Publication Date Title
KR102245247B1 (ko) 트리거된 동작을 이용하는 gpu 원격 통신
TWI531958B (zh) 雲端計算之大型儲存虛擬化
US7835897B2 (en) Apparatus and method for connecting hardware to a circuit simulation
US9418181B2 (en) Simulated input/output devices
US20180219797A1 (en) Technologies for pooling accelerator over fabric
US10540301B2 (en) Virtual host controller for a data processing system
US8448172B2 (en) Controlling parallel execution of plural simulation programs
CN104094235A (zh) 多线程计算
US11308008B1 (en) Systems and methods for handling DPI messages outgoing from an emulator system
CN107729050A (zh) 基于let编程模型的实时系统及任务构建方法
CN114827048B (zh) 一种动态可配高性能队列调度方法、系统、处理器及协议
US20100332209A1 (en) Method of combined simulation of the software and hardware parts of a computer system, and associated system
CN115168256A (zh) 中断控制方法、中断控制器、电子设备、介质和芯片
WO2022121287A1 (fr) Procédé et appareil d&#39;émission de commande, dispositif de traitement, dispositif informatique et support de stockage
JP2007011720A (ja) システムシミュレータ、システムシミュレート方法、制御プログラムおよび可読記録媒体
US11151074B2 (en) Methods and apparatus to implement multiple inference compute engines
US20180011804A1 (en) Inter-Process Signaling Mechanism
US11392406B1 (en) Alternative interrupt reporting channels for microcontroller access devices
US8572631B2 (en) Distributed control of devices using discrete device interfaces over single shared input/output
US11941722B2 (en) Kernel optimization and delayed execution
WO2023207829A1 (fr) Procédé de virtualisation de dispositif et dispositif associé
EP3630318B1 (fr) Accélération sélective de l&#39;émulation
WO2023010232A1 (fr) Processeur et procédé de communication
CN116933698A (zh) 用于计算设备的验证方法及装置、电子设备及存储介质
KR20220069773A (ko) 전자 장치 및 그의 제어 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901999

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901999

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 231123)