CN112860321A - Command issuing method, processing device and storage medium - Google Patents

Command issuing method, processing device and storage medium Download PDF

Info

Publication number
CN112860321A
CN112860321A CN202110130220.6A CN202110130220A CN112860321A CN 112860321 A CN112860321 A CN 112860321A CN 202110130220 A CN202110130220 A CN 202110130220A CN 112860321 A CN112860321 A CN 112860321A
Authority
CN
China
Prior art keywords
command
stream
sub
queue
issued
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110130220.6A
Other languages
Chinese (zh)
Inventor
冷祥纶
孙海涛
郭文
殷文达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Power Tensors Intelligent Technology Co Ltd
Original Assignee
Shanghai Power Tensors Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Power Tensors Intelligent Technology Co Ltd filed Critical Shanghai Power Tensors Intelligent Technology Co Ltd
Priority to CN202110130220.6A priority Critical patent/CN112860321A/en
Publication of CN112860321A publication Critical patent/CN112860321A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer And Data Communications (AREA)

Abstract

The present disclosure provides a command issuing method, a processing device, and a storage medium, wherein the method includes: responding to the condition that the length of the command stream to be issued is larger than a preset length threshold value, and issuing a first sub-command stream corresponding to the length threshold value in the command stream to be issued to a preset stream queue; the command stream to be issued comprises at least one command to be issued; and issuing the remaining second sub-command streams except the first sub-command stream in the command stream to be issued to the equipment memory.

Description

Command issuing method, processing device and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a command issuing method, a processing device, and a storage medium.
Background
In the deep learning field, some chips are like GPUs, and are usually used as accelerator cards of a host/CPU. The chip or GPU may be referred to as a processing device, and is scheduled and controlled by the host.
With the increase of the amount of data to be processed, the host needs to transmit a large amount of data and frequently issue operation commands when scheduling and controlling the processing device. How to improve the command issuing efficiency becomes a problem faced in practice.
Disclosure of Invention
The disclosure provides a command issuing method, a processing device and a storage medium.
According to a first aspect of the embodiments of the present disclosure, there is provided a command issuing method, the method including: responding to the condition that the length of the command stream to be issued is larger than a preset length threshold value, and issuing a first sub-command stream corresponding to the length threshold value in the command stream to be issued to a preset stream queue; the command stream to be issued comprises at least one command to be issued; and issuing the remaining second sub-command streams except the first sub-command stream in the command stream to be issued to the equipment memory.
In some optional embodiments, the issuing, to the device memory, the second sub-command stream remaining in the command stream to be issued except for the first sub-command stream includes: and sending the second sub-command stream to an extended queue established in the equipment memory.
In some optional embodiments, the length threshold is less than or equal to a queue length of the flow queue; the queue length is the maximum length that the flow queue can store commands.
In some optional embodiments, the issuing the first sub-command stream corresponding to the length threshold in the command streams to be issued to a preset stream queue includes: adopting a chain DMA mode, and issuing the first sub-command stream to the preset stream queue through a DMA descriptor; the issuing, to the device memory, the remaining second sub-command streams in the command stream to be issued, excluding the first sub-command stream, includes: issuing the second sub-command stream into the device memory via another DMA descriptor.
In some optional embodiments, the method further comprises: the commands stored in the flow queue are sequentially distributed to corresponding arithmetic units for processing; and loading at least a part of commands in the second sub-command stream stored in the device memory into the stream queue before the commands stored in the stream queue are distributed completely.
In some optional embodiments, the loading at least a part of the commands in the second sub-command stream stored in the device memory into the stream queue before the commands stored in the stream queue are completely distributed includes: when the commands stored in the flow queue are distributed, every time a preset first number of commands are distributed, loading at least a part of commands in the second sub-command flow stored in the device memory into the flow queue.
According to a second aspect of embodiments of the present disclosure, there is provided a processing apparatus comprising: the device comprises a preset flow queue, a device memory, a first sub-command flow issuing module and a second sub-command flow issuing module; the first sub-command stream issuing module is configured to issue, in response to that the length of a command stream to be issued is greater than a preset length threshold, a first sub-command stream corresponding to the length threshold in the command stream to be issued to the stream queue; the command stream to be issued comprises at least one command to be issued; and the second sub-command stream issuing module is configured to issue the remaining second sub-command streams, except for the first sub-command stream, in the command stream to be issued to the device memory.
In some optional embodiments, when the second sub-command stream issuing module is configured to issue, to the device memory, a remaining second sub-command stream, except for the first sub-command stream, in the command stream to be issued, the second sub-command stream issuing module includes: and sending the second sub-command stream to an extended queue established in the equipment memory.
In some optional embodiments, the length threshold is less than or equal to a queue length of the flow queue; the queue length is the maximum length that the flow queue can store commands.
In some optional embodiments, when the first sub-command stream issuing module is configured to issue, to the stream queue, a first sub-command stream corresponding to the length threshold in the command streams to be issued, the first sub-command stream issuing module includes: adopting a chain DMA mode to issue the first sub-command stream to the stream queue through a DMA descriptor; the second sub-command stream issuing module, when configured to issue, to the device memory, the remaining second sub-command streams in the command stream to be issued, except for the first sub-command stream, includes: issuing the second sub-command stream into the device memory via another DMA descriptor.
In some optional embodiments, the processing device further comprises: an arithmetic unit for processing the distributed command; the command distributor is used for sequentially distributing the commands stored in the flow queue to the corresponding arithmetic units for processing; and the micro control unit is used for loading at least a part of commands in the second sub-command stream stored in the equipment memory into the stream queue before the command distributor distributes the commands stored in the stream queue.
In some optional embodiments, the micro control unit, when configured to load at least a portion of the commands in the second sub-command stream stored in the device memory into the stream queue before the command distributor completes distribution of the commands stored in the stream queue, includes: when the commands stored in the flow queue are distributed, every time a preset first number of commands are distributed, loading at least a part of commands in the second sub-command flow stored in the device memory into the flow queue.
In some alternative embodiments, the processing device comprises a chip.
According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the command issuing method of any one of the first aspects.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the command issuing method of any one of the first aspects.
According to a fifth aspect of the embodiments of the present disclosure, there is provided an electronic device, comprising a host and a processing device;
the host is used for sending a command stream to be issued to the processing equipment;
the processing device is configured to respond that the length of the command stream to be issued is greater than a preset length threshold, and issue a first sub-command stream corresponding to the length threshold in the command stream to be issued to a preset stream queue; the command stream to be issued comprises at least one command to be issued;
and issuing the remaining second sub-command streams except the first sub-command stream in the command stream to be issued to the equipment memory.
In the embodiment of the present disclosure, when the length of the command stream to be issued is greater than the preset length threshold, a first sub-command stream in the command stream to be issued may be issued to the stream queue, and a second sub-command stream in the command stream to be issued may be issued to the device memory. In the command issuing mode, the length of the issued command stream is not limited by the queue length of the stream queue, and the command stream with any length can be issued at one time. For the command stream with the length of the command stream to be issued being larger than the queue length of the stream queue, the command stream is not required to be issued in a plurality of times, and the host can be effectively prevented from being blocked, so that the command issuing efficiency is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a flow diagram illustrating a method for command issuing in accordance with an exemplary embodiment;
FIG. 2 is a schematic diagram illustrating a chip architecture according to an exemplary embodiment;
FIG. 3 is a schematic diagram of a flow queue shown in accordance with an exemplary embodiment;
FIG. 4 is a schematic diagram of a processing device according to an exemplary embodiment;
FIG. 5 is a schematic view of yet another processing device shown in accordance with an exemplary embodiment;
FIG. 6 is a schematic diagram illustrating a configuration of a computer device, according to an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The specific manner described in the following exemplary embodiments does not represent all aspects consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
With the widespread use of artificial intelligence, the models and data volumes for deep learning continue to increase. When the host schedules and controls the processing device, a communication link (such as a PCI-Express link) between the host and the processing device needs to transmit a large amount of data and model codes, and an operation command needs to be frequently issued, so that the communication overhead of the communication link is too large, and the host scheduling efficiency is low.
Based on the above, in the related art, the host may generate a command stream according to a plurality of commands to be issued to the processing device, and transmit the generated command stream to the processing device through the communication link. By adopting the mode of issuing commands by the command stream, a plurality of commands can be issued to the processing equipment by issuing one command stream. The communication frequency of the communication link is reduced, the communication overhead of the communication link between the host and the processing equipment is reduced, and the scheduling efficiency of the host is improved.
In the manner of issuing commands in the form of command streams between the host and the processing device, the processing device needs to store complete command streams in a preset stream queue after receiving the command streams. The flow queue set in the processing device is a hardware queue, and the queue length of the flow queue becomes a fixed value after the processing device is designed and manufactured. This way of issuing commands in the form of a command stream is limited by the queue length of the stream queue set by the processing device. When the length of the command stream generated by the host is greater than the queue length of the stream queue in the processing device, the stream queue of the processing device cannot store the next complete command stream, so that one complete command stream needs to be issued for multiple times. This undoubtedly results in a complicated command issuing procedure between the host and the processing device, reducing the command issuing efficiency.
Based on the above, the present disclosure provides a command issuing method: and under the condition that the length of the command stream to be issued is greater than a preset length threshold, issuing a first sub-command stream in the command stream to be issued to a stream queue, and issuing the remaining second sub-command stream in the command stream to be issued to the equipment memory.
In the command issuing method, for a command stream whose length of the command stream to be issued is greater than the queue length of the stream queue, a part of the second sub-command stream greater than the length of the command stream to be issued may be issued to the device memory. The length of the issued command stream is not limited by the queue length of the stream queue, and the command stream with any length can be issued at one time without issuing commands in different times. The host can be effectively prevented from being blocked, and the command issuing efficiency is improved.
In order to make the command issuing method provided by the present disclosure clearer, the following describes in detail the scheme execution process provided by the present disclosure with reference to the drawings and the specific embodiments.
Referring to fig. 1, fig. 1 is a flowchart illustrating a command issuing method according to an embodiment of the disclosure. The method may be performed by a processing device as an execution subject. Illustratively, the execution subject may be an AI chip, a GPU, or the like. As shown in fig. 1, the process includes:
step 101, responding to the condition that the length of a command stream to be issued is larger than a preset length threshold, issuing a first sub-command stream corresponding to the length threshold in the command stream to be issued to a preset stream queue; the command stream to be issued comprises at least one command to be issued.
In the embodiment of the present disclosure, the command to be issued includes a command generated by the host and to be issued to the processing device for processing. For example, the command to be issued may be a command generated by a plurality of processes included in the application layer.
For example, it is assumed that the application layer includes an application "payment x" for payment, and includes an application "beauty x" for beauty. In the process of using the above two applications, the command generated by the "pay x" process or the "american x" process needs to be issued to a processing device (e.g., an AI chip) for processing. The command generated by the "pay x" process or the "mei tuo x" process is a command to be issued to the processing device for processing.
For example, in the deep learning field, in the case of using an AI chip as a processing device, the commands to be issued included in the command stream to be issued may include, but are not limited to, at least one of the following: various operators (kernel) of the deep learning model, data movement (memcpy) commands, event (event) synchronization commands.
In some optional embodiments, the host may generate at least one command stream to be issued according to the at least one command to be issued. One command stream to be issued may include one command to be issued, or may include a plurality of commands to be issued. In addition, commands in the same command stream need to be executed in sequence, and different command streams can be executed in parallel.
The command stream here is similar to "stream" in CUDA (computer Unified Device Architecture, an operating platform by the video card vendor NVIDIA).
For example, assume that the command to be issued includes: command 1, command 2, command 3, command a, command B, and command C. If command 1, command 2, and command 3 need to be executed in sequence, the host may generate command 1, command 2, and command 3 into one command stream 1. Wherein, the command stream 1 includes: command 1, command 2, and command 3. Similarly, if command a and command B need to be executed in sequence, the host may generate command a and command B into a command stream a. Wherein, the command stream a includes: command a and command B. If command C is not related to the execution of other commands, the host may generate command C into a command stream C. When the command stream 1, the command stream a, and the command stream C are generated, the execution of the respective command streams does not affect each other. For example, three command streams may be executed in parallel.
In the disclosed embodiment, an appropriate length threshold may be set in advance. The length threshold may be used to describe or measure the length of the command stream, and the length threshold may be flexibly set in various different forms, which is not limited in this embodiment.
For example, the number of commands may be used as a statistical unit of length, and a certain number of commands may be set as a length threshold. For example, the length threshold may be set as: 100 commands. For example, a certain byte length may be set as a length threshold value by using "byte" as a statistical unit of length. For example, the length threshold may be set as: 64 bytes. For example, the length threshold may be set as a length statistic unit by "bit", for example, the length threshold is set as: 512 bits.
In one possible implementation, the length threshold is equal to a queue length of the flow queue; the queue length is the maximum length that the flow queue can store commands. It will be appreciated that similar to the setting of the length threshold, the queue length of the flow queue of the processing device may be measured in different ways.
In the above implementation, the length threshold may be preset to a queue length of a flow queue of the processing device. For example, the queue length of the flow queue of the processing device is: for 100 commands, the length threshold can be set as: 100 commands; or, the queue length of the flow queue of the processing device is: 64 bytes, then the length threshold can be set as: 64 bytes.
In another possible implementation, the length threshold is less than a queue length of the flow queue. For example, the queue length of the flow queue of the processing device is: for 100 commands, the length threshold can be set as: 99 commands; or, the queue length of the flow queue of the processing device is: 64 bytes, then the length threshold can be set as: 56 bytes.
This step may determine whether the length of the command stream to be issued is greater than a preset length threshold.
In a possible implementation manner, the processing device may obtain the length of the command stream to be issued by accessing the host; and comparing the length of the command stream to be issued with a preset length threshold value, thereby determining whether the length of the command stream to be issued is greater than the preset length threshold value. In another possible implementation manner, the processing device may synchronously determine the length of the command stream currently being issued during the process of receiving the command stream, and compare the length of the command stream currently being issued with a preset length threshold, so as to determine whether the length of the command stream currently being issued is greater than the preset length threshold.
In the embodiment of the present disclosure, the flow queue needs to be preset in the processing device. The set flow queue has a certain queue length and can be used for storing a command flow with a certain length. For example, multiple stream queues may be provided in a processing device for storing different command streams issued to the processing device. Illustratively, 50 flow queues may be set in the processing device, wherein the queue length of each flow queue is: 100 commands.
In this step, in response to that the length of the command stream to be issued is greater than the preset length threshold, the command corresponding to the length of the length threshold in the command stream to be issued is determined as the first sub-command stream in the sequence from front to back, and the command in the first sub-command stream is issued to the preset stream queue.
For example, assume that the length of the command stream to be issued is: 20 commands, and the preset length threshold is as follows: 5 commands. In this step, the first 5 commands in the command stream to be issued can be determined as the first sub-command stream, and the first 5 commands are issued to a preset stream queue; or, assuming that the length of the command stream to be issued is 1.5k and the preset length threshold is 1k, in this step, the first 1k command in the command to be issued may be determined as the first sub-command stream, and the first 1k command may be issued to a preset stream queue.
And 102, issuing the remaining second sub-command streams except the first sub-command stream in the command stream to be issued to the device memory.
In the embodiment of the present disclosure, the command stream to be issued may be divided into a first sub-command stream and a second sub-command stream according to a preset length threshold. Under the condition that the length of the command stream to be issued is determined to be larger than a preset length threshold, a first sub-command stream positioned at the front part in the command stream to be issued can be issued to a preset stream queue; and issuing the second sub-command stream positioned at the rear part in the command stream to be issued to the equipment memory.
For example, assume that the length of the command stream to be issued is: 20 commands, the queue length of the flow queue of the processing device is: 5 commands, the preset length threshold is as follows: 5 commands. In the embodiment of the present disclosure, the first 5 commands in the command stream to be issued may be determined as the first sub-command stream, and the remaining 15 commands may be determined as the second sub-command stream. Under the condition that the length (20 commands) of the command stream to be issued is determined to be greater than a preset length threshold (5 commands), a first sub-command stream (the first 5 commands) can be issued to a preset stream queue; and issuing the remaining second sub-command streams (the remaining 15 commands) in the command stream to be issued to the device memory.
In the embodiment of the present disclosure, when the length of the command stream to be issued is greater than the preset length threshold, a first sub-command stream in the command stream to be issued may be issued to the stream queue, and a remaining second sub-command stream in the command stream to be issued may be issued to the device memory. In the command issuing mode, for a command stream whose length of the command stream to be issued is greater than the queue length of the stream queue, the second sub-command stream whose length is greater than the queue length of the stream queue may be issued to the device memory. Since the memory space of the device memory tends to be large, the length of the second sub-command stream can be highly extended. That is, the length of the issued command stream is not limited by the queue length of the stream queue, and the issue of the command stream with any length can be realized at one time without performing the command issue in multiple times. The host can be effectively prevented from being blocked, and the command issuing efficiency is improved.
In some alternative embodiments, the specific implementation of step 102 may include: and sending the second sub-command stream to an extended queue established in the equipment memory.
Because the length of the command stream to be issued is uncertain, and the length of the stream queue is sufficient for issuing a complete command stream under the condition that the length of the command stream to be issued is not greater than the preset length threshold, the command stream does not need to be issued into the device memory. And only when the length of the command stream to be issued is greater than the preset length threshold, the complete command stream needs to be divided into a first sub-command stream and a second sub-command stream, the first sub-command stream is issued to the stream queue, and the second sub-command stream is issued to the device memory. At this time, an extended queue corresponding to the stream queue storing the first sub-command stream needs to be created in the device memory to store the corresponding second sub-command stream.
The specific way of creating the corresponding extended queue in the device memory is not limited in this embodiment. For example, an extended queue with a corresponding length may be created in the device memory according to the length of the second sub-command stream when it is determined that the length of the command stream to be issued is greater than a preset length threshold. The queue length of the created extended queue may be flexibly set according to the length of the second sub-command stream, which is not limited in this embodiment. Since the storage space of the device memory of the processing device is often large, the queue length of the extended queue created corresponding to the second sub-command stream may have high extensibility.
In a possible implementation manner, after the commands in the second sub-command stream in the created extended queue are distributed and processed, a memory area of the extended queue used for storing the second sub-command stream in the device memory may be recycled, and when the extended queue needs to be created, a new extended queue may be created by reusing the memory area. In another possible implementation, the extended queue may be reserved after the commands in the second sub-command stream in the created extended queue are dispatched and processed. The reserved extended queue may be used to store a new second sub-command stream during subsequent command stream issuance, and used as a new extended queue. The method reduces the creation process of the extended queue and can multiplex the created extended queue.
In some optional embodiments, the issuing of the first sub command stream and the second sub command stream in the command stream to be issued may be implemented by using a Direct Memory Access (DMA) technology.
Referring to fig. 2, an internal structure of a chip may include an AI chip. The system comprises an MCU (micro controller Unit), a stream queue module, an extended queue module, a command distributor and n arithmetic units. Taking the chip shown in fig. 2 as the processing device of this embodiment as an example, the issuing process of the first sub-command stream and the issuing process of the second sub-command stream are exemplarily described.
In response to that the length of the command stream to be issued is greater than the preset length threshold, the MCU may issue the first sub-command stream and the second sub-command stream in a chain DMA manner using the two DMA descriptors. Specifically, the MCU may issue the command corresponding to the first sub-command stream to the stream queue through one DMA descriptor, and issue the command corresponding to the second sub-command stream to the device memory through another DMA descriptor.
For example, the MCU may issue the command corresponding to the first sub-command stream to the stream queue sq1 through one DMA descriptor, and issue the command corresponding to the second sub-command stream to the extended queue esq1 of the device memory. The complete command stream is issued through the stream queue sq1 and the corresponding extended queue esq1 in the device memory.
Further, the MCU may invoke the command distributor to distribute the commands in the stream queue to the corresponding arithmetic units for processing. And before the commands in the stream queue are distributed, the MCU can load the commands in the corresponding extended queue into the stream queue to keep the undistributed commands in the stream queue until the command distributor distributes the complete command stream and the arithmetic unit finishes processing the command stream.
After the first sub-command stream is issued to the stream queue and the second sub-command stream is issued to the device memory, the entire complete command stream to be issued is issued. At this time, the processing device needs to distribute the issued commands in the entire complete command stream to the corresponding arithmetic units for processing. In other words, the processing device needs to distribute the commands in the first sub-command stream stored in the stream queue and the commands in the second sub-command stream stored in the device memory to the corresponding arithmetic units for processing, so as to complete the processing of the entire complete command stream.
In some optional embodiments, the command issuing method further comprises: the commands stored in the flow queue are sequentially distributed to corresponding arithmetic units for processing; and loading at least a part of commands in the second sub-command stream stored in the device memory into the stream queue before the commands stored in the stream queue are distributed completely.
In the above embodiment, the commands stored in the flow queue may be sequentially distributed to the corresponding arithmetic units for processing. For example, a command distributor provided in the processing device may be used to read a command from the stream queue and distribute the read command to a corresponding arithmetic unit for processing. In addition, in the above embodiment, before the commands stored in the stream queue are distributed, and in the case that the commands are stored in the device memory, a part of the commands or all the commands in the second sub-command stream stored in the device memory may be loaded into the stream queue.
The process of loading the commands stored in the device memory into the stream queue can also be implemented by using the DMA technology. For example, fig. 2 is taken as an example for explanation. Under the condition that the number of commands in the stream queue sq1 is determined to be smaller than the preset number threshold, the MCU may load a certain number of commands from the extended queue esq1 of the corresponding device memory into the stream queue based on the DMA descriptor, so as to keep that there are always commands in the stream queue sq1 that are not dispatched by the command dispatcher until all commands in the complete command stream stored in the stream queue sq1 and the extended queue esq1 are dispatched and processed by the command dispatcher.
It should be noted that the commands stored in the stream queue may include commands in the first sub-command stream, may also include commands in the second sub-command stream, or may include both commands in the first sub-command stream and commands in the second sub-command stream. In the embodiment of the present disclosure, it is only necessary to load the command from the device memory into the stream queue before the command stored in the stream queue is completely distributed.
In the above embodiment, it may be possible to keep the non-distributed commands in the stream queue all the time, and the command distributor of the processing device may continuously read and distribute the commands from the stream queue until the command distributor distributes and processes the commands in the entire complete command stream including the commands in the first sub-command stream and the commands in the second sub-command stream.
When the command distributor distributes the commands in the whole complete command stream, the commands are always read from the stream queue and distributed, and continuous reading and distribution can be maintained. Therefore, the command dispatcher does not have any perception of queue overflow of the entire complete command stream being dispatched. Wherein, it can be understood that the queue overflow of the command stream refers to: the entire complete command stream cannot be stored completely in the stream queue, but rather an "overflow" portion needs to be stored outside of the stream queue.
In a possible implementation manner, the loading at least a part of the commands in the second sub-command stream stored in the device memory into the stream queue before the commands stored in the stream queue are distributed completely includes: when the commands stored in the flow queue are distributed, every time a preset first number of commands are distributed, loading at least a part of commands in the second sub-command flow stored in the device memory into the flow queue.
In the foregoing implementation manner, the first number may be preset, and when the command distributor distributes the commands stored in the stream queue, the command distributor may trigger a loading operation every time the first number of commands are distributed, so that the processing device loads part or all of the commands in the second sub-command stream stored in the device memory into the stream queue.
The preset first number may be a fixed value, or may be a variable value, and the embodiments of the present disclosure are not limited. For example, the first number may be preset to be: 3, in this embodiment of the present disclosure, when distributing the commands stored in the stream queue, the command distributor triggers a load operation every time 3 commands are distributed, so that the processing device loads part or all of the commands stored in the device memory into the stream queue. For the example where the first number is a change value, similarly, it will not be illustrated in detail.
In a possible implementation manner, before the commands stored in the stream queue are distributed, the flags are set according to a preset rule for the commands stored in the stream queue, so that when the commands stored in the stream queue are distributed, a loading operation may be triggered according to the flags set on the commands, so that the processing device loads the commands from the device memory into the stream queue.
For example, assuming that 2K commands are stored in the stream queue, the flag (1) may be applied to both the K-th command and the 2K-th command. When the command distributor distributes 2K commands stored in the flow queue, when the commands are distributed to the Kth command, the loading operation is triggered, so that the processing equipment loads the commands stored in the equipment memory into the flow queue; similarly, a load operation is also triggered when dispatching to the 2K-th command.
Referring to fig. 3, a schematic diagram of a flow queue sq1 is shown, wherein the flow queue sq1 includes 2k commands cmd _1 to cmd _2 k. It can be predefined that for each k commands issued from flow queue sq1, k commands are loaded from the corresponding extended queue esq1 into flow queue sq 1. The present embodiment may flag the command cmd _ k in the stream queue sq1 to 1, and flag the command cmd _2k to 1. The command dispatcher distributes commands in the stream queue sq1 in sequence from command cmd _1 to cmd _ k and triggers a load operation when distributing to command cmd _ k. In response to the triggered load operation, the MCU may load K commands from extended queue esq1 of device memory into stream queue sq1 based on the DMA descriptors.
The MCU can load the remaining commands in the extended queue into the corresponding stream queue at one time based on the DMA descriptor to ensure that all commands extended into the queue are loaded back into the corresponding stream queue. Finally, through the distribution of the commands in the stream queue, the command distributor realizes the distribution of all the commands in the complete command stream stored by the stream queue and the corresponding extended queue together.
It should be noted that the manner of marking the commands stored in the flow queue in the above example is only exemplary. In practical application, the marking mode of the command in the flow queue can be determined comprehensively by synthesizing the queue length of the flow queue, the length of the first sub-command stream, the processing rate of the arithmetic unit, the loading rate from the device memory to the flow queue, and other factors.
As shown in fig. 4, the present disclosure provides a processing device that can perform the command issuing method of any of the embodiments of the present disclosure. The device may include a preset stream queue 401, a device memory 402, a first sub-command stream issuing module 403, and a second sub-command stream issuing module 404. Wherein:
the first sub-command stream issuing module 403 is configured to issue, in response to that the length of a command stream to be issued is greater than a preset length threshold, a first sub-command stream corresponding to the length threshold in the command stream to be issued to the stream queue; the command stream to be issued comprises at least one command to be issued;
the second sub-command stream issuing module 404 is configured to issue, to the device memory, the remaining second sub-command streams in the command stream to be issued, except for the first sub-command stream.
Optionally, when the second sub-command stream issuing module 404 is configured to issue, to the device memory, the remaining second sub-command streams in the command stream to be issued, except for the first sub-command stream, the second sub-command stream issuing module includes: and sending the second sub-command stream to an extended queue established in the equipment memory.
Optionally, the length threshold is less than or equal to the queue length of the flow queue; the queue length is the maximum length that the flow queue can store commands.
Optionally, when the first sub-command stream issuing module 403 is configured to issue the first sub-command stream corresponding to the length threshold in the command stream to be issued to the stream queue, the method includes: adopting a chain DMA mode to issue the first sub-command stream to the stream queue through a DMA descriptor; the second sub-command stream issuing module 404, when configured to issue, to the device memory, the remaining second sub-command streams in the command stream to be issued, except for the first sub-command stream, includes: issuing the second sub-command stream into the device memory via another DMA descriptor.
Optionally, as shown in fig. 5, the processing apparatus further includes:
an arithmetic unit 501 for processing the distributed command;
the command distributor 502 is configured to sequentially distribute the commands stored in the flow queue to the corresponding arithmetic units for processing;
a micro control unit 503, configured to load at least a part of the commands in the second sub-command stream stored in the device memory into the stream queue before the command distributor completes distribution of the commands stored in the stream queue.
Optionally, when the micro control unit 503 is configured to load at least a part of commands in the second sub-command stream stored in the device memory into the stream queue before the command distributor distributes the commands stored in the stream queue, the micro control unit includes: when the commands stored in the flow queue are distributed, every time a preset first number of commands are distributed, loading at least a part of commands in the second sub-command flow stored in the device memory into the flow queue.
Optionally, the processing device comprises a chip.
For the apparatus embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described device embodiments are merely illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of at least one embodiment of the present disclosure. One of ordinary skill in the art can understand and implement it without inventive effort.
The present disclosure also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor is capable of implementing the command issuing method of any of the embodiments of the present disclosure when executing the program.
Fig. 6 is a schematic diagram illustrating a more specific hardware structure of a computer device according to an embodiment of the present disclosure, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
The present disclosure also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, is capable of implementing the command issuing method of any of the embodiments of the present disclosure.
The non-transitory computer readable storage medium may be, among others, ROM, Random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like, and the present disclosure is not limited thereto.
In some optional embodiments, the disclosed embodiments provide a computer program product comprising computer readable code which, when run on a device, a processor in the device performs the method for implementing command issuing as provided in any of the above embodiments. The computer program product may be embodied in hardware, software or a combination thereof.
In some optional embodiments, the disclosed embodiments provide an electronic device comprising a host and a processing device;
the host is used for sending a command stream to be issued to the processing equipment;
the processing device is configured to respond that the length of the command stream to be issued is greater than a preset length threshold, and issue a first sub-command stream corresponding to the length threshold in the command stream to be issued to a preset stream queue; the command stream to be issued comprises at least one command to be issued;
and issuing the remaining second sub-command streams except the first sub-command stream in the command stream to be issued to the equipment memory.
The processing device in the electronic device may further execute any method provided in the embodiments of the present application, and details are not described here.
Other embodiments of the present disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims (15)

1. A method for issuing a command, the method comprising:
responding to the condition that the length of the command stream to be issued is larger than a preset length threshold value, and issuing a first sub-command stream corresponding to the length threshold value in the command stream to be issued to a preset stream queue; the command stream to be issued comprises at least one command to be issued;
and issuing the remaining second sub-command streams except the first sub-command stream in the command stream to be issued to the equipment memory.
2. The method of claim 1, wherein issuing a remaining second sub-command stream, excluding the first sub-command stream, of the command streams to be issued to a device memory comprises:
and sending the second sub-command stream to an extended queue established in the equipment memory.
3. The method according to claim 1 or 2, wherein the length threshold is less than or equal to a queue length of the flow queue; the queue length is the maximum length that the flow queue can store commands.
4. The method according to any one of claims 1 to 3, wherein issuing a first sub-command stream corresponding to the length threshold in the command streams to be issued to a preset stream queue comprises:
adopting a chain DMA mode, and issuing the first sub-command stream to the preset stream queue through a DMA descriptor;
the issuing, to the device memory, the remaining second sub-command streams in the command stream to be issued, excluding the first sub-command stream, includes:
issuing the second sub-command stream into the device memory via another DMA descriptor.
5. The method according to any one of claims 1 to 4, further comprising:
the commands stored in the flow queue are sequentially distributed to corresponding arithmetic units for processing; and loading at least a part of commands in the second sub-command stream stored in the device memory into the stream queue before the commands stored in the stream queue are distributed completely.
6. The method of claim 5, wherein loading at least a portion of the second sub-command stream stored in the device memory into the stream queue before the commands stored in the stream queue are distributed comprises:
when the commands stored in the flow queue are distributed, every time a preset first number of commands are distributed, loading at least a part of commands in the second sub-command flow stored in the device memory into the flow queue.
7. A processing device, characterized in that the processing device comprises: the device comprises a preset flow queue, a device memory, a first sub-command flow issuing module and a second sub-command flow issuing module;
the first sub-command stream issuing module is configured to issue, in response to that the length of a command stream to be issued is greater than a preset length threshold, a first sub-command stream corresponding to the length threshold in the command stream to be issued to the stream queue; the command stream to be issued comprises at least one command to be issued;
and the second sub-command stream issuing module is configured to issue the remaining second sub-command streams, except for the first sub-command stream, in the command stream to be issued to the device memory.
8. The processing device according to claim 7, wherein the second sub-command stream issuing module, when configured to issue, to the device memory, the remaining second sub-command streams in the command stream to be issued, except for the first sub-command stream, includes:
and sending the second sub-command stream to an extended queue established in the equipment memory.
9. The processing device according to claim 7 or 8, wherein the length threshold is equal to or less than a queue length of the flow queue; the queue length is the maximum length that the flow queue can store commands.
10. The processing apparatus according to any one of claims 7 to 9,
the first sub-command stream issuing module, when configured to issue the first sub-command stream corresponding to the length threshold in the command stream to be issued to the stream queue, includes:
adopting a chain DMA mode to issue the first sub-command stream to the stream queue through a DMA descriptor;
the second sub-command stream issuing module, when configured to issue, to the device memory, the remaining second sub-command streams in the command stream to be issued, except for the first sub-command stream, includes:
issuing the second sub-command stream into the device memory via another DMA descriptor.
11. The processing apparatus according to any one of claims 7 to 10, further comprising:
an arithmetic unit for processing the distributed command;
the command distributor is used for sequentially distributing the commands stored in the flow queue to the corresponding arithmetic units for processing;
and the micro control unit is used for loading at least a part of commands in the second sub-command stream stored in the equipment memory into the stream queue before the command distributor distributes the commands stored in the stream queue.
12. The processing device of claim 11, wherein the micro-control unit, when configured to load at least a portion of the second sub-command stream stored in the device memory into the stream queue before the command dispatcher completes dispatching the commands stored in the stream queue, comprises:
when the commands stored in the flow queue are distributed, every time a preset first number of commands are distributed, loading at least a part of commands in the second sub-command flow stored in the device memory into the flow queue.
13. The processing apparatus according to any of claims 7 to 12, characterized in that the processing apparatus comprises a chip.
14. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method of any one of claims 1 to 6.
15. An electronic device, comprising a host and a processing device;
the host is used for sending a command stream to be issued to the processing equipment;
the processing device is configured to respond that the length of the command stream to be issued is greater than a preset length threshold, and issue a first sub-command stream corresponding to the length threshold in the command stream to be issued to a preset stream queue; the command stream to be issued comprises at least one command to be issued;
and issuing the remaining second sub-command streams except the first sub-command stream in the command stream to be issued to the equipment memory.
CN202110130220.6A 2021-01-29 2021-01-29 Command issuing method, processing device and storage medium Pending CN112860321A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110130220.6A CN112860321A (en) 2021-01-29 2021-01-29 Command issuing method, processing device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110130220.6A CN112860321A (en) 2021-01-29 2021-01-29 Command issuing method, processing device and storage medium

Publications (1)

Publication Number Publication Date
CN112860321A true CN112860321A (en) 2021-05-28

Family

ID=75987181

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110130220.6A Pending CN112860321A (en) 2021-01-29 2021-01-29 Command issuing method, processing device and storage medium

Country Status (1)

Country Link
CN (1) CN112860321A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105938418A (en) * 2015-03-06 2016-09-14 爱思开海力士有限公司 Memory system and operation method thereof
US20160299690A1 (en) * 2015-04-09 2016-10-13 Samsung Electronics Co .. Ltd. Data storage device and data processing system including the same
CN109656479A (en) * 2018-12-11 2019-04-19 湖南国科微电子股份有限公司 A kind of method and device constructing memory command sequence
CN110121114A (en) * 2018-02-07 2019-08-13 华为技术有限公司 Send the method and data transmitting equipment of flow data
CN111465966A (en) * 2018-05-31 2020-07-28 华为技术有限公司 Apparatus and method for command stream optimization and enhancement
CN111897665A (en) * 2020-08-04 2020-11-06 北京泽石科技有限公司 Data queue processing method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105938418A (en) * 2015-03-06 2016-09-14 爱思开海力士有限公司 Memory system and operation method thereof
US20160299690A1 (en) * 2015-04-09 2016-10-13 Samsung Electronics Co .. Ltd. Data storage device and data processing system including the same
CN110121114A (en) * 2018-02-07 2019-08-13 华为技术有限公司 Send the method and data transmitting equipment of flow data
CN111465966A (en) * 2018-05-31 2020-07-28 华为技术有限公司 Apparatus and method for command stream optimization and enhancement
CN109656479A (en) * 2018-12-11 2019-04-19 湖南国科微电子股份有限公司 A kind of method and device constructing memory command sequence
CN111897665A (en) * 2020-08-04 2020-11-06 北京泽石科技有限公司 Data queue processing method and device

Similar Documents

Publication Publication Date Title
CN110489213B (en) Task processing method and processing device and computer system
CN104641396B (en) Delay preemption techniques for Dispatching Drawings processing unit command stream
US11003489B2 (en) Cause exception message broadcast between processing cores of a GPU in response to indication of exception event
US10133597B2 (en) Intelligent GPU scheduling in a virtualization environment
CN102597950B (en) Hardware-based scheduling of GPU work
US7788435B2 (en) Interrupt redirection with coalescing
US10970129B2 (en) Intelligent GPU scheduling in a virtualization environment
US7783811B2 (en) Efficient interrupt message definition
CN105940388A (en) Workload batch submission mechanism for graphics processing unit
CN103455376A (en) Managing use of a field programmable gate array by multiple processes in an operating system
CN110308982B (en) Shared memory multiplexing method and device
CN110532100B (en) Method, device, terminal and storage medium for scheduling resources
CN110209493B (en) Memory management method, device, electronic equipment and storage medium
CN103092810A (en) Processor with programmable virtual ports
CN108292162A (en) Software definition fifo buffer for multi-thread access
CN110300959B (en) Method, system, device, apparatus and medium for dynamic runtime task management
CN113835887B (en) Video memory allocation method and device, electronic equipment and readable storage medium
CN112257362A (en) Verification method, verification device and storage medium for logic code
CN112860321A (en) Command issuing method, processing device and storage medium
US11106478B2 (en) Simulation device, simulation method, and computer readable medium
CN112988355B (en) Program task scheduling method and device, terminal equipment and readable storage medium
CN115775199A (en) Data processing method and device, electronic equipment and computer readable storage medium
US12033238B2 (en) Register compaction with early release
CN115080242A (en) Method, device and medium for unified scheduling of PCI equipment resources
CN113760524A (en) Task execution method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination