CN115640052A - Multi-core multi-pipeline parallel execution optimization method for graphics processor - Google Patents

Multi-core multi-pipeline parallel execution optimization method for graphics processor Download PDF

Info

Publication number
CN115640052A
CN115640052A CN202211300379.9A CN202211300379A CN115640052A CN 115640052 A CN115640052 A CN 115640052A CN 202211300379 A CN202211300379 A CN 202211300379A CN 115640052 A CN115640052 A CN 115640052A
Authority
CN
China
Prior art keywords
instruction
request
execution
data
pipeline
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211300379.9A
Other languages
Chinese (zh)
Inventor
邹凌君
张利峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinling Institute of Technology
Original Assignee
Jinling Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinling Institute of Technology filed Critical Jinling Institute of Technology
Priority to CN202211300379.9A priority Critical patent/CN115640052A/en
Publication of CN115640052A publication Critical patent/CN115640052A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Advance Control (AREA)

Abstract

The invention provides a multi-core multi-pipeline parallel execution optimization method used in a graphic processor, in particular to a design method for simultaneously transmitting a plurality of instructions to enter a pipeline for out-of-order execution by a multi-stream processing core unit in the processor, wherein the multi-core simultaneous transmission of the plurality of instructions comprises that a plurality of independent instructions transmitted by a single instruction multiple data thread (SIMD thread) executed by a plurality of cores of a GPU processing core part enter the pipeline for out-of-order execution, are re-segmented on instruction granularity and executed according to executable request granularity smaller than the instructions, and the performance of parallel data reading of each data channel is fully utilized, so that an Arithmetic Logic Unit (ALU) or an execution unit in the pipeline is always in a busy working state, thereby achieving out-of-order parallel execution of the instructions and improving the execution efficiency.

Description

Multi-core multi-pipeline parallel execution optimization method for graphics processor
Technical Field
The invention relates to a multi-core multi-pipeline parallel execution optimization method for a graphics processor.
Background
In current Graphic Processor (GPU) designs, a multi-pipeline structure is usually adopted to achieve a more parallel effect. In addition to the shaping and floating point type instruction computation units, there are a variety of other instruction processing units that execute in parallel. These instruction pipelines may be divided into various types of texture Sample unit (Sample) instructions, data Load and Store unit (Load/Store) instructions, special math function (SFU) instructions, shared Memory unit access (Shared Memory) instructions, and so on. After the data is transmitted from the instruction transmitting unit, each pipeline independently reads relevant data and executes relevant operation.
However, in a multi-core design, multiple cores may typically issue the same type of Single Instruction Multiple Data (SIMD) processing instruction simultaneously into the instruction specification module for execution. The pipeline of instruction execution also typically executes at instruction granularity, where an instruction needs to flush all source operand (source) data to begin execution for the entire instruction, subject to the data transfer bandwidth constraints. Since the GPU itself is designed with a multi-core structure, and in order to save the overall huge hardware overhead cost of the execution pipeline, the execution pipeline of this type has only one set on a single core of the graphics processing, so the upstream instruction launching module has to be blocked, and the blocked instruction can be executed after the current instruction execution is completed, as shown in fig. 1.
At this time, if the currently executed instruction reads the instruction source operand data not yet completed, or arbitration causes delay during reading the source operand, and the like, and the instruction is not ready in time, the pipeline is caused to wait all the time, and other cores cannot issue the type of instruction to be executed.
Therefore, at this time, the execution of the instruction unit may become a serial structure, as shown in the effect of serial execution pipeline processing in fig. 2, which describes the instruction execution process from issue-fetch operand-to execution of the multi-stream processing core. The green area is a request part for continuously sending and reading operand data, the red area is an instruction execution part, and after execution enters a pipeline, the subsequent execution part is not blocked, so that the completion can be assumed to be finished after the execution is sent out, and cycle counting is not required to be carried out according to the length of the whole pipeline. The horizontal direction in the figure is the execution cycle duration, and the vertical direction shows the execution process of the instruction which is continuously and alternately transmitted by 4 cores (Stream Processor, SP) on different cycles of the horizontal axis. As can be seen from fig. 2, the operand fetch and pipeline execution need to wait for each other, and the purpose of parallel execution cannot be achieved, and the effect of out-of-order execution cannot be achieved.
The current method of performing in sequence after polling reception mainly has the following problems:
1. when a single execution pipeline receives an instruction, the execution pipeline becomes serial execution, so that the processes of reading a source operand and executing the instruction are serialized, and the execution efficiency is reduced.
2. For data transmitted by multiple cores without dependence, serial waiting processing is needed, and waiting causes performance degradation.
3. Currently, the execution is performed basically according to the sequence in the receiving queue, so that the execution is performed according to a complete instruction, the execution granularity is large, and the effect of complete out-of-order execution cannot be achieved.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the problem that when multiple cores simultaneously transmit instructions of the same type and enter a corresponding pipeline for execution, the multiple cores need to wait for execution in line according to an arrival sequence, and provides a design for out-of-order execution among different cores in the pipeline. The instruction source operand data reading of different cores is independent, and the read-write data conflict problem does not exist, so that the out-of-order parallel execution method can enable instructions transmitted by a plurality of cores simultaneously or instructions without dependency relationship among different threads (SIMD threads) transmitted by a single core to be executed out of order under the state of relevant resource preparation, improve the utilization rate of pipeline control execution logic and the utilization rate of partial special mathematical function calculation units, and achieve better performance.
The invention specifically provides a multi-core multi-pipeline parallel execution optimization method for a graphics processor, which can convert processing executed in a single instruction granularity into a parallel execution process capable of executing a small-granularity request, and is characterized in that the design of smaller execution granularity of an execution pipeline in the processor comprises the following steps:
step 1, an instruction buffer area receives and caches instructions;
step 2, selecting an out-of-order executable instruction;
step 3, splitting the instruction request and reading data;
step 4, receiving and storing data returned by the reading request;
step 5, controlling the executable small-granularity request;
step 6, extracting instruction information;
step 7, executing the request capable of executing small granularity;
step 8, outputting the result of the request for completing execution;
and 9, clearing the executed instruction.
The step 1 comprises the following steps: the parallel pipeline firstly needs to poll and receive processing related to the same external processing unit (non-shaping and floating point type processing instructions, generally a type of instruction processing pipeline shared by a plurality of stream processing cores, such as an instruction processing pipeline for loading data) transmitted from N stream processing cores (generally 2 or 4 stream processing cores, each stream processing core is provided with an independent instruction transmitting system and can transmit dozens of instructions such as floating point type instructions, integer type instructions, special function instructions, texture sampling instructions and the like, and the rest of the instructions can be outside the stream processor except the floating point type instructions, the instructions are used in the instruction processing pipeline outside the stream processor) of the graphics processor, and the received instructions are written into an instruction processing pipeline specially used for storing instruction information; when the operation of writing the instruction into the buffer area is finished, the instruction item is obtained and used as the specific information of the index searching instruction, and meanwhile, the instruction index number, the number of the executed single instruction multiple data thread, the number of the stream processing core and the priority information are placed into an instruction retrieval queue.
The step 2 comprises the following steps: and an instruction selector of the execution instruction pipeline selects instructions corresponding to each independent stream processing core from the instruction retrieval queue according to the priority information carried by the instructions, the stream processing core number and the executed single instruction multiple data thread number. Executed single instruction multiple data threads from different stream processing cores may be preferentially selected to obtain instruction index numbers.
The step 3 comprises the following steps: and reading the instruction and splitting the instruction internally in the instruction pipeline according to the acquired instruction index information, splitting the instruction into executable small-granularity requests, and respectively sending the executable small-granularity requests to a data memory in each independent core to acquire corresponding operand data through the stream processing core number, the number of the executed single-instruction multiple-data threads, the instruction index number, the example starting information and example quantity information, the source address information, the request index in the instruction and the request executable tag.
Step 4 comprises the following steps: the instruction request splitting unit in the instruction pipeline is mainly used for splitting an instruction which needs to be executed and reads a source operand into executable granularity and applying for a storage area position address in a data buffer. When the instruction requests the splitting unit to split the request, the position address of the return data needing to be stored is applied to the data buffer area. When a data reading request returns, receiving and writing data into a data buffer area for storage according to the position address of the returned data needing to be stored; after the operands are obtained, the sequence returned by reading the general data register by each core is disordered, and the data buffer area is written in a polling mode at the moment. And meanwhile, checking the request executable mark bit, if the request executable mark carried in the returned executable small-granularity request is valid, sending the request executable mark to a request controller of the execution pipeline, and controlling the execution of the small-granularity request by the request controller of the execution pipeline according to the returned ready state information.
The step 5 comprises the following steps: the instruction pipeline enters a ready queue according to the ready sequence for more than two executable small-granularity request ready states, the executable small-granularity request control reads the requests of the ready states from the ready queue in sequence and sends the requests to an instruction information extraction module unit and a data buffer module to read related information (finding instruction items according to instruction indexes and extracting corresponding instruction information required for execution according to required information). The instruction information extraction module unit is mainly used for extracting effective information of the currently executed instruction from the instruction buffer. And finding the instruction entry according to the instruction index, and extracting corresponding instruction information required during execution according to the required information. The data buffer module is mainly used for managing the allocation and release of the data buffer area.
The step 6 comprises the following steps: when the corresponding small-granularity request is executed, the request controller sends the information to the information extraction module unit to extract the execution state data, extracts the effective instruction request in the instruction buffer area by returning the instruction index carried by the information, extracts the effective instance mask code and the instruction operation code corresponding to the request according to the initial position and the number of the instances of the request, and sends the effective instance mask code and the instruction operation code together with the source data read by the request control to the request execution subunit for real execution.
The step 7 comprises the following steps: the request execution subunit is an execution subject unit that can execute small-granularity requests, and contains arithmetic logic operators associated with specific functions of the pipeline required to perform the computation. If the calculation is carried out as a special mathematical function, the calculation ALU can be calculated by a Taylor formula third-order table look-up method when calculating the reciprocal operation. When the request execution subunit collects the operation code, the effective instance mask code, the source data and the state information of the instruction execution, the calculation of the pipeline is started to be executed, and the calculation result is obtained.
The step 8 comprises the following steps: the calculation result of the pipeline can decide that the general data register file should be selected to be written back or output to a designated buffer area for storage according to the state type of the instruction destination address information.
Step 9 comprises: when the execution request is finished, the current instruction is checked to confirm whether all the requests are executed completely. If not, continuing to circularly execute each small-granularity request; if so, the corresponding instruction entry in the instruction search queue is executed, and the subsequent instruction index entry is moved down, filling and keeping the continuous area available. And simultaneously clearing the corresponding instruction entry in the instruction buffer for receiving the subsequently transmitted instruction.
The invention has the advantages and obvious effects that: (1) And a plurality of instructions emitted by the multi-core enter different flow lines respectively to be executed, are completely split into compatible execution granularities in the flow lines, and are emitted out of order respectively to read instruction source data resources related to the inside and the outside of the core. (2) When the data resource of the instruction source with any executable granularity is ready and the last request of the small-granularity request is ready, it means that the request can enter the request controller and then enter the execution pipeline to actually execute the related ALU or conversion processing. (3) In the process, different selected instructions are respectively split into different small-granularity requests to be executed, the small-granularity requests of the different instructions are staggered in a disorder mode, the instruction source data is read back, the instructions enter the execution assembly line in the disorder mode to be executed, the execution assembly line is enabled to be in a busy state all the time, the utilization rate of the execution assembly line is improved, the parallelism among different cores is also improved, and the purpose of disorder is achieved. (4) The request transmitting granularity of small granularity which does not depend on the instruction is not needed to be executed, the utilization rate of the data buffer area can be increased, and the effect of converting Vertical to Horizontal can be achieved.
Drawings
The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a schematic diagram of a multi-core issue instruction entering the same pipeline and alternately issuing data into the pipeline for execution by using a polling strategy.
FIG. 2 is a diagram of the execution effect of a read source operand and an execution pipeline in a conventional serial design, where the green region is the number of requested cycles for reading the operand, the red region is the number of requested cycles for executing instructions in the pipeline, and the red and green parts are connected together to form a complete life cycle of an instruction.
FIG. 3 is an out-of-order execution diagram of the execution pipeline of the present invention.
FIG. 4 is a flowchart illustrating a process for completing executable instructions according to the present invention.
FIG. 5 is a diagram illustrating selection of executable instructions according to the present invention.
FIG. 6 is a diagram illustrating a single-channel request for splitting executable instructions according to the present invention.
FIG. 7 is a diagram illustrating a split request of a multi-channel executable instruction according to the present invention.
FIG. 8 is a diagram illustrating the comparison between the present invention and the conventional design.
FIG. 9 is a flow chart illustrating instruction pipeline processing and execution.
Detailed Description
The invention provides an out-of-order execution optimization method for multi-core multi-pipeline parallel execution in a Graphics Processing Unit (GPU), in particular to a design method for simultaneously transmitting a plurality of instructions by a plurality of cores to enter the out-of-order execution of a pipeline, which is generally used for transmitting texture sampling instructions by each flow processing core to enter texture processing units (usually used for special texture processing units in each graphics processing unit) to execute. The texture sampling instruction is mainly used for acquiring image data at a specified position from a texture image given by an application program according to the coordinates of pixel points to perform sampling, linear interpolation, filtering processing and the like, and returning a sampled result to a general data register for use of a program map and the like. If the instruction is processed according to the three-dimensional image, the three-dimensional coordinate address information of the instances contained in the whole instruction is collected before the instruction is executed. In the invention, the granularity is divided into smaller execution granularity for execution, and the execution can be started when the three-dimensional coordinate address information of the executable granularity is collected. The multi-core simultaneous transmission of a plurality of instructions comprises that a plurality of independent instructions transmitted by SIMD threads of a plurality of cores of a GPU processing core part enter a flow line to be executed out of order, are divided again on instruction granularity and are executed according to executable request granularity smaller than the instructions, the performance of reading data in parallel by each data path is fully utilized, and an ALU (atomic local unit) or an execution unit in the flow line is always in a busy working state, so that the out-of-order parallel instructions are achieved, the execution efficiency is improved, and the processing is shown in figure 3.
First, inside the execution pipeline, an instruction buffer (instruction buffer) may be used to receive requests for instruction information related to instruction issue from multiple cores, including information necessary for executing various instructions, such as source operand data address, instruction operation code (opcode), and single instance effective processing mask in SIMD instruction. Each instruction entry (item/entry) in the instruction cache corresponds to an instruction, and when instruction information needs to be searched or used, the instruction information is indexed by an internal instruction index. When instruction information needs to be used in the subsequent execution process or data needs to be written out after execution is completed, relevant instruction information including instruction destination data, destination address type, effective data mask, single-instance effective processing mask and other relevant information can be obtained through the internal instruction index.
An instruction selector is designed, as shown in fig. 5, to select different instructions according to the thread, the priority and the source data dependency relationship, and to split the instructions into requests for execution. The condition of instruction selection is selected according to priority, and if the condition is specified to be high priority, the condition needs to be selected and executed newly. And secondly, the selection is carried out according to different cores, because the data dependency between different cores is less. Again according to a different thread id. Finally, different instructions of the same core and the same thread need to be segmented and transmitted strictly according to the last effective instance signal instruction of the instruction carried by the instruction, so that an error caused by the fact that the execution of the next instruction completely exceeds that of the previous instruction is prevented.
The selected instruction is disassembled into a plurality of request requests according to the execution granularity of different pipelines and the requirement of the number of instruction source data channels, as shown in fig. 6 below. The disassembled request is sent to a General data Register File (General Register File) to read instruction source data, and meanwhile, a request execution termination signal with a request index and the execution granularity is taken. For sample and load/store instructions, according to different data or address channel, reading source data source channels of multiple channels to convert Vertical format to Horizontal format. When the request data is read back, it is written into the designated data buffer for buffering, as shown in the data buffer portion of FIG. 3.
When the data buffer memory receives the data returned by the reading of the general data register file and has the last data marking signal of the executable granularity request, the data buffer memory indicates that all the data of the executable granularity request are ready and can be executed.
At this point it is sent to the requesting controller, informing it that it is ready to execute. At this time, the request controller acquires the instruction information from the instruction buffer through the instruction index, extracts the information such as the effective mask of all the instances corresponding to the request, and sends the information to the request execution subunit for partial execution.
And after the execution is finished, writing the result data and the destination address related information to the destination address position of the destination data type specified resource.
In the process, the instructions extracted from the instruction buffer are disassembled, resources in different cores are removed to request data resources, the returned time sequence is inconsistent, and the instructions are interleaved and out-of-order. However, the execution resources in the instruction are guaranteed to be in order by the order received by the various resources, and therefore the request requests in the instruction must be executed in order.
As shown in FIG. 8, the upper portion of the conventional serial execution process executes at instruction granularity, requiring data to be ready to begin execution, resulting in a pipeline in an instruction serial configuration. The lower part is the execution process of the invention, wherein the green part is the split single read operand request, the red area is the process of single execution request, and the green and red are overlapped to execute the two operations simultaneously. By adopting the technical scheme, the invention can obtain the technical effect of the lower part in the figure 8, the processing granularity of the invention is reduced, and the information of the read data is cached, so that the data request part of the execution part of the read operation is performed simultaneously. To better show the parallel process, the read operand green field portion and the red field execution portion are shown in two lines. In this process, it takes time to mainly start part of the operand reading, and when four cores read the operands simultaneously, the time to read the operand part is greatly reduced. At the moment, each core reads the returned operands enough to enable the execution pipeline to execute in a saturated mode, and the efficiency of the execution pipeline is guaranteed. Compared with the upper half of the prior art in fig. 8, the whole life cycle of the execution of the same number of the same instructions is greatly reduced compared with the conventional method.
Since the conventional design method can be executed only by collecting all data of a single instruction, the upper part of the continuous large-block green part area in fig. 8 is used for reading and collecting operand data sent by the single instruction. When the operands are ready, the entire instruction activity begins to execute, thus the effect is that the continuous green plus red portion of the instruction execution lifecycle is complete. When the execution of the first green area plus the red area is finished, which indicates that the execution of one instruction is finished, the next instruction is switched to execute. The polling strategy is adopted when instructions are transmitted by a plurality of cores, so that the black rectangular area in fig. 8 identifies the process in which the second instruction is transmitted by the core 1 and is completely executed. And executing the instructions sequentially, wherein the finally used period number of the horizontal area is the time length used for finishing executing the current continuous multiple instructions. The process of alternately executing a total of 10 instructions is shown in fig. 8.
It should be noted that the number of cycles for executing 10 instructions at a time is not necessarily the same, and may be about 90 cycles in fig. 8, which is determined by the difference in the number of different instruction channels generated randomly. If it is a single channel, it is possible to execute 16 instances of granularity, then 32 parallel processed data in SIMD32 state, read 2 times, wait 1 cycle of data return in the middle (wait for the last read data request of the instruction to return to completion), execute 2 times, and execute 5 times in total. However, if the amount of multi-channel data is large, the granularity of 4 instances may be realized, 32 data processed in parallel in the SIMD32 state are read 8 times, 1 cycle of data return is waited in the middle, the data is executed 8 times, and the execution is completed 17 times in total. So that different instructions may appear differently long green regions for conventional serial execution of portions of the instructions.
In the design scheme, as shown in the lower part area of fig. 8, since the processing granularity becomes smaller and information of each core data is cached at the same time, the data request part for performing the partial read operation is performed at the same time. In order to better show the process, the read operand green area part and the red area execution part are divided into two lines of small-granularity read operands and small-granularity execution requests for display respectively. Each stream processing core issues an instruction which must first read the source operand portion, and then the lower portion of each stream processing core is presented with a green field. Then, after executing a section of fetch operands, it starts to present the condition that the cycles of the red and green fetch data parts on the horizontal axis overlap, because after dividing the smaller execution granularity, the read operand Req _3 and the execute Req _0 parts of the partial flow processing core 0 in fig. 8 happen simultaneously, so that the true parallel effect is achieved, and there is no need to completely wait for the complete completion of the read operand (specifically, as shown in table 1). In the process, the time for transmitting the instruction by each stream processing core is mainly occupied by the front small part of operand reading part, when the four cores read the operands simultaneously, the cycle part of the operand reading will be overlapped with the cycle part of the execution pipeline execution part, which is the effect of hiding the delay and reducing a large amount of time. At this point in the present invention's design of a small execution granularity solution, the red regions should add up on the horizontal axis to be a nearly continuous red region, indicating that there are enough operands to saturate the execution pipeline at this point, and thus the overall life cycle of the same number of instructions executing is greatly reduced over the conventional practice.
TABLE 1
Figure BDA0003904385020000081
Examples
The processing procedure of the application out-of-order execution of the invention can be used for single operand instructions and multi-operand instructions, but the overall processing procedure is similar except for the difference between the small granularity request splitting and the execution granularity, as shown in fig. 4, the processing procedure comprises the following steps:
step 1, an instruction buffer receives an instruction;
step 2, selecting an out-of-order executable instruction;
step 3, splitting the instruction request and reading data;
step 4, receiving and storing data returned by the reading request;
step 5, small-granularity request control can be executed;
step 6, extracting instruction information;
step 7, executing the executable small-granularity request;
step 8, outputting the result of the request for completing execution;
and 9, clearing the executed instruction.
The step 1 comprises the following steps: the execution pipeline first needs to transmit instructions from multiple cores simultaneously, perform polling reception processing, and write the received instructions into an instruction buffer. Here, if the corresponding core does not issue the instruction out, the core and the present execution pipeline path are skipped. When the operation of writing the instruction into the buffer area is finished, the instruction item is obtained to be used as the index to search the specific information of the instruction, and the subsequent extraction and execution of the related information are facilitated. And meanwhile, the instruction index number, the number of the executed single instruction multiple data thread, the number of the stream processing core and the priority information are placed in an instruction retrieval queue.
The step 2 comprises the following steps: and the instruction selector selects the oldest instruction corresponding to each independent stream processing core from the instruction retrieval queue for processing according to the priority information carried by the instruction, the stream processing core number and the executed single instruction multiple data thread number. Different executed SIMD thread numbers may be processed in parallel, thereby obtaining an instruction index.
The step 3 comprises the following steps: and reading the instruction and internally splitting according to the acquired instruction index information. The instructions are split into small-granularity requests, and the small-granularity requests are respectively sent to data memories such as general data register files in the independent cores to acquire corresponding operand data through stream processing core numbers, executed single-instruction multiple-data thread numbers, instruction index numbers, instance starting information and instance quantity information, source address information, request indexes in the instructions and request executable marks.
Step 4 comprises the following steps: the command request splitting unit also applies for the location address of the return data to be stored from the data buffer at the same time. When the data is returned, the data is received and written to the data buffer according to the address. After the operands are obtained, the sequence returned by the cores reading the general data register is disordered, and at the moment, the data buffer area is written in a polling mode, and meanwhile, the request executable mark bit is also checked. At this time, if the request executable flag carried in the returned small-granularity request is valid, the request executable flag is sent to the request controller to control the execution of the returned request in the ready state.
The step 5 comprises the following steps: the ready state may be multiple at the same time, and at this time, the requests enter the ready queue according to the ready sequence, the requests of the small-granularity request control can be executed to read the ready state from the ready queue in sequence, and the requests are sent to the instruction information extraction module and the data buffer module to read the relevant information.
The step 6 comprises the following steps: when the corresponding small-granularity request is executed, the information is sent to an information extraction unit to extract execution state data, and an effective instruction request in the instruction buffer area is extracted by returning the carried instruction index. And then extracting effective instance masks corresponding to the requests and key information such as instruction operation codes according to the information such as the initial positions, the number of the instances and the like of the request instances, and sending the effective instance masks, the instruction operation codes and the like to the request execution subunit for real execution together with the source data read by the request control.
The step 7 comprises the following steps: the request execution subunit collects the operation code, the effective instance mask, the source data and the state information of the instruction execution, and starts to execute the related ALU calculation to obtain the calculation result.
The step 8 comprises the following steps: and the calculation result is selectively written back to the general data register file or output to a designated buffer area for storage and the like according to the instruction state.
Step 9 comprises: when the execution request is finished, the current instruction is checked to confirm whether all the requests are executed completely. If not, the executable small-granularity requests are continuously executed circularly. If so, executing the corresponding entry in the instruction retrieval queue, moving down the subsequent instruction index entry, and filling to keep the continuous area available. At the same time, the corresponding instruction entry in the instruction buffer is cleared for subsequent instruction acceptance.
In this embodiment, when the execution instruction uses multiple operands, it will be executed in a request-slicing manner with smaller granularity, which is divided in step 3 according to the granularity of the executable request. The multiple operands will complete the Vertical to Horizontal conversion in the data receive buffer. For texture sampling processing instructions or data load and store type instructions, there is no dependency between the received instructions sent from multiple cores, and a data request of multi-channel source data may be converted from Vertical to Horizontal in a data buffer, and multiple requests may be issued, and when the last channel data request comes back, there may be multiple executable granularities for execution, and at this time, the ready resources may be processed in sequence, as shown in fig. 7.
As shown in fig. 9, a flowchart of an execution process of a specific instruction in this embodiment is split into each small-granularity processing request, a single instruction is split into multiple small-granularity requests, where each executable small-granularity request directly enters a request controller to be processed when it is ready, and is sent to an execution pipeline to be executed when the instruction information is ready.
In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, and the computer program, when executed by the data processing unit, may execute the inventive content for a multi-core multi-pipeline parallel execution optimization method in a graphics processor and some or all of the steps in each embodiment provided in the present invention. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.
It is clear to those skilled in the art that the technical solutions in the embodiments of the present invention can be implemented by means of a computer program and its corresponding general-purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a computer program or a software product, where the computer program or the software product may be stored in a storage medium and include instructions for enabling a device (which may be a personal computer, a server, a single chip microcomputer, an MUU, or a network device) including a data processing unit to execute the method according to the embodiments or some parts of the embodiments of the present invention.
The present invention provides a method for optimizing parallel execution of multiple processing cores and multiple pipelines in a graphics processor, and a plurality of methods and approaches for implementing the technical solution are provided, and the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a plurality of improvements and modifications may be made without departing from the principle of the present invention, and these improvements and modifications should also be considered as the protection scope of the present invention. All the components not specified in this embodiment can be implemented by the prior art.

Claims (10)

1. The method for optimizing the parallel execution of the multi-core and multi-pipeline in the graphics processor comprises the following steps:
step 1, an instruction buffer area receives and caches instructions;
step 2, selecting an out-of-order executable instruction;
step 3, splitting the instruction request and reading data;
step 4, receiving and storing data returned by the reading request;
step 5, controlling the executable small-granularity request;
step 6, extracting instruction information;
step 7, executing the request capable of executing small granularity;
step 8, outputting the result of the request for completing execution;
and 9, clearing the executed instruction.
2. The method of claim 1, wherein step 1 comprises: the parallel pipeline firstly needs to perform polling receiving processing on the same external processing instruction transmitted from N stream processing cores of the graphics processor at the same time, and writes the received instruction into an instruction buffer area specially used for storing instruction information, and if the stream processing cores do not transmit the instruction, skips the processing of the stream processing cores and the execution pipeline; when the operation of writing the instruction into the buffer area is finished, the instruction item is obtained and used as the specific information of the index searching instruction, and meanwhile, the instruction index number, the number of the executed single instruction multiple data thread, the number of the stream processing core and the priority information are placed into an instruction retrieval queue.
3. The method of claim 2, wherein step 2 comprises: and an instruction selector of the execution instruction pipeline selects instructions corresponding to each independent stream processing core from the instruction retrieval queue according to the priority information carried by the instructions, the stream processing core number and the executed single instruction multiple data thread number.
4. The method of claim 3, wherein step 3 comprises: and reading the instruction and splitting the instruction internally in the instruction pipeline according to the acquired instruction index information, splitting the instruction into executable small-granularity requests, and respectively sending the executable small-granularity requests to a data memory in each independent core to acquire corresponding operand data through the stream processing core number, the number of the executed single-instruction multiple-data threads, the instruction index number, the example starting information and example quantity information, the source address information, the request index in the instruction and the request executable tag.
5. The method of claim 4, wherein step 4 comprises: when an instruction request splitting unit in an instruction assembly line splits a request, applying for a position address of return data needing to be stored to a data buffer area, and when a data reading request returns, receiving and writing the data into the data buffer area for storage according to the position address of the return data needing to be stored; after obtaining the operand, the sequence returned by each core reading the general data register is disordered, at the moment, the data buffer is written in a polling mode, the request executable mark bit is checked, if the request executable mark carried in the returned executable small-granularity request is effective, the request executable mark is sent to the request controller of the execution pipeline, and the request controller of the execution pipeline can control the execution of the small-granularity request according to the returned ready state information.
6. The method of claim 5, wherein step 5 comprises: the execution pipeline enters the ready queue according to the ready sequence for the ready states above two executable small-granularity requests, the executable small-granularity requests control the requests of reading the ready states from the ready queue in sequence, the instruction items are searched according to the instruction indexes, and then the corresponding instruction information required by execution is extracted according to the required information.
7. The method of claim 6, wherein step 6 comprises: when the corresponding small-granularity request is executed, the request controller sends the information to the information extraction module unit to extract the execution state data, extracts the effective instruction request in the instruction buffer area by returning the carried instruction index, extracts the effective instance mask code and the instruction operation code corresponding to the request according to the initial position and the number of the instances of the request instance, and sends the effective instance mask code and the instruction operation code together with the source data read by the request control to the request execution subunit for real execution.
8. The method of claim 7, wherein step 7 comprises: and the request execution subunit collects the operation code, the effective instance mask code, the source data and the state information of the instruction execution, and starts to execute the calculation of the pipeline to obtain a calculation result.
9. The method of claim 8, wherein step 8 comprises: the calculation result of the pipeline can decide that the general data register file should be selected to be written back or output to a designated buffer area for storage according to the state type of the instruction destination address information.
10. The method of claim 9, wherein step 9 comprises: when the execution request is finished, checking the current instruction, confirming whether all the requests are executed completely, and if not, continuing to circularly execute each small-granularity request; if the instruction is finished, executing the corresponding instruction entry in the instruction retrieval queue, shifting down the subsequent instruction index entry, filling and keeping the continuous area available, and clearing the corresponding instruction entry in the instruction buffer area for receiving the subsequently transmitted instruction.
CN202211300379.9A 2022-10-24 2022-10-24 Multi-core multi-pipeline parallel execution optimization method for graphics processor Pending CN115640052A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211300379.9A CN115640052A (en) 2022-10-24 2022-10-24 Multi-core multi-pipeline parallel execution optimization method for graphics processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211300379.9A CN115640052A (en) 2022-10-24 2022-10-24 Multi-core multi-pipeline parallel execution optimization method for graphics processor

Publications (1)

Publication Number Publication Date
CN115640052A true CN115640052A (en) 2023-01-24

Family

ID=84945667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211300379.9A Pending CN115640052A (en) 2022-10-24 2022-10-24 Multi-core multi-pipeline parallel execution optimization method for graphics processor

Country Status (1)

Country Link
CN (1) CN115640052A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115993949A (en) * 2023-03-21 2023-04-21 苏州浪潮智能科技有限公司 Vector data processing method and device for multi-core processor
CN116485629A (en) * 2023-06-21 2023-07-25 芯动微电子科技(珠海)有限公司 Graphic processing method and system for multi-GPU parallel geometry processing
CN117193858A (en) * 2023-11-07 2023-12-08 芯来智融半导体科技(上海)有限公司 Method, device, equipment and storage medium for receiving and sending access/fetch instruction

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115993949A (en) * 2023-03-21 2023-04-21 苏州浪潮智能科技有限公司 Vector data processing method and device for multi-core processor
CN116485629A (en) * 2023-06-21 2023-07-25 芯动微电子科技(珠海)有限公司 Graphic processing method and system for multi-GPU parallel geometry processing
CN117193858A (en) * 2023-11-07 2023-12-08 芯来智融半导体科技(上海)有限公司 Method, device, equipment and storage medium for receiving and sending access/fetch instruction
CN117193858B (en) * 2023-11-07 2024-03-15 芯来智融半导体科技(上海)有限公司 Method, device, equipment and storage medium for receiving and sending access/fetch instruction

Similar Documents

Publication Publication Date Title
CN106991011B (en) CPU multithreading and GPU (graphics processing unit) multi-granularity parallel and cooperative optimization based method
US11710209B2 (en) Multi-thread graphics processing system
CN115640052A (en) Multi-core multi-pipeline parallel execution optimization method for graphics processor
CN110908716B (en) Method for implementing vector aggregation loading instruction
US20240045593A1 (en) Apparatus and method for accessing data, processing apparatus and computer system
US20230267000A1 (en) Processing apparatus and system for executing data processing on a plurality of pieces of channel information
CN115129480B (en) Scalar processing unit and access control method thereof
US20230267079A1 (en) Processing apparatus, method and system for executing data processing on a plurality of channels
US20150143378A1 (en) Multi-thread processing apparatus and method for sequentially processing threads
CN113900712B (en) Instruction processing method, instruction processing apparatus, and storage medium
CN113138804B (en) Stream processor for extracting stream data characteristics in transmission process and implementation method thereof
CN116820333B (en) SSDRAID-5 continuous writing method based on multithreading
CN116185497B (en) Command analysis method, device, computer equipment and storage medium
CN117472439A (en) System and method for realizing atomic operation in processor
CN115905038A (en) Cache data reading method and device, computer equipment and storage medium
CN116244005A (en) Multithreading asynchronous data transmission system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination