CN117492965A - Thread group scheduling method, general graphics processor and storage medium - Google Patents

Thread group scheduling method, general graphics processor and storage medium Download PDF

Info

Publication number
CN117492965A
CN117492965A CN202311590204.0A CN202311590204A CN117492965A CN 117492965 A CN117492965 A CN 117492965A CN 202311590204 A CN202311590204 A CN 202311590204A CN 117492965 A CN117492965 A CN 117492965A
Authority
CN
China
Prior art keywords
thread group
thread
sequence
target
group sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311590204.0A
Other languages
Chinese (zh)
Inventor
王明羽
王旺广
张奕聪
虞志益
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202311590204.0A priority Critical patent/CN117492965A/en
Publication of CN117492965A publication Critical patent/CN117492965A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching

Abstract

The application discloses a thread group scheduling method, a general graphics processor and a storage medium, comprising the following steps: acquiring first data locality information, performing sequencing operation on a first initial thread group sequence to obtain a first transition thread group sequence, and performing sequencing operation on a second initial thread group sequence to obtain a second transition thread group sequence; the sorting operation includes: traversing each thread group, when the existence of the priority thread group is determined according to the first data locality information, improving the priority of the priority thread group, and sequencing a plurality of thread groups according to the priority, wherein the priority thread group is a thread group with the number of cache lines requesting the same prefetch block being greater than a preset request number threshold, or is a plurality of thread groups simultaneously requesting the same prefetch block, and the number of cache lines requested by the plurality of thread groups is greater than a preset request number threshold; combining the first transition thread group sequence and the second transition thread group sequence to obtain a target thread group sequence; and performing scheduling operation on the target thread group in the target thread group sequence.

Description

Thread group scheduling method, general graphics processor and storage medium
Technical Field
The present disclosure relates to the field of general graphics processors, and in particular, to a thread group scheduling method, a general graphics processor, and a storage medium.
Background
With the increasing demands on processor performance in the big data age, general-purpose graphics processors with high computational power are widely used in various general-purpose computing fields.
From the general-purpose graphics processor hardware perspective, a general-purpose graphics processor is made up of multiple stream multiprocessors, each of which is made up of multiple stream processors, which can be understood as computing units in the general-purpose graphics processor. Each stream multiprocessor internal pipeline is divided into five stages of fetching, decoding, transmitting, executing and writing back, and thread groups are taken as execution units, and each thread group consists of 32 threads. The fetch stage fetches one to two instructions from the instruction cache to a ready thread group. And the decoding stage decodes the fetched instruction and stores the decoded instruction into an instruction cache corresponding to the thread group. The launching stage selects a ready thread group according to a thread group scheduling strategy and launches instructions in an instruction cache thereof into the execution stage. The execution stage comprises an execution pipeline of a calculation instruction and a memory access instruction, the calculation instruction performs calculation operation in the stream processor, and the memory access instruction enters the loading storage unit to perform memory access operation of the data cache. The write-back stage writes back the result of executing the instruction to the designated location. Thread scheduling in the general purpose graphics processor indicates the thread group scheduling policy of the launch stage, which determines the order of instruction execution within the general purpose graphics processor.
However, the existing thread scheduling method has frequent cache jitter caused by the multithreading access memory competition of the general-purpose graphics processor, and reduces the performance of the general-purpose graphics processor.
Disclosure of Invention
The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the application provides a thread group scheduling method, a general graphics processor and a storage medium, which can solve the problem of frequent cache jitter of the existing thread scheduling method.
The thread group scheduling method according to the first aspect of the embodiment of the application comprises the following steps: the method comprises the steps that first data locality information is obtained, a first initial thread group sequence is subjected to sequencing operation to obtain a first transition thread group sequence, a second initial thread group sequence is subjected to sequencing operation to obtain a second transition thread group sequence, the first initial thread group sequence comprises a plurality of calculation thread groups, the calculation thread groups are thread groups for executing calculation instructions, the second initial thread group sequence comprises a plurality of access thread groups, and the access thread groups are thread groups for executing access instructions;
wherein the sorting operation includes: traversing each thread group, when determining that a priority thread group exists according to the first data locality information, improving the priority of the priority thread group, sorting a plurality of thread groups according to the priority of the thread groups, wherein a prefetching block comprises a plurality of cache lines, the first data locality information is used for representing the cache line number of a register storage unit maintained in a missing information state of the prefetching block requested by the thread group, the priority thread group is the thread group which requests that the cache line number of the same prefetching block is larger than a preset request number threshold, or is the thread group which simultaneously requests the same prefetching block, and the cache line number requested by the thread group is larger than the preset request number threshold;
combining the first transition thread group sequence and the second transition thread group sequence to obtain a target thread group sequence;
and carrying out scheduling operation on a target thread group in the target thread group sequence, wherein the target thread group is the thread group to be scheduled in the target thread group sequence.
The thread group scheduling method according to the embodiment of the first aspect of the application has at least the following advantages:
and confirming the first initial thread group sequence and the thread group with data locality in the second initial thread group sequence as priority thread groups according to the first data locality information, improving the priority of the priority thread groups, respectively sequencing the first initial thread group sequence and the second initial thread group sequence according to the priority to respectively obtain a first transition thread group sequence and a second transition thread group sequence, merging the first transition thread group sequence and the second transition thread group sequence to obtain a target thread group sequence, and performing scheduling operation on the thread groups to be scheduled in the target thread group sequence. Compared with the traditional thread scheduling method, the thread group scheduling method of the first aspect of the embodiment of the application can schedule the thread group with data locality preferentially, reduce the situation of cache jitter and improve the performance of the general graphics processor.
According to some embodiments of the application, further comprising:
obtaining second data locality information, wherein the second data locality information is used for representing the sum of the cache line number of the prefetch block requested by the thread group in the missing information state holding register storage unit and a first-level data cache;
confirming a target prefetched block according to the second data locality information and a plurality of prefetched blocks, wherein the target prefetched block is the prefetched block with the sum of the cache line numbers in the missing information state holding register storage unit and the primary data cache being larger than a preset prefetched block locality threshold;
generating a prefetch request, wherein the prefetch request is used for requesting to send a to-be-fetched cache line of a target prefetch block to the first-level data cache, and the to-be-fetched cache line is not sent to the missing information state holding register storage unit and the cache line of the first-level data cache;
and sending the prefetch request to a memory system.
According to some embodiments of the present application, the generating the prefetch request includes:
and acquiring the data volume of the storage unit of the missing information state holding register, and generating the prefetch request if the data volume is smaller than a preset first data volume threshold.
According to some embodiments of the present application, the sending the prefetch request to a memory system includes:
storing the prefetch request in a load store unit;
acquiring the number of memory access requests in the loading memory unit;
and when the number of the access requests is zero, sending the prefetch request to a memory system.
According to some embodiments of the present application, the ordering the plurality of thread groups according to priorities of the plurality of thread groups includes:
fine-grained ordering of a plurality of the thread groups is performed according to priorities of the plurality of the thread groups.
According to some embodiments of the present application, after the scheduling operation is performed on the target thread group in the target thread group sequence, the method further includes: and reducing the priority of the target thread group.
According to some embodiments of the present application, the merging the first transition thread group sequence and the second transition thread group sequence includes:
acquiring the data quantity of the storage unit of the missing information state holding register;
and if the data volume is larger than a preset second data volume threshold, adding the second transition thread group sequence to the sequence tail of the first transition thread group, otherwise, adding the first transition thread group sequence to the sequence tail of the second transition thread group to obtain a target thread group sequence.
According to some embodiments of the present application, the scheduling operation for the target thread group in the target thread group sequence includes:
numbering a plurality of thread groups in the target thread group sequence in sequence;
and performing scheduling operation on a plurality of thread groups in the target thread group sequence according to the numbering order.
A general-purpose graphics processor according to an embodiment of the second aspect of the present application, configured to execute the above thread group scheduling method, and the general-purpose graphics processor according to the embodiment of the second aspect of the present application, has at least the following advantages:
and confirming the first initial thread group sequence and the thread group with data locality in the second initial thread group sequence as priority thread groups according to the first data locality information, improving the priority of the priority thread groups, respectively sequencing the first initial thread group sequence and the second initial thread group sequence according to the priority to respectively obtain a first transition thread group sequence and a second transition thread group sequence, merging the first transition thread group sequence and the second transition thread group sequence to obtain a target thread group sequence, and performing scheduling operation on the thread groups to be scheduled in the target thread group sequence. Compared with the traditional general-purpose graphics processor, the general-purpose graphics processor of the second aspect of the embodiment of the present application can schedule the thread group with data locality preferentially, reduce the cache jitter, and improve the performance of the general-purpose graphics processor.
A computer readable storage medium according to an embodiment of the third aspect of the present application stores therein a processor-executable program for implementing a thread group scheduling method as described above when executed by a processor.
The computer readable storage medium according to the embodiment of the third aspect of the present application has at least the following advantageous effects:
and confirming the first initial thread group sequence and the thread group with data locality in the second initial thread group sequence as priority thread groups according to the first data locality information, improving the priority of the priority thread groups, respectively sequencing the first initial thread group sequence and the second initial thread group sequence according to the priority to respectively obtain a first transition thread group sequence and a second transition thread group sequence, merging the first transition thread group sequence and the second transition thread group sequence to obtain a target thread group sequence, and performing scheduling operation on the thread groups to be scheduled in the target thread group sequence. The computer readable storage medium according to the third aspect of the present invention can schedule the thread group with data locality preferentially over the conventional general-purpose graphics processor, reduce the cache jitter, and improve the performance of the general-purpose graphics processor.
Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.
Drawings
The application is further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flow chart of a method of thread group scheduling in an embodiment of the present application;
FIG. 2 is a flow chart of prefetch request generation and transmission in one embodiment of the present application;
FIG. 3 is a flow chart of sending a prefetch request to a memory system according to one embodiment of the present application;
FIG. 4 is a flow chart of merging the first transition thread group sequence and the second transition thread group sequence in one embodiment of the present application;
FIG. 5 is a flow chart of a scheduling operation for a target thread group in an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
In the description of the present application, it should be understood that, with respect to the description of the orientation, such as the orientation or positional relationship indicated above, below, etc., the orientation or positional relationship shown based on the drawings is merely for convenience of describing the present application and simplifying the description, and does not indicate or imply that the apparatus or element in question must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application.
In the description of the present application, plural refers to two or more. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present application, unless explicitly defined otherwise, terms such as arrangement, installation, electrical connection, etc. should be construed broadly and the specific meaning of the terms in the present application can be reasonably determined by a person skilled in the art in combination with the specific content of the technical solution.
A thread group scheduling method, a general-purpose graphic processor, and a storage medium according to an embodiment of the present application are described below with reference to fig. 1 to 5.
The thread group scheduling method in the embodiment of the present application, as shown in fig. 1, includes:
step S100: the method comprises the steps of obtaining first data locality information, performing sequencing operation on a first initial thread group sequence to obtain a first transition thread group sequence, and performing sequencing operation on a second initial thread group sequence to obtain a second transition thread group sequence, wherein the first initial thread group sequence comprises a plurality of calculation thread groups, each calculation thread group is a thread group for executing a calculation instruction, and the second initial thread group sequence comprises a plurality of access thread groups, each access thread group is a thread group for executing an access instruction;
wherein the sorting operation comprises: traversing each thread group, when the existence of the priority thread group is determined according to the first data locality information, the priority of the priority thread group is improved, the priority thread group is used for sequencing a plurality of thread groups, the prefetching block comprises a plurality of cache lines, the first data locality information is used for representing the cache line number of a register storage unit in a missing information state of the prefetching block requested by the thread group, the priority thread group is a thread group with the cache line number of the same prefetching block being greater than a preset request number threshold, or is a plurality of thread groups simultaneously requesting the same prefetching block, and the cache line number requested by the thread groups is greater than a preset request number threshold;
in this step, according to the first data locality information, determining a thread group having data locality in the first initial thread group sequence and the second initial thread group sequence, and increasing the priority of the thread group having data locality, so as to sort the plurality of thread groups in the first initial thread group sequence and the second initial thread group sequence according to the priority to obtain a first transition thread group sequence and a second transition thread group sequence, so that the thread group having data locality is preferentially scheduled.
It can be appreciated that the preset request number threshold is not limited, for example, the preset request number threshold is set to 2, and when the number of cache lines requesting the same prefetch block of a certain thread group is greater than 2, the thread group is confirmed to have the data locality in the thread group, that is, the thread group is confirmed to be a priority thread group; when more than 2 cache lines in the same prefetch block are simultaneously requested by a plurality of thread groups, the thread groups are confirmed to have data locality among the thread groups, namely, the thread groups are confirmed to be priority thread groups.
It should be noted that, the number of cache lines included in one prefetch block is not limited, and for example, the prefetch block may include 5 cache lines, or may include 6 cache lines, and the plurality of cache lines may be continuous or discontinuous.
Step S200: combining the first transition thread group sequence and the second transition thread group sequence to obtain a target thread group sequence;
step S300: and carrying out scheduling operation on a target thread group in the target thread group sequence, wherein the target thread group is a thread group to be scheduled in the target thread group sequence.
In this step, the instruction of the thread group in the target thread group sequence is scheduled in accordance with the priority order, in other words, the thread group having the higher priority in the target thread group sequence is preferentially scheduled.
After completing the scheduling operation on the target thread group, the steps are sequentially executed when the next scheduling operation is performed.
In this embodiment, according to the first data locality information, the first initial thread group sequence and the thread group having data locality in the second initial thread group sequence are identified as priority thread groups, the priority of the priority thread groups is increased, the first initial thread group sequence and the second initial thread group sequence are respectively sequenced according to the priority, a first transition thread group sequence and a second transition thread group sequence are respectively obtained, the first transition thread group sequence and the second transition thread group sequence are combined, a target thread group sequence is obtained, and a scheduling operation is performed on a thread group to be scheduled in the target thread group sequence. Compared with the traditional thread scheduling method, the thread group scheduling method provided by the embodiment of the application can schedule the thread group with data locality preferentially, reduce the situation of cache jitter and improve the performance of the general graphics processor.
In one embodiment of the present application, as shown in fig. 2, the thread group scheduling method further includes, but is not limited to, step S400, step S500, step S600, and step S700.
Step S400: acquiring second data locality information, wherein the second data locality information is used for representing the sum of the cache line numbers of the prefetch blocks requested by the thread group in the missing information state holding register storage unit and the first-level data cache;
in this step, by acquiring the second data locality information, the number of cache lines of the prefetch block in the first level data cache and miss information state holding register storage unit can be known, so that the prefetch block having data locality can be confirmed.
Step S500: confirming a target prefetched block according to the second data locality information and the plurality of prefetched blocks, wherein the target prefetched block is a prefetched block with the sum of the cache line numbers in the missing information state holding register storage unit and the primary data cache being larger than a preset prefetched block locality threshold value:
in this step, the target prefetch block is identified from the plurality of prefetch blocks according to the second data locality information, and if the sum of the number of cache lines in the prefetch block in the miss information state holding register storage unit and the first level data cache is greater than a preset prefetch block locality threshold, the prefetch block is indicated to have data locality, and is the target prefetch block.
It should be noted that, the value of the preset prefetch block locality threshold is not limited, and for example, the preset prefetch block locality threshold may be 1, 3, 4, etc.
Step S600: generating a prefetch request, wherein the prefetch request is used for requesting to send a cache line to be fetched of a target prefetch block to a first-level data cache, and the cache line to be fetched is not sent to a missing information state holding register storage unit and the cache line of the first-level data cache;
step S700: the prefetch request is sent to the memory system.
In this embodiment, the prefetch block with data locality is confirmed through the second data locality information, and a prefetch request is generated, so that all cache lines except the storage unit of the holding register and the first-level data cache with the prefetch block with data locality in the missing information state are prefetched into the first-level data cache, thereby reducing on-chip storage resource competition and reducing cache jitter.
It should be noted that, the to-be-fetched cache line is stored in a secondary data cache, a DRAM, or other cache structures.
In one embodiment of the present application, the "generate prefetch request" in step S600 is further described, and step S600 includes, but is not limited to, step S610.
Step S610: and acquiring the data quantity of the storage unit of the missing information state holding register, and generating a prefetch request if the data quantity is smaller than a preset first data quantity threshold value.
In this embodiment, because on the architecture of the existing general-purpose graphics processor, when the on-chip storage resources are saturated, the first-level data cache will reject the new access request, resulting in the serialization processing of the subsequent access request, and reducing the memory parallelism of the graphics processor. Therefore, by confirming the data amount of the storage unit of the missing information state holding register, if the data amount is larger than the preset first data amount threshold value, the on-chip storage resource is saturated, the generated prefetch request is stopped, and the contention of the thread group on the storage resource is prevented from being deteriorated.
It will be appreciated that the value of the preset first data amount threshold is not limited.
In one embodiment of the present application, the "send prefetch request to memory system" in step S700 is further described, and as shown in fig. 3, step S700 includes, but is not limited to, step S6710, step S720, and step S730.
Step S710: storing the prefetch request in a load store unit;
step S720: acquiring the number of memory access requests in a loading memory unit;
step S730: when the number of access requests is zero, a prefetch request is sent to the memory system.
In this embodiment, by storing the prefetch request in the load store unit, when there is no transmittable access request in the load store unit, the prefetch request is transmitted to the memory system, and the cache line to be fetched of the target prefetch block is requested to be fetched into the first level data cache.
In one embodiment of the present application, step S100 is further described as "sorting the plurality of thread groups according to the priority of the plurality of thread groups", and step S100 includes, but is not limited to, step S110.
Step S110: the plurality of thread groups are fine-grained ordered according to their priorities.
In this embodiment, the fine-grained ordering is performed on the plurality of thread groups based on the priorities of the thread groups, and the thread groups with higher priorities are prioritized.
In an embodiment of the present application, after the "performing the scheduling operation on the target thread group in the target thread group sequence" in step S300, step S310 is further included, but not limited to.
Step S310: the priority of the target thread group is reduced.
In this embodiment, after the target thread group is scheduled, the priority of the target thread group is reduced, so as to avoid continuously executing the target thread group.
It should be noted that, by setting the same initial priority score for a plurality of thread groups, the priority of the thread group is determined according to the priority score of the thread group, when the thread group is confirmed to be the target thread group, the first score value is increased on the basis of the initial priority score of the target thread group, so as to increase the priority of the target thread group, and after the target thread group is scheduled, the second score value is decreased, so as to decrease the priority of the target thread group.
It will be appreciated that the second score value is reduced after the target thread group is scheduled, and is not reduced after the priority score of the target thread group is reduced to the initial priority score.
Note that, the initial priority score, the first score value, and the second score value are not limited, and for example, the initial priority score may be 100, the first score value may be 30, and the second score value may be 1.
In one embodiment of the present application, step S200 is further described as "merging the first transition thread group sequence and the second transition thread group sequence", and as shown in fig. 4, step S200 includes, but is not limited to, step S210 and step S220.
Step S210: acquiring the data quantity of a storage unit of the missing information state holding register;
step S220: if the data quantity is larger than a preset second data quantity threshold value, adding the second transition thread group sequence to the sequence tail of the first transition thread group, otherwise, adding the first transition thread group sequence to the sequence tail of the second transition thread group, and obtaining the target thread group sequence.
It should be noted that, the existing thread scheduling method basically adopts a polling scheduling policy, and each thread group has an equal scheduling opportunity. However, under the polling scheduling policy, the situation that the computing instructions of each computing thread group are sequentially executed and then the memory accessing instructions of each memory accessing thread group are sequentially executed easily occurs, and at this time, the instructions of all thread groups are long-delay memory accessing instructions, and no more schedulable thread groups are used for hiding the delay, so that the memory accessing delay in the general graphics processor is not effectively hidden, and the performance is lost.
In this embodiment, by acquiring the data amount of the missing information state holding register storage unit, if the data amount is smaller than a preset second data amount threshold, adding the first transition thread group sequence to the sequence tail of the second transition thread group, and preferentially scheduling the access thread group, so that the access delay of the access thread group is effectively hidden by the calculation instruction of the calculation thread group; if the data volume is larger than the second preset data volume threshold value, the on-chip storage resource utilization rate is close to saturation, and the second transition thread group sequence is added to the sequence tail of the first transition thread group, so that the calculation thread group is scheduled to be executed preferentially, the problem of on-chip storage resource competition is solved, the cache jitter condition is reduced, and the performance of the graphics processor is improved. It will be appreciated that the value of the second predetermined data amount threshold is not limited.
It should be noted that the first preset data amount threshold may be the same as the second preset data amount threshold, that is, when the data amount of the storage unit of the missing information state maintaining register is detected to be greater than the first preset data amount value, the generation of the prefetch request is stopped, the priority of the computing thread group is improved, and the problem of on-chip storage resource competition is effectively reduced.
Some embodiments of the present application further describe "schedule a target thread group in a target thread group sequence" in step S300. As shown in fig. 5, step S300 includes, but is not limited to, step S310 and step S320.
Step S310: numbering a plurality of thread groups in a target thread group sequence in sequence;
step S320: and scheduling the plurality of thread groups in the target thread group sequence according to the numbering order.
In this embodiment, the multiple thread groups in the target thread group sequence are numbered from small to large in sequence, so that the target thread group with the smallest number is scheduled according to the thread group number information.
In addition, an embodiment of the application also discloses a general graphics processor for executing the thread group scheduling method
In this embodiment, according to the first data locality information, the first initial thread group sequence and the thread group having data locality in the second initial thread group sequence are identified as priority thread groups, the priority of the priority thread groups is increased, the first initial thread group sequence and the second initial thread group sequence are respectively sequenced according to the priority, a first transition thread group sequence and a second transition thread group sequence are respectively obtained, the first transition thread group sequence and the second transition thread group sequence are combined, a target thread group sequence is obtained, and a scheduling operation is performed on a thread group to be scheduled in the target thread group sequence. The general graphics processor provided by the embodiment of the application can schedule the thread group with data locality preferentially, reduce the situation of cache jitter and improve the performance of the general graphics processor.
In addition, an embodiment of the present application further discloses a computer readable storage medium, in which a program executable by a processor is stored, where the program executable by the processor is used to implement the thread group scheduling method as described above.
In this embodiment, according to the first data locality information, the first initial thread group sequence and the thread group having data locality in the second initial thread group sequence are identified as priority thread groups, the priority of the priority thread groups is increased, the first initial thread group sequence and the second initial thread group sequence are respectively sequenced according to the priority, a first transition thread group sequence and a second transition thread group sequence are respectively obtained, the first transition thread group sequence and the second transition thread group sequence are combined, a target thread group sequence is obtained, and a scheduling operation is performed on a thread group to be scheduled in the target thread group sequence. The computer readable storage medium of the embodiment of the application can schedule the thread group with data locality preferentially, reduce the situation of cache jitter and improve the performance of the general graphics processor.
Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, information structures, program modules or other information as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, information structures, program modules or other information in a modulated information signal such as a carrier wave or other transport mechanism and may include any information delivery media.
The embodiments of the present application have been described in detail above with reference to the accompanying drawings, but the present application is not limited to the above embodiments, and various changes can be made within the knowledge of one of ordinary skill in the art without departing from the spirit of the present application.

Claims (10)

1. A method for scheduling a thread group, comprising:
the method comprises the steps that first data locality information is obtained, a first initial thread group sequence is subjected to sequencing operation to obtain a first transition thread group sequence, a second initial thread group sequence is subjected to sequencing operation to obtain a second transition thread group sequence, the first initial thread group sequence comprises a plurality of calculation thread groups, the calculation thread groups are thread groups for executing calculation instructions, the second initial thread group sequence comprises a plurality of access thread groups, and the access thread groups are thread groups for executing access instructions;
wherein the sorting operation includes: traversing each thread group, when determining that a priority thread group exists according to the first data locality information, improving the priority of the priority thread group, sorting a plurality of thread groups according to the priority of the thread groups, wherein a prefetching block comprises a plurality of cache lines, the first data locality information is used for representing the cache line number of a register storage unit maintained in a missing information state of the prefetching block requested by the thread group, the priority thread group is the thread group which requests that the cache line number of the same prefetching block is larger than a preset request number threshold, or is the thread group which simultaneously requests the same prefetching block, and the cache line number requested by the thread group is larger than the preset request number threshold;
combining the first transition thread group sequence and the second transition thread group sequence to obtain a target thread group sequence;
and carrying out scheduling operation on a target thread group in the target thread group sequence, wherein the target thread group is the thread group to be scheduled in the target thread group sequence.
2. The thread group scheduling method of claim 1, further comprising:
obtaining second data locality information, wherein the second data locality information is used for representing the sum of the cache line number of the prefetch block requested by the thread group in the missing information state holding register storage unit and a first-level data cache;
confirming a target prefetched block according to the second data locality information and a plurality of prefetched blocks, wherein the target prefetched block is the prefetched block with the sum of the cache line numbers in the missing information state holding register storage unit and the primary data cache being larger than a preset prefetched block locality threshold;
generating a prefetch request, wherein the prefetch request is used for requesting to send a to-be-fetched cache line of a target prefetch block to the first-level data cache, and the to-be-fetched cache line is not sent to the missing information state holding register storage unit and the cache line of the first-level data cache;
and sending the prefetch request to a memory system.
3. The thread group scheduling method of claim 2, wherein generating the prefetch request comprises:
and acquiring the data volume of the storage unit of the missing information state holding register, and generating the prefetch request if the data volume is smaller than a preset first data volume threshold.
4. The thread group scheduling method of claim 2, wherein the sending the prefetch request to a memory system comprises:
storing the prefetch request in a load store unit;
acquiring the number of memory access requests in the loading memory unit;
and when the number of the access requests is zero, sending the prefetch request to a memory system.
5. The thread group scheduling method of claim 1, wherein said ordering a plurality of said thread groups according to their priorities comprises:
fine-grained ordering of a plurality of the thread groups is performed according to priorities of the plurality of the thread groups.
6. The thread group scheduling method of claim 1, wherein after performing the scheduling operation on the target thread group in the target thread group sequence, further comprising: and reducing the priority of the target thread group.
7. The thread group scheduling method of claim 1, wherein the merging the first transition thread group sequence and the second transition thread group sequence comprises:
acquiring the data quantity of the storage unit of the missing information state holding register;
and if the data volume is larger than a preset second data volume threshold, adding the second transition thread group sequence to the sequence tail of the first transition thread group, otherwise, adding the first transition thread group sequence to the sequence tail of the second transition thread group to obtain a target thread group sequence.
8. The thread group scheduling method of claim 1, wherein the scheduling the target thread group in the target thread group sequence comprises:
numbering a plurality of thread groups in the target thread group sequence in sequence;
and performing scheduling operation on a plurality of thread groups in the target thread group sequence according to the numbering order.
9. A general purpose graphics processor configured to perform the thread group scheduling method of any one of claims 1 to 8.
10. A computer readable storage medium, wherein a processor executable program is stored, the processor executable program when executed by a processor being for implementing the thread group scheduling method of any one of claims 1 to 8.
CN202311590204.0A 2023-11-24 2023-11-24 Thread group scheduling method, general graphics processor and storage medium Pending CN117492965A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311590204.0A CN117492965A (en) 2023-11-24 2023-11-24 Thread group scheduling method, general graphics processor and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311590204.0A CN117492965A (en) 2023-11-24 2023-11-24 Thread group scheduling method, general graphics processor and storage medium

Publications (1)

Publication Number Publication Date
CN117492965A true CN117492965A (en) 2024-02-02

Family

ID=89678215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311590204.0A Pending CN117492965A (en) 2023-11-24 2023-11-24 Thread group scheduling method, general graphics processor and storage medium

Country Status (1)

Country Link
CN (1) CN117492965A (en)

Similar Documents

Publication Publication Date Title
US7949855B1 (en) Scheduler in multi-threaded processor prioritizing instructions passing qualification rule
US8589934B2 (en) Controlling priority levels of pending threads awaiting processing
US9870228B2 (en) Prioritising of instruction fetching in microprocessor systems
US20080155197A1 (en) Locality optimization in multiprocessor systems
CN107077390B (en) Task processing method and network card
CN106991073B (en) Data read-write scheduler and reservation station for vector operation
US20090228663A1 (en) Control circuit, control method, and control program for shared memory
US11579885B2 (en) Method for replenishing a thread queue with a target instruction of a jump instruction
US8806168B2 (en) Producer-consumer data transfer using piecewise circular queue
CN115562838A (en) Resource scheduling method and device, computer equipment and storage medium
US7461211B2 (en) System, apparatus and method for generating nonsequential predictions to access a memory
US20110276979A1 (en) Non-Real Time Thread Scheduling
CN116048627B (en) Instruction buffering method, apparatus, processor, electronic device and readable storage medium
CN117492965A (en) Thread group scheduling method, general graphics processor and storage medium
CN109388429B (en) Task distribution method for MHP heterogeneous multi-pipeline processor
CN113127179A (en) Resource scheduling method and device, electronic equipment and computer readable medium
CN115129480A (en) Scalar processing unit and access control method thereof
CN112114967B (en) GPU resource reservation method based on service priority
US11269642B2 (en) Dynamic hammock branch training for branch hammock detection in an instruction stream executing in a processor
US11068272B2 (en) Tracking and communication of direct/indirect source dependencies of producer instructions executed in a processor to source dependent consumer instructions to facilitate processor optimizations
US20090031118A1 (en) Apparatus and method for controlling order of instruction
US20140046979A1 (en) Computational processing device, information processing device, and method of controlling information processing device
TWI382348B (en) Multi-core system and scheduling method thereof
CN116820333B (en) SSDRAID-5 continuous writing method based on multithreading
US9471321B2 (en) Method and apparatus for controlling fetch-ahead in a VLES processor architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination