CN112083957B

CN112083957B - Bandwidth control device, multithread controller system and memory access bandwidth control method

Info

Publication number: CN112083957B
Application number: CN202010991780.6A
Authority: CN
Inventors: 姚涛; 时兴; 贾琳黎; 林江
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2023-10-20
Anticipated expiration: 2040-09-18
Also published as: CN112083957A

Abstract

The application provides a bandwidth control device, a multithreading controller system and a memory access bandwidth control method, wherein the bandwidth control device is respectively connected with LLC and a processor core, the processor core supports multithreading, and the processor core is communicated with a multi-level cache; the bandwidth control equipment is used for acquiring a first memory access instruction sent by the LLC to the lower-level storage unit; the bandwidth control equipment is used for determining a first processing priority corresponding to the first thread identifier and determining the limiting rate of the first thread sending the memory access instruction after a preset clock period; the bandwidth control device is used for sending a limiting rate to the first processor core, and indicating the processor core to limit the number of memory access instructions sent by the first thread in a preset clock cycle according to the limiting rate. The limit of the access bandwidth of the low-priority thread is realized in the link of the processor core, so that the cache resources which can be used in the cache for generating the access instruction by the high-priority thread are more, and the bandwidth resource limit of the low-priority thread and the smooth operation of the high-priority thread are both considered.

Description

Bandwidth control device, multithread controller system and memory access bandwidth control method

Technical Field

The application relates to the field of computers, in particular to a bandwidth control device, a multithread controller system and a memory access bandwidth control method.

Background

In a multi-core multithreaded processor, cache bandwidth management supporting quality of service (Quality of Service, qoS for short) enables more orderly execution of programs. QoS provides better service for high priority threads by limiting the bandwidth resources occupied by low priority threads.

In practice, however, multiple threads often share multiple resources in a multi-level cache system of a processor, and there is a resource competition between threads. Since Last Level Cache (LLC) requires servicing multiple processor cores and multiple threads, resource competition in LLC is particularly intense. On the other hand, low priority threads are restricted in the LLC, rather, the low priority threads often have to occupy various resources in the LLC, resulting in less available resources for high priority threads, and even blocked requests for high priority threads because resources are occupied. Thus, the prior art cannot limit bandwidth resources of low priority threads while not interfering with the operation of high priority threads.

Disclosure of Invention

An embodiment of the application aims to provide bandwidth control equipment, a multithreading controller system and a memory access bandwidth control method, which are used for solving the problem that the prior art cannot limit the bandwidth resources of low-priority threads and simultaneously does not interfere with the operation of high-priority threads.

In a first aspect, an embodiment of the present application provides a bandwidth control apparatus, where the bandwidth control apparatus is respectively connected to a last level cache LLC in a multi-level cache and at least one processor core, where the at least one processor core supports multithreading, and the at least one processor core is in communication with the multi-level cache; the bandwidth control device is configured to obtain a first access instruction sent by the LLC to a lower-level storage unit, where the first access instruction carries a first core identifier of a first processor core that generates the first access instruction and a first thread identifier of a first thread that generates the first access instruction and is operated by the first processor core; the bandwidth control equipment is used for determining a first processing priority corresponding to the first thread identifier, and determining the limiting rate of the first thread sending the access instruction after a preset clock period according to the first processing priority; the bandwidth control device is configured to send the restriction rate to the first processor core, instruct the processor core to restrict, according to the restriction rate, the number of access instructions sent by the first thread after a preset clock cycle.

In the above embodiment, the bandwidth control device may obtain the memory access instruction sent by the LLC, and obtain the thread identifier of the thread that generates the memory access instruction from the memory access instruction. And determining the processing priority of the thread generating the access instruction according to the thread identification, further calculating the restriction rate according to the processing priority, and then sending the restriction rate to a processor core from which the access instruction comes by the bandwidth control equipment. The processor core may limit the number of memory instructions issued by the same thread per unit clock cycle based on a limit rate, the processing priority of the limited thread typically being a lower priority. By the method, the limit of the access bandwidth of the threads with low priority is realized in the link of the processor core (namely the dispatch stage of the processor instruction), and the limit of the access bandwidth of the threads with low priority in the multi-level cache is correspondingly reduced; the reduction of the access instructions generated by the threads with low priority in the multi-level cache enables the cache resources which can be used in the cache to be more for the threads with high priority to generate the access instructions, so that the bandwidth resource limitation of the threads with low priority and the smooth operation of the threads with high priority are both realized.

In one possible design, the bandwidth control device includes a plurality of control computing units, at least one of the plurality of control computing units is in an operating state, the number of control computing units in the operating state is the same as the number of all processing priorities of all threads supported by the multithreaded processor system, and the control computing units in the operating state are in one-to-one correspondence with the processing priorities; the bandwidth control device is used for determining a first control calculation unit corresponding to the first processing priority; the bandwidth control device is used for calculating the limiting rate of the first thread sending the memory access instruction after a preset clock period by using the first control calculation unit.

In the above embodiment, the control computing unit in the running state in the bandwidth control apparatus corresponds one-to-one to all processing priorities of all threads supported by the multithreaded processor system. The bandwidth control apparatus may select a first control calculation unit of the corresponding processing priority from among a plurality of control calculation units in an operation state, and implement calculation of the restriction rate using the first control calculation unit. Each control calculation unit in an operation state is provided with components with parameters corresponding to the processing priority, so that the limiting rate corresponding to the processing priority is calculated.

In one possible design, the first control calculation unit includes a token generator, a token counter, and a restriction rate calculation unit, where the token generator and the restriction rate calculation unit are both connected to the token counter; the token generator is used for generating a corresponding token for each thread belonging to the first priority, and the generation period of the generated token is the same as the preset sending period of the memory access instruction corresponding to the first priority; the token counter is used for adding one to the current number of tokens of the first thread when receiving the tokens newly generated by the token generator for the first thread; the token counter is further configured to decrease the current number of tokens of the first thread by one when the LLC sends a memory access instruction of the first thread to a lower storage unit; the limiting rate calculating unit is used for calculating a difference value between the initial number of tokens of the first thread and the current number of tokens of the first thread in the current clock period, and calculating the limiting rate of the first thread for sending the memory access instruction after the preset clock period according to the difference value.

In the above embodiment, the token generation period of the token generator included in the first control calculation unit may be the same as a preset transmission period of the access instruction, where the preset transmission period may be obtained according to a set access bandwidth, and the access bandwidth may be set by a user according to a processing priority of the thread. The token counter may increment the current number of tokens by one upon receipt of a token newly generated by the token generator; upon receiving the memory access instruction, the current number of tokens is decremented by one. By the method, the restriction rate calculation unit can obtain the variation trend and the variation value of the number of tokens in the time period of the current clock cycle; and calculating the limiting rate according to the variation trend and the variation value of the number of tokens. The access bandwidth can be set according to the processing priority of the thread, so that the calculated limiting rate is related to the processing priority, different processing priority threads can be treated differently, and the processing efficiency is improved.

In one possible design, the restriction rate calculating unit is configured to calculate a restriction rate change amount according to a difference between the initial number of tokens of the first thread and the current number of tokens of the first thread in a current clock cycle and a difference between the initial number of tokens of the first thread and the current number of tokens of the first thread in a latest one or more clock cycles; the limiting rate calculating unit is used for calculating the sum of the historical limiting rate and the limiting rate variation, wherein the sum is the limiting rate of the first thread sending the memory access instruction after a preset clock period.

In the above embodiment, the rate of change of the limiting rate may be calculated according to the token difference value of the current clock cycle and the token difference value of the historical clock cycle, and then the sum of the rate of change of the limiting rate and the historical limiting rate is calculated, so as to obtain a new limiting rate, where the limiting rate is the limiting rate of the processor core sending the memory command after the preset clock cycle. And calculating the limiting rate by combining the historical data, so that the limiting rate meets the requirement of access bandwidth control.

In a second aspect, the present application provides a multithreaded processor system comprising at least one processor core, a multi-level cache, and a bandwidth control device as described in any one of the optional implementations of the first aspect or the first aspect, the at least one processor core supporting multithreading, the at least one processor core being in communication with the multi-level cache, the multi-level cache comprising a last-level cache LLC, the LLC and the at least one processor core each being connected to the bandwidth control device.

In the embodiment, the limit of the access bandwidth of the low-priority thread is realized in the link of the processor core, and the limit of the access bandwidth of the low-priority thread in the multi-level cache is correspondingly reduced; the reduction of the access instructions generated by the threads with low priority in the multi-level cache enables the cache resources which can be used in the cache to be more for the threads with high priority to generate the access instructions, so that the bandwidth resource limitation of the threads with low priority and the smooth operation of the threads with high priority are both realized.

In one possible design, the processor core includes an instruction output logic module and an execution/access unit, the instruction output logic module is connected with the execution/access unit, and an emission limiting module is arranged in the instruction output logic module; and the emission limiting module in the processor core is used for receiving the limiting rate sent by the bandwidth control equipment and indicating the emission limiting module to limit the number of the access instructions sent by the first thread after a preset clock period according to the limiting rate.

In a third aspect, an embodiment of the present application provides a memory access bandwidth control method, which is applied to the multithreaded processor system in any optional implementation manner of the second aspect and the second aspect, and the method includes: the bandwidth control equipment acquires a first access instruction sent by the LLC to a lower-level storage unit, wherein the first access instruction carries a first core identifier of a first processor core generating the first access instruction and a first thread identifier of a first thread operated by the first processor core and generating the first access instruction; the bandwidth control equipment determines a first processing priority corresponding to the first thread identifier, and determines the limiting rate of the first thread sending the memory access instruction after a preset clock period according to the first processing priority; and the bandwidth control equipment sends the limiting rate to the first processor core, and instructs the processor core to limit the number of memory access instructions sent by the first thread after a preset clock period according to the limiting rate.

In one possible design, the bandwidth control device includes a plurality of control computing units, at least one of the plurality of control computing units is in an operating state, the number of control computing units in the operating state is the same as the number of all processing priorities of all threads supported by the multithreaded processor system, and the control computing units in the operating state are in one-to-one correspondence with the processing priorities; the determining, according to the first processing priority, a limiting rate of the first thread to send the access instruction after a preset clock period includes: the bandwidth control device determines a first control calculation unit corresponding to the first processing priority; and the bandwidth control equipment calculates the limiting rate of the first thread sending the memory access instruction after a preset clock period by using the first control calculation unit.

In one possible design, the first control calculation unit includes a token generator, a token counter, and a restriction rate calculation unit, where the token generator and the restriction rate calculation unit are both connected to the token counter; the token generator is used for generating a corresponding token for each thread belonging to the first priority, and the generation period of the token is the same as the preset sending period of the access instruction corresponding to the first priority; the bandwidth control device calculates a limiting rate of a memory access instruction sent by a processor core after a preset clock period by using the first control calculation unit, and the bandwidth control device comprises: the token counter adds one to the current number of tokens of the first thread when receiving the tokens newly generated by the token generator for the first thread; the token counter reduces the current number of tokens of the first thread by one when the LLC sends a memory access instruction of the first thread to a lower-level storage unit; the limiting rate calculating unit calculates the difference value between the initial number of tokens of the first thread and the current number of tokens of the first thread in the current clock period, and calculates the limiting rate of the first thread for sending the memory access instruction after the preset clock period according to the difference value.

In one possible design, the calculating the limiting rate of the first thread sending the memory access instruction after the preset clock period according to the difference value includes: the limiting rate calculating unit calculates a limiting rate change amount according to the difference value between the initial number of tokens of the first thread and the current number of tokens of the first thread in the current clock period and the difference value between the initial number of tokens of the first thread and the current number of tokens of the first thread in the latest one or more clock periods; the limiting rate calculating unit calculates the sum of the historical limiting rate and the limiting rate change, wherein the sum is the limiting rate of the first thread sending the memory access instruction after a preset clock period.

In one possible design, the first processor core includes an instruction output logic module and an execution/access unit, the instruction output logic module is connected with the execution/access unit, and an emission limiting module is disposed in the instruction output logic module; the bandwidth control apparatus sending the restriction rate to the first processor core, comprising: and the bandwidth control equipment sends a limiting rate to a transmission limiting module in the first processor core, and instructs the transmission limiting module to limit the number of memory access instructions sent by the first thread after a preset clock period according to the limiting rate.

In the above embodiment, the bandwidth control device specifically sends the limiting rate to the emission limiting module in the processor core, so that the emission limiting module limits the emission rate of the instruction output logic module of the processor core according to the limiting rate, thereby limiting the number of the memory access instructions sent by the first thread after the preset clock period.

In order to make the above objects, features and advantages of the embodiments of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 shows a schematic block diagram of a multithreaded controller system provided by an embodiment of the application;

FIG. 2 illustrates a schematic block diagram of one embodiment of a multithreaded controller system provided by an embodiment of the present application;

fig. 3 shows a schematic block diagram of a bandwidth control apparatus;

fig. 4 shows a schematic block diagram of a single control calculation unit in a bandwidth control apparatus;

FIG. 5 shows a schematic block diagram of any of a plurality of processor cores;

fig. 6 is a flow chart illustrating a memory access bandwidth control method according to an embodiment of the present application;

Fig. 7 is a schematic flow chart showing specific steps of step S120 in fig. 6;

fig. 8 shows a flowchart of specific steps of step S122 in fig. 7.

Detailed Description

In contrast to the embodiment, the conventional QoS-supporting cache bandwidth management often limits the low-priority threads in the multi-level cache system, but limits the low-priority threads, so that the low-priority threads occupy various resources in the cache for a long time. For example, in LLC, memory access instructions corresponding to low priority threads tend to be stored in miss cache queues, misque, or request cache queues, reqque, for a long period of time, resulting in fewer resources available for high priority threads in misque or reqque. The MissQueue is used for storing the access instruction to be sent to the lower-level storage unit, and the ReqQueue is used for storing the access instruction received from the upper-level cache.

According to the bandwidth control device, the multithreading controller system and the access bandwidth control method, the access bandwidth of the threads with low priority is limited in the link of the processor core, namely the Dispatch stage (Dispatch/isue) of the processor instruction, so that the limit of the access bandwidth of the threads with low priority in the multi-level cache is reduced, more cache resources which can be used in the cache for generating the access instruction by the threads with high priority are generated, and therefore the bandwidth resource limit of the threads with low priority and smooth operation of the threads with high priority are achieved.

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

Referring to FIG. 1, FIG. 1 shows a schematic block diagram of a multithreaded controller system 10 including a plurality of processor cores 110, a multi-level cache, and a bandwidth control device 210, in accordance with an embodiment of the application. Each processor core 110 of the plurality of processor cores 110 may support multithreading, with the plurality of processor cores 110 each in communication with a multi-level cache.

Referring to fig. 1 for details, the number of processor cores 110 may be n, and the multi-level cache has m levels, and the multi-level cache includes a first-level cache 120 and a second-level cache 130, … m-level cache, where the last-level cache in the multi-level cache is denoted as LLC 140, i.e., in the multithread controller system shown in fig. 1, the m-level cache is denoted as LLC 140.LLC 140 and n processor cores 110 are each coupled to bandwidth control device 210.

Any of the multiple threads supported by any of the processor cores 110 may generate a number of memory instructions within a clock cycle of the processor cores 110, where the memory instructions are instructions that access specific data in a memory unit. The access instruction can preferentially search whether the specific data to be accessed is cached in the high-level cache, and if the specific data is not cached in the high-level cache, the specific data is called as 'missing' (Miss) in the high-level cache; if the specific data is cached in the high-level cache, the specific data is said to "Hit" in the high-level cache (Hit).

Referring to fig. 1, a memory access instruction generated by a thread in a current clock cycle of a processor core 110 will preferentially search for specific data to be accessed in a primary cache 120, if the specific data to be accessed is not found in the primary cache 120, determine that the primary cache 120 is missing, and send the memory access instruction to a secondary cache 130 by the primary cache 120; if the specific data to be accessed is not found in the secondary cache 130, the secondary cache 130 is determined to be missing, the secondary cache 130 sends the access instruction to the tertiary cache … and so on until the m-1 level cache is missing, and the access instruction is sent to the m level cache, that is, the LLC 140, by the m-1 level cache. If LLC 140 remains missing, LLC 140 will issue the access instruction to the lower level memory unit. The generated memory access instruction carries identity information SrcID, wherein the SrcID comprises a processor Core identifier Core ID and a thread identifier TID, the processor Core identifier Core ID is the identifier of the processor Core 110 from which the memory access instruction comes, and the thread identifier TID is the identifier of the thread from which the memory access instruction comes.

The memory access instruction is sent to a lower level memory system, typically a dynamic random access memory (Dynamic Random Access Memory, DRAM for short), via an on-chip interconnect system. The on-chip interconnect system may be connected to IO devices and other processor systems in addition to DRAM.

Optionally, in a specific embodiment, the level one cache 120 may include a level one instruction cache (L1I for short) and a level one data cache (L1D for short), and the access instruction may be selectively sent to one of the L1I and the LID by the processor core 110 according to a data type of the specific data to be accessed by the processor core itself. For example, if the data type of the specific data to be accessed by the access instruction is an instruction, the processor core 110 may send the access instruction to L1I, so that when the access instruction is missing in L1I, the access instruction may be sent by L1I to the secondary cache 130; if the data type of the specific data to be accessed by the access instruction is data, the processor core 110 may send the access instruction to the L1D, so that the access instruction may be sent to the secondary cache 130 by the L1D when the L1D is missing. In another possible implementation manner, the access instruction may also continue to search the specific data to be accessed in the L1D when the L1I is missing, and the access instruction is sent to the secondary cache 130 only when the L1D is also missing; or the access instruction can continue to search the specific data to be accessed in the L1I when the L1D is missing, and the access instruction is sent to the secondary cache 130 when the L1I is also missing. The particular process by which the memory instructions are looked up in the primary cache 120 should not be construed as limiting the application.

Bandwidth control device 210 may obtain the memory access instruction sent by LLC 140 to the lower memory unit, obtain, through a series of computations, the restriction rate of the memory access instruction sent by processor core 110 supported by the thread generating the memory access instruction after a preset clock cycle, and send the restriction rate to processor core 110. The specific structure of the bandwidth control apparatus 210 and the calculation process of the restriction rate will be described in detail below.

The processor core 110 may limit the number of new memory access instructions sent by the thread that generates the memory access instructions in a unit clock cycle after the preset clock cycle according to the limiting rate. The preset clock cycle refers to a number of clock cycles elapsed from the current clock cycle of the processor core 110. Because of the time required for signaling, the restriction rate calculated in the current clock cycle is often applied to the memory access instruction issued by the processor core 110 after the predetermined time period. The predetermined number of clock cycles may be 10 clock cycles or another number, such as 12 clock cycles, and the specific number of clock cycles should not be construed as limiting the application.

For ease of description, the embodiment of the multithreaded controller system shown in fig. 2 is not limited to this. That is, the number of the processor cores 110 may be 4, namely Core 0, core 1, core 2 and Core 3, wherein each processor Core 110 of the 4 processor cores 110 supports two threads; the multi-level cache can be provided with 3 levels, namely L1, L2 and L3, wherein L1 comprises L1D and L1I.

Referring to fig. 3, fig. 3 shows a schematic block diagram of a bandwidth control apparatus 210, and the bandwidth control apparatus 210 includes a request allocation unit 211, a plurality of control calculation units 212, a control signal allocation unit 213, and a QoS control management register 214.

The request allocation unit 211 is configured to determine, according to the srclid of the access instruction, a thread generating the access instruction and a processing priority corresponding to the thread, where the processing priority is a Class of Service (CoS for short). The request allocation unit 211 may determine the control calculation unit 212 corresponding to the CoS from among the plurality of control calculation units 212, and allocate the memory access instruction to the control calculation unit 212 described above.

The number of the plurality of control computing units 212 may be the same as the total number of threads supported by LLC 140, for example, for the multithreaded controller system shown in fig. 2, there are 4 processor cores 110, and each processor core 110 supports two threads, which are supported by LLC 140 (i.e., three-level cache), and the total number of threads supported by LLC 140 is 4*2 =8, and the number of control computing units 212 is the same as the total number of threads supported by LLC 140, namely, 8, which are control computing unit 0, control computing unit 1, and control computing unit 2 …, respectively, and control computing unit 7.

The 8 threads each have a corresponding CoS. Alternatively, one thread may correspond to one CoS, or a plurality of threads may correspond to one CoS, i.e., the number of CoS is less than or equal to the number of threads supported by LLC 140. The number of control computing units 212 in an operating state among the plurality of control computing units 212 is the same as the number of CoS, i.e., the number of control computing units 212 in an operating state is less than or equal to the total number of control computing units 212.

The control calculation unit 212 is configured to calculate a restriction rate according to the CoS corresponding to itself and the received memory access instruction, and input the restriction rate to the control signal distribution unit 213. The specific process of the control calculation unit 212 to calculate the restriction rate is described in detail below.

The control signal distribution unit 213 is configured to convert the limiting rate into a transmission control signal, and feed back the transmission control signal to the processor core 110 supporting the thread according to the mapping relationship between the thread and the CoS.

The QoS control management register 214 is used to maintain the bandwidth quota condition of each CoS in the system, and parameters required by the control calculation unit 212, while also maintaining the mapping relationship between each thread and CoS, i.e., the mapping relationship between SrcID and CoS. The control computation unit 212 corresponding to each CoS may also be dynamically allocated by the QoS control management register 214. For example, it is not necessary to provide the CoS with x levels, and in the embodiment of the present application, y levels out of the x levels may be used, so that there are y control calculation units 212 in operation, and the y control calculation units 212 in operation are in one-to-one correspondence with the y levels of the CoS.

Referring to fig. 4, fig. 4 shows a schematic block diagram of a single control calculation unit 212, where the control calculation unit 212 includes a token generator 2121, a token counter 2122, and a restriction rate calculation unit 2123, and the token generator 2121 and the restriction rate calculation unit 2123 are connected to the token counter 2122.

The token generator 2121 is configured to generate a token according to a generation cycle of the token corresponding to itself. The generation period of the token is the same as the preset sending period of the access instruction, the preset sending period of the access instruction is obtained according to the set access bandwidth, and the access bandwidth can be set by a user according to the CoS of the thread.

Alternatively, in one embodiment, the token generation process of the token generator 2121 may be calculated as follows:

the frequency of LLC 140 may be F GHz, the memory bandwidth of CoS of the thread is N1/8 GByte/S, the thread sends N memory instructions in T clock cycles, the data length requested by the memory instructions is 64Byte, and the relationship between the clock cycles and the main frequency is obtained as follows:

N*64Byte/T*F GHz＝N*1/8GB/s

from the above formula, it is possible to obtain: t=512×f.

The N memory instructions need to be sent in T clock cycles, and the preset sending cycle of one memory instruction is T/n=512×f/N, and the generating cycle of the token generator 2121 is the same as the preset sending cycle of the memory instruction and is 512×f/N.

Alternatively, token generator 2121 may be a step-1 counter, with 1 being fixed incremented every clock cycle, and when added to 512 x f/N clock cycles, a token is output, with the counter cleared and accumulated from scratch.

Alternatively, the token generator 2121 may be a step configurable counter, e.g., N x 1/8GByte/S for memory bandwidth, where the step of the token generator 2121 may be configured to increment N every clock cycle until the value of the counter is greater than or equal to T, outputting a token, subtracting T from the value of the counter, and restarting the accumulation.

In another embodiment, T may be an integer power of 2, for example, let T be 2048, and issue M memory instructions in T clock cycles, then there are:

M*64Byte/2048*F GHz＝N*1/8GB/s

from the above formula, it is possible to obtain: m= 4*N/F.

Thus, the counter bit width of token generator 2121 may be set to 11 bits, the counter accumulating M every clock cycle, and whenever the accumulated result is greater than or equal to 2048, the highest bit carry over occurs, at which point the token output is generated.

It should be appreciated that the generation of tokens by token generator 2121 may be accomplished by other schemes than those described above, and the specific generation of tokens by token generator 2121 should not be construed as limiting the present application.

The token counter 2122 is configured to, when receiving the token generated by the token generator 2121, increment the current number of tokens by one, and take the number of tokens obtained by increment of the current number of tokens as a new current number of tokens; and when the access instruction is received, subtracting one from the current number of tokens, and subtracting one from the current number of tokens to serve as a new current number of tokens.

Alternatively, to avoid the occurrence of negative numbers of tokens, token counter 2122 may be set with a larger initial fixed value that may be set based on the size of the miss cache queue of LLC 140 and the memory delay. For example, if the size of the miss cache queue is 100 and the access delay is 100 clock cycles, at most 100 access requests will be issued in a unit clock cycle, the initial fixed value may be set to a value greater than 100, such as 256, so that the occurrence of a negative token number may be avoided.

If the number of tokens is maintained at the initial fixed value for a period of time, indicating that the number of LLC 140 access instructions of one or more threads corresponding to CoS meets the bandwidth quota; if the number of tokens is less than the initial fixed value, indicating that the memory bandwidth used by one or more threads corresponding to the CoS is overlarge; and if the number of the tokens is larger than the initial fixed value, indicating that the access bandwidth used by one or more threads corresponding to the CoS is insufficient.

The restriction rate calculating unit 2123 may calculate a difference between the initial number of tokens and the current number of tokens in the current clock cycle, and calculate, according to the difference, a restriction rate of a thread supported by the processor core 110 to send a memory access instruction after a preset clock cycle.

Alternatively, in one embodiment, to limit a thread, the token issued by token generator 2121 may carry identification information characterizing the identity of the thread, such as Core ID and thread ID TID, and token counter 2122 may calculate respective differences for different threads based on the Core ID and thread ID TID, respectively.

Alternatively, in another embodiment, the number of token generators 2121 may be the same as the number of threads, and the number of token counters 2122 may be the same as the number of threads, i.e., each thread corresponds to a respective token generator and token counter, where the difference in the number of tokens obtained by each token counter corresponds to the respective thread.

The specific process of calculating the restriction rate by the restriction rate calculation unit 2123 is described in detail later. If the initial number of the tokens is larger than the current number of the tokens, the difference value is a positive number; if the initial number of tokens is smaller than the current number of tokens, the difference is a negative number.

Referring to FIG. 5, FIG. 5 shows a schematic block diagram of any one of the plurality of processor cores 110, but the processor Core 0 is not limited to the example.

The processor core 110 includes an instruction output logic module 111 and an execution/access unit 112, the instruction output logic module 111 is connected with the execution/access unit 112, and an emission limiting module is disposed in the instruction output logic module 111. Since the processor core 110 supports two threads, the processor core 110 further includes an address decoder 113 for thread 0 and an address decoder 114 for thread 1. There are two transmit limit modules, one of which 1111 is coupled to the fetch decode module of thread 0 and the other 1112 is coupled to the fetch decode module of thread 1. Alternatively, the number of the emission limiting modules may be multiple, that is, the emission limiting modules may be in one-to-one correspondence with the threads, and the number of the emission limiting modules may be one, that is, one emission limiting module manages multiple threads.

The control signal allocation unit 213 in the bandwidth control device 210 is specifically configured to convert the limiting rate into a transmission control signal, and send the transmission control signal to a corresponding transmission limiting module in the processor core 110 according to a thread corresponding to the transmission control signal. For example, if the restriction rate corresponds to thread 0 of Core 0, the restriction rate may be converted into a transmit control signal: the restriction rate_t0, and is sent to the issue restriction module 1111 connected to the fetch decode module of thread 0; if the restriction rate corresponds to thread 1 of Core 0, the restriction rate may be converted into a transmit control signal: the limiting rate_t1 and is sent to an issue limiting module 1112, which is coupled to the fetch decode module of thread 1. The issue limiting module is configured to limit the issue rate of the instruction output logic module 111 according to the limiting rate, thereby limiting the number of memory access instructions that a thread sends after a preset clock period. The instruction output logic module 111 may output an instruction by dispatching an instruction, or may output an instruction by transmitting an instruction, and the specific manner of outputting an instruction should not be construed as limiting the present application.

Fig. 6 is a flow chart of a memory bandwidth control method according to an embodiment of the present application, where the memory bandwidth control method is executed by the bandwidth control device 210, and the memory bandwidth control method shown in fig. 6 includes steps S110 to S130 as follows:

in step S110, bandwidth control device 210 obtains a first memory instruction sent by LLC 140 to a lower level storage unit.

The first access instruction carries a first core identifier of a first processor core generating the first access instruction and a first thread identifier of a first thread generating the first access instruction, which is operated by the first processor core. For ease of illustration, it is not necessary to provide that the first access instruction is from core 0 thread L0 in FIG. 2, i.e., the first core is identified as core 0 and the first thread is identified as L0.

The identity information is SrcID described above, core ID, thread TID.

The lower level memory cells are typically DRAM, but may be other memory cells, such as a memory, and the particular type of lower level memory cell should not be construed as limiting the application.

After the access instruction is sent to the lower storage unit, searching whether the specific data to be accessed is cached in the lower storage unit or not in the lower storage unit, and if the specific data is not cached in the lower storage unit, calling that the specific data is missing in the lower storage unit; if the specific data is cached in the lower memory cell, the specific data is referred to as hit in the lower memory cell.

In step S120, the bandwidth control device determines a first processing priority corresponding to the first thread identifier, and determines a limiting rate of the first thread sending the access instruction after a preset clock period according to the first processing priority.

The processing priority is CoS as described above, and each thread has a corresponding CoS, and may be one CoS corresponding to one thread or one CoS corresponding to a plurality of threads. The bandwidth control device 210 may determine a thread that generates the memory instruction according to the srclid of the memory instruction, and perform calculation of the restriction rate according to the CoS corresponding to the thread.

The CoS of the thread L0 of core 0 in fig. 2 may be identified as the CoS10 based on the correspondence between the thread and the CoS. The bandwidth control device may determine, according to the first processing priority CoS10, a limiting rate of the thread L0 of the first thread core 0 to send the memory access instruction after the preset clock cycle.

Step S130, the bandwidth control device sends the limiting rate to the first processor core, and instructs the processor core to limit the number of memory access instructions sent by the first thread after a preset clock period according to the limiting rate.

Continuing with the above example, the processor core 0 may limit the thread L0 supported by the processor core 0 according to the limiting rate sent by the bandwidth control device, which is specifically shown as follows: the thread L0 is restricted from sending the number of memory access instructions after a preset clock cycle.

Bandwidth control facility 210 may obtain a memory instruction issued by LLC 140 and obtain a thread identification of the thread that generated the memory instruction from the memory instruction. Based on the thread identification, the processing priority of the thread generating the memory instruction is determined, and the restriction rate is calculated based on the processing priority, and then the bandwidth control device 210 sends the restriction rate to the processor core 110 from which the memory instruction originated. The processor core 110 may limit the number of memory access instructions issued by the same thread per unit clock cycle based on a limit rate, the processing priority of the limited thread typically being a lower priority.

By the above manner, the limitation of the access bandwidth of the low-priority thread is realized in the link of the processor core 110, namely, the dispatch stage of the processor instruction, and the limitation of the access bandwidth of the low-priority thread in the multi-level cache is correspondingly reduced; the reduction of the access instructions generated by the threads with low priority in the multi-level cache enables the cache resources which can be used in the cache to be more for the threads with high priority to generate the access instructions, so that the bandwidth resource limitation of the threads with low priority and the smooth operation of the threads with high priority are both realized.

Referring to fig. 7, fig. 7 shows a flowchart of specific steps of step S120, specifically including the following steps S121 to S122:

In step S121, the bandwidth control apparatus determines a first control calculation unit corresponding to the first processing priority.

The processing priority and the control calculation unit are mapped in a one-to-one correspondence, and the first control calculation unit corresponding to the first processing priority CoS10 may be the control calculation unit 1 in fig. 3.

Since the QoS control management register 214 in the bandwidth control apparatus 210 holds the mapping relationship between the thread and the processing priority CoS and also holds the correspondence relationship between the CoS and the control calculation unit 212, the CoS corresponding to the thread may be selected first, then the control calculation unit 212 corresponding to the CoS may be selected, and the control calculation unit 212 may be referred to as the first control calculation unit 212.

In step S122, the bandwidth control device calculates, using the first control calculation unit, a limiting rate of the first thread sending the memory access instruction after a preset clock period.

The control calculation units 212 in the bandwidth control apparatus 210 in the running state are in one-to-one correspondence with the processing priorities corresponding to the thread identifications. The bandwidth control apparatus 210 may select a first control calculation unit 212 of a corresponding processing priority from among the plurality of control calculation units 212 in an operation state, and implement calculation of the restriction rate using the first control calculation unit 212. Each of the control calculation units 212 in the operation state has components whose parameters correspond to the processing priority, thereby calculating the restriction rate corresponding to the processing priority.

A first control calculation unit: the control calculation unit 1 may specifically perform calculation of the restriction rate by:

referring to fig. 8, fig. 8 shows a flowchart of specific steps of step S122, specifically including the following steps S1221 to S1223:

in step S1221, the token counter increments the current number of tokens of the first thread by one when receiving the tokens newly generated by the token generator for the first thread.

In step S1222, the token counter decrements the current number of tokens of the first thread by one when the LLC sends a memory access instruction of the first thread to the lower level memory unit.

Wherein, the processing priority corresponding to the thread identifier of the target access instruction corresponds to the first control computing unit 212.

In step S1223, the restriction rate calculating unit calculates a difference between the initial number of tokens of the first thread and the current number of tokens of the first thread in the current clock cycle, and calculates a restriction rate of the first thread to send the memory access instruction after the preset clock cycle according to the difference.

If the initial number of tokens is larger than the current number of tokens, the difference value is a positive number; if the initial number of tokens is smaller than the current number of tokens, the difference is a negative number.

The token generation period of the token generator 2121 included in the first control calculation unit 212 may be the same as a preset transmission period of the access instruction, which may be obtained according to a set access bandwidth, which may be set by a user according to a processing priority of the thread. The token counter 2122 may increment the current number of tokens by one upon receipt of the token newly generated by the token generator 2121; upon receiving the memory access instruction, the current number of tokens is decremented by one. In the above manner, the restriction rate calculation unit 2123 may obtain the variation trend and the variation value of the number of tokens in the period of the current clock cycle; and calculating the limiting rate according to the variation trend and the variation value of the number of tokens. The access bandwidth can be set according to the processing priority of the thread, so that the calculated limiting rate is related to the processing priority, thereby realizing the differentiated treatment of threads with different processing priorities and improving the processing efficiency

Alternatively, step S1223 may specifically perform calculation of the restriction rate by: the limiting rate calculating unit calculates a limiting rate change amount according to the difference value between the initial number of tokens of the first thread and the current number of tokens of the first thread in the current clock period and the difference value between the initial number of tokens of the first thread and the current number of tokens of the first thread in the latest one or more clock periods; the limiting rate calculating unit calculates the sum of the historical limiting rate and the limiting rate change, wherein the sum is the limiting rate of the first thread sending the memory access instruction after a preset clock period.

The number of the most recent clock cycles may be two or more, and for convenience of explanation, two clock cycles may be taken as an example:

for example, err0, err1, err2 may be introduced, where Err0 represents the difference between the initial number of tokens and the current number of tokens in the current clock cycle; err1 represents the difference between the initial number of tokens and the current number of tokens in the previous clock cycle; err2 represents the difference between the initial number of tokens and the current number of tokens in the previous two clock cycles.

The value of Err1 may be obtained by latch Err0 for one clock cycle and the value of Err2 may be obtained by latch Err1 for one clock cycle.

After determining Err0, err1, err2, according to the formula:

the limiting rate change amount thrttle_value_delta is calculated by thr0+k1+k1+k2+err2. Wherein, K0, K1, K2 are three parameters, which can be obtained by multiple experimental verification, and K0, K1, K2 can be integer powers of 2 for easy calculation.

After calculating the limiting rate change amount Throttle_Value_delta, the following formula can be used:

thrtatle_value=thrtatle_value_d1+thrtatle_value_delta the constraint rate thrtatle_value of the access instruction is sent by the processor core 110 after a preset clock cycle. Wherein, the Throttle_Value_d1 is the history limiting rate, and can be obtained by latching the Throttle_Value for one clock cycle.

And calculating the limiting rate change rate according to the token difference value of the current clock cycle and the token difference value of the historical clock cycle, and then calculating the sum of the limiting rate change rate and the historical limiting rate to obtain a new limiting rate, wherein the limiting rate is the limiting rate of the memory access instruction sent by the processor core 110 after the preset clock cycle. And calculating the limiting rate by combining the historical data, so that the limiting rate meets the requirement of access bandwidth control.

Optionally, in an embodiment, after calculating the limiting rate throttle_value, the specific limiting number may be determined according to a corresponding relationship between the limiting rate throttle_value and the limiting number. After the limiting rate is calculated, the method is used for limiting the number of the memory access instructions sent by the same thread which sends the memory access instructions in a unit clock period after a preset clock period according to the limiting rate. Therefore, the specific limiting number can be defined according to the corresponding relation between the limiting rate Throttle_Value and the limiting number.

For example, the constraint ratio Throttle_Value may correspond to the following five mapping intervals: (0, 1/8 max_rate ], (1/8 max_rate,3/8 max_rate ], (3/8 max_rate,5/8 max_rate ], (5/8 max_rate,7/8 max_rate ], (7/8 max_rate, max_rate ], (0, 1/8 max_rate) corresponding limit number is 0, (1/8 max_rate,3/8 max_rate) corresponding limit number is 1, (3/8 max_rate,5/8 max_rate ] corresponding limit number is 2, (5/8 max_rate,7/8 max_rate ] corresponding limit number is 3, (7/8 max_rate, max_rate) corresponding limit number is 4.rate, and the set value can be obtained by multiple experiments.

I.e. if the constraint ratio Throttle_Value falls within the mapping interval: within (0, 1/8×max_rate), the corresponding limit number is 0, i.e., the processor core 110 does not need to limit the thread corresponding to the limit rate;

if the constraint rate Throttle_Value falls in the mapping interval: in (1/8×max_rate,3/8×max_rate), the corresponding limiting number is 1, that is, the processor core 110 needs to limit the thread corresponding to the limiting rate, specifically, the limiting number of the memory instructions sent by the thread corresponding to the limiting rate in the unit clock cycle is reduced by 1 compared with the set value;

if the constraint rate Throttle_Value falls in the mapping interval: in (3/8×max_rate,5/8×max_rate), the corresponding limiting number is 2, that is, the processor core 110 needs to limit the thread corresponding to the limiting rate, specifically, the limiting number of the memory instructions sent by the thread corresponding to the limiting rate in the unit clock cycle is reduced by 2 compared with the set value;

if the constraint rate Throttle_Value falls in the mapping interval: in (5/8×max_rate,7/8×max_rate), the corresponding limit number is 3, that is, the processor core 110 needs to limit the thread corresponding to the limit rate, specifically, the limit is that the number of memory instructions sent by the thread corresponding to the limit rate in a unit clock cycle is reduced by 3 compared with the set value;

If the constraint rate Throttle_Value falls in the mapping interval: in (7/8×max_rate, max_rate), the corresponding limit number is 4, that is, the processor core 110 needs to limit the thread corresponding to the limit rate, specifically, the limit is that the number of memory access instructions sent by the thread corresponding to the limit rate in a unit clock cycle is reduced by 4 compared with the set value.

The set value of the memory access instruction sent by the thread in the unit clock period may be set according to the actual requirement of the multithreaded controller system, for example, the set value may be set to 4. It should be understood that the above set values may also be other values, and the specific values should not be construed as limiting the application.

In a specific implementation manner provided in this embodiment of the present application, the restriction rate calculated by the bandwidth control device 210 may be sent to a fetch unit in the processor core 110, or may be sent to a prefetch unit in the multi-level cache, so as to dynamically control the running speed of the above units, thereby achieving the purpose of controlling the access bandwidth of the thread corresponding to the restriction rate in the LLC 140.

Because the present application is limited by resources unique to threads, when a low priority thread is controlled, the high priority can use resources shared between threads, such as LLC 140 bandwidth resources, thereby achieving higher performance. Meanwhile, the control of the threads occurs in the link of the processor core 110, namely the Dispatch/issue stage of the processor instruction, so that the mutual interference among the threads is avoided, the low-priority threads can be free of resources, and the execution performance of the high-priority threads is improved.

In the embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be additional divisions in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, indirect coupling or communication connection of devices or units, electrical, mechanical, or other form.

Further, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A bandwidth control device, wherein the bandwidth control device is respectively connected with a last level cache LLC in a multi-level cache and at least one processor core, the at least one processor core supporting multithreading, the at least one processor core being in communication with the multi-level cache;

the bandwidth control device is configured to obtain a first access instruction sent by the LLC to a lower-level storage unit, where the first access instruction carries a first core identifier of a first processor core that generates the first access instruction and a first thread identifier of a first thread that generates the first access instruction and is operated by the first processor core;

The bandwidth control equipment is used for determining a first processing priority corresponding to the first thread identifier, and determining the limiting rate of the first thread sending the access instruction after a preset clock period according to the first processing priority; the limiting rate is information for limiting the number of memory access instructions which can be sent by the same thread in a unit clock period;

the bandwidth control device is configured to send the restriction rate to the first processor core, instruct the processor core to restrict, according to the restriction rate, the number of access instructions sent by the first thread after a preset clock cycle.

2. The bandwidth control apparatus according to claim 1, characterized in that the bandwidth control apparatus includes a plurality of control calculation units, at least one of the plurality of control calculation units is in an operation state, the number of control calculation units in the operation state is the same as the number of all processing priorities of all threads supported by the multithreaded processor system, and the control calculation units in the operation state are in one-to-one correspondence with the processing priorities;

the bandwidth control device is used for determining a first control calculation unit corresponding to the first processing priority;

The bandwidth control device is used for calculating the limiting rate of the first thread sending the memory access instruction after a preset clock period by using the first control calculation unit.

3. The bandwidth control apparatus according to claim 2, wherein the first control calculation unit includes a token generator, a token counter, and a restriction rate calculation unit, the token generator and the restriction rate calculation unit being both connected to the token counter;

the token generator is used for generating a corresponding token for each thread belonging to the first priority, and the generation period of the generated token is the same as the preset sending period of the access instruction corresponding to the first priority;

the token counter is used for adding one to the current number of tokens of the first thread when receiving the tokens newly generated by the token generator for the first thread;

the token counter is further configured to decrease the current number of tokens of the first thread by one when the LLC sends a memory access instruction of the first thread to a lower storage unit;

the limiting rate calculating unit is used for calculating a difference value between the initial number of tokens of the first thread and the current number of tokens of the first thread in the current clock period, and calculating the limiting rate of the first thread for sending the memory access instruction after the preset clock period according to the difference value.

4. The bandwidth control apparatus according to claim 3, wherein,

the limiting rate calculating unit is used for calculating a limiting rate change amount according to the difference value between the initial number of tokens of the first thread in the current clock period and the current number of tokens of the first thread and the difference value between the initial number of tokens of the first thread in the latest one or more clock periods and the current number of tokens of the first thread;

the limiting rate calculating unit is used for calculating the sum of the historical limiting rate and the limiting rate variation, wherein the sum is the limiting rate of the first thread sending the memory access instruction after a preset clock period.

5. A multithreaded processor system comprising at least one processor core, a multi-level cache, and a bandwidth control device recited in any of claims 1-4, the at least one processor core supporting multithreading, the at least one processor core in communication with the multi-level cache, the multi-level cache comprising a last-level cache LLC, the LLC and the at least one processor core each coupled to the bandwidth control device.

6. The multithreaded processor system of claim 5, wherein the processor core comprises an instruction output logic module and an execution/access unit, the instruction output logic module being coupled to the execution/access unit, an emission limiting module being disposed within the instruction output logic module;

And the emission limiting module in the processor core is used for receiving the limiting rate sent by the bandwidth control equipment and indicating the emission limiting module to limit the number of the access instructions sent by the first thread after a preset clock period according to the limiting rate.

7. A method of access bandwidth control, applied to the multithreaded processor system of any one of claims 5-6, the method comprising:

the bandwidth control equipment acquires a first access instruction sent by LLC to a lower-level storage unit, wherein the first access instruction carries a first core identifier of a first processor core generating the first access instruction and a first thread identifier of a first thread operated by the first processor core and generating the first access instruction;

the bandwidth control equipment determines a first processing priority corresponding to the first thread identifier, and determines the limiting rate of the first thread sending the memory access instruction after a preset clock period according to the first processing priority; the limiting rate is information for limiting the number of memory access instructions which can be sent by the same thread in a unit clock period;

and the bandwidth control equipment sends the limiting rate to the first processor core, and instructs the processor core to limit the number of memory access instructions sent by the first thread after a preset clock period according to the limiting rate.

8. The method of claim 7, wherein determining the restriction rate of the first thread to send memory access instructions after a predetermined clock cycle based on the first processing priority comprises:

the bandwidth control device determines a first control calculation unit corresponding to the first processing priority;

and the bandwidth control equipment calculates the limiting rate of the first thread sending the memory access instruction after a preset clock period by using the first control calculation unit.

9. The method of claim 8, wherein the bandwidth control apparatus calculating, using the first control calculation unit, a restriction rate of the processor core to transmit the memory access instruction after a preset clock period, includes:

the token counter adds one to the current number of tokens of the first thread when receiving the tokens newly generated by the token generator for the first thread;

the token counter reduces the current number of tokens of the first thread by one when the LLC sends a memory access instruction of the first thread to a lower-level storage unit;

the limiting rate calculating unit calculates the difference value between the initial number of tokens of the first thread and the current number of tokens of the first thread in the current clock period, and calculates the limiting rate of the first thread for sending the memory access instruction after the preset clock period according to the difference value.

10. The method of claim 9, wherein calculating the limit rate of the first thread to send memory access instructions after a predetermined clock cycle based on the difference value comprises:

the limiting rate calculating unit calculates a limiting rate change amount according to the difference value between the initial number of tokens of the first thread and the current number of tokens of the first thread in the current clock period and the difference value between the initial number of tokens of the first thread and the current number of tokens of the first thread in the latest one or more clock periods;

the limiting rate calculating unit calculates the sum of the historical limiting rate and the limiting rate change, wherein the sum is the limiting rate of the first thread sending the memory access instruction after a preset clock period.

11. The method of claim 7, wherein the bandwidth control device transmitting the restriction rate to the first processor core comprises:

and the bandwidth control equipment sends a limiting rate to a transmission limiting module in the first processor core, and instructs the transmission limiting module to limit the number of memory access instructions sent by the first thread after a preset clock period according to the limiting rate.