CN116719643B

CN116719643B - Multi-core processor scheduling method and device for optimizing three-level cache access delay

Info

Publication number: CN116719643B
Application number: CN202310878869.5A
Authority: CN
Inventors: 刘丹; 苟鹏飞
Original assignee: Beijing Hexin Digital Technology Co ltd; Hexin Technology Co ltd
Current assignee: Beijing Hexin Digital Technology Co ltd; Hexin Technology Co ltd
Priority date: 2023-07-17
Filing date: 2023-07-17
Publication date: 2024-04-05
Anticipated expiration: 2043-07-17
Also published as: CN116719643A

Abstract

The application belongs to the technical field of processors, and discloses a multi-core processor scheduling method and device for optimizing three-level cache access delay; the method comprises the following steps: constructing a jump distance sorting table of each three-level cache block and each processor core; constructing a mapping relation configuration table of each three-level cache block and a memory address; acquiring a new task and a candidate processor core queue; inquiring a jump step distance sorting table and a mapping relation configuration table according to the access address of the new task and the candidate processor core queue to obtain a target processor core; the new task is assigned to the target processor core. The method and the device can reduce the access delay of the three-level cache checked by each processor.

Description

Multi-core processor scheduling method and device for optimizing three-level cache access delay

Technical Field

The application relates to the technical field of processors, in particular to a multi-core processor scheduling method and device for optimizing three-level cache access delay.

Background

When the number of processor cores in a multi-core processor chip is relatively large, inter-core interconnects tend to choose a Mesh (wireless Mesh) structure, such as an ARM CMN (Coherent Mesh Network, coherent Mesh) network. At this time, the L1 Cache (first level Cache) and the L2Cache (second level Cache) are generally private by each processor core (there may be a few cores sharing the L2Cache in some architectures, while the L3 Cache (third level Cache) is distributed and placed on the Mesh interconnect and shared by all the processor cores together.

Therefore, the technical problem solved by the application is how to reduce the access delay of the multi-core processor to the tertiary cache.

Disclosure of Invention

The multi-core processor scheduling method and device for optimizing the access delay of the three-level cache can reduce the access delay of each processor for checking the three-level cache and improve the performance of the processor.

In a first aspect, the present application provides a method for scheduling a multi-core processor for optimizing a three-level cache access delay, where the method includes:

constructing a jump distance sorting table of each three-level cache block and each processor core;

constructing a mapping relation configuration table of each three-level cache block and a memory address;

acquiring a new task and a candidate processor core queue;

inquiring: inquiring a jump step distance sorting table and a mapping relation configuration table according to the access address of the new task and the candidate processor core queue to obtain a target processor core;

the new task is assigned to the target processor core.

Further, the constructing the jump distance sorting table of each three-level cache block and each processor core includes:

and constructing a jump step distance ordering table of each three-level cache block and each processor core according to the chip topological structure.

The embodiment realizes the acquisition of the distance information between the three-level cache block and the processor core.

Further, the constructing the jump step distance sorting table of each three-level cache block and each processor core according to the chip topology structure includes: acquiring the positions of each processor core and each three-level cache block in a chip topological structure;

distance acquisition: obtaining the jump distance from the position of the processor core to each three-level cache block based on a routing algorithm;

sequencing: sequencing the jump distances to obtain a jump distance matrix of the processor core;

executing a distance acquisition step and a sequencing step on each processor core to obtain a plurality of jump distance matrixes;

and obtaining a jump distance sorting table according to the jump distance matrixes.

The above embodiment can quickly and accurately obtain the jump distance between the processor core and each three-level cache block by using the routing algorithm, so that an accurate jump distance sorting table can be obtained.

Further, the candidate processor core queue is obtained by screening each processor core according to a heuristic principle.

According to the embodiment, the candidate processor core queues are obtained through the heuristic principle, so that the finally allocated target processor cores meet the optimization of multi-core scheduling performance at different angles.

Further, heuristic rules include load balancing policies, core sensitivity policies, and access latency policies.

According to the embodiment, the candidate processor core queue is obtained according to the load balancing strategy, the core sensitivity strategy and the access delay strategy, so that the selected target processor core simultaneously meets the optimization on various performances such as load balancing, core sensitivity and access delay.

Further, the method further comprises: after obtaining the candidate processor core queue, detecting whether the number of processor cores in the candidate processor core queue is larger than a first preset value; if yes, executing the query step.

The embodiment avoids useless inquiry of the jump distance sorting table and the mapping relation configuration table, and further optimizes the heuristic flow when task allocation is carried out on the multi-core processor.

Further, the querying the step distance sorting table and the mapping relation configuration table according to the access address of the new task and the candidate processor core queue to obtain the target processor core includes:

inquiring a mapping relation configuration table according to the access address of the new task to obtain a cache block to be accessed;

and inquiring the jump distance sorting table according to the to-be-accessed cache block to obtain the processor core closest to the to-be-accessed cache block in the candidate processor core queue, and taking the processor core as the target processor core.

According to the embodiment, the three-level cache block to be accessed is determined according to the access address of the new task, so that the processor core closest to the accessed three-level cache block is selected, and the huge delay caused by accessing the remote three-level cache block for completing the task is avoided.

Further, the method further comprises:

after obtaining the cache blocks to be accessed, detecting whether the number of the cache blocks to be accessed is larger than a second preset value;

if yes, obtaining the target processor core according to the access address of the new task, the cache block to be accessed, the jump distance sorting table and the mapping relation configuration table.

The embodiment avoids the problem of error inquiry of the jump step distance sorting table, which is easy to cause when a new task needs to access a plurality of three-level cache blocks, and realizes the allocation of the new task accessing the plurality of three-level cache blocks.

Further, the obtaining the target processor core according to the access address, the to-be-accessed cache block, the skip distance sorting table and the mapping relation configuration table of the new task includes:

obtaining the address weight of each cache block to be accessed according to the access address of the new task and the mapping relation configuration table;

according to the jump distance sorting table and the address weight of each cache block to be accessed, calculating the access delay of each processor core in the candidate processor core queue; the processor core with the lowest access delay is taken as the target processor core.

The embodiment realizes accurate calculation of the access delay of the three-level cache when each processor core executes the new task based on the address weight of the new task in the three-level cache block, and ensures that the access delay of the target processor core is minimum.

Further, the calculating the access delay of each processor core in the candidate processor core queue according to the jump distance sorting table and the address weight of each to-be-accessed cache block includes:

obtaining the distance from the processor core to each cache block to be accessed according to the jump distance sorting table; multiplying the distance from the processor core to the to-be-accessed cache block by the address weight of the to-be-accessed cache block to obtain the sub-delay from the processor core to the to-be-accessed cache block;

and adding the sub-delays from the processor core to each cache block to be accessed to obtain the access delay of the processor core.

The embodiment avoids the possible flow error caused by the new task accessing the three-level cache blocks, calculates the access delay by accessing the address weight of the three-level cache blocks, and ensures that the delay of the target processor core accessing the three-level cache blocks is the lowest.

Further, the method further comprises:

responding to the scheduling interrupt instruction, and performing scheduling domain traversal based on each scheduling domain to obtain a to-be-balanced processor;

and migrating the process in the processor to be balanced to the current processor.

The embodiment realizes the processes of load balancing and processor process migration in the multi-core processor, and avoids the process delay and performance reduction of the processor caused by overhigh internal load.

Further, the performing the scheduling domain traversal based on each scheduling domain to obtain the to-be-equalized processor includes:

taking the lowest scheduling domain as the current scheduling domain, and traversing the current scheduling domain;

maximum load acquisition step: obtaining a maximum load processor according to the load value of each scheduling group of the current scheduling domain;

judging whether the load unbalance degree of the current dispatching domain is smaller than the migration cost;

if yes, the father dispatching domain of the current dispatching domain is used as the current dispatching domain; and returning to the maximum load obtaining step; otherwise, the maximum load processor is taken as a processor to be balanced.

The above embodiment provides the steps of judging and acquiring the processors to be balanced according to the scheduling domain and the scheduling group, so that accurate judgment of the processors needing load balancing and process migration is realized.

Further, the scheduling domains comprise a plurality of three-level scheduling domains positioned at the same level; the tertiary scheduling domain includes several processor cores that access the same tertiary cache block at the same distance.

According to the embodiment, the three-level scheduling domain is added, so that when load balancing is carried out, one core is not selected from the plurality of CPUs of the same node at will, but the cores with the same three-level cache access distance can be selected preferentially, and therefore the access delay of the processor to the three-level cache in load balancing is optimized.

Further, the obtaining the maximum load processor according to the load value of each scheduling group of the current scheduling domain includes:

if the scheduling group in the current scheduling domain is a three-level scheduling domain, calculating a load value of the three-level scheduling domain according to the number of processor cores in the three-level scheduling domain and the load of each processor core.

The method for calculating the load value of the three-level scheduling domain, provided by the embodiment, avoids calculation errors of the load value and error of load balancing results caused by different composition structures of the three-level scheduling domain.

In a second aspect, the present application provides a multi-core processor scheduling apparatus for optimizing a three-level cache access delay, the apparatus comprising:

the jump distance building module (101) is used for building a jump distance sorting table of each three-level cache block and each processor core;

the mapping relation construction module (102) is used for constructing a mapping relation configuration table of each three-level cache block and the memory address;

an acquisition module (103) for acquiring a new task and a candidate processor core queue;

the query module (104) is used for querying the jump step distance sorting table and the mapping relation configuration table according to the access address of the new task and the candidate processor core queue to obtain a target processor core;

An allocation module (105) for allocating the new task to the target processor core.

According to the multi-core processor scheduling device for optimizing the three-level cache access delay, the three-level cache blocks, the jump distance sorting table of the processor cores and the mapping relation configuration table of the three-level cache blocks and the memory addresses are constructed, after a new task is received, the two tables are queried according to the new task and the candidate processor core queue, so that the target processor core closest to the three-level cache blocks to be accessed by the new task is obtained, the new task is distributed to the target processor core, and the processor core does not need to access the far three-level cache blocks for completing the new task, so that the access delay of each processor for checking the three-level cache is reduced, and the performance of the processor is improved.

Further, the device further comprises:

the traversing module (201) is used for responding to the scheduling interrupt instruction, and performing scheduling domain traversing based on each scheduling domain to obtain a processor to be balanced;

and the migration module (202) is used for migrating the process in the processor to be balanced to the current processor.

The embodiment provides the device for realizing load balancing and processor process migration in the multi-core processor, and avoids process delay and performance reduction of the processor caused by overhigh internal load.

Further, the traversing module (201) includes:

the bottom layer traversing unit (21) is used for taking the scheduling domain at the bottommost layer as the current scheduling domain and traversing the current scheduling domain;

a load processing unit (22) for obtaining a maximum load processor according to the load values of each scheduling group of the current scheduling domain;

a judging unit (23) for judging whether the load unbalance of the current dispatching domain is smaller than the migration cost;

an equalizing unit (24) for taking a parent scheduling domain of the current scheduling domain as the current scheduling domain; and returns to the load handling unit (22); or the maximum load processor is taken as the processor to be balanced.

The above embodiment provides a unit structure for judging and acquiring the to-be-balanced processor in the traversing module (201), so that accurate judgment of the processor needing load balancing and process migration is realized.

In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the steps of a multi-core processor scheduling method for optimizing three-level cache access latency as in any of the embodiments described above.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a multi-core processor scheduling method of optimizing three-level cache access latency of any of the embodiments described above.

In summary, compared with the prior art, the technical scheme provided by the embodiment of the application has the beneficial effects that at least:

according to the multi-core processor scheduling method for optimizing the three-level cache access delay, the three-level cache blocks, the jump distance sorting table of the processor cores and the mapping relation configuration table of the three-level cache blocks and the memory addresses are constructed, after a new task is received, the two tables are queried according to the new task and the candidate processor core queue, so that a target processor core closest to the three-level cache blocks to be accessed by the new task is obtained, the new task is distributed to the target processor core, and the processor core does not need to access the far three-level cache blocks for completing the new task, so that the access delay of each processor for checking the three-level cache is reduced, and the performance of the processor is improved.

Drawings

Fig. 1 is a flowchart of a multi-core processor scheduling method for optimizing three-level cache access delay according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a chip topology structure provided in an embodiment of the present application.

Fig. 3 is a flowchart of a target processor core obtaining step provided in an embodiment of the present application.

Fig. 4 is a flowchart of a load balancing step provided in an embodiment of the present application.

Fig. 5 is a schematic diagram of a task queue of a processor during load balancing according to an embodiment of the present application.

Fig. 6 is a flowchart of a determining step of a to-be-equalized processor according to another embodiment of the present application.

Fig. 7 is a block diagram of a multi-core processor scheduling device for optimizing three-level cache access delay according to an embodiment of the present application.

Fig. 8 is a block diagram of a multi-core processor scheduling apparatus for optimizing three-level cache access delay according to an embodiment of the present application.

Fig. 9 is a block diagram of a traversal module (201) according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Referring to fig. 1, an embodiment of the present application provides a multi-core processor scheduling method for optimizing three-level cache access delay, including the following steps:

and S01, constructing a jump step distance sorting table of each three-level cache block and each processor core.

And S02, constructing a mapping relation configuration table of each three-level cache block and the memory address.

Step S03, obtaining a new task and a candidate processor core queue.

And step S04, inquiring a jump step distance sorting table and a mapping relation configuration table according to the access address of the new task and the candidate processor core queue to obtain the target processor core.

Step S05, distributing the new task to the target processor core.

According to the multi-core processor scheduling method for optimizing the three-level cache access delay, the three-level cache blocks, the jump distance sorting table of the processor cores and the mapping relation configuration table of the three-level cache blocks and the memory addresses are constructed, after a new task is received, the two tables are queried according to the new task and the candidate processor core queue, so that the target processor core closest to the three-level cache blocks to be accessed by the new task is obtained, the new task is distributed to the target processor core, and the processor core does not need to access the far three-level cache blocks for completing the new task, so that the access delay of each processor for checking the three-level cache is reduced, and the performance of the processor is improved.

There are many ways to construct the mapping relationship configuration table, for example, an address sharing policy: dividing all the memory addresses equally according to the number of the three-level cache blocks, so that the sizes of address ranges included in each three-level cache block are the same; address hashing policy: after hashing the memory address, placing the memory address into each three-level cache block: hierarchical policy, placing memory addresses in each three-level cache block in a hierarchical structure, and so on. Specifically, the jump step distance ordering table of the three-level cache blocks SLC (system level cache ) 21 and SLC22 is as follows:

SLC21 Distance

SLC22 Distance

Core	Distance
		EC22	1
EC21	2
		EC23	2
EC15	2
		EC14	3
…

the mapping relation configuration table of each three-level cache block and the memory address is as follows (note: the address space corresponding to the SLC is not necessarily a continuous space):

SLC (three-level buffer memory block)	Address Range (Address Range)
		SLC21	0x0～0xffffffff
SLC22	0x100000000～0x1ffffffff
		SLC23	0x200000000～0x2ffffffff
…

Specifically, the candidate processor core queues are faced, the IDs of the corresponding L3 Cache blocks are searched in a mapping relation configuration table according to the memory address range allocated by the new task, then the processor cores closest to the L3 Cache blocks in all candidate processor cores are searched in a jump distance sorting table according to the ID numbers of the L3 Cache blocks, and the new task is placed in the task queue of the processor core.

Referring to fig. 2, in some embodiments, the constructing the jump distance ordering table of each three-level cache block and each processor core includes: and constructing a jump step distance ordering table of each three-level cache block and each processor core according to the chip topological structure.

Referring to fig. 2, in some embodiments, the constructing the jump distance ordering table of each tertiary cache block and each processor core according to the chip topology includes:

and acquiring the positions of each processor core and each three-level cache block in the chip topological structure.

Distance acquisition: and obtaining the jump distance from the position of the processor core to each three-level cache block based on a routing algorithm.

Sequencing: and sequencing the jump distances to obtain a jump distance matrix of the processor core.

And executing a distance acquisition step and a sequencing step on each processor core to obtain a plurality of jump distance matrixes.

Taking the example of obtaining the jump distance between the processor core EC21 and the three-level cache block SLC22 in fig. 2, the specific implementation of the routing algorithm is as follows: the distance from EC21 to node XP05 is 1, the distance from node XP05 to node XP15 is 1, and the distances are added to obtain the distance from EC21 to SLC22 is 2; and inquiring the advanced line to obtain the distance between the EC21 and each three-level cache block, and sequencing the distance to obtain a jump distance matrix, so as to obtain a jump distance sequencing table.

In some embodiments, the candidate processor core queues are screened for processor cores according to heuristic rules.

The heuristic principle is other methods for distributing new tasks to the scheduling queue of the processor besides the method for distributing new tasks to the scheduling queue of the processor.

In some embodiments, the heuristic rules include a load balancing policy, a core sensitivity policy, and an access latency policy.

In the MQMS (Multi-queue Scheduling) method, when a new task enters the system, the system puts the task into a corresponding Scheduling queue according to a certain heuristic rule, and each Scheduling queue performs Scheduling independently. If one task is blocked, the task is placed in the same queue as much as possible when the task wakes up again, so that the extra overhead caused by overlong cold cache or access delay is reduced.

The heuristic principle before improvement mainly considers factors such as load balancing (such as selecting a CPU core queue with relatively less load), core sensitivity (such as size core), access delay (such as preferentially selecting a CPU core corresponding to local storage) and the like.

In this application, it is also candidate processor core queues that are generated according to these heuristics.

In some embodiments, the method further comprises: after obtaining the candidate processor core queue, detecting whether the number of processor cores in the candidate processor core queue is larger than a first preset value; if yes, executing the query step.

The first preset value is 1, and the query step is step S04 in the above embodiment, and the target processor core is obtained according to the access address of the new task and the candidate processor core queue query step distance sorting table and the mapping relation configuration table. If the number of cores in the candidate processor core queue is 1, the query step may be omitted, and the unique processor core may be directly used as the target processor core.

Specifically, if only one processor core is selected by the existing heuristic rule, no query of the jump distance sorting table and the mapping relation configuration table is needed, and the new task is directly distributed to the unique processor core.

Referring to fig. 3, in some embodiments, the querying the step distance ordering table and the mapping relation configuration table according to the access address of the new task and the candidate processor core queue to obtain the target processor core includes:

and step S041, inquiring a mapping relation configuration table according to the access address of the new task to obtain a cache block to be accessed.

And step S042, inquiring a jump distance sorting table according to the to-be-accessed cache block to obtain the processor core closest to the to-be-accessed cache block in the candidate processor core queue, and taking the processor core as a target processor core.

The ID of the corresponding L3 Cache block is searched in a mapping relation configuration table according to the memory address range allocated by the new task, then the processor cores closest to the L3 Cache block in all candidate processor cores are searched in a jump distance sorting table according to the ID number of the L3 Cache block, and the new task is put into a task queue of the processor cores.

In some embodiments, the method may further comprise:

after the cache blocks to be accessed are obtained, whether the number of the cache blocks to be accessed is larger than a second preset value or not is detected.

If yes, obtaining the target processor core according to the access address of the new task, the cache block to be accessed, the jump distance sorting table and the mapping relation configuration table. Wherein the second preset value is 1.

Specifically, if the number of the cache blocks to be accessed is less than or equal to the second preset value, the step S042 is executed, that is, the jump step distance sorting table is directly queried according to the cache blocks to be accessed, the mapping relation configuration table is not needed to be queried, and after the processor core closest to the cache blocks to be accessed in the candidate processor core queue is obtained, the processor core is taken as the target processor core.

In some embodiments, the obtaining the target processor core according to the access address of the new task, the to-be-accessed cache block, the skip distance sorting table and the mapping relation configuration table includes:

and obtaining the address weight of each cache block to be accessed according to the access address of the new task and the mapping relation configuration table.

Specifically, if the memory address range corresponding to the new task is found in the mapping relation configuration table to involve multiple blocks of L3 caches, when calculating the distances from all candidate processor cores to the L3 Cache blocks, the distances from the processor cores to the L3 caches related to each block are found in the jump distance sorting table, and then the total distance is calculated according to the weight of the address range.

In some embodiments, the calculating the access delay of each processor core in the candidate processor core queue according to the jump distance sorting table and the address weight of each to-be-accessed cache block includes:

Obtaining the distance from the processor core to each cache block to be accessed according to the jump distance sorting table; multiplying the distance from the processor core to the to-be-accessed cache block by the address weight of the to-be-accessed cache block to obtain the sub-delay from the processor core to the to-be-accessed cache block.

For example, taking the table in the above example as an example, if 80% of the memory space accessed by the new task is distributed in SLC21 and 20% is distributed in SLC22, the delay of the processor core EC21 accessing the SCL21 block and the SCL22 block of the L3 Cache is 80% ×1+20% ×2=1.2, and the delay of the processor core EC22 accessing the SCL21 block and the SCL22 block of the L3 Cache is 80% ×2+20% ×1=1.8. So in contrast, 1.2 < 1.8, the access latency of EC21 is lower, assigning new tasks to EC21.

Referring to fig. 4 and 5, in some embodiments, the method further comprises:

And S11, responding to the scheduling interrupt instruction, and performing scheduling domain traversal based on each scheduling domain to obtain the processor to be balanced.

And step S12, migrating the process in the processor to be balanced to the current processor.

The scheduling interrupt instruction is implemented by an interrupt function set in a program of an operating system, and the number of clocks for sending the scheduling interrupt instruction is set in the interrupt function, that is, once a certain period/number of clocks passes, scheduling interrupt occurs.

As shown in fig. 5, assuming that CPU0 and CPU1 are allocated AC and BD, respectively, but that a period of time C has elapsed while a is left, it is apparent that the load of CPU0 will be lower than that of CPU 1. To solve this problem, it is necessary to "migrate" processes in time, migrating a process from one CPU to another. The existing load balancing method controls the task migration mode through a scheduling domain and a scheduling group, wherein the scheduling group is a basic unit of load balancing, and migration can only be carried out among scheduling groups in the same scheduling domain. The following is a definition of the scheduling fields and scheduling groups in Linux before improvement.

	Scheduling domain	Scheduling group
			SMT	Nuclear	Thread(s)
MC	Shared L2 cores	Nuclear
			DIE	Node	Shared L2 cores
NUMA	Cluster	Node

Referring to fig. 6, in some embodiments, the foregoing performing the scheduling domain traversal based on each scheduling domain to obtain the to-be-equalized processor may specifically include the following steps:

step S111, using the bottom scheduling domain as the current scheduling domain, and traversing the current scheduling domain.

Step S112, obtaining the maximum load processor according to the load value of each scheduling group of the current scheduling domain.

Step S113, judging whether the load unbalance degree of the current dispatching domain is smaller than the migration cost.

Step S114, if yes, the parent scheduling domain of the current scheduling domain is used as the current scheduling domain; and returns to step S112; otherwise, the maximum load processor is taken as a processor to be balanced.

The bottommost scheduling domain is an SMT scheduling domain, namely a core scheduling domain taking threads as a scheduling group.

Specifically, when the scheduling interrupt occurs, the basic flow for implementing load balancing is:

1. and traversing the scheduling domain from bottom to top from the current CPU, and balancing loads from the bottommost layer.

2. And searching the busiest scheduling group in the current scheduling domain, and searching the busiest CPU in the busiest scheduling group.

3. If the current scheduling domain does not meet the migration condition (the degree of load imbalance is less than the cost of migration), the scheduling domain needs to be compared with other groups in the parent scheduling domain as one scheduling group to the upper layer.

4. And if the current dispatching field meets the migration condition, migrating the process of the busiest CPU to the current CPU.

Wherein when traversing the dispatching domain from bottom to top, the load value of the busiest processor in each stage of dispatching domain and the corresponding CPU are recorded. For example, assuming that the current scheduling occurs at the SMT stage, the current scheduling domain, i.e. the CPU with the largest load in the core, is recorded. When entering the MC stage, the load value of the CPU with the largest load in each core sharing L2 is obtained first, and then the load value of which CPU is the largest is compared and judged.

In some embodiments, the scheduling domain includes a plurality of three-level scheduling domains at the same level; the tertiary scheduling domain includes several processor cores that access the same tertiary cache block at the same distance.

Specifically, one level of ML3C is added in the hierarchy of a scheduling domain and a scheduling group, namely a three-level scheduling domain, wherein the scheduling domain of the level is a core with the same distance for accessing each L3Cache block, and the scheduling group depends on the scheduling domain of the upper level.

Taking the above topology as an example, assuming that all processor cores have independent L1 and L2 and that the L3Cache block accessed by the current task is SLC21, the relevant scheduling domain and scheduling group are shown in the following table (here only focusing on the case of several processor cores listed in the SLC21 Distance table):

In some embodiments, the obtaining the maximum load processor according to the load value of each scheduling group of the current scheduling domain includes:

Specifically, since the number of cores in different scheduling domains of ML3C may not be consistent, when ML3C is used as a scheduling group and DIE is the current scheduling domain, comparison is required according to the number of cores when finding the most busy and most loaded scheduling group in DIE, that is, the number of cores and the load of each core together form the total load value of ML 3C.

Referring to fig. 7, another embodiment of the present application provides a multi-core processor scheduling apparatus for optimizing three-level cache access delay, where the apparatus may specifically include:

the step distance building module 101 is configured to build a step distance ordering table of each three-level cache block and each processor core.

The mapping relation construction module 102 is configured to construct a mapping relation configuration table of each three-level cache block and the memory address.

An obtaining module 103, configured to obtain the new task and the candidate processor core queue.

And the query module 104 is configured to query the step-by-step distance sorting table and the mapping relation configuration table according to the access address of the new task and the candidate processor core queue to obtain the target processor core.

An allocation module 105 for allocating new tasks to the target processor cores.

Referring to fig. 8, in some embodiments, the apparatus further comprises:

and the traversing module 201 is configured to respond to the scheduling interrupt instruction, and perform scheduling domain traversal based on each scheduling domain, so as to obtain the to-be-balanced processor. And the migration module 202 is used for migrating the process in the to-be-balanced processor to the current processor.

Referring to fig. 9, in some embodiments, the traversing module 201 includes:

the bottom layer traversing unit 21 is configured to traverse the current scheduling domain by using the scheduling domain at the bottom layer as the current scheduling domain.

The load processing unit 22 is configured to obtain a maximum load processor according to the load values of the scheduling groups in the current scheduling domain.

A judging unit 23, configured to judge whether the load imbalance of the current scheduling domain is less than the migration cost.

An equalizing unit 24, configured to take a parent scheduling domain of the current scheduling domain as the current scheduling domain; and returns to the load processing unit 22; or the maximum load processor is taken as the processor to be balanced.

The above embodiment provides a unit structure for judging and acquiring the to-be-balanced processor in the traversal module 201, so as to realize accurate judgment of the processor needing load balancing and process migration.

The specific limitation of the multi-core processor scheduling device for optimizing the three-level cache access delay provided in the present application can be referred to the embodiment of the multi-core processor scheduling method for optimizing the three-level cache access delay hereinabove, and is not repeated herein. The above-mentioned each module in the multi-core processor scheduling apparatus for optimizing the three-level cache access delay may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

The implementation process of the multi-core processor scheduling method for optimizing the three-level cache access delay is shown by a specific example:

the multi-core CPU scheduling methods fall into two categories: SQMS (Single Queue Scheduling ) and MQMS (Multi-queue Scheduling). The SQMS performs scheduling based on the global scheduling queue. Its advantages are simple process and high load balance. It has mainly two disadvantages:

1. poor scalability: the lock needs to be used to ensure correctness, but as the number of processors increases, contention becomes greater and the overhead of the lock can directly impact performance.

Cache affinity difference: because each CPU directly selects the next task to execute from the global queue, the same task is switched between different CPUs, and the previously cached data cannot be reused.

The MQMS method is proposed for the problem of SQMS. It has a plurality of distributed scheduling queues, typically one queue per processor core, each of which is independently scheduled according to certain scheduling rules (e.g., round robin). When a new task enters the system, the system places the task into a corresponding scheduling queue according to a certain heuristic principle. If a task is blocked, the task is placed in the same queue as much as possible when the task wakes up again, so that the overhead caused by overlong cold cache or access delay (for example, access to remote storage in NUMA) is reduced.

MQMS has two main advantages: firstly, the lock expense caused by sharing the same queue is avoided, and secondly, the Cache affinity is better. Its main disadvantage is the load balancing problem. As shown in fig. 5, assuming that the CPU0 and the CPU1 are allocated AC and BD, respectively, but that a period of time C has elapsed, it is apparent that the load of the CPU0 will be lower than that of the CPU 1. To solve this problem, we need to "migrate" processes in time, migrating a process from one CPU to another. For example, a CPU queue may peep through other queues from time to see if the other queues are relatively full, and if so, may "peep" a task to its own queue. Both the O (1) scheduler and CFS (Completely Fair Scheduler, full fair scheduler) used in Linux are of the MQMS type.

For MQMS, both the heuristic of selecting processor core queues and the load balancing method directly affect the scheduling performance when new tasks enter the system. Common heuristic rules for selecting processor core queues when a new task enters the system include:

1. and (5) randomizing.

2. Find a queue with fewer current tasks.

3.Asymmetric multiprocessor (asymmetric multiprocessor) considers Core sensitivity, i.e. large cores in a large and small Core structure tend to serve low latency or relatively high IPC requirements of a workload, while other workload may not necessarily bring about performance improvement if dispatched to a large Core, and may cause increased power consumption. Sensitivity is the effect of a thread on performance when changing from a large core to a small core.

And selecting the corresponding CPU to be stored locally as far as possible in NUMA.

The load balancing method controls task migration through a scheduling domain and a scheduling group, wherein the scheduling group is a basic unit of load balancing, and migration can only be carried out among scheduling groups in the same scheduling domain. The following is a definition of the scheduling fields and scheduling groups in Linux before improvement.

When the scheduling interrupt occurs, the basic flow of load balancing by the CPU is as follows:

1. the scheduling domain is traversed from bottom to top from the current CPU, and load balancing is performed from the lowest layer (SMT).

When the number of processor cores in a multi-core processor chip is relatively large, inter-core interconnections tend to select Mesh structures, such as CMN networks of ARM. At this time, the L1 Cache and the L2 Cache are generally private (the L2 Cache may be shared by a few cores in some architectures) by each processor core, while the L3 Cache is distributed and placed on the Mesh interconnect, and shared by all the processor cores together. Each block of distributed L3 Cache is used to Cache memory data within a certain address range, such as SLC in an ARM CMN network. Due to the large area of Mesh interconnect, the difference in latency of processor cores in different locations to access the same block of L3 Cache may be on the order of tens of processor execution cycles. If the multi-core scheduling can consider the delay of each processor core accessing each L3 Cache block, the processing performance of a single core can be improved. Based on the idea, the application column improves the heuristic principle of selecting the processor core when a new task enters the system and the load balancing method respectively.

1. Heuristic for selecting processor cores when a new task enters the system.

In the MQMS method, when a new task enters the system, the system puts the task into a corresponding scheduling queue according to a certain heuristic principle, and each scheduling queue performs scheduling independently. If one task is blocked, the task is placed in the same queue as much as possible during the wake-up, so that the extra overhead caused by overlong cold cache or access delay is reduced. The heuristic principle before improvement mainly considers load balancing (such as selecting a CPU core queue with relatively less load), core sensitivity (such as size core), access delay (such as preferentially selecting a CPU core corresponding to local storage in a NUMA system) and other factors. In the method, the influence of the access delay of the L3 Cache is considered, a heuristic strategy is added, and processor cores with relatively close distances are selected to execute according to the memory space distribution which is required to be accessed by the task. The method comprises the following specific processes:

(1) Before the OS starts, a step distance ordering table from each L3 Cache block to each processor core is generated according to the chip topology, for example, as shown in fig. 2, the step distance ordering table for forming SLC21 and SLC22 is as follows:

SLC21 Distance

core (processor Core)	Distance (Distance)
		EC21	1
EC22	2
		EC14	2
EC15	3
		EC23	3
…

SLC22 Distance

Core	Distance
		EC22	1
EC21	2
		EC23	2
EC15	2
		EC14	3
…

Filling of a mapping relation configuration table of each L3Cache block and a memory address in the Mesh interconnection is completed, for example SAM (System Address Map) of ARM CMN. Note that the address space corresponding to SLC is not necessarily a continuous space.

When a new task enters the system, the factors such as load balancing, core sensitivity and the like are still considered at first, and a candidate processor core queue is generated. If a plurality of candidate processor cores exist, according to the memory address range allocated by the new task, searching the corresponding L3Cache block ID in the mapping relation configuration table, then searching CPU cores closest to all candidate CPU cores in the jump distance sorting table according to the ID number of the L3Cache block, and finally placing the new task in a task queue corresponding to the CPU cores.

If the memory address range corresponding to the current new task is found to involve a plurality of L3 caches in the mapping relation configuration table, when the distances from all candidate CPU cores to the L3Cache blocks are calculated, the distance from each relevant L3Cache block is found in the jump distance sorting table, and then the total distance is calculated according to the weight of the address range.

For example, taking the table in the above example as an example, if 80% of the memory space accessed by the new task is distributed in SLC21 and 20% is distributed in SLC22, the delay of the CPU core EC21 accessing the SCL21 block and the SCL22 block of the L3Cache is 80% ×1+20% ×2, that is

1.2, and the delay of the CPU core EC22 accessing the SCL21 block and the SCL22 block of the L3Cache is 80% ×2+20% ×1, i.e. 1.8. The access delay of EC21 is lower in comparison.

2. Load balancing method

Considering the influence of the access delay of the L3Cache, the method adds one level of ML3C in the classification of a scheduling domain and a scheduling group, wherein the scheduling domain of the level is a core with the same distance to access each L3Cache block, and the scheduling group depends on the scheduling domain of the previous level.

	Scheduling domain	Scheduling group
			SMT	Nuclear	Thread(s)
MC	Shared L2 cores	Nuclear
			ML3C	Accessing cores with the same L3Cache distance	Shared L2 cores
DIE	Node	Accessing cores with the same L3Cache distance
			NUMA	Cluster	Node

during load balancing, considering the L3Cache blocks related to the current task, the task is migrated to the core of the same scheduling domain as much as possible to be executed. It should be noted that since the number of cores in different scheduling domains of ML3C may not be uniform, a comparison is made according to the number of cores when looking for the busiest scheduling group in the DIE.

The present application may optimize the latency of each processor core to access the L3 Cache, thereby improving the performance of a single-core processor. According to the previous evaluation result based on the Gem5 Power single-core simulator, when the access delay of the L3 Cache is increased from 72 processor clock cycles (the average access delay of the L3 Cache calculated based on the CMN network delay) to 24 processor clock cycles (the delay of a certain processor core accessing the L3 Cache closest to itself calculated based on the CMN network delay), the performance of a single processor core on SPEC17 can be increased by nearly 10%.

Embodiments of the present application provide a computer device that may include a processor, memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, causes the processor to perform the steps of a multi-core processor scheduling method of optimizing three-level cache access latency as in any of the embodiments described above. The working process, working details and technical effects of the computer device provided in this embodiment may refer to the embodiment of a multi-core processor scheduling method for optimizing three-level cache access delay in the foregoing, which is not described herein.

The present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a multi-core processor scheduling method of optimizing three-level cache access latency as in any of the embodiments described above. The computer readable storage medium refers to a carrier for storing data, and may include, but is not limited to, a floppy disk, an optical disk, a hard disk, a flash Memory, and/or a Memory Stick (Memory Stick), etc., where the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.

The working process, working details and technical effects of the computer readable storage medium provided in this embodiment may refer to the embodiment of a multi-core processor scheduling method for optimizing three-level cache access delay in the foregoing, which is not described herein.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description. The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method for scheduling a multi-core processor for optimizing three-level cache access delay, the method comprising:

acquiring a new task and a candidate processor core queue;

inquiring: inquiring the mapping relation configuration table according to the access address of the new task to obtain a cache block to be accessed; inquiring the jump distance sorting table according to the to-be-accessed cache block to obtain the processor core closest to the to-be-accessed cache block in the candidate processor core queue, and taking the processor core as a target processor core;

The access address of the new task is a memory address range allocated to the new task;

the new task is distributed to the target processor core.

2. The method of claim 1, wherein constructing a stride distance ordered list of each tertiary cache block and each processor core comprises:

and constructing the jump step distance sorting table of each three-level cache block and each processor core according to a chip topological structure.

3. The method of claim 2, wherein said constructing the jump distance ordering table of each of the three levels of cache blocks and each of the processor cores according to a chip topology comprises:

acquiring the positions of each processor core and each three-level cache block in the chip topological structure;

executing the distance acquisition step and the sorting step on each processor core to obtain a plurality of jump distance matrixes;

and obtaining the jump distance sorting table according to a plurality of jump distance matrixes.

4. The method of claim 1, wherein the candidate processor core queues are screened for each of the processor cores according to heuristic rules.

5. The method of claim 4, wherein the heuristic rules comprise a load balancing policy, a core sensitivity policy, and an access latency policy.

6. The method according to claim 4, wherein the method further comprises:

after the candidate processor core queue is obtained, detecting whether the number of the processor cores in the candidate processor core queue is larger than a first preset value; if yes, executing the query step.

7. The method according to claim 1, wherein the method further comprises:

after the to-be-accessed cache blocks are obtained, detecting whether the number of the to-be-accessed cache blocks is larger than a second preset value or not;

if yes, the target processor core is obtained according to the access address of the new task, the cache block to be accessed, the jump distance sorting table and the mapping relation configuration table.

8. The method of claim 7, wherein the obtaining the target processor core according to the access address of the new task, the cache block to be accessed, the stride distance ordering table, and the mapping relationship configuration table comprises:

calculating access delay of each processor core in the candidate processor core queue according to the jump distance sorting table and the address weight of each to-be-accessed cache block;

and taking the processor core with the lowest access delay as the target processor core.

9. The method of claim 8, wherein said calculating access delays for each of said processor cores in said candidate processor core queue based on said jump distance ranking table and said address weights for each of said cache blocks to be accessed comprises:

obtaining the distance from the processor core to each cache block to be accessed according to the jump distance sorting table;

multiplying the distance from the processor core to the to-be-accessed cache block by the address weight of the to-be-accessed cache block to obtain the sub-delay from the processor core to the to-be-accessed cache block;

10. The method according to claim 1, wherein the method further comprises:

11. The method of claim 10, wherein performing the scheduling domain traversal based on each scheduling domain results in the to-be-equalized processor, comprising:

taking the bottommost scheduling domain as a current scheduling domain, and traversing the current scheduling domain;

judging whether the load unbalance degree of the current dispatching domain is smaller than migration cost or not;

if yes, taking the parent scheduling domain of the current scheduling domain as the current scheduling domain, and returning to the maximum load obtaining step; otherwise, the maximum load processor is used as the processor to be balanced.

12. The method of claim 11, wherein the scheduling domain comprises a plurality of three-level scheduling domains located at the same level; the tertiary scheduling domain includes a plurality of processor cores that access the same tertiary cache block at the same distance.

13. The method of claim 12, wherein the deriving the maximum load processor from the load values of each scheduling group of the current scheduling domain comprises:

If the scheduling group in the current scheduling domain is the three-level scheduling domain, calculating a load value of the three-level scheduling domain according to the number of the processor cores in the three-level scheduling domain and the load of each processor core.

14. A multi-core processor scheduling apparatus that optimizes three-level cache access latency, the apparatus comprising:

the query module (104) is used for querying the mapping relation configuration table according to the access address of the new task to obtain a cache block to be accessed; inquiring the jump distance sorting table according to the to-be-accessed cache block to obtain the processor core closest to the to-be-accessed cache block in the candidate processor core queue, and taking the processor core as a target processor core; the access address of the new task is a memory address range allocated to the new task;

-an allocation module (105) for allocating said new task to said target processor core.

15. The apparatus of claim 14, wherein the apparatus further comprises:

16. The apparatus of claim 15, wherein the traversal module (201) comprises:

a bottom layer traversing unit (21) for taking the scheduling domain at the bottommost layer as the current scheduling domain and traversing the current scheduling domain;

a load processing unit (22) for obtaining a maximum load processor according to the load value of each scheduling group of the current scheduling domain;

a judging unit (23) for judging whether the load unbalance of the current scheduling domain is smaller than the migration cost;

an equalizing unit (24) configured to take a parent scheduling domain of the current scheduling domain as the current scheduling domain; and returns to the load handling unit (22); or the maximum load processor is used as the processor to be balanced.

17. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 13 when the computer program is executed.

18. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 13.