CN115495246A - Hybrid remote memory scheduling method under separated memory architecture - Google Patents

Hybrid remote memory scheduling method under separated memory architecture Download PDF

Info

Publication number
CN115495246A
CN115495246A CN202211212624.0A CN202211212624A CN115495246A CN 115495246 A CN115495246 A CN 115495246A CN 202211212624 A CN202211212624 A CN 202211212624A CN 115495246 A CN115495246 A CN 115495246A
Authority
CN
China
Prior art keywords
memory
task
node
far
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211212624.0A
Other languages
Chinese (zh)
Other versions
CN115495246B (en
Inventor
李超
王靖
贺昊
梅君夷
汪陶磊
过敏意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202211212624.0A priority Critical patent/CN115495246B/en
Publication of CN115495246A publication Critical patent/CN115495246A/en
Application granted granted Critical
Publication of CN115495246B publication Critical patent/CN115495246B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A mixed far memory scheduling method under a separated memory architecture comprises the steps of firstly collecting data in operation by limiting the use of an application local memory, so as to divide tasks into a far memory insensitive task, a far memory sensitive task and a far memory forbidden use task; allocating the memory insensitive task and the memory sensitive task to the same computing node according to a sensitivity degree complementary principle, yielding a memory to the maximum extent under the same performance limiting condition according to the tasks, performing cross-node memory resource adjustment when the overall yielding memory values between corresponding servers have larger difference, determining the yielding memory value or the rented remote memory value of the server, then performing memory resource adjustment inside the node, and performing resource allocation for each task according to the current residual memory resources of the server and the principle of more additional local memory resources of the sensitive tasks, thereby realizing mixed remote memory scheduling. The invention fully excavates the characteristics of the application in the remote memory environment, and then improves the memory utilization rate and the use efficiency in the data center through an efficient remote memory allocation strategy.

Description

Hybrid remote memory scheduling method under separated memory architecture
Technical Field
The invention relates to a technology in the field of distributed computing, in particular to a method/system for scheduling a hybrid remote memory under a separated memory architecture.
Background
The ineffectiveness of memory resource utilization in a data center is one of the main reasons that bottlenecks occur in performance of many complex computing applications. The sharing of memory resources and storage resources can be realized through remote memory access, so that the resource utilization is optimized. Currently, there are technologies that use high performance memories such as SSD as data replacement of the longitudinal remote memory, and technologies that use high speed networks such as RDMA protocol to realize read and write of the transverse remote memory. However, the existing server and job scheduler have not considered and solved the problem of allocating application resources under the horizontal and vertical mixed remote memory architecture, and cannot allocate remote memory resources according to application characteristics, and at the same time, cannot capture the dynamics of tasks to adjust the deployment of memory resources to achieve load balancing, and cannot realize high-performance resource sharing under limited memory resources.
Disclosure of Invention
The invention provides a hybrid remote memory scheduling method under a separated memory architecture, which aims at solving the problems of unbalanced cluster load and task throughput caused by the fact that the existing memory scheduling technology does not consider the remote memory sensitivity of tasks, does not support the bottleneck that the dynamic property of the captured tasks is not captured so as to adjust the deployment of memory resources, the task using the remote memory is not high enough, and the resource using efficiency of the integral scheduling is not good enough.
The invention is realized by the following technical scheme:
the invention relates to a mixed remote memory scheduling method under a separated memory architecture, which comprises the steps of firstly collecting data during operation in a mode of limiting the use of a local memory, thereby dividing tasks into a remote memory insensitive task, a remote memory sensitive task and a remote memory forbidden task; allocating memory insensitive tasks and memory sensitive tasks to the same computing node according to a sensitivity degree complementation principle, yielding memory to the maximum extent under the condition of the same performance limit by the tasks, performing cross-node memory resource adjustment when the integral yielding memory value difference between corresponding servers is large, determining yielding memory value or rented far memory value of the server, then performing internal memory resource adjustment of the nodes, and performing resource allocation for each task according to the current residual memory resource of the server and the principle of more additional local memory resources of the sensitive tasks, thereby realizing mixed far memory scheduling.
The separated memory architecture is as follows: the architecture for flexibly combining and collocating a plurality of server CPUs and memories in a data center in a network connection mode is characterized in that: servers with computing tasks as functions are used as computing nodes (computer nodes), and servers with Memory access as functions are used as Memory nodes (Memory nodes). And the main program of the task runs on the computing node and simultaneously accesses the memory of the memory node as a remote memory. The server can be used as a computing node and simultaneously provide a memory, and can also be used as a memory node to provide a function of far memory access.
The far memory architecture is as follows: each server is provided with RDMA network cards, the two RDMA network cards are connected through a copper cable, each server takes a CPU as a computing core and a DRAM as a memory unit, the respective RDMA network cards are connected with a mainboard of the server through PCIe, and the CPU of each server uses a local memory and uses a remote memory through the RDMA network cards without occupying the resources of the remote CPU.
The hybrid remote memory comprises: a laterally distal memory and a longitudinally distal memory, wherein: the horizontal far memory comprises a far DRAM space accessed through an RDMA network card connection, and the vertical far memory comprises storage accessed through a linux swap mechanism and an I/O interface and used by the same server, such as a magnetic disk device and an SSD deviceThe storage space of (a). The total memory space Mem of an application is added by three parts, including the local memory space Mem lm Far memory space Mem hfm And longitudinal far memory space Mem vfm
The method for collecting the runtime data by limiting the use of the local memory of the application refers to the following steps: and recording the local memory use ratio and the corresponding overall application performance in a mode of activating longitudinal remote memory access by limiting the application local memory use ratio.
The step of allocating the memory insensitive task and the memory sensitive task to the same computing node is as follows: local memory space according to task i
Figure BDA0003873963060000021
And the allowable memory value SI i Server id and its remaining memory capacity C j And calculating to obtain the server node id where each task should be placed, specifically:
i) When the total remaining resource Res is satisfied, i.e.
Figure BDA0003873963060000022
Resource Min which is larger than minimum resource allocation of current task group Allo I.e. by
Figure BDA0003873963060000023
When step ii) is performed;
ii) maximum allowable memory value per task for all tasks in current task group
Figure BDA0003873963060000024
And within group allowable memory mean SI avg In a differential order, i.e. according to
Figure BDA0003873963060000025
Is sorted according to the minimum local memory of each task by using a knapsack algorithm
Figure BDA0003873963060000026
Allocate and prioritize the
Figure BDA0003873963060000027
Tasks that are similar and different in sign are distributed to the same server.
iii) Real-time calculation of server predicted remaining capacity C j When it comes to
Figure BDA0003873963060000028
And is
Figure BDA0003873963060000029
The current server is considered full and the placement of the task for the next server is started.
iv) returning to execute step i) until each task is traversed.
Preferably, the computing node adjusts the local memory usage and the horizontal far memory usage and the vertical far memory usage of each task according to a ratio according to the maximum allowable memory of each task.
The cross-node memory resource adjustment specifically includes:
i) Calculating the overall allowable memory value ServerSI of the server, namely the difference value between the residual memory capacity of the server when the task is not distributed and the minimum local memory resource of the distributed task, namely ServerSI j =C j -Min Allo
ii) calculating the total allowable memory ServerSI of each server j Average value ServerSI of avg When ServerSI is used j -ServerSI avg When > 0, the server needs to give out the server j -ServerSI avg Memory capacity of otherwise the server needs to borrow ServerSI j -ServerSI avg Memory capacity of l.
The memory resource adjustment inside the node is as follows: the method for adjusting the size of the memory resources of the local memory, the transverse far memory and the longitudinal far memory of each task inside the server node so as to achieve the effect that task sensitivity application has a relatively larger proportion of SI value specifically comprises the following steps:
i) Collecting minimum local memory value of each task in current server
Figure BDA0003873963060000031
And
Figure BDA0003873963060000032
calculating the maximum allocable memory resource value ServerSI = Min (ServerSI ) of the current server avg ). For the servers of the yielding memory, only the total yielding memory needs to be provided, the memory spaces are used as the task horizontal far memories on other servers, and the use amount of each task far memory is calculated in the next step.
ii) calculating the value of increasable local memory resources of each task, the proportion of increasable local resources of each task and the value of each task's own local memory resources
Figure BDA0003873963060000033
In inverse proportion, i.e.
Figure BDA0003873963060000034
Then the local memory space Mem for that task lm =Min lm +Δlm。
iii) Laterally distant memory space for computing tasks
Figure BDA0003873963060000035
Wherein: each task ii comprises only the far memory sensitive tasks in the server, while satisfying ServerSI j -ServerSI avg >0。
iv) calculating the longitudinal far memory value equal to Mem for each task vfm =Mem-Mem lm -Mem hfm
v) returning the final memory allocation condition of each node.
Preferably, when the far memory sensitivity of the computing node changes or a task ends but other tasks do not end, the computing node performs cross-node memory resource adjustment and node internal resource adjustment again, releases related resources after operation ends, and recalculates the allocation condition of the task in the next task queue.
The change of the far memory sensitivity of the computing node is as follows: and judging the far memory sensitivity of the task by periodically detecting the difference value of the page error number. When the difference value of the page faults is positive and is more than three times continuously larger than the average number of the page faults, the sensitivity of the page faults is considered to be converted from the far memory insensitive type to the far memory sensitive type. And when the difference value of the page faults is negative and is more than three times of continuous times and larger than the average page fault number, the sensitivity of the page faults is considered to be converted from the far memory sensitivity type to the far memory sensitivity type.
The step of performing the cross-node memory resource adjustment and the node internal resource adjustment again refers to: when a task is changed from a far memory sensitivity to an insensitivity, or when a task is completed but other tasks are not finished, the change delta SI of the current yielding memory value SI of the server needs to be preferentially acquired, when the change is larger than a certain threshold value, a cross-node adjusting module is called, a yielding part of memory is accessed as a far memory, then a node internal adjusting module is called, and a memory with a proper proportion is allocated to each running task. Meanwhile, when the far memory is insensitive and becomes sensitive, the change delta SI of the current SI value of the server is obtained preferentially, when the change is larger than a certain threshold value, a cross-node adjusting module is called, a part of memory is borrowed to be used as far memory access of a local task, and then a node internal adjusting module is called to allocate a far memory with a proper proportion to each running task.
And the resource allocation is carried out on each task, and comprises the allocation of the size of a local memory resource, the size of a transverse far memory resource and the size of a longitudinal far memory.
The invention relates to a system for realizing the method, which comprises the following steps: the system comprises a stage application sensitivity analysis unit based on a far memory, a task grouping unit according to sensitivity and self characteristics, a computing node selection unit based on load balance, a cross-node memory resource adjustment unit and a node internal memory resource adjustment unit, wherein: the application sensitivity analysis unit acquires data during online operation by limiting the use of an application local memory, collects the data during operation of the application, and calculates application parameters related to sensitivity; the task grouping unit carries out far memory sensitivity analysis and calculation according to the collected running data and related parameters, and divides tasks into far memory insensitive tasks, far memory sensitive tasks and far memory forbidden tasks; the computing node selection unit distributes the memory insensitive task and the memory sensitive task to the same computing node according to the task sensitivity information collected by the application sensitivity analysis unit and the task grouping unit and the sensitivity degree complementation principle; the cross-node memory resource adjusting unit calculates the overall allowable memory value of the servers according to the maximum allowable memory of the tasks under the same performance limiting condition, and when the overall allowable memory value difference between the servers is large, the cross-node memory resource adjusting unit adjusts the cross-node memory resources and determines the allowable memory value or the leased far memory value of the servers; and the internal memory resource adjusting unit of the node allocates the size of the local memory resource for each task according to the current residual memory resource of the server and the principle that more additional local memory resources are provided for sensitive tasks, determines the size of the transverse remote memory resource by combining the result of the cross-node memory resource adjusting unit, and calculates the size of the longitudinal remote memory at the same time, thereby finally realizing efficient hybrid remote internal scheduling.
Technical effects
The method comprises the steps of performing online runtime data acquisition by limiting the use of a local memory, collecting runtime data of an application, calculating sensitivity-related application parameters, performing remote memory sensitivity analysis calculation according to the collected runtime data and the related parameters in cooperation with task grouping according to sensitivity and self characteristics, dividing tasks into a remote memory insensitive task, a remote memory sensitive task and a remote memory forbidden use task, selecting based on load balancing calculation nodes, and distributing the memory insensitive task and the memory sensitive task to the same calculation node according to sensitivity degree complementary principle according to collected task sensitivity information; the method comprises the steps of determining an yielding memory value or a rented remote memory value of a server according to the maximum yielding memory of a task and the integral yielding memory value of a computing server under the same performance limiting condition through cross-node memory resource adjustment, further distributing the size of a local memory resource for each task according to the current residual memory resource of the server and the principle of more additional local memory resources of a sensitive task through node internal memory resource adjustment, and determining the size of a transverse remote memory resource and the size of a longitudinal remote memory.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of a module framework embodying the present invention;
FIG. 3 is a flow chart of cross-node and intra-node resource readjustment;
FIG. 4 is a diagram illustrating the result of the effect optimization of the embodiment;
in the figure: the method comprises the following steps of (a) optimizing the utilization rate, (b) optimizing the overall performance, and (c) optimizing the overall memory utilization efficiency;
FIG. 5 is a diagram illustrating the sensitivity and effect of different tasks in one embodiment;
in the figure: (a) memory use efficiency of the S-Trace; (b) memory use efficiency of M-Trace; and (c) the memory use efficiency of the L-Trace.
Detailed Description
In this embodiment, taking a plurality of real application frames as examples, including a general computation task, a graph computation task, a video processing task, an AI training task, an AI inference task, an image recognition task, and a video recognition task, RDMA is used as a remote memory medium when runtime data is collected, and the environment of the system is as follows: two Intel (R) Xeon (R) Gold 6148CPU,256GB memory, 21TB hard disk and a two-channel Mellanox ConnectX-5RDMA network card with 2 20 cores. One of the servers serves as a compute node and the other serves as a remote memory access node (remote node). In the process of simulated scheduling, a python program is used for simulating three scenes of 10 nodes and 200 tasks, 20 nodes and 500 tasks and 50 nodes and 2000 tasks to respectively give optimization results and comparison, and meanwhile, the task distribution with the far memory sensitive task proportion of 10%,30%,50%,70% and 90% is used for showing the memory use efficiency optimization of the method.
As shown in fig. 1, a method for scheduling a hybrid far memory under a split memory architecture according to this embodiment includes:
i) Firstly, collecting runtime data by limiting the use of an application local memory so as to collect and analyze application characteristic data in a staged manner under a remote memory environment, specifically: activating the longitudinal far memory by limiting the application local memory use ratio, and recording:
(1) the use proportion of the local memory and the corresponding application running time;
(2) comparing the task performance under each condition with the performance when the far memory is not used, and recording the maximum used memory capacity Mem when the performance ratio is not more than SLO (the default value is 1.2) max Local memory ratio L and local memory value Mem lm Calculating the maximum memory unloading ratio R =1-L and the maximum allowable memory value SI max =Mem max -Mem lm ,Mem lm The value of the current allowable memory value SI of each task is equal to the sum of the allocated memory capacity Mem minus the local memory value Mem lm
(3) And setting task stage division nodes according to the sensitivity performance, determining by the number of page errors PF recorded under the condition that the time Interval is Interval, and selecting a time point T when the difference delta PF of the number of page errors is more than mean (PF (T)) for three times continuously as the task stage division nodes.
ii) according to the collected and analyzed application characteristic data, grouping the tasks according to the sensitivity and the self characteristics, and dividing the tasks into far memory insensitive tasks, far memory sensitive tasks and far memory forbidden tasks, specifically: calculating an allowable memory value SI of each task in a task group to be distributed and an average value SI _ avg of the SI of each task in the task group, wherein when SI is larger than SI _ avg and R is smaller than 0.2, the memory is a far memory forbidden use type, when R is larger than 0.2 and smaller than 0.5, the memory is a far memory sensitive type, and when R is larger than 0.5, the memory is a far memory insensitive type; when SI < SI _ avg and R < 0.5, it is the far memory forbidden, when 0.5 < R < 0.8, it is the far memory sensitive, and when R > 0.8, it is the far memory insensitive.
iii) Selecting a computing node for the task, specifically: according to a set of task _ num tasksLocal memory value
Figure BDA0003873963060000061
And the allowable memory value SI i Id of a group of server _ num servers and remaining memory capacity C thereof j Current task group minimum resource to allocate
Figure BDA0003873963060000062
Res is less than the total remaining resources
Figure BDA0003873963060000063
Then, the maximum allowable memory value of each task is given to all tasks in the current task group
Figure BDA0003873963060000064
And within group the allowable memory mean SI avg In a differential order, i.e. according to
Figure BDA0003873963060000065
Is sorted according to the minimum local memory of each task by using a knapsack algorithm
Figure BDA0003873963060000066
Is allocated and preferably will
Figure BDA0003873963060000067
The tasks which are close and have different symbols are distributed to the same server; real-time calculation of server residual capacity C j When it comes to
Figure BDA0003873963060000068
And is provided with
Figure BDA0003873963060000069
Considering that the current server is full, and starting to place a task for the next server; and traversing each task and outputting a server id corresponding to each task.
iv) performing cross-node memory resource adjustment, comprising: 1) Calculating the overall permissible memory value ServerSI of the server, i.e. the taskThe difference between the residual memory capacity of the server when not allocated and the minimum local memory resource of the allocated task, i.e. ServerSI j =C j -Min Allo (ii) a 2) Calculating the available memory ServerSI of each server j Average value ServerSI of avg When ServerSI is used j -ServerSI avg When > 0, the server needs to give the server j -ServerSI avg Memory capacity of otherwise server needs to borrow server SI j -ServerSI avg Memory capacity of l.
v) adjusting the task local memory inside the node, including: 1) Collecting minimum local memory value of each task in current server
Figure BDA00038739630600000610
And
Figure BDA00038739630600000611
calculating the maximum allocable memory resource value ServerSI = Min (ServerSI ) of the current server avg ). 2) Calculating the value of increasable local memory resources of each task, the proportion of increasable local resources of each task and the value of each task
Figure BDA00038739630600000612
In inverse proportion, i.e.
Figure BDA00038739630600000613
Then the local memory space Mem for that task lm =Min lm + Δ lm. 3) Laterally distant memory space for computing tasks
Figure BDA00038739630600000614
Wherein: each task ii comprises only the far memory sensitive tasks in the server, while satisfying ServerSI j -ServerSI avg Is greater than 0. 4) Computing the longitudinal far memory value equal to Mem of each task vfm =Mem-Mem lm -Mem hfm . And returning the final memory allocation condition of each node.
vi) adjusting resources inside the cross-boundary point and the node again, and judging the far memory sensitivity of the task by periodically detecting the difference value of the page error number, wherein the method comprises the following steps: 1) And when the difference value of the page faults is positive and is more than three times continuously larger than the average page fault number, the sensitivity of the page faults is considered to be converted from the far memory insensitive type to the far memory sensitive type. And when the difference value of the page faults is negative and is more than three times of continuous times and larger than the average page fault number, the sensitivity of the page faults is considered to be converted from the far memory sensitivity type to the far memory sensitivity type. 2) When a task is changed from being insensitive to a far memory, and when a certain task is completed but other tasks are not completed, a change delta SI of a current SI value of a server needs to be preferentially obtained, when the change is larger than a certain threshold value, a cross-node adjusting module is called, a part of memory is given out to be used as far memory access, and then a node internal adjusting module is called to allocate a memory with a proper proportion to each running task. 3) Meanwhile, when the far memory is insensitive and sensitive, the change delta SI of the current SI value of the server is preferentially acquired, when the change is larger than a certain threshold value, a cross-node adjusting module is called, a part of memory is borrowed to be used as far memory access of a local task, then a node internal adjusting module is called, and a far memory with a proper proportion is distributed for each running task.
vii) releasing relevant resources after the operation is finished, and recalculating the task allocation condition in the next task queue.
Through practical experiments, under the specific environment setting of remote memory based on RDMA, tasks such as running chart calculation, video processing, AI training, AI inference, image recognition and automatically generated sample application are performed, the task distribution of 10%,30%,50%,70% and 90% of remote memory sensitive tasks is successfully simulated under three scenes of 10 nodes of 200 tasks (S-Trace), 20 nodes of 500 tasks (M-Trace) and 50 nodes of 2000 tasks (L-Trace) by realizing the remote memory scheduling strategy, and the scheduling strategy (No-FM) of the remote memory is compared with the scheduling strategy (LIFC) of a first-to-first compressed memory and the scheduling strategy of a first-to-first compressed (FIFC) without considering the remote memory; the invention improves the overall memory utilization rate of 17.6%, the remote memory application performance of 20.7% and the memory use efficiency of 20.5% under the condition of reaching the maximum memory utilization rate of 98%.
Compared with the prior art, the performance index improvement of the method is higher memory utilization rate, more flexible remote memory access, higher application performance in a remote memory environment and higher memory use efficiency.
The foregoing embodiments may be modified in many different ways by one skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and not by the preceding embodiments, and all embodiments within their scope are intended to be limited by the scope of the invention.

Claims (9)

1. A mixed far memory scheduling method under a separated memory architecture is characterized in that data during operation are collected by limiting the use of an application local memory, so that tasks are divided into far memory insensitive tasks, far memory sensitive tasks and far memory forbidden tasks; allocating memory insensitive tasks and memory sensitive tasks to the same computing node according to a sensitivity degree complementation principle, yielding memory to the maximum extent under the condition of the same performance limit by the tasks, performing cross-node memory resource adjustment when the integral yielding memory value difference between corresponding servers is large, determining yielding memory value or rented far memory value of the server, then performing internal memory resource adjustment of the nodes, and performing resource allocation for each task according to the current residual memory resource of the server and the principle of more additional local memory resources of the sensitive tasks, thereby realizing mixed far memory scheduling.
2. The method as claimed in claim 1, wherein the hybrid remote memory comprises: a laterally far memory and a longitudinally far memory, wherein: the transverse far memory comprises a far-end DRAM space accessed through RDMA network card connection, and the longitudinal far memory comprises a storage space of the same server, such as a magnetic disk device and an SSD device, accessed through a linux swap mechanism and an I/O interface; the total memory space Mem of the application is added by three parts, including the local memory space Mem lm Far memory space Mem hfm And longitudinal far memory space Mem vfm
3. The method according to claim 1, wherein the step of allocating the memory-insensitive task and the memory-sensitive task to the same compute node comprises: local memory space according to task i
Figure FDA0003873963050000011
And the allowable memory value SI i Server id and its remaining memory capacity C j And calculating to obtain the server node id where each task should be placed, specifically:
i) When the total remaining resource Res is satisfied, i.e.
Figure FDA0003873963050000012
Resource Min which is larger than minimum resource allocation of current task group Allo I.e. by
Figure FDA0003873963050000013
When step ii) is performed;
ii) maximum allowable memory value per task for all tasks in current task group
Figure FDA0003873963050000014
And within group allowable memory mean SI avg In a sequence of differences, i.e. according to
Figure FDA0003873963050000015
Using knapsack algorithm to order according to the minimum local memory of each task
Figure FDA0003873963050000016
Allocate and prioritize the
Figure FDA0003873963050000017
Proximity ofTasks with different symbols are distributed to the same server;
iii) Real-time calculation of server predicted remaining capacity C j When is coming into contact with
Figure FDA0003873963050000018
And is
Figure FDA0003873963050000019
Considering that the current server is full, and starting to place a task for the next server;
iv) returning to execute step i) until each task is traversed.
4. The method according to claim 1 or 3, wherein the compute node proportionally adjusts the local memory usage and the horizontal and vertical remote memory usage for each task according to the maximum allowable memory for each task.
5. The method according to claim 1, wherein the cross-node memory resource adjustment comprises:
i) Calculating the overall allowed memory value ServerSI of the server, i.e. the difference between the residual memory capacity of the server when the task is not allocated and the minimum local memory resource of the allocated task, i.e. ServerSI j =C j -Min Allo
ii) calculating the total allowable memory ServerSI of each server j Average value ServerSI of avg When ServerSI is used j -ServerSI avg When > 0, the server needs to give the server j -ServerSi avg Memory capacity of otherwise the server needs to borrow ServerSI j -ServerSI avg Memory capacity of l.
6. The method according to claim 1 or 2, wherein the memory resource adjustment inside the node is: the method for adjusting the memory resource size of the local memory, the transverse remote memory and the longitudinal remote memory of each task inside the server node so as to achieve the effect that task sensitivity application has a relatively larger proportion of SI value includes the following specific steps:
i) Collecting minimum local memory value of each task in current server
Figure FDA0003873963050000021
And
Figure FDA0003873963050000022
calculating the maximum allocable memory resource value ServerSI = Min (ServerSI ) of the current server avg ) (ii) a For the servers of the yielding memory, only the total yielding memory needs to be provided, the memory spaces are used as the task transverse remote memories on other servers, and the using quantity of each task remote memory is calculated in the next step;
ii) calculating the value of increasable local memory resources of each task, the proportion of increasable local resources of each task and the value of each task's own local memory resources
Figure FDA0003873963050000023
In inverse proportion, i.e.
Figure FDA0003873963050000024
The local memory space Mem of the task lm =Min lm +Δlm;
iii) Laterally distant memory space for computing tasks
Figure FDA0003873963050000025
Wherein: each task ii comprises only the far memory sensitive tasks in the server, while satisfying ServerSI j -ServerSI avg >0;
iv) calculating the longitudinal far memory value equal to Mem for each task vfm =Mem-Mem lm -Mem hfm
v) returning the final memory allocation condition of each node.
7. The method according to claim 1, 2, 3 or 5 for scheduling the hybrid remote memory under the separate memory architecture, wherein when the sensitivity of the remote memory of the compute node changes or a task ends but other tasks do not end, the compute node performs cross-node memory resource adjustment and resource adjustment inside the compute node again, releases related resources after the operation ends and recalculates the allocation condition of the task in the next task queue;
the change of the far memory sensitivity of the computing node is as follows: judging the remote memory sensitivity of the task by periodically detecting the difference value of the page error number; when the difference value of the page faults is positive and is more than three times continuously larger than the average page fault number, the sensitivity of the page faults is considered to be converted from a far memory insensitive type to a far memory sensitive type; when the difference value of the page faults is negative and is more than three times continuously larger than the average page fault number, the sensitivity of the page faults is considered to be converted from a far memory sensitive type to a far memory sensitive type;
the re-performing of the cross-node memory resource adjustment and the node internal resource adjustment refers to: when a task is changed from a far memory sensitivity to an insensitivity, and when a certain task is completed but other tasks are not finished, the change delta SI of the current yielding memory value SI of the server needs to be preferentially obtained, when the change is larger than a certain threshold value, a cross-node adjusting module is called, a yielding part of memory is accessed as a far memory, and then a node internal adjusting module is called to allocate a memory with a proper proportion to each running task; meanwhile, when the far memory is insensitive and becomes sensitive, the change delta SI of the current SI value of the server is obtained preferentially, when the change is larger than a certain threshold value, a cross-node adjusting module is called, a part of memory is borrowed to be used as far memory access of a local task, and then a node internal adjusting module is called to allocate a far memory with a proper proportion to each running task.
8. The method according to claim 1, wherein the allocating resources for each task includes allocating a size of a local memory resource, a size of a horizontal remote memory resource, and a size of a vertical remote memory.
9. A system for implementing the method for scheduling a hybrid remote memory under the split memory architecture of any one of claims 1 to 8, comprising: the system comprises a stage application sensitivity analysis unit based on a remote memory, a task grouping unit according to sensitivity and self characteristics, a computing node selection unit based on load balancing, a cross-node memory resource adjustment unit and a node internal memory resource adjustment unit, wherein: the application sensitivity analysis unit acquires data during online operation by limiting the use of an application local memory, collects the data during operation of the application, and calculates application parameters related to sensitivity; the task grouping unit performs far memory sensitivity analysis and calculation according to the collected running data and related parameters, and divides tasks into far memory insensitive tasks, far memory sensitive tasks and far memory forbidden tasks; the computing node selection unit distributes the memory insensitive task and the memory sensitive task to the same computing node according to the task sensitivity information collected by the application sensitivity analysis unit and the task grouping unit and the sensitivity degree complementation principle; the cross-node memory resource adjusting unit calculates the overall allowable memory value of the servers according to the maximum allowable memory of the tasks under the same performance limiting condition, and when the overall allowable memory value difference between the servers is large, the cross-node memory resource adjusting unit adjusts the cross-node memory resources and determines the allowable memory value or the leased far memory value of the servers; and the internal memory resource adjusting unit of the node allocates the size of the local memory resource for each task according to the current residual memory resource of the server and the principle that more additional local memory resources are provided for sensitive tasks, determines the size of the transverse remote memory resource by combining the result of the cross-node memory resource adjusting unit, and calculates the size of the longitudinal remote memory at the same time, thereby finally realizing efficient hybrid remote internal scheduling.
CN202211212624.0A 2022-09-30 2022-09-30 Hybrid remote memory scheduling method under separated memory architecture Active CN115495246B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211212624.0A CN115495246B (en) 2022-09-30 2022-09-30 Hybrid remote memory scheduling method under separated memory architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211212624.0A CN115495246B (en) 2022-09-30 2022-09-30 Hybrid remote memory scheduling method under separated memory architecture

Publications (2)

Publication Number Publication Date
CN115495246A true CN115495246A (en) 2022-12-20
CN115495246B CN115495246B (en) 2023-04-18

Family

ID=84471597

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211212624.0A Active CN115495246B (en) 2022-09-30 2022-09-30 Hybrid remote memory scheduling method under separated memory architecture

Country Status (1)

Country Link
CN (1) CN115495246B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104166597A (en) * 2013-05-17 2014-11-26 华为技术有限公司 Remote memory allocation method and device
US9535740B1 (en) * 2015-08-26 2017-01-03 International Business Machines Corporation Implementing dynamic adjustment of resources allocated to SRIOV remote direct memory access adapter (RDMA) virtual functions based on usage patterns
US20170277655A1 (en) * 2016-03-25 2017-09-28 Microsoft Technology Licensing, Llc Memory sharing for working data using rdma
CN109885381A (en) * 2019-02-15 2019-06-14 合肥谐桐科技有限公司 The method and its system of memory share management and running are realized based on KVM virtualization
CN112817887A (en) * 2021-02-24 2021-05-18 上海交通大学 Far memory access optimization method and system under separated combined architecture
CN114756388A (en) * 2022-03-28 2022-07-15 北京航空航天大学 RDMA (remote direct memory Access) -based method for sharing memory among cluster system nodes as required

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104166597A (en) * 2013-05-17 2014-11-26 华为技术有限公司 Remote memory allocation method and device
US9535740B1 (en) * 2015-08-26 2017-01-03 International Business Machines Corporation Implementing dynamic adjustment of resources allocated to SRIOV remote direct memory access adapter (RDMA) virtual functions based on usage patterns
US20170277655A1 (en) * 2016-03-25 2017-09-28 Microsoft Technology Licensing, Llc Memory sharing for working data using rdma
CN109885381A (en) * 2019-02-15 2019-06-14 合肥谐桐科技有限公司 The method and its system of memory share management and running are realized based on KVM virtualization
CN112817887A (en) * 2021-02-24 2021-05-18 上海交通大学 Far memory access optimization method and system under separated combined architecture
CN114756388A (en) * 2022-03-28 2022-07-15 北京航空航天大学 RDMA (remote direct memory Access) -based method for sharing memory among cluster system nodes as required

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈游旻: "基于RDMA的分布式存储系统研究综述" *

Also Published As

Publication number Publication date
CN115495246B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN107038069B (en) Dynamic label matching DLMS scheduling method under Hadoop platform
WO2021136137A1 (en) Resource scheduling method and apparatus, and related device
US20120221730A1 (en) Resource control system and resource control method
CN103699433B (en) One kind dynamically adjusts number of tasks purpose method and system in Hadoop platform
CN109005130B (en) Network resource allocation scheduling method and device
Zhang et al. Virtual machine placement strategy using cluster-based genetic algorithm
CN107273200B (en) Task scheduling method for heterogeneous storage
CN110018781B (en) Disk flow control method and device and electronic equipment
CN116089051A (en) Task allocation method, device and system
CN111522885A (en) Distributed database system collaborative optimization method based on dynamic programming
CN117349026B (en) Distributed computing power scheduling system for AIGC model training
CN115495246B (en) Hybrid remote memory scheduling method under separated memory architecture
CN110069319B (en) Multi-target virtual machine scheduling method and system for cloud resource management
CN108616583B (en) Storage space allocation method based on computer cloud
CN115357368A (en) MapReduce job scheduling method based on heterogeneous environment perception
CN112988363B (en) Resource scheduling method, device, server and storage medium
CN111208943B (en) IO pressure scheduling system of storage system
CN110580192B (en) Container I/O isolation optimization method in mixed scene based on service characteristics
CN115344358A (en) Resource scheduling method, device and management node
CN113656150A (en) Deep learning computing power virtualization system
JP2012038275A (en) Transaction calculation simulation system, method, and program
CN115379014B (en) Data request distribution method and device and electronic equipment
CN113835869B (en) MPI-based load balancing method, MPI-based load balancing device, computer equipment and storage medium
CN115543222B (en) Storage optimization method, system, equipment and readable storage medium
CN113535388B (en) Task-oriented service function aggregation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant