CN116954886A - Node preselection method, pod scheduling method, device, server and medium - Google Patents

Node preselection method, pod scheduling method, device, server and medium Download PDF

Info

Publication number
CN116954886A
CN116954886A CN202211404741.7A CN202211404741A CN116954886A CN 116954886 A CN116954886 A CN 116954886A CN 202211404741 A CN202211404741 A CN 202211404741A CN 116954886 A CN116954886 A CN 116954886A
Authority
CN
China
Prior art keywords
node
nodes
cards
cluster
idle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211404741.7A
Other languages
Chinese (zh)
Inventor
闫晓瑞
丛鹏宇
冯俊兰
邓超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Communications Ltd Research Institute filed Critical China Mobile Communications Group Co Ltd
Priority to CN202211404741.7A priority Critical patent/CN116954886A/en
Publication of CN116954886A publication Critical patent/CN116954886A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Small-Scale Networks (AREA)

Abstract

The application provides a node preselection method, a pod scheduling method, a device, a server and a medium. The node preselection method comprises the following steps: after receiving a training task, determining the total card number corresponding to the training task; traversing the cluster according to the training task, and selecting a first node with the maximum X idle cards in the cluster; and selecting a second node which meets the residual card number from the residual nodes in the cluster. The embodiment of the application can reduce the communication overhead between the nodes.

Description

Node preselection method, pod scheduling method, device, server and medium
Technical Field
The embodiment of the application relates to the technical field of communication, in particular to a node preselection method, a pod scheduling method, a device, a server and a medium.
Background
The existing Ring-AllReduce communication strategy has obvious acceleration effect when carrying out distributed training of a small number of nodes of a single host, but has to adopt a multi-host multi-node Ring shape if the training task is too large.
At present, when selecting nodes to form ring, a random scheduling mode is mainly adopted, namely, the nodes can be scheduled for use as long as the resource requirement of training tasks is met. This approach results in the situation that the selected node may be located in multiple hosts and that cross-host communications may be required between the multiple hosts, with a relatively large communication overhead.
Disclosure of Invention
The embodiment of the application provides a node preselection method, a pod scheduling method, a device, a server and a medium, aiming at reducing communication overhead among nodes.
To solve the above problems, the present application is achieved as follows:
in a first aspect, an embodiment of the present application provides a Kubernetes-based node preselection method, where the method includes:
after receiving a training task, determining the total card number corresponding to the training task;
traversing the cluster according to the training task, and selecting a first node with the maximum X idle cards in the cluster;
and selecting a second node which meets the residual card number from the residual nodes in the cluster.
In a second aspect, an embodiment of the present application provides a Kubernetes-based pod scheduling method, where the method includes:
after receiving a training task, determining the total card number corresponding to the training task, and determining the idle card number of each node in the cluster;
assigning numbers to the nodes according to the number of the idle cards of the nodes, wherein the numbers of the nodes with the same number of the idle cards correspond to the same rank number, and the nodes adjacent to the rank number can directly communicate or communicate with each other with the least number of the nodes passing through, so that the numbers of the nodes in the same host are continuous;
traversing the cluster according to the training task, and selecting X first nodes with the largest idle card numbers in the cluster, wherein the rank numbers of the first nodes are the same or adjacent;
selecting a second node meeting the residual card number from the residual nodes in the cluster;
and starting pod scheduling according to the numbers corresponding to the first node and the second node so as to execute the training task.
In a third aspect, an embodiment of the present application provides a Kubernetes-based node preselection apparatus, including:
the first determining module is used for determining the total card number corresponding to the training task after receiving the training task;
the first selection module is used for traversing the cluster according to the training task and selecting a first node with the maximum X idle cards in the cluster;
and the second selection module is used for selecting a second node which meets the residual card number from the residual nodes in the cluster.
In a fourth aspect, an embodiment of the present application further provides a pod scheduling device based on Kubernetes, including:
the second determining module is used for determining the total card number corresponding to the training task after receiving the training task and determining the idle card number of each node in the cluster;
the numbering module is used for sorting the idle cards of all the nodes according to the sorting result and distributing numbers to all the nodes according to the sorting result, wherein the numbers of the nodes with the same idle cards correspond to the same rank number, and the adjacent nodes with the rank numbers can directly communicate or communicate with each other with the least number of the nodes passing through, so that the numbers of the processes in the same node are continuous;
the third selection module is used for traversing the cluster according to the training task, selecting the first nodes with the maximum X idle cards in the cluster, wherein the rank numbers of the first nodes are the same or adjacent;
a fourth selecting module, configured to select a second node that meets the remaining number of cards from remaining nodes in the cluster;
and the scheduling module is used for starting pod scheduling according to the numbers corresponding to the first node and the second node.
In a fifth aspect, an embodiment of the present application further provides a server, including a memory, a processor, and a program stored on the memory and executable on the processor; the processor is configured to read a program in the memory to implement steps in the Kubernetes-based node pre-selection method as described in the first aspect, or steps in the Kubernetes-based pod scheduling method as described in the second aspect.
In a sixth aspect, an embodiment of the present application further provides a readable storage medium, configured to store a program, where the program when executed by a processor implements a step in a Kubernetes-based node pre-selection method according to the first aspect, or a step in a Kubernetes-based pod scheduling method according to the second aspect.
After receiving a training task, the embodiment of the application determines the total card number corresponding to the training task; traversing the cluster according to the training task, and selecting a first node with the maximum X idle cards in the cluster; and selecting a second node which meets the residual card number from the residual nodes in the cluster. In this way, the node with the largest idle card number is selected as the first node as far as possible in the process of selecting the node, and then the second node meeting the remaining card number is selected from the remaining nodes, so that the total number of the selected first node and the second node is minimum, the number of hosts where the nodes are located is also minimum, and the communication expenditure among the hosts in the process of executing training tasks can be reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.
FIG. 1 is a schematic communication diagram of nodes in a prior training task;
FIG. 2 is a schematic flow chart of a node preselection method based on Kubernetes provided by an embodiment of the present application;
fig. 3 is a schematic flow chart of a Kubernetes-based pod scheduling method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a Kubernetes-based node preselection apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a Kubernetes-based pod scheduling device according to the embodiment of the present application;
fig. 6 is a schematic structural diagram of a server according to the embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms "first," "second," and the like in embodiments of the present application are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Furthermore, the use of "and/or" in the present application means at least one of the connected objects, such as a and/or B and/or C, means 7 cases including a alone a, B alone, C alone, and both a and B, both B and C, both a and C, and both A, B and C.
Referring to fig. 1, fig. 1 is a schematic communication diagram of nodes in a conventional training task, and as shown in fig. 1, the communication diagram includes 8 GPUs, each GPU is a computing node in a host, and the 8 GPUs may be located in at least one host, that is, each host includes at least one computing node. Each node has a forward node and a backward node, and it only transmits data to its forward node and receives data to its backward node. The transmission and summation processes of the first N (gpu process number) -1 steps and the transmission and coverage processes of the last N-1 steps are carried out. And 2 (N-1) times of transmission are carried out, and parameter updating (namely training task is completed) is completed. Under the condition that training tasks are relatively large and the number of required hosts is relatively large, when nodes are selected, nodes meeting the requirements of the training tasks can be selected only by the existing Kubernetes (the Kubernetes is a container cluster management system), namely a plurality of nodes in a plurality of hosts can be selected, the more the selected hosts are, the communication times among the hosts can be increased in the training process, and the communication expense is increased.
In order to solve the above problems, the present application provides a node preselection method and a pod scheduling method provided by the embodiments of the present application, which are described below.
Referring to fig. 2, fig. 2 is a schematic flow chart of a Kubernetes-based node preselection method according to an embodiment of the present application. The node preselection method may include the steps of:
step 210, after receiving a training task, determining a total card number corresponding to the training task;
the node preselection method can be applied to any host in a network where a training task is located, and can also be applied to a server, and the server can acquire the states of all the hosts, so that at least one host in the network is selected to complete the training task. The present embodiment is described by taking an application to a server as an example.
After receiving the training task, the server analyzes the training task, thereby determining the total card number required by the training task.
Step 220, traversing the cluster according to the training task, and selecting a first node with the maximum number of X idle cards in the cluster;
and step 230, selecting a second node which meets the residual card number from the residual nodes in the cluster.
After determining the total number of cards required by the training task, the server traverses a cluster formed by a plurality of hosts according to the training task, wherein each host comprises a plurality of nodes, and each node has a fixed resource comprising the number of cards. After traversing the cluster, determining the number of idle cards existing in each node in the cluster, selecting the node with the maximum number of X idle cards from the cluster as a first node, and selecting a second node which meets the number of the remaining cards from the remaining nodes in the cluster after the number of cards of the X first nodes cannot meet the total number of cards of the training task.
As an embodiment, if the maximum number of free cards of the node is N, traversing the cluster according to the training task, and selecting a first node with the maximum number of X free cards in the cluster includes:
and traversing the cluster according to the training task, and selecting X first nodes with N idle cards in the cluster, wherein X is equal to the ratio of the total cards to N and is rounded downwards.
Specifically, in this embodiment, it is assumed that the maximum number of idle cards of each node is N, and the total number of cards required for the training task is count. The count/N is calculated and rounded down, and the value obtained by rounding down the ratio of the count to N is defined as a first value for convenience of expression.
If the number of the nodes with the number of the idle cards being N in the cluster is larger than or equal to a first value, selecting X nodes from the nodes with the number of the idle cards being N as the first nodes.
If the number of the nodes with the number of the idle cards being N in the cluster is smaller than the first value, selecting all the nodes with the number of the idle cards being N as the first nodes, wherein at the moment, X is equal to the number of the nodes with the number of the idle cards being N.
As another embodiment, the traversing the cluster according to the training task selects a first node with the largest number of X free cards in the cluster, and further includes:
and if the node with the number of the idle cards being N does not exist in the cluster, selecting a first node with the number of the X idle cards being N-1 from the cluster, wherein X is equal to the ratio of the total cards to N-1 and is rounded downwards.
Specifically, if the maximum number of idle cards of each node in the cluster is N-1, the rule for selecting the first node may be the same as the above manner, that is, X nodes are selected from the nodes with the number of idle cards being N-1 as the first node.
For easy understanding, taking count=10 and n=8 as an example, for example, 10/8 is rounded down to 1, if more than 1 node has 8 free cards in the cluster, x=1, and any node with 8 free cards is selected as the first node. If the maximum free card number of each node in the cluster is 7, X is equal to 10/7 and rounded down, i.e., x=1. If the maximum number of free cards of each node in the cluster is 6, X is equal to 10/6 and rounded down, i.e. x=1. If the maximum number of free cards for each node in the cluster is 5, X is equal to 10/5 and rounded down, i.e., x=2, and so on.
According to the description, the application mainly selects the node with the largest current idle number from the cluster as the node of the current training task, and selects the node without the idle N as the first node, which is next to the node with the idle N, namely the node with the idle card number of N-1, and so on.
It can be understood that if the sum of all the idle cards in the cluster is smaller than the total number of cards count required by the training task, the cluster cannot meet the training task, the training task is temporarily suspended, and the training task is executed after waiting for the sum of the idle cards in the cluster to be greater than or equal to the count. This reduces the number of nodes selected as much as possible.
Further, in order to reduce communications among the plurality of first nodes, a plurality of nodes located in the same host may be selected among the plurality of first nodes, or when the first nodes are located in different hosts, a plurality of hosts (PICE is a communication bus) capable of communicating through PCIE-PCIE inside the hosts may be selected among the hosts, so as to reduce communications overhead of hosts communicating through PCIE-XX-SWITCH-XX-PCIE.
If the number of remaining cards after the first node is selected, for example, when the product of X and N is smaller than count, it indicates that the number of cards needed for the training task is not counted, the difference between the corresponding count and x×n is greater than 0, and at this time, the second node is further selected to satisfy the number of remaining cards, so that the sum of the number of cards of the selected second node and the second node is greater than or equal to the total number of cards needed for the training task.
As an embodiment, selecting a second node satisfying the remaining cards from the remaining nodes in the cluster includes:
and if the node with the idle card number equal to the residual card number exists in the residual nodes, selecting the node with the idle card number equal to the residual card number as the second node.
Specifically, after the total number of idle cards of the first node cannot meet the total number of cards of the training task, the second node needs to be selected continuously, the remaining number of cards after the first node is selected is calculated first, the remaining number of cards is equal to the total number of idle cards of the count and the first node, and if the number of idle cards of each first node is N and the number of idle cards of each first node is X, the remaining number of cards=count-x×n. At this time, as an example, first, from the remaining nodes in the cluster, a node with the number of idle cards equal to the number of remaining cards is selected as the second node, so as to avoid splitting other nodes with the number of idle cards greater than the number of remaining cards.
As another embodiment, selecting a second node that satisfies the remaining cards from the remaining nodes in the cluster further includes:
and if the node with the idle card number equal to the residual card number does not exist in the residual nodes, selecting a node with the smallest difference value between the idle card number and the residual card number from the residual nodes, and taking the node with the idle card number equal to the difference value as the second node.
Specifically, if no node with the number of idle cards equal to the number of the remaining cards exists in the remaining nodes, selecting a node with the smallest difference between the number of idle cards and the number of the remaining cards from the remaining nodes, and taking the node with the number of idle cards equal to the difference as the second node. For example, assuming that the number of remaining cards is 2 and the number of idle cards in the remaining nodes is not 2, two nodes having the number of idle cards of 1 are selected as the second node. If no node with the number of the idle cards not equal to the difference value exists, the node with the smallest difference value between the number of the idle cards and the number of the residual cards is selected from the residual nodes, and the like.
As a further embodiment, if there is no node with the number of idle cards equal to the number of remaining cards in the remaining nodes, a node with the smallest difference between the number of idle cards and the number of remaining cards is selected from the remaining nodes, and a node with the number of idle cards greater than the number of remaining cards is used as the second node, so that the number of nodes can be reduced, and communication between the nodes in the training task can be further reduced.
It will be appreciated that in order to avoid reducing the communication between the nodes in the training task as much as possible, in selecting the second node, the two modes may be combined, that is, if there is no node with the number of idle cards equal to the number of remaining cards in the remaining nodes, a node with the smallest difference between the number of idle cards and the number of remaining cards is selected from the remaining nodes, and a node with the number of idle cards greater than or equal to the difference is selected as the second node. Of course, the two modes can be adopted at the same time, then the selection results of the two modes are compared, and the mode with the minimum number of the second nodes in the two modes is selected.
Further, the first node and the second node both conform to preset conditions corresponding to the training task, where the preset conditions include at least one of a core number of the node, a maximum card number of the node, and a network type of the node.
After receiving a training task, the embodiment of the application determines the total card number corresponding to the training task; traversing the cluster according to the training task, and selecting a first node with the maximum X idle cards in the cluster; and selecting a second node which meets the residual card number from the residual nodes in the cluster. In this way, the node with the largest idle card number is selected as the first node as far as possible in the process of selecting the node, and then the second node meeting the remaining card number is selected from the remaining nodes, so that the total number of the selected first node and the second node is minimum, the number of hosts where the nodes are located is also minimum, and the communication expenditure among the hosts in the process of executing training tasks can be reduced.
Referring to fig. 3, fig. 3 is a flowchart of a Kubernetes-based pod scheduling method provided by an embodiment of the present application. The Kubernetes-based pod scheduling method may include the steps of:
step 310, after receiving a training task, determining a total card number corresponding to the training task, and determining an idle card number of each node in a cluster;
the pod scheduling method based on Kubernetes in the embodiment can be applied to any host in a network where training tasks are located, and can also be applied to a server, and the server can acquire the states of all the hosts, so that at least one host in the network is selected to complete the training tasks. The present embodiment is described by taking an application to a server as an example.
After receiving the training task, the server analyzes the training task, so as to determine the total number of cards required by the training task and determine the number of idle cards of each node in the cluster of the network where the training task is located.
Step 320, assigning numbers to the nodes according to the number of idle cards of the nodes, wherein the numbers of the nodes with the same number of idle cards correspond to the same rank number, and the nodes adjacent to the rank number can directly communicate or the number of the nodes passing through the nodes is the least when communicating with each other, so that the numbers of the nodes in the same host are continuous;
after the number of idle cards of each node in the cluster is determined, each node is numbered, each node after the numbering meets the condition that the rank numbers corresponding to the numbers of the nodes with the same number of idle cards are the same, the nodes adjacent to the rank numbers can directly communicate or the number of the nodes passing through the nodes is the least when communicating with each other, and the numbers of the nodes in the same host are continuous. Illustratively, the nodes in the cluster where there are free cards include: the number of nodes A with the number of the idle cards is N, the number of the nodes B with the number of the idle cards is N-1, and the number of the nodes C with the number of the idle cards is N-2, and in the process of numbering, the number of the nodes with the number of the idle cards of A is N firstly 0 To N A-1 The B nodes with the number of the idle cards being N-1 are respectively numbered as (N-1) A Until (N-1) A+B-1 The C nodes with the number of the free cards being N-2 are respectively numbered as (N-2) A+B Until (N-2) A+B+C-1 And so on. The rank number is N, (N-1), (N-2) and the like, and the N-1 are adjacent, and particularly when the numbers are allocated, the numbers of a plurality of nodes in the same host are continuous, and the rank numbers between the nodes with the least number of passing nodes can be adjacent when the nodes are in direct communication or mutual communication.
As another embodiment, step 320 may be performed after node preselection, that is, the preselected nodes are numbered according to a numbering rule, so long as the numbers of nodes with the same number of idle cards correspond to the same rank number, and the numbers of nodes in the same host are continuous as long as the number of nodes with adjacent rank numbers can directly communicate or communicate with each other with the least number of nodes passing through.
Step 330, traversing the cluster according to the training task, and selecting the first nodes with the maximum number of X idle cards in the cluster, wherein the rank numbers of the first nodes are the same or adjacent;
step 340, selecting a second node satisfying the remaining card number from the remaining nodes in the cluster;
after determining the total number of cards required by the training task, the server traverses a cluster formed by a plurality of hosts according to the training task, wherein each host comprises a plurality of nodes, and each node has a fixed resource comprising the number of cards. After traversing the cluster, determining the number of idle cards existing in each node in the cluster, selecting the node with the maximum number of X idle cards from the cluster as a first node, and selecting a second node which meets the number of the remaining cards from the remaining nodes in the cluster after the number of cards of the X first nodes cannot meet the total number of cards of the training task.
As an embodiment, if the maximum number of free cards of the node is N, traversing the cluster according to the training task, and selecting a first node with the maximum number of X free cards in the cluster includes:
and traversing the cluster according to the training task, and selecting X first nodes with N idle cards in the cluster, wherein X is equal to the ratio of the total cards to N and is rounded downwards.
Specifically, in this embodiment, it is assumed that the maximum number of idle cards of each node is N, and the total number of cards required for the training task is count. The count/N is calculated and rounded down, and the value obtained by rounding down the ratio of the count to N is defined as a first value for convenience of expression.
If the number of the nodes with the number of N idle cards in the cluster is larger than or equal to a first value, selecting X nodes from the nodes with the number of N idle cards as first nodes, wherein the rank numbers of the X first nodes selected at the moment are the same or adjacent, and if the rank numbers of the X first nodes are not the same or adjacent, selecting the node with the smallest rank number difference value from the nodes with the number of N idle cards, namely selecting the node with the smallest communication distance.
If the number of the nodes with the number of the idle cards being N in the cluster is smaller than the first value, selecting all the nodes with the number of the idle cards being N as the first nodes, wherein at the moment, X is equal to the number of the nodes with the number of the idle cards being N. Similarly, the rank numbers of the selected X first nodes are the same or adjacent, and if the rank numbers of the X first nodes are not the same or adjacent, the node with the smallest rank number difference value is selected from the nodes with the number of idle cards being N, that is, the node with the smallest communication distance is selected.
As another embodiment, the traversing the cluster according to the training task selects a first node with the largest number of X free cards in the cluster, and further includes:
and if the node with the number of the idle cards being N does not exist in the cluster, selecting a first node with the number of the X idle cards being N-1 from the cluster, wherein X is equal to the ratio of the total cards to N-1 and is rounded downwards.
Specifically, if the maximum number of idle cards of each node in the cluster is N-1, the rule for selecting the first node may be the same as the above manner, that is, X nodes are selected from the nodes with the number of idle cards being N-1 as the first node. At this time, the rank numbers of the selected X first nodes are the same or adjacent, and if the rank numbers of the X first nodes are not the same or adjacent, the node with the smallest rank number difference value is selected from the nodes with the number of idle cards being N, that is, the node with the smallest communication distance is selected.
For easy understanding, taking count=10 and n=8 as an example, for example, 10/8 is rounded down to 1, if more than 1 node has 8 free cards in the cluster, x=1, and any node with 8 free cards is selected as the first node. If the maximum free card number of each node in the cluster is 7, X is equal to 10/7 and rounded down, i.e., x=1. If the maximum number of free cards of each node in the cluster is 6, X is equal to 10/6 and rounded down, i.e. x=1. If the maximum number of idle cards of each node in the cluster is 5, X is equal to 10/5 and is rounded downwards, i.e., x=2, at this time, a node with the same rank number or adjacent nodes (or the smallest difference between rank numbers is selected as soon as possible) is selected from the nodes with the number of idle cards of 5, and so on.
According to the description, the application mainly selects the node with the largest current idle number from the cluster as the node of the current training task, and selects the node with no idle number N as the first node, and selects the node with the idle number N next to the node with the idle number N, namely selects the node with the idle card number N-1, and so on.
It can be understood that if the sum of all the idle cards in the cluster is smaller than the total number of cards count required by the training task, the cluster cannot meet the training task, the training task is temporarily suspended, and the training task is executed after waiting for the sum of the idle cards in the cluster to be greater than or equal to the count. This reduces the number of nodes selected as much as possible.
The nodes with the same rank number or adjacent rank numbers are used for reducing communication among a plurality of first nodes, so that a plurality of nodes located in the same host are selected among the plurality of first nodes, or when the first nodes are located in different hosts, a plurality of hosts (PICE is a communication bus) capable of communicating through PCIE-PCIE inside the hosts are selected among the hosts, and the hosts communicating through PCIE-XX-SWITCH-XX-PCIE are reduced, so that communication expenditure is reduced.
If the number of remaining cards after the first node is selected, for example, when the product of X and N is smaller than count, it indicates that the number of cards needed for the training task is not counted, the difference between the corresponding count and x×n is greater than 0, and at this time, the second node is further selected to satisfy the number of remaining cards, so that the sum of the number of cards of the selected second node and the second node is greater than or equal to the total number of cards needed for the training task.
As an embodiment, selecting a second node satisfying the remaining cards from the remaining nodes in the cluster includes:
and if the node with the idle card number equal to the residual card number exists in the residual nodes, selecting the node with the idle card number equal to the residual card number as the second node.
Specifically, after the total number of idle cards of the first node cannot meet the total number of cards of the training task, the second node needs to be selected continuously, the remaining number of cards after the first node is selected is calculated first, the remaining number of cards is equal to the total number of idle cards of the count and the first node, and if the number of idle cards of each first node is N and the number of idle cards of each first node is X, the remaining number of cards=count-x×n. At this time, as an example, first, from the remaining nodes in the cluster, a node with the number of idle cards equal to the number of remaining cards is selected as the second node, so as to avoid splitting other nodes with the number of idle cards greater than the number of remaining cards.
As another embodiment, selecting a second node that satisfies the remaining cards from the remaining nodes in the cluster further includes:
and if the node with the idle card number equal to the residual card number does not exist in the residual nodes, selecting a node with the smallest difference value between the idle card number and the residual card number from the residual nodes, and taking the node with the idle card number equal to the difference value as the second nodes, wherein the rank numbers of the second nodes are the same or adjacent.
Specifically, if no node with the number of idle cards equal to the number of the remaining cards exists in the remaining nodes, selecting a node with the smallest difference between the number of idle cards and the number of the remaining cards from the remaining nodes, and taking the node with the number of idle cards equal to the difference as the second node. For example, assuming that the number of remaining cards is 2 and the number of idle cards in the remaining nodes is not 2, two nodes with the number of idle cards being 1 are selected as the second nodes, and when a plurality of second nodes are selected again, the rank numbers of the plurality of second nodes are the same or adjacent. If no node with the number of the idle cards not equal to the difference value exists, the node with the smallest difference value between the number of the idle cards and the number of the residual cards is selected from the residual nodes, and the like.
As a further embodiment, if there is no node with the number of idle cards equal to the number of remaining cards in the remaining nodes, a node with the smallest difference between the number of idle cards and the number of remaining cards is selected from the remaining nodes, and a node with the number of idle cards greater than the number of remaining cards is used as the second node, so that the number of nodes can be reduced, and communication between the nodes in the training task can be further reduced.
It will be appreciated that in order to avoid reducing the communication between the nodes in the training task as much as possible, in selecting the second node, the two modes may be combined, that is, if there is no node with the number of idle cards equal to the number of remaining cards in the remaining nodes, a node with the smallest difference between the number of idle cards and the number of remaining cards is selected from the remaining nodes, and a node with the number of idle cards greater than or equal to the difference is selected as the second node. Of course, the two modes can be adopted at the same time, then the selection results of the two modes are compared, and the mode with the minimum number of the second nodes in the two modes is selected.
Further, the first node and the second node both conform to preset conditions corresponding to the training task, where the preset conditions include at least one of a core number of the node, a maximum card number of the node, and a network type of the node.
And step 350, starting pod scheduling according to the numbers corresponding to the first node and the second node so as to execute the training task.
After the numbers of the first node and the second node are distributed, the pod scheduling is started to start to execute the training task and the communication between the nodes. Through experiments, taking the fastest node communication of 100G Infiniband rdma as an example, the acceleration ratio is about 80%, and the time saving is as follows: (Count- (a+b+ &..+1)) (1-80%), the maximum communication time that can be saved is (Count-Count N/n+1) 20%.
The various optional embodiments described in the embodiments of the present application may be implemented in combination with each other without collision, or may be implemented separately, which is not limited to the embodiments of the present application.
Referring to fig. 4, fig. 4 is one of the block diagrams of the Kubernetes-based node preselection method apparatus provided in the embodiment of the present application. As shown in fig. 4, the Kubernetes-based node preselection method device includes:
a first determining module 410, configured to determine a total number of cards corresponding to a training task after receiving the training task;
a first selection module 420, configured to traverse a cluster according to the training task, and select a first node with the largest number of X idle cards in the cluster;
a second selecting module 430, configured to select a second node that satisfies the remaining cards from the remaining nodes in the cluster.
In one embodiment, the first selection module 420 is further configured to: and traversing the cluster according to the training task, and selecting X first nodes with N idle cards in the cluster, wherein X is equal to the ratio of the total cards to N and is rounded downwards.
In one embodiment, the first selection module 420 is further configured to: and if the node with the number of the idle cards being N does not exist in the cluster, selecting a first node with the number of the X idle cards being N-1 from the cluster, wherein X is equal to the ratio of the total cards to N-1 and is rounded downwards.
In one embodiment, the second selection module 430 is further configured to: and if the node with the idle card number equal to the residual card number exists in the residual nodes, selecting the node with the idle card number equal to the residual card number as the second node.
In one embodiment, the second selection module 430 is further configured to: and if the node with the idle card number equal to the residual card number does not exist in the residual nodes, selecting a node with the smallest difference value between the idle card number and the residual card number from the residual nodes, and taking the node with the idle card number equal to the difference value as the second node.
In one embodiment, the first node and the second node both conform to preset conditions corresponding to the training task, where the preset conditions include at least one of a core number of the node, a maximum card number of the node, and a network type of the node.
The Kubernetes-based node preselection device can implement the processes of the method embodiment of fig. 2 in the embodiment of the present application, and achieve the same beneficial effects, and in order to avoid repetition, the description is omitted here.
Referring to fig. 5, fig. 5 is a block diagram of a pod scheduling apparatus based on Kubernetes according to an embodiment of the present application. As shown in fig. 5, the Kubernetes-based pod scheduling apparatus includes:
a second determining module 510, configured to determine, after receiving a training task, a total number of cards corresponding to the training task, and determine a number of idle cards of each node in the cluster;
the numbering module 520 is configured to sort the idle cards according to the number of each node, and assign numbers to the nodes according to the sorting result, where the rank numbers corresponding to the numbers of the nodes with the same idle card number are the same, and the numbers of processes in the same node are continuous when the nodes adjacent to the rank numbers can directly communicate or communicate with each other with the least number of nodes passing through;
a third selecting module 530, configured to traverse the cluster according to the training task, select first nodes with the largest number of X idle cards in the cluster, where the rank numbers of the first nodes are the same or adjacent;
a fourth selecting module 540, configured to select a second node that satisfies a remaining card number from remaining nodes in the cluster;
and the scheduling module 550 is configured to start pod scheduling according to the numbers corresponding to the first node and the second node.
In one embodiment, the third selection module 530 is further configured to:
and traversing the cluster according to the training task, and selecting X first nodes with N idle cards in the cluster, wherein X is equal to the ratio of the total cards to N and is rounded downwards.
In one embodiment, the third selection module 530 is further configured to:
and if the node with the number of the idle cards being N does not exist in the cluster, selecting a first node with the number of the X idle cards being N-1 from the cluster, wherein X is equal to the ratio of the total cards to N-1 and is rounded downwards.
In one embodiment, the fourth selection module 540 is further configured to:
and if the node with the idle card number equal to the residual card number exists in the residual nodes, selecting the node with the idle card number equal to the residual card number as the second node.
In one embodiment, the fourth selection module 540 is further configured to:
and if the node with the idle card number equal to the residual card number does not exist in the residual nodes, selecting a node with the smallest difference value between the idle card number and the residual card number from the residual nodes, and taking the node with the idle card number equal to the difference value as the second nodes, wherein the rank numbers of the second nodes are the same or adjacent.
The pod scheduling device based on Kubernetes can realize the processes of the method embodiment of fig. 3 in the embodiment of the present application and achieve the same beneficial effects, and in order to avoid repetition, the description is omitted here.
The embodiment of the application also provides a server. Referring to fig. 6, the server may include a processor 601, a memory 602, and a program 6021 stored in the memory 602 and capable of running on the processor 601 to implement any steps and achieve the same advantages in the method embodiments corresponding to fig. 2 or 3, and will not be described herein.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of implementing the methods of the embodiments described above may be implemented by hardware associated with program instructions, where the program may be stored on a readable medium. The embodiment of the present application further provides a readable storage medium, where a computer program is stored, where the computer program when executed by a processor can implement any step in the method embodiments corresponding to fig. 2 and 3, and achieve the same technical effects, and in order to avoid repetition, no detailed description is given here.
Such as Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disk, etc.
While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims (15)

1. A Kubernetes-based node preselection method, the method comprising:
after receiving a training task, determining the total card number corresponding to the training task;
traversing the cluster according to the training task, and selecting a first node with the maximum X idle cards in the cluster;
and selecting a second node which meets the residual card number from the residual nodes in the cluster.
2. The method of claim 1, wherein if the maximum number of free cards of the node is N, traversing the cluster according to the training task, selecting a first node with the maximum number of X free cards in the cluster comprises:
and traversing the cluster according to the training task, and selecting X first nodes with N idle cards in the cluster, wherein X is equal to the ratio of the total cards to N and is rounded downwards.
3. The method of claim 2, wherein traversing the cluster according to the training task selects a first node of the cluster having a largest number of X free cards, further comprising:
and if the node with the number of the idle cards being N does not exist in the cluster, selecting a first node with the number of the X idle cards being N-1 from the cluster, wherein X is equal to the ratio of the total cards to N-1 and is rounded downwards.
4. The method of claim 1, wherein selecting a second node from the remaining nodes in the cluster that satisfies a remaining number of cards comprises:
and if the node with the idle card number equal to the residual card number exists in the residual nodes, selecting the node with the idle card number equal to the residual card number as the second node.
5. The method of claim 4, wherein selecting a second node from the remaining nodes in the cluster that meets a remaining number of cards further comprises:
and if the node with the idle card number equal to the residual card number does not exist in the residual nodes, selecting a node with the smallest difference value between the idle card number and the residual card number from the residual nodes, and taking the node with the idle card number equal to the difference value as the second node.
6. The method of claim 1, wherein the first node and the second node each meet a preset condition corresponding to the training task, and the preset condition includes at least one of a core number of the node, a maximum card number of the node, and a network type of the node.
7. A Kubernetes-based pod scheduling method, the method comprising:
after receiving a training task, determining the total card number corresponding to the training task, and determining the idle card number of each node in the cluster;
assigning numbers to the nodes according to the number of the idle cards of the nodes, wherein the numbers of the nodes with the same number of the idle cards correspond to the same rank number, and the nodes adjacent to the rank number can directly communicate or communicate with each other with the least number of the nodes passing through, so that the numbers of the nodes in the same host are continuous;
traversing the cluster according to the training task, and selecting X first nodes with the largest idle card numbers in the cluster, wherein the rank numbers of the first nodes are the same or adjacent;
selecting a second node meeting the residual card number from the residual nodes in the cluster;
and starting pod scheduling according to the numbers corresponding to the first node and the second node so as to execute the training task.
8. The method of claim 7, wherein if the maximum number of free cards of the node is N, traversing the cluster according to the training task, selecting a first node with the maximum number of X free cards in the cluster comprises:
and traversing the cluster according to the training task, and selecting X first nodes with N idle cards in the cluster, wherein X is equal to the ratio of the total cards to N and is rounded downwards.
9. The method of claim 8, wherein traversing the cluster according to the training task selects a first node of the cluster having a largest number of X free cards, further comprising:
and if the node with the number of the idle cards being N does not exist in the cluster, selecting a first node with the number of the X idle cards being N-1 from the cluster, wherein X is equal to the ratio of the total cards to N-1 and is rounded downwards.
10. The method of claim 7, wherein selecting a second node from the remaining nodes in the cluster that satisfies a remaining number of cards comprises:
and if the node with the idle card number equal to the residual card number exists in the residual nodes, selecting the node with the idle card number equal to the residual card number as the second node.
11. The method of claim 10, wherein selecting a second node from the remaining nodes in the cluster that meets a remaining number of cards further comprises:
and if the node with the idle card number equal to the residual card number does not exist in the residual nodes, selecting a node with the smallest difference value between the idle card number and the residual card number from the residual nodes, and taking the node with the idle card number equal to the difference value as the second nodes, wherein the rank numbers of the second nodes are the same or adjacent.
12. A Kubernetes-based node preselection apparatus comprising:
the first determining module is used for determining the total card number corresponding to the training task after receiving the training task;
the first selection module is used for traversing the cluster according to the training task and selecting a first node with the maximum X idle cards in the cluster;
and the second selection module is used for selecting a second node which meets the residual card number from the residual nodes in the cluster.
13. A Kubernetes-based pod scheduling apparatus, comprising:
the second determining module is used for determining the total card number corresponding to the training task after receiving the training task and determining the idle card number of each node in the cluster;
the numbering module is used for sorting the idle cards of all the nodes according to the sorting result and distributing numbers to all the nodes according to the sorting result, wherein the numbers of the nodes with the same idle cards correspond to the same rank number, and the adjacent nodes with the rank numbers can directly communicate or communicate with each other with the least number of the nodes passing through, so that the numbers of the processes in the same node are continuous;
the third selection module is used for traversing the cluster according to the training task, selecting the first nodes with the maximum X idle cards in the cluster, wherein the rank numbers of the first nodes are the same or adjacent;
a fourth selecting module, configured to select a second node that meets the remaining number of cards from remaining nodes in the cluster;
and the scheduling module is used for starting pod scheduling according to the numbers corresponding to the first node and the second node.
14. A server comprising a memory, a processor, and a program stored on the memory and executable on the processor; characterized in that the processor is configured to read a program in a memory to implement the steps in the Kubernetes-based node pre-selection method according to any one of claims 1 to 6 or the Kubernetes-based pod scheduling method according to any one of claims 7 to 11.
15. A readable storage medium storing a program, wherein the program when executed by a processor implements the steps of the Kubernetes-based node pre-selection method of any one of claims 1 to 6 or the Kubernetes-based pod scheduling method of any one of claims 7 to 11.
CN202211404741.7A 2022-11-10 2022-11-10 Node preselection method, pod scheduling method, device, server and medium Pending CN116954886A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211404741.7A CN116954886A (en) 2022-11-10 2022-11-10 Node preselection method, pod scheduling method, device, server and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211404741.7A CN116954886A (en) 2022-11-10 2022-11-10 Node preselection method, pod scheduling method, device, server and medium

Publications (1)

Publication Number Publication Date
CN116954886A true CN116954886A (en) 2023-10-27

Family

ID=88443294

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211404741.7A Pending CN116954886A (en) 2022-11-10 2022-11-10 Node preselection method, pod scheduling method, device, server and medium

Country Status (1)

Country Link
CN (1) CN116954886A (en)

Similar Documents

Publication Publication Date Title
CN110618870A (en) Working method and device for deep learning training task
CN102752198B (en) Multi-core message forwarding method, multi-core processor and network equipment
CN103309738B (en) User job dispatching method and device
CN110378413A (en) Neural network model processing method, device and electronic equipment
CN106503791A (en) System and method for the deployment of effective neutral net
CN111339027A (en) Automatic design method of reconfigurable artificial intelligence core and heterogeneous multi-core chip
CN113037800B (en) Job scheduling method and job scheduling device
CN114968601B (en) Scheduling method and scheduling system for AI training jobs with resources reserved in proportion
CN103942109A (en) Self-adaptation task scheduling method based on multi-core DSP
CN114281521A (en) Method, system, device and medium for optimizing communication efficiency of deep learning heterogeneous resources
CN113391914A (en) Task scheduling method and device
CN112862083B (en) Deep neural network inference method and device in edge environment
CN109636709B (en) Graph calculation method suitable for heterogeneous platform
CN109800078B (en) Task processing method, task distribution terminal and task execution terminal
CN114429195A (en) Performance optimization method and device for hybrid expert model training
CN103299298B (en) The method and system of process business
CN116663639B (en) Gradient data synchronization method, system, device and medium
CN116954886A (en) Node preselection method, pod scheduling method, device, server and medium
CN110930092B (en) Distribution route adjusting method and device, electronic equipment and storage medium
CN113849295A (en) Model training method and device and computer readable storage medium
CN114298294B (en) Neural network memory optimization method and device based on hardware accelerator
KR20200078318A (en) Situation-adapted Global Pooling System and Method for Transfer Vehicles in Automated Container Terminal
CN115564374A (en) Collaborative multitask redistribution method, device, equipment and readable storage medium
CN111626916B (en) Information processing method, device and equipment
de Assumpção Drummond et al. Distributed parallel metaheuristics based on GRASP and VNS for solving the traveling purchaser problem

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination