CN113377498A - Resource scheduling method and device, electronic equipment and storage medium - Google Patents

Resource scheduling method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113377498A
CN113377498A CN202110702487.8A CN202110702487A CN113377498A CN 113377498 A CN113377498 A CN 113377498A CN 202110702487 A CN202110702487 A CN 202110702487A CN 113377498 A CN113377498 A CN 113377498A
Authority
CN
China
Prior art keywords
resource
job
containers
node
manager
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110702487.8A
Other languages
Chinese (zh)
Inventor
钱瀚
史少晨
师锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN202110702487.8A priority Critical patent/CN113377498A/en
Publication of CN113377498A publication Critical patent/CN113377498A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/4557Distribution of virtual machine instances; Migration and load balancing

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Factory Administration (AREA)

Abstract

The disclosure relates to a resource scheduling method, a resource scheduling device, an electronic device and a storage medium, wherein the method comprises the following steps: the resource manager receives a resource request carrying the number of resource containers required by operation of the job, responds to the resource request and sends a resource allocation result to the application manager of the job, wherein the allocation result comprises node information of a node where the resource container allocated for the job is located and resource container information; the application manager communicates with the node manager of the corresponding node to start the allocated resource container based on the resource allocation result; the application manager judges whether the number of the started resource containers meets the number of the resource containers required by the operation of the job; and if so, sending a job operation instruction to all the started resource containers to operate the job. Therefore, the operation job can be started only under the condition that the number of the started resource containers meets the number of the resource containers required by the job, so that the job can be pulled up rigidly after all resources are taken by the job, and rigid scheduling on the YARN is realized.

Description

Resource scheduling method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a resource scheduling method, a resource scheduling apparatus, an electronic device implementing the resource scheduling method, and a computer-readable storage medium.
Background
YARN (Another Resource coordinator) is a universal Resource management system, and can provide Resource management and allocation for various computing frameworks such as MapReduce (a programming model), Spark (a fast and universal computing engine designed for large-scale data processing), and the like. YARN consists of an RM (Resource Manager) and multiple NMs (Node Manager), where the RM is responsible for managing and scheduling resources on each NM. A Container is a resource abstraction in YARN, and encapsulates a multi-dimensional resource on one NM, such as a memory, a CPU (Central Processing Unit), a disk, a network, and the like. By applying to the RM, the RM selects the appropriate NM allocation and starts the Container to perform the corresponding job.
A particular computing framework is a machine learning framework such as TensorFlow (a symbolic mathematical architecture based on data flow programming), PyTorch (an open source machine learning library), etc., which can also run on YARN. A common requirement of the machine learning framework is rigid scheduling, i.e. a job needs to acquire all resources before it can be run.
However, rigid scheduling cannot be achieved in the YARN at present, and particularly when a requirement for simultaneous operation of multiple jobs occurs, the resource allocation strategy of the YARN at present makes each job only reach a part of resources, so that the jobs cannot normally operate.
Disclosure of Invention
An object of the embodiments of the present disclosure is to provide a resource scheduling method, a resource scheduling apparatus, an electronic device implementing the resource scheduling method, and a computer-readable storage medium, so as to implement rigid scheduling on YARN.
In a first aspect, the present disclosure provides a resource scheduling method, including:
a resource manager receives a resource request, wherein the resource request carries the number of resource containers required by operation of a job;
the resource manager responds to the resource request and sends a resource allocation result to the application manager corresponding to the job, wherein the resource allocation result comprises node information of a node where a resource container allocated for the job is located and resource container information;
the application manager sends the resource container information to a node manager of a node indicated by the node information so as to enable the node manager to start an allocated resource container;
the application manager judges whether the number of the started resource containers meets the number of the resource containers required by the operation of the job;
and if so, the application manager sends a job operation instruction to all the started resource containers to operate the job.
Optionally, in some embodiments of the present disclosure, the method further includes:
when the number of the started resource containers does not meet the number of the resource containers required by the operation of the job, the application manager sends a new resource request to the resource manager;
the resource manager responds to the new resource request and returns a new resource allocation result to the application manager, wherein the new resource allocation result comprises node information of a node where a new resource container allocated for the job is located and new resource container information;
the application manager sends the new resource container information to a node manager of a target node indicated by the node information of the node where the new resource container is located, so that the node manager of the target node starts the allocated new resource container;
the application manager judges whether the number of the resource containers started at the current moment meets the number of the resource containers required by the operation;
and if so, the application manager sends the job operation instruction to all the started resource containers at the current moment.
Optionally, in some embodiments of the present disclosure, the method further includes:
and if the application manager judges that the number of the resource containers started at the current moment still does not meet the number of the resource containers required by the operation of the job, returning to the step of sending a new resource request to the resource manager by the application manager until the number of the resource containers required by the operation of the job is met.
Optionally, in some embodiments of the present disclosure, the method further includes:
and if the preset time length is exceeded, the number of the resource containers started for the job does not meet the number of the resource containers required by the job operation, releasing all the started resource containers for the job by the application manager to exit.
Optionally, in some embodiments of the present disclosure, the method further includes:
the application manager periodically sends heartbeat information to the node manager, wherein the heartbeat information carries a message of whether all the allocated resource containers are started;
the step of sending the job operation instruction to all the started resource containers by the application manager comprises the following steps:
and carrying the information that all the allocated resource containers are started in the heartbeat information sent to the node manager so as to enable all the started resource containers to run the operation.
In a second aspect, an embodiment of the present disclosure provides an apparatus for scheduling resources, where the apparatus includes:
the request receiving module is used for enabling the resource manager to receive resource requests, and the resource requests carry the number of resource containers required by operation;
the first resource allocation module is used for enabling the resource manager to respond to the resource request and sending a resource allocation result to the application manager corresponding to the job, wherein the resource allocation result comprises node information of a node where a resource container allocated for the job is located and resource container information;
a resource starting module, configured to enable the application manager to send the resource container information to a node manager of a node indicated by the node information, so that the node manager starts an allocated resource container;
a resource judging module, configured to enable the application manager to judge whether the number of the started resource containers meets the number of the resource containers required by the job operation;
and the job running module is used for enabling the application manager to send job running instructions to all the started resource containers to run the jobs when the judgment result of the resource judgment module is satisfied.
Optionally, in some embodiments of the present disclosure, the apparatus further includes:
the resource application module is used for enabling the application manager to send a new resource request to the resource manager when the number of the started resource containers does not meet the number of the resource containers required by the operation of the job;
a second resource allocation module, configured to enable the resource manager to respond to the new resource request and return a new resource allocation result to the application manager, where the new resource allocation result includes node information of a node where a new resource container allocated for the job is located and new resource container information;
the resource starting module is further configured to enable the application manager to send the new resource container information to a node manager of a target node indicated by the node information of the node where the new resource container is located, so that the node manager of the target node starts the allocated new resource container;
the resource judging module is further configured to enable the application manager to judge whether the number of the resource containers started at the current time meets the number of the resource containers required by the operation;
and the job running module is further used for enabling the application manager to send the job running instruction to all the started resource containers at the current moment when the judgment result of the resource judgment module at the current moment is met.
Optionally, in some embodiments of the present disclosure, the apparatus further includes:
and the resource application control module is used for triggering the resource application module to send a new resource request to the resource manager if the application manager judges that the number of the resource containers started at the current moment still does not meet the number of the resource containers required by the operation, and ending the process until the number of the resource containers required by the operation is met.
Optionally, in some embodiments of the present disclosure, the apparatus further includes:
and the resource release module is used for releasing the application manager to quit for all the started resource containers of the operation if the preset time length is exceeded, and the number of the started resource containers of the operation does not meet the number of the resource containers required by the operation.
Optionally, in some embodiments of the present disclosure, the apparatus further includes:
a heartbeat information sending module, configured to enable the application manager to periodically send heartbeat information to the node manager, where the heartbeat information carries a message of whether all the allocated resource containers are started;
the job running module is configured to enable the application manager to send a job running instruction to all the started resource containers, and specifically includes:
and carrying the information that all the allocated resource containers are started in the heartbeat information sent to the node manager so as to enable all the started resource containers to run the operation.
In a third aspect, the present disclosure provides an electronic device, including a processor and a storage medium, where the storage medium stores executable instructions capable of being executed by the processor, and the processor is caused by the executable instructions to implement the resource scheduling method according to the embodiment provided in the first aspect of the present disclosure.
In a fourth aspect, the present disclosure provides a computer-readable storage medium storing executable instructions that, when invoked and executed by a processor, implement a resource scheduling method as provided by the first aspect of the present disclosure.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
according to the resource scheduling method, the resource scheduling device, the electronic device and the storage medium provided by the embodiment of the disclosure, when the application manager corresponding to the job judges that the number of the started resource containers allocated to the job meets the number of the resource containers required by the job running, the application manager sends a job running instruction to all the started resource containers to run the job. Therefore, only when the number of the started resource containers meets the requirement of the job on the number of the job resource containers, all the started resource containers are informed to run the job, so that the job can be rigidly pulled up after all the resources are taken by the job, and rigid scheduling on the YARN is realized.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a resource scheduling method according to an embodiment of the present disclosure;
fig. 2 is a flowchart illustrating a resource scheduling method according to another embodiment of the disclosure;
fig. 3 is an interaction flowchart under the YARN architecture of the embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a resource scheduling apparatus according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
In the YARN cluster system, a user can submit a job such as a machine learning framework through a Client (Client), and a resource manager RM creates an application manager am (application manager), which is simply referred to as an application manager, for managing the job submitted by the user; the application manager AM makes a resource application for the job to the resource manager RM; the resource manager RM allocates the resource containers on the nodes to the application manager AM according to the resource use information of each node; after obtaining the resource container allocation result of the resource manager RM, the application manager AM communicates with the node manager NM of the corresponding node, and starts to run the subtasks in the job at the corresponding node, where generally a job may include a plurality of subtasks. To this end, the job completes a complete flow from submission to execution.
The embodiment of the disclosure provides a resource scheduling method, a resource scheduling device, an electronic device for implementing the resource scheduling method and a computer readable storage medium. Next, a resource scheduling method provided in the embodiment of the present disclosure is first described.
The resource scheduling method provided by the embodiment of the disclosure can be applied to the YARN cluster system. The resource scheduling method provided by the embodiment of the present disclosure may be implemented by at least one of software, a hardware circuit, and a logic circuit.
As shown in fig. 1, an embodiment of the present disclosure provides a resource scheduling method, which may include the following steps:
step S101: the resource manager receives a resource request, wherein the resource request carries the number of resource containers required by operation.
Illustratively, a job may be a machine learning training task, but is not so limited. The user needs to describe how many subtasks need to be pulled up by the machine learning training task, how many containers are needed by each subtask, and the number of resource Container containers needed by the operation of the job can be determined, and the needed number of resource Container containers can be submitted to the resource manager RM through the client together with the job, and then the RM creates a corresponding application manager AM for the job.
The AM is configured to manage resource scheduling, operation management, and the like of the job, and the AM constructs a resource request and sends the resource request to the RM based on the number of resource containers required for the job operation, that is, the RM receives the resource request. These can be understood with reference to the prior art and will not be described in detail here.
For example, the number of subtasks required for machine learning training is 8, each Worker requires 10G of memory, 10 cores of CPU, and 1 GPU (Graphics Processing Unit). That is, if one wants to run the machine learning training task normally, 8 containers are needed in total, and one Container may refer to the above-mentioned 10G memory, 10 cores of CPU, and 1 GPU.
Step S102: and the resource manager responds to the resource request and sends a resource allocation result to the application manager corresponding to the job, wherein the resource allocation result comprises node information of a node where a resource container allocated for the job is located and resource container information.
Specifically, the AM may send a resource request to the RM to apply for a resource according to the requirement of the machine learning training task, that is, the required number of resource containers, and the RM may allocate the resource and return a resource allocation result, such as a Container allocation result, to the AM. The Container allocation result includes node information of the node where the Container is located and the Container information. The node information may be, for example, a node name, an IP address, etc., and the Container information may be, but is not limited to, Container identification information. For example, the assigned Container is on a node with a node name of "node 1", and the Container identification information is the containers indicated by numbers 1 to 8 on the node 1, which is only for illustration and is not limited specifically, and the assigned containers may be located on different nodes.
Step S103: and the application manager sends the resource container information to the node manager of the node indicated by the node information so as to enable the node manager to start the allocated resource container.
Specifically, as an example, the AM may send Container identification information, such as numbers 1 to 8, to the NM of the node 1, and the NM of the node 1 starts 8 containers based on the numbers 1 to 8.
Step S104: and the application manager judges whether the number of the started resource containers meets the number of the resource containers required by the operation of the operation.
Specifically, the RM may not be able to allocate enough resources at once due to limitations such as the number of available resources. The AM needs to determine if e.g. 8 containers started on node 1 satisfy the number of resource Container containers needed for the job to run.
Step S105: and if so, the application manager sends a job operation instruction to all the started resource containers to operate the job.
Specifically, when the started, for example, 8 containers satisfy the number of resource Container containers required for the job to run, the AM sends a job running instruction to all the started resource Container containers to notify that the job is run.
In the resource scheduling method provided by the foregoing embodiment of the present disclosure, when it is determined that the number of started containers of the job satisfies the number of containers required for running the job, the AM corresponding to the job sends a job running instruction to all the started containers to run the job. Therefore, only when the number of started containers meets the requirement of the job on the number of job containers, all the started containers are informed to run the job, so that the job can be rigidly pulled up after all the resources are taken by the job, and the rigid scheduling on the YARN is realized.
Optionally, in some embodiments of the present disclosure, referring to fig. 2 in combination, the method may further include the steps of:
step S201: and when the AM judges that the number of the started resource containers does not meet the number of the resource containers required by the operation of the job, the AM sends a new resource request to the resource manager RM.
Specifically, for example, the resource allocation result returned by the RM includes, for example, 1 Container on the node 1, which indicates that 7 containers are needed to normally run the task of machine learning training. After, for example, starting 1 Container in step S103, the AM determines that the started 1 Container does not satisfy the number 8 of resource Container containers required for the job running, and thus may send a new resource request to the resource manager RM.
Step S202: and the resource manager RM responds to the new resource request and returns a new resource allocation result to the AM, wherein the new resource allocation result comprises node information of a node where a new resource container allocated for the job is located and new resource container information.
Specifically, the RM continues to allocate a Container based on the new resource request, and returns the new Container allocation result to the AM. Illustratively, the new Container allocation result includes node information of a node where the new Container is located and the new Container information. The node information may be, for example, a node name, an IP address, etc., and the new Container information may be, but is not limited to, Container identification information. For example, the assigned new Container is on a node with a node name of "node 2", and the Container identification information is the containers indicated by numbers 2 to 8 on the node 2, which is only for illustration and not limited specifically, and the assigned containers may be located on different nodes.
Step S203: and the AM sends the new resource container information to the NM of the target node indicated by the node information of the node where the new resource container is located, so that the NM of the target node starts the allocated new resource container.
As an example, the AM sends the new resource Container information to NM of the node 2, such as the number 2-8, and NM of the node 2 starts the 7 resource Container containers indicated by the corresponding number 2-8.
Step S204: and the AM judges whether the number of the resource containers started at the current moment meets the number of the resource containers required by the operation.
Specifically, when the NM of the node 2 starts 7 resource Container containers, and adds 1 resource Container on the node 1 that has been started before, 8 resource Container containers are started in total, and the AM determines whether the 8 resource Container containers currently satisfy the number of resource Container containers required for the job operation.
Step S205: and if so, the AM sends the operation running instruction to all the started resource containers at the current moment.
Specifically, since the number of resource Container containers required for the job to run is 8, and thus is satisfied, the AM sends a job running instruction to all the started resource Container containers, such as 1 Container on node 1 and 7 resource Container containers started on node 2, to start running the job.
In the above solution of this embodiment, if the AM determines that the started number of containers for the job cannot satisfy the number of containers required for the job to run, the AM requests the RM to continue to allocate containers for the job, and after receiving a result of Container allocation continuously returned by the RM, the AM communicates with the corresponding NM to enable the NM to start the continuously allocated containers, and until the started number of containers satisfies the requirement of the number of containers required for the job to run, the AM does not send a job running command to all started containers to run the job, so that the job can be guaranteed to be pulled up rigidly after all resources are taken by the job, and thus rigid scheduling on YARN is achieved.
Optionally, on the basis of the above embodiments, in some embodiments of the present disclosure, the method may further include: if the AM determines that the number of the resource Container containers started at the present time still does not satisfy the number of the resource Container containers required by the job running, the AM returns to the step of sending a new resource request to the resource manager RM in step S201 until the number of the resource Container containers required by the job running is satisfied, and the process is ended. After which step S205 may be skipped. In this way, it can wait until the number of started containers meets the requirement of the job on the number of containers required by the job operation, and send the job operation command to all the started containers to operate the job, so that it can be ensured that the job is not pulled up until all the resources are taken by the job, and thus, rigid scheduling on YARN is realized.
Optionally, in some embodiments of the present disclosure, the method may further include: and if the preset time length is exceeded, the number of the started resource Container containers of the operation does not meet the number of the resource Container containers required by the operation running, the AM releases all the started resource Container containers for the operation to exit.
For example, the preset time period may be set according to needs, and is not limited in this respect. Specifically, the AM may record a duration for the above-mentioned job to apply for the resource to the RM, and if the duration exceeds a preset duration, and the number of started containers for the job does not yet satisfy the number of containers required for job operation, it indicates that the job cannot be normally executed in a short time.
Optionally, in some embodiments of the present disclosure, the method may further include: and the AM periodically sends heartbeat information to the NM, wherein the heartbeat information carries a message of whether the allocated resource Container is completely started. Correspondingly, the step of sending, by the application manager in step S105, a job execution instruction to all the started resource containers may specifically include: and carrying the information that all the allocated resource Container containers are started in the heartbeat information sent to the NM so as to enable all the started resource Container containers to run the operation.
Specifically, if the started resource Container containers are located on different nodes, heartbeats are maintained between the AM and the NM of the different nodes, and heartbeat information is periodically sent to the NM of each node, where the heartbeat information may carry a message indicating whether all the containers allocated for the job have been started, so that each NM can know whether all the containers required by the job have been started. Based on this, in practical application, when it is determined that the number of containers for which the job has been started meets the number of containers required for running the job, the NM may not send a running command alone, but carry information that the containers have all been started in the heartbeat information sent to the NM, and after receiving the heartbeat information, the NM knows that all the containers required for the job have been started, and then starts to run the job. In this way, the processing efficiency can be improved to some extent.
The overall architecture of the disclosed embodiment is shown in fig. 3, and a user can submit a job and job configuration information to an RM (resource manager in fig. 3) through a Client (Client in fig. 3), where the job configuration information includes the number of subtasks of the job and the number of containers required. The RM creates the AM (application manager in FIG. 3) for the job. The AM acquires the job configuration information, and then sends a resource request to the RM, wherein the resource request carries the number of containers required by the job operation, the RM allocates resources correspondingly and returns a resource allocation result to the AM, the resource allocation result carries the Container information allocated to the job and the node information of the node where the Container is located, the AM sends corresponding Container information to the node managers NM (such as node managers 1 and 2 in figure 3) of the node indicated by the node information respectively according to the node information, and each NM starts a corresponding Container (a resource Container in figure 3) on the node after receiving the Container information. In this embodiment, when determining that the number of containers for which the job has been started satisfies the number of containers required for the job operation, the AM sends a job operation instruction to all the started containers to execute the job. Therefore, only when the number of started containers meets the requirement of the job on the number of job containers, all the started containers are informed to run the job, so that the job can be rigidly pulled up after all the resources are taken by the job, and the rigid scheduling on the YARN is realized.
Based on the same inventive concept as the method embodiment described above, the embodiment of the present disclosure further provides a resource scheduling apparatus, as shown in fig. 4, the apparatus may include a request receiving module 401, a first resource allocating module 402, a resource starting module 403, a resource judging module 404, and a job running module 405:
the request receiving module 401 is configured to enable the resource manager RM to receive a resource request, where the resource request carries a resource Container number required by job running.
A first resource allocation module 402, configured to enable the resource manager to respond to the resource request, and send a resource allocation result to the application manager AM corresponding to the job, where the resource allocation result includes node information of a node where a resource container allocated for the job is located and resource container information.
A resource starting module 403, configured to enable the application manager to send the resource container information to the node manager NM of the node indicated by the node information, so that the node manager starts the allocated resource container.
A resource determining module 404, configured to enable the application manager to determine whether the number of the started resource containers meets the number of the resource containers required by the job operation;
a job running module 405, configured to, when the determination result of the resource determining module 404 is satisfied, enable the application manager to send a job running instruction to all the started resource containers, so as to run the job.
Optionally, in some embodiments of the present disclosure, the apparatus may further include a resource application module and a second resource allocation module. And the resource application module is used for enabling the application manager to send a new resource request to the resource manager when the number of the started resource containers does not meet the number of the resource containers required by the operation of the job. And the second resource allocation module is used for enabling the resource manager to respond to the new resource request and return a new resource allocation result to the application manager, wherein the new resource allocation result comprises node information of a node where a new resource container allocated for the job is located and new resource container information. The resource starting module is further configured to enable the application manager to send the new resource container information to the node manager of the target node indicated by the node information of the node where the new resource container is located, so that the node manager of the target node starts the allocated new resource container. The resource judging module is further configured to enable the application manager to judge whether the number of the resource containers started at the current time meets the number of the resource containers required by the operation. And the job running module is further used for enabling the application manager to send the job running instruction to all the started resource containers at the current moment when the judgment result of the resource judgment module at the current moment is met.
Optionally, in some embodiments of the present disclosure, the apparatus further includes a resource application control module, configured to trigger the resource application module to send a new resource request to the resource manager if the application manager determines that the number of the resource containers started at the current time still does not meet the number of the resource containers required by the job operation, and the process is ended until the number of the resource containers required by the job operation is met.
Optionally, in some embodiments of the present disclosure, the apparatus further includes a resource releasing module, configured to release, by the application manager, all the started resource containers for the job to exit if the number of the resource containers already started by the job does not satisfy the number of the resource containers required for the job to run after exceeding a preset time period.
Optionally, in some embodiments of the present disclosure, the apparatus further includes a heartbeat information sending module, configured to enable the application manager to periodically send heartbeat information to the node manager, where the heartbeat information carries a message of whether all the allocated resource containers have been started. The job running module is configured to enable the application manager to send a job running instruction to all the started resource containers, and specifically includes: and carrying the information that all the allocated resource containers are started in the heartbeat information sent to the node manager so as to enable all the started resource containers to run the operation.
By applying the scheme of the embodiment of the disclosure, when the AM judges that the number of the started containers of the job meets the number of the containers required by the operation of the job, the AM sends the operation instruction of the job to all the started containers to operate the job. Therefore, only when the number of started containers meets the requirement of the job on the number of job containers, all the started containers are informed to run the job, so that the job can be rigidly pulled up after all the resources are taken by the job, and the rigid scheduling on the YARN is realized.
And if the AM judges that the started number of the contacts cannot meet the number of the contacts required by the operation of the job, the AM requests the RM to continuously allocate the contacts to the job, communicates with the corresponding NM after receiving the result of the contact allocation continuously returned by the RM so that the NM starts the continuously allocated contacts, and sends a job operation command to all the started contacts to operate the job until the started number of the contacts meets the requirement of the number of the contacts required by the operation of the job, so that the job can be rigidly pulled up after all resources are taken by the job, and the rigid scheduling on the YARN is realized.
The disclosed embodiment also provides an electronic device, as shown in fig. 5, the electronic device 50 may include a processor 501 and a storage medium 502, the storage medium 502 stores executable instructions capable of being executed by the processor 501, and the processor 501 is caused by the executable instructions to implement: the resource scheduling method provided by the embodiment of the present disclosure is described above.
By applying the scheme of the embodiment of the disclosure, when the AM judges that the number of started containers of the job meets the number of containers required by the operation of the job, the AM sends the operation instruction of the job to all the started containers to operate the job. Therefore, only when the number of started containers meets the requirement of the job on the number of job containers, all the started containers are informed to run the job, so that the job can be rigidly pulled up after all the resources are taken by the job, and the rigid scheduling on the YARN is realized.
The storage medium may include a RAM (Random Access Memory) or an NVM (Non-volatile Memory), such as at least one disk Memory. In the alternative, the storage medium may be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor including a CPU, an NP (Network Processor), and the like; but also a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programmable Gate Array) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
The storage medium 502 and the processor 501 may be connected by a wired or wireless connection for data transmission, and the electronic device and other devices may communicate via a wired or wireless communication interface. Fig. 5 is only an example of data transmission through the bus, and the connection method is not limited to a specific connection method.
Additionally, the disclosed embodiments provide a computer-readable storage medium storing executable instructions that, when invoked and executed by a processor, implement: the resource scheduling method provided by the embodiment of the present disclosure is described above.
There is also provided in an embodiment of the present disclosure a computer program product containing instructions which, when run on a computer, cause the computer to perform: the resource scheduling method provided by the embodiment of the present disclosure is described above.
As for the resource scheduling apparatus, the electronic device and the storage medium embodiment, since the contents of the related methods are substantially similar to those of the foregoing method embodiments, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the disclosure are, in whole or in part, generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber, DSL (Digital Subscriber Line)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD (Digital Versatile Disk)), or a semiconductor medium (e.g., a SSD (Solid State Disk)), etc.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A method for scheduling resources, the method comprising:
a resource manager receives a resource request, wherein the resource request carries the number of resource containers required by operation of a job;
the resource manager responds to the resource request and sends a resource allocation result to the application manager corresponding to the job, wherein the resource allocation result comprises node information of a node where a resource container allocated for the job is located and resource container information;
the application manager sends the resource container information to a node manager of a node indicated by the node information so as to enable the node manager to start an allocated resource container;
the application manager judges whether the number of the started resource containers meets the number of the resource containers required by the operation of the job;
and if so, the application manager sends a job operation instruction to all the started resource containers to operate the job.
2. The method of claim 1, further comprising:
when the number of the started resource containers does not meet the number of the resource containers required by the operation of the job, the application manager sends a new resource request to the resource manager;
the resource manager responds to the new resource request and returns a new resource allocation result to the application manager, wherein the new resource allocation result comprises node information of a node where a new resource container allocated for the job is located and new resource container information;
the application manager sends the new resource container information to a node manager of a target node indicated by the node information of the node where the new resource container is located, so that the node manager of the target node starts the allocated new resource container;
the application manager judges whether the number of the resource containers started at the current moment meets the number of the resource containers required by the operation;
and if so, the application manager sends the job operation instruction to all the started resource containers at the current moment.
3. The method of claim 2, further comprising:
and if the application manager judges that the number of the resource containers started at the current moment still does not meet the number of the resource containers required by the operation of the job, returning to the step of sending a new resource request to the resource manager by the application manager until the number of the resource containers required by the operation of the job is met.
4. The method according to any one of claims 1 to 3, further comprising:
and if the preset time length is exceeded, the number of the resource containers started for the job does not meet the number of the resource containers required by the job operation, releasing all the started resource containers for the job by the application manager to exit.
5. The method according to any one of claims 1 to 3, further comprising:
the application manager periodically sends heartbeat information to the node manager, wherein the heartbeat information carries a message of whether all the allocated resource containers are started;
the step of sending the job operation instruction to all the started resource containers by the application manager comprises the following steps:
and carrying the information that all the allocated resource containers are started in the heartbeat information sent to the node manager so as to enable all the started resource containers to run the operation.
6. An apparatus for scheduling resources, the apparatus comprising:
the request receiving module is used for enabling the resource manager to receive resource requests, and the resource requests carry the number of resource containers required by operation;
the first resource allocation module is used for enabling the resource manager to respond to the resource request and sending a resource allocation result to the application manager corresponding to the job, wherein the resource allocation result comprises node information of a node where a resource container allocated for the job is located and resource container information;
a resource starting module, configured to enable the application manager to send the resource container information to a node manager of a node indicated by the node information, so that the node manager starts an allocated resource container;
the resource judging module is used for enabling the application manager to judge whether the number of the started resource containers meets the number of the resource containers required by the operation of the job;
and the job running module is used for enabling the application manager to send job running instructions to all the started resource containers to run the jobs when the judgment result of the resource judgment module is satisfied.
7. The apparatus of claim 6, further comprising:
the resource application module is used for enabling the application manager to send a new resource request to the resource manager when the number of the started resource containers does not meet the number of the resource containers required by the operation of the job;
a second resource allocation module, configured to enable the resource manager to respond to the new resource request and return a new resource allocation result to the application manager, where the new resource allocation result includes node information of a node where a new resource container allocated for the job is located and new resource container information;
the resource starting module is further configured to enable the application manager to send the new resource container information to a node manager of a target node indicated by the node information of the node where the new resource container is located, so that the node manager of the target node starts the allocated new resource container;
the resource judging module is further configured to enable the application manager to judge whether the number of the resource containers started at the current time meets the number of the resource containers required by the operation;
and the job running module is further used for enabling the application manager to send the job running instruction to all the started resource containers at the current moment when the judgment result of the resource judgment module at the current moment is met.
8. The apparatus of claim 7, further comprising:
and the resource application control module is used for triggering the resource application module to send a new resource request to the resource manager if the application manager judges that the number of the resource containers started at the current moment still does not meet the number of the resource containers required by the operation, and ending the process until the number of the resource containers required by the operation is met.
9. The apparatus of any one of claims 6 to 8, further comprising:
and the resource release module is used for releasing the application manager to quit for all the started resource containers of the operation if the preset time length is exceeded, and the number of the started resource containers of the operation does not meet the number of the resource containers required by the operation.
10. The apparatus of any one of claims 6 to 8, further comprising:
a heartbeat information sending module, configured to enable the application manager to periodically send heartbeat information to the node manager, where the heartbeat information carries a message of whether all the allocated resource containers are started;
the job running module is configured to enable the application manager to send a job running instruction to all the started resource containers, and specifically includes:
and carrying the information that all the allocated resource containers are started in the heartbeat information sent to the node manager so as to enable all the started resource containers to run the operation.
11. An electronic device comprising a processor and a storage medium storing executable instructions executable by the processor, the processor being caused by the executable instructions to implement the resource scheduling method of any one of claims 1-5.
12. A computer-readable storage medium storing executable instructions that, when invoked and executed by a processor, implement the resource scheduling method of any one of claims 1-5.
CN202110702487.8A 2021-06-24 2021-06-24 Resource scheduling method and device, electronic equipment and storage medium Pending CN113377498A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110702487.8A CN113377498A (en) 2021-06-24 2021-06-24 Resource scheduling method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110702487.8A CN113377498A (en) 2021-06-24 2021-06-24 Resource scheduling method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113377498A true CN113377498A (en) 2021-09-10

Family

ID=77578996

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110702487.8A Pending CN113377498A (en) 2021-06-24 2021-06-24 Resource scheduling method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113377498A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114579298A (en) * 2022-01-27 2022-06-03 浙江大华技术股份有限公司 Resource management method, resource manager, and computer-readable storage medium
CN116643880A (en) * 2023-05-06 2023-08-25 上海楷领科技有限公司 Cluster node processing method, system, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203424A (en) * 2017-04-17 2017-09-26 北京奇虎科技有限公司 A kind of method and apparatus that deep learning operation is dispatched in distributed type assemblies
CN108681777A (en) * 2018-05-07 2018-10-19 北京京东尚科信息技术有限公司 A kind of method and apparatus of the machine learning program operation based on distributed system
CN108737270A (en) * 2018-05-07 2018-11-02 北京京东尚科信息技术有限公司 A kind of method for managing resource and device of server cluster
CN109117252A (en) * 2017-06-26 2019-01-01 北京京东尚科信息技术有限公司 Method, system and the container cluster management system of task processing based on container
US20200183751A1 (en) * 2018-12-06 2020-06-11 International Business Machines Corporation Handling expiration of resources allocated by a resource manager running a data integration job

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203424A (en) * 2017-04-17 2017-09-26 北京奇虎科技有限公司 A kind of method and apparatus that deep learning operation is dispatched in distributed type assemblies
CN109117252A (en) * 2017-06-26 2019-01-01 北京京东尚科信息技术有限公司 Method, system and the container cluster management system of task processing based on container
CN108681777A (en) * 2018-05-07 2018-10-19 北京京东尚科信息技术有限公司 A kind of method and apparatus of the machine learning program operation based on distributed system
CN108737270A (en) * 2018-05-07 2018-11-02 北京京东尚科信息技术有限公司 A kind of method for managing resource and device of server cluster
US20200183751A1 (en) * 2018-12-06 2020-06-11 International Business Machines Corporation Handling expiration of resources allocated by a resource manager running a data integration job

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114579298A (en) * 2022-01-27 2022-06-03 浙江大华技术股份有限公司 Resource management method, resource manager, and computer-readable storage medium
CN116643880A (en) * 2023-05-06 2023-08-25 上海楷领科技有限公司 Cluster node processing method, system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US20180144025A1 (en) Map-reduce job virtualization
EP3567829B1 (en) Resource management method and apparatus
CN107018091B (en) Resource request scheduling method and device
CN113377498A (en) Resource scheduling method and device, electronic equipment and storage medium
WO2021227999A1 (en) Cloud computing service system and method
WO2018049873A1 (en) Application scheduling method and device
JP2014520346A5 (en)
CN110659131B (en) Task processing method, electronic device, computer equipment and storage medium
WO2018107945A1 (en) Method and device for implementing allocation of hardware resources, and storage medium
US20130152101A1 (en) Preparing parallel tasks to use a synchronization register
CN113419839A (en) Resource scheduling method and device for multi-type jobs, electronic equipment and storage medium
CN112162852A (en) Multi-architecture CPU node management method, device and related components
CN113886069A (en) Resource allocation method and device, electronic equipment and storage medium
CN112865993B (en) Method and device for switching slave nodes in distributed master-slave system
CN111163140A (en) Method, apparatus and computer readable storage medium for resource acquisition and allocation
CN113391925A (en) Cloud resource management method, system, medium, and computer device
CN111506388B (en) Container performance detection method, container management platform and computer storage medium
CN108924128A (en) A kind of mobile terminal and its method for limiting, the storage medium of interprocess communication
CN113094172A (en) Server management method and device applied to distributed storage system
CN113391906B (en) Job updating method, job updating device, computer equipment and resource management system
CN112003931A (en) Method and system for deploying scheduling controller and related components
CN115878333A (en) Method, device and equipment for judging consistency between process groups
CN115150464A (en) Application proxy method, device, equipment and medium
CN114675954A (en) Task scheduling method and device
CN109117278A (en) A kind of mobile terminal and its method for limiting, the storage medium of interprocess communication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination