WO2020233120A1 - 一种调度方法、装置及相关设备 - Google Patents

一种调度方法、装置及相关设备 Download PDF

Info

Publication number
WO2020233120A1
WO2020233120A1 PCT/CN2019/128545 CN2019128545W WO2020233120A1 WO 2020233120 A1 WO2020233120 A1 WO 2020233120A1 CN 2019128545 W CN2019128545 W CN 2019128545W WO 2020233120 A1 WO2020233120 A1 WO 2020233120A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
scheduling
central
task
scheduling request
Prior art date
Application number
PCT/CN2019/128545
Other languages
English (en)
French (fr)
Inventor
刘志飘
周明耀
邓慧财
李哲
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP19929870.4A priority Critical patent/EP3866441B1/en
Publication of WO2020233120A1 publication Critical patent/WO2020233120A1/zh
Priority to US17/530,560 priority patent/US20220075653A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1012Server selection for load balancing based on compliance of requirements or conditions with available server resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1036Load balancing of requests to servers for services different from user content provisioning, e.g. load balancing across domain name servers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources
    • H04L67/63Routing a service request depending on the request content or context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/506Constraint
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1008Server selection for load balancing based on parameters of servers, e.g. available memory or workload

Definitions

  • the present invention relates to the field of information technology, in particular to a scheduling method, device and related equipment.
  • a distributed system is a unified computer system composed of multiple scattered computers through an interconnected network.
  • the physical and logical resources of each computer cooperate with each other and are highly autonomous. It can realize resource management and data sharing throughout the system, dynamically realize task and function allocation, and can run distributed programs in parallel. It emphasizes resources
  • the overall distribution of tasks, functions, data and control they are distributed in each physically dispersed computer node, each node communicates with each other through the interconnection network, forming a unified processing system.
  • the current communication technology is based on the internal communication of a single cluster (that is, a single distributed system), and dynamic task allocation and resource scheduling can only be performed within a cluster.
  • the embodiment of the present invention discloses a scheduling method, device and related equipment, which can send tasks across clusters, realize resource sharing between different clusters, and improve resource utilization efficiency.
  • the present application provides a scheduling method, the method includes: a central cluster receives a scheduling request sent by a first cluster, and determines a second cluster that satisfies the scheduling request; the central cluster instructs the first cluster Use the second cluster to perform tasks.
  • the central cluster serves as a unified scheduling and control center, and performs unified scheduling after receiving the scheduling request sent by the first cluster. Specifically, the central cluster determines the second cluster that meets the scheduling request among all the managed clusters; after determining the second cluster, the central cluster instructs the first cluster to use the resources in the second cluster to perform tasks. In this way, tasks can be scheduled across clusters, that is, tasks originally executed by the first cluster can be executed by the second cluster, which realizes resource sharing between different clusters, and improves resource utilization.
  • the scheduling request is generated when the first cluster does not have enough remaining resources to run the task, or the scheduling request is the first cluster Generated when the first cluster does not run the mirror image of the task.
  • the first cluster before sending the scheduling request to the central cluster, the first cluster will determine whether it has a mirror image of the task and whether it has enough remaining resources to run the task. Only when the first cluster does not have the image and/or does not have sufficient resources to run the task, the first cluster will send a scheduling request to the central cluster.
  • the central cluster determines the second cluster according to mirror information that matches the task specified by the scheduling request, and the mirror information includes the name of the mirror and the version number of the mirror,
  • the second cluster is a cluster that has authorized the mirror specified by the mirror information to be used by the first cluster.
  • the first cluster when the first cluster sends a scheduling request to the central cluster, it will send the mirror information matching the task to the central cluster together.
  • the image information includes the name of the image and the version number of the image.
  • the central cluster After the central cluster receives the mirroring information sent by the first cluster, it can find the second cluster that has authorized the mirroring specified by the mirroring information to the first cluster, so that it can further realize the connection between the first cluster and the second cluster. Resource Sharing.
  • the central cluster determines a second cluster that satisfies the resources required to run the task specified by the scheduling request among at least one cluster having a mirror specified by the mirror information.
  • the central cluster may find multiple clusters that authorize the mirror specified by the mirror information to be used by the first cluster. Among the multiple clusters found, there may be remaining resources of some clusters that are insufficient to support running the task; the central cluster needs to further determine from the multiple clusters that the remaining resources support the second cluster running the task specified by the scheduling request. This can ensure that the determined second cluster can satisfy the scheduling request sent by the first cluster.
  • the central cluster sends the address and identifier of the second cluster to the first cluster, and the address and identifier of the second cluster are used by the first cluster to access the second cluster .
  • the central cluster after determining the second cluster that satisfies the scheduling request, the central cluster sends the address and identifier of the second cluster to the first cluster. Thereby, it can be ensured that the first cluster can accurately establish communication with the second cluster, and the second cluster can be used to perform tasks, realize resource sharing between clusters, and improve resource utilization.
  • the central cluster authorizes the image running the task in the second cluster to the first cluster.
  • the central cluster can mutually authorize the mirroring information between the clusters, for example, authorize the mirroring of the task running in the second cluster to the first cluster. In this way, tasks can be scheduled across clusters, thereby realizing resource sharing between clusters, and improving resource utilization.
  • the present application provides a scheduling method, including: a first cluster sends a scheduling request to a central cluster, the scheduling request is used by the central cluster to determine a second cluster that satisfies the scheduling request; the first cluster Receiving an instruction sent by the central cluster corresponding to the scheduling request; the first cluster uses the instruction to determine the second cluster to perform tasks.
  • the first cluster sends a scheduling request to the central cluster, and the central cluster serves as a unified scheduling and control center.
  • the central cluster After receiving the scheduling request sent by the first cluster, the central cluster performs unified scheduling, and determines that it meets the requirements in all managed clusters. Scheduling the requested second cluster and instructing the first cluster to use the resources in the determined second cluster to perform tasks. This achieves cross-cluster scheduling of tasks.
  • the first cluster can use the second cluster to perform tasks, realizing resources between different clusters. Sharing improves resource utilization.
  • the scheduling request is generated when the first cluster does not have enough remaining resources to run the task, or the scheduling request is the first cluster Generated when the first cluster does not run the mirror image of the task.
  • the first cluster receives the address and identifier of the second cluster sent by the central cluster, and the address and identifier of the second cluster are used by the first cluster to access the second cluster.
  • the first cluster uses a central authentication service (central authentication service, CAS) to authenticate in the second cluster; after the authentication is passed, the first cluster sends the A task and receiving an execution result obtained by the second cluster from executing the task.
  • central authentication service central authentication service
  • the first cluster completes inter-cluster authentication through CAS, which can ensure that the number of clusters is not limited when accessing authentication across clusters, which improves cluster scalability, and the authentication process is simple and reliable.
  • this application provides a scheduling device for a central cluster, including: a receiving module, configured to receive a scheduling request sent by a first cluster; a processing module, configured to determine a second cluster that satisfies the scheduling request, Instruct the first cluster to use the second cluster to perform tasks.
  • the scheduling request is generated when the first cluster does not have enough remaining resources to run the task, or the scheduling request is the first cluster Generated when the first cluster does not run the mirror image of the task.
  • the processing module is configured to: determine a second cluster according to the mirror information that matches the task specified by the scheduling request, and the mirror information includes the name of the mirror and the name of the mirror.
  • the version number, the second cluster is a cluster that has authorized the mirror specified by the mirror information to be used by the first cluster.
  • the processing module is configured to: in at least one cluster having a mirror specified by the mirror information, determine a second cluster that satisfies the resources required to run the task specified by the scheduling request .
  • the scheduling device further includes a sending module configured to send the address and identifier of the second cluster to the first cluster, and the address and the identifier of the second cluster to the first cluster.
  • the identifier is used by the first cluster to access the second cluster.
  • the processing module is further configured to authorize the image running the task in the second cluster to the first cluster.
  • this application provides a scheduling device for a first cluster, including: a sending module, configured to send a scheduling request to a central cluster, the scheduling request being used by the central cluster to determine that the scheduling request is satisfied The second cluster; a receiving module, configured to receive an instruction sent by the central cluster in response to the scheduling request; a processing module, configured to perform tasks in the second cluster determined by the instruction.
  • the scheduling request is generated by the scheduling device when the scheduling device does not have enough remaining resources to run the task, or the scheduling request is generated by the scheduling device in the Generated when the scheduling device is not running the mirror image of the task.
  • the receiving module is configured to receive the address and identifier of the second cluster sent by the central cluster, and the address and identifier of the second cluster are used by the scheduling device to access the first cluster. Two clusters.
  • the processing module is used to: use the central authentication service CAS to authenticate in the second cluster; the sending module is also used to send all the data to the second cluster after the authentication is passed. The task; the receiving module is also used to receive the execution result obtained by the second cluster executing the task.
  • the present application provides a computing device, which includes a processor and a memory.
  • the processor executes the computer instructions stored in the memory, so that the computing device executes the foregoing first aspect and the method in combination with any one of the foregoing first aspect.
  • the present application provides a computing device including a processor and a memory.
  • the processor executes the computer instructions stored in the memory, so that the computing device executes the foregoing second aspect and the method in combination with any one of the foregoing second aspect.
  • the present application provides a computer storage medium that stores a computer program that, when executed by a computing device, implements any one of the foregoing first aspect and a combination of the foregoing first aspect Implement the flow of the scheduling method provided by the method.
  • the present application provides a computer storage medium that stores a computer program that, when executed by a computing device, implements any of the foregoing second aspect and a combination of the foregoing second aspect Implement the flow of the scheduling method provided by the method.
  • the present application provides a computer program product, the computer program product includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute the above first aspect and in combination with the above first aspect Any one of the methods in the implementation.
  • the present application provides a computer program product.
  • the computer program product includes computer instructions.
  • the computing device can execute the above second aspect and in combination with the above second aspect. Any one of the methods in the implementation.
  • FIG. 1 is a schematic diagram of a cross-cluster task scheduling scenario provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a scheduling method provided by an embodiment of the present application.
  • 3A is a schematic structural diagram of a cluster network provided by an embodiment of the present application.
  • FIG. 3B is a schematic structural diagram of another cluster network provided by an embodiment of the present application.
  • Figure 5 is a schematic structural diagram of a scheduling device for a central cluster provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a scheduling device for a first cluster provided by an embodiment of the present application.
  • Fig. 7 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • component used in this specification are used to denote computer-related entities, hardware, firmware, a combination of hardware and software, software, or software in execution.
  • the component may be, but is not limited to, a process, a processor, an object, an executable file, an execution thread, a program, and/or a computer running on a processor.
  • the application running on the computing device and the computing device can be components.
  • One or more components may reside in processes and/or threads of execution, and components may be located on one computer and/or distributed among two or more computers.
  • these components can be executed from various computer readable media having various data structures stored thereon.
  • the component may be based on, for example, a signal having one or more data packets (such as data from two components interacting with another component in a local system, a distributed system, and/or a network, such as the Internet that interacts with other systems through signals) Communicate through local and/or remote processes.
  • a signal having one or more data packets (such as data from two components interacting with another component in a local system, a distributed system, and/or a network, such as the Internet that interacts with other systems through signals) Communicate through local and/or remote processes.
  • a task is a basic work element to be completed by a computer in a multi-program or multi-process environment, and it includes one or more instruction sequences.
  • tasks are scheduled running entities that implement specific business logic, for example, a task is used to execute an algorithm logic.
  • a cluster is a group of independent computers interconnected by a high-speed network. They form a group and are managed in a single system mode. When a user interacts with the cluster, the cluster is like an independent server, and the cluster configuration can improve availability and scalability. Through the cluster technology, relatively high benefits in performance, high reliability, and flexibility can be obtained at a lower cost. Task scheduling is the core technology in the cluster system.
  • Another resource negotiator is a resource management system, including a global resource manager (resource manager, RM) and each application-specific manager (application master, AM), where , RM is responsible for the resource management and allocation of the entire system, and AM is responsible for the management of a single application.
  • YARN is generally a master/slave structure.
  • RM is responsible for unified management and scheduling of resources on each node manager. When a user submits an application, it needs to provide an AM to track and manage the program, which is responsible for reporting to RM Apply for resources and require the node manager to start tasks that can occupy certain resources. Since different AMs are distributed on different nodes, they will not affect each other.
  • RM includes a scheduler (scheduler) and an application manager (applications manager, ASM).
  • the scheduler is based on capacity, queues and other constraints (for example, each queue allocates a certain amount of resources, executes a certain number of jobs at most, etc.).
  • the resources allocated to each running application are no longer responsible for work related to specific applications, such as no longer responsible for monitoring or tracking the execution status of the application, nor for restarting failures caused by application execution failures or hardware failures Tasks, these are completed by the AM related to the application.
  • the scheduler only allocates resources according to the resource requirements of each application.
  • the resource allocation unit can be represented by a resource container (container).
  • a container is a dynamic resource allocation unit that combines memory, central processing unit (CPU), and disk.
  • Network and other resources are encapsulated together to limit the amount of resources used by each task.
  • the scheduler is a pluggable component, and users can design according to their needs, such as a capacity scheduler (capacity scheduler) and a fair scheduler (fair scheduler).
  • ASM is responsible for all applications in the entire system, including application submission, resource negotiation with the scheduler to start AM, monitoring the AM running status and restarting it if it fails, etc.
  • the network authentication protocol is a computer network authorization protocol that is used to authenticate personal communications with safe means in non-secure networks. Its design goal is to provide powerful authentication services for client/server applications through a key system. The realization of the authentication process does not depend on the authentication of the host operating system, does not require the trust based on the host address, does not require the physical security of all hosts on the network, and assumes that the data packets transmitted on the network can be arbitrarily read, modified and inserted. . Under the above circumstances, kerberos, as a trusted third-party authentication service, implements authentication services through traditional cryptographic techniques (such as shared keys).
  • the authentication process is specifically as follows: the client sends a request to an authentication server (authentication server, AS) to obtain a certificate of a certain server, and then the response of the AS contains these certificates encrypted with the client key, and the certificate is issued by the server ticket (ticket) It is composed of a session key (session key).
  • the server ticket includes the client's identity information encrypted with the server key and a copy of the session key. After the client encrypts its own identity information and timestamp with the session key, they are transmitted to the server together with the server ticket.
  • the central authentication service is an independent open command protocol that can provide a reliable single sign-on (SSO) method for the world wide web (web) application system.
  • SSO single sign-on
  • web world wide web
  • the problems to be solved in this application include how to support cross-cluster scheduling of tasks to realize resource sharing between different clusters, and support cross-cluster access authentication to improve cluster scalability.
  • Fig. 1 is a schematic diagram of a cross-cluster task scheduling scenario provided by the present application.
  • the central cluster receives the cluster information and resource information reported by cluster 1 and cluster 2 (ie step S110), and manages the received cluster information and resource information, where the cluster information may include cluster mirroring information , Cluster identification information, cluster level information, and cluster address information for external connections, etc.
  • the cluster mirror information includes a list of mirrors owned by the cluster and version information.
  • Resource information may include resource usage information, resource remaining information, and resource change information.
  • cluster 1 sends a scheduling request to the central cluster (step S120).
  • the scheduling request includes the required memory resources and CPU resources, as well as the tasks matching the task.
  • the central cluster After receiving the scheduling request sent by cluster 1, the central cluster performs scheduling according to the previously received cluster information and resource information, and determines a schedulable cluster that meets the scheduling request. For example, it determines that cluster 2 is a schedulable cluster that meets the scheduling request, and The address information of cluster 2 is returned to cluster 1 (that is, step S130). After the cluster 1 receives the address information of the cluster 2, it sends the task to the cluster 2 (that is, step S140), uses the resources of the cluster 2 to execute the task, and completes the cross-cluster scheduling of the task. It should be understood that FIG. 1 only exemplarily lists the corresponding cross-cluster scheduling of tasks when there are two clusters. In the case where there are more clusters, the task scheduling logic is consistent with the above, and will not be repeated here.
  • central cluster, cluster 1 and cluster 2 may be deployed on one or more data center instances. Examples include virtual machines, containers, and physical machines. This application does not limit the specific implementation form of the instances.
  • the present application provides a scheduling method, a scheduling device, and a computing device, which can support cross-cluster scheduling of tasks, realize resource sharing between different clusters, and improve resource utilization.
  • FIG. 2 is a schematic flowchart of a scheduling method provided by an embodiment of the application. As shown in Figure 2, the method includes but is not limited to the following steps:
  • S210 The central cluster receives the scheduling request sent by the first cluster.
  • the first cluster needs to be configured in advance whether to allow cross-cluster scheduling of tasks.
  • the first cluster is configured to allow cross-cluster scheduling of tasks
  • the resource cannot support the normal operation of the task.
  • the artificial intelligence (AI) business service of the first cluster will automatically send a scheduling request to the central cluster to request cross-cluster scheduling of tasks.
  • the first cluster does not have a mirror image of the running task, the task cannot be executed in the first cluster.
  • the first cluster also needs to send a scheduling request to the central cluster to request cross-cluster scheduling of the task.
  • the first cluster sends required resources to the central cluster through a scheduling request, and the resources may include memory resources, graphics processing unit (GPU) resources, and CPU resources required for running tasks.
  • the first cluster also sends the mirror name and version number matching the task to the central cluster through a scheduling request.
  • a mirror can have different version numbers, even if it is the same mirror, if its version number is different, it belongs to different mirrors, and their functions, startup commands, and configuration files are different.
  • a mirror can have different version numbers, even if it is the same mirror, if its version number is different, it belongs to different mirrors, and their functions, startup commands, and configuration files are different.
  • the central cluster configures all lower-level clusters including the first cluster.
  • the central cluster can configure the level relationship between the managed clusters.
  • the central cluster can configure the level of all clusters to be the same, that is, all clusters are equal, and there is no superior-subordinate relationship.
  • the entire cluster network is a honeycomb structure.
  • the central cluster can also configure all clusters at different levels, that is, there is a superior-subordinate relationship between different clusters, and the entire cluster network is a tree structure.
  • the clusters managed by the central cluster include cluster 1, cluster 2, cluster 3, and cluster 4.
  • the central cluster configures these four clusters to the same level, that is, the cluster network formed by these four clusters is A cellular network directly accepts the management of the central cluster.
  • cluster 1, cluster 2, cluster 3, cluster 4, cluster 5, and cluster 6 form a tree-like cluster network, which is managed by the central cluster.
  • the central cluster configures the hierarchical relationship of the cluster, the cluster 1 is configured as a first-level cluster, cluster 2 and cluster 3 are configured as a second-level cluster, and cluster 4, cluster 5, and cluster 6 are configured as a third-level cluster.
  • cluster 2 and cluster 3 are the lower-level clusters of cluster 1
  • cluster 5 is the lower cluster of cluster 2
  • cluster 6 is the lower cluster of cluster 3.
  • the cluster network is specifically configured as a cellular network or a tree network, or other network structures, and how many levels the clusters in the cluster network are divided into can be performed by the central cluster according to requirements. The configuration is not limited in this application.
  • the central cluster can also configure some other information of each cluster.
  • the central cluster can be configured with address information of each cluster, cluster identification information, and cluster bandwidth information.
  • the address information includes the network address and port number of the cluster, and the network address may be an Internet Protocol (IP) address used by the cluster for external connections.
  • IP Internet Protocol
  • the identification information of the cluster can be a string, for example, the string is a piece of garbled code configured by the backend; the string is the unique identification (ID) of the cluster in the network, and the ID of each cluster is different , Unique.
  • the central cluster adopts a heartbeat mechanism to receive cluster information periodically reported by each cluster.
  • all clusters in the cluster network will perform service configuration when installing services.
  • the configuration includes the network address and reporting period of the central cluster, and each cluster can communicate with the central cluster through the configured network address of the central cluster.
  • each cluster will periodically report its own cluster information to the central cluster according to the configured reporting period.
  • the cluster information may include the cluster address and cluster identifier and the remaining resources of the cluster.
  • the central cluster maintains the online status of each cluster and manages the life cycle of each cluster according to the cluster information reported by each cluster. If a certain cluster does not report to the central cluster within a preset time period, the central cluster determines that the cluster is offline, and removes the cluster from the cluster network.
  • each cluster has its own mirror warehouse, and the mirrors in the mirror warehouses of each cluster can be different, or the mirrors in the mirror warehouses of each cluster can be partially or all the same. .
  • its mirror name in all clusters is the same.
  • each cluster needs to report its own mirroring information to the central cluster; optionally, each cluster can report its own mirroring information to the central cluster when reporting its own cluster information periodically.
  • the mirroring information can include The name of the image and the corresponding version number.
  • the central cluster can authorize the configuration of each cluster's mirror and the corresponding version of the mirror.
  • it can authorize any mirror of any cluster to other clusters for use.
  • Mirror 1 and Mirror 2 of cluster 1 are authorized to cluster 2 for use, mirror 3 of cluster 1 is authorized to cluster 3 for use, mirror 4 of cluster 2 is authorized to cluster 1 for use, and mirror 5 of cluster 2 is authorized to cluster 3 is used, the mirror 6 and mirror 7 of cluster 3 are authorized to cluster 1 for use, and the mirror 6 of cluster 3 is authorized to cluster 2 for use.
  • each cluster can set the mirroring information that needs to be reported, that is, each cluster can selectively report partial mirroring information, that is, it is not necessary to report all mirroring information.
  • each cluster can selectively report partial mirroring information, that is, it is not necessary to report all mirroring information.
  • the central cluster receives resource information periodically reported by each cluster.
  • the resource information periodically reported by each cluster may include the node type, host name, total number of CPU cores, total disks, total memory, host IP, number of used CPU cores, disk usage, and memory of each cluster server. Usage amount, etc.
  • the node types of cluster servers can include data nodes and computing nodes. It should be understood that the central cluster may also receive the remaining amount of resources periodically reported by each cluster, such as the remaining amount of CPU, remaining amount of memory, and so on. After receiving the resource information or resource remaining amount periodically reported by each cluster, the central cluster manages the resource usage information and resource remaining information of each cluster.
  • the cluster needs to immediately report the resource change information to the central cluster. For example, if the total number of CPU cores of a server in cluster 1 increases from eight to ten, the cluster 1 It is necessary to report the change details of the total number of CPU cores to the central cluster in time.
  • S220 The central cluster determines a second cluster that satisfies the scheduling request according to the scheduling request.
  • the central cluster After receiving the scheduling request sent by the first cluster, the central cluster finds a matching mirror (with the mirror name and its corresponding version number) according to the mirror name in the scheduling request and the version number corresponding to the mirror name The second cluster authorized to the first cluster has been configured.
  • the central cluster determines a second cluster that satisfies the resources required to run the task specified by the scheduling request among at least one cluster having a mirror specified by the mirror information.
  • the central cluster may find multiple clusters that authorize the mirror specified by the mirror information to be used by the first cluster. Among the multiple clusters found, some clusters may have insufficient remaining resources to support the task. The central cluster needs to further determine the remaining resources from the multiple clusters to support the second cluster running the task specified by the scheduling request. For example, the central cluster may determine the second cluster whose remaining resources can meet task operation requirements from the multiple clusters according to the resource information (memory resources, CPU resources, GPU resources, etc.) in the scheduling request.
  • the resource information memory resources, CPU resources, GPU resources, etc.
  • the central cluster determines the cluster with the most remaining resources among the one or more clusters as the second cluster.
  • the central cluster may also determine the second cluster based on other conditions, such as determining the network bandwidth of the cluster, or determining the distance from the first cluster, or randomly selecting one from the clusters that meet the scheduling request.
  • the specific rule used to determine the second cluster from the one or more clusters is not limited in this application.
  • the mirror name and the corresponding version number in the scheduling request sent by cluster 1 to the central cluster are mirror A1.0, and the central cluster finds the cluster that authorized the configuration of mirror A1.0 to cluster 1 before.
  • the central cluster can further determine whether the remaining resources of cluster 2 can support the successful operation of the task according to the resource information reported by cluster 2, namely cluster 2. Whether the remaining resources of is greater than the resources required by the scheduling request, if greater, the central cluster can determine that cluster 2 is a cluster that satisfies the scheduling request.
  • S230 The central cluster instructs the first cluster to use the second cluster to perform tasks.
  • the central cluster sends the first information matching the second cluster to the first cluster.
  • the central cluster after determining the second cluster that satisfies the scheduling request, the central cluster sends the first information matching the second cluster, that is, the IP address, port number, and unique ID of the second cluster to the first cluster to The first cluster can communicate with the second cluster.
  • the first cluster sends tasks to the second cluster.
  • the CAS is used to complete the authentication of the first cluster in the second cluster. That is, the first cluster uses the inter-cluster authentication user to perform the login operation in the second cluster. If the login is successful, the authentication is successful. After the login is successful, the task and the unique ID of the second cluster are sent to the second cluster.
  • the second cluster is in After receiving the unique ID sent by the first cluster, it is checked whether it is consistent with its own ID. After the verification is successful, the second cluster allows the first cluster to call the AI business service of the second cluster for task sending, uses the resources of the second cluster to run the task, and after the task is completed, sends the result information to the first cluster.
  • each cluster can create a user as an authentication user, and all clusters can create the same user to complete inter-cluster authentication. For example, there are cluster 1, cluster 2 and cluster 3, and the same user A is created in cluster 1, cluster 2 and cluster 3 as the unified authentication user between clusters.
  • kerberos and CAS there are currently two authentication methods for inter-cluster authentication: kerberos and CAS. If you use kerberos to complete the authentication between clusters, you must first configure the mutual trust of the authentication servers between the clusters, for example, configure the authentication server of cluster 1, the authentication server of cluster 2 and the authentication server of cluster 3 to trust each other, and kerberos allows only 16 mutual trusts to be configured The cluster authentication server.
  • inter-service communication needs to be completed across clusters, for example, service A in cluster 1 needs to communicate with service B in cluster 2, user A in cluster 1 needs to first generate the key.tab file, and then use the file to Authentication is performed in the authentication server of cluster 2. If user A is authenticated in the authentication server of cluster 2, cluster 1 and cluster 2 can communicate, that is, service A in cluster 1 can communicate with service B in cluster 2. It can be seen that to use kerberos authentication to complete the communication between cross-cluster services, you must first configure the authentication server mutual trust between the clusters. In addition, the number of cluster visits will also be limited, and there are some other authentication restrictions. The steps are cumbersome and cannot meet the requirement for inter-cluster authentication to not limit the number of clusters.
  • CAS Using CAS to complete the authentication between clusters does not need to configure mutual trust between the authentication servers between the clusters. It is only required that when each cluster creates a unified authentication user between the clusters, the account and password of the unified authentication user between the clusters are consistent, that is, the account and password of the aforementioned user A are consistent. In this way, if inter-service communication needs to be completed across clusters, for example, service A in cluster 1 needs to communicate with service B in cluster 2, then user A in cluster 1 directly executes on the corresponding service node of cluster 2 or cluster 2. Login operation. If the login is successful, it means that the authentication is passed and cluster 1 and cluster 2 can communicate, that is, service A in cluster 1 can communicate with service B in cluster 2.
  • this application uses CAS authentication to complete the authentication between clusters, which can ensure that the number of clusters is not limited when accessing authentication across clusters, which improves cluster scalability and the authentication process is simple and reliable.
  • the second cluster sends the execution result obtained by executing the task to the first cluster.
  • the second cluster can implement dynamic task allocation and resource scheduling within the cluster through YARN, and obtain the result information after the task is completed, and return the result information Give the first cluster.
  • FIG. 4 is a schematic flow diagram of another scheduling method provided by an embodiment of the application.
  • the central cluster includes an association relationship service module, a computing distribution service module, and a resource management service module.
  • the local cluster and The schedulable cluster includes an algorithm warehouse service module and a YARN service module.
  • the algorithm warehouse service module stores and manages the mirroring information of the cluster, and the YARN service module is responsible for task allocation and resource scheduling within the cluster.
  • the method includes but is not limited to the following steps:
  • the association relationship service module receives the configuration of the user.
  • the user can configure the level relationship between the clusters in the cluster network, and configure other information of the cluster, such as the address of the cluster, the identifier of the cluster, and the bandwidth of the cluster.
  • the association service module receives the user's configuration, and stores and manages the user's configuration information.
  • the algorithm warehouse service module obtains resource information of the local cluster.
  • the resource information of the local cluster may include the node type, host name, total number of CPU cores, total disks, total memory, host IP, number of used CPU cores, disk usage, and memory usage of each local cluster server
  • the node types of cluster servers can include data nodes and computing nodes.
  • the local cluster is configured with the network address and reporting period of the central cluster when installing the service, and the algorithm warehouse service module reports the cluster heartbeat to the association service module according to the reporting period.
  • the correlation service module maintains the online status of the local cluster and manages the life cycle of the cluster according to the reported cluster heartbeat.
  • the algorithm warehouse service reports the mirroring information owned by the local cluster to the association relationship service module when the heartbeat is reported.
  • the mirror information includes the name of the mirror and the version corresponding to the mirror name.
  • the association service module can configure and authorize each mirror of each cluster and its corresponding version, and can authorize any mirror of any cluster to other clusters. .
  • the algorithm warehouse service module can set the reported mirroring information by itself, without reporting all the mirroring information of the local cluster.
  • the algorithm warehouse service module periodically reports resource information to the resource management service module, and the period can be set according to actual needs.
  • the algorithm warehouse service module needs to immediately report the resource change information to the resource management service module.
  • the resource management service receives the resource information reported by the local cluster, it manages the resource usage information and resource remaining information of the local cluster.
  • the computing distribution service module obtains the cluster association relationship and basic information of the cluster.
  • the computing distribution service module obtains the cluster association relationship and basic information of the cluster from the association relationship service module.
  • the cluster association relationship may be the level relationship between each cluster, which is configured by the user in the association relationship service module.
  • the basic information of the cluster may be the address information of the cluster, the identification information of the cluster, or the bandwidth information of the cluster, etc., which are also configured by the user in the association relationship service module.
  • the computing distribution service module obtains cluster resource information.
  • the computing distribution service module obtains the resource information of the cluster from the resource management service module.
  • Cluster resource information can be the resource information periodically reported to the resource management service module by local clusters and schedulable clusters, which can specifically include the node type, host name, total number of CPU cores, total disks, total memory, and hosts of each cluster server IP, number of used CPU cores, disk usage, memory usage and other information.
  • the calculation and distribution service module summarizes and integrates all the information obtained from the association service module and the resource management service module into final data.
  • the calculation and distribution service module receives a scheduling request, it can allocate schedulable clusters based on the final data. .
  • the algorithm warehouse service module sends a scheduling request to the computing distribution service module.
  • the algorithm warehouse service module of the local cluster is in the process of running tasks.
  • the algorithm warehouse service module will send a multi-level scheduling request to the computing distribution service module of the central cluster to make
  • the computing distribution service module may determine a schedulable cluster that meets the requirements of the running task according to the multi-level scheduling request.
  • the multi-level scheduling request may specifically include the image name of the image matching the task and its corresponding version, and resource information required to run the task.
  • the computing distribution service module determines a schedulable cluster.
  • the computing distribution service module After the computing distribution service module receives the scheduling request sent by the algorithm warehouse service module, it finds a matching image (with the image name and its corresponding version number) according to the image name contained in the scheduling request and the version number corresponding to the image. Version number) One or more clusters authorized to the local cluster have been configured, and then a schedulable cluster whose remaining resources meet the requirements of running tasks are determined from the one or more clusters.
  • the computing distribution service module returns the address information of the schedulable cluster to the algorithm warehouse service module.
  • the computing distribution service module sends the IP address, port number, and unique ID of the schedulable cluster to the algorithm warehouse service module of the local cluster.
  • the algorithm warehouse service module of the local cluster sends tasks to the algorithm warehouse service module of the schedulable cluster.
  • the local cluster uses CAS to complete the authentication of the local cluster in the schedulable cluster.
  • the algorithm warehouse service module of the local cluster will communicate with the algorithm warehouse service module of the schedulable cluster according to the IP address and port number of the schedulable cluster. And send the unique ID of the schedulable cluster to the algorithm warehouse service module of the schedulable cluster, and the algorithm warehouse service module of the schedulable cluster will verify the unique ID and accept the tasks sent by the local cluster only after the verification is successful.
  • the algorithm warehouse service module of the schedulable cluster sends tasks to the YARN service module.
  • the YARN service module receives the task issued by the algorithm warehouse service module, it performs resource scheduling within the cluster to run the task, and obtains the result information after the task is completed.
  • S4130 The YARN service module of the schedulable cluster returns result information to the algorithm warehouse service module of the schedulable cluster.
  • the algorithm warehouse service module of the schedulable cluster returns result information to the algorithm warehouse service module of the local cluster.
  • central cluster, local cluster, and schedulable cluster and the scheduling process for task cross-cluster scheduling are merely examples, and should not constitute specific limitations.
  • the central cluster, local cluster, and schedulable cluster can be set as needed.
  • the operations and/or functions of each module in the central cluster, the local cluster, and the schedulable cluster are to implement the corresponding processes of the methods in FIGS. 1 to 4, respectively.
  • the scheduling device 500 for the central cluster includes a receiving module 510 and a processing module 520. among them,
  • the receiving module 510 is configured to receive a scheduling request sent by the first cluster.
  • the processing module 520 is configured to determine a second cluster that satisfies the scheduling request, and instruct the first cluster to use the second cluster to perform tasks.
  • the scheduling request is generated by the first cluster when the first cluster does not have enough remaining resources to run the task, or the scheduling request is generated by the first cluster in Generated when the first cluster is not running the mirror of the task.
  • the processing module 520 is configured to: determine a second cluster according to the mirror information that matches the task specified by the scheduling request, and the mirror information includes the name of the mirror and the name of the mirror.
  • the version number, the second cluster is a cluster that has authorized the mirror specified by the mirror information to be used by the first cluster.
  • the processing module 520 is configured to: in at least one cluster with a mirror specified by the mirror information, determine a second cluster that satisfies the resources required to run the task specified by the scheduling request .
  • the scheduling device 500 further includes a sending module 530, the sending module 530 is configured to send the address and identifier of the second cluster to the first cluster, The address and identifier are used by the first cluster to access the second cluster.
  • the processing module 520 is further configured to authorize the image running the task in the second cluster to the first cluster.
  • the above-mentioned structure of the scheduling device for the central cluster and the scheduling process for task cross-cluster scheduling are merely examples, and should not constitute a specific limitation.
  • the various modules in the scheduling device for the central cluster can be performed as needed. Increase, decrease or merge.
  • the operations and/or functions of each module in the scheduling device for the central cluster are used to implement the corresponding procedures of the methods described in FIG. 2 and FIG. 4, and are not repeated here for brevity.
  • FIG. 6 is a schematic structural diagram of a scheduling apparatus for a first cluster provided by an embodiment of the present application.
  • the scheduling device 600 for the first cluster includes a sending module 610, a receiving module 620, and a processing module 630. among them,
  • the sending module 610 is configured to send a scheduling request to a central cluster, where the scheduling request is used by the central cluster to determine a second cluster that satisfies the scheduling request.
  • the receiving module 620 is configured to receive an indication sent by the central cluster in response to the scheduling request.
  • the processing module 630 is configured to execute a task by using the second cluster determined by the instruction.
  • the scheduling request is generated by the scheduling device when the scheduling device does not have enough remaining resources to run the task, or the scheduling request is generated by the scheduling device in the scheduling Generated when the device is not running the image of the task.
  • the receiving module 620 is configured to receive the address and identifier of the second cluster sent by the central cluster, and the address and identifier of the second cluster are used by the scheduling device 600 to access the The second cluster.
  • the processing module 630 is configured to: use the central authentication service CAS to authenticate in the second cluster; the sending module 610 is also configured to send to the second cluster after the authentication is passed. The task; the receiving module 620 is further configured to receive the execution result obtained by the second cluster executing the task.
  • each of the scheduling devices for the first cluster can be set as needed. Modules are added, reduced or merged. In addition, the operations and/or functions of each module in the scheduling device for the first cluster are used to implement the corresponding procedures of the methods described in FIG. 2 and FIG. 4, and are not repeated here for brevity.
  • FIG. 7 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • the computing device 700 includes a processor 710, a communication interface 720, and a memory 730.
  • the processor 710, the communication interface 720, and the memory 730 are connected to each other through an internal bus 740.
  • the computing device may be a computer or a server.
  • the processor 710 may be composed of one or more general-purpose processors, such as a central processing unit (CPU), or a combination of a CPU and a hardware chip.
  • the aforementioned hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (generic array logic, GAL), or any combination thereof.
  • the bus 740 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect standard
  • EISA extended industry standard architecture
  • the bus 740 can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is used in FIG. 7, but it does not mean that there is only one bus or one type of bus.
  • the memory 730 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM); the memory 730 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (read-only memory). Only memory (ROM), flash memory (flash memory), hard disk drive (HDD) or solid-state drive (SSD); memory 730 may also include a combination of the above types.
  • the memory 730 may be used to store programs and data, so that the processor 710 can call the program codes and data stored in the memory 730 to implement the functions of the aforementioned processing modules.
  • the program code can be used to implement the functional modules of the scheduling device for the central cluster shown in FIG. 5 or the scheduling device for the first cluster shown in FIG. 6, or used to implement the methods shown in FIG. 2 and FIG. 4
  • the central cluster is the execution subject or the first cluster is the execution subject of the method steps.
  • This application also provides a computer storage medium, wherein the computer storage medium stores a computer program, and when the computer program is executed by a processor, it can implement part or all of the steps of any one of the above method embodiments. , And realize the function of any one of the functional modules described in Figure 5 and Figure 6.
  • the embodiment of the present invention also provides a computer program, which includes computer instructions.
  • the computer instructions When the computer instructions are executed by a computer, the computer can execute part or all of the steps of any scheduling method, and execute the above-mentioned FIG. 5 and The function of any one of the functional modules described in Figure 6.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the above-mentioned units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

本申请提供一种调度方法、装置及相关设备。其中,该方法包括:中心集群接收第一集群发送的调度请求,确定满足所述调度请求的第二集群;所述中心集群指示所述第一集群利用所述第二集群执行任务。上述方法能够支持任务跨集群调度以实现不同集群之间资源共享,提高资源的利用率。

Description

一种调度方法、装置及相关设备 技术领域
本发明涉及信息技术领域,尤其涉及一种调度方法、装置及相关设备。
背景技术
随着信息技术的发展,数据中心已全面进入基于分布式虚拟化技术的云数据中心时代,通过计算虚拟化、网络虚拟化、存储虚拟化为代表的虚拟化技术,数据中心实现了本地资源的高可靠性、高稳定性、高弹性的按需分配和充分利用。
分布式系统是由多个分散的计算机经过互连网络构成的统一计算机系统。其中各个计算机的物理资源和逻辑资源既相互配合又高度自治,能在全系统范围内实现资源管理和数据共享,动态的实现任务分配和功能分配,且能并行的运行分布式程序,它强调资源、任务、功能、数据和控制的全面分布,它们分布于各个物理上分散的计算机节点中,各个节点经过互连网络相互通信,构成统一的处理系统。
当前通信技术都是基于单个集群(即单个分布式系统)内部通信,任务动态分配和资源调度只能在一个集群内部进行。
发明内容
本发明实施例公开了一种调度方法、装置及相关设备,能够跨集群发送任务,实现不同集群之间资源共享,提高资源的利用效率。
第一方面,本申请提供了一种调度方法,所述方法包括:中心集群接收第一集群发送的调度请求,确定满足所述调度请求的第二集群;所述中心集群指示所述第一集群利用所述第二集群执行任务。
在本申请实施例中,中心集群作为统一调度和控制中心,在接收到第一集群发送的调度请求后,进行统一调度。具体地,中心集群在所管理的所有集群中确定满足所述调度请求的第二集群;中心集群在确定第二集群之后,指示第一集群可以利用第二集群中的资源执行任务。这样可以实现任务跨集群调度,即原先第一集群执行的任务可以利用第二集群去执行,实现了不同集群之间资源共享,提高了资源的利用率。
在一种可能的实现方式中,所述调度请求为所述第一集群在所述第一集群没有足够的剩余资源来运行所述任务时生成的,或者所述调度请求为所述第一集群在所述第一集群没有运行所述任务的镜像时生成的。
在本申请实施例中,第一集群在向中心集群发送调度请求之前,会判断自身是否具有该任务的镜像,以及是否具有足够的剩余资源运行该任务。只有在第一集群不具有该镜像和/或没有足够资源运行该任务的情况下,第一集群才会向中心集群发送调度请求。
在一种可能的实现方式中,所述中心集群根据所述调度请求指定的与所述任务匹配的镜像信息,确定第二集群,所述镜像信息包括镜像的名称以及所述镜像的版本号,所述第二集群为已授权所述镜像信息指定的镜像给所述第一集群使用的集群。
在本申请实施例中,第一集群在向中心集群发送调度请求时,会将与任务匹配的镜 像信息一起发送给中心集群。其中,该镜像信息包括镜像的名称以及镜像的版本号。中心集群在接收到第一集群发送的镜像信息之后,可以查找到已将该镜像信息指定的镜像授权给第一集群使用的第二集群,从而可以进一步实现第一集群和第二集群之间的资源共享。
在一种可能的实现方式中,中心集群在具有所述镜像信息指定的镜像的至少一个集群中,确定满足所述调度请求指定的运行所述任务所需资源的第二集群。
在本申请实施例中,中心集群可能找到多个将镜像信息指定的镜像授权给第一集群使用的多个集群。在找到的多个集群中,可能存在某些集群的剩余资源不足以支持运行该任务;中心集群需要进一步从该多个集群中确定剩余资源支持运行调度请求指定任务的第二集群。这样可以保证确定的第二集群能够满足第一集群发送的调度请求。
在一种可能的实现方式中,中心集群向所述第一集群发送所述第二集群的地址和标识,所述第二集群的地址和标识用于所述第一集群访问所述第二集群。
在本申请实施例中,中心集群在确定了满足调度请求的第二集群之后,将该第二集群的地址和标识发送给第一集群。从而可以保证第一集群能够准确的与第二集群建立通信,进而可以利用第二集群执行任务,实现集群间资源共享,提高资源利用率。
在一种可能的实现方式中,中心集群将所述第二集群中运行所述任务的镜像授权给所述第一集群。
在本申请实施例中,中心集群可以对集群间的镜像信息进行相互授权,例如将第二集群中运行所述任务的镜像授权给第一集群。从而可以实现任务跨集群调度,进而实现集群间资源共享,提高资源的利用率。
第二方面,本申请提供一种调度方法,包括:第一集群向中心集群发送调度请求,所述调度请求用于所述中心集群确定满足所述调度请求的第二集群;所述第一集群接收所述中心集群相应所述调度请求所发送的指示;所述第一集群利用所述指示确定的所述第二集群执行任务。
在本申请实施例中,第一集群向中心集群发送调度请求,中心集群作为统一调度和控制中心,接收到第一集群发送的调度请求后进行统一调度,在所管理的所有集群中确定满足该调度请求的第二集群,并指示第一集群利用确定的第二集群中的资源执行任务,这样实现了任务跨集群调度,第一集群可以利用第二集群执行任务,实现了不同集群之间资源共享,提高了资源的利用率。
在一种可能的实现方式中,所述调度请求为所述第一集群在所述第一集群没有足够的剩余资源来运行所述任务时生成的,或者所述调度请求为所述第一集群在所述第一集群没有运行所述任务的镜像时生成的。
在一种可能的实现方式中,第一集群接收所述中心集群发送的第二集群的地址和标识,所述第二集群的地址和标识用于所述第一集群访问所述第二集群。
在一种可能的实现方式中,第一集群利用中央认证服务(central authentication service,CAS)在所述第二集群认证;在认证通过之后,所述第一集群向所述第二集群发送所述任务以及接收所述第二集群执行所述任务所得的执行结果。
在本申请实施例中,第一集群通过CAS完成集群间的认证,可以保证在跨集群访问 认证时,对集群数量不作限制,提高了集群可扩展性,且认证过程简单可靠。
第三方面,本申请提供了一种用于中心集群的调度装置,包括:接收模块,用于接收第一集群发送的调度请求;处理模块,用于确定满足所述调度请求的第二集群,指示所述第一集群利用所述第二集群执行任务。
在一种可能的实现方式中,所述调度请求为所述第一集群在所述第一集群没有足够的剩余资源来运行所述任务时生成的,或者所述调度请求为所述第一集群在所述第一集群没有运行所述任务的镜像时生成的。
在一种可能的实现方式中,所述处理模块用于:根据所述调度请求指定的与所述任务匹配的镜像信息,确定第二集群,所述镜像信息包括镜像的名称以及所述镜像的版本号,所述第二集群为已授权所述镜像信息指定的镜像给所述第一集群使用的集群。
在一种可能的实现方式中,所述处理模块用于:在具有所述镜像信息指定的镜像的至少一个集群中,确定满足所述调度请求指定的运行所述任务所需资源的第二集群。
在一种可能的实现方式中,所述调度装置还包括发送模块,所述发送模块,用于向所述第一集群发送所述第二集群的地址和标识,所述第二集群的地址和标识用于所述第一集群访问所述第二集群。
在一种可能的实现方式中,所述处理模块还用于,将所述第二集群中运行所述任务的镜像授权给所述第一集群。
第四方面,本申请提供了一种用于第一集群的调度装置,包括:发送模块,用于向中心集群发送调度请求,所述调度请求用于所述中心集群确定满足所述调度请求的第二集群;接收模块,用于接收所述中心集群响应所述调度请求所发送的指示;处理模块,用于利用所述指示确定的所述第二集群执行任务。
在一种可能的实现方式中,所述调度请求为所述调度装置在所述调度装置没有足够的剩余资源来运行所述任务时生成的,或者所述调度请求为所述调度装置在所述调度装置没有运行所述任务的镜像时生成的。
在一种可能的实现方式中,所述接收模块用于,接收所述中心集群发送的第二集群的地址和标识,所述第二集群的地址和标识用于所述调度装置访问所述第二集群。
在一种可能的实现方式中,所述处理模块用于:利用中央认证服务CAS在所述第二集群认证;所述发送模块,还用于在认证通过之后,向所述第二集群发送所述任务;所述接收模块,还用于接收所述第二集群执行所述任务所得的执行结果。
第五方面,本申请提供了一种计算设备,所述计算设备包括处理器和存储器。所述处理器执行所述存储器存储的计算机指令,使得所述计算设备执行上述第一方面以及结合上述第一方面中的任意一种实现方式的方法。
第六方面,本申请提供了一种计算设备,所述计算设备包括处理器和存储器。所述处理器执行所述存储器存储的计算机指令,使得所述计算设备执行上述第二方面以及结合上述第二方面中的任意一种实现方式的方法。
第七方面,本申请提供了一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序在被计算设备执行时实现上述第一方面以及结合上述第一方面中的任意一种实现方式所提供的调度方法的流程。
第八方面,本申请提供了一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序在被计算设备执行时实现上述第二方面以及结合上述第二方面中的任意一种实现方式所提供的调度方法的流程。
第九方面,本申请提供了一种计算机程序产品,所述计算机程序产品包括计算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行上述第一方面以及结合上述第一方面中的任意一种实现方式的方法。
第十方面,本申请提供了一种计算机程序产品,所述计算机程序产品包括计算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行上述第二方面以及结合上述第二方面中的任意一种实现方式的方法。
附图说明
为了更清楚地说明本发明实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种任务跨集群调度的场景示意图;
图2是本申请实施例提供的一种调度方法的流程示意图;
图3A是本申请实施例提供的一种集群网络的结构示意图;
图3B是本申请实施例提供的又一种集群网络的结构示意图;
图4是本申请实施例提供的又一种调度方法的流程示意图;
图5是本申请实施例提供的一种用于中心集群的调度装置的结构示意图;
图6是本申请实施例提供的一种用于第一集群的调度装置的结构示意图;
图7是本申请实施例提供的一种计算设备的结构示意图。
具体实施方式
下面结合附图对本申请实施例中的技术方案进行清楚、完整的描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三” 和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。
首先,结合附图对本申请中所涉及的部分用语和相关技术进行解释说明,以便于本领域技术人员理解。
任务(task)是在多道程序或多进程环境中,要由计算机来完成的基本工作元,它包括一个或多个指令序列。在实际应用中,任务是实现特定业务逻辑的被调度的运行实体,例如一个任务用于执行一个算法逻辑。
集群(cluster)是一组相互独立、通过高速网络互联的计算机,它们构成了一个组,并以单一系统的模式加以管理。一个用户与集群相互作用时,集群像是一个独立的服务器,集群配置可以提高可用性和可缩放性。通过集群技术,可以在付出较低成本的情况下获得在性能、高可靠性、灵活性方面相对较高的收益,任务调度是集群系统中的核心技术。
另一种资源协调者(yet another resource negotiator,YARN)是一种资源管理系统,包括一个全局资源管理器(resource manager,RM)和每个应用程序特有的管理器(application master,AM),其中,RM负责整个系统的资源管理和分配,AM负责单个应用程序的管理。YARN总体上是主/从结构,RM负责对各个节点管理器上的资源进行统一管理和调度,当用户提交一个应用程序时,需要提供一个用以跟踪和管理这个程序的AM,它负责向RM申请资源,并要求节点管理器启动可以占用一定资源的任务。由于不同的AM被分布到不同的节点上,因此它们之间不会相互影响。RM包括调度器(scheduler)和应用程序管理器(applications manager,ASM),调度器根据容量、队列等限制条件(例如每个队列分配一定的资源,最多执行一定数量的作业等),将系统中的资源分配 给各个正在运行的应用程序,不再负责与具体应用程序相关的工作,例如不再负责监控或者跟踪应用的执行状态,也不负责重新启动因应用执行失败或者硬件故障而产生的失败任务,这些均由应用程序相关的AM完成。调度器仅根据各个应用程序的资源需求进行资源分配,资源分配单位可以用资源容器(container)表示,container是一个动态资源分配单位,它将内存、中央处理器(central processing unit,CPU)、磁盘、网络等资源封装在一起,从而限定每个任务使用的资源量。此外,该调度器是一个可插拔的组件,用户可以根据需要进行设计,例如能力调度器(capacity scheduler)和公平调度器(fair scheduler)等。ASM负责整个系统中所有应用程序,包括应用程序提交、与调度器协商资源以启动AM、监控AM运行状态并在失败时重新启动它等。
网络认证协议(kerberos)是一种计算机网络授权协议,用来在非安全网络中,对个人通信以安全的手段进行身份认证。其设计目标是通过密钥系统为客户机/服务器应用程序提供强大的认证服务。该认证过程的实现不依赖于主机操作系统的认证,无需基于主机地址的信任,不要求网络上所有主机的物理安全,并假定网络上传送的数据包可以被任意的读取、修改和插入数据。在以上情况下,kerberos作为一种可信任的第三方认证服务,是通过传统的密码技术(例如共享密钥)执行认证服务的。其认证过程具体为:客户机向认证服务器(authentication server,AS)发送请求,要求得到某服务器的证书,然后AS的响应中包含这些用客户端密钥加密的证书,证书由服务器票据(ticket)和一个会话密钥(session key)构成。服务器票据包括用服务器密钥加密的客户机身份信息和一份会话密钥的拷贝,客户机将自身的身份信息和时间戳通过会话密钥加密后,连同服务器票据一起传送到服务器。
中央认证服务(central authentication service,CAS)是一种独立开放指令协议,可以为万维网(world wide web,web)应用系统提供一种可靠的单点登录(single sign on,SSO)方法,SSO使得在多个应用系统中,用户只需要登录一次就可以访问所有相互信任的应用系统。
目前,是通过YARN的方式实现任务动态分配以及资源调度,从而提高资源利用率。但是,这种方式只能在单个集群内部实现资源共享,当存在多个集群的情况下,将不支持任务跨集群调度,不能实现不同集群之间资源共享。此外,目前使用kerberos认证来完成的服务间通信,但是kerberos认证对集群访问数量存在限制,最多只能支持16个集群相互认证,而且还存在一些其它的认证限制,例如需要开放后台认证用户白名单等,不能满足集群数量不做限制的认证需求。
综上,本申请需解决的问题包括,如何支持任务跨集群调度以实现不同集群之间资源共享,以及支持跨集群访问认证,提高集群可扩展性。
参见图1,图1是本申请提供的一种任务跨集群调度的场景示意图。如图1所示,中心集群接收集群1和集群2上报的集群信息和资源信息(即步骤S110),并对接收到的集群信息和资源信息进行管理,其中,集群信息可以包括集群的镜像信息、集群的标识信息、集群的级别信息和集群用于外部连接的地址信息等,集群的镜像信息包括集群所拥有的镜像以及版本信息的列表。资源信息可以包括资源使用信息、资源剩余信息和资源变更信息。集群1在资源不足的情况下,任务无法在本地执行,因此集群1向中心集 群发送调度请求(即步骤S120),该调度请求中包括所需要的内存资源和CPU资源等,以及与任务匹配的镜像名称与版本号。中心集群在接收到集群1发送的调度请求之后,根据之前接收到的集群信息和资源信息进行调度,确定满足调度请求的可调度集群,例如确定集群2为满足调度请求的可调度集群,并将集群2的地址信息返回给集群1(即步骤S130)。集群1在接收到集群2的地址信息之后,向集群2发送任务(即步骤S140),利用集群2的资源执行任务,完成任务的跨集群调度。应理解,图1只是示例性的列举了两个集群时对应的任务跨集群调度,对于存在更多的集群的情况下,任务调度逻辑与上述一致,在此不再赘述。
需要说明的是,上述中心集群、集群1和集群2可以部署在一个或者多个数据中心的实例上,实例包括虚拟机、容器和物理机,本申请对实例的具体实现形式不作限定。
本申请提供了一种调度方法、调度装置及计算设备,能够支持任务跨集群调度,实现不同集群之间资源共享,提高资源的利用率。
基于上述,下面对本申请实施例提供的调度方法、装置及相关设备进行描述。参见图2,图2为本申请实施例提供的一种调度方法的流程示意图。如图2所示,该方法包括但不限于以下步骤:
S210:中心集群接收第一集群发送的调度请求。
具体地,第一集群需要预先配置是否允许进行任务跨集群调度,在第一集群配置了允许任务跨集群调度的情况下,如果第一集群资源不足,任务无法在本地执行,即第一集群的资源不能支持任务正常运行,此时,第一集群的人工智能(artificial intelligence,AI)业务服务会自动向中心集群发送调度请求,请求进行任务跨集群调度。可选地,当第一集群没有运行任务的镜像时,任务也无法在第一集群中执行,此时,第一集群也需要向中心集群发送调度请求,请求对该任务进行跨集群调度。
进一步的,第一集群将需要的资源通过调度请求发送给中心集群,该资源可以包括运行任务所需要的内存资源、图形处理器(graphics processing unit,GPU)资源和CPU资源。此外,第一集群将与所述任务匹配的镜像名称与版本号也通过调度请求发送给中心集群。
如果任务用于执行算法,而镜像是用于执行该算法的载体,即一个任务匹配一个镜像。其中,一个镜像可以存在不同的版本号,即使是同一个镜像,若其版本号不同,则也属于不同的镜像,它们的功能、启动命令以及配置文件都是不一样的。例如,对于镜像A,其存在三个不同的版本号,分别为镜像A1.0、镜像A2.0和镜像A3.0,那么镜像A1.0、镜像A2.0和镜像A3.0为三个不同的镜像。因此,第一集群若要实现任务跨集群调度,需要将与任务相匹配的镜像的名称以及相应的版本号发送给中心集群,以使中心集群能够对与该镜像名称及版本号相匹配的任务进行调度。
在一种可能的实现方式中,中心集群对包括第一集群在内的所有下层集群进行配置。具体地,中心集群可以对所管理的各个集群间的级别关系进行配置,例如,中心集群可以将所有的集群的级别配置为相同的,即所有的集群都是平等的,不存在上下级关系,整个集群网络是一个蜂窝状结构。或者是,中心集群也可以将所有的集群的级别配置为不同的,即不同集群之间存在上下级关系,整个集群网络是一个树状结构。
示例性的,参见图3A,中心集群所管理的集群包括集群1、集群2、集群3和集群4,中心集群将这四个集群配置为同一级别,即这四个集群所构成的集群网络是一个蜂窝状网络,直接接受中心集群的管理。参见图3B,集群1、集群2、集群3、集群4、集群5和集群6构成一个树状的集群网络,由中心集群统一管理,中心集群在对集群的级别关系进行配置的时候,将集群1配置为一级集群,将集群2和集群3配置为二级集群,将集群4、集群5和集群6配置为三级集群,其中,集群2和集群3为集群1的下层集群,集群4和集群5为集群2的下层集群,集群6为集群3的下层集群。值得说明的是,在实际应用中,集群网络具体配置为蜂窝状网络或者树状网络,或者是其它网络结构,以及将集群网络中的集群划分为多少级,都可以根据需求,由中心集群进行配置,本申请对此不作限定。
此外,中心集群还可以对各个集群的一些其它信息进行配置。可选的,中心集群可以配置各个集群的地址信息、集群的标识信息和集群的带宽信息。该地址信息包括集群的网络地址和端口号,该网络地址可以是集群用于外部连接的互联网协议(internet protocol,IP)地址。集群的标识信息可以是一个字符串,例如该字符串是由后台配置的一段乱码;该字符串是该集群在网络中的唯一身份标识(identification,ID),每个集群的ID都是不一样的,具有唯一性。
在一种可能的实现方式中,中心集群采用心跳机制接收各个集群周期性上报的集群信息。具体地,集群网络中的所有集群在安装服务时,都会进行服务配置。该配置包括中心集群的网络地址和上报周期,每个集群可以通过配置的中心集群的网络地址与中心集群进行通信。此外,每个集群都会根据配置的上报周期向中心集群周期性的上报自身的集群信息,该集群信息可以包括集群的地址和集群的标识以及集群的剩余资源。中心集群根据各个集群上报的集群信息,维护每个集群的在线状态,管理每个集群的生命周期。若某个集群没有在预设的时间周期内向中心集群进行上报,则中心集群判定该集群处于离线状态,并将该集群从集群网络中移除。
在一种可能的实现方式中,每个集群都拥有自己的镜像仓库,各个集群的镜像仓库中的镜像可以是不同的,或者各个集群的镜像仓库中的镜像可以是部分相同的或者全部相同的。对于同一个镜像,其在所有集群中的镜像名称是相同的。例如,集群1和集群2同时存在一个相同的镜像,该镜像在集群1中的名称为镜像A,则在集群2中,该镜像的名称也为镜像A。每个集群需要将自己拥有的镜像信息上报给中心集群;可选的,每个集群可以在周期性上报自身的集群信息时,将自己所拥有的镜像信息一同上报给中心集群,镜像信息可以包括镜像的名称以及相应的版本号。中心集群可以对每个集群的镜像以及该镜像对应的版本进行配置授权,例如可以将任意一个集群的任意一个镜像授权给其它集群进行使用。示例性的,集群1的镜像仓库中存在镜像1、镜像2和镜像3,集群2的镜像仓库中存在镜像4和镜像5,集群3的镜像仓库中存在镜像6和镜像7;中心集群可以将集群1的镜像1和镜像2授权给集群2进行使用,将集群1的镜像3授权给集群3进行使用,将集群2的镜像4授权给集群1进行使用,将集群2的镜像5授权给集群3进行使用,将集群3的镜像6和镜像7授权给集群1进行使用,将集群3的镜像6授权给集群2进行使用。
可选的,每个集群可以自行设置需要上报的镜像信息,即每个集群可以选择性的上 报部分镜像的镜像信息,即不需要上报全部镜像的镜像信息。例如,集群1的镜像仓库中存在镜像A1.0、镜像A2.0和镜像B,集群1可以只选择上报镜像A1.0和镜像B,则中心集群只能对集群1上报的镜像A1.0和镜像B进行配置授权给其它集群使用。
在一种可能的实现方式中,中心集群接收各个集群周期性上报的资源信息。具体地,各个集群周期性上报的资源信息可以包括每台集群服务器的节点类型、主机名称、CPU核总数、磁盘总量、内存总量、主机IP、已用CPU核数、磁盘使用量、内存使用量等,集群服务器的节点类型可以包括数据节点和计算节点。应理解,中心集群也可以接收各个集群周期性上报的资源剩余量,例如CPU剩余量、内存剩余量等。中心集群在接收到各个集群周期性上报的资源信息或资源剩余量后,对各个集群的资源使用信息和资源剩余信息进行管理。
可选地,当某个集群的资源发生变化,则该集群需要立即向中心集群上报资源变更信息,例如,集群1的某台服务器的CPU核总数由原来的八个增加到十个,则集群1需要及时将CPU核总数变化详情上报给中心集群。
S220:中心集群根据所述调度请求,确定满足所述调度请求的第二集群。
具体地,中心集群在接收到第一集群发送的调度请求后,根据调度请求中的镜像名称和该镜像名称对应的版本号,查找到匹配的镜像(具有该镜像名称及其对应的版本号)已配置授权给第一集群使用的第二集群。
在一种可能的实现方式中,中心集群在具有所述镜像信息指定的镜像的至少一个集群中,确定满足所述调度请求指定的运行所述任务所需资源的第二集群。
具体地,中心集群可能找到多个将镜像信息指定的镜像授权给第一集群使用的多个集群。在找到的多个集群中,可能存在某些集群的剩余资源不足以支持运行该任务。中心集群需要进一步从该多个集群中确定剩余资源支持运行调度请求指定任务的第二集群。例如,中心集群可以根据调度请求中的资源信息(内存资源、CPU资源和GPU资源等),从该多个集群中确定剩余资源能够满足任务运行需求的第二集群。
可选的,中心集群将该一个或多个集群中剩余资源最多的集群确定为第二集群。可选地,中心集群也可以通过其它的条件确定第二集群,例如根据集群的网络带宽进行确定,或者根据与第一集群的距离进行确定,或者是从满足调度请求的集群中随机选取一个。
因此,具体采用何种规则从该一个或多个集群中确定第二集群,本申请对此不作限定。
示例性的,集群1向中心集群发送的调度请求中的镜像名称及对应的版本号为镜像A1.0,中心集群查找到之前将镜像A1.0配置授权给集群1使用的集群。在查找到的集群中,若集群2将镜像A1.0已经授权给集群1使用,中心集群可以再根据集群2上报的资源信息,进一步确定集群2剩余资源是否能够支持任务成功运行,即集群2的剩余资源是否大于调度请求所需要的资源,若大于,则中心集群可以确定集群2为满足调度请求的集群。
S230:中心集群指示所述第一集群利用所述第二集群执行任务。
在一种可能的实现方式中,中心集群将与第二集群匹配的第一信息发送给第一集群。
具体地,中心集群在确定了满足调度请求的第二集群后,将与该第二集群匹配的第一信息,即该第二集群的IP地址、端口号和唯一ID发送给第一集群,以使第一集群可以和第二集群进行通信。
第一集群向第二集群发送任务。可选地,第一集群接收到中心集群发送的第二集群的IP地址、端口号和唯一ID之后,使用CAS完成第一集群在第二集群的认证。即第一集群使用集群间认证用户在第二集群中执行登录操作,若登录成功,即说明认证成功,并在登录成功后向第二集群发送任务以及第二集群的唯一ID,第二集群在接收到第一集群发送的唯一ID之后,检验其与自己的ID是否吻合。在检验成功后,第二集群允许第一集群调用第二集群的AI业务服务进行任务发送,利用第二集群的资源运行任务,并在任务完成后,将结果信息发送给第一集群。
需要说明的是,每个集群可以创建一个用户作为认证用户,所有的集群可以创建同一个用户来完成集群间的认证。例如,存在集群1、集群2和集群3,在集群1、集群2和集群3中分别创建一个相同的用户A作为集群间统一认证用户。而当前集群间认证存在kerberos和CAS两种认证方式。若使用kerberos完成集群之间的认证,首先要配置集群间的认证服务器互信,例如配置集群1的认证服务器、集群2的认证服务器和集群3的认证服务器互信,而kerberos最多只允许配置16个互信的集群认证服务器。此外,若需要跨集群完成服务间通信,例如上述集群1中的服务A需要与集群2中的服务B进行通信,则集群1中的用户A需要先生成key.tab文件,然后通过该文件在集群2的认证服务器中进行认证。若用户A在集群2的认证服务器中认证通过,则集群1和集群2可以进行通信,即集群1中的服务A可以与集群2中的服务B进行通信。可以看出,使用kerberos认证完成跨集群服务间的通信,必须首先要配置集群间的认证服务器互信。此外集群的访问数量也会受到限制,而且还存在一些其它认证限制,步骤比较繁琐,不能满足集群间认证对集群数量不作限制的需求。
而使用CAS完成集群之间的认证,不需要配置集群间的认证服务器互信。只需要每个集群在创建集群间统一认证用户时,该集群间统一认证用户的账号和密码一致,即上述用户A的账号和密码一致。这样,若需要跨集群完成服务间通信,例如上述集群1中的服务A需要与集群2中的服务B进行通信,则集群1中的用户A直接在集群2或集群2相应的服务节点上执行登录操作。若登录成功,则说明认证通过,集群1和集群2可以进行通信,即集群1中的服务A可以与集群2中的服务B进行通信。
可以理解,本申请采用CAS认证完成集群间的认证,可以保证在跨集群访问认证时,对集群数量不作限制,提高了集群可扩展性,且认证过程简单可靠。
S240:第二集群向第一集群发送执行所述任务所得的执行结果。
具体地,第二集群在接收到第一集群发送的任务之后,第二集群可以通过YARN的方式在集群内部实现任务动态分配以及资源调度,在任务运行完成后得到结果信息,并将结果信息返回给第一集群。
为了更好的理解本申请实施例,本申请提供了又一种调度方法的流程示意图。参见 图4,图4为本申请实施例的提供的又一种调度方法的流程示意图,如图4所示,中心集群包括关联关系服务模块、计算分配服务模块和资源管理服务模块,本地集群和可调度集群分别包括算法仓库服务模块和YARN服务模块,其中,算法仓库服务模块存储并管理着集群的镜像信息,YARN服务模块负责集群内部的任务分配和资源调度。该方法包括但不限于以下步骤:
S410:关联关系服务模块接收用户的配置。
具体地,用户可以对集群网络中各个集群间的级别关系进行配置,以及对集群的其它信息进行配置,例如集群的地址、集群的标识和集群的带宽等。关联关系服务模块接收用户的配置,并对用户配置的信息进行存储和管理。
S420:算法仓库服务模块获取本地集群的资源信息。
具体地,本地集群的资源信息可以包括每台本地集群服务器的节点类型、主机名称、CPU核总数、磁盘总量、内存总量、主机IP、已用CPU核数、磁盘使用量、内存使用量等,集群服务器的节点类型可以包括数据节点和计算节点。
S430:算法仓库服务模块向关联关系服务模块上报集群心跳。
具体地,本地集群在安装服务时配置了中心集群的网络地址和上报周期,算法仓库服务模块根据上报周期向关联关系服务模块上报集群心跳。关联关系服务模块根据上报的集群心跳,维护本地集群的在线状态,管理集群的生命周期。
可选地,算法仓库服务将本地集群拥有的镜像信息在心跳上报时一同上报给关联关系服务模块。镜像信息包括镜像的名称以及该镜像名称对应的版本,关联关系服务模块可以对每个集群的每个镜像及其对应的版本进行配置授权,可以将任意一个集群的任意一个镜像授权给其它集群使用。
可选的,算法仓库服务模块可以自行设置上报的镜像信息,不需要上报本地集群全部的镜像信息。
S440:算法仓库服务模块向资源管理服务模块上报资源信息。
具体地,算法仓库服务模块周期性的向资源管理服务模块上报资源信息,其周期可以按照实际需求进行设置。可选地,当本地集群的资源发生变化时,算法仓库服务模块需要立即向资源管理服务模块上报资源变更信息。资源管理服务在接收到本地集群上报的资源信息之后,对本地集群的资源使用信息和资源剩余信息进行管理。
S450:计算分配服务模块获取集群关联关系和集群的基本信息。
具体地,计算分配服务模块从关联关系服务模块中获取集群关联关系以及集群的基本信息。其中,集群关联关系可以是各个集群间的级别关系,是由用户在关联关系服务模块中进行配置的。集群的基本信息可以是集群的地址信息、集群的标识信息或集群的带宽信息等,也是由用户在关联关系服务模块中进行配置的。
S460:计算分配服务模块获取集群资源信息。
具体地,计算分配服务模块从资源管理服务模块中获取集群的资源信息。集群资源信息可以是本地集群和可调度集群周期性上报给资源管理服务模块的资源信息,具体可以包括每台集群服务器的节点类型、主机名称、CPU核总数、磁盘总量、内存总量、主机IP、已用CPU核数、磁盘使用量、内存使用量等信息。
S470:计算分配服务模块整合数据。
具体地,计算分配服务模块将从关联关系服务模块和资源管理服务模块中获取的所有信息汇总整合为最终数据,当计算分配服务模块接收到调度请求时,可以根据该最终数据来分配可调度集群。
S480:算法仓库服务模块向计算分配服务模块发送调度请求。
具体地,本地集群的算法仓库服务模块在运行任务的过程中,当本地资源不能满足运行任务的需求时,算法仓库服务模块将会向中心集群的计算分配服务模块发送多级调度请求,以使计算分配服务模块可以根据该多级调度请求确定满足运行任务需求的可调度集群。多级调度请求具体可以包括与任务相匹配的镜像的镜像名称以及其对应的版本,以及运行任务所需要的资源信息。
S490:计算分配服务模块确定可调度集群。
具体地,计算分配服务模块在接收到算法仓库服务模块发送的调度请求后,根据调度请求中包含的镜像名称和该镜像对应的版本号,查找到匹配的镜像(具有该镜像名称及其对应的版本号)已配置授权给本地集群使用的一个或多个集群,然后从该一个或多个集群中确定剩余资源满足运行任务所需要的可调度集群。
S4100:计算分配服务模块向算法仓库服务模块返回可调度集群的地址信息。
具体地,计算分配服务模块在确定可调度集群之后,将该可调度集群的IP地址、端口号和唯一ID发送给本地集群的算法仓库服务模块。
S4110:本地集群的算法仓库服务模块向可调度集群的算法仓库服务模块发送任务。
具体地,本地集群使用CAS完成本地集群在可调度集群的认证。
进一步的,本地集群的算法仓库服务模块在接收到可调度集群的IP地址、端口号和唯一ID之后,将根据可调度集群的IP地址和端口号与可调度集群的算法仓库服务模块进行通信,并将可调度集群的唯一ID发送给可调度集群的算法仓库服务模块,可调度集群的算法仓库服务模块会对该唯一ID进行验证,并在验证成功后才会接受本地集群发送的任务。
S4120:可调度集群的算法仓库服务模块向YARN服务模块发送任务。
具体地,YARN服务模块接收到算法仓库服务模块下发的任务后,进行集群内部的资源调度以运行该任务,并在任务运行完成之后得到结果信息。
S4130:可调度集群的YARN服务模块向可调度集群的算法仓库服务模块返回结果信息。
S4140:可调度集群的算法仓库服务模块向本地集群的算法仓库服务模块返回结果信息。
需要说明的是,步骤S410-S4140的具体实现过程可以参照上述图1至图3的相关描述,为了叙述简洁,在此不再赘述。
应理解,上述中心集群、本地集群和可调度集群的结构以及针对任务跨集群调度的调度过程仅仅作为一种示例,不应构成具体的限定,可以根据需要对中心集群、本地集群和可调度集群中的各个模块进行增加、减少或合并。此外,中心集群、本地集群和可调度集群中的各个模块的操作和/或功能分别为了实现图1至图4中的方法的相应流程。
上述详细阐述了本申请实施例的方法,为了便于更好的实施本申请实施例的上述方 案,相应地,下面还提供用于配合实施上述方案的相关设备。
参见图5,图5是本申请实施例提供的一种用于中心集群的调度装置的结构示意图。如图5所示,用于中心集群的调度装置500包括接收模块510和处理模块520。其中,
接收模块510,用于接收第一集群发送的调度请求。
处理模块520,用于确定满足所述调度请求的第二集群,指示所述第一集群利用所述第二集群执行任务。
在一种可能的实现中,所述调度请求为所述第一集群在所述第一集群没有足够的剩余资源来运行所述任务时生成的,或者所述调度请求为所述第一集群在所述第一集群没有运行所述任务的镜像时生成的。
在一种可能的实现中,所述处理模块520用于:根据所述调度请求指定的与所述任务匹配的镜像信息,确定第二集群,所述镜像信息包括镜像的名称以及所述镜像的版本号,所述第二集群为已授权所述镜像信息指定的镜像给所述第一集群使用的集群。
在一种可能的实现中,所述处理模块520用于:在具有所述镜像信息指定的镜像的至少一个集群中,确定满足所述调度请求指定的运行所述任务所需资源的第二集群。
在一种可能的实现中,所述调度装置500还包括发送模块530,所述发送模块530,用于向所述第一集群发送所述第二集群的地址和标识,所述第二集群的地址和标识用于所述第一集群访问所述第二集群。
在一种可能的实现中,所述处理模块520还用于,将所述第二集群中运行所述任务的镜像授权给所述第一集群。
应理解,上述用于中心集群的调度装置的结构以及针对任务跨集群调度的调度过程仅仅作为一种示例,不应构成具体限定,可以根据需要对用于中心集群的调度装置中的各个模块进行增加、减少或合并。此外,用于中心集群的调度装置中的各个模块的操作和/或功能分别为了实现上述图2以及图4所描述的方法的相应流程,为了简洁,在此不再赘述。
参见图6,图6是本申请实施例提供的一种用于第一集群的调度装置的结构示意图。如图6所示,用于第一集群的调度装置600包括发送模块610、接收模块620和处理模块630。其中,
发送模块610,用于向中心集群发送调度请求,所述调度请求用于所述中心集群确定满足所述调度请求的第二集群。
接收模块620,用于接收所述中心集群响应所述调度请求所发送的指示。
处理模块630,用于利用所述指示确定的所述第二集群执行任务。
在一种可能的实现中,所述调度请求为所述调度装置在所述调度装置没有足够的剩余资源来运行所述任务时生成的,或者所述调度请求为所述调度装置在所述调度装置没有运行所述任务的镜像时生成的。
在一种可能的实现中,所述接收模块620用于,接收所述中心集群发送的第二集群的地址和标识,所述第二集群的地址和标识用于所述调度装置600访问所述第二集群。
在一种可能的实现中,所述处理模块630用于:利用中央认证服务CAS在所述第二集群认证;所述发送模块610,还用于在认证通过之后,向所述第二集群发送所述任务; 所述接收模块620,还用于接收所述第二集群执行所述任务所得的执行结果。
应理解,上述用于第一集群的调度装置的结构以及针对任务跨集群调度的调度过程仅仅作为一种示例,不应构成具体限定,可以根据需要对用于第一集群的调度装置中的各个模块进行增加、减少或合并。此外,用于第一集群的调度装置中的各个模块的操作和/或功能分别为了实现上述图2以及图4所描述的方法的相应流程,为了简洁,在此不再赘述。
参见图7,图7是本申请实施例提供的一种计算设备的结构示意图。如图7所示,该计算设备700包括:处理器710、通信接口720以及存储器730,所述处理器710、通信接口720以及存储器730通过内部总线740相互连接。应理解,该计算设备可以是计算机,或者可以是服务器。
所述处理器710可以由一个或者多个通用处理器构成,例如中央处理器(central processing unit,CPU),或者CPU和硬件芯片的组合。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC)、可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合。
总线740可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。所述总线740可以分为地址总线、数据总线、控制总线等。为便于表示,图7中仅用一条粗线表示,但不表示仅有一根总线或一种类型的总线。
存储器730可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM);存储器730也可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM)、快闪存储器(flash memory)、硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储器730还可以包括上述种类的组合。存储器730可用于存储程序和数据,以便于处理器710调用存储器730中存储的程序代码和数据以实现上述处理模块的功能。程序代码可以是用来实现图5所示的用于中心集群的调度装置或图6所示的用于第一集群的调度装置的功能模块,或者用于实现图2以及图4所示的方法实施例中以中心集群为执行主体或以第一集群为执行主体的方法步骤。
本申请还提供一种计算机存储介质,其中,所述计算机存储介质存储有计算机程序,所述计算机程序在被处理器执行时,可以实现上述方法实施例中记载的任意一种的部分或全部步骤,以及实现上述图5和图6所描述的任意一个功能模块的功能。
本发明实施例还提供一种计算机程序,该计算机程序包括计算机指令,当所述计算机指令被计算机执行时,所述计算机可以执行任意一种调度方法的部分或全部步骤,以及执行上述图5和图6所描述的任意一个功能模块的功能。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可能可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。

Claims (26)

  1. 一种调度方法,其特征在于,包括:
    中心集群接收第一集群发送的调度请求,确定满足所述调度请求的第二集群;
    所述中心集群指示所述第一集群利用所述第二集群执行任务。
  2. 如权利要求1所述的方法,其特征在于,所述调度请求为所述第一集群在所述第一集群没有足够的剩余资源来运行所述任务时生成的,或者所述调度请求为所述第一集群在所述第一集群没有运行所述任务的镜像时生成的。
  3. 如权利要求1或2所述的方法,其特征在于,所述中心集群确定满足所述调度请求的第二集群包括:
    所述中心集群根据所述调度请求指定的与所述任务匹配的镜像信息,确定第二集群,所述镜像信息包括镜像的名称以及所述镜像的版本号,所述第二集群为已授权所述镜像信息指定的镜像给所述第一集群使用的集群。
  4. 如权利要求3所述的方法,其特征在于,所述中心集群确定第二集群包括:
    所述中心集群在具有所述镜像信息指定的镜像的至少一个集群中,确定满足所述调度请求指定的运行所述任务所需资源的第二集群。
  5. 如权利要求1至4任一项所述的方法,其特征在于,所述中心集群指示所述第一集群利用第二集群执行任务包括:
    所述中心集群向所述第一集群发送所述第二集群的地址和标识,所述第二集群的地址和标识用于所述第一集群访问所述第二集群。
  6. 如权利要求1至5任一项所述的方法,其特征在于,所述方法包括:
    所述中心集群将所述第二集群中运行所述任务的镜像授权给所述第一集群。
  7. 一种调度方法,其特征在于,包括:
    第一集群向中心集群发送调度请求,所述调度请求用于所述中心集群确定满足所述调度请求的第二集群;
    所述第一集群接收所述中心集群响应所述调度请求所发送的指示;
    所述第一集群利用所述指示确定的所述第二集群执行任务。
  8. 如权利要求7所述的方法,其特征在于,所述调度请求为所述第一集群在所述第一集群没有足够的剩余资源来运行所述任务时生成的,或者所述调度请求为所述第一集群在所述第一集群没有运行所述任务的镜像时生成的。
  9. 如权利要求7或8所述的方法,其特征在于,所述第一集群接收所述中心集群响应所述调度请求所发送的指示包括:
    所述第一集群接收所述中心集群发送的第二集群的地址和标识,所述第二集群的地 址和标识用于所述第一集群访问所述第二集群。
  10. 如权利要求7至9任一项所述的方法,其特征在于,所述第一集群利用所述指示确定的所述第二集群执行任务包括:
    所述第一集群利用中央认证服务CAS在所述第二集群认证;
    在认证通过之后,所述第一集群向所述第二集群发送所述任务,和接收所述第二集群执行所述任务所得的执行结果。
  11. 用于中心集群的调度装置,其特征在于,包括:
    接收模块,用于接收第一集群发送的调度请求;
    处理模块,用于确定满足所述调度请求的第二集群,指示所述第一集群利用所述第二集群执行任务。
  12. 如权利要求11所述的调度装置,其特征在于,所述调度请求为所述第一集群在所述第一集群没有足够的剩余资源来运行所述任务时生成的,或者所述调度请求为所述第一集群在所述第一集群没有运行所述任务的镜像时生成的。
  13. 如权利要求11或12所述的调度装置,其特征在于,
    所述处理模块用于:根据所述调度请求指定的与所述任务匹配的镜像信息,确定第二集群,所述镜像信息包括镜像的名称以及所述镜像的版本号,所述第二集群为已授权所述镜像信息指定的镜像给所述第一集群使用的集群。
  14. 如权利要求13所述的调度装置,其特征在于,
    所述处理模块用于:在具有所述镜像信息指定的镜像的至少一个集群中,确定满足所述调度请求指定的运行所述任务所需资源的第二集群。
  15. 如权利要求11至14任一项所述的调度装置,其特征在于,所述调度装置还包括:
    发送模块,用于向所述第一集群发送所述第二集群的地址和标识,所述第二集群的地址和标识用于所述第一集群访问所述第二集群。
  16. 如权利要求11至15任一项所述的计算装置,其特征在于,
    所述处理模块还用于:将所述第二集群中运行所述任务的镜像授权给所述第一集群。
  17. 用于第一集群的调度装置,其特征在于,包括:
    发送模块,用于向中心集群发送调度请求,所述调度请求用于所述中心集群确定满足所述调度请求的第二集群;
    接收模块,用于接收所述中心集群响应所述调度请求所发送的指示;
    处理模块,用于利用所述指示确定的所述第二集群执行任务。
  18. 如权利要求17所述的调度装置,其特征在于,所述调度请求为所述调度装置在所述调度装置没有足够的剩余资源来运行所述任务时生成的,或者所述调度请求为所述调度装置在所述调度装置没有运行所述任务的镜像时生成的。
  19. 如权利要求17或18所述的调度装置,其特征在于,
    所述接收模块用于:接收所述中心集群发送的第二集群的地址和标识,所述第二集群的地址和标识用于所述调度装置访问所述第二集群。
  20. 如权利要求17至20任一项所述的调度装置,其特征在于,
    所述处理模块用于:利用中央认证服务CAS在所述第二集群认证;
    所述发送模块还用于:在认证通过之后,向所述第二集群发送所述任务;
    所述接收模块还用于:接收所述第二集群执行所述任务所得的执行结果。
  21. 一种计算设备,其特征在于,所述计算设备包括处理器和存储器,所述处理器执行所述存储器存储的计算机指令,使得所述计算设备执行权利要求1至6任一项所述的方法。
  22. 一种计算设备,其特征在于,所述计算设备包括处理器和存储器,所述处理器执行所述存储器存储的计算机指令,使得所述计算设备执行权利要求7至10任一项所述的方法。
  23. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有计算机程序,所述计算机程序在被计算设备执行时实现权利要求要求1至6任一项所述的方法。
  24. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有计算机程序,所述计算机程序在被计算设备执行时实现权利要求要求7至10任一项所述的方法。
  25. 一种计算机程序产品,所述计算机程序产品包括计算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行权利要求1至6任一项所述的方法。
  26. 一种计算机程序产品,所述计算机程序产品包括计算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行权利要求7至10任一项所述的方法。
PCT/CN2019/128545 2019-05-20 2019-12-26 一种调度方法、装置及相关设备 WO2020233120A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19929870.4A EP3866441B1 (en) 2019-05-20 2019-12-26 Scheduling method and apparatus, and related device
US17/530,560 US20220075653A1 (en) 2019-05-20 2021-11-19 Scheduling method and apparatus, and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910420049.5 2019-05-20
CN201910420049.5A CN110120979B (zh) 2019-05-20 2019-05-20 一种调度方法、装置及相关设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/530,560 Continuation US20220075653A1 (en) 2019-05-20 2021-11-19 Scheduling method and apparatus, and related device

Publications (1)

Publication Number Publication Date
WO2020233120A1 true WO2020233120A1 (zh) 2020-11-26

Family

ID=67522861

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/128545 WO2020233120A1 (zh) 2019-05-20 2019-12-26 一种调度方法、装置及相关设备

Country Status (4)

Country Link
US (1) US20220075653A1 (zh)
EP (1) EP3866441B1 (zh)
CN (1) CN110120979B (zh)
WO (1) WO2020233120A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114040020A (zh) * 2021-10-08 2022-02-11 杭州隆埠科技有限公司 跨集群服务调用的方法及系统

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110120979B (zh) * 2019-05-20 2023-03-10 华为云计算技术有限公司 一种调度方法、装置及相关设备
CN111124688A (zh) * 2019-12-31 2020-05-08 青梧桐有限责任公司 服务器资源控制方法和系统
CN113315719B (zh) * 2020-02-27 2024-09-13 阿里巴巴集团控股有限公司 流量调度方法、设备、系统及存储介质
CN113364892B (zh) * 2020-03-04 2023-03-24 阿里巴巴集团控股有限公司 跨多集群服务的域名解析方法、相关方法、装置和系统
CN113886058A (zh) * 2020-07-01 2022-01-04 中国联合网络通信集团有限公司 一种跨集群资源调度方法和装置
CN111885123B (zh) * 2020-07-06 2022-06-03 苏州浪潮智能科技有限公司 一种跨K8s目标服务访问通道的构建方法及装置
CN113746887B (zh) * 2020-11-05 2024-06-18 北京沃东天骏信息技术有限公司 一种跨集群数据请求处理方法、设备及存储介质
CN112738203A (zh) * 2020-12-25 2021-04-30 中孚安全技术有限公司 一种基于私有协议的数据处理集群组件方法及系统
CN113010531B (zh) * 2021-02-05 2022-11-01 成都库珀创新科技有限公司 一种基于有向无环图的区块链baas系统任务调度框架
CN113806066A (zh) * 2021-04-06 2021-12-17 京东科技控股股份有限公司 大数据资源调度方法、系统和存储介质
CN113391902B (zh) * 2021-06-22 2023-03-31 未鲲(上海)科技服务有限公司 一种任务调度方法及设备、存储介质
CN114189482A (zh) * 2021-12-14 2022-03-15 郑州阿帕斯数云信息科技有限公司 一种集群资源的控制方法、装置和系统
CN116954877A (zh) * 2022-04-15 2023-10-27 华为技术有限公司 一种分布式资源共享方法及相关装置
CN116866438B (zh) * 2023-09-04 2023-11-21 金网络(北京)数字科技有限公司 一种跨集群任务调度方法、装置、计算机设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107621973A (zh) * 2016-07-13 2018-01-23 阿里巴巴集团控股有限公司 一种跨集群的任务调度方法及装置
CN109347982A (zh) * 2018-11-30 2019-02-15 网宿科技股份有限公司 一种数据中心的调度方法及装置
CN109379774A (zh) * 2018-11-08 2019-02-22 网宿科技股份有限公司 智能调度方法、终端设备、边缘节点集群与智能调度系统
WO2019094369A1 (en) * 2017-11-09 2019-05-16 Qualcomm Incorporated Intra-cell interference management for device-to-device communication using grant-free resource
CN110120979A (zh) * 2019-05-20 2019-08-13 华为技术有限公司 一种调度方法、装置及相关设备

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7861246B2 (en) * 2004-06-17 2010-12-28 Platform Computing Corporation Job-centric scheduling in a grid environment
US9495649B2 (en) * 2011-05-24 2016-11-15 International Business Machines Corporation Workload-to-cloud migration analysis based on cloud aspects
CN102841759B (zh) * 2012-05-10 2016-04-20 天津兆民云计算科技有限公司 一种针对超大规模虚拟机集群的存储系统
CN103207814B (zh) * 2012-12-27 2016-10-19 北京仿真中心 一种去中心化的跨集群资源管理与任务调度系统与调度方法
CN104052820A (zh) * 2014-06-27 2014-09-17 国家计算机网络与信息安全管理中心 一种分布式云计算平台的动态节能资源调度系统及方法
US9977699B2 (en) * 2014-11-17 2018-05-22 Mediatek, Inc. Energy efficient multi-cluster system and its operations
CN104461740B (zh) * 2014-12-12 2018-03-20 国家电网公司 一种跨域集群计算资源聚合和分配的方法
CN105159769B (zh) * 2015-09-11 2018-06-29 国电南瑞科技股份有限公司 一种适用于计算能力异构集群的分布式作业调度方法
CN105871580A (zh) * 2015-11-02 2016-08-17 乐视致新电子科技(天津)有限公司 跨集群自动化部署运维系统及方法
CN106371889B (zh) * 2016-08-22 2020-05-29 浪潮(北京)电子信息产业有限公司 一种调度镜像的高性能集群系统实现方法及装置
CN108011862A (zh) * 2016-10-31 2018-05-08 中兴通讯股份有限公司 镜像仓库授权、访问、管理方法及服务器和客户端
CN106790483A (zh) * 2016-12-13 2017-05-31 武汉邮电科学研究院 基于容器技术的Hadoop集群系统及快速构建方法
US10382565B2 (en) * 2017-01-27 2019-08-13 Red Hat, Inc. Capacity scaling of network resources
CN107493191B (zh) * 2017-08-08 2020-12-22 深信服科技股份有限公司 一种集群节点及自调度容器集群系统
CN109471705B (zh) * 2017-09-08 2021-08-13 杭州海康威视数字技术股份有限公司 任务调度的方法、设备及系统、计算机设备
CN108038153A (zh) * 2017-12-04 2018-05-15 北京小度信息科技有限公司 Hbase的数据在线迁移方法和装置
US11134013B1 (en) * 2018-05-31 2021-09-28 NODUS Software Solutions LLC Cloud bursting technologies
US11068312B2 (en) * 2019-03-28 2021-07-20 Amazon Technologies, Inc. Optimizing hardware platform utilization for heterogeneous workloads in a distributed computing environment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107621973A (zh) * 2016-07-13 2018-01-23 阿里巴巴集团控股有限公司 一种跨集群的任务调度方法及装置
WO2019094369A1 (en) * 2017-11-09 2019-05-16 Qualcomm Incorporated Intra-cell interference management for device-to-device communication using grant-free resource
CN109379774A (zh) * 2018-11-08 2019-02-22 网宿科技股份有限公司 智能调度方法、终端设备、边缘节点集群与智能调度系统
CN109347982A (zh) * 2018-11-30 2019-02-15 网宿科技股份有限公司 一种数据中心的调度方法及装置
CN110120979A (zh) * 2019-05-20 2019-08-13 华为技术有限公司 一种调度方法、装置及相关设备

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114040020A (zh) * 2021-10-08 2022-02-11 杭州隆埠科技有限公司 跨集群服务调用的方法及系统

Also Published As

Publication number Publication date
EP3866441B1 (en) 2023-05-31
EP3866441A4 (en) 2022-02-16
EP3866441A1 (en) 2021-08-18
CN110120979B (zh) 2023-03-10
CN110120979A (zh) 2019-08-13
US20220075653A1 (en) 2022-03-10

Similar Documents

Publication Publication Date Title
WO2020233120A1 (zh) 一种调度方法、装置及相关设备
US10768955B1 (en) Executing commands within virtual machine instances
US10516623B2 (en) Pluggable allocation in a cloud computing system
US9544289B2 (en) Method and system for identity-based authentication of virtual machines
US9471384B2 (en) Method and system for utilizing spare cloud resources
EP3716107B1 (en) Technologies for accelerated orchestration and attestation with edge device trust chains
JP4876170B2 (ja) グリッド・システムにおいてセキュリティ強制を追跡するシステムおよび方法
US9350682B1 (en) Compute instance migrations across availability zones of a provider network
WO2014031473A2 (en) Multi-level cloud computing system
US20220057947A1 (en) Application aware provisioning for distributed systems
US7657945B2 (en) Systems and arrangements to adjust resource accessibility based upon usage modes
US10817327B2 (en) Network-accessible volume creation and leasing
US20210311798A1 (en) Dynamic microservices allocation mechanism
US20230216847A1 (en) Determining session duration for device authentication
EP2852893A1 (en) Pluggable allocation in a cloud computing system
US20220046014A1 (en) Techniques for device to device authentication

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19929870

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019929870

Country of ref document: EP

Effective date: 20210512

NENP Non-entry into the national phase

Ref country code: DE