CN109117265A

CN109117265A - The method, apparatus, equipment and storage medium of schedule job in the cluster

Info

Publication number: CN109117265A
Application number: CN201810761530.6A
Authority: CN
Inventors: 周倜
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2019-01-01

Abstract

The invention discloses the method, apparatus of schedule job in the cluster, equipment and storage mediums.The described method includes: obtaining Pod data corresponding with operation；According to the node scheduling condition and the node state of each node in cluster in Pod data, one or more destination nodes are selected from cluster；Pod is disposed respectively according to Pod data on each destination node, the operation process of the operation operation technical solution disposes operation process by the way of containerization in the Pod of deployment, different work is deployed in each independent container, so that the operation respectively applied on cluster can be independent of each other, the scene of interaction can needed to be communicated again, the effective use for realizing cluster resource needs deep learning etc. mostly to can be realized stable job scheduling using the project of pipelineization cooperation.

Description

The method, apparatus, equipment and storage medium of schedule job in the cluster

[technical field]

The present invention relates to job scheduling fields, and in particular to the method, apparatus, equipment and storage of schedule job in the cluster Medium.

[background technique]

Using more physics units at cluster, deployment services are the conventional means that Internet enterprises realize project on it, Therefore, how to carry out rational management to all kinds of operations serviced on cluster is always the problem of technical staff constantly studies.

By taking deep learning project as an example, technical staff is often desirable to run all portions of project on same foundation architecture platform Point, and the sample data of deep learning is often from the product of service line, that is, needing to dispatch multiclass jointly in same cluster The operation of type.However, existing Scheduling Framework can not realize this point well.

[summary of the invention]

In view of this, the present invention provides the method, apparatus of schedule job in the cluster, equipment and storage medium, with solution Certainly in same cluster the problem of the operation of scheduling different application or service.

Specific technical solution is as follows:

A method of schedule job in the cluster, comprising:

Obtain container pod Pod data corresponding with operation；

According to the node scheduling condition and the node state of each node in the cluster in the Pod data, from the collection One or more destination nodes are selected in group；

Pod is disposed respectively according to the Pod data on each destination node, and the operation is run in the Pod of deployment Operation process.

Optionally, the node scheduling condition includes rigid schedulable condition and/or soft schedulable condition；

The node state of each node in the node scheduling attribute according in the Pod data and the cluster, from institute It states and selects one or more destination nodes in cluster and include:

If the node state of a node meets the rigid schedulable condition, belong to destination node；

And/or

It is scheduled scoring according to the node state of each node and the soft schedulable condition, selects one according to appraisal result A or multiple destination nodes.

Optionally, the rigid schedulable condition includes the hardware information of node and/or the area information of node.

Optionally, the node shape of the node scheduling condition according in the Pod data and each node in the cluster State, one or more destination nodes are selected from the cluster includes:

The multiple nodes being located on the same available area are selected as destination node；

It is described that dispose Pod respectively according to the Pod data on each destination node include: on each destination node according to institute It states Pod data and disposes a Pod respectively, to form multiple examples of the operation.

Optionally, the node scheduling condition further includes example numerical lower limits and example the upper limit of the number；Described select is located at Multiple nodes on the same available area are as destination node further include:

When the number of nodes selected is greater than or equal to the example numerical lower limits, according to the number of nodes selected and The example the upper limit of the number determines quantity of the lesser value as destination node in the two；

When the number of nodes selected is less than the example numerical lower limits, this job scheduling is terminated.

Optionally, the soft schedulable condition includes the affine condition of operation, and the node state is included in this node top The corresponding operation of each Pod of administration.

Optionally, the operation corresponds to following one or more applications and/or service:

Deep learning system, Web service, log collector, Distributed Queuing Service, log connector.

Optionally, the operation is deep learning training operation, the operation that the operation is run in the Pod of deployment Process includes:

A parameter server process and a training aids process are run in the Pod of deployment, by the training aids process Deep learning task is obtained from the metamessage management node of deep learning, is obtained according to local deep learning training pattern training It is sent to the parameter server process after gradient, and obtains updated parameter from the parameter server process；

The parameter server process saves training snapshot in distributed storage at predetermined intervals, in Pod or Pod Process restarting resumed training according to the trained snapshot；

Deep learning training pattern is stored to distribution and is deposited by the parameter server process and/or the training aids process Storage.

Optionally, the node scheduling condition includes that resource request lower limit corresponding with each computing resource and resource are asked Seek the upper limit；

The node state of each node in the node scheduling condition according in the Pod data and the cluster, from institute It states and selects one or more destination nodes in cluster and include:

Under the schedulable resource upper limit and schedulable resource that calculate each computing resource according to the Pod disposed in each node Limit；

It is provided when the resource request lower limit of each computing resource in the node scheduling condition is respectively less than or is equal to corresponding calculate When the schedulable resource lower limit in source, one or more destination nodes are selected from the cluster.

Optionally, the node scheduling condition further includes job priority；

The node state of each node in the node scheduling condition according in the Pod data and the cluster, from institute It states and selects one or more destination nodes in cluster further include:

It is excellent according to operation when the resource request lower limit in the node scheduling condition is greater than the schedulable resource lower limit First grade kills or blocks the Pod disposed, alternatively, terminating this job scheduling.

Optionally, it is described kill or block Pod include:

When it is compressible resource that the node scheduling condition is corresponding, block Pod；

When it is incompressible resource that the node scheduling condition is corresponding, Pod is killed.

Optionally, this method further include:

It is that the Pod disposed on each node distributes computing resource by node scheduling condition；Wherein, if portion in a node The sum of resource request upper limit of compressible resource of each Pod of administration is less than the upper limit of the compressible resource of the node, then will not divided The compressible resource matched is proportionately distributed to each Pod disposed on the node.

Optionally, this method further include:

The EMS memory occupation point for calculating each operation process reaches corresponding with the operation process in the EMS memory occupation being calculated point Preset value when kill the operation process.

Optionally, this method further include:

Obtain the cpu busy percentage of each Pod dispose based on same Pod data, according to the arithmetic mean of instantaneous value of cpu busy percentage and Node scheduling condition in the Pod data calculates Pod quantity adjusted.

Optionally, this method further include:

Whether have not Pod by successful dispatch, be the section for further determining whether extendible capacity if monitoring in the cluster Point；

It is the node for starting at least partly extendible capacity, will be dispatched on the node newly started by the Pod of successful dispatch.

Optionally, this method further include:

Judge whether node meets capacity reducing condition according to the node state of each node, is corresponding node to be closed, in phase Other nodes being dispatched to the Pod disposed when having deployed Pod on the node answered in cluster.

Optionally, the capacity reducing condition includes following one or more:

The computing resource utilization rate of node is less than preset value；

The Pod disposed in node can be scheduled to other nodes in cluster；

The Pod disposed in node is confirmed as to be drifted about according to PodDisruptionBudget controller；

Node is not locally stored.

Optionally, dilatation and/or the capacity reducing of cluster are carried out according to following one or more strategies:

Randomly choose node；

Node is selected according to the Pod quantity disposed；

Node is selected according to computing resource utilization rate；

Node is selected according to the use price of physical machine；

When thering is the node of preset quantity and/or preset ratio to be abnormal in the cluster, suspend dilatation and/or capacity reducing.

A kind of device of schedule job in the cluster, which is characterized in that the device includes:

Pod data capture unit, for obtaining container pod Pod data corresponding with operation；

Scheduling unit, for according to the node scheduling condition and the node of each node in the cluster in the Pod data State selects one or more destination nodes from the cluster；

Pod deployment unit, for disposing Pod respectively according to the Pod data on each destination node, in the Pod of deployment The operation process of the middle operation operation.

The scheduling unit belongs to target if the node state for a node meets the rigid schedulable condition Node；And/or scoring is scheduled according to the node state of each node and the soft schedulable condition, it is selected according to appraisal result One or more destination nodes out.

Optionally, the scheduling unit, for selecting the multiple nodes being located on the same available area as destination node；

The Pod deployment unit, for disposing a Pod respectively according to the Pod data on each destination node, with shape At multiple examples of the operation.

Optionally, the node scheduling condition further includes example numerical lower limits and example the upper limit of the number；

The scheduling unit, for when the number of nodes selected be greater than or equal to the example numerical lower limits when, according to institute It states the number of nodes selected and the example the upper limit of the number determines quantity of the lesser value as destination node in the two；When When the number of nodes selected is less than the example numerical lower limits, this job scheduling is terminated.

Optionally, the operation is deep learning training operation；

The deployment unit, for running a parameter server process and a training aids process in the Pod of deployment, Deep learning task is obtained from the metamessage management node of deep learning by the training aids process, according to local deep learning Training pattern training is sent to the parameter server process after obtaining gradient, and obtains more from the parameter server process Parameter after new；The parameter server process saves training snapshot in distributed storage at predetermined intervals, in Pod or Process restarting in Pod is resumed training according to the trained snapshot；The parameter server process and/or the training aids Process stores deep learning training pattern to distributed storage.

The scheduling unit, for calculating the schedulable resource of each computing resource according to the Pod disposed in each node Limit and schedulable resource lower limit；When the resource request lower limit of each computing resource in the node scheduling condition is respectively less than or is equal to When the schedulable resource lower limit of corresponding computing resource, one or more destination nodes are selected from the cluster.

Optionally, the node scheduling condition further includes job priority；

The scheduling unit, for being greater than the schedulable resource when the resource request lower limit in the node scheduling condition It when lower limit, is killed according to job priority or blocks the Pod disposed, alternatively, terminating this job scheduling.

Optionally, the scheduling unit, for blocking when it is compressible resource that the node scheduling condition is corresponding Pod；When it is incompressible resource that the node scheduling condition is corresponding, Pod is killed.

Optionally, the scheduling unit is also used to be that the Pod distribution disposed on each node is calculated by node scheduling condition Resource；Wherein, if the sum of resource request upper limit of compressible resource of each Pod disposed in a node can less than the node Unassigned compressible resource is then proportionately distributed to each Pod disposed on the node by the upper limit of compressed resource.

Optionally, the scheduling unit is also used to calculate the EMS memory occupation point of each operation process, in the memory being calculated It occupies to divide and kills the operation process when reaching preset value corresponding with the operation process.

Optionally, the scheduling unit, for obtaining the cpu busy percentage of each Pod based on the deployment of same Pod data, root Pod quantity adjusted is calculated according to the node scheduling condition in the arithmetic mean of instantaneous value of cpu busy percentage and the Pod data.

Optionally, the scheduling unit, it is then that whether be also used to monitor in the cluster, which has the not Pod by successful dispatch, Further determine whether the node of extendible capacity；It is the node for starting at least partly extendible capacity, it will be not by the Pod of successful dispatch It is dispatched on the node newly started.

Optionally, the scheduling unit, for judging whether node meets capacity reducing condition according to the node state of each node, It is to close corresponding node, the Pod disposed is dispatched to other in cluster when having deployed Pod on corresponding node Node.

Optionally, the capacity reducing condition includes following one or more: the computing resource utilization rate of node is less than default Value；The Pod disposed in node can be scheduled to other nodes in cluster；The Pod disposed in node according to PodDisruptionBudget controller is confirmed as to be drifted about；Node is not locally stored.

Optionally, the scheduling unit, for carrying out the dilatation and/or contracting of cluster according to following one or more strategies Hold: random selection node；Node is selected according to the Pod quantity disposed；Node is selected according to computing resource utilization rate；According to object The use price of reason machine selects node；When having the node of preset quantity and/or preset ratio to be abnormal in the cluster, pause is expanded Appearance and/or capacity reducing.

A kind of computer equipment, including memory, processor and be stored on the memory and can be in the processor The computer program of upper operation, the processor realize method as described above when executing described program.

A kind of computer readable storage medium is stored thereon with computer program, real when described program is executed by processor Now method as described above.

It can be seen that based on above-mentioned introduction using scheme of the present invention, getting container pod Pod corresponding with operation After data, according to the node state of each node in node scheduling condition therein and cluster, several targets are selected from cluster Node, and further Pod is disposed respectively according to Pod data on each destination node, the operation is run in the Pod of deployment Operation process.The technical solution disposes operation process by the way of containerization, different work is deployed in each only In vertical container, so that the operation respectively applied on cluster can be independent of each other, and it can be carried out in the scene for needing interaction Communication, realizes the effective use of cluster resource, and deep learning etc. is needed mostly can be real using the project of pipelineization cooperation Now stable job scheduling.

[Detailed description of the invention]

Fig. 1 shows a kind of process signal of method of schedule job in the cluster according to an embodiment of the invention Figure.

Fig. 2 shows a kind of structural representations of the device of schedule job in the cluster according to an embodiment of the invention Figure.

Fig. 3 shows a kind of deep learning system architecture schematic diagram according to an embodiment of the invention.

Fig. 4 shows the block diagram for being suitable for the exemplary computer system/server 12 for being used to realize embodiment of the present invention.

[specific embodiment]

In order to be clearer and more clear technical solution of the present invention, hereinafter, referring to the drawings and the embodiments, to institute of the present invention The scheme of stating is further described.

Obviously, described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on the present invention In embodiment, those skilled in the art's all other embodiment obtained without creative efforts, all Belong to the scope of protection of the invention.

Fig. 1 shows a kind of process signal of method of schedule job in the cluster according to an embodiment of the invention Figure.As shown in Figure 1, this method comprises:

Step S110 obtains container pod Pod data corresponding with operation.

Container (container) can provide independent running environment, and this point and virtual machine are similar.However in this hair Operation process is disposed using the thinking of containerization in bright embodiment, rather than uses virtual machine, this is because container is more light Amount, efficiency and utilization rate are all significantly larger than virtual machine.

Container pod Pod is the set of one or more containers, it is however generally that including a root container, and operation operation into Each container of journey.In an embodiment of the present invention, a Pod can correspond to one or more examples of operation, but it is general and Speech will not realize multiple examples using Pod.

Pod data can be stored on etcd, and etcd is a key assignments storage repository, for configuring shared and service hair It is existing.The Pod data of new job and the Pod data of killed Pod can be stored on etcd, when dispatching corresponding operation It is obtained.Specifically, the job request that can be submitted according to user generates Pod data.

Step S120, according to the node scheduling condition and the node state of each node in cluster in Pod data, from cluster Select one or more destination nodes.

Here, node scheduling condition can have very much, and specific description form can be label (Label), equally, node State can also be described in the form of Label.Lablel is the key-value pair of a key-value, and wherein key and value are by user Oneself is specified.It can be attached on various resource objects, a resource object can define any number of Label, Ke Yitong Cross LabelSelector (label selector) inquiry and screening resource object.

Step S130 disposes Pod according to Pod data on each destination node respectively, and operation is run in the Pod of deployment Operation process.

As it can be seen that method shown in FIG. 1, after getting container pod Pod data corresponding with operation, according to node therein The node state of each node in schedulable condition and cluster, selects several destination nodes from cluster, and further in each target Pod is disposed respectively according to Pod data on node, and the operation process of operation is run in the Pod of deployment.The technical solution is using appearance The mode of device disposes operation process, different work is deployed in each independent container, so that on cluster The operation respectively applied can be independent of each other, and the scene of interaction can needed to be communicated, and realize the effective of cluster resource It utilizes, deep learning etc. is needed mostly to can be realized stable job scheduling using the project of pipelineization cooperation.

In one embodiment of the invention, in the above method, node scheduling condition includes rigid schedulable condition and/or soft Property schedulable condition；According to the node state of each node in the node scheduling attribute and cluster in Pod data, one is selected from cluster If the node state that a or multiple destination nodes include: a node meets rigid schedulable condition, belong to destination node；With/ Or, being scheduled scoring according to the node state of each node and soft schedulable condition, one or more is selected according to appraisal result Destination node.

As can be seen that rigid schedulable condition is more strict requirements.For example, in one embodiment of the present of invention In, in the above method, rigid schedulable condition includes the hardware information of node and/or the area information of node.

In one example, user wishes that certain operation is deployed only on the node of CPU model Intel, it is clear that this is one The hardware information of a node.In another example, user wishes that certain operation is deployed on the node of region A, then this is one The area information of node.It summarizes ground to say, when set every in the rigid schedulable condition of an operation is the node of a node In state when the subset of every set, then this node is exactly the destination node selected.It specifically can also be with tag-shaped Formula marks rigid schedulable condition and node state, then can be realized by the way of NodeSelector (Node Selector).

Rigid schedulable condition can help quickly to filter out available node, however also be easy to cause no node available in this way The case where.Therefore the operation less harsh for those demands, can also be arranged soft schedulable condition, grade is according to each node Node state and soft schedulable condition are scheduled scoring, select one or more destination nodes according to appraisal result.In this way, by In being no longer strictly to match, then can preferentially select with operation affinity highest rather than perfect affine node.

In one embodiment of the invention, in the above method, according in the node scheduling condition and cluster in Pod data The node state of each node includes: to select on the same available area from one or more destination nodes are selected in cluster Multiple nodes are as destination node；Disposing Pod respectively according to Pod data on each destination node includes: on each destination node A Pod is disposed respectively according to Pod data, to form multiple examples of operation.

A kind of job scheduling thinking is provided in the present embodiment, i.e., in AZ (Availability Zone, available area) Multiple examples (i.e. Pod) of deployment operation (operation here mean to apply or service, such as log collector), i.e., singly answer It is affine used in each example of AZ rank.Available area refers in the mutually isolated one or more data of the infrastructure such as electric power, network The heart.One region includes that one or more available areas will not influence making for other available areas after an available area breaks down With.

And be then to dispose an example respectively on each destination node in AZ, also mean that a node will not be disposed Two examples also only will affect an example in this way when a node failure.A kind of thinking is done in machine frame (cabinet) rank It is anti-affine, as it is possible that a machine frame all can failure.But substantially they are all a labels on node, and when scheduling only needs Dynamic Packet is carried out by this special label, handle affine and anti-affine relationship.

In one embodiment of the invention, in the above method, node scheduling condition further includes example numerical lower limits and reality Example the upper limit of the number；The multiple nodes being located on the same available area are selected as destination node further include: when the number of nodes selected When amount is greater than or equal to example numerical lower limits, determined according to the number of nodes and example the upper limit of the number selected one smaller in the two Quantity of the value as destination node；When the number of nodes selected is less than example numerical lower limits, this job scheduling is terminated.

This is the insurmountable problem of many Scheduling Frameworks such as Slurm.It simply introduces herein: Slurm task tune (predecessor is extremely letter Linux resource management tool, English: Simple Linux Utility for Resource to degree tool Management takes initial, is abbreviated as SLURM) or Slurm, be one for Linux and Unix core system it is free, The task schedule tool of open source, is widely used by worldwide supercomputer and computer cluster.It provides three passes Key function.First, specially enjoying or the non-resource (computer node) specially enjoyed for certain time is distributed for user, so that user executes work Make.Second, it provides a frame, (usually parallel for starting, executing, monitoring the running task on node Task, such as MPI), third reasonably distributes resource for task queue.It is all run on about 60% the last 500 supercomputer Slurm, including the computer Milky Way -2 most fast on former world in 2016.Slurm is used based on the scheduling of Hilbert curve or fertilizer The most suitable algorithm of fat network topology structure, to optimize the distribution of the task in parallel computer.

MPI is the communications protocol across language, for writing parallel computer.Support point-to-point and broadcast.MPI is one A information transmits application programming interfaces, including agreement and and semantic description, they indicate how it plays it in various implementations Characteristic.The target of MPI is high-performance, extensive property, and portable.MPI is in the main models that today is still high-performance calculation. Main MPI-1 model does not include shared drive concept, and MPI-2 only has limited Distributed shared-memory concept.But MPI program Often run on the machine of shared drive.In MPI model periphery design program well because of MPI than being designed under NUMA architecture Encourage memory localization.Although MPI belongs to the layer 5 or higher of OSI Reference Model, his realization may pass through transport layer Sockets and Transmission Control Protocol (TCP) cover most layer.Most MPI realize by Some specified convention collection (API) compositions, can be by C, C++, Fortran, or has the language such as C#, Java or of this class libraries Python is called directly.MPI is portability and speed because of him better than legacy information transmitting library.

However, having 99 enabled nodes when using Slrum or MPI frame and needing the example of 100 submission operations In, operation has to wait for without the use of any enabled node.Or if occurring mistake in cluster, entire task can be labeled At failure, to waste a large amount of cluster resources.

It is then not in such problems apparently according to the present embodiment, due to being provided with example numerical lower limits and example quantity The upper limit (is referred to as number of copies, because being realized according to the same Pod data), if that the pair of operation is submitted in setting This number is 80~100, then example numerical lower limits are 80, then 99 enabled nodes are clearly meet demand.So operation can To be scheduled on this 99 enabled nodes.

In one embodiment of the invention, in the above method, soft schedulable condition includes the affine condition of operation, node shape State includes the corresponding operation of each Pod in this node top administration.

This gives a kind of example of soft schedulable condition, the i.e. affine condition of operation, it is referred to as using parent And condition.For example, business operation affinity corresponding with monitoring log processing and local data is higher, if where them Pod distance farther out, then the problem of network overhead accessed will result in inefficiency.Therefore it is actually in the present embodiment The nearest deployment of affine application is provided, such as is deployed in same node, just reduces network overhead naturally.

Affine is a mutual behavior, therefore, is each required with the possible affine operation of other operations with operation parent Oneself which affine/anti-affine operation is marked with condition (being also possible to label), inspection is just will do it in Pod deployment, realizes Symmetry.

Another is common issue is that there is a situation where migrate by affine application.Need to illustrate has two o'clock, first is that Algorithm design on done symmetry the considerations of, either first deployment or deployment, even if this application hang, it is rebuilding And when scheduled, which affine Pod or affine by which Pod in current system can be still checked, preferentially with their portions to one It rises and goes.In addition, at present RC/RS (copy set is statelessly applied) only have node hang when just can there is a situation where rebuild Pod, Node is not hung, and is exited and oneself is restarted in situ extremely.This can guarantee that affine application section exists from two levels Together, anti-affine application separates this demand.

In one embodiment of the invention, in the above method, operation correspond to following one or more applications and/or Service: deep learning system, Web service, log collector, Distributed Queuing Service, log connector.

Such deep learning project can be used for the research of artificial intelligence (AI), meet industrial requirement.Industrial user inclines To in using deep learning operation as the subset stage of partial data pipeline, including Web server and log collector.It is this logical Flexible scheduling priority-based is needed with cluster.This to run more processes in Web server operation, And deep learning is then less within the network flow higher period, and depth is then preferentially carried out when network flow is lower It practises.SLURM or MPI is unable to satisfy the demand of flexible dispatching in this regard.

Deep learning training frame itself needs to be designed to support distributed training.There are three angles in deep learning cluster Color: parameter server (Parameter Server) and training aids (Trainer) and metamessage management node (Master).Often A parameter server process all safeguards the fragment (shard) of world model (global model).Each training aids has it Ground model (local model) copy, and use its local data more new model.In training process, training aids by model more It newly is sent to parameter server, parameter server is responsible for summarizing these updates, so that training aids can be by its local replica and complete Office's mold sync.

Cluster training includes following module: single metamessage management node is responsible for distributed tasks (task), by data Collection (dataset) is divided into task and is distributed in each training aids, keeps trained by using task queue (task queue) The trackability of task；Multiple training aids are responsible for receiving appointing for master by sgd (stochastic gradient descent) training pattern Business, processing task calculate and upload gradient (gradient) to parameter server, while downloading newest gradient (to claim For parameter, model) arrive oneself local model；Multiple parameters server is responsible for storage and updates training pattern, specifically, from Gradient is obtained at training aids, undated parameter returns to the newest parameter of training aids；Periodically parameter is stored to distributed field system System or etcd, cover original parameter.Specific training framework may refer to Fig. 3.Deep learning model is divided into two in Fig. 3 Fragment, respectively by two parameter server management.

When master starts, a master lock is taken, checks whether the task queue to be created has existed, if Through existing, just restore the task queue, if it does not exist, then creation.Monitoring/trainer/ catalogue looks for existing Trainer, distribution task give existing trainer, and updates task queue at the same time.When master fault recovery, It is restarted automatically master, and restores corresponding data from etcd.

When trainer starts, the relevant catalogue/ps/ of parameter server is monitored, parameter server is waited Reach specified quantity.A unique id is generated, is write under etcd ,/trainer/, due to having lease, master can With know trainer on line or it is online under, wait task to be assigned.When trainer fault recovery, it can be restarted automatically Trainer, and task is pulled from todo queue (queue to be processed), and continue to train.

Parameter server start when, read parameter server target sum, search etcd key be/ Ps/'s, index (index) is less than target sum, looks at the key which has be not present, if so, entering with regard to supplement. Parameter server can read the data being stored under the path, and store into memory, then start externally to provide Service.

In one embodiment of the invention, in the above method, operation is deep learning training operation, in the Pod of deployment The operation process of middle operation operation includes: that a parameter server process and a training aids process are run in the Pod of deployment, Deep learning task is obtained from the metamessage management node of deep learning by training aids process, according to local deep learning training Model training is sent to parameter server process after obtaining gradient, and obtains updated parameter from parameter server process； Parameter server process saves training snapshot in distributed storage at predetermined intervals, is opened again with the process in Pod or Pod It is dynamic to be resumed training according to training snapshot；Parameter server process and/or training aids process by deep learning training pattern store to Distributed storage.

The realization of model data checkpoint can effectively avoid the single-point or multiple spot simultaneous faults of parameter server.Mould Shape parameter checkpoint passes through the complete mirror that the model data that portion is stored in parameter server memory is periodically saved on disk Picture, to guarantee that training process can be restarted from intermediate state.In the training mission that one can not interrupt and lack backup, Appearance can be reached to distributed storage service by the data snapshot (snapshot) of the interim each parameter server of preservation The purpose of calamity, such as every 10 minutes newest snapshots, and delete snapshot earlier.When there is Single Point of Faliure, it is only necessary to extensive This multiple node, or this node is moved to another node and started can resume training task.

For example, being realized using lock (lock) mechanism, every 10 minutes, parameter server can go application read lock (read lock), Save checkpoint.And at the same time, which can stop write operation, and checkpoint to be saved is waited to complete.Parameter service later Recent snapshot is written in device in distributed storage, and deletes other original snapshots, after the completion of operation, discharges read lock, writes Operation can continue.

When snapshot is read, the reading check point file mark uuid from etcd, the load check point snapshot document from disk, And load wherein parameter.If load is unsuccessful, initial data initiation parameter is used.

Specific implementation can realize the shared storage of data using public cloud bosfs file system, can also use publicly-owned The newest nfs storage system of cloud.Data can uniformly be converted into RecordIO interface, provide standardization translation interface.

There are two types of selections, i.e. parameter server process and/or training aids process, and deep learning is trained mould for the storage of model Type is stored to distributed storage.Since the data in current parameter server are fragments, and model is dense in training aids Update, so it possesses entire model, so in terms of ease for use can preferably training aids carry out storage model.Specifically, it instructs Practice device to elect by etcd, be exported to select one of node as the storage of model.

In one embodiment of the invention, in the above method, node scheduling condition includes right respectively with each computing resource The resource request lower limit and the resource request upper limit answered；According to the node scheduling condition and the section of each node in cluster in Pod data Dotted state includes: to calculate each calculating money according to the Pod disposed in each node from one or more destination nodes are selected in cluster The schedulable resource upper limit in source and schedulable resource lower limit；When the resource request lower limit of each computing resource in node scheduling condition Respectively less than or equal to corresponding computing resource schedulable resource lower limit when, one or more destination nodes are selected from cluster.

A kind of resource-based scheduling mode is present embodiments provided, i.e. Pod can request item for CPU, memory setting Part specifically includes a resource request upper limit and a resource request lower limit.For each resource, 0≤resource request lower limit ≤ resource request the upper limit≤infinite.If container can be guaranteed by successful dispatch to node, the resource request of container.

So, entire cluster can be safeguarded and be calculated under the schedulable resource upper limit and schedulable resource of each computing resource Limit, if the resource request lower limit of each computing resource in node scheduling condition is respectively less than or adjustable equal to corresponding computing resource When spending resource lower limit, it is clear that resource is enough.For example, Pod requires the memory lower limit of 1024MB, that is, if it is not provided The memory of 1024MB, then operation can not carry out, and if node can provide the memory of 2048MB as schedulable resource at this time Lower limit, it is clear that the operation can be scheduled on the node.

In one embodiment of the invention, in the above method, node scheduling condition further includes job priority；According to The node state of each node, selects one or more target sections from cluster in node scheduling condition and cluster in Pod data Point further include: when the resource request lower limit in node scheduling condition is greater than schedulable resource lower limit, killed according to job priority Fall or block the Pod disposed, alternatively, terminating this job scheduling.

It can be divided into and be divided into three priority: Best-effort (optimal adaptation rank) by Service Quality Management (Qos), Resource request lower limit and the resource request upper limit are not write in node scheduling condition exactly, when such resource abundance can use most Resource (such as deep learning training operation can be set to this priority, occupy in the case where night service flow is small Resource as much as possible is trained), but can be also killed at first when resource anxiety (such as when daytime, service traffics were larger Need preferentially to guarantee the stabilization of business)；Burstable (unstable rank), as long as there is the resource of a container in corresponding Pod Request lower limit, the resource request upper limit that other containers are set inconsistent, then the QoS of this POD is exactly Burstable grades Not.Guaranteed (assurance level), all containers must all be unified to be provided with resource request lower limit, the resource request upper limit, and And setting parameter is all consistent.

In one embodiment of the invention, in the above method, killing or blocking Pod includes: when node scheduling condition pair When what is answered is compressible resource, block Pod；When it is incompressible resource that node scheduling condition is corresponding, Pod is killed.Here According to the property of computing resource, since the occupied needs of memory are released, it is consequently belonging to incompressible resource, and CPU then can be with Dynamic adjustment service condition, belongs to compressible resource.Therefore corresponding processing is also different.

In one embodiment of the invention, the above method further include: by node scheduling condition be each node on disposed Pod distribute computing resource；Wherein, if the resource request upper limit of the compressible resource of each Pod disposed in a node it The upper limit with the compressible resource of the node is less than, then be proportionately distributed on the node portion for unassigned compressible resource Each Pod of administration.

The smallest cpu resource is limited to 10M.This is linux kernel limit decision.Container can obtain the CPU of requirement Can amount obtain task of the additional CPU time depending on other operations.In addition to the CPU quantity of request, additional cpu resource It is shared.Such as, it is assumed that specified 60%, the B container for needing CPU of A container requests the 30% of CPU, it is assumed that two containers are all Cpu resource is used as much as possible, that 10% additional cpu resource will be respectively allocated to container A according to the ratio of 2:1 and hold Device B.Container resource uses more than resource constraint and will be blocked, if resource constraint is not specified, when there is cpu resource that can make Used time, container can use additional CPU.

Container can obtain the memory source amount of request, if requested beyond memory source, container can be killed (when other When container needs memory), but if the resource of container consumption is less than the stock number of resource request lower limit, will not be killed (unless system task or finger daemon need more memories).It is more than that memory source requests the upper limit in container memory usage amount When, container can be killed.

In one embodiment of the invention, the above method further include: the EMS memory occupation point for calculating each operation process is being counted Obtained EMS memory occupation point kills the operation process when reaching preset value corresponding with the operation process.

EMS memory occupation point also referred to as OOM (out of memory) score in the present embodiment.The OOM score of process be into 10 times of the percentage of the memory of journey consumption, are adjusted by OOM_SCORE_ADJ (preset value), and killing has higher OOM score Process.Basic OOM score is between 0 and 1000, and the final OOM score of process is also between 0 and 1000.It is shown below three Example is arranged in the OOM_SCORE_ADJ of priority:

The OOM_SCORE that the process in OOM_SCORE_ADJ:1000- container is arranged in Best-effort- will be 1000；

The OOM_SCORE that the process in OOM_SCORE_ADJ:-998- container is arranged in Guaranteed- is 0 or 1；

Burstable- by OOM_SCORE_ADJ be set as 1000-10* (memory of resource request lower limit it is shared entire The percentage of node memory), this ensures OOM_SCORE > 1 of Burstable copy.If memory request is 0, OOM_SCORE_ ADJ is arranged to 999.

In one embodiment of the invention, the above method further include: obtain each Pod disposed based on same Pod data Cpu busy percentage, Pod adjusted is calculated according to the node scheduling condition in the arithmetic mean of instantaneous value of cpu busy percentage and Pod data Quantity.

This is also referred to as the horizontal automatic telescopic of example or copy (practical is also Pod).Automatically adjust device (Autoscaler) it is implemented as control loop, periodically passes through the cpu busy percentage of the copy of query node state collection Pod. Then, the arithmetic mean of instantaneous value of the cpu busy percentage of copy is compared by it with target defined in node scheduling condition, and according to Need to adjust the copy amount of scale to match target.Reserve: MinReplicas (example numerical lower limits)≤ Replicas (instance number)≤MaxReplicas (example the upper limit of the number).

Automatically adjust device period by controller management device-horizontal-pod-autoscaler-sync- Period mark control.Default value is 30 seconds.Cpu busy percentage is nearest CPU usage (last 1 minute flat an of copy Mean value) divided by the CPU of pod request.

The pod of destination number is calculated by following formula: TargetNumOfPods=ceil (sum (CurrentPodsCPUUtilization)/Target), ceil () is floor operation, indicates to take more than or equal to certain number A nearest integer；Sum is arithmetic sum operation；CurrentPodsCPUUtilization is some Pod in nearest one minute CPU usage/amount average value；Target is the resource request upper limit of CPU.

Noise (for example, starting may temporarily increase CPU) may be brought to measurement by starting and stopping in window phase.Cause This, after each movement, automatic adjustment device should wait for a period of time to obtain reliable data.Only at the past 3 minutes When there is no re-scaling inside, can just it amplify.Reducing will be from last time re-scaling to waiting 5 minutes.In addition, only working as copy The arithmetic mean of instantaneous value of cpu busy percentage and the ratio of resource request lower limit drop below 0.9 or be increased above 1.1 (10% Tolerance) when, just carry out any scaling.

There are two benefits for this method: on the one hand, automatic adjustment device is worked in a manner of conservative.It is negative if there is new user It carries, it is important that quickling increase the quantity of pod for us, so as not to refuse user's request, and reduces the quantity of pod not It is so urgent.On the other hand, automatic adjustment device needs avoid shaking: if load is unstable, quick execution conflict being prevented to determine Plan.

In one embodiment of the invention, the above method further include: whether monitor has in cluster not by successful dispatch Pod is the node for further determining whether extendible capacity；It is the node for starting at least partly extendible capacity, it will be not by success The Pod of scheduling is dispatched on the node newly started.

The embodiment passes through the expansion to cluster interior joint not provide a kind of resolving ideas by the Pod of successful dispatch Hold and carry out meet demand, because the node in cluster might not be all in starting state.For example, realized using dilatation component, Dilatation component creates a monitoring to all pod, it can check whether there is the pod that can not be scheduled every 10 seconds, and pod is general The state that can not be scheduled can be fallen into due to not having the node that can be dispatched.The pod that can not be scheduled can be monitored them PodCondition (state) is unscheduled (not scheduled).If there is this situation occurs, dilatation component can be it A new node is looked for dispatch.It may also insure that pod all in the copy set where the pod is in the same node group, Newly-built machine type can be consistent with the other machines in the node group in this way.

It accounts for from another point of view, in one embodiment of the invention, the above method further include: according to each node Node state judges whether node meets capacity reducing condition, is, closes corresponding node, has deployed Pod on corresponding node When the Pod disposed is dispatched to other nodes in cluster.Namely avoid the waste of resource.

For example, being realized using capacity reducing component, every 10 seconds capacity reducing components can check whether that suitable node can be deleted. In one embodiment of the invention, in the above method, capacity reducing condition includes following one or more: the computing resource of node Utilization rate is less than preset value；The Pod disposed in node can be scheduled to other nodes in cluster；The Pod disposed in node It is confirmed as to be drifted about according to PodDisruptionBudget controller；Node is not locally stored.

In one embodiment of the invention, in the above method, cluster is carried out according to following one or more strategies Dilatation and/or capacity reducing: random selection node；Node is selected according to the Pod quantity disposed；It is selected according to computing resource utilization rate Node；Node is selected according to the use price of physical machine；There is the node of preset quantity and/or preset ratio to occur in the cluster different Chang Shi suspends dilatation and/or capacity reducing.Such as it is unavailable in order to prevent extensive node caused by network or other problems Cause pod that can not dispose, to create the avalanche conditions of more unavailable nodes, certain rule can be formulated.Such as it is arranged In 30% node, or when maximum 3 nodes are abnormal, suspend scalable appearance function, until clustered node integrally restores.

Fig. 2 shows a kind of structural representations of the device of schedule job in the cluster according to an embodiment of the invention Figure.As shown in Fig. 2, the device 200 of schedule job in the cluster includes:

Pod data capture unit 210, for obtaining container pod Pod data corresponding with operation.

Scheduling unit 220, for the node state according to the node scheduling condition in Pod data and each node in cluster, One or more destination nodes are selected from cluster.

Pod deployment unit 230, for disposing Pod respectively according to Pod data on each destination node, in the Pod of deployment Run the operation process of operation.

As it can be seen that device shown in Fig. 2, after getting container pod Pod data corresponding with operation, according to node therein The node state of each node in schedulable condition and cluster, selects several destination nodes from cluster, and further in each target Pod is disposed respectively according to Pod data on node, and the operation process of operation is run in the Pod of deployment.The technical solution is using appearance The mode of device disposes operation process, different work is deployed in each independent container, so that on cluster The operation respectively applied can be independent of each other, and the scene of interaction can needed to be communicated, and realize the effective of cluster resource It utilizes, deep learning etc. is needed mostly to can be realized stable job scheduling using the project of pipelineization cooperation.

In one embodiment of the invention, in above-mentioned apparatus, node scheduling condition includes rigid schedulable condition and/or soft Property schedulable condition；Scheduling unit 220 belongs to target section if the node state for a node meets rigid schedulable condition Point；And/or scoring is scheduled according to the node state of each node and soft schedulable condition, one is selected according to appraisal result Or multiple destination nodes.

In one embodiment of the invention, in above-mentioned apparatus, rigid schedulable condition include node hardware information and/or The area information of node.

In one embodiment of the invention, in above-mentioned apparatus, scheduling unit 220 is located at same can be used for selecting Multiple nodes in area are as destination node；Pod deployment unit 230 is used on each destination node according to Pod data difference portion A Pod is affixed one's name to, to form multiple examples of operation.

In one embodiment of the invention, in above-mentioned apparatus, node scheduling condition further includes example numerical lower limits and reality Example the upper limit of the number；Scheduling unit 220, for when the number of nodes selected is greater than or equal to example numerical lower limits, according to selecting Number of nodes and example the upper limit of the number determine quantity of the lesser value as destination node in the two；When the node selected When quantity is less than example numerical lower limits, this job scheduling is terminated.

In one embodiment of the invention, in above-mentioned apparatus, soft schedulable condition includes the affine condition of operation, node shape State includes the corresponding operation of each Pod in this node top administration.

In one embodiment of the invention, in above-mentioned apparatus, operation correspond to following one or more applications and/or Service: deep learning system, Web service, log collector, Distributed Queuing Service, log connector.

In one embodiment of the invention, in above-mentioned apparatus, operation is deep learning training operation；Deployment unit is used A parameter server process and a training aids process are run in the Pod in deployment, by training aids process from deep learning Metamessage management node obtain deep learning task, sent after obtaining gradient according to local deep learning training pattern training Parameter server process is given, and obtains updated parameter from parameter server process；Between parameter server process is by making a reservation for It is interposed between in distributed storage and saves training snapshot, instruction is restored according to training snapshot with the process restarting in Pod or Pod Practice；Parameter server process and/or training aids process store deep learning training pattern to distributed storage.

In one embodiment of the invention, in above-mentioned apparatus, node scheduling condition includes right respectively with each computing resource The resource request lower limit and the resource request upper limit answered；Scheduling unit 220, it is each for being calculated according to the Pod disposed in each node The schedulable resource upper limit of computing resource and schedulable resource lower limit；When the resource of each computing resource in node scheduling condition is asked Ask lower limit be respectively less than or equal to corresponding computing resource schedulable resource lower limit when, one or more target sections are selected from cluster Point.

In one embodiment of the invention, in above-mentioned apparatus, node scheduling condition further includes job priority；Scheduling is single Member 220, for being killed according to job priority when the resource request lower limit in node scheduling condition is greater than schedulable resource lower limit Fall or block the Pod disposed, alternatively, terminating this job scheduling.

In one embodiment of the invention, in above-mentioned apparatus, scheduling unit 220, for being corresponded to when node scheduling condition Be compressible resource when, block Pod；When it is incompressible resource that node scheduling condition is corresponding, Pod is killed.

In one embodiment of the invention, in above-mentioned apparatus, scheduling unit 220, be also used to be by node scheduling condition The Pod distribution computing resource disposed on each node；Wherein, if the compressible resource of each Pod disposed in a node The sum of resource request upper limit is less than the upper limit of the compressible resource of the node, then is divided in portion unassigned compressible resource To each Pod disposed on the node.

In one embodiment of the invention, in above-mentioned apparatus, scheduling unit 220 is also used to calculate each operation process EMS memory occupation point, killed when the EMS memory occupation being calculated point reaches preset value corresponding with the operation process operation into Journey.

In one embodiment of the invention, in above-mentioned apparatus, scheduling unit 220 is based on same Pod data for obtaining The cpu busy percentage of each Pod of deployment is calculated according to the node scheduling condition in the arithmetic mean of instantaneous value of cpu busy percentage and Pod data Pod quantity adjusted.

In one embodiment of the invention, in above-mentioned apparatus, whether scheduling unit 220, being also used to monitor in cluster has It not by the Pod of successful dispatch, is to further determine whether the node of extendible capacity；It is the section for starting at least partly extendible capacity Point will be dispatched on the node newly started by the Pod of successful dispatch.

In one embodiment of the invention, in above-mentioned apparatus, scheduling unit 220, for the node shape according to each node State judges whether node meets capacity reducing condition, is, closes corresponding node, will when having deployed Pod on corresponding node The Pod of deployment is dispatched to other nodes in cluster.

In one embodiment of the invention, in above-mentioned apparatus, capacity reducing condition includes following one or more: node Computing resource utilization rate is less than preset value；The Pod disposed in node can be scheduled to other nodes in cluster；In the middle part of node The Pod of administration is confirmed as to be drifted about according to PodDisruptionBudget controller；Node is not locally stored.

In one embodiment of the invention, in above-mentioned apparatus, scheduling unit 220, for according to following one kind or more Kind strategy carries out dilatation and/or the capacity reducing of cluster: random selection node；Node is selected according to the Pod quantity disposed；According to meter It calculates resource utilization and selects node；Node is selected according to the use price of physical machine；There is preset quantity in the cluster and/or presets When the node of ratio is abnormal, suspend dilatation and/or capacity reducing.

The specific embodiment that the specific embodiment of above-mentioned apparatus embodiment is referred to preceding method embodiment is come real It is existing, it is not repeating herein.

It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described because According to the present invention, certain steps can use other sequences or carry out simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules is not necessarily of the invention It is necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiments.

Fig. 4 shows the block diagram for being suitable for the exemplary computer system/server 12 for being used to realize embodiment of the present invention. The computer system/server 12 that Fig. 4 is shown is only an example, should not function and use scope to the embodiment of the present invention Bring any restrictions.

As shown in figure 4, computer system/server 12 is showed in the form of universal computing device.Computer system/service The component of device 12 can include but is not limited to: one or more processor (processing unit) 16, memory 28, connect not homology The bus 18 of system component (including memory 28 and processor 16).

Bus 18 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.

Computer system/server 12 typically comprises a variety of computer system readable media.These media, which can be, appoints What usable medium that can be accessed by computer system/server 12, including volatile and non-volatile media, it is moveable and Immovable medium.

Memory 28 may include the computer system readable media of form of volatile memory, such as random access memory Device (RAM) 30 and/or cache memory 32.Computer system/server 12 may further include it is other it is removable/no Movably, volatile/non-volatile computer system storage medium.Only as an example, storage system 34 can be used for reading and writing Immovable, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive ").Although not shown in fig 4, may be used To provide the disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk "), and it is non-volatile to moving Property CD (such as CD-ROM, DVD-ROM or other optical mediums) read and write CD drive.In these cases, each drive Dynamic device can be connected by one or more data media interfaces with bus 18.Memory 28 may include at least one program Product, the program product have one group of (for example, at least one) program module, these program modules are configured to perform the present invention The function of each embodiment.

Program/utility 40 with one group of (at least one) program module 42 can store in such as memory 28 In, such program module 42 includes --- but being not limited to --- operating system, one or more application program, other programs It may include the realization of network environment in module and program data, each of these examples or certain combination.Program mould Block 42 usually executes function and/or method in embodiment described in the invention.

Computer system/server 12 can also be (such as keyboard, sensing equipment, aobvious with one or more external equipments 14 Show device 24 etc.) communication, it is logical that the equipment interacted with the computer system/server 12 can be also enabled a user to one or more Letter, and/or with the computer system/server 12 any is set with what one or more of the other calculating equipment was communicated Standby (such as network interface card, modem etc.) communicates.This communication can be carried out by input/output (I/O) interface 22.And And computer system/server 12 can also pass through network adapter 20 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, such as internet) communication.As shown in figure 4, network adapter 20 passes through bus 18 communicate with other modules of computer system/server 12.It should be understood that although not shown in the drawings, computer can be combined Systems/servers 12 use other hardware and/or software module, including but not limited to: microcode, device driver, at redundancy Manage unit, external disk drive array, RAID system, tape drive and data backup storage system etc..

The program that processor 16 is stored in memory 28 by operation, at various function application and data Reason, such as realize the method in embodiment illustrated in fig. 1, i.e.,

The present invention discloses a kind of computer readable storage mediums, are stored thereon with computer program, the program quilt Processor will realize the method in embodiment as shown in Figure 1 when executing.

It can be using any combination of one or more computer-readable media.Computer-readable medium can be calculating Machine readable signal medium or computer readable storage medium.Computer readable storage medium for example can be --- but it is unlimited In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or any above combination.It calculates The more specific example (non exhaustive list) of machine readable storage medium storing program for executing includes: electrical connection with one or more conducting wires, just Taking formula computer disk, hard disk, random access memory (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In this document, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.

Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium other than computer readable storage medium, which can send, propagate or Transmission is for by the use of instruction execution system, device or device or program in connection.

The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.

The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.? Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as mentioned using Internet service It is connected for quotient by internet).

In several embodiments provided by the present invention, it should be understood that disclosed device and method etc. can pass through Other modes are realized.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the unit, Only a kind of logical function partition, there may be another division manner in actual implementation.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.

The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) execute the present invention The part steps of embodiment the method.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. it is various It can store the medium of program code.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims

1. a kind of method of schedule job in the cluster, which is characterized in that this method comprises:

Obtain container pod Pod data corresponding with operation；

According to the node scheduling condition and the node state of each node in the cluster in the Pod data, from the cluster Select one or more destination nodes；

Pod is disposed respectively according to the Pod data on each destination node, and the operation of the operation is run in the Pod of deployment Process.

2. the method according to claim 1, wherein the node scheduling condition include rigid schedulable condition and/ Or soft schedulable condition；

The node state of each node in the node scheduling attribute according in the Pod data and the cluster, from the collection Selecting one or more destination nodes in group includes:

And/or

Be scheduled scoring according to the node state of each node and the soft schedulable condition, according to appraisal result select one or Multiple destination nodes.

3. according to the method described in claim 2, it is characterized in that, the hardness schedulable condition includes the hardware information of node And/or the area information of node.

4. according to the method described in claim 3, it is characterized in that, the node scheduling condition according in the Pod data With the node state of node each in the cluster, one or more destination nodes are selected from the cluster includes:

It is described on each destination node according to the Pod data dispose respectively Pod include: on each destination node according to Pod data dispose a Pod respectively, to form multiple examples of the operation.

5. according to the method described in claim 4, it is characterized in that, the node scheduling condition further include example numerical lower limits and Example the upper limit of the number；The multiple nodes selected on the same available area are as destination node further include:

When the number of nodes selected is greater than or equal to the example numerical lower limits, according to the number of nodes selected and described Example the upper limit of the number determines quantity of the lesser value as destination node in the two；

6. according to the method described in claim 2, it is characterized in that, the soft schedulable condition includes the affine condition of operation, institute Stating node state includes the corresponding operation of each Pod in this node top administration.

7. the method according to claim 1, wherein the operation corresponds to following one or more applications And/or service:

8. the method according to the description of claim 7 is characterized in that the operation be deep learning training operation, it is described in portion The operation process that the operation is run in the Pod of administration includes:

A parameter server process and a training aids process are run in the Pod of deployment, by the training aids process from depth The metamessage management node of degree study obtains deep learning task, obtains gradient according to local deep learning training pattern training After be sent to the parameter server process, and obtain updated parameter from the parameter server process；

The parameter server process saves training snapshot in distributed storage at predetermined intervals, in Pod or Pod into Journey restarting is resumed training according to the trained snapshot；

The parameter server process and/or the training aids process store deep learning training pattern to distributed storage.

9. the method according to claim 1, wherein the node scheduling condition includes distinguishing with each computing resource Corresponding resource request lower limit and the resource request upper limit；

The node state of each node in the node scheduling condition according in the Pod data and the cluster, from the collection Selecting one or more destination nodes in group includes:

According to the Pod disposed in each node calculate each computing resource the schedulable resource upper limit and schedulable resource lower limit；

When the resource request lower limit of each computing resource in the node scheduling condition is respectively less than or equal to corresponding computing resource When schedulable resource lower limit, one or more destination nodes are selected from the cluster.

10. according to the method described in claim 9, it is characterized in that, the node scheduling condition further includes job priority；

The node state of each node in the node scheduling condition according in the Pod data and the cluster, from the collection One or more destination nodes are selected in group further include:

When the resource request lower limit in the node scheduling condition is greater than the schedulable resource lower limit, according to job priority It kills or blocks the Pod disposed, alternatively, terminating this job scheduling.

11. according to the method described in claim 10, it is characterized in that, it is described kill or block Pod include:

12. according to the method described in claim 9, it is characterized in that, this method further include:

It is that the Pod disposed on each node distributes computing resource by node scheduling condition；Wherein, if having disposed in a node The sum of resource request upper limit of compressible resource of each Pod is less than the upper limit of the compressible resource of the node, then will be unassigned Compressible resource is proportionately distributed to each Pod disposed on the node.

13. according to the method for claim 12, which is characterized in that this method further include:

The EMS memory occupation point for calculating each operation process reaches corresponding pre- with the operation process in the EMS memory occupation being calculated point If killing the operation process when value.

14. according to the method described in claim 9, it is characterized in that, this method further include:

Obtain the cpu busy percentage of each Pod disposed based on same Pod data, according to the arithmetic mean of instantaneous value of cpu busy percentage with it is described Node scheduling condition in Pod data calculates Pod quantity adjusted.

15. the method according to claim 1, wherein this method further include:

Whether have not Pod by successful dispatch, be the node for further determining whether extendible capacity if monitoring in the cluster；

16. the method according to claim 1, wherein this method further include:

Judge whether node meets capacity reducing condition according to the node state of each node, is corresponding node to be closed, corresponding Other nodes being dispatched to the Pod disposed when having deployed Pod on node in cluster.

17. according to the method for claim 16, which is characterized in that the capacity reducing condition includes following one or more:

The computing resource utilization rate of node is less than preset value；

The Pod disposed in node can be scheduled to other nodes in cluster；

Node is not locally stored.

18. method described in any one of 5-17 according to claim 1, which is characterized in that according to following one or more plans Slightly carry out dilatation and/or the capacity reducing of cluster:

Randomly choose node；

Node is selected according to the Pod quantity disposed；

Node is selected according to computing resource utilization rate；

Node is selected according to the use price of physical machine；

19. a kind of device of schedule job in the cluster, which is characterized in that the device includes:

Scheduling unit, for the node state according to the node scheduling condition in the Pod data and each node in the cluster, One or more destination nodes are selected from the cluster；

Pod deployment unit is transported in the Pod of deployment for disposing Pod respectively according to the Pod data on each destination node The operation process of the row operation.

20. device according to claim 19, which is characterized in that the node scheduling condition includes rigid schedulable condition And/or soft schedulable condition；

The scheduling unit belongs to destination node if the node state for a node meets the rigid schedulable condition； And/or scoring is scheduled according to the node state of each node and the soft schedulable condition, one is selected according to appraisal result Or multiple destination nodes.

21. device according to claim 20, which is characterized in that the hardness schedulable condition includes the hardware information of node And/or the area information of node.

22. device according to claim 21, which is characterized in that

The scheduling unit, for selecting the multiple nodes being located on the same available area as destination node；

The Pod deployment unit, for disposing a Pod respectively according to the Pod data on each destination node, to be formed State multiple examples of operation.

23. device according to claim 22, which is characterized in that the node scheduling condition further includes example numerical lower limits With example the upper limit of the number；

The scheduling unit, for when the number of nodes selected be greater than or equal to the example numerical lower limits when, according to the choosing Number of nodes and the example the upper limit of the number out determines quantity of the lesser value as destination node in the two；When selecting Number of nodes be less than the example numerical lower limits when, terminate this job scheduling.

24. device according to claim 20, which is characterized in that the soft schedulable condition includes the affine condition of operation, The node state includes the corresponding operation of each Pod in this node top administration.

25. device according to claim 19, which is characterized in that the operation corresponds to following one or more applications And/or service:

26. device according to claim 25, which is characterized in that the operation is deep learning training operation；

The deployment unit, for running a parameter server process and a training aids process in the Pod of deployment, by institute It states training aids process and obtains deep learning task from the metamessage management node of deep learning, according to local deep learning training Model training is sent to the parameter server process after obtaining gradient, and after parameter server process acquisition update Parameter；The parameter server process saves training snapshot in distributed storage at predetermined intervals, in Pod or Pod Process restarting resumed training according to the trained snapshot；The parameter server process and/or the training aids process Deep learning training pattern is stored to distributed storage.

27. device according to claim 19, which is characterized in that the node scheduling condition includes and each computing resource is divided Not corresponding resource request lower limit and the resource request upper limit；

The scheduling unit, for calculated according to the Pod disposed in each node each computing resource the schedulable resource upper limit and Schedulable resource lower limit；When the resource request lower limit of each computing resource in the node scheduling condition is respectively less than or is equal to corresponding When the schedulable resource lower limit of computing resource, one or more destination nodes are selected from the cluster.

28. device according to claim 27, which is characterized in that the node scheduling condition further includes job priority；

The scheduling unit, for being greater than the schedulable resource lower limit when the resource request lower limit in the node scheduling condition When, it is killed according to job priority or blocks the Pod disposed, alternatively, terminating this job scheduling.

29. device according to claim 28, which is characterized in that

The scheduling unit, for blocking Pod when it is compressible resource that the node scheduling condition is corresponding；When the section Point schedulable condition is corresponding when being incompressible resource, kills Pod.

30. device according to claim 27, which is characterized in that

The scheduling unit is also used to be that the Pod disposed on each node distributes computing resource by node scheduling condition；Wherein, If the sum of resource request upper limit of compressible resource of each Pod disposed in a node is less than the compressible resource of the node Unassigned compressible resource is then proportionately distributed to each Pod disposed on the node by the upper limit.

31. device according to claim 30, which is characterized in that

The scheduling unit is also used to calculate the EMS memory occupation point of each operation process, reaches in the EMS memory occupation being calculated point The operation process is killed when preset value corresponding with the operation process.

32. device according to claim 27, which is characterized in that

The scheduling unit, for obtaining the cpu busy percentage of each Pod based on the deployment of same Pod data, according to cpu busy percentage Arithmetic mean of instantaneous value and the Pod data in node scheduling condition calculate Pod quantity adjusted.

33. device according to claim 19, which is characterized in that

Whether the scheduling unit, being also used to monitor in the cluster has the not Pod by successful dispatch, be further judge be The no node for having extendible capacity；It is the node for starting at least partly extendible capacity, new starting will be dispatched to by the Pod of successful dispatch Node on.

34. device according to claim 19, which is characterized in that

The scheduling unit is to close phase for judging whether node meets capacity reducing condition according to the node state of each node The node answered, other nodes being dispatched to the Pod disposed when having deployed Pod on corresponding node in cluster.

35. device according to claim 34, which is characterized in that the capacity reducing condition includes following one or more: The computing resource utilization rate of node is less than preset value；The Pod disposed in node can be scheduled to other nodes in cluster；Section The Pod disposed in point is confirmed as to be drifted about according to PodDisruptionBudget controller；Node is not deposited locally Storage.

36. the device according to any one of claim 33-35, which is characterized in that

The scheduling unit, for carrying out dilatation and/or the capacity reducing of cluster: random selection according to following one or more strategies Node；Node is selected according to the Pod quantity disposed；Node is selected according to computing resource utilization rate；According to the use of physical machine Price selects node；When thering is the node of preset quantity and/or preset ratio to be abnormal in the cluster, suspend dilatation and/or contracting Hold.

37. a kind of computer equipment, including memory, processor and it is stored on the memory and can be on the processor The computer program of operation, which is characterized in that the processor is realized when executing described program as any in claim 1~18 Method described in.

38. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that described program is processed The method as described in any one of claim 1~18 is realized when device executes.