CN110502340A

CN110502340A - A kind of resource dynamic regulation method, device, equipment and storage medium

Info

Publication number: CN110502340A
Application number: CN201910736569.7A
Authority: CN
Inventors: 王超
Original assignee: Guangdong Inspur Big Data Research Co Ltd
Current assignee: Guangdong Inspur Smart Computing Technology Co Ltd
Priority date: 2019-08-09
Filing date: 2019-08-09
Publication date: 2019-11-26

Abstract

The invention discloses a kind of resource dynamic regulation method based on Kubernetes cluster, device, equipment and computer readable storage mediums；In the application, in the training process of training mission, adjustment information can be generated according to the idling-resource information of cluster and the Current resource use information of each training mission, to re-create the container group of training mission by the adjustment information, training mission is adjusted using the dynamic of resource to realize, it realizes and the reasonable of cluster resource is applicable in, greatly reduce the training time.

Description

A kind of resource dynamic regulation method, device, equipment and storage medium

Technical field

The present invention relates to training resource adjustment technology fields, are based on Kubernetes cluster more specifically to one kind Resource dynamic regulation method, device, equipment and computer readable storage medium.

Background technique

When executing deep learning training mission, it can use container technique at present and execute deep learning training mission, it should Mode is a kind of quickly and effectively training method, some system configurations of user and condition depended is eliminated, furthermore it is also possible to utilize The containerization training of extensive task may be implemented in Kubernetes (container cluster administrative skill) Lai Guanli container.But Using container technique execute deep learning training mission when, due to each resource be it is preset, training process can not be more When changing, therefore encountering that training pattern is complicated or the sweeping situation of training data, it may appear that due to resource using unreasonable and Caused by training time too long problem.

Summary of the invention

The purpose of the present invention is to provide a kind of resource dynamic regulation method based on Kubernetes cluster, device, set Standby and computer readable storage medium reduces the execution time of training mission to realize rationally applicable cluster resource.

To achieve the above object, the present invention provides a kind of resource dynamic regulation method based on Kubernetes cluster, packet It includes:

The idling-resource information of cluster is obtained, and Current resource corresponding with the container group of each training mission uses letter Breath；

Target container group is determined using the Current resource use information；The target container group is the adjustment of pending resource Target training mission container group；

Determine the adjustment information that resource adjustment is carried out to the target container group；The adjustment information is to utilize the free time What resource information and the Current resource use information determined；

Container group is re-created using the adjustment information, to continue to execute target training by the container group re-created Task.

Optionally, it is described determine target container group using the Current resource use information after, further includes:

The idling-resource information and Current resource use information corresponding with the container group of each training mission are existed It is shown on visualization interface；

Generate the prompt information being adjusted to the resource of the target container group.

Optionally, the adjustment information that resource adjustment is carried out to the target container group is determined, comprising:

Receive the adjustment information that resource adjustment is carried out to the target container group that user sends.

Rule is adjusted using preset resource, automatically generates the adjustment letter for carrying out resource adjustment to target container group Breath.

Optionally, the adjustment information includes the tune at least one of CPU quantity, GPU quantity, memory source usage amount Whole information.

Optionally, container group is re-created using the adjustment information, to continue to execute by the container group re-created Target training mission, comprising:

The training mission of target end container group；

Using the adjustment information and Checkpoint file corresponding with the target container group, container is re-created Group, to continue to execute target training mission by the container group re-created.

To achieve the above object, the present invention further provides a kind of, and the resource dynamic adjustment based on Kubernetes cluster fills It sets, comprising:

The first information obtains module, for obtaining the idling-resource information of cluster；

Second data obtaining module uses letter for obtaining Current resource corresponding with the container group of each training mission Breath；

Target container group determining module determines target container group with using the Current resource use information；The target Container group is the container group of the target training mission of pending resource adjustment；

Adjustment information determining module, for determining the adjustment information for carrying out resource adjustment to the target container group；It is described Adjustment information is to be determined using the idling-resource information and the Current resource use information；

Container group creation module, for re-creating container group using the adjustment information, to pass through the appearance re-created Device group continues to execute target training mission.

Optionally, the resource dynamic adjusting device further include:

Display module, for by the idling-resource information and current money corresponding with the container group of each training mission Source use information is shown on visualization interface；

Cue module, for generating the prompt information being adjusted to the resource of the target container group.

To achieve the above object, the present invention further provides a kind of, and the resource dynamic adjustment based on Kubernetes cluster is set It is standby, comprising: memory, for storing computer program；Processor realizes above-mentioned money when for executing the computer program The step of source dynamic adjusting method.

To achieve the above object, the present invention further provides a kind of computer readable storage mediums, described computer-readable Computer program is stored on storage medium, the computer program realizes that above-mentioned resource dynamic adjusts when being executed by processor The step of method.

By above scheme it is found that a kind of resource dynamic based on Kubernetes cluster provided in an embodiment of the present invention is adjusted Adjusting method, comprising: obtain the idling-resource information of cluster, and Current resource corresponding with the container group of each training mission makes Use information；Target container group is determined using idling-resource information and Current resource use information；Target container group is pending money The container group of the target training mission of source adjustment；Determine the adjustment information that resource adjustment is carried out to target container group；Utilize adjustment Information re-creates container group, to continue to execute target training mission by the container group re-created.

It, can be according to the current of the idling-resource information of cluster and each training mission as it can be seen that the application is in the training process Resource using information generates adjustment information, to re-create the container group of training mission by the adjustment information, thus realization pair Training mission is adjusted using the dynamic of resource, is realized and is applicable in the reasonable of cluster resource, greatly reduces the training time；The present invention is also A kind of resource dynamic adjusting device, equipment and computer readable storage medium based on Kubernetes cluster is disclosed, equally It is able to achieve above-mentioned technical effect.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is that a kind of resource dynamic regulation method process based on Kubernetes cluster disclosed by the embodiments of the present invention is shown It is intended to；

Fig. 2 is the resource expansion schematic diagram of single machine single deck tape-recorder type disclosed by the embodiments of the present invention；

Fig. 3 is the resource expansion schematic diagram of the more Card Types of single machine disclosed by the embodiments of the present invention；

Fig. 4 is the resource expansion schematic diagram of the more Card Types of multimachine disclosed by the embodiments of the present invention；

Fig. 5 is that a kind of resource dynamic Adjusted Option process based on Kubernetes cluster disclosed by the embodiments of the present invention is shown It is intended to；

Fig. 6 is that a kind of resource dynamic adjusting device structure based on Kubernetes cluster disclosed by the embodiments of the present invention is shown It is intended to；

Fig. 7 is that a kind of resource dynamic adjustment device structure based on Kubernetes cluster disclosed by the embodiments of the present invention shows It is intended to.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a kind of resource dynamic regulation methods based on Kubernetes cluster, device, equipment And computer readable storage medium reduces the execution time of training mission to realize rationally applicable cluster resource.

Referring to Fig. 1, a kind of resource dynamic regulation method based on Kubernetes cluster provided in an embodiment of the present invention, packet It includes:

S101, the idling-resource information for obtaining cluster, and Current resource corresponding with the container group of each training mission Use information；

It is understood that having much idle computing resource, therefore in this Shen on large-scale Kubernetes cluster Please in, if detect that training pattern is complicated or training data sweeping task, so that it may be trained task resource Extension, to accelerate training mission.

In this application, the idling-resource information of cluster is by bottom Kubernetes to the monitoring resource of computing cluster After information is analyzed, the service condition of obtained current cluster resource, the idling-resource information may include CPU (Central Processing Unit, central processing unit) resource free message, GPU (Graphics Processing Unit, graphics process Device) resource free message, memory source free message etc., it is not specific herein to limit.On computing cluster, exists and instructing Each training mission in white silk, the training mission can be deep learning task；For each training mission, expanded by POD Exhibition mechanism optimizes the training mission of deep learning, which is the minimum unit that can be created and dispose in Kubernetes, is An application example in Kubernetes cluster, is always deployed on the same node.One or more appearances are contained in POD Device further comprises the shared resource of each container such as storage, network.

Therefore in this application, optimize the training mission of deep learning using the POD stretching mechanism of Kubernetes, this Kind is realized by POD to resource adjustment mode, and training mission can be allowed to carry out dynamic expansion when running inadequate resource, reach abundant Using the purpose of cluster resource and acceleration training mission, therefore in this application, other than obtaining idling-resource information, also need Obtain the corresponding Current resource use information of container group for each training mission trained, the current resource using information It equally may include cpu resource use information, GPU resource use information, memory source use information etc., can also include logical Cross the operation duration etc. that Current resource executes training mission.

S102, target container group is determined using the Current resource use information；The target container group is pending money The container group of the target training mission of source adjustment；

It is understood that in this application, determining the Current resource use information of idling-resource information and training mission Later, target container group can be determined using above- mentioned information, which is to need to carry out the target training of resource adjustment The container group of task；Such as: it detects in the Current resource use information of a certain training mission, GPU utilization rate is excessively high, 90% More than, and there is idle GPU resource in trunked idle resource information, it at this moment can be using the training mission as target training Task, using the POD of the target training mission as target container group, will pass through use of the subsequent step to target training mission Resource is adjusted.

It should be noted that the application is adjusted resource, and it can be to carry out dynamic expansion to resource, it can also be to money Source carries out dynamic reduction, and resource expansion can increase resource to training mission it is to be understood that in order to shorten the training time, with Realize the purpose of resource expansion, and resource reduction is it is to be understood that when the unused resources of a certain training mission are excessive, in order to allow Other training mission reasonable employment resources, it is possible to reduce the resource of the training mission, so that other training missions can be used.

S103, the adjustment information that resource adjustment is carried out to the target container group is determined；The adjustment information is utilizes institute State what idling-resource information and the Current resource use information determined；Wherein, which includes to CPU quantity, GPU number The adjustment information of at least one of amount, memory source usage amount.

In this application, it after target container group to be adjusted determines, can be used according to the Current resource of target container group Information and the idling-resource information of cluster generate adjustment information, such as: detect that the Current resource of target training mission uses letter In breath, GPU utilization rate is excessively high, 90% or more, and there is idle GPU resource in trunked idle resource information, at this moment may be used To increase the GPU quantity of the target training mission, if the resource service condition before adjustment is 1GPU, 1CPU, 2G memory, generation Adjustment information is to increase by the adjustment information of 1 GPU, and at this moment resource service condition adjusted is 2GPU, 1CPU, 2G memory.

S104, container group is re-created using the adjustment information, to continue to execute mesh by the container group re-created Mark training mission.

It should be noted that after the application is determined to the adjustment information of target container group, it can be by restarting to POD It realizes and the dynamic of resource is adjusted, which will lead to training and can terminate, therefore in this application, subsequent in order to restart in POD It is continuous to execute training mission, it needs to connect in training script using the API that corresponding deep learning frame reads Checkpoint file Mouth (Application Programming Interface, application programming interface), so that training script passes through the API Interface reads Checkpoint file and continues to execute training mission, which needs user to set in training script It sets to generate, is tensorflow, the file of a kind of preservation model and training process that the training frame such as pytorch generates, energy It is enough in task accidental interruption or by specifying a certain trained position to carry out restarting to train.

Therefore in this application, container group is re-created using adjustment information, to continue by the container group re-created Performance objective training mission, comprising: the training mission of target end container group；Utilize adjustment information and corresponding with target container group Checkpoint file, container group is re-created, to continue to execute target training mission by the container group that re-creates.

That is, in this application, after determining adjustment information, need to terminate the training mission of the target container group, and Re-create with new resources configuration carrying training mission POD, by Kubernetes be wrapping with load in the way of, by user Training file (checkpoint or model file) through persistence is mounted in newly created POD, so that newly created POD Continue to execute training mission.

As can be seen that the resource dynamic of the deep learning training mission disclosed in the present application based on Kubernetes cluster expands Exhibition scheme can carry out the dynamic adjustment that resource uses for the deep learning training run on Kubernetes cluster, lead to This mode is crossed, it can be fast for training is accelerated when training for a long time with large-scale dataset by way of resource expansion Degree, while improving the utilization efficiency of cluster resource.

Based on the above embodiment, in the present embodiment, using the Current resource use information determine target container group it Afterwards, further includes:

It should be noted that in this application, training mission is current in the entire cluster obtained by Kubernetes Resource using information can represent the training mission situation of each training mission, and the idling-resource information of entire cluster can generation Table cluster resource service condition, above-mentioned two information can be used as the reference of follow-up work resource dynamic expansion.

Further, the application can reflect above- mentioned information to visualization interface, after obtaining above- mentioned information by showing boundary Face shows the resource service conditions of different training missions, and then passes through prompt for the target container group of pending resource adjustment Information carries out emphasis mark；It in the present embodiment, can be from GPU utilization rate, CPU usage, interior for the judgement of target container group It deposits utilization rate and training this four factors of duration is determined that certainly, the application is only illustrated by taking aforementioned four factor as an example, It is not limited thereto.Such as: the application can will meet high GPU utilization rate, high CPU usage, any one in high memory utilization Person, and training total duration reaches the training mission of predetermined threshold as target training mission, that is to say, that if training mission It is pre- greater than third that GPU utilization rate is greater than the first predetermined threshold, CPU usage is greater than the second predetermined threshold or memory usage When determining threshold value, training total duration is greater than the 4th predetermined threshold, then determines the training mission for target training mission.

Specifically, which can be set as 90%, which can be set as 80%, by The utilization rate of CPU will not be very high in deep learning, if the excessively high explanation of CPU usage has significant component of calculating or operation It is carried out in cpu, because the third predetermined threshold can be set as 90% it is necessary to be extended to this resource, the 4th is predetermined Threshold value can be set as 5h, certainly, above-mentioned predetermined threshold can self-setting according to the demand of user, to reach customized prison The effect of control.

It should be noted that determination of the application to adjustment information, can be determined by two ways, one kind is manual Adjustment, another kind are adjust automatically；If manually adjusting, then need to receive user's transmission carries out resource tune to target container group Whole adjustment information；If adjust automatically, then rule is adjusted using preset resource, automatically generated to target container group The adjustment information for carrying out resource adjustment, is extended to example herein with resource and is illustrated respectively to both of these case:

For manual adjusting style, user can carry out hand to inter-related task according to the prompt information of visualization interface Dynamic task resource dynamic expansion generates adjustment information；That is, when visualization interface shows each training mission, it can It is shown in the form of through list, in the list, emphasis mark can be carried out to target training mission；User is based on should Task list can be chosen and want extension target training mission, then according to the vacant resource situation of cluster and task resource needs The case where extension, inputs adjustment information in systems, which is addition GPU quantity, adds CPU quantity, increases memory Any one in resource usage amount, such as: resource service condition is 1GPU 1CPU 2G memory before having not been changed, and at this moment user can Reach extended resources purpose to be manually changed into 2GPU 2CPU 4G memory, after confirming after selecting, interface will be called The API of Kubernetes terminates and re-creates with the POD of carrying training mission newly configured, to reach resource expansion Purpose.

For adjust automatically mode, user needs in task creation, needs to preset resource adjustment rule, should Resource adjustment rule is included in resource utilization threshold value and resource adjusting strategies, which can be GPU utilization rate threshold Value, CPU usage threshold value, memory usage threshold value and operation duration threshold value etc., that is to say, that if system monitoring is to target When a certain resource utilization of training mission reaches corresponding threshold value, the resource dynamic expansion of task will be carried out automatically；Resource Adjustable strategies can adjust ratio for resource or resource adjusts quantity, such as: it is 40% etc. that resource use ratio, which is turned up,；Such as Fruit user, which is not provided with when perhaps whole resource is insufficient, will extend according to the smallest dilatation strategy or select not Extension, which can be dilatation 20%, and user's true extension situation, the reality are prompted by notification information Spread scenarios may include extending the prompt information successfully or to fail, the resource utilization etc. after adjustment information, extension.

It is understood that no matter be adjusted by which kind of adjustment mode to resource, adjustment process be it is identical, In this application, it can be achieved that the dynamic expansion of the training mission of different training types, the training type can be single machine single deck tape-recorder, single machine Block, any one in multimachine mostly card is illustrated by taking GPU resource as an example herein more.

Such as: it is the resource expansion schematic diagram of single machine single deck tape-recorder type provided by the present application referring to fig. 2；It can be seen by Fig. 2 Out, for single machine single deck tape-recorder type, change GPU number will become the training type that single machine blocks more, expand to resource Zhan Shi, needs to terminate training mission, then restarts the task, and start to restore by reading nearest checkpoint Training.It is the resource expansion schematic diagram of the more Card Types of single machine provided by the present application referring to Fig. 3；As seen in Figure 3, for It for the training mission that single machine blocks more, needs to terminate all tasks in the POD, then restart task and reads Checkpoint file is resumed training from nearest training position.

It referring to fig. 4, is the resource expansion schematic diagram of the more Card Types of multimachine provided by the present application；As seen in Figure 4, should The more Card Types of multimachine are distributed-type, including parameter server PS and Worker working node is divided into different in this case Step with synchronous two kinds, for the resource expansion of Worker node, when asynchronous system, can extend one or more simultaneously The POD of Worker node, after dynamic expansion, the Worker newly configured can load the global step of Checkpoint file To be trained；When synchronous mode, due to a Worker node after, other Worker nodes also can be in pause State, this programme can be according to the strategy of update or synchronized update one by one be used, to realize the resource expansion to multiple POD.It is right For PS node, collecting and calculating due to serving as gradient for task, dynamic capacity-expanding may cause problems, therefore in reality When extension, it can selectively extend, not illustrate herein according to the actual situation.

It should be noted that not increasing Worker when the training mission to the more Card Types of multimachine carries out resource expansion The quantity of node, but resource is added in needing to extend the POD where worker, such as addition CPU, GPU, memory etc., so POD where restarting again afterwards is to achieve the purpose that extended resources.

It is that a kind of resource dynamic Adjusted Option process based on Kubernetes cluster provided by the present application is shown referring to Fig. 5 It is intended to；It can be seen that this programme by the flow chart to collect trunked idle resource information first and be currently running the current of task The information such as resource using information, such as CPU, GPU, memory；Then it is shown in user interface, and prompting can dynamic The task of resource expansion；Block for single machine single deck tape-recorder, single machine, the training mission of the more Card Types of multimachine, by restarting corresponding POD more Mode realize the resource expansion to training mission, after the completion of extension, task recovery training and continues to show by display interface Show trunked idle resource information and is currently running the Current resource use information of task.

In summary it can be seen, the application propose it is this on Kubernetes cluster to the money of deep learning training mission Source carries out the scheme of dynamic expansion, can use the stretching mechanism of the POD of Kubernetes to optimize the training of deep learning and appoint Business, when the operation inadequate resource of training mission, can mode either automatically or manually carry out dynamic resource extension, thus reasonably Using resource in cluster, the training speed of training mission is accelerated, improves the availability of cluster resource.

Resource dynamic adjusting device provided in an embodiment of the present invention is introduced below, resource dynamic described below is adjusted Engagement positions can be cross-referenced with above-described resource dynamic regulation method.

Referring to Fig. 6, a kind of resource dynamic adjusting device based on Kubernetes cluster provided in an embodiment of the present invention, packet It includes:

The first information obtains module 100, for obtaining the idling-resource information of cluster；

Second data obtaining module 200 is used for obtaining Current resource corresponding with the container group of each training mission Information；

Target container group determining module 300 determines target container group with using the Current resource use information；The mesh Mark the container group for the target training mission that container group is the adjustment of pending resource；

Adjustment information determining module 400, for determining the adjustment information for carrying out resource adjustment to the target container group；It adjusts Whole information is to be determined using idling-resource information and the Current resource use information；

Container group creation module 500, for re-creating container group using the adjustment information, by re-creating Container group continues to execute target training mission.

Wherein, the resource dynamic adjusting device further include:

Wherein, adjustment information determining module 400 includes:

Information receiving unit, the adjustment for carrying out resource adjustment to the target container group for receiving user's transmission are believed Breath.

Wherein, adjustment information determining module 400 includes:

Adjustment unit automatically generates for adjusting rule using preset resource and carries out resource to target container group The adjustment information of adjustment.

Wherein, the adjustment information includes the adjustment at least one of CPU quantity, GPU quantity, memory source usage amount Information.

Wherein, container group creation module is specifically used for: the training mission of target end container group；Utilize the adjustment information And Checkpoint file corresponding with the target container group, container group is re-created, to pass through the container group re-created Continue to execute target training mission.

Referring to Fig. 7, the resource dynamic adjustment equipment based on Kubernetes cluster that the embodiment of the invention also discloses a kind of, It include: memory 11, for storing computer program；Processor 12 is realized when for executing the computer program as above-mentioned The step of resource dynamic regulation method described in any means embodiment.

In the present embodiment, equipment 1 can be PC (Personal Computer, PC), be also possible to plate electricity The terminal devices such as brain, palm PC, portable computer.

The equipment 1 may include memory 11, processor 12 and bus 13.

Wherein, memory 11 include at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory, Hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), magnetic storage, disk, CD etc..Memory 11 It can be the internal storage unit of equipment 1, such as the hard disk of the equipment 1 in some embodiments.Memory 11 is in other realities Apply the plug-in type hard disk being equipped on the External memory equipment for being also possible to equipment 1 in example, such as equipment 1, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Into One step, memory 11 can also both internal storage units including equipment 1 or including External memory equipment.Memory 11 is not only It can be used for storing the application software and Various types of data for being installed on equipment 1, such as execute the code etc. of resource dynamic regulation method, It can be also used for temporarily storing the data that has exported or will export.

Processor 12 can be in some embodiments a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips, the program for being stored in run memory 11 Code or processing data, such as execute the code etc. of resource dynamic regulation method.

The bus 13 can be Peripheral Component Interconnect standard (peripheral component interconnect, abbreviation PCI) bus or expanding the industrial standard structure (extended industry standard architecture, abbreviation EISA) Bus etc..The bus can be divided into address bus, data/address bus, control bus etc..For convenient for indicating, in Fig. 7 only with one slightly Line indicates, it is not intended that an only bus or a type of bus.

Further, equipment can also include network interface 14, network interface 14 optionally may include wireline interface and/ Or wireless interface (such as WI-FI interface, blue tooth interface), it is logical commonly used in being established between the equipment 1 and other electronic equipments Letter connection.

Optionally, which can also include user interface, and user interface may include display (Display), input Unit such as keyboard (Keyboard), optional user interface can also include standard wireline interface and wireless interface.It is optional Ground, in some embodiments, display can be light-emitting diode display, liquid crystal display, touch-control liquid crystal display and OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) touches device etc..Wherein, display can also be appropriate Referred to as display screen or display unit, for showing the information handled in the device 1 and for showing visual user interface.

Fig. 7 illustrates only the equipment 1 with component 11-14, it will be appreciated by persons skilled in the art that shown in Fig. 7 Structure does not constitute the restriction to equipment 1, may include than illustrating less perhaps more components or the certain components of combination, Or different component layout.

The embodiment of the invention also discloses a kind of computer readable storage medium, deposited on the computer readable storage medium Computer program is contained, the resource as described in above-mentioned any means embodiment is realized when the computer program is executed by processor The step of dynamic adjusting method.

Wherein, the storage medium may include: USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. are various can store program The medium of code.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of resource dynamic regulation method based on Kubernetes cluster characterized by comprising

Obtain the idling-resource information of cluster, and Current resource use information corresponding with the container group of each training mission；

Target container group is determined using the Current resource use information；The target container group is the mesh of pending resource adjustment Mark the container group of training mission；

Determine the adjustment information that resource adjustment is carried out to the target container group；The adjustment information is to utilize the idling-resource What information and the Current resource use information determined；

Container group is re-created using the adjustment information, is appointed with continuing to execute target training by the container group re-created Business.

2. resource dynamic regulation method according to claim 1, which is characterized in that described to be used using the Current resource Information determines after target container group, further includes:

By the idling-resource information and Current resource use information corresponding with the container group of each training mission visual Change and is shown on interface；

3. resource dynamic regulation method according to claim 2, which is characterized in that determine and carried out to the target container group The adjustment information of resource adjustment, comprising:

4. resource dynamic regulation method according to claim 1 or 2, which is characterized in that determine to the target container group Carry out the adjustment information of resource adjustment, comprising:

Rule is adjusted using preset resource, automatically generates the adjustment information for carrying out resource adjustment to target container group.

5. resource dynamic regulation method according to claim 1, which is characterized in that the adjustment information includes to CPU number The adjustment information of at least one of amount, GPU quantity, memory source usage amount.

6. resource dynamic regulation method according to claim 1, which is characterized in that re-created using the adjustment information Container group, to continue to execute target training mission by the container group re-created, comprising:

The training mission of target end container group；

Using the adjustment information and Checkpoint file corresponding with the target container group, container group is re-created, with Target training mission is continued to execute by the container group re-created.

7. a kind of resource dynamic adjusting device based on Kubernetes cluster characterized by comprising

Second data obtaining module, for obtaining Current resource use information corresponding with the container group of each training mission；

Adjustment information determining module, for determining the adjustment information for carrying out resource adjustment to the target container group；The adjustment Information is to be determined using the idling-resource information and the Current resource use information；

Container group creation module, for re-creating container group using the adjustment information, to pass through the container group re-created Continue to execute target training mission.

8. resource dynamic adjusting device according to claim 7, which is characterized in that further include:

Display module, for making the idling-resource information and Current resource corresponding with the container group of each training mission It is shown on visualization interface with information；

9. a kind of resource dynamic adjustment equipment based on Kubernetes cluster characterized by comprising

Memory, for storing computer program；

Processor realizes that resource dynamic as claimed in any one of claims 1 to 6 such as adjusts when for executing the computer program The step of method.

10. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program is realized when the computer program is executed by processor such as resource dynamic as claimed in any one of claims 1 to 6 adjustment side The step of method.