CN109117248A

CN109117248A - A kind of deep learning task elastic telescopic system and method based on kubernetes platform

Info

Publication number: CN109117248A
Application number: CN201810798693.1A
Authority: CN
Inventors: 刘娜
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2018-07-19
Filing date: 2018-07-19
Publication date: 2019-01-01

Abstract

The embodiment of the invention discloses a kind of deep learning task elastic telescopic system and method based on kubernetes platform, system includes: data acquisition module, data monitoring module and data memory module, wherein the acquisition module is set to the calculate node of kubernetes platform, the data monitoring module and data memory module are set to the management node of kubernetes platform, and the data acquisition module, data monitoring module and data memory module successively communicate to connect.In the memory usage of the calculate node setting data collecting module collected container of kubernetes platform in the present invention, then data monitoring module and data memory module are arranged by the management node in kubernetes platform, data monitoring module will protect fortune in data memory module after container utilization rate concentrated collection collected in each calculate node, management node carries out container increase according to preset rules by the memory usage saved in data memory module, to make the physical training condition of container be in best, deep learning training effectiveness is improved.

Description

A kind of deep learning task elastic telescopic system based on kubernetes platform and Method

Technical field

The present invention relates to deep learning technology fields, more particularly to a kind of deep learning based on kubernetes platform Task elastic telescopic system and method.

Background technique

The rise of artificial intelligence is referred to as " fourth industrial revolution " mark by people, now more and more artificial intelligence The life for coming into us, including recognition of face, picture recognition, speech recognition, intelligent driving, intelligence financing etc. can be applied.Manually The essence of intelligence is exactly to make data model by training repeatedly based on specific data model using a large amount of historical data The ability for having self resolution.Distributed machines learning system Tensorflow outstanding as artificial intelligence field, is producing It has been widely used in environment.

In the prior art, the process of deep learning is carried out by Tensorflow system are as follows: in kubernetes platform Training container is arranged in calculate node, and training data is divided into several equal parts, by control node to container allocation training data, Carry out parallel training.

However, in the prior art, the quantity of training container be it is artificially specified, whether number of containers has reached optimal nothing Method precognition, when the number of containers of setting is very few, the computing capability of each container is just restricted, when the number of containers of setting The waste that will cause resource when excessive, affects deep learning training effectiveness on the whole.

Summary of the invention

A kind of deep learning task elastic telescopic system based on kubernetes platform is provided in the embodiment of the present invention And method, to solve the problems, such as that deep learning training effectiveness is low in the prior art.

In order to solve the above-mentioned technical problem, the embodiment of the invention discloses following technical solutions:

First aspect present invention provides a kind of deep learning task elastic telescopic system based on kubernetes platform, It include: data acquisition module, data monitoring module and data memory module, wherein the acquisition module is set to The calculate node of kubernetes platform, the data monitoring module and data memory module are set to kubernetes platform Management node, the data acquisition module, data monitoring module and data memory module successively communicate to connect.

Preferably, the system also includes container setup module, it is flat that the container setup module is set to kubernetes The calculate node of platform is simultaneously communicated to connect with the data memory module.

Preferably, the system also includes scheduling of resource module, it is flat that the scheduling of resource module is set to kubernetes The calculate node of platform is simultaneously connect with the container setup module, is used for container allocation training data.

Preferably, the data acquisition module includes: cAdvisor tool；

The data monitoring module includes: heapster tool；

The data memory module includes: influxDB tool.

Second aspect of the present invention provides a kind of deep learning task elastic telescopic method based on kubernetes platform, It is characterised by comprising:

Original container quantity is set；

Obtain container memory usage；

Judge whether the container utilization rate is greater than default memory usage；

If it is container is increased according to the first preset capacity value, otherwise continues to obtain container memory usage；

Calculate utilization rate incrementss；

Increase container according to the utilization rate incrementss.

Preferably, utilization rate incrementss are calculated to specifically include:

The default memory usage is subtracted with the container utilization rate.

Preferably, increase container according to the utilization rate incrementss to specifically include:

The utilization rate incrementss are every to improve default percentage point then according to the second preset capacity value increase container.

Preferably, the method also includes:

Training data will be waited to distribute to new volume increasing device.

By above technical scheme as it can be seen that data acquisition module is arranged in the calculate node of kubernetes platform in the present invention Then data monitoring module and number is arranged by the management node in kubernetes platform in the memory usage of collection container According to memory module, data monitoring module will protect fortune in data after container utilization rate concentrated collection collected in each calculate node Memory module, management node carry out container increase according to preset rules by the memory usage saved in data memory module, To make the physical training condition of container be in best, deep learning training effectiveness is improved.

Detailed description of the invention

It is illustrated more clearly that the embodiment of the present invention or technical solution in the prior art, it below will be to embodiment or existing Attached drawing needed in technical description is briefly described, it should be apparent that, for those of ordinary skills, Without creative efforts, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is a kind of deep learning task elastic telescopic system based on kubernetes platform provided in an embodiment of the present invention The structural schematic diagram of system；

Fig. 2 is another deep learning task elastic telescopic based on kubernetes platform provided in an embodiment of the present invention The structural schematic diagram of system；

Fig. 3 is another deep learning task elastic telescopic based on kubernetes platform provided in an embodiment of the present invention The structural schematic diagram of system；

Fig. 4 is a kind of deep learning task elastic telescopic side based on kubernetes platform provided in an embodiment of the present invention The flow diagram of method；

Fig. 5 is another deep learning task elastic telescopic based on kubernetes platform provided in an embodiment of the present invention The flow diagram of method.

Specific embodiment

Technical solution in order to enable those skilled in the art to better understand the present invention, below in conjunction with of the invention real The attached drawing in example is applied, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described implementation Example is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this field is common Technical staff's every other embodiment obtained without making creative work, all should belong to protection of the present invention Range.

Referring to Fig. 1, for a kind of deep learning task elasticity based on kubernetes platform provided in an embodiment of the present invention The structural schematic diagram of telescopic system, the deep learning task elasticity provided in an embodiment of the present invention based on kubernetes platform are stretched Compression system, comprising: data acquisition module, data monitoring module and data memory module.

Wherein the acquisition module is set to the calculate node of kubernetes platform, the data monitoring module and number The management node of kubernetes platform, the data acquisition module, data monitoring module and number are set to according to memory module It is successively communicated to connect according to memory module.

In the embodiment of the present invention, the data acquisition module includes cAdvisor tool；The data monitoring module includes Heapster tool；The data memory module includes influxDB tool.

Kubernetes strip generally has a management node master and has multiple calculate node node, and training mission is held Device is distributed also on just different node node, and cAdvisor is to be deployed in the tool for being used to collection vessel operating status on node, Heapster can converge cAdvisor on each node and collect data, and the data after convergence are stored on influxDB, The operating condition for all containers that a model training is used can be got by influxDB.

It referring to fig. 2, is another deep learning task bullet based on kubernetes platform provided in an embodiment of the present invention The structural schematic diagram of property telescopic system, as shown in Fig. 2, the system also includes container setup modules.

The container setup module is set to the calculate node of kubernetes platform and leads to the data memory module Letter connection.

Container setup module is according to the container memory usage increase number of containers appropriate of acquisition, in order to avoid causing to hold Then device waste, the embodiment of the present invention gradually increase number of containers according to memory usage using setting smallest vessel value.As Smallest vessel quantity is arranged according to model training amount in one of embodiment, when the memory usage of container is greater than 40%, Increase number of containers 3, rises 10 percentage points when container memory usage is every later, volume increasing device quantity increases by one.

It is another deep learning task bullet based on kubernetes platform provided in an embodiment of the present invention referring to Fig. 3 The structural schematic diagram of property telescopic system, as shown in figure 3, the system also includes scheduling of resource modules.

The scheduling of resource module is set to the calculate node of kubernetes platform and connects with the container setup module It connects, is used for container allocation training data.

It can be multiple equal parts according to certain regular partition by training data before model training, each container can only be transported Row portion training data, in training, Tensorflow platform can be distributed to each node in model according to certain algorithm To different training datas, when number of containers is less than training quantity number, if the algorithm in container has been not carried out, other Training data can only be waited because the embodiment of the present invention increases new container, then scheduling of resource module can will wait In training data distribute to the container newly increased and be trained, reduce the waiting time, substantially increase training effectiveness.

It referring to fig. 4, is a kind of deep learning task elasticity based on kubernetes platform provided in an embodiment of the present invention The flow diagram of telescopic method, as shown in figure 4, the deep learning provided in an embodiment of the present invention based on kubernetes platform Task elastic telescopic method, comprising:

S10: setting original container quantity.

According to the training burden with training pattern, corresponding smallest vessel quantity is set.

S20: container memory usage is obtained.

The memory usage of each container is obtained by cAdvisor tool, is then aggregated by heapster tool InfluxDB tool

S30: judge whether the container utilization rate is greater than default memory usage.

Judge whether the memory usage of the container on each node got is greater than default memory usage.

If it is thening follow the steps S40: increasing container according to the first preset capacity value.Otherwise it continues to execute step S30: obtaining Extracting container memory usage.

S50: utilization rate incrementss are calculated.

It is greater than the container of default memory usage for memory usage, needs the incrementss of calculator memory usage, Increase number of containers according to incrementss, specific calculation method is to subtract the default memory with the container utilization rate to utilize Rate.

S60: increase container according to the utilization rate incrementss.

The utilization rate incrementss are every to improve default percentage point then according to the second preset capacity value increase container.Specific setting Method is as described above, details are not described herein.

Referring to Fig. 5, for another deep learning task bullet based on kubernetes platform provided in an embodiment of the present invention The flow diagram of property telescopic method, as shown in figure 5, the method also includes:

S70: training data will be waited to distribute to new volume increasing device.

The training data of waiting is distributed to the container newly increased and is trained by scheduling of resource module, improves training effectiveness.

The memory of calculate node setting data collecting module collected container in the present invention in kubernetes platform utilizes Then data monitoring module and data memory module, data prison is arranged by the management node in kubernetes platform in rate Fortune will be protected in data memory module, management node by controlling module after container utilization rate concentrated collection collected in each calculate node Container increase is carried out according to preset rules by the memory usage saved in data memory module, to make the training shape of container State is in best, improves deep learning training effectiveness.

The above is only a specific embodiment of the invention, is made skilled artisans appreciate that or realizing this hair It is bright.Various modifications to these embodiments will be apparent to one skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of deep learning task elastic telescopic system based on kubernetes platform characterized by comprising data are adopted Collect module, data monitoring module and data memory module, wherein the acquisition module is set to the meter of kubernetes platform Operator node, the data monitoring module and data memory module are set to the management node of kubernetes platform, the number It is successively communicated to connect according to acquisition module, data monitoring module and data memory module.

2. the deep learning task elastic telescopic system according to claim 1 based on kubernetes platform, feature It is, the system also includes container setup module, the container setup module is set to the calculating section of kubernetes platform Point is simultaneously communicated to connect with the data memory module.

3. the deep learning task elastic telescopic system according to claim 1 or 2 based on kubernetes platform, special Sign is, the system also includes scheduling of resource module, the scheduling of resource module is set to the calculating of kubernetes platform Node is simultaneously connect with the container setup module, is used for container allocation training data.

4. the deep learning task elastic telescopic system according to claim 3 based on kubernetes platform, feature It is, the data acquisition module includes: cAdvisor tool；

The data monitoring module includes: heapster tool；

The data memory module includes: influxDB tool.

5. a kind of deep learning task elastic telescopic method based on kubernetes platform characterized by comprising

Original container quantity is set；

Obtain container memory usage；

Calculate utilization rate incrementss；

Increase container according to the utilization rate incrementss.

6. the deep learning task elastic telescopic method according to claim 5 based on kubernetes platform, feature It is, calculates utilization rate incrementss and specifically include:

The default memory usage is subtracted with the container utilization rate.

7. the deep learning task elastic telescopic method according to claim 5 based on kubernetes platform, feature It is, increases container according to the utilization rate incrementss and specifically include:

8. according to any deep learning task elastic telescopic method based on kubernetes platform of claim 5-7, It is characterized in that, the method also includes:

Training data will be waited to distribute to new volume increasing device.