CN109117248A - A kind of deep learning task elastic telescopic system and method based on kubernetes platform - Google Patents
A kind of deep learning task elastic telescopic system and method based on kubernetes platform Download PDFInfo
- Publication number
- CN109117248A CN109117248A CN201810798693.1A CN201810798693A CN109117248A CN 109117248 A CN109117248 A CN 109117248A CN 201810798693 A CN201810798693 A CN 201810798693A CN 109117248 A CN109117248 A CN 109117248A
- Authority
- CN
- China
- Prior art keywords
- container
- module
- data
- kubernetes platform
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/301—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is a virtual computing platform, e.g. logically partitioned systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3089—Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5077—Logical partitioning of resources; Management or configuration of virtualized resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45583—Memory management, e.g. access or allocation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- Mathematical Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The embodiment of the invention discloses a kind of deep learning task elastic telescopic system and method based on kubernetes platform, system includes: data acquisition module, data monitoring module and data memory module, wherein the acquisition module is set to the calculate node of kubernetes platform, the data monitoring module and data memory module are set to the management node of kubernetes platform, and the data acquisition module, data monitoring module and data memory module successively communicate to connect.In the memory usage of the calculate node setting data collecting module collected container of kubernetes platform in the present invention, then data monitoring module and data memory module are arranged by the management node in kubernetes platform, data monitoring module will protect fortune in data memory module after container utilization rate concentrated collection collected in each calculate node, management node carries out container increase according to preset rules by the memory usage saved in data memory module, to make the physical training condition of container be in best, deep learning training effectiveness is improved.
Description
Technical field
The present invention relates to deep learning technology fields, more particularly to a kind of deep learning based on kubernetes platform
Task elastic telescopic system and method.
Background technique
The rise of artificial intelligence is referred to as " fourth industrial revolution " mark by people, now more and more artificial intelligence
The life for coming into us, including recognition of face, picture recognition, speech recognition, intelligent driving, intelligence financing etc. can be applied.Manually
The essence of intelligence is exactly to make data model by training repeatedly based on specific data model using a large amount of historical data
The ability for having self resolution.Distributed machines learning system Tensorflow outstanding as artificial intelligence field, is producing
It has been widely used in environment.
In the prior art, the process of deep learning is carried out by Tensorflow system are as follows: in kubernetes platform
Training container is arranged in calculate node, and training data is divided into several equal parts, by control node to container allocation training data,
Carry out parallel training.
However, in the prior art, the quantity of training container be it is artificially specified, whether number of containers has reached optimal nothing
Method precognition, when the number of containers of setting is very few, the computing capability of each container is just restricted, when the number of containers of setting
The waste that will cause resource when excessive, affects deep learning training effectiveness on the whole.
Summary of the invention
A kind of deep learning task elastic telescopic system based on kubernetes platform is provided in the embodiment of the present invention
And method, to solve the problems, such as that deep learning training effectiveness is low in the prior art.
In order to solve the above-mentioned technical problem, the embodiment of the invention discloses following technical solutions:
First aspect present invention provides a kind of deep learning task elastic telescopic system based on kubernetes platform,
It include: data acquisition module, data monitoring module and data memory module, wherein the acquisition module is set to
The calculate node of kubernetes platform, the data monitoring module and data memory module are set to kubernetes platform
Management node, the data acquisition module, data monitoring module and data memory module successively communicate to connect.
Preferably, the system also includes container setup module, it is flat that the container setup module is set to kubernetes
The calculate node of platform is simultaneously communicated to connect with the data memory module.
Preferably, the system also includes scheduling of resource module, it is flat that the scheduling of resource module is set to kubernetes
The calculate node of platform is simultaneously connect with the container setup module, is used for container allocation training data.
Preferably, the data acquisition module includes: cAdvisor tool;
The data monitoring module includes: heapster tool;
The data memory module includes: influxDB tool.
Second aspect of the present invention provides a kind of deep learning task elastic telescopic method based on kubernetes platform,
It is characterised by comprising:
Original container quantity is set;
Obtain container memory usage;
Judge whether the container utilization rate is greater than default memory usage;
If it is container is increased according to the first preset capacity value, otherwise continues to obtain container memory usage;
Calculate utilization rate incrementss;
Increase container according to the utilization rate incrementss.
Preferably, utilization rate incrementss are calculated to specifically include:
The default memory usage is subtracted with the container utilization rate.
Preferably, increase container according to the utilization rate incrementss to specifically include:
The utilization rate incrementss are every to improve default percentage point then according to the second preset capacity value increase container.
Preferably, the method also includes:
Training data will be waited to distribute to new volume increasing device.
By above technical scheme as it can be seen that data acquisition module is arranged in the calculate node of kubernetes platform in the present invention
Then data monitoring module and number is arranged by the management node in kubernetes platform in the memory usage of collection container
According to memory module, data monitoring module will protect fortune in data after container utilization rate concentrated collection collected in each calculate node
Memory module, management node carry out container increase according to preset rules by the memory usage saved in data memory module,
To make the physical training condition of container be in best, deep learning training effectiveness is improved.
Detailed description of the invention
It is illustrated more clearly that the embodiment of the present invention or technical solution in the prior art, it below will be to embodiment or existing
Attached drawing needed in technical description is briefly described, it should be apparent that, for those of ordinary skills,
Without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of deep learning task elastic telescopic system based on kubernetes platform provided in an embodiment of the present invention
The structural schematic diagram of system;
Fig. 2 is another deep learning task elastic telescopic based on kubernetes platform provided in an embodiment of the present invention
The structural schematic diagram of system;
Fig. 3 is another deep learning task elastic telescopic based on kubernetes platform provided in an embodiment of the present invention
The structural schematic diagram of system;
Fig. 4 is a kind of deep learning task elastic telescopic side based on kubernetes platform provided in an embodiment of the present invention
The flow diagram of method;
Fig. 5 is another deep learning task elastic telescopic based on kubernetes platform provided in an embodiment of the present invention
The flow diagram of method.
Specific embodiment
Technical solution in order to enable those skilled in the art to better understand the present invention, below in conjunction with of the invention real
The attached drawing in example is applied, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described implementation
Example is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this field is common
Technical staff's every other embodiment obtained without making creative work, all should belong to protection of the present invention
Range.
Referring to Fig. 1, for a kind of deep learning task elasticity based on kubernetes platform provided in an embodiment of the present invention
The structural schematic diagram of telescopic system, the deep learning task elasticity provided in an embodiment of the present invention based on kubernetes platform are stretched
Compression system, comprising: data acquisition module, data monitoring module and data memory module.
Wherein the acquisition module is set to the calculate node of kubernetes platform, the data monitoring module and number
The management node of kubernetes platform, the data acquisition module, data monitoring module and number are set to according to memory module
It is successively communicated to connect according to memory module.
In the embodiment of the present invention, the data acquisition module includes cAdvisor tool;The data monitoring module includes
Heapster tool;The data memory module includes influxDB tool.
Kubernetes strip generally has a management node master and has multiple calculate node node, and training mission is held
Device is distributed also on just different node node, and cAdvisor is to be deployed in the tool for being used to collection vessel operating status on node,
Heapster can converge cAdvisor on each node and collect data, and the data after convergence are stored on influxDB,
The operating condition for all containers that a model training is used can be got by influxDB.
It referring to fig. 2, is another deep learning task bullet based on kubernetes platform provided in an embodiment of the present invention
The structural schematic diagram of property telescopic system, as shown in Fig. 2, the system also includes container setup modules.
The container setup module is set to the calculate node of kubernetes platform and leads to the data memory module
Letter connection.
Container setup module is according to the container memory usage increase number of containers appropriate of acquisition, in order to avoid causing to hold
Then device waste, the embodiment of the present invention gradually increase number of containers according to memory usage using setting smallest vessel value.As
Smallest vessel quantity is arranged according to model training amount in one of embodiment, when the memory usage of container is greater than 40%,
Increase number of containers 3, rises 10 percentage points when container memory usage is every later, volume increasing device quantity increases by one.
It is another deep learning task bullet based on kubernetes platform provided in an embodiment of the present invention referring to Fig. 3
The structural schematic diagram of property telescopic system, as shown in figure 3, the system also includes scheduling of resource modules.
The scheduling of resource module is set to the calculate node of kubernetes platform and connects with the container setup module
It connects, is used for container allocation training data.
It can be multiple equal parts according to certain regular partition by training data before model training, each container can only be transported
Row portion training data, in training, Tensorflow platform can be distributed to each node in model according to certain algorithm
To different training datas, when number of containers is less than training quantity number, if the algorithm in container has been not carried out, other
Training data can only be waited because the embodiment of the present invention increases new container, then scheduling of resource module can will wait
In training data distribute to the container newly increased and be trained, reduce the waiting time, substantially increase training effectiveness.
It referring to fig. 4, is a kind of deep learning task elasticity based on kubernetes platform provided in an embodiment of the present invention
The flow diagram of telescopic method, as shown in figure 4, the deep learning provided in an embodiment of the present invention based on kubernetes platform
Task elastic telescopic method, comprising:
S10: setting original container quantity.
According to the training burden with training pattern, corresponding smallest vessel quantity is set.
S20: container memory usage is obtained.
The memory usage of each container is obtained by cAdvisor tool, is then aggregated by heapster tool
InfluxDB tool
S30: judge whether the container utilization rate is greater than default memory usage.
Judge whether the memory usage of the container on each node got is greater than default memory usage.
If it is thening follow the steps S40: increasing container according to the first preset capacity value.Otherwise it continues to execute step S30: obtaining
Extracting container memory usage.
S50: utilization rate incrementss are calculated.
It is greater than the container of default memory usage for memory usage, needs the incrementss of calculator memory usage,
Increase number of containers according to incrementss, specific calculation method is to subtract the default memory with the container utilization rate to utilize
Rate.
S60: increase container according to the utilization rate incrementss.
The utilization rate incrementss are every to improve default percentage point then according to the second preset capacity value increase container.Specific setting
Method is as described above, details are not described herein.
Referring to Fig. 5, for another deep learning task bullet based on kubernetes platform provided in an embodiment of the present invention
The flow diagram of property telescopic method, as shown in figure 5, the method also includes:
S70: training data will be waited to distribute to new volume increasing device.
The training data of waiting is distributed to the container newly increased and is trained by scheduling of resource module, improves training effectiveness.
The memory of calculate node setting data collecting module collected container in the present invention in kubernetes platform utilizes
Then data monitoring module and data memory module, data prison is arranged by the management node in kubernetes platform in rate
Fortune will be protected in data memory module, management node by controlling module after container utilization rate concentrated collection collected in each calculate node
Container increase is carried out according to preset rules by the memory usage saved in data memory module, to make the training shape of container
State is in best, improves deep learning training effectiveness.
The above is only a specific embodiment of the invention, is made skilled artisans appreciate that or realizing this hair
It is bright.Various modifications to these embodiments will be apparent to one skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention
It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one
The widest scope of cause.
Claims (8)
1. a kind of deep learning task elastic telescopic system based on kubernetes platform characterized by comprising data are adopted
Collect module, data monitoring module and data memory module, wherein the acquisition module is set to the meter of kubernetes platform
Operator node, the data monitoring module and data memory module are set to the management node of kubernetes platform, the number
It is successively communicated to connect according to acquisition module, data monitoring module and data memory module.
2. the deep learning task elastic telescopic system according to claim 1 based on kubernetes platform, feature
It is, the system also includes container setup module, the container setup module is set to the calculating section of kubernetes platform
Point is simultaneously communicated to connect with the data memory module.
3. the deep learning task elastic telescopic system according to claim 1 or 2 based on kubernetes platform, special
Sign is, the system also includes scheduling of resource module, the scheduling of resource module is set to the calculating of kubernetes platform
Node is simultaneously connect with the container setup module, is used for container allocation training data.
4. the deep learning task elastic telescopic system according to claim 3 based on kubernetes platform, feature
It is, the data acquisition module includes: cAdvisor tool;
The data monitoring module includes: heapster tool;
The data memory module includes: influxDB tool.
5. a kind of deep learning task elastic telescopic method based on kubernetes platform characterized by comprising
Original container quantity is set;
Obtain container memory usage;
Judge whether the container utilization rate is greater than default memory usage;
If it is container is increased according to the first preset capacity value, otherwise continues to obtain container memory usage;
Calculate utilization rate incrementss;
Increase container according to the utilization rate incrementss.
6. the deep learning task elastic telescopic method according to claim 5 based on kubernetes platform, feature
It is, calculates utilization rate incrementss and specifically include:
The default memory usage is subtracted with the container utilization rate.
7. the deep learning task elastic telescopic method according to claim 5 based on kubernetes platform, feature
It is, increases container according to the utilization rate incrementss and specifically include:
The utilization rate incrementss are every to improve default percentage point then according to the second preset capacity value increase container.
8. according to any deep learning task elastic telescopic method based on kubernetes platform of claim 5-7,
It is characterized in that, the method also includes:
Training data will be waited to distribute to new volume increasing device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810798693.1A CN109117248A (en) | 2018-07-19 | 2018-07-19 | A kind of deep learning task elastic telescopic system and method based on kubernetes platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810798693.1A CN109117248A (en) | 2018-07-19 | 2018-07-19 | A kind of deep learning task elastic telescopic system and method based on kubernetes platform |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109117248A true CN109117248A (en) | 2019-01-01 |
Family
ID=64863034
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810798693.1A Pending CN109117248A (en) | 2018-07-19 | 2018-07-19 | A kind of deep learning task elastic telescopic system and method based on kubernetes platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109117248A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109885389A (en) * | 2019-02-19 | 2019-06-14 | 山东浪潮云信息技术有限公司 | A kind of parallel deep learning scheduling training method and system based on container |
CN111158908A (en) * | 2019-12-27 | 2020-05-15 | 重庆紫光华山智安科技有限公司 | Kubernetes-based scheduling method and device for improving GPU utilization rate |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105554102A (en) * | 2015-12-14 | 2016-05-04 | 中电科华云信息技术有限公司 | Elastic expansion method based on container cluster and application system thereof |
CN105912403A (en) * | 2016-04-14 | 2016-08-31 | 青岛海信传媒网络技术有限公司 | Resource management method and device of Docker container |
CN106888254A (en) * | 2017-01-20 | 2017-06-23 | 华南理工大学 | A kind of exchange method between container cloud framework based on Kubernetes and its each module |
CN106897147A (en) * | 2017-02-24 | 2017-06-27 | 郑州云海信息技术有限公司 | A kind of application container engine container resource regulating method and device |
CN107370816A (en) * | 2017-07-26 | 2017-11-21 | 郑州云海信息技术有限公司 | A kind of dispositions method and device of Web applications |
US9836328B2 (en) * | 2013-01-10 | 2017-12-05 | International Business Machines Corporation | System and method for improving memory usage in virtual machines at a cost of increasing CPU usage |
-
2018
- 2018-07-19 CN CN201810798693.1A patent/CN109117248A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9836328B2 (en) * | 2013-01-10 | 2017-12-05 | International Business Machines Corporation | System and method for improving memory usage in virtual machines at a cost of increasing CPU usage |
CN105554102A (en) * | 2015-12-14 | 2016-05-04 | 中电科华云信息技术有限公司 | Elastic expansion method based on container cluster and application system thereof |
CN105912403A (en) * | 2016-04-14 | 2016-08-31 | 青岛海信传媒网络技术有限公司 | Resource management method and device of Docker container |
CN106888254A (en) * | 2017-01-20 | 2017-06-23 | 华南理工大学 | A kind of exchange method between container cloud framework based on Kubernetes and its each module |
CN106897147A (en) * | 2017-02-24 | 2017-06-27 | 郑州云海信息技术有限公司 | A kind of application container engine container resource regulating method and device |
CN107370816A (en) * | 2017-07-26 | 2017-11-21 | 郑州云海信息技术有限公司 | A kind of dispositions method and device of Web applications |
Non-Patent Citations (2)
Title |
---|
PEROFU: "kubernetes+docker监控之简介", 《HTTPS://MY.OSCHINA.NET/FUFANGCHUN/BLOG/714530》 * |
陈林,廖恩红,曹杰: "《"互联网+"智慧校园技术与工程实施》", 30 September 2017, 电子科技大学出版社 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109885389A (en) * | 2019-02-19 | 2019-06-14 | 山东浪潮云信息技术有限公司 | A kind of parallel deep learning scheduling training method and system based on container |
CN109885389B (en) * | 2019-02-19 | 2021-07-16 | 浪潮云信息技术股份公司 | Parallel deep learning scheduling training method and system based on container |
CN111158908A (en) * | 2019-12-27 | 2020-05-15 | 重庆紫光华山智安科技有限公司 | Kubernetes-based scheduling method and device for improving GPU utilization rate |
CN111158908B (en) * | 2019-12-27 | 2021-05-25 | 重庆紫光华山智安科技有限公司 | Kubernetes-based scheduling method and device for improving GPU utilization rate |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105184367B (en) | The model parameter training method and system of deep neural network | |
CN108197849A (en) | A kind of intelligent worksheet processing system and worksheet processing method based on automation | |
CN105677000B (en) | The system and method for dynamic voltage frequency adjustment | |
CN104639626A (en) | Multi-level load forecasting and flexible cloud resource configuring method and monitoring and configuring system | |
CN109117248A (en) | A kind of deep learning task elastic telescopic system and method based on kubernetes platform | |
CN104794687A (en) | Point clouds simplifying system and method | |
CN101625735A (en) | FPGA implementation method based on LS-SVM classification and recurrence learning recurrence neural network | |
CN110414778A (en) | Case work dispatching method and device | |
CN113515382B (en) | Cloud resource allocation method and device, electronic equipment and storage medium | |
CN102541622B (en) | Method for placing load-related virtual machine | |
CN109598250A (en) | Feature extracting method, device, electronic equipment and computer-readable medium | |
CN115981863A (en) | Intelligent cloud resource elastic expansion method and system combining business characteristics | |
CN109697090A (en) | A kind of method, terminal device and the storage medium of controlling terminal equipment | |
CN107223046A (en) | intelligent blind-guiding method and device | |
CN111324644B (en) | Method and device for monitoring database connection storm under large-scale micro-service architecture | |
CN110994613B (en) | Power plant load scheduling system and scheduling method thereof | |
CN109788061B (en) | Computing task deployment method and device | |
CN107729218A (en) | A kind of system and method for monitoring processing computing resource equipment | |
CN105335135A (en) | Data processing method and center node | |
CN110276452A (en) | Pruning method, device, equipment and the artificial intelligence chip of neural network model | |
CN109787247A (en) | A kind of reactive compensation planing method based on multi-parametric programming | |
CN109301820A (en) | A kind of enterprise's electrical control method and system | |
CN104468379B (en) | Virtual Hadoop clustered nodes system of selection and device based on most short logical reach | |
CN112052087B (en) | Deep learning training system and method for dynamic resource adjustment and migration | |
CN109117457A (en) | Bodily form curvature big data calculation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190101 |