CN109117248A - A kind of deep learning task elastic telescopic system and method based on kubernetes platform - Google Patents

A kind of deep learning task elastic telescopic system and method based on kubernetes platform Download PDF

Info

Publication number
CN109117248A
CN109117248A CN201810798693.1A CN201810798693A CN109117248A CN 109117248 A CN109117248 A CN 109117248A CN 201810798693 A CN201810798693 A CN 201810798693A CN 109117248 A CN109117248 A CN 109117248A
Authority
CN
China
Prior art keywords
container
module
data
kubernetes platform
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810798693.1A
Other languages
Chinese (zh)
Inventor
刘娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201810798693.1A priority Critical patent/CN109117248A/en
Publication of CN109117248A publication Critical patent/CN109117248A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/301Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is a virtual computing platform, e.g. logically partitioned systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45583Memory management, e.g. access or allocation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention discloses a kind of deep learning task elastic telescopic system and method based on kubernetes platform, system includes: data acquisition module, data monitoring module and data memory module, wherein the acquisition module is set to the calculate node of kubernetes platform, the data monitoring module and data memory module are set to the management node of kubernetes platform, and the data acquisition module, data monitoring module and data memory module successively communicate to connect.In the memory usage of the calculate node setting data collecting module collected container of kubernetes platform in the present invention, then data monitoring module and data memory module are arranged by the management node in kubernetes platform, data monitoring module will protect fortune in data memory module after container utilization rate concentrated collection collected in each calculate node, management node carries out container increase according to preset rules by the memory usage saved in data memory module, to make the physical training condition of container be in best, deep learning training effectiveness is improved.

Description

A kind of deep learning task elastic telescopic system based on kubernetes platform and Method
Technical field
The present invention relates to deep learning technology fields, more particularly to a kind of deep learning based on kubernetes platform Task elastic telescopic system and method.
Background technique
The rise of artificial intelligence is referred to as " fourth industrial revolution " mark by people, now more and more artificial intelligence The life for coming into us, including recognition of face, picture recognition, speech recognition, intelligent driving, intelligence financing etc. can be applied.Manually The essence of intelligence is exactly to make data model by training repeatedly based on specific data model using a large amount of historical data The ability for having self resolution.Distributed machines learning system Tensorflow outstanding as artificial intelligence field, is producing It has been widely used in environment.
In the prior art, the process of deep learning is carried out by Tensorflow system are as follows: in kubernetes platform Training container is arranged in calculate node, and training data is divided into several equal parts, by control node to container allocation training data, Carry out parallel training.
However, in the prior art, the quantity of training container be it is artificially specified, whether number of containers has reached optimal nothing Method precognition, when the number of containers of setting is very few, the computing capability of each container is just restricted, when the number of containers of setting The waste that will cause resource when excessive, affects deep learning training effectiveness on the whole.
Summary of the invention
A kind of deep learning task elastic telescopic system based on kubernetes platform is provided in the embodiment of the present invention And method, to solve the problems, such as that deep learning training effectiveness is low in the prior art.
In order to solve the above-mentioned technical problem, the embodiment of the invention discloses following technical solutions:
First aspect present invention provides a kind of deep learning task elastic telescopic system based on kubernetes platform, It include: data acquisition module, data monitoring module and data memory module, wherein the acquisition module is set to The calculate node of kubernetes platform, the data monitoring module and data memory module are set to kubernetes platform Management node, the data acquisition module, data monitoring module and data memory module successively communicate to connect.
Preferably, the system also includes container setup module, it is flat that the container setup module is set to kubernetes The calculate node of platform is simultaneously communicated to connect with the data memory module.
Preferably, the system also includes scheduling of resource module, it is flat that the scheduling of resource module is set to kubernetes The calculate node of platform is simultaneously connect with the container setup module, is used for container allocation training data.
Preferably, the data acquisition module includes: cAdvisor tool;
The data monitoring module includes: heapster tool;
The data memory module includes: influxDB tool.
Second aspect of the present invention provides a kind of deep learning task elastic telescopic method based on kubernetes platform, It is characterised by comprising:
Original container quantity is set;
Obtain container memory usage;
Judge whether the container utilization rate is greater than default memory usage;
If it is container is increased according to the first preset capacity value, otherwise continues to obtain container memory usage;
Calculate utilization rate incrementss;
Increase container according to the utilization rate incrementss.
Preferably, utilization rate incrementss are calculated to specifically include:
The default memory usage is subtracted with the container utilization rate.
Preferably, increase container according to the utilization rate incrementss to specifically include:
The utilization rate incrementss are every to improve default percentage point then according to the second preset capacity value increase container.
Preferably, the method also includes:
Training data will be waited to distribute to new volume increasing device.
By above technical scheme as it can be seen that data acquisition module is arranged in the calculate node of kubernetes platform in the present invention Then data monitoring module and number is arranged by the management node in kubernetes platform in the memory usage of collection container According to memory module, data monitoring module will protect fortune in data after container utilization rate concentrated collection collected in each calculate node Memory module, management node carry out container increase according to preset rules by the memory usage saved in data memory module, To make the physical training condition of container be in best, deep learning training effectiveness is improved.
Detailed description of the invention
It is illustrated more clearly that the embodiment of the present invention or technical solution in the prior art, it below will be to embodiment or existing Attached drawing needed in technical description is briefly described, it should be apparent that, for those of ordinary skills, Without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of deep learning task elastic telescopic system based on kubernetes platform provided in an embodiment of the present invention The structural schematic diagram of system;
Fig. 2 is another deep learning task elastic telescopic based on kubernetes platform provided in an embodiment of the present invention The structural schematic diagram of system;
Fig. 3 is another deep learning task elastic telescopic based on kubernetes platform provided in an embodiment of the present invention The structural schematic diagram of system;
Fig. 4 is a kind of deep learning task elastic telescopic side based on kubernetes platform provided in an embodiment of the present invention The flow diagram of method;
Fig. 5 is another deep learning task elastic telescopic based on kubernetes platform provided in an embodiment of the present invention The flow diagram of method.
Specific embodiment
Technical solution in order to enable those skilled in the art to better understand the present invention, below in conjunction with of the invention real The attached drawing in example is applied, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described implementation Example is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this field is common Technical staff's every other embodiment obtained without making creative work, all should belong to protection of the present invention Range.
Referring to Fig. 1, for a kind of deep learning task elasticity based on kubernetes platform provided in an embodiment of the present invention The structural schematic diagram of telescopic system, the deep learning task elasticity provided in an embodiment of the present invention based on kubernetes platform are stretched Compression system, comprising: data acquisition module, data monitoring module and data memory module.
Wherein the acquisition module is set to the calculate node of kubernetes platform, the data monitoring module and number The management node of kubernetes platform, the data acquisition module, data monitoring module and number are set to according to memory module It is successively communicated to connect according to memory module.
In the embodiment of the present invention, the data acquisition module includes cAdvisor tool;The data monitoring module includes Heapster tool;The data memory module includes influxDB tool.
Kubernetes strip generally has a management node master and has multiple calculate node node, and training mission is held Device is distributed also on just different node node, and cAdvisor is to be deployed in the tool for being used to collection vessel operating status on node, Heapster can converge cAdvisor on each node and collect data, and the data after convergence are stored on influxDB, The operating condition for all containers that a model training is used can be got by influxDB.
It referring to fig. 2, is another deep learning task bullet based on kubernetes platform provided in an embodiment of the present invention The structural schematic diagram of property telescopic system, as shown in Fig. 2, the system also includes container setup modules.
The container setup module is set to the calculate node of kubernetes platform and leads to the data memory module Letter connection.
Container setup module is according to the container memory usage increase number of containers appropriate of acquisition, in order to avoid causing to hold Then device waste, the embodiment of the present invention gradually increase number of containers according to memory usage using setting smallest vessel value.As Smallest vessel quantity is arranged according to model training amount in one of embodiment, when the memory usage of container is greater than 40%, Increase number of containers 3, rises 10 percentage points when container memory usage is every later, volume increasing device quantity increases by one.
It is another deep learning task bullet based on kubernetes platform provided in an embodiment of the present invention referring to Fig. 3 The structural schematic diagram of property telescopic system, as shown in figure 3, the system also includes scheduling of resource modules.
The scheduling of resource module is set to the calculate node of kubernetes platform and connects with the container setup module It connects, is used for container allocation training data.
It can be multiple equal parts according to certain regular partition by training data before model training, each container can only be transported Row portion training data, in training, Tensorflow platform can be distributed to each node in model according to certain algorithm To different training datas, when number of containers is less than training quantity number, if the algorithm in container has been not carried out, other Training data can only be waited because the embodiment of the present invention increases new container, then scheduling of resource module can will wait In training data distribute to the container newly increased and be trained, reduce the waiting time, substantially increase training effectiveness.
It referring to fig. 4, is a kind of deep learning task elasticity based on kubernetes platform provided in an embodiment of the present invention The flow diagram of telescopic method, as shown in figure 4, the deep learning provided in an embodiment of the present invention based on kubernetes platform Task elastic telescopic method, comprising:
S10: setting original container quantity.
According to the training burden with training pattern, corresponding smallest vessel quantity is set.
S20: container memory usage is obtained.
The memory usage of each container is obtained by cAdvisor tool, is then aggregated by heapster tool InfluxDB tool
S30: judge whether the container utilization rate is greater than default memory usage.
Judge whether the memory usage of the container on each node got is greater than default memory usage.
If it is thening follow the steps S40: increasing container according to the first preset capacity value.Otherwise it continues to execute step S30: obtaining Extracting container memory usage.
S50: utilization rate incrementss are calculated.
It is greater than the container of default memory usage for memory usage, needs the incrementss of calculator memory usage, Increase number of containers according to incrementss, specific calculation method is to subtract the default memory with the container utilization rate to utilize Rate.
S60: increase container according to the utilization rate incrementss.
The utilization rate incrementss are every to improve default percentage point then according to the second preset capacity value increase container.Specific setting Method is as described above, details are not described herein.
Referring to Fig. 5, for another deep learning task bullet based on kubernetes platform provided in an embodiment of the present invention The flow diagram of property telescopic method, as shown in figure 5, the method also includes:
S70: training data will be waited to distribute to new volume increasing device.
The training data of waiting is distributed to the container newly increased and is trained by scheduling of resource module, improves training effectiveness.
The memory of calculate node setting data collecting module collected container in the present invention in kubernetes platform utilizes Then data monitoring module and data memory module, data prison is arranged by the management node in kubernetes platform in rate Fortune will be protected in data memory module, management node by controlling module after container utilization rate concentrated collection collected in each calculate node Container increase is carried out according to preset rules by the memory usage saved in data memory module, to make the training shape of container State is in best, improves deep learning training effectiveness.
The above is only a specific embodiment of the invention, is made skilled artisans appreciate that or realizing this hair It is bright.Various modifications to these embodiments will be apparent to one skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims (8)

1. a kind of deep learning task elastic telescopic system based on kubernetes platform characterized by comprising data are adopted Collect module, data monitoring module and data memory module, wherein the acquisition module is set to the meter of kubernetes platform Operator node, the data monitoring module and data memory module are set to the management node of kubernetes platform, the number It is successively communicated to connect according to acquisition module, data monitoring module and data memory module.
2. the deep learning task elastic telescopic system according to claim 1 based on kubernetes platform, feature It is, the system also includes container setup module, the container setup module is set to the calculating section of kubernetes platform Point is simultaneously communicated to connect with the data memory module.
3. the deep learning task elastic telescopic system according to claim 1 or 2 based on kubernetes platform, special Sign is, the system also includes scheduling of resource module, the scheduling of resource module is set to the calculating of kubernetes platform Node is simultaneously connect with the container setup module, is used for container allocation training data.
4. the deep learning task elastic telescopic system according to claim 3 based on kubernetes platform, feature It is, the data acquisition module includes: cAdvisor tool;
The data monitoring module includes: heapster tool;
The data memory module includes: influxDB tool.
5. a kind of deep learning task elastic telescopic method based on kubernetes platform characterized by comprising
Original container quantity is set;
Obtain container memory usage;
Judge whether the container utilization rate is greater than default memory usage;
If it is container is increased according to the first preset capacity value, otherwise continues to obtain container memory usage;
Calculate utilization rate incrementss;
Increase container according to the utilization rate incrementss.
6. the deep learning task elastic telescopic method according to claim 5 based on kubernetes platform, feature It is, calculates utilization rate incrementss and specifically include:
The default memory usage is subtracted with the container utilization rate.
7. the deep learning task elastic telescopic method according to claim 5 based on kubernetes platform, feature It is, increases container according to the utilization rate incrementss and specifically include:
The utilization rate incrementss are every to improve default percentage point then according to the second preset capacity value increase container.
8. according to any deep learning task elastic telescopic method based on kubernetes platform of claim 5-7, It is characterized in that, the method also includes:
Training data will be waited to distribute to new volume increasing device.
CN201810798693.1A 2018-07-19 2018-07-19 A kind of deep learning task elastic telescopic system and method based on kubernetes platform Pending CN109117248A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810798693.1A CN109117248A (en) 2018-07-19 2018-07-19 A kind of deep learning task elastic telescopic system and method based on kubernetes platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810798693.1A CN109117248A (en) 2018-07-19 2018-07-19 A kind of deep learning task elastic telescopic system and method based on kubernetes platform

Publications (1)

Publication Number Publication Date
CN109117248A true CN109117248A (en) 2019-01-01

Family

ID=64863034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810798693.1A Pending CN109117248A (en) 2018-07-19 2018-07-19 A kind of deep learning task elastic telescopic system and method based on kubernetes platform

Country Status (1)

Country Link
CN (1) CN109117248A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109885389A (en) * 2019-02-19 2019-06-14 山东浪潮云信息技术有限公司 A kind of parallel deep learning scheduling training method and system based on container
CN111158908A (en) * 2019-12-27 2020-05-15 重庆紫光华山智安科技有限公司 Kubernetes-based scheduling method and device for improving GPU utilization rate

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105554102A (en) * 2015-12-14 2016-05-04 中电科华云信息技术有限公司 Elastic expansion method based on container cluster and application system thereof
CN105912403A (en) * 2016-04-14 2016-08-31 青岛海信传媒网络技术有限公司 Resource management method and device of Docker container
CN106888254A (en) * 2017-01-20 2017-06-23 华南理工大学 A kind of exchange method between container cloud framework based on Kubernetes and its each module
CN106897147A (en) * 2017-02-24 2017-06-27 郑州云海信息技术有限公司 A kind of application container engine container resource regulating method and device
CN107370816A (en) * 2017-07-26 2017-11-21 郑州云海信息技术有限公司 A kind of dispositions method and device of Web applications
US9836328B2 (en) * 2013-01-10 2017-12-05 International Business Machines Corporation System and method for improving memory usage in virtual machines at a cost of increasing CPU usage

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9836328B2 (en) * 2013-01-10 2017-12-05 International Business Machines Corporation System and method for improving memory usage in virtual machines at a cost of increasing CPU usage
CN105554102A (en) * 2015-12-14 2016-05-04 中电科华云信息技术有限公司 Elastic expansion method based on container cluster and application system thereof
CN105912403A (en) * 2016-04-14 2016-08-31 青岛海信传媒网络技术有限公司 Resource management method and device of Docker container
CN106888254A (en) * 2017-01-20 2017-06-23 华南理工大学 A kind of exchange method between container cloud framework based on Kubernetes and its each module
CN106897147A (en) * 2017-02-24 2017-06-27 郑州云海信息技术有限公司 A kind of application container engine container resource regulating method and device
CN107370816A (en) * 2017-07-26 2017-11-21 郑州云海信息技术有限公司 A kind of dispositions method and device of Web applications

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PEROFU: "kubernetes+docker监控之简介", 《HTTPS://MY.OSCHINA.NET/FUFANGCHUN/BLOG/714530》 *
陈林,廖恩红,曹杰: "《"互联网+"智慧校园技术与工程实施》", 30 September 2017, 电子科技大学出版社 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109885389A (en) * 2019-02-19 2019-06-14 山东浪潮云信息技术有限公司 A kind of parallel deep learning scheduling training method and system based on container
CN109885389B (en) * 2019-02-19 2021-07-16 浪潮云信息技术股份公司 Parallel deep learning scheduling training method and system based on container
CN111158908A (en) * 2019-12-27 2020-05-15 重庆紫光华山智安科技有限公司 Kubernetes-based scheduling method and device for improving GPU utilization rate
CN111158908B (en) * 2019-12-27 2021-05-25 重庆紫光华山智安科技有限公司 Kubernetes-based scheduling method and device for improving GPU utilization rate

Similar Documents

Publication Publication Date Title
CN105184367B (en) The model parameter training method and system of deep neural network
CN108197849A (en) A kind of intelligent worksheet processing system and worksheet processing method based on automation
CN105677000B (en) The system and method for dynamic voltage frequency adjustment
CN104639626A (en) Multi-level load forecasting and flexible cloud resource configuring method and monitoring and configuring system
CN109117248A (en) A kind of deep learning task elastic telescopic system and method based on kubernetes platform
CN104794687A (en) Point clouds simplifying system and method
CN101625735A (en) FPGA implementation method based on LS-SVM classification and recurrence learning recurrence neural network
CN110414778A (en) Case work dispatching method and device
CN113515382B (en) Cloud resource allocation method and device, electronic equipment and storage medium
CN102541622B (en) Method for placing load-related virtual machine
CN109598250A (en) Feature extracting method, device, electronic equipment and computer-readable medium
CN115981863A (en) Intelligent cloud resource elastic expansion method and system combining business characteristics
CN109697090A (en) A kind of method, terminal device and the storage medium of controlling terminal equipment
CN107223046A (en) intelligent blind-guiding method and device
CN111324644B (en) Method and device for monitoring database connection storm under large-scale micro-service architecture
CN110994613B (en) Power plant load scheduling system and scheduling method thereof
CN109788061B (en) Computing task deployment method and device
CN107729218A (en) A kind of system and method for monitoring processing computing resource equipment
CN105335135A (en) Data processing method and center node
CN110276452A (en) Pruning method, device, equipment and the artificial intelligence chip of neural network model
CN109787247A (en) A kind of reactive compensation planing method based on multi-parametric programming
CN109301820A (en) A kind of enterprise's electrical control method and system
CN104468379B (en) Virtual Hadoop clustered nodes system of selection and device based on most short logical reach
CN112052087B (en) Deep learning training system and method for dynamic resource adjustment and migration
CN109117457A (en) Bodily form curvature big data calculation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190101