CN109284184A - A kind of building method of the distributed machines learning platform based on containerization technique - Google Patents
A kind of building method of the distributed machines learning platform based on containerization technique Download PDFInfo
- Publication number
- CN109284184A CN109284184A CN201810186485.6A CN201810186485A CN109284184A CN 109284184 A CN109284184 A CN 109284184A CN 201810186485 A CN201810186485 A CN 201810186485A CN 109284184 A CN109284184 A CN 109284184A
- Authority
- CN
- China
- Prior art keywords
- task
- service
- file
- platform
- kubernetes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000010801 machine learning Methods 0.000 claims description 31
- 230000008569 process Effects 0.000 claims description 9
- 238000013461 design Methods 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 abstract description 4
- 238000011160 research Methods 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 7
- 238000009434 installation Methods 0.000 description 3
- 230000000712 assembly Effects 0.000 description 2
- 238000000429 assembly Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003032 molecular docking Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5077—Logical partitioning of resources; Management or configuration of virtualized resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45562—Creating, deleting, cloning virtual machine instances
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45575—Starting, stopping, suspending or resuming virtual machine instances
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5015—Service provider selection
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Stored Programmes (AREA)
Abstract
The present invention provides a kind of building method of distributed machines learning platform based on containerization technique, the utilization efficiency and computational efficiency of resource can be improved using the platform that this method is built, facilitate management and submits task, focus more on user in the research of deep learning, and be not a concern hardware and other the problem of.
Description
Technical field
The present invention relates to machine learning fields, more particularly, to a kind of distributed machines based on containerization technique
Practise the building method of platform.
Background technique
In recent years, with the progress for calculating horizontal and algorithm, machine learning especially deep learning obtains great hair
Exhibition just like has become current most burning hot research field, and obtains in more and more fields, more and more problems
Using.And there is the problem of many complexity to have bigger data set, higher calculation amount, more long operation time.This when
The resource (CPU, GPU, memory, disk etc.) and performance of the computer of separate unit are easily ensnared into bottleneck, are unable to satisfy machine learning
The requirement of task.Distributed computing is the core technology of current big data processing, and a task can be divided by it can be simultaneously
The multiple portions that row executes, are then dispensed on each node, the execution that each node can be parallel, later again by the result of execution
Summarized.And the task of machine learning has the calculating that can largely execute parallel, and several machine learning frames of most mainstream
Frame all supports the operation under distribution.Therefore, machine learning is integrated to is trend of the times on Distributed Computing Platform.
It is increasingly mature as the containerization technique of representative using docker, the operation ring of a virtualization is created using mirror image
Border contains required all dependences, lightweight in running environment, and manageable feature receives extensive welcome.Therefore,
The associated component of distributed platform is disposed, using docker to be combined into final platform, it is possible to reduce many work.And with
Kubernetes is the container programming facility of representative, can be effectively managed to container, it is ensured that the High Availabitity of container,
Rolling upgrade, load balancing etc. provide the robustness of platform very big support.This technology is taken based on docker
A distributed machine learning platform is built, there can be good support to machine learning and guarantees terseness and can be used
Property, realize the customization of machine learning platform.
Summary of the invention
The present invention provides a kind of building method of distributed machines learning platform based on containerization technique, and this method is built
Platform utilization efficiency and computational efficiency it is high.
In order to reach above-mentioned technical effect, technical scheme is as follows:
A kind of building method of the distributed machines learning platform based on containerization technique, comprising the following steps:
S1: preparing the warehouse Docker, this platform requires the upload for carrying out mirror image in deployment and use process and drawing goes to operate,
Must have the warehouse Docker just can be with, this warehouse Docker can be self-built, is also possible to using public
The warehouse Docker;
S2: filling in cluster and describe file, and cluster describes file and is mainly used for building Kubernetes cluster, the inside need include
The information of the Kubernetes cluster node to be run, the configuration information of various components and each node need which runs
Service;
S3: deployment Kubernetes cluster reads cluster using Python script and describes file, and generates required shell
Script and Kubernetes describe file, send each host by ssh for these files and execute, so that it may complete
The deployment of Kubernetes cluster;
S4: filling in service description file, and service description file is mainly used for the starting and fortune of the various services of machine learning platform
Row, the inside need which node the configuration information comprising each service and each service need to operate on;
S5: being fabricated to Docker mirror image for the service of machine learning platform, is described in file using Python script reading service
The description information of each service generates the corresponding dockerfile of each service, after constructing mirror image using these dockerfile
It uploads in long-range warehouse;
S6: the service of deployment machine learning platform is generated using the service description file in Python script read step four
The corresponding Kubernetes of each service describes file, then the command-line tool kubectl of Kubernetes is called to create this
A little services;
S7: waiting each service to complete coordinate synchronization, and after having disposed each service of step S6, various components needs one are small
The section time completes to synchronize, and might have several services during this restarts for several times, after all service all stable operations, entirely
Building for machine learning platform also just completes.
Further, the learning process of the platform is:
Step 1: upload code and data are to HDFS, in the design of this platform, the code and data of program all be need from
It is obtained on HDFS, therefore needs to want this required by task using pai-fs the generation of data and machine learning before execution task
Code is transferred on HDFS.
Step 2: filling in task description file, and the task description file that operation required by task is wanted is for describing to run
Task need the computing resources such as CPU, GPU, memory and disk to be used, the order that each node needs to be implemented, program address,
Data address, output address etc., and the number etc. for needing to retry when mistake occurs;
Step 3: submitting task by WebPortal, submits the page in the task of WebPortal, clicks and submit task button,
Selection can prompt task successfully to submit in the ready task description file of step 2 after submitting successfully;
Step 4: RESTServer prepares Run Script and frame describes file, and user is uploaded times to come up by RESTServer
Business description document analysis, prepares its required script and frame in using FrameworkLauncher and describes file, and lead to
Cross interface notification FrameworkLauncher starting task;
Step 5: starting task by FrameworkLauncher, after new task is got by interface,
The task can be added in task queue by FrameworkLauncher, wait it is pending, when taking turns to the task execution, and should
When resources supplIes required by task can meet, FrameworkLauncher can notify Hadoop YARN to hold
The row task;
Step 6: start container using Hadoop YARN to execute task.Resource pipe Hadoop YARN global as one
Device is managed, same can dispatch global resource to execute machine learning task, it might even be possible to which, across machine scheduling, YARN will start newly
Container come service requirement execution task, and for these containers assign require resource;
Step 7: waiting the implementing result of acquisition task after pending completion, after task execution is completed, can pass through
WebPortal obtains the implementing result of task, can also output data directly from HDFS during downloading-running.
Compared with prior art, the beneficial effect of technical solution of the present invention is:
The building method that deep learning platform is carried out the purpose of the invention is to construct the general and succinct mode of one kind, utilizes
The utilization efficiency and computational efficiency of resource can be improved in the platform that this method is built, and facilitates management and submits task, allows user
Focus more in the research of deep learning, and be not a concern hardware and other the problem of.
Detailed description of the invention
Fig. 1 is the architecture diagram of entire platform;
Fig. 2 is the deployment explanatory diagram of Kubernetes, has the function of High Availabitity and load balancing;
Fig. 3 is all processes of machine learning task, is terminated from being submitted to returning the result.
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent the ruler of actual product
It is very little;
To those skilled in the art, the omitting of some known structures and their instructions in the attached drawings are understandable.
The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.
Embodiment 1
A kind of building method of the distributed machines learning platform based on containerization technique, comprising the following steps:
S1: preparing the warehouse Docker, this platform requires the upload for carrying out mirror image in deployment and use process and drawing goes to operate,
Must have the warehouse Docker just can be with, this warehouse Docker can be self-built, is also possible to using public
The warehouse Docker;
S2: filling in cluster and describe file, and cluster describes file and is mainly used for building Kubernetes cluster, the inside need include
The information of the Kubernetes cluster node to be run, the configuration information of various components and each node need which runs
Service;
S3: deployment Kubernetes cluster reads cluster using Python script and describes file, and generates required shell
Script and Kubernetes describe file, send each host by ssh for these files and execute, so that it may complete
The deployment of Kubernetes cluster;
S4: filling in service description file, and service description file is mainly used for the starting and fortune of the various services of machine learning platform
Row, the inside need which node the configuration information comprising each service and each service need to operate on;
S5: being fabricated to Docker mirror image for the service of machine learning platform, is described in file using Python script reading service
The description information of each service generates the corresponding dockerfile of each service, after constructing mirror image using these dockerfile
It uploads in long-range warehouse;
S6: the service of deployment machine learning platform is generated using the service description file in Python script read step four
The corresponding Kubernetes of each service describes file, then the command-line tool kubectl of Kubernetes is called to create this
A little services;
S7: waiting each service to complete coordinate synchronization, and after having disposed each service of step S6, various components needs one are small
The section time completes to synchronize, and might have several services during this restarts for several times, after all service all stable operations, entirely
Building for machine learning platform also just completes.
Further, the learning process of the platform is:
Step 1: upload code and data are to HDFS, in the design of this platform, the code and data of program all be need from
It is obtained on HDFS, therefore needs to want this required by task using pai-fs the generation of data and machine learning before execution task
Code is transferred on HDFS.
Step 2: filling in task description file, and the task description file that operation required by task is wanted is for describing to run
Task need the computing resources such as CPU, GPU, memory and disk to be used, the order that each node needs to be implemented, program address,
Data address, output address etc., and the number etc. for needing to retry when mistake occurs;
Step 3: submitting task by WebPortal, submits the page in the task of WebPortal, clicks and submit task button,
Selection can prompt task successfully to submit in the ready task description file of step 2 after submitting successfully;
Step 4: RESTServer prepares Run Script and frame describes file, and user is uploaded times to come up by RESTServer
Business description document analysis, prepares its required script and frame in using FrameworkLauncher and describes file, and lead to
Cross interface notification FrameworkLauncher starting task;
Step 5: starting task by FrameworkLauncher, after new task is got by interface,
The task can be added in task queue by FrameworkLauncher, wait it is pending, when taking turns to the task execution, and should
When resources supplIes required by task can meet, FrameworkLauncher can notify Hadoop YARN to hold
The row task;
Step 6: start container using Hadoop YARN to execute task.Resource pipe Hadoop YARN global as one
Device is managed, same can dispatch global resource to execute machine learning task, it might even be possible to which, across machine scheduling, YARN will start newly
Container come service requirement execution task, and for these containers assign require resource;
Step 7: waiting the implementing result of acquisition task after pending completion, after task execution is completed, can pass through
WebPortal obtains the implementing result of task, can also output data directly from HDFS during downloading-running.
Use the physical machine that operating system is Ubuntu16.04 LTE or virtual machine as operation in the bottom of platform
The host of various components, it is desirable that all hosts are among the same subnet, and are capable of providing ssh and are interconnected,
User needs to possess the permission of root, and has to be equipped with GPU in worker node.
Use docker as containerization engine, installation on each host, by its lightweight, manageability, can
The strong feature of transplantability, can significantly simplify the complexity of installation and deployment and operational management.
Docker container is managed collectively using kubernetes on docker, all platform assemblies are all logical
Kubernetes is crossed to be installed.Equally considered for what is be easily managed, in the component of kubernetes in addition to docker and
Except kubectl, other components are disposed in a manner of containerization, and wherein kubelet is most basic component, need to lead to
Docker is crossed come except starting, other components be all kubelet it is static by way of carry out deployment installation.
Hadoop is one of most important serviced component in entire platform, is the basis for carrying out parallel computation, improving efficiency,
Its multiple modules have important role in the design of platform: HDFS is for storing data used in machine learning task
And code, it is assumed here that all machine learning tasks are all supported from HDFS as input source, due to current big portion
The machine learning frame divided all has preferable support to HDFS, therefore this point requires not will cause too big puzzlement, it is only necessary to
It is familiar with this programming mode;MapReduce is the major way that parallel computation carries out, most of foot run on platform
This finally can all be converted into the mode of MapReduce to carry out;YARN is the resource manager of Hadoop, by establishing one
Global resource manager and an application controller is established for each application, can united to the resource of each node
One management and scheduling, improves the utilization efficiency of resource;ZooKeeper is the module for automatic management node, when there is node
When breaking down or service unexpected terminate, it can go to go to continue to run service in new node, to guarantee the stability of entire cluster
With availability.
Except Hadoop, one layer of FrameworkLauncher is encapsulated, by FramewLauncher come to Hadoop
It is extended and extends, some function packages based on Hadoop are good, it is convenient to be docked with other assemblies, it decreases not
With the coupling between component, design is simplified.
RESTServer provides HTTP connection service, open some ports, for third-party service and webpage to outside
Entrance uses, and runs convenient for the extension and commercialization of service.
WebPortal is the web portal of this platform, by entrance can visually to the cluster of platform and service into
Row management, submission and management role, obtain result and statistics of task etc..
Simultaneously in order to achieve the purpose that convenience and High Availabitity, many modes and strategy are also taken here:
It automatically generates script: only providing a user single configuration file in the design, need to provide each node in configuration file
IP address, user name, password, role, the service that runs on each node etc. node configuration information and each service group
The service profile informations such as the open port of the address of such as docking platform of information required by part, service.It can be first when deployment
Configuration information is first read from configuration file, and then pre-prepd template is filled using the script that Python is realized,
The main shell script used including deployment platform constructs required for dockerfile used in mirror image and deployment services
Yaml configuration file.Then the script of production and configuration file are distributed to by ssh by all nodes, and execute starting foot
This, starts the process of deployment.This process has probably divided three parts: deployment kubernetes constructs mirror image and deployment services.
In the case where no appearance causes deployment to terminate extremely, this three parts can be executed successively, full-automation deployment.
It is disposed using mirror image: mentioning before and automatically construct mirror image using script, and why use the original of mirror image
Cause is exactly that it can greatly facilitate our deployment.After the configuration information from configuration file reading service, these are used
Information has mended dockerfile, reuses it and makes mirror image, and uploads in long-range warehouse (public or privately owned is ok).
During deployment, it is only necessary to specify the mirror image to be pulled in some node, so that it may which the container for generating corresponding service comes
Operation, unusual efficient quick, especially when scale is bigger, advantage is more prominent.
The High Availabitity of Kubernetes: tool and service deployment platform of the Kubernetes as bottom container layout,
Act on it is very big, once collapse, will lead to all components and is all unable to operate normally, so it has to be ensured that its High Availabitity,
And ensure that the time that its delay machine is restored will as far as possible short.The node of Kubernetes cluster is broadly divided into two kinds, one is
Master node is responsible for managing and controlling for entire platform, and the storage and external interaction of data, are the cores of cluster, separately
One is worker nodes, are mainly responsible for the execution of task.For worker node, delay machine or end of service meeting
All containers run on the node are caused to lose or terminate, but master can automatically start newly from other nodes
Container, continue to execute task, caused by risk it is smaller, and for master node, once delay machine, caused by shadow
Loud and risk is all very big, therefore, devise multiple master nodes herein, pass through VIP(virtual ip address) realize one
The High Availabitity mode of a active and standby combination be utilized all host nodes can in addition combined with load balancing, to divide
Carry on a shoulder pole the pressure of flow.
The High Availabitity of service: platform operation various serviced components be also required to guarantee high availability, this implement compared with
To be easy, DaemonSet function and Service function are provided used here as Kubernetes, DaemonSet function can
It, can be automatic when the container of operation service collapses or is killed to provide finger daemon for the service on individual node
Start new container, to continue to service.And Service is then to provide load balancing clothes for the container of series of identical function
Business, it can periodically send heartbeat and go to check the availability of each container, and if there is the container of failure, then later request will be by
It is forwarded in other containers, if Service combination Deployment function, can also realize to work as have container or node failure
Later, automatically from other nodes starting container to guarantee copy number, and new container is added in Service.
The same or similar label correspond to the same or similar components;
Described in attached drawing positional relationship for only for illustration, should not be understood as the limitation to this patent;
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be to this hair
The restriction of bright embodiment.For those of ordinary skill in the art, it can also do on the basis of the above description
Other various forms of variations or variation out.There is no necessity and possibility to exhaust all the enbodiments.It is all in the present invention
Spirit and principle within made any modifications, equivalent replacements, and improvements etc., should be included in the guarantor of the claims in the present invention
Within the scope of shield.
Claims (2)
1. a kind of building method of the distributed machines learning platform based on containerization technique, which is characterized in that including following step
It is rapid:
S1: preparing the warehouse Docker, this platform requires the upload for carrying out mirror image in deployment and use process and drawing goes to operate,
Must have the warehouse Docker just can be with, this warehouse Docker can be self-built, is also possible to using public
The warehouse Docker;
S2: filling in cluster and describe file, and cluster describes file and is mainly used for building Kubernetes cluster, the inside need include
The information of the Kubernetes cluster node to be run, the configuration information of various components and each node need which runs
Service;
S3: deployment Kubernetes cluster reads cluster using Python script and describes file, and generates required shell
Script and Kubernetes describe file, send each host by ssh for these files and execute, so that it may complete
The deployment of Kubernetes cluster;
S4: filling in service description file, and service description file is mainly used for the starting and fortune of the various services of machine learning platform
Row, the inside need which node the configuration information comprising each service and each service need to operate on;
S5: being fabricated to Docker mirror image for the service of machine learning platform, is described in file using Python script reading service
The description information of each service generates the corresponding dockerfile of each service, after constructing mirror image using these dockerfile
It uploads in long-range warehouse;
S6: the service of deployment machine learning platform is generated using the service description file in Python script read step four
The corresponding Kubernetes of each service describes file, then the command-line tool kubectl of Kubernetes is called to create this
A little services;
S7: waiting each service to complete coordinate synchronization, and after having disposed each service of step S6, various components needs one are small
The section time completes to synchronize, and might have several services during this restarts for several times, after all service all stable operations, entirely
Building for machine learning platform also just completes.
2. the building method of the distributed machines learning platform according to claim 1 based on containerization technique, feature
It is, the learning process of the platform is:
Step 1: upload code and data are to HDFS, in the design of this platform, the code and data of program all be need from
It is obtained on HDFS, therefore needs to want this required by task using pai-fs the generation of data and machine learning before execution task
Code is transferred on HDFS;
Step 2: filling in task description file, and the task description file that operation required by task is wanted is appointed for describe to be run
Business needs the computing resources such as CPU, GPU, memory and disk to be used, the order that each node needs to be implemented, program address, data
Address, output address etc., and the number etc. for needing to retry when mistake occurs;
Step 3: submitting task by WebPortal, submits the page in the task of WebPortal, clicks and submit task button,
Selection can prompt task successfully to submit in the ready task description file of step 2 after submitting successfully;
Step 4: RESTServer prepares Run Script and frame describes file, and user is uploaded times to come up by RESTServer
Business description document analysis, prepares its required script and frame in using FrameworkLauncher and describes file, and lead to
Cross interface notification FrameworkLauncher starting task;
Step 5: starting task by FrameworkLauncher, after new task is got by interface,
The task can be added in task queue by FrameworkLauncher, wait it is pending, when taking turns to the task execution, and should
When resources supplIes required by task can meet, FrameworkLauncher can notify Hadoop YARN to hold
The row task;
Step 6: start container using Hadoop YARN to execute task, resource pipe Hadoop YARN global as one
Device is managed, same can dispatch global resource to execute machine learning task, it might even be possible to which, across machine scheduling, YARN will start newly
Container come service requirement execution task, and for these containers assign require resource;
Step 7: waiting the implementing result of acquisition task after pending completion, after task execution is completed, can pass through
WebPortal obtains the implementing result of task, can also output data directly from HDFS during downloading-running.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810186485.6A CN109284184A (en) | 2018-03-07 | 2018-03-07 | A kind of building method of the distributed machines learning platform based on containerization technique |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810186485.6A CN109284184A (en) | 2018-03-07 | 2018-03-07 | A kind of building method of the distributed machines learning platform based on containerization technique |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109284184A true CN109284184A (en) | 2019-01-29 |
Family
ID=65186146
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810186485.6A Pending CN109284184A (en) | 2018-03-07 | 2018-03-07 | A kind of building method of the distributed machines learning platform based on containerization technique |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109284184A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109740765A (en) * | 2019-01-31 | 2019-05-10 | 成都品果科技有限公司 | A kind of machine learning system building method based on Amazon server |
CN110297670A (en) * | 2019-05-17 | 2019-10-01 | 北京瀚海星云科技有限公司 | A kind of method and system improving distributed task scheduling training effectiveness on container cloud |
CN110378463A (en) * | 2019-07-15 | 2019-10-25 | 北京智能工场科技有限公司 | A kind of artificial intelligence model standardized training platform and automated system |
CN110471767A (en) * | 2019-08-09 | 2019-11-19 | 上海寒武纪信息科技有限公司 | A kind of dispatching method of equipment |
CN111026414A (en) * | 2019-12-12 | 2020-04-17 | 杭州安恒信息技术股份有限公司 | HDP platform deployment method based on kubernets |
CN111338854A (en) * | 2020-05-25 | 2020-06-26 | 南京云信达科技有限公司 | Kubernetes cluster-based method and system for quickly recovering data |
CN111930525A (en) * | 2020-10-10 | 2020-11-13 | 北京世纪好未来教育科技有限公司 | GPU resource use method, electronic device and computer readable medium |
WO2020259081A1 (en) * | 2019-06-25 | 2020-12-30 | 深圳前海微众银行股份有限公司 | Task scheduling method, apparatus, and device, and computer-readable storage medium |
CN112395039A (en) * | 2019-08-16 | 2021-02-23 | 北京神州泰岳软件股份有限公司 | Management method and device for Kubernetes cluster |
CN112817581A (en) * | 2021-02-20 | 2021-05-18 | 中国电子科技集团公司第二十八研究所 | Lightweight intelligent service construction and operation support method |
CN113641343A (en) * | 2021-10-15 | 2021-11-12 | 中汽数据(天津)有限公司 | High-concurrency python algorithm calling method and medium based on environment isolation |
WO2022134001A1 (en) * | 2020-12-25 | 2022-06-30 | 深圳晶泰科技有限公司 | Machine learning model framework development method and system based on containerization technology |
DE202022104275U1 (en) | 2022-07-28 | 2022-08-25 | Ahmed Alemran | System for intelligent resource management for distributed machine learning tasks |
CN115357256A (en) * | 2022-10-18 | 2022-11-18 | 安徽华云安科技有限公司 | CDH cluster deployment method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106790483A (en) * | 2016-12-13 | 2017-05-31 | 武汉邮电科学研究院 | Hadoop group systems and fast construction method based on container technique |
CN106888254A (en) * | 2017-01-20 | 2017-06-23 | 华南理工大学 | A kind of exchange method between container cloud framework based on Kubernetes and its each module |
CN107450961A (en) * | 2017-09-22 | 2017-12-08 | 济南浚达信息技术有限公司 | A kind of distributed deep learning system and its building method, method of work based on Docker containers |
CN107659609A (en) * | 2017-07-26 | 2018-02-02 | 北京天云融创软件技术有限公司 | A kind of deep learning support platform and deep learning training method based on cloud computing |
-
2018
- 2018-03-07 CN CN201810186485.6A patent/CN109284184A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106790483A (en) * | 2016-12-13 | 2017-05-31 | 武汉邮电科学研究院 | Hadoop group systems and fast construction method based on container technique |
CN106888254A (en) * | 2017-01-20 | 2017-06-23 | 华南理工大学 | A kind of exchange method between container cloud framework based on Kubernetes and its each module |
CN107659609A (en) * | 2017-07-26 | 2018-02-02 | 北京天云融创软件技术有限公司 | A kind of deep learning support platform and deep learning training method based on cloud computing |
CN107450961A (en) * | 2017-09-22 | 2017-12-08 | 济南浚达信息技术有限公司 | A kind of distributed deep learning system and its building method, method of work based on Docker containers |
Non-Patent Citations (3)
Title |
---|
YIN,XUGANG ET AL: "Research and Implementation PaaS platform based on Docker", 《4TH NATIONAL CONFERENCE ON ELECTRICAL, ELECTRONICS AND COMPUTER ENGINEERING (NCEECE)》 * |
张羿等: "基于Docker 的电网轻量级PaaS平台构建方案", 《计算机工程应用技术》 * |
邹暾等: "基于容器云的烟草商业企业PaaS平台架构设计", 《中国烟草学会2017年学术年会》 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109740765A (en) * | 2019-01-31 | 2019-05-10 | 成都品果科技有限公司 | A kind of machine learning system building method based on Amazon server |
CN109740765B (en) * | 2019-01-31 | 2023-05-02 | 成都品果科技有限公司 | Machine learning system building method based on Amazon network server |
CN110297670A (en) * | 2019-05-17 | 2019-10-01 | 北京瀚海星云科技有限公司 | A kind of method and system improving distributed task scheduling training effectiveness on container cloud |
CN110297670B (en) * | 2019-05-17 | 2023-06-27 | 深圳致星科技有限公司 | Method and system for improving training efficiency of distributed tasks on container cloud |
WO2020259081A1 (en) * | 2019-06-25 | 2020-12-30 | 深圳前海微众银行股份有限公司 | Task scheduling method, apparatus, and device, and computer-readable storage medium |
CN110378463B (en) * | 2019-07-15 | 2021-05-14 | 北京智能工场科技有限公司 | Artificial intelligence model standardization training platform and automatic system |
CN110378463A (en) * | 2019-07-15 | 2019-10-25 | 北京智能工场科技有限公司 | A kind of artificial intelligence model standardized training platform and automated system |
CN110471767A (en) * | 2019-08-09 | 2019-11-19 | 上海寒武纪信息科技有限公司 | A kind of dispatching method of equipment |
CN110471767B (en) * | 2019-08-09 | 2021-09-03 | 上海寒武纪信息科技有限公司 | Equipment scheduling method |
CN112395039B (en) * | 2019-08-16 | 2024-01-19 | 北京神州泰岳软件股份有限公司 | Method and device for managing Kubernetes cluster |
CN112395039A (en) * | 2019-08-16 | 2021-02-23 | 北京神州泰岳软件股份有限公司 | Management method and device for Kubernetes cluster |
CN111026414B (en) * | 2019-12-12 | 2023-09-08 | 杭州安恒信息技术股份有限公司 | HDP platform deployment method based on kubernetes |
CN111026414A (en) * | 2019-12-12 | 2020-04-17 | 杭州安恒信息技术股份有限公司 | HDP platform deployment method based on kubernets |
CN111338854A (en) * | 2020-05-25 | 2020-06-26 | 南京云信达科技有限公司 | Kubernetes cluster-based method and system for quickly recovering data |
CN111930525B (en) * | 2020-10-10 | 2021-02-02 | 北京世纪好未来教育科技有限公司 | GPU resource use method, electronic device and computer readable medium |
CN111930525A (en) * | 2020-10-10 | 2020-11-13 | 北京世纪好未来教育科技有限公司 | GPU resource use method, electronic device and computer readable medium |
WO2022134001A1 (en) * | 2020-12-25 | 2022-06-30 | 深圳晶泰科技有限公司 | Machine learning model framework development method and system based on containerization technology |
CN112817581A (en) * | 2021-02-20 | 2021-05-18 | 中国电子科技集团公司第二十八研究所 | Lightweight intelligent service construction and operation support method |
CN113641343B (en) * | 2021-10-15 | 2022-02-11 | 中汽数据(天津)有限公司 | High-concurrency python algorithm calling method and medium based on environment isolation |
CN113641343A (en) * | 2021-10-15 | 2021-11-12 | 中汽数据(天津)有限公司 | High-concurrency python algorithm calling method and medium based on environment isolation |
DE202022104275U1 (en) | 2022-07-28 | 2022-08-25 | Ahmed Alemran | System for intelligent resource management for distributed machine learning tasks |
CN115357256A (en) * | 2022-10-18 | 2022-11-18 | 安徽华云安科技有限公司 | CDH cluster deployment method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109284184A (en) | A kind of building method of the distributed machines learning platform based on containerization technique | |
US11593149B2 (en) | Unified resource management for containers and virtual machines | |
CN107431696B (en) | Method and cloud management node for application automation deployment | |
WO2019179453A1 (en) | Virtual machine creation method and apparatus | |
Bui et al. | Work queue+ python: A framework for scalable scientific ensemble applications | |
CN110442396B (en) | Application program starting method and device, storage medium and electronic equipment | |
CN104506620A (en) | Extensible automatic computing service platform and construction method for same | |
CN103064742A (en) | Automatic deployment system and method of hadoop cluster | |
CN105786603B (en) | Distributed high-concurrency service processing system and method | |
US10860364B2 (en) | Containerized management services with high availability | |
CN104579792A (en) | Architecture and method for achieving centralized management of various types of virtual resources based on multiple adaptive modes | |
CN111045786B (en) | Container creation system and method based on mirror image layering technology in cloud environment | |
US10042673B1 (en) | Enhanced application request based scheduling on heterogeneous elements of information technology infrastructure | |
CN109740765A (en) | A kind of machine learning system building method based on Amazon server | |
Justino et al. | Outsourcing resource-intensive tasks from mobile apps to clouds: Android and aneka integration | |
US20120059938A1 (en) | Dimension-ordered application placement in a multiprocessor computer | |
CN105100180A (en) | Cluster node dynamic loading method, device and system | |
CN105144107A (en) | Method, processing modules and system for executing an executable code | |
CN104714843A (en) | Method and device supporting multiple processors through multi-kernel operating system living examples | |
CN110782040A (en) | Method, device, equipment and medium for training tasks of pitorch | |
CN113110920B (en) | Operation method, device, equipment and storage medium of block chain system | |
Wu et al. | An automatic artificial intelligence training platform based on kubernetes | |
CN115237547A (en) | Unified container cluster hosting system and method for non-intrusive HPC computing cluster | |
CN115033290A (en) | Instruction set-based micro-service splitting method and device and terminal equipment | |
CN110807018A (en) | Method, device, equipment and storage medium for migrating master-slave architecture of business data to cluster architecture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20220830 |