CN109284184A

CN109284184A - A kind of building method of the distributed machines learning platform based on containerization technique

Info

Publication number: CN109284184A
Application number: CN201810186485.6A
Authority: CN
Inventors: 徐晓欣; 林小拉
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-03-07
Filing date: 2018-03-07
Publication date: 2019-01-29

Abstract

The present invention provides a kind of building method of distributed machines learning platform based on containerization technique, the utilization efficiency and computational efficiency of resource can be improved using the platform that this method is built, facilitate management and submits task, focus more on user in the research of deep learning, and be not a concern hardware and other the problem of.

Description

A kind of building method of the distributed machines learning platform based on containerization technique

Technical field

The present invention relates to machine learning fields, more particularly, to a kind of distributed machines based on containerization technique Practise the building method of platform.

Background technique

In recent years, with the progress for calculating horizontal and algorithm, machine learning especially deep learning obtains great hair Exhibition just like has become current most burning hot research field, and obtains in more and more fields, more and more problems Using.And there is the problem of many complexity to have bigger data set, higher calculation amount, more long operation time.This when The resource (CPU, GPU, memory, disk etc.) and performance of the computer of separate unit are easily ensnared into bottleneck, are unable to satisfy machine learning The requirement of task.Distributed computing is the core technology of current big data processing, and a task can be divided by it can be simultaneously The multiple portions that row executes, are then dispensed on each node, the execution that each node can be parallel, later again by the result of execution Summarized.And the task of machine learning has the calculating that can largely execute parallel, and several machine learning frames of most mainstream Frame all supports the operation under distribution.Therefore, machine learning is integrated to is trend of the times on Distributed Computing Platform.

It is increasingly mature as the containerization technique of representative using docker, the operation ring of a virtualization is created using mirror image Border contains required all dependences, lightweight in running environment, and manageable feature receives extensive welcome.Therefore, The associated component of distributed platform is disposed, using docker to be combined into final platform, it is possible to reduce many work.And with Kubernetes is the container programming facility of representative, can be effectively managed to container, it is ensured that the High Availabitity of container, Rolling upgrade, load balancing etc. provide the robustness of platform very big support.This technology is taken based on docker A distributed machine learning platform is built, there can be good support to machine learning and guarantees terseness and can be used Property, realize the customization of machine learning platform.

Summary of the invention

The present invention provides a kind of building method of distributed machines learning platform based on containerization technique, and this method is built Platform utilization efficiency and computational efficiency it is high.

In order to reach above-mentioned technical effect, technical scheme is as follows:

A kind of building method of the distributed machines learning platform based on containerization technique, comprising the following steps:

S1: preparing the warehouse Docker, this platform requires the upload for carrying out mirror image in deployment and use process and drawing goes to operate, Must have the warehouse Docker just can be with, this warehouse Docker can be self-built, is also possible to using public The warehouse Docker；

S2: filling in cluster and describe file, and cluster describes file and is mainly used for building Kubernetes cluster, the inside need include The information of the Kubernetes cluster node to be run, the configuration information of various components and each node need which runs Service；

S3: deployment Kubernetes cluster reads cluster using Python script and describes file, and generates required shell Script and Kubernetes describe file, send each host by ssh for these files and execute, so that it may complete The deployment of Kubernetes cluster；

S4: filling in service description file, and service description file is mainly used for the starting and fortune of the various services of machine learning platform Row, the inside need which node the configuration information comprising each service and each service need to operate on；

S5: being fabricated to Docker mirror image for the service of machine learning platform, is described in file using Python script reading service The description information of each service generates the corresponding dockerfile of each service, after constructing mirror image using these dockerfile It uploads in long-range warehouse；

S6: the service of deployment machine learning platform is generated using the service description file in Python script read step four The corresponding Kubernetes of each service describes file, then the command-line tool kubectl of Kubernetes is called to create this A little services；

S7: waiting each service to complete coordinate synchronization, and after having disposed each service of step S6, various components needs one are small The section time completes to synchronize, and might have several services during this restarts for several times, after all service all stable operations, entirely Building for machine learning platform also just completes.

Further, the learning process of the platform is:

Step 1: upload code and data are to HDFS, in the design of this platform, the code and data of program all be need from It is obtained on HDFS, therefore needs to want this required by task using pai-fs the generation of data and machine learning before execution task Code is transferred on HDFS.

Step 2: filling in task description file, and the task description file that operation required by task is wanted is for describing to run Task need the computing resources such as CPU, GPU, memory and disk to be used, the order that each node needs to be implemented, program address, Data address, output address etc., and the number etc. for needing to retry when mistake occurs；

Step 3: submitting task by WebPortal, submits the page in the task of WebPortal, clicks and submit task button, Selection can prompt task successfully to submit in the ready task description file of step 2 after submitting successfully；

Step 4: RESTServer prepares Run Script and frame describes file, and user is uploaded times to come up by RESTServer Business description document analysis, prepares its required script and frame in using FrameworkLauncher and describes file, and lead to Cross interface notification FrameworkLauncher starting task；

Step 5: starting task by FrameworkLauncher, after new task is got by interface, The task can be added in task queue by FrameworkLauncher, wait it is pending, when taking turns to the task execution, and should When resources supplIes required by task can meet, FrameworkLauncher can notify Hadoop YARN to hold The row task；

Step 6: start container using Hadoop YARN to execute task.Resource pipe Hadoop YARN global as one Device is managed, same can dispatch global resource to execute machine learning task, it might even be possible to which, across machine scheduling, YARN will start newly Container come service requirement execution task, and for these containers assign require resource；

Step 7: waiting the implementing result of acquisition task after pending completion, after task execution is completed, can pass through WebPortal obtains the implementing result of task, can also output data directly from HDFS during downloading-running.

Compared with prior art, the beneficial effect of technical solution of the present invention is:

The building method that deep learning platform is carried out the purpose of the invention is to construct the general and succinct mode of one kind, utilizes The utilization efficiency and computational efficiency of resource can be improved in the platform that this method is built, and facilitates management and submits task, allows user Focus more in the research of deep learning, and be not a concern hardware and other the problem of.

Detailed description of the invention

Fig. 1 is the architecture diagram of entire platform；

Fig. 2 is the deployment explanatory diagram of Kubernetes, has the function of High Availabitity and load balancing；

Fig. 3 is all processes of machine learning task, is terminated from being submitted to returning the result.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent the ruler of actual product It is very little；

To those skilled in the art, the omitting of some known structures and their instructions in the attached drawings are understandable.

The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

Embodiment 1

Further, the learning process of the platform is:

Use the physical machine that operating system is Ubuntu16.04 LTE or virtual machine as operation in the bottom of platform The host of various components, it is desirable that all hosts are among the same subnet, and are capable of providing ssh and are interconnected, User needs to possess the permission of root, and has to be equipped with GPU in worker node.

Use docker as containerization engine, installation on each host, by its lightweight, manageability, can The strong feature of transplantability, can significantly simplify the complexity of installation and deployment and operational management.

Docker container is managed collectively using kubernetes on docker, all platform assemblies are all logical Kubernetes is crossed to be installed.Equally considered for what is be easily managed, in the component of kubernetes in addition to docker and Except kubectl, other components are disposed in a manner of containerization, and wherein kubelet is most basic component, need to lead to Docker is crossed come except starting, other components be all kubelet it is static by way of carry out deployment installation.

Hadoop is one of most important serviced component in entire platform, is the basis for carrying out parallel computation, improving efficiency, Its multiple modules have important role in the design of platform: HDFS is for storing data used in machine learning task And code, it is assumed here that all machine learning tasks are all supported from HDFS as input source, due to current big portion The machine learning frame divided all has preferable support to HDFS, therefore this point requires not will cause too big puzzlement, it is only necessary to It is familiar with this programming mode；MapReduce is the major way that parallel computation carries out, most of foot run on platform This finally can all be converted into the mode of MapReduce to carry out；YARN is the resource manager of Hadoop, by establishing one Global resource manager and an application controller is established for each application, can united to the resource of each node One management and scheduling, improves the utilization efficiency of resource；ZooKeeper is the module for automatic management node, when there is node When breaking down or service unexpected terminate, it can go to go to continue to run service in new node, to guarantee the stability of entire cluster With availability.

Except Hadoop, one layer of FrameworkLauncher is encapsulated, by FramewLauncher come to Hadoop It is extended and extends, some function packages based on Hadoop are good, it is convenient to be docked with other assemblies, it decreases not With the coupling between component, design is simplified.

RESTServer provides HTTP connection service, open some ports, for third-party service and webpage to outside Entrance uses, and runs convenient for the extension and commercialization of service.

WebPortal is the web portal of this platform, by entrance can visually to the cluster of platform and service into Row management, submission and management role, obtain result and statistics of task etc..

Simultaneously in order to achieve the purpose that convenience and High Availabitity, many modes and strategy are also taken here:

It automatically generates script: only providing a user single configuration file in the design, need to provide each node in configuration file IP address, user name, password, role, the service that runs on each node etc. node configuration information and each service group The service profile informations such as the open port of the address of such as docking platform of information required by part, service.It can be first when deployment Configuration information is first read from configuration file, and then pre-prepd template is filled using the script that Python is realized, The main shell script used including deployment platform constructs required for dockerfile used in mirror image and deployment services Yaml configuration file.Then the script of production and configuration file are distributed to by ssh by all nodes, and execute starting foot This, starts the process of deployment.This process has probably divided three parts: deployment kubernetes constructs mirror image and deployment services. In the case where no appearance causes deployment to terminate extremely, this three parts can be executed successively, full-automation deployment.

It is disposed using mirror image: mentioning before and automatically construct mirror image using script, and why use the original of mirror image Cause is exactly that it can greatly facilitate our deployment.After the configuration information from configuration file reading service, these are used Information has mended dockerfile, reuses it and makes mirror image, and uploads in long-range warehouse (public or privately owned is ok). During deployment, it is only necessary to specify the mirror image to be pulled in some node, so that it may which the container for generating corresponding service comes Operation, unusual efficient quick, especially when scale is bigger, advantage is more prominent.

The High Availabitity of Kubernetes: tool and service deployment platform of the Kubernetes as bottom container layout, Act on it is very big, once collapse, will lead to all components and is all unable to operate normally, so it has to be ensured that its High Availabitity, And ensure that the time that its delay machine is restored will as far as possible short.The node of Kubernetes cluster is broadly divided into two kinds, one is Master node is responsible for managing and controlling for entire platform, and the storage and external interaction of data, are the cores of cluster, separately One is worker nodes, are mainly responsible for the execution of task.For worker node, delay machine or end of service meeting All containers run on the node are caused to lose or terminate, but master can automatically start newly from other nodes Container, continue to execute task, caused by risk it is smaller, and for master node, once delay machine, caused by shadow Loud and risk is all very big, therefore, devise multiple master nodes herein, pass through VIP(virtual ip address) realize one The High Availabitity mode of a active and standby combination be utilized all host nodes can in addition combined with load balancing, to divide Carry on a shoulder pole the pressure of flow.

The High Availabitity of service: platform operation various serviced components be also required to guarantee high availability, this implement compared with To be easy, DaemonSet function and Service function are provided used here as Kubernetes, DaemonSet function can It, can be automatic when the container of operation service collapses or is killed to provide finger daemon for the service on individual node Start new container, to continue to service.And Service is then to provide load balancing clothes for the container of series of identical function Business, it can periodically send heartbeat and go to check the availability of each container, and if there is the container of failure, then later request will be by It is forwarded in other containers, if Service combination Deployment function, can also realize to work as have container or node failure Later, automatically from other nodes starting container to guarantee copy number, and new container is added in Service.

The same or similar label correspond to the same or similar components；

Described in attached drawing positional relationship for only for illustration, should not be understood as the limitation to this patent；

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be to this hair The restriction of bright embodiment.For those of ordinary skill in the art, it can also do on the basis of the above description Other various forms of variations or variation out.There is no necessity and possibility to exhaust all the enbodiments.It is all in the present invention Spirit and principle within made any modifications, equivalent replacements, and improvements etc., should be included in the guarantor of the claims in the present invention Within the scope of shield.

Claims

1. a kind of building method of the distributed machines learning platform based on containerization technique, which is characterized in that including following step It is rapid:

2. the building method of the distributed machines learning platform according to claim 1 based on containerization technique, feature It is, the learning process of the platform is:

Step 1: upload code and data are to HDFS, in the design of this platform, the code and data of program all be need from It is obtained on HDFS, therefore needs to want this required by task using pai-fs the generation of data and machine learning before execution task Code is transferred on HDFS；

Step 2: filling in task description file, and the task description file that operation required by task is wanted is appointed for describe to be run Business needs the computing resources such as CPU, GPU, memory and disk to be used, the order that each node needs to be implemented, program address, data Address, output address etc., and the number etc. for needing to retry when mistake occurs；

Step 6: start container using Hadoop YARN to execute task, resource pipe Hadoop YARN global as one Device is managed, same can dispatch global resource to execute machine learning task, it might even be possible to which, across machine scheduling, YARN will start newly Container come service requirement execution task, and for these containers assign require resource；