CN109284184A - A kind of building method of the distributed machines learning platform based on containerization technique - Google Patents

A kind of building method of the distributed machines learning platform based on containerization technique Download PDF

Info

Publication number
CN109284184A
CN109284184A CN201810186485.6A CN201810186485A CN109284184A CN 109284184 A CN109284184 A CN 109284184A CN 201810186485 A CN201810186485 A CN 201810186485A CN 109284184 A CN109284184 A CN 109284184A
Authority
CN
China
Prior art keywords
task
service
file
platform
kubernetes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810186485.6A
Other languages
Chinese (zh)
Inventor
徐晓欣
林小拉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201810186485.6A priority Critical patent/CN109284184A/en
Publication of CN109284184A publication Critical patent/CN109284184A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45562Creating, deleting, cloning virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45575Starting, stopping, suspending or resuming virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5015Service provider selection

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

The present invention provides a kind of building method of distributed machines learning platform based on containerization technique, the utilization efficiency and computational efficiency of resource can be improved using the platform that this method is built, facilitate management and submits task, focus more on user in the research of deep learning, and be not a concern hardware and other the problem of.

Description

A kind of building method of the distributed machines learning platform based on containerization technique
Technical field
The present invention relates to machine learning fields, more particularly, to a kind of distributed machines based on containerization technique Practise the building method of platform.
Background technique
In recent years, with the progress for calculating horizontal and algorithm, machine learning especially deep learning obtains great hair Exhibition just like has become current most burning hot research field, and obtains in more and more fields, more and more problems Using.And there is the problem of many complexity to have bigger data set, higher calculation amount, more long operation time.This when The resource (CPU, GPU, memory, disk etc.) and performance of the computer of separate unit are easily ensnared into bottleneck, are unable to satisfy machine learning The requirement of task.Distributed computing is the core technology of current big data processing, and a task can be divided by it can be simultaneously The multiple portions that row executes, are then dispensed on each node, the execution that each node can be parallel, later again by the result of execution Summarized.And the task of machine learning has the calculating that can largely execute parallel, and several machine learning frames of most mainstream Frame all supports the operation under distribution.Therefore, machine learning is integrated to is trend of the times on Distributed Computing Platform.
It is increasingly mature as the containerization technique of representative using docker, the operation ring of a virtualization is created using mirror image Border contains required all dependences, lightweight in running environment, and manageable feature receives extensive welcome.Therefore, The associated component of distributed platform is disposed, using docker to be combined into final platform, it is possible to reduce many work.And with Kubernetes is the container programming facility of representative, can be effectively managed to container, it is ensured that the High Availabitity of container, Rolling upgrade, load balancing etc. provide the robustness of platform very big support.This technology is taken based on docker A distributed machine learning platform is built, there can be good support to machine learning and guarantees terseness and can be used Property, realize the customization of machine learning platform.
Summary of the invention
The present invention provides a kind of building method of distributed machines learning platform based on containerization technique, and this method is built Platform utilization efficiency and computational efficiency it is high.
In order to reach above-mentioned technical effect, technical scheme is as follows:
A kind of building method of the distributed machines learning platform based on containerization technique, comprising the following steps:
S1: preparing the warehouse Docker, this platform requires the upload for carrying out mirror image in deployment and use process and drawing goes to operate, Must have the warehouse Docker just can be with, this warehouse Docker can be self-built, is also possible to using public The warehouse Docker;
S2: filling in cluster and describe file, and cluster describes file and is mainly used for building Kubernetes cluster, the inside need include The information of the Kubernetes cluster node to be run, the configuration information of various components and each node need which runs Service;
S3: deployment Kubernetes cluster reads cluster using Python script and describes file, and generates required shell Script and Kubernetes describe file, send each host by ssh for these files and execute, so that it may complete The deployment of Kubernetes cluster;
S4: filling in service description file, and service description file is mainly used for the starting and fortune of the various services of machine learning platform Row, the inside need which node the configuration information comprising each service and each service need to operate on;
S5: being fabricated to Docker mirror image for the service of machine learning platform, is described in file using Python script reading service The description information of each service generates the corresponding dockerfile of each service, after constructing mirror image using these dockerfile It uploads in long-range warehouse;
S6: the service of deployment machine learning platform is generated using the service description file in Python script read step four The corresponding Kubernetes of each service describes file, then the command-line tool kubectl of Kubernetes is called to create this A little services;
S7: waiting each service to complete coordinate synchronization, and after having disposed each service of step S6, various components needs one are small The section time completes to synchronize, and might have several services during this restarts for several times, after all service all stable operations, entirely Building for machine learning platform also just completes.
Further, the learning process of the platform is:
Step 1: upload code and data are to HDFS, in the design of this platform, the code and data of program all be need from It is obtained on HDFS, therefore needs to want this required by task using pai-fs the generation of data and machine learning before execution task Code is transferred on HDFS.
Step 2: filling in task description file, and the task description file that operation required by task is wanted is for describing to run Task need the computing resources such as CPU, GPU, memory and disk to be used, the order that each node needs to be implemented, program address, Data address, output address etc., and the number etc. for needing to retry when mistake occurs;
Step 3: submitting task by WebPortal, submits the page in the task of WebPortal, clicks and submit task button, Selection can prompt task successfully to submit in the ready task description file of step 2 after submitting successfully;
Step 4: RESTServer prepares Run Script and frame describes file, and user is uploaded times to come up by RESTServer Business description document analysis, prepares its required script and frame in using FrameworkLauncher and describes file, and lead to Cross interface notification FrameworkLauncher starting task;
Step 5: starting task by FrameworkLauncher, after new task is got by interface, The task can be added in task queue by FrameworkLauncher, wait it is pending, when taking turns to the task execution, and should When resources supplIes required by task can meet, FrameworkLauncher can notify Hadoop YARN to hold The row task;
Step 6: start container using Hadoop YARN to execute task.Resource pipe Hadoop YARN global as one Device is managed, same can dispatch global resource to execute machine learning task, it might even be possible to which, across machine scheduling, YARN will start newly Container come service requirement execution task, and for these containers assign require resource;
Step 7: waiting the implementing result of acquisition task after pending completion, after task execution is completed, can pass through WebPortal obtains the implementing result of task, can also output data directly from HDFS during downloading-running.
Compared with prior art, the beneficial effect of technical solution of the present invention is:
The building method that deep learning platform is carried out the purpose of the invention is to construct the general and succinct mode of one kind, utilizes The utilization efficiency and computational efficiency of resource can be improved in the platform that this method is built, and facilitates management and submits task, allows user Focus more in the research of deep learning, and be not a concern hardware and other the problem of.
Detailed description of the invention
Fig. 1 is the architecture diagram of entire platform;
Fig. 2 is the deployment explanatory diagram of Kubernetes, has the function of High Availabitity and load balancing;
Fig. 3 is all processes of machine learning task, is terminated from being submitted to returning the result.
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent the ruler of actual product It is very little;
To those skilled in the art, the omitting of some known structures and their instructions in the attached drawings are understandable.
The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.
Embodiment 1
A kind of building method of the distributed machines learning platform based on containerization technique, comprising the following steps:
S1: preparing the warehouse Docker, this platform requires the upload for carrying out mirror image in deployment and use process and drawing goes to operate, Must have the warehouse Docker just can be with, this warehouse Docker can be self-built, is also possible to using public The warehouse Docker;
S2: filling in cluster and describe file, and cluster describes file and is mainly used for building Kubernetes cluster, the inside need include The information of the Kubernetes cluster node to be run, the configuration information of various components and each node need which runs Service;
S3: deployment Kubernetes cluster reads cluster using Python script and describes file, and generates required shell Script and Kubernetes describe file, send each host by ssh for these files and execute, so that it may complete The deployment of Kubernetes cluster;
S4: filling in service description file, and service description file is mainly used for the starting and fortune of the various services of machine learning platform Row, the inside need which node the configuration information comprising each service and each service need to operate on;
S5: being fabricated to Docker mirror image for the service of machine learning platform, is described in file using Python script reading service The description information of each service generates the corresponding dockerfile of each service, after constructing mirror image using these dockerfile It uploads in long-range warehouse;
S6: the service of deployment machine learning platform is generated using the service description file in Python script read step four The corresponding Kubernetes of each service describes file, then the command-line tool kubectl of Kubernetes is called to create this A little services;
S7: waiting each service to complete coordinate synchronization, and after having disposed each service of step S6, various components needs one are small The section time completes to synchronize, and might have several services during this restarts for several times, after all service all stable operations, entirely Building for machine learning platform also just completes.
Further, the learning process of the platform is:
Step 1: upload code and data are to HDFS, in the design of this platform, the code and data of program all be need from It is obtained on HDFS, therefore needs to want this required by task using pai-fs the generation of data and machine learning before execution task Code is transferred on HDFS.
Step 2: filling in task description file, and the task description file that operation required by task is wanted is for describing to run Task need the computing resources such as CPU, GPU, memory and disk to be used, the order that each node needs to be implemented, program address, Data address, output address etc., and the number etc. for needing to retry when mistake occurs;
Step 3: submitting task by WebPortal, submits the page in the task of WebPortal, clicks and submit task button, Selection can prompt task successfully to submit in the ready task description file of step 2 after submitting successfully;
Step 4: RESTServer prepares Run Script and frame describes file, and user is uploaded times to come up by RESTServer Business description document analysis, prepares its required script and frame in using FrameworkLauncher and describes file, and lead to Cross interface notification FrameworkLauncher starting task;
Step 5: starting task by FrameworkLauncher, after new task is got by interface, The task can be added in task queue by FrameworkLauncher, wait it is pending, when taking turns to the task execution, and should When resources supplIes required by task can meet, FrameworkLauncher can notify Hadoop YARN to hold The row task;
Step 6: start container using Hadoop YARN to execute task.Resource pipe Hadoop YARN global as one Device is managed, same can dispatch global resource to execute machine learning task, it might even be possible to which, across machine scheduling, YARN will start newly Container come service requirement execution task, and for these containers assign require resource;
Step 7: waiting the implementing result of acquisition task after pending completion, after task execution is completed, can pass through WebPortal obtains the implementing result of task, can also output data directly from HDFS during downloading-running.
Use the physical machine that operating system is Ubuntu16.04 LTE or virtual machine as operation in the bottom of platform The host of various components, it is desirable that all hosts are among the same subnet, and are capable of providing ssh and are interconnected, User needs to possess the permission of root, and has to be equipped with GPU in worker node.
Use docker as containerization engine, installation on each host, by its lightweight, manageability, can The strong feature of transplantability, can significantly simplify the complexity of installation and deployment and operational management.
Docker container is managed collectively using kubernetes on docker, all platform assemblies are all logical Kubernetes is crossed to be installed.Equally considered for what is be easily managed, in the component of kubernetes in addition to docker and Except kubectl, other components are disposed in a manner of containerization, and wherein kubelet is most basic component, need to lead to Docker is crossed come except starting, other components be all kubelet it is static by way of carry out deployment installation.
Hadoop is one of most important serviced component in entire platform, is the basis for carrying out parallel computation, improving efficiency, Its multiple modules have important role in the design of platform: HDFS is for storing data used in machine learning task And code, it is assumed here that all machine learning tasks are all supported from HDFS as input source, due to current big portion The machine learning frame divided all has preferable support to HDFS, therefore this point requires not will cause too big puzzlement, it is only necessary to It is familiar with this programming mode;MapReduce is the major way that parallel computation carries out, most of foot run on platform This finally can all be converted into the mode of MapReduce to carry out;YARN is the resource manager of Hadoop, by establishing one Global resource manager and an application controller is established for each application, can united to the resource of each node One management and scheduling, improves the utilization efficiency of resource;ZooKeeper is the module for automatic management node, when there is node When breaking down or service unexpected terminate, it can go to go to continue to run service in new node, to guarantee the stability of entire cluster With availability.
Except Hadoop, one layer of FrameworkLauncher is encapsulated, by FramewLauncher come to Hadoop It is extended and extends, some function packages based on Hadoop are good, it is convenient to be docked with other assemblies, it decreases not With the coupling between component, design is simplified.
RESTServer provides HTTP connection service, open some ports, for third-party service and webpage to outside Entrance uses, and runs convenient for the extension and commercialization of service.
WebPortal is the web portal of this platform, by entrance can visually to the cluster of platform and service into Row management, submission and management role, obtain result and statistics of task etc..
Simultaneously in order to achieve the purpose that convenience and High Availabitity, many modes and strategy are also taken here:
It automatically generates script: only providing a user single configuration file in the design, need to provide each node in configuration file IP address, user name, password, role, the service that runs on each node etc. node configuration information and each service group The service profile informations such as the open port of the address of such as docking platform of information required by part, service.It can be first when deployment Configuration information is first read from configuration file, and then pre-prepd template is filled using the script that Python is realized, The main shell script used including deployment platform constructs required for dockerfile used in mirror image and deployment services Yaml configuration file.Then the script of production and configuration file are distributed to by ssh by all nodes, and execute starting foot This, starts the process of deployment.This process has probably divided three parts: deployment kubernetes constructs mirror image and deployment services. In the case where no appearance causes deployment to terminate extremely, this three parts can be executed successively, full-automation deployment.
It is disposed using mirror image: mentioning before and automatically construct mirror image using script, and why use the original of mirror image Cause is exactly that it can greatly facilitate our deployment.After the configuration information from configuration file reading service, these are used Information has mended dockerfile, reuses it and makes mirror image, and uploads in long-range warehouse (public or privately owned is ok). During deployment, it is only necessary to specify the mirror image to be pulled in some node, so that it may which the container for generating corresponding service comes Operation, unusual efficient quick, especially when scale is bigger, advantage is more prominent.
The High Availabitity of Kubernetes: tool and service deployment platform of the Kubernetes as bottom container layout, Act on it is very big, once collapse, will lead to all components and is all unable to operate normally, so it has to be ensured that its High Availabitity, And ensure that the time that its delay machine is restored will as far as possible short.The node of Kubernetes cluster is broadly divided into two kinds, one is Master node is responsible for managing and controlling for entire platform, and the storage and external interaction of data, are the cores of cluster, separately One is worker nodes, are mainly responsible for the execution of task.For worker node, delay machine or end of service meeting All containers run on the node are caused to lose or terminate, but master can automatically start newly from other nodes Container, continue to execute task, caused by risk it is smaller, and for master node, once delay machine, caused by shadow Loud and risk is all very big, therefore, devise multiple master nodes herein, pass through VIP(virtual ip address) realize one The High Availabitity mode of a active and standby combination be utilized all host nodes can in addition combined with load balancing, to divide Carry on a shoulder pole the pressure of flow.
The High Availabitity of service: platform operation various serviced components be also required to guarantee high availability, this implement compared with To be easy, DaemonSet function and Service function are provided used here as Kubernetes, DaemonSet function can It, can be automatic when the container of operation service collapses or is killed to provide finger daemon for the service on individual node Start new container, to continue to service.And Service is then to provide load balancing clothes for the container of series of identical function Business, it can periodically send heartbeat and go to check the availability of each container, and if there is the container of failure, then later request will be by It is forwarded in other containers, if Service combination Deployment function, can also realize to work as have container or node failure Later, automatically from other nodes starting container to guarantee copy number, and new container is added in Service.
The same or similar label correspond to the same or similar components;
Described in attached drawing positional relationship for only for illustration, should not be understood as the limitation to this patent;
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be to this hair The restriction of bright embodiment.For those of ordinary skill in the art, it can also do on the basis of the above description Other various forms of variations or variation out.There is no necessity and possibility to exhaust all the enbodiments.It is all in the present invention Spirit and principle within made any modifications, equivalent replacements, and improvements etc., should be included in the guarantor of the claims in the present invention Within the scope of shield.

Claims (2)

1. a kind of building method of the distributed machines learning platform based on containerization technique, which is characterized in that including following step It is rapid:
S1: preparing the warehouse Docker, this platform requires the upload for carrying out mirror image in deployment and use process and drawing goes to operate, Must have the warehouse Docker just can be with, this warehouse Docker can be self-built, is also possible to using public The warehouse Docker;
S2: filling in cluster and describe file, and cluster describes file and is mainly used for building Kubernetes cluster, the inside need include The information of the Kubernetes cluster node to be run, the configuration information of various components and each node need which runs Service;
S3: deployment Kubernetes cluster reads cluster using Python script and describes file, and generates required shell Script and Kubernetes describe file, send each host by ssh for these files and execute, so that it may complete The deployment of Kubernetes cluster;
S4: filling in service description file, and service description file is mainly used for the starting and fortune of the various services of machine learning platform Row, the inside need which node the configuration information comprising each service and each service need to operate on;
S5: being fabricated to Docker mirror image for the service of machine learning platform, is described in file using Python script reading service The description information of each service generates the corresponding dockerfile of each service, after constructing mirror image using these dockerfile It uploads in long-range warehouse;
S6: the service of deployment machine learning platform is generated using the service description file in Python script read step four The corresponding Kubernetes of each service describes file, then the command-line tool kubectl of Kubernetes is called to create this A little services;
S7: waiting each service to complete coordinate synchronization, and after having disposed each service of step S6, various components needs one are small The section time completes to synchronize, and might have several services during this restarts for several times, after all service all stable operations, entirely Building for machine learning platform also just completes.
2. the building method of the distributed machines learning platform according to claim 1 based on containerization technique, feature It is, the learning process of the platform is:
Step 1: upload code and data are to HDFS, in the design of this platform, the code and data of program all be need from It is obtained on HDFS, therefore needs to want this required by task using pai-fs the generation of data and machine learning before execution task Code is transferred on HDFS;
Step 2: filling in task description file, and the task description file that operation required by task is wanted is appointed for describe to be run Business needs the computing resources such as CPU, GPU, memory and disk to be used, the order that each node needs to be implemented, program address, data Address, output address etc., and the number etc. for needing to retry when mistake occurs;
Step 3: submitting task by WebPortal, submits the page in the task of WebPortal, clicks and submit task button, Selection can prompt task successfully to submit in the ready task description file of step 2 after submitting successfully;
Step 4: RESTServer prepares Run Script and frame describes file, and user is uploaded times to come up by RESTServer Business description document analysis, prepares its required script and frame in using FrameworkLauncher and describes file, and lead to Cross interface notification FrameworkLauncher starting task;
Step 5: starting task by FrameworkLauncher, after new task is got by interface, The task can be added in task queue by FrameworkLauncher, wait it is pending, when taking turns to the task execution, and should When resources supplIes required by task can meet, FrameworkLauncher can notify Hadoop YARN to hold The row task;
Step 6: start container using Hadoop YARN to execute task, resource pipe Hadoop YARN global as one Device is managed, same can dispatch global resource to execute machine learning task, it might even be possible to which, across machine scheduling, YARN will start newly Container come service requirement execution task, and for these containers assign require resource;
Step 7: waiting the implementing result of acquisition task after pending completion, after task execution is completed, can pass through WebPortal obtains the implementing result of task, can also output data directly from HDFS during downloading-running.
CN201810186485.6A 2018-03-07 2018-03-07 A kind of building method of the distributed machines learning platform based on containerization technique Pending CN109284184A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810186485.6A CN109284184A (en) 2018-03-07 2018-03-07 A kind of building method of the distributed machines learning platform based on containerization technique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810186485.6A CN109284184A (en) 2018-03-07 2018-03-07 A kind of building method of the distributed machines learning platform based on containerization technique

Publications (1)

Publication Number Publication Date
CN109284184A true CN109284184A (en) 2019-01-29

Family

ID=65186146

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810186485.6A Pending CN109284184A (en) 2018-03-07 2018-03-07 A kind of building method of the distributed machines learning platform based on containerization technique

Country Status (1)

Country Link
CN (1) CN109284184A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740765A (en) * 2019-01-31 2019-05-10 成都品果科技有限公司 A kind of machine learning system building method based on Amazon server
CN110297670A (en) * 2019-05-17 2019-10-01 北京瀚海星云科技有限公司 A kind of method and system improving distributed task scheduling training effectiveness on container cloud
CN110378463A (en) * 2019-07-15 2019-10-25 北京智能工场科技有限公司 A kind of artificial intelligence model standardized training platform and automated system
CN110471767A (en) * 2019-08-09 2019-11-19 上海寒武纪信息科技有限公司 A kind of dispatching method of equipment
CN111026414A (en) * 2019-12-12 2020-04-17 杭州安恒信息技术股份有限公司 HDP platform deployment method based on kubernets
CN111338854A (en) * 2020-05-25 2020-06-26 南京云信达科技有限公司 Kubernetes cluster-based method and system for quickly recovering data
CN111930525A (en) * 2020-10-10 2020-11-13 北京世纪好未来教育科技有限公司 GPU resource use method, electronic device and computer readable medium
WO2020259081A1 (en) * 2019-06-25 2020-12-30 深圳前海微众银行股份有限公司 Task scheduling method, apparatus, and device, and computer-readable storage medium
CN112395039A (en) * 2019-08-16 2021-02-23 北京神州泰岳软件股份有限公司 Management method and device for Kubernetes cluster
CN112817581A (en) * 2021-02-20 2021-05-18 中国电子科技集团公司第二十八研究所 Lightweight intelligent service construction and operation support method
CN113641343A (en) * 2021-10-15 2021-11-12 中汽数据(天津)有限公司 High-concurrency python algorithm calling method and medium based on environment isolation
WO2022134001A1 (en) * 2020-12-25 2022-06-30 深圳晶泰科技有限公司 Machine learning model framework development method and system based on containerization technology
DE202022104275U1 (en) 2022-07-28 2022-08-25 Ahmed Alemran System for intelligent resource management for distributed machine learning tasks
CN115357256A (en) * 2022-10-18 2022-11-18 安徽华云安科技有限公司 CDH cluster deployment method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106790483A (en) * 2016-12-13 2017-05-31 武汉邮电科学研究院 Hadoop group systems and fast construction method based on container technique
CN106888254A (en) * 2017-01-20 2017-06-23 华南理工大学 A kind of exchange method between container cloud framework based on Kubernetes and its each module
CN107450961A (en) * 2017-09-22 2017-12-08 济南浚达信息技术有限公司 A kind of distributed deep learning system and its building method, method of work based on Docker containers
CN107659609A (en) * 2017-07-26 2018-02-02 北京天云融创软件技术有限公司 A kind of deep learning support platform and deep learning training method based on cloud computing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106790483A (en) * 2016-12-13 2017-05-31 武汉邮电科学研究院 Hadoop group systems and fast construction method based on container technique
CN106888254A (en) * 2017-01-20 2017-06-23 华南理工大学 A kind of exchange method between container cloud framework based on Kubernetes and its each module
CN107659609A (en) * 2017-07-26 2018-02-02 北京天云融创软件技术有限公司 A kind of deep learning support platform and deep learning training method based on cloud computing
CN107450961A (en) * 2017-09-22 2017-12-08 济南浚达信息技术有限公司 A kind of distributed deep learning system and its building method, method of work based on Docker containers

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YIN,XUGANG ET AL: "Research and Implementation PaaS platform based on Docker", 《4TH NATIONAL CONFERENCE ON ELECTRICAL, ELECTRONICS AND COMPUTER ENGINEERING (NCEECE)》 *
张羿等: "基于Docker 的电网轻量级PaaS平台构建方案", 《计算机工程应用技术》 *
邹暾等: "基于容器云的烟草商业企业PaaS平台架构设计", 《中国烟草学会2017年学术年会》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740765A (en) * 2019-01-31 2019-05-10 成都品果科技有限公司 A kind of machine learning system building method based on Amazon server
CN109740765B (en) * 2019-01-31 2023-05-02 成都品果科技有限公司 Machine learning system building method based on Amazon network server
CN110297670A (en) * 2019-05-17 2019-10-01 北京瀚海星云科技有限公司 A kind of method and system improving distributed task scheduling training effectiveness on container cloud
CN110297670B (en) * 2019-05-17 2023-06-27 深圳致星科技有限公司 Method and system for improving training efficiency of distributed tasks on container cloud
WO2020259081A1 (en) * 2019-06-25 2020-12-30 深圳前海微众银行股份有限公司 Task scheduling method, apparatus, and device, and computer-readable storage medium
CN110378463B (en) * 2019-07-15 2021-05-14 北京智能工场科技有限公司 Artificial intelligence model standardization training platform and automatic system
CN110378463A (en) * 2019-07-15 2019-10-25 北京智能工场科技有限公司 A kind of artificial intelligence model standardized training platform and automated system
CN110471767A (en) * 2019-08-09 2019-11-19 上海寒武纪信息科技有限公司 A kind of dispatching method of equipment
CN110471767B (en) * 2019-08-09 2021-09-03 上海寒武纪信息科技有限公司 Equipment scheduling method
CN112395039B (en) * 2019-08-16 2024-01-19 北京神州泰岳软件股份有限公司 Method and device for managing Kubernetes cluster
CN112395039A (en) * 2019-08-16 2021-02-23 北京神州泰岳软件股份有限公司 Management method and device for Kubernetes cluster
CN111026414B (en) * 2019-12-12 2023-09-08 杭州安恒信息技术股份有限公司 HDP platform deployment method based on kubernetes
CN111026414A (en) * 2019-12-12 2020-04-17 杭州安恒信息技术股份有限公司 HDP platform deployment method based on kubernets
CN111338854A (en) * 2020-05-25 2020-06-26 南京云信达科技有限公司 Kubernetes cluster-based method and system for quickly recovering data
CN111930525B (en) * 2020-10-10 2021-02-02 北京世纪好未来教育科技有限公司 GPU resource use method, electronic device and computer readable medium
CN111930525A (en) * 2020-10-10 2020-11-13 北京世纪好未来教育科技有限公司 GPU resource use method, electronic device and computer readable medium
WO2022134001A1 (en) * 2020-12-25 2022-06-30 深圳晶泰科技有限公司 Machine learning model framework development method and system based on containerization technology
CN112817581A (en) * 2021-02-20 2021-05-18 中国电子科技集团公司第二十八研究所 Lightweight intelligent service construction and operation support method
CN113641343B (en) * 2021-10-15 2022-02-11 中汽数据(天津)有限公司 High-concurrency python algorithm calling method and medium based on environment isolation
CN113641343A (en) * 2021-10-15 2021-11-12 中汽数据(天津)有限公司 High-concurrency python algorithm calling method and medium based on environment isolation
DE202022104275U1 (en) 2022-07-28 2022-08-25 Ahmed Alemran System for intelligent resource management for distributed machine learning tasks
CN115357256A (en) * 2022-10-18 2022-11-18 安徽华云安科技有限公司 CDH cluster deployment method and system

Similar Documents

Publication Publication Date Title
CN109284184A (en) A kind of building method of the distributed machines learning platform based on containerization technique
US11593149B2 (en) Unified resource management for containers and virtual machines
CN107431696B (en) Method and cloud management node for application automation deployment
WO2019179453A1 (en) Virtual machine creation method and apparatus
Bui et al. Work queue+ python: A framework for scalable scientific ensemble applications
CN110442396B (en) Application program starting method and device, storage medium and electronic equipment
CN104506620A (en) Extensible automatic computing service platform and construction method for same
CN103064742A (en) Automatic deployment system and method of hadoop cluster
CN105786603B (en) Distributed high-concurrency service processing system and method
US10860364B2 (en) Containerized management services with high availability
CN104579792A (en) Architecture and method for achieving centralized management of various types of virtual resources based on multiple adaptive modes
CN111045786B (en) Container creation system and method based on mirror image layering technology in cloud environment
US10042673B1 (en) Enhanced application request based scheduling on heterogeneous elements of information technology infrastructure
CN109740765A (en) A kind of machine learning system building method based on Amazon server
Justino et al. Outsourcing resource-intensive tasks from mobile apps to clouds: Android and aneka integration
US20120059938A1 (en) Dimension-ordered application placement in a multiprocessor computer
CN105100180A (en) Cluster node dynamic loading method, device and system
CN105144107A (en) Method, processing modules and system for executing an executable code
CN104714843A (en) Method and device supporting multiple processors through multi-kernel operating system living examples
CN110782040A (en) Method, device, equipment and medium for training tasks of pitorch
CN113110920B (en) Operation method, device, equipment and storage medium of block chain system
Wu et al. An automatic artificial intelligence training platform based on kubernetes
CN115237547A (en) Unified container cluster hosting system and method for non-intrusive HPC computing cluster
CN115033290A (en) Instruction set-based micro-service splitting method and device and terminal equipment
CN110807018A (en) Method, device, equipment and storage medium for migrating master-slave architecture of business data to cluster architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned
AD01 Patent right deemed abandoned

Effective date of abandoning: 20220830