CN109086134A - A kind of operation method and device of deep learning operation - Google Patents

A kind of operation method and device of deep learning operation Download PDF

Info

Publication number
CN109086134A
CN109086134A CN201810793520.0A CN201810793520A CN109086134A CN 109086134 A CN109086134 A CN 109086134A CN 201810793520 A CN201810793520 A CN 201810793520A CN 109086134 A CN109086134 A CN 109086134A
Authority
CN
China
Prior art keywords
deep learning
calculate node
docker
mirror image
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810793520.0A
Other languages
Chinese (zh)
Inventor
袁绍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201810793520.0A priority Critical patent/CN109086134A/en
Publication of CN109086134A publication Critical patent/CN109086134A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses the operation methods and device of a kind of deep learning operation, this method comprises: receiving selection of the user for resource required for operation deep learning operation and the selection for submitting the docker mirror image of deep learning operation;Deep learning operation is dispatched according to the use of calculate node and loading condition;When by deep learning job scheduling to calculate node, docker mirror image selected by user is pushed from mirror image warehouse, and docker container is created in each calculate node in the cluster;The hardware resource that calculate node is distributed according to deep learning operation is mapped to docker mirror image, and deep learning operation is run using the hardware resource and docker container that are mapped to docker mirror image.By the above-mentioned means, reducing user using cluster operation deep learning operation the time it takes and energy.

Description

A kind of operation method and device of deep learning operation
Technical field
The present invention relates to computer cluster, the espespecially a kind of operation method and device of deep learning operation.
Background technique
The concept of deep learning is derived from the research of artificial neural network.Multilayer perceptron containing more hidden layers is exactly a kind of depth Learning structure.Deep learning, which forms more abstract high level by combination low-level feature, indicates attribute classification or feature, with discovery The distributed nature of data indicates.The deep learning frame of present mainstream include tensorflow, caffe, pytorch, mxnet.Cluster is reflected what one group of computer system interconnection was formed with triangular web by high performance network or local area network The high-performance of picture, enhanced scalability, high performance price ratio computer cluster.As group system is in scientific algorithm, quotient The extensive use of industry operation etc., the effect that group system plays is also more and more important, and being increasingly becoming can not in above-mentioned field Or scarce tool.When cluster application is when deep learning, since deep learning needs to be implemented a large amount of calculating, it is therefore desirable to cluster System there is a large amount of calculate node with provide a large amount of hardware resource (for example, GPU (Graphics Processing Unit, Graphics processing unit) resource).But cluster node substantial amounts, it is difficult to which United Dispatching provides hardware resource, therefore cluster Hardware resource utilization rate it is low, and the hardware resource for dispatching the node of cluster can spend user's a large amount of time and essence Power.In addition, the framing dependence of different deep learning frames is different, user is different depth before model training It practises frame and configures different training environments, this also needs to expend considerable time and effort.
Summary of the invention
In order to solve the above-mentioned technical problems, the present invention provides the operation method and device of a kind of deep learning operation, User can be reduced using cluster operation deep learning operation the time it takes and energy.
To achieve the goals above, on the one hand, the embodiment provides a kind of operation sides of deep learning operation Method, this method comprises:
User is received for the selection of resource required for operation deep learning operation and for submitting deep learning to make The selection of the docker mirror image of industry;
Deep learning operation is dispatched according to the use of calculate node and loading condition;
When by deep learning job scheduling to calculate node, docker selected by user is pushed from mirror image warehouse Mirror image, and docker container is created in each calculate node in the cluster;
The hardware resource that calculate node is distributed according to deep learning operation is mapped to docker mirror image, and uses and reflects Hardware resource and the docker container of docker mirror image are mapped to run deep learning operation.
Further, in an alternative embodiment, required resource includes:
Using the cpu resource of deep learning task training, GPU resource, framework type, queuing message.
Further, in an alternative embodiment, the calculate node in cluster and management node use network file The mode of system NFS carrys out shared stored file;
In the step of using the hardware resource and docker container for being mapped to docker mirror image to run deep learning operation Later, this method further include:
It will be using the model file storage of deep learning task training to calculate node, so that calculate node is by model file Share to management node.
Further, in an alternative embodiment, docker container is created in each calculate node in the cluster The step of after, this method further include:
Cluster is configured using overlay network tool flannel.
To achieve the goals above, on the other hand, the embodiment of the invention provides a kind of operation of deep learning operation dresses It sets, which includes: that user selects receiving module, job scheduling module, container creation module and job run module;Its In,
User select receiving module be used for: receive user for operation deep learning operation required for resource selection with And the selection for submitting the docker mirror image of deep learning operation;
Job scheduling module is used for: deep learning operation is dispatched according to the use of calculate node and loading condition;
Container creation module is used for: when by deep learning job scheduling to calculate node, being pushed from mirror image warehouse Docker mirror image selected by user, and docker container is created in each calculate node in the cluster;
Job run module is used for: the hardware resource that calculate node is distributed according to deep learning operation is mapped to Docker mirror image, and deep learning operation is run using the hardware resource and docker container that are mapped to docker mirror image.
Further, in an alternative embodiment, required resource includes:
Using the cpu resource of deep learning task training, GPU resource, framework type, queuing message.
Further, in an alternative embodiment, the calculate node in cluster and management node use network file The mode of system NFS carrys out shared stored file;
The device further includes model file memory module, and model file memory module is used for: being used in job run module Hardware resource and the docker container of docker mirror image are mapped to will make using deep learning after running deep learning operation The model file of industry training is stored to calculate node, so that model file is shared to management node by calculate node.
Further, in an alternative embodiment, which further includes cluster configuration module, and cluster configuration module is used In: after creating docker container in container creation module each calculate node in the cluster, using overlay network tool Flannel configures cluster.
The beneficial effect of the embodiment of the present invention is that user can select operation deep learning operation institute by client The hardware resource needed, and CPU and GPU resource in hardware resource are dynamically distributed using dispatcher software, therefore ensure that collection The high usage of the hardware resource of group, and reduce hardware resource the time it takes and energy that user dispatches cluster.No With deep learning frame can be convenient and efficiently operates on entire cluster, avoiding user, to be that different frames configure different Framework environment, bottom run deep learning operation using docker container, avoid different frames and rely on conflict.Reduce user Configuration surroundings the time it takes and energy.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that understand through the implementation of the invention.The objectives and other advantages of the invention can be by specification, right Specifically noted structure is achieved and obtained in claim and attached drawing.
Detailed description of the invention
Attached drawing is used to provide to further understand technical solution of the present invention, and constitutes part of specification, with this The embodiment of application technical solution for explaining the present invention together, does not constitute the limitation to technical solution of the present invention.
Fig. 1 is a kind of flow chart of the operation method of deep learning operation provided in an embodiment of the present invention;
Fig. 2 is a kind of block diagram of the running gear of deep learning operation provided in an embodiment of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention Embodiment be described in detail.It should be noted that in the absence of conflict, in the embodiment and embodiment in the application Feature can mutual any combination.
Step shown in the flowchart of the accompanying drawings can be in a computer system such as a set of computer executable instructions It executes.Also, although logical order is shown in flow charts, and it in some cases, can be to be different from herein suitable Sequence executes shown or described step.
On the one hand, the embodiment provides a kind of operation methods of deep learning operation, as shown in Figure 1, the party Method includes step S101-S107.
Step S101 receives user for the selection of resource required for operation deep learning operation and deep for submitting Spend the selection of the docker mirror image of learning performance.
Wherein, resource and depth required for user is run by the web page selected deep learning performance of client The docker mirror image of learning performance, and user can also select or input training script.Client uses B/S (Browser/ Server, Browser/Server Mode) architecture management system is a kind of network structure mode after web rises, web browser It is the most important application software of client.This mode has unified client, and the core that system function is realized is focused on On server, the exploitation, maintenance and use of system are simplified.As long as installing a browser, such as Netscape in client Navigator or Internet Explorer, server install the databases such as SQL Server, Oracle, MYSQL.Browser Data interaction is carried out with database by Web Server.
After this, client can send to the management node in cluster and request, which can be HTTP (HyperText Transfer Protocol, hypertext transfer protocol) request.Management node upon receiving a request, by institute Received request is sent to dispatcher software.
Step S103 dispatches deep learning operation according to the use of calculate node and loading condition.
Here, using resource management software TORQUE coordinating operation dispatcher software Maui according to each calculating section in cluster Point is respective to dispatch deep learning operation using with loading condition, deep learning operation is assigned to each calculate node, respectively Hardware resource needed for a calculate node provides operation deep learning operation.
Step S105, it is each into cluster from mirror image warehouse when by deep learning job scheduling to calculate node A calculate node pushes docker mirror image selected by user, and creates docker in each calculate node in the cluster and hold Device.
Here, the selected docker mirror image push (push) of user is arrived each calculate node, so as to execute depth gauge It can be regarded as and create docker container in each calculate node of industry.
The hardware resource that calculate node is distributed according to deep learning operation is mapped to docker mirror image by step S107, And deep learning operation is run using the hardware resource and docker container that are mapped to docker mirror image.
Docker is the application container engine of an open source, and the application for allowing developer that can be packaged them (refers to herein Be deep learning operation) and rely on packet into a transplantable container, be then published to the Linux machine of any prevalence On, it also may be implemented to virtualize, container is not have any interface between each other using sandbox mechanism completely.Due to docker Include system independent of any language, frame, therefore run deep learning operation using docker in bottom, avoids difference Deep learning frame framing dependence (framing dependence packet) between conflict.
The beneficial effect of the embodiment of the present invention is that user can select operation deep learning operation institute by client The hardware resource needed, and CPU and GPU resource in hardware resource are dynamically distributed using dispatcher software, therefore ensure that collection The high usage of the hardware resource of group, and reduce hardware resource the time it takes and energy that user dispatches cluster.No With deep learning frame can be convenient and efficiently operates on entire cluster, avoiding user, to be that different frames configure different Framework environment, bottom run deep learning operation using docker container, avoid different frames and rely on conflict.Reduce user Configuration surroundings the time it takes and energy.
Further, in an alternative embodiment, required resource includes: using deep learning task training Cpu resource, GPU resource, framework type, queuing message.
Further, in an alternative embodiment, the calculate node in cluster and management node use network file The mode of system NFS (Network File System, Network File System) carrys out shared stored file.NFS is One of the file system that FreeBSD is supported, it allows to pass through TCP/IP network shared resource between the computer in network.
After step S107, this method further include: by the model file storage using deep learning task training to meter Operator node, so that model file is shared to management node by calculate node.User can obtain the model file from management node.
Further, in one embodiment, after step S105, this method further include: use overlay network tool Flannel configures cluster.
When creating docker container in calculate node, due to the property of docker container, two calculate nodes It is not intercommunication between docker container, therefore cluster is configured by deployment overlay network tool flannel, to docker container IP planned, can be achieved with the communication between the docker container across calculate node.Working directory is mapped to conduct The calculate node of docker host, setting GPU resource maps, and GPU use environment is arranged.
On the other hand, the embodiment provides a kind of running gears of deep learning operation, as shown in Fig. 2, should Device includes: that user selects receiving module 201, job scheduling module 203, container creation module 205 and job run module 207;Wherein,
User selects receiving module 201 to be used for: receiving choosing of the user for resource required for operation deep learning operation Select and for submit deep learning operation docker mirror image selection;
Job scheduling module 203 is used for: deep learning operation is dispatched according to the use of calculate node and loading condition;
Container creation module 205 is used for: when by deep learning job scheduling to calculate node, being pushed away from mirror image warehouse Docker mirror image selected by user is sent, and creates docker container in each calculate node in the cluster;
Job run module 207 is used for: the hardware resource that calculate node is distributed according to deep learning operation is mapped to Docker mirror image, and deep learning operation is run using the hardware resource and docker container that are mapped to docker mirror image.
The beneficial effect of the embodiment of the present invention is that user can select operation deep learning operation institute by client The hardware resource needed, and CPU and GPU resource in hardware resource are dynamically distributed using dispatcher software, therefore ensure that collection The high usage of the hardware resource of group, and reduce hardware resource the time it takes and energy that user dispatches cluster.No With deep learning frame can be convenient and efficiently operates on entire cluster, avoiding user, to be that different frames configure different Framework environment, bottom run deep learning operation using docker container, avoid different frames and rely on conflict.Reduce user Configuration surroundings the time it takes and energy.
Further, in an alternative embodiment, required resource includes:
Using the cpu resource of deep learning task training, GPU resource, framework type, queuing message.
Further, in an alternative embodiment, the calculate node in cluster and management node use network file The mode of system NFS carrys out shared stored file;
The device further includes model file memory module, and the model file memory module is used for: in job run mould After block 207 runs deep learning operation using the hardware resource and docker container that are mapped to docker mirror image, it will use The model file storage of deep learning task training is to calculate node, so that model file is shared to management section by calculate node Point.
Further, in an alternative embodiment, which further includes cluster configuration module, and cluster configuration mould Block is used for: after creating docker container in container creation module each calculate node in the cluster, using overlay network Tool flannel configures cluster.
Although disclosed herein embodiment it is as above, above-mentioned content only for ease of understanding the present invention and use Embodiment is not intended to limit the invention.Technical staff in any fields of the present invention is taken off not departing from the present invention Under the premise of the spirit and scope of dew, any modification and variation, but the present invention can be carried out in the form and details of implementation Scope of patent protection, still should be subject to the scope of the claims as defined in the appended claims.

Claims (8)

1. a kind of operation method of deep learning operation characterized by comprising
User is received for the selection of resource required for operation deep learning operation and for submitting deep learning operation The selection of docker mirror image;
The deep learning operation is dispatched according to the use of calculate node and loading condition;
When by the deep learning job scheduling to calculate node, docker selected by user is pushed from mirror image warehouse Mirror image, and docker container is created in each calculate node in the cluster;
The hardware resource that calculate node is distributed according to the deep learning operation is mapped to the docker mirror image, and is adopted Deep learning operation is run with the hardware resource and the docker container that are mapped to docker mirror image.
2. according to the method described in claim 1, wherein, the required resource includes:
Using cpu resource, GPU resource, framework type, the queuing message of the deep learning task training.
3. according to the method described in claim 1, wherein, calculate node and management node in the cluster use network file The mode of system NFS carrys out shared stored file;
Deep learning operation is run using the hardware resource for being mapped to docker mirror image and the docker container described After step, the method also includes:
By the model file storage using the deep learning task training to the calculate node, so that the calculate node will The model file shares to management node.
4. the method according to claim 1, wherein being created in each calculate node in the cluster After the step of docker container, the method also includes:
The cluster is configured using overlay network tool flannel.
5. a kind of running gear of deep learning operation characterized by comprising user selects receiving module, job scheduling mould Block, container creation module and job run module;Wherein,
The user selects receiving module to be used for: receive user for resource required for operation deep learning operation selection with And the selection for submitting the docker mirror image of deep learning operation;
The job scheduling module is used for: the deep learning operation is dispatched according to the use of calculate node and loading condition;
The container creation module is used for: when by the deep learning job scheduling to calculate node, from mirror image warehouse Docker mirror image selected by user is pushed, and creates docker container in each calculate node in the cluster;
The job run module is used for: the hardware resource that calculate node is distributed according to the deep learning operation is mapped to The docker mirror image, and depth is run using the hardware resource and the docker container that are mapped to docker mirror image Exercises industry.
6. device according to claim 5, wherein it is described required for resource include:
Using cpu resource, GPU resource, framework type, the queuing message of the deep learning task training.
7. device according to claim 5, wherein calculate node and management node in the cluster use network file The mode of system NFS carrys out shared stored file;
Described device further includes model file memory module, and the model file memory module is used for: in the job run mould After block runs deep learning operation using the hardware resource and the docker container that are mapped to docker mirror image, it will use The model file storage of the deep learning task training is to the calculate node, so that the calculate node is literary by the model Part shares to management node.
8. device according to claim 5, which is characterized in that described device further includes cluster configuration module, the cluster Configuration module is used for: the container creation module in each calculate node in the cluster creation docker container it Afterwards, the cluster is configured using overlay network tool flannel.
CN201810793520.0A 2018-07-19 2018-07-19 A kind of operation method and device of deep learning operation Pending CN109086134A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810793520.0A CN109086134A (en) 2018-07-19 2018-07-19 A kind of operation method and device of deep learning operation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810793520.0A CN109086134A (en) 2018-07-19 2018-07-19 A kind of operation method and device of deep learning operation

Publications (1)

Publication Number Publication Date
CN109086134A true CN109086134A (en) 2018-12-25

Family

ID=64837778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810793520.0A Pending CN109086134A (en) 2018-07-19 2018-07-19 A kind of operation method and device of deep learning operation

Country Status (1)

Country Link
CN (1) CN109086134A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508238A (en) * 2019-01-05 2019-03-22 咪付(广西)网络技术有限公司 A kind of resource management system and method for deep learning
CN109857475A (en) * 2018-12-27 2019-06-07 深圳云天励飞技术有限公司 A kind of method and device of frame management
CN110688230A (en) * 2019-10-17 2020-01-14 广州文远知行科技有限公司 Synchronous training method and device, computer equipment and storage medium
CN111090456A (en) * 2019-12-06 2020-05-01 浪潮(北京)电子信息产业有限公司 Construction method, device, equipment and medium for deep learning development environment
CN111190713A (en) * 2019-12-26 2020-05-22 曙光信息产业(北京)有限公司 Job scheduling management method and device
CN112114931A (en) * 2019-06-21 2020-12-22 鸿富锦精密电子(天津)有限公司 Deep learning program configuration method and device, electronic equipment and storage medium
CN112148348A (en) * 2019-06-28 2020-12-29 杭州海康威视数字技术股份有限公司 Task processing method and device and storage medium
CN112203291A (en) * 2020-12-03 2021-01-08 中国科学院自动化研究所 Cluster control method for area coverage and connectivity maintenance based on knowledge embedding
CN112422651A (en) * 2020-11-06 2021-02-26 电子科技大学 Cloud resource scheduling performance bottleneck prediction method based on reinforcement learning
CN112416585A (en) * 2020-11-20 2021-02-26 南京大学 GPU resource management and intelligent scheduling method for deep learning
WO2021155667A1 (en) * 2020-02-05 2021-08-12 北京百度网讯科技有限公司 Model training method and apparatus, and clustering system
CN113542352A (en) * 2021-06-08 2021-10-22 支付宝(杭州)信息技术有限公司 Node joint modeling method and node
US11249749B2 (en) 2020-03-26 2022-02-15 Red Hat, Inc. Automatic generation of configuration files
CN114090183A (en) * 2021-11-25 2022-02-25 北京字节跳动网络技术有限公司 Application starting method and device, computer equipment and storage medium
WO2023174163A1 (en) * 2022-03-15 2023-09-21 之江实验室 Neural model storage system for brain-inspired computer operating system, and method

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120317447A1 (en) * 2011-06-07 2012-12-13 Microsoft Corporation Propagating unobserved exceptions in distributed execution environments
CN102880832A (en) * 2012-08-28 2013-01-16 曙光信息产业(北京)有限公司 Method for implementing mass data management system under colony
CN105224256A (en) * 2015-10-13 2016-01-06 浪潮(北京)电子信息产业有限公司 A kind of storage system
CN106708622A (en) * 2016-07-18 2017-05-24 腾讯科技(深圳)有限公司 Cluster resource processing method and system, and resource processing cluster
CN106790660A (en) * 2017-01-18 2017-05-31 咪咕视讯科技有限公司 A kind of dispositions method and device for realizing distributed memory system
CN107135257A (en) * 2017-04-28 2017-09-05 东方网力科技股份有限公司 Task is distributed in a kind of node cluster method, node and system
CN107450961A (en) * 2017-09-22 2017-12-08 济南浚达信息技术有限公司 A kind of distributed deep learning system and its building method, method of work based on Docker containers
CN107733977A (en) * 2017-08-31 2018-02-23 北京百度网讯科技有限公司 A kind of cluster management method and device based on Docker
CN107783818A (en) * 2017-10-13 2018-03-09 北京百度网讯科技有限公司 Deep learning task processing method, device, equipment and storage medium
CN108062246A (en) * 2018-01-25 2018-05-22 北京百度网讯科技有限公司 For the resource regulating method and device of deep learning frame

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120317447A1 (en) * 2011-06-07 2012-12-13 Microsoft Corporation Propagating unobserved exceptions in distributed execution environments
CN102880832A (en) * 2012-08-28 2013-01-16 曙光信息产业(北京)有限公司 Method for implementing mass data management system under colony
CN105224256A (en) * 2015-10-13 2016-01-06 浪潮(北京)电子信息产业有限公司 A kind of storage system
CN106708622A (en) * 2016-07-18 2017-05-24 腾讯科技(深圳)有限公司 Cluster resource processing method and system, and resource processing cluster
CN106790660A (en) * 2017-01-18 2017-05-31 咪咕视讯科技有限公司 A kind of dispositions method and device for realizing distributed memory system
CN107135257A (en) * 2017-04-28 2017-09-05 东方网力科技股份有限公司 Task is distributed in a kind of node cluster method, node and system
CN107733977A (en) * 2017-08-31 2018-02-23 北京百度网讯科技有限公司 A kind of cluster management method and device based on Docker
CN107450961A (en) * 2017-09-22 2017-12-08 济南浚达信息技术有限公司 A kind of distributed deep learning system and its building method, method of work based on Docker containers
CN107783818A (en) * 2017-10-13 2018-03-09 北京百度网讯科技有限公司 Deep learning task processing method, device, equipment and storage medium
CN108062246A (en) * 2018-01-25 2018-05-22 北京百度网讯科技有限公司 For the resource regulating method and device of deep learning frame

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109857475A (en) * 2018-12-27 2019-06-07 深圳云天励飞技术有限公司 A kind of method and device of frame management
CN109857475B (en) * 2018-12-27 2020-06-16 深圳云天励飞技术有限公司 Framework management method and device
CN109508238A (en) * 2019-01-05 2019-03-22 咪付(广西)网络技术有限公司 A kind of resource management system and method for deep learning
CN112114931B (en) * 2019-06-21 2023-12-26 富联精密电子(天津)有限公司 Deep learning program configuration method and device, electronic equipment and storage medium
CN112114931A (en) * 2019-06-21 2020-12-22 鸿富锦精密电子(天津)有限公司 Deep learning program configuration method and device, electronic equipment and storage medium
CN112148348A (en) * 2019-06-28 2020-12-29 杭州海康威视数字技术股份有限公司 Task processing method and device and storage medium
CN112148348B (en) * 2019-06-28 2023-10-20 杭州海康威视数字技术股份有限公司 Task processing method, device and storage medium
CN110688230B (en) * 2019-10-17 2022-06-24 广州文远知行科技有限公司 Synchronous training method and device, computer equipment and storage medium
CN110688230A (en) * 2019-10-17 2020-01-14 广州文远知行科技有限公司 Synchronous training method and device, computer equipment and storage medium
CN111090456A (en) * 2019-12-06 2020-05-01 浪潮(北京)电子信息产业有限公司 Construction method, device, equipment and medium for deep learning development environment
CN111190713A (en) * 2019-12-26 2020-05-22 曙光信息产业(北京)有限公司 Job scheduling management method and device
WO2021155667A1 (en) * 2020-02-05 2021-08-12 北京百度网讯科技有限公司 Model training method and apparatus, and clustering system
US11249749B2 (en) 2020-03-26 2022-02-15 Red Hat, Inc. Automatic generation of configuration files
CN112422651A (en) * 2020-11-06 2021-02-26 电子科技大学 Cloud resource scheduling performance bottleneck prediction method based on reinforcement learning
CN112416585A (en) * 2020-11-20 2021-02-26 南京大学 GPU resource management and intelligent scheduling method for deep learning
CN112416585B (en) * 2020-11-20 2024-03-15 南京大学 Deep learning-oriented GPU resource management and intelligent scheduling method
CN112203291A (en) * 2020-12-03 2021-01-08 中国科学院自动化研究所 Cluster control method for area coverage and connectivity maintenance based on knowledge embedding
CN113542352A (en) * 2021-06-08 2021-10-22 支付宝(杭州)信息技术有限公司 Node joint modeling method and node
CN113542352B (en) * 2021-06-08 2024-04-09 支付宝(杭州)信息技术有限公司 Node joint modeling method and node
CN114090183A (en) * 2021-11-25 2022-02-25 北京字节跳动网络技术有限公司 Application starting method and device, computer equipment and storage medium
WO2023174163A1 (en) * 2022-03-15 2023-09-21 之江实验室 Neural model storage system for brain-inspired computer operating system, and method

Similar Documents

Publication Publication Date Title
CN109086134A (en) A kind of operation method and device of deep learning operation
CN103516777B (en) For carrying out the method and system supplied in cloud computer environment
JP6659544B2 (en) Automated experimental platform
CN105593813B (en) For visualizing the presentation interpreter of the data provided from constrained environment container
US11481616B2 (en) Framework for providing recommendations for migration of a database to a cloud computing system
CN108958892A (en) A kind of method and apparatus creating the container for deep learning operation
Srinivasan et al. An overview of service-oriented architecture, web services and grid computing
US8719833B2 (en) Adaptive demand-driven load balancing
CN102760074B (en) Method and its system for high load capacity operation flow scalability
US10178163B2 (en) Server-processor hybrid system for processing data
EP3032442B1 (en) Modeling and simulation of infrastructure architecture for big data
CN103425529A (en) System and method for migrating virtual machines between networked computing environments based on resource utilization
Marinescu Cloud Computing and Computer Clouds
Kale Guide to cloud computing for business and technology managers: from distributed computing to cloudware applications
CN102291445A (en) Cloud computing management system based on virtual resources
CN109117252A (en) Method, system and the container cluster management system of task processing based on container
CN115202729A (en) Container service-based mirror image generation method, device, equipment and medium
US20090132582A1 (en) Processor-server hybrid system for processing data
JP5822414B2 (en) General-purpose simulation system using social network interface
Chen et al. Web-FEM: An internet-based finite-element analysis framework with 3D graphics and parallel computing environment
Willis et al. Container-based analysis environments for low-barrier access to research data
Rahmatulloh et al. Event-Driven Architecture to Improve Performance and Scalability in Microservices-Based Systems
Tahboub et al. Novel Approach for Remote Energy Meter Reading Using Mobile Agents
AU2015101031A4 (en) System and a method for modelling the performance of information systems
CN115361382A (en) Data processing method, device, equipment and storage medium based on data group

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181225