CN109034386A - A kind of deep learning system and method based on Resource Scheduler - Google Patents

A kind of deep learning system and method based on Resource Scheduler Download PDF

Info

Publication number
CN109034386A
CN109034386A CN201810668856.4A CN201810668856A CN109034386A CN 109034386 A CN109034386 A CN 109034386A CN 201810668856 A CN201810668856 A CN 201810668856A CN 109034386 A CN109034386 A CN 109034386A
Authority
CN
China
Prior art keywords
deep learning
resource scheduler
resource
performance calculation
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810668856.4A
Other languages
Chinese (zh)
Inventor
王珏
刘芳
王彦棡
曹荣强
王晓光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Computer Network Information Center of CAS
Original Assignee
Computer Network Information Center of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Computer Network Information Center of CAS filed Critical Computer Network Information Center of CAS
Priority to CN201810668856.4A priority Critical patent/CN109034386A/en
Publication of CN109034386A publication Critical patent/CN109034386A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The present invention provides a kind of deep learning system and method based on Resource Scheduler, comprising: multiple high-performance calculation nodes, each high-performance calculation node include muti-piece graphics processor;Further include: Resource Scheduler and deep learning frame, wherein Resource Scheduler is used for according to the mentioned demand of user, and resource allocation required for choosing from multiple high-performance calculation nodes is to user;The environmental variance that the Resource Scheduler distributes to user resources is parsed by parsing plug-in unit, obtains corresponding parameter;Deep learning frame forms the process of an operation according to the parameter, to start to execute deep learning program;After the completion of deep learning program, the Resource Scheduler recycles the resource of all distribution, to complete entire depth learning process.The present invention provides the system of the centralized management of an entirety for all kinds of deep learning frames, effectively improves the operation efficiency of Distributed Learning frame.

Description

A kind of deep learning system and method based on Resource Scheduler
Technical field
The present invention relates to artificial intelligence deep learning technology field more particularly to a kind of depth based on Resource Scheduler Learning system and its method.
Background technique
The arrival of Internet of Things and mobile internet era, data generate side's aspect in the form of all kinds of from production and living Face, such as: perceptron, journal file, emails, social media, all kinds of pictures and video etc..Current 80% number according to estimates According to being Un-structured, the data of Un-structured are just increased with the data of 15 times of structurings, it is contemplated that arrive the year two thousand twenty global metadata Total amount is up to 40zettabytes (1021bytes), and the mankind have really stepped into a data-centered epoch.It passes On system, HPC (High Performance Computing Cluster) combines closely with the extensive scientific algorithm of solution and big data application.HPC is naturally just gathered around There is a whole set of complete, mature, height optimization family's system technology for high-performance calculation.Such as: special 03 height having Performance optimization transmitting network (InfiniBand, IBM Blue Gene interconnects), high-performance message passing library (MPI), the mathematical computations library (BLAS, LAPACK) abundant accelerated towards all kinds of architectures, efficient parallel file storage System (Lustre, Parastor) and scheduler (Slurm, LSF) by all kinds of combination of software together.
Have evolved into ripe high-performance calculation adapt to using deep learning as representative emphasize big data calculate algorithm, with Based on the related facility of High Performance Computing Cluster, for the demand in terms of artificial intelligence and machine learning, we need at present The matter of utmost importance of solution is that scheduler how to be made to adapt to distributed deep learning frame, to carry out to large-scale data deep The study and training of degree study aspect.
Summary of the invention
To solve the above problems, in a first aspect, the present invention provides a kind of deep learning system based on Resource Scheduler, packet Include: multiple high-performance calculation nodes, each high-performance calculation node include muti-piece graphics processor;Further include: Resource Scheduler With deep learning frame, wherein Resource Scheduler is used to be chosen from multiple high-performance calculation nodes according to the mentioned demand of user Required resource allocation is to user;The environmental variance that user resources are distributed to by parsing plug-in unit resolving resource scheduler, is obtained Take corresponding parameter;Deep learning frame forms the process of an operation according to parameter, to start to execute deep learning program; After the completion of deep learning program, Resource Scheduler recycles the resource of all distribution, to complete entire deep learning process.
Preferably, parsing plug-in unit be application container engine, application container engine include Singularity, Shifter or Docker。
Preferably, the environmental variance that user resources are distributed to by the parsing plug-in unit resolving resource scheduler write in advance, is obtained Taking corresponding parameter step includes: to be become by the environment that the parsing plug-in unit resolving resource scheduler write in advance distributes to user resources SLURM_JOB_NODELIST and SLURMD_NODENAME is measured, corresponding parameter cluster, job_name, task_ are obtained index;Deep learning frame forms the process of an operation according to parameter cluster, job_name, task_index, thus Start to execute deep learning program.
Preferably, the quantity of high-performance calculation node is 48, and each high-performance calculation node is handled comprising 8 block graphicses Device.
Preferably, Resource Scheduler is Slurm Resource Scheduler.
Preferably, deep learning frame is TensorFlow deep learning frame.
Second aspect, the present invention provide a kind of deep learning method based on Resource Scheduler, comprising the following steps: resource Scheduler is according to the mentioned demand of user, and resource allocation required for choosing from multiple high-performance calculation nodes is to user;Pass through The parsing plug-in unit resolving resource scheduler write in advance distributes to the environmental variance of user resources, obtains corresponding parameter;Depth It practises frame and forms the process of an operation according to parameter, to start to execute deep learning program;It is completed in deep learning program Later, Resource Scheduler recycles the resource of all distribution, to complete entire deep learning process.
Preferably, parsing plug-in unit be application container engine, application container engine include Singularity, Shifter or Docker。
Preferably, the environmental variance that user resources are distributed to by the parsing plug-in unit resolving resource scheduler write in advance, is obtained Taking corresponding parameter step includes: to be become by the environment that the parsing plug-in unit resolving resource scheduler write in advance distributes to user resources SLURM_JOB_NODELIST and SLURMD_NODENAME is measured, corresponding parameter cluster, job_name, task_ are obtained index;Deep learning frame forms the process of an operation according to parameter cluster, job_name, task_index, thus Start to execute deep learning program.
Preferably, the quantity of high-performance calculation node is 48, and each high-performance calculation node is handled comprising 8 block graphicses Device.
The present invention provides the system of the centralized management of an entirety for all kinds of deep learning frames, and scheduler is combined and is distributed Formula storage system can be such that all deep learning frames coexist in a management system, to adjust by the plug-in unit write in advance It spends device and supports scheduling Distributed Learning frame, so that distributed deep learning frame is just as the equally scheduled device tune of ordinary procedure With and cancel.Effectively improve the operation efficiency of Distributed Learning frame.
Detailed description of the invention
Fig. 1 is a kind of deep learning method flow schematic diagram based on Resource Scheduler that the embodiment of the present invention one provides;
Fig. 2 is the experiment effect figure of a pair of of embodiment of the present invention single node test performance test;
Fig. 3 is one Distributed T ensorFlow effect picture of the embodiment of the present invention.
Specific embodiment
With reference to the accompanying drawings and examples, technical scheme of the present invention will be described in further detail.
The deep learning system based on Resource Scheduler that the embodiment of the invention provides a kind of, the system include: multiple high Performance calculate node, each high-performance calculation node include muti-piece graphics processor;For example, the quantity of high-performance calculation node is 48, each high-performance calculation node includes 8 block graphics processors.It is provided in an embodiment of the present invention a kind of based on Resource Scheduler Deep learning system further include: Resource Scheduler and deep learning frame, wherein Resource Scheduler be Slurm scheduling of resource Device, deep learning frame are TensorFlow deep learning frame.Resource Scheduler is used for according to the mentioned demand of user, from multiple Resource allocation required for choosing in high-performance calculation node is to user;The Resource Scheduler distribution is parsed by parsing plug-in unit To the environmental variance of user resources, corresponding parameter is obtained;Deep learning frame according to the parameter formed one operation into Journey, to start to execute deep learning program;After the completion of deep learning program, the Resource Scheduler recycles all distribution Resource, to complete entire deep learning process.
Wherein, parsing plug-in unit be application container engine, application container engine include Singularity, Shifter or Docker.In past ten years, from certain engineers, self hobby is gradually transformed into globalization to virtualization technology Industrialized basal needs.All kinds of exploitation environment are wrapped in container by Singularity as container, in this way, for work Oneself exploitation or required software environment can be loaded among Singularity by Cheng Shi, this kind of people of scientist, a structure It builds, repeatedly multi-platform seamless interfacing uses, and improves the working efficiency of engineering staff and scientific research personnel, they are absorbed in In the research of more core more importantly question essence, without excessively worried by the trifling thing in corner;For system manager For be also the big liberation of production, nowadays all kinds of softwares enthusiastically come out like the mushrooms after rain, and each user demand may have phase not to the utmost Together, unified that all softwares are installed, it is unrealistic also It is not necessary to, may use 1.0 versions with a software, party A-subscriber, and party B-subscriber Use 1.2 versions, it is also possible to which A software translating relies on version repository 0.5, and B software translating relies on version repository 0.8. The demand of more ideally user and system manager both sides are capable of in encapsulation of the Singlularity as an environment.
The container that Singularity sheet is designed and developed as HPC, the natural key technology example supported in high-performance calculation Such as InfiniBand and Lustre, seamless interfacing is in all high-performance resource managers, such as Slurm, Torque, SGE.It is special Not, Singularity can be integrated into a plug-in unit in Slurm, and Slurm jobs is so enabled to naturally to run Into the container of Singularity.Common containers Singularity, Shifter and Docker are compared as follows table:
Compare between Singularity Shifter Docker
The environmental variance that the Resource Scheduler distributes to user resources is parsed above by the parsing plug-in unit write in advance, is obtained Taking corresponding parameter step includes: to parse the ring that the Resource Scheduler distributes to user resources by the parsing plug-in unit write in advance Border variable SLURM_JOB_NODELIST and SLURMD_NODENAME, obtain corresponding parameter cluster, job_name, task_index;Deep learning frame forms an operation according to the parameter cluster, job_name, task_index Process, to start to execute deep learning program.
The embodiment of the present invention provides the system of the centralized management of an entirety for all kinds of deep learning frames, by scheduler knot Closing distributed memory system can be such that all deep learning frames coexist in a management system, pass through the plug-in unit write in advance So that scheduler supports scheduling Distributed Learning frame, so that distributed deep learning frame is adjusted just as ordinary procedure is the same Device is spent to call and cancel.Effectively improve the operation efficiency of Distributed Learning frame.
Correspondingly, the deep learning method based on Resource Scheduler that the embodiment of the invention provides a kind of, as shown in Figure 1, Method includes the following steps:
S101, Resource Scheduler is according to the mentioned demand of user, money required for choosing from multiple high-performance calculation nodes Distribute to user in source;
S102 parses the environmental variance that the Resource Scheduler distributes to user resources by the parsing plug-in unit write in advance, Obtain corresponding parameter;
S103, deep learning frame form the process of an operation according to the parameter, to start to execute deep learning Program;After the completion of deep learning program, the Resource Scheduler recycles the resource of all distribution, to complete entire depth Learning process.
The embodiment of the present invention provides the method for the centralized management of an entirety for all kinds of deep learning frames, by scheduler knot Closing distributed memory system can be such that all deep learning frames coexist in a management system, pass through the plug-in unit write in advance So that scheduler supports scheduling Distributed Learning frame, so that distributed deep learning frame is adjusted just as ordinary procedure is the same Device is spent to call and cancel.Effectively improve the operation efficiency of Distributed Learning frame.
The experiment of deep learning system performance is illustrated below:
Deep learning system needs to carry out intensive calculating on large data sets, therefore first to the read-write band of mass data Width is more demanding, just can control the network delay of each deep learning business model in the training process.This deep learning system The networking mode for having used the Infiniband of Parastor, has ensured message transmission rate.
Deep learning system possesses 48 nodes, 8 pieces of Tesla P100gpu on each node, the scheduling based on Slurm Device, platform have realized the demand that can satisfy in all kinds of deep learning scheduling to the other scheduling of card several levels is dynamically assigning to.
Deep learning system is supported all kinds of deep learning frames and divergence is all very excellent, at present the depth of mainstream Frame is practised to have supported, such as: TensorFlow, Caffe, Mxnet, Pytourch etc..The need of each version in order to balance It asks and the customized deep learning frame of user, platform provides the other virtualization condition of container levels, and can be multiplexed The container of Docker, successful com environment of user, hereafter no longer needing to carry out any change again can seamless interfacing transplanting Relevant infrastest is carried out on to this platform, substantially increases the convenience of experiment.
In order to embody the representativeness of deep learning system using practice, it is the most typical that this test has chosen deep learning The direct naked race that frame TensorFlow has been carried out in single node calls Slurm to carry out dynamic node allocation for test, virtualization TensorFlow environmental testing in container, Slurm scheduling load TensorFlow container Singularity and are tested.Single-unit Point test is upper to use 1,2,4, the 8 multi-class tests of muti-piece GPU again.The following figure is test result table.
Python Slurm Singulariy Singulariy on slurm
1gpu 159.3 159.9 158.9 160.49
2gpu 315.5 307.8 310.9 313.1
4gpu 630.4 615.2 621.9 613.0
8gpu 1133.6 1108.7 1135.2 1099.8
TensorFlow test experiments
Experiment in single node exports the result is that the picture number handled in each second, the data set for testing selection are 2012 experiment collection in ImageNet, deep learning network uniformly choose Resnet50, and BatchSize is all made of in training process 32。
From the point of view of the experiment effect of single node, four big group experiments have reached linear speed-up ratio, and are dispatched using Slurm Virtualization with Singularity is tested, and does not almost bring the loss in performance to experiment effect.The following figure embodies all kinds of experiments Effect difference is no different.It (note: Python: directly logs in calculate node and is tested;Slurm:Slurm dynamic dispatching calculate node It is tested;Singularity: being logged on to calculate node and tested using Singularity, Singularity on Slurm: Slurm dispatches Singularity and carries out node test), four groups of experimental performance effects are as shown in Figure 2.
8 pieces of gpu have been able to meet the training necessary requirement in most deep learnings, in order to support ultra-large mould Training and acceleration in type, Distributed T ensorFlow also seamless interfacing on the dynamic dispatching of Slurm.Following diagrammatic representation From 1 to 4 performance test of the Distributed T ensorFlow of node.
1node 2node 4node 8node
python 1133.7 1888.4 3868.9 7561.5
It is watched from the process performance result of test, the extension on 8 pieces of gpu to 64 pieces of gpu has also reached linear substantially and added The effect of speed ratio.Embody the outstanding expansible characteristic of platform.Specific effect picture is shown in Fig. 3.
Deep learning system based on Slurm realizes the gpu from single node to cross-node from experiment effect Linear speed-up ratio, it is outstanding complete deep learning needed for calculated performance requirement.The dynamic allocation of platform computing resource, preferably The convenient purpose used while meeting multiple users.The seamless interfacing of mainstream frame supports that multiple version virtualizations are held Device coexists, and realizes user customized demand and excellent portable effect extensively.Entire platform, which uses, to be passed System high-performance architectural framework has been perfectly combined current deep learning technology burning the hotest, has produced good application value
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims (10)

1. a kind of deep learning system based on Resource Scheduler, comprising: multiple high-performance calculation nodes, each high-performance calculation Node includes muti-piece graphics processor;It is characterized by further comprising: Resource Scheduler and deep learning frame, wherein
Resource Scheduler is used for according to the mentioned demand of user, resource allocation required for choosing from multiple high-performance calculation nodes To user;
The environmental variance that the Resource Scheduler distributes to user resources is parsed by parsing plug-in unit, obtains corresponding parameter;
Deep learning frame forms the process of an operation according to the parameter, to start to execute deep learning program;In depth It spends after learning program completion, the Resource Scheduler recycles the resource of all distribution, to complete entire deep learning process.
2. system according to claim 1, which is characterized in that the parsing plug-in unit is application container engine, the application Container engine includes Singularity, Shifter or Docker.
3. system according to claim 1, which is characterized in that the parsing plug-in unit by writing in advance parses the resource Scheduler distributes to the environmental variance of user resources, and obtaining corresponding parameter step includes: the parsing plug-in unit solution by writing in advance Environmental variance SLURM_JOB_NODELIST and SLURMD_NODENAME that the Resource Scheduler distributes to user resources are analysed, Obtain corresponding parameter cluster, job_name, task_index;
Deep learning frame forms the process of an operation according to the parameter cluster, job_name, task_index, from And start to execute deep learning program.
4. system according to claim 1, which is characterized in that the quantity of the high-performance calculation node is 48, each High-performance calculation node includes 8 block graphics processors.
5. system according to claim 1, which is characterized in that the Resource Scheduler is Slurm Resource Scheduler.
6. system according to claim 1, which is characterized in that the deep learning frame is TensorFlow deep learning Frame.
7. a kind of deep learning method based on Resource Scheduler, which comprises the following steps: Resource Scheduler according to The mentioned demand of user, resource allocation required for choosing from multiple high-performance calculation nodes is to user;
The environmental variance that the Resource Scheduler distributes to user resources is parsed by the parsing plug-in unit write in advance, is obtained corresponding Parameter;
Deep learning frame forms the process of an operation according to the parameter, to start to execute deep learning program;In depth It spends after learning program completion, the Resource Scheduler recycles the resource of all distribution, to complete entire deep learning process.
8. the method according to the description of claim 7 is characterized in that the parsing plug-in unit is application container engine, the application Container engine includes Singularity, Shifter or Docker.
9. system according to claim 7, which is characterized in that the parsing plug-in unit by writing in advance parses the resource Scheduler distributes to the environmental variance of user resources, and obtaining corresponding parameter step includes: the parsing plug-in unit solution by writing in advance Environmental variance SLURM_JOB_NODELIST and SLURMD_NODENAME that the Resource Scheduler distributes to user resources are analysed, Obtain corresponding parameter cluster, job_name, task_index;
Deep learning frame forms the process of an operation according to the parameter cluster, job_name, task_index, from And start to execute deep learning program.
10. the method according to the description of claim 7 is characterized in that the quantity of the high-performance calculation node is 48, each High-performance calculation node includes 8 block graphics processors.
CN201810668856.4A 2018-06-26 2018-06-26 A kind of deep learning system and method based on Resource Scheduler Pending CN109034386A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810668856.4A CN109034386A (en) 2018-06-26 2018-06-26 A kind of deep learning system and method based on Resource Scheduler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810668856.4A CN109034386A (en) 2018-06-26 2018-06-26 A kind of deep learning system and method based on Resource Scheduler

Publications (1)

Publication Number Publication Date
CN109034386A true CN109034386A (en) 2018-12-18

Family

ID=64610901

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810668856.4A Pending CN109034386A (en) 2018-06-26 2018-06-26 A kind of deep learning system and method based on Resource Scheduler

Country Status (1)

Country Link
CN (1) CN109034386A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109739514A (en) * 2018-12-21 2019-05-10 北京中科寒武纪科技有限公司 Parameter processing method and Related product
CN109976911A (en) * 2019-03-25 2019-07-05 哈尔滨工程大学 A kind of adaptive resource dispatching method
CN110096356A (en) * 2019-03-22 2019-08-06 北京达佳互联信息技术有限公司 Resource regulating method, device, electronic equipment and storage medium
CN110389834A (en) * 2019-06-28 2019-10-29 苏州浪潮智能科技有限公司 A kind of method and apparatus for submitting deep learning training mission
CN111221541A (en) * 2019-12-26 2020-06-02 曙光信息产业(北京)有限公司 Cluster parallel program deployment method and device
CN111695672A (en) * 2019-03-14 2020-09-22 百度(美国)有限责任公司 Method for improving AI engine MAC utilization rate
CN113065642A (en) * 2021-04-09 2021-07-02 中电科数字科技(集团)有限公司 Artificial intelligence acceleration method and system based on heterogeneous computing
CN114968559A (en) * 2022-05-06 2022-08-30 苏州国科综合数据中心有限公司 LSF-based method for multi-host multi-GPU distributed arrangement of deep learning model
US11699073B2 (en) 2018-12-29 2023-07-11 Cambricon Technologies Corporation Limited Network off-line model processing method, artificial intelligence processing device and related products

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107733977A (en) * 2017-08-31 2018-02-23 北京百度网讯科技有限公司 A kind of cluster management method and device based on Docker
US9934147B1 (en) * 2015-06-26 2018-04-03 Emc Corporation Content-aware storage tiering techniques within a job scheduling system
CN108170417A (en) * 2017-12-29 2018-06-15 曙光信息产业(北京)有限公司 A kind of method and apparatus that high performance job scheduling frame is integrated in MESOS clusters

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9934147B1 (en) * 2015-06-26 2018-04-03 Emc Corporation Content-aware storage tiering techniques within a job scheduling system
CN107733977A (en) * 2017-08-31 2018-02-23 北京百度网讯科技有限公司 A kind of cluster management method and device based on Docker
CN108170417A (en) * 2017-12-29 2018-06-15 曙光信息产业(北京)有限公司 A kind of method and apparatus that high performance job scheduling frame is integrated in MESOS clusters

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JACOB_WJJ: "slurm提交Tensorflow任务实现", 《HTTPS://BLOG.CSDN.NET/JIANGBO1017/ARTICLE/DETAILS/78591846》 *
陆忠华等: "基于 Slurm 的深度学习高性能计算平台设计及其调度实现技术", 《科研信息化技术与应用》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109739514A (en) * 2018-12-21 2019-05-10 北京中科寒武纪科技有限公司 Parameter processing method and Related product
US11699073B2 (en) 2018-12-29 2023-07-11 Cambricon Technologies Corporation Limited Network off-line model processing method, artificial intelligence processing device and related products
CN111695672A (en) * 2019-03-14 2020-09-22 百度(美国)有限责任公司 Method for improving AI engine MAC utilization rate
CN111695672B (en) * 2019-03-14 2023-09-08 百度(美国)有限责任公司 Method for improving MAC utilization rate of AI engine
CN110096356A (en) * 2019-03-22 2019-08-06 北京达佳互联信息技术有限公司 Resource regulating method, device, electronic equipment and storage medium
CN110096356B (en) * 2019-03-22 2022-06-03 北京达佳互联信息技术有限公司 Resource scheduling method, device, electronic equipment and storage medium
CN109976911A (en) * 2019-03-25 2019-07-05 哈尔滨工程大学 A kind of adaptive resource dispatching method
CN110389834A (en) * 2019-06-28 2019-10-29 苏州浪潮智能科技有限公司 A kind of method and apparatus for submitting deep learning training mission
CN110389834B (en) * 2019-06-28 2022-07-12 苏州浪潮智能科技有限公司 Method and device for submitting deep learning training task
CN111221541A (en) * 2019-12-26 2020-06-02 曙光信息产业(北京)有限公司 Cluster parallel program deployment method and device
CN113065642A (en) * 2021-04-09 2021-07-02 中电科数字科技(集团)有限公司 Artificial intelligence acceleration method and system based on heterogeneous computing
CN113065642B (en) * 2021-04-09 2023-04-07 中电科数字科技(集团)有限公司 Artificial intelligence acceleration method and system based on heterogeneous computing
CN114968559A (en) * 2022-05-06 2022-08-30 苏州国科综合数据中心有限公司 LSF-based method for multi-host multi-GPU distributed arrangement of deep learning model
CN114968559B (en) * 2022-05-06 2023-12-01 苏州国科综合数据中心有限公司 LSF-based multi-host multi-GPU distributed arrangement deep learning model method

Similar Documents

Publication Publication Date Title
CN109034386A (en) A kind of deep learning system and method based on Resource Scheduler
Taylor Distributed simulation: state-of-the-art and potential for operational research
Mao et al. Analysis of average shortest‐path length of scale‐free network
CN106649085A (en) Cloud computing-based software test system
CN106199696B (en) Earthquake data processing system and method
CN104053179B (en) A kind of C RAN system integration project platforms
Cui et al. Multiple DAGs workflow scheduling algorithm based on reinforcement learning in cloud computing
CN107480717A (en) Train job processing method and system, computing device, computer-readable storage medium
Li et al. Intermediate data placement and cache replacement strategy under Spark platform
Yilmaz et al. Panel: The future of research in modeling & simulation
CN109992715A (en) Information displaying method, device, medium and calculating equipment
Scarpiniti et al. VirtFogSim: A parallel toolbox for dynamic energy-delay performance testing and optimization of 5G mobile-fog-cloud virtualized platforms
CN110351145A (en) A kind of radio network functions method of combination of the virtualization based on economic benefit
Varghese et al. DocLite: A docker-based lightweight cloud benchmarking tool
CN106293947A (en) GPU CPU mixing resource allocation system and method under virtualization cloud environment
CN107301088A (en) A kind of method and apparatus for managing virtual machine batch migration
CN109976873A (en) The scheduling scheme acquisition methods and dispatching method of containerization distributed computing framework
Xhafa et al. A parallel grid-based implementation for real-time processing of event log data of collaborative applications
Liu et al. KubFBS: A fine‐grained and balance‐aware scheduling system for deep learning tasks based on kubernetes
Depasquale et al. Dynamics of research into modeling the power consumption of virtual entities used in the telco cloud
Cao et al. Throughput optimization for Storm-based processing of stream data on clouds
Mondal et al. Toward optimal load prediction and customizable autoscaling scheme for kubernetes
Wang et al. GPARS: Graph predictive algorithm for efficient resource scheduling in heterogeneous GPU clusters
Muniyandi et al. A representation of membrane computing with a clustering algorithm on the graphical processing unit
Nasonov et al. The multi-level adaptive approach for efficient execution of multi-scale distributed applications with dynamic workload

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181218