CN109034386A

CN109034386A - A kind of deep learning system and method based on Resource Scheduler

Info

Publication number: CN109034386A
Application number: CN201810668856.4A
Authority: CN
Inventors: 王珏; 刘芳; 王彦棡; 曹荣强; 王晓光
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2018-06-26
Filing date: 2018-06-26
Publication date: 2018-12-18

Abstract

The present invention provides a kind of deep learning system and method based on Resource Scheduler, comprising: multiple high-performance calculation nodes, each high-performance calculation node include muti-piece graphics processor；Further include: Resource Scheduler and deep learning frame, wherein Resource Scheduler is used for according to the mentioned demand of user, and resource allocation required for choosing from multiple high-performance calculation nodes is to user；The environmental variance that the Resource Scheduler distributes to user resources is parsed by parsing plug-in unit, obtains corresponding parameter；Deep learning frame forms the process of an operation according to the parameter, to start to execute deep learning program；After the completion of deep learning program, the Resource Scheduler recycles the resource of all distribution, to complete entire depth learning process.The present invention provides the system of the centralized management of an entirety for all kinds of deep learning frames, effectively improves the operation efficiency of Distributed Learning frame.

Description

A kind of deep learning system and method based on Resource Scheduler

Technical field

The present invention relates to artificial intelligence deep learning technology field more particularly to a kind of depth based on Resource Scheduler Learning system and its method.

Background technique

The arrival of Internet of Things and mobile internet era, data generate side's aspect in the form of all kinds of from production and living Face, such as: perceptron, journal file, emails, social media, all kinds of pictures and video etc..Current 80% number according to estimates According to being Un-structured, the data of Un-structured are just increased with the data of 15 times of structurings, it is contemplated that arrive the year two thousand twenty global metadata Total amount is up to 40zettabytes (1021bytes), and the mankind have really stepped into a data-centered epoch.It passes On system, HPC (High Performance Computing Cluster) combines closely with the extensive scientific algorithm of solution and big data application.HPC is naturally just gathered around There is a whole set of complete, mature, height optimization family's system technology for high-performance calculation.Such as: special 03 height having Performance optimization transmitting network (InfiniBand, IBM Blue Gene interconnects), high-performance message passing library (MPI), the mathematical computations library (BLAS, LAPACK) abundant accelerated towards all kinds of architectures, efficient parallel file storage System (Lustre, Parastor) and scheduler (Slurm, LSF) by all kinds of combination of software together.

Have evolved into ripe high-performance calculation adapt to using deep learning as representative emphasize big data calculate algorithm, with Based on the related facility of High Performance Computing Cluster, for the demand in terms of artificial intelligence and machine learning, we need at present The matter of utmost importance of solution is that scheduler how to be made to adapt to distributed deep learning frame, to carry out to large-scale data deep The study and training of degree study aspect.

Summary of the invention

To solve the above problems, in a first aspect, the present invention provides a kind of deep learning system based on Resource Scheduler, packet Include: multiple high-performance calculation nodes, each high-performance calculation node include muti-piece graphics processor；Further include: Resource Scheduler With deep learning frame, wherein Resource Scheduler is used to be chosen from multiple high-performance calculation nodes according to the mentioned demand of user Required resource allocation is to user；The environmental variance that user resources are distributed to by parsing plug-in unit resolving resource scheduler, is obtained Take corresponding parameter；Deep learning frame forms the process of an operation according to parameter, to start to execute deep learning program； After the completion of deep learning program, Resource Scheduler recycles the resource of all distribution, to complete entire deep learning process.

Preferably, parsing plug-in unit be application container engine, application container engine include Singularity, Shifter or Docker。

Preferably, the environmental variance that user resources are distributed to by the parsing plug-in unit resolving resource scheduler write in advance, is obtained Taking corresponding parameter step includes: to be become by the environment that the parsing plug-in unit resolving resource scheduler write in advance distributes to user resources SLURM_JOB_NODELIST and SLURMD_NODENAME is measured, corresponding parameter cluster, job_name, task_ are obtained index；Deep learning frame forms the process of an operation according to parameter cluster, job_name, task_index, thus Start to execute deep learning program.

Preferably, the quantity of high-performance calculation node is 48, and each high-performance calculation node is handled comprising 8 block graphicses Device.

Preferably, Resource Scheduler is Slurm Resource Scheduler.

Preferably, deep learning frame is TensorFlow deep learning frame.

Second aspect, the present invention provide a kind of deep learning method based on Resource Scheduler, comprising the following steps: resource Scheduler is according to the mentioned demand of user, and resource allocation required for choosing from multiple high-performance calculation nodes is to user；Pass through The parsing plug-in unit resolving resource scheduler write in advance distributes to the environmental variance of user resources, obtains corresponding parameter；Depth It practises frame and forms the process of an operation according to parameter, to start to execute deep learning program；It is completed in deep learning program Later, Resource Scheduler recycles the resource of all distribution, to complete entire deep learning process.

The present invention provides the system of the centralized management of an entirety for all kinds of deep learning frames, and scheduler is combined and is distributed Formula storage system can be such that all deep learning frames coexist in a management system, to adjust by the plug-in unit write in advance It spends device and supports scheduling Distributed Learning frame, so that distributed deep learning frame is just as the equally scheduled device tune of ordinary procedure With and cancel.Effectively improve the operation efficiency of Distributed Learning frame.

Detailed description of the invention

Fig. 1 is a kind of deep learning method flow schematic diagram based on Resource Scheduler that the embodiment of the present invention one provides；

Fig. 2 is the experiment effect figure of a pair of of embodiment of the present invention single node test performance test；

Fig. 3 is one Distributed T ensorFlow effect picture of the embodiment of the present invention.

Specific embodiment

With reference to the accompanying drawings and examples, technical scheme of the present invention will be described in further detail.

The deep learning system based on Resource Scheduler that the embodiment of the invention provides a kind of, the system include: multiple high Performance calculate node, each high-performance calculation node include muti-piece graphics processor；For example, the quantity of high-performance calculation node is 48, each high-performance calculation node includes 8 block graphics processors.It is provided in an embodiment of the present invention a kind of based on Resource Scheduler Deep learning system further include: Resource Scheduler and deep learning frame, wherein Resource Scheduler be Slurm scheduling of resource Device, deep learning frame are TensorFlow deep learning frame.Resource Scheduler is used for according to the mentioned demand of user, from multiple Resource allocation required for choosing in high-performance calculation node is to user；The Resource Scheduler distribution is parsed by parsing plug-in unit To the environmental variance of user resources, corresponding parameter is obtained；Deep learning frame according to the parameter formed one operation into Journey, to start to execute deep learning program；After the completion of deep learning program, the Resource Scheduler recycles all distribution Resource, to complete entire deep learning process.

Wherein, parsing plug-in unit be application container engine, application container engine include Singularity, Shifter or Docker.In past ten years, from certain engineers, self hobby is gradually transformed into globalization to virtualization technology Industrialized basal needs.All kinds of exploitation environment are wrapped in container by Singularity as container, in this way, for work Oneself exploitation or required software environment can be loaded among Singularity by Cheng Shi, this kind of people of scientist, a structure It builds, repeatedly multi-platform seamless interfacing uses, and improves the working efficiency of engineering staff and scientific research personnel, they are absorbed in In the research of more core more importantly question essence, without excessively worried by the trifling thing in corner；For system manager For be also the big liberation of production, nowadays all kinds of softwares enthusiastically come out like the mushrooms after rain, and each user demand may have phase not to the utmost Together, unified that all softwares are installed, it is unrealistic also It is not necessary to, may use 1.0 versions with a software, party A-subscriber, and party B-subscriber Use 1.2 versions, it is also possible to which A software translating relies on version repository 0.5, and B software translating relies on version repository 0.8. The demand of more ideally user and system manager both sides are capable of in encapsulation of the Singlularity as an environment.

The container that Singularity sheet is designed and developed as HPC, the natural key technology example supported in high-performance calculation Such as InfiniBand and Lustre, seamless interfacing is in all high-performance resource managers, such as Slurm, Torque, SGE.It is special Not, Singularity can be integrated into a plug-in unit in Slurm, and Slurm jobs is so enabled to naturally to run Into the container of Singularity.Common containers Singularity, Shifter and Docker are compared as follows table:

Compare between Singularity Shifter Docker

The environmental variance that the Resource Scheduler distributes to user resources is parsed above by the parsing plug-in unit write in advance, is obtained Taking corresponding parameter step includes: to parse the ring that the Resource Scheduler distributes to user resources by the parsing plug-in unit write in advance Border variable SLURM_JOB_NODELIST and SLURMD_NODENAME, obtain corresponding parameter cluster, job_name, task_index；Deep learning frame forms an operation according to the parameter cluster, job_name, task_index Process, to start to execute deep learning program.

The embodiment of the present invention provides the system of the centralized management of an entirety for all kinds of deep learning frames, by scheduler knot Closing distributed memory system can be such that all deep learning frames coexist in a management system, pass through the plug-in unit write in advance So that scheduler supports scheduling Distributed Learning frame, so that distributed deep learning frame is adjusted just as ordinary procedure is the same Device is spent to call and cancel.Effectively improve the operation efficiency of Distributed Learning frame.

Correspondingly, the deep learning method based on Resource Scheduler that the embodiment of the invention provides a kind of, as shown in Figure 1, Method includes the following steps:

S101, Resource Scheduler is according to the mentioned demand of user, money required for choosing from multiple high-performance calculation nodes Distribute to user in source；

S102 parses the environmental variance that the Resource Scheduler distributes to user resources by the parsing plug-in unit write in advance, Obtain corresponding parameter；

S103, deep learning frame form the process of an operation according to the parameter, to start to execute deep learning Program；After the completion of deep learning program, the Resource Scheduler recycles the resource of all distribution, to complete entire depth Learning process.

The embodiment of the present invention provides the method for the centralized management of an entirety for all kinds of deep learning frames, by scheduler knot Closing distributed memory system can be such that all deep learning frames coexist in a management system, pass through the plug-in unit write in advance So that scheduler supports scheduling Distributed Learning frame, so that distributed deep learning frame is adjusted just as ordinary procedure is the same Device is spent to call and cancel.Effectively improve the operation efficiency of Distributed Learning frame.

The experiment of deep learning system performance is illustrated below:

Deep learning system needs to carry out intensive calculating on large data sets, therefore first to the read-write band of mass data Width is more demanding, just can control the network delay of each deep learning business model in the training process.This deep learning system The networking mode for having used the Infiniband of Parastor, has ensured message transmission rate.

Deep learning system possesses 48 nodes, 8 pieces of Tesla P100gpu on each node, the scheduling based on Slurm Device, platform have realized the demand that can satisfy in all kinds of deep learning scheduling to the other scheduling of card several levels is dynamically assigning to.

Deep learning system is supported all kinds of deep learning frames and divergence is all very excellent, at present the depth of mainstream Frame is practised to have supported, such as: TensorFlow, Caffe, Mxnet, Pytourch etc..The need of each version in order to balance It asks and the customized deep learning frame of user, platform provides the other virtualization condition of container levels, and can be multiplexed The container of Docker, successful com environment of user, hereafter no longer needing to carry out any change again can seamless interfacing transplanting Relevant infrastest is carried out on to this platform, substantially increases the convenience of experiment.

In order to embody the representativeness of deep learning system using practice, it is the most typical that this test has chosen deep learning The direct naked race that frame TensorFlow has been carried out in single node calls Slurm to carry out dynamic node allocation for test, virtualization TensorFlow environmental testing in container, Slurm scheduling load TensorFlow container Singularity and are tested.Single-unit Point test is upper to use 1,2,4, the 8 multi-class tests of muti-piece GPU again.The following figure is test result table.

	Python	Slurm	Singulariy	Singulariy on slurm
					1gpu	159.3	159.9	158.9	160.49
2gpu	315.5	307.8	310.9	313.1
					4gpu	630.4	615.2	621.9	613.0
8gpu	1133.6	1108.7	1135.2	1099.8

TensorFlow test experiments

Experiment in single node exports the result is that the picture number handled in each second, the data set for testing selection are 2012 experiment collection in ImageNet, deep learning network uniformly choose Resnet50, and BatchSize is all made of in training process 32。

From the point of view of the experiment effect of single node, four big group experiments have reached linear speed-up ratio, and are dispatched using Slurm Virtualization with Singularity is tested, and does not almost bring the loss in performance to experiment effect.The following figure embodies all kinds of experiments Effect difference is no different.It (note: Python: directly logs in calculate node and is tested；Slurm:Slurm dynamic dispatching calculate node It is tested；Singularity: being logged on to calculate node and tested using Singularity, Singularity on Slurm: Slurm dispatches Singularity and carries out node test), four groups of experimental performance effects are as shown in Figure 2.

8 pieces of gpu have been able to meet the training necessary requirement in most deep learnings, in order to support ultra-large mould Training and acceleration in type, Distributed T ensorFlow also seamless interfacing on the dynamic dispatching of Slurm.Following diagrammatic representation From 1 to 4 performance test of the Distributed T ensorFlow of node.

	1node	2node	4node	8node
					python	1133.7	1888.4	3868.9	7561.5

It is watched from the process performance result of test, the extension on 8 pieces of gpu to 64 pieces of gpu has also reached linear substantially and added The effect of speed ratio.Embody the outstanding expansible characteristic of platform.Specific effect picture is shown in Fig. 3.

Deep learning system based on Slurm realizes the gpu from single node to cross-node from experiment effect Linear speed-up ratio, it is outstanding complete deep learning needed for calculated performance requirement.The dynamic allocation of platform computing resource, preferably The convenient purpose used while meeting multiple users.The seamless interfacing of mainstream frame supports that multiple version virtualizations are held Device coexists, and realizes user customized demand and excellent portable effect extensively.Entire platform, which uses, to be passed System high-performance architectural framework has been perfectly combined current deep learning technology burning the hotest, has produced good application value

Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims

1. a kind of deep learning system based on Resource Scheduler, comprising: multiple high-performance calculation nodes, each high-performance calculation Node includes muti-piece graphics processor；It is characterized by further comprising: Resource Scheduler and deep learning frame, wherein

Resource Scheduler is used for according to the mentioned demand of user, resource allocation required for choosing from multiple high-performance calculation nodes To user；

The environmental variance that the Resource Scheduler distributes to user resources is parsed by parsing plug-in unit, obtains corresponding parameter；

Deep learning frame forms the process of an operation according to the parameter, to start to execute deep learning program；In depth It spends after learning program completion, the Resource Scheduler recycles the resource of all distribution, to complete entire deep learning process.

2. system according to claim 1, which is characterized in that the parsing plug-in unit is application container engine, the application Container engine includes Singularity, Shifter or Docker.

3. system according to claim 1, which is characterized in that the parsing plug-in unit by writing in advance parses the resource Scheduler distributes to the environmental variance of user resources, and obtaining corresponding parameter step includes: the parsing plug-in unit solution by writing in advance Environmental variance SLURM_JOB_NODELIST and SLURMD_NODENAME that the Resource Scheduler distributes to user resources are analysed, Obtain corresponding parameter cluster, job_name, task_index；

Deep learning frame forms the process of an operation according to the parameter cluster, job_name, task_index, from And start to execute deep learning program.

4. system according to claim 1, which is characterized in that the quantity of the high-performance calculation node is 48, each High-performance calculation node includes 8 block graphics processors.

5. system according to claim 1, which is characterized in that the Resource Scheduler is Slurm Resource Scheduler.

6. system according to claim 1, which is characterized in that the deep learning frame is TensorFlow deep learning Frame.

7. a kind of deep learning method based on Resource Scheduler, which comprises the following steps: Resource Scheduler according to The mentioned demand of user, resource allocation required for choosing from multiple high-performance calculation nodes is to user；

The environmental variance that the Resource Scheduler distributes to user resources is parsed by the parsing plug-in unit write in advance, is obtained corresponding Parameter；

8. the method according to the description of claim 7 is characterized in that the parsing plug-in unit is application container engine, the application Container engine includes Singularity, Shifter or Docker.

9. system according to claim 7, which is characterized in that the parsing plug-in unit by writing in advance parses the resource Scheduler distributes to the environmental variance of user resources, and obtaining corresponding parameter step includes: the parsing plug-in unit solution by writing in advance Environmental variance SLURM_JOB_NODELIST and SLURMD_NODENAME that the Resource Scheduler distributes to user resources are analysed, Obtain corresponding parameter cluster, job_name, task_index；

10. the method according to the description of claim 7 is characterized in that the quantity of the high-performance calculation node is 48, each High-performance calculation node includes 8 block graphics processors.