CN109240814A

CN109240814A - A kind of deep learning intelligent dispatching method and system based on TensorFlow

Info

Publication number: CN109240814A
Application number: CN201810962198.XA
Authority: CN
Inventors: 王宇; 曹雪
Original assignee: Hunan Shunkang Information Technology Co ltd
Current assignee: Hunan Shunkang Information Technology Co ltd
Priority date: 2018-08-22
Filing date: 2018-08-22
Publication date: 2019-01-18

Abstract

The invention discloses a kind of deep learning intelligent dispatching method based on TensorFlow, comprising: S1, the resource information for receiving the number of tasks and each task requests that include in the TensorFlow application of user terminal transmission；Resource information in S2, acquisition cluster；S3, optimal set of resource nodes is calculated according to the resource information in the resource information and the cluster of each task requests；S4, mapping relations between each task and corresponding optimal resource node are established according to number of tasks and the optimal set of resource nodes；S5, publication TensorFlow application；The present invention also proposes a kind of deep learning intelligent dispatching system based on TensorFlow；Without user's mapping relations to establish between each task and resource node manually, greatly shorten the time for establishing mapping relations of user, it carries out the selection of optimal set of resource nodes automatically according further to the resource information being collected into and application task to be released is published on optimal resource node, the reasonable employment for maximumlly carrying out resource, efficiently avoids the waste of resource.

Description

A kind of deep learning intelligent dispatching method and system based on TensorFlow

Technical field

The present invention relates to deep learning application fields, more particularly to a kind of deep learning intelligence based on TensorFlow Dispatching method and system.

Background technique

In recent years, deep learning is foundation, mould as a new technology in machine learning algorithm research, motivation Anthropomorphic brain carries out the neural network of analytic learning.By means of deep learning algorithm, the mankind can finally find how to handle and " be abstracted general Read " method of this problem from ancient times to now.

TensorFlow is Google in the Computational frame formally increased income on November 9th, 2015.TensorFlow Computational frame One of various algorithms and the most popular library of deep learning that can support deep learning well, are that Google is summarizing deeply It is formed in the experience and lessons of its predecessor DistBelief.It innately has portable, efficient, expansible, moreover it is possible in different meters It is run on calculation machine.

Current Distributed Application cluster operational mode such as Fig. 1 based on TensorFlow in production environment.When operation one When a TensorFlow Distributed Application cluster, all ginsengs of the good TensorFlow Distributed Application cluster of user's configured in advance are needed It counts and which task is specified to run with which port of which host.Create necessity of TensorFlow Distributed Application cluster Condition is serviced for each task start one.Following work can be done for each task:

1, a tf.train.ClusterSpec is created to be used for all in TensorFlow Distributed Application cluster Task is described, which should be identical for all tasks.

2, it creates a tf.train.Server and the parameter in tf.train.ClusterSpec is passed to construction letter Number, and the number of the title of operation and current task is written in local task.

Under traditional Distributed T ensorFlow cluster environment, although human configuration can be passed through The mode of tf.train.ClusterSpec and tf.train.Server runs TensorFlow Distributed Application, this is for one As the fewer environment of Distributed T ensorFlow clustered node be feasible and simply and easily, but in big data For large-scale TensorFlow Distributed Application under, it will become sufficiently complex and be difficult to safeguard.In this background, it is based on The number of nodes of TensorFlow Distributed Application is up to hundreds and thousands of, this, which is just determined, is carrying out TensorFlow application hair When cloth, different tf.train.ClusterSpec parameter and tf.train.Server ginseng need to be configured to each resource node Number, if every publication TensorFlow application require user's manual configuration tf.train.ClusterSpec parameter and Tf.train.Server parameter, this is all unacceptable for any user.In addition, in publication in application, user It can not judge whether the corresponding host node of the task publication of TensorFlow has enough resources to meet the item using operation Part, such as, if having that enough CPU carries out task schedule or whether node is configured with GPU etc., this application published method for It is not optimal selection for any user.

Summary of the invention

In order to solve not automatically selecting the technical issues of optimal set of resource nodes is carried out using publication in the prior art, this Invention proposes a kind of deep learning intelligent dispatching method and system based on TensorFlow.

Technical problem of the invention is resolved by technical solution below:

A kind of deep learning intelligent dispatching method based on TensorFlow, includes the following steps:

S1, the resource letter for receiving the number of tasks and each task requests that include in the TensorFlow application that user terminal is sent Breath；

Resource information in S2, acquisition cluster；

S3, optimal resource node is calculated according to the resource information in the resource information and the cluster of each task requests Collection；

S4, it is established according to the number of tasks and the optimal set of resource nodes between each task and corresponding optimal resource node Mapping relations；

S5, the publication TensorFlow application.

In some preferred embodiments, the task includes worker task and ps task.

In some preferred embodiments, the resource information of each task requests includes the request of resource apparatus and resource Amount.

In some preferred embodiments, the resource information in the cluster includes the usage amount and cluster money of cluster resource Source total amount.

In some preferred embodiments, the resource apparatus includes CPU, GPU, MEM, IO and bandwidth.

In some preferred embodiments, the step S3 is achieved by the steps of:

T1, all satisfactions are calculated according to the resource information of resource information and each task requests in the cluster Resource node；

T2, according to the resource node of the satisfaction in the step T1, using analytic hierarchy process (AHP) calculate each node for The weight of current task, the smallest node of weight are the optimal resource node of current task；

T3, the usage amount of the optimal resource node in the step T2 is subtracted from the resource information in the cluster, Step T1 and step T2 is repeated, obtains the optimal set of resource nodes of each task.

In some more preferably embodiments, the step T1 is achieved by the steps of:

Resource information in T11, the acquisition cluster；

T12, ungratified money is rejected according to the resource information in the resource information and the cluster of each task requests Source node obtains the resource node of all satisfactions.

The present invention also proposes a kind of deep learning intelligent dispatching system based on TensorFlow, comprising:

Application configuration unit, for receive user terminal transmission TensorFlow application in include number of tasks and each The resource information of business request；

Rm-cell for obtaining the resource information of each task requests from the application configuration unit, and acquires collection Resource information in group；

Analysis of strategies unit, for obtaining the resource information of each task requests and described from the rm-cell Resource information in cluster, and calculate optimal set of resource nodes；

Application configuration analytical unit, for obtaining the number of tasks from the application configuration unit, from the analysis of strategies Unit obtains the optimal set of resource nodes, and establishes the mapping relations between each task and corresponding optimal resource node；

Configuration parameter submits unit, for proposing the resource information of each task requests in the application configuration unit It is sent to rm-cell；

Using release unit, for issuing the TensorFlow application.

The present invention also proposes a kind of electronic equipment, comprising:

Memory and processor；

For the memory for storing computer executable instructions, the processor is executable for executing the computer Instruction:

Resource information in S2, acquisition cluster；

S5, the publication TensorFlow application.

The beneficial effect of the present invention compared with the prior art includes:

The deep learning intelligent dispatching method based on TensorFlow in the present invention, includes the following steps S1, receives use The resource information of the number of tasks and each task requests that include in the TensorFlow application that family end is sent；In S2, acquisition cluster Resource information；S3, optimal resource is calculated according to the resource information in the resource information and the cluster of each task requests Node collection；S4, it is established according to the number of tasks and the optimal set of resource nodes between each task and corresponding optimal resource node Mapping relations；S5, the publication TensorFlow application；In resource information and the cluster by each task requests Resource information calculate optimal set of resource nodes, each task is then established according to the number of tasks and the optimal set of resource nodes With the mapping relations between corresponding optimal resource node, then the TensorFlow application is issued, is built manually without user The mapping relations between each task and resource node are found, the time for establishing mapping relations of user is greatly shortened, has simultaneously Effect improves the correctness that mapping relations are established between each task and the resource node；According further to the money being collected into Source information carries out the selection of optimal set of resource nodes automatically and application task to be released is published in optimal set of resource nodes, The reasonable employment for maximumlly carrying out resource, efficiently avoids the waste of resource.

Detailed description of the invention

Fig. 1 is the flow chart of the deep learning intelligent dispatching method based on TensorFlow in a certain embodiment of the present invention；

Fig. 2 is the flow chart of the concrete methods of realizing of step S3 in Fig. 1；

Fig. 3 is the flow chart of the concrete methods of realizing of step T1 in Fig. 2；

Fig. 4 is the system architecture of the deep learning intelligent dispatching system based on TensorFlow in a certain embodiment of the present invention Figure；

Fig. 5 is the schematic diagram of electronic equipment in a certain embodiment of the present invention.

Wherein, 1, utility control center；101, TensorFlow cluster initialization unit；102, using release unit；2, Application configuration center；201, application configuration unit；202, application configuration analytical unit；203, parameter submits unit；3, resource tune Degree center；301, rm-cell；302, analysis of strategies unit；4, resource pool；401, resource node；5, electronic equipment； 501, processor；502, memory.

Specific embodiment

Below against attached drawing and in conjunction with preferred embodiment, the invention will be further described.

With reference to Fig. 1-3, the deep learning intelligent dispatching method based on TensorFlow in the present embodiment includes following step It is rapid:

S1, application configuration unit 201 receive the number of tasks for including in the TensorFlow application of user terminal transmission and each The resource information of task requests；The number of tasks includes worker number of tasks and ps number of tasks；The resource of each task requests Information includes the request amount of resource apparatus and resource apparatus.

S2, rm-cell 301 acquire cluster in resource information, including in cluster total resources and resource use Amount, that is, total resources and resource usage amount in resource pool 4.

S3, analysis of strategies unit 302 according to the resource apparatus of each task requests and the request amount of resource in the step S1, And total resources and the optimal set of resource nodes of resource usage amount calculating in resource pool 4, the meter of the optimal set of resource nodes Calculation method is specific as follows:

T1, analysis of strategies unit 302 are total from the resource information in the acquisition resource pool 4 of rm-cell 301, including resource Amount and resource usage amount, and obtain from application configuration unit 201 resource information of each task requests, including resource apparatus and Resource request amount calculates the resource node 401 of all satisfactions；The calculation method of the resource node 401 of all satisfactions has Body is as follows:

T11, rm-cell 301 obtain total resources and resource usage amount in resource pool 4, the resource apparatus packet CPU, GPU, MEM, IO and bandwidth are included, CPU, GPU, MEM, IO and bandwidth idle in the resource pool 4 are calculated；

T12, rm-cell 301 reject according to CPU, GPU, MEM, IO and bandwidth of application request and are unsatisfactory for requesting Resource information resource node 401, the resource information in cluster is filtered and is screened, the resource section of all satisfactions is obtained Point 401；

T2, analysis of strategies unit 302 utilize level point according to the resource node 401 of the satisfaction in the step T1 Analysis method calculates each node for the weight of current task, and the smallest node of weight is the optimal resource node 401 of current task；Institute State that calculate each node specific as follows for the method for the weight of current task using analytic hierarchy process (AHP):

Ask each node for the weight of current task, judgment matrix using analytic hierarchy process (AHP) (AHP) development of judgment matrix Form such as formula (1) shown in:

Wherein, a_ijIndicate index i for the significance level of index j.It, can be according to consistency check public affairs after obtaining weight Formula, to judge whether weight is up to standard.The formula of coincident indicator and consistency ratio such as (2) is shown,

Wherein, λ_maxIt is the Maximum characteristic root of judgment matrix, n is to compare λ_maxSmall maximum integer.RI is that random consistency refers to Mark, value are as shown in table 1 referring to table:

1 random index RI value of table

n	1	2	3	4	5	6	7	8	9	10	11
												RI	0	0	0.58	0.90	1.12	1.24	1.32	1.41	1.45	1.49	1.51

As consistency ratio CR < 0.1, it is believed that the judgment matrix of building meets condition, can be used as the calculating of weight.

T3, analysis of strategies unit 302 are by the usage amount of the optimal resource node 401 in the step T2 from the cluster Resource information in subtract, that is, subtracted from the resource of resource pool 4, repeat step T1 and step T2, obtain each task Optimal set of resource nodes A={ w0, w1 ... wn, p0, p1 ... pm }, wherein w refers to that worker node, p refer to ps node.

S4, application configuration unit 201 are according to the worker number of tasks and ps number of tasks and the optimal resource node Collection establishes the mapping relations between each task and corresponding optimal resource node 401；The specific method for building up of the mapping relations is such as Under:

Automatically configure TensorFlow application in each task tf.train.ClusterSpec parameter and It is as follows specifically to automatically configure rule for tf.train.Server parameter:

Tf.train.Server parameter configuration

If it is worker task, then marking tf.train.Server is tf.train.Server (cluster, job_ Name=" worker ", task_index=N), wherein N is the subscript of the worker node in matrix A；

If it is ps task, then marking tf.train.Server is tf.train.Server (cluster, job_name =" ps ", task_index=M), wherein M is the subscript of the ps node in matrix A；

Tf.train.ClusterSpec parameter configuration

The parameter of tf.train.ClusterSpec be tf.train.ClusterSpec (" worker ": [" w0: Port " ... ..., " wn:port "], " ps ": [" p0:port " ... ..., " pm:port "] }), wherein w0-wn in the value of worker For the IP address of all worker nodes in matrix A, p0-pm is the IP address of all ps nodes in matrix A.Port is to answer With the default port number configured in configuration unit 201.

S5, application configuration center 2 configuration parameter submit unit 203 get TensorFlow application all configurations After information and optimal node collection, all configuration informations and optimal node collection are committed to using release unit 102, it is single using publication All worker tasks and ps task that member 102 issues TensorFlow application according to information above are to specified resource node On 401.

With reference to Fig. 4, the deep learning intelligent dispatching system based on TensorFlow in the present embodiment includes application configuration Center 2, resource distribution center, utility control center 1 and resource pool 4.

The application configuration center 2 includes that application configuration unit 201, application configuration analytical unit 202 and configuration parameter mention Presentate member 203；The resource distribution center includes rm-cell 301 and analysis of strategies unit 302；In the application management The heart 1 includes applying release unit 102 and TensorFlow cluster initialization unit 101.

The configuration center is used to receive the number of tasks in the TensorFlow application of user terminal transmission included and each The resource information of business request；The task includes the sum of worker task and ps task；The resource of each task requests is believed Breath includes the request amount of resource apparatus and resource；The resource apparatus includes CPU, MEM, GPU, IO and bandwidth；The application is matched Analytical unit 202 is set for obtaining the number of tasks from the application configuration unit 201, is obtained from the analysis of strategies unit 302 The optimal set of resource nodes is taken, and establishes the mapping relations between each task and corresponding optimal resource node 401；Configuration Parameter submits unit 203, for the resource information of each task requests in the application configuration unit 201 to be submitted to money Source control unit 301, and will acquire TensorFlow application all configuration informations and optimal node collection after, by all configurations Information and optimal node collection are committed to using release unit 102.

Rm-cell 301, for obtaining the resource information of each task requests from the application configuration unit 201, and Acquire the resource information in cluster；Analysis of strategies unit 302, for obtaining each task from the rm-cell 301 Resource information in the resource information of request and the cluster, and calculate optimal set of resource nodes.

Matched using release unit 102 according to described using release unit 102 for issuing the TensorFlow application On confidence breath and all tasks to specified resource node 401 of optimal node collection publication TensorFlow application；It is described TensorFlow cluster initialization unit 101 is for initializing TensorFlow cluster.

Resource pool 4 includes all available resources nodes 401 in TensorFlow cluster, the resource apparatus on each node Including CPU, MEM, GPU, IO and bandwidth.

When user carry out TensorFlow application publication when, by TensorFlow apply in each task requests resource information Application configuration center 2 is sent to the number of tasks for including in TensorFlow application and initializes TensorFlow cluster, Application configuration unit 201, application configuration analytical unit 202 and the configuration parameter at application configuration center 2 submit unit 203 to analyze The tf.train.ClusterSpec parameter information of each required by task, tf.train.Server parameter in TensorFlow application The resource information of each task requests, the rm-cell 301 and plan at scheduling of resource center 3 in information and TensorFlow application Slightly 302 analysis meter of the analytical unit optimal set of resource nodes that calculates TensorFlow application schedules, and by optimal set of resource nodes Information is published to application configuration analytical unit 202, and application configuration analytical unit 202 is according to optimal 401 information of resource node and appoints Business number calculates the tf.train.ClusterSpec parameter information and tf.train.Server parameter information of each task, finally All information are submitted to using release unit 102, are joined using release unit 102 according to the tf.train.ClusterSpec Number information and the tf.train.Server parameter information, in the feelings for applying any parameter without manual configuration TensorFlow In the optimal set of resource nodes that application is published in TensorFlow cluster under condition.

With reference to Fig. 5, the electronic equipment 5 in the present embodiment includes: memory 502 and processor 501；

The memory 502 is for storing computer executable instructions, and the processor 501 is for executing the computer Executable instruction:

S1, application configuration unit 201 receive the number of tasks for including in the TensorFlow application of user terminal transmission and each The resource information of task requests；The number of tasks includes worker number of tasks and ps number of tasks；The resource of each task requests Information includes the request amount of resource apparatus and resource apparatus；

S2, rm-cell 301 acquire cluster in resource information, including in cluster total resources and resource use Amount, that is, total resources and resource usage amount in resource pool 4；

1 random index RI value of table

n	1	2	3	4	5	6	7	8	9	10	11
												RI	0	0	0.58	0.90	1.12	1.24	1.32	1.41	1.45	1.49	1.51

S4, application configuration unit 201 establish each task and phase according to the number of tasks and the optimal set of resource nodes Answer the mapping relations between optimal resource node 401；The specific method for building up of the mapping relations is as follows:

Tf.train.Server parameter configuration

Tf.train.ClusterSpec parameter configuration

S5, application configuration center 2 configuration parameter submit unit 203 get TensorFlow application all configurations After information and optimal node collection, all configuration informations and optimal node collection are committed to using release unit 102, it is single using publication Member 102 is issued according to information above in all tasks to specified resource node 401 of TensorFlow application.

The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that Specific implementation of the invention is only limited to these instructions.For those skilled in the art to which the present invention belongs, it is not taking off Under the premise of from present inventive concept, several equivalent substitute or obvious modifications can also be made, and performance or use is identical, all answered When being considered as belonging to protection scope of the present invention.

Claims

1. a kind of deep learning intelligent dispatching method based on TensorFlow, which comprises the steps of:

S1, the resource information for receiving the number of tasks and each task requests that include in the TensorFlow application that user terminal is sent；

Resource information in S2, acquisition cluster；

S3, optimal set of resource nodes is calculated according to the resource information in the resource information and the cluster of each task requests；

S4, reflecting between each task and corresponding optimal resource node is established according to the number of tasks and the optimal set of resource nodes Penetrate relationship；

S5, the publication TensorFlow application.

2. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that described Business includes worker task and ps task.

3. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that described each The resource information of task requests includes the request amount of resource apparatus and resource.

4. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that the collection Resource information in group includes the usage amount and cluster total resources of cluster resource.

5. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that the money Source device includes CPU, GPU, MEM, IO and bandwidth.

6. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that the step Rapid S3 is achieved by the steps of:

T1, the resource that all satisfactions are calculated according to the resource information of resource information and each task requests in the cluster Node；

T2, according to the resource node of the satisfaction in the step T1, calculate each node for current using analytic hierarchy process (AHP) The weight of task, the smallest node of weight are the optimal resource node of current task；

T3, the usage amount of the optimal resource node in the step T2 is subtracted from the resource information in the cluster, is repeated Step T1 and step T2 obtains the optimal set of resource nodes of each task.

7. the deep learning intelligent dispatching method based on TensorFlow as claimed in claim 6, which is characterized in that the step Rapid T1 is achieved by the steps of:

Resource information in T11, the acquisition cluster；

T12, ungratified resource section is rejected according to the resource information in the resource information and the cluster of each task requests Point obtains the resource node of all satisfactions.

8. a kind of deep learning intelligent dispatching system based on TensorFlow characterized by comprising

Application configuration unit, number of tasks and each task for including in receiving the TensorFlow application of user terminal transmission are asked The resource information asked；

Rm-cell for obtaining the resource information of each task requests from the application configuration unit, and acquires in cluster Resource information；

Analysis of strategies unit, for obtained from the rm-cell each task requests resource information and the cluster In resource information, and calculate optimal set of resource nodes；

Application configuration analytical unit, for obtaining the number of tasks from the application configuration unit, from the analysis of strategies unit The optimal set of resource nodes is obtained, and establishes the mapping relations between each task and corresponding optimal resource node；

Configuration parameter submits unit, for the resource information of each task requests in the application configuration unit to be submitted to Rm-cell；

Using release unit, for issuing the TensorFlow application.

9. a kind of electronic equipment characterized by comprising

Memory and processor；

The memory is for storing computer executable instructions, and for executing, the computer is executable to be referred to the processor It enables:

Resource information in S2, acquisition cluster；

S5, the publication TensorFlow application.