CN109240814A - A kind of deep learning intelligent dispatching method and system based on TensorFlow - Google Patents

A kind of deep learning intelligent dispatching method and system based on TensorFlow Download PDF

Info

Publication number
CN109240814A
CN109240814A CN201810962198.XA CN201810962198A CN109240814A CN 109240814 A CN109240814 A CN 109240814A CN 201810962198 A CN201810962198 A CN 201810962198A CN 109240814 A CN109240814 A CN 109240814A
Authority
CN
China
Prior art keywords
resource
tensorflow
task
resource information
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810962198.XA
Other languages
Chinese (zh)
Inventor
王宇
曹雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Shunkang Information Technology Co ltd
Original Assignee
Hunan Shunkang Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Shunkang Information Technology Co ltd filed Critical Hunan Shunkang Information Technology Co ltd
Priority to CN201810962198.XA priority Critical patent/CN109240814A/en
Publication of CN109240814A publication Critical patent/CN109240814A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a kind of deep learning intelligent dispatching method based on TensorFlow, comprising: S1, the resource information for receiving the number of tasks and each task requests that include in the TensorFlow application of user terminal transmission;Resource information in S2, acquisition cluster;S3, optimal set of resource nodes is calculated according to the resource information in the resource information and the cluster of each task requests;S4, mapping relations between each task and corresponding optimal resource node are established according to number of tasks and the optimal set of resource nodes;S5, publication TensorFlow application;The present invention also proposes a kind of deep learning intelligent dispatching system based on TensorFlow;Without user's mapping relations to establish between each task and resource node manually, greatly shorten the time for establishing mapping relations of user, it carries out the selection of optimal set of resource nodes automatically according further to the resource information being collected into and application task to be released is published on optimal resource node, the reasonable employment for maximumlly carrying out resource, efficiently avoids the waste of resource.

Description

A kind of deep learning intelligent dispatching method and system based on TensorFlow
Technical field
The present invention relates to deep learning application fields, more particularly to a kind of deep learning intelligence based on TensorFlow Dispatching method and system.
Background technique
In recent years, deep learning is foundation, mould as a new technology in machine learning algorithm research, motivation Anthropomorphic brain carries out the neural network of analytic learning.By means of deep learning algorithm, the mankind can finally find how to handle and " be abstracted general Read " method of this problem from ancient times to now.
TensorFlow is Google in the Computational frame formally increased income on November 9th, 2015.TensorFlow Computational frame One of various algorithms and the most popular library of deep learning that can support deep learning well, are that Google is summarizing deeply It is formed in the experience and lessons of its predecessor DistBelief.It innately has portable, efficient, expansible, moreover it is possible in different meters It is run on calculation machine.
Current Distributed Application cluster operational mode such as Fig. 1 based on TensorFlow in production environment.When operation one When a TensorFlow Distributed Application cluster, all ginsengs of the good TensorFlow Distributed Application cluster of user's configured in advance are needed It counts and which task is specified to run with which port of which host.Create necessity of TensorFlow Distributed Application cluster Condition is serviced for each task start one.Following work can be done for each task:
1, a tf.train.ClusterSpec is created to be used for all in TensorFlow Distributed Application cluster Task is described, which should be identical for all tasks.
2, it creates a tf.train.Server and the parameter in tf.train.ClusterSpec is passed to construction letter Number, and the number of the title of operation and current task is written in local task.
Under traditional Distributed T ensorFlow cluster environment, although human configuration can be passed through The mode of tf.train.ClusterSpec and tf.train.Server runs TensorFlow Distributed Application, this is for one As the fewer environment of Distributed T ensorFlow clustered node be feasible and simply and easily, but in big data For large-scale TensorFlow Distributed Application under, it will become sufficiently complex and be difficult to safeguard.In this background, it is based on The number of nodes of TensorFlow Distributed Application is up to hundreds and thousands of, this, which is just determined, is carrying out TensorFlow application hair When cloth, different tf.train.ClusterSpec parameter and tf.train.Server ginseng need to be configured to each resource node Number, if every publication TensorFlow application require user's manual configuration tf.train.ClusterSpec parameter and Tf.train.Server parameter, this is all unacceptable for any user.In addition, in publication in application, user It can not judge whether the corresponding host node of the task publication of TensorFlow has enough resources to meet the item using operation Part, such as, if having that enough CPU carries out task schedule or whether node is configured with GPU etc., this application published method for It is not optimal selection for any user.
Summary of the invention
In order to solve not automatically selecting the technical issues of optimal set of resource nodes is carried out using publication in the prior art, this Invention proposes a kind of deep learning intelligent dispatching method and system based on TensorFlow.
Technical problem of the invention is resolved by technical solution below:
A kind of deep learning intelligent dispatching method based on TensorFlow, includes the following steps:
S1, the resource letter for receiving the number of tasks and each task requests that include in the TensorFlow application that user terminal is sent Breath;
Resource information in S2, acquisition cluster;
S3, optimal resource node is calculated according to the resource information in the resource information and the cluster of each task requests Collection;
S4, it is established according to the number of tasks and the optimal set of resource nodes between each task and corresponding optimal resource node Mapping relations;
S5, the publication TensorFlow application.
In some preferred embodiments, the task includes worker task and ps task.
In some preferred embodiments, the resource information of each task requests includes the request of resource apparatus and resource Amount.
In some preferred embodiments, the resource information in the cluster includes the usage amount and cluster money of cluster resource Source total amount.
In some preferred embodiments, the resource apparatus includes CPU, GPU, MEM, IO and bandwidth.
In some preferred embodiments, the step S3 is achieved by the steps of:
T1, all satisfactions are calculated according to the resource information of resource information and each task requests in the cluster Resource node;
T2, according to the resource node of the satisfaction in the step T1, using analytic hierarchy process (AHP) calculate each node for The weight of current task, the smallest node of weight are the optimal resource node of current task;
T3, the usage amount of the optimal resource node in the step T2 is subtracted from the resource information in the cluster, Step T1 and step T2 is repeated, obtains the optimal set of resource nodes of each task.
In some more preferably embodiments, the step T1 is achieved by the steps of:
Resource information in T11, the acquisition cluster;
T12, ungratified money is rejected according to the resource information in the resource information and the cluster of each task requests Source node obtains the resource node of all satisfactions.
The present invention also proposes a kind of deep learning intelligent dispatching system based on TensorFlow, comprising:
Application configuration unit, for receive user terminal transmission TensorFlow application in include number of tasks and each The resource information of business request;
Rm-cell for obtaining the resource information of each task requests from the application configuration unit, and acquires collection Resource information in group;
Analysis of strategies unit, for obtaining the resource information of each task requests and described from the rm-cell Resource information in cluster, and calculate optimal set of resource nodes;
Application configuration analytical unit, for obtaining the number of tasks from the application configuration unit, from the analysis of strategies Unit obtains the optimal set of resource nodes, and establishes the mapping relations between each task and corresponding optimal resource node;
Configuration parameter submits unit, for proposing the resource information of each task requests in the application configuration unit It is sent to rm-cell;
Using release unit, for issuing the TensorFlow application.
The present invention also proposes a kind of electronic equipment, comprising:
Memory and processor;
For the memory for storing computer executable instructions, the processor is executable for executing the computer Instruction:
S1, the resource letter for receiving the number of tasks and each task requests that include in the TensorFlow application that user terminal is sent Breath;
Resource information in S2, acquisition cluster;
S3, optimal resource node is calculated according to the resource information in the resource information and the cluster of each task requests Collection;
S4, it is established according to the number of tasks and the optimal set of resource nodes between each task and corresponding optimal resource node Mapping relations;
S5, the publication TensorFlow application.
The beneficial effect of the present invention compared with the prior art includes:
The deep learning intelligent dispatching method based on TensorFlow in the present invention, includes the following steps S1, receives use The resource information of the number of tasks and each task requests that include in the TensorFlow application that family end is sent;In S2, acquisition cluster Resource information;S3, optimal resource is calculated according to the resource information in the resource information and the cluster of each task requests Node collection;S4, it is established according to the number of tasks and the optimal set of resource nodes between each task and corresponding optimal resource node Mapping relations;S5, the publication TensorFlow application;In resource information and the cluster by each task requests Resource information calculate optimal set of resource nodes, each task is then established according to the number of tasks and the optimal set of resource nodes With the mapping relations between corresponding optimal resource node, then the TensorFlow application is issued, is built manually without user The mapping relations between each task and resource node are found, the time for establishing mapping relations of user is greatly shortened, has simultaneously Effect improves the correctness that mapping relations are established between each task and the resource node;According further to the money being collected into Source information carries out the selection of optimal set of resource nodes automatically and application task to be released is published in optimal set of resource nodes, The reasonable employment for maximumlly carrying out resource, efficiently avoids the waste of resource.
Detailed description of the invention
Fig. 1 is the flow chart of the deep learning intelligent dispatching method based on TensorFlow in a certain embodiment of the present invention;
Fig. 2 is the flow chart of the concrete methods of realizing of step S3 in Fig. 1;
Fig. 3 is the flow chart of the concrete methods of realizing of step T1 in Fig. 2;
Fig. 4 is the system architecture of the deep learning intelligent dispatching system based on TensorFlow in a certain embodiment of the present invention Figure;
Fig. 5 is the schematic diagram of electronic equipment in a certain embodiment of the present invention.
Wherein, 1, utility control center;101, TensorFlow cluster initialization unit;102, using release unit;2, Application configuration center;201, application configuration unit;202, application configuration analytical unit;203, parameter submits unit;3, resource tune Degree center;301, rm-cell;302, analysis of strategies unit;4, resource pool;401, resource node;5, electronic equipment; 501, processor;502, memory.
Specific embodiment
Below against attached drawing and in conjunction with preferred embodiment, the invention will be further described.
With reference to Fig. 1-3, the deep learning intelligent dispatching method based on TensorFlow in the present embodiment includes following step It is rapid:
S1, application configuration unit 201 receive the number of tasks for including in the TensorFlow application of user terminal transmission and each The resource information of task requests;The number of tasks includes worker number of tasks and ps number of tasks;The resource of each task requests Information includes the request amount of resource apparatus and resource apparatus.
S2, rm-cell 301 acquire cluster in resource information, including in cluster total resources and resource use Amount, that is, total resources and resource usage amount in resource pool 4.
S3, analysis of strategies unit 302 according to the resource apparatus of each task requests and the request amount of resource in the step S1, And total resources and the optimal set of resource nodes of resource usage amount calculating in resource pool 4, the meter of the optimal set of resource nodes Calculation method is specific as follows:
T1, analysis of strategies unit 302 are total from the resource information in the acquisition resource pool 4 of rm-cell 301, including resource Amount and resource usage amount, and obtain from application configuration unit 201 resource information of each task requests, including resource apparatus and Resource request amount calculates the resource node 401 of all satisfactions;The calculation method of the resource node 401 of all satisfactions has Body is as follows:
T11, rm-cell 301 obtain total resources and resource usage amount in resource pool 4, the resource apparatus packet CPU, GPU, MEM, IO and bandwidth are included, CPU, GPU, MEM, IO and bandwidth idle in the resource pool 4 are calculated;
T12, rm-cell 301 reject according to CPU, GPU, MEM, IO and bandwidth of application request and are unsatisfactory for requesting Resource information resource node 401, the resource information in cluster is filtered and is screened, the resource section of all satisfactions is obtained Point 401;
T2, analysis of strategies unit 302 utilize level point according to the resource node 401 of the satisfaction in the step T1 Analysis method calculates each node for the weight of current task, and the smallest node of weight is the optimal resource node 401 of current task;Institute State that calculate each node specific as follows for the method for the weight of current task using analytic hierarchy process (AHP):
Ask each node for the weight of current task, judgment matrix using analytic hierarchy process (AHP) (AHP) development of judgment matrix Form such as formula (1) shown in:
Wherein, aijIndicate index i for the significance level of index j.It, can be according to consistency check public affairs after obtaining weight Formula, to judge whether weight is up to standard.The formula of coincident indicator and consistency ratio such as (2) is shown,
Wherein, λmaxIt is the Maximum characteristic root of judgment matrix, n is to compare λmaxSmall maximum integer.RI is that random consistency refers to Mark, value are as shown in table 1 referring to table:
1 random index RI value of table
n 1 2 3 4 5 6 7 8 9 10 11
RI 0 0 0.58 0.90 1.12 1.24 1.32 1.41 1.45 1.49 1.51
As consistency ratio CR < 0.1, it is believed that the judgment matrix of building meets condition, can be used as the calculating of weight.
T3, analysis of strategies unit 302 are by the usage amount of the optimal resource node 401 in the step T2 from the cluster Resource information in subtract, that is, subtracted from the resource of resource pool 4, repeat step T1 and step T2, obtain each task Optimal set of resource nodes A={ w0, w1 ... wn, p0, p1 ... pm }, wherein w refers to that worker node, p refer to ps node.
S4, application configuration unit 201 are according to the worker number of tasks and ps number of tasks and the optimal resource node Collection establishes the mapping relations between each task and corresponding optimal resource node 401;The specific method for building up of the mapping relations is such as Under:
Automatically configure TensorFlow application in each task tf.train.ClusterSpec parameter and It is as follows specifically to automatically configure rule for tf.train.Server parameter:
Tf.train.Server parameter configuration
If it is worker task, then marking tf.train.Server is tf.train.Server (cluster, job_ Name=" worker ", task_index=N), wherein N is the subscript of the worker node in matrix A;
If it is ps task, then marking tf.train.Server is tf.train.Server (cluster, job_name =" ps ", task_index=M), wherein M is the subscript of the ps node in matrix A;
Tf.train.ClusterSpec parameter configuration
The parameter of tf.train.ClusterSpec be tf.train.ClusterSpec (" worker ": [" w0: Port " ... ..., " wn:port "], " ps ": [" p0:port " ... ..., " pm:port "] }), wherein w0-wn in the value of worker For the IP address of all worker nodes in matrix A, p0-pm is the IP address of all ps nodes in matrix A.Port is to answer With the default port number configured in configuration unit 201.
S5, application configuration center 2 configuration parameter submit unit 203 get TensorFlow application all configurations After information and optimal node collection, all configuration informations and optimal node collection are committed to using release unit 102, it is single using publication All worker tasks and ps task that member 102 issues TensorFlow application according to information above are to specified resource node On 401.
With reference to Fig. 4, the deep learning intelligent dispatching system based on TensorFlow in the present embodiment includes application configuration Center 2, resource distribution center, utility control center 1 and resource pool 4.
The application configuration center 2 includes that application configuration unit 201, application configuration analytical unit 202 and configuration parameter mention Presentate member 203;The resource distribution center includes rm-cell 301 and analysis of strategies unit 302;In the application management The heart 1 includes applying release unit 102 and TensorFlow cluster initialization unit 101.
The configuration center is used to receive the number of tasks in the TensorFlow application of user terminal transmission included and each The resource information of business request;The task includes the sum of worker task and ps task;The resource of each task requests is believed Breath includes the request amount of resource apparatus and resource;The resource apparatus includes CPU, MEM, GPU, IO and bandwidth;The application is matched Analytical unit 202 is set for obtaining the number of tasks from the application configuration unit 201, is obtained from the analysis of strategies unit 302 The optimal set of resource nodes is taken, and establishes the mapping relations between each task and corresponding optimal resource node 401;Configuration Parameter submits unit 203, for the resource information of each task requests in the application configuration unit 201 to be submitted to money Source control unit 301, and will acquire TensorFlow application all configuration informations and optimal node collection after, by all configurations Information and optimal node collection are committed to using release unit 102.
Rm-cell 301, for obtaining the resource information of each task requests from the application configuration unit 201, and Acquire the resource information in cluster;Analysis of strategies unit 302, for obtaining each task from the rm-cell 301 Resource information in the resource information of request and the cluster, and calculate optimal set of resource nodes.
Matched using release unit 102 according to described using release unit 102 for issuing the TensorFlow application On confidence breath and all tasks to specified resource node 401 of optimal node collection publication TensorFlow application;It is described TensorFlow cluster initialization unit 101 is for initializing TensorFlow cluster.
Resource pool 4 includes all available resources nodes 401 in TensorFlow cluster, the resource apparatus on each node Including CPU, MEM, GPU, IO and bandwidth.
When user carry out TensorFlow application publication when, by TensorFlow apply in each task requests resource information Application configuration center 2 is sent to the number of tasks for including in TensorFlow application and initializes TensorFlow cluster, Application configuration unit 201, application configuration analytical unit 202 and the configuration parameter at application configuration center 2 submit unit 203 to analyze The tf.train.ClusterSpec parameter information of each required by task, tf.train.Server parameter in TensorFlow application The resource information of each task requests, the rm-cell 301 and plan at scheduling of resource center 3 in information and TensorFlow application Slightly 302 analysis meter of the analytical unit optimal set of resource nodes that calculates TensorFlow application schedules, and by optimal set of resource nodes Information is published to application configuration analytical unit 202, and application configuration analytical unit 202 is according to optimal 401 information of resource node and appoints Business number calculates the tf.train.ClusterSpec parameter information and tf.train.Server parameter information of each task, finally All information are submitted to using release unit 102, are joined using release unit 102 according to the tf.train.ClusterSpec Number information and the tf.train.Server parameter information, in the feelings for applying any parameter without manual configuration TensorFlow In the optimal set of resource nodes that application is published in TensorFlow cluster under condition.
With reference to Fig. 5, the electronic equipment 5 in the present embodiment includes: memory 502 and processor 501;
The memory 502 is for storing computer executable instructions, and the processor 501 is for executing the computer Executable instruction:
S1, application configuration unit 201 receive the number of tasks for including in the TensorFlow application of user terminal transmission and each The resource information of task requests;The number of tasks includes worker number of tasks and ps number of tasks;The resource of each task requests Information includes the request amount of resource apparatus and resource apparatus;
S2, rm-cell 301 acquire cluster in resource information, including in cluster total resources and resource use Amount, that is, total resources and resource usage amount in resource pool 4;
S3, analysis of strategies unit 302 according to the resource apparatus of each task requests and the request amount of resource in the step S1, And total resources and the optimal set of resource nodes of resource usage amount calculating in resource pool 4, the meter of the optimal set of resource nodes Calculation method is specific as follows:
T1, analysis of strategies unit 302 are total from the resource information in the acquisition resource pool 4 of rm-cell 301, including resource Amount and resource usage amount, and obtain from application configuration unit 201 resource information of each task requests, including resource apparatus and Resource request amount calculates the resource node 401 of all satisfactions;The calculation method of the resource node 401 of all satisfactions has Body is as follows:
T11, rm-cell 301 obtain total resources and resource usage amount in resource pool 4, the resource apparatus packet CPU, GPU, MEM, IO and bandwidth are included, CPU, GPU, MEM, IO and bandwidth idle in the resource pool 4 are calculated;
T12, rm-cell 301 reject according to CPU, GPU, MEM, IO and bandwidth of application request and are unsatisfactory for requesting Resource information resource node 401, the resource information in cluster is filtered and is screened, the resource section of all satisfactions is obtained Point 401;
T2, analysis of strategies unit 302 utilize level point according to the resource node 401 of the satisfaction in the step T1 Analysis method calculates each node for the weight of current task, and the smallest node of weight is the optimal resource node 401 of current task;Institute State that calculate each node specific as follows for the method for the weight of current task using analytic hierarchy process (AHP):
Ask each node for the weight of current task, judgment matrix using analytic hierarchy process (AHP) (AHP) development of judgment matrix Form such as formula (1) shown in:
Wherein, aijIndicate index i for the significance level of index j.It, can be according to consistency check public affairs after obtaining weight Formula, to judge whether weight is up to standard.The formula of coincident indicator and consistency ratio such as (2) is shown,
Wherein, λmaxIt is the Maximum characteristic root of judgment matrix, n is to compare λmaxSmall maximum integer.RI is that random consistency refers to Mark, value are as shown in table 1 referring to table:
1 random index RI value of table
n 1 2 3 4 5 6 7 8 9 10 11
RI 0 0 0.58 0.90 1.12 1.24 1.32 1.41 1.45 1.49 1.51
As consistency ratio CR < 0.1, it is believed that the judgment matrix of building meets condition, can be used as the calculating of weight.
T3, analysis of strategies unit 302 are by the usage amount of the optimal resource node 401 in the step T2 from the cluster Resource information in subtract, that is, subtracted from the resource of resource pool 4, repeat step T1 and step T2, obtain each task Optimal set of resource nodes A={ w0, w1 ... wn, p0, p1 ... pm }, wherein w refers to that worker node, p refer to ps node.
S4, application configuration unit 201 establish each task and phase according to the number of tasks and the optimal set of resource nodes Answer the mapping relations between optimal resource node 401;The specific method for building up of the mapping relations is as follows:
Automatically configure TensorFlow application in each task tf.train.ClusterSpec parameter and It is as follows specifically to automatically configure rule for tf.train.Server parameter:
Tf.train.Server parameter configuration
If it is worker task, then marking tf.train.Server is tf.train.Server (cluster, job_ Name=" worker ", task_index=N), wherein N is the subscript of the worker node in matrix A;
If it is ps task, then marking tf.train.Server is tf.train.Server (cluster, job_name =" ps ", task_index=M), wherein M is the subscript of the ps node in matrix A;
Tf.train.ClusterSpec parameter configuration
The parameter of tf.train.ClusterSpec be tf.train.ClusterSpec (" worker ": [" w0: Port " ... ..., " wn:port "], " ps ": [" p0:port " ... ..., " pm:port "] }), wherein w0-wn in the value of worker For the IP address of all worker nodes in matrix A, p0-pm is the IP address of all ps nodes in matrix A.Port is to answer With the default port number configured in configuration unit 201.
S5, application configuration center 2 configuration parameter submit unit 203 get TensorFlow application all configurations After information and optimal node collection, all configuration informations and optimal node collection are committed to using release unit 102, it is single using publication Member 102 is issued according to information above in all tasks to specified resource node 401 of TensorFlow application.
The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that Specific implementation of the invention is only limited to these instructions.For those skilled in the art to which the present invention belongs, it is not taking off Under the premise of from present inventive concept, several equivalent substitute or obvious modifications can also be made, and performance or use is identical, all answered When being considered as belonging to protection scope of the present invention.

Claims (9)

1. a kind of deep learning intelligent dispatching method based on TensorFlow, which comprises the steps of:
S1, the resource information for receiving the number of tasks and each task requests that include in the TensorFlow application that user terminal is sent;
Resource information in S2, acquisition cluster;
S3, optimal set of resource nodes is calculated according to the resource information in the resource information and the cluster of each task requests;
S4, reflecting between each task and corresponding optimal resource node is established according to the number of tasks and the optimal set of resource nodes Penetrate relationship;
S5, the publication TensorFlow application.
2. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that described Business includes worker task and ps task.
3. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that described each The resource information of task requests includes the request amount of resource apparatus and resource.
4. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that the collection Resource information in group includes the usage amount and cluster total resources of cluster resource.
5. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that the money Source device includes CPU, GPU, MEM, IO and bandwidth.
6. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that the step Rapid S3 is achieved by the steps of:
T1, the resource that all satisfactions are calculated according to the resource information of resource information and each task requests in the cluster Node;
T2, according to the resource node of the satisfaction in the step T1, calculate each node for current using analytic hierarchy process (AHP) The weight of task, the smallest node of weight are the optimal resource node of current task;
T3, the usage amount of the optimal resource node in the step T2 is subtracted from the resource information in the cluster, is repeated Step T1 and step T2 obtains the optimal set of resource nodes of each task.
7. the deep learning intelligent dispatching method based on TensorFlow as claimed in claim 6, which is characterized in that the step Rapid T1 is achieved by the steps of:
Resource information in T11, the acquisition cluster;
T12, ungratified resource section is rejected according to the resource information in the resource information and the cluster of each task requests Point obtains the resource node of all satisfactions.
8. a kind of deep learning intelligent dispatching system based on TensorFlow characterized by comprising
Application configuration unit, number of tasks and each task for including in receiving the TensorFlow application of user terminal transmission are asked The resource information asked;
Rm-cell for obtaining the resource information of each task requests from the application configuration unit, and acquires in cluster Resource information;
Analysis of strategies unit, for obtained from the rm-cell each task requests resource information and the cluster In resource information, and calculate optimal set of resource nodes;
Application configuration analytical unit, for obtaining the number of tasks from the application configuration unit, from the analysis of strategies unit The optimal set of resource nodes is obtained, and establishes the mapping relations between each task and corresponding optimal resource node;
Configuration parameter submits unit, for the resource information of each task requests in the application configuration unit to be submitted to Rm-cell;
Using release unit, for issuing the TensorFlow application.
9. a kind of electronic equipment characterized by comprising
Memory and processor;
The memory is for storing computer executable instructions, and for executing, the computer is executable to be referred to the processor It enables:
S1, the resource information for receiving the number of tasks and each task requests that include in the TensorFlow application that user terminal is sent;
Resource information in S2, acquisition cluster;
S3, optimal set of resource nodes is calculated according to the resource information in the resource information and the cluster of each task requests;
S4, reflecting between each task and corresponding optimal resource node is established according to the number of tasks and the optimal set of resource nodes Penetrate relationship;
S5, the publication TensorFlow application.
CN201810962198.XA 2018-08-22 2018-08-22 A kind of deep learning intelligent dispatching method and system based on TensorFlow Pending CN109240814A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810962198.XA CN109240814A (en) 2018-08-22 2018-08-22 A kind of deep learning intelligent dispatching method and system based on TensorFlow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810962198.XA CN109240814A (en) 2018-08-22 2018-08-22 A kind of deep learning intelligent dispatching method and system based on TensorFlow

Publications (1)

Publication Number Publication Date
CN109240814A true CN109240814A (en) 2019-01-18

Family

ID=65068722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810962198.XA Pending CN109240814A (en) 2018-08-22 2018-08-22 A kind of deep learning intelligent dispatching method and system based on TensorFlow

Country Status (1)

Country Link
CN (1) CN109240814A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111124634A (en) * 2019-12-06 2020-05-08 广东浪潮大数据研究有限公司 Training method and device, electronic equipment and storage medium
CN111400000A (en) * 2020-03-09 2020-07-10 百度在线网络技术(北京)有限公司 Network request processing method, device, equipment and storage medium
CN111984398A (en) * 2019-05-22 2020-11-24 富士通株式会社 Method and computer readable medium for scheduling operations
CN112134812A (en) * 2020-09-08 2020-12-25 华东师范大学 Distributed deep learning performance optimization method based on network bandwidth allocation
WO2021120550A1 (en) * 2019-12-19 2021-06-24 Huawei Technologies Co., Ltd. Methods and apparatus for resource scheduling of resource nodes of a computing cluster or a cloud computing platform
WO2022083777A1 (en) * 2020-10-23 2022-04-28 Huawei Cloud Computing Technologies Co., Ltd. Resource scheduling methods using positive and negative caching, and resource manager implementing the methods
CN114661480A (en) * 2022-05-23 2022-06-24 阿里巴巴(中国)有限公司 Deep learning task resource allocation method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529682A (en) * 2016-10-28 2017-03-22 北京奇虎科技有限公司 Method and apparatus for processing deep learning task in big-data cluster
CN107203424A (en) * 2017-04-17 2017-09-26 北京奇虎科技有限公司 A kind of method and apparatus that deep learning operation is dispatched in distributed type assemblies
CN107370796A (en) * 2017-06-30 2017-11-21 香港红鸟科技股份有限公司 A kind of intelligent learning system based on Hyper TF
CN107888669A (en) * 2017-10-31 2018-04-06 武汉理工大学 A kind of extensive resource scheduling system and method based on deep learning neutral net
US20180137445A1 (en) * 2016-11-14 2018-05-17 Apptio, Inc. Identifying resource allocation discrepancies
CN108062246A (en) * 2018-01-25 2018-05-22 北京百度网讯科技有限公司 For the resource regulating method and device of deep learning frame

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529682A (en) * 2016-10-28 2017-03-22 北京奇虎科技有限公司 Method and apparatus for processing deep learning task in big-data cluster
US20180137445A1 (en) * 2016-11-14 2018-05-17 Apptio, Inc. Identifying resource allocation discrepancies
CN107203424A (en) * 2017-04-17 2017-09-26 北京奇虎科技有限公司 A kind of method and apparatus that deep learning operation is dispatched in distributed type assemblies
CN107370796A (en) * 2017-06-30 2017-11-21 香港红鸟科技股份有限公司 A kind of intelligent learning system based on Hyper TF
CN107888669A (en) * 2017-10-31 2018-04-06 武汉理工大学 A kind of extensive resource scheduling system and method based on deep learning neutral net
CN108062246A (en) * 2018-01-25 2018-05-22 北京百度网讯科技有限公司 For the resource regulating method and device of deep learning frame

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
崔广章等: "容器云资源调度策略的改进", 《计算机与数字工程》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111984398A (en) * 2019-05-22 2020-11-24 富士通株式会社 Method and computer readable medium for scheduling operations
CN111124634A (en) * 2019-12-06 2020-05-08 广东浪潮大数据研究有限公司 Training method and device, electronic equipment and storage medium
WO2021120550A1 (en) * 2019-12-19 2021-06-24 Huawei Technologies Co., Ltd. Methods and apparatus for resource scheduling of resource nodes of a computing cluster or a cloud computing platform
CN111400000A (en) * 2020-03-09 2020-07-10 百度在线网络技术(北京)有限公司 Network request processing method, device, equipment and storage medium
CN112134812A (en) * 2020-09-08 2020-12-25 华东师范大学 Distributed deep learning performance optimization method based on network bandwidth allocation
WO2022083777A1 (en) * 2020-10-23 2022-04-28 Huawei Cloud Computing Technologies Co., Ltd. Resource scheduling methods using positive and negative caching, and resource manager implementing the methods
CN114661480A (en) * 2022-05-23 2022-06-24 阿里巴巴(中国)有限公司 Deep learning task resource allocation method and system

Similar Documents

Publication Publication Date Title
CN109240814A (en) A kind of deep learning intelligent dispatching method and system based on TensorFlow
CN105550323B (en) Load balance prediction method and prediction analyzer for distributed database
US10354201B1 (en) Scalable clustering for mixed machine learning data
WO2020024442A1 (en) Resource allocation method and apparatus, computer device and computer-readable storage medium
CN108537440A (en) A kind of building scheme project management system based on BIM
CN109491790A (en) Industrial Internet of Things edge calculations resource allocation methods and system based on container
CN106776005A (en) A kind of resource management system and method towards containerization application
CN107508901A (en) Distributed data processing method, apparatus, server and system
CN104750780B (en) A kind of Hadoop configuration parameter optimization methods based on statistical analysis
CN110389820A (en) A kind of private clound method for scheduling task carrying out resources based on v-TGRU model
CN104731595A (en) Big-data-analysis-oriented mixing computing system
CN106775632A (en) A kind of operation flow can flexible expansion high-performance geographic information processing method and system
CN107370796A (en) A kind of intelligent learning system based on Hyper TF
CN112579273B (en) Task scheduling method and device and computer readable storage medium
CN109478147A (en) Adaptive resource management in distributed computing system
CN103116525A (en) Map reduce computing method under internet environment
Cheng et al. Heterogeneity aware workload management in distributed sustainable datacenters
CN115134371A (en) Scheduling method, system, equipment and medium containing edge network computing resources
CN104035819B (en) Scientific workflow scheduling method and device
CN108132840A (en) Resource regulating method and device in a kind of distributed system
Wu et al. Optimizing end-to-end performance of data-intensive computing pipelines in heterogeneous network environments
CN109858789A (en) Human resources visible processing method, device, equipment and readable storage medium storing program for executing
CN109614210A (en) Storm big data energy-saving scheduling method based on energy consumption perception
Liu et al. A probabilistic strategy for setting temporal constraints in scientific workflows
CN106572191A (en) Cross-data center collaborative calculation method and system thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190118

RJ01 Rejection of invention patent application after publication