CN109240814A - A kind of deep learning intelligent dispatching method and system based on TensorFlow - Google Patents
A kind of deep learning intelligent dispatching method and system based on TensorFlow Download PDFInfo
- Publication number
- CN109240814A CN109240814A CN201810962198.XA CN201810962198A CN109240814A CN 109240814 A CN109240814 A CN 109240814A CN 201810962198 A CN201810962198 A CN 201810962198A CN 109240814 A CN109240814 A CN 109240814A
- Authority
- CN
- China
- Prior art keywords
- resource
- tensorflow
- task
- resource information
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses a kind of deep learning intelligent dispatching method based on TensorFlow, comprising: S1, the resource information for receiving the number of tasks and each task requests that include in the TensorFlow application of user terminal transmission;Resource information in S2, acquisition cluster;S3, optimal set of resource nodes is calculated according to the resource information in the resource information and the cluster of each task requests;S4, mapping relations between each task and corresponding optimal resource node are established according to number of tasks and the optimal set of resource nodes;S5, publication TensorFlow application;The present invention also proposes a kind of deep learning intelligent dispatching system based on TensorFlow;Without user's mapping relations to establish between each task and resource node manually, greatly shorten the time for establishing mapping relations of user, it carries out the selection of optimal set of resource nodes automatically according further to the resource information being collected into and application task to be released is published on optimal resource node, the reasonable employment for maximumlly carrying out resource, efficiently avoids the waste of resource.
Description
Technical field
The present invention relates to deep learning application fields, more particularly to a kind of deep learning intelligence based on TensorFlow
Dispatching method and system.
Background technique
In recent years, deep learning is foundation, mould as a new technology in machine learning algorithm research, motivation
Anthropomorphic brain carries out the neural network of analytic learning.By means of deep learning algorithm, the mankind can finally find how to handle and " be abstracted general
Read " method of this problem from ancient times to now.
TensorFlow is Google in the Computational frame formally increased income on November 9th, 2015.TensorFlow Computational frame
One of various algorithms and the most popular library of deep learning that can support deep learning well, are that Google is summarizing deeply
It is formed in the experience and lessons of its predecessor DistBelief.It innately has portable, efficient, expansible, moreover it is possible in different meters
It is run on calculation machine.
Current Distributed Application cluster operational mode such as Fig. 1 based on TensorFlow in production environment.When operation one
When a TensorFlow Distributed Application cluster, all ginsengs of the good TensorFlow Distributed Application cluster of user's configured in advance are needed
It counts and which task is specified to run with which port of which host.Create necessity of TensorFlow Distributed Application cluster
Condition is serviced for each task start one.Following work can be done for each task:
1, a tf.train.ClusterSpec is created to be used for all in TensorFlow Distributed Application cluster
Task is described, which should be identical for all tasks.
2, it creates a tf.train.Server and the parameter in tf.train.ClusterSpec is passed to construction letter
Number, and the number of the title of operation and current task is written in local task.
Under traditional Distributed T ensorFlow cluster environment, although human configuration can be passed through
The mode of tf.train.ClusterSpec and tf.train.Server runs TensorFlow Distributed Application, this is for one
As the fewer environment of Distributed T ensorFlow clustered node be feasible and simply and easily, but in big data
For large-scale TensorFlow Distributed Application under, it will become sufficiently complex and be difficult to safeguard.In this background, it is based on
The number of nodes of TensorFlow Distributed Application is up to hundreds and thousands of, this, which is just determined, is carrying out TensorFlow application hair
When cloth, different tf.train.ClusterSpec parameter and tf.train.Server ginseng need to be configured to each resource node
Number, if every publication TensorFlow application require user's manual configuration tf.train.ClusterSpec parameter and
Tf.train.Server parameter, this is all unacceptable for any user.In addition, in publication in application, user
It can not judge whether the corresponding host node of the task publication of TensorFlow has enough resources to meet the item using operation
Part, such as, if having that enough CPU carries out task schedule or whether node is configured with GPU etc., this application published method for
It is not optimal selection for any user.
Summary of the invention
In order to solve not automatically selecting the technical issues of optimal set of resource nodes is carried out using publication in the prior art, this
Invention proposes a kind of deep learning intelligent dispatching method and system based on TensorFlow.
Technical problem of the invention is resolved by technical solution below:
A kind of deep learning intelligent dispatching method based on TensorFlow, includes the following steps:
S1, the resource letter for receiving the number of tasks and each task requests that include in the TensorFlow application that user terminal is sent
Breath;
Resource information in S2, acquisition cluster;
S3, optimal resource node is calculated according to the resource information in the resource information and the cluster of each task requests
Collection;
S4, it is established according to the number of tasks and the optimal set of resource nodes between each task and corresponding optimal resource node
Mapping relations;
S5, the publication TensorFlow application.
In some preferred embodiments, the task includes worker task and ps task.
In some preferred embodiments, the resource information of each task requests includes the request of resource apparatus and resource
Amount.
In some preferred embodiments, the resource information in the cluster includes the usage amount and cluster money of cluster resource
Source total amount.
In some preferred embodiments, the resource apparatus includes CPU, GPU, MEM, IO and bandwidth.
In some preferred embodiments, the step S3 is achieved by the steps of:
T1, all satisfactions are calculated according to the resource information of resource information and each task requests in the cluster
Resource node;
T2, according to the resource node of the satisfaction in the step T1, using analytic hierarchy process (AHP) calculate each node for
The weight of current task, the smallest node of weight are the optimal resource node of current task;
T3, the usage amount of the optimal resource node in the step T2 is subtracted from the resource information in the cluster,
Step T1 and step T2 is repeated, obtains the optimal set of resource nodes of each task.
In some more preferably embodiments, the step T1 is achieved by the steps of:
Resource information in T11, the acquisition cluster;
T12, ungratified money is rejected according to the resource information in the resource information and the cluster of each task requests
Source node obtains the resource node of all satisfactions.
The present invention also proposes a kind of deep learning intelligent dispatching system based on TensorFlow, comprising:
Application configuration unit, for receive user terminal transmission TensorFlow application in include number of tasks and each
The resource information of business request;
Rm-cell for obtaining the resource information of each task requests from the application configuration unit, and acquires collection
Resource information in group;
Analysis of strategies unit, for obtaining the resource information of each task requests and described from the rm-cell
Resource information in cluster, and calculate optimal set of resource nodes;
Application configuration analytical unit, for obtaining the number of tasks from the application configuration unit, from the analysis of strategies
Unit obtains the optimal set of resource nodes, and establishes the mapping relations between each task and corresponding optimal resource node;
Configuration parameter submits unit, for proposing the resource information of each task requests in the application configuration unit
It is sent to rm-cell;
Using release unit, for issuing the TensorFlow application.
The present invention also proposes a kind of electronic equipment, comprising:
Memory and processor;
For the memory for storing computer executable instructions, the processor is executable for executing the computer
Instruction:
S1, the resource letter for receiving the number of tasks and each task requests that include in the TensorFlow application that user terminal is sent
Breath;
Resource information in S2, acquisition cluster;
S3, optimal resource node is calculated according to the resource information in the resource information and the cluster of each task requests
Collection;
S4, it is established according to the number of tasks and the optimal set of resource nodes between each task and corresponding optimal resource node
Mapping relations;
S5, the publication TensorFlow application.
The beneficial effect of the present invention compared with the prior art includes:
The deep learning intelligent dispatching method based on TensorFlow in the present invention, includes the following steps S1, receives use
The resource information of the number of tasks and each task requests that include in the TensorFlow application that family end is sent;In S2, acquisition cluster
Resource information;S3, optimal resource is calculated according to the resource information in the resource information and the cluster of each task requests
Node collection;S4, it is established according to the number of tasks and the optimal set of resource nodes between each task and corresponding optimal resource node
Mapping relations;S5, the publication TensorFlow application;In resource information and the cluster by each task requests
Resource information calculate optimal set of resource nodes, each task is then established according to the number of tasks and the optimal set of resource nodes
With the mapping relations between corresponding optimal resource node, then the TensorFlow application is issued, is built manually without user
The mapping relations between each task and resource node are found, the time for establishing mapping relations of user is greatly shortened, has simultaneously
Effect improves the correctness that mapping relations are established between each task and the resource node;According further to the money being collected into
Source information carries out the selection of optimal set of resource nodes automatically and application task to be released is published in optimal set of resource nodes,
The reasonable employment for maximumlly carrying out resource, efficiently avoids the waste of resource.
Detailed description of the invention
Fig. 1 is the flow chart of the deep learning intelligent dispatching method based on TensorFlow in a certain embodiment of the present invention;
Fig. 2 is the flow chart of the concrete methods of realizing of step S3 in Fig. 1;
Fig. 3 is the flow chart of the concrete methods of realizing of step T1 in Fig. 2;
Fig. 4 is the system architecture of the deep learning intelligent dispatching system based on TensorFlow in a certain embodiment of the present invention
Figure;
Fig. 5 is the schematic diagram of electronic equipment in a certain embodiment of the present invention.
Wherein, 1, utility control center;101, TensorFlow cluster initialization unit;102, using release unit;2,
Application configuration center;201, application configuration unit;202, application configuration analytical unit;203, parameter submits unit;3, resource tune
Degree center;301, rm-cell;302, analysis of strategies unit;4, resource pool;401, resource node;5, electronic equipment;
501, processor;502, memory.
Specific embodiment
Below against attached drawing and in conjunction with preferred embodiment, the invention will be further described.
With reference to Fig. 1-3, the deep learning intelligent dispatching method based on TensorFlow in the present embodiment includes following step
It is rapid:
S1, application configuration unit 201 receive the number of tasks for including in the TensorFlow application of user terminal transmission and each
The resource information of task requests;The number of tasks includes worker number of tasks and ps number of tasks;The resource of each task requests
Information includes the request amount of resource apparatus and resource apparatus.
S2, rm-cell 301 acquire cluster in resource information, including in cluster total resources and resource use
Amount, that is, total resources and resource usage amount in resource pool 4.
S3, analysis of strategies unit 302 according to the resource apparatus of each task requests and the request amount of resource in the step S1,
And total resources and the optimal set of resource nodes of resource usage amount calculating in resource pool 4, the meter of the optimal set of resource nodes
Calculation method is specific as follows:
T1, analysis of strategies unit 302 are total from the resource information in the acquisition resource pool 4 of rm-cell 301, including resource
Amount and resource usage amount, and obtain from application configuration unit 201 resource information of each task requests, including resource apparatus and
Resource request amount calculates the resource node 401 of all satisfactions;The calculation method of the resource node 401 of all satisfactions has
Body is as follows:
T11, rm-cell 301 obtain total resources and resource usage amount in resource pool 4, the resource apparatus packet
CPU, GPU, MEM, IO and bandwidth are included, CPU, GPU, MEM, IO and bandwidth idle in the resource pool 4 are calculated;
T12, rm-cell 301 reject according to CPU, GPU, MEM, IO and bandwidth of application request and are unsatisfactory for requesting
Resource information resource node 401, the resource information in cluster is filtered and is screened, the resource section of all satisfactions is obtained
Point 401;
T2, analysis of strategies unit 302 utilize level point according to the resource node 401 of the satisfaction in the step T1
Analysis method calculates each node for the weight of current task, and the smallest node of weight is the optimal resource node 401 of current task;Institute
State that calculate each node specific as follows for the method for the weight of current task using analytic hierarchy process (AHP):
Ask each node for the weight of current task, judgment matrix using analytic hierarchy process (AHP) (AHP) development of judgment matrix
Form such as formula (1) shown in:
Wherein, aijIndicate index i for the significance level of index j.It, can be according to consistency check public affairs after obtaining weight
Formula, to judge whether weight is up to standard.The formula of coincident indicator and consistency ratio such as (2) is shown,
Wherein, λmaxIt is the Maximum characteristic root of judgment matrix, n is to compare λmaxSmall maximum integer.RI is that random consistency refers to
Mark, value are as shown in table 1 referring to table:
1 random index RI value of table
n | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
RI | 0 | 0 | 0.58 | 0.90 | 1.12 | 1.24 | 1.32 | 1.41 | 1.45 | 1.49 | 1.51 |
As consistency ratio CR < 0.1, it is believed that the judgment matrix of building meets condition, can be used as the calculating of weight.
T3, analysis of strategies unit 302 are by the usage amount of the optimal resource node 401 in the step T2 from the cluster
Resource information in subtract, that is, subtracted from the resource of resource pool 4, repeat step T1 and step T2, obtain each task
Optimal set of resource nodes A={ w0, w1 ... wn, p0, p1 ... pm }, wherein w refers to that worker node, p refer to ps node.
S4, application configuration unit 201 are according to the worker number of tasks and ps number of tasks and the optimal resource node
Collection establishes the mapping relations between each task and corresponding optimal resource node 401;The specific method for building up of the mapping relations is such as
Under:
Automatically configure TensorFlow application in each task tf.train.ClusterSpec parameter and
It is as follows specifically to automatically configure rule for tf.train.Server parameter:
Tf.train.Server parameter configuration
If it is worker task, then marking tf.train.Server is tf.train.Server (cluster, job_
Name=" worker ", task_index=N), wherein N is the subscript of the worker node in matrix A;
If it is ps task, then marking tf.train.Server is tf.train.Server (cluster, job_name
=" ps ", task_index=M), wherein M is the subscript of the ps node in matrix A;
Tf.train.ClusterSpec parameter configuration
The parameter of tf.train.ClusterSpec be tf.train.ClusterSpec (" worker ": [" w0:
Port " ... ..., " wn:port "], " ps ": [" p0:port " ... ..., " pm:port "] }), wherein w0-wn in the value of worker
For the IP address of all worker nodes in matrix A, p0-pm is the IP address of all ps nodes in matrix A.Port is to answer
With the default port number configured in configuration unit 201.
S5, application configuration center 2 configuration parameter submit unit 203 get TensorFlow application all configurations
After information and optimal node collection, all configuration informations and optimal node collection are committed to using release unit 102, it is single using publication
All worker tasks and ps task that member 102 issues TensorFlow application according to information above are to specified resource node
On 401.
With reference to Fig. 4, the deep learning intelligent dispatching system based on TensorFlow in the present embodiment includes application configuration
Center 2, resource distribution center, utility control center 1 and resource pool 4.
The application configuration center 2 includes that application configuration unit 201, application configuration analytical unit 202 and configuration parameter mention
Presentate member 203;The resource distribution center includes rm-cell 301 and analysis of strategies unit 302;In the application management
The heart 1 includes applying release unit 102 and TensorFlow cluster initialization unit 101.
The configuration center is used to receive the number of tasks in the TensorFlow application of user terminal transmission included and each
The resource information of business request;The task includes the sum of worker task and ps task;The resource of each task requests is believed
Breath includes the request amount of resource apparatus and resource;The resource apparatus includes CPU, MEM, GPU, IO and bandwidth;The application is matched
Analytical unit 202 is set for obtaining the number of tasks from the application configuration unit 201, is obtained from the analysis of strategies unit 302
The optimal set of resource nodes is taken, and establishes the mapping relations between each task and corresponding optimal resource node 401;Configuration
Parameter submits unit 203, for the resource information of each task requests in the application configuration unit 201 to be submitted to money
Source control unit 301, and will acquire TensorFlow application all configuration informations and optimal node collection after, by all configurations
Information and optimal node collection are committed to using release unit 102.
Rm-cell 301, for obtaining the resource information of each task requests from the application configuration unit 201, and
Acquire the resource information in cluster;Analysis of strategies unit 302, for obtaining each task from the rm-cell 301
Resource information in the resource information of request and the cluster, and calculate optimal set of resource nodes.
Matched using release unit 102 according to described using release unit 102 for issuing the TensorFlow application
On confidence breath and all tasks to specified resource node 401 of optimal node collection publication TensorFlow application;It is described
TensorFlow cluster initialization unit 101 is for initializing TensorFlow cluster.
Resource pool 4 includes all available resources nodes 401 in TensorFlow cluster, the resource apparatus on each node
Including CPU, MEM, GPU, IO and bandwidth.
When user carry out TensorFlow application publication when, by TensorFlow apply in each task requests resource information
Application configuration center 2 is sent to the number of tasks for including in TensorFlow application and initializes TensorFlow cluster,
Application configuration unit 201, application configuration analytical unit 202 and the configuration parameter at application configuration center 2 submit unit 203 to analyze
The tf.train.ClusterSpec parameter information of each required by task, tf.train.Server parameter in TensorFlow application
The resource information of each task requests, the rm-cell 301 and plan at scheduling of resource center 3 in information and TensorFlow application
Slightly 302 analysis meter of the analytical unit optimal set of resource nodes that calculates TensorFlow application schedules, and by optimal set of resource nodes
Information is published to application configuration analytical unit 202, and application configuration analytical unit 202 is according to optimal 401 information of resource node and appoints
Business number calculates the tf.train.ClusterSpec parameter information and tf.train.Server parameter information of each task, finally
All information are submitted to using release unit 102, are joined using release unit 102 according to the tf.train.ClusterSpec
Number information and the tf.train.Server parameter information, in the feelings for applying any parameter without manual configuration TensorFlow
In the optimal set of resource nodes that application is published in TensorFlow cluster under condition.
With reference to Fig. 5, the electronic equipment 5 in the present embodiment includes: memory 502 and processor 501;
The memory 502 is for storing computer executable instructions, and the processor 501 is for executing the computer
Executable instruction:
S1, application configuration unit 201 receive the number of tasks for including in the TensorFlow application of user terminal transmission and each
The resource information of task requests;The number of tasks includes worker number of tasks and ps number of tasks;The resource of each task requests
Information includes the request amount of resource apparatus and resource apparatus;
S2, rm-cell 301 acquire cluster in resource information, including in cluster total resources and resource use
Amount, that is, total resources and resource usage amount in resource pool 4;
S3, analysis of strategies unit 302 according to the resource apparatus of each task requests and the request amount of resource in the step S1,
And total resources and the optimal set of resource nodes of resource usage amount calculating in resource pool 4, the meter of the optimal set of resource nodes
Calculation method is specific as follows:
T1, analysis of strategies unit 302 are total from the resource information in the acquisition resource pool 4 of rm-cell 301, including resource
Amount and resource usage amount, and obtain from application configuration unit 201 resource information of each task requests, including resource apparatus and
Resource request amount calculates the resource node 401 of all satisfactions;The calculation method of the resource node 401 of all satisfactions has
Body is as follows:
T11, rm-cell 301 obtain total resources and resource usage amount in resource pool 4, the resource apparatus packet
CPU, GPU, MEM, IO and bandwidth are included, CPU, GPU, MEM, IO and bandwidth idle in the resource pool 4 are calculated;
T12, rm-cell 301 reject according to CPU, GPU, MEM, IO and bandwidth of application request and are unsatisfactory for requesting
Resource information resource node 401, the resource information in cluster is filtered and is screened, the resource section of all satisfactions is obtained
Point 401;
T2, analysis of strategies unit 302 utilize level point according to the resource node 401 of the satisfaction in the step T1
Analysis method calculates each node for the weight of current task, and the smallest node of weight is the optimal resource node 401 of current task;Institute
State that calculate each node specific as follows for the method for the weight of current task using analytic hierarchy process (AHP):
Ask each node for the weight of current task, judgment matrix using analytic hierarchy process (AHP) (AHP) development of judgment matrix
Form such as formula (1) shown in:
Wherein, aijIndicate index i for the significance level of index j.It, can be according to consistency check public affairs after obtaining weight
Formula, to judge whether weight is up to standard.The formula of coincident indicator and consistency ratio such as (2) is shown,
Wherein, λmaxIt is the Maximum characteristic root of judgment matrix, n is to compare λmaxSmall maximum integer.RI is that random consistency refers to
Mark, value are as shown in table 1 referring to table:
1 random index RI value of table
n | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
RI | 0 | 0 | 0.58 | 0.90 | 1.12 | 1.24 | 1.32 | 1.41 | 1.45 | 1.49 | 1.51 |
As consistency ratio CR < 0.1, it is believed that the judgment matrix of building meets condition, can be used as the calculating of weight.
T3, analysis of strategies unit 302 are by the usage amount of the optimal resource node 401 in the step T2 from the cluster
Resource information in subtract, that is, subtracted from the resource of resource pool 4, repeat step T1 and step T2, obtain each task
Optimal set of resource nodes A={ w0, w1 ... wn, p0, p1 ... pm }, wherein w refers to that worker node, p refer to ps node.
S4, application configuration unit 201 establish each task and phase according to the number of tasks and the optimal set of resource nodes
Answer the mapping relations between optimal resource node 401;The specific method for building up of the mapping relations is as follows:
Automatically configure TensorFlow application in each task tf.train.ClusterSpec parameter and
It is as follows specifically to automatically configure rule for tf.train.Server parameter:
Tf.train.Server parameter configuration
If it is worker task, then marking tf.train.Server is tf.train.Server (cluster, job_
Name=" worker ", task_index=N), wherein N is the subscript of the worker node in matrix A;
If it is ps task, then marking tf.train.Server is tf.train.Server (cluster, job_name
=" ps ", task_index=M), wherein M is the subscript of the ps node in matrix A;
Tf.train.ClusterSpec parameter configuration
The parameter of tf.train.ClusterSpec be tf.train.ClusterSpec (" worker ": [" w0:
Port " ... ..., " wn:port "], " ps ": [" p0:port " ... ..., " pm:port "] }), wherein w0-wn in the value of worker
For the IP address of all worker nodes in matrix A, p0-pm is the IP address of all ps nodes in matrix A.Port is to answer
With the default port number configured in configuration unit 201.
S5, application configuration center 2 configuration parameter submit unit 203 get TensorFlow application all configurations
After information and optimal node collection, all configuration informations and optimal node collection are committed to using release unit 102, it is single using publication
Member 102 is issued according to information above in all tasks to specified resource node 401 of TensorFlow application.
The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that
Specific implementation of the invention is only limited to these instructions.For those skilled in the art to which the present invention belongs, it is not taking off
Under the premise of from present inventive concept, several equivalent substitute or obvious modifications can also be made, and performance or use is identical, all answered
When being considered as belonging to protection scope of the present invention.
Claims (9)
1. a kind of deep learning intelligent dispatching method based on TensorFlow, which comprises the steps of:
S1, the resource information for receiving the number of tasks and each task requests that include in the TensorFlow application that user terminal is sent;
Resource information in S2, acquisition cluster;
S3, optimal set of resource nodes is calculated according to the resource information in the resource information and the cluster of each task requests;
S4, reflecting between each task and corresponding optimal resource node is established according to the number of tasks and the optimal set of resource nodes
Penetrate relationship;
S5, the publication TensorFlow application.
2. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that described
Business includes worker task and ps task.
3. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that described each
The resource information of task requests includes the request amount of resource apparatus and resource.
4. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that the collection
Resource information in group includes the usage amount and cluster total resources of cluster resource.
5. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that the money
Source device includes CPU, GPU, MEM, IO and bandwidth.
6. the deep learning intelligent dispatching method based on TensorFlow as described in claim 1, which is characterized in that the step
Rapid S3 is achieved by the steps of:
T1, the resource that all satisfactions are calculated according to the resource information of resource information and each task requests in the cluster
Node;
T2, according to the resource node of the satisfaction in the step T1, calculate each node for current using analytic hierarchy process (AHP)
The weight of task, the smallest node of weight are the optimal resource node of current task;
T3, the usage amount of the optimal resource node in the step T2 is subtracted from the resource information in the cluster, is repeated
Step T1 and step T2 obtains the optimal set of resource nodes of each task.
7. the deep learning intelligent dispatching method based on TensorFlow as claimed in claim 6, which is characterized in that the step
Rapid T1 is achieved by the steps of:
Resource information in T11, the acquisition cluster;
T12, ungratified resource section is rejected according to the resource information in the resource information and the cluster of each task requests
Point obtains the resource node of all satisfactions.
8. a kind of deep learning intelligent dispatching system based on TensorFlow characterized by comprising
Application configuration unit, number of tasks and each task for including in receiving the TensorFlow application of user terminal transmission are asked
The resource information asked;
Rm-cell for obtaining the resource information of each task requests from the application configuration unit, and acquires in cluster
Resource information;
Analysis of strategies unit, for obtained from the rm-cell each task requests resource information and the cluster
In resource information, and calculate optimal set of resource nodes;
Application configuration analytical unit, for obtaining the number of tasks from the application configuration unit, from the analysis of strategies unit
The optimal set of resource nodes is obtained, and establishes the mapping relations between each task and corresponding optimal resource node;
Configuration parameter submits unit, for the resource information of each task requests in the application configuration unit to be submitted to
Rm-cell;
Using release unit, for issuing the TensorFlow application.
9. a kind of electronic equipment characterized by comprising
Memory and processor;
The memory is for storing computer executable instructions, and for executing, the computer is executable to be referred to the processor
It enables:
S1, the resource information for receiving the number of tasks and each task requests that include in the TensorFlow application that user terminal is sent;
Resource information in S2, acquisition cluster;
S3, optimal set of resource nodes is calculated according to the resource information in the resource information and the cluster of each task requests;
S4, reflecting between each task and corresponding optimal resource node is established according to the number of tasks and the optimal set of resource nodes
Penetrate relationship;
S5, the publication TensorFlow application.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810962198.XA CN109240814A (en) | 2018-08-22 | 2018-08-22 | A kind of deep learning intelligent dispatching method and system based on TensorFlow |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810962198.XA CN109240814A (en) | 2018-08-22 | 2018-08-22 | A kind of deep learning intelligent dispatching method and system based on TensorFlow |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109240814A true CN109240814A (en) | 2019-01-18 |
Family
ID=65068722
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810962198.XA Pending CN109240814A (en) | 2018-08-22 | 2018-08-22 | A kind of deep learning intelligent dispatching method and system based on TensorFlow |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109240814A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111124634A (en) * | 2019-12-06 | 2020-05-08 | 广东浪潮大数据研究有限公司 | Training method and device, electronic equipment and storage medium |
CN111400000A (en) * | 2020-03-09 | 2020-07-10 | 百度在线网络技术(北京)有限公司 | Network request processing method, device, equipment and storage medium |
CN111984398A (en) * | 2019-05-22 | 2020-11-24 | 富士通株式会社 | Method and computer readable medium for scheduling operations |
CN112134812A (en) * | 2020-09-08 | 2020-12-25 | 华东师范大学 | Distributed deep learning performance optimization method based on network bandwidth allocation |
WO2021120550A1 (en) * | 2019-12-19 | 2021-06-24 | Huawei Technologies Co., Ltd. | Methods and apparatus for resource scheduling of resource nodes of a computing cluster or a cloud computing platform |
WO2022083777A1 (en) * | 2020-10-23 | 2022-04-28 | Huawei Cloud Computing Technologies Co., Ltd. | Resource scheduling methods using positive and negative caching, and resource manager implementing the methods |
CN114661480A (en) * | 2022-05-23 | 2022-06-24 | 阿里巴巴(中国)有限公司 | Deep learning task resource allocation method and system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106529682A (en) * | 2016-10-28 | 2017-03-22 | 北京奇虎科技有限公司 | Method and apparatus for processing deep learning task in big-data cluster |
CN107203424A (en) * | 2017-04-17 | 2017-09-26 | 北京奇虎科技有限公司 | A kind of method and apparatus that deep learning operation is dispatched in distributed type assemblies |
CN107370796A (en) * | 2017-06-30 | 2017-11-21 | 香港红鸟科技股份有限公司 | A kind of intelligent learning system based on Hyper TF |
CN107888669A (en) * | 2017-10-31 | 2018-04-06 | 武汉理工大学 | A kind of extensive resource scheduling system and method based on deep learning neutral net |
US20180137445A1 (en) * | 2016-11-14 | 2018-05-17 | Apptio, Inc. | Identifying resource allocation discrepancies |
CN108062246A (en) * | 2018-01-25 | 2018-05-22 | 北京百度网讯科技有限公司 | For the resource regulating method and device of deep learning frame |
-
2018
- 2018-08-22 CN CN201810962198.XA patent/CN109240814A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106529682A (en) * | 2016-10-28 | 2017-03-22 | 北京奇虎科技有限公司 | Method and apparatus for processing deep learning task in big-data cluster |
US20180137445A1 (en) * | 2016-11-14 | 2018-05-17 | Apptio, Inc. | Identifying resource allocation discrepancies |
CN107203424A (en) * | 2017-04-17 | 2017-09-26 | 北京奇虎科技有限公司 | A kind of method and apparatus that deep learning operation is dispatched in distributed type assemblies |
CN107370796A (en) * | 2017-06-30 | 2017-11-21 | 香港红鸟科技股份有限公司 | A kind of intelligent learning system based on Hyper TF |
CN107888669A (en) * | 2017-10-31 | 2018-04-06 | 武汉理工大学 | A kind of extensive resource scheduling system and method based on deep learning neutral net |
CN108062246A (en) * | 2018-01-25 | 2018-05-22 | 北京百度网讯科技有限公司 | For the resource regulating method and device of deep learning frame |
Non-Patent Citations (1)
Title |
---|
崔广章等: "容器云资源调度策略的改进", 《计算机与数字工程》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111984398A (en) * | 2019-05-22 | 2020-11-24 | 富士通株式会社 | Method and computer readable medium for scheduling operations |
CN111124634A (en) * | 2019-12-06 | 2020-05-08 | 广东浪潮大数据研究有限公司 | Training method and device, electronic equipment and storage medium |
WO2021120550A1 (en) * | 2019-12-19 | 2021-06-24 | Huawei Technologies Co., Ltd. | Methods and apparatus for resource scheduling of resource nodes of a computing cluster or a cloud computing platform |
CN111400000A (en) * | 2020-03-09 | 2020-07-10 | 百度在线网络技术(北京)有限公司 | Network request processing method, device, equipment and storage medium |
CN112134812A (en) * | 2020-09-08 | 2020-12-25 | 华东师范大学 | Distributed deep learning performance optimization method based on network bandwidth allocation |
WO2022083777A1 (en) * | 2020-10-23 | 2022-04-28 | Huawei Cloud Computing Technologies Co., Ltd. | Resource scheduling methods using positive and negative caching, and resource manager implementing the methods |
CN114661480A (en) * | 2022-05-23 | 2022-06-24 | 阿里巴巴(中国)有限公司 | Deep learning task resource allocation method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109240814A (en) | A kind of deep learning intelligent dispatching method and system based on TensorFlow | |
CN105550323B (en) | Load balance prediction method and prediction analyzer for distributed database | |
US10354201B1 (en) | Scalable clustering for mixed machine learning data | |
WO2020024442A1 (en) | Resource allocation method and apparatus, computer device and computer-readable storage medium | |
CN108537440A (en) | A kind of building scheme project management system based on BIM | |
CN109491790A (en) | Industrial Internet of Things edge calculations resource allocation methods and system based on container | |
CN106776005A (en) | A kind of resource management system and method towards containerization application | |
CN107508901A (en) | Distributed data processing method, apparatus, server and system | |
CN104750780B (en) | A kind of Hadoop configuration parameter optimization methods based on statistical analysis | |
CN110389820A (en) | A kind of private clound method for scheduling task carrying out resources based on v-TGRU model | |
CN104731595A (en) | Big-data-analysis-oriented mixing computing system | |
CN106775632A (en) | A kind of operation flow can flexible expansion high-performance geographic information processing method and system | |
CN107370796A (en) | A kind of intelligent learning system based on Hyper TF | |
CN112579273B (en) | Task scheduling method and device and computer readable storage medium | |
CN109478147A (en) | Adaptive resource management in distributed computing system | |
CN103116525A (en) | Map reduce computing method under internet environment | |
Cheng et al. | Heterogeneity aware workload management in distributed sustainable datacenters | |
CN115134371A (en) | Scheduling method, system, equipment and medium containing edge network computing resources | |
CN104035819B (en) | Scientific workflow scheduling method and device | |
CN108132840A (en) | Resource regulating method and device in a kind of distributed system | |
Wu et al. | Optimizing end-to-end performance of data-intensive computing pipelines in heterogeneous network environments | |
CN109858789A (en) | Human resources visible processing method, device, equipment and readable storage medium storing program for executing | |
CN109614210A (en) | Storm big data energy-saving scheduling method based on energy consumption perception | |
Liu et al. | A probabilistic strategy for setting temporal constraints in scientific workflows | |
CN106572191A (en) | Cross-data center collaborative calculation method and system thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190118 |
|
RJ01 | Rejection of invention patent application after publication |