CN108768876A - A kind of traffic scheduling method of Machine oriented learning framework - Google Patents

A kind of traffic scheduling method of Machine oriented learning framework Download PDF

Info

Publication number
CN108768876A
CN108768876A CN201810569876.6A CN201810569876A CN108768876A CN 108768876 A CN108768876 A CN 108768876A CN 201810569876 A CN201810569876 A CN 201810569876A CN 108768876 A CN108768876 A CN 108768876A
Authority
CN
China
Prior art keywords
stream
group
priority
detection
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810569876.6A
Other languages
Chinese (zh)
Other versions
CN108768876B (en
Inventor
江勇
李清
杨光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Tsinghua University
Original Assignee
Shenzhen Graduate School Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Tsinghua University filed Critical Shenzhen Graduate School Tsinghua University
Priority to CN201810569876.6A priority Critical patent/CN108768876B/en
Publication of CN108768876A publication Critical patent/CN108768876A/en
Application granted granted Critical
Publication of CN108768876B publication Critical patent/CN108768876B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/50Queue scheduling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/50Queue scheduling
    • H04L47/62Queue scheduling characterised by scheduling criteria
    • H04L47/625Queue scheduling characterised by scheduling criteria for service slots or service orders
    • H04L47/6275Queue scheduling characterised by scheduling criteria for service slots or service orders based on priority

Abstract

The present invention proposes a kind of traffic scheduling method of Machine oriented learning framework, it is distributed machines learning framework flow scheduling mechanism in a kind of efficient data center, in the case where the scene using stream information can not be obtained, efficient scheduling strategy is realized using the self-similarity of machine learning flow in group level of stream.The mechanism organically combines the rate control of stream with flow scheduling, effective stream information has been helped to speculate the completion in stream transmission procedure by timely rate control, while the scheduling strategy based on estimation result has reasonably guided rate control of the stream under different network environments.

Description

A kind of traffic scheduling method of Machine oriented learning framework
Technical field
The present invention relates to a kind of schemes improving machine learning application traffic scheduling performance in data center network, belong to meter Calculation machine network field.
Background technology
It is artificial to its that the technological break-through in machine learning field in recent years so that more and more large scale business companies increase The input of intelligent use is researched and developed.In order to promote research and development progress, each company to be proposed different machine learning frames and carry out fully profit With the computing resource of physical computer cluster.The scheduling of resource is particularly significant to being efficiently completed machine learning task in cluster, The distribution of wherein Internet resources is particularly critical.It usually will produce a large amount of flows in machine learning tasks make progress, these flows It is easy to cause network congestion into cluster network (data center network) and then extends the deadline of task.Network congestion generates Reason of both mainly having:(1) lack application semantics and perceive mechanism, traditional data center network is due to cannot be distinguished difference Using the differentiated demand for network, the fair services provided can cause so that network becomes the bottle that application performance is promoted Neck;(2) traditional network transmission speed controlling mechanism is not suitable for data center network, the traffic aggregation in data center network Pattern is easy to cause data-bag lost, to reduce transmission performance.
Before machine learning task occupies the extensive work load of cluster, group stream has proven to a kind of and effectively carries Rise the valid model of the network performance of distributed computing framework in data center.It is better than traditional base based on a group scheduling scheme for stream It is in the reason of scheme of stream, group stream contains real-time requirement of the Distributed Application to network.For example it for, comes from The a plurality of stream of the same Distributed Application reaches the same recipient by different links, and application requirement recipient completes Next calculation stages could be entered after the transmission of all streams;Congestion occurs for link where wherein one stream at this time, then in base In the scheduling scheme of stream, the congestion signal of a stream can only have an impact the rate control mechanism of this stream itself;And it is right For seeing above-mentioned a plurality of stream the group stream scheduling scheme of a group stream as in logic, then the speed of other streams can be suitably reduced Rate avoids unnecessary bandwidth occupancy.
It therefore, it is expected to, realize that the flow scheduling in distributed machines learning framework group fluid layer face will bring great net Network performance boost.
But regrettably, effect is limited if the group stream scheme for lacking stream information.This is because realizing distributed After the flow scheduling in machine learning frame set fluid layer face, group stream scheduling strategy determines group stream scheduling performance when congestion occurs. Different from the scheduling scheme based on stream, the definition of group stream itself determines that it has optimal scheduling strategy:The priority of group stream It is determined by stream most slow in group stream, this is because the deadline of group stream depends on the completion of the last one stream.The prior art because Lack effective group stream scheme of stream information, therefore effect is limited, the quantity according to stream transmission data packet is needed to carry out pre- flow measurement Size simultaneously places it in different priority queries, and scheduling performance can not dependent on the rate control that the accuracy of prediction flows simultaneously Explicitly controlled.
Invention content
The purpose of the present invention is to solve group stream scheme the problem of, propose a kind of flow tune of Machine oriented learning framework Degree method.
In the traffic scheduling method of Machine oriented learning framework of the present invention, the machine learning is for passing through data Parallel model obtains different machine learning models under large-scale data set;In the machine learning, large-scale number Multiple distributed nodes are divided into according to collection to be stored;The working example of distributed node is run on according to local part number According to collection training and the Grad of model parameter is obtained, and is sent to the update that hyper parameter server carries out model;Hyper parameter service Device converges multigroup Grad and carries out model training, and will send back to working example under updated model parameter;It is characterized in that, The stream that multiple working examples are sent to the same hyper parameter server is organized into a group stream, hyper parameter server is sent to multiple The stream tissue of example becomes another group stream, realizes distributed machines learning framework in group flow scheduling in fluid layer face.
Further, further include that group stream information speculates mechanism, for detecting a group potential congestion for stream in as fast as possible Ability, group stream information speculate that mechanism includes the following steps:S1, after machine learning task starts, it is logical based on a group stream Scheduling Framework The quantity of statistics active stream is crossed to get the number n flowed in a group stream;S2, the random stream selected in group stream are as spy Flow measurement simultaneously ensures that it completes the transmission of data packet as early as possible, to obtain the size f of its stream;S3, in conjunction with machine study group stream from The size of similitude, a group stream can be obtained by n*f;The size of the group stream is equivalent to for the potential of shared forwarding node Congestion ability, and by a foundation of the priority as one group stream of judgement;S4, priority update is carried out.
Further, come from group stream in newly generated machine learning task to need just be added by information supposition In being closed to active group adfluxion;During information speculates, according to the edge switch belonging to receiving terminal hyper parameter server with Machine chooses a group stream in multiple groups of streams in new task;For the group stream being selected, residing for transmitting terminal working example Physical host randomly selects a stream as detection stream.
Further, the data packet from detection stream is labeled with label and enters explorer queue, enjoys highest priority;It visits By being randomly assigned in the group stream that randomly chooses, that is, the selection for detecting stream is designed flow measurement using dual random.
Further, the highest priority combination flexible rate control algolithm of stream is detected to ensure detection stream transmission rate Rapid growth.
Further, before priority update startup, the non-detection stream from new task enters active queue, to ensure The minimum transmission rate that itself do not died of hunger under current network configuration is given out a contract for a project.
Further, when flow priority update timer expiry, all non-detection stream, comprising do not complete group stream, The result of detection flowed according to the detection belonging to it enters corresponding priority query;Detection stream size passes through influence group stream size Speculate to determine that the priority of non-detection stream, priority update ensure the scheduling strategy of short job priority.
Further, flexible rate control algolithm includes:In the case where there is enough available bandwidths, the rate of stream is in again Increase quickly to occupy available bandwidth to promote link utilization.
Further, each RTT information for flowing passed through link is available to estimate in end system acquisition group stream Bandwidth realizes the rate control based on end system;And in such a way that similar setting sends window size come with existing TCP/IP Protocol-compliant.
Further, for a stream f, the desired number of data packets sent of targeted rate, that is, next cycle, by Following formula is calculated:
elastic_gain*f.max_rtt/f.min_rtt
Wherein preset value elastic_gain determines that the minimum transmission rate of stream, f.max_rtt indicate institute's energy in link The maximum queue length of receiving, size are related to network configuration;F.min_rtt is indicated in a measurement period in link Queue length.
The invention also includes a kind of flow scheduling systems of Machine oriented learning framework, for realizing above-mentioned method, packet Controller is included, the newest stream information periodically reported for receiving transmitting terminal, realization group stream semantic analysis function;Controller In comprising a group stream information acquisition module, communication pattern matching module and group stream size speculate module;Group stream information acquisition module For flowing the stream tissue from machine learning task in groups according to existing stream information, and identify the group from different cycles of training Stream;Communication pattern matching module is for recording completed group of stream to match current unfinished group stream;The group of successful match Stream then need not enter group stream a size speculate a module, belonging to stream directly issue decision to receiving terminal using matched result;Group It flows size and speculates that module speculates algorithm for realizing above-mentioned group stream information, new group stream is divided into detection stream to flow with non-detection And according to group result of stream information acquisition module come update it is non-detection stream priority;
Further, controller is additionally operable to periodically issue scheduling decision to receiving terminal;Module positioned at terminal passes through Received group Flow Policy is realized in elastomeric flow scheduling, wherein comprising priority labeling module, measurement result update module with And transport layer Rate control module;Priority labeling module receives priority update information of the controller for each stream in group stream Afterwards, corresponding priority is marked to the data packet of each stream and ensures its multi-level feedback priority query entered in operating system To realize shortest-job-first strategy;Measurement result update module acquire it is end-to-end between link RTT information and a upper week Transmission packet quantity in phase and receiver packet number amount, to provide data basis for the transport layer Rate control module of next step;Transmission Layer Rate control module is used to calculate the transmission packet quantity in next cycle, and with the help of rate limiter of speed, it is ensured that has Corresponding transmission packet is normally sent out from network interface card.
Compared with prior art, the present invention proposes distributed machines learning framework flow in a kind of efficient data center Scheduling mechanism realizes efficient scheduling strategy in the case where that can not obtain the scene using stream information in group level of stream.
Further, which organically combines the rate control of stream with flow scheduling, passes through timely rate control It makes the effective stream information of help and speculates the completion in stream transmission procedure, while the scheduling strategy based on estimation result is rational Rate control of the stream under different network environments is guided.
Description of the drawings
Fig. 1 is the deployment in the data center of distributed machines learning framework and the signal of data center network canonical topology Figure;
Fig. 2 is the group stream example schematic diagram in distributed machines learning framework in one embodiment of the invention;
Fig. 3 is that group stream information of the embodiment of the present invention speculates flow diagram;
Fig. 4 is rate control strategy schematic diagram of the embodiment of the present invention based on link queue;
Fig. 5 is flexible rate control flow schematic diagram of the embodiment of the present invention;
Fig. 6 is model framework schematic diagram of the embodiment of the present invention;
Fig. 7a-7c is group stream dispatching algorithm exemplary plot of the embodiment of the present invention.
Reference numeral:Tn:The edge switch that number is n;Hn-m:The object of n-th of edge switch of access that number is m Manage host;VM1:The fictitious host computer that number is 1, D1:No. 1 working example based on Hadoop;S1:No. 1 based on Spark Working example.
Specific implementation mode
With reference to embodiment and compares attached drawing the present invention is described in further details.Wherein identical attached drawing Label indicates identical component, unless stated otherwise.It is emphasized that following the description is only exemplary, without It is to limit the scope of the invention and its apply.
The deployment in the data center of distributed machines learning framework and data center network topology are as shown in Fig. 1: One physical host H is usually run multiple distributed work examples in a manner of virtual machine VM and (for example comes from Computational frame The D of Hadoop, or Spark S);Multiple working examples in the same physical host share physical resource, such as CPU, interior It deposits, hard disk and network bandwidth;Each physical host accesses internal network (data center's internal network by edge switch T Refer to being accessed by edge switch, the network organization that is built by multiple high performance switch of inside), at the same edge switch to Affiliated physical host distributes the flow from internal network.
It has been generally acknowledged that congestion will not occur for internal network, congestion occurs mainly in data center by edge switch and object Manage the network portion that host is constituted, i.e. external network.Common data center network congestion can be divided into transmitting terminal congestion and side Edge switch congestion.The former is since the multiple working examples run in individual host cause data packet can not seizing for bandwidth It leaves host and enters network;The latter is made because the flow from internal network has been more than the transfer capability of edge switch Final jump of the data packet in routed path is obtained to be dropped.
It can see in conjunction with the example of above-mentioned group of stream, if we can be by multiple work from the same physical host The stream of example is organized as a group stream, alternatively, a plurality of stream for reaching the same edge switch is organized as a group stream, so that it may To give full play to a group scheduling advantage for stream.
Fortunately, the calculating semanteme support group stream scheduling scheme of distributed machines learning framework.Point of most of mainstreams Cloth machine frame is built based on hyper parameter server model as shown in Fig. 2.In such systems, machine learning is appointed The target of business is that different machine learning models is obtained under large-scale data set by data parallel model.Large-scale number Multiple distributed nodes are divided into according to collection to be stored;The working example of distributed node is run on according to local part number According to collection training and the Grad of model parameter is obtained, and is sent to the update that hyper parameter server carries out model;Hyper parameter service Device converges multigroup Grad and carries out model training, and will send back to working example under updated model parameter.Due to hyper parameter Server needs that until a certain amount of gradient result model training could be started, and working example is needed until hyper parameter server Issuing for updated model parameter finishes the gradient that could start new round calculating, and multiple working examples can be sent to same by we The stream of a hyper parameter server is organized into a group stream (the group stream from task 1), and hyper parameter server is sent to multiple examples Stream tissue become a group stream (the group stream from task 2).Compare the congestion in the above-mentioned data center network that we mention Type can see, the former advantageously accounts for the edge switch congestion of hyper parameter server end downlink, and the latter is conducive to Solve the transmitting terminal congestion of hyper parameter server end uplink.
The inventors discovered that different from the group stream of other Distributed Applications, machine learning group stream, which has, following several has pole The feature of big potential value:1, it unlike other multistage application tasks, can see from above-mentioned example, machine learning task is only Have two stages, which dictates that because the stage is different migration will not occur for the terminal of group stream to ensure that a group flow structure The stability of (transmitting terminal and receiving terminal that each flow);2, it is drawn during the semanteme applied is similar and a machine learning task Data block size after point is fixed, therefore the size flowed in different cycles of training is consistent;3, due in super ginseng In number server model, the model parameter that a hyper parameter server is responsible for is that equivalent divides, each hyper parameter server It is mutually similar between corresponding same type group stream.Features above can be to be collectively referred to as the self similarity of machine learning group stream Property.
The inventors discovered that the self-similarity of machine learning group stream is conducive to simplify the design of group Flow Policy.The present invention is real It is exactly to construct distributed machines learning framework flow scheduling mechanism in efficient data center using these discoveries to apply example.Tool Body is described as follows:
1. group stream information speculates mechanism
Group stream information speculates that the purpose of mechanism is to detect a group potential congestion ability for stream in as fast as possible.It considers Short job priority be currently known shortening task completion time optimal algorithm, it is assumed that stream transmission rate stablize it is constant Under the premise of, we are using maximum group of stream as the group stream for most possible initiation congestion here.
How key problem that group stream in longest stream deadline be design efficient group stream scheduling scheme is calculated.Tradition Group stream scheme need to calculate objective result in conjunction with available bandwidth under the scene of size each flowed in known group of stream, and The rate each flowed is updated according to result.This scheme is difficult to be suitable for the unknown scene of stream information, and is visited dependent on bandwidth The accuracy of survey, while large-scale update stream transmission rate will bring great resource overhead.
Due to the group flow structure in machine learning frame have stability, we can after machine learning task starts, Based on existing group of stream Scheduling Framework the number n flowed in a group stream is got by counting the quantity of active stream.If I The random stream selected in group stream flowed as detection and ensure that it completes the transmission of data packet as early as possible, flow to obtaining it Size f, in conjunction with the self-similarity of machine study group stream, the size of a group stream can be obtained by n*f.Due to group stream size by etc. Valence is the potential congestion ability for shared forwarding node, therefore organizes stream size by the priority as one group stream of judgement Main Basiss.
Information supposition and priority update are two cores in group stream information supposition algorithm, as shown in Fig. 3.Come It is needed in speculating that can just be added to active group adfluxion closes by information from stream is organized in newly generated machine learning task.? During information speculates, multiple groups are randomly selected in new task according to the edge switch belonging to receiving terminal (hyper parameter server) A group stream in stream.For the group stream being selected, one is randomly selected according to the physical host residing for transmitting terminal (working example) A stream is flowed as detection.Data packet from detection stream is labeled with label and enters explorer queue, enjoys highest priority.Detection stream It is randomly assigned in the group stream obtained by random selection (see step 2,3 in Fig. 3);Detection stream choose dual random design be in order to Collide competition bandwidth between avoiding detection to flow as possible.The highest priority of detection stream can ensure that itself data packet measures Link Round-Trip delay (RTT) it is minimum, this point combination flexible rate control algolithm can ensure detection flows transmission rate it is fast Speed increases.
Before priority update startup, the non-detection stream from new task enters active queue, to ensure itself working as The minimum transmission rate that do not died of hunger under preceding network configuration is given out a contract for a project.When flow priority updates timer expiry, all is non- Detection stream (including the group stream not completed) enters corresponding priority query according to the result of detection for detecting stream belonging to it.Detection Stream size determines the priority of non-detection stream by the supposition of influence group stream size, therefore combines the scheduling of short job priority former From the point of view of then, the priority of the detection stream smaller non-detection stream of size is higher.Priority update ensures the scheduling of short job priority Implementation of strategies, because when the size of detection stream constantly increases, the priority of the non-detection stream determined can be lowered gradually (see step 9) in Fig. 3.
Group stream supposition mechanism support group stream scheduling scheme realizes efficient scheduling strategy.It is sent compared to traditional by flowing Data volume number carry out the mode (LAS) of decision set flow priority, the result of detection that detection stream provides can be on realization theory more Excellent shortest-job-first strategy (SJF), as shown in attached drawing 7a.Task J1, the J2 occurred simultaneously there are two in this example, Middle J1 includes two group streams C1 and C2, and C1 includes that the stream that three sizes are 1 shares a receiving terminal H1, others group flow structure ginseng See Fig. 7 a, repeats no more.Assuming that each chronomere of link handles the data of a unit, come from group angle of stream deadline It sees, this patent proposes that scheduling SJF schemes (such as Fig. 7 c) are better than scheme (such as Fig. 7 b) under traditional LAS schedule strategy.This is Because LAS need in t=3 could group stream of the Differentiated Services from different task, and SJF can be in t=2 just by J1 It is greater than J1 to the detection of H3 stream area come J2 in group stream size earlier in H1 and J2.
2. flexible rate control algolithm
Detection stream and non-detection stream propose rate control strategy different requirements.Ideal rate control strategy is for example attached Shown in Fig. 4.Detection stream carries the mission that group stream size speculates, additivity increases obviously not as good as the increasing of multiplying property has in traditional rate control Effect.Therefore in the case where there is enough available bandwidths, the rate of stream should be in that multiplication Calais quickly occupies available bandwidth to be promoted Link utilization.As shown in bandwidth detection part in Fig. 4, when the data packet number for the transmission flowed in a upper detection cycle is m When there is not packet loss, then the data packet number sent in the period can increase to 2*m.At the same time, higher rate is not Should be by random network fluctuation, such as a small amount of data packet discarding, and be greatly lowered, significantly so as to cause rate Fluctuation;Rational strategy should perceive the smaller rear reduction of speed appropriate of the queue occurred in link, to protect current rate It is horizontal.As shown in Fig. 4 medium-rates protection part, when packet loss number is n in the upper a cycle of stream, the period interior number sent It can be adjusted to m-n according to packet number.Opposite, the target for detecting stream is as far as possible will be available in intrinsic rate and link Bandwidth matches.Detection stream can receive significantly rate fluctuation and avoid congestion, especially when exist in link occupy it is big When measuring the detection stream of bandwidth.As shown in Congestion Avoidance part in Fig. 4, when the hair of the m data packet sent in the upper a cycle of stream The transmission for averagely wanting waiting n data packet is sent, the data packet number sent in the period can be adjusted to m/n.
The flexible rate control algolithm that this patent proposes can as far as possible meet in the case of smaller overhead Above-mentioned requirements, as shown in Fig. 5.Since machine learning group stream shares the same receiving terminal (or transmitting terminal), we can hold Each RTT information for flowing passed through link estimates available bandwidth in system acquisition group stream.Rate control based on end system System avoids calculating and communication overhead caused by centralized rate calculations, and can be very good to send window by similar setting The mode of mouthful size come it is compatible with existing ICP/IP protocol, to reduce development cost.It should be noted that the transmission rate of stream The upper limit do not do here pressure limitation.
Flexible rate control algolithm, which regards arbitrarily complicated link as, has the direct-connected of identical RTT sizes and bottleneck bandwidth Link.The receptible maximum queue length of institute, size and network configuration phase in link are indicated for stream a f, f.max_rtt It closes;F.min_rtt indicates the queue length in link in a measurement period, this is because in Modern High-Speed data center net The forward delay of data packet, processing delay and transmission delay can be ignored compared to the queueing delay that queue is brought in network. Currently transmitted rate can be equivalent to indicate the number of data packets in upper a cycle transmitted by transmitting terminal (due to week in logic Phase size is changeless, and the data packet more multilist of transmission shows that the rate of transmission is faster);Currently receive in rate representation one The number of data packets that receiving terminal is received in period, to realize rate Preservation tactics;Targeted rate indicates that next cycle is wished It hopes the number of data packets sent, (is calculated see step 1) in Fig. 5 by formula in Fig. 5:
elastic_gain*f.max_rtt/f.min_rtt
Wherein preset value elastic_gain determines the minimum transmission rate of stream, because the queue in link reaches In limited time, f.max_rtt/f.min_rtt=1;Last counted transmission rate indicates that next cycle is practical and sends out the data sent Packet number, wherein multi_gain indicate the radical degree of bandwidth detection.
The present embodiment speculates a mechanism by group stream information, avoid group stream scheduling scheme to acquisition group stream information in advance according to Rely, the complexity of scheduling scheme design is reduced to the excavation of machine learning group stream self-similarity, and simplifies existing group of stream Consequently facilitating disposing, flexible rate controls the performance in Optimized Operation strategy while improving the utilization rate of link Scheduling Framework, Reduce the average completion time of machine learning group stream.
This programme needs to simplify existing centralization group stream Scheduling Framework, and part of module is migrated to receiving terminal It realizes, as shown in Fig. 6.
Algorithm shown in attached drawing 3 is located at group stream semantic module, algorithm shown in attached drawing 5 in Fig. 6 and is located at elastomeric flow tune in Fig. 6 Spend module.The main realization group stream semantic analysis function of controller, wherein comprising a group stream information acquisition module, communication pattern matches mould Block and group stream size speculate module.For transmitting terminal periodically to the newest stream information of controller report, group stream information acquires mould Block flows the stream tissue from machine learning task according to existing stream information in groups, and identifies the group from different cycles of training Stream.Communication pattern matching module is responsible for recording completed group of stream to match current unfinished group stream;The group of successful match Stream then need not enter group stream a size speculate a module, belonging to stream directly issue decision to receiving terminal using matched result.Group Stream size speculates that module mainly realizes that above-mentioned group stream information speculates algorithm, and new group stream is divided into detection stream to flow with non-detection And according to group result of stream information acquisition module come update it is non-detection stream priority.
Controller periodically issues scheduling decision to receiving terminal.Module positioned at terminal is realized by elastomeric flow scheduling Received group Flow Policy, wherein including priority labeling module, measurement result update module and transport layer rate control Module.After priority labeling module receives controller for the priority update information of each stream in group stream, to the number of each stream It marks corresponding priority according to packet and ensures it into the multi-level feedback priority query in operating system and realize that short operation is excellent First strategy.Measurement result update module acquire it is end-to-end between link RTT information and the transmission packet number in upper a cycle Amount and receiver packet number amount, to provide data basis for the transport layer Rate control module of next step.Transport layer Rate control module It is main to realize algorithm described function shown in attached drawing 5, calculate the transmission packet quantity in next cycle and in rate limiter of speed With the help of, it is ensured that there is corresponding transmission packet normally to be sent out from network interface card.
The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that The specific implementation of the present invention is confined to these explanations.For those skilled in the art to which the present invention belongs, it is not taking off Under the premise of from present inventive concept, several equivalent substitute or obvious modifications can also be made, and performance or use is identical, all answered When being considered as belonging to protection scope of the present invention.

Claims (12)

1. a kind of traffic scheduling method of Machine oriented learning framework, the machine learning is used for through data parallel model big Different machine learning models is obtained under the data set of scale;In the machine learning, large-scale data set is divided into Multiple distributed nodes are stored;The working example of distributed node is run on according to local partial data collection training and is obtained To the Grad of model parameter, and it is sent to the update that hyper parameter server carries out model;Hyper parameter server converges multigroup ladder Angle value carries out model training, and will send back to working example under updated model parameter;It is characterized in that, by multiple work reality The stream that example is sent to the same hyper parameter server is organized into a group stream, and hyper parameter server is sent to the stream tissue of multiple examples As another group stream, realize distributed machines learning framework in group flow scheduling in fluid layer face.
2. the traffic scheduling method of Machine oriented learning framework according to claim 1, it is characterised in that including a group stream letter Supposition mechanism is ceased, in group potential congestion ability for stream that detects as fast as possible, group stream information to speculate that mechanism includes as follows Step:
S1, after machine learning task starts, get one by counting the quantity of active stream based on a group stream Scheduling Framework The number n flowed in group stream;
S2, the random stream selected in group stream flow as detection and ensure that it completes the transmission of data packet as early as possible, to obtain Its size f flowed;
S3, in conjunction with the self-similarity of machine study group stream, the size of a group stream can be obtained by n*f;The size of the group stream by etc. Valence is the potential congestion ability for shared forwarding node, and by a foundation of the priority as one group stream of judgement;
S4, priority update is carried out.
3. the traffic scheduling method of Machine oriented learning framework according to claim 2, which is characterized in that come from new production Stream is organized in raw machine learning task to need just be added to by information supposition in active group adfluxion conjunction;Speculate in information In the process, one in new task in multiple groups of streams is randomly selected according to the edge switch belonging to receiving terminal hyper parameter server Group stream;For the group stream being selected, a stream is randomly selected as detection according to the physical host residing for transmitting terminal working example Stream.
4. the traffic scheduling method of Machine oriented learning framework according to claim 3, which is characterized in that from detection stream Data packet be labeled with label enter explorer queue, enjoy highest priority;In the group stream that detection stream is obtained by random selection with Machine is specified, that is, the selection for detecting stream is designed using dual random.
5. the traffic scheduling method of Machine oriented learning framework according to claim 4, which is characterized in that detect stream most High priority combination flexible rate control algolithm is to ensure the rapid growth of detection stream transmission rate.
6. the traffic scheduling method of Machine oriented learning framework according to claim 2, which is characterized in that priority more Before new startup, the non-detection stream from new task enters active queue, to ensure itself not starved under current network configuration Dead minimum transmission rate is given out a contract for a project.
7. the traffic scheduling method of Machine oriented learning framework according to claim 6, which is characterized in that work as flow priority Update timer expiry, all non-detection stream, comprising unfinished group stream, the result of detection that flows according to the detection belonging to it into Enter corresponding priority query;Detection stream size determines the priority of non-detection stream by the supposition of influence group stream size, excellent First grade update ensures the scheduling strategy of short job priority.
8. the traffic scheduling method of Machine oriented learning framework according to claim 5, it is characterised in that flexible rate control Algorithm processed includes:In the case where there is enough available bandwidths, the rate of stream quickly occupies available bandwidth to carry in multiplication Calais Rise link utilization.
9. the traffic scheduling method of Machine oriented learning framework according to claim 8, it is characterised in that:It is adopted in end system Each RTT information for flowing passed through link estimates available bandwidth in collection group stream, realizes the rate control based on end system; And come compatible with existing ICP/IP protocol in such a way that similar setting sends window size.
10. the traffic scheduling method of Machine oriented learning framework according to claim 9, which is characterized in that for one F is flowed, the desired number of data packets sent of targeted rate, that is, next cycle is calculated by following formula:
elastic_gain*f.max_rtt/f.min_rtt
Wherein preset value elastic_gain determines that the minimum transmission rate of stream, f.max_rtt indicate to receive in link Maximum queue length, size is related to network configuration;F.min_rtt indicates the queue in link in a measurement period Length.
11. a kind of flow scheduling system of Machine oriented learning framework, special for realizing the method described in claim 1-10 Sign is, including controller, the newest stream information periodically reported for receiving transmitting terminal, realization group stream semantic analysis work( Energy;Speculate module comprising group stream information acquisition module, communication pattern matching module and group stream size in controller;
Group stream information acquisition module is identified for flowing the stream tissue from machine learning task in groups according to existing stream information Group stream from different cycles of training;
Communication pattern matching module is for recording completed group of stream to match current unfinished group stream;The group of successful match Stream then need not enter group stream a size speculate a module, belonging to stream directly issue decision to receiving terminal using matched result;
Group stream size speculates that module speculates algorithm for realizing above-mentioned group stream information, by new group stream be divided into detection stream with it is non- Detection flows and updates the priority of non-detection stream according to group result of stream information acquisition module.
12. the flow scheduling system of Machine oriented learning framework as claimed in claim 11, which is characterized in that controller is also used In periodically issuing scheduling decision to receiving terminal;Module positioned at terminal realizes received group by elastomeric flow scheduling Flow Policy, wherein including priority labeling module, measurement result update module and transport layer Rate control module;
After priority labeling module receives controller for the priority update information of each stream in group stream, to the data of each stream Packet marks corresponding priority and ensures it into the multi-level feedback priority query in operating system to realize short job priority Strategy;
Measurement result update module acquire it is end-to-end between link RTT information and transmission packet quantity in upper a cycle and Receiver packet number amount, to provide data basis for the transport layer Rate control module of next step;
Transport layer Rate control module is used to calculate the transmission packet quantity in next cycle, and in the help of rate limiter of speed Under, it is ensured that there is corresponding transmission packet normally to be sent out from network interface card.
CN201810569876.6A 2018-06-05 2018-06-05 Traffic scheduling method facing machine learning framework Active CN108768876B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810569876.6A CN108768876B (en) 2018-06-05 2018-06-05 Traffic scheduling method facing machine learning framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810569876.6A CN108768876B (en) 2018-06-05 2018-06-05 Traffic scheduling method facing machine learning framework

Publications (2)

Publication Number Publication Date
CN108768876A true CN108768876A (en) 2018-11-06
CN108768876B CN108768876B (en) 2022-01-11

Family

ID=63999879

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810569876.6A Active CN108768876B (en) 2018-06-05 2018-06-05 Traffic scheduling method facing machine learning framework

Country Status (1)

Country Link
CN (1) CN108768876B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110958187A (en) * 2019-12-17 2020-04-03 电子科技大学 Distributed machine learning parameter-oriented synchronous differential data transmission method
CN111078659A (en) * 2019-12-20 2020-04-28 腾讯科技(深圳)有限公司 Model updating method, model updating device, computer readable storage medium and computer equipment
CN111612155A (en) * 2020-05-15 2020-09-01 湖南大学 Distributed machine learning system and communication scheduling method suitable for same
CN111628940A (en) * 2020-05-15 2020-09-04 清华大学深圳国际研究生院 Flow scheduling method, device, system, switch and computer storage medium
CN113194086A (en) * 2021-04-27 2021-07-30 新华三信息安全技术有限公司 Anti-attack method and device
CN113839884A (en) * 2020-06-24 2021-12-24 华为技术有限公司 Flow control method and device
CN115062771A (en) * 2022-08-16 2022-09-16 之江实验室 Distributed machine learning gradient convergence method and device and model training method
CN115271102A (en) * 2022-09-26 2022-11-01 太极计算机股份有限公司 Task-oriented priority method and system for machine learning engine
CN115499306A (en) * 2022-07-29 2022-12-20 天翼云科技有限公司 Method and device for constructing traffic scheduling model, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030133411A1 (en) * 1997-10-23 2003-07-17 Kabushiki Kaisha Toshiba Communication resource management method and node control device using priority control and admission control
CN101692657A (en) * 2009-10-22 2010-04-07 北京交通大学 Differentiated service core router and data forwarding method thereof
US20100135158A1 (en) * 2008-12-01 2010-06-03 Razoom, Inc. Flow State Aware QoS Management Without User Signalling
CN104852887A (en) * 2014-02-17 2015-08-19 上海宽带技术及应用工程研究中心 Network flow tracing system and method based on OpenFlow technology
EP3200410A1 (en) * 2016-01-28 2017-08-02 Alcatel Lucent Method and system for queueing packets in communication networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030133411A1 (en) * 1997-10-23 2003-07-17 Kabushiki Kaisha Toshiba Communication resource management method and node control device using priority control and admission control
US20100135158A1 (en) * 2008-12-01 2010-06-03 Razoom, Inc. Flow State Aware QoS Management Without User Signalling
CN101692657A (en) * 2009-10-22 2010-04-07 北京交通大学 Differentiated service core router and data forwarding method thereof
CN104852887A (en) * 2014-02-17 2015-08-19 上海宽带技术及应用工程研究中心 Network flow tracing system and method based on OpenFlow technology
EP3200410A1 (en) * 2016-01-28 2017-08-02 Alcatel Lucent Method and system for queueing packets in communication networks

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110958187A (en) * 2019-12-17 2020-04-03 电子科技大学 Distributed machine learning parameter-oriented synchronous differential data transmission method
CN110958187B (en) * 2019-12-17 2021-05-18 电子科技大学 Distributed machine learning parameter-oriented synchronous differential data transmission method
CN111078659A (en) * 2019-12-20 2020-04-28 腾讯科技(深圳)有限公司 Model updating method, model updating device, computer readable storage medium and computer equipment
CN111078659B (en) * 2019-12-20 2023-04-21 腾讯科技(深圳)有限公司 Model updating method, device, computer readable storage medium and computer equipment
CN111612155A (en) * 2020-05-15 2020-09-01 湖南大学 Distributed machine learning system and communication scheduling method suitable for same
CN111628940A (en) * 2020-05-15 2020-09-04 清华大学深圳国际研究生院 Flow scheduling method, device, system, switch and computer storage medium
CN111612155B (en) * 2020-05-15 2023-05-05 湖南大学 Distributed machine learning system and communication scheduling method suitable for same
CN113839884A (en) * 2020-06-24 2021-12-24 华为技术有限公司 Flow control method and device
CN113839884B (en) * 2020-06-24 2023-08-22 华为技术有限公司 Flow control method and device
CN113194086A (en) * 2021-04-27 2021-07-30 新华三信息安全技术有限公司 Anti-attack method and device
CN113194086B (en) * 2021-04-27 2022-05-27 新华三信息安全技术有限公司 Anti-attack method and device
CN115499306A (en) * 2022-07-29 2022-12-20 天翼云科技有限公司 Method and device for constructing traffic scheduling model, electronic equipment and storage medium
CN115499306B (en) * 2022-07-29 2024-03-12 天翼云科技有限公司 Method and device for constructing flow scheduling model, electronic equipment and storage medium
CN115062771B (en) * 2022-08-16 2022-11-25 之江实验室 Distributed machine learning gradient convergence method and device and model training method
CN115062771A (en) * 2022-08-16 2022-09-16 之江实验室 Distributed machine learning gradient convergence method and device and model training method
CN115271102B (en) * 2022-09-26 2022-12-16 太极计算机股份有限公司 Task-oriented priority method and system for machine learning engine
CN115271102A (en) * 2022-09-26 2022-11-01 太极计算机股份有限公司 Task-oriented priority method and system for machine learning engine

Also Published As

Publication number Publication date
CN108768876B (en) 2022-01-11

Similar Documents

Publication Publication Date Title
CN108768876A (en) A kind of traffic scheduling method of Machine oriented learning framework
Meng et al. Online deadline-aware task dispatching and scheduling in edge computing
Chen et al. Round-robin synchronization: Mitigating communication bottlenecks in parameter servers
CN104714852B (en) A kind of parameter synchronization optimization method and its system suitable for distributed machines study
CN104993941B (en) One kind is based on Openflow network high fault tolerance virtual network mapping algorithms
CN102724103B (en) Proxy server, hierarchical network system and distributed workload management method
CN108282415A (en) A kind of dispatching method and equipment
CN109104373A (en) The processing method of network congestion, apparatus and system
CN110381541A (en) A kind of smart grid slice distribution method and device based on intensified learning
CN103825838B (en) A kind of data center removes bandwidth fragmentation stream scheduling method
CN109492753A (en) A kind of method of the stochastic gradient descent of decentralization
Zhang et al. Tuning the aggressive TCP behavior for highly concurrent HTTP connections in intra-datacenter
CN110990140B (en) Method for scheduling distributed machine learning flow in photoelectric switching network
CN106934454B (en) Test-schedule method in network on three-dimensional chip based on Petri network
CN109445386A (en) A kind of most short production time dispatching method of the cloud manufacturing operation based on ONBA
CN103699433A (en) Method and system for performing dynamic adjustment on number of tasks in Hadoop platform
CN111767146A (en) Distributed machine learning system acceleration method based on network reconfiguration
CN109714795A (en) A kind of method for managing resource, resource management system and device based on SDN network slice
CN114938372B (en) Federal learning-based micro-grid group request dynamic migration scheduling method and device
CN109976873A (en) The scheduling scheme acquisition methods and dispatching method of containerization distributed computing framework
CN105591790B (en) Data communication connection pool management device
Tao et al. DRL-Driven Digital Twin Function Virtualization for Adaptive Service Response in 6G Networks
Zhang et al. ATFQ: a fair and efficient packet scheduling method in multi-resource environments
CN108616569B (en) A kind of Based on Distributed calculates the light Measurement Request dispatching method of application
CN106533756B (en) A kind of communication feature extracts, flow generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant