CN105956021B

CN105956021B - A kind of automation task suitable for distributed machines study parallel method and its system

Info

Publication number: CN105956021B
Application number: CN201610255970.5A
Authority: CN
Inventors: 廖小飞; 曹镇山; 郭人通; 刘海坤; 金海�; 陆枫
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2016-04-22
Filing date: 2016-04-22
Publication date: 2019-05-21
Anticipated expiration: 2036-04-22
Also published as: CN105956021A

Abstract

The present invention provides a kind of automation task parallel methods and its system suitable for distributed machines study, solve the defect of existing distributed machines study programming interface: the read-write interface for only providing key-value pair leads to system data access behavior and applies logic close coupling.The defect aggravates network bandwidth resources competition in distributed type assemblies, and programming personnel is caused to be not easy to carry out parallelization to task.Present system includes working node module, service node module, host node module, tensor module, scheduler module, message tracking module, stage module, stage group module and enforcement engine module.The present invention decouples the logic of read and write access behavior and application program by the way that the programming for providing higher level is abstract, runtime system carries out dynamic task division according to the loading condition of service node first, secondly machine learning task is automated into parallel execution, mitigates the burden that programming personnel writes high concurrent machine learning application significantly.

Description

A kind of parallel method of automation task suitable for distributed machines study and its System

Technical field

The invention belongs to distributed computings and machine learning interleaving techniques field, and in particular to one kind is suitable for distributed machine The automation task of device study parallel method and its system.

Background technique

The conventional method that machine learning algorithm is worth as a kind of mining data is widely used in natural language processing, text The fields such as this analysis, speech recognition, automatic driving of motor vehicle and biological information.With the arrival of big data era, data Value increasingly show especially out, the commercial value especially wherein contained, machine learning thus be taken seriously.However, with number According to scale and mutually increasing in requisition for the scale of the model parameter of study, single calculate node is due to its memory source, meter The finiteness for calculating resource and memory bandwidth resource etc. is no longer satisfied the demand of large-scale machines study.By traditional single-unit Point machine learning, which carries out distribution, becomes a kind of new and required trend.After machine learning distribution, it can be used more More calculate nodes goes to handle larger data, at the same shorten training gained model needed for the time, and improve study Model accuracy.Distributed machines study is all received universal concern in industry and academia, such as: Google utilizes distribution Formula system DistBelief has trained cat face identification model, and Apache Software Foundation, which is developed, to be based on It increases income one and is applicable to machine in the distributed machines laboratory Berkeley AMP learning framework Mahout and UC of Hadoop Distributed computing system Spark of learning algorithm etc..

Distributed machines, which learn most of algorithms, has iterative nature, runs the iterative process or model ginseng of predetermined number of times Number, which converges to a certain stable state, just terminates training process.Traditional Distributed Architecture MapReduce etc. is due to its synchronization mechanism Defect, make it be bad to cause its performance not fully up to expectations in iterative estimated performance.

Novel machine learning distributed system is parameter server framework, and parameter described herein refers in machine learning For the key-value pair (key, value) or two-dimensional matrix or multi-dimensional matrix of descriptive model parameter, while multi-dimensional matrix Also referred to as tensor.In parameter server framework, the calculate node in cluster is divided into two classes, and a kind of node is known as working node, Another kind of node is known as service node.Wherein, service node is responsible for safeguarding that world model's parameter, including responsive operation node are directed to The operation such as inquiry and update of model parameter；Working node loads partial data collection that global training data is concentrated to local memory In, it is calculated using the algorithm of application logic regulation and which model parameter is needed to be calculated, initiate inquiry behaviour to service node Make, required model parameter is transmitted in local memory by network, then utilizes the algorithm and required mould for applying logic regulation Shape parameter calculates the updated value Δ w of new model parameter w or model parameter, after a wheel iterative calculation, work section Point is initiated update and synchronous world model's parameter etc. to service node and is operated.Working node is primary complete in distributed machines study Behavior in whole iteration, which can conclude, is described as following steps:

1. working node loading section data set；

2. working node calculates the model parameter of needs, the mould needed for being obtained by the model access interface that bottom provides Shape parameter；

3. according to the updated value Δ w for going out new model parameter w or model parameter using logic calculation；

4. the updated value Δ w of the model parameter w newly calculated or model parameter is pushed to service node by working node, into Row parameter update with it is synchronous.

Step 2 among the above, 3,4 are the committed steps in iterative calculation, and are accessed by world model's parameter reading and writing Interface obtains the model parameter needed for calculating and the updated value of the model parameter newly calculated or model parameter is pushed to clothes Business node, is the major source of network transmission in system.

It is huge due to model parameter for step 2, the network transmission volume thus caused be also it is huge, in net In the case that network bandwidth resources are certain, for a working node, the network latency in iterative process, which is greater than, to be calculated Time, so that the time of entire model training lengthens；When multiple working nodes trigger network transmission simultaneously, there are bandwidth resources Warfare, network latency can become longer.The behavior and upper layer application logic of working node triggering mode parameter It is closely related.The physical layer interface provided in parameter current server architecture is the unified interface of global parameter access, is made in this way It obtains the behavior of the access global parameter of system and applies logic close coupling, be unfavorable for optimizing from system bottom.

For step 3, working node computation model parameter, this operation is the operation of computation-intensive, current many-core, How multicore era maximizes the parallel calculating task, most important for the concurrency for improving system.Current distributed machine Device learning system is not provided with the programming interface of corresponding parallelization, only provides world model's read and write access interface, it is therefore desirable to Programming personnel has the experience of multiple programming, can just write the machine learning application program of high concurrent.

For step 4, for the bottleneck of network transmission in parameter synchronization, existing 2 kinds of solutions: one is change to synchronize The iteration progress of model, i.e. permission different operating node has certain difference, after the difference of iteration progress reaches certain threshold value, Batch synchronization (BSP, Bulk Synchronous Parallel) is carried out again, and such scheme alleviates Netowrk tape to a certain extent The case where wide resource contention；Another solution is control parameter server resource occupancy situation, is selected for different operating node Different synchronization of time intenals is taken to avoid request emergency case, while guaranteeing that the time interval chosen can meet reduction simultaneously Communication frequency and ensure train accuracy rate.

Summary of the invention

In view of the above drawbacks of the prior art or Improvement requirement, the present invention provides be suitable for appointing for distributed machines study Business automates parallel method and its system.Firstly, by decoupling the access interface of model parameter and application logic, this Adjustable characteristic when sample makes system have operation for the access behavior of model parameter, be in this way network transmission and be The optimization of system bottom parallelization etc. provides the foundation.Several stages are logically decomposed into secondly, will apply, and thus construct oriented nothing Ring figure (directed acyclic graph, abbreviation DAG) goes to describe the dependence between each calculation stages, and when operation is Task divide and be executed parallel by system by DAG automation, improves system concurrency degree.Above method and system can be effective Ground solves network transmission bottleneck problem in existing distributed machines learning system and improves system concurrency degree, to improve system Overall performance.

To achieve the goals above, according to one aspect of the present invention, it provides a kind of suitable for distributed machines study Task automation parallel method and its system, specifically include working node module, service node module, host node module, Measure module, scheduler module, message tracking module, stage module, stage group module and enforcement engine module.Wherein stage mould Block, scheduler module are all connected with tensor module；Stage module is connected with stage group module；Engine execution module and stage module phase Even；Scheduler module, tensor module, stage group are all connected with tensor module.

The working node module and service node module, is the row for working node and parameter service node respectively For abstractdesription, and the two modules are transparent to machine learning programming personnel.

The host node module is the abstractdesription for host node.The effect of host node is to coordinate whole system Workflow, such as the initialization of system and the end of system.Mentioned-above system module, in addition to working node module, clothes Other modules outside business node module, host node module are all present in all nodes.

The tensor module is used to describe the key-value pair (key, value) of model parameter in machine learning.Application program needs A variety of tensor objects are wanted to describe to train required model parameter, each tensor object have tensor_id attribute as it only One mark.There are three types of the types of tensor object: the overall situation sharable (global shared), globally unique (global ) and local (local) unique.Global sharable expression tensor object safeguarded by distributed node, different nodes Between the data safeguarded can have intersection；The globally unique expression tensor object safeguarded by distributed node, different sections Point between safeguard without intersection；The expression of the local tensor object exists only in a node.Tensor object has load (load), the operation interfaces such as (pull), push (push) are pulled to use for programming personnel.

The stage module, for describing certain section of programmed logic in application program.The present invention patrols the entirety of application program It collects and resolves into the different stages, and each phase object contains stage_id attribute as its unique identification.Between phase object, Function set_dependency being relied on by setting, the dependence between them is set.Phase object needs several tensors pair As its input and an optional output.Its input has 2 seed types for the stage, and one kind being referred to as master variable Primary_variable, another kind are referred to as complementary variable secondary_variable.(key, the value) of master variable It is right, there is no dependence between key, and (key, the value) of complementary variable is right, has dependence between key.It is right In a stage, programming personnel needs to provide core function kernel_function, the core logic as this stage.Simultaneously Programming personnel it is also required to provide mapping function (the key_projection letter between the key of master variable and the key of complementary variable Number), runtime system derives the key of auxiliary variable automatically according to the key and key_projection function of master variable.For Each master variable and auxiliary variable in stage have a corresponding variable to be known as update_variable.update_ Variable is for updating corresponding variable, and the customer-furnished update_function of more new logic is defined.

The stage group module, for describing one group of stage.Connection is close between this group of represented stage of stage group.Rank Section group has attribute group_id as its unique identification.Stage group has the two interfaces of run and set_barrier.The side run The optional parameters of method is an integer num_run, for specifying the execution number of this stage group.Set_barrier interface is used In setting simultaneously operating, after up till now stage group has executed, needing all working node to enter, fence is synchronous to wait shape for expression State can just continue to run after the stage group of all working node has executed.

The scheduler module is for decision for some tensor object, the set of working node next stage key to be treated key_set.The bandwidth information on scheduler module fixed time broadcast its node on service node, the scheduler module root on working node According to the bandwidth information of its service node obtained, decision share out the work node model parameter to be treated next time key collection It closes.

The engine execution module, for by stage group stage and its relation of interdependence be described as directed acyclic graph (directed acyclic graph, abbreviation DAG), in the directed acyclic graph, the node in figure indicates stage, having in figure Dependence between the side expression stage, while tail portion stage will prior to while head stage execute.

The message tracking module is used in logging program operation by tensor module, scheduler module, stage group module, work Make the message that node module, service node module, host node module are submitted, when message tracking module is given in message-submission, message with Track module is responsible for for message being transferred to recipient, and after recipient returns to message receipt, message tracking module is responsible for notification message Initial launching side, and deliver acknowledgement message.

Correspondingly, the present invention also provides a kind of tasks suitable for distributed machines study to automate parallel method, uses In the automatic division and automatic paralleling execution that carry out task in distributed machines learning algorithm scene, including system initialization Step, parallel training step and system finishing step, in which:

(1) system initialization step: initialization node topology information and initialization application logic, specifically include with Lower sub-step:

(1.1) all nodes bring into operation, and read configuration file respectively, determine the role's rotor step (1.2) of oneself, institute Stating role is working node or service node or host node；

(1.2) working node, service node are communicated with host node respectively, inform that its nodal information of host node, host node will The nodal information being collected into is broadcast to other all nodes, goes to step (1.3)；

(1.3) after working node and service node receive the nodal information that host node is sent, initialization node topology letter Breath, for the communication between posterior nodal point；Go to step (1.4)；

(1.4) working node and service node initializing application logic, runtime system is according to stage group in program The sequencing occurred in code determines that the successive of stage group executes sequence, while constructing the corresponding DAG of each stage group；Turn step Suddenly (2)；

(2) parallel training step: host node and service node skip specific training logic, enter step (3), each Working node enters model training state, and working node is iterated formula parallel training according to the training data subset of input, directly To predefined iteration termination condition is met, the behavior of working node specifically includes following sub-steps:

(2.1) runtime system of working node, will be on its DAG for having determined each stage group of sequencing Node carries out topological sorting and executes sequence with each stage group of determination interior all stages；Go to step (2.2)；

(2.2) the stage group being currently not carried out is known as next_group；Setting next_group is that machine learning application is patrolled Collect the stage group occurred for the first time；(2.3) are gone to step, if going to step (2.6) currently without the stage group being not carried out；

(2.3) by the run method execution num_run of the next_group stage group indicated, (num_run is that user starts journey The parameter provided when sequence) it is secondary, for the run method of single, when operation, creates a collection of thread, according to determined by step (2.1) Stage executes sequence in the stage group, successively executes the run method in all stages, numbers the small stage and first carries out, has The stage pipeline of identical number executes, which goes to step (2.4) after having executed num_run times；

(2.4) whether the stage group that the runtime system judgement of working node has currently executed num_run times is provided with Set_barrier carries out fence simultaneously operating if being provided with set_barrier；Go to step (2.5)；

(2.5) if next_group, is set to the stage group being currently not carried out by the stage group being currently also not carried out, turn Step (2.3), otherwise goes to step (2.6)；

(2.6) judge whether to reach iteration termination condition when working node is run, if reaching termination condition, go to step (3), (2.1) are otherwise gone to step；

(3) system finishing step: working node informs that its work of host node is completed, and host node detects all working node Work complete after, all nodes of notice coordinate exit the program, specifically include following sub-step:

(3.1) all working node sends job_done message to host node, and host node receives all working node After job_done message, host node sends sys_exit message to all working node and service node, goes to step (3.2)；

(3.2) after working node and service node receive sys_exit message, working node and service node are to main section Point sends sys_exit_ack message, goes to step (3.3)；

(3.3) host node receives the sys_exit_ack message that all working node and service node are sent, and goes to step (3.4)；

(3.4) all nodes terminate program.

Determine that the process of the execution sequence in stage in stage group specifically includes following sub-step in above-mentioned steps (2.1):

Current unassigned number order is set to 0 by (2.1.1), and the node set nodes that current in-degree is 0 is set to Sky goes to step (2.1.2)；

The node that in-degree in current DAG is 0 is added in set nodes by (2.1.2), in node set nodes It is order that all nodes, which are all numbered, and order increases 1 certainly；By in node set nodes node and its it is all go out side from DAG figure Remove, nodes set is set to sky, goes to step (2.1.3)；

(2.1.3) judges whether current DAG is sky, then goes to step (2.2) if it is sky, otherwise go to step (2.1.2).

The run method in stage described in above-mentioned steps (2.3), specifically includes following sub-step:

The runtime system of (2.3.1) working node calls the prepare_variables method in the stage, determines current The master variable primary_variable in stage keyset to be processed closes primary_key_set, specifically, first according to scheduling The loading condition (L1, L2 ..., Ln) and the key of master variable primary_variable for the service node that module obtains are servicing Network is loaded the part key not handled by working node also safeguarded on minimum service node by the distribution situation on node Set distributes to working node, closes parimary_key_set, rotor step (2.3.2) as keyset to be processed next time；

The key_projection function that (2.3.2) runtime system is provided according to user, and determined in (2.3.1) The keyset of master variable closes primary_key_set and derives that the keyset of auxiliary variable closes secondary_key_set, calls the master The pull method of the tensors object such as variable and auxiliary variable goes to pull required model parameter；Go to step (2.3.3)；

(2.3.3) executes the core function kernel_function in the stage, and runtime system is automatically by the key of master variable Set key_set be divided into num_threads part automatically, create num_threads thread parallelization execution core Function, wherein num_threads is the parameter that user provides, and goes to step (2.3.4)；

(2.3.4) run core function kernel_function generate more new variables v_update, runtime system according to The update_function that user provides updates corresponding variable v；If the type of variable v is globally shared or complete Office is unique, and the push function of run time call variable v is updated, and (key, the value) which to be updated is to sequence Change；And the data after serializing are sent to the service node of all maintenance this section key, service node receives more new data Afterwards, the data of its maintenance are updated.

Pull method described in above-mentioned (2.3.2) step has following sub-step:

(2.3.2.1) keyset to be pulled the tensor object closes key_set serializing, and the data of serializing are sent To the service node for safeguarding the paragraph key set, (2.3.2.2) is gone to step；

After (2.3.2.2) service node receives pull message, by corresponding (key, value) the binary group number of key_set Requesting party is returned to according to serializing, and by the data after serializing.

By the above method, in general the above technical scheme conceived by the present invention compared with prior art, have with Under advantage and technical effect:

(1) the present invention provides compared to the higher some programming modules of global read and write access interface abstraction level, these Module is decoupled the logic of read and write access behavior and application program, is on the one hand greatly facilitated programming personnel and is write and answers With program, on the other hand provide the foundation for the optimization of system level；

(2) the automation task that the present invention realizes machine learning task executes parallel, this significantly reduces application program Personnel write the burden of high concurrent machine learning application；

(3) runtime system that the present invention develops carries out the dynamic of task automatically according to the loading condition of each service node It divides, takes full advantage of network bandwidth resources.

Detailed description of the invention

Fig. 1 is the module frame chart of automation task parallel system of the present invention；

Fig. 2 is the overall workflow figure of automation task parallel method of the present invention；

Fig. 3 is the sub- work flow diagram of system initialization of automation task parallel method of the present invention；

Fig. 4 is the sub- work flow diagram of parallel training of automation task parallel method of the present invention；

Fig. 5 is the sub- work flow diagram of system finishing of automation task parallel method of the present invention；

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below that Not constituting conflict between this can be combined with each other.

Fig. 1 is the module frame chart of automation task parallel execution system of the present invention.As shown in Figure 1, automation of the invention Task parallel system specifically includes working node module, service node module, host node module, tensor module, scheduler module, disappears Cease tracking module, stage module, stage group module and enforcement engine module.Wherein stage module, scheduler module be all and tensor Module is connected；Stage module is connected with stage group module；Engine execution module is connected with stage module；Scheduler module, tensor mould Block, stage group are all connected with message tracking module.

Working node module and service node module, are the pumping for the behavior of working node and parameter service node respectively As description, and the two modules are transparent to machine learning programming personnel.

Host node module is the abstractdesription for host node.The effect of host node is to coordinate the workflow of whole system Journey, such as the initialization of system and the end of system.Mentioned-above system module, in addition to working node module, service node Other modules outside module, host node module are all present in all nodes.

Tensor module is used to describe the key-value pair (key, value) of model parameter in machine learning.Application program needs more Tensor object is planted to describe to train required model parameter, each tensor object has tensor_id attribute as its unique mark Know.There are three types of the types of tensor object: the overall situation sharable (global shared), globally unique (global unique) With local (local).Global sharable expression tensor object is safeguarded by distributed node, is safeguarded between different nodes Data can have intersection；The globally unique expression tensor object is safeguarded by distributed node, is tieed up between different nodes Shield without intersection；The expression of the local tensor object exists only in a node.Tensor object has load (load), draws The operation interfaces such as (pull), push (push) are taken to use for programming personnel.

Stage module, for describing certain section of programmed logic in application program.Overall logic point of the present invention application program Solution is at the different stages, and each phase object contains stage_id attribute as its unique identification.Between phase object, it can lead to It crosses setting and relies on the dependence that function set_dependency is arranged between them.Phase object needs several tensor objects to make For its input and an optional output.Its input has 2 seed types for the stage, and one kind being referred to as master variable Primary_variable, another kind are referred to as complementary variable secondary_variable.(key, the value) of master variable It is right, there is no dependence between key, and (key, the value) of complementary variable is right, has dependence between key.It is right In a stage, programming personnel needs to provide core function kernel_function, the core logic as this stage.Simultaneously Programming personnel it is also required to provide the mapping function key_projection between the key of master variable and the key of complementary variable, operation When system the key of auxiliary variable is derived automatically according to the key and key_projection function of master variable.It is every for the stage A master variable and auxiliary variable have a corresponding variable to be known as update_variable.Update_variable is used for Corresponding variable is updated, and the customer-furnished update_function of more new logic is defined.

Stage group module, for describing one group of stage.Connection is close between this group of represented stage of stage group.Stage group With attribute group_id as its unique identification.Stage group has the two interfaces of run and set_barrier.Run method Optional parameters is an integer num_run, for specifying the execution number of this stage group.Set_barrier interface is for setting Simultaneously operating is set, expression needs all working node to enter the synchronous wait state of fence after up till now stage group has executed, when After the stage group of all working node has executed, can just it continue to run.

Scheduler module is for decision for some tensor object, the set key_ of working node next stage key to be treated set.The bandwidth information on scheduler module fixed time broadcast its node on service node, the scheduler module on working node is according to it The bandwidth information of the service node of acquisition, decision share out the work node model parameter to be treated next time key set.

Engine execution module, for by stage group stage and its relation of interdependence be described as directed acyclic graph (directed acyclic graph, abbreviation DAG), in the directed acyclic graph, the node in figure indicates stage, having in figure Dependence between the side expression stage, while tail portion stage will prior to while head stage execute.

Message tracking module, for being saved in logging program operation by tensor module, scheduler module, stage group module, work The message that point module, service node module, host node module are submitted, when message tracking module is given in message-submission, message tracks mould Block is responsible for for message being transferred to recipient, and after recipient returns to message receipt, message tracking module is responsible for the first of notification message Beginning initiator, and deliver acknowledgement message.

Fig. 2 is the overall workflow figure of automation task parallel method of the present invention.As shown in Fig. 2, the present invention automates The overall workflow of task parallel method the following steps are included:

(1) system initialization step: initialization node topology information and initialization application logic；

(2) parallel training step: host node and service node skip specific training logic, enter step (3), each Working node enters model training state, and working node is iterated formula parallel training according to the training data subset of input, directly To meeting predefined iteration termination condition；

(3) system finishing step: working node informs that its work of host node is completed, and host node detects all working node Work complete after, all nodes of notice coordinate exit the program.

Fig. 3 is the sub- work flow diagram of system initialization that automation task of the present invention executes method parallel.As shown in Figure 3 originally Invention automation task execute the system initialization workflow of method parallel the following steps are included:

(1.1) all nodes bring into operation, and read configuration file respectively, determine the role's rotor step (1.2) of oneself, institute The role stated is working node or service node or host node；

(1.4) working node and service node initializing application logic, runtime system is according to stage group in program The sequencing occurred in code determines that the successive of stage group executes sequence, while constructing the corresponding DAG of each stage group；Turn step Suddenly (2).

Fig. 4 is the sub- work flow diagram of parallel training of automation task parallel method of the present invention.As shown in figure 4, certain works The sub- workflow of the parallel training of node, comprising the following steps:

(2.6) working node runtime system judges whether to reach iteration termination condition, if reaching termination condition, turns step Suddenly (3) otherwise go to step (2.1).

Fig. 5 is the sub- work flow diagram of system finishing of automation task parallel method of the present invention.As shown in figure 5, of the invention The sub- workflow of system finishing of automation task parallel method the following steps are included:

(3.4) all nodes terminate program.

Further, the process for the execution sequence in stage in stage group being determined in step (2.1) specifically includes following son Step:

Further, the run method in stage described in step (2.3), specifically includes following sub-step:

The runtime system of (2.3.1) working node calls the prepare_variables method in the stage, determines current The master variable primary_variable in stage keyset to be processed closes primary_key_set, specifically, first according to scheduling The loading condition (L1, L2 ..., Ln) and the key of master variable primary_variable for the service node that module obtains are servicing Network is loaded the part key not handled by working node also safeguarded on minimum service node by the distribution situation on node Set distributes to working node, closes as keyset to be processed next time, rotor step (2.3.2)；

Further, pull method described in step (2.3.2) has following sub-step:

After (2.3.2.2) service node receives the message of pull, by corresponding (key, the value) binary group of key_set Data Serialization, and the data after serializing are returned into requesting party.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims

1. a kind of automation task parallel system suitable for distributed machines study, which is characterized in that including working node mould Block, service node module, host node module, tensor module, scheduler module, message tracking module, stage module, stage group module And enforcement engine module；Wherein stage module, scheduler module are all connected with tensor module；Stage module and stage group module phase Even；Engine execution module is connected with stage module；Scheduler module, tensor module, stage group are all connected with message tracking module；

The working node module and service node module, is respectively used to the behavior for working node and parameter service node It is described abstractly；

The host node module, for the abstractdesription to host node, host node is the workflow for coordinating whole system Journey, the end of initialization and system including system；

The tensor module is used to describe the key-value pair (key, value) of model parameter in machine learning；Application program needs more Tensor object is planted to describe to train required model parameter, each tensor object has tensor_id attribute as its unique mark Know；There are three types of the types of tensor object: the overall situation sharable (global shared), globally unique (global unique) With local (local)；Global sharable expression tensor object is safeguarded by distributed node, is safeguarded between different nodes Data can have intersection；The globally unique expression tensor object is safeguarded by distributed node, is tieed up between different nodes Shield without intersection；The expression of the local tensor object exists only in a node；Tensor object has load (load), draws The operation interfaces such as (pull), push (push) are taken to use for programming personnel；

The stage module, for describing certain section of programmed logic in application program, the overall logic of application program resolves into difference Stage, and each phase object contains stage_id attribute as its unique identification；Between phase object, can by setting according to Rely function set_dependency that the dependence between them is set；Phase object needs several tensor objects as its input An and optional output；Its input has 2 seed types for phase object, and one kind being referred to as master variable primary_ Variable, another kind are referred to as complementary variable secondary_variable；(key, the value) of master variable is right, key Between there is no dependence, and (key, the value) of complementary variable is right, has dependence between key；For a rank Section, programming personnel need to provide core function kernel_function, the core logic as this stage；Programming personnel simultaneously It it is also required to provide the mapping function key_projection between the key of master variable and the key of complementary variable, runtime system is certainly The dynamic key that auxiliary variable is derived according to the key and key_projection function of master variable；For each master variable in stage And auxiliary variable, there is a corresponding variable to be known as update_variable；Update_variable is for updating correspondence Variable, and more new logic customer-furnished update_function definition；

The stage group module, for describing one group of stage；Connection is close between this group of represented stage of stage group；Stage group With attribute group_id as its unique identification；Stage group has the two interfaces of run and set_barrier；Run method Optional parameters is an integer num_run, for specifying the execution number of this stage group；Set_barrier interface is for setting Simultaneously operating is set, expression needs all working node to enter the synchronous wait state of fence after up till now stage group has executed, when After the stage group of all working node has executed, can just it continue to run；

The scheduler module is for decision for some tensor object, the set key_ of working node next stage key to be treated set；Scheduler timing on service node broadcasts the bandwidth information on its node, and the scheduler on working node is according to its acquisition Service node bandwidth information, decision share out the work node model parameter to be treated next time key set；

The enforcement engine module, for by stage group stage and its relation of interdependence be described as directed acyclic graph (directed acyclic graph, abbreviation DAG), in the directed acyclic graph, the node in figure indicates stage, having in figure Dependence between the side expression stage, while tail portion stage will prior to while head stage execute；

The message tracking module, for being saved in logging program operation by tensor module, scheduler module, working group's module, work The message that point module, service node module, host node module are submitted, when message tracking module is given in message-submission, message tracks mould Block is responsible for for message being transferred to recipient, and after recipient returns to message receipt, message tracking module is responsible for the first of notification message Beginning initiator, and deliver acknowledgement message.

2. a kind of automation task parallel method suitable for distributed machines study, which is characterized in that including system initialization Step, parallel training step and system finishing step, in which:

(1) system initialization step: initialization node topology information and initialization application logic specifically include following son Step:

(1.1) all nodes bring into operation, and read configuration file respectively, determine the role's rotor step (1.2) of oneself, the angle Color is working node or service node or host node；

(1.2) working node, service node are communicated with host node respectively, inform that its nodal information of host node, host node will be collected To nodal information be broadcast to other all nodes, go to step (1.3)；

(1.3) after working node and service node receive the nodal information that host node is sent, node topology information is initialized, is used In with the communication between posterior nodal point；Go to step (1.4)；

(1.4) working node and service node initializing application logic, runtime system is according to stage group in program code The sequencing of middle appearance determines that the successive of stage group executes sequence, while constructing the corresponding DAG of each stage group；It goes to step (2)；

(2) parallel training step: host node and service node skip specific training logic, enter step (3), each work Node enters model training state, and working node is iterated formula parallel training, Zhi Daoman according to the training data subset of input The predefined iteration termination condition of foot, the behavior of working node specifically include following sub-steps:

(2.1) runtime system of working node is for having determined each stage group of sequencing, by the node on its DAG It carries out topological sorting and executes sequence with each stage group of determination interior all stages；Go to step (2.2)；

(2.2) the stage group being currently not carried out is known as next_group；Setting next_group is machine learning application logic the The stage group once occurred；(2.3) are gone to step, if going to step (2.6) currently without the stage group being not carried out；

(2.3) the run method of the next_group stage group indicated is executed into num_run (when num_run is user's startup program The parameter of offer) it is secondary, for the run method of single, when operation, creates a collection of thread, according to the rank determined by step (2.1) Section organize in the stage execute sequence, successively execute the run method in all stages, numbering the small stage first carries out, with identical The stage pipeline of number executes, which goes to step (2.4) after having executed num_run times；

(2.4) whether the stage group that the runtime system judgement of working node has currently executed num_run times is provided with set_ Barrier carries out fence simultaneously operating if being provided with set_barrier；Go to step (2.5)；Set_barrier interface is used In setting simultaneously operating, after up till now stage group has executed, needing all working node to enter, fence is synchronous to wait shape for expression State can just continue to run after the stage group of all working node has executed；

(2.5) if next_group, is set to the stage group being currently not carried out, gone to step by the stage group being currently also not carried out (2.3), (2.6) are otherwise gone to step；

(2.6) runtime system of working node judges whether to reach iteration termination condition, if reaching termination condition, goes to step (3), (2.1) are otherwise gone to step；

(3) system finishing step: working node informs that its work of host node is completed, and host node detects the work of all working node After completing, all nodes of notice coordinate exit the program, and specifically include following sub-step:

(3.1) all working node sends job_done message to host node, and host node receives the job_ of all working node After done message, host node sends sys_exit message to all working node and service node, goes to step (3.2)；

(3.2) after working node and service node receive sys_exit message, working node and service node are sent out to host node Sys_exit_ack message is sent, (3.3) are gone to step；

(3.4) all nodes terminate program.

3. the automation task parallel method suitable for distributed machines study as claimed in claim 2, which is characterized in that institute It states and determines that the process of the execution sequence in stage in stage group specifically includes following sub-step in step (2.1):

Current unappropriated DAG node number order is set to 0 by (2.1.1), and the node set nodes that current in-degree is 0 is set For sky, (2.1.2) is gone to step；

The node that in-degree in current DAG is 0 is added in set nodes by (2.1.2), to all in node set nodes It is order that node, which is all numbered, and order increases 1 certainly；By in node set nodes node and its it is all go out side from DAG figure Fall, nodes set is set to sky, goes to step (2.1.3)；

4. the automation task parallel method suitable for distributed machines study as claimed in claim 2, which is characterized in that step Suddenly the run method in stage described in (2.3), specifically includes following sub-step:

The runtime system of (2.3.1) working node calls the prepare_variables method in the stage, determines the current generation Master variable primary_variable keyset to be processed close primary_key_set, specifically, first according to scheduler module The loading condition (L1, L2 ..., Ln) of the service node of acquisition and the key of master variable primary_variable are in service node On distribution situation, network is loaded safeguarded on minimum service node also not by working node handle part key set Working node is distributed to, closes parimary_key_set, rotor step (2.3.2) as keyset to be processed next time；

The key_projection function that (2.3.2) runtime system is provided according to user, and the main transformer determined in (2.3.1) The keyset of amount closes primary_key_set and derives that the keyset of auxiliary variable closes secondary_key_set, calls the master variable It goes to pull required model parameter with the pull method of the tensors object such as auxiliary variable；Go to step (2.3.3)；

(2.3.3) executes the core function kernel_function in the stage, and runtime system is automatically by the collection of the key of master variable It closes key_set and is divided into num_threads part automatically, create num_threads thread parallelization and execute core function, Wherein num_threads is the parameter that user provides, and goes to step (2.3.4)；

(2.3.4) runs core function kernel_function and generates more new variables v_update, and runtime system is according to user The update_function of offer updates corresponding variable v；If the type of variable v be it is globally shared or it is global only One, the push function of run time call variable v is updated, and (key, the value) which to be updated is to serializing；And Data after serializing are sent to the service node of all maintenance this section key, after service node receives more new data, are updated Its data safeguarded.

5. the automation task parallel method suitable for distributed machines study as claimed in claim 4, which is characterized in that step Suddenly pull method described in (2.3.2) step has following sub-step:

(2.3.2.1) keyset to be pulled the tensor object closes key_set serializing, and the data of serializing are sent to dimension The service node for protecting the paragraph key set goes to step (2.3.2.2)；

After (2.3.2.2) service node receives the message of pull, by corresponding (key, value) the binary group data of key_set Serializing, and the data after serializing are returned into requesting party.