CN105956021A

CN105956021A - Automated task parallel method suitable for distributed machine learning and system thereof

Info

Publication number: CN105956021A
Application number: CN201610255970.5A
Authority: CN
Inventors: 廖小飞; 曹镇山; 郭人通; 刘海坤; 金海�; 陆枫
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2016-04-22
Filing date: 2016-04-22
Publication date: 2016-09-21
Anticipated expiration: 2036-04-22
Also published as: CN105956021B

Abstract

The invention provides an automated task parallel method suitable for distributed machine learning and a system thereof. The method and the system solve defects of a programming interface in existing distributed machine learning, and tight coupling of system data access behaviors and application logic caused by just providing a reading-writing interface of key value pairs. The defect intensifies network bandwidth resource competition in distributed cluster, and causes that programming personnel is difficult to perform parallelization on a task. The system comprises a work node module, a service node module, a host node module, a tensor module, a scheduling module, a message tracking module, a stage module, a stage group module, and an executing engine module. Through providing higher-level programming abstraction, the system decouples logic of reading-writing access behaviors and an application program. In operation, the system firstly dynamically partitions tasks according to the load condition of a service node, and then machine learning tasks are automatically executed in a parallel manner, so as to greatly reduce burden of programming personnel to compile high-concurrency machine learning applications.

Description

A kind of method of automatization's tasks in parallel being applicable to distributed machines study and system thereof

Technical field

The invention belongs to Distributed Calculation and machine learning interleaving techniques field, be specifically related to one and be applicable to The method of automatization's tasks in parallel of distributed machines study and system thereof.

Background technology

The traditional method that machine learning algorithm is worth as a kind of mining data, is widely used in natural language The fields such as process, text analyzing, speech recognition, automatic driving of motor vehicle and bio information.Along with greatly The arrival of data age, the value of data increasingly shows especially out, the business valency wherein contained Value, machine learning thus come into one's own.But, along with data scale and in requisition for the model learnt The scale of parameter is increasing, and single calculating node is due to its memory source, calculating resource and memory access band The finiteness of wide resource etc., can not meet the demand of large-scale machines study.By tradition single node machine Device study carries out distribution becomes new and required a kind of trend.After machine learning distribution, can To use the more node that calculates to go to process larger data, needed for shortening training gained model simultaneously Time, and improve study model accuracy.Distributed machines study is at industrial quarters and academia all By universal concern, such as: Google utilizes distributed system DistBelief to train cat face identification Model, Apache Software Foundation develops distributed machines based on Hadoop study Framework Mahout and UC Berkeley AMP laboratory are increased income one and are applicable to machine learning algorithm Distributed computing system Spark etc..

Distributed machines learns most of algorithms and has an iterative nature, run predetermined number of times iterative process or Person's model parameter converges to a certain stable state and just terminates training process.Traditional Distributed Architecture is such as MapReduce etc. are due to the defect of its synchronization mechanism so that it is be bad at, in iterative estimated performance, to cause Its performance is not fully up to expectations.

Novel machine learning distributed system is parameter server framework, and parameter described herein refers to machine For the key-value pair (key, value) of descriptive model parameter in device study, or two-dimensional matrix, or Person's multi-dimensional matrix, multi-dimensional matrix is also referred to as tensor simultaneously.Calculating in parameter server framework, in cluster Node is divided into two classes, and a category node is referred to as working node, and another kind of node is referred to as service node.Its In, service node is responsible for safeguarding world model's parameter, including responsive operation node looking into for model parameter Ask and renewal etc. operates；Working node loads the part data set of overall situation training data concentration to local internal memory In, utilize the algorithm of application logic regulation to calculate which model parameter of needs and calculate, to service joint Point initiates inquiry operation, and by network by required model parameter transmission to local internal memory, then utilizing should Model parameter w made new advances with algorithm and the required model parameter calculation of logic regulation or model parameter are more New value Δ w, after taking turns iterative computation one, working node initiates to update and synchronize the overall situation to service node Model parameters etc. operate.In distributed machines study, working node behavior in the most complete iteration can It is described as following steps to conclude:

1. working node loading section data set；

2. working node calculates the model parameter of needs, and the model access interface provided by bottom is obtained Take required model parameter；

3. model parameter w made new advances according to application logical calculated or the updated value Δ w of model parameter；

4. the updated value Δ w of model parameter w newly calculated or model parameter is pushed to clothes by working node Business node, carries out parameter renewal and synchronization.

Step 2 in above-mentioned, 3,4, be the committed step in iterative computation, and pass through world model Parameter reading and writing access interface obtains the model parameter needed for calculating and by the model parameter newly calculated or mould The updated value of shape parameter is pushed to service node, is the major source of network transmission in system.

For step 2, huge due to model parameter, the transmission volume thus caused also is huge Big, in the case of network bandwidth resources is certain, for a working node, in iterative process Network latency more than the calculating time so that the time of whole model training lengthens；When multiple work When node triggers network transmission simultaneously, bandwidth resources warfare occur, network latency can become more Long.The behavior that working node triggers mode parameter is closely related with upper layer application logic.Parameter current The physical layer interface provided in server architecture is the unified interface that global parameter accesses, and so makes system Access global parameter behavior and application logic close coupling, be unfavorable for being optimized from system bottom.

For step 3, working node computation model parameter, this operation is the operation of computation-intensive, In current many-core, many nuclear ages, how to maximize this calculating task parallel, for improving the concurrent of system Spend most important.Current distributed machines learning system does not provide the programming of corresponding parallelization to connect Mouthful, only provide world model's read and write access interface, it is therefore desirable to programming personnel has the warp of multiple programming Test, just can write high concurrent machine learning application program.

For step 4, for the bottleneck of network transmission in parameter synchronization, existing 2 kinds of solutions: a kind of It is to change synchronistic model, i.e. allows the iteration progress of different operating node to have certain difference, when iteration is entered After the difference of degree reaches certain threshold value, then carry out batch synchronization (BSP, Bulk Synchronous Parallel), this kind of scheme alleviates the situation that network bandwidth resources is competed to a certain extent；Another kind of Solution is to control parameter server occupation condition, for the synchronization that different operating node selection is different Time interval is avoided asking emergency case, ensures that the time interval chosen can meet reduction simultaneously simultaneously Communication frequency and guarantee train accuracy rate.

Summary of the invention

For drawbacks described above or the Improvement requirement of prior art, the invention provides and be applicable to distributed machines The parallel method of task automation of study and system thereof.First, by by the access interface of model parameter and Application logic decouples, and so makes system adjustable when having an operation for the access behavior of model parameter The characteristic of joint, that so transmit for network and system bottom parallelization etc. optimization provides the foundation.Secondly, Application is logically decomposed into some stages, and thus builds directed acyclic graph (directed acyclic Graph, is called for short DAG) go to describe the dependence between each calculation stages, runtime system passes through DAG Task is carried out dividing and executed in parallel by automatization, improves system concurrency degree.Above method and system can have Effect ground solves network transmission bottleneck problem and raising system concurrency degree in existing distributed machines learning system, Thus improve the overall performance of system.

To achieve these goals, according to one aspect of the present invention, it is provided that one is applicable to distributed machine The task automation parallel method of device study and system thereof, specifically include working node module, service node mould Block, host node module, tensor module, scheduler module, message tracking module, stage module, stage group mould Block and enforcement engine module.Wherein stage module, scheduler module are all connected with tensor module；Stage module It is connected with stage group module；Engine performs module and is connected with stage module；Scheduler module, tensor module, rank Section group is all connected with tensor module.

Described working node module and service node module, be for working node and parameter service respectively The abstractdesription of the behavior of node, and the two module is transparent to machine learning programming personnel.

Described host node module, is the abstractdesription for host node.The effect of host node is that coordination is whole The workflow of system, such as initialization and the end of system of system.Foregoing system module, removes Other modules outside working node module, service node module, host node module are all present in all nodes In.

Described tensor module is for describing the key-value pair (key, value) of model parameter in machine learning.Should Needing multiple tensor object to describe the model parameter needed for training by program, each tensor object has Tensor_id attribute uniquely identifies as it.The type of tensor object has three kinds: the overall situation is sharable (global shared), globally unique (global unique) and local (local).The overall situation This tensor object of sharable expression is safeguarded by distributed node, and the data safeguarded between different nodes can To have common factor；This tensor object of globally unique expression is safeguarded by distributed node, between different nodes That safeguards does not occur simultaneously；Local this tensor object of expression exists only in a node.Tensor object has Load (load), pull the operation interface such as (pull), propelling movement (push) for programming personnel.

Described stage module, is used for describing certain section of programmed logic in application program.The present invention is application program Overall logic resolves into the different stages, and each phase object to contain stage_id attribute unique as it Mark.Between phase object, can arrange between them by arranging dependence function set_dependency Dependence.Phase object needs some tensor objects as its input and an optional output.For For stage, its input has 2 types, and one is referred to as master variable primary_variable, another Plant and be referred to as complementary variable secondary_variable.(key, the value) of master variable is right, its There is no dependence between key, and (key, the value) of complementary variable is right, have between its key Dependence.For a stage, programming personnel needs to provide core function kernel_function, makees Core logic for this stage.Programming personnel it is also required to provide the key of master variable and complementary variable simultaneously Mapping function (key_projection function) between key, runtime system is automatically according to master variable Key and key_projection function derive the key of auxiliary variable.For the stage each master variable and Auxiliary variable, has a corresponding variable to be referred to as update_variable.Update_variable uses In updating corresponding variable, and more new logic customer-furnished update_function definition.

Described stage group module, for describing one group of stage.Contact between this group stage represented by stage group Closely.Stage group has attribute group_id and uniquely identifies as it.Stage group have run and Set_barrier the two interface.The optional parameters of run method is an integer number num_run, is used for Specify the execution number of times of this stage group.Set_barrier interface is used for arranging simultaneously operating, and expression ought be up till now After stage group has performed, need all working node to enter fence and synchronize waiting state, when all working saves After this stage group of point has performed, just can continue to run with.

Described scheduler module needs to be processed for decision-making for certain tensor object, working node next stage The set key_set of key.The bandwidth information on its node of scheduler module fixed time broadcast on service node, work Making the bandwidth information according to its service node obtained of the scheduler module on node, decision-making shares out the work node Next time needs the set of the key of model parameter to be processed.

Described engine performs module, has for the stage in stage group and relation of interdependence thereof being described as To acyclic figure (directed acyclic graph is called for short DAG), in this directed acyclic graph, in figure Node represent the stage, the dependence between the directed edge expression stage in figure, the stage of the afterbody on limit want Perform prior to the stage of the head on limit.

Described message tracking module, by tensor module, scheduler module, stage group in logging program runs The message that module, working node module, service node module, host node module are submitted to, when message-submission is given Message tracking module, message tracking module is responsible for being transferred to message recipient, and recipient returns message receipt Afterwards, message tracking module is responsible for the initial launching side of notification message, and pays acknowledgement message.

Correspondingly, present invention also offers a kind of be applicable to distributed machines study task automation parallel Method, for carrying out automatically dividing and automatic paralleling of task in distributed machines learning algorithm scene Perform, including system initialization step, parallel training step and system finishing step, wherein:

(1) system initialization step: initialize node topology information and initialize application logic, Specifically include following sub-step:

(1.1) all nodes bring into operation, and read configuration file respectively, determine the role's rotor step of oneself (1.2), suddenly described role is working node or service node or host node；

(1.2) working node, service node communicate with host node respectively, inform its nodal information of host node, The nodal information collected is broadcast to other all nodes by host node, goes to step (1.3)；

(1.3), after working node and service node receive the nodal information that host node sends, joint is initialized Point topology information, for the communication between posterior nodal point；Go to step (1.4)；

(1.4) working node and service node initializing application logic, runtime system is according to the stage The sequencing that group occurs in program code determines the priority execution sequence of stage group, builds each simultaneously The DAG that stage group is corresponding；Go to step (2)；

(2) parallel training step: host node and service node all skip concrete training logic, enters step Suddenly (3), each working node enters model training state, and working node is according to training data of input Collection is iterated formula parallel training, until meeting the iteration termination condition that predefined is good, the row of working node For specifically including following sub-step:

(2.1) runtime system of working node is for it has been determined that each stage group of sequencing, inciting somebody to action Node on its DAG carries out topological sorting to determine the execution sequence in each stage group interior all stages；Turn Step (2.2)；

(2.2) current unenforced stage group is referred to as next_group；Putting next_group is machine The stage group that study application logic occurs for the first time；Go to step (2.3), if currently without unenforced Stage group, then go to step (2.6)；

(2.3) the run method of the stage group represented by next_group performs num_run (num_run It is the parameter that provides when starting program of user) secondary, for the run method of single, create a collection of during operation Thread, according to the execution sequence in stage in this stage group determined by step (2.1), performs all of successively The run method in stage, numbering the little stage first carries out, and the stage pipeline with identical numbering performs, these rank Section group goes to step (2.4) after having performed num_run time；

(2.4) runtime system of working node judges that the stage group currently having performed num_run time is No being provided with set_barrier, if being provided with set_barrier, carrying out fence simultaneously operating；Turn step Suddenly (2.5)；

(2.5) if currently also having unenforced stage group, it is set to next_group currently be not carried out Stage group, go to step (2.3), otherwise go to step (2.6)；

(2.6) working node judges whether when running to arrive iteration termination condition, if arriving termination condition, Go to step (3), otherwise go to step (2.1)；

(3) system finishing step: working node informs that its work of host node completes, and host node detects institute After the work having working node completes, all nodes of notice coordinate quit a program, and specifically include following sub-step Rapid:

(3.1) all working node sends job_done message to host node, and host node receives all works After making the job_done message of node, host node sends to all working node and service node Sys_exit message, goes to step (3.2)；

(3.2) after working node and service node receive sys_exit message, working node and service Node sends sys_exit_ack message to host node, goes to step (3.3)；

(3.3) the sys_exit_ack message that host node receives all working node and service node is sent, Go to step (3.4)；

(3.4) all nodes terminate program.

Above-mentioned steps (2.1) determining, the flow process of the execution sequence in stage group interior stage specifically includes following Sub-step:

(2.1.1) current unassigned number order is set to 0, is the nodal set of 0 by current in-degree Close nodes and be set to sky, go to step (2.1.2)；

(2.1.2) node that in-degree in current DAG is 0 is added in set nodes, to nodal set Close the most numbered order of all nodes, order in nodes from increasing 1；By in node set nodes Node and all go out limit remove from DAG figure, by nodes set be set to sky, go to step (2.1.3)；

(2.1.3) judge whether current DAG is empty, then go to step (2.2) if sky, otherwise turn step Suddenly (2.1.2).

The run method in the stage described in above-mentioned steps (2.3), specifically includes following sub-step:

(2.3.1) runtime system of working node calls the prepare_variables method in this stage, Determine that the master variable primary_variable keyset to be processed of current generation closes primary_key_set, Specifically, first according to scheduler module obtain service node loading condition (L1, L2 ..., Ln) and The key of master variable primary_variable distribution situation on service node, loads minimum by networking Working node is distributed in the set of the part key not processed by working node safeguarded on service node, makees Parimary_key_set, rotor step (2.3.2) is closed for next keyset to be processed；

(2.3.2) the key_projection function that runtime system provides according to user, and (2.3.1) The keyset of the middle master variable determined closes primary_key_set and derives the keyset conjunction of auxiliary variable Secondary_key_set, the pull method calling this tensor object such as master variable and auxiliary variable goes to pull Required model parameter；Go to step (2.3.3)；

(2.3.3) performing core function kernel_function in this stage, runtime system automatically will The set key_set of the key of master variable is divided into num_threads part automatically, creates Num_threads thread parallelization performs core function, and wherein num_threads is the ginseng that user provides Number, goes to step (2.3.4)；

(2.3.4) run core function kernel_function and produce more new variables v_update, run Time the update_function that provides according to user of system update corresponding variable v；If this variable v's Type is globally shared or globally unique, and the push function of run time call variable v is updated, By (key, value) to be updated for this variable to serializing；And the data after serializing are sent to all dimensions Protect the service node of this interval key, after service node receives more new data, update its data safeguarded.

Pull method described in above-mentioned (2.3.2) step has a following sub-step:

(2.3.2.1) keyset to be pulled for this tensor object is closed key_set serializing, and will serializing Data be sent to safeguard the service node of this paragraph key set, go to step (2.3.2.2)；

(2.3.2.2) after service node receives pull message, by corresponding for key_set (key, value) Two tuple Data Serializations, and the data after serializing are returned to requesting party.

By said method, the above technical scheme that the present invention is contemplated in general compared with prior art, Have the following advantages that and technique effect:

(1) the invention provides programmings more higher compared to overall situation read and write access interface abstraction level Module, the logic of read and write access behavior and application program is decoupled by these modules, and one Aspect greatly facilitates programming personnel and writes application program, is on the other hand system level Optimization provides the foundation；

(2) the automatization's tasks in parallel that present invention achieves machine learning task performs, and this significantly reduces Application program personnel write the burden of high concurrent machine learning application；

(3) runtime system of present invention exploitation is carried out automatically according to the loading condition of each service node The dynamic division of task, takes full advantage of network bandwidth resources.

Accompanying drawing explanation

The module frame chart of Tu1Shi automatization of the present invention tasks in parallel system；

The overall workflow figure of Tu2Shi automatization of the present invention tasks in parallel method；

The sub-workflow diagram of system initialization of Tu3Shi automatization of the present invention tasks in parallel method；

The sub-workflow diagram of parallel training of Tu4Shi automatization of the present invention tasks in parallel method；

The sub-workflow diagram of system finishing of Tu5Shi automatization of the present invention tasks in parallel method；

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and reality Execute example, the present invention is further elaborated.Only should be appreciated that specific embodiment described herein In order to explain the present invention, it is not intended to limit the present invention.Additionally, each enforcement of invention described below Just can be mutually combined as long as technical characteristic involved in mode does not constitutes conflict each other.

Tu1Shi automatization of the present invention tasks in parallel performs the module frame chart of system.As it is shown in figure 1, this Bright automatization's tasks in parallel system specifically include working node module, service node module, host node module, Tensor module, scheduler module, message tracking module, stage module, stage group module and enforcement engine mould Block.Wherein stage module, scheduler module are all connected with tensor module；Stage module is connected with stage group module； Engine performs module and is connected with stage module；Scheduler module, tensor module, stage group all follow the tracks of mould with message Block is connected.

Working node module and service node module, respectively for working node and parameter service node The abstractdesription of behavior, and the two module is transparent to machine learning programming personnel.

Host node module, is the abstractdesription for host node.The effect of host node is to coordinate whole system Workflow, such as initialization and the end of system of system.Foregoing system module, except work Other modules outside node module, service node module, host node module are all present in all nodes.

Tensor module is for describing the key-value pair (key, value) of model parameter in machine learning.Application journey Sequence needs multiple tensor object to describe the model parameter needed for training, and each tensor object has Tensor_id attribute uniquely identifies as it.The type of tensor object has three kinds: the overall situation is sharable (global shared), globally unique (global unique) and local (local).The overall situation This tensor object of sharable expression is safeguarded by distributed node, and the data safeguarded between different nodes can To have common factor；This tensor object of globally unique expression is safeguarded by distributed node, between different nodes That safeguards does not occur simultaneously；Local this tensor object of expression exists only in a node.Tensor object has Load (load), pull the operation interface such as (pull), propelling movement (push) for programming personnel.

Stage module, is used for describing certain section of programmed logic in application program.The present invention is the entirety of application program It is logically decomposed into the different stages, and each phase object contains stage_id attribute and uniquely identifies as it. Between phase object, can rely on, by arranging, the dependence pass that function set_dependency arranges between them System.Phase object needs some tensor objects as its input and an optional output.For the stage Saying that its input has 2 types, one is referred to as master variable primary_variable, and another kind is claimed For complementary variable secondary_variable.(key, the value) of master variable is right, its key it Between there is no dependence, and (key, the value) of complementary variable is right, has dependence and close between its key System.For a stage, programming personnel needs to provide core function kernel_function, as these rank The core logic of section.Programming personnel it is also required to provide between the key of master variable and the key of complementary variable simultaneously Mapping function key_projection, runtime system automatically according to master variable key and Key_projection function derives the key of auxiliary variable.Each master variable and auxiliary for the stage become Amount, has a corresponding variable to be referred to as update_variable.Update_variable is used for updating Corresponding variable, and more new logic customer-furnished update_function definition.

Stage group module, for describing one group of stage.Between this group stage represented by stage group, contact is closely. Stage group has attribute group_id and uniquely identifies as it.Stage group has run and set_barrier The two interface.The optional parameters of run method is an integer number num_run, is used for specifying this stage group Execution number of times.Set_barrier interface is used for arranging simultaneously operating, and expression ought up till now stage group perform Afterwards, need all working node to enter fence and synchronize waiting state, when this stage group of all working node After having performed, just can continue to run with.

Scheduler module needs key to be processed for decision-making for certain tensor object, working node next stage Set key_set.The bandwidth information on its node of scheduler module fixed time broadcast on service node, work joint Scheduler module on point according to the bandwidth information of its service node obtained, decision-making share out the work node next time Need the set of the key of model parameter to be processed.

Engine performs module, for the stage in stage group and relation of interdependence thereof are described as oriented nothing Ring figure (directed acyclic graph is called for short DAG), the knot in this directed acyclic graph, in figure In the some expression stage, the dependence between the directed edge expression stage in figure, the stage of the afterbody on limit will be prior to The stage of the head on limit performs.

Message tracking module, for logging program run in by tensor module, scheduler module, stage group module, Working node module, service node module, host node module submit to message, when message-submission to message with Track module, message tracking module is responsible for message is transferred to recipient, after recipient returns message receipt, Message tracking module is responsible for the initial launching side of notification message, and pays acknowledgement message.

The overall workflow figure of Tu2Shi automatization of the present invention tasks in parallel method.As in figure 2 it is shown, this The overall workflow of invention automatization tasks in parallel method comprises the following steps:

(1) system initialization step: initialize node topology information and initialize application logic；

(2) parallel training step: host node and service node all skip concrete training logic, enters step Suddenly (3), each working node enters model training state, and working node is according to training data of input Collection is iterated formula parallel training, until meeting the iteration termination condition that predefined is good；

(3) system finishing step: working node informs that its work of host node completes, and host node detects institute After the work having working node completes, all nodes of notice coordinate quit a program.

Tu3Shi automatization of the present invention tasks in parallel performs the sub-workflow diagram of system initialization of method.Such as figure Automatization of the present invention tasks in parallel shown in 3 performs the system initialization workflow of method and comprises the following steps:

(1.1) all nodes bring into operation, and read configuration file respectively, determine the role's rotor step of oneself Suddenly (1.2), described role is working node or service node or host node；

(1.4) working node and service node initializing application logic, runtime system is according to the stage The sequencing that group occurs in program code determines the priority execution sequence of stage group, builds each simultaneously The DAG that stage group is corresponding；Go to step (2).

The sub-workflow diagram of parallel training of Tu4Shi automatization of the present invention tasks in parallel method.Such as Fig. 4 institute Show that the sub-workflow of parallel training of certain working node comprises the following steps:

(2.6) working node runtime system judges whether to arrive iteration termination condition, terminates if arrived Condition, goes to step (3), otherwise goes to step (2.1).

The sub-workflow diagram of system finishing of Tu5Shi automatization of the present invention tasks in parallel method.Such as Fig. 5 institute Showing, the sub-workflow of system finishing of automatization of the present invention tasks in parallel method comprises the following steps:

(3.4) all nodes terminate program.

Further, step (2.1) determining, the flow process of the execution sequence in stage group interior stage is specifically wrapped Include following sub-step:

Further, the run method in the stage described in step (2.3), specifically include following sub-step:

(2.3.1) runtime system of working node calls the prepare_variables method in this stage, Determine that the master variable primary_variable keyset to be processed of current generation closes primary_key_set, Specifically, first according to scheduler module obtain service node loading condition (L1, L2 ..., Ln) and The key of master variable primary_variable distribution situation on service node, loads minimum by networking Working node is distributed in the set of the part key not processed by working node safeguarded on service node, makees Close for next keyset to be processed, rotor step (2.3.2)；

Further, the pull method described in step (2.3.2) has a following sub-step:

(2.3.2.2) after service node receives the message of pull, by corresponding for key_set (key, Value) two tuple Data Serialization, and the data after serializing are returned to requesting party.

As it will be easily appreciated by one skilled in the art that and the foregoing is only presently preferred embodiments of the present invention, and Not in order to limit the present invention, all made within the spirit and principles in the present invention any amendment, equivalent With improvement etc., should be included within the scope of the present invention.

Claims

1. the automatization's tasks in parallel system being applicable to distributed machines study, it is characterised in that Including working node module, service node module, host node module, tensor module, scheduler module, disappear Breath tracking module, stage module, stage group module and enforcement engine module；Wherein stage module, tune Degree module is all connected with tensor module；Stage module is connected with stage group module；Engine perform module and Stage module is connected；Scheduler module, tensor module, stage group are all connected with message tracking module；

Described working node module and service node module, be respectively used to for working node and parameter The behavior of service node is described abstractly；

Described host node module, for the abstractdesription to host node, host node is whole for coordinating The workflow of system, including initialization and the end of system of system；

Described tensor module is for describing the key-value pair (key, value) of model parameter in machine learning； Application program needs multiple tensor object to describe the model parameter needed for training, and each tensor object has Tensor_id attribute is had uniquely to identify as it；The type of tensor object has three kinds: the overall situation can be shared (global shared), globally unique (global unique) and local (local)； The overall situation this tensor object of sharable expression is safeguarded by distributed node, safeguards between different nodes Data can have common factor；This tensor object of globally unique expression is safeguarded by distributed node, different That safeguards between node does not occur simultaneously；Local this tensor object of expression exists only in a node；? Amount object has loading (load), pulls the operation interface such as (pull), propelling movement (push) for programming Librarian use；

Described stage module, is used for describing certain section of programmed logic in application program, and the entirety of application program is patrolled Collect and resolve into the different stages, and each phase object contains stage_id attribute and uniquely identifies as it； Between phase object, can rely on, by arranging, the dependence that function set_dependency arranges between them Relation；Phase object needs some tensor objects as its input and an optional output；For rank For section object, its input has 2 types, and one is referred to as master variable primary_variable, Another kind is referred to as complementary variable secondary_variable；(key, the value) of master variable Right, there is no dependence between its key, and (key, the value) of complementary variable is right, its key Between there is dependence；For a stage, programming personnel needs to provide core function Kernel_function, as the core logic in this stage；Programming personnel it is also required to provide main transformer simultaneously Mapping function key_projection between key and the key of complementary variable of amount, runtime system is certainly The dynamic key according to master variable and key_projection function derive the key of auxiliary variable；For rank Each master variable of section and auxiliary variable, have a corresponding variable to be referred to as update_variable； Update_variable is for updating the variable of correspondence, and more new logic is customer-furnished Update_function defines；

Described stage group module, for describing one group of stage；Join between this group stage represented by stage group Fasten close；Stage group has attribute group_id and uniquely identifies as it；Stage group have run and Set_barrier the two interface；The optional parameters of run method is an integer number num_run, uses In the execution number of times specifying this stage group；Set_barrier interface is used for arranging simultaneously operating, represents and works as After up till now stage group has performed, need all working node to enter fence and synchronize waiting state, when all After this stage group of working node has performed, just can continue to run with；

Described scheduler module is for decision-making for certain tensor object, and the working node next stage needs to process The set key_set of key；Scheduler timing on service node broadcasts the bandwidth information on its node, Scheduler on working node is according to the bandwidth information of its service node obtained, and decision-making shares out the work joint Point needs the set of the key of model parameter to be processed next time；

Described enforcement engine module, for being described as the stage in stage group and relation of interdependence thereof Directed acyclic graph (directed acyclic graph is called for short DAG), in this directed acyclic graph, Node in figure represents the stage, the dependence between the directed edge expression stage in figure, the afterbody on limit Stage to perform prior to the stage of the head on limit；

Described message tracking module, by tensor module, scheduler module, work in logging program runs The message that group module, working node module, service node module, host node module are submitted to, when message is handed over Paying message tracking module, message tracking module is responsible for being transferred to message recipient, and recipient returns and disappears After breath receipt, message tracking module is responsible for the initial launching side of notification message, and pays acknowledgement message.

2. the automatization's tasks in parallel method being applicable to distributed machines study, it is characterised in that Including system initialization step, parallel training step and system finishing step, wherein:

(1.1) all nodes bring into operation, and read configuration file respectively, determine the role's rotor of oneself Step (1.2), described role is working node or service node or host node；

(1.2) working node, service node communicate with host node respectively, inform that its node of host node is believed Breath, the nodal information collected is broadcast to other all nodes, goes to step (1.3) by host node；

(1.3), after working node and service node receive the nodal information that host node sends, initialize Node topology information, for the communication between posterior nodal point；Go to step (1.4)；

(1.4) working node and service node initializing application logic, runtime system is according to rank The sequencing that section group occurs in program code determines the priority execution sequence of stage group, builds simultaneously The DAG that each stage group is corresponding；Go to step (2)；

(2) parallel training step: host node and service node all skip concrete training logic, enters Step (3), each working node enters model training state, and working node is according to the training number of input It is iterated formula parallel training according to subset, until meeting the iteration termination condition that predefined is good, work joint The behavior of point specifically includes following sub-step:

(2.1) runtime system of working node is for it has been determined that each stage group of sequencing, Node on its DAG is carried out topological sorting to determine the execution sequence in each stage group interior all stages； Go to step (2.2)；

(2.2) current unenforced stage group is referred to as next_group；Putting next_group is machine The stage group that device study application logic occurs for the first time；Go to step (2.3), if currently without not holding The stage group of row, then go to step (2.6)；

(2.3) the run method of the stage group represented by next_group performs num_run (num_run It is the parameter that provides when starting program of user) secondary, for the run method of single, during operation, create one Criticize thread, according to the execution sequence in stage in this stage group determined by step (2.1), perform institute successively The run method in some stages, numbering the little stage first carries out, and the stage pipeline with identical numbering is held OK, this stage group goes to step (2.4) after having performed num_run time；

(2.4) runtime system of working node judges currently to have performed the stage group of num_run time Whether being provided with set_barrier, if being provided with set_barrier, carrying out fence simultaneously operating； Go to step (2.5)；

(2.5) if currently also having unenforced stage group, it is set to next_group currently not hold The stage group of row, goes to step (2.3), otherwise goes to step (2.6)；

(2.6) runtime system of working node judges whether to arrive iteration termination condition, if arrived Termination condition, goes to step (3), otherwise goes to step (2.1)；

(3) system finishing step: working node informs that its work of host node completes, and host node detects After the work of all working node completes, all nodes of notice coordinate quit a program, and specifically include following Sub-step:

(3.1) all working node sends job_done message to host node, and host node receives all After the job_done message of working node, host node sends to all working node and service node Sys_exit message, goes to step (3.2)；

(3.2) after working node and service node receive sys_exit message, working node kimonos Business node sends sys_exit_ack message to host node, goes to step (3.3)；

(3.3) sys_exit_ack that host node receives all working node and service node is sent disappears Breath, goes to step (3.4)；

(3.4) all nodes terminate program.

It is applicable to automatization's tasks in parallel method of distributed machines study the most as claimed in claim 2, It is characterized in that, described step (2.1) determining, the flow process of the execution sequence in stage group interior stage is concrete Including following sub-step:

(2.1.1) current unassigned number order is set to 0, is the node of 0 by current in-degree Set nodes is set to sky, goes to step (2.1.2)；

(2.1.2) node that in-degree in current DAG is 0 is added in set nodes, to node The most numbered order of all nodes, order in set nodes is from increasing 1；By node set nodes In node and all go out limit remove from DAG figure, by nodes set be set to sky, go to step (2.1.3)；

(2.1.3) judge whether current DAG is empty, then go to step (2.2) if sky, otherwise turn Step (2.1.2).

It is applicable to automatization's tasks in parallel method of distributed machines study the most as claimed in claim 2, It is characterized in that, the run method in the stage described in step (2.3), specifically include following sub-step:

(2.3.1) runtime system of working node calls the prepare_variables side in this stage Method, determines that the master variable primary_variable keyset to be processed of current generation closes Primary_key_set, specifically, the loading condition of the service node first obtained according to scheduler module (L1, L2 ..., Ln) and the distribution on service node of the key of master variable primary_variable Situation, loads the part key not processed by working node safeguarded on minimum service node by networking Set distribute to working node, as next keyset to be processed close parimary_key_set, turn Sub-step (2.3.2)；

(2.3.2) the key_projection function that runtime system provides according to user, and (2.3.1) keyset of the master variable determined in closes primary_key_set and derives the key of auxiliary variable Set secondary_key_set, calls the pull side of this tensor object such as master variable and auxiliary variable Method goes to pull required model parameter；Go to step (2.3.3)；

(2.3.3) performing core function kernel_function in this stage, runtime system is automatic The set key_set of the key of master variable is divided into num_threads part automatically, creates Num_threads thread parallelization performs core function, and wherein num_threads is that user provides Parameter, goes to step (2.3.4)；

(2.3.4) run core function kernel_function and produce more new variables v_update, fortune The update_function that during row, system provides according to user updates corresponding variable v；If this variable The type of v is globally shared or globally unique, and the push function of run time call variable v enters Row updates, by (key, value) to be updated for this variable to serializing；And the data after serializing are sent out Give all service nodes safeguarding this interval key, after service node receives more new data, update it The data safeguarded.

It is applicable to automatization's tasks in parallel method of distributed machines study the most as claimed in claim 4, It is characterized in that, the pull method described in step (2.3.2) step has a following sub-step:

(2.3.2.1) keyset to be pulled for this tensor object is closed key_set serializing, and by sequence The data changed are sent to safeguard the service node of this paragraph key set, go to step (2.3.2.2)；