CN105956021A - Automated task parallel method suitable for distributed machine learning and system thereof - Google Patents

Automated task parallel method suitable for distributed machine learning and system thereof Download PDF

Info

Publication number
CN105956021A
CN105956021A CN201610255970.5A CN201610255970A CN105956021A CN 105956021 A CN105956021 A CN 105956021A CN 201610255970 A CN201610255970 A CN 201610255970A CN 105956021 A CN105956021 A CN 105956021A
Authority
CN
China
Prior art keywords
node
module
stage
key
variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610255970.5A
Other languages
Chinese (zh)
Other versions
CN105956021B (en
Inventor
廖小飞
曹镇山
郭人通
刘海坤
金海�
陆枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201610255970.5A priority Critical patent/CN105956021B/en
Publication of CN105956021A publication Critical patent/CN105956021A/en
Application granted granted Critical
Publication of CN105956021B publication Critical patent/CN105956021B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2323Non-hierarchical techniques based on graph theory, e.g. minimum spanning trees [MST] or graph cuts

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Discrete Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Multi Processors (AREA)

Abstract

The invention provides an automated task parallel method suitable for distributed machine learning and a system thereof. The method and the system solve defects of a programming interface in existing distributed machine learning, and tight coupling of system data access behaviors and application logic caused by just providing a reading-writing interface of key value pairs. The defect intensifies network bandwidth resource competition in distributed cluster, and causes that programming personnel is difficult to perform parallelization on a task. The system comprises a work node module, a service node module, a host node module, a tensor module, a scheduling module, a message tracking module, a stage module, a stage group module, and an executing engine module. Through providing higher-level programming abstraction, the system decouples logic of reading-writing access behaviors and an application program. In operation, the system firstly dynamically partitions tasks according to the load condition of a service node, and then machine learning tasks are automatically executed in a parallel manner, so as to greatly reduce burden of programming personnel to compile high-concurrency machine learning applications.

Description

A kind of method of automatization's tasks in parallel being applicable to distributed machines study and system thereof
Technical field
The invention belongs to Distributed Calculation and machine learning interleaving techniques field, be specifically related to one and be applicable to The method of automatization's tasks in parallel of distributed machines study and system thereof.
Background technology
The traditional method that machine learning algorithm is worth as a kind of mining data, is widely used in natural language The fields such as process, text analyzing, speech recognition, automatic driving of motor vehicle and bio information.Along with greatly The arrival of data age, the value of data increasingly shows especially out, the business valency wherein contained Value, machine learning thus come into one's own.But, along with data scale and in requisition for the model learnt The scale of parameter is increasing, and single calculating node is due to its memory source, calculating resource and memory access band The finiteness of wide resource etc., can not meet the demand of large-scale machines study.By tradition single node machine Device study carries out distribution becomes new and required a kind of trend.After machine learning distribution, can To use the more node that calculates to go to process larger data, needed for shortening training gained model simultaneously Time, and improve study model accuracy.Distributed machines study is at industrial quarters and academia all By universal concern, such as: Google utilizes distributed system DistBelief to train cat face identification Model, Apache Software Foundation develops distributed machines based on Hadoop study Framework Mahout and UC Berkeley AMP laboratory are increased income one and are applicable to machine learning algorithm Distributed computing system Spark etc..
Distributed machines learns most of algorithms and has an iterative nature, run predetermined number of times iterative process or Person's model parameter converges to a certain stable state and just terminates training process.Traditional Distributed Architecture is such as MapReduce etc. are due to the defect of its synchronization mechanism so that it is be bad at, in iterative estimated performance, to cause Its performance is not fully up to expectations.
Novel machine learning distributed system is parameter server framework, and parameter described herein refers to machine For the key-value pair (key, value) of descriptive model parameter in device study, or two-dimensional matrix, or Person's multi-dimensional matrix, multi-dimensional matrix is also referred to as tensor simultaneously.Calculating in parameter server framework, in cluster Node is divided into two classes, and a category node is referred to as working node, and another kind of node is referred to as service node.Its In, service node is responsible for safeguarding world model's parameter, including responsive operation node looking into for model parameter Ask and renewal etc. operates;Working node loads the part data set of overall situation training data concentration to local internal memory In, utilize the algorithm of application logic regulation to calculate which model parameter of needs and calculate, to service joint Point initiates inquiry operation, and by network by required model parameter transmission to local internal memory, then utilizing should Model parameter w made new advances with algorithm and the required model parameter calculation of logic regulation or model parameter are more New value Δ w, after taking turns iterative computation one, working node initiates to update and synchronize the overall situation to service node Model parameters etc. operate.In distributed machines study, working node behavior in the most complete iteration can It is described as following steps to conclude:
1. working node loading section data set;
2. working node calculates the model parameter of needs, and the model access interface provided by bottom is obtained Take required model parameter;
3. model parameter w made new advances according to application logical calculated or the updated value Δ w of model parameter;
4. the updated value Δ w of model parameter w newly calculated or model parameter is pushed to clothes by working node Business node, carries out parameter renewal and synchronization.
Step 2 in above-mentioned, 3,4, be the committed step in iterative computation, and pass through world model Parameter reading and writing access interface obtains the model parameter needed for calculating and by the model parameter newly calculated or mould The updated value of shape parameter is pushed to service node, is the major source of network transmission in system.
For step 2, huge due to model parameter, the transmission volume thus caused also is huge Big, in the case of network bandwidth resources is certain, for a working node, in iterative process Network latency more than the calculating time so that the time of whole model training lengthens;When multiple work When node triggers network transmission simultaneously, bandwidth resources warfare occur, network latency can become more Long.The behavior that working node triggers mode parameter is closely related with upper layer application logic.Parameter current The physical layer interface provided in server architecture is the unified interface that global parameter accesses, and so makes system Access global parameter behavior and application logic close coupling, be unfavorable for being optimized from system bottom.
For step 3, working node computation model parameter, this operation is the operation of computation-intensive, In current many-core, many nuclear ages, how to maximize this calculating task parallel, for improving the concurrent of system Spend most important.Current distributed machines learning system does not provide the programming of corresponding parallelization to connect Mouthful, only provide world model's read and write access interface, it is therefore desirable to programming personnel has the warp of multiple programming Test, just can write high concurrent machine learning application program.
For step 4, for the bottleneck of network transmission in parameter synchronization, existing 2 kinds of solutions: a kind of It is to change synchronistic model, i.e. allows the iteration progress of different operating node to have certain difference, when iteration is entered After the difference of degree reaches certain threshold value, then carry out batch synchronization (BSP, Bulk Synchronous Parallel), this kind of scheme alleviates the situation that network bandwidth resources is competed to a certain extent;Another kind of Solution is to control parameter server occupation condition, for the synchronization that different operating node selection is different Time interval is avoided asking emergency case, ensures that the time interval chosen can meet reduction simultaneously simultaneously Communication frequency and guarantee train accuracy rate.
Summary of the invention
For drawbacks described above or the Improvement requirement of prior art, the invention provides and be applicable to distributed machines The parallel method of task automation of study and system thereof.First, by by the access interface of model parameter and Application logic decouples, and so makes system adjustable when having an operation for the access behavior of model parameter The characteristic of joint, that so transmit for network and system bottom parallelization etc. optimization provides the foundation.Secondly, Application is logically decomposed into some stages, and thus builds directed acyclic graph (directed acyclic Graph, is called for short DAG) go to describe the dependence between each calculation stages, runtime system passes through DAG Task is carried out dividing and executed in parallel by automatization, improves system concurrency degree.Above method and system can have Effect ground solves network transmission bottleneck problem and raising system concurrency degree in existing distributed machines learning system, Thus improve the overall performance of system.
To achieve these goals, according to one aspect of the present invention, it is provided that one is applicable to distributed machine The task automation parallel method of device study and system thereof, specifically include working node module, service node mould Block, host node module, tensor module, scheduler module, message tracking module, stage module, stage group mould Block and enforcement engine module.Wherein stage module, scheduler module are all connected with tensor module;Stage module It is connected with stage group module;Engine performs module and is connected with stage module;Scheduler module, tensor module, rank Section group is all connected with tensor module.
Described working node module and service node module, be for working node and parameter service respectively The abstractdesription of the behavior of node, and the two module is transparent to machine learning programming personnel.
Described host node module, is the abstractdesription for host node.The effect of host node is that coordination is whole The workflow of system, such as initialization and the end of system of system.Foregoing system module, removes Other modules outside working node module, service node module, host node module are all present in all nodes In.
Described tensor module is for describing the key-value pair (key, value) of model parameter in machine learning.Should Needing multiple tensor object to describe the model parameter needed for training by program, each tensor object has Tensor_id attribute uniquely identifies as it.The type of tensor object has three kinds: the overall situation is sharable (global shared), globally unique (global unique) and local (local).The overall situation This tensor object of sharable expression is safeguarded by distributed node, and the data safeguarded between different nodes can To have common factor;This tensor object of globally unique expression is safeguarded by distributed node, between different nodes That safeguards does not occur simultaneously;Local this tensor object of expression exists only in a node.Tensor object has Load (load), pull the operation interface such as (pull), propelling movement (push) for programming personnel.
Described stage module, is used for describing certain section of programmed logic in application program.The present invention is application program Overall logic resolves into the different stages, and each phase object to contain stage_id attribute unique as it Mark.Between phase object, can arrange between them by arranging dependence function set_dependency Dependence.Phase object needs some tensor objects as its input and an optional output.For For stage, its input has 2 types, and one is referred to as master variable primary_variable, another Plant and be referred to as complementary variable secondary_variable.(key, the value) of master variable is right, its There is no dependence between key, and (key, the value) of complementary variable is right, have between its key Dependence.For a stage, programming personnel needs to provide core function kernel_function, makees Core logic for this stage.Programming personnel it is also required to provide the key of master variable and complementary variable simultaneously Mapping function (key_projection function) between key, runtime system is automatically according to master variable Key and key_projection function derive the key of auxiliary variable.For the stage each master variable and Auxiliary variable, has a corresponding variable to be referred to as update_variable.Update_variable uses In updating corresponding variable, and more new logic customer-furnished update_function definition.
Described stage group module, for describing one group of stage.Contact between this group stage represented by stage group Closely.Stage group has attribute group_id and uniquely identifies as it.Stage group have run and Set_barrier the two interface.The optional parameters of run method is an integer number num_run, is used for Specify the execution number of times of this stage group.Set_barrier interface is used for arranging simultaneously operating, and expression ought be up till now After stage group has performed, need all working node to enter fence and synchronize waiting state, when all working saves After this stage group of point has performed, just can continue to run with.
Described scheduler module needs to be processed for decision-making for certain tensor object, working node next stage The set key_set of key.The bandwidth information on its node of scheduler module fixed time broadcast on service node, work Making the bandwidth information according to its service node obtained of the scheduler module on node, decision-making shares out the work node Next time needs the set of the key of model parameter to be processed.
Described engine performs module, has for the stage in stage group and relation of interdependence thereof being described as To acyclic figure (directed acyclic graph is called for short DAG), in this directed acyclic graph, in figure Node represent the stage, the dependence between the directed edge expression stage in figure, the stage of the afterbody on limit want Perform prior to the stage of the head on limit.
Described message tracking module, by tensor module, scheduler module, stage group in logging program runs The message that module, working node module, service node module, host node module are submitted to, when message-submission is given Message tracking module, message tracking module is responsible for being transferred to message recipient, and recipient returns message receipt Afterwards, message tracking module is responsible for the initial launching side of notification message, and pays acknowledgement message.
Correspondingly, present invention also offers a kind of be applicable to distributed machines study task automation parallel Method, for carrying out automatically dividing and automatic paralleling of task in distributed machines learning algorithm scene Perform, including system initialization step, parallel training step and system finishing step, wherein:
(1) system initialization step: initialize node topology information and initialize application logic, Specifically include following sub-step:
(1.1) all nodes bring into operation, and read configuration file respectively, determine the role's rotor step of oneself (1.2), suddenly described role is working node or service node or host node;
(1.2) working node, service node communicate with host node respectively, inform its nodal information of host node, The nodal information collected is broadcast to other all nodes by host node, goes to step (1.3);
(1.3), after working node and service node receive the nodal information that host node sends, joint is initialized Point topology information, for the communication between posterior nodal point;Go to step (1.4);
(1.4) working node and service node initializing application logic, runtime system is according to the stage The sequencing that group occurs in program code determines the priority execution sequence of stage group, builds each simultaneously The DAG that stage group is corresponding;Go to step (2);
(2) parallel training step: host node and service node all skip concrete training logic, enters step Suddenly (3), each working node enters model training state, and working node is according to training data of input Collection is iterated formula parallel training, until meeting the iteration termination condition that predefined is good, the row of working node For specifically including following sub-step:
(2.1) runtime system of working node is for it has been determined that each stage group of sequencing, inciting somebody to action Node on its DAG carries out topological sorting to determine the execution sequence in each stage group interior all stages;Turn Step (2.2);
(2.2) current unenforced stage group is referred to as next_group;Putting next_group is machine The stage group that study application logic occurs for the first time;Go to step (2.3), if currently without unenforced Stage group, then go to step (2.6);
(2.3) the run method of the stage group represented by next_group performs num_run (num_run It is the parameter that provides when starting program of user) secondary, for the run method of single, create a collection of during operation Thread, according to the execution sequence in stage in this stage group determined by step (2.1), performs all of successively The run method in stage, numbering the little stage first carries out, and the stage pipeline with identical numbering performs, these rank Section group goes to step (2.4) after having performed num_run time;
(2.4) runtime system of working node judges that the stage group currently having performed num_run time is No being provided with set_barrier, if being provided with set_barrier, carrying out fence simultaneously operating;Turn step Suddenly (2.5);
(2.5) if currently also having unenforced stage group, it is set to next_group currently be not carried out Stage group, go to step (2.3), otherwise go to step (2.6);
(2.6) working node judges whether when running to arrive iteration termination condition, if arriving termination condition, Go to step (3), otherwise go to step (2.1);
(3) system finishing step: working node informs that its work of host node completes, and host node detects institute After the work having working node completes, all nodes of notice coordinate quit a program, and specifically include following sub-step Rapid:
(3.1) all working node sends job_done message to host node, and host node receives all works After making the job_done message of node, host node sends to all working node and service node Sys_exit message, goes to step (3.2);
(3.2) after working node and service node receive sys_exit message, working node and service Node sends sys_exit_ack message to host node, goes to step (3.3);
(3.3) the sys_exit_ack message that host node receives all working node and service node is sent, Go to step (3.4);
(3.4) all nodes terminate program.
Above-mentioned steps (2.1) determining, the flow process of the execution sequence in stage group interior stage specifically includes following Sub-step:
(2.1.1) current unassigned number order is set to 0, is the nodal set of 0 by current in-degree Close nodes and be set to sky, go to step (2.1.2);
(2.1.2) node that in-degree in current DAG is 0 is added in set nodes, to nodal set Close the most numbered order of all nodes, order in nodes from increasing 1;By in node set nodes Node and all go out limit remove from DAG figure, by nodes set be set to sky, go to step (2.1.3);
(2.1.3) judge whether current DAG is empty, then go to step (2.2) if sky, otherwise turn step Suddenly (2.1.2).
The run method in the stage described in above-mentioned steps (2.3), specifically includes following sub-step:
(2.3.1) runtime system of working node calls the prepare_variables method in this stage, Determine that the master variable primary_variable keyset to be processed of current generation closes primary_key_set, Specifically, first according to scheduler module obtain service node loading condition (L1, L2 ..., Ln) and The key of master variable primary_variable distribution situation on service node, loads minimum by networking Working node is distributed in the set of the part key not processed by working node safeguarded on service node, makees Parimary_key_set, rotor step (2.3.2) is closed for next keyset to be processed;
(2.3.2) the key_projection function that runtime system provides according to user, and (2.3.1) The keyset of the middle master variable determined closes primary_key_set and derives the keyset conjunction of auxiliary variable Secondary_key_set, the pull method calling this tensor object such as master variable and auxiliary variable goes to pull Required model parameter;Go to step (2.3.3);
(2.3.3) performing core function kernel_function in this stage, runtime system automatically will The set key_set of the key of master variable is divided into num_threads part automatically, creates Num_threads thread parallelization performs core function, and wherein num_threads is the ginseng that user provides Number, goes to step (2.3.4);
(2.3.4) run core function kernel_function and produce more new variables v_update, run Time the update_function that provides according to user of system update corresponding variable v;If this variable v's Type is globally shared or globally unique, and the push function of run time call variable v is updated, By (key, value) to be updated for this variable to serializing;And the data after serializing are sent to all dimensions Protect the service node of this interval key, after service node receives more new data, update its data safeguarded.
Pull method described in above-mentioned (2.3.2) step has a following sub-step:
(2.3.2.1) keyset to be pulled for this tensor object is closed key_set serializing, and will serializing Data be sent to safeguard the service node of this paragraph key set, go to step (2.3.2.2);
(2.3.2.2) after service node receives pull message, by corresponding for key_set (key, value) Two tuple Data Serializations, and the data after serializing are returned to requesting party.
By said method, the above technical scheme that the present invention is contemplated in general compared with prior art, Have the following advantages that and technique effect:
(1) the invention provides programmings more higher compared to overall situation read and write access interface abstraction level Module, the logic of read and write access behavior and application program is decoupled by these modules, and one Aspect greatly facilitates programming personnel and writes application program, is on the other hand system level Optimization provides the foundation;
(2) the automatization's tasks in parallel that present invention achieves machine learning task performs, and this significantly reduces Application program personnel write the burden of high concurrent machine learning application;
(3) runtime system of present invention exploitation is carried out automatically according to the loading condition of each service node The dynamic division of task, takes full advantage of network bandwidth resources.
Accompanying drawing explanation
The module frame chart of Tu1Shi automatization of the present invention tasks in parallel system;
The overall workflow figure of Tu2Shi automatization of the present invention tasks in parallel method;
The sub-workflow diagram of system initialization of Tu3Shi automatization of the present invention tasks in parallel method;
The sub-workflow diagram of parallel training of Tu4Shi automatization of the present invention tasks in parallel method;
The sub-workflow diagram of system finishing of Tu5Shi automatization of the present invention tasks in parallel method;
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and reality Execute example, the present invention is further elaborated.Only should be appreciated that specific embodiment described herein In order to explain the present invention, it is not intended to limit the present invention.Additionally, each enforcement of invention described below Just can be mutually combined as long as technical characteristic involved in mode does not constitutes conflict each other.
Tu1Shi automatization of the present invention tasks in parallel performs the module frame chart of system.As it is shown in figure 1, this Bright automatization's tasks in parallel system specifically include working node module, service node module, host node module, Tensor module, scheduler module, message tracking module, stage module, stage group module and enforcement engine mould Block.Wherein stage module, scheduler module are all connected with tensor module;Stage module is connected with stage group module; Engine performs module and is connected with stage module;Scheduler module, tensor module, stage group all follow the tracks of mould with message Block is connected.
Working node module and service node module, respectively for working node and parameter service node The abstractdesription of behavior, and the two module is transparent to machine learning programming personnel.
Host node module, is the abstractdesription for host node.The effect of host node is to coordinate whole system Workflow, such as initialization and the end of system of system.Foregoing system module, except work Other modules outside node module, service node module, host node module are all present in all nodes.
Tensor module is for describing the key-value pair (key, value) of model parameter in machine learning.Application journey Sequence needs multiple tensor object to describe the model parameter needed for training, and each tensor object has Tensor_id attribute uniquely identifies as it.The type of tensor object has three kinds: the overall situation is sharable (global shared), globally unique (global unique) and local (local).The overall situation This tensor object of sharable expression is safeguarded by distributed node, and the data safeguarded between different nodes can To have common factor;This tensor object of globally unique expression is safeguarded by distributed node, between different nodes That safeguards does not occur simultaneously;Local this tensor object of expression exists only in a node.Tensor object has Load (load), pull the operation interface such as (pull), propelling movement (push) for programming personnel.
Stage module, is used for describing certain section of programmed logic in application program.The present invention is the entirety of application program It is logically decomposed into the different stages, and each phase object contains stage_id attribute and uniquely identifies as it. Between phase object, can rely on, by arranging, the dependence pass that function set_dependency arranges between them System.Phase object needs some tensor objects as its input and an optional output.For the stage Saying that its input has 2 types, one is referred to as master variable primary_variable, and another kind is claimed For complementary variable secondary_variable.(key, the value) of master variable is right, its key it Between there is no dependence, and (key, the value) of complementary variable is right, has dependence and close between its key System.For a stage, programming personnel needs to provide core function kernel_function, as these rank The core logic of section.Programming personnel it is also required to provide between the key of master variable and the key of complementary variable simultaneously Mapping function key_projection, runtime system automatically according to master variable key and Key_projection function derives the key of auxiliary variable.Each master variable and auxiliary for the stage become Amount, has a corresponding variable to be referred to as update_variable.Update_variable is used for updating Corresponding variable, and more new logic customer-furnished update_function definition.
Stage group module, for describing one group of stage.Between this group stage represented by stage group, contact is closely. Stage group has attribute group_id and uniquely identifies as it.Stage group has run and set_barrier The two interface.The optional parameters of run method is an integer number num_run, is used for specifying this stage group Execution number of times.Set_barrier interface is used for arranging simultaneously operating, and expression ought up till now stage group perform Afterwards, need all working node to enter fence and synchronize waiting state, when this stage group of all working node After having performed, just can continue to run with.
Scheduler module needs key to be processed for decision-making for certain tensor object, working node next stage Set key_set.The bandwidth information on its node of scheduler module fixed time broadcast on service node, work joint Scheduler module on point according to the bandwidth information of its service node obtained, decision-making share out the work node next time Need the set of the key of model parameter to be processed.
Engine performs module, for the stage in stage group and relation of interdependence thereof are described as oriented nothing Ring figure (directed acyclic graph is called for short DAG), the knot in this directed acyclic graph, in figure In the some expression stage, the dependence between the directed edge expression stage in figure, the stage of the afterbody on limit will be prior to The stage of the head on limit performs.
Message tracking module, for logging program run in by tensor module, scheduler module, stage group module, Working node module, service node module, host node module submit to message, when message-submission to message with Track module, message tracking module is responsible for message is transferred to recipient, after recipient returns message receipt, Message tracking module is responsible for the initial launching side of notification message, and pays acknowledgement message.
The overall workflow figure of Tu2Shi automatization of the present invention tasks in parallel method.As in figure 2 it is shown, this The overall workflow of invention automatization tasks in parallel method comprises the following steps:
(1) system initialization step: initialize node topology information and initialize application logic;
(2) parallel training step: host node and service node all skip concrete training logic, enters step Suddenly (3), each working node enters model training state, and working node is according to training data of input Collection is iterated formula parallel training, until meeting the iteration termination condition that predefined is good;
(3) system finishing step: working node informs that its work of host node completes, and host node detects institute After the work having working node completes, all nodes of notice coordinate quit a program.
Tu3Shi automatization of the present invention tasks in parallel performs the sub-workflow diagram of system initialization of method.Such as figure Automatization of the present invention tasks in parallel shown in 3 performs the system initialization workflow of method and comprises the following steps:
(1.1) all nodes bring into operation, and read configuration file respectively, determine the role's rotor step of oneself Suddenly (1.2), described role is working node or service node or host node;
(1.2) working node, service node communicate with host node respectively, inform its nodal information of host node, The nodal information collected is broadcast to other all nodes by host node, goes to step (1.3);
(1.3), after working node and service node receive the nodal information that host node sends, joint is initialized Point topology information, for the communication between posterior nodal point;Go to step (1.4);
(1.4) working node and service node initializing application logic, runtime system is according to the stage The sequencing that group occurs in program code determines the priority execution sequence of stage group, builds each simultaneously The DAG that stage group is corresponding;Go to step (2).
The sub-workflow diagram of parallel training of Tu4Shi automatization of the present invention tasks in parallel method.Such as Fig. 4 institute Show that the sub-workflow of parallel training of certain working node comprises the following steps:
(2) parallel training step: host node and service node all skip concrete training logic, enters step Suddenly (3), each working node enters model training state, and working node is according to training data of input Collection is iterated formula parallel training, until meeting the iteration termination condition that predefined is good, the row of working node For specifically including following sub-step:
(2.1) runtime system of working node is for it has been determined that each stage group of sequencing, inciting somebody to action Node on its DAG carries out topological sorting to determine the execution sequence in each stage group interior all stages;Turn Step (2.2);
(2.2) current unenforced stage group is referred to as next_group;Putting next_group is machine The stage group that study application logic occurs for the first time;Go to step (2.3), if currently without unenforced Stage group, then go to step (2.6);
(2.3) the run method of the stage group represented by next_group performs num_run (num_run It is the parameter that provides when starting program of user) secondary, for the run method of single, create a collection of during operation Thread, according to the execution sequence in stage in this stage group determined by step (2.1), performs all of successively The run method in stage, numbering the little stage first carries out, and the stage pipeline with identical numbering performs, these rank Section group goes to step (2.4) after having performed num_run time;
(2.4) runtime system of working node judges that the stage group currently having performed num_run time is No being provided with set_barrier, if being provided with set_barrier, carrying out fence simultaneously operating;Turn step Suddenly (2.5);
(2.5) if currently also having unenforced stage group, it is set to next_group currently be not carried out Stage group, go to step (2.3), otherwise go to step (2.6);
(2.6) working node runtime system judges whether to arrive iteration termination condition, terminates if arrived Condition, goes to step (3), otherwise goes to step (2.1).
The sub-workflow diagram of system finishing of Tu5Shi automatization of the present invention tasks in parallel method.Such as Fig. 5 institute Showing, the sub-workflow of system finishing of automatization of the present invention tasks in parallel method comprises the following steps:
(3.1) all working node sends job_done message to host node, and host node receives all works After making the job_done message of node, host node sends to all working node and service node Sys_exit message, goes to step (3.2);
(3.2) after working node and service node receive sys_exit message, working node and service Node sends sys_exit_ack message to host node, goes to step (3.3);
(3.3) the sys_exit_ack message that host node receives all working node and service node is sent, Go to step (3.4);
(3.4) all nodes terminate program.
Further, step (2.1) determining, the flow process of the execution sequence in stage group interior stage is specifically wrapped Include following sub-step:
(2.1.1) current unassigned number order is set to 0, is the nodal set of 0 by current in-degree Close nodes and be set to sky, go to step (2.1.2);
(2.1.2) node that in-degree in current DAG is 0 is added in set nodes, to nodal set Close the most numbered order of all nodes, order in nodes from increasing 1;By in node set nodes Node and all go out limit remove from DAG figure, by nodes set be set to sky, go to step (2.1.3);
(2.1.3) judge whether current DAG is empty, then go to step (2.2) if sky, otherwise turn step Suddenly (2.1.2).
Further, the run method in the stage described in step (2.3), specifically include following sub-step:
(2.3.1) runtime system of working node calls the prepare_variables method in this stage, Determine that the master variable primary_variable keyset to be processed of current generation closes primary_key_set, Specifically, first according to scheduler module obtain service node loading condition (L1, L2 ..., Ln) and The key of master variable primary_variable distribution situation on service node, loads minimum by networking Working node is distributed in the set of the part key not processed by working node safeguarded on service node, makees Close for next keyset to be processed, rotor step (2.3.2);
(2.3.2) the key_projection function that runtime system provides according to user, and (2.3.1) The keyset of the middle master variable determined closes primary_key_set and derives the keyset conjunction of auxiliary variable Secondary_key_set, the pull method calling this tensor object such as master variable and auxiliary variable goes to pull Required model parameter;Go to step (2.3.3);
(2.3.3) performing core function kernel_function in this stage, runtime system automatically will The set key_set of the key of master variable is divided into num_threads part automatically, creates Num_threads thread parallelization performs core function, and wherein num_threads is the ginseng that user provides Number, goes to step (2.3.4);
(2.3.4) run core function kernel_function and produce more new variables v_update, run Time the update_function that provides according to user of system update corresponding variable v;If this variable v's Type is globally shared or globally unique, and the push function of run time call variable v is updated, By (key, value) to be updated for this variable to serializing;And the data after serializing are sent to all dimensions Protect the service node of this interval key, after service node receives more new data, update its data safeguarded.
Further, the pull method described in step (2.3.2) has a following sub-step:
(2.3.2.1) keyset to be pulled for this tensor object is closed key_set serializing, and will serializing Data be sent to safeguard the service node of this paragraph key set, go to step (2.3.2.2);
(2.3.2.2) after service node receives the message of pull, by corresponding for key_set (key, Value) two tuple Data Serialization, and the data after serializing are returned to requesting party.
As it will be easily appreciated by one skilled in the art that and the foregoing is only presently preferred embodiments of the present invention, and Not in order to limit the present invention, all made within the spirit and principles in the present invention any amendment, equivalent With improvement etc., should be included within the scope of the present invention.

Claims (5)

1. the automatization's tasks in parallel system being applicable to distributed machines study, it is characterised in that Including working node module, service node module, host node module, tensor module, scheduler module, disappear Breath tracking module, stage module, stage group module and enforcement engine module;Wherein stage module, tune Degree module is all connected with tensor module;Stage module is connected with stage group module;Engine perform module and Stage module is connected;Scheduler module, tensor module, stage group are all connected with message tracking module;
Described working node module and service node module, be respectively used to for working node and parameter The behavior of service node is described abstractly;
Described host node module, for the abstractdesription to host node, host node is whole for coordinating The workflow of system, including initialization and the end of system of system;
Described tensor module is for describing the key-value pair (key, value) of model parameter in machine learning; Application program needs multiple tensor object to describe the model parameter needed for training, and each tensor object has Tensor_id attribute is had uniquely to identify as it;The type of tensor object has three kinds: the overall situation can be shared (global shared), globally unique (global unique) and local (local); The overall situation this tensor object of sharable expression is safeguarded by distributed node, safeguards between different nodes Data can have common factor;This tensor object of globally unique expression is safeguarded by distributed node, different That safeguards between node does not occur simultaneously;Local this tensor object of expression exists only in a node;? Amount object has loading (load), pulls the operation interface such as (pull), propelling movement (push) for programming Librarian use;
Described stage module, is used for describing certain section of programmed logic in application program, and the entirety of application program is patrolled Collect and resolve into the different stages, and each phase object contains stage_id attribute and uniquely identifies as it; Between phase object, can rely on, by arranging, the dependence that function set_dependency arranges between them Relation;Phase object needs some tensor objects as its input and an optional output;For rank For section object, its input has 2 types, and one is referred to as master variable primary_variable, Another kind is referred to as complementary variable secondary_variable;(key, the value) of master variable Right, there is no dependence between its key, and (key, the value) of complementary variable is right, its key Between there is dependence;For a stage, programming personnel needs to provide core function Kernel_function, as the core logic in this stage;Programming personnel it is also required to provide main transformer simultaneously Mapping function key_projection between key and the key of complementary variable of amount, runtime system is certainly The dynamic key according to master variable and key_projection function derive the key of auxiliary variable;For rank Each master variable of section and auxiliary variable, have a corresponding variable to be referred to as update_variable; Update_variable is for updating the variable of correspondence, and more new logic is customer-furnished Update_function defines;
Described stage group module, for describing one group of stage;Join between this group stage represented by stage group Fasten close;Stage group has attribute group_id and uniquely identifies as it;Stage group have run and Set_barrier the two interface;The optional parameters of run method is an integer number num_run, uses In the execution number of times specifying this stage group;Set_barrier interface is used for arranging simultaneously operating, represents and works as After up till now stage group has performed, need all working node to enter fence and synchronize waiting state, when all After this stage group of working node has performed, just can continue to run with;
Described scheduler module is for decision-making for certain tensor object, and the working node next stage needs to process The set key_set of key;Scheduler timing on service node broadcasts the bandwidth information on its node, Scheduler on working node is according to the bandwidth information of its service node obtained, and decision-making shares out the work joint Point needs the set of the key of model parameter to be processed next time;
Described enforcement engine module, for being described as the stage in stage group and relation of interdependence thereof Directed acyclic graph (directed acyclic graph is called for short DAG), in this directed acyclic graph, Node in figure represents the stage, the dependence between the directed edge expression stage in figure, the afterbody on limit Stage to perform prior to the stage of the head on limit;
Described message tracking module, by tensor module, scheduler module, work in logging program runs The message that group module, working node module, service node module, host node module are submitted to, when message is handed over Paying message tracking module, message tracking module is responsible for being transferred to message recipient, and recipient returns and disappears After breath receipt, message tracking module is responsible for the initial launching side of notification message, and pays acknowledgement message.
2. the automatization's tasks in parallel method being applicable to distributed machines study, it is characterised in that Including system initialization step, parallel training step and system finishing step, wherein:
(1) system initialization step: initialize node topology information and initialize application logic, Specifically include following sub-step:
(1.1) all nodes bring into operation, and read configuration file respectively, determine the role's rotor of oneself Step (1.2), described role is working node or service node or host node;
(1.2) working node, service node communicate with host node respectively, inform that its node of host node is believed Breath, the nodal information collected is broadcast to other all nodes, goes to step (1.3) by host node;
(1.3), after working node and service node receive the nodal information that host node sends, initialize Node topology information, for the communication between posterior nodal point;Go to step (1.4);
(1.4) working node and service node initializing application logic, runtime system is according to rank The sequencing that section group occurs in program code determines the priority execution sequence of stage group, builds simultaneously The DAG that each stage group is corresponding;Go to step (2);
(2) parallel training step: host node and service node all skip concrete training logic, enters Step (3), each working node enters model training state, and working node is according to the training number of input It is iterated formula parallel training according to subset, until meeting the iteration termination condition that predefined is good, work joint The behavior of point specifically includes following sub-step:
(2.1) runtime system of working node is for it has been determined that each stage group of sequencing, Node on its DAG is carried out topological sorting to determine the execution sequence in each stage group interior all stages; Go to step (2.2);
(2.2) current unenforced stage group is referred to as next_group;Putting next_group is machine The stage group that device study application logic occurs for the first time;Go to step (2.3), if currently without not holding The stage group of row, then go to step (2.6);
(2.3) the run method of the stage group represented by next_group performs num_run (num_run It is the parameter that provides when starting program of user) secondary, for the run method of single, during operation, create one Criticize thread, according to the execution sequence in stage in this stage group determined by step (2.1), perform institute successively The run method in some stages, numbering the little stage first carries out, and the stage pipeline with identical numbering is held OK, this stage group goes to step (2.4) after having performed num_run time;
(2.4) runtime system of working node judges currently to have performed the stage group of num_run time Whether being provided with set_barrier, if being provided with set_barrier, carrying out fence simultaneously operating; Go to step (2.5);
(2.5) if currently also having unenforced stage group, it is set to next_group currently not hold The stage group of row, goes to step (2.3), otherwise goes to step (2.6);
(2.6) runtime system of working node judges whether to arrive iteration termination condition, if arrived Termination condition, goes to step (3), otherwise goes to step (2.1);
(3) system finishing step: working node informs that its work of host node completes, and host node detects After the work of all working node completes, all nodes of notice coordinate quit a program, and specifically include following Sub-step:
(3.1) all working node sends job_done message to host node, and host node receives all After the job_done message of working node, host node sends to all working node and service node Sys_exit message, goes to step (3.2);
(3.2) after working node and service node receive sys_exit message, working node kimonos Business node sends sys_exit_ack message to host node, goes to step (3.3);
(3.3) sys_exit_ack that host node receives all working node and service node is sent disappears Breath, goes to step (3.4);
(3.4) all nodes terminate program.
It is applicable to automatization's tasks in parallel method of distributed machines study the most as claimed in claim 2, It is characterized in that, described step (2.1) determining, the flow process of the execution sequence in stage group interior stage is concrete Including following sub-step:
(2.1.1) current unassigned number order is set to 0, is the node of 0 by current in-degree Set nodes is set to sky, goes to step (2.1.2);
(2.1.2) node that in-degree in current DAG is 0 is added in set nodes, to node The most numbered order of all nodes, order in set nodes is from increasing 1;By node set nodes In node and all go out limit remove from DAG figure, by nodes set be set to sky, go to step (2.1.3);
(2.1.3) judge whether current DAG is empty, then go to step (2.2) if sky, otherwise turn Step (2.1.2).
It is applicable to automatization's tasks in parallel method of distributed machines study the most as claimed in claim 2, It is characterized in that, the run method in the stage described in step (2.3), specifically include following sub-step:
(2.3.1) runtime system of working node calls the prepare_variables side in this stage Method, determines that the master variable primary_variable keyset to be processed of current generation closes Primary_key_set, specifically, the loading condition of the service node first obtained according to scheduler module (L1, L2 ..., Ln) and the distribution on service node of the key of master variable primary_variable Situation, loads the part key not processed by working node safeguarded on minimum service node by networking Set distribute to working node, as next keyset to be processed close parimary_key_set, turn Sub-step (2.3.2);
(2.3.2) the key_projection function that runtime system provides according to user, and (2.3.1) keyset of the master variable determined in closes primary_key_set and derives the key of auxiliary variable Set secondary_key_set, calls the pull side of this tensor object such as master variable and auxiliary variable Method goes to pull required model parameter;Go to step (2.3.3);
(2.3.3) performing core function kernel_function in this stage, runtime system is automatic The set key_set of the key of master variable is divided into num_threads part automatically, creates Num_threads thread parallelization performs core function, and wherein num_threads is that user provides Parameter, goes to step (2.3.4);
(2.3.4) run core function kernel_function and produce more new variables v_update, fortune The update_function that during row, system provides according to user updates corresponding variable v;If this variable The type of v is globally shared or globally unique, and the push function of run time call variable v enters Row updates, by (key, value) to be updated for this variable to serializing;And the data after serializing are sent out Give all service nodes safeguarding this interval key, after service node receives more new data, update it The data safeguarded.
It is applicable to automatization's tasks in parallel method of distributed machines study the most as claimed in claim 4, It is characterized in that, the pull method described in step (2.3.2) step has a following sub-step:
(2.3.2.1) keyset to be pulled for this tensor object is closed key_set serializing, and by sequence The data changed are sent to safeguard the service node of this paragraph key set, go to step (2.3.2.2);
(2.3.2.2) after service node receives the message of pull, by corresponding for key_set (key, Value) two tuple Data Serialization, and the data after serializing are returned to requesting party.
CN201610255970.5A 2016-04-22 2016-04-22 A kind of automation task suitable for distributed machines study parallel method and its system Active CN105956021B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610255970.5A CN105956021B (en) 2016-04-22 2016-04-22 A kind of automation task suitable for distributed machines study parallel method and its system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610255970.5A CN105956021B (en) 2016-04-22 2016-04-22 A kind of automation task suitable for distributed machines study parallel method and its system

Publications (2)

Publication Number Publication Date
CN105956021A true CN105956021A (en) 2016-09-21
CN105956021B CN105956021B (en) 2019-05-21

Family

ID=56915367

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610255970.5A Active CN105956021B (en) 2016-04-22 2016-04-22 A kind of automation task suitable for distributed machines study parallel method and its system

Country Status (1)

Country Link
CN (1) CN105956021B (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107231558A (en) * 2017-05-23 2017-10-03 江苏火米互动科技有限公司 A kind of implementation method of the H.264 parallel encoder based on CUDA
CN107622310A (en) * 2017-08-30 2018-01-23 第四范式(北京)技术有限公司 For performing the distributed system and its method of machine learning
CN107729353A (en) * 2017-08-30 2018-02-23 第四范式(北京)技术有限公司 For performing the distributed system and its method of machine learning
CN107944566A (en) * 2017-11-28 2018-04-20 杭州云脑科技有限公司 A kind of machine learning method, host node, working node and system
CN108009642A (en) * 2016-10-31 2018-05-08 腾讯科技(深圳)有限公司 Distributed machines learning method and system
CN108229686A (en) * 2016-12-14 2018-06-29 阿里巴巴集团控股有限公司 Model training, Forecasting Methodology, device, electronic equipment and machine learning platform
CN108681777A (en) * 2018-05-07 2018-10-19 北京京东尚科信息技术有限公司 A kind of method and apparatus of the machine learning program operation based on distributed system
CN108733461A (en) * 2017-04-18 2018-11-02 北京京东尚科信息技术有限公司 Distributed task dispatching method and apparatus
CN109814986A (en) * 2017-11-20 2019-05-28 上海寒武纪信息科技有限公司 Task method for parallel processing, storage medium, computer equipment, device and system
CN109871958A (en) * 2019-02-01 2019-06-11 东软医疗系统股份有限公司 The method, device and equipment of training pattern
CN109960570A (en) * 2017-12-14 2019-07-02 北京图森未来科技有限公司 A kind of multimode dispatching method, device and system
CN110612539A (en) * 2017-03-15 2019-12-24 西门子股份公司 Method for executing machine learning model on memory-limited industrial equipment
CN110990059A (en) * 2019-11-28 2020-04-10 中国科学院计算技术研究所 Stream type calculation engine operation method and system for tilt data
CN111506402A (en) * 2020-03-31 2020-08-07 上海氪信信息技术有限公司 Computer task scheduling method, device, equipment and medium for machine learning modeling
CN111580970A (en) * 2020-05-07 2020-08-25 电子科技大学 Transmission scheduling method for model distribution and aggregation of federated learning
CN111753997A (en) * 2020-06-28 2020-10-09 北京百度网讯科技有限公司 Distributed training method, system, device and storage medium
EP3678068A4 (en) * 2017-08-30 2020-11-25 The Fourth Paradigm (Beijing) Tech Co Ltd Distributed system for executing machine learning and method therefor
WO2020243973A1 (en) * 2019-06-06 2020-12-10 华为技术有限公司 Model-based signal inference method and apparatus
CN112214256A (en) * 2020-09-30 2021-01-12 招商局金融科技有限公司 Operation control method and device for machine learning, electronic equipment and storage medium
WO2021051772A1 (en) * 2019-09-19 2021-03-25 Huawei Technologies Co., Ltd. Method and apparatus for vectorized resource scheduling in distributed computing systems using tensors
CN113157413A (en) * 2021-04-16 2021-07-23 上海交通大学 Deep learning task resource optimization configuration method and system based on service quality requirement
CN114461392A (en) * 2022-01-25 2022-05-10 西南交通大学 Bandwidth-aware selective data multicast method
TWI780382B (en) * 2019-12-05 2022-10-11 新唐科技股份有限公司 Microcontroller updating system and method
CN115314397A (en) * 2022-08-05 2022-11-08 中科计算技术西部研究院 Network simulation method, system, device and storage medium for distributed training
CN116483580A (en) * 2022-09-29 2023-07-25 陕西震旦纪信息技术有限公司 System and method for scheduling server computing resources based on Kubernetes
CN116662039A (en) * 2023-07-25 2023-08-29 菲特(天津)检测技术有限公司 Industrial information parallel detection method, device and medium based on shared memory
US11954611B2 (en) 2020-08-27 2024-04-09 International Business Machines Corporation Tensor comparison across a distributed machine learning environment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8027938B1 (en) * 2007-03-26 2011-09-27 Google Inc. Discriminative training in machine learning
CN102546247A (en) * 2011-12-29 2012-07-04 华中科技大学 Massive data continuous analysis system suitable for stream processing
CN103763378A (en) * 2014-01-24 2014-04-30 中国联合网络通信集团有限公司 Task processing method and system and nodes based on distributive type calculation system
US20140280142A1 (en) * 2013-03-14 2014-09-18 Science Applications International Corporation Data analytics system
CN104360903A (en) * 2014-11-18 2015-02-18 北京美琦华悦通讯科技有限公司 Method for realizing task data decoupling in spark operation scheduling system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8027938B1 (en) * 2007-03-26 2011-09-27 Google Inc. Discriminative training in machine learning
CN102546247A (en) * 2011-12-29 2012-07-04 华中科技大学 Massive data continuous analysis system suitable for stream processing
US20140280142A1 (en) * 2013-03-14 2014-09-18 Science Applications International Corporation Data analytics system
CN103763378A (en) * 2014-01-24 2014-04-30 中国联合网络通信集团有限公司 Task processing method and system and nodes based on distributive type calculation system
CN104360903A (en) * 2014-11-18 2015-02-18 北京美琦华悦通讯科技有限公司 Method for realizing task data decoupling in spark operation scheduling system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MU LI: "Scaling Distributed Machine Learning", 《PROCEEDINGS OF 11TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION》 *
WEI DAI: "Petuum: A Framework for Iterative-Convergent", 《PROCEEDINGS OF ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS》 *
YUCHENG LOW: "Distributed GraphLab: A Framework for Machine Learning", 《PROCEEDINGS OF VLDB ENDOWMENT》 *
余林琛等: "基于事物内存的分布式编程环境中缓存一致性维护机制", 《微电子学与计算机》 *

Cited By (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009642B (en) * 2016-10-31 2021-12-14 腾讯科技(深圳)有限公司 Distributed machine learning method and system
CN108009642A (en) * 2016-10-31 2018-05-08 腾讯科技(深圳)有限公司 Distributed machines learning method and system
CN108229686B (en) * 2016-12-14 2022-07-05 阿里巴巴集团控股有限公司 Model training and predicting method and device, electronic equipment and machine learning platform
CN108229686A (en) * 2016-12-14 2018-06-29 阿里巴巴集团控股有限公司 Model training, Forecasting Methodology, device, electronic equipment and machine learning platform
CN110612539B (en) * 2017-03-15 2024-04-12 西门子股份公司 Method for executing a machine learning model on a memory-constrained industrial device
CN110612539A (en) * 2017-03-15 2019-12-24 西门子股份公司 Method for executing machine learning model on memory-limited industrial equipment
CN108733461A (en) * 2017-04-18 2018-11-02 北京京东尚科信息技术有限公司 Distributed task dispatching method and apparatus
CN107231558A (en) * 2017-05-23 2017-10-03 江苏火米互动科技有限公司 A kind of implementation method of the H.264 parallel encoder based on CUDA
CN107231558B (en) * 2017-05-23 2019-10-22 江苏火米互动科技有限公司 A kind of implementation method of the H.264 parallel encoder based on CUDA
CN107729353B (en) * 2017-08-30 2020-04-07 第四范式(北京)技术有限公司 Distributed system for performing machine learning and method thereof
CN111597187A (en) * 2017-08-30 2020-08-28 第四范式(北京)技术有限公司 Distributed system for performing machine learning and method thereof
EP3678068A4 (en) * 2017-08-30 2020-11-25 The Fourth Paradigm (Beijing) Tech Co Ltd Distributed system for executing machine learning and method therefor
CN107622310A (en) * 2017-08-30 2018-01-23 第四范式(北京)技术有限公司 For performing the distributed system and its method of machine learning
CN109447274A (en) * 2017-08-30 2019-03-08 第四范式(北京)技术有限公司 For executing the distributed system and its method of machine learning
CN107729353A (en) * 2017-08-30 2018-02-23 第四范式(北京)技术有限公司 For performing the distributed system and its method of machine learning
CN111597187B (en) * 2017-08-30 2023-09-01 第四范式(北京)技术有限公司 Distributed system for performing machine learning and method thereof
CN109814986B (en) * 2017-11-20 2021-01-05 上海寒武纪信息科技有限公司 Task parallel processing method, storage medium, computer equipment, device and system
CN109814986A (en) * 2017-11-20 2019-05-28 上海寒武纪信息科技有限公司 Task method for parallel processing, storage medium, computer equipment, device and system
CN107944566A (en) * 2017-11-28 2018-04-20 杭州云脑科技有限公司 A kind of machine learning method, host node, working node and system
WO2019104713A1 (en) * 2017-11-28 2019-06-06 杭州云脑科技有限公司 Machine learning method, master node, work node, and system
CN109960570A (en) * 2017-12-14 2019-07-02 北京图森未来科技有限公司 A kind of multimode dispatching method, device and system
CN108681777A (en) * 2018-05-07 2018-10-19 北京京东尚科信息技术有限公司 A kind of method and apparatus of the machine learning program operation based on distributed system
CN108681777B (en) * 2018-05-07 2021-07-20 北京京东尚科信息技术有限公司 Method and device for running machine learning program based on distributed system
CN109871958A (en) * 2019-02-01 2019-06-11 东软医疗系统股份有限公司 The method, device and equipment of training pattern
WO2020243973A1 (en) * 2019-06-06 2020-12-10 华为技术有限公司 Model-based signal inference method and apparatus
WO2021051772A1 (en) * 2019-09-19 2021-03-25 Huawei Technologies Co., Ltd. Method and apparatus for vectorized resource scheduling in distributed computing systems using tensors
US11907770B2 (en) 2019-09-19 2024-02-20 Huawei Cloud Computing Technologies Co., Ltd. Method and apparatus for vectorized resource scheduling in distributed computing systems using tensors
CN110990059B (en) * 2019-11-28 2021-11-19 中国科学院计算技术研究所 Stream type calculation engine operation method and system for tilt data
CN110990059A (en) * 2019-11-28 2020-04-10 中国科学院计算技术研究所 Stream type calculation engine operation method and system for tilt data
TWI780382B (en) * 2019-12-05 2022-10-11 新唐科技股份有限公司 Microcontroller updating system and method
CN111506402B (en) * 2020-03-31 2023-06-27 上海氪信信息技术有限公司 Computer task scheduling method, device, equipment and medium for machine learning modeling
CN111506402A (en) * 2020-03-31 2020-08-07 上海氪信信息技术有限公司 Computer task scheduling method, device, equipment and medium for machine learning modeling
CN111580970A (en) * 2020-05-07 2020-08-25 电子科技大学 Transmission scheduling method for model distribution and aggregation of federated learning
CN111753997B (en) * 2020-06-28 2021-08-27 北京百度网讯科技有限公司 Distributed training method, system, device and storage medium
CN111753997A (en) * 2020-06-28 2020-10-09 北京百度网讯科技有限公司 Distributed training method, system, device and storage medium
US11954611B2 (en) 2020-08-27 2024-04-09 International Business Machines Corporation Tensor comparison across a distributed machine learning environment
CN112214256B (en) * 2020-09-30 2024-02-02 招商局金融科技有限公司 Machine learning operation control method and device, electronic equipment and storage medium
CN112214256A (en) * 2020-09-30 2021-01-12 招商局金融科技有限公司 Operation control method and device for machine learning, electronic equipment and storage medium
CN113157413B (en) * 2021-04-16 2022-04-26 上海交通大学 Deep learning task resource optimization configuration method and system based on service quality requirement
CN113157413A (en) * 2021-04-16 2021-07-23 上海交通大学 Deep learning task resource optimization configuration method and system based on service quality requirement
CN114461392B (en) * 2022-01-25 2023-03-31 西南交通大学 Bandwidth-aware selective data multicast method
CN114461392A (en) * 2022-01-25 2022-05-10 西南交通大学 Bandwidth-aware selective data multicast method
CN115314397A (en) * 2022-08-05 2022-11-08 中科计算技术西部研究院 Network simulation method, system, device and storage medium for distributed training
CN116483580A (en) * 2022-09-29 2023-07-25 陕西震旦纪信息技术有限公司 System and method for scheduling server computing resources based on Kubernetes
CN116483580B (en) * 2022-09-29 2024-05-28 陕西震旦纪信息技术有限公司 System and method for scheduling server computing resources based on Kubernetes
CN116662039A (en) * 2023-07-25 2023-08-29 菲特(天津)检测技术有限公司 Industrial information parallel detection method, device and medium based on shared memory
CN116662039B (en) * 2023-07-25 2024-01-23 菲特(天津)检测技术有限公司 Industrial information parallel detection method, device and medium based on shared memory

Also Published As

Publication number Publication date
CN105956021B (en) 2019-05-21

Similar Documents

Publication Publication Date Title
CN105956021A (en) Automated task parallel method suitable for distributed machine learning and system thereof
CN107239335B (en) Job scheduling system and method for distributed system
CN105117286B (en) The dispatching method of task and streamlined perform method in MapReduce
US8443351B2 (en) Parallel loops in a workflow
CN111738434B (en) Method for executing deep neural network on heterogeneous processing unit
US11915101B2 (en) Numerical quantum experimentation
CN109669452A (en) A kind of cloud robot task dispatching method and system based on parallel intensified learning
US20150331713A1 (en) Parallel simulation using multiple co-simulators
CN103377035A (en) Pipeline parallelization method for coarse-grained streaming application
Yu et al. Automated runtime-aware scheduling for multi-tenant dnn inference on gpu
US20200272896A1 (en) System for deep learning training using edge devices
Ward et al. Colmena: Scalable machine-learning-based steering of ensemble simulations for high performance computing
CN109918199A (en) Distributed figure processing system based on GPU
CN112764893B (en) Data processing method and data processing system
CN105719126A (en) System and method for internet big data task scheduling based on life cycle model
CN106681820A (en) Message combination based extensible big data computing method
Yi et al. Fast training of deep learning models over multiple gpus
CN117009038B (en) Graph computing platform based on cloud native technology
Busch et al. Dynamic scheduling in distributed transactional memory
CN113010296B (en) Formalized model based task analysis and resource allocation method and system
Feljan et al. Task allocation optimization for multicore embedded systems
CN113407343A (en) Service processing method, device and equipment based on resource allocation
CN113568747A (en) Cloud robot resource scheduling method and system based on task classification and time sequence prediction
CN106055862A (en) Novel efficient heuristic-type two-stage parallel branch-and-bound method
Yi et al. Optimizing DNN compilation for distributed training with joint OP and tensor fusion

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant