CN114650288A - Distributed training method and system, terminal device and computer readable storage medium - Google Patents
Distributed training method and system, terminal device and computer readable storage medium Download PDFInfo
- Publication number
- CN114650288A CN114650288A CN202011399386.XA CN202011399386A CN114650288A CN 114650288 A CN114650288 A CN 114650288A CN 202011399386 A CN202011399386 A CN 202011399386A CN 114650288 A CN114650288 A CN 114650288A
- Authority
- CN
- China
- Prior art keywords
- node
- grouping
- working
- model parameters
- working node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 112
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000002159 abnormal effect Effects 0.000 claims description 27
- 238000004590 computer program Methods 0.000 claims description 16
- 230000005856 abnormality Effects 0.000 claims description 3
- 238000010801 machine learning Methods 0.000 abstract description 11
- 238000005265 energy consumption Methods 0.000 abstract description 4
- 102100040160 Rabankyrin-5 Human genes 0.000 description 12
- 101710086049 Rabankyrin-5 Proteins 0.000 description 12
- 238000010586 diagram Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 230000007246 mechanism Effects 0.000 description 6
- 230000002265 prevention Effects 0.000 description 6
- 238000001514 detection method Methods 0.000 description 5
- 238000012544 monitoring process Methods 0.000 description 5
- 230000007547 defect Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 208000015181 infectious disease Diseases 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention provides a distributed training method and a system, terminal equipment and a computer readable storage medium, wherein the distributed training method comprises the following steps: the working node trains the local model by using local training data according to the initial model parameters to obtain model parameters of the local model; the working node obtains updated model parameters according to the model parameters of the local model and grouping conditions; and taking the updated model parameters obtained by the training of the current round as initial model parameters in the next round of training, and executing multiple rounds of training in a circulating way until the training stopping conditions are met. The distributed training method is applied to a distributed training system consisting of at least 3 mobile terminals, so that the problems of high price, high energy consumption, large volume and poor mobility caused by realization of distributed machine learning on the conventional large server are solved.
Description
Technical Field
The present invention relates to the field of machine learning technologies, and in particular, to a distributed training method and system, a terminal device, and a computer-readable storage medium.
Background
Machine learning has now proven to be an effective method for automatically extracting data information and has enjoyed great success in various fields such as image recognition, speech processing, machine translation, gaming, healthcare, and the like. In order to obtain a useful machine learning model, a large number of data sets need to be trained for a long time, and as the data size is increased, a single machine cannot successfully complete a learning task, so that the workload of machine learning needs to be distributed among a plurality of machines to realize distributed training and improve the learning speed, thereby promoting wider application of machine learning. The existing distributed machine learning is realized on a large-scale server, and large-scale equipment has the advantages of large memory, good performance and stability, but the large-scale server generally has the defects of high price, high energy consumption, large volume and poor mobility, so the cost for realizing the machine learning on the large-scale server is very high.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a distributed training method and system, a terminal device and a computer readable storage medium, which are applied to a distributed training system comprising at least 3 mobile terminals, so that the machine learning cost is reduced.
The specific technical scheme provided by the invention is as follows: a distributed training method is applied to a distributed training system, the distributed training system comprises at least 3 mobile terminals, any one mobile terminal is used as a grouping node, and the rest mobile terminals are used as working nodes, and the distributed training method comprises the following steps:
the working node trains the local model by using local training data according to the initial model parameters to obtain model parameters of the local model;
the working node obtains updated model parameters according to the model parameters of the local model and grouping conditions;
and taking the updated model parameters obtained by the training of the current round as initial model parameters in the next round of training, and executing multiple rounds of training in a circulating way until the training stopping conditions are met.
Further, the obtaining, by the working node, the updated model parameter according to the model parameter of the local model and the grouping condition includes:
the working node sends a grouping request to the grouping node;
the grouping node generates grouping information after receiving the grouping request and sends the grouping information to the working node;
the working node judges whether the grouping information is empty or not;
if the grouping information is not null, the working node judges whether an abnormal working node exists in the grouping where the working node is located, if not, the average value of the model parameters of the local models of all the working nodes in the grouping where the working node is located is used as the updated model parameter; if the local model parameters exist, taking the average value of the model parameters of the local model of the working node without abnormality in the group where the working node is located as the updated model parameters;
and if the grouping information is empty, taking the model parameter of the local model as the updated model parameter.
Further, the grouping node generates grouping information after receiving the grouping request, including:
judging whether the grouping information of the working node is empty or not;
if the grouping information of the working nodes is empty, regrouping the working nodes and generating grouping information;
and if the grouping information of the working node is not empty, taking the existing grouping information corresponding to the working node in the working node state table as the grouping information of the working node.
Further, regrouping the working nodes and generating grouping information includes:
taking the working node with empty grouping information in the working node state table as a node set to be grouped;
respectively acquiring the number of times of the sent grouping request of each working node in the node set to be grouped;
calculating the difference value between the number of times of the sent grouping request of each working node in the node set to be grouped and the number of times of the sent grouping request of the working node, and adding the working nodes of which the difference values are smaller than a preset threshold value into the grouping of the working nodes to generate grouping information;
and updating the working node state table.
Further, regrouping the working nodes and generating grouping information includes:
taking the working node with empty grouping information in the working node state table as a node set to be grouped;
judging whether a grouping request sent by a working node in a node set to be grouped is received within a preset time;
and if a grouping request sent by the working nodes in the node set to be grouped is received within a preset time, adding the working nodes sending the grouping request in the node set to be grouped into groups of the working nodes to generate grouping information.
Further, after the working node obtains the updated model parameters according to the model parameters of the local model and the grouping condition, the distributed training method further includes:
the working node sends a release request to the grouping node;
and the grouping node deletes the grouping information of the working node and updates the working node state table.
Further, the training stopping condition is that the training times reach a preset training time.
In order to solve the defects of the prior art, the invention also provides a distributed training system, wherein the distributed system comprises at least 3 mobile terminals, any mobile terminal is used as a grouping node, the rest mobile terminals are used as working nodes, and the distributed system trains the local model by the distributed training method.
The invention also provides a terminal device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the distributed training method.
The present invention also provides a computer readable storage medium having computer instructions stored thereon, wherein the computer instructions, when executed by a processor, implement the distributed training method as described above.
The distributed training method is applied to a distributed training system consisting of at least 3 mobile terminals, any mobile terminal is used as a grouping node, the rest mobile terminals are used as working nodes, each working node trains a local model by using local training data according to initial model parameters, and model parameters of the local model are obtained; and obtaining updated model parameters according to the model parameters and grouping conditions of the local model, taking the updated model parameters obtained in the current training as initial model parameters in the next training, and executing multiple rounds of training in a circulating manner until the training stopping conditions are met, thereby avoiding the problems of high price, high energy consumption, large volume and poor mobility caused by the realization of distributed machine learning on the existing large-scale server.
Drawings
The technical solution and other advantages of the present invention will become apparent from the following detailed description of specific embodiments of the present invention, which is to be read in connection with the accompanying drawings.
FIG. 1 is a schematic diagram of a distributed training system of the present invention;
FIG. 2 is a flowchart illustrating a distributed training method according to a first embodiment of the present invention;
fig. 3 is a flowchart illustrating a step S2 according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a terminal device in the third embodiment of the present invention.
Detailed Description
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to explain the principles of the invention and its practical application to thereby enable others skilled in the art to understand the invention for various embodiments and with various modifications as are suited to the particular use contemplated. In the drawings, like reference numerals will be used to refer to like elements throughout.
Example one
Referring to fig. 1 and 2, the distributed training system provided in this embodiment includes at least 3 mobile terminals, where any mobile terminal of the at least 3 mobile terminals is used as a grouping node (master node), and the remaining mobile terminals are used as work nodes (worker nodes) to form a mobile terminal cluster, each worker node independently trains a local model using local data, the master node is used to group the worker nodes and send grouping information to the worker nodes, and the master node and the worker nodes train the local model by the following distributed training method:
s1, training a local model by the worker node according to the initial model parameters by using local training data to obtain model parameters of the local model;
step S2, the worker node obtains updated model parameters according to the model parameters of the local model and the grouping condition;
and step S3, taking the updated model parameters obtained in the training of the round as the initial model parameters in the next training round, and circularly executing multiple rounds of training until the training is finished when the training stop conditions are met.
In step S, before training a local model, each worker node acquires local training data thereof as training data of subsequent rounds of training, wherein the training data of the mobile terminal cluster is divided equally into a plurality of data blocks, the number of the plurality of data blocks is consistent with the number of the worker nodes, each worker acquires data blocks corresponding to the worker nodes according to the sequence of sequence numbers, for example, the training data is { a, a } the distributed training system includes 6 worker nodes with sequence numbers of Rank, and divides the training data into 6 data blocks of { a, a }, { a, a }, and { a }, the data blocks acquired by Rank are { a, a } the data blocks acquired with sequence numbers of Rank are { a, a } and the data blocks acquired by Rank are { a, a4, and so on, finally each worker node obtains its own local training data as the training data of the subsequent rounds of training.
The method includes that the Worker node trains a local model according to initial model parameters by using local training data after obtaining the local training data, and it should be noted that in a first round of training, the initial model parameters refer to model parameters of the local model before the local model is not trained.
Referring to fig. 3, step S2 specifically includes:
step S21, the worker node sends a grouping request to the master node;
step S22, after receiving the grouping request, the master node generates grouping information and sends the grouping information to the worker node;
step S23, the worker node judges whether the grouping information is empty, if the grouping information is not empty, the step S24 is executed, and if the grouping information is empty, the step S25 is executed;
step S24, judging whether abnormal worker nodes exist in the group where the worker nodes are located, and if the abnormal worker nodes do not exist in the group where the worker nodes are located, taking the average value of the model parameters of the local models of all the worker nodes in the group where the worker nodes are located as the updated model parameters; if the abnormal worker node exists, taking the average value of the model parameters of the local model of the worker node which is not abnormal in the group where the abnormal worker node exists as the updated model parameter;
and step S25, taking the model parameters of the local model as the updated model parameters.
The packet request in this embodiment represents a communication instruction, which may be a serial number of a worker node, that is, the worker node sends a corresponding serial number to a master node, and this is shown here only as an example, and in other embodiments, the packet request may also be a preset flag bit, and a value of the flag bit is 1, which represents the packet request, and is not used to limit this embodiment.
In step S22, the Master node is in a monitoring state until receiving the grouping request, and monitors the message sent by the worker node through the monitoring port, when receiving the message sent by the worker node, the Master node first determines whether the received message is a grouping request, if so, the Master node generates grouping information and sends the grouping information to the worker node, and if not, the Master node updates the state of the worker node to an idle state and updates the worker node state table.
The worker node state table in this embodiment is maintained by a master node, and may include a serial number, a state, and grouping information of each worker node, and the following table gives an example of the worker node state table:
TABLE 1 worker node status Table
The number of the worker nodes is 6, the serial numbers of the 6 worker nodes are Rank0, Rank1, Rank2, Rank3, Rank4 and Rank5 respectively, the state flag bit is 1 to indicate that the worker nodes are in a busy state, the state flag bit is 0 to indicate that the worker nodes are in an idle state, the states of the worker nodes with the serial numbers of Rank0, Rank1, Rank2, Rank4 and Rank5 are in a busy state, the state of the worker node with the serial number of Rank3 is in an idle state, the grouping information of the worker node with the serial number of Rank0 is { Rank0, Rank2 and Rank5}, the grouping information indicates that the worker nodes with the serial numbers of Rank0, Rank2 and Rank5 belong to the same group, and the grouping information of the worker node with the serial number of Rank3 is { }, namely the information of the worker node with the serial number of Rank3 is empty.
In step S23, after the master node sends the grouping information to the worker node, the worker node first determines whether the received grouping information is empty, and if not, the worker node establishes a connection with other workers belonging to the same group, for example, if the grouping information of the worker node with the Rank0 in the table is { Rank0, Rank2, Rank5}, the worker nodes with the Rank0, Rank2, and Rank5 establish a connection with each other.
The mobile terminal in this embodiment is the cell-phone based on ARM framework, the master node communicates through between SockerServer and the worker node, the pyrtch deep learning frame has been installed respectively in the worker node, it is unusual (NAN) to probably produce in the time of using distributed training to the cell-phone based on ARM framework, for example, if NAN appears in a certain worker node, because NAN's appearance has the infectivity, when carrying out distributed training, the worker node that takes place the NAN will infect normal worker node, thereby make normal worker node also appear NAN, finally can lead to all worker nodes in whole distributed training all to be infected, make the training fail. Therefore, the embodiment proposes an epidemic prevention mechanism, the basic principle of which is described in step S24, and the epidemic prevention mechanism is described in detail below.
Firstly, self-detection is carried out, each worker node firstly carries out self-detection, namely whether the worker node is abnormal or not is judged, in the embodiment, a flag bit for judging whether the worker node is healthy or not is added into each worker, the flag bit is a Boolean value variable, namely, the value of the flag bit is 0 or 1, wherein, the flag bit is 0 to indicate that the worker node is abnormal, when the flag bit is 1, the worker node is not abnormal, when the worker node is abnormal, the flag bit will be set to 0, therefore, in the self-detection step, it can be determined whether the worker node is abnormal only by detecting whether the flag bit of the worker node is 0, because the connections between the workers belonging to the same group are established, after the detection of each worker node is finished, the detection result of each worker node is sent to other worker nodes in the group, in the embodiment, a detect _ analysis () function in the deep learning framework of the pytorech is used to track whether a worker node with an exception exists.
And secondly, self-isolation is carried out, when abnormal worker nodes exist in the same group, model parameters of a local model of the worker nodes which do not have the abnormality are only considered during updating of the model parameters, namely the abnormal worker nodes are self-isolated, and the worker nodes cannot send the model parameters of the local model to other worker nodes.
And thirdly, self-repairing, wherein if abnormal worker nodes exist in the same group, the abnormal worker nodes in the group respectively send the model parameters of the local model of the abnormal worker nodes to other abnormal worker nodes, each abnormal worker node averages the model parameters of the abnormal worker node and the received local model after receiving the model parameters of the local model of the abnormal worker nodes, takes the average value as an updated model parameter, the abnormal worker nodes send the updated model parameters to self-isolated worker nodes, all the worker nodes in the group update the local model by using the updated model parameters, and the abnormal worker nodes adopt the updated model parameters calculated by the abnormal worker nodes, so that self-repairing is completed.
If abnormal worker nodes do not exist in the same group, each worker node in the group sends model parameters of a local model of the worker node to other worker nodes, each worker node averages the model parameters of the worker node and the received local model after receiving the model parameters of the local model of other worker nodes, the average value is used as an updated model parameter, and all worker nodes in the group update the local model by using the updated model parameter.
NAN infection in the mobile terminal cluster can be avoided by adopting the epidemic prevention mechanism, the stability of the whole distributed training system is ensured, and abnormal worker nodes can be quickly repaired through the epidemic prevention mechanism, so that the training time is saved.
In step S25, when the grouping information of the worker node is empty, it indicates that the worker node is not grouped, and at this time, if the worker node is grouped in the next training round, the model parameter of the local model obtained in step S1 is used as the updated model parameter.
Specifically, in step S22, the master node generates grouping information upon receiving the grouping request, including:
step S221, judging whether the grouping information of the worker node is empty, if the grouping information of the worker node is empty, entering step S222, and if the grouping information of the worker node is not empty, entering step S223;
step S222, regrouping worker nodes and generating grouping information;
and S223, taking the existing grouping information corresponding to the worker node in the worker node state table as the grouping information of the worker node.
Because training speeds of each worker node may be different, each worker node may send a grouping request to a master node at different times, for example, a worker node with a sequence number of Rank0 sends a grouping request to the master node first, the master node groups the worker nodes, and generated grouping information is { Rank0, Rank2, Rank5}, at this time, the master node updates a worker node state table, and updates the grouping information of the worker nodes with sequence numbers of Rank2 and Rank5 to { Rank0, Rank2, Rank5}, so that when the master node receives the grouping request sent by the worker node with sequence numbers of Rank2 or Rank5, the master node needs to query the grouping information of the worker node with sequence numbers of Rank2 or Rank5 in the worker node state table first, thereby avoiding that the worker node with sequence numbers of Rank2 or Rank5 is subjected to a grouping request again, and then determines whether the received grouping request is a null time of the worker node, and whether the received grouping request is a null time of the worker node, if the grouping information of the worker node is null, the worker node is not grouped, and the master node regroups the worker node and generates grouping information; if the packet information of the worker node is not null, the worker node is grouped, and at this time, the worker node has corresponding packet information, so that the master node only needs to query the worker node state table to obtain the existing packet information of the worker node and use the packet information as the packet information of the worker node.
In step S222, the step of regrouping worker nodes and generating grouping information specifically includes:
step S2221, taking the worker nodes with empty grouping information in the worker node state table as a node set to be grouped;
step S2222, the times of the sent grouping requests of each worker node in the node set to be grouped are respectively obtained;
step S2223, calculating the difference value between the number of times of the sent grouping request of each worker node in the node set to be grouped and the number of times of the sent grouping request of the worker node, and adding the worker node of which the difference value is smaller than a preset threshold value into the grouping of the worker node to generate grouping information;
and step S2224, updating the worker node state table.
The method includes the steps that a master node firstly queries a worker node state table, worker nodes with empty grouping information in the worker node state table are added into a node set to be grouped, in the embodiment, a counting flag bit is arranged in each worker node, the number of times of grouping requests sent by the worker node is represented through the counting flag bit, the master node calculates the difference value between the number of times of the grouping requests sent by each worker node in the node set to be grouped and the number of times of the grouping requests sent by the worker node after obtaining the number of times of the grouping requests sent by each worker node in the node set to be grouped, the worker nodes with the difference values smaller than a preset threshold value are added into groups of the worker nodes, grouping information is generated, the worker node state table is updated, and grouping of the worker nodes is completed. The predetermined threshold may be set according to actual needs, and is not specifically limited herein.
After step S2, the distributed training method in this embodiment further includes:
step S4, the worker node sends a release request to the master node;
and step S5, the master node deletes the grouping information of the worker node and updates the worker node state table.
In this embodiment, each worker node needs to be grouped again after completing one round of training, so that after each round of training is completed, the worker node needs to send a release request to the master node, and the master node deletes the grouping information of the worker node in the worker node state table when receiving the release request sent by the worker node, thereby completing the update of the worker node state table.
In step S25, in this embodiment, when the packet information of the worker node is empty, the worker node also sends a release request to the master node, and the master node deletes the packet information of the worker node and updates the worker node status table.
In step S3, the training stopping condition is that the training number reaches a preset training number, in other embodiments, the training stopping condition may be set to be the local model convergence, or of course, other training stopping conditions known in the art may also be used, which is not limited herein.
The distributed training method in the embodiment is applied to the mobile terminal group cluster, the problems of high price, high energy consumption, large volume and poor mobility caused by realization of distributed machine learning on the existing large-scale server are solved, an epidemic prevention mechanism is added when the distributed training method is applied to the mobile terminal group cluster, NAN infection in the mobile terminal group cluster can be avoided, the stability of the whole distributed training system is ensured, and abnormal worker nodes can be quickly repaired through the epidemic prevention mechanism, so that the training time is saved.
Example two
The difference between the distributed training method provided in this embodiment and the first embodiment is as follows: the specific implementation manner of regrouping the worker nodes and generating the grouping information is different, and other steps in this embodiment are the same as those in the first embodiment, which is not described herein again, and only the specific implementation manner of regrouping the worker nodes and generating the grouping information in this embodiment is described in detail.
In this embodiment, the regrouping the worker nodes and generating grouping information includes:
step S2221, taking the worker nodes with empty grouping information in the worker node state table as a node set to be grouped;
step S2222, judging whether a grouping request sent by a working node in a node set to be grouped is received within a preset time period, and if a grouping request sent by other worker nodes is received within the preset time period, entering step S2223;
step S2223, adding worker nodes which are to be grouped and send grouping requests in a node set into the groups of the worker nodes, and generating grouping information.
Specifically, a master node firstly queries a worker node state table, and adds a worker node, of which grouping information is empty, in the worker node state table to a to-be-grouped node set, in this embodiment, a timer is arranged in the master node, when the master node receives a grouping request of the worker node, the timer starts to time until the time counted by the timer reaches a predetermined time, the master node judges whether a grouping request sent by a working node in the to-be-grouped node set is received within the predetermined time, and if the master node receives the grouping request sent by the working node in the to-be-grouped node set within the predetermined time, the worker node sending the grouping request in the to-be-grouped node set is added to a group of the worker node, so as to generate grouping information. The predetermined time period may be set according to actual needs, and is not particularly limited herein.
In another implementation manner of this embodiment, the reserved time length may also be a monitoring period of a master node, where the master node detects a monitoring period in which a time of a packet request sent by the worker node is located, and in the monitoring period, all worker nodes sending the packet request in the set of nodes to be grouped are added to a packet of the worker node to generate grouping information.
EXAMPLE III
Referring to fig. 4, the present embodiment provides a terminal device, which includes a memory 100, a processor 200, and a computer program stored in the memory 100, wherein the processor 200 executes the computer program to implement the distributed training method according to the first embodiment and the second embodiment.
The Memory 100 may include a Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.
The processor 200 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the distributed training method described in the first and second embodiments may be implemented by integrated logic circuits of hardware in the processor 200 or instructions in the form of software. The Processor 200 may also be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), etc., and may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component.
The memory 100 is used for storing a computer program, which is executed by the processor 200 after receiving the execution instruction to implement the distributed training method according to the first embodiment and the second embodiment.
The embodiment further provides a computer storage medium, a computer program is stored in the computer storage medium, and the processor 200 is configured to read and execute the computer program stored in the computer storage medium 201 to implement the distributed training method according to the first embodiment and the second embodiment.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer storage medium or transmitted from one computer storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer storage media may be any available media that can be accessed by a computer or a data storage device, such as a server, data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is directed to embodiments of the present application and it is noted that numerous modifications and adaptations may be made by those skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.
Claims (10)
1. A distributed training method is characterized by being applied to a distributed training system, wherein the distributed training system comprises at least 3 mobile terminals, any one mobile terminal is used as a grouping node, and the rest mobile terminals are used as working nodes, and the distributed training method comprises the following steps:
the working node trains the local model by using local training data according to the initial model parameters to obtain model parameters of the local model;
the working node obtains updated model parameters according to the model parameters of the local model and the grouping condition;
and taking the updated model parameters obtained by the training of the current round as initial model parameters in the next round of training, and executing multiple rounds of training in a circulating way until the training stopping conditions are met.
2. The distributed training method of claim 1, wherein the obtaining, by the working node, updated model parameters according to the model parameters of the local model and the grouping condition comprises:
the working node sends a grouping request to the grouping node;
the grouping node generates grouping information after receiving the grouping request and sends the grouping information to the working node;
the working node judges whether the grouping information is empty;
if the grouping information is not null, the working node judges whether an abnormal working node exists in the grouping where the working node is located, if not, the average value of the model parameters of the local models of all the working nodes in the grouping where the working node is located is used as the updated model parameter; if the local model parameters exist, taking the average value of the model parameters of the local model of the working node without abnormality in the group where the working node is located as the updated model parameters;
and if the grouping information is empty, taking the model parameter of the local model as the updated model parameter.
3. The distributed training method of claim 2, wherein the grouping node generates grouping information upon receiving the grouping request, comprising:
judging whether the grouping information of the working node is empty or not;
if the grouping information of the working nodes is empty, regrouping the working nodes and generating grouping information;
and if the grouping information of the working node is not empty, taking the existing grouping information corresponding to the working node in the working node state table as the grouping information of the working node.
4. The distributed training method of claim 3, wherein regrouping the worker nodes and generating grouping information comprises:
taking the working node with empty grouping information in the working node state table as a node set to be grouped;
respectively acquiring the number of times of the sent grouping request of each working node in the node set to be grouped;
calculating the difference value between the number of times of the grouping request sent by each working node in the node set to be grouped and the number of times of the grouping request sent by the working node, and adding the working nodes with the difference values smaller than a preset threshold value into the grouping of the working nodes to generate grouping information;
and updating the working node state table.
5. The distributed training method of claim 3, wherein regrouping the worker nodes and generating grouping information comprises:
taking the working node with empty grouping information in the working node state table as a node set to be grouped;
judging whether a grouping request sent by a working node in a node set to be grouped is received within a preset time;
and if a grouping request sent by the working nodes in the node set to be grouped is received within a preset time, adding the working nodes sending the grouping request in the node set to be grouped into groups of the working nodes to generate grouping information.
6. The distributed training method according to any one of claims 1 to 5, wherein after the working node obtains updated model parameters according to the model parameters of the local model and the grouping condition, the distributed training method further comprises:
the working node sends a release request to the grouping node;
and the grouping node deletes the grouping information of the working node and updates a working node state table.
7. The distributed training method according to claim 6, wherein the training stop condition is that the number of training times reaches a preset number of training times.
8. A distributed training system, characterized in that the distributed system comprises at least 3 mobile terminals, any mobile terminal is used as a grouping node, the rest mobile terminals are used as working nodes, and the distributed system trains a local model by the distributed training method according to any one of claims 1 to 7.
9. A terminal device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the distributed training method of any one of claims 1 to 7.
10. A computer readable storage medium having computer instructions stored thereon, wherein the computer instructions, when executed by a processor, implement the distributed training method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011399386.XA CN114650288B (en) | 2020-12-02 | 2020-12-02 | Distributed training method and system, terminal equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011399386.XA CN114650288B (en) | 2020-12-02 | 2020-12-02 | Distributed training method and system, terminal equipment and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114650288A true CN114650288A (en) | 2022-06-21 |
CN114650288B CN114650288B (en) | 2024-03-08 |
Family
ID=81990908
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011399386.XA Active CN114650288B (en) | 2020-12-02 | 2020-12-02 | Distributed training method and system, terminal equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114650288B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115600512A (en) * | 2022-12-01 | 2023-01-13 | 深圳先进技术研究院(Cn) | Tool life prediction method based on distributed learning |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170220949A1 (en) * | 2016-01-29 | 2017-08-03 | Yahoo! Inc. | Method and system for distributed deep machine learning |
CN109558909A (en) * | 2018-12-05 | 2019-04-02 | 清华大学深圳研究生院 | Combined depth learning method based on data distribution |
CN110263928A (en) * | 2019-06-18 | 2019-09-20 | 中国科学技术大学 | Protect the mobile device-based distributed deep learning training method of data-privacy |
CN110503207A (en) * | 2019-08-28 | 2019-11-26 | 深圳前海微众银行股份有限公司 | Federation's study credit management method, device, equipment and readable storage medium storing program for executing |
CN111158902A (en) * | 2019-12-09 | 2020-05-15 | 广东工业大学 | Mobile edge distributed machine learning system and method |
CN111178503A (en) * | 2019-12-16 | 2020-05-19 | 北京邮电大学 | Mobile terminal-oriented decentralized target detection model training method and system |
-
2020
- 2020-12-02 CN CN202011399386.XA patent/CN114650288B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170220949A1 (en) * | 2016-01-29 | 2017-08-03 | Yahoo! Inc. | Method and system for distributed deep machine learning |
CN109558909A (en) * | 2018-12-05 | 2019-04-02 | 清华大学深圳研究生院 | Combined depth learning method based on data distribution |
CN110263928A (en) * | 2019-06-18 | 2019-09-20 | 中国科学技术大学 | Protect the mobile device-based distributed deep learning training method of data-privacy |
CN110503207A (en) * | 2019-08-28 | 2019-11-26 | 深圳前海微众银行股份有限公司 | Federation's study credit management method, device, equipment and readable storage medium storing program for executing |
CN111158902A (en) * | 2019-12-09 | 2020-05-15 | 广东工业大学 | Mobile edge distributed machine learning system and method |
CN111178503A (en) * | 2019-12-16 | 2020-05-19 | 北京邮电大学 | Mobile terminal-oriented decentralized target detection model training method and system |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115600512A (en) * | 2022-12-01 | 2023-01-13 | 深圳先进技术研究院(Cn) | Tool life prediction method based on distributed learning |
Also Published As
Publication number | Publication date |
---|---|
CN114650288B (en) | 2024-03-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112468372B (en) | Method and device for detecting equipment state in power line communication network | |
CN111538763B (en) | Method for determining master node in cluster, electronic equipment and storage medium | |
CN110070445B (en) | Transaction processing method and device based on blockchain system | |
CN107870948A (en) | Method for scheduling task and device | |
CN111010318A (en) | Method and system for discovering loss of connection of terminal equipment of Internet of things and equipment shadow server | |
CN111431802A (en) | Block chain node communication optimization system and method | |
CN114650288A (en) | Distributed training method and system, terminal device and computer readable storage medium | |
CN110995851A (en) | Message processing method, device, storage medium and equipment | |
CN114285795A (en) | State control method, device, equipment and storage medium of virtual equipment | |
CN107786390B (en) | Method and device for correcting networking nodes | |
CN113132160B (en) | Method and system for detecting network sub-health state of client node | |
CN107360012A (en) | A kind of Link State processing method and apparatus for network node | |
CN110224872B (en) | Communication method, device and storage medium | |
CN112532467B (en) | Method, device and system for realizing fault detection | |
CN105045224A (en) | Data transmission method and device | |
CN106793093B (en) | Service processing method and device | |
CN111092956A (en) | Resource synchronization method, device, storage medium and equipment | |
US10951732B2 (en) | Service processing method and device | |
CN111757371A (en) | Statistical method of transmission delay, server and storage medium | |
CN107210996B (en) | Service chain management method and device | |
CN112804115B (en) | Method, device and equipment for detecting abnormity of virtual network function | |
JP2018137637A (en) | Communication network and communication terminal | |
CN112073987A (en) | State monitoring method, device, equipment and storage medium | |
CN113396573A (en) | Migration of computing services | |
CN115002020B (en) | OSPF-based data processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |