CN114650288A - Distributed training method and system, terminal device and computer readable storage medium - Google Patents

Distributed training method and system, terminal device and computer readable storage medium Download PDF

Info

Publication number
CN114650288A
CN114650288A CN202011399386.XA CN202011399386A CN114650288A CN 114650288 A CN114650288 A CN 114650288A CN 202011399386 A CN202011399386 A CN 202011399386A CN 114650288 A CN114650288 A CN 114650288A
Authority
CN
China
Prior art keywords
node
grouping
working
model parameters
working node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011399386.XA
Other languages
Chinese (zh)
Other versions
CN114650288B (en
Inventor
李惠娟
曾思棋
喻之斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202011399386.XA priority Critical patent/CN114650288B/en
Publication of CN114650288A publication Critical patent/CN114650288A/en
Application granted granted Critical
Publication of CN114650288B publication Critical patent/CN114650288B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention provides a distributed training method and a system, terminal equipment and a computer readable storage medium, wherein the distributed training method comprises the following steps: the working node trains the local model by using local training data according to the initial model parameters to obtain model parameters of the local model; the working node obtains updated model parameters according to the model parameters of the local model and grouping conditions; and taking the updated model parameters obtained by the training of the current round as initial model parameters in the next round of training, and executing multiple rounds of training in a circulating way until the training stopping conditions are met. The distributed training method is applied to a distributed training system consisting of at least 3 mobile terminals, so that the problems of high price, high energy consumption, large volume and poor mobility caused by realization of distributed machine learning on the conventional large server are solved.

Description

Distributed training method and system, terminal device and computer readable storage medium
Technical Field
The present invention relates to the field of machine learning technologies, and in particular, to a distributed training method and system, a terminal device, and a computer-readable storage medium.
Background
Machine learning has now proven to be an effective method for automatically extracting data information and has enjoyed great success in various fields such as image recognition, speech processing, machine translation, gaming, healthcare, and the like. In order to obtain a useful machine learning model, a large number of data sets need to be trained for a long time, and as the data size is increased, a single machine cannot successfully complete a learning task, so that the workload of machine learning needs to be distributed among a plurality of machines to realize distributed training and improve the learning speed, thereby promoting wider application of machine learning. The existing distributed machine learning is realized on a large-scale server, and large-scale equipment has the advantages of large memory, good performance and stability, but the large-scale server generally has the defects of high price, high energy consumption, large volume and poor mobility, so the cost for realizing the machine learning on the large-scale server is very high.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a distributed training method and system, a terminal device and a computer readable storage medium, which are applied to a distributed training system comprising at least 3 mobile terminals, so that the machine learning cost is reduced.
The specific technical scheme provided by the invention is as follows: a distributed training method is applied to a distributed training system, the distributed training system comprises at least 3 mobile terminals, any one mobile terminal is used as a grouping node, and the rest mobile terminals are used as working nodes, and the distributed training method comprises the following steps:
the working node trains the local model by using local training data according to the initial model parameters to obtain model parameters of the local model;
the working node obtains updated model parameters according to the model parameters of the local model and grouping conditions;
and taking the updated model parameters obtained by the training of the current round as initial model parameters in the next round of training, and executing multiple rounds of training in a circulating way until the training stopping conditions are met.
Further, the obtaining, by the working node, the updated model parameter according to the model parameter of the local model and the grouping condition includes:
the working node sends a grouping request to the grouping node;
the grouping node generates grouping information after receiving the grouping request and sends the grouping information to the working node;
the working node judges whether the grouping information is empty or not;
if the grouping information is not null, the working node judges whether an abnormal working node exists in the grouping where the working node is located, if not, the average value of the model parameters of the local models of all the working nodes in the grouping where the working node is located is used as the updated model parameter; if the local model parameters exist, taking the average value of the model parameters of the local model of the working node without abnormality in the group where the working node is located as the updated model parameters;
and if the grouping information is empty, taking the model parameter of the local model as the updated model parameter.
Further, the grouping node generates grouping information after receiving the grouping request, including:
judging whether the grouping information of the working node is empty or not;
if the grouping information of the working nodes is empty, regrouping the working nodes and generating grouping information;
and if the grouping information of the working node is not empty, taking the existing grouping information corresponding to the working node in the working node state table as the grouping information of the working node.
Further, regrouping the working nodes and generating grouping information includes:
taking the working node with empty grouping information in the working node state table as a node set to be grouped;
respectively acquiring the number of times of the sent grouping request of each working node in the node set to be grouped;
calculating the difference value between the number of times of the sent grouping request of each working node in the node set to be grouped and the number of times of the sent grouping request of the working node, and adding the working nodes of which the difference values are smaller than a preset threshold value into the grouping of the working nodes to generate grouping information;
and updating the working node state table.
Further, regrouping the working nodes and generating grouping information includes:
taking the working node with empty grouping information in the working node state table as a node set to be grouped;
judging whether a grouping request sent by a working node in a node set to be grouped is received within a preset time;
and if a grouping request sent by the working nodes in the node set to be grouped is received within a preset time, adding the working nodes sending the grouping request in the node set to be grouped into groups of the working nodes to generate grouping information.
Further, after the working node obtains the updated model parameters according to the model parameters of the local model and the grouping condition, the distributed training method further includes:
the working node sends a release request to the grouping node;
and the grouping node deletes the grouping information of the working node and updates the working node state table.
Further, the training stopping condition is that the training times reach a preset training time.
In order to solve the defects of the prior art, the invention also provides a distributed training system, wherein the distributed system comprises at least 3 mobile terminals, any mobile terminal is used as a grouping node, the rest mobile terminals are used as working nodes, and the distributed system trains the local model by the distributed training method.
The invention also provides a terminal device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the distributed training method.
The present invention also provides a computer readable storage medium having computer instructions stored thereon, wherein the computer instructions, when executed by a processor, implement the distributed training method as described above.
The distributed training method is applied to a distributed training system consisting of at least 3 mobile terminals, any mobile terminal is used as a grouping node, the rest mobile terminals are used as working nodes, each working node trains a local model by using local training data according to initial model parameters, and model parameters of the local model are obtained; and obtaining updated model parameters according to the model parameters and grouping conditions of the local model, taking the updated model parameters obtained in the current training as initial model parameters in the next training, and executing multiple rounds of training in a circulating manner until the training stopping conditions are met, thereby avoiding the problems of high price, high energy consumption, large volume and poor mobility caused by the realization of distributed machine learning on the existing large-scale server.
Drawings
The technical solution and other advantages of the present invention will become apparent from the following detailed description of specific embodiments of the present invention, which is to be read in connection with the accompanying drawings.
FIG. 1 is a schematic diagram of a distributed training system of the present invention;
FIG. 2 is a flowchart illustrating a distributed training method according to a first embodiment of the present invention;
fig. 3 is a flowchart illustrating a step S2 according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a terminal device in the third embodiment of the present invention.
Detailed Description
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to explain the principles of the invention and its practical application to thereby enable others skilled in the art to understand the invention for various embodiments and with various modifications as are suited to the particular use contemplated. In the drawings, like reference numerals will be used to refer to like elements throughout.
Example one
Referring to fig. 1 and 2, the distributed training system provided in this embodiment includes at least 3 mobile terminals, where any mobile terminal of the at least 3 mobile terminals is used as a grouping node (master node), and the remaining mobile terminals are used as work nodes (worker nodes) to form a mobile terminal cluster, each worker node independently trains a local model using local data, the master node is used to group the worker nodes and send grouping information to the worker nodes, and the master node and the worker nodes train the local model by the following distributed training method:
s1, training a local model by the worker node according to the initial model parameters by using local training data to obtain model parameters of the local model;
step S2, the worker node obtains updated model parameters according to the model parameters of the local model and the grouping condition;
and step S3, taking the updated model parameters obtained in the training of the round as the initial model parameters in the next training round, and circularly executing multiple rounds of training until the training is finished when the training stop conditions are met.
In step S, before training a local model, each worker node acquires local training data thereof as training data of subsequent rounds of training, wherein the training data of the mobile terminal cluster is divided equally into a plurality of data blocks, the number of the plurality of data blocks is consistent with the number of the worker nodes, each worker acquires data blocks corresponding to the worker nodes according to the sequence of sequence numbers, for example, the training data is { a, a } the distributed training system includes 6 worker nodes with sequence numbers of Rank, and divides the training data into 6 data blocks of { a, a }, { a, a }, and { a }, the data blocks acquired by Rank are { a, a } the data blocks acquired with sequence numbers of Rank are { a, a } and the data blocks acquired by Rank are { a, a4, and so on, finally each worker node obtains its own local training data as the training data of the subsequent rounds of training.
The method includes that the Worker node trains a local model according to initial model parameters by using local training data after obtaining the local training data, and it should be noted that in a first round of training, the initial model parameters refer to model parameters of the local model before the local model is not trained.
Referring to fig. 3, step S2 specifically includes:
step S21, the worker node sends a grouping request to the master node;
step S22, after receiving the grouping request, the master node generates grouping information and sends the grouping information to the worker node;
step S23, the worker node judges whether the grouping information is empty, if the grouping information is not empty, the step S24 is executed, and if the grouping information is empty, the step S25 is executed;
step S24, judging whether abnormal worker nodes exist in the group where the worker nodes are located, and if the abnormal worker nodes do not exist in the group where the worker nodes are located, taking the average value of the model parameters of the local models of all the worker nodes in the group where the worker nodes are located as the updated model parameters; if the abnormal worker node exists, taking the average value of the model parameters of the local model of the worker node which is not abnormal in the group where the abnormal worker node exists as the updated model parameter;
and step S25, taking the model parameters of the local model as the updated model parameters.
The packet request in this embodiment represents a communication instruction, which may be a serial number of a worker node, that is, the worker node sends a corresponding serial number to a master node, and this is shown here only as an example, and in other embodiments, the packet request may also be a preset flag bit, and a value of the flag bit is 1, which represents the packet request, and is not used to limit this embodiment.
In step S22, the Master node is in a monitoring state until receiving the grouping request, and monitors the message sent by the worker node through the monitoring port, when receiving the message sent by the worker node, the Master node first determines whether the received message is a grouping request, if so, the Master node generates grouping information and sends the grouping information to the worker node, and if not, the Master node updates the state of the worker node to an idle state and updates the worker node state table.
The worker node state table in this embodiment is maintained by a master node, and may include a serial number, a state, and grouping information of each worker node, and the following table gives an example of the worker node state table:
TABLE 1 worker node status Table
Figure BDA0002812020010000061
The number of the worker nodes is 6, the serial numbers of the 6 worker nodes are Rank0, Rank1, Rank2, Rank3, Rank4 and Rank5 respectively, the state flag bit is 1 to indicate that the worker nodes are in a busy state, the state flag bit is 0 to indicate that the worker nodes are in an idle state, the states of the worker nodes with the serial numbers of Rank0, Rank1, Rank2, Rank4 and Rank5 are in a busy state, the state of the worker node with the serial number of Rank3 is in an idle state, the grouping information of the worker node with the serial number of Rank0 is { Rank0, Rank2 and Rank5}, the grouping information indicates that the worker nodes with the serial numbers of Rank0, Rank2 and Rank5 belong to the same group, and the grouping information of the worker node with the serial number of Rank3 is { }, namely the information of the worker node with the serial number of Rank3 is empty.
In step S23, after the master node sends the grouping information to the worker node, the worker node first determines whether the received grouping information is empty, and if not, the worker node establishes a connection with other workers belonging to the same group, for example, if the grouping information of the worker node with the Rank0 in the table is { Rank0, Rank2, Rank5}, the worker nodes with the Rank0, Rank2, and Rank5 establish a connection with each other.
The mobile terminal in this embodiment is the cell-phone based on ARM framework, the master node communicates through between SockerServer and the worker node, the pyrtch deep learning frame has been installed respectively in the worker node, it is unusual (NAN) to probably produce in the time of using distributed training to the cell-phone based on ARM framework, for example, if NAN appears in a certain worker node, because NAN's appearance has the infectivity, when carrying out distributed training, the worker node that takes place the NAN will infect normal worker node, thereby make normal worker node also appear NAN, finally can lead to all worker nodes in whole distributed training all to be infected, make the training fail. Therefore, the embodiment proposes an epidemic prevention mechanism, the basic principle of which is described in step S24, and the epidemic prevention mechanism is described in detail below.
Firstly, self-detection is carried out, each worker node firstly carries out self-detection, namely whether the worker node is abnormal or not is judged, in the embodiment, a flag bit for judging whether the worker node is healthy or not is added into each worker, the flag bit is a Boolean value variable, namely, the value of the flag bit is 0 or 1, wherein, the flag bit is 0 to indicate that the worker node is abnormal, when the flag bit is 1, the worker node is not abnormal, when the worker node is abnormal, the flag bit will be set to 0, therefore, in the self-detection step, it can be determined whether the worker node is abnormal only by detecting whether the flag bit of the worker node is 0, because the connections between the workers belonging to the same group are established, after the detection of each worker node is finished, the detection result of each worker node is sent to other worker nodes in the group, in the embodiment, a detect _ analysis () function in the deep learning framework of the pytorech is used to track whether a worker node with an exception exists.
And secondly, self-isolation is carried out, when abnormal worker nodes exist in the same group, model parameters of a local model of the worker nodes which do not have the abnormality are only considered during updating of the model parameters, namely the abnormal worker nodes are self-isolated, and the worker nodes cannot send the model parameters of the local model to other worker nodes.
And thirdly, self-repairing, wherein if abnormal worker nodes exist in the same group, the abnormal worker nodes in the group respectively send the model parameters of the local model of the abnormal worker nodes to other abnormal worker nodes, each abnormal worker node averages the model parameters of the abnormal worker node and the received local model after receiving the model parameters of the local model of the abnormal worker nodes, takes the average value as an updated model parameter, the abnormal worker nodes send the updated model parameters to self-isolated worker nodes, all the worker nodes in the group update the local model by using the updated model parameters, and the abnormal worker nodes adopt the updated model parameters calculated by the abnormal worker nodes, so that self-repairing is completed.
If abnormal worker nodes do not exist in the same group, each worker node in the group sends model parameters of a local model of the worker node to other worker nodes, each worker node averages the model parameters of the worker node and the received local model after receiving the model parameters of the local model of other worker nodes, the average value is used as an updated model parameter, and all worker nodes in the group update the local model by using the updated model parameter.
NAN infection in the mobile terminal cluster can be avoided by adopting the epidemic prevention mechanism, the stability of the whole distributed training system is ensured, and abnormal worker nodes can be quickly repaired through the epidemic prevention mechanism, so that the training time is saved.
In step S25, when the grouping information of the worker node is empty, it indicates that the worker node is not grouped, and at this time, if the worker node is grouped in the next training round, the model parameter of the local model obtained in step S1 is used as the updated model parameter.
Specifically, in step S22, the master node generates grouping information upon receiving the grouping request, including:
step S221, judging whether the grouping information of the worker node is empty, if the grouping information of the worker node is empty, entering step S222, and if the grouping information of the worker node is not empty, entering step S223;
step S222, regrouping worker nodes and generating grouping information;
and S223, taking the existing grouping information corresponding to the worker node in the worker node state table as the grouping information of the worker node.
Because training speeds of each worker node may be different, each worker node may send a grouping request to a master node at different times, for example, a worker node with a sequence number of Rank0 sends a grouping request to the master node first, the master node groups the worker nodes, and generated grouping information is { Rank0, Rank2, Rank5}, at this time, the master node updates a worker node state table, and updates the grouping information of the worker nodes with sequence numbers of Rank2 and Rank5 to { Rank0, Rank2, Rank5}, so that when the master node receives the grouping request sent by the worker node with sequence numbers of Rank2 or Rank5, the master node needs to query the grouping information of the worker node with sequence numbers of Rank2 or Rank5 in the worker node state table first, thereby avoiding that the worker node with sequence numbers of Rank2 or Rank5 is subjected to a grouping request again, and then determines whether the received grouping request is a null time of the worker node, and whether the received grouping request is a null time of the worker node, if the grouping information of the worker node is null, the worker node is not grouped, and the master node regroups the worker node and generates grouping information; if the packet information of the worker node is not null, the worker node is grouped, and at this time, the worker node has corresponding packet information, so that the master node only needs to query the worker node state table to obtain the existing packet information of the worker node and use the packet information as the packet information of the worker node.
In step S222, the step of regrouping worker nodes and generating grouping information specifically includes:
step S2221, taking the worker nodes with empty grouping information in the worker node state table as a node set to be grouped;
step S2222, the times of the sent grouping requests of each worker node in the node set to be grouped are respectively obtained;
step S2223, calculating the difference value between the number of times of the sent grouping request of each worker node in the node set to be grouped and the number of times of the sent grouping request of the worker node, and adding the worker node of which the difference value is smaller than a preset threshold value into the grouping of the worker node to generate grouping information;
and step S2224, updating the worker node state table.
The method includes the steps that a master node firstly queries a worker node state table, worker nodes with empty grouping information in the worker node state table are added into a node set to be grouped, in the embodiment, a counting flag bit is arranged in each worker node, the number of times of grouping requests sent by the worker node is represented through the counting flag bit, the master node calculates the difference value between the number of times of the grouping requests sent by each worker node in the node set to be grouped and the number of times of the grouping requests sent by the worker node after obtaining the number of times of the grouping requests sent by each worker node in the node set to be grouped, the worker nodes with the difference values smaller than a preset threshold value are added into groups of the worker nodes, grouping information is generated, the worker node state table is updated, and grouping of the worker nodes is completed. The predetermined threshold may be set according to actual needs, and is not specifically limited herein.
After step S2, the distributed training method in this embodiment further includes:
step S4, the worker node sends a release request to the master node;
and step S5, the master node deletes the grouping information of the worker node and updates the worker node state table.
In this embodiment, each worker node needs to be grouped again after completing one round of training, so that after each round of training is completed, the worker node needs to send a release request to the master node, and the master node deletes the grouping information of the worker node in the worker node state table when receiving the release request sent by the worker node, thereby completing the update of the worker node state table.
In step S25, in this embodiment, when the packet information of the worker node is empty, the worker node also sends a release request to the master node, and the master node deletes the packet information of the worker node and updates the worker node status table.
In step S3, the training stopping condition is that the training number reaches a preset training number, in other embodiments, the training stopping condition may be set to be the local model convergence, or of course, other training stopping conditions known in the art may also be used, which is not limited herein.
The distributed training method in the embodiment is applied to the mobile terminal group cluster, the problems of high price, high energy consumption, large volume and poor mobility caused by realization of distributed machine learning on the existing large-scale server are solved, an epidemic prevention mechanism is added when the distributed training method is applied to the mobile terminal group cluster, NAN infection in the mobile terminal group cluster can be avoided, the stability of the whole distributed training system is ensured, and abnormal worker nodes can be quickly repaired through the epidemic prevention mechanism, so that the training time is saved.
Example two
The difference between the distributed training method provided in this embodiment and the first embodiment is as follows: the specific implementation manner of regrouping the worker nodes and generating the grouping information is different, and other steps in this embodiment are the same as those in the first embodiment, which is not described herein again, and only the specific implementation manner of regrouping the worker nodes and generating the grouping information in this embodiment is described in detail.
In this embodiment, the regrouping the worker nodes and generating grouping information includes:
step S2221, taking the worker nodes with empty grouping information in the worker node state table as a node set to be grouped;
step S2222, judging whether a grouping request sent by a working node in a node set to be grouped is received within a preset time period, and if a grouping request sent by other worker nodes is received within the preset time period, entering step S2223;
step S2223, adding worker nodes which are to be grouped and send grouping requests in a node set into the groups of the worker nodes, and generating grouping information.
Specifically, a master node firstly queries a worker node state table, and adds a worker node, of which grouping information is empty, in the worker node state table to a to-be-grouped node set, in this embodiment, a timer is arranged in the master node, when the master node receives a grouping request of the worker node, the timer starts to time until the time counted by the timer reaches a predetermined time, the master node judges whether a grouping request sent by a working node in the to-be-grouped node set is received within the predetermined time, and if the master node receives the grouping request sent by the working node in the to-be-grouped node set within the predetermined time, the worker node sending the grouping request in the to-be-grouped node set is added to a group of the worker node, so as to generate grouping information. The predetermined time period may be set according to actual needs, and is not particularly limited herein.
In another implementation manner of this embodiment, the reserved time length may also be a monitoring period of a master node, where the master node detects a monitoring period in which a time of a packet request sent by the worker node is located, and in the monitoring period, all worker nodes sending the packet request in the set of nodes to be grouped are added to a packet of the worker node to generate grouping information.
EXAMPLE III
Referring to fig. 4, the present embodiment provides a terminal device, which includes a memory 100, a processor 200, and a computer program stored in the memory 100, wherein the processor 200 executes the computer program to implement the distributed training method according to the first embodiment and the second embodiment.
The Memory 100 may include a Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.
The processor 200 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the distributed training method described in the first and second embodiments may be implemented by integrated logic circuits of hardware in the processor 200 or instructions in the form of software. The Processor 200 may also be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), etc., and may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component.
The memory 100 is used for storing a computer program, which is executed by the processor 200 after receiving the execution instruction to implement the distributed training method according to the first embodiment and the second embodiment.
The embodiment further provides a computer storage medium, a computer program is stored in the computer storage medium, and the processor 200 is configured to read and execute the computer program stored in the computer storage medium 201 to implement the distributed training method according to the first embodiment and the second embodiment.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer storage medium or transmitted from one computer storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer storage media may be any available media that can be accessed by a computer or a data storage device, such as a server, data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is directed to embodiments of the present application and it is noted that numerous modifications and adaptations may be made by those skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.

Claims (10)

1. A distributed training method is characterized by being applied to a distributed training system, wherein the distributed training system comprises at least 3 mobile terminals, any one mobile terminal is used as a grouping node, and the rest mobile terminals are used as working nodes, and the distributed training method comprises the following steps:
the working node trains the local model by using local training data according to the initial model parameters to obtain model parameters of the local model;
the working node obtains updated model parameters according to the model parameters of the local model and the grouping condition;
and taking the updated model parameters obtained by the training of the current round as initial model parameters in the next round of training, and executing multiple rounds of training in a circulating way until the training stopping conditions are met.
2. The distributed training method of claim 1, wherein the obtaining, by the working node, updated model parameters according to the model parameters of the local model and the grouping condition comprises:
the working node sends a grouping request to the grouping node;
the grouping node generates grouping information after receiving the grouping request and sends the grouping information to the working node;
the working node judges whether the grouping information is empty;
if the grouping information is not null, the working node judges whether an abnormal working node exists in the grouping where the working node is located, if not, the average value of the model parameters of the local models of all the working nodes in the grouping where the working node is located is used as the updated model parameter; if the local model parameters exist, taking the average value of the model parameters of the local model of the working node without abnormality in the group where the working node is located as the updated model parameters;
and if the grouping information is empty, taking the model parameter of the local model as the updated model parameter.
3. The distributed training method of claim 2, wherein the grouping node generates grouping information upon receiving the grouping request, comprising:
judging whether the grouping information of the working node is empty or not;
if the grouping information of the working nodes is empty, regrouping the working nodes and generating grouping information;
and if the grouping information of the working node is not empty, taking the existing grouping information corresponding to the working node in the working node state table as the grouping information of the working node.
4. The distributed training method of claim 3, wherein regrouping the worker nodes and generating grouping information comprises:
taking the working node with empty grouping information in the working node state table as a node set to be grouped;
respectively acquiring the number of times of the sent grouping request of each working node in the node set to be grouped;
calculating the difference value between the number of times of the grouping request sent by each working node in the node set to be grouped and the number of times of the grouping request sent by the working node, and adding the working nodes with the difference values smaller than a preset threshold value into the grouping of the working nodes to generate grouping information;
and updating the working node state table.
5. The distributed training method of claim 3, wherein regrouping the worker nodes and generating grouping information comprises:
taking the working node with empty grouping information in the working node state table as a node set to be grouped;
judging whether a grouping request sent by a working node in a node set to be grouped is received within a preset time;
and if a grouping request sent by the working nodes in the node set to be grouped is received within a preset time, adding the working nodes sending the grouping request in the node set to be grouped into groups of the working nodes to generate grouping information.
6. The distributed training method according to any one of claims 1 to 5, wherein after the working node obtains updated model parameters according to the model parameters of the local model and the grouping condition, the distributed training method further comprises:
the working node sends a release request to the grouping node;
and the grouping node deletes the grouping information of the working node and updates a working node state table.
7. The distributed training method according to claim 6, wherein the training stop condition is that the number of training times reaches a preset number of training times.
8. A distributed training system, characterized in that the distributed system comprises at least 3 mobile terminals, any mobile terminal is used as a grouping node, the rest mobile terminals are used as working nodes, and the distributed system trains a local model by the distributed training method according to any one of claims 1 to 7.
9. A terminal device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the distributed training method of any one of claims 1 to 7.
10. A computer readable storage medium having computer instructions stored thereon, wherein the computer instructions, when executed by a processor, implement the distributed training method of any of claims 1-7.
CN202011399386.XA 2020-12-02 2020-12-02 Distributed training method and system, terminal equipment and computer readable storage medium Active CN114650288B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011399386.XA CN114650288B (en) 2020-12-02 2020-12-02 Distributed training method and system, terminal equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011399386.XA CN114650288B (en) 2020-12-02 2020-12-02 Distributed training method and system, terminal equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN114650288A true CN114650288A (en) 2022-06-21
CN114650288B CN114650288B (en) 2024-03-08

Family

ID=81990908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011399386.XA Active CN114650288B (en) 2020-12-02 2020-12-02 Distributed training method and system, terminal equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114650288B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600512A (en) * 2022-12-01 2023-01-13 深圳先进技术研究院(Cn) Tool life prediction method based on distributed learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170220949A1 (en) * 2016-01-29 2017-08-03 Yahoo! Inc. Method and system for distributed deep machine learning
CN109558909A (en) * 2018-12-05 2019-04-02 清华大学深圳研究生院 Combined depth learning method based on data distribution
CN110263928A (en) * 2019-06-18 2019-09-20 中国科学技术大学 Protect the mobile device-based distributed deep learning training method of data-privacy
CN110503207A (en) * 2019-08-28 2019-11-26 深圳前海微众银行股份有限公司 Federation's study credit management method, device, equipment and readable storage medium storing program for executing
CN111158902A (en) * 2019-12-09 2020-05-15 广东工业大学 Mobile edge distributed machine learning system and method
CN111178503A (en) * 2019-12-16 2020-05-19 北京邮电大学 Mobile terminal-oriented decentralized target detection model training method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170220949A1 (en) * 2016-01-29 2017-08-03 Yahoo! Inc. Method and system for distributed deep machine learning
CN109558909A (en) * 2018-12-05 2019-04-02 清华大学深圳研究生院 Combined depth learning method based on data distribution
CN110263928A (en) * 2019-06-18 2019-09-20 中国科学技术大学 Protect the mobile device-based distributed deep learning training method of data-privacy
CN110503207A (en) * 2019-08-28 2019-11-26 深圳前海微众银行股份有限公司 Federation's study credit management method, device, equipment and readable storage medium storing program for executing
CN111158902A (en) * 2019-12-09 2020-05-15 广东工业大学 Mobile edge distributed machine learning system and method
CN111178503A (en) * 2019-12-16 2020-05-19 北京邮电大学 Mobile terminal-oriented decentralized target detection model training method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600512A (en) * 2022-12-01 2023-01-13 深圳先进技术研究院(Cn) Tool life prediction method based on distributed learning

Also Published As

Publication number Publication date
CN114650288B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
CN112468372B (en) Method and device for detecting equipment state in power line communication network
CN111538763B (en) Method for determining master node in cluster, electronic equipment and storage medium
CN110070445B (en) Transaction processing method and device based on blockchain system
CN107870948A (en) Method for scheduling task and device
CN111010318A (en) Method and system for discovering loss of connection of terminal equipment of Internet of things and equipment shadow server
CN111431802A (en) Block chain node communication optimization system and method
CN114650288A (en) Distributed training method and system, terminal device and computer readable storage medium
CN110995851A (en) Message processing method, device, storage medium and equipment
CN114285795A (en) State control method, device, equipment and storage medium of virtual equipment
CN107786390B (en) Method and device for correcting networking nodes
CN113132160B (en) Method and system for detecting network sub-health state of client node
CN107360012A (en) A kind of Link State processing method and apparatus for network node
CN110224872B (en) Communication method, device and storage medium
CN112532467B (en) Method, device and system for realizing fault detection
CN105045224A (en) Data transmission method and device
CN106793093B (en) Service processing method and device
CN111092956A (en) Resource synchronization method, device, storage medium and equipment
US10951732B2 (en) Service processing method and device
CN111757371A (en) Statistical method of transmission delay, server and storage medium
CN107210996B (en) Service chain management method and device
CN112804115B (en) Method, device and equipment for detecting abnormity of virtual network function
JP2018137637A (en) Communication network and communication terminal
CN112073987A (en) State monitoring method, device, equipment and storage medium
CN113396573A (en) Migration of computing services
CN115002020B (en) OSPF-based data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant