WO2021147620A1 - 一种基于模型训练的通信方法、装置及系统 - Google Patents

一种基于模型训练的通信方法、装置及系统 Download PDF

Info

Publication number
WO2021147620A1
WO2021147620A1 PCT/CN2020/140434 CN2020140434W WO2021147620A1 WO 2021147620 A1 WO2021147620 A1 WO 2021147620A1 CN 2020140434 W CN2020140434 W CN 2020140434W WO 2021147620 A1 WO2021147620 A1 WO 2021147620A1
Authority
WO
WIPO (PCT)
Prior art keywords
model parameter
value
communication device
model
central server
Prior art date
Application number
PCT/CN2020/140434
Other languages
English (en)
French (fr)
Inventor
陈晨
王森
张弓
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20916011.8A priority Critical patent/EP4087198A4/en
Publication of WO2021147620A1 publication Critical patent/WO2021147620A1/zh
Priority to US17/814,073 priority patent/US20220360539A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/28Flow control; Congestion control in relation to timing considerations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0813Configuration setting characterised by the conditions triggering a change of settings
    • H04L41/082Configuration setting characterised by the conditions triggering a change of settings the condition being updates or upgrades of network functionality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/02Topology update or discovery
    • H04L45/08Learning-based routing, e.g. using neural networks or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/42Centralised routing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/70Routing based on monitoring results
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/80Responding to QoS
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/75Media network packet handling
    • H04L65/756Media network packet handling adapting media to device capabilities

Definitions

  • This application relates to the field of communication technology, and in particular to a communication method, device and system based on model training.
  • the federated learning (FL) system is an emerging artificial intelligence basic technology. Its main idea is to build a machine learning model based on the data sets on multiple communication devices through the cooperation of a central server and multiple communication devices. In the process of building a machine learning model, there is no need to share data between various communication devices to prevent data leakage.
  • the central server sends the value of each parameter of the machine learning model to each communication device, and each communication device serves as a cooperative unit to train the model locally, update the value of each parameter of the machine learning model, and send the value of each parameter of the machine learning model.
  • the gradient of the parameter value (Gradient) is sent to the central server.
  • the central server generates a new model parameter value according to the gradient of the value of each parameter. Repeat the above steps until the machine learning model converges and complete the entire model training process.
  • model parameter data that needs to be transmitted between the communication device and the central server is also increasing. Limited by the Internet connection speed, bandwidth and other issues, the model parameter transmission delay between the communication device and the central server is longer, and the model parameter value update rate is slow.
  • the communication method, device, and system based on model training provided in the present application can effectively reduce the amount of parameter transmission data between a communication device and a server, and improve communication efficiency without losing model training accuracy.
  • this application provides a communication method based on model training, which is applied to a system including a central server and communication equipment.
  • the method may include: the communication device determines the amount of change in the value of the first model parameter. If the communication device determines that the first model parameter is stable according to the amount of change in the value of the first model parameter, the communication device stops sending the update amount of the value of the first model parameter to the central server within a preset time period. Wherein, the update amount of the value of the first model parameter is determined by the communication device according to user data in the process of model training. The communication device receives the value of the second model parameter sent by the central server. Wherein, in the preset time period, the value of the second model parameter does not include the value of the first model parameter.
  • the first model parameter is a model parameter that participates in determining whether it is stable, and its number is not limited, and can be any one or more model parameters. If it is determined that the first model parameter is stable, the first model parameter is a model parameter that does not participate in the transmission of the update amount and value within the preset time period.
  • the second model parameter is a model parameter that participates in the update amount and value transmission between the central server and the communication device. The number of second model parameters is also not limited, and may be one or more. Optionally, all the first model parameters and all the second model parameters constitute all the model parameters.
  • the amount of change in the value of the first model parameter is used to determine whether the first model parameter is stable. If the first model parameters are stable, it means that the first model parameters have converged. In the subsequent communication process, the change of its value is mainly a small amplitude oscillation change, which is of little significance to model training. Therefore, a preset time period can be set, and within the preset time period, the transmission of the stable update amount of the first model parameter value is stopped. It is understandable that if the communication device does not send the update amount of the value of the first model parameter to the central server within the preset time period, the central server will not generate and send the first model parameter to the communication device during the preset time period. After the updated value, the amount of bidirectional data between the communication device and the central server is reduced.
  • the communication device After the communication device receives the updated model parameter values sent by the central server, it can construct an updated local training model and train the model using local user data.
  • the value of the model parameter is adjusted based on the user data, and then the update amount of the value of the model parameter, that is, the gradient of the value of the model parameter, is obtained.
  • the communication device determines that the model parameter is stable according to the change in the value of the model parameter, it can stop transmitting the update amount of the model parameter value within a preset time period, which can effectively reduce the data transmitted by the model parameter between the communication device and the server. In order to ensure that the model training accuracy is not lost, the communication efficiency can be improved.
  • the method further includes: after a preset period of time, the communication device sends an update amount of the value of the first model parameter to the central server and receives the value of the first model parameter sent by the central server.
  • the communication device can adjust the amount of data transmitted between the communication device and the central server according to the amount of change in the value of the model parameter.
  • the value of the model parameter and the update amount of the value are transmitted between the communication device and the central server through a message.
  • the communication device stops sending the update amount of the first model parameter value to the central server within a preset time period, including: the communication device sends a message to the central server, and the message contains the third model parameter value.
  • the update amount of the value and the value of the corresponding flag; the update amount of the value of the third model parameter includes the update amount of the value of the second model parameter and the update amount of the value of the first model parameter.
  • the value of the flag corresponding to the update amount of the value of the first model parameter is used to indicate that the communication device has not transmitted the update amount of the value of the first model parameter to the central server.
  • the identification bits included in the message may include, for example, a bit mapping, and each bit corresponds to data of a model parameter.
  • the value of the identification bit can be set to 0/1 for distinction, for example.
  • the central server after the central server receives the message sent by the communication device, it can determine whether the update amount of the model parameter value corresponding to the identification bit is transmitted according to the update amount of the model parameter value carried in the message corresponding to the value of each identification bit , And then can quickly read the updated value of the transmitted model parameter. It is also possible to quickly determine the update amount of the value of the first model parameter that has stopped transmission, and the updated value of the first model parameter is not included in the data returned to the communication device.
  • the communication device sends a message to the central server, and the message includes the update amount of the second model parameter value and the value of the corresponding identification bit.
  • the update amount of the value of the second model parameter does not include the update amount of the value of the first model parameter
  • the value of the identification bit corresponding to the second model parameter is used to indicate that the communication device transmits the first model parameter to the central server.
  • the message contains only the update amount of the second model parameter value and its corresponding identification bit
  • the central server can directly determine the update amount of the transmitted second model parameter value according to the received message, and then directly return it.
  • the updated value of the second model parameter is sufficient. Can further reduce the amount of data transmitted and can speed up the process of model training.
  • the message does not directly include the update amount of the value of the first model parameter.
  • the transmission sequence of the update amount of the value of the third model parameter in the message to be transmitted is preset between the communication device and the central server. After the communication device determines that the first model parameter is stable, even if the message transmitted to the central server does not directly contain the updated value of the first model parameter, the central server can also determine whether to stop the transmission according to the vacant bits.
  • the update amount of which first model parameter values are obtained can also be known which update amount of the second model parameter values correspond to which of the transmitted data.
  • the communication device determining the amount of change in the value of the first model parameter includes: the communication device obtains the amount of change in the value of the first model parameter according to historical information of the value of the first model parameter. For example, the historical value of the first model parameter or the change amount of the historical value is obtained, and then the change amount of the value of the first model parameter this time can be calculated by combining the value of the first model parameter this time.
  • the value history information includes: the effective change of the value of the first model parameter and the cumulative change of the value of the first model parameter.
  • the effective change and the accumulated change of the value of the first model parameter this time can be obtained based on the effective change and the accumulated change of all the obtained values of the first model parameter. Or, it can be obtained according to the effective change amount and the cumulative change amount of the value of the first model parameter obtained in the last few times of detection. Or, it can be obtained according to the effective change amount and the accumulated change amount of the value of the first model parameter obtained in the previous detection. After that, the change in the value of the first model parameter can be obtained according to the ratio of the effective change in the value of the first model parameter to the cumulative change.
  • an exponential moving average (EMA) method may be used. According to the effective change of the value of the first model parameter obtained in the last detection and the cumulative change of the value of the first model parameter, the last time and The value of the first model parameter obtained this time is obtained, the effective change amount and the cumulative change amount of the value of the first model parameter this time are obtained, and then the change amount of the value of the first model parameter is obtained.
  • EMA exponential moving average
  • the communication device determines that the first model parameter is stable according to the amount of change in the value of the first model parameter, including: if the amount of change in the value of the first model parameter is less than a preset threshold, determining the first model The parameters are stable.
  • the preset threshold may be determined based on experimental data, expert experience values, etc. If the change in the value of the first model parameter is less than the preset threshold, it indicates that the current first model parameter has converged and stabilized, and continued transmission will not make a significant contribution to the training of the model, so the transmission of the first model parameter can be stopped. The update amount of the value.
  • the communication device determining the amount of change in the value of the first model parameter includes: the communication device determining the amount of change in the value of the first model parameter M times; where M is a positive integer greater than or equal to 2.
  • the communication device determining the preset condition that is satisfied for the preset time period according to the stable state of the first model parameter for the kth time includes: if the first model parameter is determined to be stable for the kth time according to the change in the value of the first model parameter, then the preset time period is The duration is the first duration, and the first duration is greater than the second duration.
  • the second duration is the duration of the preset period when it is determined that the first model parameter is stable for the k-1th time, and the update amount of the first model parameter value is stopped to the central server; or, the second duration is the k-1th
  • the obtained time length is used to adjust the corresponding preset time period when the first model parameter is stable next time; where k is a positive integer and k ⁇ M.
  • the preset condition further includes: if it is determined that the first model parameter is not stable for the kth time, the obtained third model parameter is used to adjust the corresponding preset time period when the first model parameter is stable next time.
  • the third time length is less than the fourth time length; among them, the fourth time length is the length of the preset time period when the first model parameter is determined to be stable for the k-1th time, and the update amount of the first model parameter value is stopped to the central server,
  • the fourth duration is the duration obtained when it is determined that the first model parameter is not stable for the k-1th time and is used to adjust the preset period duration corresponding to the next time when the first model parameter is stable.
  • a duration will be obtained. If it is determined that the first model parameter is stable this time, it indicates that the first model parameter has converged, and it is necessary to increase the length of time the first model parameter stops transmitting.
  • the obtained time length is the time length of the preset time period, and the time length is greater than the time length obtained in the last detection . If the first model parameter is determined to be unstable this time, it indicates that the first model parameter has not yet converged, and the next time the first model parameter stops transmitting should be reduced.
  • the obtained duration is the duration used to adjust the acquisition duration of the next detection. And the duration is less than the duration obtained in the last detection.
  • the communication device stops sending an update of the value of the first model parameter to the central server within a preset time period
  • the amount includes: if the k-th communication device determines that the first model parameter is stable according to the amount of change in the value of the first model parameter, stop sending the update amount of the value of the first model parameter to the central server in n sending cycles; where , N is a positive integer.
  • the k+1th communication device determines that the first model parameter is stable according to the change in the value of the first model parameter, it will stop sending the first model parameter to the central server within (n+m) sending cycles.
  • the update amount of model parameter values where m is a positive integer.
  • the k+1th communication device determines that the first model parameter is not stable according to the change in the value of the first model parameter, then if the k+2th communication device determines the first model parameter according to the change Stable, stop sending the update amount of the value of the first model parameter to the central server within (n/r+m) sending cycles; where r is a positive integer greater than or equal to 2, and (n/r) ⁇ 1.
  • the sending period is the period for the communication device to send the update amount of the value of the model parameter to the central server.
  • the preset period duration for stopping the transmission of the update amount of the value of the first model parameter may be adjusted based on the duration of the sending period. After it is determined that the first model parameter is stable, an integer number of transmission period durations are added to the duration obtained last time to obtain the preset time period for stopping transmission this time. After determining that the parameters of the first model are unstable, the time length obtained last time is reduced proportionally to obtain the time length of this time.
  • the preset time period obtained by stably obtaining the first model parameters is 2 sending cycle lengths.
  • the method further includes: the communication device determines the amount of change in the value of the first model parameter according to the detection period; the transmission period is less than the detection period.
  • the communication device After the communication device sends the updated value of the first model parameter value and obtains the updated first model parameter value, it is meaningful to judge the stable state of the first model parameter value, so the sending period should be shorter than the detection period.
  • the method further includes: if the communication device determines that the ratio of the number of stable first model parameters to the number of third model parameters is greater than the preset ratio, reducing the value of the preset threshold.
  • the preset threshold is dynamically adjusted to avoid the excessive number of model parameters that stop transmission, which prolongs the process of model training or affects the final accuracy of the model.
  • the model can only be trained with a constant value of each model parameter that stopped being transmitted within a preset time period, and the training effect may not be ideal.
  • the updated values of these model parameters can only be obtained after the preset period of time for each model parameter that stops transmission is reached, and the training of the model is continued, resulting in too long training time for the model to reach the model convergence condition.
  • the method further includes: the communication device receives the value of the second model parameter sent by the central server. For example, the sending cycle is 5s, and the detection cycle is 10s. Then the communication device sends the update amount of the value of the first model parameter to the central server every 5s, and the same central server sends the value of the first model parameter to the communication device every 5s. The communication device confirms the change in the value of the first model parameter every 10s, and determines whether the first model parameter is stable based on the change.
  • this application provides a communication method based on model training, which is applied to a system including a central server and communication equipment.
  • the method includes: the central server receives the update amount of the second model parameter value sent by the communication device; within a preset time period, the update amount of the second model parameter value does not include the update amount of the first model parameter value.
  • the first model parameter is a stable model parameter that is determined according to the amount of change in the value of the first model parameter.
  • the central server determines the updated value of the second model parameter according to the update amount of the second model parameter value.
  • the central server sends the updated value of the second model parameter to the communication device. In the preset time period, the updated value of the second model parameter does not include the updated value of the first model parameter.
  • the update amount of the value of the first model parameter and the update amount of the second model parameter are determined by the communication device according to user data in the process of model training. That is, the communication device will determine the update amount of the value of the model parameter according to the user data in the process of model training. In addition, the communication device will determine the update amount of the value of the model parameter that needs to stop transmitting to the central server according to the stable state of the model parameter.
  • the central server will determine whether the communication device has stopped transmitting part of the model parameters according to the update amount of the received model parameter values. If the transmission of part of the model parameters has been stopped, the central server will also stop transmitting the part of the model parameters that the communication device has stopped transmitting within the preset time period. In turn, it can reduce the amount of data transmitted in both directions and improve communication efficiency.
  • the method further includes: after a preset period of time, the central server sends the value of the first model parameter to the communication device and receives the update amount of the value of the first model parameter sent by the communication device.
  • the communication device After the preset time period, the communication device will automatically start to send the updated value of the model parameter value to the central server, and then the central server will determine the updated value of the model parameter according to the updated value of the model parameter value, which is realized in the preset time period After that, the central server transmits the updated values of the model parameters that were stopped before transmission to the communication device. In this way, the adjustment of the two-way adaptive communication data volume between the central server and the communication device is realized.
  • the central server sends the updated value of the second model parameter to the communication device; within a preset time period, the updated value of the second model parameter does not include the updated value of the first model parameter.
  • the value includes: the central server sends a message to the communication device, the message includes the updated value of the third model parameter and the value of the corresponding flag; the updated value of the third model parameter includes the second model parameter update And the updated value of the first model parameter.
  • the value of the flag corresponding to the updated value of the first model parameter is used to indicate that the central server has not transmitted the updated value of the first model parameter to the communication device.
  • the central server sends a message to the communication device, and the message includes the updated value of the second model parameter and the value of the corresponding identification bit.
  • the updated value of the second model parameter does not include the updated value of the first model parameter, and the value of the flag corresponding to the updated value of the second model parameter is used to indicate the central server
  • the updated value of the second model parameter is transmitted to the communication device.
  • the first model parameter is the determination of a stable model parameter according to the amount of change in the value of the first model parameter, including: if the amount of change in the value of the first model parameter is less than a preset threshold, determining the first model parameter One model parameters are stable.
  • the method before the central server receives the update amount of the value of the second model parameter sent by the communication device, the method further includes: the central server sends the value of the second model parameter to the communication device.
  • the process of model parameter transmission between the central server and the communication device is a cyclical interaction process.
  • the communication device can obtain the value of the model parameter based on the value of the model parameter.
  • the update amount of the model parameter can be obtained by the central server.
  • the present application provides a communication device based on model training.
  • the device includes: a processing unit, a sending unit, and a receiving unit.
  • the processing unit is used to determine the amount of change in the value of the first model parameter.
  • the processing unit is further configured to determine whether the first model parameter is stable according to the amount of change in the value of the first model parameter.
  • the sending unit is used to send the update amount of the value of the first model parameter to the central server; if the processing unit determines that the first model parameter is stable according to the change in the value of the first model parameter, the sending unit stops sending to the center within a preset time period
  • the server sends the update amount of the value of the first model parameter.
  • the update amount of the value of the first model parameter is determined by the processing unit according to user data in the process of model training.
  • the receiving unit is configured to receive the value of the second model parameter sent by the central server; wherein, within the preset time period, the value of the second model parameter does not include the value of the first model parameter.
  • the sending unit is further configured to send the update amount of the value of the first model parameter to the central server after a preset period of time.
  • the receiving unit is further configured to receive the value of the first model parameter sent by the central server after the preset time period.
  • the sending unit is specifically configured to: send a message to the central server, the message including the update amount of the value of the third model parameter and the value of the corresponding flag; the value of the third model parameter
  • the update amount includes the update amount of the second model parameter value and the update amount of the first model parameter value.
  • the value of the flag corresponding to the update amount of the value of the first model parameter is used to indicate that the sending unit has not transmitted the update amount of the value of the first model parameter to the central server.
  • a message is sent to the central server, and the message includes the update amount of the value of the second model parameter and the value of the corresponding identification bit.
  • the update amount of the value of the second model parameter does not include the update amount of the first model parameter value, and the value of the flag corresponding to the update amount of the second model parameter value is used to indicate the sending unit
  • the update amount of the second model parameter value is transmitted to the central server.
  • the processing unit is specifically configured to: obtain the amount of change in the value of the first model parameter according to the historical information of the value of the first model parameter.
  • the value history information includes: the effective change of the value of the first model parameter and the cumulative change of the value of the first model parameter.
  • the processing unit is specifically configured to: determine whether the first model parameter is stable according to the amount of change in the value of the first model parameter, and if the amount of change in the value of the first model parameter is less than a preset threshold, determine The parameters of the first model are stable.
  • the processing unit is specifically configured to determine the amount of change in the value of the first model parameter M times.
  • M is a positive integer greater than or equal to 2.
  • determining the preset condition that the preset time period meets includes: if the first model parameter is determined to be stable according to the change in the value of the first model parameter for the kth time, the duration of the preset time period is The first time length, the first time length is greater than the second time length; where the second time length is the k-1th time when the first model parameter is determined to be stable, stop sending the update amount of the first model parameter value to the central server for the preset time period Duration, or, the second duration is the k-1th time when it is determined that the first model parameters are not stable, the duration obtained is used to adjust the corresponding preset period duration when the first model parameters are stable next time; where k is a positive integer , K ⁇ M.
  • the preset condition further includes: if it is determined that the first model parameter is not stable for the kth time, the obtained third model parameter is used to adjust the corresponding preset time period when the first model parameter is stable next time.
  • the third time length is less than the fourth time length; among them, the fourth time length is the length of the preset time period when the first model parameter is determined to be stable for the k-1th time, and the update amount of the first model parameter value is stopped to the central server,
  • the fourth duration is the duration obtained when it is determined that the first model parameter is not stable for the k-1th time and is used to adjust the preset period duration corresponding to the next time when the first model parameter is stable.
  • the sending unit stops sending the first model to the central server within n sending cycles The update amount of the parameter value.
  • n is a positive integer.
  • the processing unit determines that the first model parameter is stable according to the change in the value of the first model parameter for the k+1 time, then within (n+m) sending cycles, the sending unit stops reporting to the central server Send the update amount of the value of the first model parameter.
  • m is a positive integer.
  • the sending unit stops sending the update amount of the value of the first model parameter to the central server; where r is a positive integer greater than or equal to 2, and (n/r) ⁇ 1 .
  • the sending period is the period for the sending unit to send the update amount of the value of the model parameter to the central server.
  • the processing unit is further configured to: determine the amount of change in the value of the first model parameter according to the detection period; the sending period is less than the detection period.
  • the processing unit is further configured to: if the processing unit determines that the ratio of the number of stable first model parameters to the number of third model parameters is greater than the preset ratio, reduce the value of the preset threshold .
  • the receiving unit receives the value of the second model parameter sent by the central server.
  • the present application provides a communication device based on model training.
  • the device includes: a receiving unit, a processing unit, and a sending unit.
  • the receiving unit is configured to receive the update amount of the value of the second model parameter sent by the communication device; within the preset time period, the update amount of the value of the second model parameter does not include the update amount of the value of the first model parameter; wherein,
  • the first model parameter is a stable model parameter determined according to the amount of change in the value of the first model parameter.
  • the processing unit is configured to determine the updated value of the second model parameter according to the update amount of the second model parameter value.
  • the sending unit is configured to send the updated value of the second model parameter to the communication device; within a preset time period, the updated value of the second model parameter does not include the updated value of the first model parameter.
  • the sending unit is further configured to send the value of the first model parameter to the communication device after a preset period of time.
  • the receiving unit is further configured to receive the update amount of the value of the first model parameter sent by the communication device.
  • the sending unit is specifically configured to: send a message to the communication device, the message including the updated value of the third model parameter and the value of the corresponding flag; the updated third model parameter The value includes the updated value of the second model parameter and the updated value of the first model parameter.
  • the value of the flag corresponding to the updated value of the first model parameter is used to indicate that the sending unit has not transmitted the updated value of the first model parameter to the communication device.
  • the first model parameter is the determination of a stable model parameter according to the amount of change in the value of the first model parameter, including: if the amount of change in the value of the first model parameter is less than a preset threshold, determining the first model parameter One model parameters are stable.
  • the sending unit sends the value of the second model parameter to the communication device.
  • the present application provides a communication device, including: one or more processors; a memory; and a computer program, wherein the computer program is stored in the memory, and the computer program includes instructions; when the instructions are executed by the communication device,
  • the communication device executes the communication method based on model training in the above-mentioned first aspect and any one of its possible implementation manners.
  • the present application provides a communication device, including: one or more processors; a memory; and a computer program, wherein the computer program is stored in the memory, and the computer program includes instructions; when the instructions are executed by a central server, The central server executes the communication method based on model training in the above-mentioned second aspect and any one of its possible implementation manners.
  • the present application provides a communication system, including: a central server and at least one communication device, and at least one communication device executes the communication method based on model training in the first aspect and any one of its possible implementation manners.
  • the central server executes the communication method based on model training in the above-mentioned second aspect and any one of its possible implementation manners.
  • this application provides a communication device that has the function of implementing the communication method based on model training as described in the first aspect to the second aspect and any one of the possible implementations.
  • This function can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above-mentioned functions.
  • the present application provides a computer storage medium including computer instructions, when the computer instructions are executed on a communication device based on model training, the communication device based on model training executes the first to second aspects as described above, and The communication method based on model training described in any of the possible implementations.
  • the present application provides a computer program product.
  • the computer program product runs on a communication device based on model training
  • the communication device based on model training executes the first to second aspects as described above, and any one of them.
  • the communication method based on model training described in one possible implementation.
  • a circuit system in an eleventh aspect, includes a processing circuit configured to perform the model-based training described in the first aspect to the second aspect and any one of the possible implementations. Communication method.
  • an embodiment of the present application provides a chip system, which includes at least one processor and at least one interface circuit.
  • the at least one interface circuit is used to perform receiving and sending functions and send instructions to at least one processor.
  • the instructions are executed by the processor, at least one processor executes the communication method based on model training as described in the first aspect to the second aspect and any one of the possible implementation manners.
  • FIG. 1 is a schematic diagram of an application scenario of a communication method based on model training provided by an embodiment of this application;
  • FIG. 2 is a schematic diagram of the hardware structure of a communication device provided by an embodiment of the application.
  • FIG. 3 is a first schematic flowchart of a communication method based on model training provided by an embodiment of this application;
  • FIG. 4 is a schematic diagram of a message structure provided by an embodiment of the application.
  • FIG. 5 is a second schematic flowchart of a communication method based on model training provided by an embodiment of this application;
  • FIG. 6 is a third schematic flowchart of a communication method based on model training provided by an embodiment of this application.
  • FIG. 7 is an analysis diagram 1 of experimental data provided by an embodiment of this application.
  • FIG. 8 is an analysis diagram of experimental data provided by an embodiment of the application; FIG. 2; FIG.
  • FIG. 9 is an analysis diagram of experimental data provided by an embodiment of the application.
  • FIG. 10 is a schematic structural diagram 1 of a communication device based on model training provided by an embodiment of the application;
  • FIG. 11 is a second structural diagram of a communication device based on model training provided by an embodiment of this application.
  • FIG. 12 is a schematic structural diagram of a chip system provided by an embodiment of the application.
  • the solutions of the embodiments of the present application are mainly applied to a distributed system of machine learning, and the distributed system of machine learning includes a central server and multiple communication devices.
  • Each communication device has its own processor and memory, and each has its own independent data processing function.
  • each communication device has the same status, user data is stored, and user data is not shared between each communication device, which can ensure information security during big data exchange and protect terminal data and personal data privacy Under the premise of developing high-efficiency machine learning among multiple communication devices.
  • a distributed machine learning system 100 includes a central server 10 and at least one communication device 20, such as the communication device 1, the communication device 2, the communication device 3, and the communication device 4 in FIG.
  • the central server 10 and the at least one communication device 20 may be connected through a wired network or a wireless network.
  • the embodiment of the present application does not specifically limit the connection mode between the central server 10 and the at least one communication device 20.
  • the central server 10 may be a device or server with computing functions such as a cloud server or a network server.
  • the central server 10 may be a server, a server cluster composed of multiple servers, or a cloud computing service center.
  • a neural network model is stored in the central server 10, and parameter values of the neural network model can be sent to at least one communication device 20 through an interactive interface.
  • the communication device 20 may also be referred to as a host, a model training host, and so on.
  • the communication device 20 may be a server or a terminal device.
  • Related human-computer interaction interface can be provided to collect local user data based on model training.
  • it can include mobile phones (mobile phones), tablet computers (Pad), computers with wireless transceiver functions, virtual reality (VR) electronic devices, augmented reality (AR) electronic devices, industrial control (industrial control) Wireless terminals in, self-driving (self-driving) wireless terminals, wireless terminals in remote medical (remote medical), wireless terminals in smart grid (smart grid), wireless terminals in transportation safety, Wireless terminals in a smart city (smart city), wireless terminals in a smart home (smart home), in-vehicle terminals, artificial intelligence (AI) terminals, etc.
  • the embodiments of the present application do not impose special restrictions on the specific form of the communication device.
  • the communication device 20 may receive the neural network model parameter values sent by the central server 10 to construct a local model, and use the local data as training data to train the local model. It is understandable that the larger the amount of training data, the better the performance of the model trained. Due to the limited amount of local data contained in a single communication device 20, the accuracy of the trained local model is limited. In order to increase the amount of data contained in the training data, the value of the transmission model parameter or the amount of update of the value is performed between multiple communication devices 20, which is not conducive to protecting privacy data.
  • the distributed machine learning system 100 can optimize the local model in each communication device 20 without sharing local data among the communication devices 20.
  • each communication device 20 does not need to share local data, but sends the update amount of each parameter of the local model before and after the training to the central server 10, and the central server 10 uses the model parameters sent by each communication device 20
  • the updated amount of the value is used to train the model, and the value of the trained model parameter is sent to each communication device 20.
  • Each communication device 20 then uses local data to train a local model constructed using the updated model parameter values.
  • the central server 10 can obtain a neural network model with better performance and send it to each communication device 20.
  • the update amount can also be called a gradient.
  • the above-mentioned machine learning process can also be described as a distributed system of federated learning.
  • the technical solution in the embodiments of the present application may be a communication method based on model training, or may be described as a communication method based on federated learning. That is, the federated learning process can also be described as a model training process.
  • the aforementioned machine learning distributed system 100 can be applied in the following scenarios:
  • Scenario 1 Improve the input performance of mobile phone input method.
  • the mobile phone input method can predict and display the vocabulary that may be input in the future based on the vocabulary currently input by the user.
  • the prediction model used therein needs to be trained based on user data to improve its prediction accuracy.
  • certain user personal data such as sensitive data such as the website visited by the user and the user's travel location, cannot be shared directly.
  • Google introduced a distributed system based on machine learning to improve the prediction model of mobile phone input vocabulary.
  • the central server in Google will send the prediction model parameter values to multiple mobile phones.
  • the update amount of the model parameter values obtained by each mobile phone training the prediction model based on the respective local user data can be obtained.
  • the central server sends feedback to each mobile phone.
  • the update amount of the prediction model parameter values is averagely superimposed to obtain a unified new prediction model. In this way, under the premise that the mobile phone does not need to share local user data, through continuous iterative updates, a prediction model with high prediction accuracy can be finally obtained, thereby improving the performance of the mobile phone input method.
  • the application scenario of the distributed system of machine learning may include a scenario based on a large number of communication devices in the above scenario 1, and may also include a scenario based on a limited number of communication devices, such as the following scenario 2.
  • Scenario 2 Improve the scenario of the medical prediction model.
  • the training data required for the establishment of a prediction model for developing treatment methods and obtaining treatment prediction results in a hospital is patient data.
  • patient data for model training, the consequences of actual and potential violations of patient privacy may be serious.
  • a distributed system based on machine learning can be used on the premise of not sharing the patient data of their respective hospitals.
  • Each hospital uses its own patient data to train the prediction model.
  • the central server integrates the updated values of all the prediction model parameters, and finally obtains Comprehensive predictive models with higher accuracy for all hospital patient data, and send the final model parameter values to each hospital, thereby improving the medical predictive model of each hospital.
  • FIG. 2 shows a schematic diagram of the hardware structure of the communication device 20 provided by an embodiment of the application.
  • the communication device 20 includes a bus 110, a processor 120, a memory 130, a user input module 140, a display module 150, a communication interface 160, and other similar and/or suitable components.
  • the bus 110 may be a circuit that connects the aforementioned elements to each other and transfers communication between the aforementioned elements.
  • the processor 120 can receive commands from the above-mentioned other elements (such as the memory 130, the user input module 140, the display module 150, the communication interface 160, etc.) through the bus 110, can interpret the received commands, and can perform calculations according to the interpreted commands Or data processing.
  • other elements such as the memory 130, the user input module 140, the display module 150, the communication interface 160, etc.
  • the processor 120 may be a general-purpose central processing unit (central processing unit, CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more programs for controlling the execution of the program of this application. integrated circuit.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • the processor 120 is configured to establish a local model according to the received model parameter values sent by the central server 10, and use local user data to train the local model to obtain an update amount of the model parameter values.
  • the memory 130 may store commands or data received from the processor 120 or other elements (for example, the user input module 140, the display module 150, the communication interface 160, etc.) or commands or data generated by the processor 120 or other elements.
  • the memory 130 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), or other types that can store information and instructions
  • the dynamic storage device can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, optical disc storage (Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can be used by a computer Any other media accessed, but not limited to this.
  • the memory 130 may exist independently, and is connected to the processor 120 through the bus 110.
  • the memory 130 may also be integrated with the processor 120.
  • the memory 130 is used to store computer-executable instructions used to implement the solution of the present application, and the processor 120 controls the execution.
  • the processor 120 is configured to execute computer-executable instructions stored in the memory 130, so as to implement the communication method based on model training provided in the following embodiments of the present application.
  • the computer execution instructions in the embodiments of the present application may also be referred to as application program codes, instructions, computer programs, or other names, which are not specifically limited in the embodiments of the present application.
  • the memory 130 stores user data input by the communication device 20 through the user input module 140 based on human-computer interaction, or user data obtained by the communication device 20 in other ways.
  • the processor 120 may perform model training by calling user data stored in the memory 130. Further, the processor 120 may also obtain user data online for model training, which is not specifically limited in the embodiment of the present application.
  • the user input module 140 may receive data or commands input from a user via input-output means (for example, a sensor, a keyboard, a touch screen, etc.), and may transmit the received data or commands to the processor 120 or the memory 130 through the bus 110.
  • input-output means for example, a sensor, a keyboard, a touch screen, etc.
  • the display module 150 may display various information (for example, multimedia data, text data) received from the above-mentioned elements.
  • the communication interface 160 can control the communication between the communication device 20 and the central server 10.
  • the communication interface 160 can receive the model parameter value sent by the central server 10, and send the update amount of the model parameter value to the central server, and can also control the update of the model parameter value.
  • the amount of sending cycle may also be executed by the processor 120 controlling the communication interface 160.
  • the communication interface 160 may communicate with the central server 10 directly or through the network 161.
  • the communication interface 160 may operate to connect the communication device 20 to the network 161.
  • the hardware structure of the illustrated communication device 20 is only an example, and the communication device 20 may have more or fewer components than those shown in FIG. 2, and two or more components may be combined. , Or can have different component configurations.
  • the communication method for model training of distributed systems based on machine learning is generally that the central server sends all model parameter values to the communication device, and the communication device obtains the update amount of all model parameter values according to the model parameter values, and Send the updated amount of all model parameter values back to the central server.
  • the amount of data transmitted between the central server and the communication device is relatively large. But in fact, in the early stage before the model converges, some model parameters have converged and stabilized, and the subsequent changes in the value of the model parameters are only small oscillation changes, which are of little significance to model training.
  • the embodiment of the present application provides a communication method based on model training, which can confirm whether the model parameter is stable based on the change in the value of the transmitted model parameter, and then confirm whether the transmission of the corresponding model parameter value is stopped.
  • the amount of updates In this way, the amount of transmitted data can be reduced without loss of model training accuracy, and communication efficiency can be improved.
  • a schematic flow chart of a communication method based on model training is provided for this embodiment of the application, and the method may include S101-S103:
  • the communication device determines the amount of change in the value of the first model parameter.
  • the first model parameter can be at least one of the model parameters sent by the communication device receiving the central server, which can determine the change in the values of all the received model parameters, or it can only determine the value of all the received model parameters.
  • the amount of change in the value of some model parameters For example, the communication device receives 100 model parameter values, and the values of two model parameters among the 100 model parameter values have a greater impact on the model training process. Therefore, during the model training process, it is necessary to ensure that the values of these two model parameters are continuously updated. Therefore, the number of the first model parameter values is 98. After the communication device receives the 100 model parameter values sent by the central server, the communication device will determine the 98th model parameter values other than the above two model parameters. The amount of change in the value of a model parameter.
  • the communication device will receive the value of the first model parameter sent by the central server, and the communication device can obtain the value of the first model parameter according to the received historical information. The amount of change in the value of the first model parameter.
  • the historical information may include the effective change of the value of the first model parameter and the cumulative change of the value of the first model parameter.
  • an exponential moving average (EMA) method is used to obtain the value of the first model parameter according to the effective change in the value of the first model parameter and the cumulative change in the value of the first model parameter. The amount of change.
  • EMA exponential moving average
  • the effective change E k of the value of the first model parameter acquired by the communication device at the kth time and the cumulative change of the value of the first model parameter The ratio of, the change amount P of the value of the first model parameter is obtained; among them, the change amount The effective change E k and the cumulative change of the value of the first model parameter The initial value of is 0.
  • is the weight parameter, which is used to indicate the attenuation degree of the weight, 0 ⁇ 1; E k-1 is the effective change of the first model parameter value k-1 times; Is the cumulative change of the value of the first model parameter at k-1 times; ⁇ k is the difference between the value of the first model parameter obtained at the kth time and the k-1th time.
  • is a variable, which will decrease exponentially according to the number of calculations.
  • the memory of the communication device only needs to store the value of the first model parameter received last time, and the effective change amount and the cumulative change amount of the value of the first model parameter determined last time can be based on the first model parameter value received this time.
  • the value of a model parameter is used to determine the effective amount of change and the cumulative amount of change of the value of the first model parameter this time, and then determine the amount of change of the value of the first model parameter. With a lower storage space, the amount of change in the value of the first model parameter can be determined.
  • the communication device may determine the amount of change in the value of the first model parameter this time based on the amount of change in the value of the first model parameter in each of the most recent detections.
  • the communication device needs to save the change in the value of the first model parameter for the last 10 times.
  • the changes in the values of all the saved first model parameters are used to obtain the changes in the values of the first model parameters this time. For example, in the sixth test, only the data of the first 5 times are retained in the memory, and if it is less than 10 times, the data of these 5 times are used to obtain the change of the value of the first model parameter this time.
  • the communication device determines that the first model parameter is stable according to the amount of change in the value of the first model parameter, the communication device stops sending the update amount of the value of the first model parameter to the central server within a preset time period.
  • the communication device is used for model training according to user data, and the update amount of the value of the model parameter is determined during the model training process, and the update amount may also be referred to as a gradient.
  • the model will be constructed and the model will be trained using local user data.
  • the model parameter values are adjusted based on user data to obtain the update amount of the model parameter values, and the update amount of the model parameter values is sent to the central server according to the sending cycle.
  • the transmission period may be determined based on the bandwidth of the communication device, and the specific determination method may refer to the prior art, which is not specifically limited in the embodiment of the present application.
  • a detection period may be set, and the communication device obtains the amount of change in the value of the first model parameter according to the detection period, that is, periodically determines whether the first model parameter is stable according to the detection period. It is understandable that the communication device will also send the update amount of the value of the first model parameter to the central server according to the preset sending cycle. In addition, the change in the value of the first model parameter will only occur after the data is sent and received, so the sending period is shorter than the detection period. For example: the sending period is 5s, and the detection period is 10s. Then the communication device sends the update amount of the value of the first model parameter to the central server every 5s, and the same central server sends the value of the first model parameter to the communication device every 5s. The communication device confirms the change in the value of the first model parameter every 10s, and determines whether the first model parameter is stable based on the change.
  • the amount of change in the value of the first model parameter may be periodically determined through the above step S101, and then whether the first model parameter is stable is determined according to the amount of change in the value of the first model parameter.
  • a preset threshold may be set according to experimental data or historical experience values, etc. If the change in the value of the first model parameter obtained by the method in step S101 is less than the preset threshold, then it is determined that the first model parameter is stable . If the first model parameter is stable, the communication device stops sending the update amount of the value of the first model parameter to the central server within the preset time period. Correspondingly, the central server also stops sending the first model parameter to the communication device within the preset time period. Value.
  • the first model parameter is a model parameter that participates in determining whether it is stable, and its number is not limited, and can be any one or more model parameters. If it is determined that the first model parameter is stable, the first model parameter is a model parameter that does not participate in transmission within the preset time period.
  • the second model parameter is a model parameter that participates in the update amount and value transmission between the central server and the communication device. The number of second model parameters is also not limited, and may be one or more.
  • all the first model parameters and all the second model parameters constitute all the model parameters (may also be referred to as the third model parameters).
  • the communication device sends the update amount of the value of the first model parameter to the central server and receives the value of the first model parameter sent by the central server. Then, after the preset time period is reached, the first model parameter automatically starts to participate in the transmission, and the historical information also starts to be automatically recorded. For example, automatically start recording the effective change of the value of the first model parameter and the cumulative change of the value of the first model parameter.
  • the value and update amount of model parameters can be transmitted between the communication device and the central server through a message.
  • a message structure provided by an embodiment of this application includes data of a message header and the message body.
  • the message header information includes the message destination address (destination address), source address (source address), header, and length/type (length/type).
  • the source address and the destination address mentioned above are both media access control address (MAC) addresses, and the specific structure of the message can refer to the prior art.
  • the header includes an identification bit corresponding to the update amount of the value of the model parameter.
  • the message carries the update amount of the value of the third model parameter (all model parameters included in the model) and the value of the corresponding flag, and the value of the flag is used to indicate Whether the communication device transmits the update amount of the value of the third model parameter to the central server.
  • special characters may be set in the identification bit.
  • the value of the identification bit is "1”, which means that the communication device sends the update amount of the model parameter value corresponding to the identification bit to the central server.
  • the value of the flag bit is "0", which means that the communication device has not sent the update amount of the model parameter value corresponding to the flag bit to the central server.
  • the central server knows, according to the value of the identification bit of the header, that the communication device has not transmitted the update amount of the value of the first model parameter within the preset time period, but has transmitted the update amount of the value of the second model parameter. That is, the update amount of the value of the model parameter transmitted between the communication device and the central server is the update amount of the value of the third model parameter, and the update amount of the value of the third model parameter includes the second model judged to be transferable The update amount of the parameter value and the update amount of the first model parameter value that cannot be transmitted within the preset time period.
  • the communication device judges that the three model parameters of the first model parameters B, E, and F are stable, and the value of the flag corresponding to the update amount of the values of the three first model parameters of B, E, and F takes a value of 0.
  • the value of the flag bit corresponding to the update amount of the values of the three second model parameters A, C, and D is 1. It means that the data corresponding to the model parameters of A, C, and D are included in the data segment of the message transmitted this time, and the data corresponding to the model parameters of B, E, and F are not included.
  • the packet header of the message contains the update amount of the value of the second model parameter and the value of the corresponding identification bit.
  • the update amount of the value of the second model parameter does not include the update amount of the first model parameter value
  • the value of the flag corresponding to the update amount of the second model parameter value is used to indicate the communication device
  • the update amount of the second model parameter value is transmitted to the central server.
  • the identification bit corresponding to the update amount of the model parameter value, the value of the three second model parameters A, C, and D carried in the message
  • the value can be set to 1, which means that the data segment of the message transmitted this time contains the data corresponding to the A, C, and D model parameters.
  • the order of the update amount of the value of the third model parameter in the message to be transmitted is preset between the communication device and the central server. After the communication device determines that the first model parameter is stable, the message transmitted to the central server does not directly contain the update amount of the first model parameter value. The central server can also determine which of the first model parameters are to be stopped based on the vacant bits.
  • the update amount of the value of a model parameter can also be known as the update amount of the second model parameter value corresponding to the transmitted data.
  • the identification bit corresponding to the update amount of the value of the first model parameter may be empty.
  • the central server can, according to the value of the identification bit of the message, be empty, knowing that within the preset time period, the update amount of the value of the model parameter whose identification bit is empty is the value of the first model parameter not transmitted by the communication device The amount of updates.
  • every time the communication device determines that the first model parameter is stable and needs to stop transmitting the update amount of the first model parameter value it also needs to determine the ratio of the number of the first model parameter to the number of the third model parameter. If the communication device determines that the ratio of the number of stable first model parameters to the number of third model parameters is greater than the preset ratio, the value of the preset threshold is reduced. That is to say, after the number of model parameters that stop transmission reaches a certain proportion, it is necessary to lower the judgment threshold of whether the model parameters are stable, and then reduce the number of model parameters that stop transmission, and ensure that the model parameters participating in model training are updated. Value of the number of model parameters.
  • the preset threshold is dynamically adjusted to avoid the excessive number of model parameters that stop transmission, which prolongs the process of model training or affects the final accuracy of the model.
  • the model can only be trained with the values of the model parameters stopped to be transmitted within a preset time period, and the training effect may be unsatisfactory.
  • the updated values of these model parameters can only be obtained after the preset time duration of each model parameter that stops transmitting is reached, and the training of the model is continued, resulting in too long training time for the model to reach the model convergence condition.
  • S103 The communication device receives the value of the second model parameter sent by the central server. Wherein, in the preset time period, the value of the second model parameter does not include the value of the first model parameter.
  • the central server will obtain the updated second model according to the obtained update amount of the model parameter value sent by one or more communication devices
  • the parameter value is obtained, and the updated value of the second model parameter is sent to the corresponding communication device.
  • the first model parameter does not participate in the transmission, and the updated value of the second model parameter does not include the updated value of the first model parameter.
  • the communication device sends the update amount of the value of the first model parameter to the central server and receives the value of the first model parameter sent by the central server.
  • the central server will not generate and send the first model parameter update to the communication device during the preset time period. After the value is selected, the amount of bidirectional data between the communication device and the central server is reduced. After a preset period of time, the communication device and the central server will automatically start to transmit the value of the first model parameter and the update amount of the value, so as to realize the adaptive adjustment of the amount of transmitted data.
  • step S104 may be further included.
  • S104 The communication device receives the value of the second model parameter sent by the central server.
  • the communication device receives the value of the third model parameter sent by the central server as the value of all the model parameters used for model training.
  • the number of the third model parameter will be changed according to the number of updates of the value of the model parameter sent by the communication device to the central server, such as changing to a second model parameter other than the first model parameter whose transmission is stopped. That is, after the communication device sends the updated value of the model parameter value to the central server, the central server will determine the updated value of the model parameter according to the updated value of the model parameter value, and then will send the updated value to the communication equipment.
  • the process of model training includes an interactive process of transmitting the update amount or value of model parameters between the communication device and the central server. In the process of interaction, the convergence of the model is achieved and the training process is completed.
  • each communication device judges the stability of the value of the first model parameter, and then determines whether to stop transmitting the update amount of the value of the first model parameter. It is understandable that the central server may also judge the stability of the value of the first model parameter to confirm whether to stop transmitting the value of the first model parameter. In addition, a notification may also be sent to each corresponding communication device to inform each corresponding communication device of the judgment result, thereby ensuring that each communication device stops sending the update amount of the value of the first model parameter to the central server when it knows whether in the next sending period.
  • the communication method for model training provided by the embodiment of the present application, if it is determined that the value of the first model parameter is stable, the transmission of the update amount of the value of the first model parameter is stopped within a preset time period. It can flexibly reduce the amount of data transmitted between the communication device and the central server without losing the accuracy of model training, and improve the communication efficiency.
  • the communication device sends the update amount of the value of the model parameter to the central server according to the sending cycle, and receives the value of the model parameter sent by the central server.
  • the communication device detects the stability of the value of the model parameter according to the detection period, determines the update amount of the value of the model parameter in which the transmission can be stopped, and obtains a duration for each detection. If the first model parameter is stable, the obtained time length is the preset period of time to stop transmission; if the first model parameter is unstable, the obtained time length is used to adjust the next time to determine whether the value of the first model parameter is stable. The length of time.
  • the following steps can be used to realize the transmission process of the value of the model parameter and the update amount of the value of the cyclic interaction between the communication device and the central server, and realize the flexible adjustment of the obtained time length.
  • the duration of the preset period for stopping the transmission of the first model parameter may be an integer multiple of the transmission period, or may be any customized duration, which is not specifically limited in the embodiment of the present application.
  • Step 1 Determine whether the first model parameters are stable.
  • the communication device will receive the value of the third model parameter sent by the central server.
  • the communication device will use the value of the third model parameter to construct a local model, and use the local data to train the model, and send it to the center according to the sending cycle.
  • the server sends the update amount of the value of the third model parameter.
  • the value of the third model parameter is the value of all the model parameters used to build the model.
  • the receiving center server sends the updated value of the third model parameter, and it is determined whether the value of the first model parameter is stable through the method in step S101 and step S102 described above. If the first model parameter is stable, perform step 2; if the first model parameter is unstable, continue to transmit the updated value of the first model parameter.
  • Step 2 Obtain the preset period duration for stopping the transmission of the update amount of the value of the first model parameter as the initial duration.
  • the transmission of the update amount of the first model parameter value can be stopped within a preset time period, thereby reducing the data communicated between the communication device and the central server quantity.
  • the preset time period required to stop transmitting the update amount of the value of the first model parameter is obtained as the initial time length.
  • the initial duration can be an integer number of sending cycles or a custom time period. You can adjust the duration by increasing or decreasing the number of sending cycles or a custom time period.
  • the preset ratio can be customized according to experience values.
  • Step 3 When the initial duration is reached, the update amount of the value of the first model parameter is started to be transmitted.
  • the first model parameter that stops transmission automatically participates in transmission according to the transmission period.
  • Step 4 Obtaining the preset period duration for stopping the transmission of the updated value of the first model parameter value is greater than the initial duration.
  • the time length obtained during the last judgment of the first model parameter is the initial time length
  • the time length of the preset time period obtained by the judgment stably this time is greater than the initial time length obtained last time.
  • the model parameter 1 obtained the initial duration of 1 transmission period in the last detection period, and the model parameter 1 automatically participates in the transmission after the transmission is stopped for 1 transmission period.
  • the transmission period is less than the detection period, so model parameter 1 will be tested again after participating in transmission for a period of time. This time it is determined that the model parameter 1 is stable, and the duration of the preset period of time greater than 1 transmission cycle, such as 2 transmission cycles, can be obtained.
  • step two it is necessary to calculate the ratio of the number of first model parameters to be stopped to the number of third model parameters, and if it is greater than the preset ratio, the preset threshold is reduced.
  • Step 5 When the time reaches the preset time period, start to transmit the update amount of the value of the first model parameter.
  • the first model parameter that stops transmission automatically starts to participate in the transmission. And continue to participate in judging its stability. If the first model parameters are stable, step 6 is executed; if the first model parameters are unstable, step 7 is executed.
  • Step 6 Obtaining the preset period of time for stopping the transmission of the update value of the first model parameter value is greater than the time obtained last time.
  • the duration of the preset time period for stopping transmission is greater than the duration obtained by the previous detection.
  • step two it is necessary to calculate the ratio of the number of first model parameters to be stopped to the number of third model parameters, and if it is greater than the preset ratio, the preset threshold is reduced.
  • step six After completing step six, return to step five, that is, after the transmission starts automatically after the preset time period is reached, continue to determine whether the value of the first model parameter is stable. That is to say, the entire model training process is a cyclical interaction process between the communication device and the central server for the value of model parameters and the update amount of the values. In the process of model training, it is necessary to repeatedly judge the stability of the model parameters according to the detection cycle, and obtain the corresponding duration.
  • the communication device determines the amount of change in the value of the first model parameter M times.
  • M is a positive integer greater than or equal to 2.
  • the communication device determining the preset condition that is satisfied for the preset time period according to the stable state of the first model parameter for the kth time includes: if the first model parameter is determined to be stable for the kth time according to the change in the value of the first model parameter, then the preset time period is The duration is the first duration, and the first duration is greater than the second duration.
  • the second duration is the duration of the preset period when it is determined that the first model parameter is stable for the k-1th time, and the update amount of the value of the first model parameter is stopped to the central server.
  • the second time length is the time length obtained for adjusting the preset time period corresponding to the next time when the first model parameter is stable when it is determined that the first model parameter is not stable for the k-1th time; where k is a positive integer, k ⁇ M.
  • Step 7 Continue to transmit the update amount of the value of the first model parameter, and obtain a duration less than the duration obtained last time.
  • the parameters of the first model are unstable this time, it is necessary to obtain a time length for adjusting the acquisition time length of the next detection, and the time length obtained this time is less than the time length obtained by the previous detection.
  • the value of the first model parameter in the last detection may be stable or unstable.
  • step seven After completing step seven, return to step five, that is, continue to determine whether the value of the first model parameter is stable in the next detection cycle.
  • the communication device determines the amount of change in the value of the first model parameter M times.
  • M is a positive integer greater than or equal to 2.
  • the communication device determining the preset condition that is satisfied for the preset time period according to the kth stable state of the first model parameter further includes: if it is determined that the first model parameter is not stable for the kth time, the obtained value is used to adjust the next first model parameter stability Time corresponds to the third duration of the preset time duration, and the third duration is less than the fourth duration.
  • the fourth duration is the duration of the preset period of time when it is determined that the first model parameter is stable for the k-1th time, and the update amount of the value of the first model parameter is stopped to the central server.
  • the fourth duration is the duration obtained when it is determined that the first model parameter is not stable for the k-1th time and is used to adjust the preset period duration corresponding to the next time when the first model parameter is stable.
  • Step 8 Continue to transmit the update amount of the value of the first model parameter, and obtain a duration less than the initial duration.
  • the transmission is continued, and the duration for adjusting the acquisition duration of the next detection can be obtained this time, and the duration of this time will be less than the initial duration.
  • a duration will be obtained. If it is determined that the first model parameter is stable this time, it indicates that the first model parameter has converged, and it is necessary to increase the duration of the update amount for the first model parameter to stop transmitting the value.
  • the obtained duration is the duration of the preset period, and the duration is greater than the above. The length of time obtained by a test.
  • the first model parameter is determined to be unstable this time, it indicates that the first model parameter has not yet converged, and it is necessary to reduce the duration of the update of the first model parameter's next stop transmission value, and the obtained duration is used to adjust the next detection Obtain the duration of the time, and the duration is less than the duration of the last detection.
  • n is a positive integer.
  • the k+1th communication device After n sending cycles, if the k+1th communication device determines that the first model parameter is stable according to the change in the value of the first model parameter, it will stop sending the first model parameter to the central server within (n+m) sending cycles.
  • the update amount of model parameter values Among them, m is a positive integer.
  • r is a positive integer greater than or equal to 2, and (n/r) ⁇ 1.
  • the preset time period obtained by stably obtaining the first model parameters is 2 sending cycle time lengths.
  • FIG. 6 a schematic flowchart of another communication method based on model training provided by this embodiment of the application, and the method may include S201-S203:
  • the central server receives the update amount of the second model parameter value sent by the communication device. In the preset time period, the update amount of the value of the second model parameter does not include the update amount of the value of the first model parameter.
  • the first model parameter is a stable model parameter that is determined according to the amount of change in the value of the first model parameter.
  • the process of determining that the first model parameter is stable includes determining that the first model parameter is stable if the amount of change in the value of the first model parameter is less than a preset threshold.
  • a preset threshold For the specific method, please refer to the related description of step S101, which will not be repeated here.
  • the update amount of the value of the first model parameter and the update amount of the second model parameter are determined by the communication device according to user data in the process of model training. That is, the communication device will determine the update amount of the value of the model parameter according to the user data in the process of model training. In addition, the communication device will determine the update amount of the value of the model parameter that needs to stop transmitting to the central server according to the stable state of the model parameter.
  • the central server sends the value of the first model parameter to the communication device and receives the update amount of the value of the first model parameter sent by the communication device.
  • the central server will determine whether the communication device has stopped transmitting part of the model parameters according to the update amount of the received model parameter values. If the transmission of part of the model parameters has been stopped, the central server will also stop transmitting the part of the model parameters that the communication device has stopped transmitting within the preset time period. In turn, it can reduce the amount of data transmitted in both directions and improve communication efficiency.
  • step S101 to step S103 For the rest of the content, reference may be made to the related descriptions of step S101 to step S103, which will not be repeated here.
  • S202 The central server determines the updated value of the second model parameter according to the update amount of the second model parameter value.
  • the central server after receiving the update amount of the model parameter value sent by each communication device, the central server averagely superimposes the update amount of the model parameter fed back by each communication device to obtain a unified new model parameter value.
  • the process of obtaining the value of the new model parameter reference may be made to the prior art, which will not be repeated in the embodiment of the present application.
  • S203 The central server sends the updated value of the second model parameter to the communication device.
  • the updated value of the second model parameter does not include the updated value of the first model parameter.
  • the value and update amount of model parameters are transmitted between the central server and the communication device through a message.
  • the central server sends a message to the communication device, and the message includes the updated value of the third model parameter and the value of the corresponding identification bit.
  • the value of the flag corresponding to the updated value of the first model parameter is used to indicate that the central server has not transmitted the updated value of the first model parameter to the communication device.
  • the central server sends a message to the communication device, and the message includes the updated value of the second model parameter and the value of the corresponding identification bit.
  • the updated value of the second model parameter does not include the updated value of the first model parameter, and the value of the flag corresponding to the updated value of the second model parameter is used to indicate the central server
  • the updated value of the second model parameter is transmitted to the communication device.
  • step S101 to step S103 For the rest of the content, reference may be made to the related descriptions of step S101 to step S103, which will not be repeated here.
  • step S204 may be further included.
  • S204 The central server sends the second model parameter value to the communication device.
  • step S104 For the rest of the content, reference may be made to the related description of step S104, which will not be repeated here.
  • the model training method in the distributed system of machine learning provided in the embodiment of the application and the prior art is implemented on the Pytorch platform.
  • the client in the distributed system of machine learning contains 50 communication devices.
  • the configuration of each communication device is m5.xlarge, including 2vCPUs, 8GB of memory, download bandwidth of 9Mbps, and upload bandwidth of 3Mbps.
  • the server side is a central server with a configuration of c5.9xlarge and a bandwidth of 10Gbps.
  • the Pytorch platform is installed in the communication equipment and the central server to form a cluster, and the cluster information includes an elastic computing cloud (EC2) cluster.
  • EC2 cluster elastic computing cloud
  • the embodiment of the application mainly analyzes the experimental data structure from the following three aspects to reflect the beneficial effects of the technical solution provided by the application: overall performance, convergence, and overhead.
  • the training data set includes a small data set CIFAR-10 for identifying universal objects and a keyword spotting data set (KWS).
  • Neural network models include LeNet, residual network (residual network, ResNet)18 and two-layer loop (recurrent layer) long and short-term memory network (long short-term memory, LSTM).
  • LeNet and ResNet18 use the data set CIFAR-10 to train neural network models.
  • the recurrent layer LSTM uses the data set KWS to train the neural network model.
  • the first aspect is the comparison of overall performance.
  • the communication method for model training provided in the embodiments of this application and the communication method for model training in the prior art are adopted respectively, and the total amount of data transmission between the communication device and the central server is obtained. And compare the average time of each transmission.
  • the second aspect is convergence analysis.
  • the abscissa in Fig. 7, Fig. 8 and Fig. 9 represents the communication rounds between the communication device and the central server, and the ordinate represents the accuracy rate of the neural network model or the proportion of the number of model parameters that stop transmission.
  • the thick solid line indicates the accuracy of the neural network model obtained by adopting the technical solution of this application.
  • the thin solid line represents the accuracy of the neural network model obtained by adopting the technical solution in the prior art.
  • the dashed line represents the ratio of the number of model parameters to the number of all model parameters that are stopped during the model training process after the technical solution of the embodiment of the present application is adopted.
  • model convergence can still be achieved after a limited number of communication rounds, and the accuracy (accuracy) is stable. And it will not reduce the accuracy of the neural network model.
  • the third aspect is the cost comparison.
  • the cost includes the time cost and memory cost of the three neural network models after using the technical solution provided by the embodiments of the application for model training communication, and the ratio of time cost increased compared to the model training communication method in the prior art And the ratio of memory overhead.
  • the technical solution provided by the embodiments of the present application can effectively reduce the total amount of data transmission and the average time of each transmission in terms of overall performance. In addition, it will not have a major impact on the accuracy of the neural network model and the overhead in the training process.
  • Fig. 10 shows a schematic diagram of a possible structure of the communication device based on model training involved in the foregoing embodiment.
  • the communication device 1000 based on model training includes: a processing unit 1001, a sending unit 1002, and a receiving unit 1003.
  • the processing unit 1001 is used to support the communication device 1000 based on model training to perform steps S101 and S102 in FIG. 3 and/or other processes used in the technology described herein.
  • the sending unit 1002 is configured to support the communication device 1000 based on model training to perform step S102 in FIG. 3 and/or other processes used in the technology described herein.
  • the receiving unit 1003 is configured to support the communication device 1000 based on model training to perform step S103 in FIG. 3 and/or other processes used in the technology described herein.
  • FIG. 11 shows a schematic diagram of a possible structure of the communication device based on model training involved in the foregoing embodiment.
  • a communication device 1100 based on model training includes: a receiving unit 1101, a processing unit 1102, and a sending unit 1103.
  • the receiving unit 1101 is configured to support the communication device 1100 based on model training to perform step S201 in FIG. 6 and/or other processes used in the technology described herein.
  • the processing unit 1102 is configured to support the communication device 1100 based on model training to perform step S202 in FIG. 6 and/or other processes used in the technology described herein.
  • the sending unit 1103 is configured to support the communication device 1100 based on model training to perform step S203 in FIG. 6 and/or other processes used in the technology described herein.
  • the embodiment of the present application also provides a chip system.
  • the chip system includes at least one processor 1201 and at least one interface circuit 1202.
  • the processor 1201 and the interface circuit 1202 may be interconnected by wires.
  • the interface circuit 1202 can be used to receive signals from other devices.
  • the interface circuit 1202 may be used to send signals to other devices (such as the processor 1201).
  • the interface circuit 1202 can read an instruction stored in the memory, and send the instruction to the processor 1201.
  • the communication device based on model training can be made to execute each step in the communication method based on model training in the foregoing embodiment.
  • the chip system may also include other discrete devices, which are not specifically limited in the embodiment of the present application.
  • the embodiment of the present application also provides a computer storage medium, the computer storage medium stores computer instructions, and when the computer instructions run on a communication device based on model training, the communication device based on model training executes the above related method steps.
  • the communication method based on model training in the above embodiment is also provided.
  • the embodiments of the present application also provide a computer program product, which when the computer program product runs on a computer, causes the computer to execute the above-mentioned related steps, so as to realize the communication method based on model training in the above-mentioned embodiment.
  • the embodiments of the present application also provide a device, which may specifically be a component or a module.
  • the device may include a connected processor and a memory; wherein the memory is used to store computer execution instructions.
  • the processor When the device is running, the processor The computer-executable instructions stored in the executable memory can be executed to make the device execute the communication method based on model training in the foregoing method embodiments.
  • the device, computer storage medium, computer program product, or chip provided in the embodiments of the present application are all used to execute the corresponding method provided above. Therefore, the beneficial effects that can be achieved can refer to the corresponding method provided above. The beneficial effects of the method will not be repeated here.
  • the disclosed method can be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be divided. It can be combined or integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, modules or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: flash memory, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk and other media that can store program instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Communication Control (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本申请提供一种基于模型训练的通信方法、装置及系统,涉及通信技术领域,应用于包括中心服务器和通信设备的系统中,能够有效减小通信设备和中心服务器之间参数传输的数据量,在保证不损失模型训练精度的前提下,提高联邦学习过程的通信效率。该方法包括:通信设备确定第一模型参数取值的变化量。若通信设备根据第一模型参数取值的变化量确定第一模型参数稳定,则通信设备在预设时段内停止向中心服务器发送第一模型参数取值的更新量。其中,第一模型参数取值的更新量由通信设备在进行模型训练的过程中根据用户数据确定;通信设备接收中心服务器发送的第二模型参数取值。其中,在预设时段内,第二模型参数取值不包括第一模型参数取值。

Description

一种基于模型训练的通信方法、装置及系统
本申请要求于2020年01月23日提交国家知识产权局、申请号为202010077048.8、发明名称为“一种基于模型训练的通信方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,尤其涉及一种基于模型训练的通信方法、装置及系统。
背景技术
联邦学习(faderated learning,FL)系统是一种新兴的人工智能基础技术,其主要思想是由中心服务器和多个通信设备协作,基于多个通信设备上的数据集构建机器学习模型。在构建机器学习模型过程中,各个通信设备之间无需数据共享,防止数据泄露。
在联邦学习系统中,中心服务器将机器学习模型的各个参数的取值发送给各个通信设备,由各通信设备作为协作单元在本地训练模型,更新机器学习模型的各个参数的取值,并将各个参数的取值的梯度(Gradient)发送给中心服务器。中心服务器根据各个参数的取值的梯度生成新的模型参数取值。重复执行上述步骤,直至机器学习模型收敛,完成整个模型训练的过程。
随着机器学习模型越来越大,通信设备与中心服务器之间需要传输的模型参数的数据量也越来越大。受限于因特网(Internet)的连接速度、带宽等问题,造成在通信设备和中心服务器之间模型参数传输的时延较长,模型参数取值更新速率较慢。
发明内容
本申请提供的基于模型训练的通信方法、装置及系统,能够有效减小通信设备和服务器之间参数传输的数据量,在保证不损失模型训练精度的前提下,提高通信效率。
为达到上述目的,本申请采用如下技术方案:
第一方面,本申请提供一种基于模型训练的通信方法,应用于包括中心服务器和通信设备的系统中。该方法可以包括:通信设备确定第一模型参数取值的变化量。若通信设备根据第一模型参数取值的变化量确定第一模型参数稳定,则通信设备在预设时段内停止向中心服务器发送第一模型参数取值的更新量。其中,第一模型参数取值的更新量由通信设备在进行模型训练的过程中根据用户数据确定。通信设备接收中心服务器发送的第二模型参数取值。其中,在预设时段内,第二模型参数取值不包括第一模型参数取值。
其中,第一模型参数为参与确定是否稳定的模型参数,其数量不限定,可以为任意一个或多个模型参数。如果确定第一模型参数稳定,则该第一模型参数为预设时段内不参与传输更新量和取值的模型参数。第二模型参数为参与中心服务器和通信设备之间的更新量和取值传输的模型参数。第二模型参数的数量也不限定,可以为一个或多个。可选的,全部的第一模型参数和全部的第二模型参数组成了全部的模型参数。
其中,第一模型参数取值的变化量用于确定第一模型参数是否稳定。若第一模型 参数稳定,则表示第一模型参数已经收敛。后续通信过程中,其取值的变化量主要为小幅度振荡变化,对模型训练的意义不大。所以可以设置预设时段,在预设时段内,停止传输稳定的第一模型参数取值的更新量。可以理解的是,若通信设备在预设时段内不向中心服务器发送第一模型参数取值的更新量,则在该预设时段内中心服务器也不会生成并向通信设备发送第一模型参数更新后的取值,进而实现通信设备与中心服务器之间双向的数据量均减小。
其中,通信设备接收到中心服务器发送更新后的模型参数取值后,可以构建更新后的本地训练模型,并利用本地用户数据训练模型。在训练过程中,会基于用户数据对模型参数取值进行调整,进而获得模型参数取值的更新量,也即模型参数取值的梯度。
如此,在通信设备根据模型参数取值的变化量确定模型参数稳定后,可以在预设时段内停止传输模型参数取值的更新量,能够有效减小通信设备和服务器之间模型参数传输的数据量,在保证不损失模型训练精度的前提下,提高通信效率。
在一种可能的实现方式中,该方法还包括:在预设时段之后,通信设备向中心服务器发送第一模型参数取值的更新量以及接收中心服务器发送的第一模型参数取值。
也就是说,在预设时段之后,停止传输的第一模型参数自动开始参与传输。即通信设备可以实现自适应的根据模型参数取值的变化量调整与中心服务器之间传输的数据量的大小。
在一种可能的实现方式中,通信设备和中心服务器之间通过报文传输模型参数取值和取值的更新量。
在报文的一种设计中,通信设备在预设时段内停止向中心服务器发送第一模型参数取值的更新量,包括:通信设备向中心服务器发送报文,报文包含第三模型参数取值的更新量与对应的标识位的取值;第三模型参数取值的更新量包括第二模型参数取值的更新量和第一模型参数取值的更新量。其中,在预设时段内,第一模型参数取值的更新量对应的标识位的取值用于表示通信设备未向中心服务器传输第一模型参数取值的更新量。
其中,报文中包含的标识位例如可以包括比特位映射,每个比特位对应一个模型参数的数据。标识位的取值例如可以设置0/1进行区分等。
如此,中心服务器接收到通信设备发送的报文后,可以根据报文中携带的模型参数取值的更新量对应各个标识位的取值,确定标识位对应的模型参数取值的更新量是否传输,进而可以快速读取已传输模型参数的取值的更新量。也可以快速确定已停止传输的第一模型参数取值的更新量,在返回给通信设备的数据中,不包含该第一模型参数更新后的取值。
在报文的另一种设计中,通信设备向中心服务器发送报文,报文包含第二模型参数取值的更新量与对应的标识位的取值。其中,在预设时段内,第二模型参数取值的更新量不包括第一模型参数取值的更新量,第二模型参数对应的标识位的取值用于表示通信设备向中心服务器传输第二模型参数取值的更新量。
如此,报文中只包含第二模型参数取值的更新量以及其对应的标识位,进而中心服务器可以直接根据接收到的报文确定传输的第二模型参数取值的更新量,后续直接 返回第二模型参数更新后的取值即可。可以进一步减少传输的数据量并且可以加快模型训练的进程。
在其他可能的设计中,报文中直接不包含第一模型参数取值的更新量。通信设备和中心服务器之间预先设置好需要传输的报文中第三模型参数取值的更新量的传输顺序。之后,通信设备确定第一模型参数稳定后,即使得传输给中心服务器的报文中直接不包含第一模型参数取值的更新量,中心服务器也可以根据空缺的比特位,确定停止传输的是哪些第一模型参数取值的更新量,也可以获知传输的数据对应的是哪些第二模型参数取值的更新量。
在一种可能的实现方式中,通信设备确定第一模型参数取值的变化量,包括:通信设备根据第一模型参数取值的历史信息获得第一模型参数取值的变化量。比如,获得第一模型参数的历史取值,或者历史取值的变化量,进而可以结合此次第一模型参数取值计算出此次第一模型参数取值的变化量。
在一种可能的实现方式中,取值的历史信息包括:第一模型参数取值的有效变化量和第一模型参数取值的累计变化量。
其中,此次第一模型参数取值的有效变化量和累计变化量可以根据获得的所有第一模型参数取值的有效变化量和累计变化量获得。或者,可以根据最近几次检测获得的第一模型参数取值的有效变化量和累计变化量获得。或者,可以根据上一次检测获得的第一模型参数取值的有效变化量和累计变化量获得。之后,根据第一模型参数取值的有效变化量和累计变化量的比值可以获得第一模型参数取值的变化量。
示例性的,可以采用指数移动平均(exponential moving average,EMA)的方法,根据上一次检测获得的第一模型参数取值的有效变化量和第一模型参数取值的累计变化量,上一次以及此次检测获得的第一模型参数取值,获得此次第一模型参数取值的有效变化量和累计变化量,进而获得第一模型参数取值的变化量。并且,采用EMA的方法获得第一模型参数取值的变化量,只需保存上一次获得的第一模型参数取值的有效变化量和累计变化量,以及上一次获得的第一模型参数取值,可以有效减小占用的存储空间。
在一种可能的实现方式中,通信设备根据第一模型参数取值的变化量确定第一模型参数稳定,包括:若第一模型参数取值的变化量小于预设阈值,则确定第一模型参数稳定。
示例性的,可以根据实验数据、专家经验值等确定预设阈值。若第一模型参数取值的变化量小于预设阈值,则表明当前第一模型参数已经收敛稳定,继续传输也不会对模型的训练产生较大的贡献,故可以停止传输该第一模型参数取值的更新量。
在一种可能的实现方式中,通信设备确定第一模型参数取值的变化量,包括:通信设备M次确定第一模型参数取值的变化量;其中,M为大于等于2的正整数。通信设备根据第一模型参数第k次的稳定状态确定预设时段满足的预设条件包括:若第k次根据第一模型参数取值的变化量确定第一模型参数稳定,则预设时段的时长为第一时长,第一时长大于第二时长。其中,第二时长为第k-1次确定第一模型参数稳定时,停止向中心服务器发送第一模型参数取值的更新量的预设时段的时长;或者,第二时长为第k-1次确定第一模型参数未稳定时,得到的用于调节下一次第一模型参数稳定 时对应的预设时段时长的时长;其中,k为正整数,k≤M。
在一种可能的实现方式中,预设条件还包括:若第k次确定第一模型参数未稳定,则得到的用于调节下一次第一模型参数稳定时对应的预设时段时长的第三时长,第三时长小于第四时长;其中,第四时长为第k-1次确定第一模型参数稳定时,停止向中心服务器发送第一模型参数取值的更新量的预设时段的时长,或者,第四时长为第k-1次确定第一模型参数未稳定时,得到的用于调节下一次第一模型参数稳定时对应的预设时段时长的时长。
也就是说,在通信设备和中心服务器之间传输模型参数的更新量或取值的过程中,每一次检测确定第一模型参数取值的变化量后,都会获得一个时长。若此次确定第一模型参数稳定,表明第一模型参数收敛,需要增加该第一模型参数停止传输的时长,则获得的时长为预设时段的时长,并且该时长大于上一次检测获得的时长。若此次确定第一模型参数不稳定,表明第一模型参数还未收敛,需要减少该第一模型参数下一次停止传输的时长,则获得的时长为用于调节下一次检测获得时长的时长,并且该时长小于上一次检测获得的时长。
如此,可以根据第一模型参数的稳定状态动态的调节停止传输的预设时段时长的时长,灵活的控制通信设备和中心服务器之间传输的参数模型训练的模型参数数量,进而保证最后获得的模型的精度满足要求。
在一种可能的实现方式中,若通信设备根据第一模型参数取值的变化量确定第一模型参数稳定,则通信设备在预设时段内停止向中心服务器发送第一模型参数取值的更新量,包括:若第k次通信设备根据第一模型参数取值的变化量确定第一模型参数稳定,则在n个发送周期内停止向中心服务器发送第一模型参数取值的更新量;其中,n为正整数。在n个发送周期之后,若第k+1次通信设备根据第一模型参数取值的变化量确定第一模型参数稳定,则在(n+m)个发送周期内停止向中心服务器发送第一模型参数取值的更新量;其中,m为正整数。在n个发送周期之后,若第k+1次通信设备根据第一模型参数取值的变化量确定第一模型参数未稳定,则若第k+2次通信设备根据变化量确定第一模型参数稳定,在(n/r+m)个发送周期内停止向中心服务器发送第一模型参数取值的更新量;其中,r为大于等于2的正整数,(n/r)≥1。其中,发送周期为通信设备向中心服务器发送模型参数取值的更新量的周期。
也就是说,停止传输第一模型参数取值的更新量的预设时段时长可以基于发送周期的时长进行调整。在确定第一模型参数稳定后,在上一次的得到的时长上增加整数个发送周期时长得到此次停止传输的预设时段。在确定第一模型参数不稳定后,将上一次得到的时长等比例减小得到此次的时长。
比如,假设第5次(第k次)检测,第一模型参数稳定获得的预设时段时长为2个发送周期时长。
若到达预设时段之后的第6次(第k+1次)检测第一模型参数稳定,则预设时段的时长可以为2+1=3个发送周期时长。
若到达预设时段之后的第6次(第k+1次)检测第一模型参数不稳定,则获得的时长可以为2/2=1个发送周期时长。若第7次(第k+2次)检测第一模型参数稳定,则获得的预设时段的时长可以为2/2+1=2个发送周期时长。
在一种可能的实现方式中,该方法还包括:通信设备按照检测周期确定第一模型参数取值的变化量;发送周期小于检测周期。
由于通信设备在发送第一模型参数取值的更新量并获得更新后的第一模型参数取值后,判断第一模型参数取值的稳定状态才有意义,所以发送周期要小于检测周期。
在一种可能的实现方式中,该方法还包括:若通信设备确定稳定的第一模型参数的数量与第三模型参数的数量的比例大于预设比例,则减小预设阈值的取值。
也就是说,在停止传输的模型参数的数量到达一定比例后,需要降低模型参数是否稳定的判定阈值,进而减少停止传输的模型参数的数量,保证参与模型训练的模型参数中为更新后的取值的模型参数的数量。
如此,动态的调节预设阈值,避免停止传输的模型参数的数量过多,使得模型训练的过程延长或者影响模型最终的精度。比如,当大量模型参数停止传输后,在预设时段内只能应用各个停止传输的模型参数的不变的取值训练模型,可能训练效果并不理想。只能等待各个停止传输的模型参数的预设时段的时长到达后,才可以获得这些模型参数更新后的取值,继续训练模型,导致模型到达模型收敛条件的训练时间过长。
在一种可能的实现方式中,通信设备确定第一模型参数取值的变化量之前,方法还包括:通信设备接收中心服务器发送的第二模型参数取值。例如发送周期为5s,检测周期为10s。则通信设备每隔5s向中心服务器发送第一模型参数取值的更新量,同样的中心服务器每隔5s向通信设备发送第一模型参数取值。而通信设备每隔10s确认第一模型参数取值的变化量,基于变化量确定第一模型参数是否稳定。
第二方面,本申请提供一种基于模型训练的通信方法,应用于包括中心服务器和通信设备的系统中。该方法包括:中心服务器接收通信设备发送的第二模型参数取值的更新量;在预设时段内,第二模型参数取值的更新量中不包含第一模型参数取值的更新量。其中,第一模型参数为根据第一模型参数取值的变化量确定稳定的模型参数。中心服务器根据第二模型参数取值的更新量确定第二模型参数更新后的取值。中心服务器向通信设备发送第二模型参数更新后的取值。在预设时段内,第二模型参数更新后的取值中不包含第一模型参数更新后的取值。
其中,第一模型参数取值的更新量和第二模型参数的更新量由通信设备在进行模型训练的过程中根据用户数据确定。即,通信设备在进行模型训练的过程中会根据用户数据确定模型参数取值的更新量。并且,通信设备会根据模型参数的稳定状态确定需要向中心服务器停止传输的模型参数的取值的更新量。
也就是说,中心服务器会根据接收的模型参数取值的更新量判断通信设备是否已经停止传输部分模型参数。若已经停止传输部分模型参数,则中心服务器也会在预设时段内,停止传输这部分通信设备停止传输的模型参数。进而实现双向减少传输的数据量,提高通信效率。
在一种可能的实现方式中,该方法还包括:在预设时段之后,中心服务器向通信设备发送第一模型参数取值以及接收通信设备发送的第一模型参数取值的更新量。
在预设时段之后,通信设备会自动开始向中心服务器发送模型参数取值的更新量,进而中心服务器会根据模型参数取值的更新量确定模型参数更新后的取值,则实现在预设时段之后,中心服务器向通信设备传输之前停止传输的模型参数更新后的取值。 如此,实现了中心服务器与通信设备之间双向的自适应通信数据量的调整。
在一种可能的实现方式中,中心服务器向通信设备发送第二模型参数更新后的取值;在预设时段内,第二模型参数更新后的取值中不包含第一模型参数更新后的取值,包括:中心服务器向通信设备发送报文,报文包含第三模型参数更新后的取值与对应的标识位的取值;第三模型参数更新后的取值包括第二模型参数更新后的取值和第一模型参数更新后的取值。其中,在预设时段内,第一模型参数更新后的取值对应的标识位的取值用于表示中心服务器未向通信设备传输第一模型参数更新后的取值。或者,中心服务器向通信设备发送报文,报文包含第二模型参数更新后的取值与对应的标识位的取值。其中,在预设时段内,第二模型参数更新后的取值不包括第一模型参数更新后的取值,第二模型参数更新后的取值对应的标识位的取值用于表示中心服务器向通信设备传输第二模型参数更新后的取值。
在一种可能的实现方式中,第一模型参数为根据第一模型参数取值的变化量确定稳定的模型参数,包括:若第一模型参数取值的变化量小于预设阈值,则确定第一模型参数稳定。
在一种可能的实现方式中,中心服务器接收通信设备发送的第二模型参数取值的更新量之前,方法还包括:中心服务器向通信设备发送第二模型参数取值。
也就是说,中心服务器和通信设备之间模型参数传输的过程为一种循环交互的过程,需要中心服务器向通信设备发送模型参数取值后,通信设备可以基于模型参数取值获得模型参数取值的更新量,进而中心服务器可以获得该模型参数取值的更新量。
第三方面,本申请提供一种基于模型训练的通信装置,该装置包括:处理单元,发送单元,接收单元。处理单元,用于确定第一模型参数取值的变化量。处理单元,还用于根据第一模型参数取值的变化量确定第一模型参数是否稳定。发送单元,用于向中心服务器发送第一模型参数取值的更新量;若处理单元根据第一模型参数取值的变化量确定第一模型参数稳定,则发送单元在预设时段内停止向中心服务器发送第一模型参数取值的更新量。其中,第一模型参数的取值的更新量由处理单元在进行模型训练的过程中根据用户数据确定。接收单元,用于接收中心服务器发送的第二模型参数取值;其中,在预设时段内,第二模型参数取值不包括第一模型参数取值。
在一种可能的实现方式中,发送单元,还用于在预设时段之后,向中心服务器发送第一模型参数取值的更新量。接收单元,还用于在预设时段之后,接收中心服务器发送的第一模型参数取值。
在一种可能的实现方式中,发送单元具体用于:向中心服务器发送报文,报文包含第三模型参数取值的更新量与对应的标识位的取值;第三模型参数取值的更新量包括第二模型参数取值的更新量和第一模型参数取值的更新量。其中,在预设时段内,第一模型参数取值的更新量对应的标识位的取值用于表示发送单元未向中心服务器传输第一模型参数取值的更新量。或者,向中心服务器发送报文,报文包含第二模型参数取值的更新量与对应的标识位的取值。其中,在预设时段内,第二模型参数取值的更新量不包括第一模型参数取值的更新量,第二模型参数取值的更新量对应的标识位的取值用于表示发送单元向中心服务器传输第二模型参数取值的更新量。
在一种可能的实现方式中,处理单元具体用于:根据第一模型参数取值的历史信 息获得第一模型参数取值的变化量。
在一种可能的实现方式中,取值的历史信息包括:第一模型参数取值的有效变化量和第一模型参数取值的累计变化量。
在一种可能的实现方式中,处理单元具体用于:根据第一模型参数取值的变化量确定第一模型参数是否稳定,若第一模型参数取值的变化量小于预设阈值,则确定第一模型参数稳定。
在一种可能的实现方式中,处理单元具体用于:M次确定第一模型参数取值的变化量。其中,M为大于等于2的正整数。根据第一模型参数第k次的稳定状态确定预设时段满足的预设条件包括:若第k次根据第一模型参数取值的变化量确定第一模型参数稳定,则预设时段的时长为第一时长,第一时长大于第二时长;其中,第二时长为第k-1次确定第一模型参数稳定时,停止向中心服务器发送第一模型参数取值的更新量的预设时段的时长,或者,第二时长为第k-1次确定第一模型参数未稳定时,得到的用于调节下一次第一模型参数稳定时对应的预设时段时长的时长;其中,k为正整数,k≤M。
在一种可能的实现方式中,预设条件还包括:若第k次确定第一模型参数未稳定,则得到的用于调节下一次第一模型参数稳定时对应的预设时段时长的第三时长,第三时长小于第四时长;其中,第四时长为第k-1次确定第一模型参数稳定时,停止向中心服务器发送第一模型参数取值的更新量的预设时段的时长,或者,第四时长为第k-1次确定第一模型参数未稳定时,得到的用于调节下一次第一模型参数稳定时对应的预设时段时长的时长。
在一种可能的实现方式中,若处理单元第k次根据第一模型参数取值的变化量确定第一模型参数稳定,则在n个发送周期内,发送单元停止向中心服务器发送第一模型参数取值的更新量。其中,n为正整数。在n个发送周期之后,若处理单元第k+1次根据第一模型参数取值的变化量确定第一模型参数稳定,则在(n+m)个发送周期内,发送单元停止向中心服务器发送第一模型参数取值的更新量。其中,m为正整数。在n个发送周期之后,若第k+1次处理单元根据第一模型参数取值的变化量确定第一模型参数未稳定,则若处理单元第k+2次根据变化量确定第一模型参数稳定,在(n/r+m)个发送周期内,发送单元停止向中心服务器发送第一模型参数取值的更新量;其中,r为大于等于2的正整数,(n/r)≥1。其中,发送周期为发送单元向中心服务器发送模型参数取值的更新量的周期。
在一种可能的实现方式中,处理单元还用于:按照检测周期确定第一模型参数取值的变化量;发送周期小于检测周期。
在一种可能的实现方式中,处理单元还用于:若处理单元确定稳定的第一模型参数的数量与第三模型参数的数量的比例大于预设比例,则减小预设阈值的取值。
在一种可能的实现方式中,处理单元确定第一模型参数取值的变化量之前,接收单元接收中心服务器发送的第二模型参数取值。
第四方面,本申请提供一种基于模型训练的通信装置,该装置包括:接收单元,处理单元,发送单元。接收单元,用于接收通信设备发送的第二模型参数取值的更新量;在预设时段内,第二模型参数取值的更新量中不包含第一模型参数取值的更新量; 其中,第一模型参数为根据第一模型参数取值的变化量确定稳定的模型参数。处理单元,用于根据第二模型参数取值的更新量确定第二模型参数更新后的取值。发送单元,用于向通信设备发送第二模型参数更新后的取值;在预设时段内,第二模型参数更新后的取值中不包含第一模型参数更新后的取值。
在一种可能的实现方式中,发送单元,还用于在预设时段之后,向通信设备发送第一模型参数取值。接收单元,还用于接收通信设备发送的第一模型参数取值的更新量。
在一种可能的实现方式中,发送单元具体用于:向通信设备发送报文,报文包含第三模型参数更新后的取值与对应的标识位的取值;第三模型参数更新后的取值包括第二模型参数更新后的取值和第一模型参数更新后的取值。其中,在预设时段内,第一模型参数更新后的取值对应的标识位的取值用于表示发送单元未向通信设备传输第一模型参数更新后的取值。或者,向通信设备发送报文,报文包含第二模型参数更新后的取值与对应的标识位的取值;其中,在预设时段内,第二模型参数更新后的取值不包括第一模型参数更新后的取值,第二模型参数更新后的取值对应的标识位的取值用于表示发送单元向通信设备传输第二模型参数更新后的取值。
在一种可能的实现方式中,第一模型参数为根据第一模型参数取值的变化量确定稳定的模型参数,包括:若第一模型参数取值的变化量小于预设阈值,则确定第一模型参数稳定。
在一种可能的实现方式中,接收单元接收通信设备发送的第二模型参数取值的更新量之前,发送单元向通信设备发送第二模型参数取值。
第五方面,本申请提供一种通信设备,包括:一个或多个处理器;存储器;以及计算机程序,其中计算机程序被存储在存储器中,计算机程序包括指令;当指令被通信设备执行时,使得通信设备执行如上述第一方面及其中任一种可能的实现方式中的基于模型训练的通信方法。
第六方面,本申请提供一种通信设备,包括:一个或多个处理器;存储器;以及计算机程序,其中计算机程序被存储在存储器中,计算机程序包括指令;当指令被中心服务器执行时,使得中心服务器执行如上述第二方面及其中任一种可能的实现方式中的基于模型训练的通信方法。
第七方面,本申请提供一种通信系统,包括:中心服务器和至少一个通信设备,至少一个通信设备执行如上述第一方面及其中任一种可能的实现方式中的基于模型训练的通信方法。中心服务器执行如上述第二方面及其中任一种可能的实现方式中的基于模型训练的通信方法。
第八方面,本申请提供一种通信设备,该通信设备具有实现如上述第一方面至第二方面,以及其中任一种可能的实现方式中所述的基于模型训练的通信方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
第九方面,本申请提供一种计算机存储介质,包括计算机指令,当计算机指令在基于模型训练的通信装置上运行时,使得基于模型训练的通信装置执行如上述第一方面至第二方面,以及其中任一种可能的实现方式中所述的基于模型训练的通信方法。
第十方面,本申请提供一种计算机程序产品,当计算机程序产品在基于模型训练的通信装置上运行时,使得基于模型训练的通信装置执行如上述第一方面至第二方面,以及其中任一种可能的实现方式中所述的基于模型训练的通信方法。
第十一方面,提供一种电路系统,电路系统包括处理电路,处理电路被配置为执行如上述第一方面至第二方面,以及其中任一种可能的实现方式中所述的基于模型训练的通信方法。
第十二方面,本申请实施例提供一种芯片系统,包括至少一个处理器和至少一个接口电路,至少一个接口电路用于执行收发功能,并将指令发送给至少一个处理器,当至少一个处理器执行指令时,至少一个处理器执行如上述第一方面至第二方面,以及其中任一种可能的实现方式中所述的基于模型训练的通信方法。
其中,第二方面至第十二方面中任一种设计方式所带来的技术效果可参见第一方面中不同设计方式所带来的技术效果,此处不再赘述。
附图说明
图1为本申请实施例提供的一种基于模型训练的通信方法的应用场景示意图;
图2为本申请实施例提供的一种通信设备的硬件结构示意图;
图3为本申请实施例提供的基于模型训练的通信方法的流程示意图一;
图4为本申请实施例提供的一种报文结构示意图;
图5为本申请实施例提供的基于模型训练的通信方法的流程示意图二;
图6为本申请实施例提供的基于模型训练的通信方法的流程示意图三;
图7为本申请实施例提供的一种实验数据结果分析图一;
图8为本申请实施例提供的一种实验数据结果分析图二;
图9为本申请实施例提供的一种实验数据结果分析图三;
图10为本申请实施例提供的一种基于模型训练的通信装置的结构示意图一;
图11为本申请实施例提供的一种基于模型训练的通信装置的结构示意图二;
图12为本申请实施例提供的一种芯片系统的结构示意图。
具体实施方式
下面结合附图对本申请实施例提供的基于模型训练的通信方法、装置及系统进行详细地描述。
本申请实施例的方案主要应用于机器学习的分布式系统,该机器学习的分布式系统中包含中心服务器和多个通信设备。每一通信设备拥有各自的处理器以及存储器,各自具有独立的数据处理功能。在该机器学习的分布式系统中,各个通信设备地位相同,存储有用户数据,且各个通信设备之间不共享用户数据,能够在保障大数据交换时的信息安全、保护终端数据和个人数据隐私的前提下,在多通信设备之间开展高效率的机器学习。
示例性的,如图1所示,机器学习的分布式系统100包括中心服务器10和至少一个通信设备20,例如图1中的通信设备1、通信设备2、通信设备3和通信设备4。中心服务器10和至少一个通信设备20之间可以通过有线网络或者无线网络连接。本申请实施例对中心服务器10和至少一个通信设备20之间的连接方式不做具体限定。
其中,中心服务器10可以是云服务器或者网络服务器等具有计算功能的设备或服 务器。中心服务器10可以是一台服务器,也可以是由多台服务器组成的服务器集群,或者是一个云计算服务中心。中心服务器10中存储有神经网络模型,可以通过交互接口将神经网络模型参数取值发送至至少一个通信设备20。
其中,通信设备20也可以称之为主机,模型训练主机等。通信设备20既可以为服务器,也可以为终端设备。可以提供相关的人机交互界面,以便采集基于模型训练的本地用户数据。例如可以包括手机(mobile phone)、平板电脑(Pad)、带无线收发功能的电脑、虚拟现实(virtual reality,VR)电子设备、增强现实(augmented reality,AR)电子设备、工业控制(industrial control)中的无线终端、无人驾驶(self driving)中的无线终端、远程医疗(remote medical)中的无线终端、智能电网(smart grid)中的无线终端、运输安全(transportation safety)中的无线终端、智慧城市(smart city)中的无线终端、智慧家庭(smart home)中的无线终端、车载终端、人工智能(Artificial Intelligence,AI)终端等。本申请实施例对通信设备的具体形态不作特殊限制。
通信设备20中存储有本地数据。通信设备20可以接收中心服务器10发送的神经网络模型参数取值构建本地模型,并将本地数据作为训练数据训练本地模型。可以理解的是,数据量越大的训练数据训练出的模型性能越优。由于单个通信设备20中包含的本地数据的数据量有限,导致训练后的本地模型的精度有限。若为增大训练数据中包含的数据量,而在多个通信设备20间进行传输模型参数取值或取值的更新量,不利于保护隐私数据。
出于保护隐私数据以及数据安全的考虑,机器学习的分布式系统100可以在无需通信设备20间共享本地数据的前提下,优化各个通信设备20中的本地模型。如图1所示,各个通信设备20间不必进行本地数据共享,而是将训练前后本地模型各个参数的取值的更新量发送至中心服务器10,中心服务器10利用各个通信设备20发送的模型参数取值的更新量训练模型,并将训练好的模型参数取值发送至各个通信设备20。各个通信设备20再利用本地数据训练利用更新后的模型参数取值构建的本地模型。如此,在循环上述步骤后,中心服务器10可以获得性能更优的神经网络模型并将其发送至各个通信设备20。其中,更新量也可以称之为梯度。
需要说明的是,有些文献中将上述机器学习的过程描述为联邦学习。因此,上述机器学习的分布式系统也可以描述为联邦学习的分布式系统。本申请实施例中的技术方案可以为基于模型训练的通信方法,也可以描述为基于联邦学习的通信方法。即联邦学习过程也可以描述为模型训练过程。
上述机器学习的分布式系统100可以应用于如下场景中:
场景1:改善手机输入法输入性能的场景。
示例性的,手机输入法可以根据用户当前输入词汇预测并显示后续可能会输入的词汇。其中运用的预测模型需要基于用户数据训练以实现提高其预测精度。但是,某些用户个人数据,例如用户访问的网站,用户的旅行地点等敏感数据,不能被直接共享。
如此,Google推出了基于机器学习的分布式系统改善手机输入词汇预测模型的方法。首先,Google中的中心服务器会将预测模型参数取值发送给多个手机,之后可以 获得各个手机基于各自本地用户数据训练预测模型得到的模型参数取值的更新量,中心服务器通过对各个手机反馈的预测模型参数取值的更新量进行平均叠加得到统一的新的预测模型。如此,在手机不必共享本地用户数据的前提下,通过不断的迭代更新,最终可以得到预测精度高的预测模型,进而改善手机输入法性能。
机器学习的分布式系统的应用场景可以包括上述场景1中基于大量通信设备的场景,还可以包括基于有限数量的通信设备的场景,如下述场景2。
场景2:改善医疗预测模型的场景。
示例性的,医院中用于开发治疗方法以及获得治疗预测结果的预测模型的建立需要的训练数据为病人数据。应用病人数据进行模型训练,实际和潜在侵犯病人隐私的后果可能很严重。
如此,可以基于机器学习的分布式系统,在不共享各自医院病人数据的前提下,每个医院利用各自拥有的病人数据训练预测模型,中心服务器综合所有预测模型参数取值的更新量,最终获得综合所有医院病人数据的精度较高的预测模型,并将最终的模型参数取值发送至各个医院,进而改善各个医院的医疗预测模型。
图2所示为本申请实施例提供的通信设备20的硬件结构示意图。通信设备20包括总线110、处理器120、存储器130、用户输入模块140、显示模块150、通信接口160和其它相似和/或合适组件。
总线110可以是将上述元件相互连接并在上述元件之间传递通信的电路。
处理器120可以通过总线110从上述其它元件(例如存储器130、用户输入模块140、显示模块150、通信接口160等)接收命令,可以解释接收到的命令,并可以根据所解释的命令来执行计算或数据处理。
处理器120可以是一个通用中央处理器(central processing unit,CPU),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制本申请方案程序执行的集成电路。
在一些实施例中,处理器120用于根据接收到的中心服务器10发送的模型参数取值建立本地模型,并利用本地用户数据训练本地模型,获得模型参数取值的更新量。
存储器130可以存储从处理器120或其它元件(例如用户输入模块140、显示模块150、通信接口160等)接收的命令或数据或者由处理器120或其它元件产生的命令或数据。
存储器130可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器130可以是独立存在,通过总线110与处理器120相连接。存储器130也可以和处理器120集成在一起。
其中,存储器130用于存储用于实现本申请方案的计算机执行指令,并由处理器 120来控制执行。处理器120用于执行存储器130中存储的计算机执行指令,从而实现本申请下述实施例提供的基于模型训练的通信方法。
可选的,本申请实施例中的计算机执行指令也可以称之为应用程序代码、指令、计算机程序或者其它名称,本申请实施例对此不作具体限定。
在一些实施例中,存储器130中存储有通信设备20通过用户输入模块140,基于人机交互的方式输入的用户数据,或者通信设备20通过其他方式获得的用户数据。处理器120可以通过调用存储器130中存储的用户数据进行模型训练。进一步的处理器120也可以在线获取用户数据进行模型训练,对此本申请实施例不做具体限定。
用户输入模块140可以接收经由输入-输出手段(例如,传感器、键盘、触摸屏等)从用户输入的数据或命令,并可以通过总线110向处理器120或存储器130传送接收到的数据或命令。
显示模块150可以显示从上述元件接收到的各种信息(例如多媒体数据、文本数据)。
通信接口160可以控制通信设备20与中心服务器10的通信。当通信设备20与中心服务器10配对连接时,通信接口160可以接收中心服务器10发送的模型参数取值,以及向中心服务器发送模型参数取值的更新量,并且也可以控制模型参数取值的更新量的发送周期。其中,模型参数取值的更新量的发送周期也可以由处理器120控制通信接口160执行。
根据本申请公开的各实施例,通信接口160可以直接或者通过网络161与中心服务器10进行通信。例如,通信接口160可以操作为将通信设备20连接至网络161。
应该理解的是,图示通信设备20的硬件结构仅是一个范例,并且通信设备20可以具有比图2中所示出的更多的或者更少的部件,可以组合两个或更多的部件,或者可以具有不同的部件配置。
目前,基于机器学习的分布式系统的模型训练的通信方法,一般是中心服务器将所有模型参数取值均发送到通信设备,通信设备根据模型参数取值获得所有模型参数取值的更新量,并将所有模型参数取值的更新量均发送回中心服务器。数据传输过程中,中心服务器和通信设备之间传输的数据量较大。但实际上,在模型收敛之前的早期阶段,部分模型参数已经收敛稳定,后续的模型参数取值的变化仅为小幅度振荡变化,对模型训练的意义不大。
有鉴于此,在本申请的实施例提供了一种基于模型训练的通信方法,可以基于传输的模型参数取值的变化量确认模型参数是否稳定,进而确认是否停止传输对应的模型参数取值的更新量。如此,可以在不损失模型训练精度的前提下,减少传输的数据量,提高通信效率。
如图3所示,为本申请实施例提供一种基于模型训练的通信方法的流程示意图,该方法可以包括S101-S103:
S101、通信设备确定第一模型参数取值的变化量。
其中,第一模型参数可以为通信设备接收中心服务器发送的模型参数中的至少一个模型参数,即可以确定接收的所有模型参数取值的变化量,也可以仅确定接收的所有模型参数取值中部分模型参数取值的变化量。比如,通信设备接收到100个模型参 数取值,这100个模型参数取值中的2个模型参数取值对模型训练过程影响较大。因此,在模型训练过程中,需要保证持续更新这2个模型参数取值。故,第一模型参数取值的数量为98个,在通信设备接收到中心服务器发送的100个模型参数取值后,通信设备会确定其中除上述2个模型参数取值之外的98个第一模型参数取值的变化量。
示例性的,通过上文描述可知,在机器学习的分布式系统中,通信设备会接收中心服务器发送的第一模型参数取值,通信设备可以根据接收的第一模型参数取值的历史信息获得第一模型参数取值的变化量。
其中,历史信息可以包括第一模型参数取值的有效变化量和第一模型参数取值的累计变化量。
在一些实施例中,采用指数移动平均(exponential moving average,EMA)的方法,根据第一模型参数取值的有效变化量和第一模型参数取值的累计变化量获得第一模型参数取值的变化量。
比如,通信设备根据第k次获取的第一模型参数取值的有效变化量E k和第一模型参数取值的累计变化量
Figure PCTCN2020140434-appb-000001
的比值,得到第一模型参数取值的变化量P;其中,变化量
Figure PCTCN2020140434-appb-000002
有效变化量E k和第一模型参数取值的累计变化量
Figure PCTCN2020140434-appb-000003
的初始值为0。
其中,若k=1,则
Figure PCTCN2020140434-appb-000004
第一模型参数取值的变化量P=1。
其中,若k为大于等于2的正整数,则有效变化量E k=αE k-1+(1-α)Δk,累计变化量
Figure PCTCN2020140434-appb-000005
其中,α为权重参数,用于表示权重的衰减程度,0<α<1;E k-1为第一模型参数取值在k-1次的有效变化量;
Figure PCTCN2020140434-appb-000006
为第一模型参数取值在k-1次的累计变化量;Δk为第k次与第k-1次获得的第一模型参数取值的差值。其中,α为变量,会根据计算次数指数递减,其具体的获得方式可参见现有技术,对此本申请实施例不再赘述。
示例性的,若k=2,即当前为第2次通信设备获得第一模型参数取值,假设第1次第一模型参数取值为-4,第2次第一模型参数取值为6,α=0.5。则第2次第一模型参数取值的有效变化量E 2=αE 1+(1-α)Δk=α(1-α)Δk 1+(1-α)Δk 2=0.5*(1-0.5)*(-4)+(1-0.5)*(6-(-4))=4,第一模型参数取值的累计变化量
Figure PCTCN2020140434-appb-000007
,进而得到第一模型参数取值的变化量
Figure PCTCN2020140434-appb-000008
如此,通信设备的存储器中只需要存储上一次接收到的第一模型参数取值,上一次确定的第一模型参数取值的有效变化量以及累计变化量,就可以根据此次接收到的第一模型参数取值,确定此次第一模型参数取值的有效变化量以及累计变化量,进而确定第一模型参数取值的变化量。利用较低的存储空间,就可以确定第一模型参数取值的变化量。
在又一些实施例中,通信设备可以基于最近几次检测中每一次的第一模型参数取值的变化量确定此次第一模型参数取值的变化量。
比如,通信设备获得最近a次第一模型参数取值的变化量分别为k 1,k 2,k 3…k a;则第一模型参数取值的有效变化量为
Figure PCTCN2020140434-appb-000009
第一模型参数取值的累计变化量为
Figure PCTCN2020140434-appb-000010
由此,第一模型参数取值的变化量
Figure PCTCN2020140434-appb-000011
其中,b为自然数, 为用于计算的中间数据。例如,a=3,则第6次检测,第一模型参数取值的有效变化量为
Figure PCTCN2020140434-appb-000012
其中,k 0表示第6次检测与第5次检测获得的第一模型参数取值的差值。
可以理解的是,若通信设备中保存的数据量不足时,可以将保存的全部数据用于确定第一模型参数取值的变化量。示例性的,假设a=10,则通信设备中需要保存最近10次第一模型参数取值的变化量。在保存的第一模型参数取值的变化量的数量不足时,利用全部保存的第一模型参数取值的变化量获得此次第一模型参数取值的变化量。例如,第6次检测,存储器中只保留了前5次的数据,不足10次,则利用这5次的数据获得此次第一模型参数取值的变化量。
S102、若通信设备根据第一模型参数取值的变化量确定第一模型参数稳定,则通信设备在预设时段内停止向中心服务器发送第一模型参数取值的更新量。
其中,通信设备用于根据用户数据进行模型训练,在模型训练过程中确定模型参数取值的更新量,该更新量也可以称之为梯度。在通信设备接收到中心服务器发送更新后的模型参数取值后,会构建模型,并利用本地的用户数据训练该模型。在训练过程中,模型参数取值会基于用户数据进行调整,进而获得模型参数取值的更新量,并按照发送周期向中心服务器发送模型参数取值的更新量。其中,发送周期可以基于通信设备的带宽确定,具体确定方法可以参考现有技术,本申请实施例对此不作具体限定。
示例性的,可以设置检测周期,通信设备按照检测周期获得第一模型参数取值的变化量,也即按照检测周期,周期性的确定第一模型参数是否稳定。可以理解的是,通信设备也会按照预先设置的发送周期向中心服务器发送第一模型参数取值的更新量。并且,发送并接收到数据后,才会产生第一模型参数取值的变化量,故发送周期要小于检测周期。例如:发送周期为5s,检测周期为10s。则通信设备每隔5s向中心服务器发送第一模型参数取值的更新量,同样的中心服务器每隔5s向通信设备发送第一模型参数取值。而通信设备每隔10s确认第一模型参数取值的变化量,基于变化量确定第一模型参数是否稳定。
如此,可以通过上述步骤S101,周期性的确定第一模型参数取值的变化量,进而根据第一模型参数取值的变化量确定第一模型参数是否稳定。示例性的,可以根据实验数据或者历史经验值等设定预设阈值,若通过上述步骤S101中的方法获得的第一模型参数取值的变化量小于预设阈值,则判断第一模型参数稳定。若第一模型参数稳定,则通信设备在预设时段内停止向中心服务器发送第一模型参数取值的更新量,相应的,中心服务器在预设时段内也停止向通信设备发送第一模型参数取值。也即,第一模型参数稳定后,通信设备和中心服务器之间不传输第一模型参数取值以及取值的更新量,该过程也可以理解为第一模型参数稳定后,将第一模型参数冻结。也就是说,第一模型参数为参与确定是否稳定的模型参数,其数量不限定,可以为任意一个或多个模型参数。如果确定第一模型参数稳定,则该第一模型参数为预设时段内不参与传输的模型参数。第二模型参数为参与中心服务器和通信设备之间的更新量和取值传输的模型参数。第二模型参数的数量也不限定,可以为一个或多个。可选的,全部的第一模型参数和全部的第二模型参数组成了全部的模型参数(也可以称之为第三模型参数)。
在一些实施例中,在第一模型参数停止传输的预设时段内,第一模型参数的历史信息不再变化。在预设时段之后,通信设备向中心服务器发送第一模型参数取值的更新量以及接收中心服务器发送的第一模型参数取值。那么,在到达预设时段之后,第一模型参数自动开始参与传输后,也会开始自动记录历史信息。如自动开始记录第一模型参数取值的有效变化量和第一模型参数取值的累计变化量。
在一些实施例中,通信设备与中心服务器之间可以通过报文传输模型参数取值和更新量。如图4中的(a)所示,为本申请实施例提供的一种报文结构,包括报文头和报文本身的数据。其中,报文头信息包括报文的目的地址(destination address)、源地址(source address)、报头、长度/类型(length/type)。
其中,上文所指的源地址和目的地址均是媒体存储控制位(media access control address,MAC)地址,该报文的具体结构可参考现有技术。在本申请实施例中,报头包括与模型参数取值的更新量对应的标识位。
在报文的一种可能的设计中,报文中携带有第三模型参数(模型包含的全部模型参数)取值的更新量与对应的标识位的取值,标识位的取值用于表示通信设备是否向中心服务器传输第三模型参数取值的更新量。示例性的,可以在标识位设置特殊的字符,如标识位取值为“1”表示通信设备向中心服务器发送该标识位对应的模型参数取值的更新量。标识位取值为“0”表示通信设备未向中心服务器发送该标识位对应的模型参数取值的更新量。这样,中心服务器根据报头的标识位的取值,获知在预设时段内,通信设备未传输第一模型参数取值的更新量,而传输了第二模型参数取值的更新量。也就是说,通信设备与中心服务器之间传输的模型参数取值的更新量为第三模型参数取值的更新量,第三模型参数取值的更新量中包括判断为可以传输的第二模型参数取值的更新量和预设时段内不可以传输的第一模型参数取值的更新量。比如,如图4中的(a)所示,一种模型参数取值的更新量对应的标识位的可能的实现方式,假设第三模型参数包含A、B、C、D、E和F五个模型参数。其中,通信设备判断其中第一模型参数B、E和F三个模型参数稳定,则B、E和F三个第一模型参数取值的更新量对应的标识位的取值为0。A、C和D三个第二模型参数取值的更新量对应的标识位的取值为1。则表示在此次传输的报文的数据段中包含A、C和D模型参数对应的数据,不包含B、E和F模型参数对应的数据。
在报文的另一种可能的设计中,报文的包头中包含第二模型参数取值的更新量与对应的标识位的取值。其中,在预设时段内,第二模型参数取值的更新量不包括第一模型参数取值的更新量,第二模型参数取值的更新量对应的标识位的取值用于表示通信设备向中心服务器传输第二模型参数取值的更新量。比如,如图4中的(b)所示,一种模型参数取值的更新量对应的标识位的可能的实现方式,报文中携带的A、C和D三个第二模型参数取值的更新量对应的标识位的取值,该取值可以设置为1,则表示在此次传输的报文的数据段中包含A、C和D模型参数对应的数据。
可以理解的是,还可以有其他报文报头的设计方式,用于表示传输的模型参数的含义。比如,通信设备和中心服务器之间预先设置好需要传输的报文中第三模型参数取值的更新量的顺序。之后,通信设备确定第一模型参数稳定后,传输给中心服务器的报文中直接不包含第一模型参数取值的更新量,中心服务器也可以根据空缺的比特 位,确定停止传输的是哪些第一模型参数取值的更新量,也可以获知传输的数据对应的是哪些第二模型参数取值的更新量。又比如,在通信设备未向中心服务器发送第一模型参数取值的更新量时,该第一模型参数取值的更新量对应的标识位可以为空。这样,中心服务器可以根据报文的标识位的取值为空,得知在预设时段内,标识位为空的模型参数取值的更新量为通信设备未传输的第一模型参数取值的更新量。
在一些实施例中,每一次通信设备确定第一模型参数稳定,需要停止传输第一模型参数取值的更新量时,还需要确定第一模型参数的数量占第三模型参数的数量的比例。若通信设备确定稳定的第一模型参数的数量与第三模型参数的数量的比例大于预设比例,则减小预设阈值的取值。也就是说,在停止传输的模型参数的数量到达一定比例后,需要降低模型参数是否稳定的判定阈值,进而减少停止传输的模型参数的数量,保证参与模型训练的模型参数中为更新后的取值的模型参数的数量。
如此,动态的调节预设阈值,避免停止传输的模型参数的数量过多,使得模型训练的过程延长或者影响模型最终的精度。比如,当大量模型参数停止传输后,在预设时段内只能应用各个停止传输的模型参数不变的取值训练模型,可能训练效果并不理想。只能等待各个停止传输的模型参数的预设时段时长到达后,才可以获得这些模型参数更新后的取值,继续训练模型,导致模型到达模型收敛条件的训练时间过长。
S103、通信设备接收中心服务器发送的第二模型参数取值。其中,在预设时段内,第二模型参数取值不包括第一模型参数取值。
示例性的,通信设备向中心服务器发送第二模型参数取值的更新量后,中心服务器会根据获取到的一个或多个通信设备发送的模型参数取值的更新量得到更新后的第二模型参数取值,并将第二模型参数更新后的取值发送至对应的通信设备。其中,在预设时段内,第一模型参数不参与传输,则第二模型参数更新后的取值不包括第一模型参数更新后的取值。
在一些实施例中,在预设时段之后,通信设备向中心服务器发送第一模型参数取值的更新量以及接收中心服务器发送的第一模型参数取值。
也就是说,通信设备若在预设时段内不向中心服务器发送第一模型参数取值的更新量,则在该预设时段内中心服务器也不会生成并向通信设备发送第一模型参数更新后的取值,进而实现通信设备与中心服务器之间双向的数据量均减小。而在预设时段后,通信设备和中心服务器又会自动开始传输第一模型参数取值以及取值的更新量,实现自适应的传输数据量的调整。
示例性的,若通信设备确定第一模型参数取值的变化量,需要先接收到中心服务器发送的第一模型参数取值,才可以根据此次接收到的第一模型参数取值与之前接收到的第一模型参数取值确定其变化量,故在上述步骤S101之前,还可以包括步骤S104。
S104、通信设备接收中心服务器发送的第二模型参数取值。
在一些实施例中,初始时,通信设备接收中心服务器发送的第三模型参数取值为用于模型训练的全部模型参数取值。后续交互过程中,第三模型参数的数量会根据通信设备发送给中心服务器的模型参数取值的更新量的数量变化,如变化为除停止传输的第一模型参数之外的第二模型参数。也即通信设备向中心服务器发送模型参数取值的更新量后,中心服务器才会根据模型参数取值的更新量确定模型参数更新后的取值, 进而才会将更新后的取值发送至通信设备。后续按照检测周期,在通信设备接收到第二模型参数取值后,才会对其中全部或者部分模型参数进行稳定性的判断,该全部或者部分模型参数即为上述步骤S101中的第一模型参数。机器学习的分布式系统中,模型训练的过程中包括通信设备与中心服务器之间传输模型参数的更新量或取值的交互过程,在交互过程中,实现模型的收敛,完成训练过程。
上述步骤S101-步骤S103是由各个通信设备判断第一模型参数取值的稳定性,进而确定是否停止传输第一模型参数取值的更新量。可以理解的是,也可以由中心服务器对第一模型参数取值的稳定性进行判断,确认是否停止传输第一模型参数取值。并且,也可以向对应的各个通信设备发送通知,将判断结果告知对应的各个通信设备,进而保证各个通信设备在获知是否在下一个发送周期停止向中心服务器发送第一模型参数取值的更新量。
由此,本申请实施例提供的模型训练的通信方法,若判断第一模型参数取值的稳定,则在预设时段内停止传输第一模型参数取值的更新量。可以在不损失模型训练精度的前提下,灵活的减小通信设备和中心服务器之间传输的数据量,提高通信效率。
在上述步骤S102中,通信设备按照发送周期向中心服务器发送模型参数取值的更新量,并接收中心服务器发送的模型参数取值。通信设备按照检测周期检测模型参数取值的稳定性,确定其中可以停止传输的模型参数取值的更新量,并且每一次检测,还会获得一个时长。若第一模型参数稳定,则获得的时长为停止传输的预设时段时长;若第一模型参数不稳定,则获得的时长为用于调节下一次判断第一模型参数取值是否稳定时获得时长的时长。如此,可以实现在通信设备与中心服务器之间模型参数取值以及取值的更新量的交互传输的过程中,灵活的调整可以停止传输的模型参数以及对应的时长。参见图5所示的流程图,可以通过下述步骤,实现通信设备和中心服务器之间循环交互的模型参数取值以及取值的更新量的传输过程,以及实现灵活的调整获得的时长。其中,停止传输第一模型参数的预设时段的时长可以为发送周期的整数倍,也可以为自定义的任意时长,本申请实施例对此不做具体限定。
步骤一、判断第一模型参数是否稳定。
示例性的,开始模型训练过程后,通信设备会接收中心服务器发送的第三模型参数取值,通信设备利用第三模型参数取值构建本地模型,并利用本地数据训练模型,按照发送周期向中心服务器发送第三模型参数取值的更新量。其中,第三模型参数取值即为用于构建模型的全部模型参数取值,首次开始模型训练过程,中心服务器会将全部模型参数取值发送至通信设备,用于构建初始模型。
之后,接收中心服务器发送更新后的第三模型参数取值,通过上述步骤S101和步骤S102中的方法,确定其中第一模型参数取值是否稳定。若第一模型参数稳定,则执行步骤二;若第一模型参数不稳定,则继续传输第一模型参数取值的更新量。
步骤二、获得停止传输第一模型参数取值的更新量的预设时段时长为初始时长。
示例性的,确定第一模型参数稳定后,表明第一模型参数已经收敛,可以在预设时段内停止传输第一模型参数取值的更新量,进而减少通信设备与中心服务器之间通信的数据量。此时,获得需要停止传输第一模型参数取值的更新量的预设时段时长为初始时长。初始时长可以为整数个发送周期也可以为自定义的时间段,后续可以通过 增减发送周期的数量或自定义的时间段来调节时长。
在一些实施例中,在判断需要停止传输第一模型参数取值的更新量后,还需要计算停止传输的第一模型参数的数量占第三模型参数的数量的比例。若大于预设比例,则减小预设阈值。减小判断第一模型参数是否稳定的预设阈值,可以在后续判断第一模型参数是否稳定的过程中,减少停止传输的模型参数的数量,保证模型训练的精度。其中,预设比例可以根据经验值自定义设定。
步骤三、到达初始时长,开始传输第一模型参数取值的更新量。
示例性的,在到达预设时段后,停止传输的第一模型参数自动按照发送周期参与传输。在传输过程中,按照检测周期判断第一模型参数是否稳定。若第一模型参数稳定,则执行步骤四;若第一模型参数不稳定,则执行步骤八。
步骤四、获得停止传输第一模型参数取值的更新量的预设时段时长大于初始时长。
示例性的,在确认第一模型参数稳定后,需要获得停止传输第一模型参数取值的更新量的预设时段时长,并在预设时段内停止传输。由于该第一模型参数上一次的判断时获得的时长为初始时长,故此次判断稳定获得的预设时段的时长要大于上一次获得的初始时长。例如,模型参数1在上一次检测周期获得初始时长为1个发送周期,则在停止传输1个发送周期后,模型参数1自动参与传输。发送周期小于检测周期,故,模型参数1在参与传输一段时间后,会再次进行检测。此次确定模型参数1稳定,则可以获得大于1个发送周期的预设时段的时长,如2个发送周期。
在一些实施例中,同上述步骤二,需要计算停止传输的第一模型参数的数量占第三模型参数的数量的比例,若大于预设比例,则减小预设阈值。
步骤五、到达预设时段的时长,开始传输第一模型参数取值的更新量。
示例性的,到达预设时段时长后,停止传输的第一模型参数自动开始参与传输。并继续参与判断其稳定性。若第一模型参数稳定,则执行步骤六;若第一模型参数不稳定,则执行步骤七。
步骤六、获得停止传输第一模型参数取值的更新量的预设时段时长大于上一次获得的时长。
示例性的,由于此次第一模型参数稳定,则需要获得停止传输第一模型参数取值的更新量的预设时段时长并在预设时段内停止传输。其中,停止传输的预设时段的时长要大于上一次检测获得的时长。
在一些实施例中,同上述步骤二,需要计算停止传输的第一模型参数的数量占第三模型参数的数量的比例,若大于预设比例,则减小预设阈值。
完成步骤六后,返回步骤五,即在达到预设时段自动开始传输之后,继续判断第一模型参数取值是否稳定。即整个模型训练过程是通信设备和中心服务器之间模型参数取值以及取值的更新量的循环交互过程,在模型训练过程中,需按照检测周期反复判断模型参数的稳定性,并获得对应的时长。
也就是说,通信设备M次确定第一模型参数取值的变化量。其中,M为大于等于2的正整数。通信设备根据第一模型参数第k次的稳定状态确定预设时段满足的预设条件包括:若第k次根据第一模型参数取值的变化量确定第一模型参数稳定,则预设时段的时长为第一时长,第一时长大于第二时长。其中,第二时长为第k-1次确定第一 模型参数稳定时,停止向中心服务器发送第一模型参数取值的更新量的预设时段的时长。或者,第二时长为第k-1次确定第一模型参数未稳定时,得到的用于调节下一次第一模型参数稳定时对应的预设时段时长的时长;其中,k为正整数,k≤M。
步骤七、继续传输第一模型参数取值的更新量,并获得小于上一次获得时长的时长。
示例性的,由于此次第一模型参数不稳定,则需要获得用于调节下一次检测获得时长的时长,此次获得的时长要小于上一次检测获得的时长。其中,上一次检测第一模型参数取值可以稳定也可以不稳定。
完成步骤七后,返回步骤五,即在下一个检测周期继续判断第一模型参数取值是否稳定。
也就是说,通信设备M次确定第一模型参数取值的变化量。其中,M为大于等于2的正整数。通信设备根据第一模型参数第k次的稳定状态确定预设时段满足的预设条件还包括:若第k次确定第一模型参数未稳定,则得到的用于调节下一次第一模型参数稳定时对应的预设时段时长的第三时长,第三时长小于第四时长。其中,第四时长为第k-1次确定第一模型参数稳定时,停止向中心服务器发送第一模型参数取值的更新量的预设时段的时长。或者,第四时长为第k-1次确定第一模型参数未稳定时,得到的用于调节下一次第一模型参数稳定时对应的预设时段时长的时长。
步骤八、继续传输第一模型参数取值的更新量,并获得小于初始时长的时长。
示例性的,第一模型参数不稳定,则继续传输,并且此次可以获得用于调节下一次检测获得时长的时长,此次的时长会小于初始时长。
也就是说,在通信设备和中心服务器之间传输模型参数取值以及取值的更新量的过程中,每一次检测确定第一模型参数取值的变化量后,都会获得一个时长。若此次确定第一模型参数稳定,表明第一模型参数收敛,需要增加该第一模型参数停止传输取值的更新量的时长,则获得的时长为预设时段的时长,并且该时长大于上一次检测获得的时长。若此次确定第一模型参数不稳定,表明第一模型参数还未收敛,需要减少该第一模型参数下一次停止传输取值的更新量的时长,则获得的时长为用于调节下一次检测获得时长的时长,并且该时长小于上一次检测获得的时长。
如此,可以根据第一模型参数的稳定状态动态的调节停止传输的预设时段的时长,灵活的控制通信设备和中心服务器之间传输的参数模型训练的模型参数数量,进而保证最后获得的模型的精度满足要求。
基于上述步骤一至步骤八,举例来说,若基于发送周期的时长调节预设时段时长,那么,若第k次通信设备根据第一模型参数取值的变化量确定第一模型参数稳定,则在n个发送周期内停止向中心服务器发送第一模型参数取值的更新量。其中,n为正整数。
在n个发送周期之后,若第k+1次通信设备根据第一模型参数取值的变化量确定第一模型参数稳定,则在(n+m)个发送周期内停止向中心服务器发送第一模型参数取值的更新量。其中,m为正整数。
在n个发送周期之后,若第k+1次通信设备根据第一模型参数取值的变化量确定第一模型参数未稳定,则若第k+2次通信设备根据变化量确定第一模型参数稳定,在 (n/r+m)个发送周期内停止向中心服务器发送第一模型参数取值的更新量。其中,r为大于等于2的正整数,(n/r)≥1。
示例性的,假设第5次(第k次)检测,第一模型参数稳定获得的预设时段时长为2个发送周期时长。
若到达预设时段之后的第6次(第k+1次)检测第一模型参数稳定,则预设时段的时长可以为2+1=3个发送周期时长。
若到达预设时段之后的第6次(第k+1次)检测第一模型参数不稳定,则获得的时长可以为2/2=1个发送周期时长。若第7次(第k+2次)检测第一模型参数稳定,则获得的预设时段的时长可以为2/2+1=2个发送周期时长。
如图6所示,为本申请实施例提供的又一种基于模型训练的通信方法的流程示意图,该方法可以包括S201-S203:
S201、中心服务器接收通信设备发送的第二模型参数取值的更新量。在预设时段内,第二模型参数取值的更新量中不包含第一模型参数取值的更新量。
其中,第一模型参数为根据第一模型参数取值的变化量确定稳定的模型参数。确定第一模型参数稳定的过程包括若第一模型参数取值的变化量小于预设阈值,则确定第一模型参数稳定。具体方法可以参见上述步骤S101的相关描述,在此不再赘述。
其中,第一模型参数取值的更新量和第二模型参数的更新量由通信设备在进行模型训练的过程中根据用户数据确定。即,通信设备在进行模型训练的过程中会根据用户数据确定模型参数取值的更新量。并且,通信设备会根据模型参数的稳定状态确定需要向中心服务器停止传输的模型参数的取值的更新量。
示例性的,在预设时段之后,中心服务器向通信设备发送第一模型参数取值以及接收通信设备发送的第一模型参数取值的更新量。
也就是说,中心服务器会根据接收的模型参数取值的更新量判断通信设备是否已经停止传输部分模型参数。若已经停止传输部分模型参数,则中心服务器也会在预设时段内,停止传输这部分通信设备停止传输的模型参数。进而实现双向减少传输的数据量,提高通信效率。
其余内容可以参考步骤S101至步骤S103的相关描述,在此不再赘述。
S202、中心服务器根据第二模型参数取值的更新量确定第二模型参数更新后的取值。
示例性的,中心服务器接收到各个通信设备发送的模型参数取值的更新量后,会对各个通信设备反馈的模型参数的更新量进行平均叠加得到统一的新的模型参数取值。具体获得新的模型参数取值的过程可以参见现有技术,本申请实施例对此不在进行赘述。
S203、中心服务器向通信设备发送第二模型参数更新后的取值。在预设时段内,第二模型参数更新后的取值中不包含第一模型参数更新后的取值。
示例性的,中心服务器和通信设备之间通过报文传输模型参数取值和更新量。
在报文的一种可能的设计中,中心服务器向通信设备发送报文,报文包含第三模型参数更新后的取值与对应的标识位的取值。其中,在预设时段内,第一模型参数更新后的取值对应的标识位的取值用于表示中心服务器未向通信设备传输第一模型参数 更新后的取值。
在报文的又一种可能的设计中,中心服务器向通信设备发送报文,报文包含第二模型参数更新后的取值与对应的标识位的取值。其中,在预设时段内,第二模型参数更新后的取值不包括第一模型参数更新后的取值,第二模型参数更新后的取值对应的标识位的取值用于表示中心服务器向通信设备传输第二模型参数更新后的取值。
其余内容可以参考步骤S101至步骤S103的相关描述,在此不再赘述。
示例性的,中心服务器若接收通信设备发送的第二模型参数取值的更新量,需要先向通信设备发送模型参数取值,通信设备才可以根据模型参数取值确定模型参数取值的变化量,进而确定可以传输的第二模型参数取值的更新量,并将第二模型参数取值的更新量发送至中心服务器,故在步骤S201之前,还可以包括步骤S204。
S204、中心服务器向通信设备发送第二模型参数取值。
示例性的,其余内容可以参考步骤S104的相关描述,在此不再赘述。
基于本申请提供的模型训练的通信方法。下面结合具体实验数据,对本申请实施例所提供的技术方案的效果进行说明。
将本申请实施例以及现有技术中提供的机器学习的分布式系统中的模型训练方法,在Pytorch平台上实现。其中,机器学习的分布式系统中的客户端包含50台通信设备。每台通信设备的配置为m5.xlarge,包含2vCPUs,内存8GB,下载带宽为9Mbps,上传带宽为3Mbps。服务器端为1个中心服务器,配置为c5.9xlarge,带宽为10Gbps。在通信设备和中心服务器中安装Pytorch平台构成集群,集群信息包括弹性云计算(elastic compute cloud,EC2)集群。
本申请实施例主要从以下三个方面的实验数据结构分析,体现本申请提供的技术方案的有益效果:整体性能,收敛性和开销。其中,实验结果主要与标准的机器学习的分布式系统算法进行比较。训练数据集包括用于识别普适物体的小型数据集CIFAR-10和关键字定位数据集(keyword spotting dataset,KWS)。神经网络模型包括LeNet,残差网络(residual network,ResNet)18和两层回路(recurrent layer)的长短期记忆网络(long short-term memory,LSTM)。其中,LeNet和ResNet18利用数据集CIFAR-10训练神经网络模型。两层回路(recurrent layer)的LSTM利用数据集KWS训练神经网络模型。
第一方面,整体性能的比较。
将训练三种神经网络模型的过程,分别采用本申请实施例提供的模型训练的通信方法以及采用现有技术中的模型训练的通信方法,获得的通信设备与中心服务器之间的数据传输总量以及每一次传输的平均时间进行对比。
如下表1所示,为数据传输总量的比较。
表1
模型 LeNet ResNet18 KWS-LSTM
本申请 239MB 2.62G 194MB
现有技术 651MB 3.12G 428MB
改进比例 63.29% 16.02% 54.35%
如下表2所示,为平均训练时间的比较。
表2
模型 LeNet ResNet18 KWS-LSTM
本申请 0.74ms 139s 1.8s
现有技术 1.02s 158s 2.2s
改进比例 27.45% 12.03% 18.18%
如此,通过上表1和表2中的实验数据可知,本申请实施例提供的技术方案可以有效的减少神经网络模型训练过程中,通信设备与中心服务器之间传输的模型参数的数据量,以及模型参数传输的时间开销。
第二方面,收敛性分析。
通过对三种神经网络模型的准确率和停止传输的模型参数的数量比例分析,获得如图7所示的LeNet的实验数据结果分析图、图8所示的ResNet18的实验数据结果分析图和图9所示的KWS-LSTM的实验数据结果分析图。
图7、图8和图9中横坐标表示通信设备和中心服务器之间的通讯轮次,纵坐标表示神经网络模型的准确率或停止传输的模型参数的数量比例。其中的粗实线表示采用本申请技术方案获得的神经网络模型准确率。细实线表示采用现有技术中的技术方案获得的神经网络模型准确率。虚线表示采用本申请实施例的技术方案后,在模型训练过程中停止传输模型参数的数量占所有模型参数的数量的比例。
可以看出,采用本申请实施例提供的技术方案中,停止传输部分模型参数后,仍可以在有限次通信轮次后实现模型收敛,精度(准确率)稳定。并且不会降低神经网络模型的精度。
第三方面,开销对比。
其中,开销包括三种神经网络模型在采用本申请实施例提供的技术方案进行模型训练通信后,产生的时间开销和内存开销,以及相对于现有技术中的模型训练通信方法增加的时间开销比例和内存开销比例。
表3
模型 LeNet ResNet18 KWS-LSTM
时间开销 0.009s 1.278s 0.011s
时间开销增加比例 1.93% 4.50% 1.42%
内存开销 1.2M 142MB 4.8MB
内存开销增加比例 0.18% 8.51% 2.35%
根据上表3中的实验数据可以看出,本申请实施例提供的技术方案对于时间开销以及内存开销增加的比例较小,不会对模型训练过程产生过大的影响。
综上所述,相对于现有技术中标准的机器学习的分布式系统算法,采用本申请实施例提供的技术方案可以在整体性能上有效减少数据传输总量以及每一次传输的平均时间。并且,不会对神经网络模型精度以及训练过程中产生的开销产生较大的影响。
图10示出了上述实施例中所涉及的基于模型训练的通信装置的一种可能的结构 示意图。如图10所示,基于模型训练的通信装置1000装置包括:处理单元1001,发送单元1002以及接收单元1003。
其中,处理单元1001,用于支持基于模型训练的通信装置1000执行图3中的步骤S101,S102和/或用于本文所描述的技术的其它过程。
发送单元1002,用于支持基于模型训练的通信装置1000执行图3中的步骤S102和/或用于本文所描述的技术的其它过程。
接收单元1003,用于支持基于模型训练的通信装置1000执行图3中的步骤S103和/或用于本文所描述的技术的其它过程。
其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能单元的功能描述,在此不再赘述。
图11示出了上述实施例中所涉及的基于模型训练的通信装置的一种可能的结构示意图。如图11所示,基于模型训练的通信装置1100包括:接收单元1101,处理单元1102以及发送单元1103。
其中,接收单元1101,用于支持基于模型训练的通信装置1100执行图6中的步骤S201和/或用于本文所描述的技术的其它过程。
处理单元1102,用于支持基于模型训练的通信装置1100执行图6中的步骤S202和/或用于本文所描述的技术的其它过程。
发送单元1103,用于支持基于模型训练的通信装置1100执行图6中的步骤S203和/或用于本文所描述的技术的其它过程。
其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能单元的功能描述,在此不再赘述。
本申请实施例还提供一种芯片系统,如图12所示,该芯片系统包括至少一个处理器1201和至少一个接口电路1202。处理器1201和接口电路1202可通过线路互联。例如,接口电路1202可用于从其它装置接收信号。又例如,接口电路1202可用于向其它装置(例如处理器1201)发送信号。示例性的,接口电路1202可读取存储器中存储的指令,并将该指令发送给处理器1201。当所述指令被处理器1201执行时,可使得基于模型训练的通信装置执行上述实施例中的基于模型训练的通信方法中的各个步骤。当然,该芯片系统还可以包含其他分立器件,本申请实施例对此不作具体限定。
本申请实施例还提供一种计算机存储介质,该计算机存储介质中存储有计算机指令,当该计算机指令在基于模型训练的通信装置上运行时,使得基于模型训练的通信装置执行上述相关方法步骤实现上述实施例中的基于模型训练的通信方法。
本申请实施例还提供一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中的基于模型训练的通信方法。
另外,本申请的实施例还提供一种装置,该装置具体可以是组件或模块,该装置可包括相连的处理器和存储器;其中,存储器用于存储计算机执行指令,当装置运行时,处理器可执行存储器存储的计算机执行指令,以使装置执行上述各方法实施例中的基于模型训练的通信方法。
其中,本申请实施例提供的装置、计算机存储介质、计算机程序产品或芯片均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的 对应的方法中的有益效果,此处不再赘述。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,模块或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:快闪存储器、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序指令的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (38)

  1. 一种基于模型训练的通信方法,其特征在于,应用于包括中心服务器和通信设备的系统中;所述方法包括:
    所述通信设备确定第一模型参数取值的变化量;
    若所述通信设备根据所述第一模型参数取值的变化量确定所述第一模型参数稳定,则所述通信设备在预设时段内停止向所述中心服务器发送所述第一模型参数取值的更新量;其中,所述第一模型参数取值的更新量由所述通信设备在进行模型训练的过程中根据用户数据确定;
    所述通信设备接收所述中心服务器发送的第二模型参数取值;其中,在所述预设时段内,所述第二模型参数取值不包括所述第一模型参数取值。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在所述预设时段之后,所述通信设备向所述中心服务器发送所述第一模型参数取值的更新量以及接收所述中心服务器发送的所述第一模型参数取值。
  3. 根据权利要求1所述的方法,其特征在于,所述通信设备在预设时段内停止向所述中心服务器发送所述第一模型参数取值的更新量,包括:
    所述通信设备向所述中心服务器发送报文,所述报文包含第三模型参数取值的更新量与对应的标识位的取值;所述第三模型参数取值的更新量包括所述第二模型参数取值的更新量和所述第一模型参数取值的更新量;其中,在所述预设时段内,所述第一模型参数取值的更新量对应的标识位的取值用于表示所述通信设备未向所述中心服务器传输所述第一模型参数取值的更新量;
    或者,所述通信设备向所述中心服务器发送报文,所述报文包含所述第二模型参数取值的更新量与对应的标识位的取值;其中,在所述预设时段内,所述第二模型参数取值的更新量不包括所述第一模型参数取值的更新量,所述第二模型参数对应的标识位的取值用于表示所述通信设备向所述中心服务器传输所述第二模型参数取值的更新量。
  4. 根据权利要求1所述的方法,其特征在于,所述通信设备确定第一模型参数取值的变化量,包括:
    所述通信设备根据所述第一模型参数取值的历史信息获得所述第一模型参数取值的变化量。
  5. 根据权利要求4所述的方法,其特征在于,所述取值的历史信息包括:所述第一模型参数取值的有效变化量和所述第一模型参数取值的累计变化量。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述通信设备根据所述第一模型参数取值的变化量确定所述第一模型参数稳定,包括:
    若所述第一模型参数取值的变化量小于预设阈值,则确定所述第一模型参数稳定。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述通信设备确定第一模型参数取值的变化量,包括:
    所述通信设备M次确定所述第一模型参数取值的变化量;其中,M为大于等于2的正整数;
    所述通信设备根据所述第一模型参数第k次的稳定状态确定预设时段满足的预设 条件包括:
    若第k次根据所述第一模型参数取值的变化量确定所述第一模型参数稳定,则所述预设时段的时长为第一时长,所述第一时长大于第二时长;其中,所述第二时长为第k-1次确定第一模型参数稳定时,停止向所述中心服务器发送所述第一模型参数取值的更新量的预设时段的时长;或者,所述第二时长为第k-1次确定第一模型参数未稳定时,得到的用于调节下一次第一模型参数稳定时对应的预设时段时长的时长;其中,所述k为正整数,k≤M。
  8. 根据权利要求7所述的方法,其特征在于,所述预设条件还包括:
    若第k次确定所述第一模型参数未稳定,则得到的用于调节下一次第一模型参数稳定时对应的预设时段时长的第三时长,所述第三时长小于第四时长;其中,所述第四时长为第k-1次确定第一模型参数稳定时,停止向所述中心服务器发送所述第一模型参数取值的更新量的预设时段的时长;或者,所述第四时长为第k-1次确定第一模型参数未稳定时,得到的用于调节下一次第一模型参数稳定时对应的预设时段时长的时长。
  9. 根据权利要求7或8所述的方法,其特征在于,所述若通信设备根据所述第一模型参数取值的变化量确定所述第一模型参数稳定,则所述通信设备在预设时段内停止向所述中心服务器发送所述第一模型参数取值的更新量,包括:
    若第k次所述通信设备根据所述第一模型参数取值的变化量确定所述第一模型参数稳定,则在n个发送周期内停止向所述中心服务器发送所述第一模型参数取值的更新量;其中,n为正整数;
    在所述n个发送周期之后,若第k+1次所述通信设备根据所述第一模型参数取值的变化量确定所述第一模型参数稳定,则在(n+m)个发送周期内停止向所述中心服务器发送所述第一模型参数取值的更新量;其中,m为正整数;
    在所述n个发送周期之后,若第k+1次所述通信设备根据所述第一模型参数取值的变化量确定所述第一模型参数未稳定,则若第k+2次所述通信设备根据所述变化量确定所述第一模型参数稳定,在(n/r+m)个发送周期内停止向所述中心服务器发送所述第一模型参数取值的更新量;其中,r为大于等于2的正整数,(n/r)≥1;
    其中,所述发送周期为通信设备向所述中心服务器发送模型参数取值的更新量的周期。
  10. 根据权利要求9所述的方法,其特征在于,所述方法还包括:
    所述通信设备按照检测周期确定所述第一模型参数取值的变化量;所述发送周期小于所述检测周期。
  11. 根据权利要求6至10任一项所述的方法,其特征在于,所述方法还包括:
    若所述通信设备确定稳定的所述第一模型参数的数量与第三模型参数的数量的比例大于预设比例,则减小预设阈值的取值。
  12. 根据权利要求1至11任一项所述的方法,其特征在于,所述通信设备确定第一模型参数取值的变化量之前,所述方法还包括:
    所述通信设备接收所述中心服务器发送的所述第二模型参数取值。
  13. 一种基于模型训练的通信方法,其特征在于,应用于包括中心服务器和通信 设备的系统中;所述方法包括:
    所述中心服务器接收所述通信设备发送的第二模型参数取值的更新量;在预设时段内,所述第二模型参数取值的更新量中不包含第一模型参数取值的更新量;其中,所述第一模型参数为根据所述第一模型参数取值的变化量确定稳定的模型参数;
    所述中心服务器根据所述第二模型参数取值的更新量确定所述第二模型参数更新后的取值;
    所述中心服务器向所述通信设备发送所述第二模型参数更新后的取值;在预设时段内,所述第二模型参数更新后的取值中不包含第一模型参数更新后的取值。
  14. 根据权利要求13所述的方法,其特征在于,所述方法还包括:
    在所述预设时段之后,所述中心服务器向所述通信设备发送所述第一模型参数取值以及接收所述通信设备发送的所述第一模型参数取值的更新量。
  15. 根据权利要求13所述的方法,其特征在于,所述中心服务器向所述通信设备发送所述第二模型参数更新后的取值;在预设时段内,所述第二模型参数更新后的取值中不包含第一模型参数更新后的取值,包括:
    所述中心服务器向所述通信设备发送报文,所述报文包含第三模型参数更新后的取值与对应的标识位的取值;所述第三模型参数更新后的取值包括所述第二模型参数更新后的取值和所述第一模型参数更新后的取值;其中,在所述预设时段内,所述第一模型参数更新后的取值对应的标识位的取值用于表示所述中心服务器未向所述通信设备传输所述第一模型参数更新后的取值;
    或者,所述中心服务器向所述通信设备发送报文,所述报文包含所述第二模型参数更新后的取值与对应的标识位的取值;其中,在所述预设时段内,所述第二模型参数更新后的取值不包括所述第一模型参数更新后的取值,所述第二模型参数对应的标识位的取值用于表示所述中心服务器向所述通信设备传输所述第二模型参数更新后的取值。
  16. 根据权利要求13至15任一项所述的方法,其特征在于,所述第一模型参数为根据所述第一模型参数取值的变化量确定稳定的模型参数,包括:
    若所述第一模型参数取值的变化量小于预设阈值,则确定所述第一模型参数稳定。
  17. 根据权利要求13至16任一项所述的方法,其特征在于,所述中心服务器接收所述通信设备发送的第二模型参数取值的更新量之前,所述方法还包括:
    所述中心服务器向所述通信设备发送所述第二模型参数取值。
  18. 一种基于模型训练的通信装置,其特征在于,所述装置包括:处理单元,发送单元,接收单元;
    所述处理单元,用于确定第一模型参数取值的变化量;
    所述处理单元,还用于根据所述第一模型参数取值的变化量确定所述第一模型参数是否稳定;
    发送单元,用于向中心服务器发送所述第一模型参数取值的更新量;若所述处理单元根据所述第一模型参数取值的变化量确定所述第一模型参数稳定,则所述发送单元在预设时段内停止向所述中心服务器发送所述第一模型参数取值的更新量;其中,所述第一模型参数的取值的更新量由所述处理单元在进行模型训练的过程中根据用户 数据确定;
    所述接收单元,用于接收所述中心服务器发送的第二模型参数取值;其中,在所述预设时段内,所述第二模型参数取值不包括所述第一模型参数取值。
  19. 根据权利要求18所述的装置,其特征在于,
    所述发送单元,还用于在所述预设时段之后,向所述中心服务器发送所述第一模型参数取值的更新量;
    所述接收单元,还用于在所述预设时段之后,接收所述中心服务器发送的所述第一模型参数取值。
  20. 根据权利要求18所述的装置,其特征在于,所述发送单元具体用于:
    向所述中心服务器发送报文,所述报文包含第三模型参数取值的更新量与对应的标识位的取值;所述第三模型参数取值的更新量包括所述第二模型参数取值的更新量和所述第一模型参数取值的更新量;其中,在所述预设时段内,所述第一模型参数取值的更新量对应的标识位的取值用于表示所述发送单元未向所述中心服务器传输所述第一模型参数取值的更新量;
    或者,向所述中心服务器发送报文,所述报文包含所述第二模型参数取值的更新量与对应的标识位的取值;其中,在所述预设时段内,所述第二模型参数取值的更新量不包括所述第一模型参数取值的更新量,所述第二模型参数对应的标识位的取值用于表示所述发送单元向所述中心服务器传输所述第二模型参数取值的更新量。
  21. 根据权利要求18所述的装置,其特征在于,所述处理单元具体用于:
    根据所述第一模型参数取值的历史信息获得所述第一模型参数取值的变化量。
  22. 根据权利要求21所述的装置,其特征在于,所述取值的历史信息包括:所述第一模型参数取值的有效变化量和所述第一模型参数取值的累计变化量。
  23. 根据权利要求18至22任一项所述的装置,其特征在于,所述处理单元具体用于:
    根据所述第一模型参数取值的变化量确定所述第一模型参数是否稳定,若所述第一模型参数取值的变化量小于预设阈值,则确定所述第一模型参数稳定。
  24. 根据权利要求18至23任一项所述的装置,其特征在于,所述处理单元具体用于:
    M次确定所述第一模型参数取值的变化量;其中,M为大于等于2的正整数;
    根据所述第一模型参数第k次的稳定状态确定预设时段满足的预设条件包括:
    若第k次根据所述第一模型参数取值的变化量确定所述第一模型参数稳定,则所述预设时段的时长为第一时长,所述第一时长大于第二时长;其中,所述第二时长为第k-1次确定第一模型参数稳定时,停止向所述中心服务器发送所述第一模型参数取值的更新量的预设时段的时长;或者,所述第二时长为第k-1次确定第一模型参数未稳定时,得到的用于调节下一次第一模型参数稳定时对应的预设时段时长的时长;其中,所述k为正整数,k≤M。
  25. 根据权利要求24所述的装置,其特征在于,所述预设条件还包括:
    若第k次确定所述第一模型参数未稳定,则得到的用于调节下一次第一模型参数稳定时对应的预设时段时长的第三时长,所述第三时长小于第四时长;其中,所述第 四时长为第k-1次确定第一模型参数稳定时,停止向所述中心服务器发送所述第一模型参数取值的更新量的预设时段的时长;或者,所述第四时长为第k-1次确定第一模型参数未稳定时,得到的用于调节下一次第一模型参数稳定时对应的预设时段时长的时长。
  26. 根据权利要求24或25所述的装置,其特征在于,
    若所述处理单元第k次根据所述第一模型参数取值的变化量确定所述第一模型参数稳定,则在n个发送周期内,所述发送单元停止向所述中心服务器发送所述第一模型参数取值的更新量;其中,n为正整数;
    在所述n个发送周期之后,若所述处理单元第k+1次根据所述第一模型参数取值的变化量确定所述第一模型参数稳定,则在(n+m)个发送周期内,所述发送单元停止向所述中心服务器发送所述第一模型参数取值的更新量;其中,m为正整数;
    在所述n个发送周期之后,若所述处理单元第k+1次根据所述第一模型参数取值的变化量确定所述第一模型参数未稳定,则若第k+2次所述处理单元根据所述变化量确定所述第一模型参数稳定,在(n/r+m)个发送周期内,所述发送单元停止向所述中心服务器发送所述第一模型参数取值的更新量;其中,r为大于等于2的正整数,(n/r)≥1;
    其中,所述发送周期为所述发送单元向所述中心服务器发送模型参数取值的更新量的周期。
  27. 根据权利要求26所述的装置,其特征在于,所述处理单元还用于:
    按照检测周期确定所述第一模型参数取值的变化量;所述发送周期小于所述检测周期。
  28. 根据权利要求23至27任一项所述的装置,其特征在于,所述处理单元还用于:
    若所述处理单元确定稳定的所述第一模型参数的数量与第三模型参数的数量的比例大于预设比例,则减小预设阈值的取值。
  29. 根据权利要求18至28任一项所述的装置,其特征在于,
    所述处理单元确定第一模型参数取值的变化量之前,所述接收单元接收所述中心服务器发送的所述第二模型参数取值。
  30. 一种基于模型训练的通信装置,其特征在于,所述装置包括:接收单元,处理单元,发送单元;
    所述接收单元,用于接收通信设备发送的第二模型参数取值的更新量;在预设时段内,所述第二模型参数取值的更新量中不包含第一模型参数取值的更新量;其中,所述第一模型参数为根据所述第一模型参数取值的变化量确定稳定的模型参数;
    所述处理单元,用于根据所述第二模型参数取值的更新量确定所述第二模型参数更新后的取值;
    所述发送单元,用于向所述通信设备发送所述第二模型参数更新后的取值;在预设时段内,所述第二模型参数更新后的取值中不包含第一模型参数更新后的取值。
  31. 根据权利要求30所述的装置,其特征在于,
    所述发送单元,还用于在所述预设时段之后,向所述通信设备发送所述第一模型 参数取值;
    所述接收单元,还用于接收所述通信设备发送的所述第一模型参数取值的更新量。
  32. 根据权利要求30所述的装置,其特征在于,所述发送单元具体用于:
    向所述通信设备发送报文,所述报文包含第三模型参数更新后的取值与对应的标识位的取值;所述第三模型参数更新后的取值包括所述第二模型参数更新后的取值和所述第一模型参数更新后的取值;其中,在所述预设时段内,所述第一模型参数更新后的取值对应的标识位的取值用于表示所述发送单元未向所述通信设备传输所述第一模型参数更新后的取值;
    或者,向所述通信设备发送报文,所述报文包含所述第二模型参数更新后的取值与对应的标识位的取值;其中,在所述预设时段内,所述第二模型参数更新后的取值不包括所述第一模型参数更新后的取值,所述第二模型参数更新后的取值对应的标识位的取值用于表示所述发送单元向所述通信设备传输所述第二模型参数更新后的取值。
  33. 根据权利要求30至32任一项所述的装置,其特征在于,所述第一模型参数为根据所述第一模型参数取值的变化量确定稳定的模型参数,包括:若所述第一模型参数取值的变化量小于预设阈值,则确定所述第一模型参数稳定。
  34. 根据权利要求30至33任一项所述的装置,其特征在于,
    所述接收单元接收所述通信设备发送的第二模型参数取值的更新量之前,所述发送单元向所述通信设备发送所述第二模型参数取值。
  35. 一种通信设备,其特征在于,包括:
    一个或多个处理器;
    存储器;
    以及计算机程序,其中所述计算机程序被存储在所述存储器中,所述计算机程序包括指令;当所述指令被所述通信设备执行时,使得所述通信设备执行如权利要求1-12中任一项所述的基于模型训练的通信方法;或者,使得所述通信设备执行如权利要求13-17中任一项所述的基于模型训练的通信方法。
  36. 一种通信系统,其特征在于,包括:中心服务器和至少一个通信设备,所述至少一个通信设备执行如权利要求1-12中任一项所述的基于模型训练的通信方法;所述中心服务器执行如权利要求13-17中任一项所述的基于模型训练的通信方法。
  37. 一种计算机存储介质,其特征在于,包括计算机指令,当所述计算机指令在基于模型训练的通信装置上运行时,使得所述基于模型训练的通信装置执行如权利要求1-12中任一项所述的基于模型训练的通信方法;或者,使得所述通信设备执行如权利要求13-17中任一项所述的基于模型训练的通信方法。
  38. 一种芯片系统,其特征在于,包括至少一个处理器和至少一个接口电路,所述至少一个接口电路用于执行收发功能,并将指令发送给所述至少一个处理器,当所述至少一个处理器执行所述指令时,所述至少一个处理器执行如权利要求1-12中任一项所述的基于模型训练的通信方法;或者,使得所述通信设备执行如权利要求13-17中任一项所述的基于模型训练的通信方法。
PCT/CN2020/140434 2020-01-23 2020-12-28 一种基于模型训练的通信方法、装置及系统 WO2021147620A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20916011.8A EP4087198A4 (en) 2020-01-23 2020-12-28 METHOD, DEVICE AND COMMUNICATION SYSTEM BASED ON MODEL TRAINING
US17/814,073 US20220360539A1 (en) 2020-01-23 2022-07-21 Model training-based communication method and apparatus, and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010077048.8 2020-01-23
CN202010077048.8A CN113162861A (zh) 2020-01-23 2020-01-23 一种基于模型训练的通信方法、装置及系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/814,073 Continuation US20220360539A1 (en) 2020-01-23 2022-07-21 Model training-based communication method and apparatus, and system

Publications (1)

Publication Number Publication Date
WO2021147620A1 true WO2021147620A1 (zh) 2021-07-29

Family

ID=76882112

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/140434 WO2021147620A1 (zh) 2020-01-23 2020-12-28 一种基于模型训练的通信方法、装置及系统

Country Status (4)

Country Link
US (1) US20220360539A1 (zh)
EP (1) EP4087198A4 (zh)
CN (1) CN113162861A (zh)
WO (1) WO2021147620A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113965314A (zh) * 2021-12-22 2022-01-21 深圳市洞见智慧科技有限公司 同态加密处理方法及相关设备

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3143855A1 (en) * 2020-12-30 2022-06-30 Atb Financial Systems and methods for federated learning on blockchain

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3229152A1 (en) * 2016-04-08 2017-10-11 HERE Global B.V. Distributed online learning for privacy-preserving personal predictive models
CN108287763A (zh) * 2018-01-29 2018-07-17 中兴飞流信息科技有限公司 参数交换方法、工作节点以及参数服务器系统
US20180268282A1 (en) * 2017-03-17 2018-09-20 Wipro Limited. Method and system for predicting non-linear relationships
CN109117953A (zh) * 2018-09-11 2019-01-01 北京迈格威科技有限公司 网络参数训练方法和系统、服务器、客户端及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11182674B2 (en) * 2017-03-17 2021-11-23 International Business Machines Corporation Model training by discarding relatively less relevant parameters
RU2702980C1 (ru) * 2018-12-14 2019-10-14 Самсунг Электроникс Ко., Лтд. Распределённое обучение моделей машинного обучения для персонализации
CN110262819B (zh) * 2019-06-04 2021-02-26 深圳前海微众银行股份有限公司 一种联邦学习的模型参数更新方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3229152A1 (en) * 2016-04-08 2017-10-11 HERE Global B.V. Distributed online learning for privacy-preserving personal predictive models
US20180268282A1 (en) * 2017-03-17 2018-09-20 Wipro Limited. Method and system for predicting non-linear relationships
CN108287763A (zh) * 2018-01-29 2018-07-17 中兴飞流信息科技有限公司 参数交换方法、工作节点以及参数服务器系统
CN109117953A (zh) * 2018-09-11 2019-01-01 北京迈格威科技有限公司 网络参数训练方法和系统、服务器、客户端及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4087198A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113965314A (zh) * 2021-12-22 2022-01-21 深圳市洞见智慧科技有限公司 同态加密处理方法及相关设备
CN113965314B (zh) * 2021-12-22 2022-03-11 深圳市洞见智慧科技有限公司 同态加密处理方法及相关设备

Also Published As

Publication number Publication date
EP4087198A4 (en) 2023-05-03
CN113162861A (zh) 2021-07-23
US20220360539A1 (en) 2022-11-10
EP4087198A1 (en) 2022-11-09

Similar Documents

Publication Publication Date Title
US9882985B1 (en) Data storage path optimization for internet of things computing system
US20220391771A1 (en) Method, apparatus, and computer device and storage medium for distributed training of machine learning model
CN110869909B (zh) 应用机器学习算法来计算健康分数以进行工作负载调度的系统和方法
US20180246768A1 (en) Mobile edge compute dynamic acceleration assignment
US10409649B1 (en) Predictive load balancer resource management
WO2021147620A1 (zh) 一种基于模型训练的通信方法、装置及系统
KR102370508B1 (ko) 에너지 효율적인 무선 데이터 전송
CN111277511B (zh) 传输速率控制方法、装置、计算机系统及可读存储介质
CN106339052B (zh) 具有资源管理机制的计算系统及其操作方法
US20160203235A1 (en) Striping of directed graphs and nodes with improved functionality
CN112187859B (zh) 物联网业务与边缘网络能力动态映射的方法及电子设备
CN113128686A (zh) 模型训练方法及装置
CN113315716A (zh) 拥塞控制模型的训练方法和设备及拥塞控制方法和设备
CN105791381A (zh) 访问控制的方法及装置
CN111866101B (zh) 访问请求处理方法及装置、存储介质和电子设备
TW202211035A (zh) 用於資源分配的系統、元件以及方法
CN112600761A (zh) 一种资源分配的方法、装置及存储介质
KR20200080388A (ko) 강화학습 기법을 활용한 LoRa Enabled IoT 장치의 에너지 최적화 방법 및 시스템
WO2021012795A1 (zh) 网络节点的调度方法、装置、电子设备和存储介质
KR102396309B1 (ko) 데이터 요청을 제어하기 위한 장치 및 방법
WO2020232903A1 (zh) 监控任务动态调整方法、装置、计算设备和存储介质
JP6140052B2 (ja) 情報処理システム
CN116390162A (zh) 一种基于深度强化学习的移动边缘计算动态服务部署方法
WO2020024392A1 (zh) 节点处理方法及装置、存储介质和电子设备
CN106775942B (zh) 一种云应用导向的固态盘缓存管理系统及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20916011

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020916011

Country of ref document: EP

Effective date: 20220802

NENP Non-entry into the national phase

Ref country code: DE