WO2024094094A1 - 一种模型训练方法及装置 - Google Patents
一种模型训练方法及装置 Download PDFInfo
- Publication number
- WO2024094094A1 WO2024094094A1 PCT/CN2023/129211 CN2023129211W WO2024094094A1 WO 2024094094 A1 WO2024094094 A1 WO 2024094094A1 CN 2023129211 W CN2023129211 W CN 2023129211W WO 2024094094 A1 WO2024094094 A1 WO 2024094094A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- gradients
- model
- target model
- updated
- target
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 80
- 238000012549 training Methods 0.000 title claims abstract description 78
- 230000015654 memory Effects 0.000 claims description 84
- 230000006835 compression Effects 0.000 claims description 26
- 238000007906 compression Methods 0.000 claims description 26
- 230000004931 aggregating effect Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 8
- 230000005540 biological transmission Effects 0.000 abstract description 8
- 238000013528 artificial neural network Methods 0.000 description 46
- 239000011159 matrix material Substances 0.000 description 26
- 239000013598 vector Substances 0.000 description 26
- 238000012545 processing Methods 0.000 description 25
- 238000004891 communication Methods 0.000 description 23
- 230000006870 function Effects 0.000 description 21
- 230000008569 process Effects 0.000 description 20
- 238000010586 diagram Methods 0.000 description 19
- 238000013473 artificial intelligence Methods 0.000 description 13
- 238000004364 calculation method Methods 0.000 description 12
- 238000010801 machine learning Methods 0.000 description 10
- 230000001537 neural effect Effects 0.000 description 10
- 230000004913 activation Effects 0.000 description 7
- 230000003993 interaction Effects 0.000 description 6
- 238000003062 neural network model Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000013500 data storage Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000011217 control strategy Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- the present application relates to the field of artificial intelligence, and in particular to a model training method and device.
- Artificial intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
- artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that machines have the functions of perception, reasoning and decision-making.
- a federated learning system trains machine learning models based on data generated from a large number of users interacting with their devices (e.g., smartphones, etc.) without removing the data from the devices. For example, each loop selects a subset of online devices, and the current version of the machine learning model is sent to those selected devices. Each of those selected devices is tasked with computing updates to the model using their own locally generated and locally stored data. The model updates are then sent back to the server, averaged, and applied to the server's model to produce a new version of the model for the next iteration of users (e.g., the next subset of devices).
- Federated learning is divided into two steps: model download and model upload.
- the central node sends the model to the terminal device through the network; each terminal device uses local data to calculate the gradient of the model; each distributed node encrypts the gradient and uploads it to the central node; the central node summarizes the gradients of each terminal distributed node and uses the parameter averaging algorithm to update the parameters of the central node model.
- the server sends an initial model to each end side. Subsequently, different end sides use local data to perform several model iterations and then feed back the model changes (that is, the gradients corresponding to the parameters) to the server.
- the server performs a weighted average of the feedback gradients, uses the obtained average gradient to update the initial model, and sends the updated model to each end side user, restarting the next round of iteration.
- the problem with the existing federated training framework is that when the user data is not independent and identically distributed, the server cannot obtain an effective model gradient update direction due to the large differences in the gradient directions after iteration of each user node, which leads to slow convergence of the server model. A large number of gradients need to be transmitted back and forth between the end-side user and the server, consuming a large amount of communication traffic. In the current network environment, the growth rate of the overall network bandwidth is much lower than the growth rate of the neural network model size. Therefore, how to effectively reduce the communication overhead is an urgent problem to be solved in federated learning.
- the present application provides a model training method that can effectively reduce the amount of gradient transmission from a server to a terminal.
- the present application provides a model training method, which is applied to a server, and the server communicates with multiple terminals.
- the method includes: obtaining multiple first gradients and multiple second gradients; the multiple first gradients are gradients corresponding to multiple first parameters in the target model; the multiple second gradients are gradients corresponding to multiple second parameters in the target model; the multiple first parameters are updated in the previous round of iteration of federated learning, and the multiple second parameters are not updated in the previous round of iteration of federated learning; selecting partial gradients from the multiple first gradients and the multiple second gradients, and the partial gradients are used to update the target model in the current round of iteration of federated learning; transmitting the updated information of the target model to multiple first devices; wherein the multiple first devices belong to the multiple terminals.
- the target model only selects and updates the values of some parameters each time, the amount of gradient transmission from the server to the terminal can be effectively reduced.
- the partial gradient is a plurality of maximum gradients among the plurality of first gradients and the plurality of second gradients.
- the multiple first gradients are obtained by aggregating multiple third gradients sent from multiple second devices in the current iteration round; the multiple second gradients are corresponding to the multiple second parameters determined in the previous iteration round. gradient, and multiple fourth gradients sent by the multiple second devices in the current iteration round; the multiple second devices belong to the multiple terminals.
- the information of the target model includes: the parameter update amount of the updated target model relative to the first model, and the first model is a model obtained by updating the target model in the iteration round before the current round of iteration; before obtaining the multiple first gradients and the multiple second gradients, the method also includes: broadcasting the updated parameter values of the first model to the multiple terminals.
- the number of gradient differences between the latest target model and the end-side user model is reduced, thereby ensuring that the compression effect of downlink communication volume is not degraded.
- the information of the target model includes: an update amount of parameters of the updated target model relative to a second model, where the second model is an initial model of the target model.
- the information of the target model includes: the parameter update amount of the updated target model relative to the third model; the multiple first devices include a first target device; the third model is a model obtained by updating the target model by the first target device in an iteration round before the current iteration round; the information of the updated target model is transmitted to the multiple first devices, including: the parameter update amount of the updated target model relative to the third model is transmitted to the first target device.
- the iteration round before the current iteration round is specifically: the iteration round in which the first target device most recently updated the target model before the current iteration round.
- the server side can maintain the existing target model list of the end-side user, and make a difference with the public model parameters of the corresponding end-side when sending the update amount, so as to improve the communication compression of the sent gradient.
- the server maintains the latest model parameter list of the end-side user, thereby reducing the difference in the gradient between the latest target model and the end-side user model when sending the gradient, thereby maximizing the compression effect of the downlink communication volume.
- the present application provides a system, the system comprising a server and a plurality of terminals, the server communicating with the plurality of terminals, wherein:
- the server is used to obtain multiple first gradients and multiple second gradients; the multiple first gradients are gradients corresponding to multiple first parameters in the target model; the multiple second gradients are gradients corresponding to multiple second parameters in the target model; the multiple first parameters are updated in the previous round of iteration of federated learning, and the multiple second parameters are not updated in the previous round of iteration of federated learning;
- the updated information of the target model is transmitted to a plurality of first devices; wherein the plurality of first devices and the plurality of second devices belong to the plurality of terminals.
- the partial gradient is a plurality of maximum gradients among the plurality of first gradients and the plurality of second gradients.
- multiple second devices among the multiple terminals are used to send multiple third gradients and multiple fourth gradients to the server;
- the multiple third gradients are gradients corresponding to multiple first parameters in the target model;
- the multiple fourth gradients are gradients corresponding to multiple second parameters in the target model;
- the multiple first parameters are updated in the previous round of iteration of federated learning;
- the multiple second parameters are not updated in the previous round of iteration of federated learning;
- the server is specifically configured to aggregate the multiple third gradients to obtain multiple first gradients
- the multiple fourth gradients are aggregated and merged with the gradients corresponding to the multiple second parameters determined in the previous round of iteration to obtain multiple second gradients.
- the multiple second devices among the multiple terminals are specifically used to determine multiple gradients corresponding to the target model in the previous round of iteration of federated learning; and randomly select the multiple third gradients and the multiple fourth gradients from the multiple gradients.
- the plurality of second devices in the plurality of terminals are specifically configured to indicate the plurality of third gradients and the plurality of The information of the fourth gradient is losslessly compressed or linearly unbiased compressed, and the compression result is sent to the server.
- linear unbiased compression refers to a compression method with linearity.
- the compressed data can be subjected to linear operations and then decompressed, and the result obtained is the same as the result obtained by performing the same linear operation on the uncompressed original data.
- Unbiased compression means that the error between the compressed result and the original data is zero mean.
- the information of the target model includes: an update amount of parameters of the updated target model relative to a first model, where the first model is a model obtained by updating the target model in an iteration round before the current iteration round;
- the server is further configured to broadcast the updated parameter values of the first model to the multiple terminals before acquiring the multiple first gradients and the multiple second gradients.
- the information of the target model includes: an update amount of parameters of the updated target model relative to a second model, where the second model is an initial model of the target model.
- the information of the target model includes: an update amount of parameters of the updated target model relative to a third model; the multiple first devices include a first target device; the third model is a model obtained by updating the target model by the first target device in an iteration round before the current iteration round;
- the server is specifically configured to transmit the updated parameter update amount of the updated target model relative to the third model to the first target device.
- the iteration round before the current iteration round is specifically: the iteration round in which the first target device most recently updated the target model before the current iteration round.
- the present application provides a model training device, which is applied to a server, the server communicates with multiple terminals, and the device includes:
- An acquisition module used to acquire a plurality of first gradients and a plurality of second gradients; the plurality of first gradients are gradients corresponding to a plurality of first parameters in a target model; the plurality of second gradients are gradients corresponding to a plurality of second parameters in the target model; the plurality of first parameters are updated in a previous round of iteration of federated learning, and the plurality of second parameters are not updated in a previous round of iteration of federated learning;
- a gradient selection module configured to select partial gradients from the plurality of first gradients and the plurality of second gradients, wherein the partial gradients are used to update the target model in a current round of iteration of federated learning
- a sending module is used to transmit the updated information of the target model to multiple first devices; wherein the multiple first devices belong to the multiple terminals.
- the partial gradient is a plurality of maximum gradients among the plurality of first gradients and the plurality of second gradients.
- the multiple first gradients are obtained by aggregating multiple third gradients sent from multiple second devices in the current iteration round; the multiple second gradients are obtained according to the gradients corresponding to the multiple second parameters determined in the previous iteration round, and the multiple fourth gradients sent by the multiple second devices in the current iteration round; the multiple second devices belong to the multiple terminals.
- the information of the target model includes: an update amount of parameters of the updated target model relative to a first model, where the first model is a model obtained by updating the target model in an iteration round before the current iteration round;
- the sending module is further used to: before acquiring the multiple first gradients and the multiple second gradients, broadcast the updated parameter values of the first model to the multiple terminals.
- the information of the target model includes: an update amount of parameters of the updated target model relative to a second model, where the second model is an initial model of the target model.
- the information of the target model includes: an update amount of parameters of the updated target model relative to a third model; the multiple first devices include a first target device; the third model is a model obtained by updating the target model by the first target device in an iteration round before the current iteration round;
- the sending module is specifically used to transmit the updated parameter update amount of the target model relative to the third model to the first target device.
- the iteration round before the current iteration round is specifically: the iteration round in which the first target device most recently updated the target model before the current iteration round.
- an embodiment of the present application provides a model training device, which may include a memory, a processor, and a bus system, wherein the memory is used to store programs, and the processor is used to execute the programs in the memory to execute the above-mentioned first aspect and any optional method of the first aspect.
- an embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored.
- the computer-readable storage medium is run on a computer, the computer executes the above-mentioned first aspect or any optional method thereof.
- an embodiment of the present application provides a computer program, which, when executed on a computer, enables the computer to execute the above-mentioned first aspect and any optional method thereof.
- the present application provides a chip system, which includes a processor for supporting an execution device to implement the functions involved in the above aspects, such as sending or processing the data involved in the above methods; or information.
- the chip system also includes a memory, which is used to store program instructions and data necessary for the execution device or training device.
- the chip system can be composed of chips, or it can include chips and other discrete devices.
- FIG1 is a schematic diagram of a structure of an artificial intelligence main framework
- FIG2 is a schematic diagram of a computing system for performing model training in an embodiment of the present application
- FIG3 is a schematic diagram of a system architecture provided in an embodiment of the present application.
- FIG4 is a schematic diagram of the architecture of a model training method provided in an embodiment of the present application.
- FIG5 is a flowchart of a model training method provided in an embodiment of the present application.
- Figures 6 to 9 are examples of a model training method provided in an embodiment of the present application.
- FIG10 is a schematic diagram of a model training device provided in an embodiment of the present application.
- FIG11 is a schematic diagram of a structure of an execution device provided in an embodiment of the present application.
- FIG12 is a schematic diagram of a structure of a training device provided in an embodiment of the present application.
- FIG. 13 is a schematic diagram of the structure of a chip provided in an embodiment of the present application.
- Figure 1 shows a structural diagram of the main framework of artificial intelligence.
- the following is an explanation of the above artificial intelligence theme framework from two dimensions: “intelligent information chain” (horizontal axis) and “IT value chain” (vertical axis).
- the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be a general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data is It has gone through the process of "data-information-knowledge-wisdom”.
- the "IT value chain” reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence, information (providing and processing technology to achieve) to the industrial ecology of the system.
- the infrastructure provides computing power support for the artificial intelligence system, enables communication with the outside world, and supports it through the basic platform. It communicates with the outside world through sensors; computing power is provided by smart chips (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform includes distributed computing frameworks and networks and other related platform guarantees and support, which can include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
- smart chips CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips
- the basic platform includes distributed computing frameworks and networks and other related platform guarantees and support, which can include cloud storage and computing, interconnected networks, etc.
- sensors communicate with the outside world to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
- the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
- the data involves graphics, images, voice, text, and IoT data of traditional devices, including business data of existing systems and perception data such as force, displacement, liquid level, temperature, and humidity.
- Model training usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
- machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, and training.
- Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formalized information to perform machine thinking and solve problems based on reasoning control strategies. Typical functions are search and matching.
- Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
- some general capabilities can be further formed based on the results of the model training, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.
- Smart products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, which productizes intelligent information decision-making and realizes practical applications. Its application areas mainly include: smart terminals, smart transportation, smart medical care, autonomous driving, safe cities, etc.
- FIG2 is a schematic diagram of a computing system for performing model training in an embodiment of the present application.
- the computing system includes a terminal device 102 (hereinafter referred to as a first device and a second device) and a server 130 (also referred to as a central node) coupled via network communication.
- the terminal device 102 may be any type of computing device, such as, for example, a personal computing device (e.g., a laptop or desktop computer), a mobile computing device (e.g., a smart phone or a tablet computer), a game console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.
- the terminal device 102 may include a processor 112 and a memory 114.
- the processor 112 may be any suitable processing device (e.g., a processor core, a microprocessor, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a controller, a microcontroller, etc.).
- the memory 114 may include, but is not limited to, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM).
- the memory 114 may store data 116 and instructions 118 executed by the processor 112 to enable the terminal device 102 to perform operations.
- the memory 114 can store one or more models 120.
- the model 120 can be or can additionally include various machine learning models, such as a neural network (e.g., a deep neural network) or other types of machine learning models, including nonlinear models and/or linear models.
- the neural network can include a feedforward neural network, a recursive neural network (e.g., a long short-term memory recursive neural network), a convolutional neural network, or other forms of neural networks.
- one or more models 120 may be received from server 130 via network 180 , stored in memory 114 , and then used or otherwise implemented by one or more processors 112 .
- the terminal device 102 may also include one or more user input components 122 for receiving user input.
- the user input component 122 may A touch-sensitive component (e.g., a touch-sensitive display screen or touchpad) is sensitive to the touch of a user input object (e.g., a finger or a stylus).
- a touch-sensitive component can be used to implement a virtual keyboard.
- Other example user input components include a microphone, a traditional keyboard, or other means by which a user can provide user input.
- the terminal device 102 may further include a communication interface 123, through which the terminal device 102 may be communicatively connected to the server 130.
- the server 130 may include a communication interface 133, through which the terminal device 102 may be communicatively connected to the communication interface 133 of the server 130, thereby realizing data interaction between the terminal device 102 and the server 130.
- the server 130 may include a processor 132 and a memory 134.
- the processor 132 may be any suitable processing device (e.g., a processor core, a microprocessor, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a controller, a microcontroller, etc.).
- the memory 134 may include, but is not limited to, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM).
- the memory 134 may store data 136 and instructions 138 executed by the processor 132 to enable the server 130 to perform operations.
- the memory 134 can store one or more machine learning models 140.
- the model 140 can be or can additionally include various machine learning models.
- Example machine learning models include neural networks or other multi-layer nonlinear models.
- Example neural networks include feedforward neural networks, deep neural networks, recursive neural networks, and convolutional neural networks.
- Figure 3 is a schematic diagram of a system 100 architecture provided in an embodiment of the present application.
- the execution device 110 is configured with an input/output (I/O) interface 112 for data interaction with an external device.
- the user can input data to the I/O interface 112 through the client device 140.
- the input data may include: various tasks to be scheduled, callable resources and other parameters in the embodiment of the present application.
- the execution device 110 When the execution device 110 preprocesses the input data, or when the computing module 111 of the execution device 110 performs calculation and other related processing (such as implementing the function of the neural network in the present application), the execution device 110 can call the data, code, etc. in the data storage system 150 for the corresponding processing, and can also store the data, instructions, etc. obtained by the corresponding processing in the data storage system 150.
- the I/O interface 112 returns the processing result to the client device 140 so as to provide it to the user.
- the training device 120 can generate corresponding target models/rules based on different training data for different goals or different tasks.
- the corresponding target models/rules can be used to achieve the above goals or complete the above tasks, thereby providing users with the desired results.
- the user can manually give input data, and the manual giving can be operated through the interface provided by the I/O interface 112.
- the client device 140 can automatically send input data to the I/O interface 112. If the client device 140 is required to automatically send input data and needs to obtain the user's authorization, the user can set the corresponding authority in the client device 140.
- the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form can be a specific method such as display, sound, action, etc.
- the client device 140 can also be used as a data acquisition terminal to collect the input data of the input I/O interface 112 and the output results of the output I/O interface 112 as shown in the figure as new sample data, and store them in the database 130.
- the I/O interface 112 directly stores the input data of the input I/O interface 112 and the output results of the output I/O interface 112 as new sample data in the database 130.
- FIG3 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation.
- the data storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 can also be placed in the execution device 110.
- a neural network can be obtained by training according to the training device 120.
- the embodiment of the present application also provides a chip, which includes a neural network processor NPU.
- the chip can be set in the execution device 110 as shown in Figure 3 to complete the calculation work of the calculation module 111.
- the chip can also be set in the training device 120 as shown in Figure 3 to complete the training work of the training device 120 and output the target model/rule.
- Neural network processor NPU is mounted on the main central processing unit (CPU) (host CPU) as a coprocessor, and the main CPU assigns tasks.
- the core part of NPU is the operation circuit, and the controller controls the operation circuit to extract data from the memory (weight memory or input memory) and perform operations.
- the arithmetic circuit includes multiple processing engines (PEs) internally.
- the arithmetic circuit is a two-dimensional systolic array.
- the arithmetic circuit can also be a one-dimensional systolic array or other electronic devices capable of performing mathematical operations such as multiplication and addition.
- the arithmetic circuit is a general-purpose matrix processor.
- the operation circuit takes the corresponding data of matrix B from the weight memory and caches it on each PE in the operation circuit.
- the operation circuit takes the matrix A data from the input memory 1 and performs matrix operations with matrix B.
- the partial results or final results of the matrix are stored in the accumulator.
- the vector calculation unit can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
- the vector calculation unit can be used for network calculations of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc.
- the vector computation unit can store the processed output vector to a unified buffer.
- the vector computation unit can apply a nonlinear function to the output of the computation circuit, such as a vector of accumulated values, to generate an activation value.
- the vector computation unit generates a normalized value, a merged value, or both.
- the processed output vector can be used as an activation input to the computation circuit, such as for use in a subsequent layer in a neural network.
- the unified memory is used to store input data and output data.
- the weight data is directly transferred from the external memory to the input memory 1 and/or the unified memory through the direct memory access controller (DMAC), the weight data in the external memory is stored in the weight memory, and the data in the unified memory is stored in the external memory.
- DMAC direct memory access controller
- the bus interface unit (BIU) is used to enable interaction between the main CPU, DMAC and instruction fetch memory through the bus.
- An instruction fetch buffer connected to the controller, used to store instructions used by the controller
- the controller is used to call the instructions cached in the memory to control the working process of the computing accelerator.
- the unified memory, input memory 1, weight memory and instruction fetch memory are all on-chip memories
- the external memory is a memory outside the NPU, which can be a double data rate synchronous dynamic random access memory (DDR SDRAM), a high bandwidth memory (HBM) or other readable and writable memories.
- DDR SDRAM double data rate synchronous dynamic random access memory
- HBM high bandwidth memory
- a neural network may be composed of neural units, and a neural unit may refer to an operation unit with xs and intercept 1 as input, and the output of the operation unit may be:
- n is a natural number greater than 1
- Ws is the weight of xs
- b is the bias of the neural unit.
- f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into the output signal.
- the output signal of the activation function can be used as the input of the next convolutional layer.
- the activation function can be a sigmoid function.
- a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
- the input of each neural unit can be connected to the local receptive field of the previous layer to extract the characteristics of the local receptive field.
- the local receptive field can be an area composed of several neural units.
- the work of each layer in the neural network can be expressed mathematically as To describe: From a physical level, the work of each layer in a neural network can be understood as completing the transformation from input space to output space (i.e., from row space to column space of a matrix) through five operations on the input space (a set of input vectors). These five operations include: 1. Dimension increase/reduction; 2. Enlargement/reduction; 3. Rotation; 4. Translation; 5. "Bending”. Operations 1, 2, and 3 are represented by Completed, operation 4 is completed by +b, and operation 5 is implemented by a().
- space is used here because the classified object is not a single thing, but a class of things, and space refers to the collection of all individuals of this class of things.
- W is a weight vector, and each value in the vector represents the weight value of a neuron in this layer of the neural network.
- the vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how to transform the space.
- the purpose of training a neural network is to eventually obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by many layers of vectors W). Therefore, the training process of a neural network is essentially about learning how to control spatial transformations, and more specifically, learning the weight matrix.
- Neural networks can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, the forward transmission of the input signal to the output will generate error loss, and the error loss information is back-propagated to update the parameters in the initial neural network model, so that the error loss converges.
- the back propagation algorithm is a back propagation movement dominated by error loss, which aims to obtain the optimal parameters of the neural network model, such as the weight matrix.
- a privacy-preserving distributed machine learning modeling method Compared with traditional centralized modeling, in federated learning, each party does not directly share data, but instead performs distributed training by sharing models.
- the current mainstream federated learning model mainly consists of several edge nodes and a central node.
- the edge node receives the model distributed by the central node, and uses local data for training on this basis, and then sends the model to the central node; the central node collects the models of each edge node, aggregates them into a global model, and distributes them to each edge node, starting a new round of training iterations.
- FIG. 4 is a schematic diagram of the architecture of a model training method provided in an embodiment of the present application.
- the architecture provided in an embodiment of the present application includes: a cloud-side central node, which may be, for example, a cloud-side server.
- A1, A2, ... are distributed nodes of type A (which may be referred to as terminals in this application), such as mobile phone products held by users.
- B1, B2, ... are distributed nodes of type B, such as personal computers held by users.
- the administrator of the distributed node (such as a user of a mobile phone or computer) agrees, the administrator of the distributed node voluntarily shares the data generated in the process of daily use of the device under the condition of privacy protection, joins the model training plan, and the device becomes a distributed node in the architecture.
- the system in this embodiment may also include more types of distributed nodes, such as smart watches, etc.
- the distributed node will not upload data to the central node, but only save data locally.
- the distributed node is connected to the cloud server via a communication network.
- the cloud-side central node can run a large model, while each distributed node can only run a small model due to hardware capabilities, and A and B can have different model training capabilities.
- the server sends an initial model to each end side. Subsequently, different end sides use local data to perform several model iterations and then feed back the model changes (that is, the gradients corresponding to the parameters) to the server.
- the server performs a weighted average of the feedback gradients, uses the obtained average gradient to update the initial model, and sends the updated model to each end side user, restarting the next round of iteration.
- the problem with the existing federated training framework is that when the user data is not independent and identically distributed, the server cannot obtain an effective model gradient update direction due to the large differences in the gradient directions after iteration of each user node, which leads to slow convergence of the server model. A large number of gradients need to be transmitted back and forth between the end-side user and the server, consuming a large amount of communication traffic. In the current network environment, the growth rate of the overall network bandwidth is much lower than the growth rate of the neural network model size. Therefore, how to effectively reduce the communication overhead is an urgent problem to be solved in federated learning.
- FIG. 5 is a flow chart of a model training method provided in an embodiment of the present application.
- the model training method provided in an embodiment of the present application includes:
- the server obtains multiple first gradients and multiple second gradients; the multiple first gradients are gradients corresponding to multiple first parameters in the target model; the multiple second gradients are gradients corresponding to multiple second parameters in the target model; the multiple first parameters are updated in the previous round of iteration of federated learning, and the multiple second parameters are not updated in the previous round of iteration of federated learning.
- the target model may be a model training object for federated learning, and the target model may include a neural network or other multi-layer nonlinear models.
- the neural network may include a feedforward neural network, a deep neural network, a recursive neural network, and a convolutional neural network.
- multiple devices after receiving the initial model of the target model sent by the server, multiple devices can use local data to train the target model and obtain gradients during the training process, and then upload the gradients to the server.
- the device in order to protect the privacy of information during the transmission process, can upload encrypted gradients to the server.
- the server may receive gradients sent by some of the multiple devices, and select some gradients from the gradients (and the gradient errors in the previous iteration, that is, the gradients not used for model updating) to perform model updating.
- the target model may include parameter A, parameter B, parameter C, parameter D, and parameter E.
- the server may receive the gradient sent by the terminal and aggregate it (aggregation may be weighted average) to obtain the gradient corresponding to parameter A, the gradient corresponding to parameter B, and the gradient corresponding to parameter C.
- Parameter C and parameter D are the parameters updated when the server iterates and updates the model in the previous round
- parameter A, parameter B, and parameter E are the parameters not updated when the server iterates and updates the model in the previous round.
- the server may fuse (for example, perform sum operation) the gradient corresponding to parameter A and parameter B obtained by aggregating the gradient sent by the terminal in the current iteration round to obtain the gradient of parameter A and parameter B in the current iteration round.
- the server may use the gradient of parameter C obtained by aggregating the gradient sent by the terminal as the gradient of parameter C based on the current iteration round.
- the server may use the gradient corresponding to parameter E when the server iterates and updates the model in the previous round as the gradient of parameter E based on the current iteration round.
- the server may select the gradients corresponding to some parameters from parameter A, parameter B, parameter C, and parameter E to update the model.
- the embodiment of the present application is described by taking the example that the multiple first parameters are updated in the previous round of iteration of federated learning, and the multiple second parameters are not updated in the previous round of iteration of federated learning.
- multiple second devices among the multiple devices may obtain the gradients of multiple parameters in the target model during a certain iteration, for example, the gradients corresponding to multiple first parameters in the target model (that is, multiple third gradients) may be obtained, and the gradients corresponding to multiple second parameters in the target model (that is, multiple fourth gradients) may be obtained.
- the multiple second devices may transmit the multiple third gradients and the multiple fourth gradients to the server.
- the server may obtain multiple first gradients by aggregating multiple third gradient passes sent from multiple second devices in the current iteration round; the server may obtain multiple gradients by aggregating multiple fourth gradient passes sent from multiple second devices in the current iteration round, and fuse the multiple gradients with the gradients corresponding to the multiple second parameters determined in the previous iteration round to obtain multiple second gradients.
- the Momentum gradient update method That is, the local gradient formed by the current merger is no longer a simple sum of the end-side gradients, but a weighted sum of the current gradient and the previous round of gradients.
- the multiple second devices among the multiple terminals are specifically used to determine multiple gradients corresponding to the target model in the previous round of iteration of federated learning; and randomly select the multiple third gradients and the multiple fourth gradients from the multiple gradients.
- the client after receiving the model parameters sent by the server, the client can use local data to update the model, calculate the updated gradient after the local iteration round ends, and use the rand-k method to randomly select k gradients from the updated gradients and upload them to the server.
- the plurality of second devices in the plurality of terminals are specifically configured to perform lossless compression or linear unbiased compression on the information indicating the plurality of third gradients and the plurality of fourth gradients, and send the compression result to the server.
- FIG. 6 is a schematic diagram of interaction between a server and a terminal.
- the following describes an embodiment of further compressing uplink traffic using a linear unbiased compression scheme.
- the interaction process is shown in FIG7 below.
- the difference from the first embodiment is that after the end side performs rand-k compression, the linear unbiased compression scheme is continued to be used to compress the traffic (for example, using the sketch compression method).
- the linear unbiased compression scheme can be hardened in the network equipment, so as to make full use of the processing capacity of the network equipment to reduce the network transmission pressure, thereby further reducing the communication volume uploaded from the end side to the server. to network devices, thereby making full use of network device performance to reduce network communication pressure.
- the partial gradient is a plurality of maximum gradients among the plurality of first gradients and the plurality of second gradients.
- the gradient error left over from the previous round is err_ ⁇ t-1 ⁇
- the top-k gradient values in (g_t+err_ ⁇ t-1 ⁇ ) are taken as update values, and the current public model is updated to w_ ⁇ t+1 ⁇ .
- (g_t+err_ ⁇ t-1 ⁇ )-top-k(g_t+err_ ⁇ t-1 ⁇ ) is err_t, and this error can be used in the next round of iteration.
- the updated target model may be w_ ⁇ t+1 ⁇ , and the information of the updated target model may be sent to the terminals that need to participate in the next round of iteration (that is, the multiple first devices in the embodiment of the present application).
- the target model only selects and updates the values of some parameters each time, the amount of gradient transmission from the server to the terminal can be effectively reduced.
- the information of the target model includes: an update amount of parameters of the updated target model relative to a second model, where the second model is an initial model of the target model.
- the server can send w_ ⁇ t+1 ⁇ -w_0 to the client as the compressed gradient value, where w_0 is the initial model of the target model.
- the information of the target model includes: the parameter update amount of the updated target model relative to the third model; the multiple first devices include a first target device; the third model is a model obtained by updating the target model by the first target device in an iteration round before the current iteration round; the server can transmit the parameter update amount of the updated target model relative to the third model to the first target device.
- the iteration round before the current iteration round is specifically: the iteration round in which the first target device most recently updated the target model before the current iteration round.
- the server side can maintain the existing target model list of the end-side user, and make a difference with the public model parameters of the corresponding end-side when sending the update amount, so as to improve the communication compression of the sent gradient.
- the specific scheme is shown in Figure 9.
- the server maintains the latest model parameter list of the end-side user, thereby reducing the difference in the gradient between the latest target model and the end-side user model when sending the gradient, thereby maximizing the compression effect of the downlink communication volume.
- the information of the target model includes: the parameter update amount of the updated target model relative to the first model, the first model being a model updated in an iteration round before the current iteration round of the target model; before obtaining the multiple first gradients and the multiple second gradients, the parameter values of the updated first model can be broadcast to the multiple terminals.
- the server can broadcast (for example, periodically broadcast) the parameters of the current latest target model to reduce the number of model gradient changes sent down from the cloud side. If the model gradient sent down from the cloud side is w_ ⁇ t+1 ⁇ -w_0. When t is large, the downlink model compression effect will be significantly reduced.
- the server periodically broadcasts the current latest public model parameters (or the difference between the latest target model and the next latest target model) to all federated end-side users participating in the training to reduce the number of model gradient changes sent down from the cloud side.
- w_anchor is the latest public model parameter periodically broadcast by the cloud-side server. In the above manner, the number of gradient differences between the latest target model and the end-side user model is reduced, thereby ensuring that the compression effect of the downlink communication volume is not degraded.
- An embodiment of the present application provides a model training method, which is applied to a server, and the server communicates with multiple terminals.
- the method includes: obtaining multiple first gradients and multiple second gradients; the multiple first gradients are gradients corresponding to multiple first parameters in the target model; the multiple second gradients are gradients corresponding to multiple second parameters in the target model; the multiple first parameters are updated in the previous round of iteration of federated learning, and the multiple second parameters are not updated in the previous round of iteration of federated learning; selecting partial gradients from the multiple first gradients and the multiple second gradients, and the partial gradients are used to update the target model in the current round of iteration of federated learning; transmitting the updated information of the target model to multiple first devices; wherein the multiple first devices belong to the multiple terminals.
- the target model only selects and updates the values of some parameters each time, the amount of gradient transmission from the server to the terminal can be effectively reduced.
- the present application provides a system, the system comprising a server and a plurality of terminals, the server communicating with the plurality of terminals, wherein:
- the server is used to obtain multiple first gradients and multiple second gradients; the multiple first gradients are gradients corresponding to multiple first parameters in the target model; the multiple second gradients are gradients corresponding to multiple second parameters in the target model; the multiple first parameters are in the federated
- the second parameters are updated in the last iteration of the learning, and the second parameters are not updated in the last iteration of the federated learning;
- the updated information of the target model is transmitted to a plurality of first devices; wherein the plurality of first devices and the plurality of second devices belong to the plurality of terminals.
- the partial gradient is a plurality of maximum gradients among the plurality of first gradients and the plurality of second gradients.
- multiple second devices among the multiple terminals are used to send multiple third gradients and multiple fourth gradients to the server;
- the multiple third gradients are gradients corresponding to multiple first parameters in the target model;
- the multiple fourth gradients are gradients corresponding to multiple second parameters in the target model;
- the multiple first parameters are updated in the previous round of iteration of federated learning;
- the multiple second parameters are not updated in the previous round of iteration of federated learning;
- the server is specifically configured to aggregate the multiple third gradients to obtain multiple first gradients
- the multiple fourth gradients are aggregated and merged with the gradients corresponding to the multiple second parameters determined in the previous round of iteration to obtain multiple second gradients.
- the multiple second devices among the multiple terminals are specifically used to determine multiple gradients corresponding to the target model in the previous round of iteration of federated learning; and randomly select the multiple third gradients and the multiple fourth gradients from the multiple gradients.
- the plurality of second devices in the plurality of terminals are specifically configured to perform lossless compression or linear unbiased compression on the information indicating the plurality of third gradients and the plurality of fourth gradients, and send the compression result to the server.
- the information of the target model includes: an update amount of parameters of the updated target model relative to a first model, where the first model is a model obtained by updating the target model in an iteration round before the current iteration round;
- the server is further configured to broadcast the updated parameter values of the first model to the multiple terminals before acquiring the multiple first gradients and the multiple second gradients.
- the information of the target model includes: an update amount of parameters of the updated target model relative to a second model, where the second model is an initial model of the target model.
- the information of the target model includes: an update amount of parameters of the updated target model relative to a third model; the multiple first devices include a first target device; the third model is a model obtained by updating the target model by the first target device in an iteration round before the current iteration round;
- the server is specifically configured to transmit the updated parameter update amount of the updated target model relative to the third model to the first target device.
- the iteration round before the current iteration round is specifically: the iteration round in which the first target device most recently updated the target model before the current iteration round.
- the model training device 1000 provided in the embodiment of the present application includes:
- An acquisition module 1001 is used to acquire a plurality of first gradients and a plurality of second gradients; the plurality of first gradients are gradients corresponding to a plurality of first parameters in a target model; the plurality of second gradients are gradients corresponding to a plurality of second parameters in the target model; the plurality of first parameters are updated in a previous round of iteration of federated learning, and the plurality of second parameters are not updated in a previous round of iteration of federated learning;
- step 501 For the specific description of the acquisition module 1001, reference may be made to the description of step 501 in the above embodiment, which will not be repeated here.
- a gradient selection module 1002 configured to select a partial gradient from the plurality of first gradients and the plurality of second gradients, wherein the partial gradient is used to update the target model in a current round of iteration of federated learning;
- step 502 For a detailed description of the gradient selection module 1002 , reference may be made to the description of step 502 in the above embodiment, which will not be repeated here.
- the sending module 1003 is used to transmit the updated information of the target model to multiple first devices; wherein the multiple first devices belong to the multiple terminals.
- the partial gradient is a plurality of maximum gradients among the plurality of first gradients and the plurality of second gradients.
- the plurality of first gradients are obtained by comparing the plurality of first gradients sent from the plurality of second devices in the current iteration round.
- the multiple second gradients are obtained by aggregating three gradients; the multiple second gradients are obtained according to the gradients corresponding to the multiple second parameters determined in the previous round of iteration, and the multiple fourth gradients sent by the multiple second devices in the current iteration round; the multiple second devices belong to the multiple terminals.
- the information of the target model includes: an update amount of parameters of the updated target model relative to a first model, where the first model is a model obtained by updating the target model in an iteration round before the current iteration round;
- the sending module is further used to: before acquiring the multiple first gradients and the multiple second gradients, broadcast the updated parameter values of the first model to the multiple terminals.
- the information of the target model includes: an update amount of parameters of the updated target model relative to a second model, where the second model is an initial model of the target model.
- the information of the target model includes: an update amount of parameters of the updated target model relative to a third model; the multiple first devices include a first target device; the third model is a model obtained by updating the target model by the first target device in an iteration round before the current iteration round;
- the sending module is specifically used to transmit the updated parameter update amount of the target model relative to the third model to the first target device.
- the iteration round before the current iteration round is specifically: the iteration round in which the first target device most recently updated the target model before the current iteration round.
- FIG. 11 is a structural schematic diagram of an execution device provided by an embodiment of the present application.
- the execution device 1100 can be specifically expressed as a mobile phone, a tablet, a laptop computer, an intelligent wearable device, a server, etc., which is not limited here.
- the model training device described in the corresponding embodiment of Figure 11 can be deployed on the execution device 1100 to implement the function of model training in the corresponding embodiment of Figure 11.
- the execution device 1100 includes: a receiver 1101, a transmitter 1102, a processor 1103 and a memory 1104 (wherein the number of processors 1103 in the execution device 1100 can be one or more, and one processor is taken as an example in Figure 11), wherein the processor 1103 may include an application processor 11031 and a communication processor 11032.
- the receiver 1101, the transmitter 1102, the processor 1103 and the memory 1104 can be connected via a bus or other means.
- the memory 1104 may include a read-only memory and a random access memory, and provides instructions and data to the processor 1103. A portion of the memory 1104 may also include a non-volatile random access memory (NVRAM).
- NVRAM non-volatile random access memory
- the memory 1104 stores processor and operation instructions, executable modules or data structures, or subsets thereof, or extended sets thereof, wherein the operation instructions may include various operation instructions for implementing various operations.
- the processor 1103 controls the operation of the execution device.
- the various components of the execution device are coupled together through a bus system, wherein the bus system includes not only a data bus but also a power bus, a control bus, and a status signal bus, etc.
- the bus system includes not only a data bus but also a power bus, a control bus, and a status signal bus, etc.
- various buses are referred to as bus systems in the figure.
- the method disclosed in the above embodiment of the present application can be applied to the processor 1103, or implemented by the processor 1103.
- the processor 1103 can be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the hardware integrated logic circuit in the processor 1103 or the instruction in the form of software.
- the above processor 1103 can be a general processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and can further include an application specific integrated circuit (application specific integrated circuit, ASIC), a field programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
- the processor 1103 can implement or execute the methods, steps and logic block diagrams disclosed in the embodiment of the present application.
- the general processor can be a microprocessor or the processor can also be any conventional processor, etc.
- the steps of the method disclosed in the embodiment of the present application can be directly embodied as a hardware decoding processor to execute, or a combination of hardware and software modules in the decoding processor to execute.
- the software module may be located in a storage medium mature in the art, such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, or an electrically erasable programmable memory, a register, etc.
- the storage medium is located in the memory 1104, and the processor 1103 reads the information in the memory 1104 and completes the steps of the above method in combination with its hardware.
- the receiver 1101 can be used to receive input digital or character information and generate signal input related to the relevant settings and function control of the execution device.
- the transmitter 1102 can be used to output digital or character information through the first interface; the transmitter 1102 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1102 can also include a display device such as a display screen.
- the processor 1103 is used to execute the method executed by the server in the embodiment corresponding to Figure 5.
- FIG. 12 is a structural schematic diagram of the training device provided by the embodiment of the present application
- the training device 1200 may be deployed with the neural network training device described in the embodiment corresponding to FIG. 10, for realizing the function of the neural network training device in the embodiment corresponding to FIG. 10, specifically, the training device 1200 is realized by one or more servers, and the training device 1200 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (CPU) 1212 (for example, one or more processors) and memory 1232, one or more storage media 1230 (for example, one or more mass storage devices) storing application programs 1242 or data 1244.
- CPU central processing units
- storage media 1230 for example, one or more mass storage devices
- the memory 1232 and the storage medium 1230 may be short-term storage or persistent storage.
- the program stored in the storage medium 1230 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the training device.
- the central processor 1212 can be configured to communicate with the storage medium 1230 and execute a series of instruction operations in the storage medium 1230 on the training device 1200.
- the training device 1200 may also include one or more power supplies 1226, one or more wired or wireless network interfaces 1250, one or more input and output interfaces 1258; or, one or more operating systems 1241, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
- the central processing unit 1212 is used to execute the steps related to the training method in the above embodiment.
- Also provided in an embodiment of the present application is a computer program product which, when executed on a computer, enables the computer to execute the steps executed by the aforementioned execution device, or enables the computer to execute the steps executed by the aforementioned training device.
- a computer-readable storage medium is also provided in an embodiment of the present application, which stores a program for signal processing.
- the computer-readable storage medium When the computer-readable storage medium is run on a computer, it enables the computer to execute the steps executed by the aforementioned execution device, or enables the computer to execute the steps executed by the aforementioned training device.
- the execution device, training device or terminal device provided in the embodiments of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, wherein the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or a circuit, etc.
- the processing unit may execute the computer execution instructions stored in the storage unit so that the chip in the execution device executes the model training method described in the above embodiment, or so that the chip in the training device executes the model training method described in the above embodiment.
- the storage unit is a storage unit in the chip, such as a register, a cache, etc.
- the storage unit may also be a storage unit located outside the chip in the wireless access device end, such as a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM), etc.
- ROM read-only memory
- RAM random access memory
- FIG. 13 is a schematic diagram of a structure of a chip provided in an embodiment of the present application.
- the chip can be expressed as a neural network processor NPU 1300.
- NPU 1300 is mounted on the host CPU (Host CPU) as a coprocessor, and tasks are assigned by the Host CPU.
- the core part of the NPU is the operation circuit 1303, which is controlled by the controller 1304 to extract matrix data from the memory and perform multiplication operations.
- the operation circuit 1303 includes multiple processing units (Process Engine, PE) inside.
- the operation circuit 1303 is a two-dimensional systolic array.
- the operation circuit 1303 can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
- the operation circuit 1303 is a general-purpose matrix processor.
- the operation circuit takes the corresponding data of matrix B from the weight memory 1302 and caches it on each PE in the operation circuit.
- the operation circuit takes the matrix A data from the input memory 1301 and performs matrix operation with matrix B, and the partial result or final result of the matrix is stored in the accumulator 1308.
- the unified memory 1306 is used to store input data and output data.
- the weight data is directly transferred to the weight memory 1302 through the direct memory access controller (DMAC) 1305.
- the input data is also transferred to the unified memory 1306 through the DMAC.
- DMAC direct memory access controller
- BIU stands for Bus Interface Unit, that is, the bus interface unit 1310, which is used for the interaction between AXI bus and DMAC and instruction fetch buffer (IFB) 1309.
- IOB instruction fetch buffer
- the bus interface unit 1310 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1309 to obtain instructions from the external memory, and is also used for the storage unit access controller 1305 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
- BIU Bus Interface Unit
- DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1306 or to transfer weight data to the weight memory 1302 or to transfer input data to the input memory 1301.
- the vector calculation unit 1307 includes a plurality of operation processing units, and further processes the output of the operation circuit 1303 when necessary. Processing, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. Mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization, pixel-level summation, upsampling of feature planes, etc.
- the vector calculation unit 1307 can store the processed output vector to the unified memory 1306.
- the vector calculation unit 1307 can apply a linear function; or a nonlinear function to the output of the operation circuit 1303, such as linear interpolation of the feature plane extracted by the convolution layer, and then, for example, a vector of accumulated values to generate an activation value.
- the vector calculation unit 1307 generates a normalized value, a pixel-level summed value, or both.
- the processed output vector can be used as an activation input to the operation circuit 1303, for example, for use in a subsequent layer in a neural network.
- An instruction fetch buffer 1309 connected to the controller 1304, for storing instructions used by the controller 1304;
- Unified memory 1306, input memory 1301, weight memory 1302 and instruction fetch memory 1309 are all On-Chip memories. External memories are private to the NPU hardware architecture.
- the processor mentioned in any of the above places may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above program.
- the device embodiments described above are merely schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment.
- the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
- the technical solution of the present application is essentially or the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a readable storage medium, such as a computer floppy disk, a U disk, a mobile hard disk, a ROM, a RAM, a disk or an optical disk, etc., including a number of instructions to enable a computer device (which can be a personal computer, a training device, or a network device, etc.) to execute the methods described in each embodiment of the present application.
- a computer device which can be a personal computer, a training device, or a network device, etc.
- all or part of the embodiments may be implemented by software, hardware, firmware or any combination thereof.
- all or part of the embodiments may be implemented in the form of a computer program product.
- the computer program product includes one or more computer instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
- the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
- the computer instructions may be transmitted from a website site, a computer, a training device, or a data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) mode to another website site, computer, training device, or data center.
- the computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device, a data center, etc. that includes one or more available media integrations.
- the available medium may be a magnetic medium, (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive (SSD)), etc.
- a magnetic medium e.g., a floppy disk, a hard disk, a tape
- an optical medium e.g., a DVD
- a semiconductor medium e.g., a solid-state drive (SSD)
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
Abstract
本申请公开了一种模型训练方法,应用于联邦学习领域,包括:获取多个第一梯度和多个第二梯度;所述多个第一梯度为目标模型中多个第一参数对应的梯度;所述多个第二梯度为所述目标模型中多个第二参数对应的梯度;所述多个第一参数在联邦学习的上一轮次迭代中被更新,所述多个第二参数在联邦学习的上一轮次迭代中未被更新;从所述多个第一梯度和所述多个第二梯度中选择部分梯度,所述部分梯度用于在联邦学习的当前轮次迭代中更新所述目标模型;将更新后的所述目标模型的信息传递至多个第一设备;其中,所述多个第一设备属于所述多个终端。本申请中,由于服务器每次每次只选择并更新目标模型中部分参数的数值,可有效降低服务器向终端的梯度传输量。
Description
本申请要求于2022年11月02日提交国家知识产权局、申请号为202211364320.6、发明名称为“一种模型训练方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能领域,尤其涉及一种模型训练方法及装置。
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
联邦学习(federated learning)系统基于从大量用户与其设备(例如,智能电话等)交互生成的数据来训练机器学习模型,而不需要从设备中取出数据。例如,每个循环都选择在线设备的子集,并且机器学习模型的当前版本被发送到那些选定的设备。那些选定的设备中的每一个被赋予使用它们自己本地生成并本地存储的数据来计算模型的更新的任务。然后,模型更新被发送回服务器、进行平均、并应用于服务器的模型,以便为用户的下一次迭代(例如,设备的下一个子集)产生模型的新版本。
联邦学习分为模型下发和模型上传两个步骤,中心节点将模型通过网络下发至终端设备;各终端设备利用本地数据计算模型的梯度;各分布式节点将梯度加密后上传至中心节点;中心节点汇总各终端分布式节点的梯度,并采用参数平均算法更新中心节点模型的参数。
在训练开始时,服务器将一个初始模型发送到各个端侧。随后,不同端侧利用本地数据进行模型进行若干迭代后,将模型改变量(也就是参数对应的梯度)反馈至服务器。服务器对反馈回的梯度进行加权平均,用所得的平均梯度对初始模型进行更新,并将更新后的模型下发给各端侧用户,重启下一轮迭代。
现有的联邦训练框架的问题是:当用户的数据非独立同分布时,由于各用户节点迭代后的梯度方向差异较大,导致服务器无法获得一个有效的模型梯度更新方向,进而导致服务器模型收敛较慢,需要大量在端侧用户与服务器之间来回传递梯度,大量消耗通信量。而当前网络环境下,网络整体带宽的增长速度远小于神经网络模型尺寸的增长速度。因此,如何有效降低通信量开销是联邦学习中亟需解决的问题。
发明内容
本申请提供了一种模型训练方法,可以有效降低服务器向终端的梯度传输量。
第一方面,本申请提供了一种模型训练方法,应用于服务器,所述服务器与多个终端通信,所述方法包括:获取多个第一梯度和多个第二梯度;所述多个第一梯度为目标模型中多个第一参数对应的梯度;所述多个第二梯度为所述目标模型中多个第二参数对应的梯度;所述多个第一参数在联邦学习的上一轮次迭代中被更新,所述多个第二参数在联邦学习的上一轮次迭代中未被更新;从所述多个第一梯度和所述多个第二梯度中选择部分梯度,所述部分梯度用于在联邦学习的当前轮次迭代中更新所述目标模型;将更新后的所述目标模型的信息传递至多个第一设备;其中,所述多个第一设备属于所述多个终端。
由于目标模型每次只选择并更新部分参数的数值,可有效降低服务器向终端的梯度传输量。
在一种可能的实现中,所述部分梯度为所述多个第一梯度和所述多个第二梯度中最大的多个梯度。
在一种可能的实现中,所述多个第一梯度为通过对来自多个第二设备在当前迭代轮次发送的多个第三梯度进行聚合得到的;所述多个第二梯度为根据所述上一轮次迭代中确定的所述多个第二参数对应的
梯度、以及所述多个第二设备在当前迭代轮次发送的多个第四梯度得到的;所述多个第二设备属于所述多个终端。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第一模型的参数更新量,所述第一模型为对所述目标模型在所述当前轮次迭代之前的迭代轮次中更新得到的模型;在所述获取多个第一梯度和多个第二梯度之前,所述方法还包括:向所述多个终端广播所述更新后的所述第一模型的参数数值。
通过上述方式,降低了最新目标模型与端侧用户模型的梯度差异数量,从而保证下行通信量压缩效果不劣化。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第二模型的参数更新量,所述第二模型为所述目标模型的初始模型。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第三模型的参数更新量;所述多个第一设备包括第一目标设备;所述第三模型为所述第一目标设备在所述当前轮次迭代之前的迭代轮次中对所述目标模型更新得到的模型;所述将更新后的所述目标模型的信息传递至多个第一设备,包括:将所述更新后的所述目标模型相对于第三模型的参数更新量传递至所述第一目标设备。
在一种可能的实现中,所述当前轮次迭代之前的迭代轮次具体为:所述第一目标设备在所述当前轮次迭代之前最近一次对所述目标模型更新的迭代轮次。
服务器侧可以维护端侧用户现存目标模型列表,在下发更新量时与对应端侧的公有模型参数做差分,从而提升下发梯度的通信压缩量。通过上述方法,服务器维护了端侧用户的最新模型参数列表,从而降低了下发梯度时,最新目标模型与端侧用户模型的梯度的差异数量,从而最大化保持下行通信量压缩效果。
第二方面,本申请提供了一种系统,所述系统包括服务器和多个终端,所述服务器与所述多个终端通信,其中,
所述服务器用于获取多个第一梯度和多个第二梯度;所述多个第一梯度为目标模型中多个第一参数对应的梯度;所述多个第二梯度为所述目标模型中多个第二参数对应的梯度;所述多个第一参数在联邦学习的上一轮次迭代中被更新,所述多个第二参数在联邦学习的上一轮次迭代中未被更新;
从所述多个第一梯度和所述多个第二梯度中选择部分梯度,所述部分梯度用于在联邦学习的当前轮次迭代中更新所述目标模型;
将更新后的所述目标模型的信息传递至多个第一设备;其中,所述多个第一设备和所述多个第二设备属于所述多个终端。
在一种可能的实现中,所述部分梯度为所述多个第一梯度和所述多个第二梯度中最大的多个梯度。
在一种可能的实现中,所述多个终端中的多个第二设备用于向所述服务器发送多个第三梯度和多个第四梯度;所述多个第三梯度为所述目标模型中多个第一参数对应的梯度;所述多个第四梯度为所述目标模型中多个第二参数对应的梯度;所述多个第一参数在联邦学习的上一轮次迭代中被更新;所述多个第二参数在联邦学习的上一轮次迭代中未被更新;
所述服务器具体用于对所述多个第三梯度进行聚合,得到多个第一梯度;
对所述多个第四梯度进行聚合并和上一轮次迭代中确定的所述多个第二参数对应的梯度进行融合,得到多个第二梯度。
在一种可能的实现中,所述多个终端中的多个第二设备具体用于在联邦学习的上一轮次迭代中确定所述目标模型对应的多个梯度;从所述多个梯度中随机选择所述多个第三梯度和多个第四梯度。
在一种可能的实现中,所述多个终端中的多个第二设备具体用于对指示所述多个第三梯度和所述多
个第四梯度的信息进行无损压缩或者线性无偏压缩,并向所述服务器发送压缩结果。
其中,线性无偏压缩,指的是具备线性性的压缩方式,压缩后的数据可以进行线性操作后,再进行解压缩,所得结果与未压缩原始数据进行相同线性操作所得结果相同。
无偏压缩,指的是压缩后的结果与原始数据之间的误差为零均值。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第一模型的参数更新量,所述第一模型为对所述目标模型在所述当前轮次迭代之前的迭代轮次中更新得到的模型;
所述服务器还用于在所述获取多个第一梯度和多个第二梯度之前,向所述多个终端广播所述更新后的所述第一模型的参数数值。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第二模型的参数更新量,所述第二模型为所述目标模型的初始模型。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第三模型的参数更新量;所述多个第一设备包括第一目标设备;所述第三模型为所述第一目标设备在所述当前轮次迭代之前的迭代轮次中对所述目标模型更新得到的模型;
所述服务器具体用于将所述更新后的所述目标模型相对于第三模型的参数更新量传递至所述第一目标设备。
在一种可能的实现中,所述当前轮次迭代之前的迭代轮次具体为:所述第一目标设备在所述当前轮次迭代之前最近一次对所述目标模型更新的迭代轮次。
第三方面,本申请提供了一种模型训练装置,应用于服务器,所述服务器与多个终端通信,所述装置包括:
获取模块,用于获取多个第一梯度和多个第二梯度;所述多个第一梯度为目标模型中多个第一参数对应的梯度;所述多个第二梯度为所述目标模型中多个第二参数对应的梯度;所述多个第一参数在联邦学习的上一轮次迭代中被更新,所述多个第二参数在联邦学习的上一轮次迭代中未被更新;
梯度选择模块,用于从所述多个第一梯度和所述多个第二梯度中选择部分梯度,所述部分梯度用于在联邦学习的当前轮次迭代中更新所述目标模型;
发送模块,用于将更新后的所述目标模型的信息传递至多个第一设备;其中,所述多个第一设备属于所述多个终端。
在一种可能的实现中,所述部分梯度为所述多个第一梯度和所述多个第二梯度中最大的多个梯度。
在一种可能的实现中,所述多个第一梯度为通过对来自多个第二设备在当前迭代轮次发送的多个第三梯度进行聚合得到的;所述多个第二梯度为根据所述上一轮次迭代中确定的所述多个第二参数对应的梯度、以及所述多个第二设备在当前迭代轮次发送的多个第四梯度得到的;所述多个第二设备属于所述多个终端。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第一模型的参数更新量,所述第一模型为对所述目标模型在所述当前轮次迭代之前的迭代轮次中更新得到的模型;
所述发送模块,还用于:在所述获取多个第一梯度和多个第二梯度之前,向所述多个终端广播所述更新后的所述第一模型的参数数值。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第二模型的参数更新量,所述第二模型为所述目标模型的初始模型。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第三模型的参数更新量;所述多个第一设备包括第一目标设备;所述第三模型为所述第一目标设备在所述当前轮次迭代之前的迭代轮次中对所述目标模型更新得到的模型;
所述发送模块,具体用于:将所述更新后的所述目标模型相对于第三模型的参数更新量传递至所述第一目标设备。
在一种可能的实现中,所述当前轮次迭代之前的迭代轮次具体为:所述第一目标设备在所述当前轮次迭代之前最近一次对所述目标模型更新的迭代轮次。
第四方面,本申请实施例提供了一种模型训练装置,可以包括存储器、处理器以及总线系统,其中,存储器用于存储程序,处理器用于执行存储器中的程序,以执行如上述第一方面及第一方面任一可选的方法。
第五方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面或其任一可选的方法。
第六方面,本申请实施例提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面及其任一可选的方法。
第七方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持执行设备实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据;或,信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存执行设备或训练设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
图1为人工智能主体框架的一种结构示意图;
图2为本申请实施例中执行模型训练的计算系统的示意;
图3是本申请实施例提供的一种系统架构的示意图;
图4为本申请实施例提供的一种模型训练方法的架构示意;
图5为本申请实施例提供的一种模型训练方法的流程示意;
图6至图9为本申请实施例提供的一种模型训练方法的示例;
图10为本申请实施例提供的一种模型训练装置的示意;
图11为本申请实施例提供的执行设备的一种结构示意图;
图12是本申请实施例提供的训练设备一种结构示意图;
图13为本申请实施例提供的芯片的一种结构示意图。
下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经
历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)模型训练
模型训练通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的模型训练后,进一步基于模型训练的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、平安城市等。
图2为本申请实施例中执行模型训练的计算系统的示意。计算系统包括通过网络通信地耦合的终端设备102(下文中可以称之为第一设备和第二设备)和服务器130(也可以称之为中心节点)。其中,终端设备102可以是任何类型的计算设备,诸如,例如个人计算设备(例如,膝上型计算机或台式计算机)、移动计算设备(例如,智能电话或平板计算机)、游戏控制台或控制器、可穿戴计算设备、嵌入式计算设备或任何其他类型的计算设备。
终端设备102可以包括处理器112和存储器114。处理器112可以是任何合适的处理设备(例如,处理器核、微处理器、特殊应用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)、控制器、微控制器等)。存储器114可以包括但不限于是随机存储记忆体(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、或便携式只读存储器(Compact Disc Read-Only Memory,CD-ROM)。存储器114可以存储由处理器112执行的数据116和指令118,以使得终端设备102执行操作。
在一些实施方式中,存储器114可以存储一个或多个模型120。例如,模型120可以是或可以另外包括各种机器学习模型,诸如神经网络(例如,深层神经网络)或其他类型的机器学习模型,包括非线性模型和/或线性模型。神经网络可以包括前馈神经网络、递归神经网络(例如,长短期记忆递归神经网络)、卷积神经网络或其他形式的神经网络。
在一些实施方式中,一个或多个模型120可以通过网络180从服务器130接收,存储在存储器114中,然后由一个或多个处理器112使用或另外实施。
终端设备102还可以包括接收用户输入的一个或多个用户输入组件122。例如,用户输入组件122可
以是对用户输入对象(例如,手指或触笔)的触摸敏感的触敏组件(例如,触敏显示屏或触摸板)。触敏组件可以用来实施虚拟键盘。其他示例用户输入组件包括麦克风、传统键盘或用户可以提供用户输入的其他装置。
终端设备102还可以包括通信接口123,终端设备102可以通过通信接口123和服务器130通信连接,服务器130可以包括通信接口133,终端设备102可以通过通信接口123和服务器130的通信接口133通信连接,以此实现终端设备102和服务器130之间的数据交互。
服务器130可以包括处理器132和存储器134。处理器132可以是可以是任何合适的处理设备(例如,处理器核、微处理器、特殊应用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)、控制器、微控制器等)。存储器134可以包括但不限于是随机存储记忆体(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、或便携式只读存储器(Compact Disc Read-Only Memory,CD-ROM)。存储器134可以存储由处理器132执行的数据136和指令138,以使得服务器130执行操作。
如上所述,存储器134可以存储一个或多个机器学习模型140。例如,模型140可以是或者可以另外包括各种机器学习模型。示例机器学习模型包括神经网络或其他多层非线性模型。示例神经网络包括前馈神经网络、深层神经网络、递归神经网络和卷积神经网络。
图3是本申请实施例提供的一种系统100架构的示意图,在图3中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:各个待调度任务,可调用资源以及其他参数。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理(比如进行本申请中神经网络的功能实现)过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则,该相应的目标模型/规则即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在图3中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,图3仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图3中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。如图3所示,可以根据训练设备120训练得到神经网络。
本申请实施例还提供的一种芯片,该芯片包括神经网络处理器NPU。该芯片可以被设置在如图3所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图3所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则。
神经网络处理器NPU,NPU作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路,控制器控制运算电路提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路是二维脉动阵列。运算电路还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子
线路。在一些实现中,运算电路是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)中。
向量计算单元可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现种,向量计算单元能将经处理的输出的向量存储到统一缓存器。例如,向量计算单元可以将非线性函数应用到运算电路的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器1和/或统一存储器、将外部存储器中的权重数据存入权重存储器,以及将统一存储器中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU),用于通过总线实现主CPU、DMAC和取指存储器之间进行交互。
与控制器连接的取指存储器(instruction fetch buffer),用于存储控制器使用的指令;
控制器,用于调用指存储器中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器,输入存储器1,权重存储器以及取指存储器均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以xs和截距1为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
神经网络中的每一层的工作可以用数学表达式来描述:从物理层面神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权
重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
因为希望神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么神经网络的训练就变成了尽可能缩小这个loss的过程。
(2)反向传播算法
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。
(3)联邦学习
一种隐私保护的分布式机器学习建模方法,与传统的集中式建模相比,联邦学习中,各个不直接共享数据,而是通过共享模型的方式进行分布式训练。
(4)参数服务器框架下的联邦学习
目前较为主流的联邦学习模式,主要由若干边缘节点和一个中心节点组成。边缘节点收到中心节点分发的模型,并在此基础上利用本地数据进行训练,之后将模型发给中心节点;中心节点收集各个边缘节点的模型,并聚合成全局模型,分发给各个边缘节点,开启新一轮的训练迭代。
(5)非独立同分布
在联邦学习语境下,指的是各个端侧用户上的数据分布不同。“独立同分布”指的是各个终端数据分布相互独立,且分布一致。
参见图4,图4为本申请实施例提供的一种模型训练方法的架构示意,如图4所示,本申请实施例提供的架构包括:云侧中心节点,例如可以是云侧的服务器。A1、A2、…为类型为A的分布式节点(本申请可以称之为终端),如用户持有的手机产品。B1、B2、…为类型为B的分布式节点,如用户持有的个人电脑。在经过分布式节点的管理员(如手机、电脑的用户)同意后,分布式节点的管理员自愿在隐私得到保护的情况下共享其日常使用设备的过程中产生的数据,加入到模型训练计划,设备成为架构中的分布式节点。本实施例中的系统也可以包含更多类型的分布式节点,如智能手表等等。为保护数据隐私,分布式节点不会将数据上传至中心节点,仅在本地保存数据。分布式节点通过通信网络与云服务器连接。云侧中心节点可以运行大模型,而各分布式节点受硬件能力限制只能运行小模型,且A和B可以拥有不同的模型训练能力。
在训练开始时,服务器将一个初始模型发送到各个端侧。随后,不同端侧利用本地数据进行模型进行若干迭代后,将模型改变量(也就是参数对应的梯度)反馈至服务器。服务器对反馈回的梯度进行加权平均,用所得的平均梯度对初始模型进行更新,并将更新后的模型下发给各端侧用户,重启下一轮迭代。
现有的联邦训练框架的问题是:当用户的数据非独立同分布时,由于各用户节点迭代后的梯度方向差异较大,导致服务器无法获得一个有效的模型梯度更新方向,进而导致服务器模型收敛较慢,需要大量在端侧用户与服务器之间来回传递梯度,大量消耗通信量。而当前网络环境下,网络整体带宽的增长速度远小于神经网络模型尺寸的增长速度。因此,如何有效降低通信量开销是联邦学习中亟需解决的问题。
参见图5,图5为本申请实施例提供的一种模型训练方法的流程示意,如图5所示,本申请实施例提供的模型训练方法包括:
501、服务器获取多个第一梯度和多个第二梯度;所述多个第一梯度为目标模型中多个第一参数对应的梯度;所述多个第二梯度为所述目标模型中多个第二参数对应的梯度;所述多个第一参数在联邦学习的上一轮次迭代中被更新,所述多个第二参数在联邦学习的上一轮次迭代中未被更新。
在一种可能的实现中,目标模型可以为联邦学习的模型训练对象,目标模型可以包括神经网络或其他多层非线性模型,例如,神经网络可以包括前馈神经网络、深层神经网络、递归神经网络和卷积神经网络。
本申请实施例中,多个设备在接收到服务器发送的目标模型的初始模型之后,可以利用本地数据对目标模型进行训练,并在训练过程中得到梯度,之后可以将梯度上传至服务器,可选的,为了传输过程的信息隐私保护,设备可以将加密后的梯度上传至服务器。
在一种可能的实现中,服务器可以接收到多个设备中部分设备发送的梯度,并从梯度(以及前一次迭代过程中的梯度误差,也就是没有使用于模型更新的梯度)中选择部分梯度进行模型更新。
例如,目标模型可以包括参数A、参数B、参数C、参数D和参数E,服务器可以接收到终端发送的梯度并进行聚合(聚合可以为加权平均),得到参数A对应的梯度、参数B对应的梯度、以及参数C对应的梯度,而参数C和参数D为上一轮服务器迭代更新模型时被更新的参数,参数A、参数B和参数E为上一轮服务器迭代更新模型时未被更新的参数,服务器可以将上一轮服务器迭代更新模型时参数A、参数B对应的梯度和基于当前迭代轮次基于终端发送的梯度聚合得到的参数A和参数B的梯度进行融合(例如加和运算),得到当前迭代轮次参数A和参数B的梯度,服务器可以将基于终端发送的梯度聚合得到的参数C的梯度作为基于当前迭代轮次参数C的梯度。服务器可以将上一轮服务器迭代更新模型时参数E对应的梯度作为基于当前迭代轮次参数E的梯度。服务器可以从参数A、参数B、参数C和参数E中选择部分参数对应的梯度,来进行模型的更新。
在一种可能的实现中,本申请实施例以所述多个第一参数在联邦学习的上一轮次迭代中被更新,所述多个第二参数在联邦学习的上一轮次迭代中未被更新为例进行说明。
在一种可能的实现中,多个设备中的多个第二设备(多个第二设备可以为多个设备中的部分设备)可以在某一次迭代过程中,得到目标模型中多个参数的梯度,例如可以得到目标模型中多个第一参数对应的梯度(也就是多个第三梯度),以及得到目标模型中多个第二参数对应的梯度(也就是多个第四梯度)。多个第二设备可以将上述多个第三梯度和多个第四梯度传递至服务器。
在一种可能的实现中,服务器可以通过对来自多个第二设备在当前迭代轮次发送的多个第三梯度通进行聚合得到多个第一梯度;服务器可以通过对来自多个第二设备在当前迭代轮次发送的多个第四梯度通进行聚合得到多个梯度,并将多个梯度和上一轮次迭代中确定的所述多个第二参数对应的梯度进行融合,得到多个第二梯度。
可选的,在聚合时,可以使用Momentum梯度更新方法:即当前合并形成的本地梯度不再是单纯的端侧梯度之和,而是当前梯度与上一轮次梯度的加权和。
在一种可能的实现中,所述多个终端中的多个第二设备具体用于在联邦学习的上一轮次迭代中确定所述目标模型对应的多个梯度;从所述多个梯度中随机选择所述多个第三梯度和多个第四梯度。
也就是说,端侧可以在接收到服务器下发的模型参数后,利用本地数据对模型进行更新,并在本地迭代轮次结束后计算更新梯度,并采用rand-k方法,从更新梯度中随机选取k个梯度,并上传服务器。
通过上述方式,在端侧实现无偏压缩方案,规避了端侧用户不是每一迭代轮次都参与联邦训练的风险。
在一种可能的实现中,所述多个终端中的多个第二设备具体用于对指示所述多个第三梯度和所述多个第四梯度的信息进行无损压缩或者线性无偏压缩,并向所述服务器发送压缩结果。
参照图6,图6为一种服务器和终端的交互示意。
下面介绍使用线性无偏压缩方案,进一步压缩上行通信量的实施例。其交互流程如下图7所示,与实施例一的差异在于端侧进行rand-k压缩后,继续使用线性无偏压缩方案对通信量进行压缩(例如使用sketch压缩方法)。
通过上述方法,可将线性无偏压缩方案硬化在网络设备中,从而充分利用网络设备的处理能力来降低网络传输压力,从而可进一步降低端侧上传到服务器的通信量。同时,所提线性无偏压缩方案可硬化
到网络设备中,从而充分利用网络设备性能来降低网络通信压力。
502、从所述多个第一梯度和所述多个第二梯度中选择部分梯度,所述部分梯度用于在联邦学习的当前轮次迭代中更新所述目标模型。
在一种可能的实现中,所述部分梯度为所述多个第一梯度和所述多个第二梯度中最大的多个梯度。
在一种可能的实现中,记上一轮次遗留梯度误差为err_{t-1},则本轮次取(g_t+err_{t-1})中的top-k个梯度值作为更新值,更新当前的公有模型为w_{t+1},记(g_t+err_{t-1})-top-k(g_t+err_{t-1})为err_t,该误差可以参与下一轮次迭代。
503、将更新后的所述目标模型的信息传递至多个第一设备;其中,所述多个第一设备属于所述多个终端。
在一种可能的实现中,更新后的目标模型可以为w_{t+1},可以将更新后的所述目标模型的信息下发至下一轮次迭代需要参与的终端(也就是本申请实施例中的多个第一设备)。
由于目标模型每次只选择并更新部分参数的数值,可有效降低服务器向终端的梯度传输量。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第二模型的参数更新量,所述第二模型为所述目标模型的初始模型。
也就是说,服务器可以向端侧下发w_{t+1}-w_0,作为压缩后的梯度值,w_0为目标模型的初始模型。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第三模型的参数更新量;所述多个第一设备包括第一目标设备;所述第三模型为所述第一目标设备在所述当前轮次迭代之前的迭代轮次中对所述目标模型更新得到的模型;服务器可以将所述更新后的所述目标模型相对于第三模型的参数更新量传递至所述第一目标设备。
在一种可能的实现中,所述当前轮次迭代之前的迭代轮次具体为:所述第一目标设备在所述当前轮次迭代之前最近一次对所述目标模型更新的迭代轮次。
在一种可能的实现中,服务器侧可以维护端侧用户现存目标模型列表,在下发更新量时与对应端侧的公有模型参数做差分,从而提升下发梯度的通信压缩量。具体方案如图9所示。通过上述方法,服务器维护了端侧用户的最新模型参数列表,从而降低了下发梯度时,最新目标模型与端侧用户模型的梯度的差异数量,从而最大化保持下行通信量压缩效果。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第一模型的参数更新量,所述第一模型为对所述目标模型在所述当前轮次迭代之前的迭代轮次中更新得到的模型;在所述获取多个第一梯度和多个第二梯度之前,可以向所述多个终端广播所述更新后的所述第一模型的参数数值。
在一种可能的实现中,服务器可以广播(例如周期性广播)当前最新目标模型的参数的方案,来降低云侧下发的模型梯度变化数量。若云侧下发的模型梯度为w_{t+1}-w_0。当t较大后,下行模型压缩效果将会有明显下降。此时,采用服务器向所有参与训练的联邦端侧用户周期性广播当前最新公有模型参数(或最新目标模型与次新目标模型的差值)的方案,用于降低云侧下发的模型梯度变化数量。如图8所示,其中w_anchor即为云侧服务器周期性广播的最新公有模型参数。通过上述方式,降低了最新目标模型与端侧用户模型的梯度差异数量,从而保证下行通信量压缩效果不劣化。
本申请实施例提供了一种模型训练方法,应用于服务器,所述服务器与多个终端通信,所述方法包括:获取多个第一梯度和多个第二梯度;所述多个第一梯度为目标模型中多个第一参数对应的梯度;所述多个第二梯度为所述目标模型中多个第二参数对应的梯度;所述多个第一参数在联邦学习的上一轮次迭代中被更新,所述多个第二参数在联邦学习的上一轮次迭代中未被更新;从所述多个第一梯度和所述多个第二梯度中选择部分梯度,所述部分梯度用于在联邦学习的当前轮次迭代中更新所述目标模型;将更新后的所述目标模型的信息传递至多个第一设备;其中,所述多个第一设备属于所述多个终端。
由于目标模型每次只选择并更新部分参数的数值,可有效降低服务器向终端的梯度传输量。
此外,本申请提供了一种系统,所述系统包括服务器和多个终端,所述服务器与所述多个终端通信,其中,
所述服务器用于获取多个第一梯度和多个第二梯度;所述多个第一梯度为目标模型中多个第一参数对应的梯度;所述多个第二梯度为所述目标模型中多个第二参数对应的梯度;所述多个第一参数在联邦
学习的上一轮次迭代中被更新,所述多个第二参数在联邦学习的上一轮次迭代中未被更新;
从所述多个第一梯度和所述多个第二梯度中选择部分梯度,所述部分梯度用于在联邦学习的当前轮次迭代中更新所述目标模型;
将更新后的所述目标模型的信息传递至多个第一设备;其中,所述多个第一设备和所述多个第二设备属于所述多个终端。
在一种可能的实现中,所述部分梯度为所述多个第一梯度和所述多个第二梯度中最大的多个梯度。
在一种可能的实现中,所述多个终端中的多个第二设备用于向所述服务器发送多个第三梯度和多个第四梯度;所述多个第三梯度为所述目标模型中多个第一参数对应的梯度;所述多个第四梯度为所述目标模型中多个第二参数对应的梯度;所述多个第一参数在联邦学习的上一轮次迭代中被更新;所述多个第二参数在联邦学习的上一轮次迭代中未被更新;
所述服务器具体用于对所述多个第三梯度进行聚合,得到多个第一梯度;
对所述多个第四梯度进行聚合并和上一轮次迭代中确定的所述多个第二参数对应的梯度进行融合,得到多个第二梯度。
在一种可能的实现中,所述多个终端中的多个第二设备具体用于在联邦学习的上一轮次迭代中确定所述目标模型对应的多个梯度;从所述多个梯度中随机选择所述多个第三梯度和多个第四梯度。
在一种可能的实现中,所述多个终端中的多个第二设备具体用于对指示所述多个第三梯度和所述多个第四梯度的信息进行无损压缩或者线性无偏压缩,并向所述服务器发送压缩结果。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第一模型的参数更新量,所述第一模型为对所述目标模型在所述当前轮次迭代之前的迭代轮次中更新得到的模型;
所述服务器还用于在所述获取多个第一梯度和多个第二梯度之前,向所述多个终端广播所述更新后的所述第一模型的参数数值。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第二模型的参数更新量,所述第二模型为所述目标模型的初始模型。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第三模型的参数更新量;所述多个第一设备包括第一目标设备;所述第三模型为所述第一目标设备在所述当前轮次迭代之前的迭代轮次中对所述目标模型更新得到的模型;
所述服务器具体用于将所述更新后的所述目标模型相对于第三模型的参数更新量传递至所述第一目标设备。
在一种可能的实现中,所述当前轮次迭代之前的迭代轮次具体为:所述第一目标设备在所述当前轮次迭代之前最近一次对所述目标模型更新的迭代轮次。
接下来从装置的角度对本申请实施例提供的模型训练装置进行描述,参照图10,图10为本申请实施例提供的一种模型训练装置的示意,如图10中示出的那样,本申请实施例提供的模型训练装置1000,包括:
获取模块1001,用于获取多个第一梯度和多个第二梯度;所述多个第一梯度为目标模型中多个第一参数对应的梯度;所述多个第二梯度为所述目标模型中多个第二参数对应的梯度;所述多个第一参数在联邦学习的上一轮次迭代中被更新,所述多个第二参数在联邦学习的上一轮次迭代中未被更新;
关于获取模块1001的具体描述可以参照上述实施例中步骤501的描述,这里不再赘述。
梯度选择模块1002,用于从所述多个第一梯度和所述多个第二梯度中选择部分梯度,所述部分梯度用于在联邦学习的当前轮次迭代中更新所述目标模型;
关于梯度选择模块1002的具体描述可以参照上述实施例中步骤502的描述,这里不再赘述。
发送模块1003,用于将更新后的所述目标模型的信息传递至多个第一设备;其中,所述多个第一设备属于所述多个终端。
关于发送模块1003的具体描述可以参照上述实施例中步骤503的描述,这里不再赘述。
在一种可能的实现中,所述部分梯度为所述多个第一梯度和所述多个第二梯度中最大的多个梯度。
在一种可能的实现中,所述多个第一梯度为通过对来自多个第二设备在当前迭代轮次发送的多个第
三梯度进行聚合得到的;所述多个第二梯度为根据所述上一轮次迭代中确定的所述多个第二参数对应的梯度、以及所述多个第二设备在当前迭代轮次发送的多个第四梯度得到的;所述多个第二设备属于所述多个终端。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第一模型的参数更新量,所述第一模型为对所述目标模型在所述当前轮次迭代之前的迭代轮次中更新得到的模型;
所述发送模块,还用于:在所述获取多个第一梯度和多个第二梯度之前,向所述多个终端广播所述更新后的所述第一模型的参数数值。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第二模型的参数更新量,所述第二模型为所述目标模型的初始模型。
在一种可能的实现中,所述目标模型的信息包括:更新后的所述目标模型相对于第三模型的参数更新量;所述多个第一设备包括第一目标设备;所述第三模型为所述第一目标设备在所述当前轮次迭代之前的迭代轮次中对所述目标模型更新得到的模型;
所述发送模块,具体用于:将所述更新后的所述目标模型相对于第三模型的参数更新量传递至所述第一目标设备。
在一种可能的实现中,所述当前轮次迭代之前的迭代轮次具体为:所述第一目标设备在所述当前轮次迭代之前最近一次对所述目标模型更新的迭代轮次。
接下来介绍本申请实施例提供的一种执行设备,请参阅图11,图11为本申请实施例提供的执行设备的一种结构示意图,执行设备1100具体可以表现为手机、平板、笔记本电脑、智能穿戴设备、服务器等,此处不做限定。其中,执行设备1100上可以部署有图11对应实施例中所描述的模型训练装置,用于实现图11对应实施例中模型训练的功能。具体的,执行设备1100包括:接收器1101、发射器1102、处理器1103和存储器1104(其中执行设备1100中的处理器1103的数量可以一个或多个,图11中以一个处理器为例),其中,处理器1103可以包括应用处理器11031和通信处理器11032。在本申请的一些实施例中,接收器1101、发射器1102、处理器1103和存储器1104可通过总线或其它方式连接。
存储器1104可以包括只读存储器和随机存取存储器,并向处理器1103提供指令和数据。存储器1104的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1104存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
处理器1103控制执行设备的操作。具体的应用中,执行设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1103中,或者由处理器1103实现。处理器1103可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1103中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1103可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器1103可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1104,处理器1103读取存储器1104中的信息,结合其硬件完成上述方法的步骤。
接收器1101可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器1102可用于通过第一接口输出数字或字符信息;发射器1102还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器1102还可以包括显示屏等显示设备。
本申请实施例中,在一种情况下,处理器1103,用于执行图5对应实施例中的服务器执行的方法。
本申请实施例还提供了一种训练设备,请参阅图12,图12是本申请实施例提供的训练设备一种结构示意图,训练设备1200上可以部署有图10对应实施例中所描述的神经网络训练装置,用于实现图10对应实施例中神经网络训练装置的功能,具体的,训练设备1200由一个或多个服务器实现,训练设备1200可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1212(例如,一个或一个以上处理器)和存储器1232,一个或一个以上存储应用程序1242或数据1244的存储介质1230(例如一个或一个以上海量存储设备)。其中,存储器1232和存储介质1230可以是短暂存储或持久存储。存储在存储介质1230的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对训练设备中的一系列指令操作。更进一步地,中央处理器1212可以设置为与存储介质1230通信,在训练设备1200上执行存储介质1230中的一系列指令操作。
训练设备1200还可以包括一个或一个以上电源1226,一个或一个以上有线或无线网络接口1250,一个或一个以上输入输出接口1258;或,一个或一个以上操作系统1241,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
本申请实施例中,中央处理器1212,用于执行上述实施例中的训练方法相关的步骤。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例提供的执行设备、训练设备或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的模型训练方法,或者,以使训练设备内的芯片执行上述实施例描述的模型训练方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体的,请参阅图13,图13为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 1300,NPU 1300作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1303,通过控制器1304控制运算电路1303提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路1303内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路1303是二维脉动阵列。运算电路1303还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1303是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1302中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1301中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1308中。
统一存储器1306用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)1305,DMAC被搬运到权重存储器1302中。输入数据也通过DMAC被搬运到统一存储器1306中。
BIU为Bus Interface Unit即,总线接口单元1310,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)1309的交互。
总线接口单元1310(Bus Interface Unit,简称BIU),用于取指存储器1309从外部存储器获取指令,还用于存储单元访问控制器1305从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1306或将权重数据搬运到权重存储器1302中或将输入数据数据搬运到输入存储器1301中。
向量计算单元1307包括多个运算处理单元,在需要的情况下,对运算电路1303的输出做进一步
处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元1307能将经处理的输出的向量存储到统一存储器1306。例如,向量计算单元1307可以将线性函数;或,非线性函数应用到运算电路1303的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1307生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1303的激活输入,例如用于在神经网络中的后续层中的使用。
控制器1304连接的取指存储器(instruction fetch buffer)1309,用于存储控制器1304使用的指令;
统一存储器1306,输入存储器1301,权重存储器1302以及取指存储器1309均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。
Claims (26)
- 一种模型训练方法,其特征在于,应用于服务器,所述服务器与多个终端通信,所述方法包括:获取多个第一梯度和多个第二梯度;所述多个第一梯度为目标模型中多个第一参数对应的梯度;所述多个第二梯度为所述目标模型中多个第二参数对应的梯度;所述多个第一参数在联邦学习的上一轮次迭代中被更新,所述多个第二参数在联邦学习的上一轮次迭代中未被更新;从所述多个第一梯度和所述多个第二梯度中选择部分梯度,所述部分梯度用于在联邦学习的当前轮次迭代中更新所述目标模型;将更新后的所述目标模型的信息传递至多个第一设备;其中,所述多个第一设备属于所述多个终端。
- 根据权利要求1所述的方法,其特征在于,所述部分梯度为所述多个第一梯度和所述多个第二梯度中最大的多个梯度。
- 根据权利要求1或2所述的方法,其特征在于,所述多个第一梯度为通过对来自多个第二设备在当前迭代轮次发送的多个第三梯度进行聚合得到的;所述多个第二梯度为根据所述上一轮次迭代中确定的所述多个第二参数对应的梯度、以及所述多个第二设备在当前迭代轮次发送的多个第四梯度得到的;所述多个第二设备属于所述多个终端。
- 根据权利要求1至3任一所述的方法,其特征在于,所述目标模型的信息包括:更新后的所述目标模型相对于第一模型的参数更新量,所述第一模型为对所述目标模型在所述当前轮次迭代之前的迭代轮次中更新得到的模型;在所述获取多个第一梯度和多个第二梯度之前,所述方法还包括:向所述多个终端广播所述更新后的所述第一模型的参数数值。
- 根据权利要求1至3任一所述的方法,其特征在于,所述目标模型的信息包括:更新后的所述目标模型相对于第二模型的参数更新量,所述第二模型为所述目标模型的初始模型。
- 根据权利要求1至3任一所述的方法,其特征在于,所述目标模型的信息包括:更新后的所述目标模型相对于第三模型的参数更新量;所述多个第一设备包括第一目标设备;所述第三模型为所述第一目标设备在所述当前轮次迭代之前的迭代轮次中对所述目标模型更新得到的模型;所述将更新后的所述目标模型的信息传递至多个第一设备,包括:将所述更新后的所述目标模型相对于第三模型的参数更新量传递至所述第一目标设备。
- 根据权利要求6所述的方法,其特征在于,所述当前轮次迭代之前的迭代轮次具体为:所述第一目标设备在所述当前轮次迭代之前最近一次对所述目标模型更新的迭代轮次。
- 一种系统,其特征在于,所述系统包括服务器和多个终端,所述服务器与所述多个终端通信,其中,所述服务器用于获取多个第一梯度和多个第二梯度;所述多个第一梯度为目标模型中多个第一参数对应的梯度;所述多个第二梯度为所述目标模型中多个第二参数对应的梯度;所述多个第一参数在联邦学习的上一轮次迭代中被更新,所述多个第二参数在联邦学习的上一轮次迭代中未被更新;从所述多个第一梯度和所述多个第二梯度中选择部分梯度,所述部分梯度用于在联邦学习的当前轮次迭代中更新所述目标模型;将更新后的所述目标模型的信息传递至多个第一设备;其中,所述多个第一设备和所述多个第二设备属于所述多个终端。
- 根据权利要求8所述的系统,其特征在于,所述部分梯度为所述多个第一梯度和所述多个第二 梯度中最大的多个梯度。
- 根据权利要求8或9所述的系统,其特征在于,所述多个终端中的多个第二设备用于向所述服务器发送多个第三梯度和多个第四梯度;所述多个第三梯度为所述目标模型中多个第一参数对应的梯度;所述多个第四梯度为所述目标模型中多个第二参数对应的梯度;所述多个第一参数在联邦学习的上一轮次迭代中被更新;所述多个第二参数在联邦学习的上一轮次迭代中未被更新;所述服务器具体用于对所述多个第三梯度进行聚合,得到多个第一梯度;对所述多个第四梯度进行聚合并和上一轮次迭代中确定的所述多个第二参数对应的梯度进行融合,得到多个第二梯度。
- 根据权利要求10所述的系统,其特征在于,所述多个终端中的多个第二设备具体用于在联邦学习的上一轮次迭代中确定所述目标模型对应的多个梯度;从所述多个梯度中随机选择所述多个第三梯度和多个第四梯度。
- 根据权利要求11所述的系统,其特征在于,所述多个终端中的多个第二设备具体用于对指示所述多个第三梯度和所述多个第四梯度的信息进行无损压缩或者线性无偏压缩,并向所述服务器发送压缩结果。
- 根据权利要求8至12任一所述的系统,其特征在于,所述目标模型的信息包括:更新后的所述目标模型相对于第一模型的参数更新量,所述第一模型为对所述目标模型在所述当前轮次迭代之前的迭代轮次中更新得到的模型;所述服务器还用于在所述获取多个第一梯度和多个第二梯度之前,向所述多个终端广播所述更新后的所述第一模型的参数数值。
- 根据权利要求8至12任一所述的系统,其特征在于,所述目标模型的信息包括:更新后的所述目标模型相对于第二模型的参数更新量,所述第二模型为所述目标模型的初始模型。
- 根据权利要求8至12任一所述的系统,其特征在于,所述目标模型的信息包括:更新后的所述目标模型相对于第三模型的参数更新量;所述多个第一设备包括第一目标设备;所述第三模型为所述第一目标设备在所述当前轮次迭代之前的迭代轮次中对所述目标模型更新得到的模型;所述服务器具体用于将所述更新后的所述目标模型相对于第三模型的参数更新量传递至所述第一目标设备。
- 根据权利要求15所述的系统,其特征在于,所述当前轮次迭代之前的迭代轮次具体为:所述第一目标设备在所述当前轮次迭代之前最近一次对所述目标模型更新的迭代轮次。
- 一种模型训练装置,其特征在于,应用于服务器,所述服务器与多个终端通信,所述装置包括:获取模块,用于获取多个第一梯度和多个第二梯度;所述多个第一梯度为目标模型中多个第一参数对应的梯度;所述多个第二梯度为所述目标模型中多个第二参数对应的梯度;所述多个第一参数在联邦学习的上一轮次迭代中被更新,所述多个第二参数在联邦学习的上一轮次迭代中未被更新;梯度选择模块,用于从所述多个第一梯度和所述多个第二梯度中选择部分梯度,所述部分梯度用于在联邦学习的当前轮次迭代中更新所述目标模型;发送模块,用于将更新后的所述目标模型的信息传递至多个第一设备;其中,所述多个第一设备属于所述多个终端。
- 根据权利要求17所述的装置,其特征在于,所述部分梯度为所述多个第一梯度和所述多个第 二梯度中最大的多个梯度。
- 根据权利要求17或18所述的装置,其特征在于,所述多个第一梯度为通过对来自多个第二设备在当前迭代轮次发送的多个第三梯度进行聚合得到的;所述多个第二梯度为根据所述上一轮次迭代中确定的所述多个第二参数对应的梯度、以及所述多个第二设备在当前迭代轮次发送的多个第四梯度得到的;所述多个第二设备属于所述多个终端。
- 根据权利要求17至19任一所述的装置,其特征在于,所述目标模型的信息包括:更新后的所述目标模型相对于第一模型的参数更新量,所述第一模型为对所述目标模型在所述当前轮次迭代之前的迭代轮次中更新得到的模型;所述发送模块,还用于:在所述获取多个第一梯度和多个第二梯度之前,向所述多个终端广播所述更新后的所述第一模型的参数数值。
- 根据权利要求17至20任一所述的装置,其特征在于,所述目标模型的信息包括:更新后的所述目标模型相对于第二模型的参数更新量,所述第二模型为所述目标模型的初始模型。
- 根据权利要求17至20任一所述的装置,其特征在于,所述目标模型的信息包括:更新后的所述目标模型相对于第三模型的参数更新量;所述多个第一设备包括第一目标设备;所述第三模型为所述第一目标设备在所述当前轮次迭代之前的迭代轮次中对所述目标模型更新得到的模型;所述发送模块,具体用于:将所述更新后的所述目标模型相对于第三模型的参数更新量传递至所述第一目标设备。
- 根据权利要求22所述的装置,其特征在于,所述当前轮次迭代之前的迭代轮次具体为:所述第一目标设备在所述当前轮次迭代之前最近一次对所述目标模型更新的迭代轮次。
- 一种模型训练装置,其特征在于,所述装置包括存储器和处理器;所述存储器存储有代码,所述处理器被配置为执行所述代码,并实现如权利要求1至7任一所述的方法。
- 一种计算机存储介质,其特征在于,所述计算机存储介质存储有一个或多个指令,所述指令在由一个或多个计算机执行时使得所述一个或多个计算机实施权利要求1至7任一所述的方法。
- 一种计算机程序产品,包括代码,其特征在于,在所述代码被执行时用于实现如权利要求1至7任一所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211364320.6 | 2022-11-02 | ||
CN202211364320.6A CN115907041A (zh) | 2022-11-02 | 2022-11-02 | 一种模型训练方法及装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024094094A1 true WO2024094094A1 (zh) | 2024-05-10 |
Family
ID=86481571
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/129211 WO2024094094A1 (zh) | 2022-11-02 | 2023-11-02 | 一种模型训练方法及装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115907041A (zh) |
WO (1) | WO2024094094A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115907041A (zh) * | 2022-11-02 | 2023-04-04 | 华为技术有限公司 | 一种模型训练方法及装置 |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112070207A (zh) * | 2020-07-31 | 2020-12-11 | 华为技术有限公司 | 一种模型训练方法及装置 |
CN112818394A (zh) * | 2021-01-29 | 2021-05-18 | 西安交通大学 | 具有本地隐私保护的自适应异步联邦学习方法 |
WO2021204040A1 (zh) * | 2020-10-29 | 2021-10-14 | 平安科技(深圳)有限公司 | 联邦学习数据处理方法、装置、设备及存储介质 |
WO2021218517A1 (zh) * | 2020-04-29 | 2021-11-04 | 华为技术有限公司 | 获取神经网络模型的方法、图像处理方法及装置 |
CN114422605A (zh) * | 2022-01-12 | 2022-04-29 | 重庆邮电大学 | 一种基于联邦学习的通信梯度自适应压缩方法 |
CN114492723A (zh) * | 2020-11-13 | 2022-05-13 | 华为技术有限公司 | 神经网络模型的训练方法、图像处理方法及装置 |
CN114548421A (zh) * | 2022-01-10 | 2022-05-27 | 清华大学 | 一种针对联邦学习通信开销的优化处理方法及装置 |
CN114969465A (zh) * | 2021-10-25 | 2022-08-30 | 京东科技信息技术有限公司 | 联邦学习数据处理方法、装置及设备 |
CN115907041A (zh) * | 2022-11-02 | 2023-04-04 | 华为技术有限公司 | 一种模型训练方法及装置 |
-
2022
- 2022-11-02 CN CN202211364320.6A patent/CN115907041A/zh active Pending
-
2023
- 2023-11-02 WO PCT/CN2023/129211 patent/WO2024094094A1/zh unknown
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021218517A1 (zh) * | 2020-04-29 | 2021-11-04 | 华为技术有限公司 | 获取神经网络模型的方法、图像处理方法及装置 |
CN112070207A (zh) * | 2020-07-31 | 2020-12-11 | 华为技术有限公司 | 一种模型训练方法及装置 |
WO2021204040A1 (zh) * | 2020-10-29 | 2021-10-14 | 平安科技(深圳)有限公司 | 联邦学习数据处理方法、装置、设备及存储介质 |
CN114492723A (zh) * | 2020-11-13 | 2022-05-13 | 华为技术有限公司 | 神经网络模型的训练方法、图像处理方法及装置 |
CN112818394A (zh) * | 2021-01-29 | 2021-05-18 | 西安交通大学 | 具有本地隐私保护的自适应异步联邦学习方法 |
CN114969465A (zh) * | 2021-10-25 | 2022-08-30 | 京东科技信息技术有限公司 | 联邦学习数据处理方法、装置及设备 |
CN114548421A (zh) * | 2022-01-10 | 2022-05-27 | 清华大学 | 一种针对联邦学习通信开销的优化处理方法及装置 |
CN114422605A (zh) * | 2022-01-12 | 2022-04-29 | 重庆邮电大学 | 一种基于联邦学习的通信梯度自适应压缩方法 |
CN115907041A (zh) * | 2022-11-02 | 2023-04-04 | 华为技术有限公司 | 一种模型训练方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN115907041A (zh) | 2023-04-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022022274A1 (zh) | 一种模型训练方法及装置 | |
WO2022179492A1 (zh) | 一种卷积神经网络的剪枝处理方法、数据处理方法及设备 | |
US20190114541A1 (en) | Method and system of controlling computing operations based on early-stop in deep neural network | |
WO2024094094A1 (zh) | 一种模型训练方法及装置 | |
WO2022088063A1 (zh) | 神经网络模型的量化方法和装置、数据处理的方法和装置 | |
CN113449859A (zh) | 一种数据处理方法及其装置 | |
WO2024001806A1 (zh) | 一种基于联邦学习的数据价值评估方法及其相关设备 | |
WO2022111387A1 (zh) | 一种数据处理方法及相关装置 | |
WO2024067373A1 (zh) | 一种数据处理方法及相关装置 | |
WO2024199404A1 (zh) | 一种消费预测方法及其相关设备 | |
WO2024213099A1 (zh) | 一种数据处理方法及其装置 | |
WO2024179510A1 (zh) | 一种图像处理方法及相关装置 | |
WO2024179485A1 (zh) | 一种图像处理方法及其相关设备 | |
WO2024199409A1 (zh) | 一种数据处理方法及其装置 | |
WO2024114659A1 (zh) | 一种摘要生成方法及其相关设备 | |
WO2024175079A1 (zh) | 一种模型的量化方法以及相关设备 | |
WO2024140973A1 (zh) | 一种动作计数方法及其相关设备 | |
WO2024109910A1 (zh) | 一种生成模型训练方法、数据转换方法以及装置 | |
WO2022246986A1 (zh) | 数据处理方法、装置、设备及计算机可读存储介质 | |
WO2024061123A1 (zh) | 一种图像处理方法及其相关设备 | |
WO2024067113A1 (zh) | 一种动作预测方法及其相关设备 | |
WO2023185541A1 (zh) | 一种模型训练方法及其相关设备 | |
WO2023197857A1 (zh) | 一种模型切分方法及其相关设备 | |
WO2024055952A1 (zh) | 一种数据处理方法及其装置 | |
WO2024046144A1 (zh) | 一种视频处理方法及其相关设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23885020 Country of ref document: EP Kind code of ref document: A1 |