WO2024000344A1 - Procédé d'entraînement de modèle et appareil associé - Google Patents

Procédé d'entraînement de modèle et appareil associé Download PDF

Info

Publication number
WO2024000344A1
WO2024000344A1 PCT/CN2022/102635 CN2022102635W WO2024000344A1 WO 2024000344 A1 WO2024000344 A1 WO 2024000344A1 CN 2022102635 W CN2022102635 W CN 2022102635W WO 2024000344 A1 WO2024000344 A1 WO 2024000344A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
exit
nodes
model
inference
Prior art date
Application number
PCT/CN2022/102635
Other languages
English (en)
Chinese (zh)
Inventor
叶德仕
孙武杰
徐晨
李榕
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2022/102635 priority Critical patent/WO2024000344A1/fr
Publication of WO2024000344A1 publication Critical patent/WO2024000344A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models

Definitions

  • This application relates to the field of communications, in particular to a model training method and related devices.
  • Multi-exit networks usually refer to neural networks with multiple exits at different locations. These multiple exits can complete the same task, but have different overhead or accuracy required for inference, so they can meet the needs of different inference tasks.
  • a method is known that can combine knowledge distillation technology with a multi-outlet network, and use distillation learning to implement multi-outlet network training.
  • the performance of multi-outlet networks can be improved through distillation learning, and at the same time, multiple models of different complexity can be obtained in one training.
  • This application provides a model training method and related devices, in order to improve the training effect of the neural network model.
  • a model training method which method can be applied to the first exit node among multiple nodes, each of the plurality of nodes has a sub-model deployed, and the sub-model deployed in each node
  • a model is one of multiple sub-models obtained by splitting the model to be trained.
  • the sub-models deployed in the multiple exit nodes are combined together to obtain a neural network model with multiple exits, which can be referred to as a multi-outlet network.
  • the first exit node may be any one of the plurality of exits.
  • the method provided in the first aspect may be executed by the first exit node.
  • the first egress node may be, for example, a communication device, such as a terminal device, or it may be a component configured in the communication device, such as a chip, a chip system, or other functions that can be used to implement part or all of the functions of the first egress node. Module or software implementation, this application does not limit this.
  • the method includes: the first exit node uses a sub-model deployed locally to perform inference on the received data to obtain an inference result of the first exit node; the first exit node receives inference results from other exit nodes.
  • the other exit nodes are exit nodes other than the first exit node among multiple exit nodes, and the multiple exit nodes belong to the multiple nodes; the first exit node is based on the multiple exit nodes.
  • the inference result of each exit node in the node and its weight are used to obtain the soft label of the first exit node, and the soft label of the first exit node is the weighted weight of the inference results of the multiple exit nodes; the first The exit node trains the locally deployed sub-model based on the soft label of the first exit node, the predefined hard label, and the predefined loss function to obtain the trained sub-model; wherein, the loss function Includes the distillation loss determined by the soft labels and the student loss determined by the hard labels.
  • soft labels can be regarded as the output of the teacher model
  • hard labels can be regarded as labels in the training samples.
  • the process of training based on soft labels and hard labels is a process of knowledge distillation, and the training object can be called a student model.
  • the weight of the inference result of each exit node may be predefined.
  • the weight of the inference result of each exit node is related to the neural network capability of each exit node. For example, if the sub-model deployed in the first exit node has a relatively complex structure and a large capacity, it can be considered that the first exit node has strong neural network capabilities and can deploy the exit node of the sub-model with a relatively complex structure.
  • a higher weight is applied to the inference results of the sub-model deployed in the first exit node, and a lower weight is applied to the inference results of the exit node deploying a sub-model with a simpler structure; if the sub-model deployed in the first exit node has a simpler structure, a smaller capacity , it can be considered that the first exit node has weak neural network capabilities, and can impose a lower weight on the inference results of exit nodes deploying sub-models with more complex structures. The inference results are given higher weight.
  • the first exit node obtains the output of the teacher model based on the weighting of the inference results output by each exit node, uses it as a soft label, and combines it with the hard label to train (or distill) the local sub-model.
  • the impact of the inference results of each exit node on the soft label can be adjusted by weight, that is, the impact of the inference result of each exit node on the distillation of the sub-model can be adjusted by weight, taking into account neural networks with different capabilities.
  • controllable training is performed so that different student models can obtain distillation losses that adapt to their neural network capabilities, which is beneficial to improving the training effect of the student model.
  • This application does not limit the location of the first exit node among multiple nodes.
  • the first exit node is the first node among the multiple nodes, and the data received by the first exit node includes training data used to train the model and the hard label. .
  • the data received by the first node includes training data and hard labels.
  • the training data and hard labels may be received from the device used to configure the training task.
  • the device used to configure the training task may be, for example, a network device, which is not limited in this application.
  • the first exit node is not the first node among the multiple nodes, and the data received by the first exit node includes the hard label and the upper limit of the first exit node. Characteristics of a hop node, wherein the characteristics of the previous hop node are inferred by the previous hop node based on the received data.
  • the data received by other nodes except the first node can include hard labels and features from the previous hop node.
  • the hard labels may be received from the device used to configure the training task, and the features from the previous hop node may be inferred by the previous hop node based on the data received by itself.
  • the weight of the inference result of each exit node is determined by the first exit node according to the characteristics of the multiple exit nodes; the method It also includes: the first exit node using a sub-model deployed locally to perform inference on the received data to obtain the characteristics of the first exit node; the first exit node receiving characteristics from the other exit nodes; The first exit node uses a preset function to determine the weight of the inference result of each exit node in the multiple exit nodes based on the characteristics of the multiple exit nodes; wherein the parameters in the preset function are The first exit node is obtained through the previous round of training of the local sub-model.
  • the weight of the inference result of each exit node is determined based on the parameters obtained in each round of training and the features obtained by inference at each exit node, so that the soft label obtained by each weighting is based on the most recent inference result and the most recent feature sum.
  • the soft labels can be updated in real time, and the updated soft labels are more conducive to obtaining a distillation loss that matches the capabilities of the neural network model, thus conducive to obtaining better training results.
  • the above-mentioned preset function is, for example, a gate function or an attention function.
  • the distillation loss includes: a loss when training the sub-model based on soft labels of the multiple exit nodes, and a loss based on the first The loss when an exit node trains the sub-model.
  • the distillation loss mainly considers the loss when training the sub-model based on soft labels.
  • the distillation loss in addition to considering the loss of the soft labels of all exit nodes when training the local sub-model of the first exit node, the distillation loss also additionally considers the loss of the soft labels of the first exit node when training the local sub-model. loss. By strengthening the local distillation loss of the first exit node, the training effect and training efficiency can be improved.
  • the method further includes: the first exit node sending the inference results and characteristics of the first exit node to the other exit nodes, so The inference results and characteristics of the first exit node are used by the other exit nodes to determine their respective soft labels.
  • the first exit node sends inference results and features to other exit nodes, so that other exit nodes can determine the soft labels for local distillation based on the above method, and then perform local knowledge distillation. As a result, local knowledge distillation can be achieved for the entire multi-exit network.
  • this step can be replaced by: the first exit node sends a request to the other exit node The inference result of the first egress node is sent, and the inference result of the first egress node is used by the other egress nodes to determine their respective soft labels.
  • the first exit node is configured to deploy locally based on soft labels of multiple exit nodes, predefined hard labels, and predefined loss functions.
  • Training the sub-model includes: the first exit node calculates the loss when training the sub-model based on the soft labels, predefined hard labels, and predefined loss functions of the multiple exit nodes; The first exit node calculates a local gradient based on the loss; the first exit node receives the intermediate gradient back propagated by the other exit nodes and the next hop node of the first exit node; the first exit The node updates the parameters in the sub-model based on the received intermediate gradient and the local gradient.
  • the parameters of the sub-models deployed in each node can be converged, thereby achieving better training results.
  • the method further includes: the first egress node reverses direction to the other egress nodes and the previous hop node of the first egress node. Propagate intermediate gradients.
  • the first exit node can make the parameters in other exit nodes or the previous hop node of the first exit node converge, thereby making the entire multi-exit network obtain better results. training effect.
  • the first egress node is not the last node among the plurality of nodes; and, when the first egress node is based on the plurality of nodes, The soft label of the exit node, the predefined hard label, and the predefined loss function.
  • the method further includes: the first exit node sends the calculated loss. For a last node of the plurality of nodes, the loss calculated by the first exit node is used for the last node to determine the total loss, and the intermediate gradient from the last node is related to the total loss.
  • the first egress node is the last node among the plurality of nodes; and the method further includes: the first egress node Receive respective losses from the other exit nodes; the first exit node determines a total loss based on the calculated losses and the losses received respectively from the other exit nodes; the first exit node determines a total loss based on the total loss.
  • the loss determines the intermediate gradients that are backpropagated to other nodes that include the other exits.
  • the plurality of nodes further include an inference node, and the inference node is the next hop node of the first exit node, and the method further includes : The first exit node sends the features extracted through the sub-model to the inference node.
  • the multiple nodes may also include inference nodes.
  • the inference node is used for inference and can output features but not inference results. Among them, both features and inference results can be obtained through neural network inference. The difference is that for the sub-model deployed in the same node, the number of neural network layers experienced from input data to output inference results is larger than that from the input data. The number of layers of the neural network that the data goes through to the output features is more.
  • next-hop node By sending features to the next-hop node, the next-hop node can make inferences based on the received data. In this way, distributed reasoning can be achieved.
  • the method further includes: the first egress node receiving first indication information, the first indication information being used to instruct the multiple egresses node, and the adjacent nodes of the first exit node.
  • the first indication information may be sent by, for example, a device used to configure the training task, such as a network device.
  • the first egress node may determine multiple egress nodes in the multi-egress network and adjacent nodes of the first egress network.
  • the adjacent nodes include previous hop nodes and/or next hop nodes.
  • the method further includes: the first egress node receiving second indication information, the second indication information being used to indicate deployment on the first egress node.
  • the second instruction information may also be sent by a device used to configure the training task, for example.
  • the first exit node may locally deploy the sub-model based on the second instruction information.
  • the structures and parameters indicated in the second instruction information sent may be different.
  • the structures and parameters of the sub-models deployed in different egress nodes may be different. This may be determined based on capability information and/or status information of each egress node. In this way, sub-models that can adapt to exit nodes with different capabilities and different states can be deployed. This is conducive to obtaining better training results.
  • the second aspect provides an inference method that can be applied to the first exit node among multiple nodes, each of the multiple nodes has a sub-model deployed, and the sub-model deployed in each node It is one of multiple sub-models obtained by splitting the model.
  • the method provided in the second aspect may be executed by the first exit node.
  • the first exit node may be trained based on the model training method provided in the first aspect, or may be obtained through other training methods, which is not limited in this application.
  • the first egress node may be, for example, a communication device, such as a terminal device, or it may be a component configured in the communication device, such as a chip, a chip system, or other functions that can be used to implement part or all of the functions of the first egress node. Module or software implementation is not limited in this application.
  • the method includes: the first exit node obtains a task end condition, which is used to determine whether the inference tasks in each sub-model can be stopped; and if the first exit node does not meet the task end condition, Continue to send the features obtained by inference to the next hop node, and output the inference result of the inference task, or the first exit node outputs the inference result of the inference task when the task end condition is met, and no longer goes down.
  • One-hop nodes send features.
  • the neural network model can end the task as early as possible when the requirements are met, thereby avoiding unnecessary waiting delays, unnecessary waste of computing resources, and reducing signaling overhead.
  • the task end condition includes one or more of the following: the classification probability in the inference result reaches the corresponding probability threshold; the entropy of the classification probability in the inference result reaches The corresponding entropy threshold; and the inference duration reaches the corresponding duration threshold.
  • the end condition of the task includes both the inference accuracy indicator and the inference duration indicator.
  • the end of the reasoning task can be controlled from the dimensions of reasoning accuracy and/or reasoning duration, thereby meeting the needs of different reasoning tasks.
  • a model training device including modules or units for implementing the method in the first aspect and any possible implementation of the first aspect. It should be understood that each module or unit can implement the corresponding function by executing a computer program.
  • a fourth aspect provides a reasoning device, including modules or units for implementing the method in the second aspect and any possible implementation of the second aspect. It should be understood that each module or unit can implement the corresponding function by executing a computer program.
  • a model training device including a processor configured to execute the method described in the first aspect and any possible implementation of the first aspect.
  • the device may also include memory for storing instructions and data.
  • the memory is coupled to the processor, and when the processor executes instructions stored in the memory, the methods described in the above aspects can be implemented.
  • the device may also include a communication interface for the device to communicate with other devices.
  • the communication interface may be a transceiver, a circuit, a bus, a module or other types of communication interfaces.
  • this application provides an inference device, including a processor configured to execute the method described in the second aspect and any possible implementation of the second aspect.
  • the device may also include memory for storing instructions and data.
  • the memory is coupled to the processor, and when the processor executes instructions stored in the memory, the methods described in the above aspects can be implemented.
  • the device may also include a communication interface for the device to communicate with other devices.
  • the communication interface may be a transceiver, a circuit, a bus, a module or other types of communication interfaces.
  • the devices in the above third to sixth aspects may be communication equipment, such as terminal equipment, or components configured in the communication equipment, such as chips, chip systems, etc.
  • a communication system including multiple nodes, each of the multiple nodes having a sub-model deployed, and the sub-model deployed in each node is obtained by splitting the model to be trained.
  • the plurality of nodes further include at least one inference node; each of the at least one inference nodes is used to perform inference based on the received data. , obtain the characteristics of the inference node; and send the characteristics to the next hop node.
  • the first inference node in the at least one inference node is the next hop node of the first exit node in the plurality of exit nodes, and the first exit node The node is also used to send the inferred features to the first inference node.
  • the second exit node among the plurality of exit nodes is a previous hop node of the first inference node; the first inference node is also used to receive the third The intermediate gradient is backpropagated to the second exit node; the intermediate gradient is backpropagated to the first exit node.
  • a communication system including multiple nodes, each of the multiple nodes has a sub-model deployed, and the sub-model deployed in each node is a plurality of sub-models obtained by splitting the model.
  • the model may be obtained based on the model training method provided in the first aspect, or may be obtained through other training methods, which is not limited in this application.
  • a chip system in a ninth aspect, includes at least one processor for supporting the implementation of the above first aspect or the second aspect and any of the possible implementation methods of the first aspect or the second aspect. Functions, such as receiving or processing data and/or information involved in the above methods.
  • the chip system further includes a memory, the memory is used to store program instructions and data, and the memory is located within the processor or outside the processor.
  • the chip system can be composed of chips or include chips and other discrete devices.
  • a computer-readable storage medium including a computer program that, when run on a computer, enables the computer to implement the first aspect or the second aspect and any possible implementation method of the first aspect or the second aspect. method in.
  • a computer program product includes: a computer program (which may also be called a code, or an instruction).
  • a computer program which may also be called a code, or an instruction.
  • the computer program When the computer program is run, it causes the computer to execute the first aspect or the third aspect.
  • the second aspect and the method in any possible implementation of the first aspect or the second aspect.
  • Figure 1 is a schematic diagram of knowledge distillation
  • Figure 2 is a schematic diagram of a multi-exit network
  • Figure 3 is a schematic diagram of a multi-layer fully connected neural network
  • Figure 4 is a schematic diagram of a scenario applicable to the model training method provided by the embodiment of the present application.
  • Figure 5 is a schematic diagram of a multi-outlet network used for knowledge distillation
  • Figure 6 is a schematic flow chart of the model training method provided by the embodiment of the present application.
  • Figure 7 is another schematic flow chart of the model training method provided by the embodiment of the present application.
  • Figure 8 is a schematic diagram of a distributed egress network provided by an embodiment of the present application.
  • Figure 9 is a schematic flow chart for sending configuration information provided by an embodiment of the present application.
  • Figure 10 is a schematic diagram of reasoning on a multi-outlet network provided by an embodiment of the present application.
  • Figure 11 is a schematic flow chart of the reasoning method provided by the embodiment of the present application.
  • Figure 12 is a schematic block diagram of a device provided by an embodiment of the present application.
  • Figure 13 is another schematic block diagram of a device provided by an embodiment of the present application.
  • Knowledge distillation can make a small model have the same performance as a large model, but reduce the number of parameters and shorten the inference delay, thereby achieving model compression and acceleration; and it is often not easy to directly use massive data to train small models to obtain better performance, and through Training large models with massive data, and then performing knowledge distillation on small models through large models can achieve better continuation effects; in addition, using knowledge distillation can also achieve integration and migration of data sets in different fields.
  • Knowledge distillation adopts the teacher-student model.
  • the teacher model is a complex large model
  • the student model is a simple small model.
  • Knowledge distillation is to use the teacher model to assist the training of the student model. Since the teacher model has strong learning ability, it can transfer the knowledge it has learned to the student model with relatively weak learning ability, thereby enhancing the generalization ability of the student model.
  • the complicated and bulky but effective teacher model is simply a tutor if it is not online. What is really deployed online for prediction tasks is the flexible and lightweight student model.
  • Figure 1 is a schematic diagram of knowledge distillation.
  • the teacher model contains more layers of neural networks and more neurons than the student model.
  • the teacher model is more complex and the student model is simpler.
  • the teacher model generally uses offline training, and the student model is trained under the guidance of the data set and the teacher model.
  • the loss function for training a student model can contain two parts, namely student loss and distillation loss.
  • the inference task of the neural network is a classification task
  • the above total loss is the sum of the classification loss and the distillation loss.
  • the inference task of the neural network is a regression task
  • the above total loss is the sum of the regression loss and the distillation loss.
  • the student loss can be, for example, mean square error (MSE), normalized minimum mean square error (NMSE), mean absolute error (MAE), maximum absolute error (also known as , absolute error bound), correlation coefficient, cross entropy, mutual information, etc., this application includes but is not limited to these.
  • L represents the total loss
  • L CE (p, y) is the cross entropy (CE) between the output p of the student model and the label y, which is an expression of student loss
  • L KL (p, q) It is the Kullback-Leibler divergence (KL divergence, also called relative entropy) between the output of the student model p and the output of the teacher model q, which is an expression of distillation loss.
  • the label y is the label in the training sample, which can be used to train the student model and distinguish it from the output of the teacher model.
  • the label y can also be called a hard label (hard label) or hard target (hard target), which can be used To calculate student losses.
  • the output of the teacher model is also used to train the student model.
  • the output of the teacher model can be called a soft label (soft label) or soft target (soft target), which can be used to calculate distillation. loss.
  • Soft and hard tags can be defined by other names to achieve the same or similar functionality.
  • the output of the teacher model can also be defined as another name, and this application does not limit this.
  • Multi-exit network Usually refers to the neural network setting multiple exits at different depth positions. These multiple exits can complete the same task, but the cost or accuracy required for reasoning is different.
  • Figure 2 is a schematic diagram of a multi-egress network.
  • the network shown in Figure 2 includes M (M is an integer greater than 1) exits.
  • the network will obtain an output queue ⁇ p 1 , p 2 ,..., p M ⁇ , where p m represents the mth of the M exits.
  • the inference results output by each outlet, where 1 ⁇ m ⁇ M, m is an integer.
  • feature 1 can be output through feature extraction of block 1.
  • Feature 1 can be input to block 2 and exit 1 at the same time.
  • the features 1 input to exit 1 are further processed to obtain the inference results (logits) 1 output from exit 1.
  • Feature 1, which is input to block 2 is extracted via feature extraction from block 2, resulting in feature 2.
  • Feature 2 is input to both block 3 and exit 2. And so on, until the feature M-1 output by block M-1 is input to block M.
  • Feature M-1 is extracted from block M to obtain feature M.
  • Feature M is input to exit M.
  • the inference result M output from exit M can be obtained.
  • a block may include one or more convolutional layers, and an outlet may include an output layer.
  • the entire neural network model is deployed in the multi-exit network and is divided into multiple sub-models.
  • it can include: sub-model 1, including block 1 and outlet 1; sub-model 2, including: block 2 and outlet 2; ...; Sub-model M, including: block M and outlet M.
  • sub-model 1 including block 1 and outlet 1
  • sub-model 2 including: block 2 and outlet 2
  • Sub-model M including: block M and outlet M.
  • model 1 includes: block 1 and outlet 1; model 2 includes: block 1, block 2, and outlet 2;...; model M includes: block 1, block 2,..., block M, and outlet M. It can be seen that these multiple models of different complexity contain one or more of the aforementioned sub-models.
  • features and inference results are data output after processing by the neural network. The difference is that features are data obtained after feature extraction. They can be input to the outlet for inference, or they can be input to the next block for further feature extraction; the inference results are output at different outlets. data. For example, if the inference task is a classification task, the inference result of the model can be the predicted category; if the inference task is a regression task, the inference result of the model can be the fitting result of the input data.
  • both features and inference results can be understood as being obtained through inference.
  • the number of neural network layers experienced from input data to output inference results is more than the number of neural network layers experienced from input data to output features.
  • features are obtained by inference based on input data, and the inference results can be obtained by further inference based on the features. Therefore, features can also be called intermediate inference results.
  • the features are obtained by inference through the neural network layer in the block, and the inference is obtained by inference through the neural network layer in the block and outlet.
  • the neural network model is split into multiple sub-models and deployed on multiple nodes.
  • each layer of neural network outputs features to the next layer, which will affect the output of the next layer of neural network. That is, the inference results output by the downstream neural network layer are affected by the features output by the upstream neural network layer. On the other hand, the inference results output by each layer of neural network are also affected by the local neural network. Therefore, the parameters in each layer of neural network can be converged by backpropagating gradients from the downstream neural network to the upstream neural network.
  • the gradient includes local gradient and intermediate gradient.
  • the local gradient is mainly introduced to consider the impact of the local neural network layer on the inference results
  • the intermediate gradient is mainly introduced to consider the impact of the received features from other layers of neural networks on the inference results. The following is explained in conjunction with the three neural network layers in Figure 3:
  • the gradient of parameter ⁇ 3 satisfies:
  • middle is the intermediate gradient obtained by deriving the loss function.
  • the gradient of parameter ⁇ 1 satisfies:
  • the received intermediate gradient from the next layer i.e., the second layer of the neural network
  • the received intermediate gradient from the next layer is backpropagated to the previous layer. Since it is the first layer of neural network, there is no need to continue backpropagating the intermediate gradient.
  • the relationship between the three-layer neural network shown in Figure 3 and the gradients corresponding to the parameters of each layer of the neural network are only examples.
  • the interaction between the neural network layers is not limited to what is shown in Figure 3.
  • the exit nodes can also send inference results and features to each other, and the corresponding intermediate gradients are also It includes more terms than the relationship shown above.
  • Figure 4 is a schematic diagram of a scenario applicable to the model training method provided by the embodiment of the present application.
  • the scene of Figure 4 shows a network device 410 and a plurality of terminal devices 421-425.
  • the network device may include a central server, and the central server may store training samples and neural network models to be trained. Among them, training samples include training numbers and hard labels.
  • the neural network model to be trained can be divided into multiple sub-models and deployed on multiple terminal devices respectively. The sub-models deployed separately in the multiple terminal devices are combined together to form a distributed training neural network model.
  • Terminal devices and network devices, and terminal devices can interact with each other through information to complete a given model training task, and after the training is completed, complete a given inference task.
  • the network device can send the sub-models to be trained to the terminal device respectively.
  • the network device can also send the training sample to the first terminal device among the multiple terminal devices, and can distribute the hard label to each terminal device, so that each terminal device can perform training on the locally deployed sub-model based on the received hard label. train.
  • Each terminal device can interact with the features inferred by each other, for example, the previous hop sends features to the next hop.
  • Each terminal device can also interact with gradients, for example, the next hop sends gradients to the previous hop.
  • the terminal device may be located within the beam/cell coverage of the network device.
  • the terminal device can communicate with the network device through the air interface through uplink (UL) or downlink (downlink, DL).
  • the terminal device can send uplink data to the network device through the physical uplink shared channel (PUSCH) in the UL direction;
  • the network device can send uplink data to the network device through the physical downlink shared channel (PDSCH) in the DL direction.
  • the terminal device sends downlink data.
  • the terminal device in Figure 4 can be a terminal device that supports the new air interface, can access the communication system through the air interface, and initiate services such as calling and Internet access.
  • the terminal equipment can also be called user equipment (user equipment, UE) or mobile station (mobile station, MS) or mobile terminal (mobile terminal, MT), etc.
  • the terminal device in Figure 4 can be a cellular phone, a mobile phone, a smart phone, a wireless data card, a personal digital assistant (PDA) computer, or a tablet computer. , wireless modem, laptop computer, MTC terminal or computer with wireless transceiver function.
  • Terminals wireless terminals in smart cities, wireless terminals in smart homes, vehicle-mounted terminals, handsets with wireless communication functions, wearable devices, computing devices, devices connected to wireless modems Other processing equipment, vehicles with vehicle-to-vehicle (V2V) communication capabilities, intelligent connected vehicles, unmanned aerial vehicle (UAV) to unmanned aerial vehicle (UAV to UAV, U2U) communication Capable drones, etc.
  • V2V vehicle-to-vehicle
  • the network device in Figure 4 can be any device with wireless transceiver functions. It is mainly used to implement wireless physical control functions, resource scheduling and wireless resource management, wireless access control, mobility management and other functions to provide reliable wireless transmission. protocols and data encryption protocols, etc.
  • the network device may be a device deployed in a wireless access network to provide wireless communication functions for terminal devices.
  • the network device in Figure 4 may be a device that supports wired access or a device that supports wireless access.
  • the network device may be an access network (AN)/radio access network (RAN) device, which is composed of multiple AN/RAN nodes.
  • AN/RAN nodes can be: access point (AP), base station (nodeB, NB), enhanced base station (enhance nodeB, eNB), next generation base station (NR nodeB, gNB), transmission reception point (transmission reception) point (TRP), transmission point (TP) or some other access node, etc., among which the base station can include various forms of macro base stations, micro base stations, relay stations, etc., and can also be device-to-device (device-to-device) -device (D2D), vehicle-to-everything (V2X), machine-to-machine (M2M) communication equipment that performs base station functions, etc., and can also include cloud access network (cloud radio Centralized unit (CU) and distributed unit (DU) in
  • the above-mentioned central server can be a server deployed in existing network equipment.
  • a base station can be reused to implement the functions of the central server, or it can be a separately deployed network-side device for the functions of a time-limited central server. . This application does not limit this.
  • Figure 5 is a schematic diagram of a multi-outlet network used for knowledge distillation.
  • the multi-exit network shown in Figure 5 is similar to Figure 2.
  • Exits 1 to M can respectively output the inference results obtained through inference of the respective models.
  • exit 1 outputs the inference results of sub-model 1
  • exit 2 outputs the inference results of sub-model 2
  • ... exit M outputs the inference results of sub-model M.
  • model M is the model with the most complex structure among the M models.
  • the total loss M of model M i.e., teacher model
  • the total loss of other models includes student loss and distillation loss.
  • student loss includes but is not limited to classification loss, for example, it can also be regression loss, and this application includes but is not limited to this.
  • the student losses of each sub-model illustrated in Figure 5 are all classification losses.
  • the total loss 1 of model 1 includes classification loss 1 and distillation loss 1, where classification loss 1 is determined by the inference result 1 output by model 1 and the hard label, and distillation loss 1 is determined by the inference result 1 output by model 1 and the inference output by model M. The result M is confirmed. By analogy, the total loss of each student model can be obtained.
  • model M includes block 1, block 2, ..., block M and outlet M, that is, it includes a multi-layer neural network.
  • This model is relatively complex. When used as a teacher model to distill knowledge on the student model, the difference in capacity of the neural networks of different student models is not taken into account. Instead, it will affect the shallow network exit (such as sub-model 1 in Figure 5 The training effect of exit 1) even affects the training effect of the entire neural network model. For example, the convergence speed of shallow network exits and even the entire neural network model is affected, and the performance of network exits and even the entire neural network after convergence is also affected.
  • this application provides a model training method.
  • multiple teacher models of different complexity can be obtained based on the different complexities of each sub-model.
  • the weighted output of multiple teacher models is used as a soft label to perform knowledge distillation on the student model. Due to the different complexity and capacity of each sub-model, the capabilities of each sub-model are also different.
  • This solution can obtain multiple teacher models of different complexity by deploying sub-models at different exit nodes, and the impact of the inference results output by each teacher model on the soft labels, that is, the impact on the training of the student model, can be determined through the weight
  • neural networks with different capabilities are considered, so that different student models can obtain distillation losses that adapt to their neural network capabilities, which can help improve the training effect of each student model.
  • the method provided by this application mainly includes the process of obtaining the output of the teacher model and the process of knowledge distillation of the student model.
  • the output of the teacher model can be used to perform knowledge distillation on the student model, and the parameters in the distilled student model can be used to obtain the output of the teacher model in the next round of training.
  • a round of training includes: obtaining the output of the teacher model and performing knowledge distillation on the student model.
  • the preset conditions refer to the conditions under which the trained submodel can be used online by training the submodel. For example, some performance parameters can be used as evaluation indicators to determine whether the submodel meets the preset conditions.
  • the preset condition includes: the average loss is less than the first preset threshold, or the maximum loss is less than the second preset threshold, or the number of training rounds is greater than the third preset threshold, and so on. This application includes but is not limited to this.
  • distillation is also a process of training. Distilling the student model means training the student model. In other words, training involves distillation. This article only distinguishes between the two processes of obtaining the output of the teacher model and distilling the student model.
  • M The number of exit nodes, M is an integer greater than 1.
  • L Loss function, where L CE represents student loss and L KL represents distillation loss.
  • y hard label, from training samples.
  • f The characteristics output by the sub-model deployed in the exit node, which can be referred to as the characteristics output by the exit node; f i represents the characteristics output by the exit node i; 1 ⁇ i ⁇ M, i is an integer; the exit node i is M exits any one of the nodes.
  • z the inference result output by the sub-model deployed in the exit node
  • z i the inference result output by the exit node i.
  • the output of the teacher model i.e. soft labels. Represents the output of the teacher model obtained by exit node i.
  • ⁇ (z i ) The output of the sub-model deployed in exit node i in the knowledge distillation stage, which can be referred to as the output of the student model in exit node i; ⁇ () represents the activation function.
  • Figure 6 is a schematic flow chart of a model training method provided by an embodiment of the present application.
  • Figure 6 uses the first exit node as an example of an exit node to describe the model training method provided by this application.
  • the first exit node is one of a plurality of nodes, each of the plurality of nodes is deployed with a sub-model, and the sub-model deployed in each node is a model to be trained (for example, as shown in Figure 5 Model M) is one of multiple sub-models obtained by splitting.
  • the sub-models deployed in the one or more nodes are combined to form multiple neural network models of different complexity (for example, Model 1 to Model M shown in Figure 5).
  • the multiple nodes include multiple exit nodes
  • the neural network model is a neural network model with multiple exits, which may be referred to as a multi-outlet network model.
  • the first exit node may include, for example, block 1 and exit 1 in FIG. 5 , or may also include block 2 and exit 2 in FIG. 5 , or may also include block M in FIG. 5 and export M, and so on, which will not be listed here.
  • the plurality of nodes further includes at least one inference node. It should be understood that whether it is an exit node or an inference node, each node requires inference. The difference between the exit node and the inference node is that the exit node outputs inference results and features after inference; the inference node outputs features after inference but does not output the inference results. Essentially, the exit node contains more neural network layers than the inference node. For example, the exit node contains an output layer, but the inference node does not contain an output layer.
  • the multiple nodes where the multiple sub-models are deployed may be, for example, multiple terminal devices in the communication system as shown in Figure 4, with one sub-model deployed in each terminal device.
  • each terminal device may be one of the plurality of nodes.
  • the first exit node is one of the exit nodes, that is, it can correspond to one of the terminal devices.
  • all of the plurality of terminal devices may be egress nodes, or some of them may be egress nodes.
  • some terminal devices among the plurality of terminal devices are exit nodes, another part of the terminal devices are inference nodes.
  • the method 600 shown in FIG. 6 includes steps 610 to 640. Each step in method 600 is described in detail below.
  • the first exit node uses the sub-model deployed locally to perform inference on the received data to obtain the inference result of the first exit node.
  • the data received by the first egress node is related to the position of the first egress node in the multiple nodes, or in other words, is related to the position of the sub-model deployed in the first egress node in the multi-egress network.
  • the first egress node is the first node in the multi-egress network
  • the data received by the first egress node includes training data and hard labels from the network device.
  • the training data can be used for inference at the first exit node to obtain the inference result of the first exit node.
  • the first egress node is the first node in the multi-egress network
  • the data received by the first egress node includes data from the previous hop node and hard tags from the network device.
  • the data from the previous hop node may be features obtained by processing the received data by the previous hop node. This feature can be used for inference at the first exit node to obtain the inference result of the first exit node.
  • the inference result output by the first exit node can be used by the first exit node and other exit nodes to determine the output of the teacher model, that is, to determine the soft label, and then used for training of the student model.
  • step 620 the first exit node receives inference results from other exit nodes.
  • the multiple nodes include M exit nodes. Therefore, in addition to obtaining the inference results through self-inference, the first exit node can also receive inference results from other exit nodes (ie, M-1 exit nodes). Therefore, the first exit node can obtain the inference results of multiple exit nodes including itself.
  • the first exit node obtains the soft label of the first exit node based on the inference result and its weight of each exit node in the plurality of exit nodes.
  • Soft labels are labels used when training the student model through the teacher model, which is the output of the teacher model.
  • its soft label can also be determined based on the obtained inference results of multiple exit nodes.
  • the output of the teacher model may be the weighted inference results of the above-mentioned M exit nodes. For example, if the inference result of exit node i is recorded as z i , then the output of the teacher model of exit node i satisfy: Represents the output of the teacher model calculated on exit node i When , the weight applied to the inference result of exit node j.
  • the first exit node can be any exit node among the M exit nodes.
  • the output of the teacher model of the first exit node can also be determined by the above formula.
  • the first exit node can perform one or more rounds of training on the local sub-model so that the sub-model reaches the preset conditions.
  • each exit node can perform one or more rounds of training on the local sub-model so that its respective sub-model reaches the preset conditions. Therefore, the output of the teacher model in each round of training can be regarded as the weighted output of the student model obtained in the previous round of training.
  • the weight of the inference results of each exit node is predefined.
  • the weight of the inference result of each exit node may be determined based on the neural network capability of the first exit node. If the sub-model deployed in the first exit node has a more complex structure and a larger capacity, it can be considered that the first exit node has strong neural network capabilities and can handle the exit node of the sub-model deployed with a more complex structure.
  • a higher weight is applied to the inference results, and a lower weight is applied to the inference results of the exit node deploying a sub-model with a simpler structure; if the sub-model deployed in the first exit node has a simpler structure, smaller capacity, It can be considered that the first exit node has weak neural network capabilities, and can impose a lower weight on the inference results of exit nodes deploying sub-models with more complex structures, and can apply lower weights to the exit nodes deploying sub-models with simpler structures. Higher weight is applied to the inference results. In this way, the first exit node can consider neural networks with different capabilities when training the local sub-model, so that the sub-model can obtain a distillation loss that adapts to its neural network capabilities, which in turn helps improve the training effect of the student model.
  • the weight of the inference result of each exit node is determined based on the characteristics of each exit node.
  • weights are functions of features.
  • the first exit node may use a preset function to determine the weight of the inference result of each of the M exit nodes based on the characteristics of the M exit nodes.
  • the preset function may include, for example, but is not limited to, gate function, attention function, deep neural network (DNN), convolutional neural network (CNN), recurrent neural network Network (recurrent neural network, RNN), graph neural network (graph neural network, GNN), and combinations of the above neural networks, etc.
  • the weight of the inference results of the 1st, 2nd,..., Mth exit node determined by the i-th (1 ⁇ i ⁇ M, i is an integer) exit node among M exit nodes They are: satisfy:
  • gate() represents the gate function.
  • the gate function is composed of a neural network, which may include, for example, a fully connected layer, a batch normalization layer, and an activation layer.
  • the activation function of the activation layer is, for example, a linear rectification function (ReLU) function.
  • F concat(f 1 , f 2 ,..., f M ), concat() represents the cascade of features f 1 to f M .
  • Features f 1 to f M are the features output by the above-mentioned M exit nodes respectively.
  • a specific example of the attention function can be expressed as: in, Indicates the result obtained by F through the neural network, and indicates the dimension of Q i or K i .
  • softmax() represents the softmax function. In this example, softmax() satisfies:
  • the method further includes:
  • the first exit node uses the locally deployed sub-model to reason on the received data and obtain the characteristics of the first exit node;
  • the first exit node receives features from other exit nodes
  • the first exit node uses a preset function to determine the weight of the inference result of each of the multiple exit nodes based on the characteristics of the multiple exit nodes.
  • the first exit node uses the sub-model deployed locally to perform inference on the received data, and the characteristics of the first exit node obtained are similar to the process of the aforementioned step 610, except that the output is different.
  • the first exit node can perform one inference on the received data through the local sub-model to obtain the inference results and characteristics of the first exit node without performing multiple inference operations.
  • the first exit node can also receive inference results and features from the same other exit nodes at the same time. It should be noted that because the first exit node has different positions among the multiple exit nodes, the timing at which it receives inference results and features from different other exit nodes is also different. For example, in the entire multi-exit network, the upstream exit nodes can determine the inference results and features earlier than the downstream exit nodes. The output of each node and its sequence will be explained later in conjunction with Figure 6, and will not be described in detail here.
  • each exit node in addition to the characteristics of each exit node that need to be obtained, there are also some parameters that need to be determined.
  • One possible implementation method is to obtain it through the previous round of training of the local sub-model of each exit node. It should be understood that the last round of training here specifically refers to the last time the output of the teacher model is obtained, and the process of knowledge distillation of the student model based on the output of the teacher model.
  • the first exit node trains the locally deployed sub-model based on the soft labels, predefined hard labels, and predefined loss functions of the multiple exit nodes to obtain the trained sub-model.
  • a training round consists of obtaining the output of the teacher model and performing knowledge distillation on the student model. Therefore, training the local sub-model can also be called knowledge distillation of the local sub-model.
  • the training can be stopped.
  • the student model is available online.
  • the total loss of each sub-model can include student loss and distillation loss, where the student loss is the loss based on the hard label training of the sub-model, and the distillation loss is based on the soft label training of the sub-model (i.e., the student model ) losses during distillation.
  • the student loss is the loss based on the hard label training of the sub-model
  • the distillation loss is based on the soft label training of the sub-model (i.e., the student model ) losses during distillation.
  • both are training losses, but the labels used for training come from different sources.
  • the loss function can be expressed as:
  • ⁇ (z i ) is the output of student model i, such as classification probability
  • y is the hard label
  • T and ⁇ are hyperparameters
  • the loss function In this loss function, the first term is the student loss, and the second and third terms are the distillation loss. Among them, the second term represents the distillation loss of the output of the teacher model calculated at all exit nodes to the student model, and the third term represents the distillation loss of the output of the teacher model calculated locally to the student model.
  • the third item is optional and can be understood as strengthening the distillation loss of the second item.
  • the loss function can also be deformed as follows:
  • loss function used to train the sub-model is not limited to those listed above. Based on the same concept, those skilled in the art can make simple transformations to the loss functions illustrated above, and these transformations should fall within into the protection scope of this application.
  • the first exit node can calculate the local loss, and then optimize the parameters in the local sub-model, so that the local loss of the first exit node can be reduced. Thereafter, the first exit node can perform a new round of training based on the process described above until the preset conditions are met and the training is stopped.
  • the first exit node obtains the output of the teacher model based on the weighting of the inference results output by each exit node, uses it as a soft label, and trains the local sub-model together with the hard label, which can make the output of each exit node
  • the impact of the inference results on the soft labels is adjusted by the weight, that is, the impact of the inference results of each exit node on the distillation of the sub-model is adjusted by the weight, taking into account neural networks with different capabilities, so that different student models can be adapted
  • the distillation loss of its neural network capabilities can help improve the training effect of each student model.
  • the weight used to obtain the output of the teacher model can be determined based on the parameters obtained in the previous round of training and the characteristics from each exit node. Therefore, the parameters in the model can be continuously converged, making the output of the student model faster. Approximating the label, thereby increasing the training speed and shortening the training time.
  • each exit node can perform self-distillation of the local sub-model based on the method executed by the above-mentioned first exit node, so that multiple different complex models can be obtained in the multi-exit network. degree and different capacity models, and different weights can be introduced for different models to obtain soft labels to control the impact of the inference results of sub-models of different complexity and different capacities on distillation. This allows the sub-models in each exit node to obtain distillation losses that adapt to their capabilities, which is conducive to obtaining better training results.
  • multiple nodes in the following embodiments include multiple exit nodes and at least one inference node.
  • a sub-model is deployed in each of the multiple nodes, and the sub-model deployed in each node is one of multiple sub-models obtained by splitting the model to be trained.
  • the five terminals are terminal 1 to terminal 5 respectively, among which terminal 2, terminal 3 and terminal 5 are exit nodes, and terminal 1 and terminal 4 are inference nodes. It should be understood that whether it is an exit node or an inference node, each terminal needs to perform inference.
  • Figure 7 is another schematic flow chart of the model training method provided by the embodiment of the present application.
  • the method shown in Figure 7 includes steps 710 to 750. Each step in method 700 is described in detail below. It should be noted that in Figure 7, in order to easily distinguish the correspondence between different terminals and their respective data (such as features, inference results, soft labels, etc.), the information or data sent by the terminals are corresponding to the same number in the naming.
  • step 710 the network device sends configuration information to multiple terminals. Accordingly, multiple terminals receive configuration information.
  • This configuration information can be used to configure respective sub-models for each terminal, and can also indicate to each terminal its respective adjacent nodes, such as the previous hop node and/or the next hop node.
  • the network device can also send the identity of the exit node to each terminal serving as an exit node.
  • the network device may determine the number of terminals that can currently be scheduled based on the status information and/or capability information reported by multiple terminals.
  • the network equipment scheduling terminal specifically refers to the scheduling terminal for model training. Network equipment can schedule multiple terminals for model training to achieve distributed training.
  • the status information of the terminal includes one or more of the following: resource information occupied by the terminal, device adjacency matrix of the terminal, device status of the terminal, whether the terminal agrees to participate in training or inference, and the validity period of the status information of the terminal.
  • the resource information occupied by the terminal may include information such as frequency band and bandwidth occupied by the terminal.
  • the device status of a terminal can be idle or busy. When the device status of the terminal is idle, the terminal can support inference. When the device status of the terminal is busy, the terminal cannot support inference.
  • the device adjacency matrix of the terminal can be used to indicate the connection relationship between the terminal and other terminals, and the device adjacency matrix can also be used to indicate the channel status or communication quality between the terminal and other terminals.
  • the validity period of the terminal's status information indicates the validity period of the terminal's status information.
  • the capability information of the terminal includes: the storage space size of the terminal and/or the maximum computing power of the terminal.
  • the network device After the network device determines the number of schedulable terminals, it can split the model to obtain multiple sub-models that match the number of schedulable terminals.
  • Network devices can use existing splitting and device selection algorithms to perform model splitting and device selection, which is not limited in this application.
  • the model will be split into 5 sub-models.
  • Two of the terminals (terminal 1 and terminal 4 in the figure) are inference terminals and can be used for inference; the other three terminals (terminal 2, terminal 3 and terminal 5 in the figure) are export terminals and can be used for training.
  • the network device can send the parameters and structure of each sub-model to its corresponding terminal respectively, so that each terminal can build the sub-model locally based on the received parameters and structure.
  • a distributed training architecture can be obtained, which can also be called a distributed split learning architecture.
  • Figure 8 shows a distributed multi-exit network obtained by using the five terminals as nodes. The output of each terminal will be described in detail below in conjunction with the specific process, and will not be described in detail here.
  • the network device After the network device determines that multiple terminals are used for model training, it can send configuration information to each terminal.
  • Step 710 will be described in detail below with reference to Figure 9 .
  • the sending configuration information shown in Figure 9 may include:
  • Step 910 The network device sends second indication information to multiple terminals, where the second indication information is used to indicate the sub-model deployed on each terminal. Correspondingly, each terminal receives the second indication information.
  • the second indication information may be specifically used to indicate to each terminal the structure and/or parameters of the sub-model deployed thereon.
  • the second indication information is also used to indicate a hard label.
  • the network device configures sub-models for each terminal, it also assigns hard labels for training to each terminal. It can be understood that the hard tags assigned to each terminal may be the same.
  • Step 920 The network device sends first indication information to each terminal. Correspondingly, each terminal receives the first indication information.
  • the first indication information may include identifiers of multiple egress nodes, and may also include identifiers of adjacent nodes.
  • step 920 may specifically include:
  • Step 9201 The network device sends the identifiers of multiple exit nodes to each terminal.
  • Step 9202 The network device sends the identification of the adjacent node to each terminal.
  • the identities of the multiple terminals of the egress node include, for example, the identities of the above-mentioned terminal 2, terminal 3, and terminal 5.
  • the identification of the terminal may include, for example: the terminal's identity information (identity, ID), the terminal's Internet protocol (Internet protocol, IP) address, the terminal's media access control (media access control, MAC) address,
  • the terminal s international mobile subscriber identity (IMSI), the terminal’s globally unique temporary identity (GUTI), the terminal’s user permanent identifier (subscription permanent identifier, SUPI) or the terminal’s universal public identifier User identifier (generic public subscription identifier, GPSI), etc.
  • each node it needs to exchange data with neighboring nodes, such as receiving data from the previous hop node and sending data to the next hop node, so each node needs to know its neighboring nodes.
  • the adjacent nodes of each node are different.
  • its adjacent nodes include the next hop node.
  • the identity of the adjacent node sent to terminal 1 shown in the figure is the identity of terminal 2; for the last node, it The adjacent nodes include the previous hop node.
  • the identity of the adjacent node sent to terminal 5 shown in the figure is the identity of terminal 4; for the intermediate node, its adjacent nodes include the previous hop node and the next hop node.
  • One hop node is one hop node.
  • the intermediate node refers to the nodes except the first node and the last node among the multiple nodes.
  • the identifiers of the adjacent nodes respectively sent to terminals 2, 3, and 4 are not listed here one by one. for example.
  • step 9201 and 9202 can be combined into one step to execute, for example, a step is used to send the first indication information indicating the identities of multiple exit nodes and the identities of adjacent nodes; for the inference node, step 9201 can be skipped.
  • step 910 and step 920 can be combined into one step to be executed, or can be divided into multiple steps to be executed, which is not limited in this application.
  • each terminal can train the local sub-model.
  • the training process may specifically include the following steps 720 to 750.
  • Each terminal can obtain a sub-model that meets the preset conditions through one or more rounds of training, and then obtain a multi-exit network model that meets the preset conditions.
  • each terminal performs inference based on the received data, obtains and sends its own characteristics.
  • the data it receives includes training data and hard labels from network devices.
  • Terminal 1 can perform inference based on the received training data, obtain feature 1, and send it to the next hop node terminal 2
  • the data it receives includes feature 1 from the previous hop node terminal 1 and the hard label from the network device.
  • Terminal 2 can perform inference based on the received feature 1 and obtain feature 2. Since terminal 2 is an exit node, it can send the obtained feature 2 to the next hop node terminal 3 and other exit node terminals 5.
  • the data it receives includes feature 2 from the previous hop node terminal 2 and the hard label from the network device.
  • Terminal 3 can perform inference based on the received feature 2 and obtain feature 3. Since terminal 3 is also an exit node, it can send the obtained feature 3 to the next hop node terminal 4, as well as other exit nodes terminal 2 and terminal 5.
  • the data it receives includes the feature 3 from the previous hop node terminal 3 and the hard tag from the network device.
  • Terminal 4 can perform inference based on the received feature 3, obtain feature 4, and send it to the next hop node terminal 5.
  • the data it receives includes the characteristics 4 from the previous hop node terminal 4 and the hard tag from the network device.
  • the terminal 5 can perform inference based on the received feature 4 and obtain the feature 5. Since terminal 5 is an exit node, it can send the obtained feature 5 to other exit nodes terminal 2 and terminal 3.
  • each terminal can receive hard tags from the network device when receiving configuration information, and when performing a training task, receive data related to the training task, such as training data or features output by the previous hop node.
  • this application does not limit the order in which features are transmitted between terminals.
  • the specific execution order can be determined based on its internal logic and resource scheduling.
  • each terminal serving as an exit node performs inference based on the received features, obtains and sends its own inference results.
  • the exit node includes terminal 2, terminal 3 and terminal 5.
  • each exit node receives the characteristics from the previous hop node, it can perform inference and obtain its own inference results. For example, after receiving feature 1, terminal 2 can infer and obtain inference result 2, and send inference result 2 to other exit nodes terminal 3 and terminal 5. For another example, after receiving feature 2, terminal 3 can infer and obtain inference result 3, and send inference result 3 to other exit node terminals 2 and 5.
  • the terminal 5 can infer the inference result 5 and send the inference result 5 to other exit node terminals 2 and 3.
  • step 720 and step 730 are two parts to facilitate distinction, and does not limit the execution order of each specific action therein.
  • the sequence of each action can be determined based on its internal logic. For example, after terminal 2 receives feature 1, it can obtain inference result 2 and send it to terminal 3 and terminal 5. It does not necessarily have to wait until terminal 5 receives feature 4 before executing it.
  • the terminal 2 can send the feature 2 and the inference result 2 to other exit nodes by performing one sending action, and does not necessarily need to be divided into two sending actions. For the sake of brevity, no examples are given here.
  • each terminal serving as an exit node obtains the soft label of each terminal based on the inference results and characteristics of each exit node.
  • the soft label of each terminal that is, the local sub-model of each terminal, is used as the output of the teacher model, which can be used to distill the knowledge of the local sub-model.
  • each terminal serving as an exit node can obtain the inference results and characteristics of each exit node including itself.
  • terminal 2 can obtain inference result 2 and feature 2 through self-inference, and can receive inference result 3 and feature 3 from terminal 3, and receive inference result 5 and feature 5 from terminal 5.
  • the terminal 2 can obtain the inference results and characteristics of each exit node, and then calculate the soft label 2.
  • the terminal 2 to calculate the soft label 2 please refer to the relevant description of step 630 in the method 600, and will not be described again here.
  • Terminal 3 and terminal 5 can obtain soft label 3 and soft label 5 respectively based on the same method.
  • the terminals serving as exit nodes do not need to exchange their respective inference results. feature.
  • the soft labels obtained by each exit node can be used to perform knowledge distillation on the local sub-model.
  • each terminal serving as an exit node performs training based on soft labels, hard labels and loss functions.
  • each terminal serving as an exit node performs step 750, it may specifically perform the following steps:
  • Step i) based on the soft label, hard label and loss function, calculate the loss when training the local sub-model
  • Step ii) calculate the local gradient based on the loss
  • Step iii receive the intermediate gradients backpropagated by other nodes
  • Step iv) update the parameters in the sub-model based on the received local gradient and intermediate gradient.
  • each terminal serving as an exit node performs inference based on the received data to obtain an inference result. Then, soft labels, hard labels and loss functions are combined to calculate the distillation loss and student loss corresponding to the inference results output by the local sub-model.
  • Figure 10 takes the above five terminals as an example to show the reasoning process of each terminal as an exit node.
  • the network device can send data for inference to the first node terminal 1. It should be understood that this inference process is used to determine the loss and is the inference in the training stage, so the data used for inference is training data. Since terminal 1 is an inference node, it can perform inference based on the received data and obtain and output feature 1. The next hop node of terminal 1 is terminal 2, and terminal 1 can send feature 1 to terminal 2. Since terminal 2 is the exit node, after receiving feature 1, it can perform inference based on feature 1 and obtain and output inference result 2 and feature 2. Inference result 2 is the result obtained by the inference of terminal 2 and can be output as the result of the inference task.
  • terminal 2 can send feature 2 to terminal 3 for further inference through terminal 3. Since terminal 3 is the exit node, after receiving feature 2, it can perform inference based on feature 2 and obtain and output inference result 3 and feature 3. Inference result 3 is the result obtained by the inference of terminal 3 and can be output as the result of the inference task.
  • terminal 3 can send feature 3 to terminal 4 for further inference through terminal 4.
  • Terminal 4 is the inference result, and can obtain and output feature 4 after inference based on the received feature 3.
  • the next hop node of terminal 4 is terminal 5, and terminal 4 can send feature 4 to terminal 5. Since terminal 5 is the exit node and the last node, after receiving feature 4, it can perform inference based on feature 4 and obtain and output inference result 5.
  • terminal 2 terminal 3 and terminal 5, which are the exit nodes, all output reasoning results based on the received data. It is understandable that since terminal 5 is at the end and the data passes through the largest number of neural network layers, the inference results obtained after inference may be closer to the real value. In other words, the later exit nodes have higher accuracy and better performance. However, as more layers of neural networks are experienced, the delay and complexity also increase.
  • Each terminal serving as an exit node can calculate the loss based on local inference results, and then perform local parameter updates. Without loss of generality, the specific implementation process of the above steps i) to iv) will be described below using terminal 2 as an example of an exit node.
  • the terminal 2 can calculate the loss 2 when training the local sub-model based on the locally calculated soft label 2, the hard label received from the network device, and the preset loss function. , and then obtain the local gradient 2 based on the loss 2.
  • terminal 2 When terminal 2 performs step iii), it can receive the intermediate gradient of back propagation from its next hop node terminal 3 and other exit node terminals 5.
  • terminal 2 may first receive intermediate gradients 2-3a from terminal 3 and intermediate gradients 2-5 from terminal 5. Since the terminal 5 also back-propagates the intermediate gradient 3-5 to the terminal 3, the terminal 3 can calculate the intermediate gradient 2-3b and feed it back to the terminal 2 based on the received intermediate gradient 3-5. Therefore, terminal 2 can receive two intermediate gradients 2-3 from terminal 3. For distinction and explanation, they are respectively recorded as intermediate gradient 2-3a and intermediate gradient 2-3b. It should be understood that the two received intermediate gradients 2-3 are different.
  • the previously received intermediate gradient 2-3a is only determined by the loss calculated by terminal 3 based on the distillation of the local sub-model.
  • the latter received intermediate gradient 2-3b also combines the intermediate gradient received by terminal 3 from terminal 5.
  • the gradient is determined by 3-5.
  • the parameters in the local sub-model can be updated based on the local gradient 2 and the received intermediate gradient.
  • intermediate gradient 3-5 propagates before intermediate gradient 5-3 and intermediate gradient 2-3b
  • intermediate gradient 2-5 propagates before intermediate gradient 3-2 and intermediate gradient 5-2.
  • the order of propagation of the intermediate gradient 3-5, the intermediate gradient 2-5, the intermediate gradient 3-2, and the intermediate gradient 5-2 is not limited.
  • the intermediate gradient 2-5, the intermediate gradient 3-5, the intermediate gradient The order of propagation of gradient 5-3 and intermediate gradient 2-3b is also not limited.
  • each terminal performs step 750, its specific implementation is not limited to the process shown in steps i) to iv) above.
  • the last node can also receive the loss from each exit node, then calculate the total loss, and back-propagate the intermediate gradient based on the total loss.
  • step 750 when performing step 750, the following steps are also included:
  • step 750 when performing step 750, the following steps may also be performed: sending the loss to the last node among the multiple nodes.
  • the intermediate gradients propagated back to other nodes include: the intermediate gradients propagated back to other exit nodes, and the intermediate gradients of the previous hop node of the last node.
  • each exit node can obtain soft labels based on local inference results and features, as well as inference results and features received from other exit nodes, and then combine with hard labels to train local sub-models. Since the model training is deployed in each communication device (such as a terminal) in the communication system, the status information and capability information of each communication device are comprehensively considered to deploy sub-models of different complexity for different communication devices.
  • the impact of sub-models of different complexity on the soft labels of different exit nodes is adjusted by adjusting the weight of the inference results of each exit node, so that different student models can obtain appropriate
  • the distillation loss is matched with its neural network capabilities, which is beneficial to improving the training effect of each student model.
  • the weight used to obtain the output of the teacher model can be determined based on the parameters obtained in the previous round of training and the features from each exit node. Therefore, the parameters in the model can be continuously converged, making the output of the student model faster. Approximating the label, thereby increasing the training speed and shortening the training time.
  • scheme 1 shown in the above table is a single-exit network model
  • schemes 2 to 6 and this scheme are multi-exit network models.
  • Each model uses the Resnet18 neural network structure, and the classification accuracy is tested based on the data sets "CUB-200-2011", “Stanford Dogs” and "FGVC-Aircraft”.
  • the last column in the table above shows the improvement in the maximum accuracy of the model (shown in bold font in the table) compared with other solutions using this solution.
  • the previous article illustrates the model training method based on multi-outlet network and knowledge distillation with multiple figures.
  • the model trained based on the above method can be used to perform inference tasks.
  • the process of inference through this model is similar to the inference process in the training phase explained above in conjunction with Figure 10.
  • the difference is that the data received by the first node in the training phase is the training data, and the data received when performing the inference task is for data that requires inference.
  • the relevant explanations in the previous article please refer to the relevant explanations in the previous article and will not be repeated again.
  • each node needs to reason and interact to collaborate to complete the reasoning task.
  • the complexity of each exit node is different. The later the exit node, the higher the complexity and the greater the reasoning delay. In some cases, the inference results obtained at an intermediate exit node may achieve higher accuracy, but the task still needs to continue. This may cause unnecessary waiting delays, unnecessary waste of computing resources and signaling overhead.
  • this application also provides an inference method that introduces task end conditions and performance indicators to judge whether the inference results of each exit node reach the corresponding threshold. If it reaches the corresponding threshold, the inference task will be ended in advance to avoid inaccuracies. It eliminates necessary waiting delays and unnecessary waste of computing resources, reduces signaling overhead, and can meet the needs of different reasoning tasks.
  • Figure 11 is a schematic flowchart of the inference method 1100 provided by the embodiment of the present application.
  • the process shown in Figure 11 shows the method with the first exit node as the execution subject.
  • the first exit node can be one of multiple nodes.
  • a sub-model is deployed in each of the multiple nodes, and the sub-model deployed in each node is one of multiple sub-models obtained by splitting the neural network model.
  • the neural network model may be trained through the model training method provided above, or may be trained based on other methods, which is not limited in this application.
  • the method 1100 shown in FIG. 11 includes steps 1110 to 1150. Each step in the method 1100 is described in detail below.
  • step 1110 the first exit node obtains the task end condition.
  • the task end condition may be set by the worker for different reasoning tasks.
  • the staff can set it on the network device and deliver it to each exit node through the network device, or they can set it on each exit node. This application does not limit this.
  • the network device sends a task end condition to each egress node.
  • the network device sends indication information, and the indication information is used to indicate the task end conditions of each sub-model.
  • the task end conditions include one or more of the following:
  • the classification probability in the inference result reaches the corresponding probability threshold
  • the entropy of the classification probability in the inference result reaches the corresponding entropy threshold
  • the inference duration reaches the corresponding duration threshold.
  • the classification probability can be used to evaluate the accuracy of the model. Setting a threshold for the classification probability can make the accuracy of the inference results output by the model meet the needs of the inference task when it ends the inference task. This can meet the different accuracy requirements of different reasoning tasks.
  • the sub-model has classification results of multiple categories, the corresponding classification probabilities are multiple. At this time, it can be judged according to the maximum value of the classification probability whether the probability threshold is reached, and then whether the task end condition is met.
  • this sub-model is used to classify animals in the target in the image.
  • the categories include “cat", “dog” and “rabbit”.
  • the inference results output by the sub-model include the probability that the target is a cat, a dog and a rabbit respectively. If the classification probability of the rabbit is the maximum value, then the classification probability of the rabbit is used to determine whether the probability threshold is reached, and then whether the task end conditions are met.
  • max() represents the function to find the maximum value
  • x is the data input to the sub-model
  • c n (x) is the result output by the sub-model
  • n can take a value from 1 to N
  • N represents the category included in the classification result.
  • the entropy of the corresponding classification probability is also multiple. In this case, it can be judged whether the entropy threshold has been reached based on the minimum value of the entropy of the classification probability. It can be understood that the smaller the entropy of the classification probability, the higher the confidence of the identification sub-model in the current inference result, and the higher the accuracy can be considered.
  • the entropy H of the classification probability can be expressed as: Among them, a is a predefined coefficient, and other parameters can be understood as above.
  • the inference time may be longer.
  • the longer the inference time the greater the complexity of the inference.
  • the inference time is too long, it may not be suitable for some inference tasks that require high latency. Therefore, by introducing the task end condition that the inference duration reaches the duration threshold, it can be used to control the inference time, so that the model will not wait for a long time when executing the inference task, thus satisfying inference tasks with high duration requirements. needs.
  • step 1120 the first exit node performs inference based on the received data to obtain features and inference results.
  • the data received by the first exit node is used for inference, and the process of obtaining features and inference results can be found in the previous description in conjunction with Figure 10 and will not be described again.
  • step 1130 the first exit node determines whether the inference result satisfies the task end condition.
  • the first exit node can determine whether the locally output inference result meets the task end condition based on the received task end condition. If satisfied, execute step 1140 to end the task; if not satisfied, execute step 1150 to send the characteristics to the next hop node.
  • step 1140 the first exit node ends the task, which means that the first exit node no longer continues to send data to the next hop node, and the entire inference task is stopped.
  • the first exit node sends features to the next hop node, which can facilitate the next hop node to continue reasoning. Until the inference result reaching a certain exit node meets the task end condition, the task can be ended.
  • Step 1160 may be executed before step 1130 or may be executed after step 1130, which is not limited in this application.
  • each exit node can execute the above process to complete local inference tasks when receiving data.
  • the steps performed by each exit node are the same as described above. For the sake of brevity, they will not be repeated here.
  • the neural network model can end the task as early as possible while meeting the requirements, thereby avoiding unnecessary waiting delays, unnecessary waste of computing resources, and reducing signaling overhead.
  • the task end condition includes both the inference accuracy indicator and the inference duration indicator.
  • the end of the reasoning task can be controlled from the dimensions of reasoning accuracy and/or reasoning duration, thereby meeting the needs of different reasoning tasks.
  • Figure 12 is a schematic block diagram of a device provided by an embodiment of the present application. As shown in Figure 12, the device 1200 may include a transceiver module 1210 and a processing module 1220.
  • the device 1200 can be a model training device, which can be used to implement the model training method in the previous embodiments. Specifically, it can be used to implement the steps performed by the exit node in the embodiments shown in FIGS. 6 to 10 .
  • the device 1200 may correspond to a first exit node, the device 1200 is a node among a plurality of nodes, each of the plurality of nodes has a sub-model deployed, and the sub-model deployed in each node is One of multiple sub-models obtained by splitting the model to be trained.
  • the processing module 1220 can be used to use the locally deployed sub-model to perform inference on the received data to obtain the inference results of the device 1200; the transceiver module 1210 can be used to receive inference results from other exit nodes.
  • Other exits The node is an exit node other than the device 1200 among multiple exit nodes, and the multiple exit nodes belong to multiple nodes; the processing module 1220 can also be used to based on the inference results and weights of each exit node in the multiple exit nodes. , obtain the soft label of the device 1200.
  • the soft label of the device 1200 is the weighting of the inference results of multiple exit nodes; based on the soft label of the device 1200, the predefined hard label, and the predefined loss function, the system deploys locally
  • the sub-model is trained to obtain the trained sub-model; where the loss function includes the distillation loss determined by the soft label and the student loss determined by the hard label.
  • each module in the device 1200 can be used to implement the corresponding process executed by the first exit node in the model training method 600 shown in the figure.
  • the transceiving module 1210 can be used to perform step 620 in the method 600
  • the processing module 1220 can be used to perform steps 610, 630 and 640 in the method 600.
  • the weight of the inference result of each exit node is determined by the device 1200 based on the characteristics of multiple exit nodes;
  • the processing module 1210 can also be used to perform inference on the received data using the locally deployed sub-model to obtain the characteristics of the device 1200; the transceiver module 1220 can also be used to receive characteristics from other exit nodes; the processing module 1210 can also be used to By using a preset function, based on the characteristics of multiple exit nodes, the weight of the inference result of each exit node in the multiple exit nodes is determined; wherein, the parameters in the preset function are determined by the device 1200 on the local sub-model. Obtained from the last round of training.
  • the distillation loss includes: a loss when training the sub-model based on the soft labels of multiple exit nodes, and a loss when training the sub-model based on the device 1200 .
  • the transceiver module 1220 can also be used to send the inference results and characteristics of the device 1200 to other egress nodes, and the inference results and characteristics of the device 1200 are used by other egress nodes to determine their respective soft labels.
  • the data received by the device 1200 includes training data and hard labels for training the model.
  • the data received by the device 1200 includes the characteristics and hard label of the previous hop node of the device 1200, where the characteristics of the previous hop node are determined by the previous node.
  • the hop nodes are inferred based on the received data.
  • the processing module 1210 can be specifically configured to calculate the loss when training the sub-model based on the soft labels of multiple exit nodes, predefined hard labels, and the predefined loss function; calculate the local gradient based on the loss ;
  • the transceiver module 1220 can be specifically configured to receive the intermediate gradients back propagated by other exit nodes and the next hop node of the device 1200; the processing module 1210 can be specifically configured to pair the sub-model based on the received intermediate gradients and local gradients. parameters in are updated.
  • the transceiver module 1220 can also be used to back-propagate intermediate gradients to other exit nodes and the previous hop node of the device 1200.
  • the device 1200 is not the last node among the multiple nodes; the transceiver module 1220 can also be used to send the calculated loss to the last node among the multiple nodes, and the loss calculated by the device 1200 is used for the final node. nodes determine the total loss, and the intermediate gradient from the last node is related to the total loss.
  • the device 1200 is the last node among multiple nodes; the transceiver module 1220 can also be used to receive respective losses from other exit nodes; the processing module 1210 can also be used to receive losses from other exits based on the calculated losses.
  • the losses received by the nodes respectively determine the total loss; based on the total loss, the intermediate gradients that are backpropagated to other nodes are determined, including other exits.
  • the multiple nodes also include an inference node, which is the next hop node of the device 1200.
  • the transceiver module 1220 can also be used to send the features extracted through the sub-model to the inference node.
  • the transceiver module 1220 may also be configured to receive first indication information, where the first indication information is used to indicate multiple egress nodes and adjacent nodes of the device 1200 .
  • the transceiver module 1220 may also be configured to receive second indication information, the second indication information being used to indicate the structure and/or parameters of the sub-model deployed in the device 1200; the processing module 1210 may also be configured to receive the second indication information based on the second indication information. Instructions to deploy the submodel locally.
  • the device 1200 can be an inference device, which can be used to implement the inference method in the previous embodiment, and specifically can be used to implement the steps performed by the first exit node in the embodiment shown in FIG. 11 .
  • the device 1200 when the device is a first exit node, the device 1200 is a node among multiple nodes, each of the multiple nodes has a sub-model deployed, and the sub-model deployed in each node is a pair of models.
  • One of the multiple sub-models obtained by splitting.
  • the transceiver module 1210 can be used to obtain the task end condition, which is used to determine whether the inference task can be stopped; the processing module 1220 can be used by the device 1200 to control the transceiver when the task end condition is not met.
  • the module 1210 continues to send the features obtained by inference to the next hop node, and outputs the inference result of the inference task; or the device 1200 outputs the inference result of the inference task when the task end condition is met, and no longer sends the inference result to the next hop node. Send this feature.
  • each module in the device 1200 can be used to implement the corresponding process executed by the first exit node in the inference method 1100 shown in FIG. 11 .
  • the transceiving module 1210 can be used to perform steps 1100, 1150 and 1160 in the method 1100
  • the processing module 1220 can be used to perform steps 1120 to 1140 in the method 1100.
  • the task end conditions include one or more of the following: the classification probability in the inference result reaches the corresponding probability threshold; the entropy of the classification probability in the inference result reaches the corresponding entropy threshold; and the inference duration reaches the corresponding duration threshold.
  • FIG. 13 is another schematic block diagram of a device provided by an embodiment of the present application.
  • the device 1300 can be used to implement the function of the first exit node in the above method.
  • the device 1300 may be a system on a chip. In the embodiments of this application, the chip system may be composed of chips, or may include chips and other discrete devices.
  • the device 1300 may include at least one processor 1310, used to implement the function of the egress node in the method provided by the embodiment of the present application.
  • the device 1300 can be a model training device.
  • the processor 1310 can be used to use the locally deployed sub-model to perform inference on the received data to obtain the first exit node. Inference results; receiving inference results from other exit nodes, which are exit nodes other than the first exit node among the plurality of exit nodes that belong to the plurality of nodes; based on the plurality of exit nodes The inference result of each exit node and its weight are used to obtain the soft label of the first exit node.
  • the soft label of the first exit node is the weighting of the inference results of multiple exit nodes; based on the soft label of the first exit node, predefined
  • the hard label and the predefined loss function are used to train the locally deployed sub-model to obtain the trained sub-model; where the loss function includes the distillation loss determined by the soft label and the student loss determined by the hard label .
  • the loss function includes the distillation loss determined by the soft label and the student loss determined by the hard label .
  • the device 1300 may be an inference device.
  • the processor 1310 can be used to receive the first indication information, where the first indication information is used to indicate the task end condition of the sub-model, The task end condition is used to determine whether the inference task can be stopped; and if the task end condition is not met, the first exit node continues to send the inferred features to the next hop node and outputs the inference result of the inference task, or the When the task end condition is met, an exit node outputs the inference result of the inference task without sending the feature to the next hop node.
  • the apparatus 1300 may also include at least one memory 1320 for storing program instructions and/or data.
  • Memory 1320 and processor 1310 are coupled.
  • the coupling in the embodiment of this application is an indirect coupling or communication connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information interaction between devices, units or modules.
  • Processor 1310 may cooperate with memory 1320.
  • Processor 1310 may execute program instructions stored in memory 1320. At least one of the at least one memory may be included in the processor.
  • the device 1300 may also include a communication interface 1330 for communicating with other devices through a transmission medium, so that the device 1300 can communicate with other devices.
  • the other device may be a device used to implement other exit nodes;
  • the communication interface 1330 may be, for example, a transceiver, an interface, Bus, circuit or device capable of transmitting and receiving functions.
  • the processor 1310 can use the communication interface 1330 to send and receive data and/or information, and be used to implement the methods executed by the egress node in the embodiment corresponding to Figures 6 to 10, or the method executed by the first egress node in the embodiment corresponding to Figure 11 Methods
  • connection medium between the processor 1310, the memory 1320 and the communication interface 1330 is not limited in the embodiment of the present application.
  • the processor 1310, the memory 1320 and the communication interface 1330 are connected through a bus 1340.
  • the bus 1340 is represented by a thick line in FIG. 13 , and the connection methods between other components are only schematically illustrated and not limited thereto.
  • the bus 1340 can be divided into an address bus, a data bus, a control bus, etc. For ease of presentation, only one thick line is used in Figure 13, but it does not mean that there is only one bus or one type of bus.
  • the processor in the embodiment of the present application may be an integrated circuit chip with signal processing capabilities.
  • each step of the above method embodiment can be completed through an integrated logic circuit of hardware in the processor or instructions in the form of software.
  • the above-mentioned processor can be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other available processors.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • programmed logic devices, discrete gate or transistor logic devices, discrete hardware components Each method, step and logical block diagram disclosed in the embodiment of this application can be implemented or executed.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • non-volatile memory may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase electrically programmable read-only memory (EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which is used as an external cache.
  • RAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • double data rate SDRAM double data rate SDRAM
  • DDR SDRAM double data rate SDRAM
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • direct rambus RAM direct rambus RAM
  • the communication system includes multiple nodes, and sub-models are deployed in each node of the multiple nodes.
  • the sub-model deployed in each node is obtained by splitting the model to be trained.
  • One of multiple sub-models; wherein the multiple nodes include multiple exit nodes, each exit node is used to perform the method performed by the exit node in the embodiment shown in Figures 6 to 10, or the first exit node in Figure 11 method of execution.
  • the multiple nodes further include at least one inference node; each inference node in the at least one inference node is used to perform inference based on the received data to obtain the characteristics of the inference node; and convert the characteristics Sent to the next hop node.
  • the first inference node in the at least one inference node is the next hop node of the first exit node in the plurality of exit nodes, and the first exit node is also used to send the first inference node to the first inference node. Send the inferred features.
  • the second exit node among the multiple exit nodes is the previous hop node of the first inference node; the first inference node is also used to receive the intermediate gradient back-propagated by the second exit node. ; Backpropagate the intermediate gradient to the first exit node.
  • the computer program product includes: a computer program (which can also be called a code, or an instruction).
  • a computer program which can also be called a code, or an instruction.
  • the computer program When the computer program is run, it causes the computer to execute the embodiments shown in Figures 6 to 10.
  • This application also provides a computer-readable storage medium that stores a computer program (which may also be called a code, or an instruction).
  • a computer program which may also be called a code, or an instruction.
  • the computer program When the computer program is run, the computer is caused to perform the method performed by the exit node in the embodiment shown in Figures 6 to 10, or the method performed by the first exit node in the embodiment shown in Figure 11.
  • unit may be used to refer to computer-related entities, hardware, firmware, a combination of hardware and software, software, or software in execution.
  • the unit described as a separate component may or may not be physically separated, and the component shown as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • each functional unit may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions (programs). When the computer program instructions (program) are loaded and executed on the computer, the processes or functions according to the embodiments of the present application are generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted over a wired connection from a website, computer, server, or data center (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to another website, computer, server or data center.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the available media may be magnetic media (e.g., floppy disks, hard disks, tapes), optical media (e.g., digital video discs (DVD)), or semiconductor media (e.g., solid state disks (SSD)) wait.
  • this function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

Des modes de réalisation de la présente demande concernent un procédé d'entraînement de modèle et un appareil associé. Le procédé est appliqué à un premier nœud de sortie parmi une pluralité de nœuds de sortie, et un sous-modèle est déployé dans chaque nœud de la pluralité de nœuds. Le procédé consiste à : effectuer un raisonnement des données reçues en utilisant des sous-modèles déployés localement, pour obtenir un résultat de raisonnement du premier nœud de sortie ; recevoir des résultats de raisonnement provenant d'autres nœuds de sortie, et obtenir une étiquette souple du premier nœud de sortie sur la base des résultats de raisonnement des nœuds de sortie et du poids des résultats de raisonnement ; et entraîner les sous-modèles déployés localement sur la base de l'étiquette souple du premier nœud de sortie, d'une étiquette dure prédéfinie et d'une fonction de perte prédéfinie pour obtenir des sous-modèles entraînés. Le poids des résultats de raisonnement des nœuds de sortie est lié à la complexité de réseaux neuronaux des nœuds de sortie, de telle sorte qu'un entraînement contrôlable peut être effectué en fonction de la complexité, de la capacité, etc., des sous-modèles pour correspondre aux capacités de réseaux neuronaux de différents modèles, ce qui permet d'améliorer l'effet d'entraînement.
PCT/CN2022/102635 2022-06-30 2022-06-30 Procédé d'entraînement de modèle et appareil associé WO2024000344A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/102635 WO2024000344A1 (fr) 2022-06-30 2022-06-30 Procédé d'entraînement de modèle et appareil associé

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/102635 WO2024000344A1 (fr) 2022-06-30 2022-06-30 Procédé d'entraînement de modèle et appareil associé

Publications (1)

Publication Number Publication Date
WO2024000344A1 true WO2024000344A1 (fr) 2024-01-04

Family

ID=89383513

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/102635 WO2024000344A1 (fr) 2022-06-30 2022-06-30 Procédé d'entraînement de modèle et appareil associé

Country Status (1)

Country Link
WO (1) WO2024000344A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180268292A1 (en) * 2017-03-17 2018-09-20 Nec Laboratories America, Inc. Learning efficient object detection models with knowledge distillation
CN114049513A (zh) * 2021-09-24 2022-02-15 中国科学院信息工程研究所 一种基于多学生讨论的知识蒸馏方法和系统
CN114611670A (zh) * 2022-03-15 2022-06-10 重庆理工大学 一种基于师生协同的知识蒸馏方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180268292A1 (en) * 2017-03-17 2018-09-20 Nec Laboratories America, Inc. Learning efficient object detection models with knowledge distillation
CN114049513A (zh) * 2021-09-24 2022-02-15 中国科学院信息工程研究所 一种基于多学生讨论的知识蒸馏方法和系统
CN114611670A (zh) * 2022-03-15 2022-06-10 重庆理工大学 一种基于师生协同的知识蒸馏方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HAILIN ZHANG; DEFANG CHEN; CAN WANG: "Confidence-Aware Multi-Teacher Knowledge Distillation", ARXIV.ORG, 11 February 2022 (2022-02-11), XP091148103 *

Similar Documents

Publication Publication Date Title
WO2021233053A1 (fr) Procédé de délestage de calcul et appareil de communication
Peng et al. Deep reinforcement learning based resource management for multi-access edge computing in vehicular networks
US12041692B2 (en) User equipment (UE) capability report for machine learning applications
CN110809306B (zh) 一种基于深度强化学习的终端接入选择方法
TW202143668A (zh) 用於通道狀態反饋(csf)學習的可配置神經網路
US20230292387A1 (en) Method and device for jointly serving user equipment by wireless access network nodes
KR102178880B1 (ko) 디바이스 클러스터링에 기반한 로라 통신 네트워크 시스템 및 데이터 전송 방법
CN113490184A (zh) 一种面向智慧工厂的随机接入资源优化方法及装置
US20220417108A1 (en) Zone-based federated learning
CN113133087A (zh) 针对终端设备配置网络切片的方法及装置
WO2023020502A1 (fr) Procédé et appareil de traitement de données
CN114024639B (zh) 一种无线多跳网络中分布式信道分配方法
US20240073732A1 (en) Method and device for adjusting split point in wireless communication system
CN116848828A (zh) 机器学习模型分布
WO2024000344A1 (fr) Procédé d'entraînement de modèle et appareil associé
WO2022262734A1 (fr) Procédé d'accès à un canal et appareil associé
US11997693B2 (en) Lower analog media access control (MAC-A) layer and physical layer (PHY-A) functions for analog transmission protocol stack
US12040905B2 (en) Determination of maximum number of uplink retransmissions
CN117597969A (zh) Ai数据的传输方法、装置、设备及存储介质
WO2024031535A1 (fr) Procédé de communication sans fil, dispositif terminal et dispositif réseau
WO2024067143A1 (fr) Procédé, appareil et système de transmission d'informations
WO2024027511A1 (fr) Procédé d'accès à un canal et appareil associé
WO2022199315A1 (fr) Procédé et appareil de traitement de données
US12069558B2 (en) Secondary cell group (SCG) failure prediction and traffic redistribution
WO2024036526A1 (fr) Procédé et appareil de planification de modèle

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22948457

Country of ref document: EP

Kind code of ref document: A1