WO2020199914A1 - 训练神经网络的方法和装置 - Google Patents

训练神经网络的方法和装置 Download PDF

Info

Publication number
WO2020199914A1
WO2020199914A1 PCT/CN2020/079808 CN2020079808W WO2020199914A1 WO 2020199914 A1 WO2020199914 A1 WO 2020199914A1 CN 2020079808 W CN2020079808 W CN 2020079808W WO 2020199914 A1 WO2020199914 A1 WO 2020199914A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
training
trained
loss function
output value
Prior art date
Application number
PCT/CN2020/079808
Other languages
English (en)
French (fr)
Inventor
徐晨
李榕
王坚
黄凌晨
王俊
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020199914A1 publication Critical patent/WO2020199914A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • This application relates to the field of artificial intelligence (AI), and in particular to a method and device for training a neural network.
  • AI artificial intelligence
  • Neural network is the foundation of AI technology. It has aspects such as network layer (such as network optimization, mobility management, resource allocation, etc.) and physical layer (such as channel coding and decoding, channel prediction, receiver, etc.) Wide application prospects.
  • the present application provides a method and device for training a neural network.
  • the training device trains a larger-scale neural network.
  • a method for training a neural network including: obtaining a neural network to be trained; sending a first neural network to a first training device, the first neural network being a sub-network of the neural network to be trained, and The training device is used to train the first neural network; the second neural network is sent to the second training device, the second neural network is a sub-network of the neural network to be trained, and the second training device is used to train the second neural network; training from the target
  • the device receives the output value of the neural network to be trained, the target training device is a training device that includes the output layer of the neural network to be trained in the training device set, and the training device set includes a first training device and a second training device;
  • the output value of the neural network determines the loss function of the neural network to be trained; the loss function or the gradient corresponding to the loss function is sent to the target training device.
  • the control device splits the neural network to be trained into multiple sub-networks. Each sub-network contains fewer parameters. Therefore, a training device with a smaller storage space can also store at least one sub-network, so that multiple storage spaces can be used.
  • the training device trains a larger-scale neural network (ie, a neural network containing more parameters), and the above method is particularly suitable for terminal devices with limited storage capacity.
  • the first neural network and the second neural network belong to different layers of the neural network to be trained.
  • the above scheme divides the neural network to be trained according to the depth. Since different sub-networks contain one or more complete neural network layers, multiple sub-networks can be processed in series without changing the structure of the neural network to be trained. Therefore, The solution is simple and easy to implement, and can reduce the load of the control device when dividing the neural network to be trained.
  • the second neural network includes an output layer of the neural network to be trained, and receiving the output value of the neural network to be trained from the target training device includes: receiving the output value of the neural network to be trained from the second training device;
  • the target training device sending the loss function or the gradient corresponding to the loss function includes: sending the loss function to the second training device.
  • the second neural network includes the output layer of the neural network to be trained
  • the second neural network is a sub-network directly connected to the control device.
  • the control device receives the output value of the neural network to be trained from the second neural network, and calculates the loss function of the neural network to be trained based on the output value, making full use of the control device’s ability to perform complex calculations (calculation loss function). This allows the training device to focus on a large number of simple calculations (neural network training process).
  • the first neural network and the second neural network belong to the same layer of the neural network to be trained.
  • the above scheme is to divide the neural network to be trained according to the width. Since one layer of the neural network to be trained is split into multiple sub-networks, multiple sub-networks need to be processed in parallel, and multiple The output values of the parallel sub-networks are combined. Compared with the solution of dividing the neural network to be trained according to the depth, although the neural network to be trained according to the width needs to add a fully connected layer, because each sub-network can update the parameters in parallel, the training according to the width can improve Training efficiency of neural network.
  • the first neural network and the second neural network include an output layer of the neural network to be trained
  • receiving the output value of the neural network to be trained from the target training device includes: receiving the first output value from the first training device, The first output value is the output value of the first neural network; the second output value is received from the second training device, and the second output value is the output value of the second neural network; the output value of the neural network to be trained is determined to be trained
  • the loss function of the neural network includes: processing the first output value and the second output value through the fully connected layer to obtain the loss function of the neural network to be trained; sending the loss function or the gradient corresponding to the loss function to the target training device, including: The first training device and the second training device send gradients corresponding to the loss function.
  • the first neural network and the second neural network include the output layer of the neural network to be trained
  • the first neural network and the second neural network are sub-networks directly connected to the control device.
  • the control device receives the output value of the neural network to be trained from the second neural network, and calculates the loss function of the neural network to be trained based on the output value, making full use of the control device’s ability to perform complex calculations (calculation loss function). This allows the training device to focus on a large number of simple calculations (neural network training process).
  • the present application also provides a method for training a neural network.
  • the method is applied to the first training device and includes: receiving the first neural network from the control device, the first neural network being a sub-sub of the neural network to be trained The first neural network does not include the output layer of the neural network to be trained; the first neural network is trained; and the trained first neural network is sent to the control device.
  • the neural network to be trained is split into multiple sub-networks, and each sub-network contains fewer parameters. Therefore, a training device with a small storage space can also store at least one sub-network, so that multiple trainings with a small storage space can be used
  • the device trains a larger-scale neural network (ie, a neural network containing more parameters), and the above method is especially suitable for terminal devices with limited storage capacity.
  • training the first neural network includes: sending the output value of the first neural network to the second training device, and the output value of the first neural network is used to determine the loss function of the neural network to be trained; from the second training device Receive the first gradient, the first gradient is the gradient of the input layer of the second neural network in the second training device, the second neural network is another sub-network of the neural network to be trained, and the first gradient is determined based on the loss function Gradient; train the first neural network according to the first gradient.
  • the first training device can directly perform backpropagation calculations based on the received gradients without additional processing of the gradients. Therefore, the solution is simple and easy to implement.
  • the method further includes: determining whether the first neural network has completed the training according to whether the training parameters meet the termination condition.
  • the training parameters include at least one of the number of training rounds, training time, and bit error rate, and determining whether the first neural network has completed training according to whether the training parameters meet the termination condition includes:
  • the loss function threshold When the value of the loss function of the neural network to be trained is less than or equal to the loss function threshold, it is determined that the second neural network has completed training; and/or,
  • the training time is greater than or equal to the time threshold, it is determined that the first neural network has completed training; and/or,
  • bit error rate is less than or equal to the bit error rate threshold, it is determined that the first neural network has completed training.
  • Whether the neural network has completed training is determined by whether different training parameters meet the termination conditions, and the neural network can be flexibly trained according to the actual situation. For example, when the training device has a heavy burden, the control device can set fewer training rounds or a shorter training time or a larger bit error rate; when the training device has a lighter burden, the control device can set more The number of training rounds or longer training time or smaller bit error rate. Thereby improving the flexibility of training neural networks.
  • this application also provides a method for training a neural network.
  • the method is applied to a second training device and includes: receiving a second neural network from a control device, the second neural network being a sub-sub of the neural network to be trained
  • the second neural network includes the output layer of the neural network to be trained; trains the second neural network; and sends the trained second neural network to the control device.
  • the neural network to be trained is split into multiple sub-networks, and each sub-network contains fewer parameters. Therefore, a training device with a small storage space can also store at least one sub-network, so that multiple trainings with a small storage space can be used
  • the device trains a larger-scale neural network (ie, a neural network containing more parameters), and the above method is especially suitable for terminal devices with limited storage capacity.
  • the method further includes: sending the output value of the second neural network to the control device, the output value of the second neural network is used to determine the loss function of the neural network to be trained; receiving the loss function or the loss function from the control device Corresponding gradient; the gradient of the input layer of the second neural network is determined according to the loss function or the gradient corresponding to the loss function.
  • the method further includes: sending a gradient of the input layer of the second neural network to the first training device, where the gradient is used for training of the first neural network in the first training device, wherein the second neural network
  • the input layer of the network is connected to the output layer of the first neural network, which is another sub-network of the neural network to be trained.
  • the second training device also needs to send the gradient of the input layer of the second neural network to the first training device, so that the first training device can use the second neural network
  • the gradient of the input layer calculates the gradient of each layer of the first neural network, and updates the parameters of the first neural network.
  • training the second neural network includes: determining whether the second neural network completes the training according to whether the training parameters meet the termination condition.
  • the training parameters include at least one of the number of training rounds, training time, the loss function of the neural network to be trained, and a bit error rate,
  • the training time is greater than or equal to the time threshold, it is determined that the second neural network has completed the training; and/or,
  • the loss function threshold When the value of the loss function is less than or equal to the loss function threshold, it is determined that the second neural network has completed training; and/or,
  • Whether the neural network has completed training is determined by whether different training parameters meet the termination conditions, and the neural network can be flexibly trained according to the actual situation. For example, when the training device has a heavy burden, the control device can set fewer training rounds or a shorter training time or a larger bit error rate; when the training device has a lighter burden, the control device can set more The number of training rounds or longer training time or smaller bit error rate. Thereby improving the flexibility of training neural networks.
  • the present application provides a control device, which can realize the functions corresponding to the method involved in the above-mentioned first aspect.
  • the functions can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes one or more units or modules corresponding to the above-mentioned functions.
  • the device includes a processor, and the processor is configured to support the device to execute the method involved in the first aspect.
  • the device may also include a memory for coupling with the processor and storing programs and data.
  • the device further includes a communication interface for supporting communication between the device and the neural network training device.
  • the communication interface may include a circuit with integrated transceiver functions.
  • this application provides a training device that can implement the functions corresponding to the methods in the second or third aspects above.
  • the functions can be implemented by hardware, or corresponding software can be executed by hardware.
  • the hardware or software includes one or more units or modules corresponding to the above-mentioned functions.
  • the device includes a processor, and the processor is configured to support the device to execute the method related to the second aspect or the third aspect.
  • the device may also include a memory for coupling with the processor and storing programs and data.
  • the device further includes a communication interface for supporting communication between the device and the control device and/or other neural network training devices.
  • the communication interface may include a circuit with integrated transceiver functions.
  • this application provides a neural network training system, including at least one control device described in the fourth aspect and at least two training devices described in the fifth aspect.
  • this application provides a computer-readable storage medium in which a computer program is stored.
  • the processor executes the method described in the first aspect.
  • this application provides a computer-readable storage medium in which a computer program is stored.
  • the processor executes the second or third aspect Methods.
  • this application provides a computer program product, the computer program product comprising: computer program code, when the computer program code is executed by a processor, the processor executes the method described in the first aspect.
  • this application provides a computer program product, the computer program product comprising: computer program code, when the computer program code is executed by a processor, the processor executes the method described in the second or third aspect .
  • this application provides a chip, which includes a processor and a communication interface.
  • the processor is, for example, a core, and the core may include at least one execution unit (execution unit), for example, the execution unit is an arithmetic and logic unit (ALU); the communication interface may be an input/output interface , Pins or circuits, etc.; the processor executes the program code stored in the memory, so that the chip executes the method described in the first aspect.
  • the memory may be a storage unit (for example, a register, a cache, etc.) located inside the chip, or a storage unit (for example, a read-only memory, a random access memory, etc.) located outside the chip.
  • this application also provides a chip, which includes a processor and a communication interface.
  • the processor is, for example, a streaming multiprocessor.
  • the streaming multiprocessor may include at least one execution unit.
  • the execution unit is, for example, a unified computing device architecture (CUDA).
  • the communication interface may be an input/output interface, a pin or a circuit, etc.; the processor executes the program code stored in the memory, so that the chip executes the method described in the second or third aspect.
  • the memory may be a storage unit (for example, a register, a cache, etc.) located inside the chip, or a storage unit (for example, a read-only memory, a random access memory, etc.) located outside the chip.
  • Figure 1 is a schematic diagram of a fully connected neural network suitable for this application
  • Figure 2 is a schematic diagram of a method for updating neural network parameters based on a loss function
  • Figure 3 is a schematic diagram of a method for calculating the gradient of a loss function
  • Fig. 4 is a schematic diagram of a neural network training system provided by the present application.
  • Fig. 5 is a schematic diagram of a method for training a neural network provided by the present application.
  • Fig. 6 is a schematic diagram of a neural network training method based on depth division provided by the present application.
  • FIG. 7 is a schematic diagram of a training method of a neural network based on width division provided by the present application.
  • Fig. 8 is a schematic diagram of a device for training a neural network provided by the present application.
  • FIG. 9 is a schematic diagram of another device for training a neural network provided by this application.
  • Fig. 10 is a schematic diagram of another device for training a neural network provided by the present application.
  • a neural network can also be called an artificial neural network (ANN), and a neural network with a large number of hidden layers is called a deep neural network.
  • the work of each layer in the neural network can be expressed in mathematical expressions To describe. From a physical perspective, the work of each layer in the neural network can be understood as the transformation of the input space to the output space (that is, the row space of the matrix to the column space) through five operations on the input space (the set of input vectors) , These five operations include: 1. Dimension Up/Down; 2. Enlarge/Reduce; 3. Rotation; 4. Translation; 5. "Bending”. Among them, the reason for operations 1, 2, 3 Complete, operation 4 is completed by +b, and operation 5 is realized by a().
  • w is a weight vector, and each value in the vector represents the weight value of a neuron in the layer of neural network.
  • the w determines the spatial transformation from the input space to the output space described above, that is, the w in each layer controls how the space is transformed.
  • the purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by many layers of w). Therefore, the training process of the neural network is essentially the way to learn the control space transformation, more specifically, the learning weight matrix.
  • the loss function is usually a multivariate function, and the gradient can reflect the rate of change of the output value of the loss function when the variable changes.
  • the greater the absolute value of the gradient the greater the rate of change of the output value of the loss function, and the loss can be calculated when updating different parameters.
  • the gradient of the function continuously updates the parameters along the direction of the fastest gradient drop, reducing the output value of the loss function as soon as possible.
  • a fully connected neural network is also called a multilayer perceptron (MLP).
  • MLP multilayer perceptron
  • an MLP includes an input layer (left), an output layer (right), and multiple hidden layers (middle), each layer contains several nodes, called neurons. The neurons in two adjacent layers are connected in pairs.
  • the output h of the neuron of the next layer is the weighted sum of all the neurons x of the previous layer connected to it after the activation function (ie, "a" mentioned above) is processed Value. It can be expressed as
  • MLP can be understood as a mapping relationship from the input data set to the output data set. Generally, MLP is initialized randomly, and the process of obtaining this mapping relationship from random w and b using existing data is called MLP training.
  • the loss function can be used to evaluate the output results of the MLP, and through back propagation, the gradient descent method can iteratively optimize w and b until the loss function reaches the minimum value.
  • the loss function of the MLP can be obtained through forward propagation (forward propagation) calculation. That is, the output result of the previous layer is input to the next layer until the output result of the output layer of the MLP is obtained, and the result is compared with the target value to obtain the loss function of the MLP. After obtaining the loss function calculated by forward propagation, perform backpropagation calculation based on the loss function to obtain the gradient of each layer, and adjust w and b along the direction of the fastest gradient drop until the loss function reaches the minimum value.
  • forward propagation forward propagation
  • the process of gradient descent can be expressed as:
  • is the parameters to be optimized (such as w and b), L is the loss function, and ⁇ is the learning rate, which is used to control the step size of gradient descent, as shown by the arrow in Figure 2.
  • the chain rule for obtaining partial derivatives can be used for backpropagation calculations, that is, the gradient of the parameters of the previous layer can be calculated recursively from the gradient of the parameters of the latter layer.
  • the chain rule can be expressed as:
  • w ij is the weight of node j connected to node i
  • s i is the weighted sum of inputs on node i.
  • GPU graphics processing unit
  • Fig. 4 is a schematic diagram of a training system suitable for this application.
  • the training system includes a control device and at least two training devices.
  • the control device and each training device can communicate with each other.
  • different training devices can also communicate with each other.
  • the control device is, for example, a central processing unit (CPU), the above training device is, for example, a GPU, and the training device may also be a tensor processing unit (TPU) or a CPU or other types of computing units.
  • CPU central processing unit
  • TPU tensor processing unit
  • This application does not limit the specific types of control devices and training devices.
  • control device and the at least two training devices can be integrated on one chip, for example, on a system-on-chip (SoC).
  • SoC system-on-chip
  • the control device and the at least two training devices can also be integrated on different chips.
  • Figure 5 shows a method for training a neural network provided by this application.
  • the method 500 can be applied to the training system shown in FIG. 4, where the control device executes the down step after acquiring the neural network to be trained.
  • S510 Send a first neural network to a first training device, where the first neural network is a sub-network of a neural network to be trained, and the first training device is used to train the first neural network.
  • S520 Send a second neural network to the second training device, the second neural network is a sub-network of the neural network to be trained, and the second neural network is different from the first neural network, and the second training device is used to train the second neural network The internet.
  • the control device can divide the neural network to be trained into the first neural network and the second neural network according to the depth, or divide the neural network to be trained into the first neural network and the second neural network according to the width.
  • the second neural network may be the same (that is, include the same parameters) or different (that is, include different parameters), and the specific forms of the first neural network and the second neural network are not limited in this application. It should be understood that even if the first neural network and the second neural network contain the same parameters, because the two neural networks are two sub-networks of the neural network to be trained, that is, the two neural networks are different from the neural network to be trained. Part, therefore, the two neural networks still belong to two different neural networks.
  • the two division methods and the corresponding training methods will be described in detail below.
  • dividing the neural network to be trained into two sub-networks is only an example, and the neural network to be trained can also be divided into more sub-networks.
  • the neural network is composed of multiple parameters. Therefore, the control device sending the first neural network to the first training device can be interpreted as: the control device sends the first training device to the first training device the parameters constituting the first neural network and indicating these Information about the connection relationship of the parameter. Similarly, sending the second neural network by the control device to the second training device can be interpreted as: the control device sends the parameters constituting the second neural network and the information indicating the connection relationship of these parameters to the second training device.
  • the first training device and the second training device After the first training device and the second training device receive their respective neural networks, they can perform the following steps respectively.
  • the neural network to be trained has 4 layers.
  • the first two layers are divided into the first neural network, and the latter two layers are divided into the second neural network.
  • the foregoing division method is only an example, and the neural network to be trained can also be divided into other types of sub-networks, and each sub-network includes at least one layer of parameters of the neural network to be trained.
  • CPU is the control device
  • GPU0 is the first training device
  • GPU1 is the second training device.
  • GPU0 inputs training samples, processes the training samples through the first neural network, and sends the processing results of the training samples to GPU1.
  • GPU1 processes the output result of GPU0 through the second neural network, and obtains the output result of the second neural network, sends the output result of the second neural network to the CPU, and the CPU calculates the loss function of the neural network to be trained based on the output result (L).
  • the training samples can be: log likelihood ratio and codeword, or log likelihood ratio and real information. That is, the input of the neural network is the log-likelihood ratio, and the output of the neural network is the estimation of the codeword or the estimation of the real information.
  • the loss function is the difference between the estimate of the codeword and the codeword, or the loss function is the difference between the estimate of the information and the actual information.
  • the training samples can be: historical channel data and future channel data. That is, the input of the neural network is historical channel data, and the output of the neural network is predicted future channel data.
  • the loss function is the difference between the predicted future channel and the real future channel.
  • the training samples can be: the current system state and the optimal scheduling strategy. That is, the input of the neural network is the state information of the system, such as: currently schedulable time-frequency resources, users that need to be scheduled, and the user's quality of service (QoS) level; the output of the neural network is the expected scheduling strategy.
  • the loss function is the difference between the predicted scheduling strategy and the optimal scheduling strategy.
  • training sample is applicable to all the embodiments of this application.
  • description of the above-mentioned preferred training samples is only an example, since the method of this application is widely applicable to fields involving artificial intelligence including wireless communication, car networking, computers, deep learning, pattern recognition, cloud computing, etc., training samples can be based on Design for specific applications.
  • An optimizer is deployed on each training device, and each optimizer is used to calculate the gradient of the neural network deployed on each training device.
  • the input information of optimizer 1 is the loss function of the entire neural network to be trained, and the input of optimizer 0
  • the information is the gradient output by the optimizer 1.
  • Gradient 1 in FIG. 6 represents the gradient of each layer of the second neural network, and gradient 0 represents the gradient of each layer of the first neural network.
  • the optimizer in the various embodiments of the present application may be a software module (for example, program code) or a hardware module (for example, a logic circuit).
  • f is the activation function
  • l is the loss function
  • N is the topological structure of the second neural network.
  • the training device calculates the gradient of each layer and updates the parameters of each layer according to the gradient of each layer.
  • GPU1 calculates the gradient of the fourth layer according to the loss function (L) Then according to Calculate the gradient of the third layer Subsequently Send to GPU0.
  • GPU0 according to Calculate the gradient of the second layer Then according to Calculate the gradient of the first layer Among them, the first layer is the input layer of the neural network to be trained, the second and third layers are the hidden layers of the neural network to be trained, and the fourth layer is the output layer of the neural network to be trained.
  • ⁇ 4 ⁇ 1 represent the parameters of each layer, and the update of ⁇ 4 ⁇ 1 can be performed after the GPU corresponding to each parameter has completed the gradient calculation, or after all the GPUs have completed the gradient calculation.
  • the training sample is processed through the first neural network after the parameter update, and the processing result of the training sample is sent to GPU1.
  • GPU1 processes the output result of GPU0 through the second neural network after parameter update, obtains the output result, and sends the output result to the CPU.
  • the CPU calculates the loss function again according to the output result. If the loss function does not meet the requirements, the loss function can be sent to GPU1, and the aforementioned parameter update steps are repeated to continue training; if the loss function meets the requirements, the training can be stopped.
  • the CPU or GPU0 can also determine whether the first neural network has completed the training according to whether the training parameters meet the termination condition.
  • the above training parameters include at least one of the number of training rounds, training time and bit error rate,
  • the CPU or GPU0 determines that the first neural network has completed training; when the training time is greater than or equal to the time threshold, the first neural network is determined to complete training; when the bit error rate is less than or equal to When the bit error rate threshold is set, it is determined that the first neural network has completed training.
  • CPU or GPU0 can stop training the first neural network according to one of the parameters of loss function, number of training rounds, training time, and bit error rate, and stop training the first neural network.
  • CPU or GPU0 can also stop training the first neural network in loss function, number of training rounds, training time and error. Stop training the first neural network when multiple parameters in the code rate meet the termination condition.
  • CPU or GPU1 can stop training the second neural network according to one of the parameters of loss function, number of training rounds, training time, and bit error rate, and stop training the second neural network.
  • CPU or GPU1 can also stop training the second neural network according to loss function, number of training rounds, training Stop training the second neural network when multiple parameters in time and bit error rate meet the termination conditions.
  • Whether the neural network has completed training is determined by whether different training parameters meet the termination conditions, and the neural network can be flexibly trained according to the actual situation. For example, when the CPU or training device has a heavy burden, the CPU can set fewer training rounds or shorter training time or a larger bit error rate; when the CPU or training device has a lighter burden, the CPU can set More training rounds or longer training time or smaller bit error rate. Thereby improving the flexibility of training neural networks.
  • Method two can be divided according to the content shown below. If the number of GPUs is M, the neural network to be trained has N layers in total, and the width of each layer is w i , i ⁇ [0,N), that is, each layer contains w i parameters. Each GPU can deploy N layers of neural networks, and the width of each layer is v i,j , i ⁇ [0,N), j ⁇ [0,M), and satisfies In addition, the width of the fully connected layer is w N-1 .
  • the neural network to be trained has 4 layers.
  • the widths of the 4 layers are 8, 16, 16, and 12 respectively. That is, the first layer contains 8 parameters, the second layer contains 16 parameters, and the third The layer contains 16 parameters, and the fourth layer contains 12 parameters.
  • the first layer is the input layer of the neural network to be trained
  • the second and third layers are the hidden layers of the neural network to be trained
  • the fourth layer is the output layer of the neural network to be trained.
  • the parameters of each layer can be divided into two groups of parameters.
  • the first neural network contains one set of parameters
  • the second neural network contains another set of parameters.
  • the width of the first neural network and the width of the second neural network are both 4 , 8, 8, and 6.
  • the above division method is only an example, and the parameters of each layer may not be divided equally.
  • a fully connected layer needs to be deployed in the CPU.
  • the width of the fully connected layer is the same as the sum of the width of the input layer of the first neural network and the width of the output layer of the second neural network. In the neural network shown in Figure 7, the width of the fully connected layer is 12.
  • GPU0 and GPU1 input training samples respectively, GPU0 processes the training samples through the first neural network, and sends the output value of the first neural network (ie, the first output value) to the CPU; GPU1 processes the training sample through the second neural network , And send the output value of the second neural network (ie, the second output value) to the CPU.
  • the CPU processes the first output value and the second output value through the fully connected layer to obtain the output value of the neural network to be trained, and determines the loss function of the neural network to be trained based on the output value.
  • the gradient of the fully connected layer is determined according to the loss function, and the gradient of the fully connected layer is sent to GPU0 and GPU1 respectively, so that GPU0 determines the gradient of the first neural network and GPU1 determines the gradient of the second neural network.
  • the training samples input by GPU0 and GPU1 may be the same or different, which is not limited in this application.
  • GPU0 and GPU1 input the same or similar training samples. This solution can improve the training effect of the neural network to be trained.
  • Optimizers are deployed on the CPU and each training device. Each optimizer is used to calculate the gradient of the neural network deployed on each training device. Among them, the optimizer on the CPU is used to calculate the gradient of the fully connected layer, and the optimizer on GPU0 is 0 It is used to calculate the gradient of each layer of the first neural network, and the optimizer 1 on GPU1 is used to calculate the gradient of each layer of the second neural network. In FIG. 7, gradient represents the gradient of the loss function, gradient 1 represents the gradient of each layer of the second neural network, and gradient 0 represents the gradient of each layer of the first neural network.
  • the training device calculates the gradient of each layer and updates the parameters of each layer according to the gradient of each layer.
  • the CPU calculates the gradient of the fully connected layer according to the loss function (L) Among them, ⁇ fc is the parameter of the fully connected layer.
  • GPU0 After GPU0 receives the gradient, it sequentially calculates the gradients of the four layers of the first neural network based on the gradient
  • GPU1 After GPU1 receives the gradient, it sequentially calculates the gradients of the four layers of the second neural network based on the gradient
  • ⁇ 4 ⁇ 0 represent the parameters of each layer of the first neural network
  • ⁇ 4 ′ ⁇ 0 ′ represent the parameters of each layer of the second neural network
  • GPU0 and GPU1 can calculate the gradient of each layer in parallel and update their parameters.
  • GPU0 After GPU0 completes the parameter update, it processes the training sample through the first neural network after the parameter update, and sends the output value to the CPU. After the GPU1 completes the parameter update, it processes the training samples through the updated second neural network and sends the output value to the CPU. The CPU calculates the loss function again according to the two output values. If the loss function does not meet the requirements, the aforementioned parameter update steps can be repeated to continue training; if the loss function meets the requirements, the training can be stopped.
  • the CPU or GPU0 can also determine whether the first neural network has completed the training according to whether the training parameters meet the termination condition.
  • the above training parameters include at least one of the number of training rounds, training time and bit error rate,
  • the CPU or GPU0 determines that the first neural network has completed training; when the training time is greater than or equal to the time threshold, the first neural network is determined to complete training; when the bit error rate is less than or equal to When the bit error rate threshold is set, it is determined that the first neural network has completed training.
  • CPU or GPU0 can stop training the first neural network according to one of the parameters of loss function, number of training rounds, training time, and bit error rate, and stop training the first neural network.
  • CPU or GPU0 can also stop training the first neural network in loss function, number of training rounds, training time and error. Stop training the first neural network when multiple parameters in the code rate meet the termination condition.
  • CPU or GPU1 can stop training the second neural network according to one of the parameters of loss function, number of training rounds, training time, and bit error rate, and stop training the second neural network.
  • CPU or GPU1 can also stop training the second neural network according to loss function, number of training rounds, training Stop training the second neural network when multiple parameters in time and bit error rate meet the termination conditions.
  • Whether the neural network has completed training is determined by whether different training parameters meet the termination conditions, and the neural network can be flexibly trained according to the actual situation. For example, when the CPU or training device has a heavy burden, the CPU can set fewer training rounds or shorter training time or a larger bit error rate; when the CPU or training device has a lighter burden, the CPU can set More training rounds or longer training time or smaller bit error rate. Thereby improving the flexibility of training neural networks.
  • dividing by width requires changing the architecture of the neural network to be trained, that is, adding a fully connected layer. Since each sub-network can update parameters in parallel, training according to the width can improve the training efficiency of the neural network.
  • the first training device and the second training device can perform the following steps respectively.
  • S550 The first training device sends the trained first neural network to the control device.
  • S560 The second training device sends the trained second neural network to the control device.
  • the first training device sending the trained first neural network to the control device can be interpreted as: the first training device sends the updated parameters of the first neural network and the information indicating the connection relationship of these parameters to the control device.
  • the second training device sending the trained second neural network to the control device can be interpreted as: the second training device sends the updated parameters of the second neural network to the control device and information indicating the connection relationship of these parameters.
  • the two neural networks can be merged to obtain the trained neural network.
  • each training device stores part of the parameters of the neural network to be trained, a training device with a small storage space can also complete the training of a large-scale neural network after applying the above method.
  • the above method is especially suitable for terminal equipment with limited storage capacity .
  • Table 1 shows the time required for training using the two division methods described above.
  • the neural network to be trained is a fully connected neural network with a depth of 10 and a width of 1024.
  • the device for training a neural network includes hardware structures and/or software modules corresponding to each function.
  • this application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
  • an apparatus for training a neural network may include a processing unit for executing the determined action in the above method example, a receiving unit for implementing the receiving action in the above method example, and a sending unit for implementing the sending action in the above method example.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit. It should be noted that the division of units in this application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
  • Fig. 8 shows a schematic structural diagram of a device for training a neural network provided by the present application.
  • the device 800 for training a neural network can be used to implement the method described in the above method embodiment.
  • the communication device 800 may be a chip, a network device or a terminal device.
  • the apparatus 800 for training a neural network includes one or more processors 801, and the one or more processors 801 can support the apparatus 800 for training a neural network to implement the method in the method embodiment corresponding to FIG. 5.
  • the processor 801 may be a general-purpose processor or a special-purpose processor.
  • the processor 801 may be a CPU.
  • the CPU can be used to control the training device (for example, GPU), execute software programs, and process data of the software programs.
  • the device 800 for training a neural network may further include a communication interface 805 for realizing signal input (reception) and output (send).
  • the communication interface 805 may be an input and/or output circuit of the chip, and the chip may be used as a terminal device or a network device or a component of other wireless communication devices.
  • the apparatus 800 for training a neural network may include one or more memories 802, on which a program 804 is stored.
  • the program 804 can be run by the processor 801 to generate instructions 803, so that the processor 801 executes the method described in the above method embodiments according to the instructions 803 Methods.
  • the memory 802 may also store data.
  • the processor 801 may also read data stored in the memory 802 (for example, a neural network to be trained). The data may be stored at the same storage address as the program 804, or the data may be stored at a different address from the program 804. The storage address.
  • the processor 801 and the memory 802 may be provided separately or integrated together, for example, integrated on a single board or SoC.
  • the processor 801 is used to control the communication interface 805 to execute:
  • the first neural network is a sub-network of the neural network to be trained, and the first training device is used to train the first neural network;
  • the second neural network is a sub-network of the neural network to be trained, and the second training device is used to train the second neural network;
  • the target training device is a training device that includes the output layer of the neural network to be trained in the training device set, and the training device set includes a first training device and a second training device;
  • the processor 801 is configured to execute: determine the loss function of the neural network to be trained according to the output value of the neural network to be trained;
  • the processor 801 is also configured to control the communication interface 805 to execute: send the loss function or the gradient corresponding to the loss function to the target training device.
  • the first neural network and the second neural network belong to different layers of the neural network to be trained
  • the second neural network includes the output layer of the neural network to be trained
  • the processor 801 is further configured to control the communication interface 805 to execute:
  • the first neural network and the second neural network belong to the same layer of the neural network to be trained, the first neural network and the second neural network include the output layer of the neural network to be trained, and the processor 801 is also used to control communication Interface 805 executes:
  • the processor 801 is further configured to execute: processing the first output value and the second output value through the fully connected layer to obtain the loss function of the neural network to be trained;
  • the processor 801 is further configured to control the communication interface 805 to execute: send the gradient corresponding to the loss function to the first training device and the second training device.
  • the processor 801 is used to control the communication interface 805 to execute:
  • the first neural network is a sub-network of the neural network to be trained, and the first neural network does not include the output layer of the neural network to be trained;
  • the processor 801 is used for executing: training the first neural network
  • the processor 801 is further configured to control the communication interface 805 to execute: send the trained first neural network to the control device.
  • the processor 801 is further configured to control the communication interface 805 to execute:
  • the first gradient is the gradient of the input layer of the second neural network in the second training device, the second neural network is another sub-network of the neural network to be trained, and the first gradient is The gradient determined based on the loss function;
  • the processor 801 is further configured to execute: training the first neural network according to the first gradient.
  • the processor 801 is further configured to execute: determining whether the first neural network completes the training according to whether the training parameters meet the termination condition.
  • the processor 801 is used to control the communication interface 805 to execute:
  • the second neural network is a sub-network of the neural network to be trained, and the second neural network includes the output layer of the neural network to be trained;
  • the processor 801 is used for executing: training a second neural network
  • the processor 801 is further configured to control the communication interface 805 to execute: send the trained second neural network to the control device.
  • the processor 801 is further configured to control the communication interface 805 to execute:
  • the processor 801 is further configured to perform: determining the gradient of the input layer of the second neural network according to the loss function or the gradient corresponding to the loss function.
  • the processor 801 is further configured to control the communication interface 805 to execute:
  • the gradient of the input layer of the second neural network is sent to the first training device, and the gradient is used for training of the first neural network in the first training device, wherein the input layer of the second neural network is the same as that of the first neural network.
  • the output layer is connected, and the first neural network is another sub-network of the neural network to be trained.
  • processor 801 is further configured to execute:
  • each step of the method embodiment may be completed by a logic circuit in the form of hardware or instructions in the form of software in the processor 801.
  • the processor 801 may also be a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (ASIC), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, For example, discrete gates, transistor logic devices, or discrete hardware components.
  • This application also provides a computer program product, which, when executed by the processor 801, implements the method described in any method embodiment in this application.
  • the computer program product may be stored in the memory 802, for example, a program 804.
  • the program 804 is finally converted into an executable object file that can be executed by the processor 801 through processing processes such as preprocessing, compilation, assembly, and linking.
  • This application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a computer, the method described in any method embodiment in this application is implemented.
  • the computer program can be a high-level language program or an executable target program.
  • the computer-readable storage medium is, for example, the memory 802.
  • the memory 802 may be a volatile memory or a non-volatile memory, or the memory 802 may include both a volatile memory and a non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electronic Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • static random access memory static random access memory
  • dynamic RAM dynamic random access memory
  • synchronous dynamic random access memory synchronous DRAM, SDRAM
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous connection dynamic random access memory serial DRAM, SLDRAM
  • direct rambus RAM direct rambus RAM, DR RAM
  • the wireless communication includes the fifth generation (5G) mobile communication system, wireless-fidelity (WiFi), satellite communication and other existing communication methods, as well as various possible future communication methods. It involves two aspects: terminal equipment and network equipment.
  • 5G fifth generation
  • WiFi wireless-fidelity
  • satellite communication and other existing communication methods, as well as various possible future communication methods. It involves two aspects: terminal equipment and network equipment.
  • FIG. 9 shows a schematic structural diagram of a terminal device provided in this application.
  • the terminal device 900 can implement the function of training a neural network in the foregoing method embodiment.
  • FIG. 9 only shows the main components of the terminal device 900.
  • the terminal device 900 includes a processor, a memory, a control circuit, an antenna, and an input and output device.
  • the processor 901 is mainly used to process the communication protocol and communication data, and to control the terminal device 900.
  • the processor 901 receives information encoded by a polarization code through an antenna and a control circuit.
  • the processor 901 is also configured to read the neural network to be trained stored in the memory 904, split it into at least two sub-networks, and send them to the processor 902 and the processor 903, respectively.
  • the processor 902 and the processor 903 are used to train the sub-network after the neural network to be trained is split.
  • the processor 901, the processor 902, and the processor 903 may be the devices shown in FIG. 8.
  • the processor 901, the processor 902, and the processor 903 may be referred to as a neural network training system, and a terminal including the three processors
  • the device 900 may also be referred to as a neural network training system.
  • the memory 904 is mainly used to store programs and data.
  • the memory 904 stores the neural network to be trained in the foregoing method embodiment.
  • the control circuit is mainly used for the conversion of baseband signals and radio frequency signals and the processing of radio frequency signals.
  • the control circuit and the antenna together can also be called a transceiver, which is mainly used to send and receive radio frequency signals in the form of electromagnetic waves.
  • the input and output device is, for example, a touch screen or a keyboard, and is mainly used to receive data input by the user and output data to the user.
  • the processor 901 can read the program in the memory 904, interpret and execute the instructions contained in the program, and process data in the program.
  • the processor 901 performs baseband processing on the information to be sent, and outputs the baseband signal to the radio frequency circuit.
  • the radio frequency circuit performs radio frequency processing on the baseband signal to obtain a radio frequency signal, and passes the radio frequency signal through the antenna in the form of electromagnetic waves Send out.
  • the radio frequency circuit receives the radio frequency signal through the antenna, converts the radio frequency signal into a baseband signal, and outputs the baseband signal to the processor, and the processor 901 converts the baseband signal For information and process the information.
  • FIG. 9 only shows one memory and three processors. In actual terminal devices, there may be more memories and processors.
  • the memory may also be called a storage medium or a storage device, etc., which is not limited in this application.
  • the processor 901 in FIG. 9 may integrate the functions of a baseband processor and a CPU.
  • the baseband processor and the CPU may also be independent processors through a bus, etc. Technology interconnection.
  • the terminal device 900 may include multiple baseband processors to adapt to different network standards, the terminal device 900 may include multiple CPUs to enhance its processing capabilities, and the various components of the terminal device 900 may be connected through various buses.
  • the baseband processor may also be referred to as a baseband processing circuit or a baseband processing chip.
  • the function of processing the communication protocol and the communication data may be built in the processor, or may be stored in the memory 904 in the form of a program, and the processor 901 executes the program in the memory 904 to realize the baseband processing function.
  • FIG. 10 is a schematic structural diagram of a network device provided in this application, and the network device may be, for example, a base station.
  • the base station can realize the function of training the neural network in the above method embodiment.
  • the base station 1000 may include one or more radio frequency units, such as a remote radio unit (RRU) 1001 and at least one baseband unit (BBU) 1002.
  • RRU remote radio unit
  • BBU baseband unit
  • the BBU 1002 may include a distributed unit (DU), or may include a DU and a centralized unit (CU).
  • DU distributed unit
  • CU centralized unit
  • the RRU 1001 may be referred to as a transceiver unit, a transceiver, a transceiver circuit or a transceiver, and it may include at least one antenna 10011 and a radio frequency unit 10012.
  • the RRU1001 is mainly used for the transceiver of radio frequency signals and the conversion of radio frequency signals and baseband signals, for example, to support the base station to realize the sending and receiving functions.
  • BBU1002 is mainly used for baseband processing and control of base stations.
  • RRU1001 and BBU1002 can be physically set together, or physically separated, that is, a distributed base station.
  • the BBU1002 can also be called a processing unit, which is mainly used to complete baseband processing functions, such as channel coding, multiplexing, modulation, and spreading.
  • the BBU 1002 may be used to control the base station to execute the operation procedure in the foregoing method embodiment.
  • the BBU 1002 can be composed of one or more single boards, and multiple single boards can jointly support a wireless access network of a single access standard, or can respectively support wireless access networks of different access standards.
  • the BBU 1002 also includes a processor 10021 and a memory 10024, and the memory 10024 is used to store necessary instructions and data.
  • the memory 10021 stores the neural network to be trained in the foregoing method embodiment.
  • the processor 10021 is used to control the base station to perform necessary actions.
  • the processor receives information encoded by a polarization code through an antenna and a control circuit.
  • the processor 10021 is further configured to read the neural network to be trained stored in the memory 10024, split it into at least two sub-networks, and send them to the processor 10022 and the processor 10023 respectively.
  • the processor 10022 and the processor 10023 are used to train the sub-network after the neural network to be trained is split.
  • the processor 1001, the processor 1002, and the processor 1003 may be the devices shown in FIG. 8.
  • the processor 1001, the processor 1002, and the processor 1003, the processor 10021, the processor 10022, and the processor 10023 may be referred to as neural networks.
  • the training system, the network device 1000 including the three processors may also be called a neural network training system.
  • the processor 10021 and the memory 10024 may serve one or more single boards.
  • the memory and the processor can be set separately on each board. It can also be that multiple boards share the same memory and processor.
  • necessary circuits can be provided on each board.
  • FIG. 10 only shows one memory and three processors. In actual network devices, there may be more memories and processors.
  • the memory may also be called a storage medium or a storage device, etc., which is not limited in this application.
  • the base station shown in FIG. 10 is only an example, and the network device suitable for this application may also be an active antenna unit (AAU) in an active antenna system (AAS).
  • AAU active antenna unit
  • AAS active antenna system
  • the disclosed system, device, and method may be implemented in other ways. For example, some features of the method embodiments described above may be ignored or not implemented.
  • the device embodiments described above are merely illustrative.
  • the division of units is only a logical function division. In actual implementation, there may be other division methods, and multiple units or components may be combined or integrated into another system.
  • the coupling between the units or the coupling between the components may be direct coupling or indirect coupling, and the foregoing coupling includes electrical, mechanical, or other forms of connection.
  • the size of the sequence number of each process does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not correspond to the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • system and “network” in this article are often used interchangeably in this article.
  • the term “and/or” in this article is only an association relationship describing associated objects, which means that there can be three types of relationships. For example, A and/or B can mean that there is A alone, and both A and B exist. There are three cases of B.
  • the character “/” in this text generally indicates that the associated objects before and after are in an "or” relationship.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Feedback Control In General (AREA)

Abstract

本申请提供了一种训练神经网络的方法和装置。该方法包括:控制装置将待训练的神经网络拆分为多个子网络,并将该多个子网络部署在多个训练装置上进行训练。由于规模较大的待训练的神经网络被拆分为多个规模较小的子网络,因此,存储空间较小的训练装置也能够存储至少一个子网络,从而可以利用多个存储空间较小的训练装置训练规模较大的神经网络。上述方法尤其适用于存储能力有限的终端设备。

Description

训练神经网络的方法和装置
本申请要求于2019年04月03日提交中国专利局、申请号为201910267854.9、申请名称为“训练神经网络的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能(artificial intelligence,AI)领域,尤其涉及一种训练神经网络的方法和装置。
背景技术
AI技术将对未来移动通信网络技术的演进产生重要的推动作用。神经网络(neural network,NN)是AI技术的基础,其在网络层(如网络优化,移动性管理,资源分配等)和物理层(如信道编译码,信道预测、接收机等)等方面有广泛的应用前景。
在物理层的应用中,例如在极化(polar)码的应用中,将会涉及到维度非常大的输入数据或输出数据。通常情况下,在神经网络的隐藏层的数量大于输入数据或输出数据的维度的情况下,神经网络才能学习到输入数据或输出数据的特性,所以应用于物理层的神经网络往往具有比较多的参数,这对训练加速硬件的存储能力要求较高,如何利用存储空间较小的训练加速硬件训练较大规模的神经网络是当前亟需解决的问题。
发明内容
本申请提供了一种训练神经网络的方法和装置,通过将待训练的神经网络拆分为多个子网络,并将该多个子网络部署在多个训练装置上,从而能够利用存储空间较小的训练装置训练较大规模的神经网络。
第一方面,提供了一种训练神经网络的方法,包括:获取待训练的神经网络;向第一训练装置发送第一神经网络,第一神经网络为待训练的神经网络的子网络,第一训练装置用于训练第一神经网络;向第二训练装置发送第二神经网络,第二神经网络为待训练的神经网络的子网络,第二训练装置用于训练第二神经网络;从目标训练装置接收待训练的神经网络的输出值,目标训练装置为训练装置集合中包含待训练的神经网络的输出层的训练装置,训练装置集合包括第一训练装置和第二训练装置;根据待训练的神经网络的输出值确定待训练的神经网络的损失函数;向目标训练装置发送损失函数或者损失函数对应的梯度。
控制装置将待训练的神经网络拆分为多个子网络,每个子网络包含的参数较少,因此,存储空间较小的训练装置也能够存储至少一个子网络,从而可以利用多个存储空间较小的训练装置训练规模较大的神经网络(即,包含较多参数的神经网络),上述方法尤其适用于存储能力有限的终端设备。
可选地,第一神经网络与第二神经网络属于待训练的神经网络的不同层。
上述方案即按照深度划分待训练的神经网络,由于不同的子网络包含一个或多个完整的神经网络层,将多个子网络进行串联处理即可,无需改变待训练的神经网络的架构,因此,该方案具有简单易实施的特点,能够减小控制装置在划分待训练的神经网络时的负载。
可选地,第二神经网络包括待训练的神经网络的输出层,从目标训练装置接收待训练的神经网络的输出值,包括:从第二训练装置接收待训练的神经网络的输出值;向目标训练装置发送损失函数或者损失函数对应的梯度,包括:向第二训练装置发送损失函数。
由于第二神经网络包含待训练的神经网络的输出层,因此,第二神经网络为与控制装置直接连接的子网络。控制装置从第二神经网络接收待训练的神经网络的输出值,并根据该输出值计算出待训练的神经网络的损失函数,充分利用了控制装置能够进行复杂运算(计算损失函数)的特点,使得训练装置能够专注于大量的简易运算(神经网络的训练过程)。
可选地,第一神经网络与第二神经网络属于待训练的神经网络的相同层。
上述方案即按照宽度划分待训练的神经网络,由于待训练的神经网络的一个层被拆分为多个子网络,因此,需要将多个子网络进行并联处理,并且,需要通过全连接层将多个并联的子网络的输出值进行合并处理。相比于按照深度划分待训练的神经网络的方案,按照宽度划分待训练的神经网络虽然需要增加一个全连接层,但是,由于各个子网络能够并行更新参数,因此,按照宽度划分进行训练能够提高神经网络的训练效率。
可选地,第一神经网络和第二神经网络包括待训练的神经网络的输出层,从目标训练装置接收待训练的神经网络的输出值,包括:从第一训练装置接收第一输出值,第一输出值为第一神经网络的输出值;从第二训练装置接收第二输出值,第二输出值为第二神经网络的输出值;根据待训练的神经网络的输出值确定待训练的神经网络的损失函数,包括:通过全连接层处理第一输出值和第二输出值,得到待训练的神经网络的损失函数;向目标训练装置发送损失函数或者损失函数对应的梯度,包括:向第一训练装置和第二训练装置发送损失函数对应的梯度。
由于第一神经网络和第二神经网络包含待训练的神经网络的输出层,因此,第一神经网络和第二神经网络为与控制装置直接连接的子网络。控制装置从第二神经网络接收待训练的神经网络的输出值,并根据该输出值计算出待训练的神经网络的损失函数,充分利用了控制装置能够进行复杂运算(计算损失函数)的特点,使得训练装置能够专注于大量的简易运算(神经网络的训练过程)。
第二方面,本申请还提供了一种训练神经网络的方法,该方法应用于第一训练装置,包括:从控制装置接收第一神经网络,第一神经网络为待训练的神经网络的一个子网络,且第一神经网络不包含待训练的神经网络的输出层;训练第一神经网络;向控制装置发送训练完成的第一神经网络。
待训练的神经网络被拆分为多个子网络,每个子网络包含的参数较少,因此,存储空间较小的训练装置也能够存储至少一个子网络,从而可以利用多个存储空间较小的训练装置训练规模较大的神经网络(即,包含较多参数的神经网络),上述方法尤其适用于存储能力有限的终端设备。
可选地,训练第一神经网络,包括:向第二训练装置发送第一神经网络的输出值,第 一神经网络的输出值用于确定待训练的神经网络的损失函数;从第二训练装置接收第一梯度,第一梯度为第二训练装置中的第二神经网络的输入层的梯度,第二神经网络为待训练的神经网络的另一个子网络,第一梯度为基于损失函数确定的梯度;根据第一梯度训练第一神经网络。
由于多个子网络之间为串联关系,第一训练装置基于接收到的梯度直接进行反向传播计算即可,无需对梯度进行额外的处理,因此,该方案具有简单易实施的特点。
可选地,所述方法还包括:根据训练参数是否满足终止条件确定第一神经网络是否完成训练。
可选地,所述训练参数包括训练轮数、训练时间和误码率中的至少一种,根据训练参数是否满足终止条件确定第一神经网络是否完成训练,包括:
当待训练的神经网络的损失函数的值小于或等于损失函数阈值时,确定第二神经网络完成训练;和/或,
当训练轮数大于或等于轮数阈值时,确定第一神经网络完成训练;和/或,
当训练时间大于或等于时间阈值时,确定第一神经网络完成训练;和/或,
当误码率小于或等于误码率阈值时,确定第一神经网络完成训练。
通过不同的训练参数是否满足终止条件确定神经网络是否完成训练,能够根据实际情况灵活训练神经网络。例如,当训练装置负担较重时,控制装置可以设定较少的训练轮数或者较短的训练时间或者较大的误码率;当训练装置负担较轻时,控制装置可以设定较多的训练轮数或者较长的训练时间或者较小的误码率。从而提高了训练神经网络的灵活性。
第三方面,本申请还提供了一种训练神经网络的方法,该方法应用于第二训练装置,包括:从控制装置接收第二神经网络,第二神经网络为待训练的神经网络的一个子网络,且第二神经网络包含待训练的神经网络的输出层;训练第二神经网络;向控制装置发送训练完成的第二神经网络。
待训练的神经网络被拆分为多个子网络,每个子网络包含的参数较少,因此,存储空间较小的训练装置也能够存储至少一个子网络,从而可以利用多个存储空间较小的训练装置训练规模较大的神经网络(即,包含较多参数的神经网络),上述方法尤其适用于存储能力有限的终端设备。
可选地,所述方法还包括:向控制装置发送第二神经网络的输出值,第二神经网络的输出值用于确定待训练的神经网络的损失函数;从控制装置接收损失函数或者损失函数对应的梯度;根据损失函数或者损失函数对应的梯度确定第二神经网络的输入层的梯度。
可选地,所述方法还包括:向第一训练装置发送所述第二神经网络的输入层的梯度,该梯度用于第一训练装置中的第一神经网络的训练,其中,第二神经网络的输入层与第一神经网络的输出层相连,第一神经网络为待训练的神经网络的另一个子网络。
若第二训练装置与第一训练装置为串联关系,则第二训练装置还需要向第一训练装置发送第二神经网络的输入层的梯度,以便于第一训练装置利用该第二神经网络的输入层的梯度计算第一神经网络的各层的梯度,并更新第一神经网络的参数。
可选地,训练所述第二神经网络,包括:根据训练参数是否满足终止条件确定第二神经网络是否完成训练。
可选地,所述训练参数包括训练轮数、训练时间、所述待训练的神经网络的损失函数 和误码率中的至少一种,
根据训练参数是否满足终止条件确定第二神经网络是否完成训练,包括:
当训练轮数大于或等于轮数阈值时,确定第二神经网络完成训练;和/或,
当训练时间大于或等于时间阈值时,确定第二神经网络完成训练;和/或,
当损失函数的值小于或等于损失函数阈值时,确定第二神经网络完成训练;和/或,
当误码率小于或等于误码率阈值时,确定第二神经网络完成训练。
通过不同的训练参数是否满足终止条件确定神经网络是否完成训练,能够根据实际情况灵活训练神经网络。例如,当训练装置负担较重时,控制装置可以设定较少的训练轮数或者较短的训练时间或者较大的误码率;当训练装置负担较轻时,控制装置可以设定较多的训练轮数或者较长的训练时间或者较小的误码率。从而提高了训练神经网络的灵活性。
第四方面,本申请提供了一种控制装置,该装置可以实现上述第一方面所涉及的方法所对应的功能,所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的单元或模块。
在一种可能的设计中,该装置包括处理器,该处理器被配置为支持该装置执行上述第一方面所涉及的方法。该装置还可以包括存储器,该存储器用于与处理器耦合,其保存有程序和数据。可选地,该装置还包括通信接口,该通信接口用于支持该装置与神经网络训练装置之间的通信。其中,所述通信接口可以包括集成收发功能的电路。
第五方面,本申请提供了一种训练装置,该装置可以实现上述第二方面或第三方面所涉及的方法所对应的功能,所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的单元或模块。
在一种可能的设计中,该装置包括处理器,该处理器被配置为支持该装置执行上述第二方面或第三方面所涉及的方法。该装置还可以包括存储器,该存储器用于与处理器耦合,其保存有程序和数据。可选地,该装置还包括通信接口,该通信接口用于支持该装置与控制装置和/或其它神经网络训练装置之间的通信。其中,所述通信接口可以包括集成收发功能的电路。
第六方面,本申请提供了一种神经网络训练系统,包括至少一个第四方面所述的控制装置和至少两个第五方面所述的训练装置。
第七方面,本申请提供了一种计算机可读存储介质,该计算机可读存储介质中存储了计算机程序,该计算机程序被处理器执行时,使得处理器执行第一方面所述的方法。
第八方面,本申请提供了一种计算机可读存储介质,该计算机可读存储介质中存储了计算机程序,该计算机程序被处理器执行时,使得处理器执行第二方面或第三方面所述的方法。
第九方面,本申请提供了一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码被处理器运行时,使得处理器执行第一方面所述的方法。
第十方面,本申请提供了一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码被处理器运行时,使得处理器执行第二方面或第三方面所述的方法。
第十一方面,本申请提供了一种芯片,该芯片包括:处理器和通信接口。该处理器例如是核(core),该核可以包括至少一种执行单元(execution unit),该执行单元例如是 算术逻辑单元(arithmetic and logic unit,ALU);该通信接口可以是输入/输出接口、管脚或电路等;该处理器执行存储器中存储的程序代码,以使该芯片执行第一方面所述的方法。该存储器可以是位于该芯片内部的存储单元(例如,寄存器、缓存等),也可以是位于该芯片外部的存储单元(例如,只读存储器、随机存取存储器等)。
第十二方面,本申请还提供了一种芯片,该芯片包括:处理器和通信接口。该处理器例如是流式多处理器(streaming multiprocessor),该流式多处理器可以包括至少一种执行单元(execution unit),该执行单元例如是统一计算设备架构(compute unified device architecture,CUDA);该通信接口可以是输入/输出接口、管脚或电路等;该处理器执行存储器中存储的程序代码,以使该芯片执行第二方面或第三方面所述的方法。该存储器可以是位于该芯片内部的存储单元(例如,寄存器、缓存等),也可以是位于该芯片外部的存储单元(例如,只读存储器、随机存取存储器等)。
附图说明
图1是一种适用于本申请的全连接神经网络的示意图;
图2是一种基于损失函数更新神经网络参数的方法的示意图;
图3是计算损失函数的梯度的方法的示意图;
图4是本申请提供的一种神经网络训练系统的示意图;
图5是本申请提供的一种训练神经网络的方法的示意图;
图6是本申请提供的一种基于深度划分的神经网络的训练方法的示意图;
图7是本申请提供的一种基于宽度划分的神经网络的训练方法的示意图;
图8是本申请提供的一种训练神经网络的装置的示意图;
图9是本申请提供的另一种训练神经网络的装置的示意图;
图10是本申请提供的再一种训练神经网络的装置的示意图。
具体实施方式
为了便于理解本申请的技术方案,首先对本申请所涉及的概念做简要介绍。
神经网络也可以称为人工神经网络(artificial neural network,ANN),隐藏层数量较多的神经网络称为深度神经网络。神经网络中的每一层的工作可以用数学表达式
Figure PCTCN2020079808-appb-000001
来描述。从物理层面看,神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中,操作1、2、3的由
Figure PCTCN2020079808-appb-000002
完成,操作4由+b完成,操作5则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,w是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该w决定着上文所述的输入空间到输出空间的空间变换,即每一层的w控制着如何变换空间。训练神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的w形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
因为希望神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较神经网 络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量。在第一次更新之前通常会有初始化的过程,即为神经网络中的各个层预先配置参数。在训练的过程中,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么神经网络的训练就变成了尽可能缩小这个输出值的过程。
损失函数通常是多变量函数,而梯度可以反映变量发生变化时损失函数的输出值的变化速率,梯度的绝对值越大,损失函数的输出值的变化率越大,可以计算更新不同参数时损失函数的梯度,沿着梯度下降最快的方向不断更新参数,尽快缩小损失函数的输出值。
下面以全连接神经网络为例,对本申请中的训练方法进行简要介绍。
全连接神经网络又叫多层感知器(multilayer perceptron,MLP)。如图1所示,一个MLP包含一个输入层(左侧),一个输出层(右侧),及多个隐藏层(中间),每层包含数个节点,称为神经元。其中相邻两层的神经元间两两相连。
考虑相邻两层的神经元,下一层的神经元的输出h为所有与之相连的上一层神经元x的加权和经过激活函数(即,上文所述的“a”)处理后的值。用矩阵可以表示为
h=f(wx+b)
其中w为权重向量,b为偏置向量,f为激活函数。则MLP的输出可以递归表达为
y=f n(w nf n-1(...)+b n)
可以将MLP理解为一个从输入数据集合到输出数据集合的映射关系。而通常MLP都是随机初始化的,用已有数据从随机的w和b得到这个映射关系的过程被称为MLP的训练。
可以采用损失函数对MLP的输出结果进行评价,并通过反向传播,通过梯度下降的方法即能迭代优化w和b直到损失函数达到最小值,
可以通过前向传播(forward propagation)计算获取MLP的损失函数。即,将前一层的输出结果输入后一层,直至得到MLP的输出层的输出结果,将该结果与目标值进行比较,获得MLP的损失函数。在得到前向传播计算的损失函数后,基于损失函数进行反向传播(back propagation)计算,以求得各层的梯度,沿着梯度下降最快的方向调整w和b,直到损失函数达到最小值。
梯度下降的过程可以表示为:
Figure PCTCN2020079808-appb-000003
其中,θ为待优化参数(如w和b),L为损失函数,η为学习率,用于控制梯度下降的步长,步长如图2中的箭头所示。
可以使用求偏导的链式法则进行反向传播计算,即,前一层参数的梯度可以由后一层参数的梯度递推计算得到,如图3所示,链式法则可以表达为:
Figure PCTCN2020079808-appb-000004
其中w ij为节点j连接节点i的权重,s i为节点i上输入的加权和。
由于神经网络的训练是一个计算量相对较大但计算类型相对简单的计算过程,故一般会采用图形处理器(graphics processing unit,GPU)等硬件来加速训练的过程。但是由于GPU显存有限,对于一个较大的神经网络,可能需要多个GPU才能部署整个神经网络。
图4是一种适用于本申请的训练系统的示意图。
该训练系统包括一个控制装置和至少两个训练装置,控制装置与每个训练装置之间能够互相通信,可选地,不同训练装置之间也可以相互通信。
该控制装置例如是中央处理器(central processing unit,CPU),上述训练装置例如是GPU,训练装置还可以是张量处理器(tensor processing unit,TPU)或CPU或其它类型的计算单元。本申请对控制装置和训练装置的具体类型不作限定。
此外,控制装置和至少两个训练装置可以集成在一个芯片上,例如,集成在系统级芯片(system on chip,SoC)上。控制装置和至少两个训练装置也可以集成在不同的芯片上。
图5示出了本申请提供的一种训练神经网络的方法。该方法500可以应用于图4所示的训练系统,其中,控制装置获取待训练的神经网络后,执行下行步骤。
S510,向第一训练装置发送第一神经网络,第一神经网络为待训练的神经网络的子网络,第一训练装置用于训练第一神经网络。
S520,向第二训练装置发送第二神经网络,第二神经网络为待训练的神经网络的子网络,且第二神经网络与第一神经网络相异,第二训练装置用于训练第二神经网络。
控制装置可以按深度将待训练的神经网络划分为第一神经网络和第二神经网络,也可以按宽度将待训练的神经网络划分为第一神经网络和第二神经网络,第一神经网络与第二神经网络可以相同(即,包含相同的参数),也可以不同(即,包含不同的参数),本申请对第一神经网络和第二神经网络的具体形式不作限定。应理解,即使第一神经网络和第二神经网络包含相同的参数,由于该两个神经网络为待训练的神经网络的两个子网络,即,该两个神经网络属于待训练的神经网络的不同部分,因此,该两个神经网络仍然属于两个不同的神经网络。下文将详细描述这两种划分方法以及对应的训练方法。此外,将待训练的神经网络划分为两个子网络仅是举例说明,还可以将待训练的神经网络划分为更多个子网络。
由图1可知,神经网络由多个参数组成,因此,控制装置向第一训练装置发送第一神经网络可以被解释为:控制装置向第一训练装置发送组成第一神经网络的参数以及指示这些参数的连接关系的信息。类似地,控制装置向第二训练装置发送第二神经网络可以被解释为:控制装置向第二训练装置发送组成第二神经网络的参数以及指示这些参数的连接关系的信息。
第一训练装置和第二训练装置分别收到各自的神经网络之后,即可分别执行以下步骤。
S530,训练所述第一神经网络。
S540,训练所述第二神经网络。
下面,将根据待训练的神经网络的划分方法分别描述训练第一神经网络和第二神经网络的方法。
方法一,按深度划分。
如图6所示,待训练的神经网络有4层,将前两层划分为第一神经网络,将后两层划分为第二神经网络。上述划分方式仅是举例说明,还可以将待训练的神经网络划分为其它类型的子网络,每个子网络包括待训练的神经网络的至少一层参数。
CPU即控制装置,GPU0即第一训练装置,GPU1为第二训练装置。GPU0输入训练样本,通过第一神经网络处理该训练样本,并将训练样本的处理结果发送至GPU1。GPU1通过第二神经网络处理GPU0的输出结果,并得到第二神经网络的输出结果,将第二神经网络的输出结果发送至CPU,由CPU根据该输出结果计算出待训练的神经网络的损失函数(L)。
以极化码译码器的训练过程为例,训练样本可以是:对数似然比和码字,或者,对数似然比和真实信息。即,神经网络的输入为对数似然比,神经网络的输出为码字的估计或真实信息的估计。损失函数则为码字的估计和码字之间的差异,或者,损失函数则为信息的估计和真实信息之间的差异。
对于信道预测器的训练过程,训练样本可以是:历史信道数据和未来信道数据。即,神经网络的输入为历史信道数据,神经网络的输出为预测的未来信道数据。损失函数则为预测的未来信道和真实的未来信道之间的差异。
对于资源调度器的训练过程,训练样本可以是:目前系统的状态和最优的调度策略。即,神经网络的输入为系统的状态信息,如:目前可调度的时频资源,需要调度的用户,用户的服务质量(quality of service,QoS)等级;神经网络的输出为预计的调度策略。损失函数为预计调度策略和最优调度策略之间的差异。
上述有关训练样本的描述适用于本申请的所有实施例。此外,由于上述优选训练样本的描述仅是举例说明,由于本申请的方法广泛适用于包括无线通信、车联网、计算机、深度学习、模式识别、云计算等涉及人工智能的领域,训练样本可以根据具体应用进行设计。
每个训练装置上部署有优化器,各个优化器用于计算各个训练装置上部署的神经网络的梯度,其中,优化器1的输入信息为整个待训练的神经网络的损失函数,优化器0的输入信息为优化器1输出的梯度。图6中的梯度1表示第二神经网络的各个层的梯度,梯度0表示第一神经网络的各个层的梯度。
本申请的各个实施例中的优化器可以是软件模块(例如,程序代码),也可以是硬件模块(例如,逻辑电路)。以图6为例,优化器1可以通过公式g 4=f(l,θ 4,N)确定待优化参数θ 4(第二神经网络的输出层的参数)的梯度g 4。其中,f为激活函数,l为损失函数,N为第二神经网络的拓扑结构。优化器1还可以通过公式g 3=f(g 43,N)确定待优化参数θ 3(第二神经网络的输入层的参数)的梯度g 3
优化器0可以通过公式g 2=f(g 32,N′)确定待优化参数θ 2(第一神经网络的输出层的参数)的梯度g 2。其中,N′为第一神经网络的拓扑结构。优化器0可以通过公式g 1=f(g 21,N′)确定待优化参数θ 1(第一神经网络的输入层的参数)的梯度g 1
在反向传播计算过程中,训练装置计算各层的梯度并根据各层的梯度更新各层的参数。例如,GPU1根据损失函数(L)计算出第四层的梯度
Figure PCTCN2020079808-appb-000005
随后根据
Figure PCTCN2020079808-appb-000006
计算第三层的梯度
Figure PCTCN2020079808-appb-000007
随后将
Figure PCTCN2020079808-appb-000008
发送至GPU0。GPU0根据
Figure PCTCN2020079808-appb-000009
计算第二层的梯度
Figure PCTCN2020079808-appb-000010
随后根据
Figure PCTCN2020079808-appb-000011
计算第一层的梯度
Figure PCTCN2020079808-appb-000012
其中,第一层为待训练的神经网络的输入层,第二层和第三层为待训练的神经网络的隐藏层,第四层为待训练的神经网络的输出层。θ 4~θ 1表示各层的参数,θ 4~θ 1的更新可以在各个参数对应的GPU完成梯度计算后进行,也可以在所有的GPU完成梯度计算后进行。
GPU0完成参数更新后,通过参数更新后的第一神经网络对训练样本进行处理,并将训练样本的处理结果发送至GPU1。GPU1通过参数更新后的第二神经网络处理GPU0的输出结果,并得到输出结果,将该输出结果发送至CPU。CPU根据该输出结果再次计算损失函数,若损失函数不满足要求,则可以将该损失函数发送至GPU1,重复前述参数更新的步骤,继续进行训练;若损失函数满足要求,则可以停止训练。
此外,CPU或者GPU0还可以根据训练参数是否满足终止条件确定第一神经网络是否完成训练。上述训练参数包括训练轮数、训练时间和误码率中的至少一种,
例如,当训练轮数大于或等于轮数阈值时,CPU或者GPU0确定第一神经网络完成训练;当训练时间大于或等于时间阈值时,确定第一神经网络完成训练;当误码率小于或等于误码率阈值时,确定第一神经网络完成训练。
CPU或者GPU0可以根据损失函数、训练轮数、训练时间和误码率中的一种参数满足终止条件停止训练第一神经网络,CPU或者GPU0也可以在损失函数、训练轮数、训练时间和误码率中的多种参数满足终止条件时停止训练第一神经网络。
类似地,CPU或者GPU1可以根据损失函数、训练轮数、训练时间和误码率中的一种参数满足终止条件停止训练第二神经网络,CPU或者GPU1也可以在损失函数、训练轮数、训练时间和误码率中的多种参数满足终止条件时停止训练第二神经网络。
通过不同的训练参数是否满足终止条件确定神经网络是否完成训练,能够根据实际情况灵活训练神经网络。例如,当CPU或者训练装置负担较重时,CPU可以设定较少的训练轮数或者较短的训练时间或者较大的误码率;当CPU或者训练装置负担较轻时,CPU可以设定较多的训练轮数或者较长的训练时间或者较小的误码率。从而提高了训练神经网络的灵活性。
此外,按照深度划分无需改变待训练的神经网络的架构,具有简单易实施的特点,能够减小控制装置在划分待训练的神经网络时的负载。
方法二,按宽度划分。
方法二可以按照如下所示的内容进行划分。若GPU的个数为M,待训练的神经网络共有N层,每一层的宽度为w i,i∈[0,N),即,每一层包含w i个参数。每个GPU可以部署N层神经网络,每一层的宽度为v i,j,i∈[0,N),j∈[0,M),且满足
Figure PCTCN2020079808-appb-000013
另外,全连接层的宽度为w N-1
如图7所示,待训练的神经网络有4层,该4层的宽度分别为8,16,16和12,即,第一层包含8个参数,第二层包含16个参数,第三层包含16个参数,第四层包含12个参数。其中,第一层为待训练的神经网络的输入层,第二层和第三层为待训练的神经网络的隐藏层,第四层为待训练的神经网络的输出层。
可以将每层参数平均划分为两组参数,第一神经网络包含其中的一组参数,第二神经网络包含另外一组参数,则第一神经网络的宽度和第二神经网络的宽度均为4,8,8和6。 上述划分方式仅是举例说明,还可以不按照均分的方式对每层参数进行划分。待训练的神经网络划分完成后,还需要再CPU中部署一个全连接层,全连接层的宽度与第一神经网络的输入层的宽度和第二神经网络的输出层的宽度之和相同,在图7所示的神经网络中,该全连接层的宽度为12。
GPU0和GPU1分别输入训练样本,GPU0通过第一神经网络处理该训练样本,并将第一神经网络的输出值(即,第一输出值)发送至CPU;GPU1通过第二神经网络处理该训练样本,并将第二神经网络的输出值(即,第二输出值)发送至CPU。CPU通过全连接层处理第一输出值和第二输出值,得到待训练的神经网络的输出值,并基于该输出值确定待训练的神经网络的损失函数。随后,根据该损失函数确定全连接层的梯度,并分别向GPU0和GPU1发送全连接层的梯度,以便于GPU0确定第一神经网络的梯度以及GPU1确定第二神经网络的梯度。
图7中,GPU0和GPU1输入的训练样本可以相同,也可以不同,本申请对此不作限定。可选地,GPU0和GPU1输入相同或相近的训练样本,该方案能够提高待训练的神经网络的训练效果。
CPU和每个训练装置上均部署有优化器,各个优化器用于计算各个训练装置上部署的神经网络的梯度,其中,CPU上的优化器用于计算全连接层的梯度,GPU0上的优化器0用于计算第一神经网络的各层的梯度,GPU1上的优化器1用于计算第二神经网络的各层的梯度。图7中,梯度表示损失函数的梯度,梯度1表示第二神经网络的各个层的梯度,梯度0表示第一神经网络的各个层的梯度。
在反向传播计算过程中,训练装置计算各层的梯度并根据各层的梯度更新各层的参数。例如,CPU根据损失函数(L)计算出全连接层的梯度
Figure PCTCN2020079808-appb-000014
其中,θ fc为全连接层的参数。GPU0收到该梯度之后,根据该梯度依次计算出第一神经网络的四个层的梯度
Figure PCTCN2020079808-appb-000015
GPU1收到该梯度之后,根据该梯度依次计算出第二神经网络的四个层的梯度
Figure PCTCN2020079808-appb-000016
θ 4~θ 0表示第一神经网络各层的参数,θ 4′~θ 0′表示第二神经网络各层的参数,GPU0和GPU1可以并行计算各层的梯度并更新各自的参数。
GPU0完成参数更新后,通过参数更新后的第一神经网络对训练样本进行处理,并将输出值发送至CPU。GPU1完成参数更新后,通过参数更新后的第二神经网络对训练样本进行处理,并将输出值发送至CPU。CPU根据该两个输出值再次计算损失函数,若损失函数不满足要求,则可以重复前述参数更新的步骤,继续进行训练;若损失函数满足要求,则可以停止训练。
此外,CPU或者GPU0还可以根据训练参数是否满足终止条件确定第一神经网络是否完成训练。上述训练参数包括训练轮数、训练时间和误码率中的至少一种,
例如,当训练轮数大于或等于轮数阈值时,CPU或者GPU0确定第一神经网络完成训练;当训练时间大于或等于时间阈值时,确定第一神经网络完成训练;当误码率小于或等于误码率阈值时,确定第一神经网络完成训练。
CPU或者GPU0可以根据损失函数、训练轮数、训练时间和误码率中的一种参数满足 终止条件停止训练第一神经网络,CPU或者GPU0也可以在损失函数、训练轮数、训练时间和误码率中的多种参数满足终止条件时停止训练第一神经网络。
类似地,CPU或者GPU1可以根据损失函数、训练轮数、训练时间和误码率中的一种参数满足终止条件停止训练第二神经网络,CPU或者GPU1也可以在损失函数、训练轮数、训练时间和误码率中的多种参数满足终止条件时停止训练第二神经网络。
通过不同的训练参数是否满足终止条件确定神经网络是否完成训练,能够根据实际情况灵活训练神经网络。例如,当CPU或者训练装置负担较重时,CPU可以设定较少的训练轮数或者较短的训练时间或者较大的误码率;当CPU或者训练装置负担较轻时,CPU可以设定较多的训练轮数或者较长的训练时间或者较小的误码率。从而提高了训练神经网络的灵活性。
此外,按照宽度划分需要改变待训练的神经网络的架构,即,增加了一个全连接层。由于各个子网络能够并行更新参数,因此,按照宽度划分进行训练能够提高神经网络的训练效率。
第一神经网络和第二神经网络训练完成后,第一训练装置和第二训练装置可以分别执行下述步骤。
S550,第一训练装置向控制装置发送训练完成的第一神经网络。
S560,第二训练装置向控制装置发送训练完成的第二神经网络。
第一训练装置向控制装置发送训练完成的第一神经网络可以被解释为:第一训练装置向控制装置发送第一神经网络更新后的参数以及指示这些参数的连接关系的信息。类似地,第二训练装置向控制装置发送训练完成的第二神经网络可以被解释为:第二训练装置向控制装置发送第二神经网络更新后的参数以及指示这些参数的连接关系的信息。
控制装置获取训练完成的第一神经网络和训练完成的第二神经网络之后,可以将该两个神经网络进行合并处理,得到训练完成的神经网络。
由于每个训练装置存储待训练的神经网络的部分参数,因此,存储空间较小的训练装置在应用上述方法后也能够完成大规模神经网络的训练,上述方法尤其适用于存储能力有限的终端设备。
表1示出了分别采用上文所述的两种划分方法进行训练所需的时间。其中,待训练的神经网络为深度为10,宽度为1024的全连接神经网络。
表1
  显存占用 运行时间/每1000次迭代
对照 单显卡:1136MB 263s
方法一 双显卡,每显卡:624MB 300s
方法二 双显卡,每显卡:624MB 200s
由表1可知,采用方法一和方法二进行训练,对每个显卡的显存需求明显降低,此外,采用方法二进行训练的效率也显著提升。
上文详细介绍了本申请提供的训练神经网络的方法的示例。下面,将详细介绍本申请提供的实现上述方法的装置。可以理解的是,训练神经网络的装置为了实现训练神经网络的方法中的功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申 请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请可以根据上述方法示例对训练神经网络的装置进行功能单元的划分,例如,可以将各个功能划分为各个功能单元,也可以将两个或两个以上的功能集成在一个功能单元中。例如,训练神经网络的装置可包括用于执行上述方法示例中确定动作的处理单元、用于实现上述方法示例中接收动作的接收单元和用于实现上述方法示例中发送动作的发送单元。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。需要说明的是,本申请中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
图8示出了本申请提供的一种训练神经网络的装置的结构示意图。训练神经网络的装置800可用于实现上述方法实施例中描述的方法。该通信装置800可以是芯片、网络设备或终端设备。
训练神经网络的装置800包括一个或多个处理器801,该一个或多个处理器801可支持训练神经网络的装置800实现图5所对应方法实施例中的方法。处理器801可以是通用处理器或者专用处理器。例如,处理器801可以是CPU。CPU可以用于对训练装置(例如,GPU)进行控制,执行软件程序,处理软件程序的数据。训练神经网络的装置800还可以包括通信接口805,用以实现信号的输入(接收)和输出(发送)。
例如,若训练神经网络的装置800为芯片,通信接口805可以是该芯片的输入和/或输出电路,该芯片可以作为终端设备或网络设备或其它无线通信设备的组成部分。
训练神经网络的装置800中可以包括一个或多个存储器802,其上存有程序804,程序804可被处理器801运行,生成指令803,使得处理器801根据指令803执行上述方法实施例中描述的方法。可选地,存储器802中还可以存储有数据。可选地,处理器801还可以读取存储器802中存储的数据(例如,待训练的神经网络),该数据可以与程序804存储在相同的存储地址,该数据也可以与程序804存储在不同的存储地址。
处理器801和存储器802可以单独设置,也可以集成在一起,例如,集成在单板或者SoC上。
在一种可能的设计中,处理器801用于控制通信接口805执行:
获取待训练的神经网络;
向第一训练装置发送第一神经网络,第一神经网络为待训练的神经网络的子网络,第一训练装置用于训练第一神经网络;
向第二训练装置发送第二神经网络,第二神经网络为待训练的神经网络的子网络,第二训练装置用于训练第二神经网络;
从目标训练装置接收待训练的神经网络的输出值,目标训练装置为训练装置集合中包含待训练的神经网络的输出层的训练装置,训练装置集合包括第一训练装置和第二训练装置;
处理器801用于执行:根据待训练的神经网络的输出值确定待训练的神经网络的损失函数;
处理器801还用于控制通信接口805执行:向目标训练装置发送损失函数或者损失函数对应的梯度。
可选地,第一神经网络与第二神经网络属于待训练的神经网络的不同层,第二神经网络包括待训练的神经网络的输出层,处理器801还用于控制通信接口805执行:
从第二训练装置接收待训练的神经网络的输出值;
向第二训练装置发送损失函数。
可选地,第一神经网络与第二神经网络属于待训练的神经网络的相同层,第一神经网络和第二神经网络包括待训练的神经网络的输出层,处理器801还用于控制通信接口805执行:
从第一训练装置接收第一输出值,第一输出值为第一神经网络的输出值;
从第二训练装置接收第二输出值,第二输出值为第二神经网络的输出值;
处理器801还用于执行:通过全连接层处理第一输出值和第二输出值,得到待训练的神经网络的损失函数;
处理器801还用于控制通信接口805执行:向第一训练装置和第二训练装置发送损失函数对应的梯度。
在另一种可能的设计中,处理器801用于控制通信接口805执行:
从控制装置接收第一神经网络,第一神经网络为待训练的神经网络的一个子网络,且第一神经网络不包含待训练的神经网络的输出层;
处理器801用于执行:训练第一神经网络;
处理器801还用于控制通信接口805执行:向控制装置发送训练完成的第一神经网络。
可选地,处理器801还用于控制通信接口805执行:
向第二训练装置发送第一神经网络的输出值,第一神经网络的输出值用于确定待训练的神经网络的损失函数;
从第二训练装置接收第一梯度,第一梯度为第二训练装置中的第二神经网络的输入层的梯度,第二神经网络为待训练的神经网络的另一个子网络,第一梯度为基于损失函数确定的梯度;
处理器801还用于执行:根据第一梯度训练第一神经网络。
可选地,处理器801还用于执行:根据训练参数是否满足终止条件确定第一神经网络是否完成训练。
在另一种可能的设计中,处理器801用于控制通信接口805执行:
从控制装置接收第二神经网络,第二神经网络为待训练的神经网络的一个子网络,且第二神经网络包含待训练的神经网络的输出层;
处理器801用于执行:训练第二神经网络;
处理器801还用于控制通信接口805执行:向控制装置发送训练完成的第二神经网络。
可选地,处理器801还用于控制通信接口805执行:
向控制装置发送第二神经网络的输出值,第二神经网络的输出值用于确定待训练的神经网络的损失函数;
从控制装置接收损失函数或者损失函数对应的梯度;
处理器801还用于执行:根据损失函数或者损失函数对应的梯度确定第二神经网络的 输入层的梯度。
可选地,处理器801还用于控制通信接口805执行:
向第一训练装置发送所述第二神经网络的输入层的梯度,该梯度用于第一训练装置中的第一神经网络的训练,其中,第二神经网络的输入层与第一神经网络的输出层相连,第一神经网络为待训练的神经网络的另一个子网络。
可选地,处理器801还用于执行:
根据训练参数是否满足终止条件确定第二神经网络是否完成训练。
应理解,方法实施例的各步骤可以通过处理器801中的硬件形式的逻辑电路或者软件形式的指令完成。处理器801还可以是数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其它可编程逻辑器件,例如,分立门、晶体管逻辑器件或分立硬件组件。
本申请还提供了一种计算机程序产品,该计算机程序产品被处理器801执行时实现本申请中任一方法实施例所述的方法。
该计算机程序产品可以存储在存储器802中,例如是程序804,程序804经过预处理、编译、汇编和链接等处理过程最终被转换为能够被处理器801执行的可执行目标文件。
本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被计算机执行时实现本申请中任一方法实施例所述的方法。该计算机程序可以是高级语言程序,也可以是可执行目标程序。
该计算机可读存储介质例如是存储器802。存储器802可以是易失性存储器或非易失性存储器,或者,存储器802可以同时包括易失性存储器和非易失性存储器。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
如前所述,本申请的方法广泛适用于包括无线通信、车联网、计算机、深度学习、模式识别、云计算等涉及人工智能的领域,这里用无线通信的装置例子进行说明。这里的无线通信包括第五代(the fifth generation,5G)移动通信系统、无线保真(wireless-fidelity,WiFi)、卫星通信等各类已有的通信方式以及未来可能的各种通信方式,主要涉及终端设备和网络设备两个方面。
在装置800为终端设备中的芯片的情况下,图9示出了本申请提供的一种终端设备的结构示意图。终端设备900可实现上述方法实施例中训练神经网络的功能。为了便于说明,图9仅示出了终端设备900的主要部件。
如图9所示,终端设备900包括处理器、存储器、控制电路、天线以及输入输出装置。其中,处理器901主要用于对通信协议以及通信数据进行处理,以及用于对终端设备900进行控制。例如,处理器901通过天线和控制电路接收通过极化码编码的信息。处理器901还用于读取存储器904中存储的待训练的神经网络,并将其拆分为至少两个子网络,分别发送至处理器902和处理器903。处理器902和处理器903用于训练待训练的神经网络被拆分后的子网络。处理器901、处理器902和处理器903可以是图8所示的装置,此外,处理器901、处理器902和处理器903可以被称为神经网络训练系统,包含该三个处理器的终端设备900也可以被称为神经网络训练系统。
存储器904主要用于存储程序和数据,例如,存储器904存储上述方法实施例中的待训练的神经网络。控制电路主要用于基带信号与射频信号的转换以及对射频信号的处理。控制电路和天线一起也可以叫做收发器,主要用于收发电磁波形式的射频信号。输入输出装置例如是触摸屏或键盘,主要用于接收用户输入的数据以及对用户输出数据。
终端设备900开机后,处理器901可以读取存储器904中的程序,解释并执行该程序所包含的指令,处理程序中的数据。当需要通过天线发送信息时,处理器901对待发送的信息进行基带处理后,输出基带信号至射频电路,射频电路将基带信号进行射频处理后得到射频信号,并将射频信号通过天线以电磁波的形式向外发送。当承载信息的电磁波(即,射频信号)到达终端设备900时,射频电路通过天线接收到射频信号,将射频信号转换为基带信号,并将基带信号输出至处理器,处理器901将基带信号转换为信息并对该信息进行处理。
本领域技术人员可以理解,为了便于说明,图9仅示出了一个存储器和三个处理器。在实际的终端设备中,可以存在更多的存储器和处理器。存储器也可以称为存储介质或者存储设备等,本申请对此不做限定。
作为一种可选的实现方式,图9中的处理器901可以集成基带处理器和CPU的功能,本领域技术人员可以理解,基带处理器和CPU也可以是相互独立的处理器,通过总线等技术互联。本领域技术人员可以理解,终端设备900可以包括多个基带处理器以适应不同的网络制式,终端设备900可以包括多个CPU以增强其处理能力,终端设备900的各个部件可以通过各种总线连接。基带处理器也可以被称为基带处理电路或者基带处理芯片。对通信协议以及通信数据进行处理的功能可以内置在处理器中,也可以以程序的形式存储在存储器904中,由处理器901执行存储器904中的程序以实现基带处理功能。
在通信装置800为网络设备中的芯片的情况下,图10是本申请提供的一种网络设备的结构示意图,该网络设备例如可以为基站。如图10所示,该基站可实现上述方法实施例中训练神经网络的功能。基站1000可包括一个或多个射频单元,如远端射频单元(remote radio unit,RRU)1001和至少一个基带单元(baseband unit,BBU)1002。其中,BBU1002可以包括分布式单元(distributed unit,DU),也可以包括DU和集中单元(central unit,CU)。
RRU1001可以称为收发单元、收发机、收发电路或者收发器,其可以包括至少一个天线10011和射频单元10012。RRU1001主要用于射频信号的收发以及射频信号与基带信号的转换,例如用于支持基站实现发送功能和接收功能。BBU1002主要用于进行基带处理,对基站进行控制等。RRU1001与BBU1002可以是物理上设置在一起的,也可以物理 上分离设置的,即分布式基站。
BBU1002也可以称为处理单元,主要用于完成基带处理功能,如信道编码,复用,调制,扩频等等。例如,BBU1002可以用于控制基站执行上述方法实施例中的操作流程。
BBU1002可以由一个或多个单板构成,多个单板可以共同支持单一接入制式的无线接入网,也可以分别支持不同接入制式的无线接入网。BBU1002还包括处理器10021和存储器10024,存储器10024用于存储必要的指令和数据。例如,存储器10021存储上述方法实施例中的待训练的神经网络。处理器10021用于控制基站进行必要的动作,例如,处理器通过天线和控制电路接收通过极化码编码的信息。处理器10021还用于读取存储器10024中存储的待训练的神经网络,并将其拆分为至少两个子网络,分别发送至处理器10022和处理器10023。处理器10022和处理器10023用于训练待训练的神经网络被拆分后的子网络。处理器1001、处理器1002和处理器1003可以是图8所示的装置,此外,处理器1001、处理器1002和处理器1003处理器10021、处理器10022和处理器10023可以被称为神经网络训练系统,包含该三个处理器的网络设备1000也可以被称为神经网络训练系统。
处理器10021和存储器10024可以服务于一个或多个单板。也就是说,可以每个单板上单独设置存储器和处理器。也可以是多个单板共用相同的存储器和处理器。此外每个单板上还可以设置有必要的电路。
本领域技术人员可以理解,为了便于说明,图10仅示出了一个存储器和三个处理器。在实际的网络设备中,可以存在更多的存储器和处理器。存储器也可以称为存储介质或者存储设备等,本申请对此不做限定。
此外,图10所示的基站仅是一个示例,适用于本申请的网络设备还可以是有源天线系统(active antenna system,AAS)中的有源天线单元(active antenna unit,AAU)。
本领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的方法实施例的一些特征可以忽略,或不执行。以上所描述的装置实施例仅仅是示意性的,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,多个单元或组件可以结合或者可以集成到另一个系统。另外,各单元之间的耦合或各个组件之间的耦合可以是直接耦合,也可以是间接耦合,上述耦合包括电的、机械的或其它形式的连接。
应理解,在本申请的各种实施例中,各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请的实施例的实施过程构成任何限定。
在本申请的各个实施例中,如果没有特殊说明以及逻辑冲突,不同的实施例之间的术语和/或描述具有一致性、且可以相互引用,不同的实施例中的技术特征根据其内在的逻辑关系可以组合形成新的实施例。
另外,本文中术语“系统”和“网络”在本文中常被可互换使用。本文中的术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一 般表示前后关联对象是一种“或”的关系。
总之,以上所述仅为本申请技术方案的较佳实施例而已,并非用于限定本申请的保护范围。凡在本申请的原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (39)

  1. 一种训练神经网络的方法,其特征在于,包括:
    获取待训练的神经网络;
    向第一训练装置发送第一神经网络,所述第一神经网络为所述待训练的神经网络的子网络,所述第一训练装置用于训练所述第一神经网络;
    向第二训练装置发送第二神经网络,所述第二神经网络为所述待训练的神经网络的子网络,且所述第二神经网络与所述第一神经网络相异,所述第二训练装置用于训练所述第二神经网络;
    从目标训练装置接收所述待训练的神经网络的输出值,所述目标训练装置为训练装置集合中包含所述待训练的神经网络的输出层的训练装置,所述训练装置集合包括所述第一训练装置和所述第二训练装置;
    根据所述待训练的神经网络的输出值确定所述待训练的神经网络的损失函数;
    向所述目标训练装置发送所述损失函数或者所述损失函数对应的梯度。
  2. 根据权利要求1所述的方法,其特征在于,所述第一神经网络与所述第二神经网络属于所述待训练的神经网络的不同层。
  3. 根据权利要求2所述的方法,其特征在于,所述第二神经网络包括所述待训练的神经网络的输出层,
    所述从目标训练装置接收所述待训练的神经网络的输出值,包括:
    从所述第二训练装置接收所述待训练的神经网络的输出值;
    所述向所述目标训练装置发送所述损失函数或者所述损失函数对应的梯度,包括:
    向所述第二训练装置发送所述损失函数。
  4. 根据权利要求1所述的方法,其特征在于,所述第一神经网络与所述第二神经网络属于所述待训练的神经网络的相同层。
  5. 根据权利要求4所述的方法,其特征在于,
    所述从目标训练装置接收所述待训练的神经网络的输出值,包括:
    从所述第一训练装置接收第一输出值,所述第一输出值为所述第一神经网络的输出值;
    从所述第二训练装置接收第二输出值,所述第二输出值为所述第二神经网络的输出值;
    所述根据所述待训练的神经网络的输出值确定所述待训练的神经网络的损失函数,包括:
    通过全连接层处理所述第一输出值和所述第二输出值,得到所述待训练的神经网络的损失函数;
    所述向所述目标训练装置发送所述损失函数或者所述损失函数对应的梯度,包括:
    向所述第一训练装置和所述第二训练装置发送所述损失函数对应的梯度。
  6. 一种训练神经网络的方法,其特征在于,所述方法应用于第一训练装置,所述方法包括:
    从控制装置接收第一神经网络,所述第一神经网络为待训练的神经网络的一个子网 络,且所述第一神经网络不包含所述待训练的神经网络的输出层;
    训练所述第一神经网络;
    向所述控制装置发送训练完成的所述第一神经网络。
  7. 根据权利要求6所述的方法,其特征在于,所述训练所述第一神经网络,包括:
    向第二训练装置发送所述第一神经网络的输出值,所述第一神经网络的输出值用于确定所述待训练的神经网络的损失函数;
    从所述第二训练装置接收第一梯度,所述第一梯度为所述第二训练装置中的第二神经网络的输入层的梯度,所述第二神经网络为所述待训练的神经网络的另一个子网络,所述第一梯度为基于所述损失函数确定的梯度;
    根据所述第一梯度训练所述第一神经网络。
  8. 根据权利要求6或7所述的方法,其特征在于,所述方法还包括:
    根据训练参数是否满足终止条件确定所述第一神经网络是否完成训练。
  9. 根据权利要求8所述的方法,其特征在于,所述训练参数包括训练轮数、训练时间和误码率中的至少一种,
    所述根据训练参数是否满足终止条件确定所述第一神经网络是否完成训练,包括:
    当所述待训练的神经网络的损失函数的值小于或等于损失函数阈值时,确定所述第二神经网络完成训练;和/或,
    当所述训练轮数大于或等于轮数阈值时,确定所述第一神经网络完成训练;和/或,
    当所述训练时间大于或等于时间阈值时,确定所述第一神经网络完成训练;和/或,
    当所述误码率小于或等于误码率阈值时,确定所述第一神经网络完成训练。
  10. 一种训练神经网络的方法,其特征在于,所述方法应用于第二训练装置,包括:
    从控制装置接收第二神经网络,所述第二神经网络为待训练的神经网络的一个子网络,且所述第二神经网络包含所述待训练的神经网络的输出层;
    训练所述第二神经网络;
    向所述控制装置发送训练完成的所述第二神经网络。
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:
    向所述控制装置发送所述第二神经网络的输出值,所述第二神经网络的输出值用于确定所述待训练的神经网络的损失函数;
    从所述控制装置接收所述损失函数或者所述损失函数对应的梯度;
    根据所述损失函数或者所述损失函数对应的梯度确定所述第二神经网络的输入层的梯度。
  12. 根据权利要求11所述的方法,其特征在于,所述方法还包括:
    向第一训练装置发送所述第二神经网络的输入层的梯度,所述梯度用于所述第一训练装置中的第一神经网络的训练,其中,所述第二神经网络的输入层与所述第一神经网络的输出层相连,所述第一神经网络为所述待训练的神经网络的另一个子网络。
  13. 根据权利要求10至12中任一项所述的方法,其特征在于,所述训练所述第二神经网络,包括:
    根据训练参数是否满足终止条件确定所述第二神经网络是否完成训练。
  14. 根据权利要求13所述的方法,其特征在于,所述训练参数包括训练轮数、训练 时间、所述待训练的神经网络的损失函数和误码率中的至少一种,
    所述根据训练参数是否满足终止条件确定所述第二神经网络是否完成训练,包括:
    当所述训练轮数大于或等于轮数阈值时,确定所述第二神经网络完成训练;和/或,
    当所述训练时间大于或等于时间阈值时,确定所述第二神经网络完成训练;和/或,
    当所述损失函数的值小于或等于损失函数阈值时,确定所述第二神经网络完成训练;和/或,
    当所述误码率小于或等于误码率阈值时,确定所述第二神经网络完成训练。
  15. 一种训练神经网络的装置,其特征在于,包括处理单元和通信接口,
    所述处理单元用于控制所述通信接口执行:
    获取待训练的神经网络;
    向第一训练装置发送第一神经网络,所述第一神经网络为所述待训练的神经网络的子网络,所述第一训练装置用于训练所述第一神经网络;
    向第二训练装置发送第二神经网络,所述第二神经网络为所述待训练的神经网络的子网络,且所述第二神经网络与所述第一神经网络相异,所述第二训练装置用于训练所述第二神经网络;
    从目标训练装置接收所述待训练的神经网络的输出值,所述目标训练装置为训练装置集合中包含所述待训练的神经网络的输出层的训练装置,所述训练装置集合包括所述第一训练装置和所述第二训练装置;
    所述处理单元还用于执行:
    根据所述待训练的神经网络的输出值确定所述待训练的神经网络的损失函数;
    所述处理单元还用于控制所述通信接口执行:
    向所述目标训练装置发送所述损失函数或者所述损失函数对应的梯度。
  16. 根据权利要求15所述的装置,其特征在于,所述第一神经网络与所述第二神经网络属于所述待训练的神经网络的不同层。
  17. 根据权利要求16所述的装置,其特征在于,所述第二神经网络包括所述待训练的神经网络的输出层,
    所述处理单元具体用于控制所述通信接口执行:
    从所述第二训练装置接收所述待训练的神经网络的输出值;
    向所述第二训练装置发送所述损失函数。
  18. 根据权利要求15所述的装置,其特征在于,所述第一神经网络与所述第二神经网络属于所述待训练的神经网络的相同层。
  19. 根据权利要求18所述的装置,其特征在于,
    所述处理单元具体用于控制所述通信接口执行:
    从所述第一训练装置接收第一输出值,所述第一输出值为所述第一神经网络的输出值;
    从所述第二训练装置接收第二输出值,所述第二输出值为所述第二神经网络的输出值;
    所述处理单元具体用于:
    通过全连接层处理所述第一输出值和所述第二输出值,得到所述待训练的神经网络的 损失函数;
    所述处理单元具体用于控制所述通信接口执行:
    向所述第一训练装置和所述第二训练装置发送所述损失函数对应的梯度。
  20. 一种训练神经网络的装置,其特征在于,处理单元和通信接口,
    所述处理单元用于控制所述通信接口执行:
    从控制装置接收第一神经网络,所述第一神经网络为待训练的神经网络的一个子网络,且所述第一神经网络不包含所述待训练的神经网络的输出层;
    所述处理单元还用于执行:
    训练所述第一神经网络;
    所述处理单元还用于控制所述通信接口执行:
    向所述控制装置发送训练完成的所述第一神经网络。
  21. 根据权利要求20所述的装置,其特征在于,
    所述处理单元具体用于控制所述通信接口执行:
    向第二训练装置发送所述第一神经网络的输出值,所述第一神经网络的输出值用于确定所述待训练的神经网络的损失函数;
    从所述第二训练装置接收第一梯度,所述第一梯度为所述第二训练装置中的第二神经网络的输入层的梯度,所述第二神经网络为所述待训练的神经网络的另一个子网络,所述第一梯度为基于所述损失函数确定的梯度;
    所述处理单元具体用于执行:
    根据所述第一梯度训练所述第一神经网络。
  22. 根据权利要求20或21所述的装置,其特征在于,所述处理单元还用于执行:
    根据训练参数是否满足终止条件确定所述第一神经网络是否完成训练。
  23. 根据权利要求22所述的装置,其特征在于,所述训练参数包括训练轮数、训练时间和误码率中的至少一种,
    所述处理单元具体用于执行:
    当所述待训练的神经网络的损失函数的值小于或等于损失函数阈值时,确定所述第二神经网络完成训练;和/或,
    当所述训练轮数大于或等于轮数阈值时,确定所述第一神经网络完成训练;和/或,
    当所述训练时间大于或等于时间阈值时,确定所述第一神经网络完成训练;和/或,
    当所述误码率小于或等于误码率阈值时,确定所述第一神经网络完成训练。
  24. 一种训练神经网络的装置,其特征在于,处理单元和通信接口,,
    所述处理单元用于控制所述通信接口执行:
    从控制装置接收第二神经网络,所述第二神经网络为待训练的神经网络的一个子网络,且所述第二神经网络包含所述待训练的神经网络的输出层;
    所述处理单元还用于执行:
    训练所述第二神经网络;
    所述处理单元还用于控制所述通信接口执行:
    向所述控制装置发送训练完成的所述第二神经网络。
  25. 根据权利要求24所述的装置,其特征在于,
    所述处理单元具体用于控制所述通信接口执行:
    向所述控制装置发送所述第二神经网络的输出值,所述第二神经网络的输出值用于确定所述待训练的神经网络的损失函数;
    从所述控制装置接收所述损失函数或者所述损失函数对应的梯度;
    所述处理单元具体用于执行:
    根据所述损失函数或者所述损失函数对应的梯度确定所述第二神经网络的输入层的梯度。
  26. 根据权利要求25所述的装置,其特征在于,所述处理单元还用于控制所述通信接口执行:
    向第一训练装置发送所述第二神经网络的输入层的梯度,所述梯度用于所述第一训练装置中的第一神经网络的训练,其中,所述第二神经网络的输入层与所述第一神经网络的输出层相连,所述第一神经网络为所述待训练的神经网络的另一个子网络。
  27. 根据权利要求24至26中任一项所述的装置,其特征在于,所述处理单元还用于执行:
    根据训练参数是否满足终止条件确定所述第二神经网络是否完成训练。
  28. 根据权利要求27所述的装置,其特征在于,所述训练参数包括训练轮数、训练时间、所述待训练的神经网络的损失函数和误码率中的至少一种,
    所述处理单元具体用于执行:
    当所述训练轮数大于或等于轮数阈值时,确定所述第二神经网络完成训练;和/或,
    当所述训练时间大于或等于时间阈值时,确定所述第二神经网络完成训练;和/或,
    当所述损失函数的值小于或等于损失函数阈值时,确定所述第二神经网络完成训练;和/或,
    当所述误码率小于或等于误码率阈值时,确定所述第二神经网络完成训练。
  29. 一种训练神经网络的装置,其特征在于,包括处理器和接口电路,所述接口电路用于接收来自所述控制装置之外的其它装置的信号并传输至所述处理器,或将来自所述处理器的信号发送给所述控制装置之外的其它装置,所述处理器通过逻辑电路或执行代码指令用于实现如权利要求1至5中任一项所述的方法。
  30. 根据权利要求29所述的装置,其特征在于,所述装置还包括存储器,所述存储器用于存储被所述代码指令。
  31. 根据权利要求29或30所述的装置,其特征在于,所述装置为芯片。
  32. 一种训练神经网络的装置,其特征在于,包括处理器和接口电路,所述接口电路用于接收来自所述控制装置之外的其它装置的信号并传输至所述处理器,或将来自所述处理器的信号发送给所述控制装置之外的其它装置,所述处理器通过逻辑电路或执行代码指令用于实现如权利要求6至9中任一项或权利要求10至14中任一项所述的方法。
  33. 根据权利要求32所述的装置,其特征在于,所述装置还包括存储器,所述存储器用于存储被所述代码指令。
  34. 根据权利要求32或33所述的装置,其特征在于,所述装置为芯片。
  35. 一种训练神经网络的系统,其特征在于,包括:
    如权利要求15至19中任一项所述的装置、如权利要求20至23中任一项所述的装置 以及如权利要求24至28中任一项所述的装置;或者,
    如权利要求29或30所述的装置以及至少一个如权利要求31或32所述的装置。
  36. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有程序或指令,当所述程序或指令被运行时,实现如权利要求1至5中任一项所述的方法。
  37. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有程序或指令,当所述程序或指令被运行时,实现如权利要求6至9中任一项或权利要求10至14中任一项所述的方法。
  38. 一种计算机程序产品,其特征在于,所述计算机程序产品包括:计算机程序代码,当所述计算机程序代码被处理器运行时,使得所述处理器执行如权利要求1至5中任一项所述的方法。
  39. 一种计算机程序产品,其特征在于,所述计算机程序产品包括:计算机程序代码,当所述计算机程序代码被处理器运行时,使得所述处理器执行如权利要求6至9中任一项或权利要求10至14中任一项所述的方法。
PCT/CN2020/079808 2019-04-03 2020-03-18 训练神经网络的方法和装置 WO2020199914A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910267854.9 2019-04-03
CN201910267854.9A CN111783932B (zh) 2019-04-03 2019-04-03 训练神经网络的方法和装置

Publications (1)

Publication Number Publication Date
WO2020199914A1 true WO2020199914A1 (zh) 2020-10-08

Family

ID=72664695

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/079808 WO2020199914A1 (zh) 2019-04-03 2020-03-18 训练神经网络的方法和装置

Country Status (2)

Country Link
CN (1) CN111783932B (zh)
WO (1) WO2020199914A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114356540A (zh) * 2021-10-30 2022-04-15 腾讯科技(深圳)有限公司 一种参数更新方法、装置、电子设备和存储介质
WO2023116787A1 (zh) * 2021-12-22 2023-06-29 华为技术有限公司 智能模型的训练方法和装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229343A (zh) * 2017-12-18 2018-06-29 北京市商汤科技开发有限公司 目标对象关键点检测方法、深度学习神经网络及装置
CN109426859A (zh) * 2017-08-22 2019-03-05 华为技术有限公司 神经网络训练系统、方法和计算机可读存储介质
CN109492761A (zh) * 2018-10-30 2019-03-19 深圳灵图慧视科技有限公司 实现神经网络的fpga加速装置、方法和系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104143327B (zh) * 2013-07-10 2015-12-09 腾讯科技(深圳)有限公司 一种声学模型训练方法和装置
CN106297774B (zh) * 2015-05-29 2019-07-09 中国科学院声学研究所 一种神经网络声学模型的分布式并行训练方法及系统
KR102494139B1 (ko) * 2015-11-06 2023-01-31 삼성전자주식회사 뉴럴 네트워크 학습 장치 및 방법과, 음성 인식 장치 및 방법
CN107346448B (zh) * 2016-05-06 2021-12-21 富士通株式会社 基于深度神经网络的识别装置、训练装置及方法
US11715009B2 (en) * 2016-05-20 2023-08-01 Deepmind Technologies Limited Training neural networks using synthetic gradients
US10949746B2 (en) * 2016-10-27 2021-03-16 International Business Machines Corporation Efficient parallel training of a network model on multiple graphics processing units
CN107480774A (zh) * 2017-08-11 2017-12-15 山东师范大学 基于集成学习的动态神经网络模型训练方法和装置
US11941516B2 (en) * 2017-08-31 2024-03-26 Micron Technology, Inc. Cooperative learning neural networks and systems
CN108460457A (zh) * 2018-03-30 2018-08-28 苏州纳智天地智能科技有限公司 一种面向卷积神经网络的多机多卡混合并行异步训练方法
CN109241880B (zh) * 2018-08-22 2021-02-05 北京旷视科技有限公司 图像处理方法、图像处理装置、计算机可读存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109426859A (zh) * 2017-08-22 2019-03-05 华为技术有限公司 神经网络训练系统、方法和计算机可读存储介质
CN108229343A (zh) * 2017-12-18 2018-06-29 北京市商汤科技开发有限公司 目标对象关键点检测方法、深度学习神经网络及装置
CN109492761A (zh) * 2018-10-30 2019-03-19 深圳灵图慧视科技有限公司 实现神经网络的fpga加速装置、方法和系统

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NAI-JIE GU, ZENG ZHAO , LV YA-FEI, ZHANG ZHI-JIANG: "Algorithm of Depth Neural Network Training Based on Multi-GPU", JOURNAL OF CHINESE COMPUTER SYSTEMS, vol. 36, no. 5, 15 May 2015 (2015-05-15), pages 1042 - 1046, XP055740403, ISSN: 1000-1220 *
YANG NING: "Multi-GPU Parallel Framework of Deep Convolutional Neural Networks", COMPUTER AND MODERNIZATION, 21 November 2016 (2016-11-21), pages 95 - 98, XP055740401, ISSN: 1006-2475, DOI: 10.3969/j.issn.1006-2475.2016.11.017 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114356540A (zh) * 2021-10-30 2022-04-15 腾讯科技(深圳)有限公司 一种参数更新方法、装置、电子设备和存储介质
WO2023116787A1 (zh) * 2021-12-22 2023-06-29 华为技术有限公司 智能模型的训练方法和装置

Also Published As

Publication number Publication date
CN111783932B (zh) 2024-07-23
CN111783932A (zh) 2020-10-16

Similar Documents

Publication Publication Date Title
Zhou et al. Edge intelligence: Paving the last mile of artificial intelligence with edge computing
Zhang et al. Deep reinforcement learning based IRS-assisted mobile edge computing under physical-layer security
Matsubara et al. Head network distillation: Splitting distilled deep neural networks for resource-constrained edge computing systems
CN113098714B (zh) 基于强化学习的低时延网络切片方法
CN113469367B (zh) 一种联邦学习方法、装置及系统
WO2020199914A1 (zh) 训练神经网络的方法和装置
Elbir et al. A hybrid architecture for federated and centralized learning
Gao et al. Deep neural network task partitioning and offloading for mobile edge computing
CN112101525A (zh) 一种通过nas设计神经网络的方法、装置和系统
CN113971090B (zh) 分布式深度神经网络的分层联邦学习方法及装置
CN114465900B (zh) 基于联邦边缘学习的数据共享时延优化方法及装置
Yan et al. Deep reinforcement learning based offloading for mobile edge computing with general task graph
Li et al. Deep neural network based computational resource allocation for mobile edge computing
Yuan et al. Low-cost federated broad learning for privacy-preserved knowledge sharing in the RIS-aided internet of vehicles
Liu et al. Ensemble distillation based adaptive quantization for supporting federated learning in wireless networks
Wu et al. Data transmission scheme based on node model training and time division multiple access with IoT in opportunistic social networks
Tanghatari et al. Federated learning by employing knowledge distillation on edge devices with limited hardware resources
Lv et al. Transfer Learning-powered resource optimization for green computing in 5G-Aided Industrial Internet of Things
CN117255356B (zh) 一种无线接入网中基于联邦学习的高效自协同方法
Kim et al. K-FL: Kalman Filter-Based Clustering Federated Learning Method
Zheng et al. Mobility-aware split-federated with transfer learning for vehicular semantic communication networks
CN116887205A (zh) 一种面向物联网协同智能的无线联邦分割学习算法
JP7073686B2 (ja) ニューラルネットワーク結合低減
Cheng et al. Efficient deep learning approach for computational offloading in mobile edge computing networks
Wang et al. Task offloading for edge computing in industrial Internet with joint data compression and security protection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20781768

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20781768

Country of ref document: EP

Kind code of ref document: A1