WO2022161387A1 - 一种神经网络的训练方法及相关设备 - Google Patents

一种神经网络的训练方法及相关设备 Download PDF

Info

Publication number
WO2022161387A1
WO2022161387A1 PCT/CN2022/073955 CN2022073955W WO2022161387A1 WO 2022161387 A1 WO2022161387 A1 WO 2022161387A1 CN 2022073955 W CN2022073955 W CN 2022073955W WO 2022161387 A1 WO2022161387 A1 WO 2022161387A1
Authority
WO
WIPO (PCT)
Prior art keywords
function
neural network
gradient
layer
binarization
Prior art date
Application number
PCT/CN2022/073955
Other languages
English (en)
French (fr)
Inventor
许奕星
韩凯
唐业辉
王云鹤
许春景
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22745261.2A priority Critical patent/EP4273754A1/en
Publication of WO2022161387A1 publication Critical patent/WO2022161387A1/zh
Priority to US18/362,435 priority patent/US20240005164A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Definitions

  • the embodiments of the present application relate to the technical field of deep learning, and in particular, to a neural network training method and related equipment.
  • Deep Learning is a new research direction in the field of Machine Learning (ML), which is introduced into Machine Learning to make it closer to the original goal - Artificial Intelligence (AI).
  • ML Machine Learning
  • AI Artificial Intelligence
  • DNN Deep Neural Networks
  • CNN Convolutional Neural Network
  • the application of convolutional neural networks requires huge computing resources, so it is difficult to directly apply convolutional neural networks to devices with limited computing power such as mobile phones, cameras, and robots.
  • One of the methods is to binarize the weights that occupy a large space to obtain a binary neural network (Binary Neural Network, BNN) to reduce the storage space required by the convolutional neural network; The activation value of the binarization process is performed to improve the operation speed of the neural network.
  • BNN Binary Neural Network
  • the Sign function is used to binarize the weight and activation value of the 32-bit floating-point number of the convolutional neural network to convert the weight and activation value of the 32-bit floating-point number to 1 or -1.
  • the original need for 32bit The stored weights and activation values now only need 1 bit to store them, thus saving storage space.
  • the gradient of the Sign function is an impulse function, that is, the gradient at the 0 point is infinite, and the gradients at other positions are 0. Therefore, in the process of training the binary neural network, the gradient of the Sign function cannot be used for backpropagation.
  • the Straight Through Estimator (STE) is mainly used to solve the problem that the gradient of the Sign function cannot be used for backpropagation. Specifically, in the process of backpropagation, the gradient of the Sign function is not calculated, but the gradient of the neural network of the upper layer of the neural network where the Sign function is located is directly transmitted back.
  • the embodiments of the present application provide a neural network training method and related equipment.
  • the training method adopts the gradient of the fitting function to replace the gradient of the binarization function, so that the accuracy of the trained neural network can be improved.
  • a first aspect of an embodiment of the present application provides a method for training a neural network, including: in a forward propagation process, using a binarization function to binarize a target weight to obtain a first neural network layer in the neural network
  • the first neural network layer is a layer of neural network in the neural network, specifically a convolution layer; the binarization function refers to the different value ranges of the independent variables, and the dependent variable has one and only two values.
  • binarization functions refers to the different value ranges of the independent variables, and the dependent variable has one and only two values.
  • the target weight can be converted to +1 or -1, and the target weight can also be converted to +1 and 0; in the back propagation process, to fit the function
  • the gradient of is the gradient of the binarization function to calculate the gradient of the loss function to the target weight, and the fitting function is determined based on the series expansion of the binarization function.
  • Forward propagation refers to calculating the intermediate variables of each layer of the neural network in order from the input layer to the output layer of the neural network, and the intermediate variables can be the output values of each layer of the neural network; From the output layer to the input layer, the intermediate variables of each layer of the neural network and the derivative of the loss function to each parameter are calculated in turn, and the intermediate variables can be the output values of each layer of the neural network.
  • the target weight is binarized by the binarization function to obtain the weight of the first neural network layer in the neural network, thereby reducing the storage space occupied by the first neural network layer.
  • the weight after value processing can be +1 or -1, or +1 or 0, so the multiplication operation can also be changed into an addition operation, so the amount of operation can be reduced;
  • the fitting function The gradient of the binarization function is used to calculate the gradient of the loss function to the target weight, so as to solve the problem that the gradient of the binarization function cannot be used for backpropagation; and the fitting function is based on the level of the binarization function. Therefore, the fitting degree of the fitting function and the binarization function is higher, and the fitting effect is better, which can improve the training effect of the neural network and ensure that the neural network obtained by training has high accuracy.
  • the data type of the target weight is a 32-bit floating point type, a 64-bit floating point type, a 32-bit integer type, or an 8-bit integer type.
  • the data type of the target weight can also be For other data types, as long as the storage space of the target weight is larger than the storage space of the binarized weight.
  • This implementation provides multiple possible data types for target weights.
  • the fitting function is composed of multiple sub-functions, and the multiple sub-functions are determined based on the series expansion of the binarization function.
  • the fitting degree of the fitting function and the binarization function is high, which can improve the training effect of the neural network.
  • the fitting function consists of multiple sub-functions and an error function, and the multiple sub-functions are determined based on the series expansion of the binarization function. Neural network fitting.
  • An error function is introduced into the fitting function, which can compensate for the error between the gradient of the fitting function and the gradient of the binarization function, and also the error between the gradient of the binarization function and the ideal gradient, so that The influence of the error on the gradient of the fitting function is reduced, and the accuracy of the gradient of the fitting function is improved.
  • the error function is fitted by a two-layer fully-connected neural network with residuals, wherein the two-layer fully-connected neural network is a neural network in which any neuron in one layer is connected to another layer of neural network.
  • Neural network connected by all neurons in ; residual refers to the difference between the actual observed value and the estimated value (the value fitted by the neural network); since the two-layer fully connected neural network with residual is used to fit the error function, Therefore, the two-layer fully connected neural network with residual can also be called an error fitting module.
  • This implementation provides a specific fitting method for the error function.
  • the error function is fitted by at least one layer of neural network; in the process of backpropagation, calculating the gradient of the loss function to the target weight using the gradient of the fitting function as the gradient of the binarization function includes: During the propagation process, calculate the gradient of multiple sub-functions to the target weight; calculate the gradient of at least one layer of neural network to the target weight; calculate the loss function based on the gradient of multiple sub-functions to the target weight and the gradient of at least one layer of neural network to the target weight.
  • the sum of the gradients of multiple sub-functions to the target weight and the gradient of at least one neural network to the target weight can be calculated first, and then the sum and the loss function to the weight of the first neural network layer. Multiply the gradients of , to get the gradient of the loss function to the target weight.
  • the gradient of multiple sub-functions to the target weight and the gradient of at least one layer of neural network to the target weight are calculated, and the loss is calculated based on the gradient of multiple sub-functions to the target weight and the gradient of at least one layer of neural network to the target weight.
  • the gradient of the function to the target weight since at least one layer of neural network is used to fit the error function, the gradient of at least one layer of neural network to the target weight makes up for the error between the gradient of the fitting function and the gradient of the binarization function, It also compensates for the error between the gradient of the binarization function and the ideal gradient, so that the gradient of the final loss function to the target weight is more accurate, and the training effect of the neural network is improved.
  • the series expansion of the binarization function is a Fourier series expansion of the binarization function, a wavelet series expansion of the binarization function, or a discrete Fourier series expansion of the binarization function.
  • This implementation provides multiple feasible solutions for the series expansion of the binarization function.
  • a second aspect of the embodiments of the present application provides a method for training a neural network, including: in the forward propagation process, using a binarization function to binarize the activation value of the second neural network layer to obtain the first The input of the neural network layer, the first neural network layer and the second neural network layer belong to the same neural network; the binarization function refers to the function of the dependent variable with and only two values for different value ranges of the independent variable.
  • the target weight can be converted to +1 or -1, and the target weight can also be converted to +1 and 0; among them, the activation value refers to the value processed by the activation function;
  • Activation function refers to the function running on the neurons of the neural network, usually a nonlinear function, which is used to map the input of the neuron to the output; the activation function includes but is not limited to the Sigmoid function, the Tanh function and the ReLU function; In the process of forward propagation, the gradient of the loss function to the activation value is calculated by taking the gradient of the fitting function as the gradient of the binarization function, and the fitting function is determined based on the series expansion of the binarization function.
  • Forward propagation refers to calculating the intermediate variables of each layer of the neural network in order from the input layer to the output layer of the neural network, and the intermediate variables can be the output values of each layer of the neural network; From the output layer to the input layer, the intermediate variables of each layer of the neural network and the derivative of the loss function to each parameter are calculated in turn, and the intermediate variables can be the output values of each layer of the neural network.
  • the activation value of the second neural network layer is binarized by a binarization function to obtain the input of the first neural network layer in the neural network, thereby reducing the occupation of the first neural network layer. Since the weight after binarization can be +1 or -1, or +1 or 0, the multiplication operation can also be changed into an addition operation, so the amount of operation can be reduced; in the process of back propagation , the gradient of the fitting function is used as the gradient of the binarization function to calculate the gradient of the loss function to the activation value, so as to solve the problem that the gradient of the binarization function cannot be used for backpropagation; and the fitting function is based on The series expansion of the binarization function is determined, so the fitting degree of the fitting function and the binarizing function is higher, and the fitting effect is better, so that the training effect of the neural network can be improved, and the neural network obtained by training can be guaranteed. high accuracy.
  • the data type of the activation value is a 32-bit floating point type, a 64-bit floating point type, a 32-bit integer type, or an 8-bit integer type.
  • This implementation provides several possible data types for activation values.
  • the fitting function is composed of multiple sub-functions, and the multiple sub-functions are determined based on the series expansion of the binarization function.
  • the fitting degree of the fitting function and the binarization function is high, which can improve the training effect of the neural network.
  • the fitting function consists of multiple sub-functions and an error function, and the multiple sub-functions are determined based on the series expansion of the binarization function. Neural network fitting.
  • An error function is introduced into the fitting function, which can compensate for the error between the gradient of the fitting function and the gradient of the binarization function, and also the error between the gradient of the binarization function and the ideal gradient, so that The influence of the error on the gradient of the fitting function is reduced, and the accuracy of the gradient of the fitting function is improved.
  • the error function is fitted by a two-layer fully-connected neural network with residuals, wherein the two-layer fully-connected neural network is a neural network in which any neuron in one layer is connected to another layer of neural network.
  • Neural network connected by all neurons in ; residual refers to the difference between the actual observed value and the estimated value (the value fitted by the neural network); since the two-layer fully connected neural network with residual is used to fit the error function, Therefore, the two-layer fully connected neural network with residual can also be called an error fitting module.
  • This implementation provides a specific fitting method for the error function.
  • the error function is fitted by at least one layer of neural network; in the backpropagation process, calculating the gradient of the loss function to the activation value using the gradient of the fitting function as the gradient of the binarization function includes: During the propagation process, the gradient of multiple sub-functions to the activation value is calculated; the gradient of at least one layer of neural network to the activation value is calculated; the loss function is calculated based on the gradient of multiple sub-functions to the activation value and the gradient of at least one layer of neural network to the activation value.
  • the gradient of the activation value specifically, the sum of the gradients of multiple sub-functions to the activation value and the gradient of at least one layer of neural networks to the activation value can be calculated first, and then the sum is added to the returned gradient (that is, the loss function is used for the first step.
  • the gradient of the activation value of one neural network layer is multiplied to obtain the gradient of the loss function to the activation value of the second neural network layer.
  • the gradient of the error function to the activation value compensates for the error between the gradient of the fitting function and the gradient of the binarization function, and also compensates for the error between the gradient of the binarization function and the ideal gradient, so that the final loss
  • the gradient of the function to the activation value is more accurate, which improves the training effect of the neural network.
  • the series expansion of the binarization function is a Fourier series expansion of the binarization function, a wavelet series expansion of the binarization function, or a discrete Fourier series expansion of the binarization function.
  • a third aspect of the embodiments of the present application provides a network structure of a neural network.
  • the neural network includes a first neural network module, a second neural network module, and a first neural network layer, and the first neural network module is composed of one or more layers of neural network.
  • the second neural network module is composed of one or more layers of neural networks, and is used to realize any possible implementation of the first aspect above. The steps in the gradient calculation in the implementation of .
  • a fourth aspect of the embodiments of the present application provides a network structure of a neural network.
  • the neural network includes a first neural network module, a second neural network module, and a first neural network layer, and the first neural network module is composed of one or more layers of neural network.
  • the second neural network module is composed of one or more layers of neural networks, and is used to realize any possible implementation of the second aspect above. The steps in the gradient calculation in the implementation of .
  • a fifth aspect of the embodiments of the present application provides a neural network training device, including: a binarization processing unit, configured to use a binarization function to binarize the target weight in the forward propagation process, so as to obtain The weight of the first neural network layer in the neural network, the first neural network layer is a layer of neural network in the neural network; the gradient calculation unit is used to take the gradient of the fitting function as the binarization function in the process of back propagation The gradient of the loss function calculates the gradient of the target weight, and the fitting function is determined based on the series expansion of the binarization function.
  • the data type of the target weight is a 32-bit floating point type, a 64-bit floating point type, a 32-bit integer type, or an 8-bit integer type.
  • the fitting function is composed of multiple sub-functions, and the multiple sub-functions are determined based on the series expansion of the binarization function.
  • the fitting function is composed of a plurality of sub-functions and an error function, and the plurality of sub-functions are determined based on the series expansion of the binarization function.
  • the error function is fitted using a two-layer fully connected neural network with residuals.
  • the error function is fitted by at least one layer of neural network; the gradient calculation unit is specifically used to calculate the gradient of multiple sub-functions to the target weight in the process of back propagation; calculate the weight of at least one layer of neural network to the target The gradient of the loss function to the target weight is calculated based on the gradient of the multiple sub-functions to the target weight and the gradient of the at least one layer of neural network to the target weight.
  • the series expansion of the binarization function is a Fourier series expansion of the binarization function, a wavelet series expansion of the binarization function, or a discrete Fourier series expansion of the binarization function.
  • a sixth aspect of the embodiments of the present application provides a neural network training device, including: a binarization processing unit, configured to use a binarization function to perform binary analysis on the activation value of the second neural network layer during the forward propagation process. Value processing to obtain the input of the first neural network layer, the first neural network layer and the second neural network layer belong to the same neural network; the gradient calculation unit is used in the back propagation process, with the gradient of the fitting function as The gradient of the binarization function calculates the gradient of the loss function to the activation value, and the fitting function is determined based on the series expansion of the binarization function.
  • the data type of the activation value is a 32-bit floating point type, a 64-bit floating point type, a 32-bit integer type, or an 8-bit integer type.
  • the fitting function is composed of multiple sub-functions, and the multiple sub-functions are determined based on the series expansion of the binarization function.
  • the fitting function is composed of a plurality of sub-functions and an error function, and the plurality of sub-functions are determined based on the series expansion of the binarization function.
  • the error function is fitted using a two-layer fully connected neural network with residuals.
  • the error function is fitted by at least one layer of neural network; the gradient calculation unit is specifically used to calculate the gradients of multiple sub-functions to activation values in the process of backpropagation; calculate the activation value of at least one layer of neural network
  • the gradient of the loss function to the activation value is calculated based on the gradient of the multiple sub-functions to the activation value and the gradient of the at least one layer of neural network to the activation value.
  • the series expansion of the binarization function is a Fourier series expansion of the binarization function, a wavelet series expansion of the binarization function, or a discrete Fourier series expansion of the binarization function.
  • a seventh aspect of an embodiment of the present application provides a training device, including: one or more processors and a memory; wherein, the memory stores computer-readable instructions; the one or more processors read computer-readable instructions The instructions are read to cause the training device to implement the method according to any implementation of the first aspect or the second aspect.
  • An eighth aspect of the embodiments of the present application provides a computer-readable storage medium, including computer-readable instructions, when the computer-readable instructions are executed on a computer, the computer is caused to perform any one of the first aspect or the second aspect. An implementation of the method described.
  • a ninth aspect of an embodiment of the present application provides a chip, including one or more processors. Part or all of the processor is used to read and execute the computer program stored in the memory, so as to execute the method in any possible implementation manner of the first aspect or the second aspect.
  • the chip includes a memory, and the memory and the processor are connected to the memory through a circuit or a wire. Further optionally, the chip further includes a communication interface, and the processor is connected to the communication interface.
  • the communication interface is used for receiving data and/or information to be processed, the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface.
  • the communication interface may be an input-output interface.
  • some of the one or more processors may also implement some steps in the above method by means of dedicated hardware, for example, the processing involving the neural network model may be performed by a dedicated neural network processor or graphics processor.
  • the methods provided in the embodiments of the present application may be implemented by one chip, or may be implemented collaboratively by multiple chips.
  • a tenth aspect of the embodiments of the present application provides a computer program product, where the computer program product includes computer software instructions, and the computer software instructions can be loaded by a processor to implement any one of the first aspect or the second aspect. method described.
  • An eleventh aspect of the embodiments of the present application provides a server, where the server may be a cloud server, and is configured to execute the method in any possible implementation manner of the first aspect or the second aspect.
  • a twelfth aspect of the embodiments of the present application provides a terminal device, where a neural network trained by the method described in any one of the first aspect or the second aspect is deployed on the terminal device.
  • Fig. 1 is a schematic diagram of the calculation process of a neuron
  • FIG. 2 is a schematic diagram of an embodiment of using a straight-through estimator for gradient calculation
  • Fig. 3 is a kind of structural schematic diagram of artificial intelligence main frame
  • FIG. 4 is a schematic diagram of an application scenario in an embodiment of the present application.
  • FIG. 5 is a system architecture diagram of a task processing system provided by an embodiment of the present application.
  • FIG. 6 provides a schematic diagram of an embodiment of a neural network training method according to an embodiment of the present application.
  • Fig. 7 is the fitting effect comparison schematic diagram of various functions
  • FIG. 8 is a schematic flowchart of calculating the gradient of the loss function to the target weight
  • FIG. 9 is an example schematic diagram of calculating the gradient of the loss function to the target weight
  • FIG. 10 provides a schematic diagram of another embodiment of a neural network training method according to an embodiment of the present application.
  • FIG. 11 provides a schematic diagram of an embodiment of an apparatus for training a neural network according to an embodiment of the present application
  • FIG. 12 provides a schematic diagram of another embodiment of a neural network training apparatus according to an embodiment of the present application.
  • FIG. 13 is a schematic diagram of an embodiment of a training device in an embodiment of the present application.
  • the embodiments of the present application provide a neural network training method and related equipment, which are used to determine the fitting function of the binarization function based on the series expansion of the binarization function, and use the gradient of the fitting function to replace the binarization
  • the gradient of the function is back-propagated, so as to avoid ignoring the gradient of the binarization function, resulting in lower accuracy of the neural network, so the embodiment of the present application can improve the accuracy of the neural network.
  • the commonly used BNN is to binarize the weight and activation value of the neural network on the basis of the existing neural network, that is, the value of each weight in the weight matrix of each layer of the original neural network and the neural network.
  • the activation value of each layer is assigned one of +1 and -1, or one of +1 and 0.
  • BNN does not change the network structure of the original neural network. It mainly performs some optimization processing on gradient descent, weight update, and convolution operations.
  • binarizing the weight of the neural network not only reduces the storage space occupied by the weight, but also turns the complex multiplication operation into an addition and subtraction operation, thereby reducing the amount of operation and improving the operation speed; similarly, Binarizing the activation value of the neural network can also reduce the amount of computation and increase the speed of computation.
  • the activation value refers to the value processed by the activation function;
  • the activation function refers to the function running on the neurons of the neural network, usually a nonlinear function, which is used to map the input of the neuron to the output.
  • Activation functions include but are not limited to Sigmoid functions, Tanh functions, and ReLU functions.
  • the activation function and activation value are described below with specific examples.
  • z1 and z2 are input to the neuron, and the operation of w1*z1+w2*z2 will be performed on the neuron, where w1 and w2 are weights; then, w1*z1 can be converted to w1*z1 through the activation function
  • the linear value of +w2*z2 is converted into a nonlinear value, and the nonlinear value is the output of the neuron, which can also be called the activation value of the neuron.
  • the output of each layer of the neural network is a linear function of the input. No matter how many layers the neural network has, the output is a linear combination of the input. This situation is the most primitive perceptron ( Perceptron). If the activation function is used, the activation function introduces nonlinear factors to the neurons, so that the neural network can approximate any nonlinear function arbitrarily, so that the neural network can be applied to many nonlinear models.
  • the first method is a deterministic method based on a sign function (also called a Sign function), and the second method is a random method (also called a statistical method); theoretical From the above, the second method is more reasonable, but the actual operation requires hardware to generate random numbers, which is more difficult. Therefore, in practical applications, the second method has not yet been applied, and the first method is adopted, that is, binarization processing is performed through the Sign function.
  • W is the weight of each layer in the neural network
  • W b is the weight after binarization.
  • the gradient of the Sign function is an impulse function, that is, the gradient at the 0 point is infinite, and the gradient at other positions is 0.
  • the training process of the neural network includes two processes: forward propagation and backward propagation.
  • forward propagation refers to calculating the intermediate variables of each layer of the neural network in order from the input layer to the output layer of the neural network, and the intermediate variables can be the output values of each layer of the neural network
  • back propagation refers to, according to The order of the neural network from the output layer to the input layer is to calculate the intermediate variables of each layer of the neural network and the derivative of the loss function to each parameter, and the intermediate variable can be the output value of each layer of the neural network.
  • the loss function can also be called the cost function (cost function), which maps the value of a random event or its related random variables to a non-negative real number to represent the "risk” or "loss" of the random event. function.
  • the Straight Through Estimator can solve the problem that the gradient of the Sign function cannot be used for backpropagation, it also brings other problems. Specifically, if the straight-through estimator is used for backpropagation, the gradient of the Sign function is not calculated during the backpropagation process (it can also be understood that the gradient of the Sign function is regarded as 1). Obviously, this method ignores the gradient of the Sign function, which will cause the trained neural network to be inaccurate.
  • the shoot-through estimator is further described below with reference to FIG. 2 .
  • FIG. 2 shows a three-layer neural network and are respectively A, B, and C, wherein the B-layer neural network is used to fit the Sign function.
  • the gradient dl/dy returned by the C-layer neural network it is necessary to use the gradient dl/dy returned by the C-layer neural network to calculate the gradient dl/dx, and return the gradient dl/dx to the A-layer neural network, where dl/dy represents the loss function for B
  • the straight-through estimator Since the gradient of Sign(x) is infinite at the 0 point, and the rest of the gradients are 0, the straight-through estimator is used for backpropagation. In this example, if the straight-through estimator is used, the gradient of Sign(x) is not calculated (it can also be considered that the gradient of Sign(x) is regarded as 1), and the gradient dl/dy is directly returned to the A-layer neural network, That is, dl/dy is considered equal to dl/dx.
  • an embodiment of the present application provides a method for training a neural network.
  • the method is to use a derivable fitting function to replace the binarization function (the Sign function is a binarization function) in the process of backpropagation. , so that the gradient of the loss function can be calculated by using the derivative of the fitting function to improve the accuracy of the neural network obtained by training; and, the fitting function is determined based on the series expansion of the binarization function, that is, the embodiment of the present application
  • the fitting function is determined based on mathematical theory. Compared with only using a fixed function to fit the binarization function, the fitting function adopted in the embodiment of the present application is more similar to the Sign function, which can reduce the fitting error. , to improve the fitting effect, thereby improving the accuracy of the trained neural network.
  • Figure 3 shows a schematic structural diagram of the main frame of artificial intelligence. (Vertical axis) The two dimensions of the above artificial intelligence theme framework are explained.
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing.
  • it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output.
  • data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communication with the outside world through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA); the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA
  • the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, and the productization of intelligent information decision-making and implementation of applications. Its application areas mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, autonomous driving, smart city, smart terminals, etc.
  • the embodiments of the present application can be applied to the optimal design of the network structure of the neural network, and the neural network trained by the present application can be specifically applied to various sub-fields in the field of artificial intelligence, such as the field of image processing, the field of computer vision, the field of semantic analysis It can be used for image classification, image segmentation, target detection and image super-resolution reconstruction.
  • the field of artificial intelligence such as the field of image processing, the field of computer vision, the field of semantic analysis It can be used for image classification, image segmentation, target detection and image super-resolution reconstruction.
  • the data in the data set obtained by the infrastructure in the embodiment of the present application may be multiple data of different types (also called training data, multiple training data constitute training data) obtained through sensors such as cameras and radars. set), it can also be multiple image data or multiple video data, it can also be data such as voice, text, etc., as long as the training set satisfies the requirements for iterative training of the neural network and can be used to realize the training of the neural network of the present application That is, the data types in the training set are not limited here.
  • the initialized binary neural network is trained by using the training data set.
  • the training process includes the process of gradient backpropagation;
  • the terminal can use the neural network for image classification.
  • the photo to be classified is a picture of a cat, and the trained neural network is used to classify it, and the classification result is a cat.
  • FIG. 5 is a system architecture diagram of a task processing system provided by an embodiment of the application.
  • the task processing system 200 includes an execution device 210, a training device 220, a database 230, a client device 240, a data
  • the storage system 250, the data acquisition device 260, and the execution device 210 include a computing module 211.
  • the data acquisition device 260 is used to obtain the open-source large-scale data set (ie the training set) required by the user, and store the training set in the database 230, and the training device 220 based on the maintained training set in the database 230.
  • the rules 201 are trained, and the trained neural network obtained by the training is then used on the execution device 210 .
  • the execution device 210 can call data, codes, etc. in the data storage system 250 , and can also store data, instructions, etc. in the data storage system 250 .
  • the data storage system 250 may be placed in the execution device 210 , or the data storage system 250 may be an external memory relative to the execution device 210 .
  • the trained neural network obtained after the target model/rule 201 trained by the training device 220 can be applied to different systems or devices (that is, the execution device 210 ), and specifically can be an edge device or an end-side device, such as a mobile phone, tablet , laptops, surveillance systems (eg, cameras), security systems, etc.
  • the execution device 210 is configured with an I/O interface 212 for data interaction with external devices, and a “user” can input data to the I/O interface 212 through the client device 240 .
  • the client device 240 may be a camera device of the supervision system, and the target image captured by the camera device is input to the calculation module 211 of the execution device 210 as input data, and the calculation module 211 detects the input target image and obtains the detection As a result, the detection result is then output to the camera device or directly displayed on the display interface (if any) of the execution device 210; in addition, in some embodiments of the present application, the client device 240 may also be integrated in the execution device 210, For example, when the execution device 210 is a mobile phone, the target task can be obtained directly through the mobile phone (for example, the target image can be captured by the camera of the mobile phone, or the target voice recorded by the recording module of the mobile phone, etc., this The target task is not limited here) or receive the target task sent by other devices (such as another mobile phone), and then the computing module 211 in the mobile phone detects the target task to obtain a detection result, and directly detects the target task. The result is presented on the display interface of the mobile phone.
  • FIG. 5 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage The system 250 is an external memory relative to the execution device 210.
  • the data storage system 250 can also be placed in the execution device 210;
  • the client device 240 is an external device relative to the execution device 210.
  • the client device 240 may also be integrated in the execution device 210 .
  • the training process of the binary neural network is as follows.
  • the initial neural network with the same topology as the convolutional neural network needs to be constructed first, and then the weight of each layer of the neural network of the convolutional neural network is binarized. , and input the weight of each layer of neural network after binarization to the initial neural network, and use the weight of each layer of neural network after binarization to train the initial neural network.
  • the weights of the initial neural network are updated through multiple iterative calculations. Each iteration calculation includes a forward propagation and a back propagation. The weights of the initial neural network can be updated through the gradient of back propagation. .
  • the weight of each layer of the neural network of the convolutional neural network needs to be retained, and in each back-propagation process, the weight of each layer of the neural network of the convolutional neural network is updated.
  • the weight of each layer of the neural network of the convolutional neural network is binarized again, and the weight of each layer of the neural network after binarization is used as the weight of each layer of the initial neural network. , so as to obtain a binary neural network.
  • the activation value of each layer of the neural network can also be binarized during the training process. , so that the input of each layer of neural network is binarized.
  • only the weight of the convolutional neural network can be binarized, or only the activation value of the convolutional neural network can be binarized, or both the weight of the convolutional neural network can be binarized.
  • Binarization processing, and binarizing the activation value of the convolutional neural network will be used to introduce the training process of the neural network.
  • the weights of the neural network are binarized; in the training process of another embodiment, the The activation values of the neural network are binarized.
  • an embodiment of the present application provides an embodiment of a neural network training method, which includes:
  • a binarization function is used to binarize the target weight, so as to obtain the weight of the first neural network layer in the neural network, and the first neural network layer is a layer of neural network in the neural network. network.
  • a binarization function refers to a function in which the dependent variable has one and only two values for different value ranges of the independent variable.
  • binarization functions There are various types of binarization functions, which are not specifically limited in this embodiment of the present application. For example, taking a binary function as an example, when the independent variable is greater than 0, the value of the binary function is +1, and when the independent variable is less than or equal to 0, the value of the binary function is -1; Taking another binarization function as an example, when the independent variable is greater than 0, the value of the binary function is +1, and when the independent variable is less than or equal to 0, the value of the binary function is 0.
  • the target weight may be the weight of the neural network layer corresponding to the first neural network layer in the neural network to be compressed.
  • the target neural network with the same topology structure as the convolutional neural network must first be constructed.
  • Each neural network in the convolutional neural network Each layer corresponds to a neural network layer in the target neural network; so there is a neural network layer corresponding to the first neural network layer in the convolutional neural network, and a neural network layer corresponding to the first neural network layer in the convolutional neural network.
  • the weight of is the target weight.
  • the first neural network layer may be any layer of neural networks in the neural network, which is not specifically limited in this embodiment of the present application.
  • a neural network includes an input layer, a hidden layer (also known as an intermediate layer) and an output layer.
  • the input layer is used for input data
  • the hidden layer is used to process the input data
  • the output layer is used for output data. Processing result; in order to avoid the change of the weight of the input layer leading to the change of the input data, and also to avoid the change of the weight of the output layer leading to the change of the output processing result, the first neural network layer is usually a hidden layer.
  • the convolutional layer (belonging to the hidden layer) includes many convolution operators, which can also be called kernels.
  • the parameters of each convolutional layer can often reach tens of thousands or hundreds of thousands. Even more, it is necessary to binarize the weights of the convolutional layers; therefore, when the neural network is a convolutional neural network, the first neural network layer can be specifically a convolutional layer.
  • the purpose of binarizing the target weight is to reduce the storage space occupied by the target weight, so regardless of the data type of the target weight, as long as the storage space occupied by the target weight is greater than the storage space occupied by the weight of the first neural network layer space is enough.
  • the data type of the target weight may be a 32-bit floating point type, a 64-bit floating point type, a 32-bit integer type, or an 8-bit integer type.
  • the gradient of the fitting function is used as the gradient of the binarization function to calculate the gradient of the loss function to the target weight, and the fitting function is determined based on the series expansion of the binarization function.
  • a fitting function is used instead of the binarization function in the process of backpropagation, that is, the gradient of the fitting function is used as the gradient of the binarization function to calculate the loss function pair. Gradient of the target weights.
  • the gradient of the fitting function is the gradient of the binarization function
  • the gradient dl/dx the gradient of the fitting function*the gradient dl/dy.
  • the embodiment of the present application determines the fitting function based on the series expansion of the binarization function.
  • the fitting function determined in this way has a good degree of fit with the binarization function, and the fitting error is small.
  • series expansion can be performed for any periodic function.
  • various types of series expansion so there can also be various types of series expansion of the binarization function, which are not specifically limited in this embodiment of the present application.
  • the series expansion of the binarization function is a Fourier series expansion of the binarization function, a wavelet series expansion of the binarization function, or a discrete Fourier series expansion of the binarization function.
  • n is a non-negative integer, and the value of n can be set according to actual needs, which is not specifically limited in this embodiment of the present application.
  • the fitting function is fitted by at least one layer of neural network, that is, the function of one or more layers of neural network is equivalent to the fitting function; and in the above example, the fitting function is It is formed by the superposition of multiple sine functions, so the at least one-layer neural network can be called a sine module.
  • the binarization function is a Sign(x) function.
  • the graph represents the Sign(x) function
  • the graph represents the gradient of the Sign(x) function
  • the graph represents Clip(x, -1, +1) function
  • the graph represents the gradient of the Clip(x, -1, +1) function
  • the graph represents a sine function SIN1(x)
  • the graph represents the gradient of the sine function SIN1(x)
  • the graph represents the fitting function SIN10(x) obtained by superposing 10 sinusoidal functions in the embodiment of the present application
  • the graph (h) represents the gradient of the fitting function SIN10(x) in the embodiment of the present application.
  • the Clip(x, -1, +1) function refers to a function used by the straight-through estimator instead of the Sign function.
  • the gradient of the Clip(x, -1, +1) function is 0 outside the range of -1 to 1, and the gradient is 1 in the range of -1 to 1, which is equivalent to
  • the gradient of the upper layer of neural network is directly transmitted back, so the training result of the neural network is poor, and the accuracy of the neural network obtained by training is low; and the gradient of the fitting function in the embodiment of the present application is close to the Sign function Therefore, using the gradient of the fitting function for backpropagation can ensure a better training effect and make the trained neural network more accurate.
  • the target weight is binarized by using a binarization function, so as to obtain the weight of the first neural network layer in the neural network;
  • the gradient of the fitting function is the gradient of the binarization function to calculate the gradient of the loss function to the target weight, so as to solve the problem that the gradient of the binarization function cannot be used for backpropagation; and the fitting function is based on the binarization function. Therefore, the fitting degree of the fitting function and the binarization function is higher, and the fitting effect is better, which can improve the training effect of the neural network and ensure that the neural network obtained by training has high accuracy. .
  • the fitting function is determined based on the series expansion of the binarization function.
  • the fitting function can have various forms.
  • the fitting function is composed of multiple sub-functions, and the multiple sub-functions are determined based on the series expansion of the binarization function.
  • the series expansion of the binarization function includes an infinite number of functions, and the multiple sub-functions are part of the infinite number of functions.
  • the series expansion of the binarization function includes an infinite number of sine functions, and multiple sub-functions are multiple sine functions among them, that is, the fitting function is formed by the superposition of multiple sine functions.
  • multiple sub-functions are determined based on the series expansion of the binarization function, and the multiple sub-functions constitute a fitting function, so that the fitting degree between the fitting function and the binarizing function is high, and the fitting effect is improved .
  • the series expansion of the binarization function includes an infinite number of functions
  • the embodiment of the present application uses a plurality of sub-functions (that is, a finite number of sub-functions) to fit to obtain a fitting function, so the fitting function and the binarization function are between the function.
  • the Sign function is an impulse function
  • the gradient of the Sign function cannot be used for backpropagation, so the gradient of the Sign function is not an ideal gradient, even if the fitting function can fit the Sign function well , the gradient of the fitted function is also not an ideal gradient.
  • there is an unknown ideal gradient (which can also be understood as the optimal gradient), which can well guide the training of neural networks.
  • the embodiment of the present application introduces an error function into the fitting function, so as to reduce the influence of the error on the gradient of the fitting function, and improve the accuracy of the gradient of the fitting function.
  • the fitting function is composed of a plurality of sub-functions and an error function, and the plurality of sub-functions are determined based on the series expansion of the binarization function.
  • an error function is added to the fitting function, that is, the fitting function is composed of a plurality of sub-functions and an error function, wherein the relevant description of the plurality of sub-functions can be understood with reference to the foregoing embodiments.
  • the error function can be fitted by at least one layer of neural network.
  • the error function fitted by the network is as accurate as possible to reduce the error.
  • the error function is fitted using a two-layer fully connected neural network with residuals.
  • the two-layer fully connected neural network is a neural network in which any neuron in one layer of neural network is connected with all neurons in the other layer of neural network; residual refers to the actual observed value and estimated value (neural network fitting value) difference.
  • a two-layer fully-connected neural network can fit any function, so the embodiment of the present application uses a two-layer fully-connected neural network to fit an error function.
  • the two-layer fully-connected neural network with residual is used to fit the error function, so the two-layer fully-connected neural network with residual can also be called an error fitting module.
  • the residual module ⁇ (x) there are various forms of the residual module ⁇ (x), which are not specifically limited in this embodiment of the present application.
  • the residual module may be 0, x, or sin(x).
  • the fitting function is composed of a plurality of sub-functions and an error function, and the plurality of sub-functions are determined based on the series expansion of the binarization function, so that the fitting function and the binarization function have a good fit.
  • the error function can reduce the error between the fitting function and the binarization function, and reduce the error between the gradient of the fitting function and the theoretical gradient, thereby improving the training results of the neural network.
  • operation 102 when the fitting function is composed of multiple sub-functions and an error function, and the error function is fitted by at least one layer of neural network, as shown in FIG. 8 , operation 102 includes:
  • the gradient of the at least one layer of neural network to the target weight is calculated.
  • the gradient of the loss function to the target weight is calculated based on the gradient of the multiple sub-functions to the target weight and the gradient of the at least one layer of neural network to the target weight.
  • the sum of the gradients of multiple sub-functions to the target weight and the gradient of the at least one layer of neural network to the target weight can be calculated first, and then the sum is multiplied by the gradient of the loss function to the weight of the first neural network layer, so that Get the gradient of the loss function to the target weights.
  • the error function is used instead of at least one layer of neural network in the example of FIG. 9 for description.
  • Sn (x) represents multiple sub-functions
  • e(x) represents the error function
  • the gradient of multiple sub-functions to the target weight and the gradient of at least one layer of neural network to the target weight are calculated, and the loss is calculated based on the gradient of multiple sub-functions to the target weight and the gradient of at least one layer of neural network to the target weight.
  • the gradient of the function to the target weight since at least one layer of neural network is used to fit the error function, the gradient of at least one layer of neural network to the target weight makes up for the error between the gradient of the fitting function and the gradient of the binarization function, It also compensates for the error between the gradient of the binarization function and the ideal gradient, so that the gradient of the final loss function to the target weight is more accurate, and the training effect of the neural network is improved.
  • the embodiment of the present application also provides another embodiment of a neural network training method, which includes:
  • the activation value of the second neural network layer is binarized by using a binarization function, so as to obtain the input of the first neural network layer, the first neural network layer and the second neural network layers belong to the same neural network.
  • the second neural network layer and the first neural network layer are two connected neural network layers, and the activation value of the second neural network layer (that is, the output of the second neural network layer) is the first neural network layer.
  • the embodiment of the present application uses a binarization function to perform binarization processing on the activation value of the second neural network layer, so that the value input to the first neural network layer is a value after binarization processing.
  • the gradient of the loss function to the activation value is calculated by taking the gradient of the fitting function as the gradient of the binarization function, and the fitting function is determined based on the series expansion of the binarization function.
  • the data type of the activation value is a 32-bit floating point type, a 64-bit floating point type, a 32-bit integer type, or an 8-bit integer type.
  • the fitting function is composed of multiple sub-functions, and the multiple sub-functions are determined based on the series expansion of the binarization function.
  • the fitting function is composed of a plurality of sub-functions and an error function, and the plurality of sub-functions are determined based on the series expansion of the binarization function.
  • the error function is fitted using a two-layer fully connected neural network with residuals.
  • the fitting function is composed of a plurality of sub-functions and an error function
  • the error function is fitted by at least one layer of neural network, and operation 302 includes:
  • the gradient of the loss function to the activation value is calculated based on the gradient of the multiple sub-functions to the activation value and the gradient of the at least one layer of neural network to the activation value.
  • the series expansion of the binarization function is a Fourier series expansion of the binarization function, a wavelet series expansion of the binarization function, or a discrete Fourier series expansion of the binarization function.
  • the difference between the embodiment shown in FIG. 10 and the embodiment shown in FIG. 7 lies in the processing object.
  • the activation value of the second neural network layer is binarized in the forward propagation process, and the gradient of the loss function to the activation value is calculated in the back propagation process.
  • the target weight is binarized in the forward propagation process, and the gradient of the loss function to the target weight is calculated in the back propagation process.
  • the embodiment shown in FIG. 10 is the same as the embodiment shown in FIG. 6 , and therefore, the embodiment shown in FIG. 10 can be understood with reference to the embodiment shown in FIG. 6 .
  • the activation value of the second neural network layer is binarized by using a binarization function, so as to obtain the input of the first neural network layer in the neural network;
  • the gradient of the fitting function is used as the gradient of the binarization function to calculate the gradient of the loss function to the activation value, so as to solve the problem that the gradient of the binarization function cannot be used for backpropagation; and, the fitting function It is determined based on the series expansion of the binarization function, so the fitting degree of the fitting function and the binarization function is higher, and the fitting effect is better, which can improve the training effect of the neural network and ensure the neural network obtained by training. Has high accuracy.
  • the embodiments of the present application also provide a network structure of a neural network, where the neural network includes a first neural network module, a second neural network module and a first neural network layer, and the first neural network module is composed of one or more layers of neural networks , and is used to realize the steps of binarization processing in the embodiment shown in FIG. 6 ; the second neural network module is composed of one or more layers of neural networks, and is used to realize the gradient calculation in the embodiment shown in FIG. 6 . step.
  • the embodiments of the present application also provide a network structure of a neural network, where the neural network includes a first neural network module, a second neural network module and a first neural network layer, and the first neural network module is composed of one or more layers of neural networks , and is used to implement the steps of binarization processing in the embodiment shown in FIG. 10 ; the second neural network module is composed of one or more layers of neural networks, and is used to implement the gradient calculation in the embodiment shown in FIG. 10 . step.
  • an embodiment of the present application also provides a training device for a neural network, including:
  • the binarization processing unit 401 is configured to use a binarization function to perform binarization processing on the target weight in the forward propagation process, so as to obtain the weight of the first neural network layer in the neural network, and the first neural network layer is a neural network.
  • the gradient calculation unit 402 is used to calculate the gradient of the loss function to the target weight by taking the gradient of the fitting function as the gradient of the binarization function during the backpropagation process, and the fitting function is determined based on the series expansion of the binarization function. of.
  • the data type of the target weight is a 32-bit floating point type, a 64-bit floating point type, a 32-bit integer type, or an 8-bit integer type.
  • the fitting function is composed of multiple sub-functions, and the multiple sub-functions are determined based on the series expansion of the binarization function.
  • the fitting function is composed of a plurality of sub-functions and an error function, and the plurality of sub-functions are determined based on the series expansion of the binarization function.
  • the error function is fitted using a two-layer fully connected neural network with residuals.
  • the error function is fitted by at least one layer of neural network
  • the gradient calculation unit 402 is specifically configured to calculate the gradient of the weights of multiple sub-functions to the target during the backpropagation process
  • Gradient of weight Calculate the gradient of the loss function to the target weight based on the gradient of the multiple sub-functions to the target weight and the gradient of the at least one layer of neural network to the target weight.
  • the series expansion of the binarization function is a Fourier series expansion of the binarization function, a wavelet series expansion of the binarization function, or a discrete Fourier series expansion of the binarization function.
  • an embodiment of the present application further provides a training device for a neural network, including:
  • the binarization processing unit 501 is configured to use a binarization function to binarize the activation value of the second neural network layer in the forward propagation process, so as to obtain the input of the first neural network layer, the first neural network layer and the second neural network layer belong to the same neural network;
  • the gradient calculation unit 502 is used for calculating the gradient of the loss function to the activation value by taking the gradient of the fitting function as the gradient of the binarization function during the backpropagation process, and the fitting function is determined based on the series expansion of the binarization function of.
  • the data type of the activation value is a 32-bit floating point type, a 64-bit floating point type, a 32-bit integer type, or an 8-bit integer type.
  • the fitting function is composed of multiple sub-functions, and the multiple sub-functions are determined based on the series expansion of the binarization function.
  • the fitting function is composed of a plurality of sub-functions and an error function, and the plurality of sub-functions are determined based on the series expansion of the binarization function.
  • the error function is fitted using a two-layer fully connected neural network with residuals.
  • the error function is fitted by at least one layer of neural network
  • the gradient calculation unit 502 is specifically configured to calculate the gradients of multiple sub-functions to activation values in the process of backpropagation; to calculate at least one layer of neural network to activate The gradient of the value; the gradient of the loss function to the activation value is calculated based on the gradient of the multiple sub-functions to the activation value and the gradient of the at least one layer of neural network to the activation value.
  • the series expansion of the binarization function is a Fourier series expansion of the binarization function, a wavelet series expansion of the binarization function, or a discrete Fourier series expansion of the binarization function.
  • FIG. 13 is a schematic structural diagram of the training device provided by the embodiment of the present application.
  • the training device 1800 may be deployed with The training apparatus of the neural network described in the embodiment corresponding to FIG. 11 or FIG. 12 is used to realize the function of the training apparatus of the neural network in the embodiment corresponding to FIG. 11 or FIG. 12 .
  • the training device 1800 is composed of one or more servers.
  • Training device 1800 may also include one or more power supplies 1826, one or more wired or wireless network interfaces 1850, one or more input and output interfaces 1858, and/or, one or more operating systems 1841, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • operating systems 1841 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the central processing unit 1822 is configured to execute the training method executed by the training apparatus of the neural network in the embodiment corresponding to FIG. 11 and FIG. 12 .
  • the central processing unit 1822 can be used to:
  • a binarization function is used to binarize the target weights to obtain the weights of the first neural network layer in the neural network, where the first neural network layer is a layer in the neural network Neural Networks;
  • the gradient of the fitting function is used as the gradient of the binarization function to calculate the gradient of the loss function to the target weight, and the fitting function is based on the series expansion of the binarization function. definite.
  • the central processing unit 1822 can also be used to:
  • the activation value of the second neural network layer is binarized by using a binarization function, so as to obtain the input of the first neural network layer, the first neural network layer and the second neural network layer.
  • the network layers belong to the same neural network;
  • the gradient of the fitting function is used as the gradient of the binarization function to calculate the gradient of the loss function to the activation value, and the fitting function is based on the series expansion of the binarization function definite.
  • Embodiments of the present application further provide a chip including one or more processors. Part or all of the processor is used to read and execute the computer program stored in the memory, so as to execute the methods of the foregoing embodiments.
  • the chip includes a memory, and the memory and the processor are connected to the memory through a circuit or a wire. Further optionally, the chip further includes a communication interface, and the processor is connected to the communication interface.
  • the communication interface is used for receiving data and/or information to be processed, the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface.
  • the communication interface may be an input-output interface.
  • some of the one or more processors may also implement some steps in the above method by means of dedicated hardware, for example, the processing involving the neural network model may be performed by a dedicated neural network processor or graphics processor.
  • the methods provided in the embodiments of the present application may be implemented by one chip, or may be implemented collaboratively by multiple chips.
  • Embodiments of the present application also provide a computer storage medium, where the computer storage medium is used for storing computer software instructions used by the above-mentioned computer device, which includes a program for executing a program designed for the computer device.
  • the computer equipment can be the training device of the neural network described in the aforementioned FIG. 11 or FIG. 12 .
  • Embodiments of the present application also provide a computer program product, where the computer program product includes computer software instructions, and the computer software instructions can be loaded by a processor to implement the processes in the methods shown in the foregoing embodiments.
  • An embodiment of the present application further provides a server, which may be a common server or a cloud server, for executing the method in the above-mentioned embodiment shown in FIG. 6 and/or FIG. 10 .
  • a server which may be a common server or a cloud server, for executing the method in the above-mentioned embodiment shown in FIG. 6 and/or FIG. 10 .
  • An embodiment of the present application further provides a terminal device, where a neural network trained by the method in the embodiment shown in FIG. 6 and/or FIG. 10 is deployed on the terminal device.
  • the terminal device may be any terminal device capable of deploying a neural network; since the neural network trained by the embodiment of the present application is a compressed binary neural network, the neural network occupies a small storage space and has a fast computing speed. However, the accuracy is slightly worse than the traditional uncompressed neural network.
  • the neural networks trained by the embodiments of the present application are mostly deployed in terminal devices with limited storage space or limited computing capabilities; for example, the storage space and computing capabilities of mobile terminal devices are limited, so the terminal devices in the embodiments of the present application It can be a mobile terminal device, specifically a mobile phone, a tablet computer, a vehicle-mounted device, a camera, a robot, and the like.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Feedback Control In General (AREA)
  • Image Analysis (AREA)

Abstract

一种神经网络的训练方法及相关设备,该方法包括:在前向传播过程中,采用二值化函数对目标权重进行二值化处理,将二值化处理后的数据作为神经网络中第一神经网络层的权重;在反向传播过程中,将拟合函数的梯度作为二值化函数的梯度,计算损失函数对目标权重的梯度,由于拟合函数是基于二值化函数的级数展开确定的,所以拟合函数与二值化函数的拟合效果较好,从而能够提高训练效果,提高训练得到的神经网络的准确率。

Description

一种神经网络的训练方法及相关设备
本申请要求于2021年1月30日提交的、中国申请号为202110132041.6、发明名称为“一种神经网络的训练方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及深度学习技术领域,尤其涉及一种神经网络的训练方法及相关设备。
背景技术
深度学习(Deep Learning,DL)是机器学习(Machine Learning,ML)领域中一个新的研究方向,它被引入机器学习使其更接近于最初的目标——人工智能(Artificial Intelligence,AI)。
随着深度学习技术的发展,深度神经网络(Deep Neural Networks,DNN)已经被广泛应用到各个领域中。例如,作为深度神经网络的一种,卷积神经网络(Convolutional Neural Network,CNN)已经被成功地应用于图片分类、物体检测等领域。然而,卷积神经网络的应用需要巨大的计算资源,所以难以直接将卷积神经网络直接应用于手机、摄像头、机器人等计算能力有限的设备上。
为了解决这一问题,许多神经网络的压缩算法和加速算法被提出,将此类算法应用到深度神经网络中可以带来非常高的压缩比和加速比,并且对原网络的精度影响非常小。其中一种方法就是对占用空间较大的权重进行二值化处理,得到二值神经网络(Binary Neural Network,BNN),以降低卷积神经网络所需的存储空间;并且,对占用空间较大的激活值进行二值化处理,以提高神经网络的运算速度。
通常,会使用Sign函数对卷积神经网络的32位bit浮点数的权重和激活值进行二值化,以将32位浮点数的权重和激活值转换为1或-1,这样,原来需要32bit存储的权重和激活值,现在仅需要1bit对其进行存储,从而节省了存储空间。
然而,Sign函数的梯度是冲激函数,即在0点位置梯度无穷大,其余位置梯度为0,所以在训练二值神经网络的过程中,无法利用Sign函数的梯度进行反向传播。
目前,主要采用直通估计器(Straight Through Estimator,STE)来解决无法利用Sign函数的梯度进行反向传播的问题。具体地,在反向传播过程中不计算Sign函数的梯度,而是直接将Sign函数所在的一层神经网络的上一层神经网络的梯度进行回传。
由于在反向传播过程中忽略了Sign函数的梯度,所以采用直通估计器训练出的二值神经网络的精度较低。
发明内容
本申请实施例提供了一种神经网络的训练方法及相关设备,该训练方法采用拟合函数的梯度代替二值化函数的梯度,从而能够提高训练出的神经网络的精度。
本申请实施例第一方面提供了一种神经网络的训练方法,包括:在前向传播过程中,采用二值化函数对目标权重进行二值化处理,以得到神经网络中第一神经网络层的权重,第一神经网络层为神经网络中的一层神经网络,具体可以为卷积层;二值化函数是指对于自变量的不同的取值范围,因变量有且仅有两个取值的函数,二值化函数的种类有多种,例如可以将目标权重转换为+1或-1,还可以将目标权重转换成+1和0;在反向传播过程中,以拟合函数的梯度为二值化函数的梯度计算损失函数对目标权重的梯度,拟合函数是基于二值化函数的级数展开确定的。
前向传播是指,按照神经网络从输入层到输出层的顺序,依次计算神经网络各层的中间变量,该中间变量可以为神经网络各层的输出值;反向传播是指,按照神经网络从输出层到输入层的顺序,依次计算神经网络各层的中间变量以及损失函数对各参数的导数,该中间变量可以为神经网络各层的输出值。
在前向传播过程中,采用二值化函数对目标权重进行二值化处理,以得到神经网络中第一神经网络层的权重,从而减小第一神经网络层所占用的存储空间,由于二值化处理后的权重可以为+1或-1,也可以为+1或0,所以还能够将乘法运算变为加法运算,所以可以降低运算量;在反向传播过程中,以拟合函数的梯度为二值化函数的梯度计算损失函数对目标权重的梯度,从而解决无法利用利用二值化函数的梯度进行反向传播的问题;并且,该拟合函数是基于二值化函数的级数展开确定的,所以拟合函数与二值化函数的拟合度更高,拟合效果更好,从而可以提高神经网络的训练效果,保证训练得到的神经网络具有较高的准确性。
作为一种实现方式,目标权重的数据类型为32位的浮点型、64位的浮点型、32位的整型或8位的整型,除此之外,目标权重的数据类型还可以为其他数据类型,只要使得目标权重的存储空间大于二值化处理后的权重的存储空间即可。
该实现方式提供了目标权重可能的多种数据类型。
作为一种实现方式,拟合函数由多个子函数构成,多个子函数是基于二值化函数的级数展开确定的。
由于多个子函数是基于二值化函数的级数展开确定的,而拟合函数由多个子函数构成,所以拟合函数与二值化函数的拟合度较高,能够提升神经网络的训练效果。
作为一种实现方式,拟合函数由多个子函数和误差函数构成,多个子函数是基于二值化函数的级数展开确定的,误差函数的形式有多种,且可以通过一层或多层神经网络拟合。
在拟合函数中引入了误差函数,该误差函数能够弥补拟合函数的梯度与二值化函数的梯度之间的误差,也能够弥补二值化函数的梯度与理想梯度之间的误差,从而降低了误差对拟合函数的梯度的影响,提高了拟合函数的梯度的准确性。
作为一种实现方式,误差函数是采用带残差的两层全连接神经网络拟合的,其中,两层全连接神经网络是一层神经网络中的任意一个神经元均与另一层神经网络中所有神经元连接的神经网络;残差是指实际观察值与估计值(神经网络拟合的值)之间的差;由于带残差的两层全连接神经网络用于拟合误差函数,所以带残差的两层全连接神经网络又可以称为误差拟合模块。
两层全连接神经网络可以看成是第一神经网络层所在的神经网络的一部分,若两层全连接神经网络由第三神经网络层和第四神经网络层组成,则误差函数可以表示为e(x)=σ(xW 1)W 2+δ(x),其中,W 1表示神经网络中第三神经网络层的权重,W 2表示神经网络中第四神经网络层的权重,σ(xW 1)表示激活函数,δ(x)为残差模块,x表示目标权重;其中,残差模块δ(x)的形式有多种,本申请实施例对此不做具体限定,例如,残差模块可以为0、x或sin(x)。
该实现方式提供了误差函数的具体拟合方式。
作为一种实现方式,误差函数由至少一层神经网络拟合;在反向传播过程中,以拟合函数的梯度为二值化函数的梯度计算损失函数对目标权重的梯度包括:在反向传播过程中,计算多个子函数对目标权重的梯度;计算至少一层神经网络对目标权重的梯度;基于多个子函数对目标权重的梯度以及至少一层神经网络对目标权重的梯度,计算损失函数对目标权重的梯度,具体地,可以先计算多个子函数对目标权重的梯度以及至少一层神经网络对目标权重的梯度的和,然后再将该和与损失函数对第一神经网络层的权重的梯度相乘,从而得到损失函数对目标权重的梯度。
在反向传播过程中,计算多个子函数对目标权重的梯度以及至少一层神经网络对目标权重的梯度,基于多个子函数对目标权重的梯度以及至少一层神经网络对目标权重的梯度计算损失函数对目标权重的梯度;由于至少一层神经网络用于拟合误差函数,所以至少一层神经网络对目标权重的梯度弥补了拟合函数的梯度与二值化函数的梯度之间的误差,也弥补了二值化函数的梯度与理想梯度之间的误差,从而使得最终得到的损失函数对目标权重的梯度更加准确,提高神经网络的训练效果。
作为一种实现方式,二值化函数的级数展开为二值化函数的傅里叶级数展开、二值化函数的小波级数展开或二值化函数的离散傅里叶级数展开。
该实现方式提供了二值化函数的级数展开的多种可行方案。
本申请实施例第二方面提供了一种神经网络的训练方法,包括:在前向传播过程中,采用二值化函数对第二神经网络层的激活值进行二值化处理,以得到第一神经网络层的输入,第一神经网络层和第二神经网络层属于同一神经网络;二值化函数是指对于自变量的不同的取值范围,因变量有且仅有两个取值的函数,二值化函数的种类有多种,例如可以将目标权重转换为+1或-1,还可以将目标权重转换成+1和0;其中,激活值是指经过激活函数处理后的值;激活函数是指在神经网络的神经元上运行的函数,通常为非线性函数,用于将神经元的输入映射到输出端;激活函数包括但不限于Sigmoid函数、Tanh函数以及ReLU函数;在反向传播过程中,以拟合函数的梯度为二值化函数的梯度计算损失函数对激活值的梯度,拟合函数是基于二值化函数的级数展开确定的。
前向传播是指,按照神经网络从输入层到输出层的顺序,依次计算神经网络各层的中间变量,该中间变量可以为神经网络各层的输出值;反向传播是指,按照神经网络从输出层到输入层的顺序,依次计算神经网络各层的中间变量以及损失函数对各参数的导数,该中间变量可以为神经网络各层的输出值。
在前向传播过程中,采用二值化函数对第二神经网络层的激活值进行二值化处理,以 得到神经网络中第一神经网络层的输入,从而减小第一神经网络层所占用的存储空间,由于二值化处理后的权重可以为+1或-1,也可以为+1或0,所以还能够将乘法运算变为加法运算,所以可以降低运算量;在反向传播过程中,以拟合函数的梯度为二值化函数的梯度计算损失函数对激活值的梯度,从而解决无法利用利用二值化函数的梯度进行反向传播的问题;并且,该拟合函数是基于二值化函数的级数展开确定的,所以拟合函数与二值化函数的拟合度更高,拟合效果更好,从而可以提高神经网络的训练效果,保证训练得到的神经网络具有较高的准确性。
作为一种实现方式,激活值的数据类型为32位的浮点型、64位的浮点型、32位的整型或8位的整型。
该实现方式提供了激活值可能的多种数据类型。
作为一种实现方式,拟合函数由多个子函数构成,多个子函数是基于二值化函数的级数展开确定的。
由于多个子函数是基于二值化函数的级数展开确定的,而拟合函数由多个子函数构成,所以拟合函数与二值化函数的拟合度较高,能够提升神经网络的训练效果。
作为一种实现方式,拟合函数由多个子函数和误差函数构成,多个子函数是基于二值化函数的级数展开确定的,误差函数的形式有多种,且可以通过一层或多层神经网络拟合。
在拟合函数中引入了误差函数,该误差函数能够弥补拟合函数的梯度与二值化函数的梯度之间的误差,也能够弥补二值化函数的梯度与理想梯度之间的误差,从而降低了误差对拟合函数的梯度的影响,提高了拟合函数的梯度的准确性。
作为一种实现方式,误差函数是采用带残差的两层全连接神经网络拟合的,其中,两层全连接神经网络是一层神经网络中的任意一个神经元均与另一层神经网络中所有神经元连接的神经网络;残差是指实际观察值与估计值(神经网络拟合的值)之间的差;由于带残差的两层全连接神经网络用于拟合误差函数,所以带残差的两层全连接神经网络又可以称为误差拟合模块。
两层全连接神经网络可以看成是第一神经网络层所在的神经网络的一部分,若两层全连接神经网络由第三神经网络层和第四神经网络层组成,则误差函数可以表示为e(x)=σ(xW 1)W 2+δ(x),其中,W 1表示神经网络中第三神经网络层的权重,W 2表示神经网络中第四神经网络层的权重,σ(xW 1)表示激活函数,δ(x)为残差模块,x表示目标权重;其中,残差模块δ(x)的形式有多种,本申请实施例对此不做具体限定,例如,残差模块可以为0、x或sin(x)。
该实现方式提供了误差函数的具体拟合方式。
作为一种实现方式,误差函数由至少一层神经网络拟合;在反向传播过程中,以拟合函数的梯度为二值化函数的梯度计算损失函数对激活值的梯度包括:在反向传播过程中,计算多个子函数对激活值的梯度;计算至少一层神经网络对激活值的梯度;基于多个子函数对激活值的梯度以及至少一层神经网络对激活值的梯度,计算损失函数对激活值的梯度,具体地,可以先计算多个子函数对激活值的梯度以及至少一层神经网络对激活值的梯度的和,然后再将该和与回传的梯度(即损失函数对第一神经网络层的激活值的梯度)相乘, 从而得到损失函数对第二神经网络层的激活值的梯度。
在反向传播过程中,计算多个子函数对激活值的梯度以及误差函数对激活值的梯度,基于多个子函数对激活值的梯度以及误差函数对激活值的梯度计算损失函数对激活值的梯度,误差函数对激活值的梯度弥补了拟合函数的梯度与二值化函数的梯度之间的误差,也弥补了二值化函数的梯度与理想梯度之间的误差,从而使得最终得到的损失函数对激活值的梯度更加准确,提高神经网络的训练效果。
作为一种实现方式,二值化函数的级数展开为二值化函数的傅里叶级数展开、二值化函数的小波级数展开或二值化函数的离散傅里叶级数展开。
本申请实施例第三方面提供了一种神经网络的网络结构,神经网络包括第一神经网络模块、第二神经网络模块和第一神经网络层,第一神经网络模块由一层或多层神经网络构成,且用于实现上述第一方面任意可能的实现方式中的二值化处理的步骤;第二神经网络模块由一层或多层神经网络构成,且用于实现上述第一方面任意可能的实现方式中的梯度计算的步骤。
本申请实施例第四方面提供了一种神经网络的网络结构,神经网络包括第一神经网络模块、第二神经网络模块和第一神经网络层,第一神经网络模块由一层或多层神经网络构成,且用于实现上述第二方面任意可能的实现方式中的二值化处理的步骤;第二神经网络模块由一层或多层神经网络构成,且用于实现上述第二方面任意可能的实现方式中的梯度计算的步骤。
本申请实施例第五方面提供了一种神经网络的训练装置,包括:二值化处理单元,用于在前向传播过程中,采用二值化函数对目标权重进行二值化处理,以得到神经网络中第一神经网络层的权重,第一神经网络层为神经网络中的一层神经网络;梯度计算单元,用于在反向传播过程中,以拟合函数的梯度为二值化函数的梯度计算损失函数对目标权重的梯度,拟合函数是基于二值化函数的级数展开确定的。
作为一种实现方式,目标权重的数据类型为32位的浮点型、64位的浮点型、32位的整型或8位的整型。
作为一种实现方式,拟合函数由多个子函数构成,多个子函数是基于二值化函数的级数展开确定的。
作为一种实现方式,拟合函数由多个子函数和误差函数构成,多个子函数是基于二值化函数的级数展开确定的。
作为一种实现方式,误差函数是采用带残差的两层全连接神经网络拟合的。
作为一种实现方式,误差函数由至少一层神经网络拟合;梯度计算单元,具体用于在反向传播过程中,计算多个子函数对目标权重的梯度;计算至少一层神经网络对目标权重的梯度;基于多个子函数对目标权重的梯度以及至少一层神经网络对目标权重的梯度,计算损失函数对目标权重的梯度。
作为一种实现方式,二值化函数的级数展开为二值化函数的傅里叶级数展开、二值化函数的小波级数展开或二值化函数的离散傅里叶级数展开。
其中,以上各单元的具体实现、相关说明以及技术效果请参考本申请实施例第一方面 的描述。
本申请实施例第六方面提供了一种神经网络的训练装置,包括:二值化处理单元,用于在前向传播过程中,采用二值化函数对第二神经网络层的激活值进行二值化处理,以得到第一神经网络层的输入,第一神经网络层和第二神经网络层属于同一神经网络;梯度计算单元,用于在反向传播过程中,以拟合函数的梯度为二值化函数的梯度计算损失函数对激活值的梯度,拟合函数是基于二值化函数的级数展开确定的。
作为一种实现方式,激活值的数据类型为32位的浮点型、64位的浮点型、32位的整型或8位的整型。
作为一种实现方式,拟合函数由多个子函数构成,多个子函数是基于二值化函数的级数展开确定的。
作为一种实现方式,拟合函数由多个子函数和误差函数构成,多个子函数是基于二值化函数的级数展开确定的。
作为一种实现方式,误差函数是采用带残差的两层全连接神经网络拟合的。
作为一种实现方式,误差函数由至少一层神经网络拟合;梯度计算单元,具体用于在反向传播过程中,计算多个子函数对激活值的梯度;计算至少一层神经网络对激活值的梯度;基于多个子函数对激活值的梯度以及至少一层神经网络对激活值的梯度,计算损失函数对激活值的梯度。
作为一种实现方式,二值化函数的级数展开为二值化函数的傅里叶级数展开、二值化函数的小波级数展开或二值化函数的离散傅里叶级数展开。
其中,以上各单元的具体实现、相关说明以及技术效果请参考本申请实施例第二方面的描述。
本申请实施例第七方面提供了一种训练设备,包括:一个或多个处理器和存储器;其中,所述存储器中存储有计算机可读指令;所述一个或多个处理器读取计算机可读指令,以使训练设备实现如第一方面或第二方面任一实现方式所述的方法。
本申请实施例第八方面提供了一种计算机可读存储介质,包括计算机可读指令,当所述计算机可读指令在计算机上运行时,使得所述计算机执行如第一方面或第二方面任一实现方式所述的方法。
本申请实施例第九方面提供了一种芯片,包括一个或多个处理器。所述处理器中的部分或全部用于读取并执行存储器中存储的计算机程序,以执行上述第一方面或第二方面任意可能的实现方式中的方法。
可选地,该芯片该包括存储器,该存储器与该处理器通过电路或电线与存储器连接。进一步可选地,该芯片还包括通信接口,处理器与该通信接口连接。通信接口用于接收需要处理的数据和/或信息,处理器从该通信接口获取该数据和/或信息,并对该数据和/或信息进行处理,并通过该通信接口输出处理结果。该通信接口可以是输入输出接口。
在一些实现方式中,所述一个或多个处理器中还可以有部分处理器是通过专用硬件的方式来实现以上方法中的部分步骤,例如涉及神经网络模型的处理可以由专用神经网络处理器或图形处理器来实现。
本申请实施例提供的方法可以由一个芯片实现,也可以由多个芯片协同实现。
本申请实施例第十方面提供了一种计算机程序产品,该计算机程序产品包括计算机软件指令,该计算机软件指令可通过处理器进行加载来实现上述第一方面中或第二方面中任意一种实现方式所述的方法。
本申请实施例第十一方面提供了一种服务器,该服务器可以为云端服务器,用于执行上述第一方面或第二方面任意可能的实现方式中的方法。
本申请实施例第十二方面提供了一种终端设备,该终端设备上部署有通过第一方面中或第二方面中任意一种实现方式所述的方法训练得到的神经网络。
附图说明
图1为神经元的计算过程示意图;
图2为采用直通估计器进行梯度计算的实施例示意图;
图3为人工智能主体框架的一种结构示意图;
图4为本申请实施例中的应用场景示意图;
图5为本申请实施例提供的任务处理系统的一种系统架构图;
图6为本申请实施例提供了一种神经网络的训练方法的一个实施例示意图;
图7为多种函数的拟合效果对比示意图;
图8为计算损失函数对目标权重的梯度的流程示意图;
图9为计算损失函数对目标权重的梯度的示例示意图;
图10为本申请实施例提供了一种神经网络的训练方法的另一个实施例示意图;
图11为本申请实施例提供了一种神经网络的训练装置的一个实施例示意图;
图12为本申请实施例提供了一种神经网络的训练装置的另一个实施例示意图;
图13为本申请实施例中训练设备的实施例示意图。
具体实施方式
本申请实施例提供了一种神经网络的训练方法及相关设备,用于基于二值化函数的级数展开确定二值化函数的拟合函数,并采用该拟合函数的梯度代替二值化函数的梯度进行反向传播,以避免忽略二值化函数的梯度而导致神经网络的精度较低,所以本申请实施例能够提高神经网络的精度。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
在介绍本申请实施例之前,先对目前神经网络二值化的技术及相关背景进行简单介绍,便于后续理解本申请实施例。
在深度学习领域,神经网络的应用无处不在,中央处理器(Central Processing Unit,CPU)已经渐渐满足不了各种深度神经网络(如卷积神经网络(Convolutional Neural Networks,CNN))的高并发性、高计算量等要求,图形处理器(Graphics Processing Unit,GPU)虽然可以部分解决高并发性、高计算量的问题,但是较大的功耗、较高的价格等原因也限制了它在移动端(包括端侧设备及边缘设备)的应用,一般都是企业级或者科研院所能购买高端的GPU做神经网络的训练、测试和应用。目前,一些移动端的手机芯片已经集成了神经网络处理器(NPU),但是如何能够达到功耗和性能的平衡依然是个亟待解决的问题。
限制深度神经网络在移动端上应用的主要是两个技术问题:1)计算量过大;2)神经网络的参数量过大。以CNN为例,卷积操作的计算量巨大,一个含有几十万个参数量的卷积核,卷积操作的浮点运算次数(Floating Point Of Perations,FLOPs)可达几千万,现有的一个普通的具有n层的CNN总共的计算量则可高达几十亿个FLOPs,在GPU上能够实时运算的CNN到了移动端则十分缓慢,在移动端的计算资源难以满足现有CNN的实时运算的情况下,就需要考虑如何降低卷积计算量;此外,在目前常用的CNN中,每个卷积层的参数量常常能够达到几万、几十万甚至更多,整个网络n层的参数加起来,能够达到几千万,并且每个参数都是用32位浮点数表示,这样就需要上百兆字节的内存或缓存来存储这些参数,而在移动端中,内存和缓存资源非常有限,如何减低卷积层的参数量,以使得CNN适配移动端的相关设备,也是个亟待解决的问题,在此背景下,二值神经网络(Binary Neural Network,BNN)应运而生。
目前,通常使用的BNN是在现有的神经网络的基础上,对神经网络的权重和激活值进行二值化处理,即将原来神经网络各层的权重矩阵中的各个权重的取值以及神经网络各层的激活值赋值为+1和-1中的一个,或赋值为+1和0中的一个。BNN并不会改变原本的神经网络的网络结构,其主要是在梯度下降、权值更新、卷积运算上做了一些优化处理。显然,对神经网络的权重进行二值化处理,不仅降低了权重占用的存储空间,还将复杂的乘法运算变成了加减法运算,从而减少了运算量,提高了运算速度;同样地,对神经网络的激活值进行二值化处理,也能够减少运算量,提高运算速度。
其中,激活值是指经过激活函数处理后的值;激活函数是指在神经网络的神经元上运行的函数,通常为非线性函数,用于将神经元的输入映射到输出端。激活函数包括但不限于Sigmoid函数、Tanh函数以及ReLU函数。
下面以具体的示例对激活函数和激活值进行说明。如图1所示,将z1和在z2输入到神经元,在该神经元上会执行w1*z1+w2*z2的运算,其中w1和w2为权重;然后,经过激活函数可以将w1*z1+w2*z2这一线性的值转换成非线性的值,该非线性的值便是该神经元的输出,也可以称为该神经元的激活值。
基于上述说明可知,如果不使用激活函数,每一层神经网络的输出都是输入的线性函数,无论神经网络有多少层,输出都是输入的线性组合,这种情况就是最原始的感知机(Perceptron)。如果使用激活函数,激活函数给神经元引入了非线性因素,使得神经网络可以任意逼近任何非线性函数,这样神经网络就可以应用到众多的非线性模型中。
目前,二值化的方法主要有两种方式,第一种方式是基于符号函数(也称为Sign函数) 的确定性方法,第二种方式是随机方法(也可以称为统计方法);理论上来说,第二种方式更合理,但是,实际操作需要用硬件生成随机数,比较困难。因此,在实际应用中,第二种方式目前还未能应用,采用的均是第一种方式,即通过Sign函数来进行二值化处理。
符号函数的公式如下:
Figure PCTCN2022073955-appb-000001
其中,W为神经网络中各层网络的权重,W b为二值化后的权重。
从上述公式可以看出,Sign函数的梯度是冲激函数,即在0点位置梯度无穷大,其余位置梯度为0。
而神经网络的训练过程包括前向传播和后向传播两个过程。其中,前向传播是指,按照神经网络从输入层到输出层的顺序,依次计算神经网络各层的中间变量,该中间变量可以为神经网络各层的输出值;反向传播是指,按照神经网络从输出层到输入层的顺序,依次计算神经网络各层的中间变量以及损失函数对各参数的导数,该中间变量可以为神经网络各层的输出值。
其中,损失函数(loss function)也可以称为代价函数(cost function),是将随机事件或其有关随机变量的取值映射为非负实数以表示该随机事件的“风险”或“损失”的函数。
由于Sign函数在0点位置梯度无穷大,其余位置梯度为0,所以无法利用Sign函数的梯度进行反向传播。直通估计器(Straight Through Estimator,STE)虽然够解决无法利用Sign函数的梯度进行反向传播的问题,但也会带来其他问题。具体地,若采用直通估计器进行反传传播,则在反向传播过程中不计算Sign函数的梯度(也可以理解为将Sign函数的梯度看成是1)。显然,这种方式忽略了Sign函数的梯度,会造成训练出的神经网络不够准确。
下面结合图2对直通估计器进行进一步说明。
请参阅图2,图2示出了三层神经网络且分别为A、B、C,其中,B层神经网络用于拟合Sign函数。在反向传播过程中,需要利用C层神经网络回传的梯度dl/dy计算梯度dl/dx,并将梯度dl/dx回传至A层神经网络,其中,dl/dy表示损失函数对B层神经网络的输出y的梯度,dl/dx表示损失函数对B层神经网络的输入x的梯度,y=Sign(x)。由于Sign(x)在0点位置梯度无穷大,其余位置梯度为0,所以使用直通估计器进行反向传播。在该示例中,若使用直通估计器,则不计算Sign(x)的梯度(也可以认为将Sign(x)的梯度看成1),直接将梯度dl/dy回传至A层神经网络,即认为dl/dy与dl/dx相等。
从图2的示例中可以明显看出,若不计算Sign(x)的梯度,则会导致回传的梯度不准确,进而造成训练得到的神经网络的准确性低。
为此,本申请实施例提供了一种神经网络的训练方法,该方法是在反向传播过程中,采用可导的拟合函数代替二值化函数(Sign函数是一种二值化函数),从而可以利用拟合函数的导数计算损失函数的梯度,以提高训练得到的神经网络的准确性;并且,该拟合函数是基于二值化函数的级数展开确定的,即本申请实施例的拟合函数是基于数学理论确定 的,相比于仅仅使用一个固定的函数拟合二值化函数,本申请实施例采用的拟合函数与Sign函数的相似度更高,能够降低拟合误差,提高拟合效果,从而提高训练得到的神经网络的准确性。
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
首先,对人工智能系统总体工作流程进行描述,请参见图3,图3示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。
其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,智慧城市,智能终端等。
本申请实施例可以应用在神经网络的网络结构的优化设计上,而通过本申请训练的神经网络具体可以应用在人工智能领域的各个细分领域中,如图像处理领域、计算机视觉领域、语义分析领域等,具体可以用于图片分类、图像分割、目标检测和图像超分辨率重建。
以图3为例,本申请实施例中基础设施获取的数据集中的数据可以是通过摄像头、雷达等传感器获取到的不同类型的多个数据(也可称为训练数据,多个训练数据构成训练集),也可以是多个图像数据或多个视频数据,还可以是语音、文本等数据,只要该训练集满足用于对神经网络进行迭代训练并能用于实现本申请的神经网络的训练即可,具体此处对训练集内的数据类型不限定。
下面结合图4介绍本申请实施例的一具体应用场景。
如图4所示,在服务器端,利用训练数据集合对初始化的二值神经网络进行训练,训练的过程包括梯度反向传播的过程;将训练好的神经网络部署在移动设备端,在移动设备端可以利用该神经网络进行图片分类。具体地,如图4所示,待分类照片为一只猫的图片,采用训练好的神经网络对其进行分类,分类结果为猫。
请参阅图5,图5为本申请实施例提供的任务处理系统的一种系统架构图,在图5中,任务处理系统200包括执行设备210、训练设备220、数据库230、客户设备240、数据存储系统250和数据采集设备260,执行设备210中包括计算模块211。其中,数据采集设备260用于获取用户需要的开源的大规模数据集(即训练集),并将训练集存入数据库230中,训练设备220基于数据库230中的维护的训练集对目标模型/规则201进行训练,训练得到的训练后的神经网络再在执行设备210上进行运用。执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、指令等存入数据存储系统250中。数据存储系统250可以置于执行设备210中,也可以为数据存储系统250相对执行设备210是外部存储器。
经由训练设备220训练的目标模型/规则201后得到的训练后的神经网络可以应用于不同的系统或设备(即执行设备210)中,具体可以是边缘设备或端侧设备,例如,手机、平板、笔记本电脑、监督系统(如,摄像头)、安防系统等等。在图5中,执行设备210配置有I/O接口212,与外部设备进行数据交互,“用户”可以通过客户设备240向I/O接口212输入数据。如,客户设备240可以是监督系统的摄像设备,通过该摄像设备拍摄的目标图像作为输入数据输入至执行设备210的计算模块211,由计算模块211对输入的该目标图像进行检测后得出检测结果,再将该检测结果输出至摄像设备或直接在执行设备210的显示界面(若有)进行显示;此外,在本申请的一些实施方式中,客户设备240也可以集成在执行设备210中,如,当执行设备210为手机时,则可以直接通过该手机获取到目标任务(如,可以通过该手机的摄像头拍摄到目标图像,或,通过该手机的录音模块录取到的目标语音等,此处对目标任务不做限定)或者接收其他设备(如,另一个手机)发送的目标任务,再由该手机内的计算模块211对该目标任务进行检测后得出检测结果,并直接将该检 测结果呈现在手机的显示界面。此处对执行设备210与客户设备240的产品形态不做限定。
值得注意的,图5仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图5中,数据存储系统250相对执行设备210是外部存储器,在其它情况下,也可以将数据存储系统250置于执行设备210中;在图5中,客户设备240相对执行设备210是外部设备,在其他情况下,客户设备240的也可以集成在执行设备210中。
接下来介绍本申请实施例提供的神经网络的训练方法。
为了便于理解,这里先对二值神经网络的训练过程进行介绍,二值神经网络的训练过程如下。
假设需要训练卷积神经网络对应的二值神经网络,那么需要先构建与卷积神经网络的拓扑结构相同的初始神经网络,然后对卷积神经网络的每层神经网络的权重进行二值化处理,并将二值化处理后的各层神经网络的权重输入到初始神经网络,利用二值化处理后的各层神经网络的权重对初始神经网络进行训练。在训练过程中,通过多次迭代计算对初始神经网络的权重进行更新,每次迭代计算都包含一次前向传播和一次反向传播,通过反向传播的梯度可以对初始神经网络的权重进行更新。
需要说明的是,在训练过程中需要保留卷积神经网络的每层神经网络的权重,且在每次反向传播过程中,对卷积神经网络的每层神经网络的权重进行更新。待训练完成后,对卷积神经网络的各层神经网络的权重进行再一次的二值化处理,并将二值化处理后的各层神经网络的权重作为初始神经网络各层神经网络的权重,从而得到二值神经网络。
在上述过程中,仅对卷积神经网络的权重进行了二值化处理,为了得到更彻底的二值神经网络,还可以在对训练过程中对每层神经网络的激活值进行二值化处理,使得每层神经网络的输入都是经过二值化处理的。
在本申请实施例中,可以仅对卷积神经网络的权重进行二值化处理,也可以仅对卷积神经网络的激活值进行二值化处理,还可以既对卷积神经网络的权重进行二值化处理,又对卷积神经网络的激活值进行二值化处理。基于此,下面将以两个实施例对神经网络的训练过程进行介绍,在一个实施例的训练过程中,对神经网络的权重进行二值化处理;在另一个实施例的训练过程中,对神经网络的激活值进行二值化处理。
下面先介绍对神经网络的权重进行二值化处理的训练过程。
具体地,请参阅图6,本申请实施例提供了一种神经网络的训练方法的一个实施例,该实施例包括:
操作101,在前向传播过程中,采用二值化函数对目标权重进行二值化处理,以得到神经网络中第一神经网络层的权重,第一神经网络层为神经网络中的一层神经网络。
二值化函数是指对于自变量的不同的取值范围,因变量有且仅有两个取值的函数。二值化函数的种类有多种,本申请实施例对此不做具体限定。例如,以一种二值化函数为例,当自变量大于0时,二值化函数的取值为+1,当自变量小于或等于0时,二值化函数的取值为-1;以另一种二值化函数为例,当自变量大于0时,二值化函数的取值为+1,当自变量小于或等于0时,二值化函数的取值为0。
目标权重可以是待压缩的神经网络中与第一神经网络层对应的神经网络层的权重。具体地,基于前述的二值神经网络的训练过程可知,若要对卷积神经网络进行压缩,首先要构建与卷积神经网络拓扑结构相同的目标神经网络,卷积神经网络中每个神经网络层都与目标神经网络中的一个神经网络层对应;所以卷积神经网络中存在一个与第一神经网络层对应的神经网络层,卷积神经网络中与第一神经网络层对应的神经网络层的权重则为目标权重。
第一神经网络层可以是神经网络中的任意一层神经网络,本申请实施例对此不做具体限定。
可以理解的是,神经网络包括输入层、隐藏层(也可以称为中间层)和输出层,输入层用于输入数据,隐藏层用于对输入的数据进行处理,输出层用于输出数据的处理结果;为了避免输入层的权重的改变导致输入数据发生变化,也为了避免输出层的权重的改变导致输出的处理结果发生变化,第一神经网络层通常为隐藏层。
对于卷积神经网络来说,卷积层(属于隐藏层)包括很多卷积算子,卷积算子也可以称为核,每个卷积层的参数量常常能够达到几万、几十万甚至更多,所以很有必要对卷积层的权重进行二值化处理;因此,当神经网络为卷积神经网络时,第一神经网络层可以具体为卷积层。
对目标权重进行二值化处理的目的在于降低目标权重占用的存储空间,所以不论目标权重的数据类型是哪种数据类型,只要目标权重占用的存储空间大于第一神经网络层的权重占用的存储空间即可。
示例性地,目标权重的数据类型可以为32位的浮点型、64位的浮点型、32位的整型或8位的整型。
操作102,在反向传播过程中,以拟合函数的梯度为二值化函数的梯度计算损失函数对目标权重的梯度,拟合函数是基于二值化函数的级数展开确定的。
由于二值化函数的梯度不可用,所以本申请实施例在反向传播的过程中采用拟合函数代替二值化函数,即以拟合函数的梯度为二值化函数的梯度计算损失函数对目标权重的梯度。
以图2为例,若以拟合函数的梯度为二值化函数的梯度,则梯度dl/dx=拟合函数的梯度*梯度dl/dy。
需要说明的是,拟合函数的确定方法有多种;其中的一类方法是采用固定的函数拟合二值化函数,然而,固定的函数与二值化函数相差较大,所以拟合误差较大,拟合效果较差。为此,本申请实施例基于二值化函数的级数展开确定拟合函数,这种方式确定出的拟合函数与二值化函数的拟合度较好,拟合误差小。
下面对基于二值化函数的级数展开确定拟合函数的过程进行具体说明。
可以理解的是,对于任意的周期函数,都可以进行级数展开。并且,级数展开的种类有多种,所以二值化函数的级数展开的种类也可以有多种,本申请实施例对此不做具体限定。
示例性地,二值化函数的级数展开为二值化函数的傅里叶级数展开、二值化函数的小 波级数展开或二值化函数的离散傅里叶级数展开。
以傅里叶级数展开为例,任意的周期函数都可以展开成为傅里叶级数
Figure PCTCN2022073955-appb-000002
所以,对于二值化函数来说,也可以展开成为上述傅里叶级数,其中,t表示自变量,a 0为常数,a i=0,i为任意正整数,
Figure PCTCN2022073955-appb-000003
上述傅里叶级数是由正弦函数和余弦函数构成,所以可以通过无穷个正弦函数的叠加来拟合二值化函数,即
Figure PCTCN2022073955-appb-000004
而实际上,无法做到使用无穷个正弦函数的叠加,只能使用有限个正弦函数的叠加来拟合二值化函数,即
Figure PCTCN2022073955-appb-000005
其中,n为非负整数,n的取值可以根据实际需要进行设定,本申请实施例对此不做具体限定。
需要说明的是,在神经网络中,拟合函数是通过至少一层神经网络进行拟合,即一层或多层神经网络的功能相当于拟合函数;而在上述示例中,拟合函数是由多个正弦函数叠加形成,所以该至少一层神经网络可以称为正弦模块。
下面通过具体的实例说明本申请实施例的拟合函数比其他固定的函数的拟合效果好,在该实例中,二值化函数为Sign(x)函数。
具体地,如图7所示,(a)曲线图表示Sign(x)函数,(e)曲线图表示Sign(x)函数的梯度;b)曲线图表示Clip(x,-1,+1)函数,(f)曲线图表示Clip(x,-1,+1)函数的梯度;c)曲线图表示一个正弦函数SIN1(x),(g)曲线图表示正弦函数SIN1(x)的梯度;d)曲线图表示本申请实施例采用10个正弦函数叠加得到的拟合函数SIN10(x),(h)曲线图表示本申请实施例中拟合函数SIN10(x)的梯度。
其中,Clip(x,-1,+1)函数是指直通估计器所使用的代替Sign函数的函数。
对比(a)曲线、(b)曲线、(c)曲线、(d)曲线可知,拟合函数SIN10(x)与Sign(x)函数的相似度更高;对比(e)曲线、(f)曲线、(c)曲线、(d)曲线可知,拟合函数SIN10(x)在0处的导数更接近Sign(x)函数在0处的导数。
从(f)曲线可以看出,Clip(x,-1,+1)函数在-1至1的范围外的梯度为0,而在-1至1的范围内的梯度为1,这相当于直接将上一层神经网络的梯度进行回传,所以会导致神经网络的训练结果较差,训练得到的神经网络的准确性较低;而本申请实施例的拟合函数的梯度接近于Sign函数的梯度,所以采用拟合函数的梯度进行反传传播,能够保证较好的训练效果,使得训练得到的神经网络的准确性较高。
在本申请实施例中,在前向传播过程中,采用二值化函数对目标权重进行二值化处理,以得到神经网络中第一神经网络层的权重;在反向传播过程中,以拟合函数的梯度为二值化函数的梯度计算损失函数对目标权重的梯度,从而解决无法利用利用二值化函数的梯度进行反向传播的问题;并且,该拟合函数是基于二值化函数的级数展开确定的,所以拟合 函数与二值化函数的拟合度更高,拟合效果更好,从而可以提高神经网络的训练效果,保证训练得到的神经网络具有较高的准确性。
基于前述说明可知,拟合函数是基于二值化函数的级数展开确定的,在此前提下,拟合函数可以有多种形式。
作为一种实现方式,拟合函数由多个子函数构成,多个子函数是基于二值化函数的级数展开确定的。
可以理解的是,二值化函数的级数展开包括无穷多个函数,这多个子函数为无穷多个函数中的一部分。以傅里叶级数展开为例,二值化函数的级数展开包括无穷多个正弦函数,多个子函数则为其中的多个正弦函数,即拟合函数是由多个正弦函数叠加形成。
在本申请实施例中,基于二值化函数的级数展开确定多个子函数,多个子函数构成拟合函数,使得拟合函数与二值化函数的拟合度较高,提升了拟合效果。
应理解,尽管基于二值化函数的级数展开确定的拟合函数与二值化函数的拟合度较高,但基于二值化函数的级数展开确定的拟合函数的梯度与二值化函数的梯度之间仍然会存在误差。
该误差的产生存在两方面原因。
一方面,二值化函数的级数展开包括无穷多个函数,而本申请实施例采用多个子函数(即有限个子函数)拟合得到拟合函数,所以拟合函数与二值化函数之间存在误差,造成基于二值化函数的级数展开确定的拟合函数的梯度与二值化函数的梯度之间还是存在误差。
另一方面,以Sign函数为例,Sign函数是冲激函数,Sign函数的梯度无法被用于反向传播,所以Sign函数的梯度并非理想梯度,即使拟合函数能够较好地拟合Sign函数,拟合函数的梯度也不是理想梯度。实际上,存在一个未知的理想梯度(也可以理解为最优梯度),它能够很好地指导神经网络的训练。因此,由于Sign函数的梯度与该理想梯度之间存在误差,即二值化函数的梯度与理想梯度之间存在误差,也可以理解为二值化函数的梯度等于理想梯度与噪声梯度的和;而拟合函数用于拟合二值化函数,所以拟合函数的梯度与该理想梯度之间也存在误差。
为此,本申请实施例在拟合函数中引入了误差函数,以降低误差对拟合函数的梯度的影响,提高拟合函数的梯度的准确性。
作为一种实现方式,拟合函数由多个子函数和误差函数构成,多个子函数是基于二值化函数的级数展开确定的。
在本申请实施例中,在拟合函数中添加了误差函数,即拟合函数是由多个子函数和误差函数构成,其中,多个子函数的相关说明可参照前述实施例进行理解。
误差函数的形式有多种,本申请实施例对此不做具体限定。
需要说明的是,误差函数可以通过至少一层神经网络进行拟合,在对神经网络训练的过程中,可以对用于拟合误差函数的至少一层神经网络进行训练,以使得至少一层神经网络拟合的误差函数尽可能地准确,以降低误差。
示例性地,误差函数是采用带残差的两层全连接神经网络拟合的。
其中,两层全连接神经网络是一层神经网络中的任意一个神经元均与另一层神经网络中所有神经元连接的神经网络;残差是指实际观察值与估计值(神经网络拟合的值)之间的差。
理论上,两层全连接神经网络可以拟合任意函数,因此本申请实施例通过两层全连接神经网络拟合误差函数。
可以理解的是,带残差的两层全连接神经网络用于拟合误差函数,所以带残差的两层全连接神经网络又可以称为误差拟合模块。
两层全连接神经网络可以看成是第一神经网络层所在的神经网络的一部分,若两层全连接神经网络由第三神经网络层和第四神经网络层组成,则误差函数可以表示为e(x)=σ(xW 1)W 2+δ(x),其中,W 1表示神经网络中第三神经网络层的权重,W 2表示神经网络中第四神经网络层的权重,σ(xW 1)表示激活函数,δ(x)为残差模块,x表示目标权重。
其中,残差模块δ(x)的形式有多种,本申请实施例对此不做具体限定,例如,残差模块可以为0、x或sin(x)。
在本申请实施例中,拟合函数由多个子函数和误差函数构成,而多个子函数是基于二值化函数的级数展开确定的,使得拟合函数与二值化函数具有较好的拟合度,误差函数则可以降低拟合函数与二值化函数之间误差,并降低拟合函数的梯度与理论梯度之间的误差,从而提高神经网络的训练结果。
作为一种实现方式,当拟合函数由多个子函数和误差函数构成时,且误差函数由至少一层神经网络拟合,如图8所示,操作102包括:
操作201,在反向传播过程中,计算多个子函数对目标权重的梯度。
操作202,计算至少一层神经网络对目标权重的梯度。
操作203,基于多个子函数对目标权重的梯度以及至少一层神经网络对目标权重的梯度,计算损失函数对目标权重的梯度。
具体地,可以先计算多个子函数对目标权重的梯度以及至少一层神经网络对目标权重的梯度的和,然后再将该和与损失函数对第一神经网络层的权重的梯度相乘,从而得到损失函数对目标权重的梯度。
下面结合图9对上述过程进行说明,由于至少一层神经网络用于拟合误差函数,因此在图9的示例中以误差函数代替至少一层神经网络进行说明。
如图9所示,S n(x)表示多个子函数,e(x)表示误差函数;首先,获取到上一层神经网络回传的损失函数对第一神经网络层的权重的梯度
Figure PCTCN2022073955-appb-000006
然后计算多个子函数S n(x)对目标权重的梯度,并计算误差函数e(x)对目标权重的梯度,最终基于梯度
Figure PCTCN2022073955-appb-000007
S n(x)对目标权重的梯度以及e(x)对目标权重的梯度计算损失函数对目标权重的梯度
Figure PCTCN2022073955-appb-000008
具体地,可以先计算S n(x)对目标权重的梯度与e(x)对目标权重的梯度的和,再将该和乘以梯度
Figure PCTCN2022073955-appb-000009
以得到损失函数对目标权重的梯度
Figure PCTCN2022073955-appb-000010
在反向传播过程中,计算多个子函数对目标权重的梯度以及至少一层神经网络对目标权重的梯度,基于多个子函数对目标权重的梯度以及至少一层神经网络对目标权重的梯度 计算损失函数对目标权重的梯度;由于至少一层神经网络用于拟合误差函数,所以至少一层神经网络对目标权重的梯度弥补了拟合函数的梯度与二值化函数的梯度之间的误差,也弥补了二值化函数的梯度与理想梯度之间的误差,从而使得最终得到的损失函数对目标权重的梯度更加准确,提高神经网络的训练效果。
上面介绍了对神经网络的权重进行二值化处理的训练过程,下面介绍对神经网络的激活值进行二值化处理的训练过程。
如图10所示,本申请实施例还提供了一种神经网络的训练方法的另一个实施例,该实施例包括:
操作301,在前向传播过程中,采用二值化函数对第二神经网络层的激活值进行二值化处理,以得到第一神经网络层的输入,第一神经网络层和第二神经网络层属于同一神经网络。
可以理解的是,第二神经网络层和第一神经网络层为相连的两个神经网络层,第二神经网络层的激活值(即第二神经网络层的输出)是第一神经网络层的输入,本申请实施例采用二值化函数对第二神经网络层的激活值进行二值化处理,使得输入到第一神经网络层的值为经过二值化处理后的值。
由于前文已对激活值进行了说明,因此可参照前文对该实施例中的激活值进行理解。
操作302,在反向传播过程中,以拟合函数的梯度为二值化函数的梯度计算损失函数对激活值的梯度,拟合函数是基于二值化函数的级数展开确定的。
作为一种实现方式,激活值的数据类型为32位的浮点型、64位的浮点型、32位的整型或8位的整型。
作为一种实现方式,拟合函数由多个子函数构成,多个子函数是基于二值化函数的级数展开确定的。
作为一种实现方式,拟合函数由多个子函数和误差函数构成,多个子函数是基于二值化函数的级数展开确定的。
作为一种实现方式,误差函数是采用带残差的两层全连接神经网络拟合的。
作为一种实现方式,当拟合函数由多个子函数和误差函数构成时,误差函数由至少一层神经网络拟合,操作302包括:
在反向传播过程中,计算多个子函数对激活值的梯度;
计算至少一层神经网络对激活值的梯度;
基于多个子函数对激活值的梯度以及至少一层神经网络对激活值的梯度,计算损失函数对激活值的梯度。
作为一种实现方式,二值化函数的级数展开为二值化函数的傅里叶级数展开、二值化函数的小波级数展开或二值化函数的离散傅里叶级数展开。
需要说明,图10所示的该实施例与图7所示的实施例相比,不同点在于处理对象。具体地,图10所示的实施例是在前向传播过程中对第二神经网络层的激活值进行二值化处理,且在反向传播过程中计算损失函数对激活值的梯度,而图6所示的实施例是在前向传播过程中对目标权重进行二值化处理,且在反向传播过程中计算损失函数对目标权重的梯 度。除此之外,图10所示的该实施例与图6所示的实施例都相同,因此,可参照图6所示的实施例对图10所示的实施例进行理解。
在本申请实施例中,在前向传播过程中,采用二值化函数对第二神经网络层的激活值进行二值化处理,以得到神经网络中第一神经网络层的输入;在反向传播过程中,以拟合函数的梯度为二值化函数的梯度计算损失函数对激活值的梯度,从而解决无法利用利用二值化函数的梯度进行反向传播的问题;并且,该拟合函数是基于二值化函数的级数展开确定的,所以拟合函数与二值化函数的拟合度更高,拟合效果更好,从而可以提高神经网络的训练效果,保证训练得到的神经网络具有较高的准确性。
本申请实施例还提供了一种神经网络的网络结构,神经网络包括第一神经网络模块、第二神经网络模块和第一神经网络层,第一神经网络模块由一层或多层神经网络构成,且用于实现图6所示实施例中的二值化处理的步骤;第二神经网络模块由一层或多层神经网络构成,且用于实现图6所示实施例中的梯度计算的步骤。
本申请实施例还提供了一种神经网络的网络结构,神经网络包括第一神经网络模块、第二神经网络模块和第一神经网络层,第一神经网络模块由一层或多层神经网络构成,且用于实现图10所示实施例中的二值化处理的步骤;第二神经网络模块由一层或多层神经网络构成,且用于实现图10所示实施例中的梯度计算的步骤。
请参阅图11,本申请实施例还提供了一种神经网络的训练装置,包括:
二值化处理单元401,用于在前向传播过程中,采用二值化函数对目标权重进行二值化处理,以得到神经网络中第一神经网络层的权重,第一神经网络层为神经网络中的一层神经网络;
梯度计算单元402,用于在反向传播过程中,以拟合函数的梯度为二值化函数的梯度计算损失函数对目标权重的梯度,拟合函数是基于二值化函数的级数展开确定的。
作为一种实现方式,目标权重的数据类型为32位的浮点型、64位的浮点型、32位的整型或8位的整型。
作为一种实现方式,拟合函数由多个子函数构成,多个子函数是基于二值化函数的级数展开确定的。
作为一种实现方式,拟合函数由多个子函数和误差函数构成,多个子函数是基于二值化函数的级数展开确定的。
作为一种实现方式,误差函数是采用带残差的两层全连接神经网络拟合的。
作为一种实现方式,误差函数由至少一层神经网络拟合,梯度计算单元402,具体用于在反向传播过程中,计算多个子函数对目标权重的梯度;计算至少一层神经网络对目标权重的梯度;基于多个子函数对目标权重的梯度以及至少一层神经网络对目标权重的梯度,计算损失函数对目标权重的梯度。
作为一种实现方式,二值化函数的级数展开为二值化函数的傅里叶级数展开、二值化函数的小波级数展开或二值化函数的离散傅里叶级数展开。
其中,以上各单元的具体实现、相关说明以及技术效果请参考图6所示的实施例的描述。
请参阅图12,本申请实施例还提供了一种神经网络的训练装置,包括:
二值化处理单元501,用于在前向传播过程中,采用二值化函数对第二神经网络层的激活值进行二值化处理,以得到第一神经网络层的输入,第一神经网络层和第二神经网络层属于同一神经网络;
梯度计算单元502,用于在反向传播过程中,以拟合函数的梯度为二值化函数的梯度计算损失函数对激活值的梯度,拟合函数是基于二值化函数的级数展开确定的。
作为一种实现方式,激活值的数据类型为32位的浮点型、64位的浮点型、32位的整型或8位的整型。
作为一种实现方式,拟合函数由多个子函数构成,多个子函数是基于二值化函数的级数展开确定的。
作为一种实现方式,拟合函数由多个子函数和误差函数构成,多个子函数是基于二值化函数的级数展开确定的。
作为一种实现方式,误差函数是采用带残差的两层全连接神经网络拟合的。
作为一种实现方式,误差函数由至少一层神经网络拟合,梯度计算单元502,具体用于在反向传播过程中,计算多个子函数对激活值的梯度;计算至少一层神经网络对激活值的梯度;基于多个子函数对激活值的梯度以及至少一层神经网络对激活值的梯度,计算损失函数对激活值的梯度。
作为一种实现方式,二值化函数的级数展开为二值化函数的傅里叶级数展开、二值化函数的小波级数展开或二值化函数的离散傅里叶级数展开。
其中,以上各单元的具体实现、相关说明以及技术效果请参考图10所示的实施例的描述。
本申请实施例还提供了一种训练设备的实施例,该训练设备可以是服务器,请参阅图13,图13是本申请实施例提供的训练设备一种结构示意图,训练设备1800上可以部署有图11或图12对应实施例中所描述的神经网络的训练装置,用于实现图11或图12对应实施例中神经网络的训练装置的功能,具体的,训练设备1800由一个或多个服务器实现,训练设备1800可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1822(例如,一个或一个以上处理器)和存储器1832,一个或一个以上存储应用程序1842或数据1844的存储介质1830(例如一个或一个以上海量存储设备)。其中,存储器1832和存储介质1830可以是短暂存储或持久存储。存储在存储介质1830的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对训练设备中的一系列指令操作。更进一步地,中央处理器1822可以设置为与存储介质1830通信,在训练设备1800上执行存储介质1830中的一系列指令操作。
训练设备1800还可以包括一个或一个以上电源1826,一个或一个以上有线或无线网络接口1850,一个或一个以上输入输出接口1858,和/或,一个或一个以上操作系统1841,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
本申请实施例中,中央处理器1822,用于执行图11和图12对应实施例中的神经网络的训练装置执行的训练方法。具体地,中央处理器1822,可以用于:
在前向传播过程中,采用二值化函数对目标权重进行二值化处理,以得到神经网络中第一神经网络层的权重,所述第一神经网络层为所述神经网络中的一层神经网络;
在反向传播过程中,以拟合函数的梯度为所述二值化函数的梯度计算损失函数对所述目标权重的梯度,所述拟合函数是基于所述二值化函数的级数展开确定的。
中央处理器1822,还可以用于:
在前向传播过程中,采用二值化函数对第二神经网络层的激活值进行二值化处理,以得到第一神经网络层的输入,所述第一神经网络层和所述第二神经网络层属于同一神经网络;
在反向传播过程中,以拟合函数的梯度为所述二值化函数的梯度计算损失函数对所述激活值的梯度,所述拟合函数是基于所述二值化函数的级数展开确定的。
本申请实施例还提供一种芯片,包括一个或多个处理器。所述处理器中的部分或全部用于读取并执行存储器中存储的计算机程序,以执行前述各实施例的方法。
可选地,该芯片该包括存储器,该存储器与该处理器通过电路或电线与存储器连接。进一步可选地,该芯片还包括通信接口,处理器与该通信接口连接。通信接口用于接收需要处理的数据和/或信息,处理器从该通信接口获取该数据和/或信息,并对该数据和/或信息进行处理,并通过该通信接口输出处理结果。该通信接口可以是输入输出接口。
在一些实现方式中,所述一个或多个处理器中还可以有部分处理器是通过专用硬件的方式来实现以上方法中的部分步骤,例如涉及神经网络模型的处理可以由专用神经网络处理器或图形处理器来实现。
本申请实施例提供的方法可以由一个芯片实现,也可以由多个芯片协同实现。
本申请实施例还提供了一种计算机存储介质,该计算机存储介质用于储存为上述计算机设备所用的计算机软件指令,其包括用于执行为计算机设备所设计的程序。
该计算机设备可以如前述图11或图12所描述的神经网络的训练装置。
本申请实施例还提供了一种计算机程序产品,该计算机程序产品包括计算机软件指令,该计算机软件指令可通过处理器进行加载来实现前述各个实施例所示的方法中的流程。
本申请实施例还提供了一种服务器,该服务器可以是普通服务器,也可以是云端服务器,用于执行上述图6和/或图10所示的实施例中的方法。
本申请实施例还提供了一种终端设备,该终端设备上部署有通过图6和/或图10所示的实施例中的方法训练得到的神经网络。
其中,该终端设备可以为能够部署神经网络的任意终端设备;由于通过本申请实施例训练出的神经网络是压缩后的二值神经网络,所以该神经网络占用的存储空间小且运算速度快,然而相比于传统的未经过压缩的神经网络,精度稍微差一点。
因此,通过本申请实施例训练出的神经网络多部署于存储空间有限或计算能力有限的终端设备中;例如,移动终端设备的存储空间和计算能力都有限,所以本申请实施例中的终端设备可以为移动终端设备,具体可以为手机、平板电脑、车载设备、摄像头、机器人等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装 置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (32)

  1. 一种神经网络的训练方法,其特征在于,包括:
    在前向传播过程中,采用二值化函数对目标权重进行二值化处理,以得到神经网络中第一神经网络层的权重,所述第一神经网络层为所述神经网络中的一层神经网络;
    在反向传播过程中,以拟合函数的梯度为所述二值化函数的梯度计算损失函数对所述目标权重的梯度,所述拟合函数是基于所述二值化函数的级数展开确定的。
  2. 根据权利要求1所述的训练方法,其特征在于,所述拟合函数由多个子函数和误差函数构成,所述多个子函数是基于所述二值化函数的级数展开确定的。
  3. 根据权利要求2所述的训练方法,其特征在于,所述误差函数是采用带残差的两层全连接神经网络拟合的。
  4. 根据权利要求2所述的训练方法,其特征在于,所述误差函数由至少一层神经网络拟合;
    所述在反向传播过程中,以拟合函数的梯度为所述二值化函数的梯度计算损失函数对所述目标权重的梯度包括:
    在反向传播过程中,计算所述多个子函数对所述目标权重的梯度;
    计算所述至少一层神经网络对所述目标权重的梯度;
    基于所述多个子函数对所述目标权重的梯度以及所述至少一层神经网络对所述目标权重的梯度,计算损失函数对所述目标权重的梯度。
  5. 根据权利要求1所述的训练方法,其特征在于,所述拟合函数由多个子函数构成,所述多个子函数是基于所述二值化函数的级数展开确定的。
  6. 根据权利要求1至5中任意一项所述的训练方法,其特征在于,所述二值化函数的级数展开为所述二值化函数的傅里叶级数展开、所述二值化函数的小波级数展开或所述二值化函数的离散傅里叶级数展开。
  7. 根据权利要求1至6中任意一项所述的训练方法,其特征在于,所述目标权重的数据类型为32位的浮点型、64位的浮点型、32位的整型或8位的整型。
  8. 一种神经网络的训练方法,其特征在于,包括:
    在前向传播过程中,采用二值化函数对第二神经网络层的激活值进行二值化处理,以得到第一神经网络层的输入,所述第一神经网络层和所述第二神经网络层属于同一神经网络;
    在反向传播过程中,以拟合函数的梯度为所述二值化函数的梯度计算损失函数对所述激活值的梯度,所述拟合函数是基于所述二值化函数的级数展开确定的。
  9. 根据权利要求8所述的训练方法,其特征在于,所述拟合函数由多个子函数和误差函数构成,所述多个子函数是基于所述二值化函数的级数展开确定的。
  10. 根据权利要求9所述的训练方法,其特征在于,所述误差函数是采用带残差的两层全连接神经网络拟合的。
  11. 根据权利要求9所述的训练方法,其特征在于,所述误差函数由至少一层神经网络拟合;
    所述在反向传播过程中,以拟合函数的梯度为所述二值化函数的梯度计算损失函数对所述激活值的梯度包括:
    在反向传播过程中,计算所述多个子函数对所述激活值的梯度;
    计算所述至少一层神经网络对所述激活值的梯度;
    基于所述多个子函数对所述激活值的梯度以及所述至少一层神经网络对所述激活值的梯度,计算损失函数对所述激活值的梯度。
  12. 根据权利要求8所述的训练方法,其特征在于,所述拟合函数由多个子函数构成,所述多个子函数是基于所述二值化函数的级数展开确定的。
  13. 根据权利要求8至12中任意一项所述的训练方法,其特征在于,所述二值化函数的级数展开为所述二值化函数的傅里叶级数展开、所述二值化函数的小波级数展开或所述二值化函数的离散傅里叶级数展开。
  14. 根据权利要求8至13中任意一项所述的训练方法,其特征在于,所述激活值的数据类型为32位的浮点型、64位的浮点型、32位的整型或8位的整型。
  15. 一种神经网络的训练装置,其特征在于,包括:
    二值化处理单元,用于在前向传播过程中,采用二值化函数对目标权重进行二值化处理,以得到神经网络中第一神经网络层的权重,所述第一神经网络层为所述神经网络中的一层神经网络;
    梯度计算单元,用于在反向传播过程中,以拟合函数的梯度为所述二值化函数的梯度计算损失函数对所述目标权重的梯度,所述拟合函数是基于所述二值化函数的级数展开确定的。
  16. 根据权利要求15所述的训练装置,其特征在于,所述拟合函数由多个子函数和误差函数构成,所述多个子函数是基于所述二值化函数的级数展开确定的。
  17. 根据权利要求16所述的训练装置,其特征在于,所述误差函数是采用带残差的两层全连接神经网络拟合的。
  18. 根据权利要求16所述的训练装置,其特征在于,所述误差函数由至少一层神经网络拟合;所述梯度计算单元用于:
    在反向传播过程中,计算所述多个子函数对所述目标权重的梯度;
    计算所述至少一层神经网络对所述目标权重的梯度;
    基于所述多个子函数对所述目标权重的梯度以及所述至少一层神经网络对所述目标权重的梯度,计算损失函数对所述目标权重的梯度。
  19. 根据权利要求15所述的训练装置,其特征在于,所述拟合函数由多个子函数构成,所述多个子函数是基于所述二值化函数的级数展开确定的。
  20. 根据权利要求15至19中任意一项所述的训练装置,其特征在于,所述二值化函数的级数展开为所述二值化函数的傅里叶级数展开、所述二值化函数的小波级数展开或所述二值化函数的离散傅里叶级数展开。
  21. 根据权利要求15至19中任意一项所述的训练装置,其特征在于,所述目标权重的数据类型为32位的浮点型、64位的浮点型、32位的整型或8位的整型。
  22. 一种神经网络的训练装置,其特征在于,包括:
    二值化处理单元,用于在前向传播过程中,采用二值化函数对第二神经网络层的激活值进行二值化处理,以得到第一神经网络层的输入,所述第一神经网络层和所述第二神经网络层属于同一神经网络;
    梯度计算单元,用于在反向传播过程中,以拟合函数的梯度为所述二值化函数的梯度计算损失函数对所述激活值的梯度,所述拟合函数是基于所述二值化函数的级数展开确定的。
  23. 根据权利要求22所述的训练装置,其特征在于,所述拟合函数由多个子函数和误差函数构成,所述多个子函数是基于所述二值化函数的级数展开确定的。
  24. 根据权利要求23所述的训练装置,其特征在于,所述误差函数是采用带残差的两层全连接神经网络拟合的。
  25. 根据权利要求23所述的训练装置,其特征在于,所述误差函数由至少一层神经网络拟合;所述梯度计算单元用于:
    在反向传播过程中,计算所述多个子函数对所述激活值的梯度;
    计算所述至少一层神经网络对所述激活值的梯度;
    基于所述多个子函数对所述激活值的梯度以及所述至少一层神经网络对所述激活值的梯度,计算损失函数对所述激活值的梯度。
  26. 根据权利要求22所述的训练装置,其特征在于,所述拟合函数由多个子函数构成,所述多个子函数是基于所述二值化函数的级数展开确定的。
  27. 根据权利要求22至26中任意一项所述的训练装置,其特征在于,所述二值化函数的级数展开为所述二值化函数的傅里叶级数展开、所述二值化函数的小波级数展开或所述二值化函数的离散傅里叶级数展开。
  28. 根据权利要求22至27中任意一项所述的训练装置,其特征在于,所述激活值的数据类型为32位的浮点型、64位的浮点型、32位的整型或8位的整型。
  29. 一种训练设备,其特征在于,包括:一个或多个处理器和存储器;其中,所述存储器中存储有计算机可读指令;
    所述一个或多个处理器读取所述计算机可读指令,以使所述训练设备实现如权利要求1至14中任一项所述的方法。
  30. 一种计算机可读存储介质,其特征在于,包括计算机可读指令,当所述计算机可读指令在计算机上运行时,使得所述计算机执行如权利要求1至14中任一项所述的方法。
  31. 一种计算机程序产品,其特征在于,包括计算机可读指令,当所述计算机可读指令在计算机上运行时,使得所述计算机执行如权利要求1至14中任一项所述的方法。
  32. 一种芯片系统,其特征在于,包括一个或多个处理器,所述处理器中的部分或全部用于读取并执行存储器中存储的计算机程序,以执行如权利要求1至14中任一项所述的方法。
PCT/CN2022/073955 2021-01-30 2022-01-26 一种神经网络的训练方法及相关设备 WO2022161387A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22745261.2A EP4273754A1 (en) 2021-01-30 2022-01-26 Neural network training method and related device
US18/362,435 US20240005164A1 (en) 2021-01-30 2023-07-31 Neural Network Training Method and Related Device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110132041.6A CN113159273B (zh) 2021-01-30 2021-01-30 一种神经网络的训练方法及相关设备
CN202110132041.6 2021-01-30

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/362,435 Continuation US20240005164A1 (en) 2021-01-30 2023-07-31 Neural Network Training Method and Related Device

Publications (1)

Publication Number Publication Date
WO2022161387A1 true WO2022161387A1 (zh) 2022-08-04

Family

ID=76879118

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/073955 WO2022161387A1 (zh) 2021-01-30 2022-01-26 一种神经网络的训练方法及相关设备

Country Status (4)

Country Link
US (1) US20240005164A1 (zh)
EP (1) EP4273754A1 (zh)
CN (1) CN113159273B (zh)
WO (1) WO2022161387A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113159273B (zh) * 2021-01-30 2024-04-30 华为技术有限公司 一种神经网络的训练方法及相关设备
CN115660046A (zh) * 2022-10-24 2023-01-31 中电金信软件有限公司 二值神经网络的梯度重构方法、装置、设备及存储介质
CN115906936A (zh) * 2022-11-01 2023-04-04 鹏城实验室 一种神经网络训练及推理方法、装置、终端及存储介质
CN117492380B (zh) * 2023-12-29 2024-05-03 珠海格力电器股份有限公司 智能家居的中央控制系统的控制方法和控制装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170286830A1 (en) * 2016-04-04 2017-10-05 Technion Research & Development Foundation Limited Quantized neural network training and inference
CN111505706A (zh) * 2020-04-28 2020-08-07 长江大学 基于深度T-Net网络的微地震P波初至拾取方法及装置
CN111523637A (zh) * 2020-01-23 2020-08-11 北京航空航天大学 一种信息保留网络的生成方法及装置
CN113159273A (zh) * 2021-01-30 2021-07-23 华为技术有限公司 一种神经网络的训练方法及相关设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPQ896000A0 (en) * 2000-07-24 2000-08-17 Seeing Machines Pty Ltd Facial image processing system
CN108389187A (zh) * 2018-01-30 2018-08-10 李家菊 基于卷积神经网络法和支持向量机法的影像科图像识别方法
CN109190753A (zh) * 2018-08-16 2019-01-11 新智数字科技有限公司 神经网络的处理方法及装置、存储介质、电子装置
CN111639751A (zh) * 2020-05-26 2020-09-08 北京航空航天大学 一种用于二值卷积神经网络的非零填补训练方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170286830A1 (en) * 2016-04-04 2017-10-05 Technion Research & Development Foundation Limited Quantized neural network training and inference
CN111523637A (zh) * 2020-01-23 2020-08-11 北京航空航天大学 一种信息保留网络的生成方法及装置
CN111505706A (zh) * 2020-04-28 2020-08-07 长江大学 基于深度T-Net网络的微地震P波初至拾取方法及装置
CN113159273A (zh) * 2021-01-30 2021-07-23 华为技术有限公司 一种神经网络的训练方法及相关设备

Also Published As

Publication number Publication date
US20240005164A1 (en) 2024-01-04
EP4273754A1 (en) 2023-11-08
CN113159273A (zh) 2021-07-23
CN113159273B (zh) 2024-04-30

Similar Documents

Publication Publication Date Title
WO2022161387A1 (zh) 一种神经网络的训练方法及相关设备
WO2022042002A1 (zh) 一种半监督学习模型的训练方法、图像处理方法及设备
CN111797893B (zh) 一种神经网络的训练方法、图像分类系统及相关设备
WO2022083536A1 (zh) 一种神经网络构建方法以及装置
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
US20200302271A1 (en) Quantization-aware neural architecture search
US11604960B2 (en) Differential bit width neural architecture search
CN112651511A (zh) 一种训练模型的方法、数据处理的方法以及装置
CN113011282A (zh) 图数据处理方法、装置、电子设备及计算机存储介质
WO2023231794A1 (zh) 一种神经网络参数量化方法和装置
WO2023221928A1 (zh) 一种推荐方法、训练方法以及装置
CN111382868A (zh) 神经网络结构搜索方法和神经网络结构搜索装置
CN110222718B (zh) 图像处理的方法及装置
WO2024041479A1 (zh) 一种数据处理方法及其装置
WO2021218470A1 (zh) 一种神经网络优化方法以及装置
WO2022012668A1 (zh) 一种训练集处理方法和装置
WO2022179492A1 (zh) 一种卷积神经网络的剪枝处理方法、数据处理方法及设备
CN111797992A (zh) 一种机器学习优化方法以及装置
CN113011568A (zh) 一种模型的训练方法、数据处理方法及设备
WO2022100607A1 (zh) 一种神经网络结构确定方法及其装置
WO2024017282A1 (zh) 一种数据处理方法及其装置
WO2024046473A1 (zh) 一种数据处理方法及其装置
CN113627421A (zh) 一种图像处理方法、模型的训练方法以及相关设备
WO2023197857A1 (zh) 一种模型切分方法及其相关设备
Sarwar Murshed et al. Efficient deployment of deep learning models on autonomous robots in the ROS environment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22745261

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022745261

Country of ref document: EP

Effective date: 20230803

NENP Non-entry into the national phase

Ref country code: DE